Monday, March 23, 2026

How Good Are New GPT-OSS Fashions? We Put Them to the Take a look at.

OpenAI hasn’t launched an open-weight language mannequin since GPT-2 again in 2019. Six years later, they shocked everybody with two: gpt-oss-120b and the smaller gpt-oss-20b.

Naturally, we wished to know — how do they really carry out?

To search out out, we ran each fashions by our open-source workflow optimization framework, syftr. It evaluates fashions throughout completely different configurations — quick vs. low cost, excessive vs. low accuracy — and consists of help for OpenAI’s new “considering effort” setting.

In principle, extra considering ought to imply higher solutions. In follow? Not at all times.

We additionally use syftr to discover questions like “is LLM-as-a-Choose really working?” and “what workflows carry out nicely throughout many datasets?”.

Our first outcomes with GPT-OSS would possibly shock you: one of the best performer wasn’t the most important mannequin or the deepest thinker. 

As a substitute, the 20b mannequin with low considering effort constantly landed on the Pareto frontier, even rivaling the 120b medium configuration on benchmarks like FinanceBench, HotpotQA, and MultihopRAG. In the meantime, excessive considering effort hardly ever mattered in any respect.

How we arrange our experiments

We didn’t simply pit GPT-OSS towards itself. As a substitute, we wished to see the way it stacked up towards different sturdy open-weight fashions. So we in contrast gpt-oss-20b and gpt-oss-120b with:

  • qwen3-235b-a22b
  • glm-4.5-air
  • nemotron-super-49b
  • qwen3-30b-a3b
  • gemma3-27b-it
  • phi-4-multimodal-instruct

To check OpenAI’s new “considering effort” characteristic, we ran every GPT-OSS mannequin in three modes: low, medium, and excessive considering effort. That gave us six configurations in complete:

  • gpt-oss-120b-low / -medium / -high
  • gpt-oss-20b-low / -medium / -high

For analysis, we solid a large web: 5 RAG and agent modes, 16 embedding fashions, and a spread of move configuration choices. To guage mannequin responses, we used GPT-4o-mini and in contrast solutions towards recognized floor reality.

Lastly, we examined throughout 4 datasets:

  • FinanceBench (monetary reasoning)
  • HotpotQA (multi-hop QA)
  • MultihopRAG (retrieval-augmented reasoning)
  • PhantomWiki (artificial Q&A pairs)

We optimized workflows twice: as soon as for accuracy + latency, and as soon as for accuracy + price—capturing the tradeoffs that matter most in real-world deployments.

Optimizing for latency, price, and accuracy

After we optimized the GPT-OSS fashions, we checked out two tradeoffs: accuracy vs. latency and accuracy vs. price. The outcomes have been extra shocking than we anticipated:

  • GPT-OSS 20b (low considering effort):
    Quick, cheap, and constantly correct. This setup appeared on the Pareto frontier repeatedly, making it one of the best default selection for many non-scientific duties. In follow, which means faster responses and decrease payments in comparison with greater considering efforts.
  • GPT-OSS 120b (medium considering effort):
    Finest suited to duties that demand deeper reasoning, like monetary benchmarks. Use this when accuracy on advanced issues issues greater than price.
  • GPT-OSS 120b (excessive considering effort):
    Costly and normally pointless. Preserve it in your again pocket for edge circumstances the place different fashions fall quick. For our benchmarks, it didn’t add worth.
How Good Are New GPT-OSS Fashions? We Put Them to the Take a look at.
Determine 1: Accuracy-latency optimization with syftr
Figure 02 cost
Determine 2: Accuracy-cost optimization with syftr

Studying the outcomes extra fastidiously

At first look, the outcomes look simple. However there’s an vital nuance: an LLM’s prime accuracy rating relies upon not simply on the mannequin itself, however on how the optimizer weighs it towards different fashions within the combine. For instance, let’s have a look at FinanceBench.

When optimizing for latency, all GPT-OSS fashions (besides excessive considering effort) landed with related Pareto-frontiers. On this case, the optimizer had little cause to focus on the 20b low considering configuration—its prime accuracy was solely 51%.

Figure 03 latency financebench
Determine 3: Per-LLM Pareto-frontiers for latency optimization on FinanceBench

When optimizing for price, the image shifts dramatically. The identical 20b low considering configuration jumps to 57% accuracy, whereas the 120b medium configuration really drops 22%. Why? As a result of the 20b mannequin is much cheaper, so the optimizer shifts extra weight towards it.

Figure 04 cost financebench
Determine 4: Per-LLM Pareto-frontiers for price optimization on FinanceBench

The takeaway: Efficiency is dependent upon context. Optimizers will favor completely different fashions relying on whether or not you’re prioritizing pace, price, or accuracy. And given the large search house of attainable configurations, there could also be even higher setups past those we examined.

Discovering agentic workflows that work nicely in your setup

The brand new GPT-OSS fashions carried out strongly in our checks — particularly the 20b with low considering effort, which regularly outpaced costlier rivals. The larger lesson? Extra mannequin and extra effort doesn’t at all times imply extra accuracy. Typically, paying extra simply will get you much less.

That is precisely why we constructed syftr and made it open-source. Each use case is completely different, and one of the best workflow for you is dependent upon the tradeoffs you care about most. Need decrease prices? Sooner responses? Most accuracy? 

Run your personal experiments and discover the Pareto candy spot that balances these priorities in your setup.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles