
This weblog put up focuses on new options and enhancements. For a complete record, together with bug fixes, please see the launch notes.
Benchmarking GPT-OSS Throughout H100s and B200s
OpenAI has launched gpt-oss-120b and gpt-oss-20b, a brand new era of open-weight reasoning fashions below the Apache 2.0 license. Constructed for strong instruction following, highly effective device use, and superior reasoning, these fashions are designed for next-generation agentic workflows.
With a Combination of Consultants (MoE) design, prolonged context size of 131K tokens, and quantization that permits the 120b mannequin to run on a single 80 GB GPU, GPT-OSS combines large scale with sensible deployment. Builders can alter reasoning ranges from low to excessive to optimize velocity, price, or accuracy, and use built-in looking, code execution, and customized instruments for complicated workflows.
Our analysis group benchmarked gpt-oss-120b throughout NVIDIA B200 and H100 GPUs utilizing vLLM, SGLang, and TensorRT-LLM. Checks lined single-request eventualities and high-concurrency workloads with 50–100 requests. Key findings embrace:
-
Single request velocity: B200 with TensorRT-LLM delivers a 0.023s time-to-first-token (TTFT), outperforming dual-H100 setups in a number of instances.
-
Excessive concurrency: B200 sustains 7,236 tokens/sec at most load with decrease per-token latency.
-
Effectivity: One B200 can change two H100s for equal or higher efficiency, with decrease energy use and fewer complexity.
-
Efficiency positive factors: Some workloads see as much as 15x quicker inference in comparison with a single H100.
For detailed benchmarks on throughput, latency, time to first token, and different metrics, learn our full weblog on NVIDIA B200 vs H100.
If you’re seeking to deploy GPT-OSS fashions on H100s, you are able to do it at the moment on Clarifai throughout a number of clouds. Help for B200s is coming quickly, providing you with entry to the newest NVIDIA GPUs for testing and manufacturing.
Developer Plan
Final month we launched Native Runners, and the response from builders has been unimaginable. From AI hobbyists to manufacturing groups, many have been desirous to run open supply fashions domestically on their very own {hardware} whereas nonetheless making the most of the Clarifai platform. With Native Runners, you’ll be able to run and take a look at fashions by yourself machines, then entry them by means of a public API for integration into any utility.
Now, with the arrival of the newest GPT-OSS fashions together with gpt-oss-20b, you’ll be able to run these superior reasoning fashions domestically with full management of your compute and the power to deploy agentic workflows immediately.
To make it even simpler, we’re introducing the Developer Plan at a promotional worth of simply $1/month. It contains every little thing within the Group Plan, plus:
Take a look at the Developer Plan and begin working your personal fashions domestically at the moment. If you’re able to run GPT-OSS-20b in your {hardware}, observe our step-by-step tutorial right here.
Revealed Fashions
We’ve expanded our mannequin library with new open-weight and specialised fashions which are prepared to make use of in your workflows.
The most recent additions embrace:
-
GPT-OSS-120b – open-weight language mannequin designed for robust reasoning, superior device use, and environment friendly on-device deployment. This mannequin helps prolonged context lengths and variable reasoning ranges, making it supreme for complicated agentic functions.
-
GPT-5, GPT-5 Mini, and GPT-5 Nano – GPT-5 is the flagship mannequin for probably the most demanding reasoning and generative duties. GPT-5 Mini affords a quicker, cost-effective different for real-time functions. GPT-5 Nano delivers ultra-low-latency inference for edge and budget-sensitive deployments.
-
Qwen3-Coder-30B-A3B-Instruct – a high-efficiency coding mannequin with long-context assist and powerful agentic capabilities, well-suited for code era, refactoring, and growth automation.
You can begin exploring these fashions immediately within the Clarifai Playground or entry them by way of API to combine into your functions.
Ollama Help
Ollama makes it easy to obtain and run highly effective open-source fashions immediately in your machine. With Clarifai Native Runners, now you can expose these domestically working fashions by way of a safe public API.
We’ve additionally added Ollama toolkit to the Clarifai CLI, letting you obtain, run, and expose Ollama fashions with a single command.
Learn our step-by-step information on working Ollama fashions domestically and making them accessible by way of API.
Playground Enhancements
Now you can evaluate a number of fashions aspect by aspect within the Playground as an alternative of testing them one after the other. Shortly spot variations in output, velocity, and high quality to decide on the most effective match to your use case.
We’ve additionally added enhanced inference controls, Pythonic assist, and mannequin model selectors for smoother experimentation.

Further Updates
Python SDK:
-
Improved logging, pipeline dealing with, authentication, Native Runner assist, and code validation.
-
Added dwell logging, verbose output, and integration with GitHub repositories for versatile mannequin initialization.
Platform:
Clarifai Organizations:
Prepared to start out constructing?
With Clarifai’s Compute Orchestration, you’ll be able to deploy GPT-OSS, Qwen3-Coder, and different open supply and your personal customized fashions on devoted GPUs like NVIDIA B200s and H100s, on-prem or within the cloud. Serve fashions, MCP servers, or full agentic workflows immediately out of your {hardware} with full management over efficiency, price, and safety.
