Saturday, June 28, 2025

Gemma 3 vs. MiniCPM vs. Qwen 2.5 VL

Introduction

Imaginative and prescient-Language Fashions (VLMs) are quickly turning into the core of many generative AI functions, from multimodal chatbots and agentic programs to automated content material evaluation instruments. As open-source fashions mature, they provide promising alternate options to proprietary programs, enabling builders and enterprises to construct cost-effective, scalable, and customizable AI options.

Nevertheless, the rising variety of VLMs presents a standard dilemma: how do you select the precise mannequin on your use case? It is usually a balancing act between output high quality, latency, throughput, context size, and infrastructure value.

This weblog goals to simplify the decision-making course of by offering detailed benchmarks and mannequin descriptions for 3 main open-source VLMs: Gemma-3-4B, MiniCPM-o 2.6, and Qwen2.5-VL-7B-Instruct. All benchmarks had been run utilizing Clarifai’s Compute Orchestration, our personal inference engine, to make sure constant circumstances and dependable comparisons throughout fashions.

Earlier than diving into the outcomes, right here’s a fast breakdown of the important thing metrics used within the benchmarks. All outcomes had been generated utilizing Clarifai’s Compute Orchestration on NVIDIA L40S GPUs, with enter tokens set to 500 and output tokens set to 150.

  1. Latency per Token: The time it takes to generate every output token. Decrease latency means sooner responses, particularly essential for chat-like experiences.
  2. Time to First Token (TTFT): Measures how shortly the mannequin generates the primary token after receiving the enter. It impacts perceived responsiveness in streaming technology duties.
  3. Finish-to-Finish Throughput: The variety of tokens the mannequin can generate per second for a single request, contemplating the complete request processing time. Greater end-to-end throughput means the mannequin can effectively generate output whereas preserving latency low.
  4. Total Throughput: The entire variety of tokens generated per second throughout all concurrent requests. This displays the mannequin’s skill to scale and preserve efficiency beneath load.

Now, let’s dive into the main points of every mannequin, beginning with Gemma-3-4B.

Gemma3-4b

Gemma-3-4B, a part of Google’s newest Gemma 3 household of open multimodal fashions, is designed to deal with each textual content and picture inputs, producing coherent and contextually wealthy textual content responses. With assist for as much as 128K context tokens, 140+ languages, and duties like textual content technology, picture understanding, reasoning, and summarization, it’s constructed for production-grade functions throughout various use instances.

Benchmark Abstract: Efficiency on L40S GPU

Gemma-3-4B continues to indicate sturdy efficiency throughout each textual content and picture duties, with constant habits beneath various concurrency ranges. All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter dimension of 500 tokens and output dimension of 150 tokens. Gemma-3-4B is optimized for low-latency textual content processing and handles picture inputs as much as 512px with secure throughput throughout concurrency ranges.

Textual content-Solely Efficiency Highlights:

  • Latency per token: 0.022 sec (1 concurrent request)

  • Time to First Token (TTFT): 0.135 sec

  • Finish-to-end throughput: 202.25 tokens/sec

  • Requests per minute (RPM): As much as 329.90 at 32 concurrent requests

  • Total throughput: 942.57 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (Total Throughput):

  • 256px photos: 718.63 tokens/sec, 252.16 RPM at 32 concurrency

  • 512px photos: 688.21 tokens/sec, 242.04 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

  • At 2 concurrent requests:

  • At 8 concurrent requests:

  • At 16 concurrent requests:

  • At 32 concurrent requests:

Total Perception:

Gemma-3-4B supplies quick and dependable efficiency for text-heavy and structured vision-language duties. For giant picture inputs (512px), efficiency stays secure, however you could have to scale compute sources to keep up low latency and excessive throughput.

In case you’re evaluating GPU efficiency for serving this mannequin, we’ve printed a separate comparability of A10 vs. L40S, serving to you select the most effective {hardware} on your wants.

gemma_throughput_trimmed

MiniCPM-o 2.6

MiniCPM-o 2.6 represents a serious leap in end-side multimodal LLMs. It expands enter modalities to photographs, video, audio, and textual content, providing real-time speech dialog and multimodal streaming assist.

With an structure integrating SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, the mannequin boasts a complete of 8 billion parameters. MiniCPM-o-2.6 demonstrates important enhancements over its predecessor, MiniCPM-V 2.6, and introduces real-time speech dialog, multimodal reside streaming, and superior effectivity in token processing.

Benchmark Abstract: Efficiency on L40S GPU

All benchmarks had been run utilizing Clarifai’s Compute Orchestration with enter dimension of 500 tokens and output dimension of 150 tokens. MiniCPM-o-2.6 performs exceptionally nicely throughout each textual content and picture workloads, scaling easily throughout concurrency ranges. Shared vLLM serving supplies important features in general throughput whereas sustaining low latency.

Textual content-Solely Efficiency Highlights:

  • Latency per token: 0.022 sec (1 concurrent request)

  • Time to First Token (TTFT): 0.087 sec

  • Finish-to-end throughput: 213.23 tokens/sec

  • Requests per minute (RPM): As much as 362.83 at 32 concurrent requests

  • Total throughput: 1075.28 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (Total Throughput):

  • 256px photos: 1039.60 tokens/sec, 353.19 RPM at 32 concurrency

  • 512px photos: 957.37 tokens/sec, 324.66 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

  • At 2 concurrent requests:

  • At 8 concurrent requests:

  • At 16 concurrent requests:

  • At 32 concurrent requests:

Total Perception:

MiniCPM-o-2.6 performs reliably throughout a variety of duties and enter sizes. It maintains low latency, scales linearly with concurrency, and stays performant even with 512px picture inputs. This makes it a stable selection for real-time functions working on fashionable GPUs just like the L40S. These outcomes replicate efficiency on that particular {hardware} configuration, and will differ relying on the setting or GPU tier.

minicpm_throughput_vs_concurrency_trimmed

Qwen2.5-VL-7B-Instruct

Qwen2.5-VL is a vision-language mannequin designed for visible recognition, reasoning, lengthy video evaluation, object localization, and structured information extraction.

Its structure integrates window consideration into the Imaginative and prescient Transformer (ViT), considerably enhancing each coaching and inference effectivity. Extra optimizations like SwiGLU activation and RMSNorm additional align the ViT with the Qwen2.5 LLM, enhancing general efficiency and consistency.

Benchmark Abstract: Efficiency on L40S GPU

Qwen2.5-VL-7B-Instruct delivers constant efficiency throughout each textual content and image-based duties. Benchmarks from Clarifai’s Compute Orchestration spotlight its skill to deal with multimodal inputs at scale, with sturdy throughput and responsiveness beneath various concurrency ranges.

Textual content-Solely Efficiency Highlights:

  • Latency per token: 0.022 sec (1 concurrent request)

  • Time to First Token (TTFT): 0.089 sec

  • Finish-to-end throughput: 205.67 tokens/sec

  • Requests per minute (RPM): As much as 353.78 at 32 concurrent requests

  • Total throughput: 1017.16 tokens/sec at 32 concurrency

Multimodal (Picture + Textual content) Efficiency (Total Throughput):

  • 256px photos: 854.53 tokens/sec, 318.64 RPM at 32 concurrency

  • 512px photos: 832.28 tokens/sec, 345.98 RPM

Scales with Concurrency (Finish-to-Finish Throughput):

  • At 2 concurrent requests:

  • At 8 concurrent requests:

  • At 16 concurrent requests:

  • At 32 concurrent requests:

Total Perception:

Qwen2.5-VL-7B-Instruct is well-suited for each textual content and multimodal duties. Whereas bigger photos introduce latency and throughput trade-offs, the mannequin performs reliably with small to medium-sized inputs even at excessive concurrency. It’s a powerful selection for scalable vision-language pipelines that prioritize throughput and reasonable latency.

qwen_throughput_vs_concurrency_trimmed

Which VLM is Proper for You?

Selecting the best Imaginative and prescient-Language Mannequin (VLM) depends upon your workload kind, enter modality, and concurrency necessities. All benchmarks on this report had been generated utilizing NVIDIA L40S GPUs by way of Clarifai’s Compute Orchestration.

These outcomes replicate efficiency on enterprise-grade infrastructure. In case you’re utilizing lower-end {hardware} or concentrating on bigger batch sizes or ultra-low latency, precise efficiency might differ. It’s essential to guage primarily based in your particular deployment setup.

MiniCPM-o-2.6
MiniCPM presents constant efficiency throughout each textual content and picture duties, particularly when deployed with shared vLLM. It scales effectively as much as 32 concurrent requests, sustaining excessive throughput and low latency even with 1024px picture inputs.

In case your utility requires secure efficiency beneath load and suppleness throughout modalities, MiniCPM is probably the most well-rounded selection on this group.

Gemma-3-4B
Gemma performs finest on text-heavy workloads with occasional picture enter. It handles concurrency nicely as much as 16 requests however begins to dip at 32, significantly with massive photos corresponding to 2048px.

In case your use case is primarily centered on quick, high-quality textual content technology with small to medium picture inputs, Gemma supplies sturdy efficiency without having high-end scaling.

Qwen2.5-VL-7B-Instruct
Qwen2.5 is optimized for structured vision-language duties corresponding to doc parsing, OCR, and multimodal reasoning, making it a powerful selection for functions that require exact visible and textual understanding.

In case your precedence is correct visible reasoning and multimodal understanding, Qwen2.5 is a powerful match, particularly when output high quality issues greater than peak throughput.

That can assist you evaluate at a look, right here’s a abstract of the important thing efficiency metrics for all three fashions at 32 concurrent requests throughout textual content and picture inputs.

Imaginative and prescient-Language Mannequin Benchmark Abstract (32 Concurrent Requests, L40S GPU)

 

 

Metric Mannequin Textual content Solely 256px Picture 512px Picture
Latency per Token (sec) Gemma-3-4B 0.027 0.036 0.037
MiniCPM-o 2.6 0.024 0.026 0.028
Qwen2.5-VL-7B-Instruct 0.025 0.032 0.032
Time to First Token (sec) Gemma-3-4B 0.236 1.034 1.164
MiniCPM-o 2.6 0.120 0.347 0.786
Qwen2.5-VL-7B-Instruct 0.121 0.364 0.341
Finish-to-Finish Throughput (tokens/s) Gemma-3-4B 168.45 124.56 120.01
MiniCPM-o 2.6 188.86 176.29 160.14
Qwen2.5-VL-7B-Instruct 186.91 179.69 191.94
Total Throughput (tokens/s) Gemma-3-4B 942.58 718.63 688.21
MiniCPM-o 2.6 1075.28 1039.60 957.37
Qwen2.5-VL-7B-Instruct 1017.16 854.53 832.28
Requests per Minute (RPM) Gemma-3-4B 329.90 252.16 242.04
MiniCPM-o 2.6 362.84 353.19 324.66
Qwen2.5-VL-7B-Instruct 353.78 318.64 345.98

 

Observe: These benchmarks had been run on L40S GPUs. Outcomes might differ relying on GPU class (corresponding to A100 or H100), CPU limitations, or runtime configurations together with batching, quantization, or mannequin variants.

Conclusion

Now we have seen the benchmarks throughout MiniCPM-2.6, Gemma-3-4B, and Qwen2.5-VL-7B-Instruct, protecting their efficiency on latency, throughput, and scalability beneath completely different concurrency ranges and picture sizes. Every mannequin performs otherwise primarily based on the duty and workload necessities.

If you wish to check out these fashions, we’ve got launched a brand new AI Playground the place you may discover them straight. We’ll proceed including the newest fashions to the platform, so keep watch over our updates and be a part of our Discord group for the newest bulletins.

If you’re additionally trying to deploy these Open Supply VLMs by yourself devoted compute, our platform helps production-grade inference, and scalable deployments. You possibly can shortly get began with establishing your individual node pool and working inference effectively. Try the tutorial beneath to get began.

 


Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles