Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster Summary of Benchmarks vLLM significantly outperforms both SGLang and llama.cpp on…

By AI Maestro May 18, 2026 1 min read
Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster



Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Summary of Benchmarks

  • vLLM significantly outperforms both SGLang and llama.cpp on mixed multi-GPU setups for long context prefill.
  • vLLM handles uneven GPU splits effectively by manually partitioning the layers, leading to substantial speedups even with a 397B model.
  • llama.cpp struggles heavily with pipeline parallelism under these conditions, often falling behind vLLM by a factor of 4 to 6. This is due to issues with how the execution graph is handled across multiple devices.

Benchmark Results

SGLang5.3s20.6s9.8sSGLangllama.cpp
Model and ContextGPU SetupEngineTTFT (s)Prefill Speed (t/s)
Qwen3.6-35B-A3B (184k tokens)2 GPUs (6000 + 5090)vLLM10.2s18060 t/s
Qwen3.6-35B-A3B (184k tokens)2 GPUs (6000 + 5090)llama.cpp24.9s7405 t/s
MiniMax-M2.7 (82k tokens)6 GPUs (Mixed)vLLM13.2s6212 t/s
MiniMax-M2.7 (82k tokens)6 GPUs (Mixed)llama.cpp77.0s1065 t/s
MiniMax-M2.7 (82k tokens)6 GPUs (Mixed)CrashedN/A
Qwen3.5-122B-A10B (75k tokens)4 GPUs (Pure Blackwell)vLLM5.0s15084 t/s
Qwen3.5-122B-A10B (75k tokens)4 GPUs (Pure Blackwell)SGLang14177 t/s
Qwen3.5-122B-A10B (75k tokens)4 GPUs (Pure Blackwell)llama.cpp3662 t/s
Qwen3.5-397B-A17B (75k tokens)7 GPUs (Uneven PP split)vLLM7683 t/s
Qwen3.5-397B-A17B (75k tokens)7 GPUs (Uneven PP split)CrashedN/A
Qwen3.5-397B-A17B (75k tokens)7 GPUs (Uneven PP split)57.2s1319 t/s

I have been running benchmarks on a heterogeneous 7-GPU cluster to evaluate how different inference engines handle long context prefill using pipeline parallelism. My setup includes a mix of Blackwell and Ada cards: one RTX PRO 6000 with 96GB, one PRO 5000 with 48GB, two 5090 with 32GB each, and three modded 4090 with 48GB. All tests were conducted using 4-bit weights (NVFP4 for vLLM and SGLang, MXFP4 for llama.cpp).

The main takeaway is that vLLM significantly outperforms the others on mixed multi-GPU setups for long context prefill. Llama.cpp struggles heavily with pipeline parallelism under these conditions, often falling behind by a factor of 4 to 6 due to issues with how the execution graph is handled across multiple devices, particularly CPU-side embeddings causing graph splits and pipeline bubbles.

SGLang performs wonderfully on a pure Blackwell setup but instantly crashes if an Ada card is introduced into the pipeline because it lacks a software fallback for FP4 weights, strictly requiring Compute Capability 10.0. vLLM handles this seamlessly by emulating FP4 on older cards.

Another interesting finding is how well vLLM handles uneven GPU splits. By manually tweaking the layer distribution using the VLLM_PP_LAYER_PARTITION environment variable, I was able to balance the compute load between fast Blackwells and slower 4090s doing FP4 emulation. This eliminated pipeline bottlenecks and resulted in massive speedups even on a 397B model.

Here is the summary of the benchmark results.

Benchmark Results

7.3s2 GPUs (6000 + 5090)SGLang2 GPUs (6000 + 5090)llama.cpp6 GPUs (Mixed)vLLM6 GPUs (Mixed)SGLang6 GPUs (Mixed)llama.cpp4 GPUs (Pure Blackwell)vLLM4 GPUs (Pure Blackwell)SGLang4 GPUs (Pure Blackwell)llama.cpp7 GPUs (Uneven PP split)vLLM7 GPUs (Uneven PP split)

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top
Model and ContextGPU SetupEngineTTFT (s)Prefill Speed (t/s)
Qwen3.6-35B-A3B (184k tokens)2 GPUs (6000 + 5090)vLLM18060 t/s
Qwen3.6-35B-A3B (184k tokens)CrashedN/A
Qwen3.6-35B-A3B (184k tokens)24.9s7405 t/s
MiniMax-M2.7 (82k tokens)13.2s6212 t/s
MiniMax-M2.7 (82k tokens)CrashedN/A
MiniMax-M2.7 (82k tokens)77.0s1065 t/s
Qwen3.5-122B-A10B (75k tokens)5.0s15084 t/s
Qwen3.5-122B-A10B (75k tokens)5.3s14177 t/s
Qwen3.5-122B-A10B (75k tokens)20.6s3662 t/s
Qwen3.5-397B-A17B (75k tokens)9.8s7683 t/s
Qwen3.5-397B-A17B (75k tokens)