Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Summary of Benchmarks

vLLM significantly outperforms both SGLang and llama.cpp on mixed multi-GPU setups for long context prefill.
vLLM handles uneven GPU splits effectively by manually partitioning the layers, leading to substantial speedups even with a 397B model.
llama.cpp struggles heavily with pipeline parallelism under these conditions, often falling behind vLLM by a factor of 4 to 6. This is due to issues with how the execution graph is handled across multiple devices.

Benchmark Results

SGLang5.3s20.6s9.8sSGLangllama.cpp

Model and Context	GPU Setup	Engine	TTFT (s)	Prefill Speed (t/s)
Qwen3.6-35B-A3B (184k tokens)	2 GPUs (6000 + 5090)	vLLM	10.2s	18060 t/s
Qwen3.6-35B-A3B (184k tokens)	2 GPUs (6000 + 5090)	llama.cpp	24.9s	7405 t/s
MiniMax-M2.7 (82k tokens)	6 GPUs (Mixed)	vLLM	13.2s	6212 t/s
MiniMax-M2.7 (82k tokens)	6 GPUs (Mixed)	llama.cpp	77.0s	1065 t/s
MiniMax-M2.7 (82k tokens)	6 GPUs (Mixed)	Crashed	N/A
Qwen3.5-122B-A10B (75k tokens)	4 GPUs (Pure Blackwell)	vLLM	5.0s	15084 t/s
Qwen3.5-122B-A10B (75k tokens)	4 GPUs (Pure Blackwell)	SGLang	14177 t/s
Qwen3.5-122B-A10B (75k tokens)	4 GPUs (Pure Blackwell)	llama.cpp	3662 t/s
Qwen3.5-397B-A17B (75k tokens)	7 GPUs (Uneven PP split)	vLLM	7683 t/s
Qwen3.5-397B-A17B (75k tokens)	7 GPUs (Uneven PP split)	Crashed	N/A
Qwen3.5-397B-A17B (75k tokens)	7 GPUs (Uneven PP split)	57.2s	1319 t/s

I have been running benchmarks on a heterogeneous 7-GPU cluster to evaluate how different inference engines handle long context prefill using pipeline parallelism. My setup includes a mix of Blackwell and Ada cards: one RTX PRO 6000 with 96GB, one PRO 5000 with 48GB, two 5090 with 32GB each, and three modded 4090 with 48GB. All tests were conducted using 4-bit weights (NVFP4 for vLLM and SGLang, MXFP4 for llama.cpp).

The main takeaway is that vLLM significantly outperforms the others on mixed multi-GPU setups for long context prefill. Llama.cpp struggles heavily with pipeline parallelism under these conditions, often falling behind by a factor of 4 to 6 due to issues with how the execution graph is handled across multiple devices, particularly CPU-side embeddings causing graph splits and pipeline bubbles.

SGLang performs wonderfully on a pure Blackwell setup but instantly crashes if an Ada card is introduced into the pipeline because it lacks a software fallback for FP4 weights, strictly requiring Compute Capability 10.0. vLLM handles this seamlessly by emulating FP4 on older cards.

Another interesting finding is how well vLLM handles uneven GPU splits. By manually tweaking the layer distribution using the VLLM_PP_LAYER_PARTITION environment variable, I was able to balance the compute load between fast Blackwells and slower 4090s doing FP4 emulation. This eliminated pipeline bottlenecks and resulted in massive speedups even on a 397B model.

Here is the summary of the benchmark results.

Benchmark Results

7.3s2 GPUs (6000 + 5090)SGLang2 GPUs (6000 + 5090)llama.cpp6 GPUs (Mixed)vLLM6 GPUs (Mixed)SGLang6 GPUs (Mixed)llama.cpp4 GPUs (Pure Blackwell)vLLM4 GPUs (Pure Blackwell)SGLang4 GPUs (Pure Blackwell)llama.cpp7 GPUs (Uneven PP split)vLLM7 GPUs (Uneven PP split)SourceRead original →

Model and Context	GPU Setup	Engine	TTFT (s)
Qwen3.6-35B-A3B (184k tokens)	2 GPUs (6000 + 5090)	vLLM	18060 t/s
Qwen3.6-35B-A3B (184k tokens)	Crashed	N/A
Qwen3.6-35B-A3B (184k tokens)	24.9s	7405 t/s
MiniMax-M2.7 (82k tokens)	13.2s	6212 t/s
MiniMax-M2.7 (82k tokens)	Crashed	N/A
MiniMax-M2.7 (82k tokens)	77.0s	1065 t/s
Qwen3.5-122B-A10B (75k tokens)	5.0s	15084 t/s
Qwen3.5-122B-A10B (75k tokens)	5.3s	14177 t/s
Qwen3.5-122B-A10B (75k tokens)	20.6s	3662 t/s
Qwen3.5-397B-A17B (75k tokens)	9.8s	7683 t/s
Qwen3.5-397B-A17B (75k tokens)

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Summary of Benchmarks

Benchmark Results

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

tencent/Hy3

US investors will soon…

The ‘first’ AI-run ransomware…

Summary of Benchmarks

Benchmark Results

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

tencent/Hy3

US investors will soon…

The ‘first’ AI-run ransomware…