Summary of Benchmarks
- vLLM significantly outperforms both SGLang and llama.cpp on mixed multi-GPU setups for long context prefill.
- vLLM handles uneven GPU splits effectively by manually partitioning the layers, leading to substantial speedups even with a 397B model.
- llama.cpp struggles heavily with pipeline parallelism under these conditions, often falling behind vLLM by a factor of 4 to 6. This is due to issues with how the execution graph is handled across multiple devices.
Benchmark Results
| Model and Context | GPU Setup | Engine | TTFT (s) | Prefill Speed (t/s) |
|---|---|---|---|---|
| Qwen3.6-35B-A3B (184k tokens) | 2 GPUs (6000 + 5090) | vLLM | 10.2s | 18060 t/s |
| Qwen3.6-35B-A3B (184k tokens) | 2 GPUs (6000 + 5090) | llama.cpp | 24.9s | 7405 t/s |
| MiniMax-M2.7 (82k tokens) | 6 GPUs (Mixed) | vLLM | 13.2s | 6212 t/s |
| MiniMax-M2.7 (82k tokens) | 6 GPUs (Mixed) | llama.cpp | 77.0s | 1065 t/s |
| MiniMax-M2.7 (82k tokens) | 6 GPUs (Mixed) | SGLangCrashed | N/A | |
| Qwen3.5-122B-A10B (75k tokens) | 4 GPUs (Pure Blackwell) | vLLM | 5.0s | 15084 t/s |
| Qwen3.5-122B-A10B (75k tokens) | 4 GPUs (Pure Blackwell) | SGLang | 5.3s14177 t/s | |
| Qwen3.5-122B-A10B (75k tokens) | 4 GPUs (Pure Blackwell) | llama.cpp | 20.6s3662 t/s | |
| Qwen3.5-397B-A17B (75k tokens) | 7 GPUs (Uneven PP split) | vLLM | 9.8s7683 t/s | |
| Qwen3.5-397B-A17B (75k tokens) | 7 GPUs (Uneven PP split) | SGLangCrashed | N/A | |
| Qwen3.5-397B-A17B (75k tokens) | 7 GPUs (Uneven PP split) | llama.cpp57.2s | 1319 t/s |
I have been running benchmarks on a heterogeneous 7-GPU cluster to evaluate how different inference engines handle long context prefill using pipeline parallelism. My setup includes a mix of Blackwell and Ada cards: one RTX PRO 6000 with 96GB, one PRO 5000 with 48GB, two 5090 with 32GB each, and three modded 4090 with 48GB. All tests were conducted using 4-bit weights (NVFP4 for vLLM and SGLang, MXFP4 for llama.cpp).
The main takeaway is that vLLM significantly outperforms the others on mixed multi-GPU setups for long context prefill. Llama.cpp struggles heavily with pipeline parallelism under these conditions, often falling behind by a factor of 4 to 6 due to issues with how the execution graph is handled across multiple devices, particularly CPU-side embeddings causing graph splits and pipeline bubbles.
SGLang performs wonderfully on a pure Blackwell setup but instantly crashes if an Ada card is introduced into the pipeline because it lacks a software fallback for FP4 weights, strictly requiring Compute Capability 10.0. vLLM handles this seamlessly by emulating FP4 on older cards.
Another interesting finding is how well vLLM handles uneven GPU splits. By manually tweaking the layer distribution using the VLLM_PP_LAYER_PARTITION environment variable, I was able to balance the compute load between fast Blackwells and slower 4090s doing FP4 emulation. This eliminated pipeline bottlenecks and resulted in massive speedups even on a 397B model.
Here is the summary of the benchmark results.
Benchmark Results
| Model and Context | GPU Setup | Engine | TTFT (s) | Prefill Speed (t/s) |
|---|---|---|---|---|
| Qwen3.6-35B-A3B (184k tokens) | 2 GPUs (6000 + 5090) | vLLM | 7.3s18060 t/s | |
| Qwen3.6-35B-A3B (184k tokens) | 2 GPUs (6000 + 5090)SGLangCrashed | N/A | ||
| Qwen3.6-35B-A3B (184k tokens) | 2 GPUs (6000 + 5090)llama.cpp24.9s | 7405 t/s | ||
| MiniMax-M2.7 (82k tokens) | 6 GPUs (Mixed)vLLM13.2s | 6212 t/s | ||
| MiniMax-M2.7 (82k tokens) | 6 GPUs (Mixed)SGLangCrashed | N/A | ||
| MiniMax-M2.7 (82k tokens) | 6 GPUs (Mixed)llama.cpp77.0s | 1065 t/s | ||
| Qwen3.5-122B-A10B (75k tokens) | 4 GPUs (Pure Blackwell)vLLM5.0s | 15084 t/s | ||
| Qwen3.5-122B-A10B (75k tokens) | 4 GPUs (Pure Blackwell)SGLang5.3s | 14177 t/s | ||
| Qwen3.5-122B-A10B (75k tokens) | 4 GPUs (Pure Blackwell)llama.cpp20.6s | 3662 t/s | ||
| Qwen3.5-397B-A17B (75k tokens) | 7 GPUs (Uneven PP split)vLLM9.8s | 7683 t/s | ||
| Qwen3.5-397B-A17B (75k tokens) | 7 GPUs (Uneven PP split)




