Dual GPU llama.cpp speedup

“`html Dual GPU LLaMA.cpp Speedup Summary Llama.cpp has a longstanding issue with the --split-mode tensor flag, which allows for great results but…

By AI Maestro May 17, 2026 1 min read
Dual GPU llama.cpp speedup

“`html




Dual GPU LLaMA.cpp Speedup

Summary

Llama.cpp has a longstanding issue with the --split-mode tensor flag, which allows for great results but only supports non-quantized KV caches. This has led many users to opt for larger KV caches and ignore tensor parallelism.

Details

I’ve attempted to address this issue by creating a branch called llama.cpp_qts, which is based on the mainline code. The changes are minimal, and I’m currently running with a combination of an RTX 3060 (12GB) and a RTX 4070 Super (12GB), totaling 24GB of VRAM.

Results

Tensor Split Mode:

ModelSizeParamsBackendNGLBatchUBatchType KType VSMFATestTokens/s
Qwen3.5 27B Q4_K Medium15.65 GiB26.90 BCUDA99128128q8_0q8_0tensor1pp128544.82 ± 6.01
Qwen3.5 27B Q4_K Medium15.65 GiB26.90 BCUDA99128128q8_0q8_0tensor1tg3230.05 ± 0.38

No Tensor Split Mode:

ModelSizeParamsBackendNGLBatchUBatchType KType VFATestTokens/s
Qwen3.5 27B Q4_K Medium15.65 GiB26.90 BCUDA99128128q8_0q8_01pp128582.60 ± 28.57
Qwen3.5 27B Q4_K Medium15.65 GiB26.90 BCUDA99128128q8_0q8_01tg3221.22 ± 0.52

With the tensor split mode, I observe a 40% speed increase over the no-split mode, without any loss in quality.

Additional Notes

  • This branch also supports the latest mtp changes, specifically: --spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 2
  • In personal use, I’ve noticed a significant improvement in tokens per second from around 25TPS to approximately 40TPS, especially when writing stories within shorter contexts.
  • I’m curious about feedback and results from users running dual RTX 5060 Ti or similar setups. Additionally, any insights into using this with dual Vulkan configurations would be appreciated.

TL;DR

If you’re running dual GPUs, I recommend trying out the -sm tensor flag to see if it results in a 50% speed boost!

Additional Note

I’ve recently identified an issue with Moe models and --sm tensor. For now, testing against dense models like Qwen3.6 (both 27B and 9B) seems to be a safer approach.

“`

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top