Dual GPU llama.cpp speedup

“`html

Dual GPU LLaMA.cpp Speedup

Summary

Llama.cpp has a longstanding issue with the --split-mode tensor flag, which allows for great results but only supports non-quantized KV caches. This has led many users to opt for larger KV caches and ignore tensor parallelism.

Details

I’ve attempted to address this issue by creating a branch called llama.cpp_qts, which is based on the mainline code. The changes are minimal, and I’m currently running with a combination of an RTX 3060 (12GB) and a RTX 4070 Super (12GB), totaling 24GB of VRAM.

Results

Tensor Split Mode:

Model	Size	Params	Backend	NGL	Batch	UBatch	Type K	Type V	SM	FA	Test	Tokens/s
Qwen3.5 27B Q4_K Medium	15.65 GiB	26.90 B	CUDA	99	128	128	q8_0	q8_0	tensor	1	pp128	544.82 ± 6.01
Qwen3.5 27B Q4_K Medium	15.65 GiB	26.90 B	CUDA	99	128	128	q8_0	q8_0	tensor	1	tg32	30.05 ± 0.38

No Tensor Split Mode:

Model	Size	Params	Backend	NGL	Batch	UBatch	Type K	Type V	FA	Test	Tokens/s
Qwen3.5 27B Q4_K Medium	15.65 GiB	26.90 B	CUDA	99	128	128	q8_0	q8_0	1	pp128	582.60 ± 28.57
Qwen3.5 27B Q4_K Medium	15.65 GiB	26.90 B	CUDA	99	128	128	q8_0	q8_0	1	tg32	21.22 ± 0.52

With the tensor split mode, I observe a 40% speed increase over the no-split mode, without any loss in quality.

Additional Notes

This branch also supports the latest mtp changes, specifically: --spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 2
In personal use, I’ve noticed a significant improvement in tokens per second from around 25TPS to approximately 40TPS, especially when writing stories within shorter contexts.
I’m curious about feedback and results from users running dual RTX 5060 Ti or similar setups. Additionally, any insights into using this with dual Vulkan configurations would be appreciated.

TL;DR

If you’re running dual GPUs, I recommend trying out the -sm tensor flag to see if it results in a 50% speed boost!

Additional Note

I’ve recently identified an issue with Moe models and --sm tensor. For now, testing against dense models like Qwen3.6 (both 27B and 9B) seems to be a safer approach.

“`

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Dual GPU llama.cpp speedup

Summary

Details

Results

Additional Notes

TL;DR

Additional Note

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

This clever little touch…

Nous Research Releases Hermes…

What’s Worth More Than…