“`html
Summary
Llama.cpp has a longstanding issue with the --split-mode tensor flag, which allows for great results but only supports non-quantized KV caches. This has led many users to opt for larger KV caches and ignore tensor parallelism.
Details
I’ve attempted to address this issue by creating a branch called llama.cpp_qts, which is based on the mainline code. The changes are minimal, and I’m currently running with a combination of an RTX 3060 (12GB) and a RTX 4070 Super (12GB), totaling 24GB of VRAM.
Results
Tensor Split Mode:
| Model | Size | Params | Backend | NGL | Batch | UBatch | Type K | Type V | SM | FA | Test | Tokens/s |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3.5 27B Q4_K Medium | 15.65 GiB | 26.90 B | CUDA | 99 | 128 | 128 | q8_0 | q8_0 | tensor | 1 | pp128 | 544.82 ± 6.01 |
| Qwen3.5 27B Q4_K Medium | 15.65 GiB | 26.90 B | CUDA | 99 | 128 | 128 | q8_0 | q8_0 | tensor | 1 | tg32 | 30.05 ± 0.38 |
No Tensor Split Mode:
| Model | Size | Params | Backend | NGL | Batch | UBatch | Type K | Type V | FA | Test | Tokens/s |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3.5 27B Q4_K Medium | 15.65 GiB | 26.90 B | CUDA | 99 | 128 | 128 | q8_0 | q8_0 | 1 | pp128 | 582.60 ± 28.57 |
| Qwen3.5 27B Q4_K Medium | 15.65 GiB | 26.90 B | CUDA | 99 | 128 | 128 | q8_0 | q8_0 | 1 | tg32 | 21.22 ± 0.52 |
With the tensor split mode, I observe a 40% speed increase over the no-split mode, without any loss in quality.
Additional Notes
- This branch also supports the latest mtp changes, specifically:
--spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 2 - In personal use, I’ve noticed a significant improvement in tokens per second from around 25TPS to approximately 40TPS, especially when writing stories within shorter contexts.
- I’m curious about feedback and results from users running dual RTX 5060 Ti or similar setups. Additionally, any insights into using this with dual Vulkan configurations would be appreciated.
TL;DR
If you’re running dual GPUs, I recommend trying out the -sm tensor flag to see if it results in a 50% speed boost!
Additional Note
I’ve recently identified an issue with Moe models and --sm tensor. For now, testing against dense models like Qwen3.6 (both 27B and 9B) seems to be a safer approach.
“`
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




