Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB until they merged the MTP PR. Performance tanked and was barely above non-MTP. I then decided to try out ik_llama.cpp since it also supports MTP and is apparently better optimized for CPU offloading.
Key Takeaways
- Switching from llama.cpp to ik_llama.cpp resulted in a 110.24 tok/s average, or a 22% increase.
- To achieve similar results on a 12GB RTX GPU using ik_llama.cpp, ensure you use the provided launch parameters:
llama-server \ -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \ --fit \ --fit-margin 1664 \ --ctx-size 131072 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --cache-type-k-draft q8_0 \ --cache-type-v-draft q8_0 \ --multi-token-prediction \ --draft-p-min 0.75 \ --draft-max 3 \ --no-mmap \ --mlock \ --threads 8 \ --temp 0.0
I am running my GPU as a secondary GPU with CachyOS and can use all available VRAM. If you get an OOM error when loading the model, consider adjusting these parameters accordingly.
Source Read original →
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




