110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB until they merged the MTP PR. Performance tanked and was barely above non-MTP. I then decided to try out ik_llama.cpp since it also supports MTP and is apparently better optimized for CPU offloading.

Key Takeaways

Switching from llama.cpp to ik_llama.cpp resulted in a 110.24 tok/s average, or a 22% increase.
To achieve similar results on a 12GB RTX GPU using ik_llama.cpp, ensure you use the provided launch parameters:

llama-server \
-m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
--fit \
--fit-margin 1664 \
--ctx-size 131072 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--cache-type-k-draft q8_0 \
--cache-type-v-draft q8_0 \
--multi-token-prediction \
--draft-p-min 0.75 \
--draft-max 3 \
--no-mmap \
--mlock \
--threads 8 \
--temp 0.0

If you encounter an OOM error, try increasing the –fit-margin to 1792 or even 2048.

I am running my GPU as a secondary GPU with CachyOS and can use all available VRAM. If you get an OOM error when loading the model, consider adjusting these parameters accordingly.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

How to Fine-Tune LFM2…

Google Is Quietly Buying…

Microsoft’s new MAI models