110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB until they merged the MTP PR. Performance tanked…

By AI Maestro May 21, 2026 1 min read
110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB until they merged the MTP PR. Performance tanked and was barely above non-MTP. I then decided to try out ik_llama.cpp since it also supports MTP and is apparently better optimized for CPU offloading.

Key Takeaways

  • Switching from llama.cpp to ik_llama.cpp resulted in a 110.24 tok/s average, or a 22% increase.
  • To achieve similar results on a 12GB RTX GPU using ik_llama.cpp, ensure you use the provided launch parameters:
  • llama-server \
    -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
    --fit \
    --fit-margin 1664 \
    --ctx-size 131072 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --cache-type-k-draft q8_0 \
    --cache-type-v-draft q8_0 \
    --multi-token-prediction \
    --draft-p-min 0.75 \
    --draft-max 3 \
    --no-mmap \
    --mlock \
    --threads 8 \
    --temp 0.0
    
  • If you encounter an OOM error, try increasing the –fit-margin to 1792 or even 2048.

I am running my GPU as a secondary GPU with CachyOS and can use all available VRAM. If you get an OOM error when loading the model, consider adjusting these parameters accordingly.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top