The MTP Surprise: Qwen3.6 at 128k context — why MTP doesn’t help
Key Takeaways
- For a 16GB RTX 5080, the
--fit-target 1536config with no MTP is fastest at 128k context. - The 35B Q4_K_XL model without MTP achieves ~97 tok/s generation and uses only 15.8 GB of VRAM, compared to 14.6 GB with MTP enabled.
- At 128k context lengths for coding agents, both
--fit-target 0(no MTP) and--fit-target 1536(with MTP) result in the same throughput of ~56 tok/s. - The hybrid architecture of the 35B Q4_K_XL allows it to handle up to 131k context lengths without running out of GPU memory, whereas the dense model of the 27B IQ3 has a max context limit at 56k due to its full attention mechanism.
- The
ctk q8_0 -ctv q8_0configuration for the 27B model extends the maximum context length from 56k to 110k without significantly impacting quality, as measured by CodeNeedle. - For a task like grading school math problems (GSM8K), the
--fit-target 1536config with no MTP is faster and more efficient, taking only 67 minutes compared to 106 minutes for the same task.
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




