RTX 5080 16GB: Qwen3.6 35B MoE at 128k context, 56 tok/s, and why MTP doesn't help

The MTP Surprise: Qwen3.6 at 128k context, why MTP doesn’t help

Key Takeaways

For a 16GB RTX 5080, the --fit-target 1536 config with no MTP is fastest at 128k context.
The 35B Q4_K_XL model without MTP achieves ~97 tok/s generation and uses only 15.8 GB of VRAM, compared to 14.6 GB with MTP enabled.
At 128k context lengths for coding agents, both --fit-target 0 (no MTP) and --fit-target 1536 (with MTP) result in the same throughput of ~56 tok/s.
The hybrid architecture of the 35B Q4_K_XL allows it to handle up to 131k context lengths without running out of GPU memory, whereas the dense model of the 27B IQ3 has a max context limit at 56k due to its full attention mechanism.
The ctk q8_0 -ctv q8_0 configuration for the 27B model extends the maximum context length from 56k to 110k without significantly impacting quality, as measured by CodeNeedle.
For a task like grading school math problems (GSM8K), the --fit-target 1536 config with no MTP is faster and more efficient, taking only 67 minutes compared to 106 minutes for the same task.