Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image

Sharing this because I didn’t believe the first run. Key Takeaways The 35B-A3B variant runs at a throughput of 249 t/s on…

By AI Maestro May 23, 2026 2 min read
Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image

Sharing this because I didn’t believe the first run.

Key Takeaways

  • The 35B-A3B variant runs at a throughput of 249 t/s on an RTX 5090 laptop with 24GB of memory, which is 3.4 times faster than the 27B dense variant under the same conditions.
  • This result is attributed to the model’s MoE architecture, where each forward pass involves running approximately 3 billion parameters rather than all 27 billion in the dense variant. The MTP (Model-Tuning Post-processing) mechanism also plays a key role in this performance boost.
  • The context size remains flat at around 262K tokens, indicating that increasing the input length does not significantly impact the model’s throughput under these conditions.

For those interested in reproducing these results, I’ve shared the Docker image used: aamsellem/llama-cpp-mtp:master-ad27757. The recipe is available for standalone building from the master branch of llama.cpp.

Setup Details:

  • Hardware: RTX 5090, 24GB memory (Blackwell SM)
  • Operating System: Linux
  • Model: unsloth/Qwen3.6-35B-A3B-MTP-GGUF (based on the master branch of llama.cpp, including MTP merge (#22673), n_max=3 default cleanup (#23269), and NVIDIA backend sampling work (#23287))
  • Args: –spec-type draft-mtp –spec-draft-n-max 3 –ctx-size 262144 –cache-type-k q4_0 –cache-type-v q4_0 –batch-size 512 –ubatch-size 512 –parallel 1 –flash-attn on –chat-template-kwargs ‘{"enable_thinking": false}’
  • Caveats: Thinking mode must remain off. The MTP draft heads are trained for non-thinking outputs, and re-enabling thinking drops acceptance rates back to around 40%. Q3_K_XL is the largest quantized version that fits within 24GB of memory.

What threw me: I ran the 27B dense MTP variant in the exact same image / args / context for comparison. The smaller 27B variant runs at a throughput of 74.28 t/s, while the larger 35B-A3B variant hits 249.30 t/s.

The math checks out once you stop being surprised: the 35B-A3B is MoE with 128 experts + 1 shared, and the router pulls ~8 experts per token. So ~3 billion params actually run per forward pass. The 27B dense pushes all 27B every token, resulting in a per-token compute that’s roughly 9× lower on the MoE variant. Additionally, MTP with n_max=3 lands at an acceptance rate of around 86.6%, which is significantly higher than the default n_max=5.

The context scaling stayed flat: from a context size of 32K to 262K, the throughput remained relatively stable.

Throughput vs Context Size Chart
Context scaling remains flat across different input sizes (data not shown).

The RTX 5090 — with half the desktop’s memory bandwidth on paper — managed to achieve this impressive throughput of 249 t/s. The key here is the MoE-A3B architecture, which allows for efficient parallelization and reduced compute per token compared to a dense model.

Curious to see what a desktop RTX 5090 (with more memory) might achieve under similar conditions, I encourage others to run this experiment. If anyone runs Qwen3.6-35B-A3B-MTP-GGUF + master llama.cpp on their own hardware and shares the results, please let me know.


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top