I ran a quantization shootout on Qwen3-Coder and the results are… interesting

Out of random curiosity, I ran a shootout on Qwen3-Coder-Next. I’ve been using the MXFP4_MOE from unsloth for awhile as it’s just…

By AI Maestro May 22, 2026 2 min read
I ran a quantization shootout on Qwen3-Coder and the results are… interesting

Out of random curiosity, I ran a shootout on Qwen3-Coder-Next. I’ve been using the MXFP4_MOE from unsloth for awhile as it’s just really fast on my system. But was curious about precision. I know quantization hurts the model, but I don’t think I had really understood that till I tested it myself.

Hardware

Hardware: 3× R9700 PRO (96 GB VRAM)

Backend

Backend: llama.cpp Vulkan

Evaluation

Evaluation: wikitext-2 (583 chunks, ctx 512)

  • Formats tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M

Tldr

The TLDR: UD-Q5_K_M is cooking! Better quality than formats half its size, barely any speed penalty. Unsloth’s dynamic precision approach is really good. I might need to test it at lower quants now.

The Numbers

MetricMXFP4Q4_K_MQ5_K_MUD-Q5_K_M
Same top-189.4%89.6%93.0%94.0%
Mean KL divergence0.07460.06850.03080.0217
Max KL (worst token)13.045.938.194.75
File size44.7 GB45.2 GB52.9 GB55.2 GB

UD-Q5_K_M wins on literally every quality metric

UD-Q5_K_M is the clear winner in terms of quality across all metrics tested, and only 10 GB larger than MXFP4. The differences are stark:

  • MXFP4 (89.4%) > 100 token output: 0.0014% chance of perfect agreement
  • UD-Q5_K_M (94%) > 100 token output: 0.21% chance of perfect agreement

This is not a small number when we’re talking about long refactoring tasks or multi-step reasoning where errors can compound exponentially. MXFP4 often “goes off the rails” more frequently.

Speed Trade-Offs

  • Refill (batch 512): MXFP4 still fastest (hardware kernels)
  • Prefill (batch 4096): MXFP4 wins again
  • Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger

For interactive coding tasks (which are decode-bound anyway), the speed hit from using a lower quantization level is negligible.

Conclusion

I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads, but for daily code generation where quality over speed matters more, UD-Q5 is the clear winner.

What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you’re on Nvidia hardware, are you seeing different tradeoffs than RDNA?

Preview

Submitted by /u/alphatrad


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top