I ran a quantization shootout on Qwen3-Coder and the results are… interesting

Key Takeaways UD-Q5_K_M is performing better than expected, with file sizes only slightly larger than MXFP4. The quality improvements in UD-Q5_K_M outweigh…

By AI Maestro May 22, 2026 2 min read
I ran a quantization shootout on Qwen3-Coder and the results are… interesting

I ran a quantization shootout on Qwen3-Coder-Next and the results are... interesting

Key Takeaways

  • UD-Q5_K_M is performing better than expected, with file sizes only slightly larger than MXFP4.
  • The quality improvements in UD-Q5_K_M outweigh the speed trade-offs, especially for tasks requiring high accuracy over speed.
  • Nvidia hardware may offer different performance characteristics compared to RDNA architecture when using aggressive quantization techniques.

Out of random curiosity I ran a shootout on Qwen3-Coder-Next. I’ve been using the MXFP4_MOE from unsloth for awhile as it’s just really fast on my system. But was curious about precision, so I tested different quantizations to see how they perform.

Hardware

  • Hardware: 3× R9700 PRO (96 GB VRAM)
  • Backend: llama.cpp Vulkan
  • Evaluation dataset: wikitext-2 (583 chunks, ctx 512)
  • Formats tested: MXFP4_MOE, Q4_K_M, Q5_K_M, UD-Q5_K_M

The results show that UD-Q5_K_M has better quality than its smaller counterparts without any significant speed penalty. The dynamic precision approach used by unsloth is very effective.

The Numbers

MetricMXFP4Q4_K_MQ5_K_MUD-Q5_K_M
Same top-1 accuracy89.4%89.6%93.0%94.0%
Mean KL divergence0.07460.06850.03080.0217
Max KL (worst token)13.045.938.194.75
File size44.7 GB45.2 GB52.9 GB55.2 GB

UD-Q5_K_M outperforms all other formats on every quality metric, being only slightly larger than MXFP4.

Token Accuracy and Performance

  • MXFP4 (89.4%): A 5% difference in per-token agreement becomes a 500× difference by token 100, indicating how quickly errors accumulate.
  • UD-Q5_K_M (94%): The probability of perfect agreement for a 100-token output is much higher, suggesting that UD-Q5_K_M performs better in long tasks and reasoning scenarios.

For interactive coding tasks which are mostly decode-bound, the speed hit from using more aggressive quantization techniques like Q4_K_M or even MXFP4 compared to UD-Q5_K_M is negligible. However, for daily code generation where quality over speed matters, UD-Q5_K_M provides a clear advantage.

I’ve switched my default format from MXFP4 to UD-Q5_K_M. MXFP4 remains great for heavy prefill workloads but for tasks requiring high-quality outputs like code generation, UD-Q5_K_M is now the preferred choice.

Discussion

  • What quants are you guys running for code models?: Are you finding similar quality cliffs with aggressive compression techniques?
  • Are you seeing different trade-offs than RDNA architectures when using Nvidia hardware?

For those interested, here is a link to the original image: Image Preview

Submitted by /u/alphatrad


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top