I ran a quantization shootout on Qwen3-Coder and the results are... interesting

I ran a quantization shootout on Qwen3-Coder-Next and the results are... interesting

Key Takeaways

UD-Q5_K_M is performing better than expected, with file sizes only slightly larger than MXFP4.
The quality improvements in UD-Q5_K_M outweigh the speed trade-offs, especially for tasks requiring high accuracy over speed.
Nvidia hardware may offer different performance characteristics compared to RDNA architecture when using aggressive quantization techniques.

Out of random curiosity I ran a shootout on Qwen3-Coder-Next. I’ve been using the MXFP4_MOE from unsloth for awhile as it’s just really fast on my system. But was curious about precision, so I tested different quantizations to see how they perform.

Hardware

Hardware: 3× R9700 PRO (96 GB VRAM)
Backend: llama.cpp Vulkan
Evaluation dataset: wikitext-2 (583 chunks, ctx 512)
Formats tested: MXFP4_MOE, Q4_K_M, Q5_K_M, UD-Q5_K_M

The results show that UD-Q5_K_M has better quality than its smaller counterparts without any significant speed penalty. The dynamic precision approach used by unsloth is very effective.

The Numbers

Metric	MXFP4	Q4_K_M	Q5_K_M	UD-Q5_K_M
Same top-1 accuracy	89.4%	89.6%	93.0%	94.0%
Mean KL divergence	0.0746	0.0685	0.0308	0.0217
Max KL (worst token)	13.04	5.93	8.19	4.75
File size	44.7 GB	45.2 GB	52.9 GB	55.2 GB

UD-Q5_K_M outperforms all other formats on every quality metric, being only slightly larger than MXFP4.

Token Accuracy and Performance

MXFP4 (89.4%): A 5% difference in per-token agreement becomes a 500× difference by token 100, indicating how quickly errors accumulate.
UD-Q5_K_M (94%): The probability of perfect agreement for a 100-token output is much higher, suggesting that UD-Q5_K_M performs better in long tasks and reasoning scenarios.

For interactive coding tasks which are mostly decode-bound, the speed hit from using more aggressive quantization techniques like Q4_K_M or even MXFP4 compared to UD-Q5_K_M is negligible. However, for daily code generation where quality over speed matters, UD-Q5_K_M provides a clear advantage.

I’ve switched my default format from MXFP4 to UD-Q5_K_M. MXFP4 remains great for heavy prefill workloads but for tasks requiring high-quality outputs like code generation, UD-Q5_K_M is now the preferred choice.

Discussion

What quants are you guys running for code models?: Are you finding similar quality cliffs with aggressive compression techniques?
Are you seeing different trade-offs than RDNA architectures when using Nvidia hardware?

For those interested, here is a link to the original image: Image Preview

Submitted by /u/alphatrad

Source Read original →

I ran a quantization shootout on Qwen3-Coder and the results are… interesting

Key Takeaways

Hardware

The Numbers

Token Accuracy and Performance

Discussion

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

lobste.rs is now running…

OpenCoreDev Releases Domain SDK…

SpaceXAI’s Grok programming tool…

Key Takeaways

Hardware

The Numbers

Token Accuracy and Performance

Discussion

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

lobste.rs is now running…

OpenCoreDev Releases Domain SDK…

SpaceXAI’s Grok programming tool…