I ran a quantization shootout on Qwen3-Coder and the results are... interesting

Out of random curiosity, I ran a shootout on Qwen3-Coder-Next. I’ve been using the MXFP4_MOE from unsloth for awhile as it’s just really fast on my system. But was curious about precision. I know quantization hurts the model, but I don’t think I had really understood that till I tested it myself.

Hardware

Hardware: 3× R9700 PRO (96 GB VRAM)

Backend

Backend: llama.cpp Vulkan

Evaluation

Evaluation: wikitext-2 (583 chunks, ctx 512)

Formats tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M

Tldr

The TLDR: UD-Q5_K_M is cooking! Better quality than formats half its size, barely any speed penalty. Unsloth’s dynamic precision approach is really good. I might need to test it at lower quants now.

The Numbers

Metric	MXFP4	Q4_K_M	Q5_K_M	UD-Q5_K_M
Same top-1	89.4%	89.6%	93.0%	94.0%
Mean KL divergence	0.0746	0.0685	0.0308	0.0217
Max KL (worst token)	13.04	5.93	8.19	4.75
File size	44.7 GB	45.2 GB	52.9 GB	55.2 GB

UD-Q5_K_M wins on literally every quality metric

UD-Q5_K_M is the clear winner in terms of quality across all metrics tested, and only 10 GB larger than MXFP4. The differences are stark:

MXFP4 (89.4%) > 100 token output: 0.0014% chance of perfect agreement
UD-Q5_K_M (94%) > 100 token output: 0.21% chance of perfect agreement

This is not a small number when we’re talking about long refactoring tasks or multi-step reasoning where errors can compound exponentially. MXFP4 often “goes off the rails” more frequently.

Speed Trade-Offs

Refill (batch 512): MXFP4 still fastest (hardware kernels)
Prefill (batch 4096): MXFP4 wins again
Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger

For interactive coding tasks (which are decode-bound anyway), the speed hit from using a lower quantization level is negligible.

Conclusion

I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads, but for daily code generation where quality over speed matters more, UD-Q5 is the clear winner.

What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you’re on Nvidia hardware, are you seeing different tradeoffs than RDNA?

Preview

Submitted by /u/alphatrad

Source Read original →

I ran a quantization shootout on Qwen3-Coder and the results are… interesting

Hardware

Backend

Evaluation

Tldr

The Numbers

UD-Q5_K_M wins on literally every quality metric

Speed Trade-Offs

Conclusion

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

lobste.rs is now running…

OpenCoreDev Releases Domain SDK…

SpaceXAI’s Grok programming tool…

Hardware

Backend

Evaluation

Tldr

The Numbers

UD-Q5_K_M wins on literally every quality metric

Speed Trade-Offs

Conclusion

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

lobste.rs is now running…

OpenCoreDev Releases Domain SDK…

SpaceXAI’s Grok programming tool…