I ran a quantization shootout on Qwen3-Coder-Next and was curious about its performance. I’ve been using the MXFP4_MOE from unsloth for my work as it’s very fast. However, I wanted to see how precision affects things.
Hardware
Hardware: 3× R9700 PRO (96 GB VRAM)
Backend: llama.cpp Vulkan
Evaluation dataset: wikitext-2 (583 chunks, context length of 512 tokens)
Formats tested
- MXFP4_MOE
- Q4_K_M
- Q5_K_M
- UD-Q5_K_M
Tldr: UD-Q5_K_M is the clear winner! It has better quality than formats half its size, and there’s barely any speed penalty. Unsloth’s dynamic precision approach is really good. I might need to test it at lower quants now.
The Numbers
| Metric | MXFP4 | Q4_K_M | Q5_K_M | UD-Q5_K_M |
|---|---|---|---|---|
| Same top-1 accuracy | 89.4% | 89.6% | 93.0% | 94.0% |
| Mean KL divergence | 0.0746 | 0.0685 | 0.0308 | 0.0217 |
| Max KL (worst token) | 13.04 | 5.93 | 8.19 | 4.75 |
| File size (GB) | 44.7 GB | 45.2 GB | 52.9 GB | 55.2 GB |
UD-Q5_K_M wins on literally every quality metric, while only being ~10 GB larger than MXFP4.
A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM’s are auto-regressive, and Yann LeCun is always talking about this. This is where all your hallucinations and stuff happen.
MXFP4 (89.4%)
A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM’s are auto-regressive, and Yann LeCun is always talking about this. This is where all your hallucinations and stuff happen.
UD-Q5_K_M (94%)
A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM’s are auto-regressive, and Yann LeCun is always talking about this. This is where all your hallucinations and stuff happen.
There is a speed trade-off to all of this though.
refill (batch 512):
MXFP4 still fastest (hardware kernels)
Prefill (batch 4096):
MXFP4 wins again
Decode:
Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger
For interactive coding (which is decode-bound anyway), the speed hit is negligible.
For me, I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads but for daily code generation where you care about quality over speed, UD-Q5 is the clear winner.
What quants are you guys running for code models?
I’m curious if others are finding similar quality cliffs with aggressive compression. And if you’re on Nvidia hardware, are you seeing different trade-offs than RDNA?
Key Takeaways
- UD-Q5_K_M is the clear winner in terms of quality and file size.
- MXFP4 is still very fast for heavy prefill workloads.
- The speed hit for interactive coding is negligible, especially when using decode-bound models like Qwen3-Coder-Next.
- For code generation tasks where you care about quality over speed, UD-Q5_K_M is recommended.
If you’re on Nvidia hardware and have different experiences with other quantization formats, please share them!
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




