Quantization Shootout on Qwen3-Coder-Next
I ran a quantization shootout to compare different precision levels of Qwen3-Coder-Next. My goal was to understand how much quality is lost when reducing the model’s precision, while also considering the impact on file size and speed.
Hardware Details
- Hardware: 3× R9700 PRO (96 GB VRAM)
- Backend: llama.cpp Vulkan
- Evaluation Dataset: wikitext-2 (583 chunks, context window of 512 tokens)
Formats Tested
- MXFP4_MOE
- Q4_K_M
- Q5_K_M
- UD-Q5_K_M
Tldr: The UD-Q5_K_M format is the clear winner in terms of quality, with only a slight increase in file size compared to MXFP4. It outperforms formats that are half its size.
The Numbers
| Metric | MXFP4 | Q4_K_M | Q5_K_M | UD-Q5_K_M |
|---|---|---|---|---|
| Same top-1 accuracy | 89.4% | 89.6% | 93.0% | 94.0% |
| Mean KL Divergence | 0.0746 | 0.0685 | 0.0308 | 0.0217 |
| Max KL (worst token) | 13.04 | 5.93 | 8.19 | 4.75 |
| File Size (in GB) | 44.7 GB | 45.2 GB | 52.9 GB | 55.2 GB |
UD-Q5_K_M is the best-performing format in terms of quality, with a 10% improvement over Q4_K_M and Q5_K_M.
Quality Metrics
- MXFP4 (89.4%): A 5% difference in per-token agreement becomes a 500× difference by token 100.
- UD-Q5_K_M (94%): The chance of perfect agreement for a 100-token output is significantly higher with UD-Q5_K_M.
This highlights the importance of maintaining high token accuracy, as even small differences can lead to noticeable quality degradation over longer sequences or more complex tasks.
Speed Trade-offs
- Refill (batch 512): MXFP4 is still faster due to hardware kernels.
- Prefill (batch 4096): MXFP4 remains the fastest option.
- Decode: UD-Q5_K_M edges slightly over Q4_K_M, but the difference is within a small margin compared to the file size increase.
The speed hit for interactive coding tasks (decode-bound) is negligible. For daily code generation where quality is prioritized over speed, UD-Q5_K_M provides excellent results with minimal trade-offs.
Conclusion and Recommendations
- I recommend testing different quantization levels to find the optimal balance between quality and file size for your specific use case.
- If you’re using Nvidia hardware, consider how this might affect performance compared to AMD’s RDNA architecture.
What quantization formats are you currently using? Have you observed similar quality cliffs with aggressive compression?
Key Takeaways
- UD-Q5_K_M is the best-performing format in terms of both speed and quality for Qwen3-Coder-Next.
- Maintaining high token accuracy across sequences is crucial for avoiding hallucinations in long or complex tasks.
- The trade-off between file size and performance should be carefully managed based on your specific use case.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




