| It’s OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD | Some discussion here has been about KV-cache quantization, especially following recent improvements in Llama.cpp. There’s debate on whether to prioritize cache quant over weight quant. I’ve noticed a lack of data-backed comparisons. One reason for this is the computational expense of computing KL-Divergence (KLD). This measure assesses how similar two probability distributions are, requiring logits from an unquantized model. However, we can approximate KLD using logits from a high-quality quantized model as a substitute. I used unsloth’s quants for Qwen3.6 27B. The largest model size I could fit on my 7900 XTX (with only 24GB of RAM) was Q5_K_M, but this had no MTP and a small context window. I ran KLD calculations using I tested multiple combinations of K and V cache quant types, focusing on thresholds between Q5 and Q4 model quants, as well as the impact of using a smaller quant for V. I didn’t observe any significant slowdowns from mixed KV quants due to my Llama.cpp being compiled with The goal was to answer: "Should you quantize the cache or use a smaller model quant to achieve longer context?" Here are some key findings:
The table below summarizes my findings:
1 Calculated w.r.t. Q5\K_M on wikitext-2 using 16k context) There are some limitations to this analysis:
Lead with what it means for makers and artists: The findings suggest that prioritizing model quantization over cache quant can lead to better performance, especially when aiming for longer context. This is crucial for applications like AI music generation or any creative task where the model needs to handle large sequences of data. |
|---|
Key Takeaways
- Model quantization is more critical than KV-cache quantization for achieving better performance, especially with longer context windows.
- The Q5_K_M configuration consistently outperformed the best Q4 configurations in our tests.
- Makers and artists should focus on optimizing model quantization rather than just focusing on cache quantization when working with large datasets or complex models.
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




