It’s OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 24, 2026 3 min read
It’s OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD
It’s OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD

Some discussion here has been about KV-cache quantization, especially following recent improvements in Llama.cpp. There’s debate on whether to prioritize cache quant over weight quant.

I’ve noticed a lack of data-backed comparisons. One reason for this is the computational expense of computing KL-Divergence (KLD). This measure assesses how similar two probability distributions are, requiring logits from an unquantized model.

However, we can approximate KLD using logits from a high-quality quantized model as a substitute. I used unsloth’s quants for Qwen3.6 27B. The largest model size I could fit on my 7900 XTX (with only 24GB of RAM) was Q5_K_M, but this had no MTP and a small context window.

I ran KLD calculations using llama-perplexity, which is part of Llama.cpp. The dataset consists of wikitext-2, downloaded with the script provided by Llama.cpp. The context size was set to 16,000 tokens due to concerns about cache quantization affecting long-context performance.

I tested multiple combinations of K and V cache quant types, focusing on thresholds between Q5 and Q4 model quants, as well as the impact of using a smaller quant for V. I didn’t observe any significant slowdowns from mixed KV quants due to my Llama.cpp being compiled with -DGGML_CUDA_FA_ALL_QUANTS=ON.

The goal was to answer: "Should you quantize the cache or use a smaller model quant to achieve longer context?"

Here are some key findings:

  1. Model quant is more important than KV-cache quant: Regardless of cache quant, Q5 performed better than Q4. Even the least optimized Q5 config (Q5_K_S weights with q4_0 kv cache) outperformed the best Q4 config (Q4_K_XL with f16 cache).
  2. q4_0 is not ideal: I recommend using quant types like q5_* at least.

The table below summarizes my findings:

Weightscache-type-kcache-type-vKLD1
Q5_K_Mq8_0q8_00.003838 ± 0.000322
Q5_K_Mq8_0q5_10.006456 ± 0.000542
Q5_K_Mq5_1q5_10.007214 ± 0.000560
Q5_K_Mq4_0q4_00.014045 ± 0.000828
Q5_K_Sq4_0q4_00.016304 ± 0.000865
Q4_K_XLf16f160.026067 ± 0.001021
Q4_K_Mf16f160.031078 ± 0.001176

1 Calculated w.r.t. Q5\K_M on wikitext-2 using 16k context)

There are some limitations to this analysis:

  • The KLD approximation is based on the largest model size I could fit, which might not be representative of larger models.
  • I didn’t test more quant levels due to computational constraints. Q5_K_M was the largest model size I tested because with q8_0 KV-cache I can achieve 100k context on my 24GB 7900 XTX.
  • The dataset (wikitext-2) isn’t ideal for coding or agent workflows; a better dataset might be available, but it needs to integrate well with the `llama-perplexity` tool.
  • 16k context is insufficient for testing long-context scenarios. I’m waiting for Llama.cpp to fix an overflow bug that currently prevents running such tests.

Lead with what it means for makers and artists:

The findings suggest that prioritizing model quantization over cache quant can lead to better performance, especially when aiming for longer context. This is crucial for applications like AI music generation or any creative task where the model needs to handle large sequences of data.

Key Takeaways

  • Model quantization is more critical than KV-cache quantization for achieving better performance, especially with longer context windows.
  • The Q5_K_M configuration consistently outperformed the best Q4 configurations in our tests.
  • Makers and artists should focus on optimizing model quantization rather than just focusing on cache quantization when working with large datasets or complex models.

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top