Has anyone here tested different KV cache quantizations and compared their performance?
I’m currently using the model in Q5_K_M with Q4 KV cache on a 12 GB VRAM GPU. With this setup, I’m offloading about 27 MoE layers to the CPU and getting around 90–100 tok/s with a 128k context window.
I’m trying to see if I can push it a bit further, since I’m using it inside my own AI agent. The model is already pretty smart, but in agentic workflows it’s not always as strong or consistent as I’d like.
I’d be curious to know what KV quantization settings people are using, and how much difference they noticed in speed, memory usage, and output quality.
Also, would you recommend trying a different model quantization than Q5_K_M for this setup? For example, would Q4_K_M, Q6_K, or another quant be a better trade-off for speed, VRAM usage, and reasoning quality?
submitted by /u/HomoAgens1
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




