“`html
Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM
I’m presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp. These are the KS and KSS quants developed by ikawrakow.
Model Link: cHunter789/Qwen3.6-27B-i1-IQ4_KS-GGUF
ik_llama.cpp Project: ikawrakow/ik_llama.cpp
Unfortunately, the ik_llama.cpp project requires running with NVIDIA CUDA and CPU only. There is currently no way to run this on AMD or Apple Silicon (Metal) :/. Using this model with ik_llama.cpp and a Q4_0 Hadamard KV cache allows for a 105k context window.
Benchmark Results & Real-World Impressions
- Qwen Benchmark: Successfully passed the performance evaluations on qwen3-6-27b-benchmark.vercel.app.
- Needle In A Haystack: Successfully evaluated with satisfying results across the full 100k context window.
- Comparison: In direct testing, this model performs slightly better than my previous variant:
Qwen3.6-27B-i1-IQ4_XS-GGUF.
Benchmark Results & Real-World Impressions (continued)
The model was heavily tested in daily production workflows for several days. It runs much faster (1.5x-1.75x) and more reliably than the previous iteration—completely eliminating the issue of “blank outputs,” while the search-replace functionality works flawlessly.
Perplexity (PPL) Testing
The model was tested using a text file from Project Gutenberg (pg19.txt) with 65,536 tokens. The q4_0 KV cache quantization setup was used for this test.
“`bash
wget https://www.gutenberg.org/files/2600/2600-0.txt -O pg19.txt
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KSS.gguf -f pg19.txt -c 65536 –chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 512
“`
Test Log Output:
“`text
perplexity: calculating perplexity over 12 chunks, n_ctx=65536, batch_size=512, n_seq=1
perplexity: 71.10 seconds per pass – ETA 14.22 minutes [1]6.6897,[2]7.0032,[3]7.1989,[4]7.3327,[5]7.4816,[6]7.3770,[7]7.4325,[8]7.4378,[9]7.4754,[10]7.5192,[11]7.5669,[12]7.4040
Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4040 +/- 0.02773
“`
Note: I currently do not have the capability to run KLD (Kullback–Leibler divergence) tests.
Example Server Configuration
For reference, here is the server configuration I used during my tests:
“`bash
llama-server \
-m “$MODEL_PATH” \
-a Qwen3.6-27B \
–ctx-size 105000 \
–chat-template-file chat_template.jinja \
–n-gpu-layers 99 \
–cache-type-k q4_0 \
–cache-type-v q4_0 \
–batch-size 512 \
–ubatch-size 256 \
–flash-attn on \
–no-mmap \
–host 0.0.0.0 \
–port 8081 \
–reasoning on \
–reasoning-format deepseek \
-t 8 \
–parallel 1 \
-khad \
-vhad \
–chat-template-kwargs ‘{"preserve_thinking": true}’ \
–defrag-thold 0.3 \
–jinja \
–cont-batching \
–temp 0.15 \
–top-k 1 \
–min-p 0.1 \
–repeat-last-n 512 \
–repeat-penalty 1.05
“`
Key Takeaways
- The model runs much faster and more reliably than the previous iteration.
- This quantization allows for a larger context window of 105k tokens.
- The model performs slightly better in certain benchmarks compared to the previous variant.
“`
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




