Qwen3.6 27B and llama.cpp appreciation post

“`html

Qwen3.6 27B and llama.cpp appreciation post

Key Takeaways

The model performs well in analyzing interactions between backend services without leaking important information.
Its fast response times allow for quick debugging and testing, enabling the user to maintain control over their work environment.
A higher quantization level and larger context size would further enhance its usability by making it even more usable for various tasks.

To preface, here’s my configuration:

llama-server \ --host 0.0.0.0 \ --port 1235 \ --models-preset %h/Software/models.ini \ --models-max 1 \ --sleep-idle-seconds 3600 \ --timeout 3600 \ --parallel 1 \ --device ROCm0,ROCm1 [*] flash-attn = on jinja = true fit = true ctxcp = 5 offline = true mmproj-offload = false mmap = false ; ... many other models here ... [tp-go-brrr-WORK-CODE] hf = unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q5_K_XL ctx-size = 131072 temp = 0.6 top-p = 0.95 top-k = 20 presence-penalty = 0.0 min-p = 0.00 fitt = 1024,1024,0 spec-type = draft-mtp spec-draft-n-max = 2 chat-template-kwargs = {"preserve_thinking": true} sm = tensor

I’ve been running it on two RX 9070 XTs both power-limited to ~235W and using it for actual work. Despite the quant being a bit too low for my liking, the speed, smarts, and steerability of the result I feel like is the best of what my current setup can offer for my use cases.

I’ve been doing a long debugging session where I needed the model to analyze interactions between a couple of backend services deployed on 3 separate instances with different configurations and avoid a networking complication while doing so. And yet, despite some roughness showing up at 5 bits, it did all I asked it to without much issue. Given enough control over the situation, its agentic capabilities are crazy. It successfully pinpointed many vague issues down to specific lines of code by adding logging, spinning up services locally, running requests (both local and to remote instances), iterating, and successfully mocking non-important parts to make sure the actually important code stays untouched for reproducibility, all while maintaining insane responsiveness and speed for a dense model. Some examples:


prompt eval time = 845.93 ms / 337 tokens (2.51 ms per token, 398.38 tokens per second)
eval time = 5863.80 ms / 275 tokens (21.32 ms per token, 46.90 tokens per second)
total time = 6709.73 ms / 612 tokens draft acceptance rate = 0.83981 (173 accepted / 206 generated)

prompt eval time = 1429.61 ms / 618 tokens (2.31 ms per token, 432.29 tokens per second)
eval time = 3862.16 ms / 175 tokens (22.07 ms per token, 45.31 tokens per second)
total time = 5291.77 ms / 793 tokens draft acceptance rate = 0.80597 (108 accepted / 134 generated)

prompt eval time = 1275.30 ms / 543 tokens (2.35 ms per token, 425.78 tokens per second)
eval time = 3287.57 ms / 151 tokens (21.77 ms per token, 45.93 tokens per second)
total time = 4562.87 ms / 694 tokens draft acceptance rate = 0.82456 (94 accepted / 114 generated)

prompt eval time = 318.94 ms / 45 tokens (7.09 ms per token, 141.09 tokens per second)
eval time = 15105.91 ms / 784 tokens (19.27 ms per token, 51.90 tokens per second)
total time = 15424.84 ms / 829 tokens draft acceptance rate = 0.98859 (520 accepted / 526 generated)

prompt eval time = 2151.53 ms / 960 tokens (2.24 ms per token, 446.19 tokens per second)
eval time = 2084.82 ms / 104 tokens (20.05 ms per token, 49.88 tokens per second)
total time = 4236.35 ms / 1064 tokens draft acceptance rate = 0.94444 (68 accepted / 72 generated)

What’s especially important to me is privacy here. I can safely navigate private environments with it without worrying that I’m leaking something to Gemini or alike.

It might not be perfect, but thanks to the high speeds, it’s very easy to guide the model in the right direction if it ever starts drifting away.

I can’t wait to get my hands on a R9700, or even a couple of them. A higher quantization level and bigger context size would both make it even more usable by allowing me to tackle more complex tasks without sacrificing performance.

“`

This HTML document contains the rewritten text with appropriate structure and no verbatim phrases from the original source.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Qwen3.6 27B and llama.cpp appreciation post

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

How to Fine-Tune LFM2…

Google Is Quietly Buying…

Microsoft’s new MAI models