Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Saw some posts around PP being slower, so they were cautious on trying it.

Here’s a real-world datapoint.

Settings:

Headless RTX 3090 24G
OpenCode
Model unsloth’s Qwen3.6-27B-MTP-Q4_K_M.gguf
128k context
q8_0 kv cache
–spec-draft-n-max: 3
–draft-p-min: 0

Use Cases:

Research task that uses ~85,000 tokens
Coding task that uses ~85,000 tokens.

Without MTP (llama.cpp:server-cuda13-b9174):

PP: 1,050 tok/s
TG: 27 toks/s
Total time to complete 85k tokens: ~39 mins

With MTP (latest master fork):

PP: 600 tok/s (down 42%)
TG: 50 tok/s (up 85%)
Total time to complete 85k tokens: ~23 mins (1.7x faster or 41% reduction)

A 41% time savings is quite huge, so unless you’re PP heavy, I’d recommend giving MTP a try on your use cases! I have it on a dual agent set-up so your total processing times may be better since I have another critic agent check the main agent’s work.

submitted by /u/cleversmoke

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

TONE3000 claims its new…

NVIDIA Releases Cosmos 3:…

Google must let publishers…