Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Saw some posts around PP being slower, so they were cautious on trying it. Here’s a real-world datapoint. Settings: Headless RTX 3090…

By AI Maestro May 17, 2026 1 min read
Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Saw some posts around PP being slower, so they were cautious on trying it.

Here’s a real-world datapoint.

Settings:

  • Headless RTX 3090 24G
  • OpenCode
  • Model unsloth’s Qwen3.6-27B-MTP-Q4_K_M.gguf
  • 128k context
  • q8_0 kv cache
  • –spec-draft-n-max: 3
  • –draft-p-min: 0

Use Cases:

  • Research task that uses ~85,000 tokens
  • Coding task that uses ~85,000 tokens.

Without MTP (llama.cpp:server-cuda13-b9174):

  • PP: 1,050 tok/s
  • TG: 27 toks/s
  • Total time to complete 85k tokens: ~39 mins

With MTP (latest master fork):

  • PP: 600 tok/s (down 42%)
  • TG: 50 tok/s (up 85%)
  • Total time to complete 85k tokens: ~23 mins (1.7x faster or 41% reduction)

A 41% time savings is quite huge, so unless you’re PP heavy, I’d recommend giving MTP a try on your use cases! I have it on a dual agent set-up so your total processing times may be better since I have another critic agent check the main agent’s work.

submitted by /u/cleversmoke

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top