Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

“`html Strix Halo LLaMA.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed Strix Halo LLaMA.cpp MTP Benchmarks: 27B Gets Much Faster,…

By AI Maestro May 16, 2026 2 min read
Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

“`html




Strix Halo LLaMA.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

Strix Halo LLaMA.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

All models were tested with Qwen3.6.

Key Findings

  • 27B-MTP vs Base 27B (15k single-turn): Faster overall
  • Total Time: 87.44s → 77.39s (-11.50% faster)
  • Prompt Processing: 279.75 → 244.90 t/s (-12.46%)
  • Generation: 7.63 → 16.15 t/s (+111.77% faster)

27B-MTP vs Base 27B (5-turn chat, ~28.5k context): Massive time savings

  • Total Time: 258.65s → 200.55s (-22.46% faster)
  • Avg Generation: 7.61 → 17.98 t/s (+136.41%)
  • Avg Prompt Processing: 254.20 → 207.87 t/s (-18.23%)

35B-MTP vs Base 35B (15k single-turn): Slower overall

  • Total Time: 20.83s → 23.16s (+11.17% slower)
  • Prompt Processing: 972.18 → 811.90 t/s (-16.49%)
  • Generation: 48.18 → 56.12 t/s (+16.47%)

35B-MTP vs Base 35B (5-turn chat, ~28.5k context): Slightly slower overall

  • Total Time: 58.86s → 60.24s (+2.34% slower)
  • Avg Generation: 46.66 → 58.23 t/s (+24.80%)
  • Avg Prompt Processing: 826.47 → 703.45 t/s (-14.89%)

Terminology:

  • wall = real end-to-end elapsed time from sending the request to receiving the full response.
  • pp = prompt processing throughput (tokens/sec).
  • gen t/s = generation throughput (tokens/sec).

Hardware / Software

  • CPU: AMD RYZEN AI MAX+ 395 (16C/32T)
  • iGPU: Radeon 8060S (RADV GFX1151)
  • RAM: 30 GiB
  • OS: Ubuntu 24.04, kernel 6.17
  • llama.cpp / llama-server: 9187 (0253fb21f)
  • Vulkan Instance: 1.4.313
  • GPU API: 1.4.305
  • Mesa RADV: 25.0.7

Models Tested (all Unsloth)

  • Qwen3.6-27B-Q8_0.gguf
  • Qwen3.6-27B-Q8_0-MTP.gguf
  • Qwen3.6-35B-A3B-Q8_0.gguf
  • Qwen3.6-35B-A3B-Q8_0-MTP.gguf

Runtime Config Used

  • –ctx-size 128000
  • -b 2048
  • –ubatch-size 1024
  • –flash-attn on
  • –threads 16
  • –threads-batch 16

MTP models only:

  • –spec-type draft-mtp
  • –spec-draft-n-max 3
  • –spec-draft-p-min 0.75

Methodology

Synthetic agentic prompt calibrated to ~15k prompt tokens.

  • max_tokens=256, temperature=0.
  • Prompt randomized each run (RUN_TAG) so cache_n=0.
  • 2 runs per model.

Stability

  • Retry logic on transient 502/503/504 for long runs.
  • Reported both server infer timing and client-observed wall time.

Takeaways

  • MTP consistently lowers pp and increases generation t/s.
  • If decode dominates, MTP can win hard (as seen on 27B here).
  • If prefill dominates enough, MTP may lose slightly overall (as seen on 35B here).
  • On this Strix Halo setup:
  • The 27B-MTP is a strong practical upgrade for long-context chat workflows.
  • The 35B-MTP is mixed: faster token generation, but slightly slower end-to-end for these specific long-context tests.

“`

This HTML document contains the key points from the original text, structured as a blog post with headings and lists. It maintains all essential information while being written in British English.

Scroll to Top