Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

“`html

Strix Halo LLaMA.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

Strix Halo LLaMA.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed

All models were tested with Qwen3.6.

Key Findings

27B-MTP vs Base 27B (15k single-turn): Faster overall
Total Time: 87.44s → 77.39s (-11.50% faster)
Prompt Processing: 279.75 → 244.90 t/s (-12.46%)
Generation: 7.63 → 16.15 t/s (+111.77% faster)

27B-MTP vs Base 27B (5-turn chat, ~28.5k context): Massive time savings

Total Time: 258.65s → 200.55s (-22.46% faster)
Avg Generation: 7.61 → 17.98 t/s (+136.41%)
Avg Prompt Processing: 254.20 → 207.87 t/s (-18.23%)

35B-MTP vs Base 35B (15k single-turn): Slower overall

Total Time: 20.83s → 23.16s (+11.17% slower)
Prompt Processing: 972.18 → 811.90 t/s (-16.49%)
Generation: 48.18 → 56.12 t/s (+16.47%)

35B-MTP vs Base 35B (5-turn chat, ~28.5k context): Slightly slower overall

Total Time: 58.86s → 60.24s (+2.34% slower)
Avg Generation: 46.66 → 58.23 t/s (+24.80%)
Avg Prompt Processing: 826.47 → 703.45 t/s (-14.89%)

Terminology:

wall = real end-to-end elapsed time from sending the request to receiving the full response.
pp = prompt processing throughput (tokens/sec).
gen t/s = generation throughput (tokens/sec).

Hardware / Software

CPU: AMD RYZEN AI MAX+ 395 (16C/32T)
iGPU: Radeon 8060S (RADV GFX1151)
RAM: 30 GiB
OS: Ubuntu 24.04, kernel 6.17
llama.cpp / llama-server: 9187 (0253fb21f)
Vulkan Instance: 1.4.313
GPU API: 1.4.305
Mesa RADV: 25.0.7

Models Tested (all Unsloth)

Qwen3.6-27B-Q8_0.gguf
Qwen3.6-27B-Q8_0-MTP.gguf
Qwen3.6-35B-A3B-Q8_0.gguf
Qwen3.6-35B-A3B-Q8_0-MTP.gguf

Runtime Config Used

–ctx-size 128000
-b 2048
–ubatch-size 1024
–flash-attn on
–threads 16
–threads-batch 16

MTP models only:

–spec-type draft-mtp
–spec-draft-n-max 3
–spec-draft-p-min 0.75

Methodology

Synthetic agentic prompt calibrated to ~15k prompt tokens.

max_tokens=256, temperature=0.
Prompt randomized each run (RUN_TAG) so cache_n=0.
2 runs per model.

Stability

Retry logic on transient 502/503/504 for long runs.
Reported both server infer timing and client-observed wall time.

Takeaways

MTP consistently lowers pp and increases generation t/s.
If decode dominates, MTP can win hard (as seen on 27B here).
If prefill dominates enough, MTP may lose slightly overall (as seen on 35B here).
On this Strix Halo setup:
The 27B-MTP is a strong practical upgrade for long-context chat workflows.
The 35B-MTP is mixed: faster token generation, but slightly slower end-to-end for these specific long-context tests.

“`

This HTML document contains the key points from the original text, structured as a blog post with headings and lists. It maintains all essential information while being written in British English.

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed