“`html
Strix Halo LLaMA.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed
All models were tested with Qwen3.6.
Key Findings
- 27B-MTP vs Base 27B (15k single-turn): Faster overall
- Total Time: 87.44s → 77.39s (-11.50% faster)
- Prompt Processing: 279.75 → 244.90 t/s (-12.46%)
- Generation: 7.63 → 16.15 t/s (+111.77% faster)
27B-MTP vs Base 27B (5-turn chat, ~28.5k context): Massive time savings
- Total Time: 258.65s → 200.55s (-22.46% faster)
- Avg Generation: 7.61 → 17.98 t/s (+136.41%)
- Avg Prompt Processing: 254.20 → 207.87 t/s (-18.23%)
35B-MTP vs Base 35B (15k single-turn): Slower overall
- Total Time: 20.83s → 23.16s (+11.17% slower)
- Prompt Processing: 972.18 → 811.90 t/s (-16.49%)
- Generation: 48.18 → 56.12 t/s (+16.47%)
35B-MTP vs Base 35B (5-turn chat, ~28.5k context): Slightly slower overall
- Total Time: 58.86s → 60.24s (+2.34% slower)
- Avg Generation: 46.66 → 58.23 t/s (+24.80%)
- Avg Prompt Processing: 826.47 → 703.45 t/s (-14.89%)
Terminology:
wall= real end-to-end elapsed time from sending the request to receiving the full response.pp= prompt processing throughput (tokens/sec).gen t/s= generation throughput (tokens/sec).
Hardware / Software
- CPU: AMD RYZEN AI MAX+ 395 (16C/32T)
- iGPU: Radeon 8060S (RADV GFX1151)
- RAM: 30 GiB
- OS: Ubuntu 24.04, kernel 6.17
llama.cpp / llama-server:9187 (0253fb21f)- Vulkan Instance: 1.4.313
- GPU API: 1.4.305
- Mesa RADV: 25.0.7
Models Tested (all Unsloth)
Qwen3.6-27B-Q8_0.ggufQwen3.6-27B-Q8_0-MTP.ggufQwen3.6-35B-A3B-Q8_0.ggufQwen3.6-35B-A3B-Q8_0-MTP.gguf
Runtime Config Used
- –ctx-size 128000
- -b 2048
- –ubatch-size 1024
- –flash-attn on
- –threads 16
- –threads-batch 16
MTP models only:
- –spec-type draft-mtp
- –spec-draft-n-max 3
- –spec-draft-p-min 0.75
Methodology
Synthetic agentic prompt calibrated to ~15k prompt tokens.
- max_tokens=256, temperature=0.
- Prompt randomized each run (RUN_TAG) so
cache_n=0. - 2 runs per model.
Stability
- Retry logic on transient 502/503/504 for long runs.
- Reported both server infer timing and client-observed wall time.
Takeaways
- MTP consistently lowers
ppand increases generation t/s. - If decode dominates, MTP can win hard (as seen on 27B here).
- If prefill dominates enough, MTP may lose slightly overall (as seen on 35B here).
- On this Strix Halo setup:
- The 27B-MTP is a strong practical upgrade for long-context chat workflows.
- The 35B-MTP is mixed: faster token generation, but slightly slower end-to-end for these specific long-context tests.
“`
This HTML document contains the key points from the original text, structured as a blog post with headings and lists. It maintains all essential information while being written in British English.
Source Read original →




