“`html
Strix Halo LLaMA.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed
All models were tested with Qwen3.6.
Key Findings
- 27B-MTP vs Base 27B (15k single-turn): Faster overall
- Total Time: 87.44s → 77.39s (-11.50% faster)
- Prompt Processing: 279.75 → 244.90 t/s (-12.46%)
- Generation: 7.63 → 16.15 t/s (+111.77% faster)
27B-MTP vs Base 27B (5-turn chat, ~28.5k context): Massive time savings
- Total Time: 258.65s → 200.55s (-22.46% faster)
- Avg Generation: 7.61 → 17.98 t/s (+136.41%)
- Avg Prompt Processing: 254.20 → 207.87 t/s (-18.23%)
35B-MTP vs Base 35B (15k single-turn): Slower overall
- Total Time: 20.83s → 23.16s (+11.17% slower)
- Prompt Processing: 972.18 → 811.90 t/s (-16.49%)
- Generation: 48.18 → 56.12 t/s (+16.47%)
35B-MTP vs Base 35B (5-turn chat, ~28.5k context): Slightly slower overall
- Total Time: 58.86s → 60.24s (+2.34% slower)
- Avg Generation: 46.66 → 58.23 t/s (+24.80%)
- Avg Prompt Processing: 826.47 → 703.45 t/s (-14.89%)
Terminology:
wall= real end-to-end elapsed time from sending the request to receiving the full response.pp= prompt processing throughput (tokens/sec).gen t/s= generation throughput (tokens/sec).
Hardware / Software
- CPU: AMD RYZEN AI MAX+ 395 (16C/32T)
- iGPU: Radeon 8060S (RADV GFX1151)
- RAM: 30 GiB
- OS: Ubuntu 24.04, kernel 6.17
llama.cpp / llama-server:9187 (0253fb21f)- Vulkan Instance: 1.4.313
- GPU API: 1.4.305
- Mesa RADV: 25.0.7
Models Tested (all Unsloth)
Qwen3.6-27B-Q8_0.ggufQwen3.6-27B-Q8_0-MTP.ggufQwen3.6-35B-A3B-Q8_0.ggufQwen3.6-35B-A3B-Q8_0-MTP.gguf
Runtime Config Used
- –ctx-size 128000
- -b 2048
- –ubatch-size 1024
- –flash-attn on
- –threads 16
- –threads-batch 16
MTP models only:
- –spec-type draft-mtp
- –spec-draft-n-max 3
- –spec-draft-p-min 0.75
Methodology
Synthetic agentic prompt calibrated to ~15k prompt tokens.
- max_tokens=256, temperature=0.
- Prompt randomized each run (RUN_TAG) so
cache_n=0. - 2 runs per model.
Stability
- Retry logic on transient 502/503/504 for long runs.
- Reported both server infer timing and client-observed wall time.
Takeaways
- MTP consistently lowers
ppand increases generation t/s. - If decode dominates, MTP can win hard (as seen on 27B here).
- If prefill dominates enough, MTP may lose slightly overall (as seen on 35B here).
- On this Strix Halo setup:
- The 27B-MTP is a strong practical upgrade for long-context chat workflows.
- The 35B-MTP is mixed: faster token generation, but slightly slower end-to-end for these specific long-context tests.
“`
This HTML document contains the key points from the original text, structured as a blog post with headings and lists. It maintains all essential information while being written in British English.
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.



