Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP
Hey fellow Llamas, keeping it short.
We just added support for DFlash and PFlash on the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). This is using the same Luce DFlash stack from our previous post about the RTX 3090.
Repo: https://github.com/Luce-Org/lucebox-hub (MIT)
The numbers
- End-to-end performance on Qwen3.6-27B (Q4_K_M): 26.85 tokens per second for decode and 20.2 seconds prefill at 16K context. That’s 2.23x faster for decode and 3.05x faster for prefill compared to llama.cpp HIP on the same hardware.
- Total wall clock time: Reduced from 147 seconds to 58 seconds, representing a 2.5x speedup end-to-end.
- Model size support: The system can now host checkpoints up to ~100 GiB, which is a class of models that even the 24 GiB consumer GPU cannot touch (like Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B).
Hardware details
| Engine | tok/s | vs AR (llama.cpp HIP) |
|---|---|---|
| llama.cpp HIP AR | 12.02 | 1.00x |
| llama.cpp Vulkan AR | 12.45 | 1.04x |
| Luce DFlash (this PR) | 26.85 | 2.23x faster |
Prefill performance:
| Engine | TTFT (seconds) | vs AR (llama.cpp HIP) |
|---|---|---|
| llama.cpp HIP AR | 61.69 s | 1.00x |
| Luce PFlash | 20.2 s | 3.05x faster |
The speedup grows with the amount of context: PFlash compression is O(S), while AR prefill has a quadratic relationship (O(S^2)). The NIAH retrieval still passes at 16K tokens.
Tuning note: The optimal budget for DFLASH27B_HIP_SM80_EQUIV=ON on gfx1151 is 22 tokens. Higher budgets accept more tokens per step but each step becomes more expensive due to the LPDDR5X memory bandwidth constraints. For gfx1100, where tile utilization matters more than launch amortization, budget=8 wins.
Reproduce
To build and run the PR:
# 1. Build PR #119 for gfx1151 git clone https://github.com/Luce-Org/lucebox-hub.git cd lucebox-hub git fetch origin pull/119/head:pr119 &> git checkout pr119 cd dflash cmake -B build -S . \ -DCMAKE_BUILD_TYPE=Release \ -DDFLASH27B_GPU_BACKEND=hip \ -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \ -DDFLASH27B_HIP_SM80_EQUIV=ON cmake --build build --target test_dflash -j # 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter mkdir -p models/draft hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/ hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/ # 3. Bench (DFlash decode + PFlash long-context prefill) LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \ DFLASH_BIN=$PWD/build/test_dflash \ DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \ DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \ DFLASH27B_DRAFT_SWA=2048 \ DFLASH27B_PREFILL_UBATCH=512 \ python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22
The DFLASH27B_PREFILL_UBATCH=512 flag applies the PR #159 fix on top of PR #119. Once PR #159 merges, this will be the default setting.
What is still missing
- Bilateral scoring kernel for HIP: The drafter compress-score path uses BSA (block-sparse attention) on CUDA. PR #119 disables it for HIP and falls back to ggml’s
flash_attn_ext, which is ~3.4x slower. A rocWMMA-native sparse-FA kernel would close this gap, reducing the PFlash TTFT time at 16K from 27.6 seconds to around 8 seconds. For 128K context, it projects a speedup over llama.cpp AR of 7-10x. - Multi-row q4_K decode GEMV: This is currently a 30% overhead in the drafter forward pass at long context and needs to be optimized for better performance.
- Tiling shape tuning for gfx1151: The current rocWMMA flashprefill tiles are tuned for gfx1100. Tuning them for Strix Halo, which has different LDS and VGPR characteristics, is required.
- 70B+ MoE targets: Models like Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B fit within the 128 GiB headroom but require a native routing of experts into the spec verify loop for MoE models.
Constraints
We are working on improving many aspects, including architecture-aware tuning and multi-row decode optimization. Feedback is welcome!
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




