Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP Luce DFlash + PFlash…

By AI Maestro May 12, 2026 3 min read
Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP



Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Hey fellow Llamas, keeping it short.

We just added support for DFlash and PFlash on the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). This is using the same Luce DFlash stack from our previous post about the RTX 3090.

Repo: https://github.com/Luce-Org/lucebox-hub (MIT)

The numbers

  • End-to-end performance on Qwen3.6-27B (Q4_K_M): 26.85 tokens per second for decode and 20.2 seconds prefill at 16K context. That’s 2.23x faster for decode and 3.05x faster for prefill compared to llama.cpp HIP on the same hardware.
  • Total wall clock time: Reduced from 147 seconds to 58 seconds, representing a 2.5x speedup end-to-end.
  • Model size support: The system can now host checkpoints up to ~100 GiB, which is a class of models that even the 24 GiB consumer GPU cannot touch (like Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B).

Hardware details

Enginetok/svs AR (llama.cpp HIP)
llama.cpp HIP AR12.021.00x
llama.cpp Vulkan AR12.451.04x
Luce DFlash (this PR)26.852.23x faster

Prefill performance:

EngineTTFT (seconds)vs AR (llama.cpp HIP)
llama.cpp HIP AR61.69 s1.00x
Luce PFlash20.2 s3.05x faster

The speedup grows with the amount of context: PFlash compression is O(S), while AR prefill has a quadratic relationship (O(S^2)). The NIAH retrieval still passes at 16K tokens.

Tuning note: The optimal budget for DFLASH27B_HIP_SM80_EQUIV=ON on gfx1151 is 22 tokens. Higher budgets accept more tokens per step but each step becomes more expensive due to the LPDDR5X memory bandwidth constraints. For gfx1100, where tile utilization matters more than launch amortization, budget=8 wins.

Reproduce

To build and run the PR:

# 1. Build PR #119 for gfx1151
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/119/head:pr119 &> git checkout pr119
cd dflash
cmake -B build -S . \
-DCMAKE_BUILD_TYPE=Release \
-DDFLASH27B_GPU_BACKEND=hip \
-DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \
-DDFLASH27B_HIP_SM80_EQUIV=ON
cmake --build build --target test_dflash -j

# 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter
mkdir -p models/draft
hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/

# 3. Bench (DFlash decode + PFlash long-context prefill)
LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \
DFLASH_BIN=$PWD/build/test_dflash \
DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \
DFLASH27B_DRAFT_SWA=2048 \
DFLASH27B_PREFILL_UBATCH=512 \
python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22

The DFLASH27B_PREFILL_UBATCH=512 flag applies the PR #159 fix on top of PR #119. Once PR #159 merges, this will be the default setting.

What is still missing

  • Bilateral scoring kernel for HIP: The drafter compress-score path uses BSA (block-sparse attention) on CUDA. PR #119 disables it for HIP and falls back to ggml’s flash_attn_ext, which is ~3.4x slower. A rocWMMA-native sparse-FA kernel would close this gap, reducing the PFlash TTFT time at 16K from 27.6 seconds to around 8 seconds. For 128K context, it projects a speedup over llama.cpp AR of 7-10x.
  • Multi-row q4_K decode GEMV: This is currently a 30% overhead in the drafter forward pass at long context and needs to be optimized for better performance.
  • Tiling shape tuning for gfx1151: The current rocWMMA flashprefill tiles are tuned for gfx1100. Tuning them for Strix Halo, which has different LDS and VGPR characteristics, is required.
  • 70B+ MoE targets: Models like Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B fit within the 128 GiB headroom but require a native routing of experts into the spec verify loop for MoE models.

Constraints

We are working on improving many aspects, including architecture-aware tuning and multi-row decode optimization. Feedback is welcome!

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top