Luce DFlash + PFlash on 7900XTX: Qwen3.6-27B at 2.24x decode and 3.05x prefill vs llama.cpp HIP

Tested a bit on my XTX, a bit share hope helpful, thanks to Lucebox! Lucebox DFlash + PFlash PR #119 Reproduction Report…

By AI Maestro May 18, 2026 4 min read
Luce DFlash + PFlash on 7900XTX: Qwen3.6-27B at 2.24x decode and 3.05x prefill vs llama.cpp HIP

Tested a bit on my XTX, a bit share hope helpful, thanks to Lucebox!

Lucebox DFlash + PFlash PR #119 Reproduction Report (RX 7900 XTX)

Hardware Environment

ComponentSpec
GPUAMD Radeon RX 7900 XTX (Navi 31, gfx1100)
VRAM24 GiB GDDR6 (~936 GB/s)
System RAM62 GiB DDR5
ROCm7.1
OSUbuntu 26.04, Linux 7.0.0-14-generic

Benchmark Results

Model: Qwen3.6-27B Q4_K_M (15.65 GiB) + Lucebox Q8_0 DFlash drafter (1.84 GiB) Test: 10-prompt HumanEval-style, --n-gen 128, --fast-rollback Baseline: llama.cpp HIP AR (tg128) — 28.07 tok/s

ConfigMean tok/sMean ALSpeedup (vs llama.cpp HIP)
llama.cpp HIP AR28.071.00x
DFlash (chain speculation)64.235.362.29x
DFlash DDTree budget=862.754.932.24x
DFlash DDTree budget=2260.946.112.17x

Key Findings

  1. Budget=8 is optimal on 7900 XTX (62.75 tok/s), consistent with the blog. GDDR6’s high bandwidth favors smaller trees to avoid tile waste; Strix Halo’s LPDDR5X needs budget=22 to amortize launch overhead.
  2. 2.24x speedup matches the blog’s 2.23x on Strix Halo. The 7900 XTX absolute speed of 62.75 tok/s far exceeds 26.85 tok/s, thanks to its ~9x bandwidth advantage.
  3. Standard chain speculation (no DDTree) is slightly faster (64.23 tok/s), showing simpler strategies have lower overhead for short generations (128 tokens).

Full Reproduction Steps

1. Clone repo and checkout PR #119

git clone https://github.com/Luce-Org/lucebox-hub.git cd lucebox-hub git fetch origin pull/119/head:pr119 && git checkout pr119 git submodule update --init --recursive 

2. Install rocWMMA headers (optional but recommended, enables Phase 2 FlashPrefill)

If you don’t have sudo to install the rocwmma package, fetch headers directly from GitHub:

git clone --depth 1 https://github.com/ROCm/rocWMMA.git /tmp/rocwmma mkdir -p /tmp/rocm_include/include cp -r /tmp/rocwmma/library/include/rocwmma /tmp/rocm_include/include/rocwmma 

3. Build (gfx1100 / 7900 XTX)

cd dflash cmake -B build -S . \ -DCMAKE_BUILD_TYPE=Release \ -DDFLASH27B_GPU_BACKEND=hip \ -DDFLASH27B_HIP_ARCHITECTURES=gfx1100 \ -DDFLASH27B_HIP_SM80_EQUIV=ON \ -DROCM_PATH=/tmp/rocm_include # path from step 2; omit if rocwmma is system-installed cmake --build build --target test_dflash -j$(nproc) 

Replace gfx1100 with your GPU arch, e.g. gfx1151 (Strix Halo), gfx1030 (Navi 21), etc.
To skip rocWMMA, set -DDFLASH27B_HIP_SM80_EQUIV=OFF to use the q8 fallback.

4. Download models

mkdir -p models/draft wget -c -O models/Qwen3.6-27B-Q4_K_M.gguf \ "https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q4_K_M.gguf" wget -c -O models/draft/dflash-draft-3.6-q8_0.gguf \ "https://huggingface.co/Lucebox/Qwen3.6-27B-DFlash-GGUF/resolve/main/dflash-draft-3.6-q8_0.gguf" 

5. Install Python dependencies (for bench script)

pip3 install --break-system-packages transformers torch 

6. Run the benchmark

# DFlash DDTree budget=8 (recommended for gfx1100) cd dflash LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \ DFLASH_BIN=$PWD/build/test_dflash \ DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \ DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \ DFLASH27B_DRAFT_SWA=2048 \ DFLASH27B_PREFILL_UBATCH=512 \ python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 8 

Environment variables

VariableMeaning
DFLASH_BINPath to test_dflash binary
DFLASH_TARGETPath to target model GGUF
DFLASH_DRAFTPath to draft model GGUF
DFLASH27B_DRAFT_SWADraft sliding window attention window for Qwen3.6 (2048)
DFLASH27B_PREFILL_UBATCHCompressed prefill micro-batch size (512, applies PR #159)

bench_he.py common arguments

ArgumentDescription
--n-gen NTokens to generate per prompt (default 128)
--ddtree-budget NDDTree node budget (8/22/32/48/64/96/128)
--ddtree-temp TDraft logits temperature (T<1 widens top-1/top-2 gap)
--max-ctx NMaximum context length
--target-tokenizer REPOTarget model tokenizer (default Qwen/Qwen3.5-27B)
--target-split-dflashUse target layer-split mode (shows prefill timing)
--skip-tokenizeSkip tokenization step (reuse cache)

7. Build and run llama.cpp baseline for comparison

# Build separately from dflash/deps/llama.cpp BUILD_DIR=/tmp/llama-bench-build cmake -B $BUILD_DIR -S dflash/deps/llama.cpp \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_HIP=ON \ -DLLAMA_BUILD_TOOLS=ON cmake --build $BUILD_DIR --target llama-bench -j$(nproc) # Run baseline LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \ $BUILD_DIR/bin/llama-bench \ -m models/Qwen3.6-27B-Q4_K_M.gguf \ -n 128 -p 128 -o md 

Comparison with Blog Data

MetricStrix Halo (gfx1151) Blog7900 XTX (gfx1100) This Run
llama.cpp HIP AR12.02 tok/s28.07 tok/s
DFlash (optimal budget)26.85 tok/s (budget=22)62.75 tok/s (budget=8)
Speedup2.23x2.24x
Optimal budget22 (LPDDR5X bandwidth bottleneck)8 (GDDR6 high bandwidth)

Blog: https://www.lucebox.com/blog/amd

Notes

  1. BSA scoring kernel is not implemented on HIP — it falls back to ggml flash_attn_ext (~3.4x slower than CUDA BSA). This is the remaining PFlash optimization headroom.
  2. PR #159 ubatch=512 is applied via the DFLASH27B_PREFILL_UBATCH=512 env variable (manually layered on top of PR #119).
  3. VRAM limitation: The 7900 XTX’s 24 GiB is insufficient for a full 16K context PFlash test. 16K KV cache + model weights (~16 GiB + ~6 GiB KV cache) exceeds 24 GiB. Strix Halo’s 128 GiB unified memory is needed for large context + large model workloads.

submitted by /u/Fit-Courage5400
[link] [comments]


Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top