Tested a bit on my XTX, a bit share hope helpful, thanks to Lucebox!

Lucebox DFlash + PFlash PR #119 Reproduction Report (RX 7900 XTX)

Hardware Environment

Component	Spec
GPU	AMD Radeon RX 7900 XTX (Navi 31, gfx1100)
VRAM	24 GiB GDDR6 (~936 GB/s)
System RAM	62 GiB DDR5
ROCm	7.1
OS	Ubuntu 26.04, Linux 7.0.0-14-generic

Benchmark Results

Model: Qwen3.6-27B Q4_K_M (15.65 GiB) + Lucebox Q8_0 DFlash drafter (1.84 GiB) Test: 10-prompt HumanEval-style, --n-gen 128, --fast-rollback Baseline: llama.cpp HIP AR (tg128), 28.07 tok/s

Config	Mean tok/s	Mean AL	Speedup (vs llama.cpp HIP)
llama.cpp HIP AR	28.07	–	1.00x
DFlash (chain speculation)	64.23	5.36	2.29x
DFlash DDTree budget=8	62.75	4.93	2.24x
DFlash DDTree budget=22	60.94	6.11	2.17x

Key Findings

Budget=8 is optimal on 7900 XTX (62.75 tok/s), consistent with the blog. GDDR6’s high bandwidth favors smaller trees to avoid tile waste; Strix Halo’s LPDDR5X needs budget=22 to amortize launch overhead.
2.24x speedup matches the blog’s 2.23x on Strix Halo. The 7900 XTX absolute speed of 62.75 tok/s far exceeds 26.85 tok/s, thanks to its ~9x bandwidth advantage.
Standard chain speculation (no DDTree) is slightly faster (64.23 tok/s), showing simpler strategies have lower overhead for short generations (128 tokens).

Full Reproduction Steps

1. Clone repo and checkout PR #119

git clone https://github.com/Luce-Org/lucebox-hub.git cd lucebox-hub git fetch origin pull/119/head:pr119 && git checkout pr119 git submodule update --init --recursive

2. Install rocWMMA headers (optional but recommended, enables Phase 2 FlashPrefill)

If you don’t have sudo to install the rocwmma package, fetch headers directly from GitHub:

git clone --depth 1 https://github.com/ROCm/rocWMMA.git /tmp/rocwmma mkdir -p /tmp/rocm_include/include cp -r /tmp/rocwmma/library/include/rocwmma /tmp/rocm_include/include/rocwmma

3. Build (gfx1100 / 7900 XTX)

cd dflash cmake -B build -S . \ -DCMAKE_BUILD_TYPE=Release \ -DDFLASH27B_GPU_BACKEND=hip \ -DDFLASH27B_HIP_ARCHITECTURES=gfx1100 \ -DDFLASH27B_HIP_SM80_EQUIV=ON \ -DROCM_PATH=/tmp/rocm_include # path from step 2; omit if rocwmma is system-installed cmake --build build --target test_dflash -j$(nproc)

Replace gfx1100 with your GPU arch, e.g. gfx1151 (Strix Halo), gfx1030 (Navi 21), etc.
To skip rocWMMA, set -DDFLASH27B_HIP_SM80_EQUIV=OFF to use the q8 fallback.

4. Download models

mkdir -p models/draft wget -c -O models/Qwen3.6-27B-Q4_K_M.gguf \ "https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q4_K_M.gguf" wget -c -O models/draft/dflash-draft-3.6-q8_0.gguf \ "https://huggingface.co/Lucebox/Qwen3.6-27B-DFlash-GGUF/resolve/main/dflash-draft-3.6-q8_0.gguf"

5. Install Python dependencies (for bench script)

pip3 install --break-system-packages transformers torch

6. Run the benchmark

# DFlash DDTree budget=8 (recommended for gfx1100) cd dflash LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \ DFLASH_BIN=$PWD/build/test_dflash \ DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \ DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \ DFLASH27B_DRAFT_SWA=2048 \ DFLASH27B_PREFILL_UBATCH=512 \ python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 8

Environment variables

Variable	Meaning
`DFLASH_BIN`	Path to test_dflash binary
`DFLASH_TARGET`	Path to target model GGUF
`DFLASH_DRAFT`	Path to draft model GGUF
`DFLASH27B_DRAFT_SWA`	Draft sliding window attention window for Qwen3.6 (2048)
`DFLASH27B_PREFILL_UBATCH`	Compressed prefill micro-batch size (512, applies PR #159)

bench_he.py common arguments

Argument	Description
`--n-gen N`	Tokens to generate per prompt (default 128)
`--ddtree-budget N`	DDTree node budget (8/22/32/48/64/96/128)
`--ddtree-temp T`	Draft logits temperature (T<1 widens top-1/top-2 gap)
`--max-ctx N`	Maximum context length
`--target-tokenizer REPO`	Target model tokenizer (default Qwen/Qwen3.5-27B)
`--target-split-dflash`	Use target layer-split mode (shows prefill timing)
`--skip-tokenize`	Skip tokenization step (reuse cache)

7. Build and run llama.cpp baseline for comparison

# Build separately from dflash/deps/llama.cpp BUILD_DIR=/tmp/llama-bench-build cmake -B $BUILD_DIR -S dflash/deps/llama.cpp \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_HIP=ON \ -DLLAMA_BUILD_TOOLS=ON cmake --build $BUILD_DIR --target llama-bench -j$(nproc) # Run baseline LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \ $BUILD_DIR/bin/llama-bench \ -m models/Qwen3.6-27B-Q4_K_M.gguf \ -n 128 -p 128 -o md

Comparison with Blog Data

Metric	Strix Halo (gfx1151) Blog	7900 XTX (gfx1100) This Run
llama.cpp HIP AR	12.02 tok/s	28.07 tok/s
DFlash (optimal budget)	26.85 tok/s (budget=22)	62.75 tok/s (budget=8)
Speedup	2.23x	2.24x
Optimal budget	22 (LPDDR5X bandwidth bottleneck)	8 (GDDR6 high bandwidth)

Blog: https://www.lucebox.com/blog/amd

Notes

BSA scoring kernel is not implemented on HIP, it falls back to ggml flash_attn_ext (~3.4x slower than CUDA BSA). This is the remaining PFlash optimization headroom.
PR #159 ubatch=512 is applied via the DFLASH27B_PREFILL_UBATCH=512 env variable (manually layered on top of PR #119).
VRAM limitation: The 7900 XTX’s 24 GiB is insufficient for a full 16K context PFlash test. 16K KV cache + model weights (~16 GiB + ~6 GiB KV cache) exceeds 24 GiB. Strix Halo’s 128 GiB unified memory is needed for large context + large model workloads.

submitted by /u/Fit-Courage5400

Source Read original →

Luce DFlash + PFlash on 7900XTX: Qwen3.6-27B at 2.24x decode and 3.05x prefill vs llama.cpp HIP

Lucebox DFlash + PFlash PR #119 Reproduction Report (RX 7900 XTX)

Hardware Environment

Benchmark Results

Key Findings

Full Reproduction Steps

1. Clone repo and checkout PR #119

2. Install rocWMMA headers (optional but recommended, enables Phase 2 FlashPrefill)

3. Build (gfx1100 / 7900 XTX)

4. Download models

5. Install Python dependencies (for bench script)

6. Run the benchmark

Environment variables

bench_he.py common arguments

7. Build and run llama.cpp baseline for comparison

Comparison with Blog Data

Notes

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Some of the nation’s…

Meituan Releases LongCat-2.0: A…

Amazon will stop accepting…

Lucebox DFlash + PFlash PR #119 Reproduction Report (RX 7900 XTX)

Hardware Environment

Benchmark Results

Key Findings

Full Reproduction Steps

1. Clone repo and checkout PR #119

2. Install rocWMMA headers (optional but recommended, enables Phase 2 FlashPrefill)

3. Build (gfx1100 / 7900 XTX)

4. Download models

5. Install Python dependencies (for bench script)

6. Run the benchmark

Environment variables

bench_he.py common arguments

7. Build and run llama.cpp baseline for comparison

Comparison with Blog Data

Notes

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Some of the nation’s…

Meituan Releases LongCat-2.0: A…

Amazon will stop accepting…