BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

“`html BeeLlama v0.2.0 – Major DFlash Update BeeLlama v0.2.0 – Major DFlash Update BeeLlama v0.2.0 is here! Not quite a pegasus, but…

By AI Maestro May 22, 2026 2 min read
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

“`html



BeeLlama v0.2.0 – Major DFlash Update

BeeLlama v0.2.0 – Major DFlash Update

BeeLlama v0.2.0 is here!

Not quite a pegasus, but close enough.

GitHub
|
Qwen 3.6 27B Quick Start
|
Gemma 4 31B Quick Start

  • Full Gemma 4 31B support with efficient DFlash implementation and vision.
  • Major Qwen 3.6 27B performance update from lower DFlash overhead, cleaner prefill handling, drafter K/V projection caching, and safer CUDA execution.
  • DFlash GGUFs with upstream architecture are now supported.
  • Fixes to adaptive profit behavior around baseline probing.
  • Reduced verifier path is stricter now, with safer fallback to full logits when grammar, sampler state, or reasoning requires it.
  • Reasoning and tool-call boundaries were tightened.
  • Stricter draft/target validation and better draft-model discovery.
  • And many more improvements!

Benchmarks

  • Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
  • Config: same as in quick start docs, but with reasoning off for non-chat prompts
  • Baseline and MTP server in comparison: llama.cpp b9275 CUDA 13.1 Windows prebuilt
  • The full text of the benchmark prompts is in README.md on GitHub

Qwen 3.6 27B

Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.

PromptServerOutputMedianBestSpeedupAcceptance
Task store moduleBaseline~1K tok37.2 tok/s37.2 tok/s1.00xN/A
Task store moduleDFlash~1K tok163.9 tok/s181.9 tok/s4.40x67.7% / 89.2%
Task store moduleMTP~1K tok69.3 tok/s69.6 tok/s1.86x92.0% / 73.3%
KV report moduleBaseline~1K tok34.6 tok/s36.5 tok/s1.00xN/A
KV report moduleDFlash~1K tok157.7 tok/s162.5 tok/s4.56x58.8% / 88.9%
KV report moduleMTP~1K tok67.3 tok/s68.1 tok/s1.94x89.3% / 73.0%
Doubly-linked listBaseline~4K tok36.8 tok/s36.9 tok/s1.00xN/A
Doubly-linked listDFlash~4K tok130.8 tok/s154.1 tok/s3.56x50.4% / 86.8%
Doubly-linked listMTP~4K tok66.3 tok/s68.0 tok/s1.80x87.8% / 72.5%
Prompt processingBaseline~20K tok1229.5 tok/s1229.5 tok/s1.00xN/A
Prompt processingDFlash~20K tok1214.4 tok/s1221.7 tok/s0.99xN/A
Prompt processingMTP~20K tok1162.6 tok/s1164.7 tok/s0.95xN/A
Multi-turn codingBaseline~28K tok33.3 tok/s33.3 tok/s1.00xN/A
Multi-turn codingDFlash~30K tok64.6 tok/s65.4 tok/s1.94x24.9% / 72.9%
Multi-turn codingMTPreddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top