“`html
BeeLlama v0.2.0 – Major DFlash Update
BeeLlama v0.2.0 is here!
Not quite a pegasus, but close enough.
GitHub
|
Qwen 3.6 27B Quick Start
|
Gemma 4 31B Quick Start
- Full Gemma 4 31B support with efficient DFlash implementation and vision.
- Major Qwen 3.6 27B performance update from lower DFlash overhead, cleaner prefill handling, drafter K/V projection caching, and safer CUDA execution.
- DFlash GGUFs with upstream architecture are now supported.
- Fixes to adaptive profit behavior around baseline probing.
- Reduced verifier path is stricter now, with safer fallback to full logits when grammar, sampler state, or reasoning requires it.
- Reasoning and tool-call boundaries were tightened.
- Stricter draft/target validation and better draft-model discovery.
- And many more improvements!
Benchmarks
- Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
- Config: same as in quick start docs, but with reasoning off for non-chat prompts
- Baseline and MTP server in comparison: llama.cpp b9275 CUDA 13.1 Windows prebuilt
- The full text of the benchmark prompts is in README.md on GitHub
Qwen 3.6 27B
Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.
| Prompt | Server | Output | Median | Best | Speedup | Acceptance |
|---|---|---|---|---|---|---|
| Task store module | Baseline | ~1K tok | 37.2 tok/s | 37.2 tok/s | 1.00x | N/A |
| Task store module | DFlash | ~1K tok | 163.9 tok/s | 181.9 tok/s | 4.40x | 67.7% / 89.2% |
| Task store module | MTP | ~1K tok | 69.3 tok/s | 69.6 tok/s | 1.86x | 92.0% / 73.3% |
| KV report module | Baseline | ~1K tok | 34.6 tok/s | 36.5 tok/s | 1.00x | N/A |
| KV report module | DFlash | ~1K tok | 157.7 tok/s | 162.5 tok/s | 4.56x | 58.8% / 88.9% |
| KV report module | MTP | ~1K tok | 67.3 tok/s | 68.1 tok/s | 1.94x | 89.3% / 73.0% |
| Doubly-linked list | Baseline | ~4K tok | 36.8 tok/s | 36.9 tok/s | 1.00x | N/A |
| Doubly-linked list | DFlash | ~4K tok | 130.8 tok/s | 154.1 tok/s | 3.56x | 50.4% / 86.8% |
| Doubly-linked list | MTP | ~4K tok | 66.3 tok/s | 68.0 tok/s | 1.80x | 87.8% / 72.5% |
| Prompt processing | Baseline | ~20K tok | 1229.5 tok/s | 1229.5 tok/s | 1.00x | N/A |
| Prompt processing | DFlash | ~20K tok | 1214.4 tok/s | 1221.7 tok/s | 0.99x | N/A |
| Prompt processing | MTP | ~20K tok | 1162.6 tok/s | 1164.7 tok/s | 0.95x | N/A |
| Multi-turn coding | Baseline | ~28K tok | 33.3 tok/s | 33.3 tok/s | 1.00x | N/A |
| Multi-turn coding | DFlash | ~30K tok | 64.6 tok/s | 65.4 tok/s | 1.94x | 24.9% / 72.9% |
| Multi-turn coding | MTP | reddit.com. Curated by AI Maestro. Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise. |




