ExLlamaV3 Major Updates!

ExLlamaV3 Major Updates! Turboderp has been in a frenzy recently, pushing new Llamas into smaller and faster boxes. We started last month…

By AI Maestro May 11, 2026 1 min read
ExLlamaV3 Major Updates!

ExLlamaV3 Major Updates!

Turboderp has been in a frenzy recently, pushing new Llamas into smaller and faster boxes. We started last month with the release of support for Gemma 4, followed by improved caching in . DFlash support was added two weeks ago, with impressive results:

CategoryBaselineN-gram/suffixDFlash
Agentic, code55.98 t/s89.58 t/s (1.60x)140.61 t/s (2.51x)
Agentic, curl54.03 t/s74.62 t/s (1.38x)125.94 t/s (2.33x)
Coding59.21 t/s75.34 t/s (1.27x)177.67 t/s (3.00x)
Creative59.10 t/s67.26 t/s (1.13x)89.19 t/s (1.50x)
Creative (reasoning)59.03 t/s64.25 t/s (1.09x)93.54 t/s (1.58x)
Translation58.11 t/s55.39 t/s (0.95x)75.73 t/s (1.30x)
Translation (reasoning)58.08 t/s80.21 t/s (1.38x)119.43 t/s (2.06x)

More model optimization was done last week, with these improvements:

Model3090¹4090¹5090¹6000 Pro¹5090²6000 Pro²
Qwen3.5-35B-A3B 4.00bpw5.3%5.8%8.6%10.3%21.0%23.5%
Qwen3.5-27B 4.00bpw0.0%1.9%8.1%11.7%13.1%15.0%
Trinity-Nano 4.15bpw29.5%48.6%52.3%52.9%70.5%72.4%
Gemma4-26B-A4B 4.10bpw3.1%2.9%7.8%9.6%16.4%19.2%
Gemma4-31B 4.00bpw4.0%4.9%10.0%8.0%16.0%12.0%

Recent updates also include DFlash model quantization and more bugfixes + efficiency in the last 2 days, with ongoing work on the dev branch.

Come say hi at the exllama discord.

Key Takeaways

  • DFlash model improvements for various categories like Agentic, Coding, and Creative.
  • Better model optimization across different models such as Qwen3.5-35B-A3B, Trinity-Nano, Gemma4-26B-A4B, etc.
  • More support added to DFlash with improved efficiency and performance metrics.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top