ExLlamaV3 Major Updates!

Turboderp has been in a frenzy recently, pushing new Llamas into smaller and faster boxes. We started last month with the release of support for Gemma 4, followed by improved caching in . DFlash support was added two weeks ago, with impressive results:

Category	Baseline	N-gram/suffix	DFlash
Agentic, code	55.98 t/s	89.58 t/s (1.60x)	140.61 t/s (2.51x)
Agentic, curl	54.03 t/s	74.62 t/s (1.38x)	125.94 t/s (2.33x)
Coding	59.21 t/s	75.34 t/s (1.27x)	177.67 t/s (3.00x)
Creative	59.10 t/s	67.26 t/s (1.13x)	89.19 t/s (1.50x)
Creative (reasoning)	59.03 t/s	64.25 t/s (1.09x)	93.54 t/s (1.58x)
Translation	58.11 t/s	55.39 t/s (0.95x)	75.73 t/s (1.30x)
Translation (reasoning)	58.08 t/s	80.21 t/s (1.38x)	119.43 t/s (2.06x)

More model optimization was done last week, with these improvements:

Model	3090¹	4090¹	5090¹	6000 Pro¹	5090²	6000 Pro²
Qwen3.5-35B-A3B 4.00bpw	5.3%	5.8%	8.6%	10.3%	21.0%	23.5%
Qwen3.5-27B 4.00bpw	0.0%	1.9%	8.1%	11.7%	13.1%	15.0%
Trinity-Nano 4.15bpw	29.5%	48.6%	52.3%	52.9%	70.5%	72.4%
Gemma4-26B-A4B 4.10bpw	3.1%	2.9%	7.8%	9.6%	16.4%	19.2%
Gemma4-31B 4.00bpw	4.0%	4.9%	10.0%	8.0%	16.0%	12.0%

Recent updates also include DFlash model quantization and more bugfixes + efficiency in the last 2 days, with ongoing work on the dev branch.

Come say hi at the exllama discord.

Key Takeaways

DFlash model improvements for various categories like Agentic, Coding, and Creative.
Better model optimization across different models such as Qwen3.5-35B-A3B, Trinity-Nano, Gemma4-26B-A4B, etc.
More support added to DFlash with improved efficiency and performance metrics.

Source Read original →