Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

“`html

A British AI enthusiast, maddie-lovelace, has successfully run the 26b-parameter Gemma MoE model on their MacBook Air M5. This achievement was made possible by tweaking various parameters and utilizing a custom kernel to optimize memory usage with TurboQuant.

The results are quite impressive: running at 128 kilobatches, the system matches or outperforms other models like llama.cpp. Specifically, it processes prompts up to 512 tokens in about 30 seconds per batch and generates text at a rate of over 40 tokens per second. This is particularly notable given that running on a MacBook Air M5 with only 8GB of RAM would be challenging without these optimizations.

The model performs exceptionally well, matching or exceeding the performance of llama.cpp in both prompt processing and text generation.
The custom kernel developed by the author enables higher batch sizes while maintaining close to full precision memory usage, allowing for more efficient use of available resources.
This work showcases how optimizations can push AI models closer to their hardware limits, paving the way for future deployments on consumer-grade devices with modest compute power.

“`

This HTML snippet encapsulates a concise and factual summary of the news item along with key takeaways.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

OpenAI expands Codex with…

Trump signs executive order…

Google rolls out fake…