Google DeepMind has made Quantization-Aware Training (QAT) checkpoints available for the Gemma 4 family. The rollout is designed for local deployment on edge devices and consumer GPUs. It follows the April launch of Gemma 4 and the 12B model released two days prior.
Our analysis compared the available Gemma 4 edge-model formats using only published specifications. The objective was straightforward: demonstrate the memory cost of each precision level and clarify exactly what QAT achieves.
What QAT actually does
Quantization reduces model size by lowering weight precision. Standard Post-Training Quantization (PTQ) compresses a finished model, which often degrades performance. QAT instead simulates quantization during the training phase. The model learns to compensate for the precision loss.
Google’s AI team states that its QAT results yield higher overall quality than standard PTQ baselines. Google did not publish Gemma 4 QAT benchmark scores in the announcement. For context, Gemma 3 QAT reduced the Q4_0 perplexity drop by 54% using llama.cpp evaluation. We cite that only as prior-generation precedent.
The comparison task
We compared Gemma 4 E2B and E4B across three formats: BF16, Q4_0 QAT, and the new mobile QAT schema. We ranked them on memory footprint, quality preservation, and on-device accessibility. We used published figures only.
Memory results
| Format | E2B | E4B | Basis |
|---|---|---|---|
| BF16 (16-bit) | 9.6 GB | 15 GB | Official Gemma 4 docs |
| Q4_0 (4-bit, QAT) | 3.2 GB | 5 GB | Official Gemma 4 docs |
| Mobile (QAT, E2B) | ~1 GB | — | QAT announcement |
The Q4_0 figures match the footprint of PTQ Q4_0. QAT does not change the size at a given format. It improves quality at that size. The new mobile schema delivers the additional reduction.
Using that mobile schema, Google reduced Gemma 4 E2B to about 1GB. Developers can go lower still. The text-only model without Per-Layer Embeddings needs under 1GB, dropping the audio and vision encoders.
Per-format breakdown
BF16 is the quality baseline. E2B needs 9.6 GB and E4B needs 15 GB. It is the reference point, not a phone deployment target.
Q4_0 QAT is the general-purpose local format. E2B drops to 3.2 GB and E4B to 5 GB. QAT preserves more quality here than PTQ at the same size. This format fits consumer GPUs. Earlier E2B testing also ran on a Raspberry Pi 5 at INT4.
The mobile format is the edge-specialized schema. It brings E2B to about 1 GB. It uses static activations, channel-wise quantization, and targeted 2-bit compression.
How the mobile schema works
Google AI team engineered four techniques for mobile hardware. Static activations pre-calculate scaling during training, reducing on-device work. Channel-wise quantization fits the design of mobile accelerators. Targeted 2-bit quantization compresses only the token-generation layers. Embedding and KV cache optimization shrinks the active memory footprint.
Core reasoning layers stay at higher precision. That protects capability while cutting storage. Developers can also deploy text-only and drop the audio and vision encoders. That trims memory further for use cases that need no multimodality.
Dimension breakdown
Scores are a qualitative ranking of the formats for on-device use. Memory is the only hard-measured axis. Quality reflects Google’s disclosed design, not measured Gemma 4 numbers. Each score has a one-line basis.
| Dimension | BF16 | Q4_0 QAT | Mobile QAT |
|---|---|---|---|
| Memory footprint | 1 — heaviest, 9.6 GB E2B | 4 — 3.2 GB E2B | 5 — ~1 GB E2B text-only |
| Quality preservation | 5 — full-precision baseline | 4 — QAT-preserved, near baseline | 3 — 2-bit token layers, core kept higher |
| Decode speed | 2 — no quantization speedup | 4 — 4-bit accelerates decode | 5 — mobile-optimized static activations |
| Deployment breadth | 4 — loadable but heavy | 5 — llama.cpp, Ollama, LM Studio, vLLM, MLX | 3 — LiteRT-LM, Transformers.js, edge-focused |
| On-device accessibility | 1 — needs large GPU | 4 — consumer GPU, Raspberry Pi 5 | 5 — runs on phones |
| Total (/25) | 13 | 21 | 21 |
Winner
The result is a tie by design. Q4_0 QAT and mobile QAT both score 21, but for different hardware. For phones, the mobile format leads. It reaches about 1GB on E2B and targets mobile accelerators directly. For laptops and consumer GPUs, Q4_0 QAT is the practical default. BF16 stays the quality reference, not a local choice.
Methodology and limits
Memory figures come from Google’s Gemma 4 documentation. The ~1GB E2B figure comes from the QAT announcement. Quality is Google’s stated claim. No independent Gemma 4 QAT quality numbers were published at release. We did not run the models locally for this comparison. Developers should test at their own quantization and workload before building.
Key takeaways
- Q4_0 QAT cuts Gemma 4 E2B to 3.2 GB and E4B to 5 GB, from 9.6 GB and 15 GB at BF16.
- A new mobile QAT schema brings E2B to about 1 GB; text-only without PLE goes under 1 GB.
- QAT changes quality at a given size, not the size itself; the mobile format drives the extra memory cut.
- Google claims higher quality than PTQ but published no Gemma 4 QAT benchmark numbers at release.
- Weights ship today on Hugging Face with llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM support.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




