Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with Native audio that runs on a 16 GB laptop

Google DeepMind has launched Gemma 4 12B, a dense multimodal model that eliminates the need for separate vision and audio encoders. Instead, visual and sonic data flow directly into the LLM backbone. The result is an agentic workflow engine capable of running on a standard consumer laptop with 16 GB of RAM. It is released under the Apache 2.0 license.

Model Overview & Access

Gemma 4 12B is a 12-billion-parameter decoder-only transformer. It natively processes text, images, audio, and video without relying on external encoders. The decoder mirrors the structure of the Gemma 4 31B Dense model, effectively bridging the gap between the edge-optimised E4B and the larger 26B Mixture of Experts variant.

Architecture: A unified, encoder-free decoder-only transformer.
Modalities: Text, image, video, and native audio input — marking the first mid-sized Gemma to support audio natively.
Hardware requirement: Requires 16 GB of VRAM or unified memory. Compatible with consumer GPU laptops and Apple Silicon Macs.
License: Apache 2.0. Weights are open and publicly downloadable.
Inference stack: Compatible with llama.cpp, MLX, vLLM, Ollama, SGLang, Unsloth, and LM Studio.
Download: Available on Hugging Face and Kaggle. The instruct variant is google/gemma-4-12B-it.
Integration: Supports Hugging Face Transformers, LiteRT-LM CLI, and an OpenAI-compatible local API server via litert-lm serve.

A dedicated Multi-Token Prediction (MTP) drafter model is also included to reduce inference latency on local hardware.

Architecture: The Encoder-Free Design

Previous mid-sized Gemma models relied on separate Transformer encoders for vision and audio, adding latency and parameter overhead. The medium-sized Gemma 4 models typically carried a 550M-parameter vision encoder, while the E2B and E4B variants included a 300M-parameter audio encoder. All of that complexity is removed in the 12B model.

Vision embedder (35M parameters): Raw images are segmented into 48×48 pixel patches. Each patch is projected into the LLM’s hidden dimension using a single matrix multiplication. There is no attention layer; each patch is processed independently. Spatial position is injected via a factorized coordinate lookup: a learned X matrix and a learned Y matrix. For a patch at (x, y), the model retrieves two learned embeddings and adds them to form a position vector. This is added to the patch embedding, followed by normalization. That constitutes the entire vision pipeline.

Audio wave projection: Raw 16 kHz audio is sliced into 40 ms frames. Each frame contains 640 values, which are linearly projected into the same embedding space as text tokens. There is no feature extraction and no conformer layers. The LLM’s existing Rotary Position Embedding (RoPE) manages the 1-D temporal sequence. The audio encoder found in the E2B and E4B models, which used 12 conformer layers, has been entirely removed.

Importance: The unified weight space means you no longer need to co-tune separate frozen encoders. Downstream fine-tuning with LoRA or full tuning updates vision, audio, and text processing in a single pass. Hugging Face Transformers and Unsloth already support this.

The encoder-free design reduces multimodal latency. The LLM backbone begins processing immediately, without waiting for an encoder to finish first.

Capabilities & Performance

Google DeepMind has not published full benchmark results in the initial launch materials. The official release notes state the 12B model performs nearly as well as the 26B MoE model on standard benchmarks, at less than half the total memory footprint.

https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/

The model’s demonstrated capabilities include:

Automatic speech recognition. Transcribes audio natively without an external ASR pipeline.
Agentic reasoning. Runs multi-step workflows locally, with performance approaching the 26B MoE model.
Diarization. Distinguishes speakers in audio input.
Video understanding. Processes video frames alongside audio. A demo analyzed a 5-minute Google I/O keynote segment using 313 frames at 1 FPS with a visual token budget of 70 per frame.
Coding. Built a Gradio image-processing app using its own code generation, served locally with llama.cpp.
Multimodal agentic workflows. The official Gemma Skills repository at github.com/google-gemma/gemma-skills provides pre-built agent capabilities.

In Google’s own Google AI Edge Eloquent app, the switch to Gemma 4 12B produced what Google reports as a 60%+ jump in overall quality, with improved instruction following and scope adherence.

Marktechpost’s Visual Explainer

Released June 3, 2026

Gemma 4 12B

Google DeepMind’s unified, encoder-free multimodal model

A 12-billion-parameter decoder-only transformer that drops separate vision and audio encoders. Vision and audio flow straight into the LLM backbone. It runs locally on a 16 GB laptop under an Apache 2.0 license.

Encoder-free — no separate vision or audio encoders
First mid-sized Gemma with native audio input; adds video
Local-ready — 16 GB VRAM or unified memory

Overview & Access

What ships

Specs, weights, and the inference stack

Architecture — decoder-only, same structure as Gemma 4 31B Dense
Modalities — text, image, video, and native audio
Hardware — 16 GB VRAM / unified memory; GPU laptops and Apple Silicon
License — Apache 2.0; weights on Hugging Face and Kaggle
Instruct variant — google/gemma-4-12B-it
Speed — a dedicated Multi-Token Prediction (MTP) drafter is also released

Architecture · Vision

A 35M vision embedder

Replacing the 550M vision encoder of the medium-sized models

Raw images split into 48×48 pixel patches
Each patch projected to the LLM hidden dimension with a single matrix multiplication
No attention layer — each patch is processed independently
Position via a factorized X/Y coordinate lookup, then normalization
That is the entire vision pipeline

Architecture · Audio

Direct audio wave projection

No conformer layers, no feature extraction

Removes the 12 conformer layers used in Gemma 4 E2B and E4B
Raw 16 kHz audio sliced into 40 ms frames (640 values each)
Frames projected into the same embedding space as text tokens
The LLM’s existing RoPE handles the temporal sequence
The first mid-sized Gemma to natively ingest audio

Capabilities & Performance

Near-26B reasoning, half the memory

Google reports performance nearing the 26B MoE at under half the memory footprint

ASR & Diarization — native transcription
Source Read original →
The SignalThe Signal: Edition 01Read this edition →Every Friday: the one AI story that actually mattered, plus the tools worth your time.
AI Maestro is an independent British AI publication. We test what we recommend, and we write it the way we would say it. More about us

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with Native audio that runs on a 16 GB laptop

Model Overview & Access

Architecture: The Encoder-Free Design

Capabilities & Performance

Marktechpost’s Visual Explainer

Gemma 4 12B

Google DeepMind’s unified, encoder-free multimodal model

What ships

Specs, weights, and the inference stack

A 35M vision embedder

Replacing the 550M vision encoder of the medium-sized models

Direct audio wave projection

No conformer layers, no feature extraction

Near-26B reasoning, half the memory

Google reports performance nearing the 26B MoE at under half the memory footprint

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

The running list: major…

Porting the Moebius 0.2B…

Prompt Injection as Role…

Model Overview & Access

Architecture: The Encoder-Free Design

Capabilities & Performance

Marktechpost’s Visual Explainer

Gemma 4 12B

Google DeepMind’s unified, encoder-free multimodal model

What ships

Specs, weights, and the inference stack

A 35M vision embedder

Replacing the 550M vision encoder of the medium-sized models

Direct audio wave projection

No conformer layers, no feature extraction

Near-26B reasoning, half the memory

Google reports performance nearing the 26B MoE at under half the memory footprint

More in AI Music

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

The running list: major…

Porting the Moebius 0.2B…

Prompt Injection as Role…