Google Releases Gemini 3.5 Live Translate, a Streaming Speech-to-Speech Audio Model Covering 70+ Languages Across Meet, Translate, and the Live API

For makers and artists building real-time audio applications, Google has introduced a significant shift in how speech translation is handled. The new Gemini 3.5 Live Translate model moves away from waiting for a speaker to finish a sentence before responding. Instead, it processes audio in a continuous stream, generating translated speech in real-time. This approach ensures the output stays only a few seconds behind the original audio, preserving the speaker’s intonation, pacing, and pitch without the awkward pauses typical of turn-by-turn systems.

Gemini 3.5 Live Translate

This is a dedicated audio model, identified as gemini-3.5-live-translate-preview, designed strictly for speech-to-speech conversion. It accepts audio as it streams in, handling multilingual inputs without requiring manual configuration. Crucially, its noise robustness allows applications to function effectively in loud, unpredictable environments.

The rollout targets three distinct channels. Developers can access the model via public preview through the Gemini Live API and Google AI Studio. Enterprise users receive a private preview within Google Meet starting this month. For general consumers, the feature is available via the Google Translate app on Android and iOS.

How the Continuous Streaming Works

The architectural difference is vital for building responsive features. While conversational agents rely on turn-based interactions, pauses, and interruption handling, Live Translation utilises continuous stream processing. It translates as the speaker talks, eliminating the need to wait for a turn to conclude.

To maintain strict real-time latency thresholds, the translation path accepts audio input only; text input is unsupported in this mode. Furthermore, the model drops tool use and system instructions, focusing the pipeline exclusively on translation rather than acting as a general assistant.

Building With the Live API

Developers configure translation settings within the Live API session setup. This involves defining a translationConfig block inside the generationConfig. The targetLanguageCode field requires a BCP-47 code, such as "pl" or "es", defaulting to "en". The echoTargetLanguage boolean determines whether the model repeats input already in the target language; setting it to true echoes the speech, while false keeps it silent.

Technical specifications are fixed: input is raw 16-bit PCM at 16kHz, mono, little-endian, while output is raw 16-bit PCM at 24kHz, mono, little-endian. Audio must be sent in 100ms chunks. For client-side applications, ephemeral tokens on the v1alpha endpoint are used to prevent API key exposure.

Dimension	Live Agent	Live Translation
Model role	Assistant that listens, reasons, and acts	Interpreter / real-time translator pipeline
Interaction	Turn-based, with interruption handling	Continuous stream processing, no turns
Tools	Function calling, Google Search, instructions	Translation only, no tools or instructions
Inputs	Text, audio, video, and image	Audio only, for strict latency
Configuration	Generation, speech, tools, instructions	`targetLanguageCode` and `echoTargetLanguage`

Use Case

The model is designed for live interpretation across various settings, including multilingual calls, meetings, lessons, and broadcasts. By offloading complex real-time media streaming infrastructure to platforms like Agora, Fishjam, LiveKit, Pipecat, and Vision Agents, developers can focus entirely on the user experience.

Google’s example app demonstrates dubbing and simultaneous multi-language translation. Meanwhile, Grab is testing the model to facilitate communication between drivers and travellers at pickups, a critical feature given that Grab users make over 10 million voice calls per month. Early reports from CJ ENM, LiveKit, and others indicate positive feedback regarding quality, accuracy, and low latency.

How It Changes Google Meet and Translate

Google confirms that Google Meet will soon integrate Gemini 3.5 Live Translate for speech translation. The update expands capabilities significantly:

Capability	Previous Meet	With 3.5 Live Translate
Languages	5	70+
Combinations per meeting	Only to and from English	2000+ combinations
Access	Existing interface	Updated interface for instant access

The Meet update is currently in private preview for select business Workspace customers this month, with a broader rollout planned for later this year. In the Translate app, the Live translate feature works with any connected headphones, mirroring the speaker’s tone across 70+ languages. Android also introduces a listening mode where users hold the phone to their ear like a regular call; the translated audio then streams through the earpiece without being overheard by others.

Key takeaways

Gemini 3.5 Live Translate is a dedicated streaming model enabling speech-to-speech translation across 70+ languages with minimal latency.
Unlike turn-based agents, it processes audio continuously, ensuring the output stays just a few seconds behind the original speaker.
Developers can configure the model via the Live API, strictly using audio-only inputs (16kHz in, 24kHz out) and setting language codes.
All generated audio carries an imperceptible SynthID watermark to ensure detectability.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Google Releases Gemini 3.5 Live Translate, a Streaming Speech-to-Speech Audio Model Covering 70+ Languages Across Meet, Translate, and the Live API

Gemini 3.5 Live Translate

How the Continuous Streaming Works

Building With the Live API

Use Case

How It Changes Google Meet and Translate

Key takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

WWDC 2026: Everything announced…

Can tech companies learn…

Google’s Gemini 3.5 Live…