Google Releases Gemini 3.5 Live Translate, a Streaming Speech-to-Speech Audio Model Covering 70+ Languages Across Meet, Translate, and the Live API

For makers and artists building real-time audio applications, Google has introduced a significant shift in how speech translation is handled. The new…

By AI Maestro June 9, 2026 3 min read
Google Releases Gemini 3.5 Live Translate, a Streaming Speech-to-Speech Audio Model Covering 70+ Languages Across Meet, Translate, and the Live API

For makers and artists building real-time audio applications, Google has introduced a significant shift in how speech translation is handled. The new Gemini 3.5 Live Translate model moves away from waiting for a speaker to finish a sentence before responding. Instead, it processes audio in a continuous stream, generating translated speech in real-time. This approach ensures the output stays only a few seconds behind the original audio, preserving the speaker’s intonation, pacing, and pitch without the awkward pauses typical of turn-by-turn systems.

Gemini 3.5 Live Translate

This is a dedicated audio model, identified as gemini-3.5-live-translate-preview, designed strictly for speech-to-speech conversion. It accepts audio as it streams in, handling multilingual inputs without requiring manual configuration. Crucially, its noise robustness allows applications to function effectively in loud, unpredictable environments.

The rollout targets three distinct channels. Developers can access the model via public preview through the Gemini Live API and Google AI Studio. Enterprise users receive a private preview within Google Meet starting this month. For general consumers, the feature is available via the Google Translate app on Android and iOS.

How the Continuous Streaming Works

The architectural difference is vital for building responsive features. While conversational agents rely on turn-based interactions, pauses, and interruption handling, Live Translation utilises continuous stream processing. It translates as the speaker talks, eliminating the need to wait for a turn to conclude.

To maintain strict real-time latency thresholds, the translation path accepts audio input only; text input is unsupported in this mode. Furthermore, the model drops tool use and system instructions, focusing the pipeline exclusively on translation rather than acting as a general assistant.

Building With the Live API

Developers configure translation settings within the Live API session setup. This involves defining a translationConfig block inside the generationConfig. The targetLanguageCode field requires a BCP-47 code, such as "pl" or "es", defaulting to "en". The echoTargetLanguage boolean determines whether the model repeats input already in the target language; setting it to true echoes the speech, while false keeps it silent.

Technical specifications are fixed: input is raw 16-bit PCM at 16kHz, mono, little-endian, while output is raw 16-bit PCM at 24kHz, mono, little-endian. Audio must be sent in 100ms chunks. For client-side applications, ephemeral tokens on the v1alpha endpoint are used to prevent API key exposure.

DimensionLive AgentLive Translation
Model roleAssistant that listens, reasons, and actsInterpreter / real-time translator pipeline
InteractionTurn-based, with interruption handlingContinuous stream processing, no turns
ToolsFunction calling, Google Search, instructionsTranslation only, no tools or instructions
InputsText, audio, video, and imageAudio only, for strict latency
ConfigurationGeneration, speech, tools, instructionstargetLanguageCode and echoTargetLanguage

Use Case

The model is designed for live interpretation across various settings, including multilingual calls, meetings, lessons, and broadcasts. By offloading complex real-time media streaming infrastructure to platforms like Agora, Fishjam, LiveKit, Pipecat, and Vision Agents, developers can focus entirely on the user experience.

Google’s example app demonstrates dubbing and simultaneous multi-language translation. Meanwhile, Grab is testing the model to facilitate communication between drivers and travellers at pickups, a critical feature given that Grab users make over 10 million voice calls per month. Early reports from CJ ENM, LiveKit, and others indicate positive feedback regarding quality, accuracy, and low latency.

How It Changes Google Meet and Translate

Google confirms that Google Meet will soon integrate Gemini 3.5 Live Translate for speech translation. The update expands capabilities significantly:

CapabilityPrevious MeetWith 3.5 Live Translate
Languages570+
Combinations per meetingOnly to and from English2000+ combinations
AccessExisting interfaceUpdated interface for instant access

The Meet update is currently in private preview for select business Workspace customers this month, with a broader rollout planned for later this year. In the Translate app, the Live translate feature works with any connected headphones, mirroring the speaker’s tone across 70+ languages. Android also introduces a listening mode where users hold the phone to their ear like a regular call; the translated audio then streams through the earpiece without being overheard by others.

Key takeaways

  • Gemini 3.5 Live Translate is a dedicated streaming model enabling speech-to-speech translation across 70+ languages with minimal latency.
  • Unlike turn-based agents, it processes audio continuously, ensuring the output stays just a few seconds behind the original speaker.
  • Developers can configure the model via the Live API, strictly using audio-only inputs (16kHz in, 24kHz out) and setting language codes.
  • All generated audio carries an imperceptible SynthID watermark to ensure detectability.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top