Gemini 3.1 Flash TTS: The Next Generation of Expressive AI Speech
Today, we’re unveiling Gemini 3.1 Flash TTS, a new text-to-speech model that offers enhanced controllability, expressiveness, and quality — enabling developers, enterprises, and everyday users to build the next wave of AI-powered speech applications.
Starting today, 3.1 Flash TTS is rolling out:
- For developers in preview via the Gemini API and Google AI Studio
- For enterprises in preview on Vertex AI
- For Workspace users via Google Vids
Enhanced Speech Quality and Control
We’ve improved the overall speech quality of Gemini 3.1 Flash TTS to deliver a more natural and expressive model compared to previous versions. On the Artificial Analysis TTS leaderboard, which measures thousands of human preferences for text-to-speech models, 3.1 Flash TTS achieved an impressive Elo score of 1,211.
Artificial Analysis has also positioned Gemini 3.1 Flash TTS as one of its most attractive options due to its balance between high-quality speech generation and low cost. The model stands out further with native support for multiple speakers, over 70 languages, and granular creative control via natural language commands.
New Audio Tags for More Expressive Speech
3.1 Flash TTS introduces audio tags — a user-friendly way to control vocal style, pace, and delivery by embedding natural language commands directly into the text input. This allows developers to steer AI-speech output with improved levels of precision.
You can start experimenting with these new audio tags along with other updates in Google AI Studio, which provides configurable controls that place you in the “director’s chair”:
- Scene direction: Use this feature to set the stage and provide specific dialogue instructions. This helps characters remain consistent across multiple turns and react naturally.
- Speaker-level specificity: Cast unique Audio Profiles for each character, then specify Director’s Notes to adjust pace, tone, and accent. You can also change expression mid-sentence using inline tags.
- Seamless export: Once you’ve perfected the performance, these exact parameters can be exported as Gemini API code to ensure consistent, recognizable voices across various projects and platforms.
This new level of creative precision allows developers to create memorable characters and immersive audio experiences. Try out this feature in the Google AI Studio Playground today.
Global Reach and Control
Gemini 3.1 Flash TTS delivers high-fidelity speech across more than 70 languages, enabling precise control for major markets around the world. This core optimization brings advanced style, pacing, and accent control to key regions — helping developers create localized, expressive speech experiences at global scale.
Early testers have already noted the model’s impressive controllability and expressiveness. They’ve highlighted how audio tags provide a new level of creative precision, transforming simple text into high-fidelity vocal performances.
Watermarked for Transparency
All audio generated by Gemini 3.1 Flash TTS is watermarked with SynthID. This invisible watermark interwoven directly into the audio output allows reliable detection of AI-generated content to help prevent misinformation. For more information on our approach to safety and responsibility, you can review the model card.
Key Takeaways
- Gemini 3.1 Flash TTS offers enhanced controllability, expressiveness, and quality for text-to-speech applications.
- The introduction of audio tags allows developers to control vocal style, pace, and delivery with greater precision.
- With support for over 70 languages, Gemini 3.1 Flash TTS can deliver high-fidelity speech experiences across major markets at global scale.
Originally published at blog.google. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

