For developers and audio creators, the release of MisoTTS represents a shift from static, robotic voices to dynamic, context-aware dialogue. Miso Labs has unveiled an open-weights 8-billion-parameter model capable of generating expressive speech by processing both text and preceding audio. By utilising residual vector quantisation (RVQ), the system expands its sonic palette without inflating the parameter count, a critical efficiency for local deployment.
What is MisoTTS?
MisoTTS functions as an 8B-parameter text-to-dialogue RVQ Transformer, drawing inspiration from the Sesame CSM architecture. It couples a Llama 3.2-style backbone with a compact audio decoder to produce Mimi audio codes from text prompts and optional audio context. Crucially, the model conditions generation on prior audio, allowing it to match the interlocutor’s tone rather than ignoring it.
The system operates with a text vocabulary of 128,256 tokens and employs 32 audio codebooks. The Mimi audio tokenizer supports a maximum sequence length of 2,048, with default inference running in torch.bfloat16.
Miso Labs reports an inference latency of 110ms, significantly faster than competitors like ElevenLabs at 700ms and Sesame at 300ms.
The Vocabulary Size Problem
Standard transformers rely on a fixed vocabulary of discrete tokens, an approach that works for simple text but fails to capture the nuance of human speech. Voice quality fluctuates across pitch, rhythm, emphasis, emotion, and accent. While expanding the audio vocabulary seems like the logical solution, doing so in a standard transformer requires a proportional increase in parameters to represent and predict each new token.
Miso Labs identifies this as the vocabulary size problem. Furthermore, most existing TTS models condition solely on text, disregarding the speaker’s tone. This limitation often contributes to the “uncanny valley” effect, where synthetic speech sounds unnaturally detached from the conversational context.
Residual Vector Quantisation: The Core Idea
MisoTTS resolves these issues through residual vector quantisation (RVQ). Adapting concepts from image generation and Sesame’s CSM, the model does not emit a single token index. Instead, it outputs a vector of indices.
Every audio token consists of 32 codebook indices drawn from 2048-way codebooks. The model maintains a distinct codebook for each position within the vector. To reconstruct the sound, the system sums the looked-up vectors, with each codebook layer adding further refinement to the signal.
This architecture allows the model to scale its addressable vocabulary exponentially without increasing its size. The addressable vocabulary equals the codebook size raised to the depth. By increasing depth, MisoTTS achieves approximately 204832, or roughly 10105 addressable tokens, a feat that naive scaling would require a vastly larger network to achieve.
The Two-Transformer Architecture
The system divides into a backbone and a decoder. The backbone is a 7.7B-parameter transformer that operates autoregressively over time, predicting the first codebook index and a final hidden state.
A 300M-parameter decoder then runs autoregressively over depth, predicting the remaining codebook indices one position at a time. Each prediction conditions on the indices already selected in the frame. The same 300M parameters are reused for every position, optimising resource usage.
Embeddings follow this logic. Text tokens use a single lookup, whereas an audio token’s embedding is the sum of per-position codebook lookups. Interleaving text and audio allows the backbone to utilise conversation history, enabling the model to carry context across turns.
Strengths and Challenges
Strengths:
- Open weights available immediately under a modified MIT license.
- RVQ scales the sonic range without scaling the parameter count.
- Conditions on audio context, not just text.
- Local deployment ensures sensitive audio data remains in-house.
- The architecture and mathematics are fully documented in a public blog post.
Challenges:
- Currently half-duplex only, with no turn-taking functionality.
- The large model requires a capable CUDA GPU.
- API access is announced but not yet available.
- Latency and quality claims require independent third-party verification.
Key takeaways
- MisoTTS is an open-weights 8B model under a modified MIT license that conditions on both text and audio to track speaker tone.
- Residual vector quantisation allows the model to reach an addressable vocabulary of ~10105 without adding parameters.
- The architecture uses a 7.7B backbone for temporal prediction and a 300M decoder for depth, running half-duplex for now.
- While local deployment is possible, API access is pending and full-duplex capabilities are planned for future releases.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




