Scenema Audio: Zero-shot expressive voice cloning and speech generation

We’ve been building Scenema Audio as part of our video production platform at scenema.ai, and we’re releasing the model weights and inference code.

How it works

The core idea: emotional performance and voice identity are independent. You describe how the speech should be performed (rage, grief, excitement, a child’s wonder), and optionally provide reference audio for voice identity. The reference provides the "who." The prompt provides the "how." Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.

Limitations (and why we still use it)

This is a diffusion model, not a traditional TTS pipeline. Common issues include repetition and gibberish on some seeds. Different seeds give different results, and you will not get a perfect output with 0% error rate. This model is meant for a post-editing workflow: generate, pick the best take, trim if needed. Same way you’d work with any generative model.

That said, we keep coming back to Scenema Audio over even Gemini 3.1 Flash TTS, which is already more controllable than most TTS systems out there. The reason is simple: the output just sounds more natural and less robotic. There’s a quality to diffusion-generated speech that autoregressive TTS doesn’t quite match, especially for emotional delivery.

Audio-first video generation

We’ve used Scenema Audio in some cases to generate audio first and then use it to drive video generation with A2V pipelines. Here’s an example of that workflow in action: here.

On distillation and speed

A few people have asked this. Our bottleneck is not denoising steps. The diffusion pass is a small fraction of total generation time. The real costs are elsewhere in the pipeline. We’re already at 8 steps (down from 50 in the base model), and that’s the sweet spot where quality holds.

Prompting matters

This model is sensitive to prompting, just like LTX 2.3 for video. A generic voice description gives you generic output. A specific, theatrical description with action tags gives you a performance. There’s also a pace parameter that controls how much time the model gets per word. Takes some experimentation to find what works for your use case, but once you do, you can generate hours of audio with minimal quality loss.

Complex words and proper nouns benefit from phonetic spelling. Unlike traditional TTS, it doesn’t have a phoneme-to-audio pipeline or a pronunciation dictionary. If it garbles "Tchaikovsky," you would spell it "Chai-koff-skee" or whatever makes sense to you.

Docker REST API with automatic VRAM management

We built this as a Docker container with a REST API. It’s the same setup we use in production on scenema.ai. The service auto-detects your GPU and picks the right configuration:

We went with Docker because that’s how we serve it. No dependency hell, no conda environments. We built it for production deployment.

ComfyUI

Native ComfyUI node support is planned. We’re hoping to release it in the coming weeks, unless someone from the community beats us to it. In the meantime, the REST API is straightforward to call from a custom node since it’s just a local HTTP service.

How to Try Scenema Audio

You can clone the repo and run docker compose up locally or
Go to Scenema and start a conversation to create a voiceover. You will be able to try voice design for free, iterate on your prompts, tune pacing, etc.

VRAM	Audio Model	Gemma	Notes
16 GB	INT8 (4.9 GB)	CPU streaming	Needs 32 GB system RAM
24 GB	INT8 (4.9 GB)	NF4 on GPU	Default config
48 GB	bf16 (9.8 GB)	bf16 on GPU	Best quality

Key Takeaways

Scenema Audio is a diffusion model designed for generating expressive voices and speech that are independent of the voice identity.
The model is sensitive to prompting, allowing users to create specific performances with detailed instructions.
The audio-first approach can be used in conjunction with video generation pipelines like LTX 2.3 or Wan 2.6.
ComfyUI support for this model is planned and will allow easier integration into existing workflows.

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Scenema Audio: Zero-shot expressive voice cloning and speech generation

How it works

Limitations (and why we still use it)

Audio-first video generation

On distillation and speed

Prompting matters

Docker REST API with automatic VRAM management

ComfyUI

Links

How to Try Scenema Audio

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Unlocking asynchronicity in continuous…

Establishing AI and data…

Data readiness for agentic…

How it works

Limitations (and why we still use it)

Audio-first video generation

On distillation and speed

Prompting matters

Docker REST API with automatic VRAM management

ComfyUI

Links

How to Try Scenema Audio

Key Takeaways

More in AI Music

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Unlocking asynchronicity in continuous…

Establishing AI and data…

Data readiness for agentic…