“`html
Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs
ElevenLabs charges between $5 and $330 per month for voice AI services. Every audio file you process goes through their cloud servers. For those looking for an open-source alternative of ElevenLabs, OmniVoice Studio is a good fit as an open-source desktop application that runs the same categories of tasks locally. It is a very interesting individual project that handles voice cloning, video dubbing, real-time dictation, vocal isolation, and speaker diarization — without sending data to an external server.
What OmniVoice Studio Does
The application bundles six distinct capabilities. Understanding each one helps clarify what the system is doing under the hood.
- **Voice cloning**: Works from a 3-second audio clip. The system uses zero-shot learning to clone a voice it has never been trained on before, by conditioning a diffusion-based TTS model on the short reference audio. The underlying model, OmniVoice from k2-fsa, supports 600+ languages.
- **Voice design**: Lets you build a new voice from parameters: gender, age, accent, pitch, speed, emotion, and dialect — without cloning any existing voice.
- **Video dubbing**: Takes a YouTube URL or a local video file. It runs transcription using WhisperX, translates the transcript, synthesizes new audio using the TTS engine, and exports an MP4. The entire pipeline runs locally.
- **Dictation widget**: A system-wide floating overlay that activates via
⌘+⇧+Spacefrom any application. It streams transcription via WebSocket and auto-pastes the result into whatever app is in focus. - **Batch Queue**: Lets you drop up to 50 videos and walk away, with per-job progress bars tracking each one through the full pipeline.
- **MCP Server**: Exposes OmniVoice Studio’s capabilities to any MCP client — including Claude, Cursor, or your own tooling.
The Architecture
The project uses a React frontend talking to a FastAPI backend. The backend exposes 97 API endpoints, uses Server-Sent Events (SSE) for streaming updates, and stores data in SQLite.
- WhisperX handles automatic speech recognition (ASR) with word-level alignment. It supports 99 languages for transcription.
- Demucs (Meta) handles source separation. It splits speech from background music and preserves both stems independently.
- Pyannote handles speaker diarization — identifying which speaker said which words in a multi-speaker audio file. It is used together with WhisperX.
- Audioslack (Meta) embeds an invisible neural watermark into generated audio. This watermark survives compression and serves as AI provenance metadata.
The Architecture
The desktop wrapper is built with Tauri, a Rust-based framework for cross-platform native apps. The codebase is 56% Python, 23.6% JavaScript, 11% CSS, 3.4% Shell, 3.3% Rust, and 2.6% TypeScript.
For GPU support, the backend auto-detects CUDA (NVIDIA), MPS (Apple Silicon Metal), and ROCm (AMD). With 8 GB VRAM or less, TTS automatically offloads to CPU during transcription. No configuration is required.
Six TTS Engines, One Backend Registry
OmniVoice Studio ships a pluggable multi-engine TTS backend. You can switch engines in Settings → TTS Engine or by setting the OMNIVOICE_TTS_BACKEND environment variable.
- Default engine: OmniVoice (600+ languages)
- CosyVoice 3 (9 languages plus 18 dialects, Apache-2.0)
- MLX-Audio (Apple Silicon-only, includes Kokoro and Qwen3-TTS among others, Apache-2.0)
- VoxCPM2 (30 languages, Apache-2.0)
- MOSS-TTS-Nano (20 languages, runs realtime on CPU)
- KittenTTS (English-only, CPU-only, MIT)
Language Coverage
ElevenLabs supports 32 languages. OmniVoice Studio supports 646 languages for TTS and 99 languages for transcription via WhisperX. Translation coverage depends on the target language pair.
Getting Started
Prerequisites: ffmpeg, Bun, and uv. Clone the repo, then run:
$ git clone https://github.com/debpalash/OmniVoice-Studio.git $ cd OmniVoice-Studio $ uv sync $ bun install $ bun dev
The frontend loads at http://localhost:5173 and the API runs on port 8000. Model weights download automatically on first generation.
Key Takeaways
- OmniVoice Studio is a local, open-source alternative to ElevenLabs’ voice AI services.
- It supports a wide range of tasks including voice cloning, video dubbing, and real-time dictation without sending data to an external server.
- The application uses multiple engines for TTS with support for 646 languages for text-to-speech and 99 languages for transcription via WhisperX.
“`
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




