For makers and artists, Microsoft’s latest speech-to-text engine, MAI-Transcribe-1.5, represents a shift from mere transcription to usable content. It is designed to handle the messy reality of real-world audio—noisy environments, diverse accents, and long-form recordings—while delivering text fast enough to keep creative workflows moving. Whether you are captioning video, analysing call logs, or feeding voice agents, this model aims to be the production-ready tool that finally stops struggling with domain-specific vocabulary.
What is MAI-Transcribe-1.5?
MAI-Transcribe-1.5 is an automatic speech recognition (ASR) system built entirely in-house by the Microsoft team, rather than relying on a third-party foundation. It accepts audio input and outputs text, supporting a single unified system for 43 languages. The architecture is specifically tuned to handle diverse dialects and real-world acoustic conditions where background noise or poor quality audio might trip up other models.
Microsoft is rolling this out across its ecosystem, integrating it into Copilot, Teams, GitHub, and Dynamics 365 Contact Centre. It is also accessible via Foundry, the company’s dedicated model platform.
The Accuracy Picture
Performance is measured using Word-Error-Rate (WER), where a lower figure indicates fewer mistakes per transcribed word. Microsoft claims best-in-class WER across 43 languages based on the FLEURS benchmark, a standard multilingual test set. On the independent Artificial Analysis leaderboard, the model achieves a WER of 2.4%, placing it third on that specific competitive open benchmark. This creates a split narrative: Microsoft positions it as a leader on FLEURS, while third-party analysis suggests it sits behind two competitors on Artificial Analysis.
Beyond raw scores, the expansion of language coverage is significant. Support grew from 25 to 43 languages without sacrificing accuracy. The 18 new additions include ten South Asian languages such as Bengali, Tamil, and Telugu, alongside eight European options including Ukrainian, Greek, and Catalan.
Speed on Long Audio
MAI-Transcribe-1.5 optimises for the intersection of accuracy and speed on the Artificial Analysis leaderboard. It claims up to 5x faster processing than models of comparable accuracy, with the most dramatic gains appearing on long audio files. The model can transcribe an hour of audio in under 15 seconds.
Microsoft cites speedups of up to 5x over Gemini 3.1, Scribe v2, and GPT-4o-Transcribe specifically for long-form audio. Comparing it to the previous generation, MAI-Transcribe-1, Azure documentation lists up to 5.7x faster long-form inference. For batch pipelines processing large archives, that latency difference compounds quickly.
Keyword (Entity) Biasing: The Feature Worth Understanding
Generic transcribers often fail when encountering domain-specific words. These include proper names, product titles, medical terms, and internal acronyms. For enterprise users, these specific terms are frequently the most critical.
MAI-Transcribe-1.5 introduces keyword biasing, also known as entity biasing. You supply a list of domain-specific keywords, and the Azure card supports up to 200 entries. The model biases its predictions toward that list but does not blindly force matches. Instead, it uses shared context to decide when biasing should apply. Microsoft reports a 30% WER reduction on FLEURS when this feature is utilised.
A short example illustrates the effect: without biasing, names might render as “Sean,” “Oif,” and “Societal.” With a supplied name list, the model recovers the correct “Shaun,” “Aoife,” and “Xochitl.” This capability is vital for meetings, healthcare, and call centers dealing with niche vocabulary.
Use Cases
The Azure model card lists concrete production scenarios, each mapping to a common engineering workload:
- Video captions for media and content platforms.
- Accessibility tools that depend on accurate captions.
- Meeting transcription for collaboration tools like Teams.
- Call analysis for contact centers and support analytics.
- Content creation workflows requiring fast draft transcripts.
- Voice agents that convert speech to text before reasoning.
Automatic language identification is another key utility, helping when the input language is unknown. The model detects the spoken language without requiring a manual setting.
MAI-Transcribe-1.5 vs MAI-Transcribe-1
The table below compares the two generations using stated facts only.
| Attribute | MAI-Transcribe-1 | MAI-Transcribe-1.5 |
|---|---|---|
| Languages covered | 25 | 43 |
| Keyword/entity biasing | Not listed | Up to 200 keywords |
| Long-form inference speed | Baseline | Up to 5.7x faster |
| Artificial Analysis WER | Not specified | 2.4% (ranked #3) |
| FLEURS position (per Microsoft) | State-of-the-art | Best-in-class across 43 languages |
| Automatic language identification | Not specified | Yes |
| Lifecycle | Prior release | Generally available (GA) |
| Input / Output | Audio / Text | Audio / Text |
Strengths and Limitations
Strengths
- 43-language coverage from a single model, up from 25.
- Keyword/entity biasing yields up to 30% WER reduction on FLEURS.
- Sub-15-second transcription for an hour of audio.
- Generally available now through Azure AI Foundry.
- Robust on noisy, real-world audio, per Microsoft.
Limitations
- No diarization yet, so speaker labels are unavailable.
- No native streaming API, so real-time use is limited.
- Several accuracy, speed, and cost claims are first-party.
- Ranked third on Artificial Analysis, behind two competitors.
Key takeaways
- MAI-Transcribe-1.5 expands language support to 43 and introduces keyword biasing, reducing errors on specific terms by up to 30%.
- The model offers significant speed improvements, transcribing an hour of audio in under 15 seconds and claiming up to 5.7x faster inference over its predecessor.
- While Microsoft claims best-in-class FLEURS accuracy, independent benchmarks place the model third on the Artificial Analysis leaderboard.
- Current limitations include a lack of speaker diarization and no native streaming API, which restricts real-time applications.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




