Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

For Makers and Artists: Why Your Voice Agents Might Be Failing Bilingual Users

More than half the global population navigates life speaking multiple tongues. For bilinguals, code-switching—fluidly shifting between languages mid-sentence—is often the natural rhythm of conversation. Yet, when this linguistic habit enters enterprise contact centres or IT helpdesks, standard voice agents often stumble. A misheard name or a garbled policy query can derail a support ticket instantly. If you build tools that rely on voice, you cannot afford to ignore how models handle mixed-language input.

ServiceNow recently addressed this gap by constructing a custom benchmark to test how Automatic Speech Recognition (ASR) systems perform on code-switched data. Their focus was deliberate: transcription errors cascade through every downstream step of a voice pipeline. In enterprise environments, where a misunderstood query leads to real operational friction, getting the transcript right is non-negotiable.

The dataset targets four specific language pairs relevant to their client base: Spanish-English, French-English, Canadian French-English, and German-English. The data simulates real-world Human Resources (HR) and IT Service Management (ITSM) scenarios, ranging from payroll inquiries to VPN access requests. To evaluate model robustness, they tracked three distinct metrics: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). These measures capture both raw transcription precision and the system’s ability to preserve meaning for downstream tasks.

The full benchmark and dataset are available via the ServiceNow AU-Harness. The study evaluated seven major ASR systems, including frontier models and open-source options. The headline finding is clear: the penalty for code-switching is not uniform; it varies significantly depending on the language pair and the specific model architecture.

The Benchmark Construction

Data Pipeline

The team began with an internal corpus of IT support and HR interactions. They constructed parallel datasets in English and one of the four target languages, filtering for utterances between 12 and 40 words. This range ensures natural spoken turns while providing sufficient length for meaningful language switching. Crucially, they excluded sentences dominated by entities like phone numbers or URLs, which force bilingualism by necessity rather than choice. To ensure quality, every generated utterance required at least three switchable content words—nouns, verbs, or adjectives—to provide a model with substantive material to process.

They employed a persona prompt sent to an LLM (OpenAI/GPT-5) to generate the code-switched text, followed by a verbalization pass to create spoken audio using ElevenLabs Multilingual V2. A native speaker linguist reviewed every record, excluding or regenerating any flagged entries. The final dataset comprises 259 Spanish-English records, 298 French-English records, 188 Canadian French-English records, and 173 German-English records.

Evaluation Methodology

The evaluation relied on three specific metrics per model and language pair:

Word Error Rate (WER): The standard measure aligning the ground truth with model output to quantify distance.
Semantic WER (SWER): Based on Pipecat’s STT benchmark and judged by Gemma-4-31B, this score identifies errors that alter the meaning of the utterance.
Answer Error Rate (AER): A functional test where an LLM reads the ASR transcript to answer three comprehension questions per utterance. This measures whether critical details like case numbers, names, and reasons for requests were preserved.

Findings

The study assessed seven systems:

AssemblyAI / Universal 3-Pro
Deepgram / Nova 3 Multilang
ElevenLabs / Scribe V2
Google / Gemini 3 Flash
Mistral AI / Voxtral Small 24B-2507
Nvidia / Parakeet TDT 0.6b V3
OpenAI / Whisper Large V3 Turbo

A. Performance on Code-Switched Benchmarks

The analysis separated errors into word-level accuracy (WER) and semantic accuracy (SWER and AER). While WER is simple, it fails to distinguish between a spelling mistake and a completely wrong word. SWER offers a holistic view of utterance performance, while AER acts as a functional stress test for downstream comprehension.

WER Results

ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro led the pack for transcription accuracy. They were tied on Spanish-English and separated by mere fraction of a percentage point on other pairs, with Scribe holding a slight edge. Google Gemini 3 Flash followed closely but trailed on Canadian French-English. Deepgram Nova-3, Mistral Voxtral, and Nvidia Parakeet occupied the middle ground, with Parakeet notably outperforming its rivals on German-English. OpenAI Whisper Large V3 Turbo ranked last, with WER ranging from 0.16 to 0.61. This underperformance stems from Whisper’s tendency to translate code-switched audio into English rather than transcribe the original languages when no explicit language parameter is set.

SWER and AER Results

Scribe V2 maintained first place with low semantic error scores. While Assembly AI led in WER, Gemini 3 Flash consistently outperformed it in AER, pushing AssemblyAI to third place. As a Large Audio Language Model (LALM), Gemini is optimised for reasoning, giving it an advantage on meaning-sensitive metrics even where raw transcription lags. Whisper remained last, though the gap narrowed under semantic metrics due to its translation bias. Notably, Deepgram Nova-3 showed a divergence: it ranked mid-tier on SWER but fell to the bottom on AER, suggesting it captures general meaning but fails on specific, consequential details.

B. The Cost of Code-Switching

To isolate the difficulty of switching languages versus the inherent difficulty of transcription, the team tested code-switched audio against monolingual baselines in both the matrix language and English.

Scribe V2, Gemini 3 Flash, and AssemblyAI showed the smallest performance deltas, with Scribe V2 notably outperforming its own non-English baseline.
Top-tier systems incur only a small penalty, whereas lower-ranked models degrade significantly. This indicates code-switching exposes robustness differences rather than uniformly raising difficulty.
Whisper showed the largest degradation relative to English (peaking at +0.85 on German-English). It is the only model that performed better on code-switched speech than on monolingual non-English audio, a direct result of its default translation behaviour.

C. How Code-Switching Breaks Systems

The final analysis used regression models to identify specific failure conditions. A logistic regression determined which variables correlated with the occurrence of an error, while an ordinary least squares (OLS) regression examined which factors influenced the magnitude of that error.

The predictors included the number of language switches and the utterance’s Code-Mixing Index. The results suggest that as the number of switches increases, the likelihood of error rises. Furthermore, the complexity of the code-mixing itself directly correlates with the severity of the transcription failure.

Key takeaways

Top performers are consistent: ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro demonstrate the highest robustness across language pairs, maintaining low error rates even when semantic nuance is critical.
Translation bias hurts accuracy: Models like Whisper that default to translating mixed-language audio into English suffer the greatest penalties, failing to preserve the original linguistic context.
Meaning matters more than words: A model can achieve decent Word Error Rates while still failing downstream tasks; semantic metrics like AER are essential for evaluating real-world utility.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

For Makers and Artists: Why Your Voice Agents Might Be Failing Bilingual Users

The Benchmark Construction

Data Pipeline

Evaluation Methodology

Findings

A. Performance on Code-Switched Benchmarks

B. The Cost of Code-Switching

C. How Code-Switching Breaks Systems

Key takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Claude Fable 5 outpaces…

The future of Hollywood…

A German Court Has…

For Makers and Artists: Why Your Voice Agents Might Be Failing Bilingual Users

The Benchmark Construction

Data Pipeline

Evaluation Methodology

Findings

A. Performance on Code-Switched Benchmarks

B. The Cost of Code-Switching

C. How Code-Switching Breaks Systems

Key takeaways

More in AI Research & Science

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Claude Fable 5 outpaces…

The future of Hollywood…

A German Court Has…