Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Adding Benchmaxxer Repellant to the Open ASR Leaderboard We have recently received high-quality English ASR datasets from Appen Inc. and DataoceanAI, covering…

By AI Maestro May 10, 2026 2 min read
Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

We have recently received high-quality English ASR datasets from Appen Inc. and DataoceanAI, covering both scripted and conversational speech across multiple accents. To prevent any potential issues related to “benchmaxing” or test-set contamination, these datasets will remain private for a more accurate assessment of performance on various tasks.

Since its inception in September 2023, the Open ASR Leaderboard has been accessed over 710K times. This reflects the community’s strong interest and motivation to continuously improve speech recognition systems through benchmarking.

To address challenges such as standardization and openness, we have gathered all test sets into a single dataset on the Hub for easy access and previewing. Additionally, we use a normalizer that removes punctuation and casing, and maps to American spelling, based on the normalizer of Whisper, to ensure consistency in model outputs and dataset transcripts.

Despite the openness of our UI code and evaluation scripts, maintaining benchmarks like the Open ASR Leaderboard remains challenging. Models may perform differently depending on factors such as their ability to handle diverse accents or specific use cases. The goal is to capture these nuances and provide a more comprehensive view of ASR performance.

New High-Quality Private Datasets

DatasetAccentDuration [h]Male (%) / Female (%)StyleTranscription
Appen Scripted AUAustralian1.4249 / 51ReadPunctuated, cased.
Appen Scripted CACanadian1.5352 / 48ReadPunctuated, cased.
Appen Scripted INIndian1.0249 / 51ReadPunctuated, cased.
Appen Scripted USAmerican1.4549 / 51ReadPunctuated, cased.
Appen Conversational INIndian1.3751 / 49Conversational, spontaneousPunctuated, disfluencies.
Appen Conversational US003American1.6449 / 51Conversational, spontaneousPunctuated, cased, disfluencies.
Appen Conversational US004American1.6549 / 51Conversational, spontaneousPunctuated, disfluencies.
DataoceanAI Scripted USAmerican2.4354 / 46ReadPunctuated, cased (proper nouns), disfluencies.
DataoceanAI Scripted GBBritish2.4347 / 53ReadPunctuated, disfluencies.
DataoceanAI Conversational USAmerican8.82NAConversational, spontaneousPunctuated, disfluencies.
DataoceanAI Conversational GBBritish5.96NAConversational, spontaneousPunctuated, disfluencies.

The variety of content in the datasets includes scripted and conversational speech, as well as acronyms, disfluencies, and proper nouns. The private nature of these datasets is designed to prevent them from being exploited for “benchmaxing,” where models might improve their performance on a benchmark without corresponding gains in real-world robustness.

How Can I Evaluate My Model?

To evaluate your model using the new private datasets, you need to add it to the Open ASR Leaderboard. Once added, we will run evaluations on both public and private datasets. You can also self-report your results for models that are not yet part of the leaderboard.

Are Models Trained on These Datasets at an Advantage?

The inclusion of data from these providers does not inherently give any model a significant advantage, as we have instructed them to withhold this information. However, having multiple data providers balances out such advantages and allows for more diverse evaluations.

To ensure fairness, the default Average WER macroaverage excludes private datasets, preventing any potential biases or gains from specific data providers. Users can toggle on/off different splits as needed to tailor their evaluation process to their application’s requirements.

Key Takeaways

  • The Open ASR Leaderboard now includes private datasets to prevent “benchmaxing” and ensure more accurate performance assessments.
  • Datasets have been standardized across the board, ensuring consistent model outputs and transcripts for fair comparisons.
  • Users can now toggle different dataset splits to better reflect their specific application needs.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top