Adding Benchmaxxer Repellant to the Open ASR Leaderboard
We have recently received high-quality English ASR datasets from Appen Inc. and DataoceanAI, covering both scripted and conversational speech across multiple accents. To prevent any potential issues related to “benchmaxing” or test-set contamination, these datasets will remain private for a more accurate assessment of performance on various tasks.
Since its inception in September 2023, the Open ASR Leaderboard has been accessed over 710K times. This reflects the community’s strong interest and motivation to continuously improve speech recognition systems through benchmarking.
To address challenges such as standardization and openness, we have gathered all test sets into a single dataset on the Hub for easy access and previewing. Additionally, we use a normalizer that removes punctuation and casing, and maps to American spelling, based on the normalizer of Whisper, to ensure consistency in model outputs and dataset transcripts.
Despite the openness of our UI code and evaluation scripts, maintaining benchmarks like the Open ASR Leaderboard remains challenging. Models may perform differently depending on factors such as their ability to handle diverse accents or specific use cases. The goal is to capture these nuances and provide a more comprehensive view of ASR performance.
New High-Quality Private Datasets
| Dataset | Accent | Duration [h] | Male (%) / Female (%) | Style | Transcription |
|---|---|---|---|---|---|
| Appen Scripted AU | Australian | 1.42 | 49 / 51 | Read | Punctuated, cased. |
| Appen Scripted CA | Canadian | 1.53 | 52 / 48 | Read | Punctuated, cased. |
| Appen Scripted IN | Indian | 1.02 | 49 / 51 | Read | Punctuated, cased. |
| Appen Scripted US | American | 1.45 | 49 / 51 | Read | Punctuated, cased. |
| Appen Conversational IN | Indian | 1.37 | 51 / 49 | Conversational, spontaneous | Punctuated, disfluencies. |
| Appen Conversational US003 | American | 1.64 | 49 / 51 | Conversational, spontaneous | Punctuated, cased, disfluencies. |
| Appen Conversational US004 | American | 1.65 | 49 / 51 | Conversational, spontaneous | Punctuated, disfluencies. |
| DataoceanAI Scripted US | American | 2.43 | 54 / 46 | Read | Punctuated, cased (proper nouns), disfluencies. |
| DataoceanAI Scripted GB | British | 2.43 | 47 / 53 | Read | Punctuated, disfluencies. |
| DataoceanAI Conversational US | American | 8.82 | NA | Conversational, spontaneous | Punctuated, disfluencies. |
| DataoceanAI Conversational GB | British | 5.96 | NA | Conversational, spontaneous | Punctuated, disfluencies. |
The variety of content in the datasets includes scripted and conversational speech, as well as acronyms, disfluencies, and proper nouns. The private nature of these datasets is designed to prevent them from being exploited for “benchmaxing,” where models might improve their performance on a benchmark without corresponding gains in real-world robustness.
How Can I Evaluate My Model?
To evaluate your model using the new private datasets, you need to add it to the Open ASR Leaderboard. Once added, we will run evaluations on both public and private datasets. You can also self-report your results for models that are not yet part of the leaderboard.
Are Models Trained on These Datasets at an Advantage?
The inclusion of data from these providers does not inherently give any model a significant advantage, as we have instructed them to withhold this information. However, having multiple data providers balances out such advantages and allows for more diverse evaluations.
To ensure fairness, the default Average WER macroaverage excludes private datasets, preventing any potential biases or gains from specific data providers. Users can toggle on/off different splits as needed to tailor their evaluation process to their application’s requirements.
Key Takeaways
- The Open ASR Leaderboard now includes private datasets to prevent “benchmaxing” and ensure more accurate performance assessments.
- Datasets have been standardized across the board, ensuring consistent model outputs and transcripts for fair comparisons.
- Users can now toggle different dataset splits to better reflect their specific application needs.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




