Adding Benchmaxxer Repellant to the Open ASR Leaderboard

We have recently received high-quality English ASR datasets from Appen Inc. and DataoceanAI, covering both scripted and conversational speech across multiple accents. To prevent any potential issues related to “benchmaxing” or test-set contamination, these datasets will remain private for a more accurate assessment of performance on various tasks.

Since its inception in September 2023, the Open ASR Leaderboard has been accessed over 710K times. This reflects the community’s strong interest and motivation to continuously improve speech recognition systems through benchmarking.

To address challenges such as standardization and openness, we have gathered all test sets into a single dataset on the Hub for easy access and previewing. Additionally, we use a normalizer that removes punctuation and casing, and maps to American spelling, based on the normalizer of Whisper, to ensure consistency in model outputs and dataset transcripts.

Despite the openness of our UI code and evaluation scripts, maintaining benchmarks like the Open ASR Leaderboard remains challenging. Models may perform differently depending on factors such as their ability to handle diverse accents or specific use cases. The goal is to capture these nuances and provide a more comprehensive view of ASR performance.

New High-Quality Private Datasets

Dataset	Accent	Duration [h]	Male (%) / Female (%)	Style	Transcription
Appen Scripted AU	Australian	1.42	49 / 51	Read	Punctuated, cased.
Appen Scripted CA	Canadian	1.53	52 / 48	Read	Punctuated, cased.
Appen Scripted IN	Indian	1.02	49 / 51	Read	Punctuated, cased.
Appen Scripted US	American	1.45	49 / 51	Read	Punctuated, cased.
Appen Conversational IN	Indian	1.37	51 / 49	Conversational, spontaneous	Punctuated, disfluencies.
Appen Conversational US003	American	1.64	49 / 51	Conversational, spontaneous	Punctuated, cased, disfluencies.
Appen Conversational US004	American	1.65	49 / 51	Conversational, spontaneous	Punctuated, disfluencies.
DataoceanAI Scripted US	American	2.43	54 / 46	Read	Punctuated, cased (proper nouns), disfluencies.
DataoceanAI Scripted GB	British	2.43	47 / 53	Read	Punctuated, disfluencies.
DataoceanAI Conversational US	American	8.82	NA	Conversational, spontaneous	Punctuated, disfluencies.
DataoceanAI Conversational GB	British	5.96	NA	Conversational, spontaneous	Punctuated, disfluencies.

The variety of content in the datasets includes scripted and conversational speech, as well as acronyms, disfluencies, and proper nouns. The private nature of these datasets is designed to prevent them from being exploited for “benchmaxing,” where models might improve their performance on a benchmark without corresponding gains in real-world robustness.

How Can I Evaluate My Model?

To evaluate your model using the new private datasets, you need to add it to the Open ASR Leaderboard. Once added, we will run evaluations on both public and private datasets. You can also self-report your results for models that are not yet part of the leaderboard.

Are Models Trained on These Datasets at an Advantage?

The inclusion of data from these providers does not inherently give any model a significant advantage, as we have instructed them to withhold this information. However, having multiple data providers balances out such advantages and allows for more diverse evaluations.

To ensure fairness, the default Average WER macroaverage excludes private datasets, preventing any potential biases or gains from specific data providers. Users can toggle on/off different splits as needed to tailor their evaluation process to their application’s requirements.

Key Takeaways

The Open ASR Leaderboard now includes private datasets to prevent “benchmaxing” and ensure more accurate performance assessments.
Datasets have been standardized across the board, ensuring consistent model outputs and transcripts for fair comparisons.
Users can now toggle different dataset splits to better reflect their specific application needs.

Source Read original →

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

New High-Quality Private Datasets

How Can I Evaluate My Model?

Are Models Trained on These Datasets at an Advantage?

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Ten advances in mathematics…

Judge denies xAI’s request…

YouTuber Hank Green says…

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

New High-Quality Private Datasets

How Can I Evaluate My Model?

Are Models Trained on These Datasets at an Advantage?

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Ten advances in mathematics…

Judge denies xAI’s request…

YouTuber Hank Green says…