QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability…

By AI Maestro May 10, 2026 2 min read
QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard


QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs.

🏆 Leaderboard · 🔧 GitHub · 📄 Paper

If you’ve been tracking Arabic LLM evaluation, you’ve probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we’re measuring?

We built QIMMA قمّة (Arabic for “summit”), to answer that question systematically. Instead of aggregating existing Arabic benchmarks as-is and running models on them, we applied a rigorous quality validation pipeline before any evaluation took place. What we found was sobering: even widely-used, well-regarded Arabic benchmarks contain systematic quality issues that can quietly corrupt evaluation results.

This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings look like once you clean things up.

🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated

Arabic is spoken by over 400 million people across diverse dialects and cultural contexts, yet the Arabic NLP evaluation landscape remains fragmented. A few key pain points have motivated this work:

  • Translation issues. Many Arabic benchmarks are translations from English. This introduces distributional shifts. Questions that feel natural in English become awkward or culturally misaligned in Arabic, making benchmark data less representative of how Arabic is naturally used.
  • Absent quality validation. Even native Arabic benchmarks are often released without rigorous quality checks. Annotation inconsistencies, incorrect gold answers, encoding errors, and cultural bias in ground-truth labels have all been documented in established resources.
  • Reproducibility gaps. Evaluation scripts and per-sample outputs are rarely released publicly, making it hard to audit results or build on prior work.
  • Coverage fragmentation. Existing leaderboards cover isolated tasks and narrow domains, making holistic model assessment difficult.

To illustrate where QIMMA sits relative to existing platforms:

LeaderboardOpen SourceNative ArabicQuality ValidationCoding EvalPublic Outputs
OALL v1Mixed
OALL v2Mostly
BALSAMPartial50%
AraGen100%
SILMA ABL100%
ILMAAMPartial100%
HELM ArabicMixed

Originally published at huggingface.co. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top