QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs.

🏆 Leaderboard · 🔧 GitHub · 📄 Paper

If you’ve been tracking Arabic LLM evaluation, you’ve probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we’re measuring?

We built QIMMA قمّة (Arabic for “summit”), to answer that question systematically. Instead of aggregating existing Arabic benchmarks as-is and running models on them, we applied a rigorous quality validation pipeline before any evaluation took place. What we found was sobering: even widely-used, well-regarded Arabic benchmarks contain systematic quality issues that can quietly corrupt evaluation results.

This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings look like once you clean things up.

🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated

Arabic is spoken by over 400 million people across diverse dialects and cultural contexts, yet the Arabic NLP evaluation landscape remains fragmented. A few key pain points have motivated this work:

Translation issues. Many Arabic benchmarks are translations from English. This introduces distributional shifts. Questions that feel natural in English become awkward or culturally misaligned in Arabic, making benchmark data less representative of how Arabic is naturally used.
Absent quality validation. Even native Arabic benchmarks are often released without rigorous quality checks. Annotation inconsistencies, incorrect gold answers, encoding errors, and cultural bias in ground-truth labels have all been documented in established resources.
Reproducibility gaps. Evaluation scripts and per-sample outputs are rarely released publicly, making it hard to audit results or build on prior work.
Coverage fragmentation. Existing leaderboards cover isolated tasks and narrow domains, making holistic model assessment difficult.

To illustrate where QIMMA sits relative to existing platforms:

Leaderboard	Open Source	Native Arabic	Quality Validation	Coding Eval	Public Outputs
OALL v1	✅	Mixed	❌	❌	✅
OALL v2	✅	Mostly	❌	❌	✅
BALSAM	Partial	50%	❌	❌	❌
AraGen	✅	100%	✅	❌	❌
SILMA ABL	✅	100%	✅	❌	✅
ILMAAM	Partial	100%	✅	❌	❌
HELM Arabic	✅	Mixed	❌	❌	✅
Originally published at huggingface.co. Curated by AI Maestro. Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise. Please enable JavaScript in your browser to complete this form. Name First Last Email Name Email Get the next one in your inbox One AI story a week. The one that mattered most. No fluff, no hype. Subscribe → AI Maestro is an independent British AI publication. We test what we recommend. More about us → Share X LinkedIn Copy link Stay sharp One AI story a week — the one that mattered. Subscribe → More in AI Research & Science 1 Anthropic Sonnet 3.5 Sets New Benchmark Standards 2 Import AI 446: Nuclear LLMs; China’s big AI benchmark; measurement and AI policy 3 How to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery 4 Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents Empowering Businesses with AI — Smart Tools, Smarter Business Decisions. follow us Popular Tag AI Ethics & Society AI for Business AI Guides & Tutorials AI Music AI News AI Research & Science Popular Post OpenAI’s internal share sale… I Work in Hollywood.… CUDA Proves Nvidia Is… © 2026 AI Maestro · All rights reserved Manage Consent To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behaviour or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions. Functional Functional Always active The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network. Preferences Preferences The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user. Statistics Statistics The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you. Marketing Marketing The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes. Manage options Manage services Manage {vendor_count} vendors Read more about these purposes View preferences {title} {title} {title} Scroll to Top

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

OpenAI’s internal share sale…

I Work in Hollywood.…

CUDA Proves Nvidia Is…