LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 12, 2026 2 min read
LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]
LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

I built a small website called LLM Win:

https://llm-win.com

It turns LLM benchmark results into a directed graph:

If model A beats model B on benchmark X, add an edge A -> B. 

Then it searches for the shortest transitive chain between two models.

The meme version is:

Can LLaMA 2 7B beat Claude Opus 4.7? 

In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot:

  1. Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%.
  2. Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking.
  3. Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark.
  4. Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity’s Last Exam, IFBench, AIME 2025, TAU2, SciCode
  5. Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking.

My current interpretation:

LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise.

The next question is whether reversal structure can help build better evaluation metrics:

  • identify specialist models;
  • identify volatile benchmarks;
  • build robust generalist scores;
  • select complementary benchmark sets;
  • decompose models into capability fingerprints.

Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks?


submitted by /u/Spico197
[link] [comments]

Key Takeaways

  • LLM rankings are better represented as a benchmark-specific capability graph.
  • Different benchmarks have different interpretations and may provide independent skill signals.
  • The reversal structure of some benchmarks could be useful for building robust evaluation metrics.

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top