LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

I built a small website called LLM Win:

It turns LLM benchmark results into a directed graph:

If model A beats model B on benchmark X, add an edge A -> B.

Then it searches for the shortest transitive chain between two models.

The meme version is:

Can LLaMA 2 7B beat Claude Opus 4.7?

In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot:

Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%.
Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking.
Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark.
Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity’s Last Exam, IFBench, AIME 2025, TAU2, SciCode
Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking.

My current interpretation:

LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise.

The next question is whether reversal structure can help build better evaluation metrics:

identify specialist models;
identify volatile benchmarks;
build robust generalist scores;
select complementary benchmark sets;
decompose models into capability fingerprints.

Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks?

submitted by /u/Spico197

Key Takeaways

LLM rankings are better represented as a benchmark-specific capability graph.
Different benchmarks have different interpretations and may provide independent skill signals.
The reversal structure of some benchmarks could be useful for building robust evaluation metrics.

Source Read original →

LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

Key Takeaways

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Mistral enters robotics with…

Your gaming data could be…

OpenAI releases new voice…

Key Takeaways

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Mistral enters robotics with…

Your gaming data could be…

OpenAI releases new voice…