| One model paid a 22.9% Agent Execution Tax (wasted / productive inference). The same model that looked cheapest per token cost 2.3x more per successful task. Ran 720 browser agent tasks across these four models on the WebVoyager benchmark. Open-weight models held their own against Gemini 2.5 Flash. Highlights: – MiniMax M2.5: 2.3x cheaper per successful task than Gemini – GLM-5: highest accuracy (57.1%), strongest on structured data – Kimi K2.5: 0% parse retries across 852 calls (Gemini was 18.6%) What surprised us: open-weight models are now winning agent benchmarks not because they got smarter but because they’re more reliable per call. Token pricing comparisons are misleading once retries compound. Full benchmark + reproducibility steps in the link submitted by /u/ogandrea |
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




