New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 16, 2026 1 min read
New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Key Takeaways

  • Researchers at Carnegie Mellon University developed a new benchmark to evaluate how AI agents like Claude Mythos and GPT-5.5 can exploit vulnerabilities in Google’s V8 JavaScript engine.
  • Claude Mythos significantly outperformed GPT-5.5, scoring on par with an experienced human security researcher.
  • Despite its strong performance, the full test run of Claude Mythos cost approximately $36,428, more than ten times higher than the cost for GPT-5.5.

The benchmark measures progress across five tiers, all the way up to arbitrary code execution, running commands as needed on a target system. V8 powers systems like Chrome, Edge, Node.js, and Cloudflare Workers.

Claude Mythos scored an average of 9.55 points in fully autonomous mode, reaching the top tier on 21 out of 41 vulnerabilities tested. GPT-5.5 managed only 4.30 points across all tests.

The benchmark is available on GitHub and the paper is published on arXiv. Anthropic provided API credits for the test run; however, all analysis was conducted independently by the authors.

Mythos works like a “fairly competent” browser security researcher

ExploitBench co-author Seunghyun Lee reviewed the Mythos transcripts and found that it performs similarly to an experienced human security researcher. For instance, the model developed an exploit technique previously dismissed as too complex by human researchers.

The benchmark includes both known and unknown vulnerabilities. The authors acknowledge that while models like Claude Mythos can identify and reproduce known issues, they do not yet measure their ability to find new flaws or fully weaponize a given exploit for real attacks.

Key Takeaways

  • The benchmark is available on GitHub and the paper is published on arXiv.
  • Claude Mythos significantly outperformed GPT-5.5 in this new evaluation.
  • The cost of running full tests for Claude Mythos is substantially higher than for GPT-5.5, highlighting potential inefficiencies and the need for further research to optimize AI models like Claude Mythos.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top