New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Key Takeaways

Researchers at Carnegie Mellon University developed a new benchmark to evaluate how AI agents like Claude Mythos and GPT-5.5 can exploit vulnerabilities in Google’s V8 JavaScript engine.
Claude Mythos significantly outperformed GPT-5.5, scoring on par with an experienced human security researcher.
Despite its strong performance, the full test run of Claude Mythos cost approximately $36,428, more than ten times higher than the cost for GPT-5.5.

The benchmark measures progress across five tiers, all the way up to arbitrary code execution, running commands as needed on a target system. V8 powers systems like Chrome, Edge, Node.js, and Cloudflare Workers.

Claude Mythos scored an average of 9.55 points in fully autonomous mode, reaching the top tier on 21 out of 41 vulnerabilities tested. GPT-5.5 managed only 4.30 points across all tests.

The benchmark is available on GitHub and the paper is published on arXiv. Anthropic provided API credits for the test run; however, all analysis was conducted independently by the authors.

Mythos works like a “fairly competent” browser security researcher

ExploitBench co-author Seunghyun Lee reviewed the Mythos transcripts and found that it performs similarly to an experienced human security researcher. For instance, the model developed an exploit technique previously dismissed as too complex by human researchers.

The benchmark includes both known and unknown vulnerabilities. The authors acknowledge that while models like Claude Mythos can identify and reproduce known issues, they do not yet measure their ability to find new flaws or fully weaponize a given exploit for real attacks.

Key Takeaways

The benchmark is available on GitHub and the paper is published on arXiv.
Claude Mythos significantly outperformed GPT-5.5 in this new evaluation.
The cost of running full tests for Claude Mythos is substantially higher than for GPT-5.5, highlighting potential inefficiencies and the need for further research to optimize AI models like Claude Mythos.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Key Takeaways

Mythos works like a “fairly competent” browser security researcher

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

How to Fine-Tune LFM2…

Google Is Quietly Buying…

Microsoft’s new MAI models

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Key Takeaways

Mythos works like a “fairly competent” browser security researcher

Key Takeaways

More in AI Research & Science

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

How to Fine-Tune LFM2…

Google Is Quietly Buying…

Microsoft’s new MAI models