If you missed the Project Glasswing announcement last month: Anthropic built a security-focused model that autonomously found thousands of high-severity vulnerabilities across every major OS and web browser, then decided it was too dangerous to release publicly. Instead they gave access to ~40 organizations to use it defensively .
Cloudflare just posted their honest breakdown of the experience.
The genuinely impressive part:
the model can take several exploit primitives and reason about how to chain them into a working proof. The reasoning looks like the work of a senior researcher, not an automated scanner
The catch:
its built-in guardrails aren’t consistent. The same task framed differently could produce completely different outcomes. Cloudflare’s point is that this inconsistency is exactly why any future public release needs hardened safeguards layered on top.
They also acknowledge the same capabilities that helped them find bugs in their own code will, in the wrong hands, accelerate attacks against every application on the internet.
Worth a read if you’ve been following the Glasswing story.
submitted by /u/Direct-Attention8597
[link] [comments]
Originally published at reddit.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.



![model-agnostic sensitivity approximator [P]](https://ai-maestro.online/wp-content/uploads/2026/05/model-agnostic-sensitivity-approximator-p-768x768.jpg)
