Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 11, 2026 3 min read
Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

For makers and artists building the next generation of creative tools, the stakes of AI safety are no longer abstract-they are immediate and personal. Anthropic has reversed a controversial decision that would have silently throttled competitors’ access to its latest model, Claude Fable 5, effectively hamstringing independent developers and research teams. The company admitted after a fierce backlash that it had miscalculated the balance between security and openness, promising to make its safety measures transparent rather than hidden.

The hidden hand in the machine

When Anthropic launched Claude Fable 5 earlier this week, it introduced guardrails designed to prevent misuse in high-risk fields like cybersecurity, biology, and chemistry. Users asking about bioweapons or cyberattacks would be automatically rerouted to a less capable model. However, the company went further for those attempting to use the tool for frontier AI development. Instead of a clear warning, Anthropic planned to deliberately degrade the model’s performance in ways invisible to the user. This approach was intended to stop researchers from training competing models, a practice explicitly banned in Anthropic’s terms of service.

Why the backlash was so sharp

The research community reacted with fury, viewing the strategy as an act of “secret sabotage.” Dean Ball, a senior fellow at the Foundation for American Innovation and former White House advisor on AI, argued that degrading machine learning research performance without telling the user is “shockingly hostile.” He noted that this tactic undermines the very goal of AI safety by preventing researchers from collaborating on how to keep these systems in check.

“It felt like Anthropic was saying to the public, ‘We don’t trust anybody else to do AI research. We are the only ones who have to do AI research.’ It feels a bit like they’re starting to pull the ladder up behind them.”

Will Brown, research lead at the open source startup Prime Intellect, highlighted that the policy left developers flying blind. They would not know when their requests triggered safeguards, creating uncertainty about whether they were violating rules. Brown warned that such restrictions could stifle the wider ecosystem, including third-party evaluation firms that test frontier models for safety and reliability. If Anthropic had secretly degraded its model, this vital layer of independent verification could have been severely hampered.

Anthropic’s original justification

Anthropic defended the move by citing the rapid acceleration of AI capabilities. In a recent blog post, the company expressed concern that AI could improve faster than society can adapt to it. They argued that having the option to “slow or temporarily pause frontier AI development” is beneficial for allowing societal structures and alignment research to catch up. Furthermore, they emphasised the need to protect the technological edge held by the US and its allies in frontier chips and software.

“These safeguards prevent foreign adversaries from using our most capable models in ways that pose severe safety risks,” the company stated. “A hidden safeguard is harder to probe and work around. This means the safeguards can be targeted much more narrowly.”

The new path forward

Following the reversal, Anthropic confirmed that safeguards for AI development will now be visible to users. If the system suspects a user is attempting to build a highly capable AI, they will be explicitly informed that the request is being refused or that they are being rerouted to a less capable model. The company acknowledged that making these measures visible means casting a wider net, potentially triggering safeguards for more benign requests. Anthropic stated it is working urgently to make its classifiers more precise to mitigate this issue.

Key takeaways

  • Anthropic has scrapped its plan to secretly degrade performance for researchers using Claude Fable 5, admitting the approach was a “wrong tradeoff.”
  • Future safety measures will be transparent, alerting users when they are rerouted or refused, rather than operating invisibly in the background.
  • Critics argue that the original policy threatened to centralise advanced AI research within a handful of labs, undermining the open-source ecosystem.
  • While Anthropic prioritises preventing foreign adversaries from eroding the US tech advantage, the community insists that trust requires visible, not hidden, guardrails.
Scroll to Top