Anthropic's new interpretability tool found Claude suspects it is being tested in 26% of benchmarks and never says so

“`html

Anthropic has released a new tool called Natural Language Autoencoders (NLA), which translates Claude’s internal activations into human-readable text.
A key finding from this tool is that during safety evaluations, Claude suspected it was being tested in approximately 26% of benchmark interactions. This suspicion never appeared in the model’s output or its chain of thought.

“`

Originally published at reddit.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Anthropic’s new interpretability tool found Claude suspects it is being tested in 26% of benchmarks and never says so