New benchmark exposes how badly AI struggles with real knowledge work

The AA-Briefcase benchmark from Artificial Analysis reveals a stark reality for artificial intelligence in professional settings. By simulating multi-week projects using fragmented…

By AI Maestro June 19, 2026 1 min read
New benchmark exposes how badly AI struggles with real knowledge work

The AA-Briefcase benchmark from Artificial Analysis reveals a stark reality for artificial intelligence in professional settings. By simulating multi-week projects using fragmented data such as Slack threads, emails, and meeting transcripts, the test shows that even top-tier models solve only three percent of tasks completely. Claude Fable 5 achieved the highest pass rate but still failed to meet all criteria on the vast majority of assignments. On thirty-one out of ninety-one tasks, no model managed to clear fifty percent of the required standards. Weaker models struggle with basic execution by missing relevant files, while stronger versions fail more subtly by overlooking details that require synthesising information across multiple sources.

This performance gap matters because it highlights the distance between current capabilities and the complex, unstructured nature of real knowledge work. The significant cost disparity, with per-task expenses ranging from forty cents for DeepSeek V4 Flash to over thirty-one dollars for Claude Fable 5, suggests organisations must weigh accuracy against budget carefully. Businesses cannot assume that deploying advanced models will automatically streamline workflows or replace human oversight in critical decision-making processes. The data indicates that AI remains a tool for assistance rather than a standalone solution for high-stakes projects requiring deep contextual understanding.

  • Top AI models successfully complete only three percent of simulated multi-week knowledge work tasks.
  • Per-task costs vary by more than eight hundred times between the cheapest and most expensive models tested.
  • Stronger models fail by missing nuanced details rather than basic execution errors, complicating error detection.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top