AI agents are shifting from answering queries to autonomously executing complex, multi-step tasks.
Before these systems can book trips or handle financial analysis, providers and startups need proof they work reliably across countless scenarios.
Standard benchmarks show off model prowess but do not prove an AI can complete real-world jobs correctly.
Patronus AI, founded in 2023 by former Meta researchers Anand Kannappan and Rebecca Qian, helps model makers fine-tune models by building simulated digital environments to evaluate agent performance.
This San Francisco startup is solving a critical problem. Virtually every frontier AI lab and many emerging startups are now customers, according to Glenn Solomon, managing director at Notable Capital, who describes the demand as nearly insatiable.
Patronus revenue has grown 15-fold over the past year, fueling significant investor interest. On Thursday, the company announced a $50 million Series B round led by Greenfield Partners, with participation from Notable Capital, Lightspeed, Datadog, and Samsung. The round brings the company’s total funding to $70 million.
Patronus uses “digital world models” to create replicas of websites and internal systems. In these environments, agents are stress-tested after training using reinforcement learning, which rewards successful task completion and penalises errors.
AI labs see great value in these digital simulations because they give agents a chance to try different, sometimes unpredictable, scenarios. The company compares its approach to how Waymo trained autonomous cars by first building synthetic worlds to test vehicles against rare hazards, such as severe weather or a child running after a ball.
The difference with AI agents is that they tend to take shortcuts, which means they fail to complete the task correctly. “Patronus is really good at spotting the hacks and making sure they are holding the models accountable,” Solomon said.
Patronus is currently providing its simulated digital worlds for software engineering and finance, but these are just the start, according to Kannappan.
“Today we’re very focused on the problems that are verifiable, so the problems that you can immediately check and verify, but there are a ton more areas that are very non-verifiable or very hard to verify,” he said.
Just because these processes are verifiable does not mean they are simple. “We want to be able to actually create the environment in which you can operate an agent that can run for 10 hours or 10 days or 10 weeks,” Kannappan said.
As for rivals, Patronus believes it is primarily competing against the internal teams AI labs have already built to evaluate agent behavior. While human-data firms like Mercor and Surge help model makers with reinforcement learning, Patronus operates differently by evaluating how agents behave without any human involvement.
What it means
For developers, this means agents will face environments that mimic real-world complexity rather than static tests. The focus is on catching shortcuts and ensuring systems can operate for days or weeks without human intervention.




