Princeton University researchers have built a benchmark where AI agents run a fictional software firm for 500 simulated days. Of the fourteen models tested, only three finished with more money than the starting capital of one million dollars. A simple rule-based strategy that never uses an AI model beat almost every other entry.
In this article
Current AI agents excel at narrow tasks like fixing bugs or following service policies. These problems share a simple structure: a clear goal, brief action, and quick feedback. Real-world business involves long chains of decisions under uncertainty. Agents must set priorities, allocate limited resources, read noisy signals, and adapt to changing conditions.
CEO-Bench simulates a startup
The benchmark simulates running a startup for 500 days. The researchers point to Apple in 1997 as an example of strategic steering. At that time, the company was 90 days from bankruptcy. Steve Jobs drew a two-by-two grid—consumer and pro, desktop and portable—and decided Apple would only build products for those four quadrants. The iMac, iPod, and iPhone followed.
Steering an entire organization toward long-term goals is different from what AI agents do today. CEO-Bench is a first attempt at measuring this “steering intelligence.”
An AI CEO for a fictional software company
In the test, an agent runs a made-up subscription software company called NovaMind. It starts with zero customers and one million dollars in the bank. Performance is measured by remaining cash at the end. If the balance drops below zero even once, the company is bankrupt and the simulation ends.
The agent controls the company through a Python API with 34 tools and a database of 19 tables. Instead of just issuing individual commands, it writes its own code, queries the database with SQL, and builds custom workflows from the results. This puts it in front of the same challenges a human CEO would face.
There is a lot to decide: pricing and tiers, ad spend across channels, product quality and R&D, infrastructure capacity and customer support, plus multi-round negotiations with enterprise clients. On top of that, there is a simulated social network where the agent can read complaints, competitor news, and economic trends and post itself.
Delayed feedback and hidden variables make the test hard
What makes the task hard is time and uncertainty. Decisions play out on realistic business timelines. Revenue only arrives at billing dates. R&D projects take days to weeks. Mistakes often do not show up until later through churn or damaged reputation. Costs hit right away. The agent has to spend money whose payoff might not show up for weeks.
Much of the company’s state stays hidden. The agent cannot directly see customer satisfaction, willingness to pay, or minimum quality expectations. It has to piece these together from noisy signals like cancellations, support tickets, or reactions on the social network. The simulation models 26 customer segments and individual customers, each with their own budgets, price sensitivities, and expectations.
The world keeps changing too. Competitors periodically raise customer quality expectations. Preferences shift over time. A simulated business cycle affects demand and willingness to pay, so the agent has to keep adjusting.
The researchers deliberately chose fixed, transparent rules rather than a language model as referee. They wanted to avoid a weakness they see in Vending-Bench, a test with a simulated vending machine. There, an AI-simulated supplier can reward an agent for unrealistic verbal promises.
Most models go bankrupt
Of fourteen tested models, most fail the task. Nearly all can generate valid commands and database queries, but none can maintain a coherent strategy over time. Many go bankrupt before the simulation ends.
Only three models finish their best run above the starting capital of one million dollars: Claude Fable 5 at $47.15 million, Claude Opus 4.8 at $27.8 million, and GPT-5.5 at $21.3 million. Claude Fable 5 is the only model that lands above starting capital in more than one run.
There is a caveat though. One Fable 5 run aborted because the model refused to continue. In the other two, some requests fell back to Opus 4.8. GPT-5.5 went bankrupt in two of its three runs.
The most telling comparison is with a simple rule-based heuristic that never calls a language model at all. It sets fixed prices, quotas, and tiers, focuses advertising and targeted development on a small set of customer segments, and adjusts capacity based on recent usage. This heuristic reaches $15.76 million, beating every model except Fable 5, Opus 4.8, and GPT-5.5.
The researchers also roughly estimate the upper bound of achievable final cash at around $2.2 billion. Even the best agents fall far short. The test is nowhere near maxed out, the authors say.
Exploration beats caution
Analyzing the decision trajectories reveals clear behavioural differences. GPT-5.5 and Claude Opus 4.8 keep trying new strategies as conditions change, whether that means ramping up customer acquisition, adjusting tiers, or shifting support and R&D budgets. Claude Opus 4.7, by contrast, mostly responds to setbacks by cutting costs and preserving cash. This passive approach lets the model survive to the end but prevents it from turning a profit.
Interestingly, Opus 4.8 and GPT-5.5 reach similar final results through very different paths. Opus 4.8 acquires more customers early on but drops to zero customers mid-simulation. GPT-5.5 holds its customer base throughout. Both write surprisingly sophisticated code. Opus 4.8 builds its own internal simulation that models customer cohorts to predict future cash flow. GPT-5.5 digs through negotiation history in the database to uncover hidden customer preferences.
The researchers measure four capabilities that correlate with success:
- uncovering hidden information, like which ad channel works best for a given customer segment
- predicting the future, measured by error in four-week cash forecasts
- adapting quickly to change, measured by how fast a model notices a competitor’s move
- and planning ahead, measured partly by how often if-then scenarios appear in the agent’s notes
On all four points, Opus 4.8 and GPT-5.5 score above the average of the other models.
The tool environment matters too
Another finding concerns the software environment agents use to act. The researchers also tested Claude Opus 4.7 with Claude Code and GPT-5.5 with Codex, two popular coding assistants. In both cases, the agents acted far less often and performed worse. The researchers suspect the system prompts in these tools, which are tuned for software development, are the cause.
Shortening the time horizon does not solve the problem either. When the simulation is compressed to 50 days, only GPT-5.5 manages to finish with a profit. Most models, the researchers conclude, remain weak at coordinating decisions even toward a short-term goal.
The authors acknowledge limits in their setup. The product is represented by a single quality score because they found no reliable way to evaluate qualitative product changes. Compliance, security, and fundraising are left out to keep each run economically feasible. Still, CEO-Bench exposes a gap between the local tool competence of today’s models and the ability to connect actions over long time horizons into a coherent strategy.
What it means
For people building businesses or managing teams, the lesson is clear. Current AI tools are good at executing specific instructions. They struggle to hold a long-term strategy when money is tight and information is incomplete. A simple, rigid rule set outperforms most advanced models at this task. Until AI can plan across long horizons without constant human oversight, it will remain a tool for execution rather than a replacement for strategic leadership.




