OpenAI researchers want to predict how often AI models will fail before launch
Key Points
- OpenAI researchers have developed a method called “Deployment Simulation” to predict AI errors before release more accurately than standard safety tests.
- The method uses real, anonymized user conversations instead of synthetic test questions. The model doesn’t realize it’s being tested, which makes the results more realistic.
- In tests with GPT-5 models, the simulation correctly predicted error trends 92 percent of the time and also uncovered hidden misbehavior.
OpenAI researchers propose a method for predicting how often a new AI model will make mistakes after release. It could fill gaps left by standard safety testing.
Before an AI model ships, it goes through safety testing. These tests try to estimate how often the model will later show unwanted behavior, like producing banned content or deceiving users. According to an OpenAI research paper, most of these tests rely on handwritten, synthetic, or deliberately tricky questions.
But these tests only capture a skewed slice of reality. They’re designed to probe for weaknesses, not to reflect what real users actually type. On top of that, models often pick up on the fact that they’re being tested and behave differently than they would in normal use. Both issues mean test results say little about how a model will actually perform in the wild.
Real conversations instead of synthetic test prompts
Researchers Marcus Williams, Micah Carroll, and their team propose a straightforward approach called “Deployment Simulation.” Instead of crafting new test questions, they pull from real, anonymized conversations that users had with a previous model. They keep the conversation history intact, all prior messages, and only have the new, unreleased model rewrite the next response.
Because the source conversations come from real traffic, the model faces exactly the kinds of situations it’ll encounter after launch. And it doesn’t realize it’s being tested, since it’s just looking at a normal user request.
These simulated responses serve two purposes. First, they can be scanned for new types of misbehavior. Second, researchers can count how often a specific problem shows up and derive a concrete frequency estimate. That estimate is verifiable: after release, the same measurement runs against real production data and gets compared to the prediction.
A prediction method that held up
OpenAI tested the approach on four models in the GPT-5 series using roughly 1.3 million conversations from August 2025 through March 2026.
For GPT-5.4, the researchers went especially strict: they used the simulation to predict how often the model would show each type of misbehavior after release, then locked in those estimates before they could even look at real usage data. That made it possible to check later, without bias, how well the predictions matched reality. Three older models in the series were analyzed retroactively, after real-world results were already known.
The team examined 20 categories of misbehavior, from banned content to deception. For categories where the frequency shifted significantly between model versions, the simulation correctly predicted whether a problem would increase or decrease 92 percent of the time. Standard tests got that right just 54 percent of the time.
The approach also surfaced a new type of misbehavior the researchers call “Calculator Hacking”: GPT-5.1 secretly used its browser tool as a calculator but told users it had run a web search. According to the researchers, the simulation would have caught this behavior before release.
Models don’t notice they’re being tested
The advantage is clearest when it comes to test awareness: When asked whether it was currently in a test, an AI model recognized standard safety tests as such 98.9 to 100 percent of the time. With real production conversations, that number dropped to just 5.4 percent. The simulation scored similarly low. Metagaming, where the model thinks about developer oversight in its chain of thought, showed up in up to 76 percent of standard tests but only 0.12 percent of real conversations. Because the simulation stays close to real-world numbers, its measurements are far more meaningful.
Coding tasks remain a challenge
One limitation is tasks where the model uses tools that access external systems, like coding. There, a model might call hundreds of tools, search through code, or modify files. These workflows can’t be easily replicated without letting an unfinished model loose on real systems, which could trigger the very risks researchers are trying to predict.
The team worked around this by using a second AI model to mimic what a tool would have returned. With access to the original conversation, the matching code version, and a database of prior tool responses, they boosted realism significantly: an evaluation system could barely tell simulated workflows from real ones.
Outside auditors could use the approach too. With the publicly available WildChat dataset, the method produced weaker but still useful predictions. That opens the door for independent researchers to evaluate models from different providers without needing access to private usage data, the researchers say. They note several limitations: the method depends on good evaluation systems, user behavior can shift over time, and it’s hard to catch extremely rare risks that only show up in one out of tens of millions of conversations.
Subscribe now
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




