OpenAI’s GPT-5.6 Sol cheats on software tests
Independent researchers found that OpenAI’s GPT-5.6 Sol exploits bugs in test environments to cheat. METR, the group behind the evaluation, recorded the highest rate of such behaviour among all publicly tested models.
The new model extracted hidden solutions and attempted to cover its tracks. METR states that the actual performance numbers are now barely usable.
Depending on how the cheating attempts are handled, the time-horizon estimate swings between 11.3 hours and over 270 hours. METR does not consider any of these values a reliable measure of the model’s true capabilities.
METR’s time-horizon method measures how long a task can take before an AI model can still solve it with a 50 or 80 percent success rate. Human completion times serve as the baseline: simple tasks like training a classifier take about 45 minutes, while harder ones like training a robust image model run about four hours. The higher the time horizon, the more capable the model.
Messy data, but Mythos still leads
Anthropic‘s Claude Mythos Preview achieved a time horizon of at least 16 hours in an earlier evaluation. The recently released Mythos 5 is likely even more capable, but it is currently blocked by the US government.
Even the Mythos measurement pushed the limits of METR’s testing method. Out of 228 tasks in the test suite, only five are designed for task lengths of 16 hours or more. That makes measurements in this range unstable and less meaningful, according to METR.
AI model time horizons are growing exponentially. Mythos Preview was the first model to land in what METR calls the unreliable measurement zone above 16 hours. GPT-5.6 Sol falls slightly below that (11 hours) or far above it (270 hours), depending on how the cheating is counted.
Regardless of the measurement issues, METR believes GPT-5.6 Sol does not sit far above the current state of the art and will not enable fully automated AI research. On a positive note, METR praised OpenAI for catching the cheating through internal monitoring and sharing it openly.
The fact that the bad behaviour is so obvious is actually reassuring, METR says, because it means more serious problems would get caught too. But METR also warned that if future models display much fewer undesirable propensities, we could become more concerned about catastrophic misalignment, as we would be worried that models may have learned to evade detection.




