An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 26, 2026 2 min read
An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

A single run of the MirrorCode benchmark cost $2,600 and kept an AI model working nonstop for 19 days.

Epoch AI and METR have released this new test to see whether artificial intelligence can recreate entire computer programs from scratch. The models must generate code without seeing the original source.

The benchmark targets 25 specific utilities. These range from Unix tools and data serializers to bioinformatics software, interpreters, static analysers, cryptography libraries, and compression tools. Every AI-generated solution must pass hidden end-to-end tests to prove it reproduces the original output exactly.

Unlike other software engineering tests that limit spending to $1 or $10 per task, MirrorCode allows for much higher costs. The developers note that existing benchmarks often cap expenses even when a human would need weeks to complete the same work.

One model finishes a 16,000-line toolkit in 14 hours

Claude Opus 4.7 currently leads the leaderboard with a 56 percent solve rate. In a standout test, the model reimplemented gotree, a bioinformatics toolkit written in Go. This codebase contains roughly 16,000 lines and over 40 commands.

A human engineer working without AI assistance would need between two and 17 weeks to finish that job. Opus 4.7 completed the task in 14 hours for a total cost of $251.

Other models performed well behind the leader. GPT-5.5 achieved a 44 percent solve rate, while Gemini 3.1 Pro Preview managed 32 percent. Even when models fail to fully reimplement a program, they typically pass 90 percent or more of the individual tests.

Complex tasks remain unsolved

Despite the progress, the most difficult challenges still stump every model tested. The tasks are divided into three categories: small, medium, and large. Small programs such as uuid and parseqsv get reliably reimplemented by all models.

The largest tasks beat every model. Leading models from a year ago would have scored only about 30 percent and were limited to simpler programs like a calendar utility.

Cost trends do not follow a clear pattern. GPT-5.5 costs three times as much as GPT-5 for the same tasks, while Claude Opus 4.7 runs three times cheaper than Claude Opus 4.1.

What it means

For developers and engineers, this benchmark shows that AI can handle demanding long-term programming tasks. A model can produce code that passes rigorous testing in a fraction of the time a human would need. However, the high cost of running these models means they are not yet ready to replace human engineers for complex, large-scale projects.

Epoch AI has open-sourced the scaffold and 22 of the 25 target programs. This covers 132 task instances across six programming languages. Three programs remain private for testing.

The researchers note an important caveat. Since MirrorCode uses open-source programs as targets, the models may have already seen the original code during training. Initial tests suggest the results were not dominated by memorisation, but the possibility that memorisation contributes to AI performance cannot be ruled out.

Scroll to Top