AI agents can now complete 16 percent of freelance jobs at pro quality, up from 2.5 percent eight months ago

The Remote Labor Index shows AI agents completed 16.1 per cent of paid freelance projects at professional quality, a rise from 2.5 per cent eight months ago.

This benchmark tracks how often automation finishes real, commercially valuable work to a standard a paying client would accept. It covers 3D and CAD, architecture, graphic design, video, animation, audio, data analysis, and web apps. The study tested 240 projects worth $144,000, sourced from 358 verified freelancers. Human evaluators at the Center for AI Safety scored each result against a gold standard set by a paid professional. The Remote Labor Index was developed with Scale Labs.

The key metric is the automation rate, which measures the share of projects where the AI work is rated at least as good as a human’s.

Top automation rate jumps from 2.5 to 16.1 percent

When the benchmark launched, the best AI agent automated just 2.5 per cent of projects. Fable 5 now hits 16.1 per cent, the highest score ever recorded. That is roughly double Opus 4.8’s 8.3 per cent. GPT-5.5 comes in at 6.3 per cent. All three models beat every previously tested system. The prior leader, Opus 4.6 running on the Claude Cowork framework, sat at 4.17 per cent.

The frontier has more than quadrupled in under eight months. One caveat about Fable 5’s score: only 218 of 240 projects could be evaluated before the U.S. government restricted access to the model. Even in the worst case, where Fable 5 failed every missing project, its rate would still be 14.6 per cent, higher than any other model.

Progress does not track neatly with release dates. On the full Scale Labs leaderboard, the newer Gemini 3 Pro lands near the bottom at just 1.25 per cent, behind much older systems.

Some examples show where even top models still fall short. On a ring design task, Fable 5 is clearly better than earlier AIs but looks unprofessional on closer inspection. On an architecture project, GPT-5.5 faked an appealing render using an image generator while its actual 3D model remained flawed.

Human evaluators still can’t be replaced

The team tested whether expensive human evaluation could be replaced by AI judges. The answer was clear: AI judges rated the new models far too generously. For GPT-5.5, the AI evaluator’s score was almost three times too high. For Opus 4.8, about two and a half times. The automated judge did get the ranking order right, but the actual numbers were way off.

To fairly judge delivered work, you need to open the files in the right professional software, operate that software correctly, and form a judgment like a paying client would. That kind of hands-on software use is exactly what current AI agents are worst at. An AI judge runs into the same limits as the AI workers it is supposed to evaluate. GPT-5.5’s faked rendering is a good example: catching the trick requires opening the 3D model and inspecting the actual geometry.

To let the models show their full ability, the team runs them in the same tools developers use day to day, like Claude Code and Codex CLI. These were extended with the ability to operate graphical programs directly. The work environment is a virtual Linux machine loaded with over 30 professional apps, including Blender, GIMP, and Audacity. Each project gets up to 24 hours of compute time. The setup also uses a critic loop: a second AI agent reviews the output as critically as a demanding client, and the first agent then revises its work.

AI still fails to hit professional quality on most projects. None of the three Fable 5 results shown in the blog post would pass as finished work. But the rise in automation rates within a single year is rapid, the authors say, and directly reflects how fast remote work automation is advancing.

What it means

Freelancers should expect a sharper divide between simple tasks and complex professional work. While automation can now handle a larger slice of basic jobs, high-end clients will still reject outputs that lack genuine software expertise. The rise in automation rates does not mean humans are becoming obsolete; it means the bar for acceptable AI output has risen significantly faster than the ability of current tools to meet it.

Source Read original →

AI agents can now complete 16 percent of freelance jobs at pro quality, up from 2.5 percent eight months ago

Top automation rate jumps from 2.5 to 16.1 percent

Human evaluators still can’t be replaced

What it means

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Understand to participate

Anthropic says it cut…

iZotope is now joining…

Top automation rate jumps from 2.5 to 16.1 percent

Human evaluators still can’t be replaced

What it means

More in AI Music

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Understand to participate

Anthropic says it cut…

iZotope is now joining…