AI won’t become a real coworker until it stops answering and starts finishing tasks

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 28, 2026 3 min read
AI won’t become a real coworker until it stops answering and starts finishing tasks


AI won’t become a real coworker until it stops answering and starts finishing tasks

A survey paper argues that AI systems won’t become reliable coworkers until they finish entire tasks in persistent work environments instead of just generating answers. The key lies in reusable “skills.”

A research team from Tencent’s Youtu Lab and several Chinese universities maps the shift “from chatbot to digital colleague” along two dimensions in a new survey paper: the cognitive core and tool-assisted task execution.

The central question is no longer how a model produces a better answer, but how it reliably turns intent into finished work, the researchers say. The goal shifts from reactive Q&A to delegated task execution.

From fast answers to slow thinking

In the chatbot era, models mostly generated text fast. They stored language patterns and facts in their parameters, then wrote answers in one pass, token by token, following the most likely continuation without checking intermediate steps or searching for solutions.

The thinking-LLM era, initiated by OpenAI’s o1 and Deepseek-R1, pours more compute into the moment of answering. These models produce long chains of thought, check intermediate steps, and learn through reinforcement learning to search and self-correct. Only verifiably correct solutions get rewarded. The researchers frame this as a shift from fast, intuitive “System 1” thinking to slow, deliberate “System 2” reasoning, borrowing psychologist Daniel Kahneman’s framework.

From tool calls to work environments

First-generation agents could call APIs, write code, and browse the web, but they remained fragile. The researchers identify four structural bottlenecks: agents perceived their environment only in fragments, tool calls left no lasting state, unexpected behavior broke them, and they rarely finished tasks.

The OpenClaw era is where the environment itself becomes persistent. Files, sessions, logs, browsers, permissions, and skills all survive across the entire workflow. The paper cites OpenHands and SWE-agent, both of which embed agents in controlled development environments.

The paper’s core argument is that combining workspace and skill is what enables the real performance leap. A workspace provides state, storage, and consequences, while a skill packages operational knowledge into reusable bundles. Anthropic‘s Agent Skills already formalize this pattern as folders containing a SKILL.md file with instructions, scripts, and resources.

According to the researchers, skills aren’t prompts, and they aren’t traditional tools either. They sit between the model’s reasoning and workspace execution, letting organizations capture know-how in modular, testable, portable form. But the authors also warn that reusable procedures can go stale, overfit to specific workflows, or become attack vectors.

Why training and evaluation need to change

The shift also transforms how these systems are trained and evaluated. Chatbots learned from instruction-response pairs and were graded on answer accuracy. Workspace-based systems learn from state-action-observation trajectories instead. Success is no longer about plausible responses, the researchers argue, but about task closure: whether the system brings the target environment to a verifiable end state.

Benchmarks like SWE-bench, OSWorld, and WebArena demand reproducible starting states, executable tools, trajectory logs, and end-state checks. GPT-4 initially completed just 14 percent of WebArena tasks, showing how far realistic web environments are from static Q&A scenarios.

Security becomes an operational problem

Persistent workspaces also expand the attack surface. Agents hold credentials, local files, identity tokens, and communication channels. Projects like OpenClaw PRISM and ClawGuard are trying to establish permissions, provenance tracking, and audit logs as runtime safeguards. Data sovereignty matters just as much, the authors argue, since workspace agents observe sensitive repos, internal documents, and intermediate results that could later become memories, skills, or training data.

The authors acknowledge the workspace-plus-skill combination isn’t a complete solution. Skills can overfit, and workspaces fill up with stale files and broken artifacts. Reliable deployment, the researchers argue, requires skill lifecycle management, workspace hygiene, permission controls, sandboxing, rollback, and trajectory-based evaluation. Reuse without governance just creates new failure modes, they warn.

A recent survey by Meta, Stanford, and the University of Illinois Urbana-Champaign made a related argument from a different angle: autonomous system performance depends less on the base model than on the software layer around it. This “harness” bundles tools, sandboxed execution environments, and verification mechanisms.

The “skill” half of this argument gets complicated in practice, according to a recent Vercel evaluation. It found that coding agents didn’t even call a provided skill system 56 percent of the time, while a compressed documentation index embedded in an AGENTS.md file hit 100 percent success. The skill system topped out at 79 percent. Passive, always-present context beat active skill retrieval, tilting the balance toward the workspace.

Subscribe now

Scroll to Top