When robots write their own code, what does that mean for makers and artists?
Artists and makers relying on automation are witnessing a shift where the tools themselves become self-improving. Instead of manually tuning parameters or curating datasets, a fleet of robots can now generate its own training strategies and evaluation metrics through AI coding agents. This moves the bottleneck from human oversight to autonomous iteration, potentially freeing creators to focus on high-level design while the machines handle the repetitive refinement of physical tasks.
Researchers from Nvidia, Carnegie Mellon University, and UC Berkeley have demonstrated this with a project called ENPIRE. The system uses AI coding agents to teach robots dexterous grasping in the real world, achieving up to 99 percent success on complex manipulation tasks without constant human intervention.
Breaking the manual loop
Dexterous manipulation remains a significant hurdle for robotics. Traditionally, humans must oversee every stage: gathering training data, resetting the physical scene after every failed attempt, and tweaking algorithms. This manual overhead slows progress dramatically. ENPIRE aims to bypass this by handing the entire workflow to AI coding agents.
The core mechanism is a feedback loop running directly on physical hardware: reset the workspace, execute a strategy, verify the outcome, and refine the next attempt.
The agent writes its own tests
ENPIRE operates in two distinct phases. Initially, the agent configures a working environment with minimal human input, establishing safety boundaries, automatic resets, and automated success checks. Rather than requiring a human to judge every single attempt, the agent writes its own reward function to distinguish success from failure. It requires only a few minutes of example video footage showing both successful and failed attempts to bootstrap this process.
For instance, when tasked with pin insertion, the agent developed a specific check combining visual alignment, gripper height, and estimated force. For cutting a cable tie, it merged two camera angles to eliminate false positives and pushed reaction times below 150 milliseconds. These evaluation tools are built once and reused indefinitely.
In the second phase, the agent operates entirely independently. It reads research papers, formulates hypotheses, and edits the training code directly. It selects methods such as behavior cloning, where the strategy mimics human demonstrations, or reinforcement learning, where the strategy improves through trial and error. The choice of method is driven by real-world success signals.
Scaling via Git
The system scales to a fleet of eight dual-arm YAM robot stations. Each station possesses its own hardware, computer, and coding agent. These agents test different hypotheses simultaneously and share results exclusively through Git, the standard version control tool for software. They adopt successful training recipes from peers and discard ineffective ideas autonomously. A breakthrough discovered at one station propagates across the entire fleet.
The study reports up to 99 percent success on demanding tasks, including the Push-T test—where a robot must slide a T-shaped block into a target position and orientation—sorting pins into a box, and cutting a cable tie. For pin insertion, the strategy converged to 100 percent success faster than comparable human-in-the-loop methods.
Scaling yields significant time savings. On the Push-T test, increasing the fleet from one to eight agents reduced the time to full success from approximately five hours to two. For pin insertion, the duration dropped from over 90 minutes to roughly 40 minutes. The researchers tested three current coding agents: Codex with GPT-5.5, Claude Code with Opus 4.7, and Kimi Code with Kimi K2.6. Codex performed best in most scenarios.
Simulation is still not the real world
The results highlight that the physical world remains far more difficult than simulated environments. On the Push-T test, all three agents solved the task in simulation, yet two out of three failed in the real environment. The researchers attribute this to unpredictable and variable conditions, such as robot dynamics, friction, and object movement. However, in the RoboCasa simulation, ENPIRE outperformed both an end-to-end vision-language-action model (GR00T) and a tool-based approach without autoresearch (CaP-X).
To measure efficiency, the researchers propose two metrics: Mean Robot Utilization (MRU), which tracks actual working time, and Mean Token Utilization (MTU), which counts language model usage per minute. Learned skills also transfer; experience gained from pin insertion helped the agents slot GPUs into a motherboard using the robot arms.
The study is clear about its limitations. Robots and compute are not fully utilised because agents spend considerable time reading logs, writing code, and waiting. As the fleet size increases, per-robot utilisation drops because agents spend more time summarising each other’s results. Token costs also grow faster than performance gains: larger fleets reach the goal sooner but consume a significantly higher compute budget. Despite this, the researchers view ENPIRE as a practical path toward robots capable of self-improvement in the real world.
Key takeaways
- AI coding agents can autonomously build evaluation tools and training strategies, reducing the need for human oversight in robot manipulation tasks.
- Scaling a fleet of eight robots using Git-based coordination cut training time for complex tasks by up to 60% compared to single-agent setups.
- While performance in simulation is high, real-world deployment still faces significant hurdles due to unpredictable physical dynamics like friction and object movement.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




