NVIDIA AI has unveiled ASPIRE, a robotics framework that allows systems to write their own control programs and keep the fixes for later. The team behind the project reports that the system reaches 31% success on long-horizon LIBERO-Pro tasks without prior training, a figure that previous methods struggle to exceed.
In this article
Standard robot programming remains difficult to scale. It demands manual orchestration of perception, physical contact dynamics, diverse configurations, and execution failures. Code-as-policy systems address this by letting language models compose these elements into executable programs. This approach makes robot behaviour inspectable, editable, and debuggable.
Existing robotic coding agents, however, operate in naive environments. They receive only coarse, task-level feedback. A failed rollout signals failure but not the cause. The root issue could lie in perception, motion planning, grasping, contact dynamics, or long-horizon coordination. Furthermore, these systems discard fixes once a task ends. An agent solving its hundredth task possesses no more experience than it had at the first.
Researchers from NVIDIA, the University of Michigan, UIUC, UC Berkeley, and CMU have introduced ASPIRE (Agentic Skill Programming through Iterative Robot Exploration). It is a continual learning system that writes and refines robot control programs. It also distils validated fixes into a reusable, transferable skill library.
How ASPIRE works
ASPIRE runs an open-ended learning loop with three components. It uses a coordinator–actor architecture. A central coordinator manages the shared skill library and dispatches actor coding agents to tasks. Actors do not exchange full chat histories or raw trajectories. Only distilled skills move between them.
Closed-loop robot execution engine: This replaces coarse rollout feedback with per-primitive multimodal traces. For each perception, planning, and control call, it stores inputs, outputs, and return status. It also stores RGB keyframes, overlays, grasp candidates, object poses, and motion-planning results. The agent inspects only the calls implicated by a failure. It then localises the fault and validates a repair through re-execution.
Skill library: Reusable knowledge is rarely an entire task program. So the library stores heterogeneous fixes. These include localisation heuristics, perception prompts, grasping constraints, motion primitives, and debugging workflows. Each skill is compact in-context guidance. It holds a failure signature, a when-to-apply condition, a repair strategy, and often a code sketch. The coordinator admits only patterns that pass debug validation and API-policy checks.
Evolutionary search: Trace-guided debugging alone can collapse into local repair loops. The agent keeps patching the same failed strategy. To broaden exploration, ASPIRE proposes K candidate programs each round. Candidates condition on top-performing prior programs and their remaining failure traces. The next round explores distinct strategies rather than refining one solution.
In simulation, the coding agent is Claude Code with Claude Opus 4.6 and a 1M-token context window. Programs are written in CaP-X, an open-source code-as-policy framework built on MuJoCo Playground. The agent cannot read simulator ground truth. Reading physics-engine state or asset files like .bddl, .xml, or .urdf is forbidden. The rule is simple. If a real robot with a camera could do it, it is allowed.
A worked example: the Multi-Angle Approach skill
Consider a BEHAVIOR-1K task where a robot must pick up a radio near a table. Perception returns the radio pose, but repeated navigate_to_pose calls fail. The generated goal lies within about 20 centimetres of the table edge. That falls inside the table’s collision-avoidance buffer, and cuRobo returns PLANNING_ERROR.
The agent reads the trace and localises the cause. The failure is target infeasibility, not perception or grasping. It then writes a repair that samples standoff poses around the radio.
# radio_pos, safe_navigate() and dist_to() are provided by ASPIRE’s robot API
for angle_deg in [180, -90, 90, -45, 45]:
angle = np.radians(angle_deg)
tx = radio_pos[0] + 0.7 * np.cos(angle) # standoff 0.7 m from the radio
ty = radio_pos[1] + 0.7 * np.sin(angle)
face_yaw = np.arctan2(radio_pos[1] – ty, radio_pos[0] – tx)
moved = safe_navigate([tx, ty, face_yaw], f”ang_{angle_deg}”)
if moved and dist_to(radio_pos[:2]) < 0.8: # reached a pose within 0.8 m
break
Each angle puts the goal on a different side of the object. When one side is blocked, another is often open. Here the 180-degree pose clears the buffer. The validated fix is admitted as a reusable navigation-recovery skill.
Benchmarks and results
ASPIRE is evaluated on three benchmark families. LIBERO-Pro tests short-horizon robustness under object, goal, and spatial perturbations. Robosuite covers contact-rich single- and dual-arm manipulation. BEHAVIOR-1K covers long-horizon household mobile manipulation. The primary coding-agent baseline is CaP-Agent0. It uses visual differencing, a predefined skill library, and per-episode test-time retries. The comparison also includes end-to-end vision-language-action policies: OpenVLA, π0, and π0.5.
On LIBERO-Pro, ASPIRE gains up to 77 points on the Object suite. That figure averages both perturbation axes over the strongest baseline. It also gains 41.5 points on Goal and 42.5 points on Spatial. On Robosuite, bimanual handover rises from 20% to 92%. On BEHAVIOR-1K, the radio pickup task rises from 56% to 88%.
The zero-shot result is notable. Reusing skills accumulated on LIBERO-90, ASPIRE reaches about 31% on held-out LIBERO-Pro Long tasks. Prior methods saturate near 4%.
Comparison summary
| Dimension | End-to-end VLAs (OpenVLA, π0, π0.5) | CaP-Agent0 | ASPIRE |
|---|---|---|---|
| Paradigm | Learned-weight policy | Code-as-policy agent | Code-as-policy agent |
| Cross-task experience | None (frozen weights) | Discarded after each task | Distilled into a skill library |
| Failure feedback | None at test time | Coarse scene-level summaries | Per-primitive multimodal traces |
| Test-time strategy | Direct inference | Per-seed reasoning + retries | One program per task |
| LIBERO-Pro overall | 0–13% | 18% | 72% |
| LIBERO-Pro Long zero-shot | 0–5% | ~4% | ~31% |
Real-robot skill transfer
The research team tests three simulation-discovered skills on a real bimanual YAM station. The real-robot coding agent is OpenAI Codex GPT-5.5. The embodiment and API differ from simulation. Transferred skills reduce debugging cost. Soda-can lifting improved from 13/20 to 19/20 while using about 10x fewer tokens. Drawer opening moved from 0/20 to 11/20, where the no-skill baseline never succeeded.
What it means
For people building physical systems, the shift is from writing brittle scripts to maintaining a library of repairs. The agent stops guessing from rollout outcomes and starts localising faults using per-primitive multimodal traces. This allows simulation-discovered skills to transfer to real hardware across different embodiments.




