NVIDIA Research has introduced SpatialClaw, a framework designed to solve a persistent bottleneck in vision-language models (VLMs): their inability to accurately reason about object locations, relationships, and 3D motion. Rather than retraining the underlying model, SpatialClaw alters the action interface through which the agent interacts with perception tools. By treating code as the primary interface for executing actions, the team argues they have removed a critical constraint. Across 20 benchmarks, SpatialClaw achieves an average accuracy of 59.9%, surpassing the recent spatial agent SpaceTools by 11.2 percentage points.
What is SpatialClaw
SpatialClaw operates as an agent loop wrapped around a stateful Python kernel. This kernel is pre-loaded with input frames and a suite of primitives. Perception tools are implemented as plain Python callables, meaning their outputs—such as masks, depth maps, camera geometry, and trajectories—are returned as standard Python variables.
The kernel exposes six public entry points. InputImages stores the sampled frames, while Metadata handles frame rate, duration, and frame indices. The tools object exposes perception and geometry primitives. The show() function embeds an image into the agent’s next context, vlm dispatches queries to a separate VLM session, and ReturnAnswer() submits the final result.
Two perception tools are central to the workflow. tools.Reconstruct wraps Depth Anything 3 to return per-frame depth, camera intrinsics, extrinsics, and dense point maps. tools.SAM3 wraps SAM 3 to produce image or video masks from text, point, or box prompts. The framework also includes lightweight utilities: tools.Geometry, tools.Mask, tools.Time, tools.Graph, and tools.Draw.
Crucially, the system is training-free. The same system prompt, tool set, and hyperparameters run across every benchmark and backbone.
Why the Action Interface Matters
The research team compared three action interfaces on the same task: measuring the closest distance between a heater and a door.
- Single-pass code writes one complete program and runs it once. It commits to a full strategy before seeing any intermediate mask or depth map. A wrong assumption then propagates straight to the answer.
- Structured tool-call invokes named tools through a fixed JSON schema. It cannot freely combine outputs with NumPy or SciPy to express test-time computations. Because the closest-point operation has no pre-registered tool, the result is incorrect.
- SpatialClaw composes tools in code, inspects results, then revises. It first computes a centroid distance, then notices the centroid uses a median. The agent switches to
scipy.spatial.KDTreeto find the true closest point. It submits 0.9439 m against a 0.9 m ground truth.
Benchmark
SpatialClaw was tested on 20 benchmarks across five categories: single-image, multi-view, general, video and 4D, and general video understanding. It improves over the no-tool baseline on all six backbones tested. These backbones range from 26B to 397B parameters across the Qwen3.5/3.6 and Gemma4 families.
A controlled comparison isolates the interface. All three variants share the same toolset and prompt; only the action interface differs.
| Action interface | Avg. (20 bench.) | Δ vs no-tool |
|---|---|---|
| No-tool baseline | 53.4 | – |
| Single-pass code | 55.2 | +1.8 |
| Structured tool-call | 56.7 | +3.3 |
| SpatialClaw (code as action) | 59.9 | +6.5 |
Gemma4-31B backbone, 20-benchmark average.
Against prior spatial agents on the same Gemma4-31B backbone, the gap widens.
| Method | Interface | Avg. | Δ vs SpatialClaw |
|---|---|---|---|
| VADAR | Single-pass | 40.5* | −19.4 |
| pySpatial | Single-pass | 47.8 | −12.1 |
| SpaceTools-Toolshed | Structured tool-call | 48.7 | −11.2 |
| SpatialClaw | Code as action | 59.9 | best |
The largest gains land on dynamic tasks. On Gemma4-31B, DSI-Bench rose +17.6 points and MindCube rose +15.3 points. These categories need chained geometric computation across frames and viewpoints.
An LLM-as-judge attribution explains the wins over structured tool-call. Code composition accounts for 52.2% of them. Control flow accounts for 19.5%, and the remaining 28.3% are interface-neutral.
Inside the Five-Stage Loop
Each sample runs a five-stage loop: planning, code generation, code execution, feedback assembly, and answer submission. A planner drafts a strategy without seeing the images. The main agent then writes one Python cell per step. A static AST checker rejects unsafe code before execution. The loop repeats until ReturnAnswer() is called or 30 steps pass.
The official repo runs on a LangGraph workflow and a persistent Jupyter kernel. Backbones serve through vLLM. Perception runs behind a FastAPI GPU service. A single quickstart runs one benchmark on one machine:
git clone --recursive https://github.com/NVlabs/SpatialClaw.git
cd SpatialClaw
bash spatial_agent/scripts/setup.sh
cp .env.example .env # add API keys, or self-host vLLM
python -m spatial_agent.entrypoints.run \
--dataset spatial_agent/config/dataset/erqa.json \
--model spatial_agent/config/model/gemini-3-pro.json \
--concurrency 4A representative agent cell composes perception with geometry, then revises:
# Reconstruct the scene, then segment both objects in one video pass
recon = tools.Reconstruct.Reconstruct(InputImages)
seg = tools.SAM3.segment_video_by_text(["radiator heater", "door"])
show(seg.visualize(1)) # inspect the masks first
# Closest-point distance via KD-tree, not centroids
pts_h = seg.get_masked_points(recon, frame=1, object=0) # object 0 = heater
pts_d = seg.get_masked_points(recon, frame=2, object=1) # object 1 = door
dists, _ = scipy.spatial.KDTree(pts_d).query(pts_h, k=1)
ReturnAnswer(float(dists.min()))The agent picks primitives from the question itself. Distance questions invoke KD-tree search and vector norms. Direction questions rely on dot products. No category-specific routing was applied.
Use Cases
The design fits problems that need step-by-step geometric reasoning. Concrete examples include
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




