For creators and engineers, the ability to anticipate how objects move in three-dimensional space is the missing link between passive observation and active creation. While modern AI excels at describing what has already happened in a video, building robots that can grasp a cup or generating video that obeys the laws of physics requires looking ahead. MolmoMotion addresses this gap by taking a single video frame, marking specific points on an object, and reading a text instruction to forecast exactly where those points will go over the next few seconds.
The new standard for 3D motion prediction
This new model, MolmoMotion, outperforms existing forecasting methods by predicting future 3D trajectories for marked points based on RGB input and action descriptions. These predictions are designed to drive practical applications, from planning robot movements to ensuring generated video frames remain physically plausible. To support this release, the team has also published MolmoMotion-1M, a massive dataset of 1.16 million videos containing paired 3D point trajectories and action descriptions, alongside PointMotionBench, a human-validated benchmark of 2,700 clips used to test accuracy.
Designing for the real world
Unlike previous approaches that rely on specific templates for human bodies or rigid objects, MolmoMotion represents motion as object-attached 3D points in world space. This choice was deliberate to ensure three critical properties:
- Class-agnostic: The system works regardless of whether the object is a hand, a tool, or any other category.
- View-stable: Motion remains consistent even when camera angles or viewpoints change.
- Directly usable: The output can be fed straight into robotics policies or video generation models without further conversion.
The architecture leverages Molmo 2 as its backbone, linking language instructions to visual objects and points. It accepts a short video history, an action description, and initial 3D coordinates for query points. The model then identifies the relevant object and predicts the future trajectory for each point.
Two variants are available for different needs:
- MolmoMotion-AR: Predicts future coordinates step-by-step as structured text. This method excels when the future path is well-defined, encouraging smooth rollouts.
- MolmoMotion-FM: Predicts trajectories in continuous 3D space by transforming noise into motion. This variant is better suited for scenarios where an instruction allows for multiple plausible futures, effectively representing uncertainty.
Building the dataset from scratch
Creating the training data required a custom pipeline because existing 3D-track datasets were too small, while internet videos lacked the necessary annotations. The team developed an automatic system to extract object-grounded 3D trajectories from unconstrained video.
The pipeline grounds the moving object and samples query points, then tracks dense 2D points before lifting them into a shared metric 3D frame. To ensure reliability, the system filters out jittery tracks caused by depth errors, smooths the remaining data, and segments clips to focus only on intervals where the object undergoes meaningful motion. The resulting MolmoMotion-1M corpus covers 736 motion types and 5,600 distinct objects.
Proving it works
The evaluation focused on three areas: raw forecasting accuracy, robotics planning, and video generation control. On PointMotionBench, MolmoMotion beats existing methods—including pixel-space generators and parametric 3D approaches—across various objects and scenes. Whether it is a lint roller moving on cloth, a bowl sliding on a table, or a car turning on a road, the predicted path adheres strictly to the instruction and matches ground truth motion.
Robotics planning
Because the model learns general motion paths rather than specific robot actions, it transfers well to physical manipulation. After fine-tuning on the DROID dataset, a control policy using MolmoMotion succeeded on 76.3% of pick-and-place tasks in simulation, compared to 56.0% for a policy based on Molmo 2. Furthermore, it learns significantly faster, reaching 51% success after just 10,000 training steps, whereas the baseline capped out at 19%. On real robots, MolmoMotion achieves the same test L2 error as the baseline after only about 2,000 steps, compared to 12,000.
Video generation
MolmoMotion’s predictions can also steer image-to-video models. Instead of letting a generator guess motion from a vague prompt, feeding in the model’s 3D forecasts results in video that follows requested actions more closely. This is particularly effective for small, precise movements. Quantitatively, using MolmoMotion to guide a generator improves motion quality on all five measured metrics and outperforms larger image-to-video models on four of those five.
Limitations and future scope
While capable, the model uses only eight query points per object during training. This is sufficient for general trajectory forecasting but limits its ability to handle complex deformable motion where dense surface geometry is required. The authors view forecasting as fundamental to machine intelligence, comparable to perception itself, and expect MolmoMotion to pave the way for broader applications in robotics and creative tools.
Key takeaways
- Superior accuracy: MolmoMotion outperforms existing 3D motion forecasting methods across diverse objects, scenes, and actions on the PointMotionBench benchmark.
- Robotics efficiency: Integrating MolmoMotion into robot planning policies leads to higher task success rates and significantly faster learning speeds compared to standard baselines.
- Enhanced video control: Using MolmoMotion’s 3D predictions to guide image-to-video models results in more accurate adherence to motion instructions and improved quality metrics.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




