Gradient-based Planning for World Models at Longer Horizons

“`html

GRASP: Gradient-based Planning for World Models at Longer Horizons

GRASP is a new gradient-based planner for learned dynamics (a “world model”) that makes long-horizon planning practical by (1) lifting the trajectory into virtual states so optimization is parallel across time, (2) adding stochasticity directly to the state iterates for exploration, and (3) reshaping gradients so actions get clean signals while we avoid brittle “state-input” gradients through high-dimensional vision models.

Large, learned world models are becoming increasingly capable. They can predict long sequences of future observations in high-dimensional visual spaces and generalize across tasks in ways that were difficult to imagine a few years ago. As these models scale, they start to look less like task-specific predictors and more like general-purpose simulators.

But having a powerful predictive model is not the same as being able to use it effectively for control/learning/planning. In practice, long-horizon planning with modern world models remains fragile: optimization becomes ill-conditioned, non-greedy structure creates bad local minima, and high-dimensional latent spaces introduce subtle failure modes.

In this blog post, I describe the problems that motivated this project and our approach to address them: why planning with modern world models can be surprisingly fragile, why long horizons are the real stress test, and what we changed to make gradient-based planning much more robust.

This blog post discusses work done with Mike Rabbat, Aditi Krishnapriyan, Yann Le Cun, and Amir Bar Bar (* denotes equal advisorship), where we propose GRASP.

What is a world model?

These days, the term “world model” is quite overloaded, and depending on the context can either mean an explicit dynamics model or some implicit, reliable internal state that a generative model relies on (e.g., when an LLM generates chess moves, whether there is some internal representation of the board). We give our loose working definition below.

Suppose you take actions $a_t \in \mathcal{A}$ and observe states $s_t \in \mathcal{S}$ (images, latent vectors, proprioception). A world model is a learned model that, given the current state and a sequence of future actions, predicts what will happen next. Formally, it defines a predictive distribution on a sequence of observed states $s_{t-h:t}$ and current action $a_t$:

\[P_\theta(s_{t+1} \mid s_{t-h:t},\; a_t)\]

that approximates the environment’s true conditional $P(s_{t+1} \mid s_{t-h:t},\; a_t)$. For this blog post, we’ll assume a Markovian model $P(s_{t+1} \mid s_{t-h:t},\; a_t)$ for simplicity (all results here can be extended to the more general case), and when the model is deterministic it reduces to a map over states:

\[s_{t+1} = F_\theta(s_t, a_t).\]

In practice the state $s_t$ is often a learned latent representation (e.g., encoded from pixels), so the model operates in a (theoretically) compact, differentiable space. The key point is that a world model gives you a differentiable simulator; you can roll it forward under hypothetical action sequences and backpropagate through the predictions.

Planning: choosing actions by optimizing through the model

Given a start $s_0$ and a goal $g$, the simplest planner chooses an action sequence $\mathbf{a}=(a_0,\dots,a_{T-1})$ by rolling out the model and minimizing terminal error:

\[\min_{\mathbf{a}} \; \| s_T(\mathbf{a}) – g \|_2^2, \quad \text{where } s_T(\mathbf{a}) = \mathcal{F}_{\theta}^{T}(s_0,\mathbf{a}).\]

Here we use $\mathcal{F}^T$ as shorthand for the full rollout through the world model (dependence on model parameters $\theta$ is implicit):

\[\mathcal{F}_{\theta}^{T}(s_0, \mathbf{a}) = F_\theta(F_\theta(\cdots F_\theta(s_0, a_0), \cdots, a_{T-2}), a_{T-1}).\]

In short horizons and low-dimensional systems, this can work reasonably well. But as horizons grow and models become larger and more expressive, its weaknesses become amplified.

So why doesn’t this just work at scale?

Why long-horizon planning is hard (even when everything is differentiable)

There are two separate pain points for the more general world model, plus a third that is specific to learned, deep learning-based models.

1) Long-horizon rollouts create deep, ill-conditioned computation graphs

Those familiar with backprop through time (BPTT) may notice that we’re differentiating through a model applied to itself repeatedly, which will lead to the exploding/vanishing gradients problem. Namely, if we take derivatives (note we’re differentiating vector-valued functions, resulting in Jacobians that we denote with $D_x (\cdots)$) with respect to earlier actions (e.g., $a_0$):

\[D_{a_0} \mathcal{F}_{\theta}^{T}(s_0, \mathbf{a}) = \Bigl(\prod_{t=1}^T D_s F_\theta(s_t, a_t)\Bigr) D_{a_0}F_\theta(s_0, a_0).\]

We see that the Jacobian’s conditioning scales exponentially with time $T$:

\[\sigma_{\text{max/min}}(D_{a_0}\mathcal{F}_{\theta}^{T}) \sim \sigma_{\text{max/min}}(D_s F_\theta)^{T-1},\]

leading to exploding or vanishing gradients.

2) The landscape is non-greedy and full of traps

At short horizons, the greedy solution, where we move straight toward the goal at every step, is often good enough. If you only need to plan a few steps ahead, the optimal trajectory usually doesn’t deviate much from “head toward $g$” at each step.

As horizons grow, two things happen. First, longer tasks are more likely to require non-greedy behavior: going around a wall, repositioning before pushing, backing up to take a better path. And as horizons grow, more of these non-greedy steps are typically needed. Second, the optimization space itself scales with horizon: $\mathrm{dim}(\mathcal{A} \times \cdots \times \mathcal{A}) = T\mathrm{dim}(\mathcal{A})$, further expanding the space of local minima for the optimization problem.

Loss landscape — *Distance to goal along the optimal path is non-monotonic, and the resulting loss landscape can be rough.*

A long-horizon fix: lifting the dynamics constraint

Suppose we treat the dynamics constraint $s_{t+1} = F_{\theta}(s_t, a_t)$ as a soft constraint, and we instead optimize the following penalty function over both actions $(a_0,\ldots,a_{T-1})$ and states $(s_0,\ldots,s_T)$:

\[\min_{\mathbf{s},\mathbf{a}} \mathcal{L}(\mathbf{s}, \mathbf{a}) = \sum_{t=0}^{T-1} \big\|F_\theta(s_t,a_t) – s_{t+1}\big\|_2^2,
\quad \text{with } s_0 \text{ fixed and } s_T=g.\]

This is also sometimes called collocation in planning/robotics literature. Note the lifted formulation shares the same global minimizers as the original rollout objective (both are zero exactly when the trajectory is dynamically feasible). But the optimization landscapes are very different, and we get two immediate benefits:

Each world model evaluation $F_{\theta}(s_t,a_t)$ depends only on local variables, so all $T$ terms can be computed in parallel across time, resulting in a huge speed-up for longer horizons, and
You no longer backpropagate through a single deep $T$-step composition to get a learning signal, since the previous product of Jacobians now splits into a sum, e.g.:

\[D_{a_0} \mathcal{L} = 2(F_\theta(s_0, a_0) – s_1).\]

Being able to optimize states directly also helps with exploration, as we can temporarily navigate through unphysical domains to find the optimal plan:

Collocation planning in BallNav — *Collocation-based planning allows us to directly perturb states and explore midpoints more effectively.*

However, lunch is never free. And indeed, especially for deep learning-based world models, there is a critical issue that makes the above optimization quite difficult in practice.

An issue for deep learning-based world models: sensitivity of state-input gradients

The tl;dr of this section is: directly optimizing states through a deep learning-based $F_{\theta}$

Source Read original →

Gradient-based Planning for World Models at Longer Horizons

What is a world model?

Planning: choosing actions by optimizing through the model

Why long-horizon planning is hard (even when everything is differentiable)

1) Long-horizon rollouts create deep, ill-conditioned computation graphs

2) The landscape is non-greedy and full of traps

A long-horizon fix: lifting the dynamics constraint

An issue for deep learning-based world models: sensitivity of state-input gradients

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Ten advances in mathematics…

Judge denies xAI’s request…

YouTuber Hank Green says…

What is a world model?

Planning: choosing actions by optimizing through the model

Why long-horizon planning is hard (even when everything is differentiable)

1) Long-horizon rollouts create deep, ill-conditioned computation graphs

2) The landscape is non-greedy and full of traps

A long-horizon fix: lifting the dynamics constraint

An issue for deep learning-based world models: sensitivity of state-input gradients

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Ten advances in mathematics…

Judge denies xAI’s request…

YouTuber Hank Green says…