RL without TD learning

“`html RL Without TD Learning Problem setting: off-policy RL Our problem is in the realm of off-policy Reinforcement Learning (RL). On-policy RL…

By AI Maestro May 10, 2026 3 min read
RL without TD learning

“`html




RL Without TD Learning

Problem setting: off-policy RL

Our problem is in the realm of off-policy Reinforcement Learning (RL). On-policy RL allows us to use only fresh data from the current policy, while off-policy RL lets us use any available data. This flexibility makes it particularly useful for tasks where collecting new data is expensive.

Two paradigms in value learning: Temporal Difference (TD) and Monte Carlo (MC)

In off-policy RL, we typically train a value function using temporal difference (TD) learning, such as Q-learning. TD learning works by updating the value of states based on immediate rewards and future predictions. However, this method struggles to scale well due to error accumulation over long horizons.

To address this, we can mix TD with Monte Carlo (MC) returns, using methods like $n$-step TD learning where we use actual MC returns for the first few steps before bootstrapping. While effective, this approach still has issues such as high variance and suboptimal solutions at large $n$.

The “Third” Paradigm: Divide and Conquer

My claim is that a third paradigm in value learning, called divide and conquer, may provide an ideal solution to off-policy RL. This approach involves dividing trajectories into smaller segments and combining their values to update the full trajectory’s value function.

The key idea here is that by reducing the number of Bellman recursions logarithmically, we can handle long-horizon tasks more effectively without needing to tune a hyperparameter like $n$. We also prevent overestimation issues typically associated with max operations in TD learning.

A practical algorithm

In a recent work co-led with Aditya, we made meaningful progress toward realizing and scaling up this divide-and-conquer idea. Specifically, we were able to apply it to highly complex tasks like goal-conditioned RL in one important class of environments.

For these tasks, the structure naturally provides a divide-and-conquer approach where the value function is updated using values from smaller segments along the trajectory. To choose the optimal subgoal $w$, we restrict our search space to states within the dataset and use expectile regression for soft maximization.

Does it work well?



humanoidmaze


puzzle

To see whether our method scales well to complex tasks, we directly evaluated TRL on some of the most challenging tasks in OGBench—humanoidmaze and puzzle with large, 1B-sized datasets. These tasks require performing highly complex skills across up to 3,000 environment steps.

The results are quite exciting! Compared to many strong baselines across different categories (TD, MC, quasimetric learning, etc.), TRL achieves the best performance on most tasks.

This is my favorite plot. We compared TRL with $n$-step TD learning for various values of $n$, from $1$ (pure TD) to $\infty$ (pure MC). The result is really nice: TRL matches the best TD-$n$ performance on all tasks, without needing to set $\boldsymbol{n}$. This is exactly what we wanted from the divide-and-conquer paradigm.

What’s next?

In this post, I shared some promising results. However, there are still many open questions and directions for future work. For example:

  • How can we further enhance the divide-and-conquer approach to handle even more complex tasks?
  • Can we apply this method to other types of RL problems beyond goal-conditioned RL?
  • What are the theoretical guarantees for the convergence and stability of these methods?

“`

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top