RL without TD learning

A new reinforcement learning algorithm, Transitive RL (TRL), has been developed to handle complex tasks lasting up to 3,000 steps without relying…

By AI Maestro June 26, 2026 3 min read

A new reinforcement learning algorithm, Transitive RL (TRL), has been developed to handle complex tasks lasting up to 3,000 steps without relying on temporal difference (TD) learning. This approach uses a divide and conquer strategy to scale off-policy learning where previous methods failed.

Problem setting: off-policy RL

Reinforcement learning algorithms fall into two categories based on the data they consume. On-policy methods, such as PPO and GRPO, require fresh data collected by the current policy. This means older data must be discarded immediately after use.

Off-policy RL removes this restriction. It allows the use of any available data, including historical experience, human demonstrations, or internet sources. This flexibility is vital for domains where gathering new data is expensive, such as robotics, healthcare, and dialogue systems. While on-policy scaling recipes are now common, a scalable off-policy algorithm for complex tasks has remained elusive.

Two paradigms in value learning: Temporal Difference (TD) and Monte Carlo (MC)

Traditional off-policy training relies on temporal difference (TD) learning, often seen in Q-learning. The update rule follows this pattern:

Q(s, a) gets r + gamma max_a' Q(s', a')

The flaw lies in bootstrapping. Errors in the next value propagate back to the current one and accumulate over the entire task horizon. This accumulation prevents TD learning from scaling to long tasks.

To address this, researchers mixed TD learning with Monte Carlo (MC) returns. An n-step TD approach uses actual returns for the first n steps before bootstrapping. This reduces the number of error-prone recursions.

In the extreme case where n is infinite, the method becomes pure Monte Carlo value learning. While this works well in practice, it does not fundamentally solve the error accumulation issue. It merely reduces recursions by a constant factor. Furthermore, increasing n leads to high variance and suboptimality, requiring careful tuning for every specific task.

The third paradigm: divide and conquer

A different approach, divide and conquer, offers a solution that scales to arbitrarily long tasks. The method splits a trajectory into two equal segments and combines their values to update the full trajectory.

This reduces the number of Bellman recursions logarithmically rather than linearly. It also avoids the need to tune hyperparameters like n and mitigates the variance issues found in n-step TD learning.

A practical algorithm

Recent work co-led with Aditya has made this concept practical for goal-conditioned RL. In this setting, the policy must learn to reach any state from any other state. The environment’s transition graph satisfies a triangle inequality, meaning the direct distance between two states is less than or equal to the distance via a midpoint.

This allows a transitive Bellman update rule. The value of reaching a goal can be updated using two smaller values: the value to reach an intermediate subgoal and the value to reach the final goal from that subgoal.

The problem

Identifying the optimal subgoal in practice is difficult. In tabular settings, one could enumerate all states to find the best midpoint, similar to the Floyd-Warshall algorithm. However, continuous environments have large state spaces where this is impossible.

The solution

The new method restricts the search for subgoals to states present in the dataset trajectory. Instead of finding a strict maximum, it computes a soft maximum using expectile regression.

The algorithm minimizes a loss function over tuples of states found in the data. This approach has two benefits. It avoids searching the entire state space, and it prevents value overestimation caused by the max operator. The resulting method is called Transitive RL (TRL).

Does it work well?

TRL was tested on OGBench, a benchmark for offline goal-conditioned RL. The evaluation focused on difficult tasks like humanoidmaze and puzzle, using datasets of 1 billion samples.

These tasks require performing complex skills across up to 3,000 environment steps. TRL achieved the best performance on most tasks compared to strong baselines, including TD, MC, and quasimetric learning methods.

When compared to n-step TD learning, TRL matched the best results across all tested values of n. Crucially, it achieved this without needing to set n manually. By recursively splitting trajectories, the algorithm naturally handles long horizons without arbitrary chunking.

What it means

For practitioners, the shift away from TD learning removes the need to tune the n parameter for every new task. This makes off-policy learning more robust for long-horizon applications where data collection is costly.

Scroll to Top