NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule

Linear attention replaces the unbounded KV cache of softmax attention with a fixed-size recurrent state. This cuts sequence mixing to linear time…

By AI Maestro May 24, 2026 8 min read
NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule

Linear attention replaces the unbounded KV cache of softmax attention with a fixed-size recurrent state. This cuts sequence mixing to linear time and decoding to constant memory. The hard part is not what to forget. It is how to edit a compressed memory without scrambling existing associations.

NVIDIA has released Gated DeltaNet-2, a linear attention layer that targets that bottleneck. The model decouples the active memory edit into two channel-wise gates. It is trained at 1.3B parameters on 100B FineWeb-Edu tokens. It outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across the researchs benchmark suite.

The scalar gate problem in delta-rule models

A recurrent linear attention layer stores a matrix state St and reads it with the query. DeltaNet adds an active edit by subtracting the value currently associated with the current key. It uses a scalar step size βt to control how much to overwrite. Mamba-2 adds a data-dependent scalar decay αt for global forgetting. Gated DeltaNet combined both operations, but both gates remained scalar per head.

Kimi Delta Attention (KDA) refines the decay side. It replaces the scalar αt with a channel-wise vector. KDA still keeps a single scalar βt for the active edit. That scalar controls two different things at once. It decides how much old content to erase on the key side. It also decides how much new content to commit on the value side. These two decisions act on different axes of the state. Tying them together is a modeling restriction, not a property of the delta rule.

https://github.com/NVlabs/GatedDeltaNet-2/blob/main/paper/GDN2_paper.pdf

Gated Delta Rule-2: two gates instead of one

Gated DeltaNet-2 separates the two decisions through Gated Delta Rule-2. It introduces a channel-wise erase gate bt ∈ [0,1]dk on the key axis. It also introduces a channel-wise write gate wt ∈ [0,1]dv on the value axis. Both gates are produced by sigmoid projections of the token representation. The update applies decay before the active edit.

Written compactly, the recurrence is:

St = (I − kt (bt ⊙ kt)) Dt St−1 + kt (wt ⊙ vt)

Here Dt = Diag(αt) is the channel-wise decay carried over from KDA. The left factor of the erase matrix stays kt, preserving the delta-rule write direction. The right factor becomes bt ⊙ kt, making the read direction channel-selective. The write term kt zt uses zt = wt ⊙ vt, making the value update channel-selective.

When both gates collapse to the same scalar βt, the update recovers KDA exactly. When the decay αt also collapses to a scalar, it recovers Gated DeltaNet. Both prior models are preserved as tied subspaces of the new update.

In the fast-weight view, Gated Delta Rule-2 is one online gradient step on a local regression loss. The decayed state stays close to memory, while the residual edit uses gated read and gated write targets.

Chunkwise training and gate-aware backward

The recurrence admits a chunkwise WY form that matches the structure used by KDA. Cumulative channel-wise decay is absorbed into the two factors of each rank-one erase. The per-chunk update becomes a product of asymmetric matrices of the form I − k̄r ēr. The implementation uses chunk size C = 64 with fused Triton kernels.

For the backward pass, the scalar shortcut used by KDA no longer applies. The write side contains a different diagonal gate over value channels. The erase side contains a different diagonal gate over key channels. So the gate factors must appear inside the dot products that accumulate gradients. The paper derives this gate-aware vector-Jacobian product explicitly. On Hopper GPUs, the fused WY backward kernel is restricted to two and four warps to avoid a Triton WGMMA layout assertion.

Block design and hybrid model

Gated DeltaNet-2 is used as the recurrent token mixer in a standard Transformer-style block. Query and key paths use linear projection, short causal convolution, SiLU, and L2 normalization. The value path uses linear projection, short convolution, and SiLU. The decay αt, erase gate bt, and write gate wt come from separate linear branches. The recurrent output is RMS-normalized, multiplied by a SiLU output gate, and projected back.

A hybrid variant inserts Sliding-Window Attention (SWA) after the recurrent mixer. A repeated cell contains Gated DeltaNet-2, an MLP, SWA, and another MLP. SWA handles exact local interactions, while the recurrent mixer compresses long histories. The hybrid retains linear sequence scaling with a bounded attention cache.

Results at 1.3B parameters

All models are 1.3B parameters trained on 100B FineWeb-Edu tokens. Parameter count and recurrent state size are matched across models. The recurrent state holds 262,144 floats per layer per batch element. Training length is 4K tokens, and hybrid models use a 2K SWA window. The Mamba-3 MIMO baseline uses rank R = 4.

On language modeling and commonsense reasoning, Gated DeltaNet-2 has the best average in both settings. The recurrent model averages 53.11 across LAMBADA and the reasoning suite. That sits above Mamba-3 MIMO at 52.39 and KDA at 52.28. In the hybrid setting, Gated DeltaNet-2 averages 53.97 against Mamba-3 MIMO at 52.72. Since recurrent state size is matched, the gain points to the update rule, not more memory.

The clearest gains appear on RULER long-context retrieval. In the recurrent setting, S-NIAH-2 at 4K rises from 89.0 (KDA) to 93.0. S-NIAH-3 at 2K jumps from 63.2 (KDA) to 89.8. MK-NIAH-1 at 4K climbs from 28.0 (KDA) to 37.8.

On real-world retrieval (SWDE, SQuAD, FDA, TriviaQA, NQ, DROP), Gated DeltaNet-2 also leads both settings. The recurrent average is 29.88 and the hybrid average is 42.28.

Marktechpost’s Visual Explainer


Gated DeltaNet-2 · Quickstart
01 / 08

NVIDIA · 2026

Gated DeltaNet-2

Decoupling Erase and Write in Linear Attention. A delta-rule recurrent attention layer with channel-wise erase and write gates.

PyTorch
Triton kernels
1.3B params
100B FineWeb-Edu tokens
Authors
Ali Hatamizadeh, Yejin Choi, Jan Kautz
Repo
github.com/NVlabs/GatedDeltaNet-2
License
NVIDIA Source Code License-NC

Step 01 · The Idea

Two gates instead of one scalar

Linear attention compresses an unbounded KV cache into a fixed-size recurrent state. Editing this memory without scrambling existing associations is the hard part.

The Problem

Prior delta-rule models (Gated DeltaNet, KDA) tie erasing old content and writing new content to one scalar gate β_t.

The Fix

Split it: a channel-wise erase gate b_t on the key axis, and a channel-wise write gate w_t on the value axis.

  • Erase gate picks which key-side coordinates of the decayed state are read and removed.
  • Write gate picks which value-side coordinates of the new content are committed.
  • Channel-wise decay is inherited from KDA for fine-grained global forgetting.

Step 02 · The Update Rule

The Gated Delta Rule-2

With erase gate b_t ∈ [0,1]^{d_k}, write gate w_t ∈ [0,1]^{d_v}, and channel-wise decay D_t = Diag(α_t), the recurrent state evolves as:

S_t = (I − k_t (b_t &odot k_t)) D_t S_{t−1} + k_t (w_t &odot v_t)
  • Recovers KDA exactly when both gates collapse to the same scalar.
  • Recovers Gated DeltaNet when the decay also collapses to a scalar.
  • Trains efficiently via a chunkwise WY form with channel-wise decay absorbed into asymmetric erase factors.

Step 03 · Get the Code

Clone the repo and build the environment

The official PyTorch implementation ships with a Dockerfile, training scripts, and the lit_gpt model definitions.

git clone https://github.com/NVlabs/GatedDeltaNet-2.git
cd GatedDeltaNet-2

# build the environment from the provided Dockerfile
docker build -t gdn2 .
docker run --gpus all -it —ipc=host -v $PWD:/workspace gdn2
Repo layout

lit_gpt/ model code · scripts/ launchers · pretrain.py training entry · data.py, cache.py data & KV cache · paper/ arXiv PDF

Step 04 · Launch Training

Run pretrain.py

The streamlined command from the official README. Replace placeholders with your dataset paths and config name.

python ../pretrain.py \
  --train_data_dir ${TRAIN_DATA} \
  --val_data_dir ${VALIDATION_DATA} \
  --output_root ${SAVE_DIR} \
  --exp_name ${NAME} \
  --model_name ${MODEL} \
  --train_config ${CONFIG} \
  --eval_iters ${EVAL_ITERS} \
  --learning_rate ${LR} \
  --micro_batch_size ${MICRO_BATCH_SIZE}
Pro tip

Add --interactive_job --debug for an interactive debugging session.

Step 05 · Default Recipe

The 1.3B / 100B FineWeb-Edu setup

Matched against Mamba-2, Gated DeltaNet, KDA, and Mamba-3 baselines under identical optimizer settings and recurrent state size.

Optimizer

AdamW · peak LR 4e-4 · weight decay 0.1 · gradient clip 1.0 · cosine schedule · 1B-token warmup.

Batch & Sequence

Global batch 0.5M tokens · sequence length 4K · hybrid models use a 2K sliding-window attention size.

Model Shape

16 heads · d_k = d_v = 128 · per-layer recurrent state 262,144 floats, matched against Mamba-2/3.

Hybrid Block

Repeated cell: Gated DeltaNet-2 → MLP → SWA → MLP. The recurrent mixer compresses long histories; SWA handles local interactions.

Step 06 · Results

Numbers worth pasting into a comparison

Best average across language modeling and commonsense reasoning, with the largest gains on long-context retrieval.

Setting · MetricKDAMamba-3 MIMOGDN-2
Recurrent avg. (LMB + reasoning)52.2852.3953.11
Hybrid avg. (LMB + reasoning)52.6852.7253.97
S-NIAH-3 @2K (recurrent)63.272.489.8
MK-NIAH-1 @4K (recurrent)28.018.037.8
Real-world recall, recurrent avg.28.6728.3529.88
Real-world recall, hybrid avg.40.1440.1142.28

Step 07 · Resources

Paper, code, and citation

Everything you need to read, run, and cite Gated DeltaNet-2 in one place.

@article{hatamizadeh2026gdn2,
  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},
  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},
  journal = {arXiv preprint},
  year    = {2026}
}









MARKTECHPOST  ·  The hub for AI research, dev tools, and model launches

Key Takeaways

  • Gated DeltaNet-2 splits the scalar βt into a channel-wise erase gate bt (key axis) and a channel-wise write gate wt (value axis).
  • The update recovers KDA when both gates collapse to one scalar, and Gated DeltaNet when the decay collapses too.
  • Training stays parallel via a chunkwise WY form, with channel-wise decay absorbed into asymmetric erase factors and a gate-aware backward fused in Triton.
  • At 1.3B params on 100B FineWeb-Edu with matched state size, it has the best average over Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in both recurrent and hybrid settings.
  • Largest gains come on RULER long-context retrieval — S-NIAH-3 at 2K rises 63.2 → 89.8 and MK-NIAH-1 at 4K rises 28.0 → 37.8 over KDA (recurrent).


Check out the Paper and RepoAlso, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post NVIDIA AI Releases Gated DeltaNet-2: A Linear Attention Layer That Decouples Erase and Write in the Delta Rule appeared first on MarkTechPost.


Originally published at marktechpost.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top