Sakana AI and NVIDIA Introduce TwELL with CUDA Kernels for 20.5% Inference and 21.9% Training Speedup in LLMs

Scaling large language models (LLMs) involves significant costs. The primary reason is that most of these models rely heavily on feedforward layers, which perform over two-thirds of all model parameters and more than 80% of total FLOPs even in larger models. A team from Sakana AI and NVIDIA has introduced a new approach to address this bottleneck by making the computation inside feedforward layers significantly cheaper through unstructured sparsity.

Sparsity Exists, But GPUs Ignore It

Within a transformer’s feedforward block, for any given input token, only a small fraction of hidden neurons actually fire—these activations are zeroed out by the activation function. This phenomenon is called activation sparsity, and prior research has documented this in models with ReLU activations.

The frustrating reality is that while theoretical savings from activation sparsity exist, they rarely translate into actual speedups on modern GPUs. NVIDIA’s optimized hardware for dense matrix multiplications using Tensor Cores prevents the use of sparse formats like ELLPACK (ELL), which require additional kernel passes to convert activations from dense to sparse representations. These conversions often cancel out any potential savings.

Previous work on sparse LLM kernels such as TurboSparse, ProSparse, and Q-Sparse focused mainly on single-token inference operations (GEMV). The research team here tackles a fundamentally harder problem: batched GEMM operations with thousands of input tokens. This regime covers both training and high-throughput inference scenarios.

GUIDE

Sparser, Faster, Lighter LLMs — TwELL & Sparse CUDA Kernels

Sakana AI × NVIDIA — arXiv:2603.23198 — ICML 2026

01 — The Problem

Feedforward layers dominate LLM cost — and most of that work is wasted.

> ⅔

of all model parameters live in feedforward layers

80%+

of total FLOPs consumed by feedforward layers

99%+

of hidden activations can be zero with no accuracy drop

For any given token, only a tiny fraction of hidden neurons actually fire. The rest output zero after the activation function. This is called activation sparsity. Historically, this has been impossible to exploit on modern GPUs because sparse operations ran slower than dense ones.

Prior work on sparse LLM kernels (TurboSparse, ProSparse, Q-Sparse) only targeted single-token GEMV operations. The research team instead tackles the harder problem: batched GEMM with thousands of tokens — the regime that covers both training and high-throughput inference.

02 — The Innovation

TwELL: a sparse format built around how GPU kernels actually work.

Old Way — ELL

Row-wide packing, costly to build

Standard ELLPACK packs non-zeros row-by-row across the entire matrix. To construct it from a tiled matmul output you need a separate kernel launch, a full global memory read, and synchronization across all CTAs. Those overheads cancel out the savings from skipping zeros.

New Way — TwELL

Tile-wise packing, built in the epilogue

TwELL partitions columns into horizontal tiles matching the matmul kernel’s tile size T_n. Non-zeros are packed locally within each tile. By matching dimensions, TwELL is constructed inside the existing gate projection kernel epilogue — no extra kernel, no extra memory read, no synchronization overhead.

The inference pipeline uses one fused kernel that reads gate activations in TwELL format and performs up + down projections together. The intermediate hidden state is never written to global memory, cutting DRAM traffic at every forward pass.

For training, a hybrid sparse format dynamically routes rows into a compact ELL matrix (sparse rows) or a dense backup (overflow rows). Sparsity during training is highly non-uniform — max non-zeros per row can be orders of magnitude above the average — so the hybrid design handles this without becoming brittle.

03 — Training Recipe

Two changes to your training config. Nothing else.

Replace SiLU with ReLU as the gate activation function. ReLU produces exact zeros for negative inputs — this is what enables unstructured sparsity. No other architectural change is needed. (Unregularized ReLU sits slightly below SiLU on task accuracy: 46.4% vs 47.1% on the 1.5B model, offset by the efficiency gains.)

Add an L1 loss term on the hidden feedforward activations, averaged over all tokens and hidden dimensions across all layers. Recommended coefficient: L1 = 2×10⁸. Add it to your standard cross-entropy loss. No changes to learning rate, weight decay, batch size, or optimizer.

Sparsity stabilizes fast. The non-zero count settles within ~1,000 training steps (~1B tokens). The training kernels deliver memory and throughput benefits for almost the entire training run, not just toward the end.

Watch Out

At L1 = 2×10⁸, over 30% of neurons become permanently inactive (dead neurons) on average across layers. Downstream accuracy is not visibly affected at this level. The paper explores targeted gate weight reinitialization as a mitigation — yielding +19.1% speedup vs +17.9% baseline with no accuracy cost.

04 — Benchmark Results

Accuracy preserved. Efficiency scales up with model size.

Model	Accuracy	Inference	Energy / tok	Training	Peak Mem
0.5B	40.4% → 40.4%	+17.0%	−11.8%	−1.5%	−19.2%
1B	44.6% → 44.7%	+18.1%	−14.6%	+7.1%	−25.5%
1.5B	46.4% → 46.2%	+18.8%	−15.0%	+11.6%	−28.1%
2B	49.1% → 48.8%	+20.5% Originally published at marktechpost.com. Curated by AI Maestro. Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise. Please enable JavaScript in your browser to complete this form. Name First Last Name Email Email Get the next one in your inbox One AI story a week. The one that mattered most. No fluff, no hype. Subscribe → AI Maestro is an independent British AI publication. We test what we recommend. More about us → Share X LinkedIn Copy link Stay sharp One AI story a week — the one that mattered. Subscribe → More in AI News 1 I Work in Hollywood. Everyone Who Used to Make TV Is Now Secretly Training AI 2 CUDA Proves Nvidia Is a Software Company 3 Schreibmaschine’s colorful Eurorack creations now do paraddidles, too 4 The new AI-powered Google Finance is expanding to Europe. Empowering Businesses with AI — Smart Tools, Smarter Business Decisions. follow us Popular Tag AI Ethics & Society AI for Business AI Guides & Tutorials AI Music AI News AI Research & Science Popular Post I Work in Hollywood.… CUDA Proves Nvidia Is… Schreibmaschine’s colorful Eurorack creations… © 2026 AI Maestro · All rights reserved Manage Consent To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behaviour or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions. Functional Functional Always active The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network. Preferences Preferences The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user. Statistics Statistics The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you. Marketing Marketing The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes. Manage options Manage services Manage {vendor_count} vendors Read more about these purposes View preferences {title} {title} {title} Scroll to Top

Sparsity Exists, But GPUs Ignore It

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

I Work in Hollywood.…

CUDA Proves Nvidia Is…

Schreibmaschine’s colorful Eurorack creations…