Scaling large language models (LLMs) involves significant costs. The primary reason is that most of these models rely heavily on feedforward layers, which perform over two-thirds of all model parameters and more than 80% of total FLOPs even in larger models. A team from Sakana AI and NVIDIA has introduced a new approach to address this bottleneck by making the computation inside feedforward layers significantly cheaper through unstructured sparsity.
Sparsity Exists, But GPUs Ignore It
Within a transformer’s feedforward block, for any given input token, only a small fraction of hidden neurons actually fire—these activations are zeroed out by the activation function. This phenomenon is called activation sparsity, and prior research has documented this in models with ReLU activations.
The frustrating reality is that while theoretical savings from activation sparsity exist, they rarely translate into actual speedups on modern GPUs. NVIDIA’s optimized hardware for dense matrix multiplications using Tensor Cores prevents the use of sparse formats like ELLPACK (ELL), which require additional kernel passes to convert activations from dense to sparse representations. These conversions often cancel out any potential savings.
Previous work on sparse LLM kernels such as TurboSparse, ProSparse, and Q-Sparse focused mainly on single-token inference operations (GEMV). The research team here tackles a fundamentally harder problem: batched GEMM operations with thousands of input tokens. This regime covers both training and high-throughput inference scenarios.
GUIDE
Sparser, Faster, Lighter LLMs — TwELL & Sparse CUDA Kernels
Sakana AI × NVIDIA — arXiv:2603.23198 — ICML 2026
01 — The Problem
Feedforward layers dominate LLM cost — and most of that work is wasted.
> ⅔
of all model parameters live in feedforward layers
80%+
of total FLOPs consumed by feedforward layers
99%+
of hidden activations can be zero with no accuracy drop
For any given token, only a tiny fraction of hidden neurons actually fire. The rest output zero after the activation function. This is called activation sparsity. Historically, this has been impossible to exploit on modern GPUs because sparse operations ran slower than dense ones.
Prior work on sparse LLM kernels (TurboSparse, ProSparse, Q-Sparse) only targeted single-token GEMV operations. The research team instead tackles the harder problem: batched GEMM with thousands of tokens — the regime that covers both training and high-throughput inference.
02 — The Innovation
TwELL: a sparse format built around how GPU kernels actually work.
Old Way — ELL
Row-wide packing, costly to build
Standard ELLPACK packs non-zeros row-by-row across the entire matrix. To construct it from a tiled matmul output you need a separate kernel launch, a full global memory read, and synchronization across all CTAs. Those overheads cancel out the savings from skipping zeros.
New Way — TwELL
Tile-wise packing, built in the epilogue
TwELL partitions columns into horizontal tiles matching the matmul kernel’s tile size T_n. Non-zeros are packed locally within each tile. By matching dimensions, TwELL is constructed inside the existing gate projection kernel epilogue — no extra kernel, no extra memory read, no synchronization overhead.
The inference pipeline uses one fused kernel that reads gate activations in TwELL format and performs up + down projections together. The intermediate hidden state is never written to global memory, cutting DRAM traffic at every forward pass.
For training, a hybrid sparse format dynamically routes rows into a compact ELL matrix (sparse rows) or a dense backup (overflow rows). Sparsity during training is highly non-uniform — max non-zeros per row can be orders of magnitude above the average — so the hybrid design handles this without becoming brittle.
03 — Training Recipe
Two changes to your training config. Nothing else.
01
Replace SiLU with ReLU as the gate activation function. ReLU produces exact zeros for negative inputs — this is what enables unstructured sparsity. No other architectural change is needed. (Unregularized ReLU sits slightly below SiLU on task accuracy: 46.4% vs 47.1% on the 1.5B model, offset by the efficiency gains.)
02
Add an L1 loss term on the hidden feedforward activations, averaged over all tokens and hidden dimensions across all layers. Recommended coefficient: L1 = 2×108. Add it to your standard cross-entropy loss. No changes to learning rate, weight decay, batch size, or optimizer.
03
Sparsity stabilizes fast. The non-zero count settles within ~1,000 training steps (~1B tokens). The training kernels deliver memory and throughput benefits for almost the entire training run, not just toward the end.
Watch Out
At L1 = 2×108, over 30% of neurons become permanently inactive (dead neurons) on average across layers. Downstream accuracy is not visibly affected at this level. The paper explores targeted gate weight reinitialization as a mitigation — yielding +19.1% speedup vs +17.9% baseline with no accuracy cost.
04 — Benchmark Results
Accuracy preserved. Efficiency scales up with model size.
Model
Accuracy
Inference
Energy / tok
Training
Peak Mem
0.5B
40.4% → 40.4%
+17.0%
−11.8%
−1.5%
−19.2%
1B
44.6% → 44.7%
+18.1%
−14.6%
+7.1%
−25.5%
1.5B
46.4% → 46.2%
+18.8%
−15.0%
+11.6%
−28.1%
2B
49.1% → 48.8%
+20.5%
Originally published at marktechpost.com. Curated by AI Maestro.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.
Manage Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behaviour or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional
Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.