“`html
Overview of Star Elastic Method
The traditional approach to training large language models (LLMs) involves a separate full training run for each model variant in the family—whether 8B, 30B, or 70B. This leads to an exponential increase in compute costs and storage requirements as more variants are added. NVIDIA researchers propose a novel solution called Star Elastic, which embeds multiple nested submodels inside a single parent reasoning model using a single training run.
Nested Architecture: What Does It Mean?
If you haven’t encountered elastic or nested architectures before, the idea is this: instead of training three separate 30B, 23B, and 12B models, one model contains the smaller ones as subsets. The smaller submodels reuse the most important weights from the parent, identified through a process called importance estimation. Star Elastic scores each model component based on how much they contribute to model accuracy. Components are then ranked and sorted so that smaller-budget submodels always use the highest-ranked contiguous subset of components from the larger model.
The method supports nesting along multiple axes: the SSM (State Space Model) dimension, embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN intermediate dimensions. For MoE layers specifically, Star Elastic uses Router-Weighted Expert Activation Pruning (REAP), which ranks experts by both routing gate values and expert output magnitudes—a more principled signal than naive frequency-based pruning.
A Learnable Router: Not a Fixed Compression Recipe
A key distinction from prior compression methods like Minitron is that Star Elastic uses an end-to-end trainable router to determine the nested submodel architectures. The router takes a target budget (e.g., “give me a 2.8B active parameter model”) as a one-hot input and outputs differentiable masks that select which components are active at that budget level. These masks are trained jointly with the model through Gumbel-Softmax, allowing gradient flow through discrete architectural decisions.
The loss function combines knowledge distillation (KD) where the non-elastified parent model acts as the teacher with a router loss that penalizes deviation from the target resource budget (parameter count, memory, or latency). This means the router learns to make architecture choices that actually improve accuracy under KD, rather than just minimizing a proxy metric.
The training uses a two-stage curriculum: a short-context phase with uniform budget sampling followed by an extended-context phase with non-uniform sampling that prioritizes the full 30B model. The extended context phase is critical for reasoning performance. The research team’s ablations on Nano v2—explicitly reproduced as the empirical basis for the same curriculum choice on Nano v3—show gains of up to 19.8% on AIME-2025 for the 6B variant and 4.0 percentage points for the 12B variant from Stage 2 alone.
Elastic Budget Control: Different Models for Different Reasoning Phases
Existing budget control in reasoning models like Nemotron Nano v3’s default behavior caps the number of tokens generated during a <think> phase before forcing a final answer. This approach uses the same model throughout. Star Elastic unlocks a different strategy: using different nested submodels for the thinking phase versus the answering phase.
The researchers evaluated four configurations. The optimal one, called ℳS → ℳL, allocates a cheaper model to generate extended reasoning traces and reserves the full-capacity model for synthesizing the final answer. The 23B → 30B configuration in particular advances the accuracy–latency Pareto frontier, achieving up to 16% higher accuracy and 1.9× lower latency. The intuition: reasoning tokens are high-volume but tolerant of some capacity reduction; the final answer requires higher precision.
Quantization Without Breaking the Nested Structure
A naive approach to deploying a quantized elastic model would be to quantize each variant separately after slicing. That breaks the nested weight-sharing property and requires a separate quantization pass per size. Instead, Star Elastic applies Quantization-Aware Distillation (QAD) directly on the elastic checkpoint, preserving the nested mask hierarchy throughout.
For FP8 (E4M3 format), post-training quantization (PTQ) is sufficient, recovering 98.69% of BF16 accuracy on the 30B variant. For NVFP4 (NVIDIA’s 4-bit floating-point format), PTQ alone causes a 4.12% average accuracy drop, so a short nested QAD phase (~5B tokens at 48K context) brings recovery back to 97.79% for the 30B variant. In both cases, zero-shot slicing of the 23B and 12B variants from the single quantized checkpoint is preserved.
The memory implications are significant. Storing separate 12B, 23B, and 30B BF16 checkpoints requires 126.1 GB; the single elastic checkpoint requires 58.9 GB. The 30B NVFP4 elastic checkpoint fits in 18.7 GB, enabling the 12B NVFP4 variant to run on an RTX 5080 where every BF16 configuration runs out of memory. On an RTX Pro 6000, the 12B NVFP4 variant reaches 7,426 tokens/s, a 3.4× throughput improvement over the 30B BF16 baseline.
Depth vs. Width: Why Star Elastic Compresses Width
One design choice worth calling out explicitly: the research team compared two compression strategies—removing layers entirely (depth compression) versus reducing internal dimensions like hidden size, expert count, and head count (width compression). With a 15% parameter reduction and 25B tokens of knowledge distillation, width compression recovered 98.1% of baseline performance. While depth compression (layer skipping) remains available as a mechanism for extreme latency-constrained scenarios.
Key Takeaways
- NVIDIA AI proposes Star Elastic: one checkpoint containing multiple nested reasoning models, reducing training and deployment costs.
- The method uses an end-to-end trainable router to determine the architecture of each submodel, allowing for different architectures during inference phases.
- Quantization-aware distillation (QAD) is applied directly on the elastic checkpoint, preserving the nested mask hierarchy and enabling zero-shot slicing from a single quantized checkpoint.
- The researchers demonstrate significant memory savings by storing one elastic checkpoint instead of multiple separate checkpoints, with benefits ranging from 15% parameter reduction to full model compression using different bit depths like FP8 and NVFP4.
“`
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




