For creators and developers, the ability to generate text at over 1,000 tokens per second transforms the user experience from a wait into an instant reaction. Xiaomi’s MiMo team, alongside the TileRT systems group, has demonstrated that a 1-trillion-parameter model can decode faster than 1,000 tokens per second on standard hardware. This is not a theoretical claim; demos show generation rates peaking near 1,200 tokens per second. The significance lies in the hardware: this speed is achieved on commodity GPUs, not proprietary silicon.
What is MiMo-V2.5-Pro-UltraSpeed
UltraSpeed is a high-throughput mode for the existing MiMo-V2.5-Pro model. The base architecture operates at the trillion-parameter scale using a Mixture-of-Experts (MoE) design. This specific mode prioritises raw generation velocity over new capability. It alters the rate at which the model outputs tokens through a coordinated approach Xiaomi terms “extreme model-system codesign.” Crucially, the entire stack executes on a single standard node equipped with eight commodity GPUs.
The Speed Case: Three Layers Working Together
The first layer employs FP4 quantization. At the trillion-scale, standard FP8 or FP16 weights impose heavy pressure on memory and bandwidth. By reducing the bit-width, weights traverse memory faster, directly lifting decode speed. Xiaomi applies the MXFP4 format selectively to the MoE Experts only, while other modules retain higher precision, reported as FP8 by TileRT. Since Experts hold the majority of parameters and tolerate quantization best, the trade-off is favourable. Quantization-Aware Training (QAT) ensures benchmark quality remains essentially on par with the original.
The second layer is DFlash speculative decoding. The third is TileRT, the system executing the workload on the GPU. Individually, these techniques are insufficient; the 1,000 tokens per second result requires all three to align tightly.
DFlash: Parallel Drafting Without a Serial Bottleneck
Standard speculative decoding uses a smaller draft model to guess upcoming tokens, which the large model then verifies in parallel. Rejection sampling ensures the output remains identical to normal decoding, preserving quality. The limitation is that the draft model still generates tokens sequentially. DFlash, a method from the research community, removes this constraint by using block-level masked parallel prediction. The draft model fills an entire block of masked positions in a single forward pass.
Xiaomi tuned DFlash using the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, matching the MiMo-V2 design. This keeps per-prediction compute constant rather than growing with context length. The block size is capped at eight to limit verification costs and raise concurrency.
Acceptance length measures how many draft tokens survive verification each round.
| Scenario | Acceptance Length |
|---|---|
| Coding | 6.30 |
| Math / Reasoning | 5.56 |
| Agent | 4.29 |
In coding scenarios, six to seven of the eight draft tokens are accepted per round. Some samples reach a maximum of 7.14.
TileRT: Squeezing the Microseconds
At 1,000 tokens per second, each operator runs for only microseconds. Traditional systems launch operators sequentially, and each launch incurs a time cost. These gaps fracture the execution stream and become the real bottleneck. TileRT replaces this with a Persistent Engine Kernel that remains resident on the GPU. It uses Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes turn into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash choices, not added as an afterthought.
Use Cases
The release targets latency-sensitive work where waiting breaks the user loop:
- Parallel reasoning: run many Best-of-N or tree-search paths within the same wall-clock time.
- Coding agents: faster code generation cuts the wait between agent steps.
- Real-time decision loops: trading signal generation, fraud interception, and live dialogue.
- Interactive prototyping: demos show a Snake game in about 10 seconds and a macOS interface in about one minute.
These are throughput-bound workloads where raw token speed is the binding constraint.
How It Compares
The first table contrasts the two routes to extreme decode speed.
| Approach | Hardware | How speed is achieved |
|---|---|---|
| Cerebras | Wafer-Scale integration (custom) | Scale on a single custom wafer |
| Groq | Custom architecture | Pure on-chip SRAM |
| MiMo × TileRT | Commodity GPUs (8-GPU node) | Model-system codesign: FP4 + DFlash + TileRT |
The second table compares the standard model with the UltraSpeed mode.
| Dimension | MiMo-V2.5-Pro | MiMo-V2.5-Pro-UltraSpeed |
|---|---|---|
| Decode speed | Baseline | ~10× faster (1000+ TPS) |
| Price | 1× | 3× |
| Weight precision | Standard | FP4 MoE Experts via QAT |
| Decoding | Standard autoregressive | DFlash speculative decoding |
| Access | Standard model plans | API only, application-based trial |
| Token Plan | Supported | Not supported |
Access, Pricing, and Open Source
UltraSpeed ships through a limited, application-based window. The API trial runs from June 9 to June 23, 2026. Pricing is three times the standard MiMo-V2.5-Pro rate, reflecting roughly ten times the speed. It is API only, and the Token Plan is not supported. Approved users also receive free Chat access during the trial. Chat limits apply: 10 queue entries daily, 30-minute sessions, and 5-minute idle release. Xiaomi has open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced select modules on GitHub.
Strengths and Limitations
Strengths
- 1000+ TPS on a 1T model without custom silicon.
- Lossless decoding through rejection sampling in DFlash.
- FP4 applied only where tolerance is highest, preserving quality.
- An open checkpoint lets the community test the claims.
Limitations
- Access is gated, short, and approval-based at launch.
- Pricing triples per token versus the standard model.
- Acceptance length drops in open-ended conversation.
- Independent third-party speed verification is not yet public.
Key takeaways
- Xiaomi MiMo and TileRT decode a 1-trillion-parameter model past 1,000 tokens per second on commodity GPUs.
- The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
- FP4 (MXFP4) is applied only to MoE Experts; QAT keeps capability essentially on par.
- DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding.
- UltraSpeed runs on a single 8-GPU node via an application-based API trial, June 9–23, 2026.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




