Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

For creators and developers, the ability to generate text at over 1,000 tokens per second transforms the user experience from a wait into an instant reaction. Xiaomi’s MiMo team, alongside the TileRT systems group, has demonstrated that a 1-trillion-parameter model can decode faster than 1,000 tokens per second on standard hardware. This is not a theoretical claim; demos show generation rates peaking near 1,200 tokens per second. The significance lies in the hardware: this speed is achieved on commodity GPUs, not proprietary silicon.

What is MiMo-V2.5-Pro-UltraSpeed

UltraSpeed is a high-throughput mode for the existing MiMo-V2.5-Pro model. The base architecture operates at the trillion-parameter scale using a Mixture-of-Experts (MoE) design. This specific mode prioritises raw generation velocity over new capability. It alters the rate at which the model outputs tokens through a coordinated approach Xiaomi terms “extreme model-system codesign.” Crucially, the entire stack executes on a single standard node equipped with eight commodity GPUs.

The Speed Case: Three Layers Working Together

The first layer employs FP4 quantization. At the trillion-scale, standard FP8 or FP16 weights impose heavy pressure on memory and bandwidth. By reducing the bit-width, weights traverse memory faster, directly lifting decode speed. Xiaomi applies the MXFP4 format selectively to the MoE Experts only, while other modules retain higher precision, reported as FP8 by TileRT. Since Experts hold the majority of parameters and tolerate quantization best, the trade-off is favourable. Quantization-Aware Training (QAT) ensures benchmark quality remains essentially on par with the original.

The second layer is DFlash speculative decoding. The third is TileRT, the system executing the workload on the GPU. Individually, these techniques are insufficient; the 1,000 tokens per second result requires all three to align tightly.

DFlash: Parallel Drafting Without a Serial Bottleneck

Standard speculative decoding uses a smaller draft model to guess upcoming tokens, which the large model then verifies in parallel. Rejection sampling ensures the output remains identical to normal decoding, preserving quality. The limitation is that the draft model still generates tokens sequentially. DFlash, a method from the research community, removes this constraint by using block-level masked parallel prediction. The draft model fills an entire block of masked positions in a single forward pass.

Xiaomi tuned DFlash using the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, matching the MiMo-V2 design. This keeps per-prediction compute constant rather than growing with context length. The block size is capped at eight to limit verification costs and raise concurrency.

Acceptance length measures how many draft tokens survive verification each round.

Scenario	Acceptance Length
Coding	6.30
Math / Reasoning	5.56
Agent	4.29

In coding scenarios, six to seven of the eight draft tokens are accepted per round. Some samples reach a maximum of 7.14.

TileRT: Squeezing the Microseconds

At 1,000 tokens per second, each operator runs for only microseconds. Traditional systems launch operators sequentially, and each launch incurs a time cost. These gaps fracture the execution stream and become the real bottleneck. TileRT replaces this with a Persistent Engine Kernel that remains resident on the GPU. It uses Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes turn into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash choices, not added as an afterthought.

Use Cases

The release targets latency-sensitive work where waiting breaks the user loop:

Parallel reasoning: run many Best-of-N or tree-search paths within the same wall-clock time.
Coding agents: faster code generation cuts the wait between agent steps.
Real-time decision loops: trading signal generation, fraud interception, and live dialogue.
Interactive prototyping: demos show a Snake game in about 10 seconds and a macOS interface in about one minute.

These are throughput-bound workloads where raw token speed is the binding constraint.

How It Compares

The first table contrasts the two routes to extreme decode speed.

Approach	Hardware	How speed is achieved
Cerebras	Wafer-Scale integration (custom)	Scale on a single custom wafer
Groq	Custom architecture	Pure on-chip SRAM
MiMo × TileRT	Commodity GPUs (8-GPU node)	Model-system codesign: FP4 + DFlash + TileRT

The second table compares the standard model with the UltraSpeed mode.

Dimension	MiMo-V2.5-Pro	MiMo-V2.5-Pro-UltraSpeed
Decode speed	Baseline	~10× faster (1000+ TPS)
Price	1×	3×
Weight precision	Standard	FP4 MoE Experts via QAT
Decoding	Standard autoregressive	DFlash speculative decoding
Access	Standard model plans	API only, application-based trial
Token Plan	Supported	Not supported

Access, Pricing, and Open Source

UltraSpeed ships through a limited, application-based window. The API trial runs from June 9 to June 23, 2026. Pricing is three times the standard MiMo-V2.5-Pro rate, reflecting roughly ten times the speed. It is API only, and the Token Plan is not supported. Approved users also receive free Chat access during the trial. Chat limits apply: 10 queue entries daily, 30-minute sessions, and 5-minute idle release. Xiaomi has open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced select modules on GitHub.

Strengths and Limitations

Strengths

1000+ TPS on a 1T model without custom silicon.
Lossless decoding through rejection sampling in DFlash.
FP4 applied only where tolerance is highest, preserving quality.
An open checkpoint lets the community test the claims.

Limitations

Access is gated, short, and approval-based at launch.
Pricing triples per token versus the standard model.
Acceptance length drops in open-ended conversation.
Independent third-party speed verification is not yet public.

Key takeaways

Xiaomi MiMo and TileRT decode a 1-trillion-parameter model past 1,000 tokens per second on commodity GPUs.
The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
FP4 (MXFP4) is applied only to MoE Experts; QAT keeps capability essentially on par.
DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding.
UltraSpeed runs on a single 8-GPU node via an application-based API trial, June 9–23, 2026.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

What is MiMo-V2.5-Pro-UltraSpeed

The Speed Case: Three Layers Working Together

DFlash: Parallel Drafting Without a Serial Bottleneck

TileRT: Squeezing the Microseconds

Use Cases

How It Compares

Access, Pricing, and Open Source

Strengths and Limitations

Strengths

Limitations

Key takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Intel gets a second…

Microsoft Research’s Lens proves…

Apple announces Siri AI…