Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

For creators and developers, the ability to generate text at over 1,000 tokens per second transforms the user experience from a wait…

By AI Maestro June 8, 2026 4 min read
Xiaomi MiMo and TileRT Push a 1-Trillion-Parameter Model Past 1000 Tokens Per Second on Commodity GPUs

For creators and developers, the ability to generate text at over 1,000 tokens per second transforms the user experience from a wait into an instant reaction. Xiaomi’s MiMo team, alongside the TileRT systems group, has demonstrated that a 1-trillion-parameter model can decode faster than 1,000 tokens per second on standard hardware. This is not a theoretical claim; demos show generation rates peaking near 1,200 tokens per second. The significance lies in the hardware: this speed is achieved on commodity GPUs, not proprietary silicon.

What is MiMo-V2.5-Pro-UltraSpeed

UltraSpeed is a high-throughput mode for the existing MiMo-V2.5-Pro model. The base architecture operates at the trillion-parameter scale using a Mixture-of-Experts (MoE) design. This specific mode prioritises raw generation velocity over new capability. It alters the rate at which the model outputs tokens through a coordinated approach Xiaomi terms “extreme model-system codesign.” Crucially, the entire stack executes on a single standard node equipped with eight commodity GPUs.

The Speed Case: Three Layers Working Together

The first layer employs FP4 quantization. At the trillion-scale, standard FP8 or FP16 weights impose heavy pressure on memory and bandwidth. By reducing the bit-width, weights traverse memory faster, directly lifting decode speed. Xiaomi applies the MXFP4 format selectively to the MoE Experts only, while other modules retain higher precision, reported as FP8 by TileRT. Since Experts hold the majority of parameters and tolerate quantization best, the trade-off is favourable. Quantization-Aware Training (QAT) ensures benchmark quality remains essentially on par with the original.

The second layer is DFlash speculative decoding. The third is TileRT, the system executing the workload on the GPU. Individually, these techniques are insufficient; the 1,000 tokens per second result requires all three to align tightly.

DFlash: Parallel Drafting Without a Serial Bottleneck

Standard speculative decoding uses a smaller draft model to guess upcoming tokens, which the large model then verifies in parallel. Rejection sampling ensures the output remains identical to normal decoding, preserving quality. The limitation is that the draft model still generates tokens sequentially. DFlash, a method from the research community, removes this constraint by using block-level masked parallel prediction. The draft model fills an entire block of masked positions in a single forward pass.

Xiaomi tuned DFlash using the Muon second-order optimizer and model self-distillation. The draft model uses Sliding Window Attention (SWA) only, matching the MiMo-V2 design. This keeps per-prediction compute constant rather than growing with context length. The block size is capped at eight to limit verification costs and raise concurrency.

Acceptance length measures how many draft tokens survive verification each round.

ScenarioAcceptance Length
Coding6.30
Math / Reasoning5.56
Agent4.29

In coding scenarios, six to seven of the eight draft tokens are accepted per round. Some samples reach a maximum of 7.14.

TileRT: Squeezing the Microseconds

At 1,000 tokens per second, each operator runs for only microseconds. Traditional systems launch operators sequentially, and each launch incurs a time cost. These gaps fracture the execution stream and become the real bottleneck. TileRT replaces this with a Persistent Engine Kernel that remains resident on the GPU. It uses Warp Specialization to split data movement, compute, and communication into coordinated roles. Small operations like RMSNorm, RoPE, and KV cache writes turn into bottlenecks at this scale. The system was co-designed with the FP4 and DFlash choices, not added as an afterthought.

Use Cases

The release targets latency-sensitive work where waiting breaks the user loop:

  • Parallel reasoning: run many Best-of-N or tree-search paths within the same wall-clock time.
  • Coding agents: faster code generation cuts the wait between agent steps.
  • Real-time decision loops: trading signal generation, fraud interception, and live dialogue.
  • Interactive prototyping: demos show a Snake game in about 10 seconds and a macOS interface in about one minute.

These are throughput-bound workloads where raw token speed is the binding constraint.

How It Compares

The first table contrasts the two routes to extreme decode speed.

ApproachHardwareHow speed is achieved
CerebrasWafer-Scale integration (custom)Scale on a single custom wafer
GroqCustom architecturePure on-chip SRAM
MiMo × TileRTCommodity GPUs (8-GPU node)Model-system codesign: FP4 + DFlash + TileRT

The second table compares the standard model with the UltraSpeed mode.

DimensionMiMo-V2.5-ProMiMo-V2.5-Pro-UltraSpeed
Decode speedBaseline~10× faster (1000+ TPS)
Price
Weight precisionStandardFP4 MoE Experts via QAT
DecodingStandard autoregressiveDFlash speculative decoding
AccessStandard model plansAPI only, application-based trial
Token PlanSupportedNot supported

Access, Pricing, and Open Source

UltraSpeed ships through a limited, application-based window. The API trial runs from June 9 to June 23, 2026. Pricing is three times the standard MiMo-V2.5-Pro rate, reflecting roughly ten times the speed. It is API only, and the Token Plan is not supported. Approved users also receive free Chat access during the trial. Chat limits apply: 10 queue entries daily, 30-minute sessions, and 5-minute idle release. Xiaomi has open-sourced the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has open-sourced select modules on GitHub.

Strengths and Limitations

Strengths

  • 1000+ TPS on a 1T model without custom silicon.
  • Lossless decoding through rejection sampling in DFlash.
  • FP4 applied only where tolerance is highest, preserving quality.
  • An open checkpoint lets the community test the claims.

Limitations

  • Access is gated, short, and approval-based at launch.
  • Pricing triples per token versus the standard model.
  • Acceptance length drops in open-ended conversation.
  • Independent third-party speed verification is not yet public.

Key takeaways

  • Xiaomi MiMo and TileRT decode a 1-trillion-parameter model past 1,000 tokens per second on commodity GPUs.
  • The speedup comes from three layers: FP4 quantization, DFlash speculative decoding, and the TileRT runtime.
  • FP4 (MXFP4) is applied only to MoE Experts; QAT keeps capability essentially on par.
  • DFlash predicts a whole masked block per forward pass, hitting 6.30 average acceptance length in coding.
  • UltraSpeed runs on a single 8-GPU node via an application-based API trial, June 9–23, 2026.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top