Microsoft Research's Mirage gives video generation a persistent spatial memory that doesn't forget what's around the corner

Mirage, a new video world model from Microsoft Research and partner universities, solves a persistent problem for creators: maintaining spatial consistency during long camera movements. Instead of relying on expensive pixel-based 3D reconstructions, the system embeds image features directly into a spatial memory within its internal latent space. This approach allows makers to generate videos up to 10.5 times faster while using up to 55 times less memory than comparable models. Crucially, the architecture filters out moving objects to ensure only stable geometry is stored, preventing the scene from losing its structural integrity.

Video world models are designed to take a single starting frame and a defined camera path to produce plausible, navigable environments. They are essential for simulations and acting as world simulators. However, without robust memory, even powerful generators struggle to maintain spatial coherence over time. A corner of a room viewed initially may look different when the camera returns, with furniture shifting or textures changing inexplicably.

Previous attempts to solve this, such as Voyager, WonderWorld, and Spatia, relied on maintaining a 3D point cloud fed by a steady stream of color data. Every generation step required rendering this cloud and translating the result back into the model’s feature space. Microsoft’s research paper identifies this as a double bottleneck: it consumes excessive compute power and causes information leakage as data passes through pixel space.

Mirage bypasses this detour entirely. Rather than storing visible color points, it retains the internal image features the diffusion model already utilises. Each feature is assigned a specific location in 3D space, effectively creating an entry in the system’s spatial memory. When generating a new viewpoint, the model projects this stored data directly onto the target camera coordinates, skipping the inefficient step of rendering a point cloud and re-encoding it.

How the memory evolves

The system constructs videos in segments, seeding the spatial memory from the initial image. For every subsequent segment, Mirage retrieves relevant data from memory, generates the new frames, and writes their contents back to the cache. Consequently, the memory grows continuously as the video progresses.

To prevent the system from conflicting with itself, a filter strips out moving objects and the sky before writing to memory. This ensures that only stable geometry is preserved in long-term storage. The researchers built upon Alibaba’s open-source video model, Wan2.2, by attaching a small module to teach the model how to utilise this new memory structure, followed by fine-tuning the entire system using LoRA adapters.

Performance and efficiency gains

On the WorldScore benchmark, Mirage outperforms its closest rival, Spatia, which still relies on color-point memory. It also leaves general video generators like Wan2.1 and CogVideoX far behind. The model excels at maintaining a scene’s spatial structure and ensuring surfaces look consistent across many frames.

It also leads two of three metrics on the RealEstate10K dataset during closed-loop tests. In this scenario, the camera circles back to its starting point, serving as a rigorous stress test because even minor errors accumulate over the full path.

Efficiency remains Mirage’s strongest attribute. Color-based memory scales poorly on longer runs, constantly demanding more graphics memory. In contrast, Mirage’s compute cost per frame stabilises after the first segment. The researchers estimate a total gain of up to 10.57x faster generation and up to 55x less memory usage compared to color-based systems.

The team is candid about a limitation: moving objects are dropped at segment boundaries because their geometry cannot be trusted, and the filter deliberately excludes them. Consequently, busy scenes derive less benefit from spatial memory than quiet interiors do. The team identifies storing dynamic content as the obvious next challenge to address.

Further details on Mirage can be found on the project page, where Microsoft also hosts a GitHub repository for Latent Spatial Memory.

Video world models represent one of the most active areas of research in AI video today. While models like Veo primarily produce single, internally consistent clips, world models aim to create navigable scenes that remain consistent over time. Google DeepMind recently demonstrated this capability with Genie 3, which spins up interactive environments in real time and sustains them for several minutes. At I/O, Google also pitched Gemini Omni as a world model and the potential successor to its text-to-video model, Veo.

Key takeaways

Mirage achieves superior spatial consistency by storing features in latent space rather than expensive pixel-based point clouds.
The system delivers significant efficiency gains, offering up to 10.5x faster generation and 55x less memory usage than current rivals.
While highly effective for static environments, the model currently struggles to retain moving objects, limiting its utility for busy scenes.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.