Mirage is a new video world model that eliminates the costly detour through pixel-based memory. This speeds up generation and keeps the spatial structure of the scene stable during long camera movements. Researchers from several universities worked with Microsoft Research to build it.
Video world models transform starting frames and camera paths into plausible videos, useful for simulations and world simulators. But without some kind of memory, even a powerful generator will lose space over time. Even corners of the room you’ve already passed will look different when the camera looks back. Furniture moves and textures change.
Systems such as Voyager, WonderWorld, and Spatia attempt to solve this problem using 3D point clouds that are fed a steady stream of color data. Each new generation step requires rendering that cloud and converting the result back to the model’s internal feature space. Microsoft’s new paper calls this a double bottleneck. This means it consumes computing and leaks information every time data passes through pixel space.
Mirage takes a different approach. Rather than preserving visible color points, it preserves the internal image features already used by the diffusion model. Each feature gets a spot in 3D space, which turns into an entry in spatial memory.

To generate a new viewpoint, the model projects this store directly onto the target camera and passes the result to the generator, skipping the point cloud rendering and re-encoding steps. The authors say this also reduces memory usage, as the data is placed in the model’s compact internal resolution rather than the full image size.
How memory grows step by step
Mirage builds the video segment by segment and seeds spatial memory from the starting image. For each subsequent segment, the system retrieves the relevant data from memory, generates a new frame, and writes its contents to the cache. Memories continue to grow as you progress.

Filters prevent the system from tripping by removing moving objects and skies before writing, so only stable geometry remains in long-term memory. The researchers built on Alibaba’s open source video model Wan2.2, added a small add-on module that teaches the model to use the new memory, and tweaked the whole thing with a LoRA adapter.
Faster and lighter than color-based rivals
In the WorldScore benchmark, Mirage outperforms its closest rival, Spatia. Spatia still retains memory as color points, giving it a significant advantage over popular video generators such as Wan2.1 and CogVideoX. This is great at preserving the spatial structure of the scene and keeping surfaces looking consistent across many frames.
It also leads in two out of three metrics on the RealEstate10K dataset in closed-loop tests. Here the camera returns to its starting point: a grueling stress test where every little error adds up over the entire pass.

Efficiency is Mirage’s strength. Color-based memory does not scale well over long runs and continues to require more graphics memory. Mirage’s computational cost per frame changes little after the first segment. The researchers estimate a total benefit of up to 10.57 times faster generation and up to 55 times less memory compared to color-based systems.
They are upfront about one catch. Moving objects are dropped at segment boundaries because their geometry is unreliable, and the filter intentionally discards them. Crowded scenes yield less from spatial memory than quiet interiors. The team cites storage of dynamic content as the obvious next problem to solve.
To learn more about Mirage, visit our project page. Microsoft also maintains a GitHub repository for latent spatial memory.
Video world models are currently one of the hottest research areas in AI video. Models like Veo primarily produce a single, internally consistent clip, while world models attempt to make the scene navigable and remain consistent over time. Google Deepmind recently demonstrated this with Genie 3, which launches an interactive environment in real time and holds it for several minutes. At I/O, Google also touted Gemini Omni as a global model and a possible successor to its text-to-video model Veo.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, exclusive six-a-year “AI Radar” frontier reports, full archive access, and access to the comments section.
Subscribe now
