Microsoft Research’s Mirage provides persistent spatial memory for video generation, ensuring you don’t forget what’s in your immediate vicinity

Mirage is a new video world model that eliminates the costly detour through pixel-based memory. This speeds up generation and keeps the spatial structure of the scene stable during long camera movements. Researchers from several universities worked with Microsoft Research to build it.

Video world models transform starting frames and camera paths into plausible videos, useful for simulations and world simulators. But without some kind of memory, even a powerful generator will lose space over time. Even corners of the room you’ve already passed will look different when the camera looks back. Furniture moves and textures change.

Systems such as Voyager, WonderWorld, and Spatia attempt to solve this problem using 3D point clouds that are fed a steady stream of color data. Each new generation step requires rendering that cloud and converting the result back to the model’s internal feature space. Microsoft’s new paper calls this a double bottleneck. This means it consumes computing and leaks information every time data passes through pixel space.

Mirage takes a different approach. Rather than preserving visible color points, it preserves the internal image features already used by the diffusion model. Each feature gets a spot in 3D space, which turns into an entry in spatial memory.

Comparison diagram of two video world model pipelines. Top: RGB point cloud memory with rendering and encoding loop. Bottom: Mirage's latent spatial memory. Constructed in latent space and read directly. — There are two video world model pipelines side by side. Top: RGB point cloud memory with rendering and encoding loop. Bottom: Mirage’s latent spatial memory. Constructed in latent space and read directly. |Image: Wang et al.

To generate a new viewpoint, the model projects this store directly onto the target camera and passes the result to the generator, skipping the point cloud rendering and re-encoding steps. The authors say this also reduces memory usage, as the data is placed in the model’s compact internal resolution rather than the full image size.

How memory grows step by step

Mirage builds the video segment by segment and seeds spatial memory from the starting image. For each subsequent segment, the system retrieves the relevant data from memory, generates a new frame, and writes its contents to the cache. Memories continue to grow as you progress.

Mirage pipeline that builds a latent cache from the first frame with VAE and depth estimation. Each generation chunk reads from the chunk through reads and updates through writes, but the latent 3D representation increases over time from t0 to tN. — Mirage seeds a latent cache from the starting image, reads and writes in chunks, and keeps static scene content intact throughout the execution. |Image: Wang et al.

Filters prevent the system from tripping by removing moving objects and skies before writing, so only stable geometry remains in long-term memory. The researchers built on Alibaba’s open source video model Wan2.2, added a small add-on module that teaches the model to use the new memory, and tweaked the whole thing with a LoRA adapter.

Faster and lighter than color-based rivals

In the WorldScore benchmark, Mirage outperforms its closest rival, Spatia. Spatia still retains memory as color points, giving it a significant advantage over popular video generators such as Wan2.1 and CogVideoX. This is great at preserving the spatial structure of the scene and keeping surfaces looking consistent across many frames.

It also leads in two out of three metrics on the RealEstate10K dataset in closed-loop tests. Here the camera returns to its starting point: a grueling stress test where every little error adds up over the entire pass.

Two bar graphs spanning five generation chunks. Left: Average generation time per frame. Right: Peak cache VRAM. Mirage remains consistently low on both metrics, while Spatia, VMem, and Gen3C have risen sharply. — Mirage keeps compute time and memory roughly flat across runs, while rival models get hungrier with each chunk. |Image: Wang et al.

Efficiency is Mirage’s strength. Color-based memory does not scale well over long runs and continues to require more graphics memory. Mirage’s computational cost per frame changes little after the first segment. The researchers estimate a total benefit of up to 10.57 times faster generation and up to 55 times less memory compared to color-based systems.

They are upfront about one catch. Moving objects are dropped at segment boundaries because their geometry is unreliable, and the filter intentionally discards them. Crowded scenes yield less from spatial memory than quiet interiors. The team cites storage of dynamic content as the obvious next problem to solve.

To learn more about Mirage, visit our project page. Microsoft also maintains a GitHub repository for latent spatial memory.

Video world models are currently one of the hottest research areas in AI video. Models like Veo primarily produce a single, internally consistent clip, while world models attempt to make the scene navigable and remain consistent over time. Google Deepmind recently demonstrated this with Genie 3, which launches an interactive environment in real time and holds it for several minutes. At I/O, Google also touted Gemini Omni as a global model and a possible successor to its text-to-video model Veo.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, exclusive six-a-year “AI Radar” frontier reports, full archive access, and access to the comments section.

Subscribe now

Source link

Binance账户 commented on The Smartest Man Who Ever Lived: Your point of view caught my eye and was very inte
打开Binance账户 commented on Top 10 Machine Learning Jobs with the Best Salaries in 2023: Your point of view caught my eye and was very inte
binance Registrera dig commented on Generative-AI-Jobs: Die 11 gefragtesten KI-Berufe: Thanks for sharing. I read many of your blog posts
create a binance account commented on WHOOP 4.0 review: Fitness tracker brand launches new AI features: Can you be more specific about the content of your
注册 commented on 11 most in-demand gen AI jobs companies are hiring for: Your point of view caught my eye and was very inte

Microsoft Research’s Mirage provides persistent spatial memory for video generation, ensuring you don’t forget what’s in your immediate vicinity

How memory grows step by step

Faster and lighter than color-based rivals

AI News Without the Hype – Curated by Humans

RECENT POSTS

The AI vs. Human debate is the wrong question in cybersecurity: CISO

Leverage AI as a core enabler of your business strategy

Onspring launches next wave of AI innovation with Agentic GRC

How memory grows step by step

Faster and lighter than color-based rivals

AI News Without the Hype – Curated by Humans

Related Posts