Microsoft Research’s World-R1 uses Flow-GRPO and 3D-Aware Rewards to introduce geometric consistency to Wan 2.1 without architectural changes

The video base model allows you to paint beautiful frames. They are still notoriously bad at remembering it. When you push a camera into a hallway in Wan 2.1 or CogVideoX, walls distort, objects deform, and detail disappears. This indicates that these models are fitting 2D pixel correlations rather than simulating a consistent 3D scene.

Introduced by a team of researchers from Microsoft Research and Zhejiang University World-R1: A framework that adjusts video generation to 3D constraints through reinforcement learning. The research team builds on the recent discovery that video foundation models already encode rich 3D geometric information internally. So, what is that job? draw out Harness the knowledge potential of expensive 3D assets rather than managing them. World-R1 accomplishes this by post-training an existing text-to-video (T2V) model with reinforcement learning using a pre-trained 3D base model and rewards from a vision language critic. The basic architecture remains unchanged and the inference cost remains the same.

two World-R1 Variants have been released: World-R1-Small (built on Wan2.1-T2V-1.3B) and World-R1-large (Built on Wan2.1-T2V-14B).

Setup: Flow-GRPO on flow matching video model

World-R1 usage Flow-GRPO-Fasta recent adaptation of GRPO to a flow-matching diffusion model. Flow-GRPO converts a deterministic ODE sampler to an inverse-time SDE to make the policy probabilistic enough for benefit estimation, and then uses KL regularization to optimize the clipped GRPO surrogate to the reference policy. The Fast variant injects SDE noise only at randomly selected intermediate steps to reduce rollout costs.

Training is performed on 48 NVIDIA H200 GPUs for the small model and 96 H200 for the large model at 832 × 480 resolution, with a GRPO group size of G=8 across 48 parallel groups.

3D-aware rewards: analysis through synthesis

Interesting work happens with rewards. For each generated video x, the system reconstructs a 3D Gaussian splatting (3DGS) representation Φ._G.S. using any depth 3 and recover the estimated camera trajectory Ê. The combined 3D rewards are:

R_3D =S_Meta +S_{reconnaissance} +S_Toraji

S_Meta renders Φ_G.S. from meta view — camera pose offset from generation trajectory — and asks Quen 3-VL As a 3D vision expert, I score reconstructions from 0 to 9, deducting points for floaters, signage artifacts, and texture stretches that look fine from the front but fall apart off-axis.
S_{reconnaissance} Re-render the scene along Ê and compare with x via 1 − LPIPS.
S_Toraji Measure the deviation between the requested trajectory E and the recovered Ê by wrapping it with a negative exponent, using L2 for translation and geodesic distance for rotation.

common aesthetic terms R_generationcalculated as the average HPSv3 λ is added to the score over the first K frames_generation = 1 to prevent visual quality from breaking due to geometric pressure.

Implicit camera conditioning with noise wrapping

Rather than training a CameraCtrl-style adapter, World-R1 uses go with the flow Paradigm: The prompt is a motion token (push_in, orbit_left, pull_outetc.), a set of camera-extrinsic elements are generated, projected onto a 2D optical flow under the front-parallel scene assumption, and used to perform a discrete noise transport at the initial potential. The transported noise preserves unit variance through density tracker normalization, so the diffusion prior is not perturbed, but the latent already encodes the requested trajectory. There are no new parameters or architectural changes.

Pure text datasets and periodic decoupling to preserve motion

Training data is synthetic pure text dataset The approximately 3,000 prompts generated by Gemini are organized along WorldScore’s camera trajectory taxonomy (intra-scene, inter-scene, composite, static) and across natural landscapes, cities and architecture, micro and still life, fantasy and surrealism, and artistic styles. Text-only decouples 3D learning from the visual bias of a given video corpus.

Strict 3D rewards have known failure modes. This means that the model overfits to the rigid scene and stops generating dynamic content. World-R1 alleviates this. Regular isolation training. Every 100 steps, R_3D is paused and the model is fine-tuned in R._generation Approximately 500 prompts by yourself Dynamic data subset (Waterfalls, crowds, fire, changing objects). If you actually delete this stage, increase PSNR has been rebuilt, but VBench AVG has dropped from 85.21 to 82.64. This is exactly the degeneration of reward hacking that the research team points out.

understand the results

3DGS-based reconstruction protocol hits World-R1-Large 27.67 PSNR / 0.865 SSIM / 0.162 LPIPS7.91 dB PSNR gain compared to 19.76 / 0.629 / 0.405 for Wan2.1-T2V-14B. World-R1-Small records 10.23 dB of gain on a 1.3B backbone. About reconstruction independence Multi-view consistency score Borrowed from GeoVideo (MVCS), World-R1-Large reaches 0.993, outperforming all tested 3D conditional and camera control baselines (Voyager, ViewCrafter, FlashWorld, ReCamMaster, etc.).

Despite not being a dedicated camera control architecture, camera control competes with specialized methods such as RotErr 1.21, TransErr 1.30, and CamMC 2.95 for large models, and outperforms CamCloneMaster and ReCamMaster. VBench scores improve over the base Wan 2.1 in aesthetic quality, image quality, motion smoothness, and subject consistency, with only a slight regression in background consistency.

For AI professionals, two robustness results stand out. a Dataset scaling The sweeps show monotonic gains from the 1K → 2K → 3K prompt in both 3D consistency and VBench AVG, suggesting that the recipe is data efficient and has the potential to be further scaled. Although training is done in short clips, World-R1-Large generalizes as follows: 121 frames PSNR on Wan2.1-T2V-14B backbone increased from 18.32 to 26.32. A double-blind user study of 25 participants reported the following win rates: 92% geometric consistency, 76% camera control accuracy, and 86% overall preference. vs. Wan 2.1.

Important points

RL replaces architectural surgery for 3D consistency. World-R1 uses Flow-GRPO-Fast to post-train Wan2.1 instead of adding a 3D module or training on a 3D supervised dataset. The basic architecture and inference costs remain unchanged.
The reward is analysis by synthesis. Each generated video is lifted into a 3D Gaussian splatting representation via Depth Anything 3 and scored on three axes: metaview validity (as determined by Qwen3-VL), reconstruction fidelity (1 − LPIPS), and trajectory alignment. Combined with HPSv3 aesthetic rewards to prevent quality collapse.
Camera control is achieved through noise wrapping rather than new parameters. Motion tokens in the prompt are transformed into external elements of the camera, projected onto a 2D optical flow, and used to warp the initial potential through a Go-with-the-Flow discrete noise transport. CameraCtrl style adapters are not required.
Prevent reward hacking with regular separation training. The 3D reward is paused every 100 steps, and the model is fine-tuned using only the aesthetic reward for about 500 dynamic prompts. Removing this stage improves PSNR but reduces VBench. The model collapses into a static, easy-to-reconstruct output.
This number is large and persists outside the pipeline. World-R1-Large improves PSNR by 7.91 dB compared to Wan2.1-T2V-14B, generalizes to 121-frame videos, and improves reconstruction-independent MVCS metrics. The overall preferred win rate reached 86% in a blind user survey with 25 participants.

Please check paper, cord and Project page. Please feel free to follow us too Twitter Don’t forget to join us 130,000+ ML subreddits and subscribe our newsletter. hang on! Are you on telegram? You can now also participate by telegram.

Need to partner with us to promote your GitHub repository, Hug Face Page, product release, webinar, etc.? connect with us

Michal Sutter is a data science expert with a master’s degree in data science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Source link