Netflix AI Team Open Sources VOID: An AI Model to Erase Objects from Videos – Physics and All

AI Video & Visuals


Video editing has always had a dirty secret. It’s easy to remove objects from footage. It’s very difficult to make a scene look like it never happened. If you take someone out with a guitar, you’ll be left with an instrument that defies gravity and floats. VFX teams in Hollywood spend weeks solving exactly these kinds of problems. Netflix and INSAIT, a team of researchers from Sofia University St. Paul’s Kliment Ofritsky’ released empty space (Deleting video objects and interactions) model that can do it automatically.

VOID removes objects and all interactions they cause on a scene from your video, including secondary effects such as shadows and reflections, as well as physical interactions such as objects falling when a person is removed.

What problem does VOID actually solve?

Standard video inpainting models (the kind used in most editing workflows today) are trained to fill in pixel areas where objects were. They are essentially very sophisticated background painters. It makes sense that they don’t do it causal relationship: If I delete an actor that has props, what happens to those props?

Existing video object deletion methods are good at repairing the content “behind” the object and fixing appearance-level artifacts such as shadows and reflections. However, if the removed objects have more significant interactions, such as collisions with other objects, the current model cannot fix them, producing implausible results.

VOID is built on CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning. The key innovation lies in how the model understands the scene, not just “which pixels should I fill?” But, “What could physically happen to this object after it disappears?”

Standard example for research papers: When the person holding the guitar leaves, the VOID also removes its influence on the person’s guitar, causing it to fall by itself. It’s not easy. The model should understand that the guitar was being played. supported And removing people means gravity takes over.

Also, unlike previous works, VOID was evaluated directly against real competitors. We experimented with both synthetic and real data and found that our approach better preserves consistent scene dynamics after object removal compared to previous video object removal methods such as ProPainter, DiffuEraser, Runway, MiniMax-Remover, ROSE, and Gen-Omnimatte.

https://arxiv.org/pdf/2604.02296

Architecture: CogVideoX internals

VOID is built on Alibaba PAI’s model CogVideoX-Fun-V1.5-5b-InP and fine-tuned for interaction-aware video inpainting. quad mask conditioning. CogVideoX is a 3D Transformer-based video generation model. Think of it like a video version of stable diffusion. Stable diffusion is a diffusion model that operates on a temporal sequence of frames rather than a single image. Certain base models (CogVideoX-Fun-V1.5-5b-InP) is a checkpoint released by Alibaba PAI on Hugging Face that engineers need to download separately before running VOID.

Fine-tuned architecture spec: CogVideoX 3D transformer with 5B parameters, takes text prompts describing video, quadmask, and deleted scene as input, operates at default resolution of 384 × 672, processes up to 197 frames, uses DDIM scheduler, and runs in BF16 with FP8 quantization for memory efficiency.

of quad mask This is probably the most interesting technical contribution here. Rather than being a binary mask (delete this pixel / keep this pixel), a quadmask is a four-valued mask that encodes the main object to remove, overlapping areas, affected areas (falling objects, moved items), and the background to keep.

In reality, each pixel in the mask gets one of four values: 0 (main object will be deleted), 63 (overlap between primary and affected areas), 127 (the area affected by the interaction – what moves or changes as a result of the deletion), and 255 (Background, as is). This gives the model a structured semantic map. what is happening on the groundnot just where is the object.

2-pass inference pipeline

VOID uses two transformer checkpoints that are trained in sequence. You can perform inference on pass 1 alone, or you can concatenate both passes for better temporal consistency.

Path 1 (void_pass1.safetensors) is the basic repair model and is sufficient for most videos. Pass 2 serves the specific purpose of correcting a known failure mode. If the model detects object morphing, a known failure mode for small-scale video diffusion models, an optional second pass reruns the inference using the flowwarp noise from the first pass to stabilize the object’s shape along the newly synthesized trajectory.

It’s worth understanding this difference. Pass 2 is not just for long clips, especially Shape stability correction. If the diffusion model produces objects that gradually warp or deform between frames (a well-documented artifact in video diffusion), pass 2 uses optical flow to warp the potentials from pass 1 and feed them as initialization to the second diffusion run, fixing the shape of the synthesized object from frame to frame.

How to generate training data

Now this is where things get really interesting. Paired videos are required to train a model to understand physical interactions. That is, the same scene with and without objects, the physics working correctly in both. Paired data at this scale does not exist in the real world. So the team built it holistically.

For training, we used paired counterfactual videos generated from two sources. HUMOTO (human-object interaction rendered in Blender using physics simulation) and Kubric (object-only interaction using Google scan objects).

HUMOTO uses motion capture data of human-object interactions. The primary mechanism is Blender resimulation. The scene is set up with humans and objects, rendered once with humans present, then humans are removed from the simulation and physics is re-run from that point on. The result is a physically true counterfactual. Objects that were held or supported now fall as expected. Kubric, developed by Google Research, applies the same idea to collisions between objects. Together, they produce a dataset of paired videos that are not approximated by human annotators and whose physics is proven to be correct.

Important points

  • VOID is more than a pixel fill. Unlike existing video repair tools that only fix visual artifacts such as shadows and reflections, VOID understands physical cause-and-effect relationships. If you remove the person holding the object, the object will fit naturally into the output video.
  • Quad Mask is the core innovation. Instead of a simple binary delete/retain mask, VOID uses a 4-valued quad mask (values ​​0, 63, 127, 255). This encodes not only what to remove, but also which surrounding areas of the scene. be physically affected — Provides structured scene understanding for diffusion models.
  • Two-pass inference resolves actual failure modes. Pass 1 handles most of the video. Pass 2 exists specifically to correct object morphing artifacts, a known weakness of video diffusion models, by using the optical flow warp latent of pass 1 as an initialization for the second diffusion run.
  • Training is now possible with synthetic paired data. Video data for real-world counterfactual pairs does not exist at scale, so the researchers used Blender’s Physics Resimulation (HUMOTO) and Google’s Kubric framework to build the data and generate before-and-after ground truth for video pairs that proves the physics to be correct.

Please check Paper, model weight and Repo. Please feel free to follow us too Twitter Don’t forget to join us 120,000+ ML subreddits and subscribe our newsletter. hang on! Are you on telegram? You can now also participate by telegram.


Michal Sutter is a data science expert with a master’s degree in data science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.



Source link