Netflix Just Open-Sourced Reality’s Undo Button (and the Physics Are Terrifying)
Released under Apache 2.0 with full training code, VOID represents a watershed moment in open-source video AI. It also happens to be a regulatory nightmare, a potential advertising revolution, and possibly the most effective historical revisionism tool since the airbrush.
The Technical Reality: Physics-Aware Inpainting
Most video inpainting models treat removal like a glorified clone stamp, they fill the hole with background texture and call it a day. VOID operates on counterfactual reasoning. Built atop CogVideoX-Fun-V1.5-5b-InP, this 5-billion-parameter diffusion model uses a quadmask system that encodes four distinct semantic regions per pixel:
| Value | Region | Description |
|---|---|---|
| 0 | Primary Object | The entity being removed |
| 63 | Overlap | Boundary regions between object and affected areas |
| 127 | Affected Region | Objects that will fall, collide, or change trajectory |
| 255 | Background | Static scene elements to preserve |
The magic happens in the affected region detection. VOID employs a Vision-Language Model (VLM) pipeline using SAM2 segmentation and Gemini reasoning to identify not just what you’re removing, but what that removal causes. Remove the person leaning on a Jenga tower, and the model understands the tower remains standing, remove the cat batting the tower, and it understands the structure stays intact but the cat’s influence disappears.

Training required synthetic counterfactuals generated from two sources: HUMOTO (human-object interactions rendered in Blender with physics simulation) and Kubric (object-only collisions using Google Scanned Objects). The team trained on 8× A100 80GB GPUs using DeepSpeed ZeRO Stage 2, producing a two-pass inference system:
- Pass 1: Base inpainting with quadmask conditioning (sufficient for most videos)
- Pass 2: Warped-noise refinement using optical flow from Pass 1 output to stabilize temporal consistency on longer clips
The hardware requirements are brutal: 40GB+ VRAM minimum (A100 territory), making local inference costs a significant consideration compared to cloud APIs for production workloads. Netflix provides a ready-to-run Colab notebook, but you’re not running this on your laptop.
The Censorship Engine Nobody Asked For
Within hours of release, developers identified the obvious dystopian applications. VOID doesn’t just remove objects, it removes consequences. The immediate speculation focused on retroactive censorship: cigarettes digitally excised from classic films, political figures vanishing from archival footage, or “localized” content where certain demographics are algorithmically erased for specific markets.
More concerning is the advertising angle. The model enables dynamic product placement at the distribution layer. A character could hold a generic “beverage can” during filming, with Netflix inserting Coca-Cola for viewers in Atlanta and Pepsi for viewers in New York, while removing the original prop and its physical interactions (condensation, finger placement, weight shift) in real-time. When sponsorship deals expire, brands can be scrubbed from the content entirely, the physics recalculated to show the scene as if the product never existed.
This isn’t theoretical. The model’s architecture specifically handles “interaction deletion”, if a removed object caused another to fall, the falling stops. If they caused a splash, the water settles. The causal chain is broken and re-simulated.
Netflix vs. The Closed AI Labs
There’s an irony here that hasn’t escaped the ML community: Netflix, a entertainment company, just released a more open video AI model than most dedicated AI labs. While Anthropic and OpenAI keep their video generation capabilities behind API gates, Netflix dropped the full stack, weights, training code, data generation pipelines, and a Gradio demo.
This continues Netflix’s open-source legacy. They pioneered Chaos Monkey a decade ago (randomly killing production servers to build resilience) and have consistently released infrastructure tools. But VOID is different, it’s a creative tool with immediate implications for content authenticity.
The release strategy suggests Netflix recognizes they can’t control the genie, so they’re opting to influence the bottle. By open-sourcing under Apache 2.0, they accelerate adoption while potentially establishing the technical standards for “ethical” object removal, complete with paper trails and provenance tracking that closed systems obscure.
The Deepfake Regulation Problem
VOID breaks current deepfake detection paradigms. Existing forensic tools look for inpainting artifacts, lighting inconsistencies, or shadow mismatches. VOID generates physically consistent shadows, reflections, and collision physics. When you remove a person from a video of a crowd, the model doesn’t just paint over them, it recalculates how the remaining people would have stood, moved, and cast shadows in that person’s absence.
This creates a class of “negative deepfakes”, authentic footage altered by subtraction rather than addition. Current provenance standards like C2PA (Content Authenticity Initiative) focus on tracking generative additions, but VOID demonstrates we need cryptographic signing for removals too. If a news organization deletes a protestor from footage, or a government excises a dissident from historical records, the technical artifacts will be harder to detect than traditional deepfakes because the physics are “correct.”
The model’s ability to process up to 197 frames (roughly 6.5 seconds at 30fps) at 384×672 resolution means entire scenes can be altered with temporal consistency. The second-pass warping ensures objects don’t morph or flicker, the bane of earlier inpainting methods.
Practical Implementation (If You Have the Silicon)
For those with the hardware to run it, VOID offers a complete pipeline:
# Install dependencies
pip install -r requirements.txt
# Download base model and VOID checkpoints
huggingface-cli download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP --local-dir ./CogVideoX-Fun-V1.5-5b-InP
huggingface-cli download netflix/void-model --local-dir .
# Run inference
python inference/cogvideox_fun/predict_v2v.py \
--config config/quadmask_cogvideox.py \
--config.data.data_rootdir="./sample" \
--config.experiment.run_seqs="lime" \
--config.video_model.transformer_path="./void_pass1.safetensors"
The input format requires three files per video: the source MP4, a quadmask MP4 (generated via the included SAM2+Gemini pipeline), and a prompt.json describing the background after removal, not the removal itself. Describe “A table with a cup on it”, not “Remove the person.”
The mask generation GUI allows manual refinement, crucial because the VLM reasoning occasionally misses secondary interactions (like a displaced curtain or rolling pencil). Users can toggle grid cells between affected (127) and background (255) regions, or paint pixel-level corrections.
The Verdict
VOID represents a shift from “generative AI” to “counterfactual AI”, models that don’t just create content but rewrite existing content’s causal history. It’s a tool of immense creative utility for filmmakers removing boom mics or stunt wires, and immense danger for anyone concerned with media authenticity.
Netflix’s decision to open-source it, rather than keep it as a proprietary editing suite, forces the conversation about video provenance into the open. The code is out there. The physics simulations are getting better. And soon, the question won’t be “Was that video generated?” but “Who was removed from it, and what fell when they vanished?”
The undo button for reality exists now. We just have to decide who gets to press it.




