Skip to main content

Command Palette

Search for a command to run...

SpriteDX - Video Matting SOTA 2025

Updated
4 min read
SpriteDX - Video Matting SOTA 2025

Let’s look at some SOTA models out there for video matting in the context of removing background in sprite animations.

Robust Video Matting (RVM)

August 2021 — Shanchuan Lin, Linjie Yang, Imran Saleemi, Soumyadip Sengupta

https://github.com/PeterL1n/RobustVideoMatting

RVM gives great composite result when it works. It doesn’t really detect legs very well until it starts moving. I think it may be due to the fact that there is less data on legs. The training data probably has lots of portrait shots but not so much full body shots.

uv run python inference.py \
    --variant resnet50 \
    --checkpoint "rvm_resnet50.pth" \
    --device cpu \
    --input-source "input.mp4" \
    --output-type video \
    --output-composition "composition.mp4" \
    --output-alpha "alpha.mp4" \
    --output-foreground "foreground.mp4" \
    --output-video-mbps 4 \
    --seq-chunk 1

Background Matting V2

December 2020 — Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian Curless, Steve Seitz, Ira Kemelmacher-Shlizerman

https://github.com/PeterL1n/BackgroundMattingV2

BMv2 is RVM’s ancestor. It works by providing video and background image pair. In our case, we know the background, so the hope it that the matting engine will be more lenient to unseens data since we are providing a strong negative signal.

    uv run python inference_video.py \
        --model-type mattingrefine \
        --model-backbone resnet50 \
        --model-backbone-scale 0.25 \
        --model-refine-mode sampling \
        --model-refine-sample-pixels 80000 \
        --model-checkpoint "pytorch_resnet50.pth" \
        --video-src "input.mp4" \
        --video-bgr "input_bgr.png" \
        --output-dir "outputs" \
        --output-type com fgr pha err ref \
        --device cpu

It does seem to do a lot better than the stock RVM but there is quite a bit of small errors (false positives and false negatives).

Also ran mattingbase option, and seems like without refinement, things look much worse.

    uv run python inference_video.py \
        --model-type mattingbase \
        --model-backbone resnet50 \
        --model-backbone-scale 0.25 \
        --model-refine-mode sampling \
        --model-refine-sample-pixels 80000 \
        --model-checkpoint "pytorch_resnet50.pth" \
        --video-src "input.mp4" \
        --video-bgr "input_bgr.png" \
        --output-dir "outputs" \
        --output-type com fgr pha err \
        --device cpu

Both RVM and BMv2 use ResNet type of solutions based on CNNs. So, perhaps there is a better SOTA that uses vision transformers. But let’s hold that thought and diagnose the data further. Let’s use dark gray matte to see the issue better.

Not much I can directly analyze from these pictures but in the unrefined version, it seems that we need a lot more tightening of alpha.

  • Like it is good at global level prediction but at local level makes lots of mistakes. Perhaps, we can make a tool to have an annotator annotate erroneous areas and do RLHF.

  • Interesting thing though is that the mattingbase model almost have no false negatives (i.e. foregrounds that are classified as backgrounds). That seems promising in that if we can direct the model to tighten things, perhaps we have a chance and better accuracy.

  • I could try to use different structures that use ViT.

  • Alternatively, there is near infinite amount of data on web with alpha channels, so I can add those to the dataset and re-train either RVM or BMv2 model. Before I commit to something like that, I want to study related works further.

In BMv2, there is also foreground correction as explicit part of the architecture, not just a byproduct of alpha estimation. BMv2 simultaneously predicts alpha and foreground.

The key compositional constraint used during training is:

$$I_t = \alpha_t F_t + (1-\alpha)B_t$$

This lets the model learn both a clean matte (alpha_t) and color-corrected foreground F_t, where F_t is independent of background color influence.

Matting-Anything

Nov 2023 — Jiachen Li, Jitesh Jain, Humphrey Shi

https://github.com/SHI-Labs/Matting-Anything

Not quite what I was expecting. Seems like BMv2 is much better.

ViTMatte

May 2023 — Jingfeng Yao1, Xinggang Wang1 📧, Shusheng Yang1, Baoyuan Wang2

https://github.com/hustvl/ViTMatte

Interesting thing about this is that it allows us to provide guidance through what’s called “TriMap.“

uv run python run_one_image.py \
    --model vitmatte-b \
    --checkpoint-dir ./checkpoints/ViTMatte_B_Com.pth \
    --device cpu

TriMap Used

Generated Composite

So, probably requires lots of tuning but so far didn’t get what I wanted.

Trimap is really helpful if we have a human annotator who can do some pre-annotation but providing imperfect trimaps actually have negative impact. So it requires a good quality annotations to begin with.

Also the limitation is that these models don’t predict “corrected foregrounds“ which are critical for compositing.

matteformer

Seems very similar to ViTMatte. Inference is very sensitive to the Trimap used.

Generated Video Matting

https://github.com/aim-uofa/GVM

…still looking into it…


Current thinking

Current thinking is that we should fine-tune BMv2 with augmented animated gif sprite dataset.

  • Collect gif animations

  • Preprocess them to be slightly blurry.

  • Create fake foregrounds

  • Then train on those dataset.


— Sprited Dev 🌱