SpriteDX

Okay, in Lab Note 4, we demonstrated that the anti-corruption model shows promising results when using magenta background.

Pupuri

We also discovered that the “sub-pixel translation” isn’t ideal. Here is a zoomed in view:

When character is idling the head bobs and you can see that the top of the hear is showing the contour changing more dynamically than in most of the sprite animations done by human hand.

If it were hand-pixelled sprite animation, more often than not, the whole head image would have moved whole pixel steps instead of being redrawn at sub-pixel translation steps.

Is sub-pixel translation a problem?

Titles like Metal Slug is known to utilize pixel color shift to emulate sub-pixel shifts. So, our animation showing sub-pixel level transitions may not necessarily be a bad thing.

However, the main issue is that the sub-pixel translations are often the visual cues used to tell apart fake pixel art animations which are often generated from 3D models v.s. hand-crafted sprite animations.

So, it is not that sub-pixel translations are wrong, rather, I think it is just that we want the hand-crafted feel to our resulting animations.

What is the cause of sub-pixel translations?

The animation models we use are not trained to produce smooth animations. Thus the animations are more fluid than the real sprite animations.

What can we do?

These are not mutually exclusive but here are some options.

Approach 1 - Video Training

The approach I have in mind is that we enhance and re-train anti-corruption model with image sequences (i.e. animation) rather than using single frame. Instead of feeding in single frame when training, we will feed in multiple frames. Then have the model spit out multiple frames.

The hypothesis is that the model will learn whole-pixel animations (e.g. the whole head bobs whole pixel) and penalize sub-pixel translations.

Approach 2 - Lower FPS

While, this won’t get rid of sub-pixel translation effects, it may reduce the impact. It is only a hypothesis and needs to be tested. I am not entirely hopeful for this apporach.

Approach 3 - Learn to Skip Frames

Not all frames are created equal. Some frames may last longer than other frames. Instead of using fixed frame-per-second, we can train a model to tell which frames should be selected and which frames should be skipped.

Training this may be very difficult as frame durations are not super easy to come by. If given lots of animated gifs, this may be possible but this may be difficult.

Alternatively, we could borrow some heuristics from animation fundamentals. Animation is well studied and there is amplitude of books. We can refer to those books and figure out the right heuristic to figure out which would be a good frame to keep and which to skip.

What’s the plan?

Both Approach 2 and Approach 3 only tries to mitigate the issue and does not address the root issue. So, I want to focus on Approach 1.

Prior Works?

I would need some consultation with AI agents on this one. However, the idea is that we extend from our Unet structure in anti-corruption model.

In previous study of Seedance 1 Pro model, we notice that they compress 4 frames of information into single latent frame. I want to take the hint from this and design our model to emulate this “compression of 4 real frames.“

Current idea is that each 4 frames will be treated as a unit. If we are inferencing a static frame, we will do inferencing with frames repeated. If we are inferencing from a video, we will divide the frames into blocks of 4-frames then run it.

Then, there will need to be some type of temporal convolution or attention. I think I will need to re-read the Seedance Paper to figure out what this means in practice.

Model Proposal — Anti-Corruption V2

In current v1 of anti-corruption model, we solved frame-wise pseudo-frame to clean-frame conversion.

However, this model can’t capture the subtleties of real sprite animations between frames. Current model is not able “to read between the lines.”

In V2, we must learn:

pseudo-sequence → clean-sequence
with temporal coherence and sprite-like motion priors.

Dataset

For dataset, we will have per-character clips with T frames (e.g. 8-32) that is aligned (same canvas size) with transparent background.

Input Data Generation

We will create pseudo inputs via existing corruption pipeline.

We will also need to add some motion blur or try to build in some temporal artifacts of Seedance 1 Pro animation results.

Model Design Tier A

In this option, we use 2D U-Net + Temporal Module. Basically we keep your proven per-frame U-Net, but add a small temporal backbone.

Structure

Encode each frame with shared 2D conv encoder → feature maps (T, C, H, W)
Temporal mixing in feature space (not pixels):
- Temporal depthwise conv (Conv1D over time) per spatial location
- or ConvGRU over time
- or a small windowed temporal attention (Seedance-style idea but tiny)
Decode per-frame with shared dncoder.

Why it works

We get temporal consistency without blowing up memory like full 3D U-Net

When to choose

We want something we can train soon, iterate quickly.

Model Design Tier B

Alternatively we can do classic video restoration architecture (3D U-Net)

Structure

Treat inputs as (C, T, H, W)
Use Conv3D + downsample in space (and optionally in time)
Decode back to (C, T, H, W)

Pros

Strong at spatiotemporal smoothing and coherence

Cons

Heavy VRAM. Harder to scale to longer sequences or bigger batch.

When to choose

If sprite frames are small (128×128) and sequences are short (8-16 frames). Then it should be feasible.

Model Design Tier C

This one is Seedance-inspired. We learn in latent space, decouple spatial fidelity and temporal coherence.

Structure

SpriteVAE: encode each frame (or chunk) into latents (smaller H/W)
Latent restoration model: does pseudo→clean in latent space.
1. spatial blocks: per-frame attention/conv
2. temporal blocks: across-time mixing.
Decode with VAE decoder

When to choose

If we want this to become a core pipeline component and you’re okay investing in infra.

Proposal

The recommendation from Pixel is that we should start simple with Tier A since it is the most easiest to get started with.

Inference Strategy

The suggestion from Pixel is that we do it in a windowed fashion.

windows of T = 8 or T = 16
overlap by 2-4 frames
blend overlaps (linear blend or just take center frames)

Summary

For v2.0, we create temporal-consistent sprite U-Net.

Input: X ∈ R^{T×4×H×W} (RGBA or RGB+mask)
Output: Y ∈ R^{T×4×H×W}

Backbone

Frame encoder: your current U-Net encoder (shared weights)
Temporal mixer at bottleneck + skip levels:
- ConvGRU on feature maps (per scale), or
- Temporal Conv1D (kernel 3/5) applied per spatial location (efficient)
Frame decoder: your current decoder (shared weights)

That’s what I have for initial proposal.

Let’s pause here. Next step is:

Prepare some kind of dataset.

— Sprited Dev 🐛

SpriteDX - Pixel Alignment - Lab Note 5

Is sub-pixel translation a problem?

What is the cause of sub-pixel translations?