[Pixel Post] What The F* is RFT

What the f* is Rectified Flow Transformer?

1. From DDPM to straight-line flow

Vanilla diffusion recap:

Forward: $x_0 \to x_T$ by adding Gaussian noise.
Reverse: lean a denoiser $\varepsilon_{\theta}$ and integrate stochastically back.

Rectified flow twist:

Instead of waddling through ever-changing noise scales, we connect every data point straight to its own noise sample and learn a deterministic ODE that follows that line:

\[x(t) = (1 - t) \, x_0 + t z, \quad z \sim \mathcal{N}(0, I), \quad t \in [0, 1].\]

The target velocity is the analytic derivative of that line:

\[\underbrace{\dot{x}(t)}_{\text{“ground-truth” flow}} = z - x_0.\]

Training is just MSE: shove $(x(t),t)$ into a network $f_{\theta}$ and make it spit out $z - x_0$. No variance schedule, no KLs.

Why it matters: the optimal transport path is already straight, so at inference you can integrate the learned ODE with ~8-16 Euler steps and still beat 50-step DDPMs on FID. That’s the “rectified” part: we made crooked score trajectories straight.

2. Probability-flow ODE ↔ rectified flow

The probability flow ODE is the deterministic sibling of diffusion sampling. Rectified flow learns that deterministic field directly instead of first learning scores and then converting to them. So the math pipeline is shorter:

DDPM:       learn score → build prob-flow ODE → integrate
Rectified:  learn prob-flow ODE directly      → integrate

Less indirection = faster convergence + fewer inference steps.

3. Enter the Rectified-Flow Transformer (RFT)

Picture a ViT strapped to an ODE-solver:

Patchify the VAE latent (8×8 or 16×16 windows).
Time token $t$ gets its own embedding; add to every patch embedding.
Dual-modality attention if you’re conditioning on text—separate QKV weights for image/text tokens so text can attend back.
Output is the velocity field $f_{\theta}(x,t)$.

Functional differences from DiT:

DiT (diffusion)	RFT
Predicts noise $\varepsilon$ or data $\mathbf{x}_0$	Predicts velocity along a straight path.
Needs variance schedule ($\beta-t$ schedule).	Schedule-free (just $t \sim U[0,1]$)
20-100 sampler steps typical.	4-16 Euler / DPM steps typical.

4. Quick-start mental model

Analogy: Think of DDPM as hiking down a twisty mountain trail in fog with random gusts (stochastic). Rectified flow bulldozes a zip-line straight from submit to base; you just slide deterministically.

Code-ish snippet (PyTorch-pseudo):

# draw data/noise pair
x0 = next(data_loader)
z  = torch.randn_like(x0)

t  = torch.rand((batch, 1, 1, 1))   # U[0,1]
xt = (1 - t) * x0 + t * z           # straight-line mix

v_target = z - x0                   # analytic velocity
v_pred   = model(xt, t)

loss = ((v_pred - v_target) ** 2).mean()
loss.backward()

At inference:

x = torch.randn_like(x0)  # start at pure noise
for step in solver:       # e.g., 8 Euler steps
    dx = model(x, t)
    x += dx * dt

5. Where you might trip

Concept	Common pitfall	One-liner fix
Divergence ≠ volume	Divergence 0 ⇒ incompressible locally, but global vol-preservation needs boundary conditions.	Treat divergence as “no squishing per voxel”, not “global Jacobian = 1”.
Patch size vs. pixel art	64×64 sprites → only 4 patches at 16×16, model may under-attend.	Train LoRA or patch-drop augment so small objects still get love.
Video / GIF generation	Naïvely stack frames → attention misses temporal cues.	Encode time as extra dimension (Spatial-Temporal Patches) or treat sheet as big pano and let RFT fill.

[Pixel Post] What The F* is RFT

1. From DDPM to straight-line flow

2. Probability-flow ODE ↔ rectified flow

3. Enter the Rectified-Flow Transformer (RFT)

4. Quick-start mental model

5. Where you might trip

Comments

SpriteDX

How to fine-tune Flux.1 LoRA in Python 3.12

More from this blog

Introducing Monet: Born in the Middle of the Story

Does SAM3D Body Work on Chibi Character Animations

Monet - Before and After

Monet - Mouth Removal

SpriteDX - Failures.gif

DiT (diffusion)	RFT
Predicts noise \(\varepsilon\) or data \(\mathbf{x}_0\)	Predicts velocity along a straight path.
Needs variance schedule (\(\beta-t\) schedule).	Schedule-free (just \(t \sim U[0,1]\))
20-100 sampler steps typical.	4-16 Euler / DPM steps typical.

Command Palette

1. From DDPM to straight-line flow

2. Probability-flow ODE ↔ rectified flow

3. Enter the Rectified-Flow Transformer (RFT)

4. Quick-start mental model

5. Where you might trip

Comments

SpriteDX

How to fine-tune Flux.1 LoRA in Python 3.12

More from this blog