[Pixel Post] What The F* is RFT
![[Pixel Post] What The F* is RFT](/_next/image?url=https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1752182535922%2F194087af-16fd-4e4c-b707-0d185f3b9bb8.jpeg&w=3840&q=75)
What the f* is Rectified Flow Transformer?
1. From DDPM to straight-line flow
Vanilla diffusion recap:
Forward: \(x_0 \to x_T\) by adding Gaussian noise.
Reverse: lean a denoiser \(\varepsilon_{\theta}\) and integrate stochastically back.
Rectified flow twist:
Instead of waddling through ever-changing noise scales, we connect every data point straight to its own noise sample and learn a deterministic ODE that follows that line:
\[x(t) = (1 - t) \, x_0 + t z, \quad z \sim \mathcal{N}(0, I), \quad t \in [0, 1].\]
The target velocity is the analytic derivative of that line:
\[\underbrace{\dot{x}(t)}_{\text{“ground-truth” flow}} = z - x_0.\]
Training is just MSE: shove $(x(t),t)$ into a network \(f_{\theta}\) and make it spit out \(z - x_0\). No variance schedule, no KLs.
Why it matters: the optimal transport path is already straight, so at inference you can integrate the learned ODE with ~8-16 Euler steps and still beat 50-step DDPMs on FID. That’s the “rectified” part: we made crooked score trajectories straight.
2. Probability-flow ODE ↔ rectified flow
The probability flow ODE is the deterministic sibling of diffusion sampling. Rectified flow learns that deterministic field directly instead of first learning scores and then converting to them. So the math pipeline is shorter:
DDPM: learn score → build prob-flow ODE → integrate
Rectified: learn prob-flow ODE directly → integrate
Less indirection = faster convergence + fewer inference steps.
3. Enter the Rectified-Flow Transformer (RFT)
Picture a ViT strapped to an ODE-solver:
Patchify the VAE latent (8×8 or 16×16 windows).
Time token $t$ gets its own embedding; add to every patch embedding.
Dual-modality attention if you’re conditioning on text—separate QKV weights for image/text tokens so text can attend back.
Output is the velocity field \(f_{\theta}(x,t)\).
Functional differences from DiT:
| DiT (diffusion) | RFT |
| Predicts noise \(\varepsilon\) or data \(\mathbf{x}_0\) | Predicts velocity along a straight path. |
| Needs variance schedule (\(\beta-t\) schedule). | Schedule-free (just \(t \sim U[0,1]\)) |
| 20-100 sampler steps typical. | 4-16 Euler / DPM steps typical. |
4. Quick-start mental model
Analogy: Think of DDPM as hiking down a twisty mountain trail in fog with random gusts (stochastic). Rectified flow bulldozes a zip-line straight from submit to base; you just slide deterministically.
Code-ish snippet (PyTorch-pseudo):
# draw data/noise pair
x0 = next(data_loader)
z = torch.randn_like(x0)
t = torch.rand((batch, 1, 1, 1)) # U[0,1]
xt = (1 - t) * x0 + t * z # straight-line mix
v_target = z - x0 # analytic velocity
v_pred = model(xt, t)
loss = ((v_pred - v_target) ** 2).mean()
loss.backward()
At inference:
x = torch.randn_like(x0) # start at pure noise
for step in solver: # e.g., 8 Euler steps
dx = model(x, t)
x += dx * dt
5. Where you might trip
| Concept | Common pitfall | One-liner fix |
| Divergence ≠ volume | Divergence 0 ⇒ incompressible locally, but global vol-preservation needs boundary conditions. | Treat divergence as “no squishing per voxel”, not “global Jacobian = 1”. |
| Patch size vs. pixel art | 64×64 sprites → only 4 patches at 16×16, model may under-attend. | Train LoRA or patch-drop augment so small objects still get love. |
| Video / GIF generation | Naïvely stack frames → attention misses temporal cues. | Encode time as extra dimension (Spatial-Temporal Patches) or treat sheet as big pano and let RFT fill. |




