# [Pixel Post] What The F* is RFT

What the f\* is Rectified Flow Transformer?

## 1\. From DDPM to straight-line flow

**Vanilla diffusion recap:**

* Forward: \\(x_0 \to x_T\\) by adding Gaussian noise.
    
* Reverse: lean a denoiser \\(\varepsilon_{\theta}\\) and integrate stochastically back.
    

**Rectified flow twist:**

Instead of waddling through ever-changing noise scales, we connect every data point straight to its own noise sample and learn a deterministic ODE that follows that line:

\\[x(t) = (1 - t) \, x_0 + t z, \quad z \sim \mathcal{N}(0, I), \quad t \in [0, 1].\\]

The target velocity is the analytic derivative of that line:

\\[\underbrace{\dot{x}(t)}_{\text{“ground-truth” flow}} = z - x_0.\\]

Training is just MSE: shove \\((x(t),t)\\) into a network \\(f_{\theta}\\) and make it spit out \\(z - x_0\\). No variance schedule, no KLs.

*Why it matters:* the optimal transport path is already straight, so at inference you can integrate the learned ODE with **~8-16 Euler steps** and still beat 50-step DDPMs on FID. That’s the “rectified” part: we made crooked score trajectories straight.

---

## 2\. Probability-flow ODE ↔ rectified flow

The *probability flow ODE* is the deterministic sibling of diffusion sampling. Rectified flow learns that deterministic field directly instead of first learning scores and then converting to them. So the math pipeline is shorter:

```plaintext
DDPM:       learn score → build prob-flow ODE → integrate
Rectified:  learn prob-flow ODE directly      → integrate
```

Less indirection = faster convergence + fewer inference steps.

## 3\. Enter the **Rectified-Flow Transformer (RFT)**

Picture a ViT strapped to an ODE-solver:

1. **Patchify the VAE latent** (8×8 or 16×16 windows).
    
2. **Time token** \\(t\\) gets its own embedding; add to every patch embedding.
    
3. **Dual-modality attention** if you’re conditioning on text—separate QKV weights for image/text tokens so text can attend back.
    
4. Output is the velocity field \\(f_{\theta}(x,t)\\).
    

Functional differences from DiT:

| **DiT (diffusion)** | **RFT** |
| --- | --- |
| Predicts noise \\(\varepsilon\\) or data \\(\mathbf{x}_0\\) | Predicts *velocity* along a straight path. |
| Needs variance schedule (\\(\beta-t\\) schedule). | Schedule-free (just \\(t \sim U[0,1]\\)) |
| 20-100 sampler steps typical. | 4-16 Euler / DPM steps typical. |

---

## 4\. Quick-start mental model

**Analogy**: Think of DDPM as hiking down a twisty mountain trail in fog with random gusts (stochastic). Rectified flow bulldozes a *zip-line* straight from submit to base; you just slide deterministically.

**Code-ish snippet (PyTorch-pseudo):**

```python
# draw data/noise pair
x0 = next(data_loader)
z  = torch.randn_like(x0)

t  = torch.rand((batch, 1, 1, 1))   # U[0,1]
xt = (1 - t) * x0 + t * z           # straight-line mix

v_target = z - x0                   # analytic velocity
v_pred   = model(xt, t)

loss = ((v_pred - v_target) ** 2).mean()
loss.backward()
```

At inference:

```python
x = torch.randn_like(x0)  # start at pure noise
for step in solver:       # e.g., 8 Euler steps
    dx = model(x, t)
    x += dx * dt
```

---

## 5\. Where you might trip

| **Concept** | **Common pitfall** | **One-liner fix** |
| --- | --- | --- |
| **Divergence ≠ volume** | Divergence 0 ⇒ incompressible locally, but global vol-preservation needs boundary conditions. | Treat divergence as “no squishing per voxel”, not “global Jacobian = 1”. |
| **Patch size vs. pixel art** | 64×64 sprites → only 4 patches at 16×16, model may under-attend. | Train LoRA or patch-drop augment so small objects still get love. |
| **Video / GIF generation** | Naïvely stack frames → attention misses temporal cues. | Encode time as extra dimension (Spatial-Temporal Patches) or treat sheet as big pano and let RFT fill. |
