Diffusion models: the forward noising process — Generative Models

The Core Idea: Destroy Then Learn to Rebuild

Diffusion models take a surprising approach to generation: first learn how to completely destroy data, then learn to reverse that destruction.

The forward process is deterministic in design — you control it completely and can derive all its properties analytically. The only thing you must learn is the reverse: given a noisy image, predict what the noise was and denoise it slightly. Chain enough denoising steps together, and you go from pure noise back to a real image.

Diffusion models are behind Stable Diffusion, DALL-E 2, and Sora — the systems generating photorealistic images and videos from text prompts. Understanding the forward noising process is the first half of understanding how these systems actually work.

This is qualitatively different from VAEs (which learn a compressed representation) and GANs (which learn a sampler adversarially). Diffusion models reduce generation to a sequence of denoising regression problems — which are individually simple and stable to train.

The Forward Process: One Step

At each time step , the forward process adds a small amount of Gaussian noise:

x_t = \sqrt{1 - \beta_t}, x_{t-1} + \sqrt{\beta_t}, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)

$x_t$: noisy image at step t
$x_{t-1}$: image at the previous step
$\beta_t$: noise variance schedule parameter at step t: a small positive number
$\varepsilon$: Gaussian noise: ε ~ N(0, I)

This is the .

In distribution form: $q(x_t \mid x_{t-1}) = \mathcal{N}!\left(\sqrt{1-\beta_t}, x_{t-1},, \beta_t I\right)$ .

Why these coefficients? The signal term is scaled by $\sqrt{1-\beta_t}$ and noise by $\sqrt{\beta_t}$ . Since $({\sqrt{1-\beta_t}})^2 + ({\sqrt{\beta_t}})^2 = 1$ , if $x_{t-1} \sim \mathcal{N}(0, I)$ then $x_t \sim \mathcal{N}(0, I)$ also. The process is variance-preserving — total energy stays at 1 throughout, which stabilizes training.

The Noise Schedule

The is a hyperparameter sequence called the noise schedule:

Linear (Ho et al., 2020): $\beta_t$ increases linearly from $\beta_1 = 0.0001$ to $\beta_T = 0.02$ over $T = 1000$ steps
Cosine (Nichol & Dhariwal, 2021): $\bar{\alpha}_t = \cos^2!!\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)$ — smoother, better at the extremes

With $\beta_t \in [0.0001, 0.02]$ , each step adds only a tiny amount of noise. After 1000 steps, the accumulation completely destroys the signal.

The Closed Form: Any Step Directly from x₀

A critical observation: since Gaussians are closed under linear combination, you can derive a formula for $x_t$ directly from $x_0$ , skipping all $t-1$ intermediate steps.

Define . Then:

x_t = \sqrt{\bar{\alpha}_t}, x_0 + \sqrt{1 - \bar{\alpha}_t}, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)

$x_t$: noisy image at step t
$x_0$: clean original image
$\bar{\alpha}_t$: cumulative product of (1−β_s): how much original signal survives
$\varepsilon$: single noise sample from N(0,I) — not the product of t individual noise terms
$\sqrt{1-\bar{\alpha}_t}$: how much noise has accumulated by step t

Derivation sketch. Apply the single-step formula twice: $x_1 = \sqrt{1-\beta_1}, x_0 + \sqrt{\beta_1}, \varepsilon_1$ , $x_2 = \sqrt{1-\beta_2}, x_1 + \sqrt{\beta_2}, \varepsilon_2$ . Substituting:

x_2 = \sqrt{(1-\beta_2)(1-\beta_1)}, x_0 + \underbrace{\sqrt{(1-\beta_2)\beta_1}, \varepsilon_1 + \sqrt{\beta_2}, \varepsilon_2}_{\text{two independent Gaussians}}

The combined noise term is Gaussian with variance $(1-\beta_2)\beta_1 + \beta_2 = 1 - (1-\beta_1)(1-\beta_2) = 1 - \bar{\alpha}_2$ . So $x_2 = \sqrt{\bar{\alpha}_2}, x_0 + \sqrt{1-\bar{\alpha}_2}, \varepsilon$ . By induction, the same pattern holds for all $t$ .

Numerical Example

Use $T = 4$ steps with $\beta = [0.1, 0.1, 0.1, 0.1]$ (equal for simplicity) and a 1D image $x_0 = 2.0$ .

Step	$\bar\alpha_t$	$\sqrt{\bar\alpha_t}$	$\sqrt{1-\bar\alpha_t}$	Expected $x_t$ (ε=0)
0	1.000	1.000	0.000	2.000
1	0.900	0.949	0.316	1.897
2	0.810	0.900	0.436	1.800
3	0.729	0.854	0.522	1.708
4	0.656	0.810	0.586	1.619

At each step the signal coefficient $\sqrt{\bar{\alpha}_t}$ shrinks and the noise coefficient $\sqrt{1-\bar{\alpha}_t}$ grows. With 1000 steps and $\beta_T = 0.02$ , $\bar{\alpha}_{1000} \approx 4.9 \times 10^{-5} \approx 0$ — signal effectively gone.

Why This Is Useful for Training

The closed form means: given any training image $x_0$ and any time step $t$ , we can instantly create a training example $(x_t, \varepsilon, t)$ by:

Sample $\varepsilon \sim \mathcal{N}(0, I)$
Compute $x_t = \sqrt{\bar{\alpha}_t}, x_0 + \sqrt{1-\bar{\alpha}_t}, \varepsilon$

No simulation of $t$ Markov steps needed. We know exactly what noise was added. The training task for the reverse model is: given $x_t$ and $t$ , predict $\varepsilon$ . This is derived in lesson 14-9.

Interactive example

Apply the forward noising process to an image and watch it gradually become pure noise across 1000 steps

Coming soon