Skip to content
Generative Models
Lesson 8 ⏱ 14 min

Diffusion models: the forward noising process

Video coming soon

Diffusion Models Part 1: Destroying Data with Noise

The forward diffusion process that gradually adds Gaussian noise to data, the noise schedule, and the closed-form expression that jumps directly to any noise level without stepping through all intermediate steps.

⏱ ~8 min

🧮

Quick refresher

Properties of Gaussian random variables

If X ~ N(μ₁, σ₁²) and Y ~ N(μ₂, σ₂²) are independent, then aX + bY ~ N(aμ₁ + bμ₂, a²σ₁² + b²σ₂²). Scaling a Gaussian scales its mean by the same factor and its variance by the square of that factor.

Example

If X ~ N(0,1), then 3X ~ N(0,9).

If X ~ N(0,1) and Y ~ N(0,1) are independent, then 0.6X + 0.8Y ~ N(0, 0.36 + 0.64) = N(0,1).

This is the key identity used to derive the diffusion closed form.

The Core Idea: Destroy Then Learn to Rebuild

Diffusion models take a surprising approach to generation: first learn how to completely destroy data, then learn to reverse that destruction.

The forward process is deterministic in design — you control it completely and can derive all its properties analytically. The only thing you must learn is the reverse: given a noisy image, predict what the noise was and denoise it slightly. Chain enough denoising steps together, and you go from pure noise back to a real image.

Diffusion models are behind Stable Diffusion, DALL-E 2, and Sora — the systems generating photorealistic images and videos from text prompts. Understanding the forward noising process is the first half of understanding how these systems actually work.

This is qualitatively different from VAEs (which learn a compressed representation) and GANs (which learn a sampler adversarially). Diffusion models reduce generation to a sequence of denoising regression problems — which are individually simple and stable to train.

The Forward Process: One Step

At each time step , the forward process adds a small amount of Gaussian noise:

xt=1βt,xt1+βt,ε,εN(0,I)x_t = \sqrt{1 - \beta_t}, x_{t-1} + \sqrt{\beta_t}, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)
xtx_t
noisy image at step t
xt1x_{t-1}
image at the previous step
βt\beta_t
noise variance schedule parameter at step t: a small positive number
ε\varepsilon
Gaussian noise: ε ~ N(0, I)

This is the .

In distribution form: q(xtxt1)=N!(1βt,xt1,,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}!\left(\sqrt{1-\beta_t}, x_{t-1},, \beta_t I\right).

Why these coefficients? The signal term is scaled by 1βt\sqrt{1-\beta_t} and noise by βt\sqrt{\beta_t}. Since (1βt)2+(βt)2=1({\sqrt{1-\beta_t}})^2 + ({\sqrt{\beta_t}})^2 = 1, if xt1N(0,I)x_{t-1} \sim \mathcal{N}(0, I) then xtN(0,I)x_t \sim \mathcal{N}(0, I) also. The process is variance-preserving — total energy stays at 1 throughout, which stabilizes training.

The Noise Schedule

The is a hyperparameter sequence called the noise schedule:

  • Linear (Ho et al., 2020): βt\beta_t increases linearly from β1=0.0001\beta_1 = 0.0001 to βT=0.02\beta_T = 0.02 over T=1000T = 1000 steps
  • Cosine (Nichol & Dhariwal, 2021): αˉt=cos2!!(t/T+s1+sπ2)\bar{\alpha}_t = \cos^2!!\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right) — smoother, better at the extremes

With βt[0.0001,0.02]\beta_t \in [0.0001, 0.02], each step adds only a tiny amount of noise. After 1000 steps, the accumulation completely destroys the signal.

The Closed Form: Any Step Directly from x₀

A critical observation: since Gaussians are closed under linear combination, you can derive a formula for xtx_t directly from x0x_0, skipping all t1t-1 intermediate steps.

Define . Then:

xt=αˉt,x0+1αˉt,ε,εN(0,I)x_t = \sqrt{\bar{\alpha}_t}, x_0 + \sqrt{1 - \bar{\alpha}_t}, \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)
xtx_t
noisy image at step t
x0x_0
clean original image
αˉt\bar{\alpha}_t
cumulative product of (1−β_s): how much original signal survives
ε\varepsilon
single noise sample from N(0,I) — not the product of t individual noise terms
1αˉt\sqrt{1-\bar{\alpha}_t}
how much noise has accumulated by step t

Derivation sketch. Apply the single-step formula twice: x1=1β1,x0+β1,ε1x_1 = \sqrt{1-\beta_1}, x_0 + \sqrt{\beta_1}, \varepsilon_1, x2=1β2,x1+β2,ε2x_2 = \sqrt{1-\beta_2}, x_1 + \sqrt{\beta_2}, \varepsilon_2. Substituting:

x2=(1β2)(1β1),x0+(1β2)β1,ε1+β2,ε2two independent Gaussiansx_2 = \sqrt{(1-\beta_2)(1-\beta_1)}, x_0 + \underbrace{\sqrt{(1-\beta_2)\beta_1}, \varepsilon_1 + \sqrt{\beta_2}, \varepsilon_2}_{\text{two independent Gaussians}}

The combined noise term is Gaussian with variance (1β2)β1+β2=1(1β1)(1β2)=1αˉ2(1-\beta_2)\beta_1 + \beta_2 = 1 - (1-\beta_1)(1-\beta_2) = 1 - \bar{\alpha}_2. So x2=αˉ2,x0+1αˉ2,εx_2 = \sqrt{\bar{\alpha}_2}, x_0 + \sqrt{1-\bar{\alpha}_2}, \varepsilon. By induction, the same pattern holds for all tt.

Numerical Example

Use T=4T = 4 steps with β=[0.1,0.1,0.1,0.1]\beta = [0.1, 0.1, 0.1, 0.1] (equal for simplicity) and a 1D image x0=2.0x_0 = 2.0.

Stepαˉt\bar\alpha_tαˉt\sqrt{\bar\alpha_t}1αˉt\sqrt{1-\bar\alpha_t}Expected xtx_t (ε=0)
01.0001.0000.0002.000
10.9000.9490.3161.897
20.8100.9000.4361.800
30.7290.8540.5221.708
40.6560.8100.5861.619

At each step the signal coefficient αˉt\sqrt{\bar{\alpha}_t} shrinks and the noise coefficient 1αˉt\sqrt{1-\bar{\alpha}_t} grows. With 1000 steps and βT=0.02\beta_T = 0.02, αˉ10004.9×1050\bar{\alpha}_{1000} \approx 4.9 \times 10^{-5} \approx 0 — signal effectively gone.

Why This Is Useful for Training

The closed form means: given any training image x0x_0 and any time step tt, we can instantly create a training example (xt,ε,t)(x_t, \varepsilon, t) by:

  1. Sample εN(0,I)\varepsilon \sim \mathcal{N}(0, I)
  2. Compute xt=αˉt,x0+1αˉt,εx_t = \sqrt{\bar{\alpha}_t}, x_0 + \sqrt{1-\bar{\alpha}_t}, \varepsilon

No simulation of tt Markov steps needed. We know exactly what noise was added. The training task for the reverse model is: given xtx_t and tt, predict ε\varepsilon. This is derived in lesson 14-9.

Interactive example

Apply the forward noising process to an image and watch it gradually become pure noise across 1000 steps

Coming soon

Quiz

1 / 3

In the forward process x_t = √(1−β_t)·x_{t−1} + √β_t·ε, what happens to the signal component √(1−β_t)·x_{t−1} as t increases (assuming β_t is small and positive)?