Diffusion Models Part 1: Destroying Data with Noise
The forward diffusion process that gradually adds Gaussian noise to data, the noise schedule, and the closed-form expression that jumps directly to any noise level without stepping through all intermediate steps.
⏱ ~8 min
🧮
Quick refresher
Properties of Gaussian random variables
If X ~ N(μ₁, σ₁²) and Y ~ N(μ₂, σ₂²) are independent, then aX + bY ~ N(aμ₁ + bμ₂, a²σ₁² + b²σ₂²). Scaling a Gaussian scales its mean by the same factor and its variance by the square of that factor.
Example
If X ~ N(0,1), then 3X ~ N(0,9).
If X ~ N(0,1) and Y ~ N(0,1) are independent, then 0.6X + 0.8Y ~ N(0, 0.36 + 0.64) = N(0,1).
This is the key identity used to derive the diffusion closed form.
The Core Idea: Destroy Then Learn to Rebuild
Diffusion models take a surprising approach to generation: first learn how to completely destroy data, then learn to reverse that destruction.
The forward process is deterministic in design — you control it completely and can derive all its properties analytically. The only thing you must learn is the reverse: given a noisy image, predict what the noise was and denoise it slightly. Chain enough denoising steps together, and you go from pure noise back to a real image.
Diffusion models are behind Stable Diffusion, DALL-E 2, and Sora — the systems generating photorealistic images and videos from text prompts. Understanding the forward noising process is the first half of understanding how these systems actually work.
This is qualitatively different from VAEs (which learn a compressed representation) and GANs (which learn a sampler adversarially). Diffusion models reduce generation to a sequence of denoising regression problems — which are individually simple and stable to train.
The Forward Process: One Step
At each time step , the forward process adds a small amount of Gaussian noise:
xt=1−βt,xt−1+βt,ε,ε∼N(0,I)
xt
noisy image at step t
xt−1
image at the previous step
βt
noise variance schedule parameter at step t: a small positive number
ε
Gaussian noise: ε ~ N(0, I)
This is the .
In distribution form: q(xt∣xt−1)=N!(1−βt,xt−1,,βtI).
Why these coefficients? The signal term is scaled by 1−βt and noise by βt. Since (1−βt)2+(βt)2=1, if xt−1∼N(0,I) then xt∼N(0,I) also. The process is variance-preserving — total energy stays at 1 throughout, which stabilizes training.
The Noise Schedule
The is a hyperparameter sequence called the noise schedule:
Linear (Ho et al., 2020): βt increases linearly from β1=0.0001 to βT=0.02 over T=1000 steps
Cosine (Nichol & Dhariwal, 2021): αˉt=cos2!!(1+st/T+s⋅2π) — smoother, better at the extremes
With βt∈[0.0001,0.02], each step adds only a tiny amount of noise. After 1000 steps, the accumulation completely destroys the signal.
The Closed Form: Any Step Directly from x₀
A critical observation: since Gaussians are closed under linear combination, you can derive a formula for xt directly from x0, skipping all t−1 intermediate steps.
Define . Then:
xt=αˉt,x0+1−αˉt,ε,ε∼N(0,I)
xt
noisy image at step t
x0
clean original image
αˉt
cumulative product of (1−β_s): how much original signal survives
ε
single noise sample from N(0,I) — not the product of t individual noise terms
1−αˉt
how much noise has accumulated by step t
Derivation sketch. Apply the single-step formula twice: x1=1−β1,x0+β1,ε1, x2=1−β2,x1+β2,ε2. Substituting:
The combined noise term is Gaussian with variance (1−β2)β1+β2=1−(1−β1)(1−β2)=1−αˉ2. So x2=αˉ2,x0+1−αˉ2,ε. By induction, the same pattern holds for all t.
Numerical Example
Use T=4 steps with β=[0.1,0.1,0.1,0.1] (equal for simplicity) and a 1D image x0=2.0.
Step
αˉt
αˉt
1−αˉt
Expected xt (ε=0)
0
1.000
1.000
0.000
2.000
1
0.900
0.949
0.316
1.897
2
0.810
0.900
0.436
1.800
3
0.729
0.854
0.522
1.708
4
0.656
0.810
0.586
1.619
At each step the signal coefficient αˉt shrinks and the noise coefficient 1−αˉt grows. With 1000 steps and βT=0.02, αˉ1000≈4.9×10−5≈0 — signal effectively gone.
Why This Is Useful for Training
The closed form means: given any training image x0 and any time step t, we can instantly create a training example (xt,ε,t) by:
Sample ε∼N(0,I)
Compute xt=αˉt,x0+1−αˉt,ε
No simulation of t Markov steps needed. We know exactly what noise was added. The training task for the reverse model is: given xt and t, predict ε. This is derived in lesson 14-9.
⚙️
Interactive example
Apply the forward noising process to an image and watch it gradually become pure noise across 1000 steps
Coming soon
Quiz
1 / 3
In the forward process x_t = √(1−β_t)·x_{t−1} + √β_t·ε, what happens to the signal component √(1−β_t)·x_{t−1} as t increases (assuming β_t is small and positive)?