From Noisy Back to Clean
Lesson 14-8 showed how to destroy data with noise. Now we learn to reverse the destruction: starting from pure Gaussian noise , iteratively denoise to produce a clean sample .
Each reverse step is a small denoising operation: take a slightly noisy image and produce a slightly less noisy image . After such steps, you have a sample from the data distribution.
The reverse diffusion process is the actual generation step in Stable Diffusion and DALL-E — when you type a prompt and watch the image gradually emerge from noise, you are watching this algorithm run in real time. The noise prediction network trained here is the U-Net at the heart of every image diffusion model.
The True Reverse Posterior
By Bayes' theorem, the true reverse is:
- reverse step conditioned on both the current noisy state and the clean original
- forward step kernel (known, Gaussian)
- marginal of x_{t−1} given x_0 (known from closed form)
All three factors on the right are Gaussian (from the closed-form derivation in lesson 14-8). The product of Gaussians is Gaussian. Working out the algebra, the mean and variance of this posterior are:
- true posterior mean — a weighted average of x_t and x_0
- true posterior variance
- 1 − β_t
- cumulative product of α_s for s=1 to t
The problem: this requires , which is unknown at generation time (that is what we are trying to produce). We cannot use this formula directly.
The Learned Reverse Process
We parameterize the reverse step as a Gaussian:
- learned reverse process: approximate the true posterior without x_0
- neural network output: predicted mean
- variance — set to β_t or \tilde{β}_t in practice
We train a neural network to predict the mean (and optionally the variance). The key DDPM insight: instead of predicting directly, predict the noise that was added.
The Noise Prediction Parameterization
From the closed form , we can solve for :
- original clean image, expressed in terms of x_t and ε
- neural network that predicts the noise ε given the noisy image and timestep
Substituting this estimate into the true posterior mean , the reverse step mean becomes:
Everything in this formula is either a hyperparameter (), the current noisy image , or the network output . No required.
The DDPM Training Objective
The full ELBO objective simplifies (with some algebra that Ho et al. work through) to:
- simplified DDPM training loss
- uniformly sampled timestep from {1, ..., T}
- noise used to create x_t, known at training time
- network's prediction of that noise
That is it. Each training step:
- Sample from training data
- Sample
- Sample
- Compute
- Predict:
- Loss:
The Network Architecture: U-Net + Time Conditioning
The noise predictor is a conditioned on time step . Time conditioning is injected by:
- Encoding as a sinusoidal embedding (same as transformer positional encodings)
- Projecting to a vector and adding to intermediate feature maps
The U-Net produces an output the same spatial size as the input — predicting a noise vector for each pixel.
The Full Sampling Procedure
Generation at inference time (no training data needed):
- denoised image at step t−1
- 1 − β_t
- noise to add: set to √β_t or √\tilde{β}_t
- fresh noise sample: z ~ N(0,I) for t > 1, z = 0 for t = 1
Numerical Walk-Through (t=2 → t=1)
Setup (1D, toy values): , so .
Current state: . Network predicts .
Predicted mean:
Add stochasticity (, sample ):
The step moved slightly toward the predicted clean value, with a small amount of added noise to maintain the stochastic character of the reverse chain.
Interactive example
Watch 1000 reverse diffusion steps denoise pure Gaussian noise into a recognizable image
Coming soon