Skip to content
Generative Models
Lesson 9 ⏱ 16 min

Reverse diffusion and DDPM

Video coming soon

Reverse Diffusion: Learning to Denoise

How to reverse the noising process step by step, why predicting noise is equivalent to predicting the reverse mean, the DDPM training objective, and a full numerical walk-through of one denoising step.

⏱ ~9 min

🧮

Quick refresher

Bayes' theorem

P(A|B) = P(B|A)·P(A)/P(B). Bayes' theorem lets you reverse conditional probabilities. In diffusion: we know q(x_t|x_{t−1}) (forward step) and want q(x_{t−1}|x_t) (reverse step). Bayes gives us the reverse — but it requires knowing x₀.

Example

q(x_{t−1}|x_t, x₀) is tractable because conditioning on x₀ makes all three distributions Gaussian with known parameters.

Without conditioning on x₀, q(x_{t−1}|x_t) is intractable.

From Noisy Back to Clean

Lesson 14-8 showed how to destroy data with noise. Now we learn to reverse the destruction: starting from pure Gaussian noise xTN(0,I)x_T \sim \mathcal{N}(0, I), iteratively denoise to produce a clean sample x0x_0.

Each reverse step is a small denoising operation: take a slightly noisy image xtx_t and produce a slightly less noisy image xt1x_{t-1}. After TT such steps, you have a sample from the data distribution.

The reverse diffusion process is the actual generation step in Stable Diffusion and DALL-E — when you type a prompt and watch the image gradually emerge from noise, you are watching this algorithm run in real time. The noise prediction network trained here is the U-Net at the heart of every image diffusion model.

The True Reverse Posterior

By Bayes' theorem, the true reverse is:

q(xt1xt,x0)=q(xtxt1)q(xt1x0)q(xtx0)q(x_{t-1} \mid x_t, x_0) = \frac{q(x_t \mid x_{t-1}) \cdot q(x_{t-1} \mid x_0)}{q(x_t \mid x_0)}
q(xt1xt,x0)q(x_{t-1} | x_t, x_0)
reverse step conditioned on both the current noisy state and the clean original
q(xtxt1)q(x_t | x_{t-1})
forward step kernel (known, Gaussian)
q(xt1x0)q(x_{t-1} | x_0)
marginal of x_{t−1} given x_0 (known from closed form)

All three factors on the right are Gaussian (from the closed-form derivation in lesson 14-8). The product of Gaussians is Gaussian. Working out the algebra, the mean and variance of this posterior are:

μ~t(xt,x0)=αˉt1,βt1αˉt,x0+αt(1αˉt1)1αˉt,xt,β~t=1αˉt11αˉt,βt\tilde{\mu}t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}{t-1}}, \beta_t}{1 - \bar{\alpha}t}, x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})}{1-\bar{\alpha}_t}, x_t, \qquad \tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}_t}, \beta_t
μ~t(xt,x0)\tilde{\mu}_t(x_t, x_0)
true posterior mean — a weighted average of x_t and x_0
β~t\tilde{\beta}_t
true posterior variance
αt\alpha_t
1 − β_t
αˉt\bar{\alpha}_t
cumulative product of α_s for s=1 to t

The problem: this requires x0x_0, which is unknown at generation time (that is what we are trying to produce). We cannot use this formula directly.

The Learned Reverse Process

We parameterize the reverse step as a Gaussian:

pθ(xt1xt)=N!(μθ(xt,t),,σt2I)p_\theta(x_{t-1} \mid x_t) = \mathcal{N}!\left(\mu_\theta(x_t, t),, \sigma_t^2 I\right)
pθ(xt1xt)p_\theta(x_{t-1} | x_t)
learned reverse process: approximate the true posterior without x_0
μθ(xt,t)\mu_\theta(x_t, t)
neural network output: predicted mean
σt2I\sigma_t^2 I
variance — set to β_t or \tilde{β}_t in practice

We train a neural network to predict the mean μθ\mu_\theta (and optionally the variance). The key DDPM insight: instead of predicting μθ\mu_\theta directly, predict the noise εθ\varepsilon_\theta that was added.

The Noise Prediction Parameterization

From the closed form xt=αˉt,x0+1αˉt,εx_t = \sqrt{\bar{\alpha}_t}, x_0 + \sqrt{1-\bar{\alpha}_t}, \varepsilon, we can solve for x0x_0:

x0=1αˉt(xt1αˉt,εθ(xt,t))x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}\left(x_t - \sqrt{1-\bar{\alpha}t}, \varepsilon\theta(x_t, t)\right)
x0x_0
original clean image, expressed in terms of x_t and ε
εθ(xt,t)\varepsilon_\theta(x_t, t)
neural network that predicts the noise ε given the noisy image and timestep

Substituting this x0x_0 estimate into the true posterior mean μ~t\tilde{\mu}_t, the reverse step mean becomes:

μθ(xt,t)=1αt!(xtβt1αˉt,εθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}t}}, \varepsilon\theta(x_t, t)\right)

Everything in this formula is either a hyperparameter (αt,βt,αˉt\alpha_t, \beta_t, \bar{\alpha}_t), the current noisy image xtx_t, or the network output εθ\varepsilon_\theta. No x0x_0 required.

The DDPM Training Objective

The full ELBO objective simplifies (with some algebra that Ho et al. work through) to:

Lsimple=Ex0,ε,t![εεθ(xt,t)2]L_{\text{simple}} = \mathbb{E}{x_0, \varepsilon, t}!\left[|\varepsilon - \varepsilon\theta(x_t, t)|^2\right]
LsimpleL_{\text{simple}}
simplified DDPM training loss
tt
uniformly sampled timestep from {1, ..., T}
ε\varepsilon
noise used to create x_t, known at training time
εθ(xt,t)\varepsilon_\theta(x_t, t)
network's prediction of that noise

That is it. Each training step:

  1. Sample x0x_0 from training data
  2. Sample tUniform(1,T)t \sim \text{Uniform}(1, T)
  3. Sample εN(0,I)\varepsilon \sim \mathcal{N}(0, I)
  4. Compute xt=αˉt,x0+1αˉt,εx_t = \sqrt{\bar{\alpha}_t}, x_0 + \sqrt{1-\bar{\alpha}_t}, \varepsilon
  5. Predict: ε^=εθ(xt,t)\hat{\varepsilon} = \varepsilon_\theta(x_t, t)
  6. Loss: εε^2\mid \varepsilon - \hat{\varepsilon}\mid ^2

The Network Architecture: U-Net + Time Conditioning

The noise predictor εθ(xt,t)\varepsilon_\theta(x_t, t) is a conditioned on time step tt. Time conditioning is injected by:

  1. Encoding tt as a sinusoidal embedding (same as transformer positional encodings)
  2. Projecting to a vector and adding to intermediate feature maps

The U-Net produces an output the same spatial size as the input — predicting a noise vector for each pixel.

The Full Sampling Procedure

Generation at inference time (no training data needed):

xt1=1αt!(xtβt1αˉt,εθ(xt,t))+σtzx_{t-1} = \frac{1}{\sqrt{\alpha_t}}!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}t}}, \varepsilon\theta(x_t, t)\right) + \sigma_t z
xt1x_{t-1}
denoised image at step t−1
αt\alpha_t
1 − β_t
σt\sigma_t
noise to add: set to √β_t or √\tilde{β}_t
zz
fresh noise sample: z ~ N(0,I) for t > 1, z = 0 for t = 1

Numerical Walk-Through (t=2 → t=1)

Setup (1D, toy values): β1=0.1,β2=0.1\beta_1 = 0.1, \beta_2 = 0.1, so α2=0.9,αˉ2=0.81\alpha_2 = 0.9, \bar{\alpha}_2 = 0.81.

Current state: x2=1.4x_2 = 1.4. Network predicts εθ(x2,2)=0.6\varepsilon_\theta(x_2, 2) = 0.6.

Predicted mean:

μθ=10.9!(1.40.110.81×0.6)=10.949!(1.40.10.436×0.6)\mu_\theta = \frac{1}{\sqrt{0.9}}!\left(1.4 - \frac{0.1}{\sqrt{1-0.81}} \times 0.6\right) = \frac{1}{0.949}!\left(1.4 - \frac{0.1}{0.436} \times 0.6\right)
=10.949(1.40.1376)=1.2620.949=1.330= \frac{1}{0.949}\left(1.4 - 0.1376\right) = \frac{1.262}{0.949} = 1.330

Add stochasticity (σ2=β~20.095\sigma_2 = \sqrt{\tilde{\beta}_2} \approx 0.095, sample z=0.2z = -0.2):

x1=1.330+0.095×(0.2)=1.3300.019=1.311x_1 = 1.330 + 0.095 \times (-0.2) = 1.330 - 0.019 = 1.311

The step moved slightly toward the predicted clean value, with a small amount of added noise to maintain the stochastic character of the reverse chain.

Interactive example

Watch 1000 reverse diffusion steps denoise pure Gaussian noise into a recognizable image

Coming soon

Quiz

1 / 3

The true reverse posterior q(x_{t−1}|x_t, x₀) is tractable only when conditioned on x₀. Why is this a problem for generation?