Reverse diffusion and DDPM — Generative Models

From Noisy Back to Clean

Lesson 14-8 showed how to destroy data with noise. Now we learn to reverse the destruction: starting from pure Gaussian noise $x_T \sim \mathcal{N}(0, I)$ , iteratively denoise to produce a clean sample $x_0$ .

Each reverse step is a small denoising operation: take a slightly noisy image $x_t$ and produce a slightly less noisy image $x_{t-1}$ . After $T$ such steps, you have a sample from the data distribution.

The reverse diffusion process is the actual generation step in Stable Diffusion and DALL-E — when you type a prompt and watch the image gradually emerge from noise, you are watching this algorithm run in real time. The noise prediction network trained here is the U-Net at the heart of every image diffusion model.

The True Reverse Posterior

By Bayes' theorem, the true reverse is:

q(x_{t-1} \mid x_t, x_0) = \frac{q(x_t \mid x_{t-1}) \cdot q(x_{t-1} \mid x_0)}{q(x_t \mid x_0)}

$q(x_{t-1} | x_t, x_0)$: reverse step conditioned on both the current noisy state and the clean original
$q(x_t | x_{t-1})$: forward step kernel (known, Gaussian)
$q(x_{t-1} | x_0)$: marginal of x_{t−1} given x_0 (known from closed form)

All three factors on the right are Gaussian (from the closed-form derivation in lesson 14-8). The product of Gaussians is Gaussian. Working out the algebra, the mean and variance of this posterior are:

\tilde{\mu}t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}{t-1}}, \beta_t}{1 - \bar{\alpha}t}, x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})}{1-\bar{\alpha}_t}, x_t, \qquad \tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}_t}, \beta_t

$\tilde{\mu}_t(x_t, x_0)$: true posterior mean — a weighted average of x_t and x_0
$\tilde{\beta}_t$: true posterior variance
$\alpha_t$: 1 − β_t
$\bar{\alpha}_t$: cumulative product of α_s for s=1 to t

The problem: this requires $x_0$ , which is unknown at generation time (that is what we are trying to produce). We cannot use this formula directly.

The Learned Reverse Process

We parameterize the reverse step as a Gaussian:

p_\theta(x_{t-1} \mid x_t) = \mathcal{N}!\left(\mu_\theta(x_t, t),, \sigma_t^2 I\right)

$p_\theta(x_{t-1} | x_t)$: learned reverse process: approximate the true posterior without x_0
$\mu_\theta(x_t, t)$: neural network output: predicted mean
$\sigma_t^2 I$: variance — set to β_t or \tilde{β}_t in practice

We train a neural network to predict the mean $\mu_\theta$ (and optionally the variance). The key DDPM insight: instead of predicting $\mu_\theta$ directly, predict the noise $\varepsilon_\theta$ that was added.

The Noise Prediction Parameterization

From the closed form $x_t = \sqrt{\bar{\alpha}_t}, x_0 + \sqrt{1-\bar{\alpha}_t}, \varepsilon$ , we can solve for $x_0$ :

x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}\left(x_t - \sqrt{1-\bar{\alpha}t}, \varepsilon\theta(x_t, t)\right)

$x_0$: original clean image, expressed in terms of x_t and ε
$\varepsilon_\theta(x_t, t)$: neural network that predicts the noise ε given the noisy image and timestep

Substituting this $x_0$ estimate into the true posterior mean $\tilde{\mu}_t$ , the reverse step mean becomes:

\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}t}}, \varepsilon\theta(x_t, t)\right)

Everything in this formula is either a hyperparameter ( $\alpha_t, \beta_t, \bar{\alpha}_t$ ), the current noisy image $x_t$ , or the network output $\varepsilon_\theta$ . No $x_0$ required.

The DDPM Training Objective

The full ELBO objective simplifies (with some algebra that Ho et al. work through) to:

L_{\text{simple}} = \mathbb{E}{x_0, \varepsilon, t}!\left[|\varepsilon - \varepsilon\theta(x_t, t)|^2\right]

$L_{\text{simple}}$: simplified DDPM training loss
$t$: uniformly sampled timestep from {1, ..., T}
$\varepsilon$: noise used to create x_t, known at training time
$\varepsilon_\theta(x_t, t)$: network's prediction of that noise

That is it. Each training step:

Sample $x_0$ from training data
Sample $t \sim \text{Uniform}(1, T)$
Sample $\varepsilon \sim \mathcal{N}(0, I)$
Compute $x_t = \sqrt{\bar{\alpha}_t}, x_0 + \sqrt{1-\bar{\alpha}_t}, \varepsilon$
Predict: $\hat{\varepsilon} = \varepsilon_\theta(x_t, t)$
Loss: $\mid \varepsilon - \hat{\varepsilon}\mid ^2$

The Network Architecture: U-Net + Time Conditioning

The noise predictor $\varepsilon_\theta(x_t, t)$ is a conditioned on time step $t$ . Time conditioning is injected by:

Encoding $t$ as a sinusoidal embedding (same as transformer positional encodings)
Projecting to a vector and adding to intermediate feature maps

The U-Net produces an output the same spatial size as the input — predicting a noise vector for each pixel.

The Full Sampling Procedure

Generation at inference time (no training data needed):

x_{t-1} = \frac{1}{\sqrt{\alpha_t}}!\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}t}}, \varepsilon\theta(x_t, t)\right) + \sigma_t z

$x_{t-1}$: denoised image at step t−1
$\alpha_t$: 1 − β_t
$\sigma_t$: noise to add: set to √β_t or √\tilde{β}_t
$z$: fresh noise sample: z ~ N(0,I) for t > 1, z = 0 for t = 1

Numerical Walk-Through (t=2 → t=1)

Setup (1D, toy values): $\beta_1 = 0.1, \beta_2 = 0.1$ , so $\alpha_2 = 0.9, \bar{\alpha}_2 = 0.81$ .

Current state: $x_2 = 1.4$ . Network predicts $\varepsilon_\theta(x_2, 2) = 0.6$ .

Predicted mean:

\mu_\theta = \frac{1}{\sqrt{0.9}}!\left(1.4 - \frac{0.1}{\sqrt{1-0.81}} \times 0.6\right) = \frac{1}{0.949}!\left(1.4 - \frac{0.1}{0.436} \times 0.6\right)

= \frac{1}{0.949}\left(1.4 - 0.1376\right) = \frac{1.262}{0.949} = 1.330

Add stochasticity ( $\sigma_2 = \sqrt{\tilde{\beta}_2} \approx 0.095$ , sample $z = -0.2$ ):

x_1 = 1.330 + 0.095 \times (-0.2) = 1.330 - 0.019 = 1.311

The step moved slightly toward the predicted clean value, with a small amount of added noise to maintain the stochastic character of the reverse chain.

Interactive example

Watch 1000 reverse diffusion steps denoise pure Gaussian noise into a recognizable image

Coming soon