Stepping Back: A Unified View
Lessons 14-8 and 14-9 presented diffusion as a discrete Markov chain: add noise for steps, learn to reverse each step. This works well in practice, but it hides the deeper mathematical structure. This lesson reveals that structure — and in doing so, unifies VAEs, diffusion models, and a third approach called score-based generative models into a single framework.
Score-based generative models are the mathematical foundation behind the most recent class of state-of-the-art generative systems. Understanding the score function unifies diffusion, energy-based models, and Langevin dynamics into one framework — and explains why sampling algorithms like DDIM and DDPM are related rather than competing.
The key object: the score function.
The Score Function
For any probability distribution , the is:
- the score: a vector field over data space
- gradient with respect to x — the data, not model parameters
- log probability density of x under the distribution P
This is a vector the same size as . For a 28×28 image, the score has 784 components. Each component says: "if you increase pixel by a tiny amount, does the log probability go up or down, and by how much?"
Intuition: the score points toward high-density regions. If you are at a low-probability image and follow the score, you move toward images that look more like real data. This is analogous to gradient ascent toward modes — but the score field knows about the full distribution's geometry, not just local slope.
Langevin Dynamics: Sampling from the Score
If we know the score at every point, we can generate samples without ever computing directly. The method is Langevin dynamics:
- next position in the Markov chain
- current position
- step size (small positive number)
- fresh Gaussian noise at each step: ε ~ N(0,I)
This is gradient ascent on log P(x) with added noise. The noise is crucial: without it, the chain converges to a mode (a point that maximizes P). With noise, the stationary distribution of the chain is exactly — any sample from the chain converges in distribution to a genuine sample from the data distribution.
Why does this work? This is the Langevin equation from statistical physics. The gradient term pushes toward high probability; the noise term provides the thermal fluctuations needed to explore the full distribution. The two forces balance at the correct stationary distribution.
Score Matching: Learning the Score
We don't have access to the true score — that would require knowing the true data distribution. Instead, train a neural network to match it.
The score matching objective (Hyvärinen, 2005):
- score matching loss
- trace of the Jacobian of the score network — expensive to compute
- squared norm of the score
This objective can be minimized without ever evaluating directly. However, computing is expensive — it requires a second-order Jacobian computation.
Denoising score matching (Vincent, 2011) gives a cheaper equivalent: add small noise to data to get , then train:
- noise level
- the noise added: ε ~ N(0,I)
- noisy data: x + σε
This is the connection to DDPM. The score of the forward-noised distribution is . So:
The DDPM noise predictor is exactly a scaled negative score function. DDPM and score-based models are doing the same thing — just using different parameterizations.
Continuous-Time: The SDE Perspective
Song et al. (2020) unified all diffusion models by taking the number of steps . The discrete Markov chain becomes a continuous-time stochastic differential equation (SDE):
- continuous-time process in data space
- drift term: deterministic force (depends on schedule)
- diffusion coefficient: controls noise magnitude
- standard Wiener process (continuous Brownian motion)
The forward (noising) SDE destroys data. There exists a reverse SDE (Anderson, 1982):
The reverse SDE involves the score function — exactly what we learn. Different choices of and recover different diffusion models:
- DDPM forward process: ,
- SMLD (Song & Ermon, 2019): ,
Numerical ODE (DDIM): the reverse SDE can also be converted to a deterministic ODE by dropping the noise term. This gives the DDIM sampler — fast, deterministic, fewer steps needed.
What Unifies Everything
| Model | Core object | How it generates |
|---|---|---|
| VAE | Latent code | Decode |
| GAN | Generator | Apply to |
| Score model | Score | Langevin dynamics from noise |
| DDPM | Noise predictor | Reverse Markov chain |
| SDE | Reverse SDE | Continuous reverse diffusion |
All generative models are trying to describe the same thing — where data lives in high-dimensional space — using different mathematical tools. Score functions describe this geometrically (gradient field). SDEs describe it dynamically (continuous process). VAEs describe it probabilistically (latent variable integral). GANs describe it implicitly (adversarial sampler). The connections between them deepen with each new result: score matching led to DDPM improvements, SDE theory enabled faster samplers, and the probability flow ODE enables image editing and inversion.
Interactive example
Visualize the score field ∇_x log P(x) for a 2D Gaussian mixture and watch Langevin dynamics converge to samples
Coming soon