Score matching: the deeper theory — Generative Models

Stepping Back: A Unified View

Lessons 14-8 and 14-9 presented diffusion as a discrete Markov chain: add noise for $T$ steps, learn to reverse each step. This works well in practice, but it hides the deeper mathematical structure. This lesson reveals that structure — and in doing so, unifies VAEs, diffusion models, and a third approach called score-based generative models into a single framework.

Score-based generative models are the mathematical foundation behind the most recent class of state-of-the-art generative systems. Understanding the score function unifies diffusion, energy-based models, and Langevin dynamics into one framework — and explains why sampling algorithms like DDIM and DDPM are related rather than competing.

The key object: the score function.

The Score Function

For any probability distribution $P(x)$ , the is:

s(x) = \nabla_x \log P(x)

$s(x)$: the score: a vector field over data space
$\nabla_x$: gradient with respect to x — the data, not model parameters
$\log P(x)$: log probability density of x under the distribution P

This is a vector the same size as $x$ . For a 28×28 image, the score has 784 components. Each component says: "if you increase pixel $i$ by a tiny amount, does the log probability go up or down, and by how much?"

Intuition: the score points toward high-density regions. If you are at a low-probability image and follow the score, you move toward images that look more like real data. This is analogous to gradient ascent toward modes — but the score field knows about the full distribution's geometry, not just local slope.

Langevin Dynamics: Sampling from the Score

If we know the score $s(x) = \nabla_x \log P(x)$ at every point, we can generate samples without ever computing $P(x)$ directly. The method is Langevin dynamics:

x^{(k+1)} = x^{(k)} + \delta \cdot s(x^{(k)}) + \sqrt{2\delta}, \varepsilon^{(k)}

$x^{(k+1)}$: next position in the Markov chain
$x^{(k)}$: current position
$\delta$: step size (small positive number)
$\varepsilon^{(k)}$: fresh Gaussian noise at each step: ε ~ N(0,I)

This is gradient ascent on log P(x) with added noise. The noise is crucial: without it, the chain converges to a mode (a point that maximizes P). With noise, the stationary distribution of the chain is exactly $P(x)$ — any sample from the chain converges in distribution to a genuine sample from the data distribution.

Why does this work? This is the Langevin equation from statistical physics. The gradient term pushes toward high probability; the noise term provides the thermal fluctuations needed to explore the full distribution. The two forces balance at the correct stationary distribution.

Score Matching: Learning the Score

We don't have access to the true score $\nabla_x \log P(x)$ — that would require knowing the true data distribution. Instead, train a neural network to match it.

The score matching objective (Hyvärinen, 2005):

L_{\text{SM}} = \mathbb{E}{x \sim P}!\left[\text{tr}(\nabla_x s\theta(x)) + \frac{1}{2}|s_\theta(x)|^2\right]

$L_{\text{SM}}$: score matching loss
$\text{tr}(\nabla_x s_\theta)$: trace of the Jacobian of the score network — expensive to compute
$\|s_\theta(x)\|^2$: squared norm of the score

This objective can be minimized without ever evaluating $\nabla_x \log P(x)$ directly. However, computing $\text{tr}(\nabla_x s_\theta)$ is expensive — it requires a second-order Jacobian computation.

Denoising score matching (Vincent, 2011) gives a cheaper equivalent: add small noise $\varepsilon$ to data $x$ to get $\tilde{x}$ , then train:

L_{\text{DSM}} = \mathbb{E}{x, \varepsilon}!\left[\left|s\theta(\tilde{x}) + \frac{\varepsilon}{\sigma}\right|^2\right]

$\sigma$: noise level
$\varepsilon$: the noise added: ε ~ N(0,I)
$\tilde{x}$: noisy data: x + σε

This is the connection to DDPM. The score of the forward-noised distribution is $\nabla_{x_t} \log q(x_t) = -\varepsilon/\sqrt{1-\bar{\alpha}_t}$ . So:

s_\theta(x_t, t) \approx \nabla_{x_t} \log q(x_t) = -\frac{\varepsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}

The DDPM noise predictor $\varepsilon_\theta$ is exactly a scaled negative score function. DDPM and score-based models are doing the same thing — just using different parameterizations.

Continuous-Time: The SDE Perspective

Song et al. (2020) unified all diffusion models by taking the number of steps $T \to \infty$ . The discrete Markov chain becomes a continuous-time stochastic differential equation (SDE):

dx_t = f(x_t, t), dt + g(t), dW_t

$x_t$: continuous-time process in data space
$f(x,t)$: drift term: deterministic force (depends on schedule)
$g(t)$: diffusion coefficient: controls noise magnitude
$dW_t$: standard Wiener process (continuous Brownian motion)

The forward (noising) SDE destroys data. There exists a reverse SDE (Anderson, 1982):

dx_t = \left[f(x_t, t) - g(t)^2 \nabla_{x_t} \log q_t(x_t)\right] dt + g(t), d\bar{W}_t

The reverse SDE involves the score function $\nabla_{x_t} \log q_t(x_t)$ — exactly what we learn. Different choices of $f$ and $g$ recover different diffusion models:

DDPM forward process: $f = -\frac{1}{2}\beta(t) x$ , $g = \sqrt{\beta(t)}$
SMLD (Song & Ermon, 2019): $f = 0$ , $g = \sqrt{d[\sigma^2(t)]/dt}$

Numerical ODE (DDIM): the reverse SDE can also be converted to a deterministic ODE by dropping the noise term. This gives the DDIM sampler — fast, deterministic, fewer steps needed.

What Unifies Everything

Model	Core object	How it generates
VAE	Latent code $z$	Decode $z \sim \mathcal{N}(0,I)$
GAN	Generator $G$	Apply $G$ to $z$
Score model	Score $\nabla_x \log P$	Langevin dynamics from noise
DDPM	Noise predictor $\varepsilon_\theta$	Reverse Markov chain
SDE	Reverse SDE	Continuous reverse diffusion

All generative models are trying to describe the same thing — where data lives in high-dimensional space — using different mathematical tools. Score functions describe this geometrically (gradient field). SDEs describe it dynamically (continuous process). VAEs describe it probabilistically (latent variable integral). GANs describe it implicitly (adversarial sampler). The connections between them deepen with each new result: score matching led to DDPM improvements, SDE theory enabled faster samplers, and the probability flow ODE enables image editing and inversion.

Interactive example

Visualize the score field ∇_x log P(x) for a 2D Gaussian mixture and watch Langevin dynamics converge to samples

Coming soon