Skip to content
Generative Models
Lesson 10 ⏱ 14 min

Score matching: the deeper theory

Video coming soon

Score Matching: A Unified View of Diffusion Models

The score function ∇_x log P(x), Langevin dynamics for sampling, why DDPM noise prediction equals score estimation, and the continuous-time SDE formulation that unifies all diffusion models.

⏱ ~9 min

🧮

Quick refresher

Gradient of a function

∇_x f(x) is the vector of partial derivatives of f with respect to each component of x. It points in the direction of steepest increase of f. To maximize f, follow the gradient; to minimize, go against it.

Example

If f(x₁,x₂) = −x₁² − x₂², then ∇f = [−2x₁, −2x₂].

At (1,1), the gradient is (−2,−2), pointing toward the origin (the maximum of f).

Following this gradient moves you toward higher density under a N(0,I) distribution.

Stepping Back: A Unified View

Lessons 14-8 and 14-9 presented diffusion as a discrete Markov chain: add noise for TT steps, learn to reverse each step. This works well in practice, but it hides the deeper mathematical structure. This lesson reveals that structure — and in doing so, unifies VAEs, diffusion models, and a third approach called score-based generative models into a single framework.

Score-based generative models are the mathematical foundation behind the most recent class of state-of-the-art generative systems. Understanding the score function unifies diffusion, energy-based models, and Langevin dynamics into one framework — and explains why sampling algorithms like DDIM and DDPM are related rather than competing.

The key object: the score function.

The Score Function

For any probability distribution P(x)P(x), the is:

s(x)=xlogP(x)s(x) = \nabla_x \log P(x)
s(x)s(x)
the score: a vector field over data space
x\nabla_x
gradient with respect to x — the data, not model parameters
logP(x)\log P(x)
log probability density of x under the distribution P

This is a vector the same size as xx. For a 28×28 image, the score has 784 components. Each component says: "if you increase pixel ii by a tiny amount, does the log probability go up or down, and by how much?"

Intuition: the score points toward high-density regions. If you are at a low-probability image and follow the score, you move toward images that look more like real data. This is analogous to gradient ascent toward modes — but the score field knows about the full distribution's geometry, not just local slope.

Langevin Dynamics: Sampling from the Score

If we know the score s(x)=xlogP(x)s(x) = \nabla_x \log P(x) at every point, we can generate samples without ever computing P(x)P(x) directly. The method is Langevin dynamics:

x(k+1)=x(k)+δs(x(k))+2δ,ε(k)x^{(k+1)} = x^{(k)} + \delta \cdot s(x^{(k)}) + \sqrt{2\delta}, \varepsilon^{(k)}
x(k+1)x^{(k+1)}
next position in the Markov chain
x(k)x^{(k)}
current position
δ\delta
step size (small positive number)
ε(k)\varepsilon^{(k)}
fresh Gaussian noise at each step: ε ~ N(0,I)

This is gradient ascent on log P(x) with added noise. The noise is crucial: without it, the chain converges to a mode (a point that maximizes P). With noise, the stationary distribution of the chain is exactly P(x)P(x) — any sample from the chain converges in distribution to a genuine sample from the data distribution.

Why does this work? This is the Langevin equation from statistical physics. The gradient term pushes toward high probability; the noise term provides the thermal fluctuations needed to explore the full distribution. The two forces balance at the correct stationary distribution.

Score Matching: Learning the Score

We don't have access to the true score xlogP(x)\nabla_x \log P(x) — that would require knowing the true data distribution. Instead, train a neural network to match it.

The score matching objective (Hyvärinen, 2005):

LSM=ExP![tr(xsθ(x))+12sθ(x)2]L_{\text{SM}} = \mathbb{E}{x \sim P}!\left[\text{tr}(\nabla_x s\theta(x)) + \frac{1}{2}|s_\theta(x)|^2\right]
LSML_{\text{SM}}
score matching loss
tr(xsθ)\text{tr}(\nabla_x s_\theta)
trace of the Jacobian of the score network — expensive to compute
sθ(x)2\|s_\theta(x)\|^2
squared norm of the score

This objective can be minimized without ever evaluating xlogP(x)\nabla_x \log P(x) directly. However, computing tr(xsθ)\text{tr}(\nabla_x s_\theta) is expensive — it requires a second-order Jacobian computation.

Denoising score matching (Vincent, 2011) gives a cheaper equivalent: add small noise ε\varepsilon to data xx to get x~\tilde{x}, then train:

LDSM=Ex,ε![sθ(x~)+εσ2]L_{\text{DSM}} = \mathbb{E}{x, \varepsilon}!\left[\left|s\theta(\tilde{x}) + \frac{\varepsilon}{\sigma}\right|^2\right]
σ\sigma
noise level
ε\varepsilon
the noise added: ε ~ N(0,I)
x~\tilde{x}
noisy data: x + σε

This is the connection to DDPM. The score of the forward-noised distribution is xtlogq(xt)=ε/1αˉt\nabla_{x_t} \log q(x_t) = -\varepsilon/\sqrt{1-\bar{\alpha}_t}. So:

sθ(xt,t)xtlogq(xt)=εθ(xt,t)1αˉts_\theta(x_t, t) \approx \nabla_{x_t} \log q(x_t) = -\frac{\varepsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}

The DDPM noise predictor εθ\varepsilon_\theta is exactly a scaled negative score function. DDPM and score-based models are doing the same thing — just using different parameterizations.

Continuous-Time: The SDE Perspective

Song et al. (2020) unified all diffusion models by taking the number of steps TT \to \infty. The discrete Markov chain becomes a continuous-time stochastic differential equation (SDE):

dxt=f(xt,t),dt+g(t),dWtdx_t = f(x_t, t), dt + g(t), dW_t
xtx_t
continuous-time process in data space
f(x,t)f(x,t)
drift term: deterministic force (depends on schedule)
g(t)g(t)
diffusion coefficient: controls noise magnitude
dWtdW_t
standard Wiener process (continuous Brownian motion)

The forward (noising) SDE destroys data. There exists a reverse SDE (Anderson, 1982):

dxt=[f(xt,t)g(t)2xtlogqt(xt)]dt+g(t),dWˉtdx_t = \left[f(x_t, t) - g(t)^2 \nabla_{x_t} \log q_t(x_t)\right] dt + g(t), d\bar{W}_t

The reverse SDE involves the score function xtlogqt(xt)\nabla_{x_t} \log q_t(x_t) — exactly what we learn. Different choices of ff and gg recover different diffusion models:

  • DDPM forward process: f=12β(t)xf = -\frac{1}{2}\beta(t) x, g=β(t)g = \sqrt{\beta(t)}
  • SMLD (Song & Ermon, 2019): f=0f = 0, g=d[σ2(t)]/dtg = \sqrt{d[\sigma^2(t)]/dt}

Numerical ODE (DDIM): the reverse SDE can also be converted to a deterministic ODE by dropping the noise term. This gives the DDIM sampler — fast, deterministic, fewer steps needed.

What Unifies Everything

ModelCore objectHow it generates
VAELatent code zzDecode zN(0,I)z \sim \mathcal{N}(0,I)
GANGenerator GGApply GG to zz
Score modelScore xlogP\nabla_x \log PLangevin dynamics from noise
DDPMNoise predictor εθ\varepsilon_\thetaReverse Markov chain
SDEReverse SDEContinuous reverse diffusion

All generative models are trying to describe the same thing — where data lives in high-dimensional space — using different mathematical tools. Score functions describe this geometrically (gradient field). SDEs describe it dynamically (continuous process). VAEs describe it probabilistically (latent variable integral). GANs describe it implicitly (adversarial sampler). The connections between them deepen with each new result: score matching led to DDPM improvements, SDE theory enabled faster samplers, and the probability flow ODE enables image editing and inversion.

Interactive example

Visualize the score field ∇_x log P(x) for a 2D Gaussian mixture and watch Langevin dynamics converge to samples

Coming soon

Quiz

1 / 3

The score function is defined as s(x) = ∇_x log P(x). What does it output?