Skip to content
Math Foundation Information Theory
Lesson 2 ⏱ 14 min

KL divergence: distance between distributions

Video coming soon

KL Divergence: How Different Are Two Distributions?

What KL divergence measures. Why it's not symmetric. The connection to cross-entropy. Worked numerical example. Where KL appears: VAEs, variational inference, RLHF.

⏱ ~8 min

🧮

Quick refresher

Entropy

Entropy H(X) = -Σ P(x)·log P(x) measures the average surprise (uncertainty) of a random variable. A fair coin has 1 bit of entropy. A deterministic variable has 0 entropy. Entropy is maximized by the uniform distribution.

Example

H(fair coin) = -(0.5·log₂0.5 + 0.5·log₂0.5) = 1 bit.

H(biased coin, P(H)=0.9) ≈ 0.47 bits.

The Setup

You have a true distribution , but you are using an approximate distribution to reason about the world.

How much are you paying for this approximation? How many extra bits do you waste every time you use QQ when the truth is PP?

The answers this precisely.

If you have two probability distributions — what your model predicts versus what actually happens — how do you measure how wrong the model is? Subtracting them misses how the mismatch compounds across low-probability events. You need a number that captures the full shape of the disagreement. KL divergence is that number, and it turns out to be exactly what you're minimizing every time you train a model with cross-entropy loss.

Definition

The from QQ to PP is:

DKL(P,,Q)=xP(x)logP(x)Q(x)D_{\text{KL}}(P ,|, Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}
DKL(PQ)D_KL(P||Q)
KL divergence from Q to P — read as 'KL of P from Q'
P(x)P(x)
probability under the true distribution P
Q(x)Q(x)
probability under the approximate distribution Q
ΣΣ
sum over all values x where P(x) > 0

Expanding using log(P/Q)=logPlogQ\log(P/Q) = \log P - \log Q:

DKL(P,,Q)=xP(x)logP(x)xP(x)logQ(x)=H(P)+H(P,Q)D_{\text{KL}}(P ,|, Q) = \sum_x P(x)\log P(x) - \sum_x P(x)\log Q(x) = -H(P) + H(P, Q)
H(P)H(P)
entropy of P
H(P,Q)H(P,Q)
cross-entropy of P and Q

where =xP(x)logQ(x)= -\sum_x P(x)\log Q(x) is the .

So: DKL(PQ)=H(P,Q)H(P)D_{\text{KL}}(P \mid Q) = H(P,Q) - H(P). KL divergence is the extra cost beyond what you'd pay if you used the optimal code.

KL ≥ 0 Always (via Jensen's Inequality)

This is a key result. We need : for any convex function ff, E[f(X)]f(E[X])\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]).

Since log-\log is convex:

DKL(P,,Q)=EP![logP(x)Q(x)]=EP![logQ(x)P(x)]log,EP![Q(x)P(x)]=logxQ(x)=0D_{\text{KL}}(P ,|, Q) = \mathbb{E}_P!\left[\log\frac{P(x)}{Q(x)}\right] = -\mathbb{E}_P!\left[\log\frac{Q(x)}{P(x)}\right] \geq -\log,\mathbb{E}_P!\left[\frac{Q(x)}{P(x)}\right] = -\log\sum_x Q(x) = 0
EPE_P
expectation taken under distribution P

Therefore DKL(PQ)0D_{\text{KL}}(P \mid Q) \geq 0 always, with equality if and only if P=QP = Q everywhere. This is sometimes called Gibbs' inequality.

Worked Numerical Example

Suppose there are only two outcomes (like a biased coin), and:

OutcomeP(x)P(x) (true)Q(x)Q(x) (model)
Heads0.70.5
Tails0.30.5

Compute DKL(PQ)D_{\text{KL}}(P \mid Q) (using natural log):

DKL(P,,Q)=0.7ln!0.70.5+0.3ln!0.30.5D_{\text{KL}}(P ,|, Q) = 0.7 \cdot \ln!\frac{0.7}{0.5} + 0.3 \cdot \ln!\frac{0.3}{0.5}
term1term 1
contribution from Heads
term2term 2
contribution from Tails
=0.7ln(1.4)+0.3ln(0.6)=0.70.336+0.3(0.511)=0.2350.153=0.082 nats= 0.7 \cdot \ln(1.4) + 0.3 \cdot \ln(0.6) = 0.7 \cdot 0.336 + 0.3 \cdot (-0.511) = 0.235 - 0.153 = 0.082 \text{ nats}

Now compute DKL(QP)D_{\text{KL}}(Q \mid P) (reversed):

=0.5ln(0.5/0.7)+0.5ln(0.5/0.3)=0.5(0.336)+0.5(0.511)=0.168+0.256=0.088 nats= 0.5 \cdot \ln(0.5/0.7) + 0.5 \cdot \ln(0.5/0.3) = 0.5 \cdot (-0.336) + 0.5 \cdot (0.511) = -0.168 + 0.256 = 0.088 \text{ nats}

DKL(PQ)=0.0820.088=DKL(QP)D_{\text{KL}}(P \mid Q) = 0.082 \neq 0.088 = D_{\text{KL}}(Q \mid P).

KL divergence is not symmetric. It is not a true distance metric.

Why Asymmetry Matters: Forward vs Reverse KL

The asymmetry has deep consequences in practice.

KL Divergence in the Wild

Variational Autoencoders (VAEs): the ELBO loss contains DKL(qϕ(zx)p(z))D_{\text{KL}}(q_\phi(z\mid x) \mid p(z)) — it forces the approximate posterior to stay close to the prior.

RLHF (Reinforcement Learning from Human Feedback): the fine-tuning objective contains a KL penalty DKL(πθπref)D_{\text{KL}}(\pi_\theta \mid \pi_{\text{ref}}) to prevent the model from drifting too far from the reference policy.

Information bottleneck: the tradeoff between compression and prediction is formalized as minimizing KL divergence between the learned representation and a target distribution.

In Code

import torch
import torch.nn.functional as F

# KL divergence: F.kl_div expects log-probabilities for input
log_q = torch.log(torch.tensor([0.5, 0.5]))
p = torch.tensor([0.7, 0.3])

# F.kl_div(log_q, p) computes sum(p * (log_p - log_q))
# reduction='sum' or 'batchmean' depending on use
kl = F.kl_div(log_q, p, reduction='sum')
# ≈ 0.082 nats, matching our manual calculation

The next lesson connects KL divergence to mutual information — measuring how much two variables tell you about each other.

Quiz

1 / 3

D_KL(P||Q) measures: