KL divergence: distance between distributions — Information Theory

The Setup

You have a true distribution , but you are using an approximate distribution to reason about the world.

How much are you paying for this approximation? How many extra bits do you waste every time you use $Q$ when the truth is $P$ ?

The answers this precisely.

If you have two probability distributions — what your model predicts versus what actually happens — how do you measure how wrong the model is? Subtracting them misses how the mismatch compounds across low-probability events. You need a number that captures the full shape of the disagreement. KL divergence is that number, and it turns out to be exactly what you're minimizing every time you train a model with cross-entropy loss.

Definition

The from $Q$ to $P$ is:

D_{\text{KL}}(P ,|, Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}

$D_KL(P||Q)$: KL divergence from Q to P — read as 'KL of P from Q'
$P(x)$: probability under the true distribution P
$Q(x)$: probability under the approximate distribution Q
$Σ$: sum over all values x where P(x) > 0

Expanding using $\log(P/Q) = \log P - \log Q$ :

D_{\text{KL}}(P ,|, Q) = \sum_x P(x)\log P(x) - \sum_x P(x)\log Q(x) = -H(P) + H(P, Q)

$H(P)$: entropy of P
$H(P,Q)$: cross-entropy of P and Q

where $= -\sum_x P(x)\log Q(x)$ is the .

So: $D_{\text{KL}}(P \mid Q) = H(P,Q) - H(P)$ . KL divergence is the extra cost beyond what you'd pay if you used the optimal code.

KL ≥ 0 Always (via Jensen's Inequality)

This is a key result. We need : for any convex function $f$ , $\mathbb{E}[f(X)] \geq f(\mathbb{E}[X])$ .

Since $-\log$ is convex:

D_{\text{KL}}(P ,|, Q) = \mathbb{E}_P!\left[\log\frac{P(x)}{Q(x)}\right] = -\mathbb{E}_P!\left[\log\frac{Q(x)}{P(x)}\right] \geq -\log,\mathbb{E}_P!\left[\frac{Q(x)}{P(x)}\right] = -\log\sum_x Q(x) = 0

$E_P$: expectation taken under distribution P

Therefore $D_{\text{KL}}(P \mid Q) \geq 0$ always, with equality if and only if $P = Q$ everywhere. This is sometimes called Gibbs' inequality.

Worked Numerical Example

Suppose there are only two outcomes (like a biased coin), and:

Outcome	$P(x)$ (true)	$Q(x)$ (model)
Heads	0.7	0.5
Tails	0.3	0.5

Compute $D_{\text{KL}}(P \mid Q)$ (using natural log):

D_{\text{KL}}(P ,|, Q) = 0.7 \cdot \ln!\frac{0.7}{0.5} + 0.3 \cdot \ln!\frac{0.3}{0.5}

$term 1$: contribution from Heads
$term 2$: contribution from Tails

= 0.7 \cdot \ln(1.4) + 0.3 \cdot \ln(0.6) = 0.7 \cdot 0.336 + 0.3 \cdot (-0.511) = 0.235 - 0.153 = 0.082 \text{ nats}

Now compute $D_{\text{KL}}(Q \mid P)$ (reversed):

= 0.5 \cdot \ln(0.5/0.7) + 0.5 \cdot \ln(0.5/0.3) = 0.5 \cdot (-0.336) + 0.5 \cdot (0.511) = -0.168 + 0.256 = 0.088 \text{ nats}

$D_{\text{KL}}(P \mid Q) = 0.082 \neq 0.088 = D_{\text{KL}}(Q \mid P)$ .

KL divergence is not symmetric. It is not a true distance metric.

Why Asymmetry Matters: Forward vs Reverse KL

The asymmetry has deep consequences in practice.

Forward KL is mass-covering; reverse KL is mode-seeking

Forward KL $D_{\text{KL}}(P \mid Q)$ : the expectation is under $P$ . The term $P(x)\log(P(x)/Q(x))$ is large when $P(x) > 0$ but $Q(x) \approx 0$ . So forward KL penalizes heavily when $Q$ assigns near-zero probability to regions where $P$ has mass. Minimizing forward KL forces $Q$ to cover all modes of $P$ — it is mass-covering. Risk: $Q$ may spread mass across a multimodal $P$ , blurring modes.

Reverse KL $D_{\text{KL}}(Q \mid P)$ : the expectation is under $Q$ . Now $Q$ is only penalized where it has mass. If $Q$ avoids a region entirely ( $Q(x) \approx 0$ ), there is no penalty. Minimizing reverse KL allows $Q$ to pick one mode of $P$ and ignore the rest — mode-seeking. This is what variational inference (and VAEs) minimize in the ELBO.

KL Divergence in the Wild

Variational Autoencoders (VAEs): the ELBO loss contains $D_{\text{KL}}(q_\phi(z\mid x) \mid p(z))$ — it forces the approximate posterior to stay close to the prior.

RLHF (Reinforcement Learning from Human Feedback): the fine-tuning objective contains a KL penalty $D_{\text{KL}}(\pi_\theta \mid \pi_{\text{ref}})$ to prevent the model from drifting too far from the reference policy.

Information bottleneck: the tradeoff between compression and prediction is formalized as minimizing KL divergence between the learned representation and a target distribution.

In Code

import torch
import torch.nn.functional as F

# KL divergence: F.kl_div expects log-probabilities for input
log_q = torch.log(torch.tensor([0.5, 0.5]))
p = torch.tensor([0.7, 0.3])

# F.kl_div(log_q, p) computes sum(p * (log_p - log_q))
# reduction='sum' or 'batchmean' depending on use
kl = F.kl_div(log_q, p, reduction='sum')
# ≈ 0.082 nats, matching our manual calculation

The next lesson connects KL divergence to mutual information — measuring how much two variables tell you about each other.