Step-by-step derivation of the Evidence Lower BOund using Jensen's inequality, connecting log P(x) to the reconstruction + KL loss, and deriving the closed-form KL for diagonal Gaussians.
⏱ ~9 min
🧮
Quick refresher
KL divergence
KL(Q || P) = Σ Q(x) log(Q(x)/P(x)) measures how much distribution Q differs from distribution P. It is always ≥ 0, and equals 0 only when Q = P. It is not symmetric: KL(Q||P) ≠ KL(P||Q) in general.
The off-center Gaussian is 0.5 nats away from the standard normal.
The Problem: An Intractable Integral
We want to train the VAE by maximum likelihood: maximize logP(x) for each training example. Using the latent variable formula from lesson 14-1:
The ELBO is the actual training objective used in every VAE implementation. Every PyTorch VAE tutorial's loss function is a form of ELBO. Understanding the derivation tells you exactly what the reconstruction term and KL term are doing — and why tuning their balance matters so much in practice.
logP(x)=log∫P(x∣z),P(z),dz
logP(x)
log-likelihood of x under our model — what we want to maximize
P(x∣z)
decoder: probability of x given latent code z
P(z)
prior over latent codes: N(0,I)
dz
we must integrate over all possible z values
This integral is intractable. For a 32-dimensional latent space, the integral is over R32 — we cannot evaluate it exactly. We need an approximation.
Introducing the Approximate Posterior
The trick: introduce an approximate posterior — the encoder's distribution. We multiply and divide inside the integral by qϕ:
This has rewritten the log of an integral as the log of an expectation. Now apply Jensen's inequality: for a concave function f, f(E[X])≥E[f(X)]. Since log is concave:
log,Eq![qϕ(z∣x)P(x,z)];≥;Eq![logqϕ(z∣x)P(x,z)]
Eq
expectation over z drawn from q_φ(z|x)
≥
Jensen gives a lower bound, not an equality
The right-hand side is the ELBO — Evidence Lower BOund.
expected log-likelihood of x under the decoder — reconstruction quality
KL(qϕ(z∣x)∥P(z))
KL divergence between encoder and prior — regularization
Maximize ELBO ≡ maximize reconstruction likelihood AND minimize the KL divergence from the prior. This is exactly the two-term loss from lesson 14-3, now derived from first principles.
Closed-Form KL for Diagonal Gaussians
In the VAE, qϕ(z∣x)=N(μ,diag(σ2)) and P(z)=N(0,I). The KL between two Gaussians has a closed form. For a single dimension:
Every symbol is now defined from first principles. The reconstruction term comes from maximizing E[logP(x∣z)] under a Gaussian decoder. The KL term comes from the Jensen's inequality derivation. Together they are the ELBO.
⚙️
Interactive example
Watch reconstruction loss and KL term trade off as you adjust the encoder variance slider
Coming soon
Quiz
1 / 3
Jensen's inequality states that for a concave function f and random variable X: f(E[X]) ≥ E[f(X)]. The log function is concave. Which step of the ELBO derivation uses this?