Skip to content
Generative Models
Lesson 4 ⏱ 18 min

The ELBO: deriving the VAE objective

Video coming soon

The ELBO: Why the VAE Loss Is What It Is

Step-by-step derivation of the Evidence Lower BOund using Jensen's inequality, connecting log P(x) to the reconstruction + KL loss, and deriving the closed-form KL for diagonal Gaussians.

⏱ ~9 min

🧮

Quick refresher

KL divergence

KL(Q || P) = Σ Q(x) log(Q(x)/P(x)) measures how much distribution Q differs from distribution P. It is always ≥ 0, and equals 0 only when Q = P. It is not symmetric: KL(Q||P) ≠ KL(P||Q) in general.

Example

KL(N(1,1) || N(0,1)) = (1² + 1 - log(1) - 1)/2 = 0.5.

The off-center Gaussian is 0.5 nats away from the standard normal.

The Problem: An Intractable Integral

We want to train the VAE by maximum likelihood: maximize logP(x)\log P(x) for each training example. Using the latent variable formula from lesson 14-1:

The ELBO is the actual training objective used in every VAE implementation. Every PyTorch VAE tutorial's loss function is a form of ELBO. Understanding the derivation tells you exactly what the reconstruction term and KL term are doing — and why tuning their balance matters so much in practice.

logP(x)=logP(xz),P(z),dz\log P(x) = \log \int P(x \mid z), P(z), dz
logP(x)\log P(x)
log-likelihood of x under our model — what we want to maximize
P(xz)P(x|z)
decoder: probability of x given latent code z
P(z)P(z)
prior over latent codes: N(0,I)
dzdz
we must integrate over all possible z values

This integral is intractable. For a 32-dimensional latent space, the integral is over R32\mathbb{R}^{32} — we cannot evaluate it exactly. We need an approximation.

Introducing the Approximate Posterior

The trick: introduce an approximate posterior — the encoder's distribution. We multiply and divide inside the integral by qϕq_\phi:

logP(x)=logP(x,z)qϕ(zx)qϕ(zx),dz=log,Ezqϕ(x)![P(x,z)qϕ(zx)]\log P(x) = \log \int \frac{P(x,z)}{q_\phi(z|x)} \cdot q_\phi(z|x), dz = \log, \mathbb{E}{z \sim q\phi(\cdot|x)}!\left[\frac{P(x,z)}{q_\phi(z|x)}\right]

This has rewritten the log of an integral as the log of an expectation. Now apply Jensen's inequality: for a concave function ff, f(E[X])E[f(X)]f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]. Since log\log is concave:

log,Eq![P(x,z)qϕ(zx)];;Eq![logP(x,z)qϕ(zx)]\log, \mathbb{E}{q}!\left[\frac{P(x,z)}{q\phi(z|x)}\right] ;\geq; \mathbb{E}{q}!\left[\log \frac{P(x,z)}{q\phi(z|x)}\right]
Eq\mathbb{E}_{q}
expectation over z drawn from q_φ(z|x)
\geq
Jensen gives a lower bound, not an equality

The right-hand side is the ELBO — Evidence Lower BOund.

Expanding the ELBO

Factor P(x,z)=P(xz)P(z)P(x, z) = P(x \mid z) \cdot P(z) and expand the log:

ELBO=Eq![logP(xz)+logP(z)logqϕ(zx)]\text{ELBO} = \mathbb{E}{q}!\left[\log P(x|z) + \log P(z) - \log q\phi(z|x)\right]
=Eq![logP(xz)]reconstructionEq![logqϕ(zx)P(z)]KL divergence= \underbrace{\mathbb{E}{q}!\left[\log P(x|z)\right]}{\text{reconstruction}} - \underbrace{\mathbb{E}{q}!\left[\log \frac{q\phi(z|x)}{P(z)}\right]}_{\text{KL divergence}}
ELBO=Eqϕ(zx)![logP(xz)]KL!(qϕ(zx),,P(z))\text{ELBO} = \mathbb{E}{q\phi(z|x)}!\left[\log P(x|z)\right] - \text{KL}!\left(q_\phi(z|x) ,|, P(z)\right)
Eq[logP(xz)]\mathbb{E}_{q}[\log P(x|z)]
expected log-likelihood of x under the decoder — reconstruction quality
KL(qϕ(zx)P(z))\text{KL}(q_\phi(z|x) \| P(z))
KL divergence between encoder and prior — regularization

Maximize ELBO ≡ maximize reconstruction likelihood AND minimize the KL divergence from the prior. This is exactly the two-term loss from lesson 14-3, now derived from first principles.

Closed-Form KL for Diagonal Gaussians

In the VAE, qϕ(zx)=N(μ,diag(σ2))q_\phi(z \mid x) = \mathcal{N}(\mu, \text{diag}(\sigma^2)) and P(z)=N(0,I)P(z) = \mathcal{N}(0, I). The KL between two Gaussians has a closed form. For a single dimension:

KL!(N(μj,σj2),,N(0,1))=12!(μj2+σj2lnσj21)\text{KL}!\left(\mathcal{N}(\mu_j, \sigma_j^2) ,|, \mathcal{N}(0,1)\right) = \frac{1}{2}!\left(\mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1\right)
μj\mu_j
encoder mean for dimension j
σj2\sigma_j^2
encoder variance for dimension j

Sum over all DD latent dimensions (they are independent by the diagonal assumption):

KL=12j=1D!(μj2+σj2lnσj21)\text{KL} = \frac{1}{2}\sum_{j=1}^{D}!\left(\mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1\right)
DD
latent dimension
j=1D\sum_{j=1}^D
sum over all latent dimensions

Full derivation of the single-dimension case. By definition:

KL=q(z)lnq(z)p(z),dz=Eq![lnq(z)lnp(z)]\text{KL} = \int q(z) \ln\frac{q(z)}{p(z)}, dz = \mathbb{E}_q!\left[\ln q(z) - \ln p(z)\right]

Substituting Gaussian log-densities lnN(z;μ,σ2)=12ln(2πσ2)(zμ)22σ2\ln \mathcal{N}(z; \mu, \sigma^2) = -\frac{1}{2}\ln(2\pi\sigma^2) - \frac{(z-\mu)^2}{2\sigma^2} and simplifying (collecting E[z2]=σ2+μ2\mathbb{E}[z^2] = \sigma^2 + \mu^2, E[z]=μ\mathbb{E}[z] = \mu):

KL=12!(lnσ2+σ2+μ21)=12!(μ2+σ2lnσ21)\text{KL} = \frac{1}{2}!\left(-\ln \sigma^2 + \sigma^2 + \mu^2 - 1\right) = \frac{1}{2}!\left(\mu^2 + \sigma^2 - \ln\sigma^2 - 1\right) \quad \checkmark

Numerical example. Two dimensions, μ=[1.0,0.0]\mu = [1.0, 0.0], σ2=[0.5,2.0]\sigma^2 = [0.5, 2.0]:

Dimμj2\mu_j^2σj2\sigma_j^2lnσj2-\ln\sigma_j^21-1Sum
11.000.500.693−11.193
20.002.00−0.693−10.307
KL=12(1.193+0.307)=12(1.500)=0.75\text{KL} = \frac{1}{2}(1.193 + 0.307) = \frac{1}{2}(1.500) = 0.75

Dimension 1 contributes more because its mean is pulled away from zero.

Putting It Together: The Final VAE Objective

For a batch of nn examples, the loss to minimize is:

L=1ni=1n[x(i)x^(i)2reconstruction+12j=1D!(μj(i)2+σj(i)2lnσj(i)21)KL]L = \frac{1}{n}\sum_{i=1}^n \left[\underbrace{|x^{(i)} - \hat{x}^{(i)}|^2}{\text{reconstruction}} + \underbrace{\frac{1}{2}\sum{j=1}^{D}!\left(\mu_j^{(i)2} + \sigma_j^{(i)2} - \ln\sigma_j^{(i)2} - 1\right)}_{\text{KL}}\right]
x^(i)\hat{x}^{(i)}
reconstruction of example i
μ(i),σ2(i)\mu^{(i)},\,\sigma^{2(i)}
encoder outputs for example i
DD
latent dimension

Every symbol is now defined from first principles. The reconstruction term comes from maximizing E[logP(xz)]\mathbb{E}[\log P(x \mid z)] under a Gaussian decoder. The KL term comes from the Jensen's inequality derivation. Together they are the ELBO.

Interactive example

Watch reconstruction loss and KL term trade off as you adjust the encoder variance slider

Coming soon

Quiz

1 / 3

Jensen's inequality states that for a concave function f and random variable X: f(E[X]) ≥ E[f(X)]. The log function is concave. Which step of the ELBO derivation uses this?