The ELBO: deriving the VAE objective — Generative Models

The Problem: An Intractable Integral

We want to train the VAE by maximum likelihood: maximize $\log P(x)$ for each training example. Using the latent variable formula from lesson 14-1:

The ELBO is the actual training objective used in every VAE implementation. Every PyTorch VAE tutorial's loss function is a form of ELBO. Understanding the derivation tells you exactly what the reconstruction term and KL term are doing — and why tuning their balance matters so much in practice.

\log P(x) = \log \int P(x \mid z), P(z), dz

$\log P(x)$: log-likelihood of x under our model — what we want to maximize
$P(x|z)$: decoder: probability of x given latent code z
$P(z)$: prior over latent codes: N(0,I)
$dz$: we must integrate over all possible z values

This integral is intractable. For a 32-dimensional latent space, the integral is over $\mathbb{R}^{32}$ — we cannot evaluate it exactly. We need an approximation.

Introducing the Approximate Posterior

The trick: introduce an approximate posterior — the encoder's distribution. We multiply and divide inside the integral by $q_\phi$ :

\log P(x) = \log \int \frac{P(x,z)}{q_\phi(z|x)} \cdot q_\phi(z|x), dz = \log, \mathbb{E}{z \sim q\phi(\cdot|x)}!\left[\frac{P(x,z)}{q_\phi(z|x)}\right]

This has rewritten the log of an integral as the log of an expectation. Now apply Jensen's inequality: for a concave function $f$ , $f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]$ . Since $\log$ is concave:

\log, \mathbb{E}{q}!\left[\frac{P(x,z)}{q\phi(z|x)}\right] ;\geq; \mathbb{E}{q}!\left[\log \frac{P(x,z)}{q\phi(z|x)}\right]

$\mathbb{E}_{q}$: expectation over z drawn from q_φ(z|x)
$\geq$: Jensen gives a lower bound, not an equality

The right-hand side is the ELBO — Evidence Lower BOund.

Expanding the ELBO

Factor $P(x, z) = P(x \mid z) \cdot P(z)$ and expand the log:

\text{ELBO} = \mathbb{E}{q}!\left[\log P(x|z) + \log P(z) - \log q\phi(z|x)\right]

= \underbrace{\mathbb{E}{q}!\left[\log P(x|z)\right]}{\text{reconstruction}} - \underbrace{\mathbb{E}{q}!\left[\log \frac{q\phi(z|x)}{P(z)}\right]}_{\text{KL divergence}}

\text{ELBO} = \mathbb{E}{q\phi(z|x)}!\left[\log P(x|z)\right] - \text{KL}!\left(q_\phi(z|x) ,|, P(z)\right)

$\mathbb{E}_{q}[\log P(x|z)]$: expected log-likelihood of x under the decoder — reconstruction quality
$\text{KL}(q_\phi(z|x) \| P(z))$: KL divergence between encoder and prior — regularization

Maximize ELBO ≡ maximize reconstruction likelihood AND minimize the KL divergence from the prior. This is exactly the two-term loss from lesson 14-3, now derived from first principles.

Closed-Form KL for Diagonal Gaussians

In the VAE, $q_\phi(z \mid x) = \mathcal{N}(\mu, \text{diag}(\sigma^2))$ and $P(z) = \mathcal{N}(0, I)$ . The KL between two Gaussians has a closed form. For a single dimension:

\text{KL}!\left(\mathcal{N}(\mu_j, \sigma_j^2) ,|, \mathcal{N}(0,1)\right) = \frac{1}{2}!\left(\mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1\right)

$\mu_j$: encoder mean for dimension j
$\sigma_j^2$: encoder variance for dimension j

Sum over all $D$ latent dimensions (they are independent by the diagonal assumption):

\text{KL} = \frac{1}{2}\sum_{j=1}^{D}!\left(\mu_j^2 + \sigma_j^2 - \ln \sigma_j^2 - 1\right)

$D$: latent dimension
$\sum_{j=1}^D$: sum over all latent dimensions

Full derivation of the single-dimension case. By definition:

\text{KL} = \int q(z) \ln\frac{q(z)}{p(z)}, dz = \mathbb{E}_q!\left[\ln q(z) - \ln p(z)\right]

Substituting Gaussian log-densities $\ln \mathcal{N}(z; \mu, \sigma^2) = -\frac{1}{2}\ln(2\pi\sigma^2) - \frac{(z-\mu)^2}{2\sigma^2}$ and simplifying (collecting $\mathbb{E}[z^2] = \sigma^2 + \mu^2$ , $\mathbb{E}[z] = \mu$ ):

\text{KL} = \frac{1}{2}!\left(-\ln \sigma^2 + \sigma^2 + \mu^2 - 1\right) = \frac{1}{2}!\left(\mu^2 + \sigma^2 - \ln\sigma^2 - 1\right) \quad \checkmark

Numerical example. Two dimensions, $\mu = [1.0, 0.0]$ , $\sigma^2 = [0.5, 2.0]$ :

Dim	$\mu_j^2$	$\sigma_j^2$	$-\ln\sigma_j^2$	$-1$	Sum
1	1.00	0.50	0.693	−1	1.193
2	0.00	2.00	−0.693	−1	0.307

\text{KL} = \frac{1}{2}(1.193 + 0.307) = \frac{1}{2}(1.500) = 0.75

Dimension 1 contributes more because its mean is pulled away from zero.

Putting It Together: The Final VAE Objective

For a batch of $n$ examples, the loss to minimize is:

L = \frac{1}{n}\sum_{i=1}^n \left[\underbrace{|x^{(i)} - \hat{x}^{(i)}|^2}{\text{reconstruction}} + \underbrace{\frac{1}{2}\sum{j=1}^{D}!\left(\mu_j^{(i)2} + \sigma_j^{(i)2} - \ln\sigma_j^{(i)2} - 1\right)}_{\text{KL}}\right]

$\hat{x}^{(i)}$: reconstruction of example i
$\mu^{(i)},\,\sigma^{2(i)}$: encoder outputs for example i
$D$: latent dimension

Every symbol is now defined from first principles. The reconstruction term comes from maximizing $\mathbb{E}[\log P(x \mid z)]$ under a Gaussian decoder. The KL term comes from the Jensen's inequality derivation. Together they are the ELBO.

Interactive example

Watch reconstruction loss and KL term trade off as you adjust the encoder variance slider

Coming soon