Skip to content
Generative Models
Lesson 3 ⏱ 14 min

Variational autoencoders: stochastic encoders

Video coming soon

VAEs: From Bottleneck Points to Bottleneck Distributions

How replacing the deterministic encoder with a distribution encoder solves the structured latent space problem, and the full VAE training and generation pipelines.

⏱ ~8 min

🧮

Quick refresher

Gaussian (normal) distribution

N(μ, σ²) is the bell-curve distribution with mean μ and variance σ². About 68% of samples fall within one standard deviation of μ. The standard normal N(0,1) has mean 0 and variance 1.

Example

If z ~ N(2, 9), then z has mean 2 and standard deviation 3.

A sample of z = 5 is one standard deviation above the mean.

The Core Insight

The standard autoencoder maps each input to a single point zz. The VAE maps each input to a region — a probability distribution over the . Concretely:

  • Standard AE encoder: fθ(x)=zf_\theta(x) = z — one specific point
  • VAE encoder: qϕ(zx)=N(μϕ(x),,σϕ2(x)I)q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x),, \sigma^2_\phi(x) \cdot I) — a Gaussian ball

Variational autoencoders are how generative AI learned to create smooth, controllable representations. They are the mathematical foundation for latent space manipulation — the technique behind face morphing, style transfer, and controlled image generation. Understanding VAEs is essential for understanding diffusion models and modern image synthesis.

The encoder network now outputs two vectors: (the center) and (the spread). In practice, the network outputs logσ2\log \sigma^2 rather than σ2\sigma^2 directly — this keeps the parameter unconstrained (\sigma^2 > 0 would require a positivity constraint).

The Full VAE Architecture

During training:

  1. Feed input xx to the encoder → get μ,logσ2\mu, \log \sigma^2
  2. Sample zqϕ(zx)=N(μ,σ2I)z \sim q_\phi(z \mid x) = \mathcal{N}(\mu, \sigma^2 I)
  3. Feed zz to the decoder → get reconstruction x^=gθ(z)\hat{x} = g_\theta(z)
  4. Compute loss: reconstruction error + KL penalty
  5. Backpropagate through decoder and encoder (we cover how to backpropagate through step 2 in lesson 14-5)

During generation:

  1. Sample zN(0,I)z \sim \mathcal{N}(0, I) — no encoder needed
  2. Feed zz to the decoder → get x^\hat{x}

Step 1 of generation is why the structure of the latent space matters. If the encoder is trained to produce distributions close to N(0,I)\mathcal{N}(0, I), then any point sampled from N(0,I)\mathcal{N}(0, I) will lie in a region the decoder has seen — so the decoder can produce a coherent output.

The Two Loss Terms

The VAE optimizes a sum of two objectives:

LVAE=Lrecon+βLKLL_{\text{VAE}} = L_{\text{recon}} + \beta \cdot L_{\text{KL}}
LVAEL_{\text{VAE}}
total VAE loss (to minimize)
LreconL_{\text{recon}}
reconstruction term: how well does the decoder recover x?
LKLL_{\text{KL}}
KL divergence: how far is q_φ(z|x) from the prior N(0,I)?
β\beta
trade-off weight (β=1 in standard VAE)

Reconstruction term: just like the standard autoencoder — penalizes the decoder for poor reconstructions. For continuous data: xx^2\mid x - \hat{x} \mid ^2. For binary: binary cross-entropy.

KL term: measures how much the encoder's distribution qϕ(zx)q_\phi(z \mid x) diverges from the prior N(0,I)\mathcal{N}(0, I). The formula for diagonal Gaussians has a clean closed form (derived in lesson 14-4):

KL!(N(μ,σ2I),,N(0,I))=12j=1D!(μj2+σj2logσj21)\text{KL}!\left(\mathcal{N}(\mu, \sigma^2 I) ,|, \mathcal{N}(0,I)\right) = \frac{1}{2}\sum_{j=1}^{D}!\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)
DD
latent dimension
μj\mu_j
mean of dimension j
σj2\sigma_j^2
variance of dimension j
KL\text{KL}
Kullback–Leibler divergence

Worked example. Suppose D=2D = 2, μ=[0.5,0.3]\mu = [0.5, -0.3], σ2=[1.2,0.8]\sigma^2 = [1.2, 0.8]:

KL=12[(0.52+1.2ln1.21)+((0.3)2+0.8ln0.81)]\text{KL} = \frac{1}{2}\left[(0.5^2 + 1.2 - \ln 1.2 - 1) + ((-0.3)^2 + 0.8 - \ln 0.8 - 1)\right]
=12[(0.25+1.20.1821)+(0.09+0.8+0.2231)]=12[0.268+0.113]=0.19= \frac{1}{2}\left[(0.25 + 1.2 - 0.182 - 1) + (0.09 + 0.8 + 0.223 - 1)\right] = \frac{1}{2}[0.268 + 0.113] = 0.19

If the encoder perfectly reproduces the prior (μ=0,σ2=1\mu = 0, \sigma^2 = 1), the KL is exactly 0. Any deviation from standard normal contributes positively to the loss.

Interpolation and the Structured Latent Space

The payoff of the KL regularization is a smooth, continuous latent space. Because every region near the origin is encouraged to correspond to valid data, you can:

  • Interpolate between two images: encode x1x_1 and x2x_2, linearly interpolate μ1\mu_1 and μ2\mu_2, decode each interpolated zz — the result smoothly morphs between the images
  • Sample new examples: draw zN(0,I)z \sim \mathcal{N}(0, I), decode — you get realistic new data

Compare this to the standard autoencoder: interpolating between two encoded points crosses uncharted latent territory, producing incoherent outputs. VAE regularization fills that territory.

Intuition: What Each Part Does

Think of the VAE as a map-making system. The encoder maps each training image to a region on a 2D (or 32D) map. The KL term says the regions must all fit inside a standard normal — the map must cover a bounded, organized territory. The decoder learns to look up any point on this map and draw a plausible image.

After training: the map is dense and organized. Pick any point on the map and the decoder returns something meaningful. This is what separates VAEs from plain autoencoders — not the architecture, but the constraint on the latent space.

Interactive example

Explore the 2D VAE latent space for MNIST — drag to sample and watch the decoder output change continuously

Coming soon

Quiz

1 / 3

In a VAE, the encoder network outputs two vectors: μ and log σ². What do these define?