The Core Insight
The standard autoencoder maps each input to a single point . The VAE maps each input to a region — a probability distribution over the . Concretely:
- Standard AE encoder: — one specific point
- VAE encoder: — a Gaussian ball
Variational autoencoders are how generative AI learned to create smooth, controllable representations. They are the mathematical foundation for latent space manipulation — the technique behind face morphing, style transfer, and controlled image generation. Understanding VAEs is essential for understanding diffusion models and modern image synthesis.
The encoder network now outputs two vectors: (the center) and (the spread). In practice, the network outputs rather than directly — this keeps the parameter unconstrained (\sigma^2 > 0 would require a positivity constraint).
The Full VAE Architecture
During training:
- Feed input to the encoder → get
- Sample
- Feed to the decoder → get reconstruction
- Compute loss: reconstruction error + KL penalty
- Backpropagate through decoder and encoder (we cover how to backpropagate through step 2 in lesson 14-5)
During generation:
- Sample — no encoder needed
- Feed to the decoder → get
Step 1 of generation is why the structure of the latent space matters. If the encoder is trained to produce distributions close to , then any point sampled from will lie in a region the decoder has seen — so the decoder can produce a coherent output.
The Two Loss Terms
The VAE optimizes a sum of two objectives:
- total VAE loss (to minimize)
- reconstruction term: how well does the decoder recover x?
- KL divergence: how far is q_φ(z|x) from the prior N(0,I)?
- trade-off weight (β=1 in standard VAE)
Reconstruction term: just like the standard autoencoder — penalizes the decoder for poor reconstructions. For continuous data: . For binary: binary cross-entropy.
KL term: measures how much the encoder's distribution diverges from the prior . The formula for diagonal Gaussians has a clean closed form (derived in lesson 14-4):
- latent dimension
- mean of dimension j
- variance of dimension j
- Kullback–Leibler divergence
Worked example. Suppose , , :
If the encoder perfectly reproduces the prior (), the KL is exactly 0. Any deviation from standard normal contributes positively to the loss.
Interpolation and the Structured Latent Space
The payoff of the KL regularization is a smooth, continuous latent space. Because every region near the origin is encouraged to correspond to valid data, you can:
- Interpolate between two images: encode and , linearly interpolate and , decode each interpolated — the result smoothly morphs between the images
- Sample new examples: draw , decode — you get realistic new data
Compare this to the standard autoencoder: interpolating between two encoded points crosses uncharted latent territory, producing incoherent outputs. VAE regularization fills that territory.
Intuition: What Each Part Does
Think of the VAE as a map-making system. The encoder maps each training image to a region on a 2D (or 32D) map. The KL term says the regions must all fit inside a standard normal — the map must cover a bounded, organized territory. The decoder learns to look up any point on this map and draw a plausible image.
After training: the map is dense and organized. Pick any point on the map and the decoder returns something meaningful. This is what separates VAEs from plain autoencoders — not the architecture, but the constraint on the latent space.
Interactive example
Explore the 2D VAE latent space for MNIST — drag to sample and watch the decoder output change continuously
Coming soon