Variational autoencoders: stochastic encoders — Generative Models

The Core Insight

The standard autoencoder maps each input to a single point $z$ . The VAE maps each input to a region — a probability distribution over the . Concretely:

Standard AE encoder: $f_\theta(x) = z$ — one specific point
VAE encoder: $q_\phi(z \mid x) = \mathcal{N}(\mu_\phi(x),, \sigma^2_\phi(x) \cdot I)$ — a Gaussian ball

Variational autoencoders are how generative AI learned to create smooth, controllable representations. They are the mathematical foundation for latent space manipulation — the technique behind face morphing, style transfer, and controlled image generation. Understanding VAEs is essential for understanding diffusion models and modern image synthesis.

The encoder network now outputs two vectors: (the center) and (the spread). In practice, the network outputs $\log \sigma^2$ rather than $\sigma^2$ directly — this keeps the parameter unconstrained ( $\sigma^2 > 0$ would require a positivity constraint).

The Full VAE Architecture

During training:

Feed input $x$ to the encoder → get $\mu, \log \sigma^2$
Sample $z \sim q_\phi(z \mid x) = \mathcal{N}(\mu, \sigma^2 I)$
Feed $z$ to the decoder → get reconstruction $\hat{x} = g_\theta(z)$
Compute loss: reconstruction error + KL penalty
Backpropagate through decoder and encoder (we cover how to backpropagate through step 2 in lesson 14-5)

During generation:

Sample $z \sim \mathcal{N}(0, I)$ — no encoder needed
Feed $z$ to the decoder → get $\hat{x}$

Step 1 of generation is why the structure of the latent space matters. If the encoder is trained to produce distributions close to $\mathcal{N}(0, I)$ , then any point sampled from $\mathcal{N}(0, I)$ will lie in a region the decoder has seen — so the decoder can produce a coherent output.

The Two Loss Terms

The VAE optimizes a sum of two objectives:

L_{\text{VAE}} = L_{\text{recon}} + \beta \cdot L_{\text{KL}}

$L_{\text{VAE}}$: total VAE loss (to minimize)
$L_{\text{recon}}$: reconstruction term: how well does the decoder recover x?
$L_{\text{KL}}$: KL divergence: how far is q_φ(z|x) from the prior N(0,I)?
$\beta$: trade-off weight (β=1 in standard VAE)

Reconstruction term: just like the standard autoencoder — penalizes the decoder for poor reconstructions. For continuous data: $\mid x - \hat{x} \mid ^2$ . For binary: binary cross-entropy.

KL term: measures how much the encoder's distribution $q_\phi(z \mid x)$ diverges from the prior $\mathcal{N}(0, I)$ . The formula for diagonal Gaussians has a clean closed form (derived in lesson 14-4):

\text{KL}!\left(\mathcal{N}(\mu, \sigma^2 I) ,|, \mathcal{N}(0,I)\right) = \frac{1}{2}\sum_{j=1}^{D}!\left(\mu_j^2 + \sigma_j^2 - \log \sigma_j^2 - 1\right)

$D$: latent dimension
$\mu_j$: mean of dimension j
$\sigma_j^2$: variance of dimension j
$\text{KL}$: Kullback–Leibler divergence

Worked example. Suppose $D = 2$ , $\mu = [0.5, -0.3]$ , $\sigma^2 = [1.2, 0.8]$ :

\text{KL} = \frac{1}{2}\left[(0.5^2 + 1.2 - \ln 1.2 - 1) + ((-0.3)^2 + 0.8 - \ln 0.8 - 1)\right]

= \frac{1}{2}\left[(0.25 + 1.2 - 0.182 - 1) + (0.09 + 0.8 + 0.223 - 1)\right] = \frac{1}{2}[0.268 + 0.113] = 0.19

If the encoder perfectly reproduces the prior ( $\mu = 0, \sigma^2 = 1$ ), the KL is exactly 0. Any deviation from standard normal contributes positively to the loss.

Interpolation and the Structured Latent Space

The payoff of the KL regularization is a smooth, continuous latent space. Because every region near the origin is encouraged to correspond to valid data, you can:

Interpolate between two images: encode $x_1$ and $x_2$ , linearly interpolate $\mu_1$ and $\mu_2$ , decode each interpolated $z$ — the result smoothly morphs between the images
Sample new examples: draw $z \sim \mathcal{N}(0, I)$ , decode — you get realistic new data

Compare this to the standard autoencoder: interpolating between two encoded points crosses uncharted latent territory, producing incoherent outputs. VAE regularization fills that territory.

Intuition: What Each Part Does

Think of the VAE as a map-making system. The encoder maps each training image to a region on a 2D (or 32D) map. The KL term says the regions must all fit inside a standard normal — the map must cover a bounded, organized territory. The decoder learns to look up any point on this map and draw a plausible image.

After training: the map is dense and organized. Pick any point on the map and the decoder returns something meaningful. This is what separates VAEs from plain autoencoders — not the architecture, but the constraint on the latent space.

Interactive example

Explore the 2D VAE latent space for MNIST — drag to sample and watch the decoder output change continuously

Coming soon