The Problem with Sampling
Recall the VAE training loop from lesson 14-3:
- Encoder outputs and
- Sample
- Decoder produces
- Compute loss, backpropagate
The reparameterization trick is the single idea that makes VAE training work at all. Without it, gradients cannot flow through the sampling step, and the entire architecture is untrainable. It also appears in many other models with stochastic components — diffusion models, normalizing flows, and stochastic policies in RL all use versions of this trick.
Step 2 is the problem. Backpropagation needs to compute and — the gradients with respect to the encoder parameters. To do that it needs to propagate gradients through the sampling step, which requires and .
But is a random draw. It is not a deterministic function of and . Every time step 2 runs, a different comes out. There is no fixed derivative to place in the computational graph. The gradient is undefined.
The Trick: Move the Randomness Outside
Instead of sampling directly, rewrite:
- the latent code we want to sample
- encoder mean output
- encoder standard deviation (element-wise)
- auxiliary noise sampled from standard normal — independent of encoder parameters
- element-wise multiplication
Check: is still distributed as ? Yes — adding a constant to a Gaussian shifts its mean, scaling by scales its standard deviation. The distribution is identical to before.
What has changed: the randomness now lives entirely in , which is independent of the encoder parameters. The path from and to is now a deterministic function:
This is differentiable. The local gradients are:
- gradient of the j-th latent dimension with respect to the j-th mean — always 1
- gradient with respect to the j-th standard deviation — equals the sampled noise ε_j
Backprop can now flow gradients from the decoder loss all the way back through , through and , and into the encoder network weights. The encoder is trainable.
Visualizing the Computational Graph
Without reparameterization (broken):
The arrow labeled "sample" has no backward pass. Gradients stop at and never reach or .
With reparameterization (working):
The path and are fully differentiable. The noise is a leaf node with no parameters — its gradient is never needed.
A Complete Numerical Walk-Through
Suppose encoder outputs , for a single latent dimension. We draw .
Forward pass:
The decoder receives and produces some output. Suppose after computing the full loss, .
Backward pass through the reparameterization:
These gradients flow backward into the encoder network. The KL gradient (from lesson 14-4) adds on top:
The encoder weights are updated using the sum of both gradient contributions — reconstruction signal (via reparameterization) and regularization signal (from the KL term directly).
Generalizing Beyond Gaussians
The reparameterization trick applies to any distribution with a tractable inverse CDF (quantile function). To sample from distribution :
- Draw
- Return where is the inverse CDF
Since is differentiable (when it exists), the path is differentiable. This works for Exponential, Logistic, Laplace, Beta (approximately), and others. For distributions without easy inverse CDFs, alternatives like implicit reparameterization (Figurnov et al., 2018) generalize the idea further.
Interactive example
Visualize how gradients flow through reparameterized vs direct sampling — toggle to see the broken graph
Coming soon