The reparameterization trick — Generative Models

The Problem with Sampling

Recall the VAE training loop from lesson 14-3:

Encoder outputs $\mu_\phi(x)$ and $\sigma_\phi(x)$
Sample $z \sim \mathcal{N}(\mu_\phi(x), \sigma_\phi(x)^2 \cdot I)$
Decoder produces $\hat{x} = g_\theta(z)$
Compute loss, backpropagate

The reparameterization trick is the single idea that makes VAE training work at all. Without it, gradients cannot flow through the sampling step, and the entire architecture is untrainable. It also appears in many other models with stochastic components — diffusion models, normalizing flows, and stochastic policies in RL all use versions of this trick.

Step 2 is the problem. Backpropagation needs to compute $\partial L / \partial \mu_\phi$ and $\partial L / \partial \sigma_\phi$ — the gradients with respect to the encoder parameters. To do that it needs to propagate gradients through the sampling step, which requires $\partial z / \partial \mu$ and $\partial z / \partial \sigma$ .

But $z$ is a random draw. It is not a deterministic function of $\mu$ and $\sigma$ . Every time step 2 runs, a different $z$ comes out. There is no fixed derivative to place in the computational graph. The gradient is undefined.

The Trick: Move the Randomness Outside

Instead of sampling $z$ directly, rewrite:

z = \mu + \sigma \odot \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I)

$z$: the latent code we want to sample
$\mu$: encoder mean output
$\sigma$: encoder standard deviation (element-wise)
$\varepsilon$: auxiliary noise sampled from standard normal — independent of encoder parameters
$\odot$: element-wise multiplication

Check: is $z$ still distributed as $\mathcal{N}(\mu, \sigma^2 I)$ ? Yes — adding a constant to a Gaussian shifts its mean, scaling by $\sigma$ scales its standard deviation. The distribution is identical to before.

What has changed: the randomness now lives entirely in $\varepsilon$ , which is independent of the encoder parameters. The path from $\mu$ and $\sigma$ to $z$ is now a deterministic function:

z = \mu + \sigma \odot \varepsilon

This is differentiable. The local gradients are:

\frac{\partial z_j}{\partial \mu_j} = 1, \qquad \frac{\partial z_j}{\partial \sigma_j} = \varepsilon_j

$\partial z_j / \partial \mu_j$: gradient of the j-th latent dimension with respect to the j-th mean — always 1
$\partial z_j / \partial \sigma_j$: gradient with respect to the j-th standard deviation — equals the sampled noise ε_j

Backprop can now flow gradients from the decoder loss all the way back through $z$ , through $\mu$ and $\sigma$ , and into the encoder network weights. The encoder is trainable.

Visualizing the Computational Graph

Without reparameterization (broken):

\mu, \sigma ;\xrightarrow{\text{sample}}; z ;\xrightarrow{g_\theta}; \hat{x} ;\to L

The arrow labeled "sample" has no backward pass. Gradients stop at $z$ and never reach $\mu$ or $\sigma$ .

With reparameterization (working):

\varepsilon \sim \mathcal{N}(0,I) ;\to; z = \mu + \sigma \odot \varepsilon ;\xrightarrow{g_\theta}; \hat{x} ;\to L

\mu ;\nearrow\hspace{-1em}\swarrow; \sigma

The path $\mu \to z \to \hat{x} \to L$ and $\sigma \to z \to \hat{x} \to L$ are fully differentiable. The noise $\varepsilon$ is a leaf node with no parameters — its gradient is never needed.

A Complete Numerical Walk-Through

Suppose encoder outputs $\mu = 0.6$ , $\sigma = 0.4$ for a single latent dimension. We draw $\varepsilon = -0.7$ .

Forward pass:

z = 0.6 + 0.4 \times (-0.7) = 0.6 - 0.28 = 0.32

The decoder receives $z = 0.32$ and produces some output. Suppose after computing the full loss, $\partial L / \partial z = 1.5$ .

Backward pass through the reparameterization:

\frac{\partial L}{\partial \mu} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial \mu} = 1.5 \times 1 = 1.5

\frac{\partial L}{\partial \sigma} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial \sigma} = 1.5 \times (-0.7) = -1.05

These gradients flow backward into the encoder network. The KL gradient (from lesson 14-4) adds on top:

\frac{\partial \text{KL}}{\partial \mu} = \mu = 0.6, \qquad \frac{\partial \text{KL}}{\partial \sigma} = \sigma - \frac{1}{\sigma} = 0.4 - 2.5 = -2.1

The encoder weights are updated using the sum of both gradient contributions — reconstruction signal (via reparameterization) and regularization signal (from the KL term directly).

Generalizing Beyond Gaussians

The reparameterization trick applies to any distribution with a tractable inverse CDF (quantile function). To sample from distribution $p$ :

Draw $u \sim \text{Uniform}(0, 1)$
Return $z = F^{-1}(u)$ where $F^{-1}$ is the inverse CDF

Since $F^{-1}$ is differentiable (when it exists), the path $\theta \to z$ is differentiable. This works for Exponential, Logistic, Laplace, Beta (approximately), and others. For distributions without easy inverse CDFs, alternatives like implicit reparameterization (Figurnov et al., 2018) generalize the idea further.

Interactive example

Visualize how gradients flow through reparameterized vs direct sampling — toggle to see the broken graph

Coming soon