Autoencoders: compression and reconstruction — Generative Models

The Idea: Learn to Forget Wisely

An autoencoder learns by solving an unusual task: compress data, then reconstruct it exactly. If that sounds circular, it is — by design. The magic is the bottleneck in the middle that forces compression.

Autoencoders are used in practice for denoising images, detecting anomalies in industrial sensors, and compressing data for storage. They are also the direct precursor to VAEs — the architecture that made controllable latent space manipulation possible. Understanding autoencoders is the first step toward understanding every modern generative model.

The architecture has two parts:

z = f_\theta(x), \qquad \hat{x} = g_\phi(z)

$f_\theta$: encoder with parameters θ, maps input to latent code
$g_\phi$: decoder with parameters φ, maps latent code back to input
$x$: original input (e.g., a 784-dimensional image)
$z$: latent code — much lower-dimensional than x
$\hat{x}$: reconstruction — the decoder's output

The has far fewer dimensions than $x$ . If $x \in \mathbb{R}^{784}$ (a 28×28 MNIST image) and $z \in \mathbb{R}^{32}$ , the bottleneck compresses by a factor of 24×.

The Reconstruction Loss

Training minimizes how different the reconstruction is from the original:

L = \frac{1}{n} \sum_{i=1}^{n} | x^{(i)} - \hat{x}^{(i)} |^2

$L$: total reconstruction loss over n training examples
$x^{(i)}$: the i-th training example
$\hat{x}^{(i)}$: its reconstruction
$n$: number of training examples

This is pixel-wise MSE. For binary data (black-and-white images, 0/1 pixels), binary cross-entropy is often preferred because pixels are probabilities rather than continuous values.

Concrete example. Suppose $x$ has five pixels with values $[0.9, 0.1, 0.7, 0.3, 0.5]$ and the reconstruction is $\hat{x} = [0.8, 0.2, 0.6, 0.4, 0.5]$ . The squared errors are:

(0.9-0.8)^2 + (0.1-0.2)^2 + (0.7-0.6)^2 + (0.3-0.4)^2 + (0.5-0.5)^2 = 0.01+0.01+0.01+0.01+0 = 0.04

Loss $= 0.04 / 5 = 0.008$ . Gradient descent nudges $\theta$ and $\phi$ to shrink this number.

What the Bottleneck Forces

Without the bottleneck, the network could copy $x$ straight through — a trivial identity mapping. The narrow middle layer prevents this. The encoder must find a lower-dimensional summary that retains enough information to reconstruct the input from the decoder side.

The : you keep what reconstruction needs; you discard what it doesn't. For natural images this means edges, colors, shapes — not pixel-level noise.

Practical Applications

Denoising Autoencoders

Corrupt the input: add Gaussian noise or randomly zero out pixels to get $\tilde{x}$ . Feed $\tilde{x}$ to the encoder but train to reconstruct the clean $x$ :

L = | x - g_\phi(f_\theta(\tilde{x})) |^2

$\tilde{x}$: corrupted input
$x$: clean target

The model learns to undo noise, which forces the latent code to represent only genuine signal. At test time, feed in a noisy image and get a clean reconstruction.

Anomaly Detection

Train on normal examples only (e.g., healthy medical images). The encoder-decoder pair learns a representation of normality. At test time, compute reconstruction error for a new sample. An anomaly lies off the learned manifold — the decoder cannot reconstruct it well, so error spikes. Threshold the error to flag anomalies.

Feature Learning

The latent code $z$ is a compressed feature vector. These features can be used as inputs to a downstream classifier, often outperforming raw pixels — especially when labeled data is scarce but unlabeled data is plentiful.

The Fatal Flaw for Generation

Here is the problem: the latent space has no structure. The encoder maps inputs to latent codes through a deterministic function trained only to minimize reconstruction. Two nearly identical images might land at very different points in $\mathbb{R}^{32}$ . Large regions of latent space may correspond to nothing — the decoder was never trained on those $z$ values.

If you try to generate by sampling $z \sim \mathcal{N}(0, I)$ and decoding, you will mostly get incoherent noise, because most random points in $\mathbb{R}^{32}$ are far from any encoded training example.

The fix: constrain the encoder to produce latent codes that follow a known distribution — specifically $\mathcal{N}(0, I)$ . If every point in latent space corresponds to plausible data, sampling becomes meaningful. This is the Variational Autoencoder, and building it carefully is the work of the next three lessons.

Interactive example

Encode MNIST digits and visualize the 2D latent space — note the unstructured scatter of class clusters

Coming soon