Skip to content
Generative Models
Lesson 2 ⏱ 12 min

Autoencoders: compression and reconstruction

Video coming soon

Autoencoders: Learning to Compress and Reconstruct

How an encoder-decoder architecture learns a compact representation by passing data through a bottleneck, with worked examples on reconstruction loss and the limitations that motivate VAEs.

⏱ ~7 min

🧮

Quick refresher

Mean squared error (MSE)

MSE measures how far predictions are from targets: MSE = (1/n)·Σ(yᵢ - ŷᵢ)². It penalizes large errors more than small ones. As a loss it pushes outputs toward the target values.

Example

If the true pixel value is 0.8 and the reconstruction is 0.5, the squared error contribution is (0.8−0.5)² = 0.09.

The Idea: Learn to Forget Wisely

An autoencoder learns by solving an unusual task: compress data, then reconstruct it exactly. If that sounds circular, it is — by design. The magic is the bottleneck in the middle that forces compression.

Autoencoders are used in practice for denoising images, detecting anomalies in industrial sensors, and compressing data for storage. They are also the direct precursor to VAEs — the architecture that made controllable latent space manipulation possible. Understanding autoencoders is the first step toward understanding every modern generative model.

The architecture has two parts:

z=fθ(x),x^=gϕ(z)z = f_\theta(x), \qquad \hat{x} = g_\phi(z)
fθf_\theta
encoder with parameters θ, maps input to latent code
gϕg_\phi
decoder with parameters φ, maps latent code back to input
xx
original input (e.g., a 784-dimensional image)
zz
latent code — much lower-dimensional than x
x^\hat{x}
reconstruction — the decoder's output

The has far fewer dimensions than xx. If xR784x \in \mathbb{R}^{784} (a 28×28 MNIST image) and zR32z \in \mathbb{R}^{32}, the bottleneck compresses by a factor of 24×.

The Reconstruction Loss

Training minimizes how different the reconstruction is from the original:

L=1ni=1nx(i)x^(i)2L = \frac{1}{n} \sum_{i=1}^{n} | x^{(i)} - \hat{x}^{(i)} |^2
LL
total reconstruction loss over n training examples
x(i)x^{(i)}
the i-th training example
x^(i)\hat{x}^{(i)}
its reconstruction
nn
number of training examples

This is pixel-wise MSE. For binary data (black-and-white images, 0/1 pixels), binary cross-entropy is often preferred because pixels are probabilities rather than continuous values.

Concrete example. Suppose xx has five pixels with values [0.9,0.1,0.7,0.3,0.5][0.9, 0.1, 0.7, 0.3, 0.5] and the reconstruction is x^=[0.8,0.2,0.6,0.4,0.5]\hat{x} = [0.8, 0.2, 0.6, 0.4, 0.5]. The squared errors are:

(0.90.8)2+(0.10.2)2+(0.70.6)2+(0.30.4)2+(0.50.5)2=0.01+0.01+0.01+0.01+0=0.04(0.9-0.8)^2 + (0.1-0.2)^2 + (0.7-0.6)^2 + (0.3-0.4)^2 + (0.5-0.5)^2 = 0.01+0.01+0.01+0.01+0 = 0.04

Loss =0.04/5=0.008= 0.04 / 5 = 0.008. Gradient descent nudges θ\theta and ϕ\phi to shrink this number.

What the Bottleneck Forces

Without the bottleneck, the network could copy xx straight through — a trivial identity mapping. The narrow middle layer prevents this. The encoder must find a lower-dimensional summary that retains enough information to reconstruct the input from the decoder side.

The : you keep what reconstruction needs; you discard what it doesn't. For natural images this means edges, colors, shapes — not pixel-level noise.

Practical Applications

Denoising Autoencoders

Corrupt the input: add Gaussian noise or randomly zero out pixels to get x~\tilde{x}. Feed x~\tilde{x} to the encoder but train to reconstruct the clean xx:

L=xgϕ(fθ(x~))2L = | x - g_\phi(f_\theta(\tilde{x})) |^2
x~\tilde{x}
corrupted input
xx
clean target

The model learns to undo noise, which forces the latent code to represent only genuine signal. At test time, feed in a noisy image and get a clean reconstruction.

Anomaly Detection

Train on normal examples only (e.g., healthy medical images). The encoder-decoder pair learns a representation of normality. At test time, compute reconstruction error for a new sample. An anomaly lies off the learned manifold — the decoder cannot reconstruct it well, so error spikes. Threshold the error to flag anomalies.

Feature Learning

The latent code zz is a compressed feature vector. These features can be used as inputs to a downstream classifier, often outperforming raw pixels — especially when labeled data is scarce but unlabeled data is plentiful.

The Fatal Flaw for Generation

Here is the problem: the latent space has no structure. The encoder maps inputs to latent codes through a deterministic function trained only to minimize reconstruction. Two nearly identical images might land at very different points in R32\mathbb{R}^{32}. Large regions of latent space may correspond to nothing — the decoder was never trained on those zz values.

If you try to generate by sampling zN(0,I)z \sim \mathcal{N}(0, I) and decoding, you will mostly get incoherent noise, because most random points in R32\mathbb{R}^{32} are far from any encoded training example.

The fix: constrain the encoder to produce latent codes that follow a known distribution — specifically N(0,I)\mathcal{N}(0, I). If every point in latent space corresponds to plausible data, sampling becomes meaningful. This is the Variational Autoencoder, and building it carefully is the work of the next three lessons.

Interactive example

Encode MNIST digits and visualize the 2D latent space — note the unstructured scatter of class clusters

Coming soon

Quiz

1 / 3

Why does forcing data through a narrow bottleneck make the encoder learn useful features?