The generation problem: modeling p(x) — Generative Models

What Does a Generative Model Do?

Every model you have built so far has answered the same question: given input $x$ , what is $y$ ? A linear regression predicts a price. A classifier predicts a category. A transformer predicts the next token. All of these are — they learn the conditional distribution $P(y \mid x)$ .

Generative AI — the technology behind ChatGPT, Stable Diffusion, and GitHub Copilot — is entirely built on generative models. Autoencoders, VAEs, GANs, and diffusion models are the four foundational architectures this unit covers. Understanding them means understanding how AI can create, not just classify.

A generative model asks something fundamentally different: what does a typical $x$ look like? It learns the distribution $P(x)$ over the data itself — not over labels, but over raw inputs. With this distribution you can:

Sample new examples that look like the training data (generate images, text, audio)
Evaluate likelihood: assign a score $P(x)$ to any input
Detect anomalies: a sample with very low $P(x)$ is unusual
Complete partial inputs: given the first half of an image, infer the rest
Understand structure: the distribution reveals what the data "cares about"

These capabilities are qualitatively different from classification. You are not labeling — you are modeling reality.

Why Is This Hard?

Consider a 256 × 256 color image: $256 \times 256 \times 3 = 196{,}608$ pixel values. A meaningful probability distribution must assign a number to every possible image — and the vast, overwhelming majority of random pixel arrangements look like static, not photographs.

The space has $256^{196608}$ configurations. You cannot store a table. You cannot fit a histogram. You must find some compact parameterized structure that concentrates probability mass exactly where real images live.

Three Approaches

Over the past decade, three broad strategies have emerged for tractable generative modeling.

1 — Autoregressive Models

Apply the to factor the joint distribution one dimension at a time:

P(x) = \prod_{i=1}^{d} P(x_i \mid x_1, x_2, \ldots, x_{i-1})

$P(x)$: joint probability of the full data point x
$x_i$: the i-th component (e.g., one pixel, one token)
$x_{<i}$: all components before i

Each factor is modeled by a neural network conditioned on the previous values. GPT is an autoregressive model over tokens. PixelCNN is an autoregressive model over pixels. The advantage: exact likelihoods, stable training. The disadvantage: slow sequential sampling.

2 — Latent Variable Models

Introduce a hidden variable and write:

P(x) = \int P(x \mid z) , P(z) , dz

$P(x)$: marginal distribution of observations
$P(x|z)$: likelihood: how z generates x
$P(z)$: prior over latent variables (often N(0,I))
$dz$: integrate over all possible z values

The idea: the high-dimensional data $x$ is a noisy, transformed view of a simpler low-dimensional latent code $z$ . Variational Autoencoders (VAEs) and diffusion models fall here.

3 — Implicit Models (GANs)

Instead of writing down a formula for $P(x)$ , train a sampler directly. A generator network $G$ maps noise $z \sim \mathcal{N}(0, I)$ to samples $x = G(z)$ . The distribution of $G(z)$ is the implicit model. A discriminator network provides the training signal. You never compute $P(x)$ explicitly — but you can generate samples.

A Roadmap for This Unit

This unit builds each approach from scratch:

Lesson	Model	Core idea
14-2	Autoencoder	Bottleneck reconstruction
14-3	VAE	Stochastic encoder
14-4	ELBO	Principled VAE objective
14-5	Reparameterization	Differentiating through sampling
14-6	GAN	Adversarial training
14-7	GAN dynamics	Mode collapse and fixes
14-8	Diffusion (forward)	Scheduled noising
14-9	Diffusion (reverse)	DDPM denoising
14-10	Score matching	Unified theory

Every model in this unit is an answer to the same question: how do we compress a complex data distribution into a trainable neural network? Each gives a different trade-off between tractability, sample quality, and training stability.

Interactive example

Compare samples from autoregressive, VAE, GAN, and diffusion models on the same dataset

Coming soon