Skip to content
Generative Models
Lesson 1 ⏱ 10 min

The generation problem: modeling p(x)

Video coming soon

The Generation Problem: From Prediction to Creation

Why modeling the data distribution P(x) is fundamentally different from classification, and the three architectural strategies — autoregressive, latent variable, and implicit — that make it tractable.

⏱ ~7 min

🧮

Quick refresher

Conditional probability

P(y|x) is the probability of y given that we already know x. Discriminative models like classifiers learn exactly this: given an input, what is the likely label?

Example

A spam filter models P(spam | email text).

It does not try to generate new emails — just to classify the one in front of it.

What Does a Generative Model Do?

Every model you have built so far has answered the same question: given input xx, what is yy? A linear regression predicts a price. A classifier predicts a category. A transformer predicts the next token. All of these are — they learn the conditional distribution P(yx)P(y \mid x).

Generative AI — the technology behind ChatGPT, Stable Diffusion, and GitHub Copilot — is entirely built on generative models. Autoencoders, VAEs, GANs, and diffusion models are the four foundational architectures this unit covers. Understanding them means understanding how AI can create, not just classify.

A generative model asks something fundamentally different: what does a typical xx look like? It learns the distribution P(x)P(x) over the data itself — not over labels, but over raw inputs. With this distribution you can:

  • Sample new examples that look like the training data (generate images, text, audio)
  • Evaluate likelihood: assign a score P(x)P(x) to any input
  • Detect anomalies: a sample with very low P(x)P(x) is unusual
  • Complete partial inputs: given the first half of an image, infer the rest
  • Understand structure: the distribution reveals what the data "cares about"

These capabilities are qualitatively different from classification. You are not labeling — you are modeling reality.

Why Is This Hard?

Consider a 256 × 256 color image: 256×256×3=196,608256 \times 256 \times 3 = 196{,}608 pixel values. A meaningful probability distribution must assign a number to every possible image — and the vast, overwhelming majority of random pixel arrangements look like static, not photographs.

The space has 256196608256^{196608} configurations. You cannot store a table. You cannot fit a histogram. You must find some compact parameterized structure that concentrates probability mass exactly where real images live.

Three Approaches

Over the past decade, three broad strategies have emerged for tractable generative modeling.

1 — Autoregressive Models

Apply the to factor the joint distribution one dimension at a time:

P(x)=i=1dP(xix1,x2,,xi1)P(x) = \prod_{i=1}^{d} P(x_i \mid x_1, x_2, \ldots, x_{i-1})
P(x)P(x)
joint probability of the full data point x
xix_i
the i-th component (e.g., one pixel, one token)
x<ix_{<i}
all components before i

Each factor is modeled by a neural network conditioned on the previous values. GPT is an autoregressive model over tokens. PixelCNN is an autoregressive model over pixels. The advantage: exact likelihoods, stable training. The disadvantage: slow sequential sampling.

2 — Latent Variable Models

Introduce a hidden variable and write:

P(x)=P(xz),P(z),dzP(x) = \int P(x \mid z) , P(z) , dz
P(x)P(x)
marginal distribution of observations
P(xz)P(x|z)
likelihood: how z generates x
P(z)P(z)
prior over latent variables (often N(0,I))
dzdz
integrate over all possible z values

The idea: the high-dimensional data xx is a noisy, transformed view of a simpler low-dimensional latent code zz. Variational Autoencoders (VAEs) and diffusion models fall here.

3 — Implicit Models (GANs)

Instead of writing down a formula for P(x)P(x), train a sampler directly. A generator network GG maps noise zN(0,I)z \sim \mathcal{N}(0, I) to samples x=G(z)x = G(z). The distribution of G(z)G(z) is the implicit model. A discriminator network provides the training signal. You never compute P(x)P(x) explicitly — but you can generate samples.

A Roadmap for This Unit

This unit builds each approach from scratch:

LessonModelCore idea
14-2AutoencoderBottleneck reconstruction
14-3VAEStochastic encoder
14-4ELBOPrincipled VAE objective
14-5ReparameterizationDifferentiating through sampling
14-6GANAdversarial training
14-7GAN dynamicsMode collapse and fixes
14-8Diffusion (forward)Scheduled noising
14-9Diffusion (reverse)DDPM denoising
14-10Score matchingUnified theory

Every model in this unit is an answer to the same question: how do we compress a complex data distribution into a trainable neural network? Each gives a different trade-off between tractability, sample quality, and training stability.

Interactive example

Compare samples from autoregressive, VAE, GAN, and diffusion models on the same dataset

Coming soon

Quiz

1 / 3

A discriminative model learns P(y|x) while a generative model learns P(x). What does this mean practically?