Skip to content
Generative Models
Lesson 6 ⏱ 14 min

GANs: the minimax game

Video coming soon

GANs: Two Networks Playing a Game

The generator-discriminator adversarial setup, the minimax objective derived from scratch, what the Nash equilibrium looks like, and the alternating training procedure.

⏱ ~8 min

🧮

Quick refresher

Binary cross-entropy loss

BCE(y, ŷ) = −[y·log(ŷ) + (1−y)·log(1−ŷ)] penalizes a classifier for outputting probability ŷ when the true label is y ∈ {0,1}. High confidence in the wrong answer incurs large loss.

Example

True label y=1, predicted ŷ=0.9: BCE = −log(0.9) ≈ 0.105.

True label y=1, predicted ŷ=0.1: BCE = −log(0.1) ≈ 2.30.

Confident wrong answers are severely penalized.

A New Approach: Don't Write Down P(x)

VAEs derive an explicit formula for P(x)P(x) (the ELBO). GANs take a completely different route: forget writing down probabilities entirely. Instead, train a neural network to be a sampler. If the samples look like real data, you win.

The key insight from Goodfellow et al. (2014): pit two networks against each other. One tries to generate realistic data; the other tries to detect forgeries. Their competition drives both toward excellence.

Generative adversarial networks produce photorealistic images, synthetic training data, and the deepfakes you've heard about in the news. The adversarial training framework introduced here is one of the most influential ideas in deep learning — and understanding it explains both the power and the instability of GANs.

The Two Players

The : takes a noise vector , outputs a fake sample G(z)G(z).

The : takes any input and outputs a probability in [0,1][0, 1]. High output means "looks real"; low means "looks fake."

Deriving the Objective from BCE

The discriminator is a binary classifier: real samples (label 1) vs fake samples (label 0). Its standard loss is binary cross-entropy:

LD=Expdata![logD(x)]EzN(0,I)![log(1D(G(z)))]L_D = -\mathbb{E}{x \sim p{\text{data}}}!\left[\log D(x)\right] - \mathbb{E}_{z \sim \mathcal{N}(0,I)}!\left[\log(1 - D(G(z)))\right]
xx
a real sample from the training data
G(z)G(z)
a fake sample from the generator
D()D(\cdot)
discriminator output probability

Minimizing LDL_D trains D to output 1 for real, 0 for fake. The generator wants the opposite — it wants D to output 1 for fakes:

LG=Ez![log(1D(G(z)))]L_G = \mathbb{E}_{z}!\left[\log(1 - D(G(z)))\right]
LGL_G
generator loss — penalizes G when D correctly identifies its output as fake

G minimizes LGL_G (makes fakes harder to catch); D minimizes LDL_D (gets better at catching fakes). These are opposing forces — a minimax game:

minGmaxD;Expdata![logD(x)]+Ez![log(1D(G(z)))]\min_G \max_D ;\mathbb{E}{x \sim p{\text{data}}}!\left[\log D(x)\right] + \mathbb{E}_{z}!\left[\log(1 - D(G(z)))\right]

Nash Equilibrium: When Does Training End?

The game has a solution — a Nash equilibrium — when neither player can benefit by changing their strategy.

Step 1: Optimal discriminator for a fixed generator. Given a fixed GG, the data distribution is a mixture of real (pdatap_{\text{data}}) and fake (pGp_G) at every point xx. The discriminator that maximizes the objective is:

D(x)=pdata(x)pdata(x)+pG(x)D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}
pdata(x)p_{\text{data}}(x)
probability density of x under the real data distribution
pG(x)p_G(x)
probability density of x under the generator's distribution
D(x)D^*(x)
optimal discriminator for a given G

Derivation: pointwise, the integrand is pdata(x)logD(x)+pG(x)log(1D(x))p_{\text{data}}(x)\log D(x) + p_G(x)\log(1-D(x)). Maximize over D(x)D(x) by setting the derivative to zero: pdata/DpG/(1D)=0D=pdata/(pdata+pG)p_{\text{data}}/D - p_G/(1-D) = 0 \Rightarrow D^* = p_{\text{data}}/(p_{\text{data}} + p_G).

Step 2: Optimal generator given optimal D. Substitute DD^* into the objective. The resulting generator minimizer satisfies pG=pdatap_G = p_{\text{data}} — the generator perfectly matches the data distribution. At this point D(x)=1/2D^*(x) = 1/2 everywhere: the discriminator guesses randomly because it cannot distinguish real from fake.

Training Procedure

GANs are trained with alternating updates:

Discriminator update (repeat kk times, typically k=1k = 1):

  1. Sample minibatch of real data: x(1),,x(m)pdatax^{(1)}, \ldots, x^{(m)} \sim p_{\text{data}}
  2. Sample noise: z(1),,z(m)N(0,I)z^{(1)}, \ldots, z^{(m)} \sim \mathcal{N}(0, I)
  3. Compute discriminator gradient and update ϕ\phi to maximize:
1mi=1m[logDϕ(x(i))+log(1Dϕ(Gθ(z(i))))]\frac{1}{m}\sum_{i=1}^m \left[\log D_\phi(x^{(i)}) + \log(1 - D_\phi(G_\theta(z^{(i)})))\right]

Generator update (1 step):

  1. Sample fresh noise: z(1),,z(m)N(0,I)z^{(1)}, \ldots, z^{(m)} \sim \mathcal{N}(0, I)
  2. Compute generator gradient and update θ\theta to maximize 1milogDϕ(Gθ(z(i)))\frac{1}{m}\sum_i \log D_\phi(G_\theta(z^{(i)}))

Note: step 2 uses the non-saturating variant — maximize log D(G(z)) rather than minimize log(1−D(G(z))). The reason matters and is explained below.

Non-Saturating Loss: The Critical Fix

Early in training, the generator is terrible. D(G(z))0D(G(z)) \approx 0 because fakes are obviously fake. In this regime:

log(1D(G(z)))log(1)=0θ0\log(1 - D(G(z))) \approx \log(1) = 0 \quad \Rightarrow \quad \nabla_\theta \approx 0

Vanishing gradient — the generator receives almost no training signal when it needs it most. The fix: instead, train G to maximize logD(G(z))\log D(G(z)):

logD(G(z))log(0)=large gradient\log D(G(z)) \approx \log(0) = -\infty \quad \Rightarrow \quad \text{large gradient}

Same Nash equilibrium (generator still wants D(G(z))=1D(G(z)) = 1), but much stronger gradients early in training.

Numerical example. Suppose D outputs 0.05 for a fake sample (D is fairly confident it is fake).

  • Saturating generator loss contribution: log(10.05)=log(0.95)0.051\log(1 - 0.05) = \log(0.95) \approx -0.051. Gradient: small.
  • Non-saturating generator loss: log(0.05)2.996\log(0.05) \approx -2.996. Gradient: ~59× larger.

The non-saturating formulation is nearly always used in practice.

Interactive example

Watch the generator distribution evolve against a fixed 2D real distribution as training progresses

Coming soon

Quiz

1 / 3

In the GAN minimax objective min_G max_D E[log D(x)] + E[log(1−D(G(z)))], what is the discriminator D trying to do?