The generator-discriminator adversarial setup, the minimax objective derived from scratch, what the Nash equilibrium looks like, and the alternating training procedure.
⏱ ~8 min
🧮
Quick refresher
Binary cross-entropy loss
BCE(y, ŷ) = −[y·log(ŷ) + (1−y)·log(1−ŷ)] penalizes a classifier for outputting probability ŷ when the true label is y ∈ {0,1}. High confidence in the wrong answer incurs large loss.
VAEs derive an explicit formula for P(x) (the ELBO). GANs take a completely different route: forget writing down probabilities entirely. Instead, train a neural network to be a sampler. If the samples look like real data, you win.
The key insight from Goodfellow et al. (2014): pit two networks against each other. One tries to generate realistic data; the other tries to detect forgeries. Their competition drives both toward excellence.
Generative adversarial networks produce photorealistic images, synthetic training data, and the deepfakes you've heard about in the news. The adversarial training framework introduced here is one of the most influential ideas in deep learning — and understanding it explains both the power and the instability of GANs.
The Two Players
The : takes a noise vector , outputs a fake sample G(z).
The : takes any input and outputs a probability in [0,1]. High output means "looks real"; low means "looks fake."
Deriving the Objective from BCE
The discriminator is a binary classifier: real samples (label 1) vs fake samples (label 0). Its standard loss is binary cross-entropy:
The game has a solution — a Nash equilibrium — when neither player can benefit by changing their strategy.
Step 1: Optimal discriminator for a fixed generator. Given a fixed G, the data distribution is a mixture of real (pdata) and fake (pG) at every point x. The discriminator that maximizes the objective is:
D∗(x)=pdata(x)+pG(x)pdata(x)
pdata(x)
probability density of x under the real data distribution
pG(x)
probability density of x under the generator's distribution
D∗(x)
optimal discriminator for a given G
Derivation: pointwise, the integrand is pdata(x)logD(x)+pG(x)log(1−D(x)). Maximize over D(x) by setting the derivative to zero: pdata/D−pG/(1−D)=0⇒D∗=pdata/(pdata+pG).
Step 2: Optimal generator given optimal D. Substitute D∗ into the objective. The resulting generator minimizer satisfies pG=pdata — the generator perfectly matches the data distribution. At this point D∗(x)=1/2 everywhere: the discriminator guesses randomly because it cannot distinguish real from fake.
Training Procedure
GANs are trained with alternating updates:
Discriminator update (repeat k times, typically k=1):
Sample minibatch of real data: x(1),…,x(m)∼pdata
Sample noise: z(1),…,z(m)∼N(0,I)
Compute discriminator gradient and update ϕ to maximize:
m1i=1∑m[logDϕ(x(i))+log(1−Dϕ(Gθ(z(i))))]
Generator update (1 step):
Sample fresh noise: z(1),…,z(m)∼N(0,I)
Compute generator gradient and update θ to maximizem1∑ilogDϕ(Gθ(z(i)))
Note: step 2 uses the non-saturating variant — maximize log D(G(z)) rather than minimize log(1−D(G(z))). The reason matters and is explained below.
Non-Saturating Loss: The Critical Fix
Early in training, the generator is terrible. D(G(z))≈0 because fakes are obviously fake. In this regime:
log(1−D(G(z)))≈log(1)=0⇒∇θ≈0
Vanishing gradient — the generator receives almost no training signal when it needs it most. The fix: instead, train G to maximize logD(G(z)):
logD(G(z))≈log(0)=−∞⇒large gradient
Same Nash equilibrium (generator still wants D(G(z))=1), but much stronger gradients early in training.
Numerical example. Suppose D outputs 0.05 for a fake sample (D is fairly confident it is fake).
Saturating generator loss contribution: log(1−0.05)=log(0.95)≈−0.051. Gradient: small.