GANs: the minimax game — Generative Models

A New Approach: Don't Write Down P(x)

VAEs derive an explicit formula for $P(x)$ (the ELBO). GANs take a completely different route: forget writing down probabilities entirely. Instead, train a neural network to be a sampler. If the samples look like real data, you win.

The key insight from Goodfellow et al. (2014): pit two networks against each other. One tries to generate realistic data; the other tries to detect forgeries. Their competition drives both toward excellence.

Generative adversarial networks produce photorealistic images, synthetic training data, and the deepfakes you've heard about in the news. The adversarial training framework introduced here is one of the most influential ideas in deep learning — and understanding it explains both the power and the instability of GANs.

The Two Players

The : takes a noise vector , outputs a fake sample $G(z)$ .

The : takes any input and outputs a probability in $[0, 1]$ . High output means "looks real"; low means "looks fake."

Deriving the Objective from BCE

The discriminator is a binary classifier: real samples (label 1) vs fake samples (label 0). Its standard loss is binary cross-entropy:

L_D = -\mathbb{E}{x \sim p{\text{data}}}!\left[\log D(x)\right] - \mathbb{E}_{z \sim \mathcal{N}(0,I)}!\left[\log(1 - D(G(z)))\right]

$x$: a real sample from the training data
$G(z)$: a fake sample from the generator
$D(\cdot)$: discriminator output probability

Minimizing $L_D$ trains D to output 1 for real, 0 for fake. The generator wants the opposite — it wants D to output 1 for fakes:

L_G = \mathbb{E}_{z}!\left[\log(1 - D(G(z)))\right]

$L_G$: generator loss — penalizes G when D correctly identifies its output as fake

G minimizes $L_G$ (makes fakes harder to catch); D minimizes $L_D$ (gets better at catching fakes). These are opposing forces — a minimax game:

\min_G \max_D ;\mathbb{E}{x \sim p{\text{data}}}!\left[\log D(x)\right] + \mathbb{E}_{z}!\left[\log(1 - D(G(z)))\right]

Nash Equilibrium: When Does Training End?

The game has a solution — a Nash equilibrium — when neither player can benefit by changing their strategy.

Step 1: Optimal discriminator for a fixed generator. Given a fixed $G$ , the data distribution is a mixture of real ( $p_{\text{data}}$ ) and fake ( $p_G$ ) at every point $x$ . The discriminator that maximizes the objective is:

D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}

$p_{\text{data}}(x)$: probability density of x under the real data distribution
$p_G(x)$: probability density of x under the generator's distribution
$D^*(x)$: optimal discriminator for a given G

Derivation: pointwise, the integrand is $p_{\text{data}}(x)\log D(x) + p_G(x)\log(1-D(x))$ . Maximize over $D(x)$ by setting the derivative to zero: $p_{\text{data}}/D - p_G/(1-D) = 0 \Rightarrow D^* = p_{\text{data}}/(p_{\text{data}} + p_G)$ .

Step 2: Optimal generator given optimal D. Substitute $D^*$ into the objective. The resulting generator minimizer satisfies $p_G = p_{\text{data}}$ — the generator perfectly matches the data distribution. At this point $D^*(x) = 1/2$ everywhere: the discriminator guesses randomly because it cannot distinguish real from fake.

Training Procedure

GANs are trained with alternating updates:

Discriminator update (repeat $k$ times, typically $k = 1$ ):

Sample minibatch of real data: $x^{(1)}, \ldots, x^{(m)} \sim p_{\text{data}}$
Sample noise: $z^{(1)}, \ldots, z^{(m)} \sim \mathcal{N}(0, I)$
Compute discriminator gradient and update $\phi$ to maximize:

\frac{1}{m}\sum_{i=1}^m \left[\log D_\phi(x^{(i)}) + \log(1 - D_\phi(G_\theta(z^{(i)})))\right]

Generator update (1 step):

Sample fresh noise: $z^{(1)}, \ldots, z^{(m)} \sim \mathcal{N}(0, I)$
Compute generator gradient and update $\theta$ to maximize $\frac{1}{m}\sum_i \log D_\phi(G_\theta(z^{(i)}))$

Note: step 2 uses the non-saturating variant — maximize log D(G(z)) rather than minimize log(1−D(G(z))). The reason matters and is explained below.

Non-Saturating Loss: The Critical Fix

Early in training, the generator is terrible. $D(G(z)) \approx 0$ because fakes are obviously fake. In this regime:

\log(1 - D(G(z))) \approx \log(1) = 0 \quad \Rightarrow \quad \nabla_\theta \approx 0

Vanishing gradient — the generator receives almost no training signal when it needs it most. The fix: instead, train G to maximize $\log D(G(z))$ :

\log D(G(z)) \approx \log(0) = -\infty \quad \Rightarrow \quad \text{large gradient}

Same Nash equilibrium (generator still wants $D(G(z)) = 1$ ), but much stronger gradients early in training.

Numerical example. Suppose D outputs 0.05 for a fake sample (D is fairly confident it is fake).

Saturating generator loss contribution: $\log(1 - 0.05) = \log(0.95) \approx -0.051$ . Gradient: small.
Non-saturating generator loss: $\log(0.05) \approx -2.996$ . Gradient: ~59× larger.

The non-saturating formulation is nearly always used in practice.

Interactive example

Watch the generator distribution evolve against a fixed 2D real distribution as training progresses

Coming soon