GAN training dynamics: mode collapse — Generative Models

GANs Are Hard to Train

The minimax equilibrium from lesson 14-6 is elegant in theory. In practice, getting two neural networks to converge to Nash equilibrium is notoriously difficult. Unlike supervised learning — where loss decreasing monotonically means progress — GAN losses oscillate, mislead, and sometimes never converge.

GAN training instability is the reason diffusion models overtook GANs as the dominant generative architecture. The failure modes described here — mode collapse, discriminator saturation, oscillation — are exactly what motivated the development of WGAN, StyleGAN, and ultimately the diffusion approach. Understanding them means understanding a decade of generative model research.

There are three classic failure modes. Understanding them is essential for any practical GAN work.

Failure Mode 1: Mode Collapse

What happens: The generator finds one output (or a handful) that reliably fools the discriminator, and collapses to producing only those outputs. It ignores the full diversity of the training distribution.

Why it happens: The generator's job is to maximize $\log D(G(z))$ . If a single output $G(z) = x^*$ achieves $D(x^*) = 0.9$ , the generator has little incentive to explore elsewhere — finding $D(x^{**}) = 0.91$ gives only marginally more reward, while the risk of a bad sample is high.

The collapse is especially insidious because metrics like discriminator loss may look fine — the fake samples are high quality — but diversity has disappeared.

Example: Training a GAN on MNIST, you want all 10 digit classes. A collapsed generator might produce only 3s (the most "average"-looking digit) regardless of what $z$ is fed.

Partial fixes:

Minibatch discrimination: feed batches of samples to D, not just individual samples. D can then detect low diversity and penalize it.
Unrolled GANs: train G against a D that has been optimized several steps into the future, reducing G's incentive to exploit current blind spots.
Multiple generators: use an ensemble of generators that are rewarded for producing diverse outputs.

Failure Mode 2: Vanishing Gradients

What happens: The discriminator becomes too good too fast. It classifies all real samples as real and all fakes as fake with high confidence. In this regime:

D(G(z)) \approx 0 ;\Rightarrow; \log(1 - D(G(z))) \approx 0 ;\Rightarrow; \nabla_\theta L_G \approx 0

$D(G(z))$: discriminator output for a fake sample — near 0 when D is strong
$\log(1-D(G(z)))$: generator's original loss — near log(1)=0 when D is strong

The generator's loss saturates. Gradients vanish. Training stops.

The non-saturating fix (from lesson 14-6) helps early in training, but does not eliminate the problem. The deeper fix is to change the distance measure entirely.

Failure Mode 3: Training Instability

Even without collapse or vanishing gradients, GAN losses oscillate wildly. The discriminator and generator loss curves do not decrease monotonically — they cycle up and down, making it hard to know if training is progressing. There is no reliable single-number stopping criterion.

This is a consequence of the minimax structure: you are optimizing a saddle point, not a minimum. Gradient descent on both players simultaneously does not have convergence guarantees analogous to single-objective gradient descent.

The Main Solution: Wasserstein GAN

Arjovsky et al. (2017) diagnosed the root problem: the original GAN implicitly minimizes the Jensen–Shannon (JS) divergence between $p_{\text{data}}$ and $p_G$ . JS divergence saturates (equals a constant) when the two distributions have disjoint support — which is common early in training when the generator is poor. Zero gradient, stuck training.

The fix: replace JS divergence with the Wasserstein distance (also called Earth Mover's Distance). It measures the minimum cost to "move" probability mass from $p_G$ to $p_{\text{data}}$ .

W(p_{\text{data}}, p_G) = \inf_{\gamma \in \Pi} \mathbb{E}_{(x,y) \sim \gamma}!\left[|x - y|\right]

$W(p,q)$: Wasserstein-1 distance between distributions p and q
$\Pi(p,q)$: set of all joint distributions with marginals p and q
$\mathbb{E}_{(x,y)\sim\gamma}$: expectation under the joint transport plan
$\|x-y\|$: cost to move mass from y to x

Unlike JS, Wasserstein distance is continuous and differentiable almost everywhere — even when the distributions don't overlap. It always provides gradient signal.

WGAN in practice: replace the discriminator (renamed critic) with an output in $(-\infty, \infty)$ rather than $[0,1]$ . Enforce a Lipschitz constraint on the critic (required for the Wasserstein estimator to be valid):

Original WGAN: clip critic weights to $[-0.01, 0.01]$ after each update (crude but works)
WGAN-GP (gradient penalty): add $\lambda \cdot \mathbb{E}[(\mid \nabla_{\hat{x}} D(\hat{x})\mid _2 - 1)^2]$ to the critic loss (better)

Practical Training Tips

In rough order of impact:

Non-saturating G loss: maximize log D(G(z)), not minimize log(1−D(G(z)))
Wasserstein loss + gradient penalty (WGAN-GP): stable training with meaningful loss values
Label smoothing: replace discriminator targets 1 → 0.9 to prevent overconfidence
Different learning rates: lower LR for G than D (e.g., 0.0001 G, 0.0004 D with Adam β₁=0, β₂=0.9)
Progressive growing (Karras et al., ProGAN): start training at 4×4 resolution, gradually add layers up to full resolution — stability improves dramatically at high resolutions
Evaluate FID (Fréchet Inception Distance): measures both quality and diversity, harder to game than generator loss alone

\text{FID} = |\mu_r - \mu_g|^2 + \text{Tr}!\left(\Sigma_r + \Sigma_g - 2!\left(\Sigma_r \Sigma_g\right)^{1/2}\right)

$\mu_r, \Sigma_r$: mean and covariance of real image features (from Inception network)
$\mu_g, \Sigma_g$: mean and covariance of generated image features
$\text{FID}$: Fréchet Inception Distance — lower is better, 0 is perfect

A well-trained modern GAN (StyleGAN3, XL) achieves FID < 5 on standard benchmarks. Random noise has FID ≈ 300.

Interactive example

Observe mode collapse, gradient vanishing, and successful training with WGAN-GP on a toy 2D distribution

Coming soon