GANs Are Hard to Train
The minimax equilibrium from lesson 14-6 is elegant in theory. In practice, getting two neural networks to converge to Nash equilibrium is notoriously difficult. Unlike supervised learning — where loss decreasing monotonically means progress — GAN losses oscillate, mislead, and sometimes never converge.
GAN training instability is the reason diffusion models overtook GANs as the dominant generative architecture. The failure modes described here — mode collapse, discriminator saturation, oscillation — are exactly what motivated the development of WGAN, StyleGAN, and ultimately the diffusion approach. Understanding them means understanding a decade of generative model research.
There are three classic failure modes. Understanding them is essential for any practical GAN work.
Failure Mode 1: Mode Collapse
What happens: The generator finds one output (or a handful) that reliably fools the discriminator, and collapses to producing only those outputs. It ignores the full diversity of the training distribution.
Why it happens: The generator's job is to maximize . If a single output achieves , the generator has little incentive to explore elsewhere — finding gives only marginally more reward, while the risk of a bad sample is high.
The collapse is especially insidious because metrics like discriminator loss may look fine — the fake samples are high quality — but diversity has disappeared.
Example: Training a GAN on MNIST, you want all 10 digit classes. A collapsed generator might produce only 3s (the most "average"-looking digit) regardless of what is fed.
Partial fixes:
- Minibatch discrimination: feed batches of samples to D, not just individual samples. D can then detect low diversity and penalize it.
- Unrolled GANs: train G against a D that has been optimized several steps into the future, reducing G's incentive to exploit current blind spots.
- Multiple generators: use an ensemble of generators that are rewarded for producing diverse outputs.
Failure Mode 2: Vanishing Gradients
What happens: The discriminator becomes too good too fast. It classifies all real samples as real and all fakes as fake with high confidence. In this regime:
- discriminator output for a fake sample — near 0 when D is strong
- generator's original loss — near log(1)=0 when D is strong
The generator's loss saturates. Gradients vanish. Training stops.
The non-saturating fix (from lesson 14-6) helps early in training, but does not eliminate the problem. The deeper fix is to change the distance measure entirely.
Failure Mode 3: Training Instability
Even without collapse or vanishing gradients, GAN losses oscillate wildly. The discriminator and generator loss curves do not decrease monotonically — they cycle up and down, making it hard to know if training is progressing. There is no reliable single-number stopping criterion.
This is a consequence of the minimax structure: you are optimizing a saddle point, not a minimum. Gradient descent on both players simultaneously does not have convergence guarantees analogous to single-objective gradient descent.
The Main Solution: Wasserstein GAN
Arjovsky et al. (2017) diagnosed the root problem: the original GAN implicitly minimizes the Jensen–Shannon (JS) divergence between and . JS divergence saturates (equals a constant) when the two distributions have disjoint support — which is common early in training when the generator is poor. Zero gradient, stuck training.
The fix: replace JS divergence with the Wasserstein distance (also called Earth Mover's Distance). It measures the minimum cost to "move" probability mass from to .
- Wasserstein-1 distance between distributions p and q
- set of all joint distributions with marginals p and q
- expectation under the joint transport plan
- cost to move mass from y to x
Unlike JS, Wasserstein distance is continuous and differentiable almost everywhere — even when the distributions don't overlap. It always provides gradient signal.
WGAN in practice: replace the discriminator (renamed critic) with an output in rather than . Enforce a Lipschitz constraint on the critic (required for the Wasserstein estimator to be valid):
- Original WGAN: clip critic weights to after each update (crude but works)
- WGAN-GP (gradient penalty): add to the critic loss (better)
Practical Training Tips
In rough order of impact:
- Non-saturating G loss: maximize log D(G(z)), not minimize log(1−D(G(z)))
- Wasserstein loss + gradient penalty (WGAN-GP): stable training with meaningful loss values
- Label smoothing: replace discriminator targets 1 → 0.9 to prevent overconfidence
- Different learning rates: lower LR for G than D (e.g., 0.0001 G, 0.0004 D with Adam β₁=0, β₂=0.9)
- Progressive growing (Karras et al., ProGAN): start training at 4×4 resolution, gradually add layers up to full resolution — stability improves dramatically at high resolutions
- Evaluate FID (Fréchet Inception Distance): measures both quality and diversity, harder to game than generator loss alone
- mean and covariance of real image features (from Inception network)
- mean and covariance of generated image features
- Fréchet Inception Distance — lower is better, 0 is perfect
A well-trained modern GAN (StyleGAN3, XL) achieves FID < 5 on standard benchmarks. Random noise has FID ≈ 300.
Interactive example
Observe mode collapse, gradient vanishing, and successful training with WGAN-GP on a toy 2D distribution
Coming soon