Skip to content
Generative Models
Lesson 7 ⏱ 14 min

GAN training dynamics: mode collapse

Video coming soon

GAN Training: What Goes Wrong and How to Fix It

The three main failure modes of GAN training — mode collapse, vanishing gradients, and oscillation — with intuitive explanations and the key solutions: WGAN, gradient penalty, and spectral normalization.

⏱ ~7 min

🧮

Quick refresher

Gradient vanishing in deep networks

When gradients are near zero, weights receive tiny updates and learning stalls. Vanishing gradients occur when activation functions saturate (output at a plateau) or when many layers multiply small numbers together. The generator in a GAN can suffer vanishing gradients when the discriminator is too strong.

Example

sigmoid(10) ≈ 1.0, sigmoid'(10) ≈ 0.00005.

A neuron with large pre-activation output contributes essentially zero gradient to backprop — it has saturated.

GANs Are Hard to Train

The minimax equilibrium from lesson 14-6 is elegant in theory. In practice, getting two neural networks to converge to Nash equilibrium is notoriously difficult. Unlike supervised learning — where loss decreasing monotonically means progress — GAN losses oscillate, mislead, and sometimes never converge.

GAN training instability is the reason diffusion models overtook GANs as the dominant generative architecture. The failure modes described here — mode collapse, discriminator saturation, oscillation — are exactly what motivated the development of WGAN, StyleGAN, and ultimately the diffusion approach. Understanding them means understanding a decade of generative model research.

There are three classic failure modes. Understanding them is essential for any practical GAN work.

Failure Mode 1: Mode Collapse

What happens: The generator finds one output (or a handful) that reliably fools the discriminator, and collapses to producing only those outputs. It ignores the full diversity of the training distribution.

Why it happens: The generator's job is to maximize logD(G(z))\log D(G(z)). If a single output G(z)=xG(z) = x^* achieves D(x)=0.9D(x^*) = 0.9, the generator has little incentive to explore elsewhere — finding D(x)=0.91D(x^{**}) = 0.91 gives only marginally more reward, while the risk of a bad sample is high.

The collapse is especially insidious because metrics like discriminator loss may look fine — the fake samples are high quality — but diversity has disappeared.

Example: Training a GAN on MNIST, you want all 10 digit classes. A collapsed generator might produce only 3s (the most "average"-looking digit) regardless of what zz is fed.

Partial fixes:

  • Minibatch discrimination: feed batches of samples to D, not just individual samples. D can then detect low diversity and penalize it.
  • Unrolled GANs: train G against a D that has been optimized several steps into the future, reducing G's incentive to exploit current blind spots.
  • Multiple generators: use an ensemble of generators that are rewarded for producing diverse outputs.

Failure Mode 2: Vanishing Gradients

What happens: The discriminator becomes too good too fast. It classifies all real samples as real and all fakes as fake with high confidence. In this regime:

D(G(z))0;;log(1D(G(z)))0;;θLG0D(G(z)) \approx 0 ;\Rightarrow; \log(1 - D(G(z))) \approx 0 ;\Rightarrow; \nabla_\theta L_G \approx 0
D(G(z))D(G(z))
discriminator output for a fake sample — near 0 when D is strong
log(1D(G(z)))\log(1-D(G(z)))
generator's original loss — near log(1)=0 when D is strong

The generator's loss saturates. Gradients vanish. Training stops.

The non-saturating fix (from lesson 14-6) helps early in training, but does not eliminate the problem. The deeper fix is to change the distance measure entirely.

Failure Mode 3: Training Instability

Even without collapse or vanishing gradients, GAN losses oscillate wildly. The discriminator and generator loss curves do not decrease monotonically — they cycle up and down, making it hard to know if training is progressing. There is no reliable single-number stopping criterion.

This is a consequence of the minimax structure: you are optimizing a saddle point, not a minimum. Gradient descent on both players simultaneously does not have convergence guarantees analogous to single-objective gradient descent.

The Main Solution: Wasserstein GAN

Arjovsky et al. (2017) diagnosed the root problem: the original GAN implicitly minimizes the Jensen–Shannon (JS) divergence between pdatap_{\text{data}} and pGp_G. JS divergence saturates (equals a constant) when the two distributions have disjoint support — which is common early in training when the generator is poor. Zero gradient, stuck training.

The fix: replace JS divergence with the Wasserstein distance (also called Earth Mover's Distance). It measures the minimum cost to "move" probability mass from pGp_G to pdatap_{\text{data}}.

W(pdata,pG)=infγΠE(x,y)γ![xy]W(p_{\text{data}}, p_G) = \inf_{\gamma \in \Pi} \mathbb{E}_{(x,y) \sim \gamma}!\left[|x - y|\right]
W(p,q)W(p,q)
Wasserstein-1 distance between distributions p and q
Π(p,q)\Pi(p,q)
set of all joint distributions with marginals p and q
E(x,y)γ\mathbb{E}_{(x,y)\sim\gamma}
expectation under the joint transport plan
xy\|x-y\|
cost to move mass from y to x

Unlike JS, Wasserstein distance is continuous and differentiable almost everywhere — even when the distributions don't overlap. It always provides gradient signal.

WGAN in practice: replace the discriminator (renamed critic) with an output in (,)(-\infty, \infty) rather than [0,1][0,1]. Enforce a Lipschitz constraint on the critic (required for the Wasserstein estimator to be valid):

  • Original WGAN: clip critic weights to [0.01,0.01][-0.01, 0.01] after each update (crude but works)
  • WGAN-GP (gradient penalty): add λE[(x^D(x^)21)2]\lambda \cdot \mathbb{E}[(\mid \nabla_{\hat{x}} D(\hat{x})\mid _2 - 1)^2] to the critic loss (better)

Practical Training Tips

In rough order of impact:

  1. Non-saturating G loss: maximize log D(G(z)), not minimize log(1−D(G(z)))
  2. Wasserstein loss + gradient penalty (WGAN-GP): stable training with meaningful loss values
  3. Label smoothing: replace discriminator targets 1 → 0.9 to prevent overconfidence
  4. Different learning rates: lower LR for G than D (e.g., 0.0001 G, 0.0004 D with Adam β₁=0, β₂=0.9)
  5. Progressive growing (Karras et al., ProGAN): start training at 4×4 resolution, gradually add layers up to full resolution — stability improves dramatically at high resolutions
  6. Evaluate FID (Fréchet Inception Distance): measures both quality and diversity, harder to game than generator loss alone
FID=μrμg2+Tr!(Σr+Σg2!(ΣrΣg)1/2)\text{FID} = |\mu_r - \mu_g|^2 + \text{Tr}!\left(\Sigma_r + \Sigma_g - 2!\left(\Sigma_r \Sigma_g\right)^{1/2}\right)
μr,Σr\mu_r, \Sigma_r
mean and covariance of real image features (from Inception network)
μg,Σg\mu_g, \Sigma_g
mean and covariance of generated image features
FID\text{FID}
Fréchet Inception Distance — lower is better, 0 is perfect

A well-trained modern GAN (StyleGAN3, XL) achieves FID < 5 on standard benchmarks. Random noise has FID ≈ 300.

Interactive example

Observe mode collapse, gradient vanishing, and successful training with WGAN-GP on a toy 2D distribution

Coming soon

Quiz

1 / 3

Mode collapse in a GAN means...