You have a model that is overfitting. The weights have grown large - the model found a complicated, wiggly function that memorizes training data. The fix is conceptually simple: penalize large weights. If the model wants to keep a large weight, it must pay for it in the loss.
Think of it like a government charging permit fees for complexity: you can build a large structure, but each additional floor costs you. The developer (model) will only build taller if the benefit outweighs the permit fee. L2 regularization makes "complexity" (weight magnitude) costly, so the model only keeps weights that genuinely earn their place.
That is the entire idea behind L2 regularization. Everything else is just math.
The Penalty Term
Instead of minimizing just the original data loss, you minimize a penalized loss:
- L}
- original loss (cross-entropy, MSE, etc.)
- regularization strength hyperparameter
- sum of squared weights
where and controls penalty strength.
The squared term means large weights are penalized much more than small weights. A weight of 10 contributes 100 to the penalty; a weight of 0.1 contributes only 0.01.
What This Does to the Gradient
Taking the gradient of with respect to a single weight :
- \partial L}/\partial w
- gradient from data loss
- extra gradient term from L2 penalty
The L2 penalty adds to the gradient. That extra term always points in the direction of - it constantly pushes the update toward reducing .
The Weight Decay Update
Plugging into the gradient descent update rule:
- learning rate
- weight decay factor - multiplied against weight each step
Notice the factor . It is multiplied against at every single update step, regardless of the data gradient. It shrinks the weight by a constant fraction each step.
Concrete example: learning rate , :
Every step, the weight is first multiplied by 0.998 - shrunk by 0.2% - then the normal gradient step is applied. This is why L2 regularization is also called .
Interactive example
Weight decay demo - watch a single weight decay over steps with adjustable lambda
Coming soon
The Geometric Intuition
Picture a 2D weight space with axes and . Without regularization, training finds the minimum of the data loss - call it .
L2 regularization adds circular contours centered at the origin - the further from , the higher the penalty. The training objective is the sum of two things: fit the data (moves toward ) and stay near the origin (pulls you back).
The optimal solution is a compromise between the origin and : it fits the data reasonably well while keeping weights small. The model can only maintain a large weight if it is genuinely supported by a strong data gradient.
Tuning
Here, must be tuned on the validation set:
- too small (e.g., ): penalty is negligible, no real effect.
- too large (e.g., ): all weights forced near zero. Severe underfitting.
- just right (often to ): reduces variance without destroying fit.
Interactive example
Lambda sweep - drag lambda slider and watch train error vs. validation error change
Coming soon
A Note on Names
You will see L2 regularization called different things in different contexts:
- "L2 regularization" - the ML/deep learning framing: add to the loss.
- "Ridge regression" - the statistics framing: same thing applied to linear regression.
- "Weight decay" - the optimizer framing: the factor in the update rule.
- "AdamW" - Adam optimizer with weight decay applied directly to weights (subtly different from applying to the gradient).
When you see weight_decay=0.01 in PyTorch, that is L2 regularization with .
Bayesian Interpretation
This framing is useful when you want to reason about what assumptions your regularizer encodes - L2 encodes the belief that weights should be small and normally distributed around zero.