L2 regularization (Ridge) — Regularization

You have a model that is overfitting. The weights have grown large - the model found a complicated, wiggly function that memorizes training data. The fix is conceptually simple: penalize large weights. If the model wants to keep a large weight, it must pay for it in the loss.

Think of it like a government charging permit fees for complexity: you can build a large structure, but each additional floor costs you. The developer (model) will only build taller if the benefit outweighs the permit fee. L2 regularization makes "complexity" (weight magnitude) costly, so the model only keeps weights that genuinely earn their place.

That is the entire idea behind L2 regularization. Everything else is just math.

The Penalty Term

Instead of minimizing just the original data loss, you minimize a penalized loss:

L_{\text{reg}} = L_{\text{data}} + \lambda \cdot |w|^2

$L}$: original loss (cross-entropy, MSE, etc.)
$\lambda$: regularization strength hyperparameter
$\|w\|^2$: sum of squared weights

where and controls penalty strength.

The squared term means large weights are penalized much more than small weights. A weight of 10 contributes 100 to the penalty; a weight of 0.1 contributes only 0.01.

What This Does to the Gradient

Taking the gradient of $L_{\text{reg}}$ with respect to a single weight $w$ :

\frac{\partial L_{\text{reg}}}{\partial w} = \frac{\partial L_{\text{data}}}{\partial w} + 2\lambda w

$\partial L}/\partial w$: gradient from data loss
$2\lambda w$: extra gradient term from L2 penalty

The L2 penalty adds $2\lambda w$ to the gradient. That extra term always points in the direction of $w$ - it constantly pushes the update toward reducing $\mid w\mid$ .

The Weight Decay Update

Plugging into the gradient descent update rule:

w \leftarrow w(1 - 2\alpha\lambda) - \alpha \cdot \frac{\partial L_{\text{data}}}{\partial w}

$\alpha$: learning rate
$(1 - 2\alpha\lambda)$: weight decay factor - multiplied against weight each step

Notice the factor $(1 - 2\alpha\lambda)$ . It is multiplied against $w$ at every single update step, regardless of the data gradient. It shrinks the weight by a constant fraction each step.

Concrete example: learning rate $\alpha = 0.01$ , $\lambda = 0.1$ :

(1 - 2 \times 0.01 \times 0.1) = (1 - 0.002) = 0.998

Every step, the weight is first multiplied by 0.998 - shrunk by 0.2% - then the normal gradient step is applied. This is why L2 regularization is also called .

Interactive example

Weight decay demo - watch a single weight decay over steps with adjustable lambda

Coming soon

The Geometric Intuition

Picture a 2D weight space with axes $w_1$ and $w_2$ . Without regularization, training finds the minimum of the data loss - call it $w^* = (3.5, 2.1)$ .

L2 regularization adds circular contours centered at the origin - the further from $(0, 0)$ , the higher the penalty. The training objective is the sum of two things: fit the data (moves toward $w^*$ ) and stay near the origin (pulls you back).

The optimal solution is a compromise between the origin and $w^*$ : it fits the data reasonably well while keeping weights small. The model can only maintain a large weight if it is genuinely supported by a strong data gradient.

Tuning $\lambda$

Here, $\lambda$ must be tuned on the validation set:

$\lambda$ too small (e.g., $10^{-6}$ ): penalty is negligible, no real effect.
$\lambda$ too large (e.g., $10.0$ ): all weights forced near zero. Severe underfitting.
$\lambda$ just right (often $10^{-4}$ to $10^{-2}$ ): reduces variance without destroying fit.

Interactive example

Lambda sweep - drag lambda slider and watch train error vs. validation error change

Coming soon

A Note on Names

You will see L2 regularization called different things in different contexts:

"L2 regularization" - the ML/deep learning framing: add $\lambda\mid w\mid ^2$ to the loss.
"Ridge regression" - the statistics framing: same thing applied to linear regression.
"Weight decay" - the optimizer framing: the $(1-2\alpha\lambda)$ factor in the update rule.
"AdamW" - Adam optimizer with weight decay applied directly to weights (subtly different from applying to the gradient).

When you see weight_decay=0.01 in PyTorch, that is L2 regularization with $\lambda = 0.01$ .

Bayesian Interpretation

This framing is useful when you want to reason about what assumptions your regularizer encodes - L2 encodes the belief that weights should be small and normally distributed around zero.