Skip to content
Regularization
Lesson 2 ⏱ 12 min

L2 regularization (Ridge)

Video coming soon

L2 Regularization - Penalizing Large Weights

How the L2 penalty term modifies the loss, changes the gradient update, and geometrically constrains the solution toward the origin.

⏱ ~7 min

🧮

Quick refresher

Gradient of loss function

The gradient dL/dw tells us how the loss changes when w changes. The update rule is w <- w - alpha * dL/dw.

Example

If L = w^2 then dL/dw = 2w.

Update: w <- w - alpha*2w = w(1-2*alpha).

You have a model that is overfitting. The weights have grown large - the model found a complicated, wiggly function that memorizes training data. The fix is conceptually simple: penalize large weights. If the model wants to keep a large weight, it must pay for it in the loss.

Think of it like a government charging permit fees for complexity: you can build a large structure, but each additional floor costs you. The developer (model) will only build taller if the benefit outweighs the permit fee. L2 regularization makes "complexity" (weight magnitude) costly, so the model only keeps weights that genuinely earn their place.

That is the entire idea behind L2 regularization. Everything else is just math.

The Penalty Term

Instead of minimizing just the original data loss, you minimize a penalized loss:

Lreg=Ldata+λw2L_{\text{reg}} = L_{\text{data}} + \lambda \cdot |w|^2
L}
original loss (cross-entropy, MSE, etc.)
λ\lambda
regularization strength hyperparameter
w2\|w\|^2
sum of squared weights

where and controls penalty strength.

The squared term means large weights are penalized much more than small weights. A weight of 10 contributes 100 to the penalty; a weight of 0.1 contributes only 0.01.

What This Does to the Gradient

Taking the gradient of LregL_{\text{reg}} with respect to a single weight ww:

Lregw=Ldataw+2λw\frac{\partial L_{\text{reg}}}{\partial w} = \frac{\partial L_{\text{data}}}{\partial w} + 2\lambda w
\partial L}/\partial w
gradient from data loss
2λw2\lambda w
extra gradient term from L2 penalty

The L2 penalty adds 2λw2\lambda w to the gradient. That extra term always points in the direction of ww - it constantly pushes the update toward reducing w\mid w\mid.

The Weight Decay Update

Plugging into the gradient descent update rule:

ww(12αλ)αLdataww \leftarrow w(1 - 2\alpha\lambda) - \alpha \cdot \frac{\partial L_{\text{data}}}{\partial w}
α\alpha
learning rate
(12αλ)(1 - 2\alpha\lambda)
weight decay factor - multiplied against weight each step

Notice the factor (12αλ)(1 - 2\alpha\lambda). It is multiplied against ww at every single update step, regardless of the data gradient. It shrinks the weight by a constant fraction each step.

Concrete example: learning rate α=0.01\alpha = 0.01, λ=0.1\lambda = 0.1:

(12×0.01×0.1)=(10.002)=0.998(1 - 2 \times 0.01 \times 0.1) = (1 - 0.002) = 0.998

Every step, the weight is first multiplied by 0.998 - shrunk by 0.2% - then the normal gradient step is applied. This is why L2 regularization is also called .

Interactive example

Weight decay demo - watch a single weight decay over steps with adjustable lambda

Coming soon

The Geometric Intuition

Picture a 2D weight space with axes w1w_1 and w2w_2. Without regularization, training finds the minimum of the data loss - call it w=(3.5,2.1)w^* = (3.5, 2.1).

L2 regularization adds circular contours centered at the origin - the further from (0,0)(0, 0), the higher the penalty. The training objective is the sum of two things: fit the data (moves toward ww^*) and stay near the origin (pulls you back).

The optimal solution is a compromise between the origin and ww^*: it fits the data reasonably well while keeping weights small. The model can only maintain a large weight if it is genuinely supported by a strong data gradient.

Tuning λ\lambda

Here, λ\lambda must be tuned on the validation set:

  • λ\lambda too small (e.g., 10610^{-6}): penalty is negligible, no real effect.
  • λ\lambda too large (e.g., 10.010.0): all weights forced near zero. Severe underfitting.
  • λ\lambda just right (often 10410^{-4} to 10210^{-2}): reduces variance without destroying fit.

Interactive example

Lambda sweep - drag lambda slider and watch train error vs. validation error change

Coming soon

A Note on Names

You will see L2 regularization called different things in different contexts:

  • "L2 regularization" - the ML/deep learning framing: add λw2\lambda\mid w\mid ^2 to the loss.
  • "Ridge regression" - the statistics framing: same thing applied to linear regression.
  • "Weight decay" - the optimizer framing: the (12αλ)(1-2\alpha\lambda) factor in the update rule.
  • "AdamW" - Adam optimizer with weight decay applied directly to weights (subtly different from applying to the gradient).

When you see weight_decay=0.01 in PyTorch, that is L2 regularization with λ=0.01\lambda = 0.01.

Bayesian Interpretation

This framing is useful when you want to reason about what assumptions your regularizer encodes - L2 encodes the belief that weights should be small and normally distributed around zero.

Quiz

1 / 3

L2 regularization modifies the training update by...