Skip to content
Regularization
Lesson 3 ⏱ 12 min

L1 regularization (Lasso)

Video coming soon

L1 Regularization - Sparsity and Feature Selection

Why the constant L1 gradient drives weights to exactly zero, the diamond vs. circle geometry, and when to choose L1 over L2.

⏱ ~7 min

🧮

Quick refresher

Absolute value and its derivative

The absolute value |w| has derivative sign(w) = +1 if w > 0, -1 if w < 0. The derivative is constant in magnitude regardless of how large w is.

Example

|5| = 5, derivative = +1.

|-5| = 5, derivative = -1.

|-0.001| = 0.001, derivative = -1.

Same magnitude gradient for all nonzero weights.

L1 and L2 regularization both penalize weight size to prevent overfitting. But they behave very differently in one critical way: L1 can drive weights to exactly zero. L2 only whispers at weights, pushing them close to zero but never reaching it. This distinction turns L1 into an automatic feature selector.

The L1 Penalty

The L1 penalty adds the sum of absolute weight values to the loss:

Lreg=Ldata+λw1=Ldata+λiwiL_{\text{reg}} = L_{\text{data}} + \lambda \cdot |w|1 = L{\text{data}} + \lambda \sum_i |w_i|
L}
original data loss
λ\lambda
regularization strength
w1\|w\|_1
L1 norm - sum of absolute values of all weights

Absolute values instead of squares. This seems like a small change, but it creates a very different gradient.

The Constant Gradient

For L2, the gradient of the penalty with respect to wiw_i is 2λwi2\lambda w_i - proportional to the weight's current value. Large weights get a large push; small weights get a small push.

For L1, the gradient of λwi\lambda\mid w_i\mid is:

(λwi)wi=λsign(wi)\frac{\partial(\lambda|w_i|)}{\partial w_i} = \lambda \cdot \text{sign}(w_i)
sign(wi)\text{sign}(w_i)
the sign of w_i - equals +1 if positive, -1 if negative

The is either +1 or -1. It does not depend on the magnitude of wiw_i. Whether wi=100w_i = 100 or wi=0.001w_i = 0.001, the gradient is the same fixed magnitude.

This means L1 decrements every weight by the same fixed amount αλ\alpha\lambda each step, regardless of size.

Why L1 Produces Exact Zeros

Imagine a weight wi=0.002w_i = 0.002 where the data gradient is near zero (this feature barely helps).

L2 update: ww(12αλ)w \leftarrow w(1 - 2\alpha\lambda). With small ww, this barely moves. The weight asymptotes toward zero but never quite arrives.

L1 update: wwαλsign(w)αLdataww \leftarrow w - \alpha\lambda \cdot \text{sign}(w) - \alpha \cdot \frac{\partial L_{\text{data}}}{\partial w}. With a fixed decrement of αλ\alpha\lambda, this weight crosses zero in a finite number of steps. Once it would overshoot zero, the implementation clamps it to exactly 0.

For a large weight (say wi=5.0w_i = 5.0) with a strong data gradient, the data gradient fights back and wins - the weight stays nonzero. L1 only zeroes out weights that cannot justify their existence.

The Geometric Picture

Consider a loss function with two weights w1w_1 and w2w_2:

The L2 constraint w2C\mid w\mid ^2 \leq C is a circle in 2D (sphere in higher dimensions). Its boundary is smooth everywhere - no corners.

The L1 constraint w1C\mid w\mid _1 \leq C is a diamond in 2D (octahedron in higher dimensions). It has sharp corners on the coordinate axes, where one weight is nonzero and all others are zero.

When loss contours first touch the constraint boundary, the L2 circle offers a smooth curve where the solution can land anywhere. The L1 diamond has corners that stick out — the solution is geometrically likely to land at a corner, which corresponds to a .

Why do corners produce zero weights? A corner on the L1 diamond is a point like (C, 0) or (0, C) — exactly one weight is nonzero. The loss contours are smooth ellipses. When an ellipse expands outward from the data loss minimum, it first touches the diamond boundary. Smooth ellipses have a much higher probability of touching a sharp corner than they do of touching the smooth faces of the diamond, especially in high dimensions where there are many corners (one per coordinate axis). Each corner assigns zero to all but one weight.

Interactive example

L1 vs L2 constraint geometry - see how loss contours touch the circle vs. the diamond

Coming soon

Practical Application: Feature Selection

Suppose you are building a model to predict house prices with 1,000 candidate features: square footage, bedrooms, proximity to schools, age of roof, fireplace presence, day-of-year the listing went live, and 993 others. Many are noise.

With L1 regularization, training drives irrelevant weights to exactly zero. You might end up with only 50 nonzero weights. The other 950 are zeroed out, not just small.

This is automatic . Invaluable for interpretability in medicine, finance, and any domain where you must explain predictions.

When to Use L1 vs L2

Use L1 when:

  • You expect most features to be irrelevant (sparse ground truth)
  • Interpretability matters - you want to know which features drive predictions
  • Statistics name: Lasso (Least Absolute Shrinkage and Selection Operator)

Use L2 when:

  • Most features contribute something (dense solution is reasonable)
  • You want smooth, stable optimization (L2 is differentiable everywhere)
  • Statistics name: Ridge regression

Use Elastic Net when:

  • You want some sparsity with the smoothness of L2
  • There are groups of correlated features (L1 arbitrarily picks one; Elastic Net can include all)
Lelastic=Ldata+λ1w1+λ2w2L_{\text{elastic}} = L_{\text{data}} + \lambda_1|w|_1 + \lambda_2|w|^2
λ1\lambda_1
L1 penalty strength - controls sparsity
λ2\lambda_2
L2 penalty strength - controls weight magnitude

In deep learning, L2 (weight decay) is far more common than L1 because optimization is easier. L1 shines in classical machine learning on tabular data with explicit feature engineering.

Quiz

1 / 3

Unlike L2, L1 regularization can produce exactly zero weights because...