L1 and L2 regularization both penalize weight size to prevent overfitting. But they behave very differently in one critical way: L1 can drive weights to exactly zero. L2 only whispers at weights, pushing them close to zero but never reaching it. This distinction turns L1 into an automatic feature selector.
The L1 Penalty
The L1 penalty adds the sum of absolute weight values to the loss:
- L}
- original data loss
- regularization strength
- L1 norm - sum of absolute values of all weights
Absolute values instead of squares. This seems like a small change, but it creates a very different gradient.
The Constant Gradient
For L2, the gradient of the penalty with respect to is - proportional to the weight's current value. Large weights get a large push; small weights get a small push.
For L1, the gradient of is:
- the sign of w_i - equals +1 if positive, -1 if negative
The is either +1 or -1. It does not depend on the magnitude of . Whether or , the gradient is the same fixed magnitude.
This means L1 decrements every weight by the same fixed amount each step, regardless of size.
Why L1 Produces Exact Zeros
Imagine a weight where the data gradient is near zero (this feature barely helps).
L2 update: . With small , this barely moves. The weight asymptotes toward zero but never quite arrives.
L1 update: . With a fixed decrement of , this weight crosses zero in a finite number of steps. Once it would overshoot zero, the implementation clamps it to exactly 0.
For a large weight (say ) with a strong data gradient, the data gradient fights back and wins - the weight stays nonzero. L1 only zeroes out weights that cannot justify their existence.
The Geometric Picture
Consider a loss function with two weights and :
The L2 constraint is a circle in 2D (sphere in higher dimensions). Its boundary is smooth everywhere - no corners.
The L1 constraint is a diamond in 2D (octahedron in higher dimensions). It has sharp corners on the coordinate axes, where one weight is nonzero and all others are zero.
When loss contours first touch the constraint boundary, the L2 circle offers a smooth curve where the solution can land anywhere. The L1 diamond has corners that stick out — the solution is geometrically likely to land at a corner, which corresponds to a .
Why do corners produce zero weights? A corner on the L1 diamond is a point like (C, 0) or (0, C) — exactly one weight is nonzero. The loss contours are smooth ellipses. When an ellipse expands outward from the data loss minimum, it first touches the diamond boundary. Smooth ellipses have a much higher probability of touching a sharp corner than they do of touching the smooth faces of the diamond, especially in high dimensions where there are many corners (one per coordinate axis). Each corner assigns zero to all but one weight.
Interactive example
L1 vs L2 constraint geometry - see how loss contours touch the circle vs. the diamond
Coming soon
Practical Application: Feature Selection
Suppose you are building a model to predict house prices with 1,000 candidate features: square footage, bedrooms, proximity to schools, age of roof, fireplace presence, day-of-year the listing went live, and 993 others. Many are noise.
With L1 regularization, training drives irrelevant weights to exactly zero. You might end up with only 50 nonzero weights. The other 950 are zeroed out, not just small.
This is automatic . Invaluable for interpretability in medicine, finance, and any domain where you must explain predictions.
When to Use L1 vs L2
Use L1 when:
- You expect most features to be irrelevant (sparse ground truth)
- Interpretability matters - you want to know which features drive predictions
- Statistics name: Lasso (Least Absolute Shrinkage and Selection Operator)
Use L2 when:
- Most features contribute something (dense solution is reasonable)
- You want smooth, stable optimization (L2 is differentiable everywhere)
- Statistics name: Ridge regression
Use Elastic Net when:
- You want some sparsity with the smoothness of L2
- There are groups of correlated features (L1 arbitrarily picks one; Elastic Net can include all)
- L1 penalty strength - controls sparsity
- L2 penalty strength - controls weight magnitude
In deep learning, L2 (weight decay) is far more common than L1 because optimization is easier. L1 shines in classical machine learning on tabular data with explicit feature engineering.