L1 regularization (Lasso) — Regularization

L1 and L2 regularization both penalize weight size to prevent overfitting. But they behave very differently in one critical way: L1 can drive weights to exactly zero. L2 only whispers at weights, pushing them close to zero but never reaching it. This distinction turns L1 into an automatic feature selector.

The L1 Penalty

The L1 penalty adds the sum of absolute weight values to the loss:

L_{\text{reg}} = L_{\text{data}} + \lambda \cdot |w|1 = L{\text{data}} + \lambda \sum_i |w_i|

$L}$: original data loss
$\lambda$: regularization strength
$\|w\|_1$: L1 norm - sum of absolute values of all weights

Absolute values instead of squares. This seems like a small change, but it creates a very different gradient.

The Constant Gradient

For L2, the gradient of the penalty with respect to $w_i$ is $2\lambda w_i$ - proportional to the weight's current value. Large weights get a large push; small weights get a small push.

For L1, the gradient of $\lambda\mid w_i\mid$ is:

\frac{\partial(\lambda|w_i|)}{\partial w_i} = \lambda \cdot \text{sign}(w_i)

$\text{sign}(w_i)$: the sign of w_i - equals +1 if positive, -1 if negative

The is either +1 or -1. It does not depend on the magnitude of $w_i$ . Whether $w_i = 100$ or $w_i = 0.001$ , the gradient is the same fixed magnitude.

This means L1 decrements every weight by the same fixed amount $\alpha\lambda$ each step, regardless of size.

L1 vs L2 weight update comparison in Python

alpha, lam = 0.01, 0.1   # learning rate, regularization strength
data_grad = 0.0          # assume the data gradient is nearly zero

w_l2 = 0.005  # a small weight
w_l1 = 0.005  # same starting value

for step in range(5):
    # L2 update: w ← w(1 - 2αλ) - α·data_grad
    w_l2 = w_l2 * (1 - 2*alpha*lam) - alpha*data_grad
    # L1 update: w ← w - αλ·sign(w) - α·data_grad
    import math
    sign = 1 if w_l1 > 0 else (-1 if w_l1 < 0 else 0)
    w_l1 = max(0, w_l1 - alpha*lam*sign - alpha*data_grad)
    print(f"step {step+1}: L2 w={w_l2:.6f}  L1 w={w_l1:.6f}")
# step 1: L2 w=0.004990  L1 w=0.004000  ← L1 steps by fixed 0.001
# step 2: L2 w=0.004980  L1 w=0.003000
# step 3: L2 w=0.004970  L1 w=0.002000
# step 4: L2 w=0.004960  L1 w=0.001000
# step 5: L2 w=0.004950  L1 w=0.000000  ← L1 reaches exact zero

L2 shrinks slowly (asymptotic). L1 hits exactly zero in a finite number of steps.

Why L1 Produces Exact Zeros

Imagine a weight $w_i = 0.002$ where the data gradient is near zero (this feature barely helps).

L2 update: $w \leftarrow w(1 - 2\alpha\lambda)$ . With small $w$ , this barely moves. The weight asymptotes toward zero but never quite arrives.

L1 update: $w \leftarrow w - \alpha\lambda \cdot \text{sign}(w) - \alpha \cdot \frac{\partial L_{\text{data}}}{\partial w}$ . With a fixed decrement of $\alpha\lambda$ , this weight crosses zero in a finite number of steps. Once it would overshoot zero, the implementation clamps it to exactly 0.

For a large weight (say $w_i = 5.0$ ) with a strong data gradient, the data gradient fights back and wins - the weight stays nonzero. L1 only zeroes out weights that cannot justify their existence.

The Geometric Picture

Consider a loss function with two weights $w_1$ and $w_2$ :

The L2 constraint $\mid w\mid ^2 \leq C$ is a circle in 2D (sphere in higher dimensions). Its boundary is smooth everywhere - no corners.

The L1 constraint $\mid w\mid _1 \leq C$ is a diamond in 2D (octahedron in higher dimensions). It has sharp corners on the coordinate axes, where one weight is nonzero and all others are zero.

When loss contours first touch the constraint boundary, the L2 circle offers a smooth curve where the solution can land anywhere. The L1 diamond has corners that stick out — the solution is geometrically likely to land at a corner, which corresponds to a .

Why do corners produce zero weights? A corner on the L1 diamond is a point like (C, 0) or (0, C) — exactly one weight is nonzero. The loss contours are smooth ellipses. When an ellipse expands outward from the data loss minimum, it first touches the diamond boundary. Smooth ellipses have a much higher probability of touching a sharp corner than they do of touching the smooth faces of the diamond, especially in high dimensions where there are many corners (one per coordinate axis). Each corner assigns zero to all but one weight.

Interactive example

L1 vs L2 constraint geometry - see how loss contours touch the circle vs. the diamond

Coming soon

Practical Application: Feature Selection

Suppose you are building a model to predict house prices with 1,000 candidate features: square footage, bedrooms, proximity to schools, age of roof, fireplace presence, day-of-year the listing went live, and 993 others. Many are noise.

With L1 regularization, training drives irrelevant weights to exactly zero. You might end up with only 50 nonzero weights. The other 950 are zeroed out, not just small.

This is automatic . Invaluable for interpretability in medicine, finance, and any domain where you must explain predictions.

When to Use L1 vs L2

Use L1 when:

You expect most features to be irrelevant (sparse ground truth)
Interpretability matters - you want to know which features drive predictions
Statistics name: Lasso (Least Absolute Shrinkage and Selection Operator)

Use L2 when:

Most features contribute something (dense solution is reasonable)
You want smooth, stable optimization (L2 is differentiable everywhere)
Statistics name: Ridge regression

Use Elastic Net when:

You want some sparsity with the smoothness of L2
There are groups of correlated features (L1 arbitrarily picks one; Elastic Net can include all)

L_{\text{elastic}} = L_{\text{data}} + \lambda_1|w|_1 + \lambda_2|w|^2

$\lambda_1$: L1 penalty strength - controls sparsity
$\lambda_2$: L2 penalty strength - controls weight magnitude

In deep learning, L2 (weight decay) is far more common than L1 because optimization is easier. L1 shines in classical machine learning on tabular data with explicit feature engineering.