Why linear regression fails for classification — Classification

The Setup: Binary Classification

Imagine building a spam filter. Every email either is spam (label $y = 1$ ) or isn't (label $y = 0$ ). You have features — word counts, sender reputation, message length — and want a model that produces the correct label.

This is : the output $y \in {0, 1}$ — exactly two choices.

Your first instinct: I already know linear regression. Why not just fit a hyperplane through the data and predict 1 if the output exceeds 0.5, predict 0 otherwise?

That instinct fails for three distinct reasons. Each failure is concrete and instructive.

Problem 1: Predictions Escape [0, 1]

Linear regression outputs $\hat{y} = \mathbf{w} \cdot \mathbf{x} + b$ . That expression is unbounded — it can produce $1.7$ , $-0.4$ , or $823.9$ .

\hat{y} = \mathbf{w} \cdot \mathbf{x} + b \in (-\infty, +\infty)

$\hat{y}$: model prediction - unbounded real number for linear regression
$\mathbf{w}$: weight vector
$b$: bias term

These values make sense as house prices. They make no sense as probabilities. A "probability" of $1.7$ or $-0.4$ is mathematically meaningless. You'd have to clamp the output to $\lbrack 0,1\rbrack$ after the fact, and that clamp is completely arbitrary — the model was never trained to stay inside that range.

We want $\hat{y} \in (0, 1)$ — a genuine probability that represents $P(y = 1 \mid \mathbf{x})$ . Linear regression cannot guarantee this.

Problem 2: MSE Gives Wrong Gradients for Class Labels

The behaves badly when labels are 0/1. MSE — the mean squared error we used for regression — penalizes predictions proportionally to how far they are from the true label, but when labels are strictly 0 or 1, this creates specific problems.

Concrete outlier problem: Suppose you have spam emails clustered near $x = 5$ (with label 1) and legitimate emails near $x = 0$ (label 0). Your linear model fits this well. Now you add one extreme spam email at $x = 100$ .

The regression line tilts dramatically to pull its prediction for $x = 100$ closer to 1. In doing so, it pushes the predictions for $x = 5$ away from 1 — hurting the clearly correct region. The decision boundary slides, even though the data near $x = 5$ was already correctly classified.

MSE also incurs nonzero loss on correct confident predictions: if $\hat{y} = 0.9$ and $y = 1$ , the loss is $(0.9 - 1)^2 = 0.01$ — small but nonzero. The optimizer still nudges the model even for correct answers, which can destabilize the boundary.

Problem 3: No Natural Decision Threshold

With linear regression, you'd add "predict 1 if $\hat{y} > 0.5$ " as an external rule tacked on after training. But the model was never trained with this threshold in mind. Nothing in the MSE objective cares about where 0.5 falls relative to the outputs.

A model trained to minimize MSE on labels 0 and 1 might produce outputs primarily in $\lbrack 0.1, 0.9\rbrack$ — the 0.5 boundary might split the data perfectly, or it might not. The model has no idea it's supposed to be producing probabilities that straddle 0.5.

What We Actually Want

We want a model that directly outputs $P(y = 1 \mid \mathbf{x})$ — the probability that example $\mathbf{x}$ belongs to class 1. This is a number in $(0, 1)$ by definition. When it exceeds a threshold (usually 0.5, but adjustable), we predict class 1.

The model needs to:

Output a bounded value in $(0, 1)$
Be trained with a loss designed for probabilities, not for real-valued targets
Have a principled threshold emerge naturally from the training objective

Code: Seeing the Failures in Python

import numpy as np

# Small classification dataset: feature = hours of practice, label = pass/fail
x = np.array([0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5])
y = np.array([0,   0,   0,   0,   1,   1,   1,   1,   1])   # binary labels

# Fit a linear regression line (slope and intercept)
slope, intercept = np.polyfit(x, y, 1)

# Problem 1: predictions can escape [0, 1]
x_test = np.array([-2.0, 0.0, 6.0, 10.0])
y_pred = slope * x_test + intercept
print("Linear predictions:", np.round(y_pred, 2))
# e.g. [-0.47  0.05  1.42  2.32] — values outside [0, 1] are meaningless as probabilities

# Problem 2: an outlier shifts the decision boundary
x_outlier = np.append(x, 20.0)    # add one extreme example far to the right
y_outlier  = np.append(y, 1.0)
slope2, intercept2 = np.polyfit(x_outlier, y_outlier, 1)

# Where each line crosses 0.5 (the "decision boundary")
boundary_original = (0.5 - intercept)  / slope
boundary_shifted  = (0.5 - intercept2) / slope2
print(f"Boundary without outlier: x ≈ {boundary_original:.2f}")
print(f"Boundary with outlier:    x ≈ {boundary_shifted:.2f}")
# The outlier at x=20 drags the boundary rightward, misclassifying previously-correct data

The Bridge: Sigmoid

We still want to use the linear combination $\mathbf{w} \cdot \mathbf{x} + b$ — it's fast and powerful. We just need to squash its unbounded output into $(0, 1)$ in a smooth, differentiable way. That's exactly what the does — the subject of the next lesson.

Interactive example

See how a regression line shifts when outliers are added vs how logistic regression handles them gracefully

Coming soon