Why Not MSE for Classification?
When your model output passes through a sigmoid and you compute MSE loss, two things go wrong:
1. The loss surface becomes non-convex. MSE composed with sigmoid creates a highly nonlinear curve with plateaus and saddle points. Training slows and gets trapped.
2. The gradients from MSE include the sigmoid derivative. We saw that \sigma'(z) = \sigma(z)(1-\sigma(z)) \leq 0.25, and near saturation it approaches zero. MSE loss propagates this near-zero derivative back through training, causing the even for a single layer.
We need a loss designed specifically for probability outputs. Enter cross-entropy.
The Cross-Entropy Loss
Binary for a single example:
- loss for a single training example - a non-negative number
- true label - either 0 or 1
- predicted probability - σ(w·x + b), always in (0,1)
- natural logarithm - undefined for y ≤ 0, equals 0 at y = 1
This looks complex but breaks into two clean cases.
Case 1: (example is actually class 1)
The second term vanishes ():
- ŷ = 0.99 (99% confident, correct!): L = -log(0.99) ≈ 0.01. Tiny loss.
- ŷ = 0.5 (uncertain): L = -log(0.5) ≈ 0.693. Moderate loss.
- ŷ = 0.01 (99% confident in the wrong answer!): L = -log(0.01) ≈ 4.6. Huge loss.
Case 2: (example is actually class 0)
The first term vanishes:
Now we penalize probability assigned to class 1 (since class 0 is correct). If (correctly predicting class 0): . If (wrong!): . Same penalty structure, mirrored.
The pattern: cross-entropy penalizes of the probability assigned to the correct class. The more confident and wrong you are, the larger the penalty — growing without bound as the prediction approaches 0 for the correct class.
The Gradient Is Beautiful
Applying the chain rule to (where is the pre-sigmoid linear output), the derivative simplifies dramatically because the sigmoid derivative \sigma'(z) = \sigma(z)(1-\sigma(z)) cancels with terms in the log derivative:
- gradient of loss with respect to the linear output z
- predicted probability = σ(z)
- true label (0 or 1)
Prediction minus truth. That's it. Let's check this makes sense:
- ŷ = 0.9, y = 1: gradient = -0.1. Negative: push z up to increase ŷ toward 1.
- ŷ = 0.8, y = 0: gradient = +0.8. Positive: push z down to decrease ŷ toward 0.
- ŷ = y: gradient = 0. No update — the prediction is perfect.
This gradient is numerically clean: no near-zero sigmoid derivative, no plateaus, just direct proportional feedback. This is why cross-entropy and sigmoid are designed to work together — the sigmoid's messy derivative cancels exactly with the log's derivative, yielding the simplest possible gradient.
Interactive example
Drag the predicted probability slider - see how cross-entropy loss grows as the model becomes confident and wrong
Coming soon