Skip to content
Classification
Lesson 4 ⏱ 12 min

Cross-entropy loss

Video coming soon

Cross-Entropy - The Right Loss for Probabilities

Why MSE fails with sigmoid outputs, the cross-entropy formula and its two cases, the maximum likelihood derivation, and the clean gradient ŷ - y.

⏱ ~7 min

🧮

Quick refresher

Logarithm basics

log(x) is the inverse of e^x. Key values: log(1)=0, log(0.5)≈-0.693, log(very small)→-∞. Logarithm is only defined for x>0.

Example

log(0.9) ≈ -0.105 (small penalty for high-confidence correct prediction).

log(0.01) ≈ -4.6 (large penalty for very low probability on correct class).

Why Not MSE for Classification?

When your model output passes through a sigmoid and you compute MSE loss, two things go wrong:

1. The loss surface becomes non-convex. MSE composed with sigmoid creates a highly nonlinear curve with plateaus and saddle points. Training slows and gets trapped.

2. The gradients from MSE include the sigmoid derivative. We saw that \sigma'(z) = \sigma(z)(1-\sigma(z)) \leq 0.25, and near saturation it approaches zero. MSE loss propagates this near-zero derivative back through training, causing the even for a single layer.

We need a loss designed specifically for probability outputs. Enter cross-entropy.

The Cross-Entropy Loss

Binary for a single example:

L=[ylog(y^)+(1y)log(1y^)]L = -\bigl[y \cdot \log(\hat{y}) + (1-y) \cdot \log(1-\hat{y})\bigr]
LL
loss for a single training example - a non-negative number
yy
true label - either 0 or 1
y^\hat{y}
predicted probability - σ(w·x + b), always in (0,1)
log\log
natural logarithm - undefined for y ≤ 0, equals 0 at y = 1

This looks complex but breaks into two clean cases.

Case 1: y=1y = 1 (example is actually class 1)

The second term vanishes (1y=01 - y = 0):

L=log(y^)L = -\log(\hat{y})
  • ŷ = 0.99 (99% confident, correct!): L = -log(0.99) ≈ 0.01. Tiny loss.
  • ŷ = 0.5 (uncertain): L = -log(0.5) ≈ 0.693. Moderate loss.
  • ŷ = 0.01 (99% confident in the wrong answer!): L = -log(0.01) ≈ 4.6. Huge loss.

Case 2: y=0y = 0 (example is actually class 0)

The first term vanishes:

L=log(1y^)L = -\log(1 - \hat{y})

Now we penalize probability assigned to class 1 (since class 0 is correct). If y^=0.01\hat{y} = 0.01 (correctly predicting class 0): L=log(0.99)0.01L = -\log(0.99) \approx 0.01. If y^=0.99\hat{y} = 0.99 (wrong!): L=log(0.01)4.6L = -\log(0.01) \approx 4.6. Same penalty structure, mirrored.

The pattern: cross-entropy penalizes of the probability assigned to the correct class. The more confident and wrong you are, the larger the penalty — growing without bound as the prediction approaches 0 for the correct class.

The Gradient Is Beautiful

Applying the chain rule to L/z\partial L / \partial z (where z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b is the pre-sigmoid linear output), the derivative simplifies dramatically because the sigmoid derivative \sigma'(z) = \sigma(z)(1-\sigma(z)) cancels with terms in the log derivative:

Lz=y^y\frac{\partial L}{\partial z} = \hat{y} - y
L/z\partial L / \partial z
gradient of loss with respect to the linear output z
y^\hat{y}
predicted probability = σ(z)
yy
true label (0 or 1)

Prediction minus truth. That's it. Let's check this makes sense:

  • ŷ = 0.9, y = 1: gradient = -0.1. Negative: push z up to increase ŷ toward 1.
  • ŷ = 0.8, y = 0: gradient = +0.8. Positive: push z down to decrease ŷ toward 0.
  • ŷ = y: gradient = 0. No update — the prediction is perfect.

This gradient is numerically clean: no near-zero sigmoid derivative, no plateaus, just direct proportional feedback. This is why cross-entropy and sigmoid are designed to work together — the sigmoid's messy derivative cancels exactly with the log's derivative, yielding the simplest possible gradient.

Interactive example

Drag the predicted probability slider - see how cross-entropy loss grows as the model becomes confident and wrong

Coming soon

Quiz

1 / 3

For y=1 (true label is positive class), the cross-entropy loss simplifies to...