Cross-entropy loss — Classification

Why Not MSE for Classification?

When your model output passes through a sigmoid and you compute MSE loss, two things go wrong:

1. The loss surface becomes non-convex. MSE composed with sigmoid creates a highly nonlinear curve with plateaus and saddle points. Training slows and gets trapped.

2. The gradients from MSE include the sigmoid derivative. We saw that $\sigma'(z) = \sigma(z)(1-\sigma(z)) \leq 0.25$ , and near saturation it approaches zero. MSE loss propagates this near-zero derivative back through training, causing the even for a single layer.

We need a loss designed specifically for probability outputs. Enter cross-entropy.

The Cross-Entropy Loss

Binary for a single example:

L = -\bigl[y \cdot \log(\hat{y}) + (1-y) \cdot \log(1-\hat{y})\bigr]

$L$: loss for a single training example - a non-negative number
$y$: true label - either 0 or 1
$\hat{y}$: predicted probability - σ(w·x + b), always in (0,1)
$\log$: natural logarithm - undefined for y ≤ 0, equals 0 at y = 1

This looks complex but breaks into two clean cases.

Case 1: $y = 1$ (example is actually class 1)

The second term vanishes ( $1 - y = 0$ ):

L = -\log(\hat{y})

ŷ = 0.99 (99% confident, correct!): L = -log(0.99) ≈ 0.01. Tiny loss.
ŷ = 0.5 (uncertain): L = -log(0.5) ≈ 0.693. Moderate loss.
ŷ = 0.01 (99% confident in the wrong answer!): L = -log(0.01) ≈ 4.6. Huge loss.

Case 2: $y = 0$ (example is actually class 0)

The first term vanishes:

L = -\log(1 - \hat{y})

Now we penalize probability assigned to class 1 (since class 0 is correct). If $\hat{y} = 0.01$ (correctly predicting class 0): $L = -\log(0.99) \approx 0.01$ . If $\hat{y} = 0.99$ (wrong!): $L = -\log(0.01) \approx 4.6$ . Same penalty structure, mirrored.

The pattern: cross-entropy penalizes of the probability assigned to the correct class. The more confident and wrong you are, the larger the penalty — growing without bound as the prediction approaches 0 for the correct class.

Where the formula comes from - maximum likelihood

Cross-entropy isn't arbitrary — it's the natural probabilistic loss.

Maximum likelihood in plain English: We ask a simple question: what model parameters make the training data most probable? The model assigns a probability to each training example — for a positive example it predicts $\hat{y}$ ; for a negative example it predicts $1 - \hat{y}$ . Because examples are independent, we can multiply these probabilities together to get the probability of seeing the entire dataset. Maximum likelihood estimation finds the parameters that maximize this product — it makes the training data as probable as possible under our model. We then take the negative log (since maximizing a product equals minimizing its negative log) to get a convenient sum rather than a product.

We model $P(y=1 \mid \mathbf{x}) = \hat{y}$ and $P(y=0 \mid \mathbf{x}) = 1-\hat{y}$ . For $n$ independent examples, the probability of observing all labels simultaneously is:

P(\text{all labels}) = \prod_{i=1}^{n} \hat{y}_i^{y_i} \cdot (1-\hat{y}_i)^{1-y_i}

$\prod$: product over all n training examples

Maximum likelihood estimation asks: find the parameters that make the training data most probable. Maximizing this product is equivalent to minimizing its negative log (since log is monotone and the negative makes it a minimization):

-\log P(\text{all labels}) = -\sum_{i=1}^{n} \bigl[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\bigr]

That's exactly the sum of binary cross-entropy losses over all examples. Cross-entropy is the principled choice: we find parameters that maximize the likelihood of the training data under our model.

The Gradient Is Beautiful

Applying the chain rule to $\partial L / \partial z$ (where $z = \mathbf{w} \cdot \mathbf{x} + b$ is the pre-sigmoid linear output), the derivative simplifies dramatically because the sigmoid derivative $\sigma'(z) = \sigma(z)(1-\sigma(z))$ cancels with terms in the log derivative:

\frac{\partial L}{\partial z} = \hat{y} - y

$\partial L / \partial z$: gradient of loss with respect to the linear output z
$\hat{y}$: predicted probability = σ(z)
$y$: true label (0 or 1)

Prediction minus truth. That's it. Let's check this makes sense:

ŷ = 0.9, y = 1: gradient = -0.1. Negative: push z up to increase ŷ toward 1.
ŷ = 0.8, y = 0: gradient = +0.8. Positive: push z down to decrease ŷ toward 0.
ŷ = y: gradient = 0. No update — the prediction is perfect.

This gradient is numerically clean: no near-zero sigmoid derivative, no plateaus, just direct proportional feedback. This is why cross-entropy and sigmoid are designed to work together — the sigmoid's messy derivative cancels exactly with the log's derivative, yielding the simplest possible gradient.

Interactive example

Drag the predicted probability slider - see how cross-entropy loss grows as the model becomes confident and wrong

Coming soon

Logarithm basics

Why Not MSE for Classification?

The Cross-Entropy Loss

The Gradient Is Beautiful