Cross-entropy as negative log-likelihood — Information Theory

The Unifying Idea

There are dozens of loss functions in machine learning: MSE, binary cross-entropy, categorical cross-entropy, focal loss, Poisson loss, and more. They seem like a disconnected catalog of choices.

They are not. Every standard loss function is cross-entropy under a distributional assumption. This lesson derives that unification, tying together MLE, KL divergence, and the losses you write in code.

Cross-Entropy Defined

The between a true distribution and an approximate distribution is:

H(P, Q) = -\sum_{x} P(x)\log Q(x)

$H(P,Q)$: cross-entropy — expected negative log-probability under Q when data comes from P
$P(x)$: probability of outcome x under the true distribution
$Q(x)$: probability of outcome x under the model distribution

Compare to entropy: $H(P) = -\sum_x P(x)\log P(x)$ . Cross-entropy replaces $\log P(x)$ with $\log Q(x)$ — you're using the model's code for data that comes from the true distribution.

Since $\log(P/Q) = \log P - \log Q$ :

H(P, Q) = H(P) + D_{\text{KL}}(P ,|, Q)

$D_KL(P||Q)$: KL divergence from model Q to true distribution P

This is the key equation. Cross-entropy = entropy of the truth + the extra cost from using the wrong distribution.

Since $H(P)$ is the entropy of the true distribution — which we cannot change by adjusting model parameters — minimizing cross-entropy over $\theta$ is identical to minimizing KL divergence from $P$ to $Q_\theta$ .

Cross-Entropy IS Negative Log-Likelihood

Now connect to MLE. Suppose we have a dataset with empirical distribution : it places mass $1/n$ on each training example $x_i$ .

The cross-entropy between the empirical distribution and the model is:

H(\hat{P}, Q_\theta) = -\sum_i \hat{P}(x_i)\log Q(x_i;\theta) = -\frac{1}{n}\sum_{i=1}^{n}\log Q(x_i;\theta)

$n$: number of training examples
$Q(xᵢ;θ)$: model probability of training example xᵢ given parameters θ

This is exactly the negative log-likelihood divided by n. Minimizing cross-entropy over $\theta$ is identical to maximizing the log-likelihood.

The chain of equivalences:

The full equivalence chain

The following objectives are all equivalent (same argmin/argmax):

Maximize likelihood: $\arg\max_\theta \prod_i P(x_i\mid \theta)$
Maximize log-likelihood: $\arg\max_\theta \sum_i \log P(x_i\mid \theta)$
Minimize negative log-likelihood (NLL): $\arg\min_\theta -\sum_i \log P(x_i\mid \theta)$
Minimize cross-entropy: $\arg\min_\theta H(\hat{P}, Q_\theta)$
Minimize KL divergence: $\arg\min_\theta D_{\text{KL}}(\hat{P} \mid Q_\theta)$

All five are the same optimization. The choice is cosmetic.

For Classification: Softmax Cross-Entropy

In a multi-class classification problem, the true label is a one-hot vector, and the model outputs softmax probabilities .

The cross-entropy loss per example is:

\mathcal{L} = H(y, \hat{y}) = -\sum_{k=1}^{K} y_k \log \hat{y}k = -\log \hat{y}{y^*}

$K$: number of classes
$yₖ$: 1 if class k is the true class, 0 otherwise
$ŷₖ$: model's predicted probability for class k

where is the true class index. Since $y$ is one-hot, only the $k = y^*$ term survives. The loss is simply $-\log$ of the probability assigned to the correct class.

Numerical example: True class = class 2, model predicts $\hat{y} = [0.1, 0.7, 0.2]$ .

Here, $\mathcal{L} = -\log(0.7) \approx 0.357$ nats. (The model was fairly confident about the wrong class... wait — class 2 is index 2, model assigns 0.2. Then $\mathcal{L} = -\log(0.2) \approx 1.609$ nats.)

MSE = Cross-Entropy under Gaussian Noise

Assume outputs are generated by a Gaussian: $y \sim \mathcal{N}(\hat{y}, \sigma^2)$ .

The cross-entropy loss for one example:

-\log P(y \mid \hat{y}) = \frac{(y - \hat{y})^2}{2\sigma^2} + \frac{1}{2}\log(2\pi\sigma^2)

$σ$: noise standard deviation — fixed hyperparameter
$ŷ$: model prediction

The second term is constant with respect to model parameters. Minimizing the cross-entropy gives:

\mathcal{L}{\text{MSE}} = \frac{1}{n}\sum{i=1}^{n}(y_i - \hat{y}_i)^2

$MSE$: mean squared error

MSE is cross-entropy under a Gaussian assumption. You choose MSE when you believe the noise in your outputs is roughly Gaussian.

Binary Cross-Entropy = Bernoulli NLL

For binary classification ( $y \in {0, 1}$ ), assume $y \sim \text{Bernoulli}(\hat{p})$ :

\mathcal{L}_{\text{BCE}} = -\left[y \log \hat{y} + (1-y)\log(1-\hat{y})\right]

$ŷ$: model's predicted probability of y=1 (e.g., sigmoid output)
$y$: true binary label (0 or 1)

This is the NLL of a Bernoulli distribution. When $y=1$ , only the first term survives: loss = $-\log\hat{y}$ . When $y=0$ , only the second term survives: loss = $-\log(1-\hat{y})$ .

Binary cross-entropy: a concrete calculation

Suppose a spam classifier outputs $\hat{p} = 0.9$ (90% confident it's spam) and the true label is $y = 1$ (it is spam).

Loss for this prediction: $\mathcal{L} = -[1 \cdot \log(0.9) + 0 \cdot \log(0.1)] = -\log(0.9) \approx 0.105$ . Small loss — the model was right.

Now suppose the model outputs $\hat{p} = 0.1$ (only 10% confident it's spam) for the same spam email:

Loss: $\mathcal{L} = -\log(0.1) \approx 2.303$ . Much larger — the model was confident in the wrong direction.

For a non-spam email ( $y = 0$ ) where the model outputs $\hat{p} = 0.05$ :

Loss: $\mathcal{L} = -[0 \cdot \log(0.05) + 1 \cdot \log(0.95)] = -\log(0.95) \approx 0.051$ . Very small — the model correctly assigned low probability to the spam class.

The loss is always $-\log(\text{probability assigned to the correct class})$ . High confidence in the right answer → loss near 0. High confidence in the wrong answer → large loss.

The cookbook: loss function → distributional assumption

Loss function	Distributional assumption	Output activation
MSE	$\mathcal{N}(\hat{y}, \sigma^2)$	Linear
Binary cross-entropy	$\text{Bernoulli}(\hat{p})$	Sigmoid
Categorical cross-entropy	$\text{Categorical}(\hat{p})$	Softmax
Poisson NLL	$\text{Poisson}(\hat{\lambda})$	Exp (positive outputs)

Every loss in this table is derived by plugging the appropriate distribution into $-\log P(y\mid \hat{y})$ . The "right" loss is the one whose distributional assumption best matches your data.

In Code

import torch
import torch.nn.functional as F

# Categorical cross-entropy (softmax CE)
# logits: raw scores before softmax; labels: integer class indices
loss = F.cross_entropy(logits, labels)  # = NLL of categorical

# Binary cross-entropy
loss = F.binary_cross_entropy(sigmoid_outputs, targets)  # = NLL of Bernoulli

# MSE = NLL of Gaussian
loss = F.mse_loss(predictions, targets)  # = NLL of N(prediction, σ²)

# All three are minimizing cross-entropy H(P_true, P_model)
# All three are minimizing D_KL(P_true || P_model)
# All three are maximizing likelihood

The choice of loss function is the choice of noise model. Make it explicitly, not by default.

Maximum Likelihood Estimation