Skip to content
Math Foundation Information Theory
Lesson 4 ⏱ 12 min

Cross-entropy as negative log-likelihood

Video coming soon

Cross-Entropy Loss: The Unifying Principle

Cross-entropy defined. Connection to KL divergence and entropy. Why minimizing cross-entropy IS minimizing KL divergence. The unification: every standard loss is cross-entropy under a distributional assumption.

⏱ ~8 min

🧮

Quick refresher

Maximum Likelihood Estimation

MLE finds the parameters θ that maximize the probability of observing the data: θ̂ = argmax Σ log p(xᵢ|θ). The log-likelihood turns the product of probabilities into a sum. Maximizing log-likelihood is equivalent to minimizing negative log-likelihood (NLL).

Example

For Gaussian noise: maximizing the Gaussian log-likelihood is equivalent to minimizing the sum of squared residuals (MSE).

The two objectives have the same argmax.

The Unifying Idea

There are dozens of loss functions in machine learning: MSE, binary cross-entropy, categorical cross-entropy, focal loss, Poisson loss, and more. They seem like a disconnected catalog of choices.

They are not. Every standard loss function is cross-entropy under a distributional assumption. This lesson derives that unification, tying together MLE, KL divergence, and the losses you write in code.

Cross-Entropy Defined

The between a true distribution and an approximate distribution is:

H(P,Q)=xP(x)logQ(x)H(P, Q) = -\sum_{x} P(x)\log Q(x)
H(P,Q)H(P,Q)
cross-entropy — expected negative log-probability under Q when data comes from P
P(x)P(x)
probability of outcome x under the true distribution
Q(x)Q(x)
probability of outcome x under the model distribution

Compare to entropy: H(P)=xP(x)logP(x)H(P) = -\sum_x P(x)\log P(x). Cross-entropy replaces logP(x)\log P(x) with logQ(x)\log Q(x) — you're using the model's code for data that comes from the true distribution.

Since log(P/Q)=logPlogQ\log(P/Q) = \log P - \log Q:

H(P,Q)=H(P)+DKL(P,,Q)H(P, Q) = H(P) + D_{\text{KL}}(P ,|, Q)
DKL(PQ)D_KL(P||Q)
KL divergence from model Q to true distribution P

This is the key equation. Cross-entropy = entropy of the truth + the extra cost from using the wrong distribution.

Since H(P)H(P) is the entropy of the true distribution — which we cannot change by adjusting model parameters — minimizing cross-entropy over θ\theta is identical to minimizing KL divergence from PP to QθQ_\theta.

Cross-Entropy IS Negative Log-Likelihood

Now connect to MLE. Suppose we have a dataset with empirical distribution : it places mass 1/n1/n on each training example xix_i.

The cross-entropy between the empirical distribution and the model is:

H(P^,Qθ)=iP^(xi)logQ(xi;θ)=1ni=1nlogQ(xi;θ)H(\hat{P}, Q_\theta) = -\sum_i \hat{P}(x_i)\log Q(x_i;\theta) = -\frac{1}{n}\sum_{i=1}^{n}\log Q(x_i;\theta)
nn
number of training examples
Q(xi;θ)Q(xᵢ;θ)
model probability of training example xᵢ given parameters θ

This is exactly the negative log-likelihood divided by n. Minimizing cross-entropy over θ\theta is identical to maximizing the log-likelihood.

The chain of equivalences:

For Classification: Softmax Cross-Entropy

In a multi-class classification problem, the true label is a one-hot vector, and the model outputs softmax probabilities .

The cross-entropy loss per example is:

L=H(y,y^)=k=1Kyklogy^k=logy^y\mathcal{L} = H(y, \hat{y}) = -\sum_{k=1}^{K} y_k \log \hat{y}k = -\log \hat{y}{y^*}
KK
number of classes
ykyₖ
1 if class k is the true class, 0 otherwise
y^kŷₖ
model's predicted probability for class k

where is the true class index. Since yy is one-hot, only the k=yk = y^* term survives. The loss is simply log-\log of the probability assigned to the correct class.

Numerical example: True class = class 2, model predicts y^=[0.1,0.7,0.2]\hat{y} = [0.1, 0.7, 0.2].

Here, L=log(0.7)0.357\mathcal{L} = -\log(0.7) \approx 0.357 nats. (The model was fairly confident about the wrong class... wait — class 2 is index 2, model assigns 0.2. Then L=log(0.2)1.609\mathcal{L} = -\log(0.2) \approx 1.609 nats.)

MSE = Cross-Entropy under Gaussian Noise

Assume outputs are generated by a Gaussian: yN(y^,σ2)y \sim \mathcal{N}(\hat{y}, \sigma^2).

The cross-entropy loss for one example:

logP(yy^)=(yy^)22σ2+12log(2πσ2)-\log P(y \mid \hat{y}) = \frac{(y - \hat{y})^2}{2\sigma^2} + \frac{1}{2}\log(2\pi\sigma^2)
σσ
noise standard deviation — fixed hyperparameter
y^ŷ
model prediction

The second term is constant with respect to model parameters. Minimizing the cross-entropy gives:

LMSE=1ni=1n(yiy^i)2\mathcal{L}{\text{MSE}} = \frac{1}{n}\sum{i=1}^{n}(y_i - \hat{y}_i)^2
MSEMSE
mean squared error

MSE is cross-entropy under a Gaussian assumption. You choose MSE when you believe the noise in your outputs is roughly Gaussian.

Binary Cross-Entropy = Bernoulli NLL

For binary classification (y0,1y \in {0, 1}), assume yBernoulli(p^)y \sim \text{Bernoulli}(\hat{p}):

LBCE=[ylogy^+(1y)log(1y^)]\mathcal{L}_{\text{BCE}} = -\left[y \log \hat{y} + (1-y)\log(1-\hat{y})\right]
y^ŷ
model's predicted probability of y=1 (e.g., sigmoid output)
yy
true binary label (0 or 1)

This is the NLL of a Bernoulli distribution. When y=1y=1, only the first term survives: loss = logy^-\log\hat{y}. When y=0y=0, only the second term survives: loss = log(1y^)-\log(1-\hat{y}).

In Code

import torch
import torch.nn.functional as F

# Categorical cross-entropy (softmax CE)
# logits: raw scores before softmax; labels: integer class indices
loss = F.cross_entropy(logits, labels)  # = NLL of categorical

# Binary cross-entropy
loss = F.binary_cross_entropy(sigmoid_outputs, targets)  # = NLL of Bernoulli

# MSE = NLL of Gaussian
loss = F.mse_loss(predictions, targets)  # = NLL of N(prediction, σ²)

# All three are minimizing cross-entropy H(P_true, P_model)
# All three are minimizing D_KL(P_true || P_model)
# All three are maximizing likelihood

The choice of loss function is the choice of noise model. Make it explicitly, not by default.

Quiz

1 / 3

Cross-entropy H(P,Q) and entropy H(P) are related to KL divergence by: