Cross-entropy defined. Connection to KL divergence and entropy. Why minimizing cross-entropy IS minimizing KL divergence. The unification: every standard loss is cross-entropy under a distributional assumption.
⏱ ~8 min
🧮
Quick refresher
Maximum Likelihood Estimation
MLE finds the parameters θ that maximize the probability of observing the data: θ̂ = argmax Σ log p(xᵢ|θ). The log-likelihood turns the product of probabilities into a sum. Maximizing log-likelihood is equivalent to minimizing negative log-likelihood (NLL).
Example
For Gaussian noise: maximizing the Gaussian log-likelihood is equivalent to minimizing the sum of squared residuals (MSE).
The two objectives have the same argmax.
The Unifying Idea
There are dozens of loss functions in machine learning: MSE, binary cross-entropy, categorical cross-entropy, focal loss, Poisson loss, and more. They seem like a disconnected catalog of choices.
They are not. Every standard loss function is cross-entropy under a distributional assumption. This lesson derives that unification, tying together MLE, KL divergence, and the losses you write in code.
Cross-Entropy Defined
The between a true distribution and an approximate distribution is:
H(P,Q)=−x∑P(x)logQ(x)
H(P,Q)
cross-entropy — expected negative log-probability under Q when data comes from P
P(x)
probability of outcome x under the true distribution
Q(x)
probability of outcome x under the model distribution
Compare to entropy: H(P)=−∑xP(x)logP(x). Cross-entropy replaces logP(x) with logQ(x) — you're using the model's code for data that comes from the true distribution.
Since log(P/Q)=logP−logQ:
H(P,Q)=H(P)+DKL(P,∣,Q)
DKL(P∣∣Q)
KL divergence from model Q to true distribution P
This is the key equation. Cross-entropy = entropy of the truth + the extra cost from using the wrong distribution.
Since H(P) is the entropy of the true distribution — which we cannot change by adjusting model parameters — minimizing cross-entropy over θ is identical to minimizing KL divergence from P to Qθ.
Cross-Entropy IS Negative Log-Likelihood
Now connect to MLE. Suppose we have a dataset with empirical distribution : it places mass 1/n on each training example xi.
The cross-entropy between the empirical distribution and the model is:
model probability of training example xᵢ given parameters θ
This is exactly the negative log-likelihood divided by n. Minimizing cross-entropy over θ is identical to maximizing the log-likelihood.
The chain of equivalences:
For Classification: Softmax Cross-Entropy
In a multi-class classification problem, the true label is a one-hot vector, and the model outputs softmax probabilities .
The cross-entropy loss per example is:
L=H(y,y^)=−k=1∑Kyklogy^k=−logy^y∗
K
number of classes
yk
1 if class k is the true class, 0 otherwise
y^k
model's predicted probability for class k
where is the true class index. Since y is one-hot, only the k=y∗ term survives. The loss is simply −log of the probability assigned to the correct class.
Numerical example: True class = class 2, model predicts y^=[0.1,0.7,0.2].
Here, L=−log(0.7)≈0.357 nats. (The model was fairly confident about the wrong class... wait — class 2 is index 2, model assigns 0.2. Then L=−log(0.2)≈1.609 nats.)
MSE = Cross-Entropy under Gaussian Noise
Assume outputs are generated by a Gaussian: y∼N(y^,σ2).
The cross-entropy loss for one example:
−logP(y∣y^)=2σ2(y−y^)2+21log(2πσ2)
σ
noise standard deviation — fixed hyperparameter
y^
model prediction
The second term is constant with respect to model parameters. Minimizing the cross-entropy gives:
LMSE=n1∑i=1n(yi−y^i)2
MSE
mean squared error
MSE is cross-entropy under a Gaussian assumption. You choose MSE when you believe the noise in your outputs is roughly Gaussian.
Binary Cross-Entropy = Bernoulli NLL
For binary classification (y∈0,1), assume y∼Bernoulli(p^):
LBCE=−[ylogy^+(1−y)log(1−y^)]
y^
model's predicted probability of y=1 (e.g., sigmoid output)
y
true binary label (0 or 1)
This is the NLL of a Bernoulli distribution. When y=1, only the first term survives: loss = −logy^. When y=0, only the second term survives: loss = −log(1−y^).
In Code
import torch
import torch.nn.functional as F
# Categorical cross-entropy (softmax CE)
# logits: raw scores before softmax; labels: integer class indices
loss = F.cross_entropy(logits, labels) # = NLL of categorical
# Binary cross-entropy
loss = F.binary_cross_entropy(sigmoid_outputs, targets) # = NLL of Bernoulli
# MSE = NLL of Gaussian
loss = F.mse_loss(predictions, targets) # = NLL of N(prediction, σ²)
# All three are minimizing cross-entropy H(P_true, P_model)
# All three are minimizing D_KL(P_true || P_model)
# All three are maximizing likelihood
The choice of loss function is the choice of noise model. Make it explicitly, not by default.
Quiz
1 / 3
Cross-entropy H(P,Q) and entropy H(P) are related to KL divergence by: