Skip to content
Math Foundation Probability & Statistics
Lesson 5 ⏱ 12 min

Maximum likelihood estimation

Video coming soon

Maximum Likelihood Estimation: Finding the Best Parameters

The core idea: given data, find the parameters that make observing that data most probable. Coin flip derivation. Gaussian likelihood → MSE. Why we minimize negative log-likelihood.

⏱ ~9 min

🧮

Quick refresher

Probability of independent events

If events are independent, the probability they all occur is the product of their individual probabilities: P(A and B) = P(A)·P(B). For n independent coin flips, the probability of a specific sequence is the product of n per-flip probabilities.

Example

Three fair coin flips H, T, H: P = (1/2)·(1/2)·(1/2) = 1/8.

If p=0.7 for heads: P = (0.7)·(0.3)·(0.7) = 0.147.

The Core Question

You have a model with unknown parameters . You have observed data . The question: which value of θθ makes the data most probable?

That is the entirety of :

θ^=argmaxθ,L(θ;,D)=argmaxθ,P(Dθ)\hat{\theta} = \arg\max_{\theta}, L(\theta;, D) = \arg\max_{\theta}, P(D \mid \theta)
θ^θ̂
the MLE estimate — the parameter value that maximizes likelihood
L(θ;D)L(θ; D)
the likelihood function — probability of data D given parameters θ
P(Dθ)P(D|θ)
probability of observing data D if the true parameters are θ

The Log-Likelihood Trick

There is a critical practical problem. If the data points are , then the joint probability is a product:

L(θ)=i=1np(xiθ)L(\theta) = \prod_{i=1}^{n} p(x_i \mid \theta)
nn
number of data points
xixᵢ
the i-th data point
p(xiθ)p(xᵢ|θ)
probability of the i-th observation given parameters θ

For 1000 data points, each with probability 0.8, this is 0.8100010970.8^{1000} \approx 10^{-97}. Computers underflow this to zero. And taking derivatives of a product is painful.

The fix: take the log. Since log\log is strictly monotone increasing, the argmax is identical:

θ^=argmaxθ,(θ)where(θ)=logL(θ)=i=1nlogp(xiθ)\hat{\theta} = \arg\max_{\theta}, \ell(\theta) \quad \text{where} \quad \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log p(x_i \mid \theta)
(θ)ℓ(θ)
the log-likelihood — the log of the likelihood function
loglog
natural logarithm (ln) unless specified otherwise

Products become sums. Sums are numerically stable. Sums are easy to differentiate. This is why every ML training objective you will encounter involves a sum of log-probabilities.

Worked Example: Coin Flip

You flip a coin nn times. Let = number of heads, = number of tails, n=h+tn = h + t. The parameter is ∈ [0,1].

The likelihood of observing this specific sequence (treating each flip as independent):

L(p)=ph(1p)tL(p) = p^h \cdot (1-p)^t
pp
probability of heads (the parameter to optimize)
hh
number of heads in the sequence
tt
number of tails in the sequence

Take the log-likelihood:

(p)=hlogp+tlog(1p)\ell(p) = h \log p + t \log(1-p)
(p)ℓ(p)
log-likelihood as a function of p

Differentiate and set to zero:

ddp=hpt1p=0\frac{d\ell}{dp} = \frac{h}{p} - \frac{t}{1-p} = 0
d/dpdℓ/dp
derivative of log-likelihood with respect to p

MLE says: estimate the probability as the fraction of times heads occurred. For 7 heads in 10 flips, p^=0.7\hat{p} = 0.7. Reassuringly intuitive — and now rigorously derived.

The Key Application: Gaussian Likelihood → MSE

This is the derivation that ties MLE to every regression loss you'll ever see.

Assume each output is generated by a Gaussian centered at the model prediction with fixed variance :

p(yixi,θ)=12πσ2exp!((yiy^i)22σ2)p(y_i \mid x_i, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp!\left(-\frac{(y_i - \hat{y}_i)^2}{2\sigma^2}\right)
p(yixi,θ)p(yᵢ|xᵢ,θ)
probability of observing yᵢ given input xᵢ and parameters θ
σσ
standard deviation of the noise
y^iŷᵢ
model prediction f(xᵢ; θ)

The log-likelihood over nn data points:

(θ)=i=1nlogp(yixi,θ)=n2log(2πσ2)12σ2i=1n(yiy^i)2\ell(\theta) = \sum_{i=1}^{n} \log p(y_i \mid x_i, \theta) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
(θ)ℓ(θ)
log-likelihood of all data given parameters θ

The first term is a constant with respect to θ\theta. So maximizing (θ)\ell(\theta) over θ\theta is equivalent to minimizing:

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
MSEMSE
mean squared error — the standard regression loss function

MSE is not arbitrary. It is the correct loss function under the assumption that residuals are Gaussian noise. Every time you use MSE, you are implicitly doing MLE under a Gaussian noise model.

Numeric Check

Say you have 5 data points: y=[2.1,1.9,2.0,2.3,1.8]y = [2.1, 1.9, 2.0, 2.3, 1.8] and your model predicts a constant y^=μ\hat{y} = \mu. The log-likelihood (up to constants) is:

(μ)=12σ2[(2.1μ)2+(1.9μ)2+(2.0μ)2+(2.3μ)2+(1.8μ)2]\ell(\mu) = -\frac{1}{2\sigma^2}\left[(2.1-\mu)^2 + (1.9-\mu)^2 + (2.0-\mu)^2 + (2.3-\mu)^2 + (1.8-\mu)^2\right]

Taking d/dμ=0d\ell/d\mu = 0 gives μ^=(2.1+1.9+2.0+2.3+1.8)/5=10.1/5=2.02\hat{\mu} = (2.1 + 1.9 + 2.0 + 2.3 + 1.8)/5 = 10.1/5 = 2.02. The sample mean. MLE recovered the mean from the Gaussian noise model.

In Code and Practice

In PyTorch, when you write:

loss = F.mse_loss(predictions, targets)         # Gaussian noise assumption
loss = F.binary_cross_entropy(predictions, targets)  # Bernoulli assumption
loss = F.cross_entropy(logits, labels)          # Categorical assumption

Each of these is computing the negative log-likelihood under a specific probabilistic model. They are all MLE, just with different distributional assumptions about the data-generating process.

The choice of loss function is really a choice of noise model. Pick the loss that matches your assumption about how outputs are distributed given inputs.

Quiz

1 / 3

You flip a coin 10 times and get 7 heads. What does MLE say p̂ (the probability of heads) is?