Maximum likelihood estimation — Probability & Statistics

The Core Question

You have a model with unknown parameters . You have observed data . The question: which value of $θ$ makes the data most probable?

That is the entirety of :

\hat{\theta} = \arg\max_{\theta}, L(\theta;, D) = \arg\max_{\theta}, P(D \mid \theta)

$θ̂$: the MLE estimate — the parameter value that maximizes likelihood
$L(θ; D)$: the likelihood function — probability of data D given parameters θ
$P(D|θ)$: probability of observing data D if the true parameters are θ

The Log-Likelihood Trick

There is a critical practical problem. If the data points are , then the joint probability is a product:

L(\theta) = \prod_{i=1}^{n} p(x_i \mid \theta)

$n$: number of data points
$xᵢ$: the i-th data point
$p(xᵢ|θ)$: probability of the i-th observation given parameters θ

For 1000 data points, each with probability 0.8, this is $0.8^{1000} \approx 10^{-97}$ . Computers underflow this to zero. And taking derivatives of a product is painful.

The fix: take the log. Since $\log$ is strictly monotone increasing, the argmax is identical:

\hat{\theta} = \arg\max_{\theta}, \ell(\theta) \quad \text{where} \quad \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log p(x_i \mid \theta)

$ℓ(θ)$: the log-likelihood — the log of the likelihood function
$log$: natural logarithm (ln) unless specified otherwise

Products become sums. Sums are numerically stable. Sums are easy to differentiate. This is why every ML training objective you will encounter involves a sum of log-probabilities.

Worked Example: Coin Flip

You flip a coin $n$ times. Let = number of heads, = number of tails, $n = h + t$ . The parameter is ∈ [0,1].

The likelihood of observing this specific sequence (treating each flip as independent):

L(p) = p^h \cdot (1-p)^t

$p$: probability of heads (the parameter to optimize)
$h$: number of heads in the sequence
$t$: number of tails in the sequence

Take the log-likelihood:

\ell(p) = h \log p + t \log(1-p)

$ℓ(p)$: log-likelihood as a function of p

Differentiate and set to zero:

\frac{d\ell}{dp} = \frac{h}{p} - \frac{t}{1-p} = 0

$dℓ/dp$: derivative of log-likelihood with respect to p

MLE says: estimate the probability as the fraction of times heads occurred. For 7 heads in 10 flips, $\hat{p} = 0.7$ . Reassuringly intuitive — and now rigorously derived.

The Key Application: Gaussian Likelihood → MSE

This is the derivation that ties MLE to every regression loss you'll ever see.

Assume each output is generated by a Gaussian centered at the model prediction with fixed variance :

p(y_i \mid x_i, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp!\left(-\frac{(y_i - \hat{y}_i)^2}{2\sigma^2}\right)

$p(yᵢ|xᵢ,θ)$: probability of observing yᵢ given input xᵢ and parameters θ
$σ$: standard deviation of the noise
$ŷᵢ$: model prediction f(xᵢ; θ)

The log-likelihood over $n$ data points:

\ell(\theta) = \sum_{i=1}^{n} \log p(y_i \mid x_i, \theta) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

$ℓ(θ)$: log-likelihood of all data given parameters θ

The first term is a constant with respect to $\theta$ . So maximizing $\ell(\theta)$ over $\theta$ is equivalent to minimizing:

\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

$MSE$: mean squared error — the standard regression loss function

MSE is not arbitrary. It is the correct loss function under the assumption that residuals are Gaussian noise. Every time you use MSE, you are implicitly doing MLE under a Gaussian noise model.

Numeric Check

Say you have 5 data points: $y = [2.1, 1.9, 2.0, 2.3, 1.8]$ and your model predicts a constant $\hat{y} = \mu$ . The log-likelihood (up to constants) is:

\ell(\mu) = -\frac{1}{2\sigma^2}\left[(2.1-\mu)^2 + (1.9-\mu)^2 + (2.0-\mu)^2 + (2.3-\mu)^2 + (1.8-\mu)^2\right]

Taking $d\ell/d\mu = 0$ gives $\hat{\mu} = (2.1 + 1.9 + 2.0 + 2.3 + 1.8)/5 = 10.1/5 = 2.02$ . The sample mean. MLE recovered the mean from the Gaussian noise model.

In Code and Practice

In PyTorch, when you write:

loss = F.mse_loss(predictions, targets)         # Gaussian noise assumption
loss = F.binary_cross_entropy(predictions, targets)  # Bernoulli assumption
loss = F.cross_entropy(logits, labels)          # Categorical assumption

Each of these is computing the negative log-likelihood under a specific probabilistic model. They are all MLE, just with different distributional assumptions about the data-generating process.

The choice of loss function is really a choice of noise model. Pick the loss that matches your assumption about how outputs are distributed given inputs.