The Core Question
You have a model with unknown parameters . You have observed data . The question: which value of makes the data most probable?
That is the entirety of :
- the MLE estimate — the parameter value that maximizes likelihood
- the likelihood function — probability of data D given parameters θ
- probability of observing data D if the true parameters are θ
The Log-Likelihood Trick
There is a critical practical problem. If the data points are , then the joint probability is a product:
- number of data points
- the i-th data point
- probability of the i-th observation given parameters θ
For 1000 data points, each with probability 0.8, this is . Computers underflow this to zero. And taking derivatives of a product is painful.
The fix: take the log. Since is strictly monotone increasing, the argmax is identical:
- the log-likelihood — the log of the likelihood function
- natural logarithm (ln) unless specified otherwise
Products become sums. Sums are numerically stable. Sums are easy to differentiate. This is why every ML training objective you will encounter involves a sum of log-probabilities.
Worked Example: Coin Flip
You flip a coin times. Let = number of heads, = number of tails, . The parameter is ∈ [0,1].
The likelihood of observing this specific sequence (treating each flip as independent):
- probability of heads (the parameter to optimize)
- number of heads in the sequence
- number of tails in the sequence
Take the log-likelihood:
- log-likelihood as a function of p
Differentiate and set to zero:
- derivative of log-likelihood with respect to p
MLE says: estimate the probability as the fraction of times heads occurred. For 7 heads in 10 flips, . Reassuringly intuitive — and now rigorously derived.
The Key Application: Gaussian Likelihood → MSE
This is the derivation that ties MLE to every regression loss you'll ever see.
Assume each output is generated by a Gaussian centered at the model prediction with fixed variance :
- probability of observing yᵢ given input xᵢ and parameters θ
- standard deviation of the noise
- model prediction f(xᵢ; θ)
The log-likelihood over data points:
- log-likelihood of all data given parameters θ
The first term is a constant with respect to . So maximizing over is equivalent to minimizing:
- mean squared error — the standard regression loss function
MSE is not arbitrary. It is the correct loss function under the assumption that residuals are Gaussian noise. Every time you use MSE, you are implicitly doing MLE under a Gaussian noise model.
Numeric Check
Say you have 5 data points: and your model predicts a constant . The log-likelihood (up to constants) is:
Taking gives . The sample mean. MLE recovered the mean from the Gaussian noise model.
In Code and Practice
In PyTorch, when you write:
loss = F.mse_loss(predictions, targets) # Gaussian noise assumption loss = F.binary_cross_entropy(predictions, targets) # Bernoulli assumption loss = F.cross_entropy(logits, labels) # Categorical assumption
Each of these is computing the negative log-likelihood under a specific probabilistic model. They are all MLE, just with different distributional assumptions about the data-generating process.
The choice of loss function is really a choice of noise model. Pick the loss that matches your assumption about how outputs are distributed given inputs.