Skip to content
Unsupervised Learning
Lesson 5 ⏱ 16 min

The EM algorithm

Video coming soon

Expectation-Maximization Explained

Walking through the E-step and M-step for a 1D GMM with two components, showing how parameters converge.

⏱ ~11 min

🧮

Quick refresher

Maximum likelihood estimation

Maximum likelihood estimation (MLE) finds the model parameters that maximize the probability of observing the data you actually saw. For a dataset X and parameters θ, you maximize the log-likelihood log P(X|θ). MLE is the standard way to fit probabilistic models.

Example

To estimate the bias of a coin from 7 heads out of 10 flips, MLE gives p = 7/10 = 0.7 — the value of p that makes the observed 7 heads most probable.

You want to fit a GMM to data, but there's a catch: you don't know which Gaussian generated each point. If you knew the assignments, fitting would be easy (just compute weighted means and covariances). If you knew the parameters, assignments would be easy (just compute responsibilities). You know neither.

This circular dependency is solved by .

The Missing Data Problem

In a GMM, each data point x_n was generated by some component k — but we don't know which. Call this unknown label . If we observed z_n, the complete-data log-likelihood would be:

lnp(X,Zθ)=n=1Nk=1Kznk[lnπk+lnN(xnμk,Σk)]\ln p(X, Z \mid \theta) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \bigl[\ln \pi_k + \ln \mathcal{N}(x_n \mid \mu_k, \Sigma_k)\bigr]
lnp(X,Zθ)\ln p(X,Z|\theta)
complete-data log-likelihood
znkz_{nk}
indicator: 1 if point n belongs to component k, 0 otherwise
θ\theta
model parameters: {μ_k, Σ_k, π_k}

But z_{nk} is unknown. EM's solution: take the expectation of z_{nk} over its posterior distribution, then maximize that expected log-likelihood.

The Two Steps

E-step (Expectation): Given current parameters θ_old, compute the expected value of the missing data — the :

rnk=πk,N(xnμk,Σk)j=1Kπj,N(xnμj,Σj)r_{nk} = \frac{\pi_k ,\mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\displaystyle\sum_{j=1}^{K} \pi_j ,\mathcal{N}(x_n \mid \mu_j, \Sigma_j)}
rnkr_{nk}
responsibility of component k for point n
πk\pi_k
current mixing weight for component k
N(xnμk,Σk)\mathcal{N}(x_n|\mu_k,\Sigma_k)
Gaussian density at x_n under component k

M-step (Maximization): Treat the responsibilities as weights, and maximize the expected complete-data log-likelihood. This yields closed-form updates:

Nk=n=1Nrnk,μknew=1Nkn=1Nrnk,xnN_k = \sum_{n=1}^{N} r_{nk}, \qquad \mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} r_{nk} , x_n
NkN_k
effective number of points assigned to component k (= ∑_n r_{nk})
μknew\mu_k^{\text{new}}
updated mean for component k
rnkr_{nk}
responsibility from the E-step
Σknew=1Nkn=1Nrnk(xnμknew)(xnμknew)\Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} r_{nk} (x_n - \mu_k^{\text{new}})(x_n - \mu_k^{\text{new}})^\top
Σknew\Sigma_k^{\text{new}}
updated covariance for component k
xnμknewx_n - \mu_k^{\text{new}}
deviation of point n from new mean
πknew=NkN\pi_k^{\text{new}} = \frac{N_k}{N}
πknew\pi_k^{\text{new}}
updated mixing weight for component k
NN
total number of data points

Repeat until the log-likelihood stops increasing (or changes less than some tolerance ε).

Worked Example: 1D, Two Components

Data: {1.5, 2.0, 2.5, 8.0, 9.0, 9.5}

Initial parameters:

  • Component 1: μ₁ = 2, σ₁ = 1, π₁ = 0.5
  • Component 2: μ₂ = 9, σ₂ = 1, π₂ = 0.5

E-step — compute responsibilities:

For x = 2.5:

  • N(2.5 | μ₁=2, σ₁=1) ∝ exp(-(2.5-2)²/2) = exp(-0.125) ≈ 0.882
  • N(2.5 | μ₂=9, σ₂=1) ∝ exp(-(2.5-9)²/2) = exp(-21.1) ≈ 0.000

r₁(2.5) ≈ 1.000, r₂(2.5) ≈ 0.000 — component 1 claims this point entirely.

For x = 8.0:

  • N(8.0 | μ₁=2, σ₁=1) ∝ exp(-18) ≈ 0.000
  • N(8.0 | μ₂=9, σ₂=1) ∝ exp(-0.5) ≈ 0.607

r₁(8.0) ≈ 0.000, r₂(8.0) ≈ 1.000 — component 2 claims it.

After computing all six responsibilities:

  • N₁ ≈ 3.0 (points 1.5, 2.0, 2.5 fully owned by component 1)
  • N₂ ≈ 3.0 (points 8.0, 9.0, 9.5 fully owned by component 2)

M-step — update parameters:

μ₁_new = (1×1.5 + 1×2.0 + 1×2.5) / 3 = 2.0 (no change — already well-positioned)

μ₂_new = (1×8.0 + 1×9.0 + 1×9.5) / 3 = 8.83

σ₁_new² = (1×(1.5-2)² + 1×(2.0-2)² + 1×(2.5-2)²) / 3 = (0.25 + 0 + 0.25)/3 = 0.167 → σ₁ ≈ 0.41

σ₂_new² = ((8-8.83)² + (9-8.83)² + (9.5-8.83)²) / 3 ≈ (0.69 + 0.03 + 0.45)/3 ≈ 0.39 → σ₂ ≈ 0.62

After just one iteration, the algorithm has already tightened each component around its actual data points.

The Log-Likelihood as Objective

The quantity EM optimizes is the observed-data log-likelihood:

lnp(Xθ)=n=1Nln(k=1Kπk,N(xnμk,Σk))\ln p(X \mid \theta) = \sum_{n=1}^{N} \ln \left( \sum_{k=1}^{K} \pi_k , \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \right)
lnp(Xθ)\ln p(X|\theta)
log-likelihood of the observed data
k\sum_k
sum over K components

Each EM iteration increases this value (or leaves it unchanged). Track it during training — a useful diagnostic is whether it's converging smoothly. If it oscillates or increases then decreases, something is wrong (numerical issues or degenerate components).

Beyond GMMs: EM Is General

EM applies to any model where:

  1. Some data is missing or hidden (the "E" in EM)
  2. With the missing data filled in, the MLE update has a closed form (the "M" in EM)

Examples beyond GMMs:

  • Hidden Markov Models: latent state sequence is the missing data
  • Latent Dirichlet Allocation: topic assignments per word are the missing data
  • Probabilistic PCA: latent coordinates are the missing data
  • Factor analysis: latent factors are the missing data

EM typically converges slower than direct gradient optimization near the optimum, but it's often more numerically stable and works when the M-step has a clean closed form.

Interactive example

Step through EM iterations on a 1D dataset — watch the Gaussians drift into position

Coming soon

Practical Notes

  • Initialize GMM with K-Means centroids — much better than random initialization
  • Watch for degenerate solutions: a component with very few points can have σ → 0 and likelihood → ∞. Add regularization (a small ridge to Σ_k diagonal) to avoid this.
  • Use BIC or AIC to compare GMMs with different K values: BIC = -2 log L + k log n (penalizes complexity)
  • Multiple restarts help — EM also has a local optima problem

What to Remember

  • EM alternates between E-step (compute responsibilities using current parameters) and M-step (update parameters using responsibilities as weights)
  • Each iteration is guaranteed to not decrease the log-likelihood → convergence guaranteed
  • M-step updates have clean closed forms for GMMs: weighted means, covariances, and mixing weights
  • EM applies to any model with latent variables where the complete-data MLE is tractable
  • Initialize with K-Means to avoid bad local optima; add Σ regularization to avoid degenerate components

Quiz

1 / 3

What does the E-step (Expectation step) compute in the EM algorithm for GMMs?