You want to fit a GMM to data, but there's a catch: you don't know which Gaussian generated each point. If you knew the assignments, fitting would be easy (just compute weighted means and covariances). If you knew the parameters, assignments would be easy (just compute responsibilities). You know neither.
This circular dependency is solved by .
The Missing Data Problem
In a GMM, each data point x_n was generated by some component k — but we don't know which. Call this unknown label . If we observed z_n, the complete-data log-likelihood would be:
- complete-data log-likelihood
- indicator: 1 if point n belongs to component k, 0 otherwise
- model parameters: {μ_k, Σ_k, π_k}
But z_{nk} is unknown. EM's solution: take the expectation of z_{nk} over its posterior distribution, then maximize that expected log-likelihood.
The Two Steps
E-step (Expectation): Given current parameters θ_old, compute the expected value of the missing data — the :
- responsibility of component k for point n
- current mixing weight for component k
- Gaussian density at x_n under component k
M-step (Maximization): Treat the responsibilities as weights, and maximize the expected complete-data log-likelihood. This yields closed-form updates:
- effective number of points assigned to component k (= ∑_n r_{nk})
- updated mean for component k
- responsibility from the E-step
- updated covariance for component k
- deviation of point n from new mean
- updated mixing weight for component k
- total number of data points
Repeat until the log-likelihood stops increasing (or changes less than some tolerance ε).
Worked Example: 1D, Two Components
Data: {1.5, 2.0, 2.5, 8.0, 9.0, 9.5}
Initial parameters:
- Component 1: μ₁ = 2, σ₁ = 1, π₁ = 0.5
- Component 2: μ₂ = 9, σ₂ = 1, π₂ = 0.5
E-step — compute responsibilities:
For x = 2.5:
- N(2.5 | μ₁=2, σ₁=1) ∝ exp(-(2.5-2)²/2) = exp(-0.125) ≈ 0.882
- N(2.5 | μ₂=9, σ₂=1) ∝ exp(-(2.5-9)²/2) = exp(-21.1) ≈ 0.000
r₁(2.5) ≈ 1.000, r₂(2.5) ≈ 0.000 — component 1 claims this point entirely.
For x = 8.0:
- N(8.0 | μ₁=2, σ₁=1) ∝ exp(-18) ≈ 0.000
- N(8.0 | μ₂=9, σ₂=1) ∝ exp(-0.5) ≈ 0.607
r₁(8.0) ≈ 0.000, r₂(8.0) ≈ 1.000 — component 2 claims it.
After computing all six responsibilities:
- N₁ ≈ 3.0 (points 1.5, 2.0, 2.5 fully owned by component 1)
- N₂ ≈ 3.0 (points 8.0, 9.0, 9.5 fully owned by component 2)
M-step — update parameters:
μ₁_new = (1×1.5 + 1×2.0 + 1×2.5) / 3 = 2.0 (no change — already well-positioned)
μ₂_new = (1×8.0 + 1×9.0 + 1×9.5) / 3 = 8.83
σ₁_new² = (1×(1.5-2)² + 1×(2.0-2)² + 1×(2.5-2)²) / 3 = (0.25 + 0 + 0.25)/3 = 0.167 → σ₁ ≈ 0.41
σ₂_new² = ((8-8.83)² + (9-8.83)² + (9.5-8.83)²) / 3 ≈ (0.69 + 0.03 + 0.45)/3 ≈ 0.39 → σ₂ ≈ 0.62
After just one iteration, the algorithm has already tightened each component around its actual data points.
The Log-Likelihood as Objective
The quantity EM optimizes is the observed-data log-likelihood:
- log-likelihood of the observed data
- sum over K components
Each EM iteration increases this value (or leaves it unchanged). Track it during training — a useful diagnostic is whether it's converging smoothly. If it oscillates or increases then decreases, something is wrong (numerical issues or degenerate components).
Beyond GMMs: EM Is General
EM applies to any model where:
- Some data is missing or hidden (the "E" in EM)
- With the missing data filled in, the MLE update has a closed form (the "M" in EM)
Examples beyond GMMs:
- Hidden Markov Models: latent state sequence is the missing data
- Latent Dirichlet Allocation: topic assignments per word are the missing data
- Probabilistic PCA: latent coordinates are the missing data
- Factor analysis: latent factors are the missing data
EM typically converges slower than direct gradient optimization near the optimum, but it's often more numerically stable and works when the M-step has a clean closed form.
Interactive example
Step through EM iterations on a 1D dataset — watch the Gaussians drift into position
Coming soon
Practical Notes
- Initialize GMM with K-Means centroids — much better than random initialization
- Watch for degenerate solutions: a component with very few points can have σ → 0 and likelihood → ∞. Add regularization (a small ridge to Σ_k diagonal) to avoid this.
- Use BIC or AIC to compare GMMs with different K values: BIC = -2 log L + k log n (penalizes complexity)
- Multiple restarts help — EM also has a local optima problem
What to Remember
- EM alternates between E-step (compute responsibilities using current parameters) and M-step (update parameters using responsibilities as weights)
- Each iteration is guaranteed to not decrease the log-likelihood → convergence guaranteed
- M-step updates have clean closed forms for GMMs: weighted means, covariances, and mixing weights
- EM applies to any model with latent variables where the complete-data MLE is tractable
- Initialize with K-Means to avoid bad local optima; add Σ regularization to avoid degenerate components