The EM algorithm — Unsupervised Learning

You want to fit a GMM to data, but there's a catch: you don't know which Gaussian generated each point. If you knew the assignments, fitting would be easy (just compute weighted means and covariances). If you knew the parameters, assignments would be easy (just compute responsibilities). You know neither.

This circular dependency is solved by .

The Missing Data Problem

In a GMM, each data point x_n was generated by some component k — but we don't know which. Call this unknown label . If we observed z_n, the complete-data log-likelihood would be:

\ln p(X, Z \mid \theta) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \bigl[\ln \pi_k + \ln \mathcal{N}(x_n \mid \mu_k, \Sigma_k)\bigr]

$\ln p(X,Z|\theta)$: complete-data log-likelihood
$z_{nk}$: indicator: 1 if point n belongs to component k, 0 otherwise
$\theta$: model parameters: {μ_k, Σ_k, π_k}

But z_{nk} is unknown. EM's solution: take the expectation of z_{nk} over its posterior distribution, then maximize that expected log-likelihood.

The Two Steps

E-step (Expectation): Given current parameters θ_old, compute the expected value of the missing data — the :

r_{nk} = \frac{\pi_k ,\mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\displaystyle\sum_{j=1}^{K} \pi_j ,\mathcal{N}(x_n \mid \mu_j, \Sigma_j)}

$r_{nk}$: responsibility of component k for point n
$\pi_k$: current mixing weight for component k
$\mathcal{N}(x_n|\mu_k,\Sigma_k)$: Gaussian density at x_n under component k

M-step (Maximization): Treat the responsibilities as weights, and maximize the expected complete-data log-likelihood. This yields closed-form updates:

N_k = \sum_{n=1}^{N} r_{nk}, \qquad \mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} r_{nk} , x_n

$N_k$: effective number of points assigned to component k (= ∑_n r_{nk})
$\mu_k^{\text{new}}$: updated mean for component k
$r_{nk}$: responsibility from the E-step

\Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^{N} r_{nk} (x_n - \mu_k^{\text{new}})(x_n - \mu_k^{\text{new}})^\top

$\Sigma_k^{\text{new}}$: updated covariance for component k
$x_n - \mu_k^{\text{new}}$: deviation of point n from new mean

\pi_k^{\text{new}} = \frac{N_k}{N}

$\pi_k^{\text{new}}$: updated mixing weight for component k
$N$: total number of data points

Repeat until the log-likelihood stops increasing (or changes less than some tolerance ε).

Worked Example: 1D, Two Components

Data: {1.5, 2.0, 2.5, 8.0, 9.0, 9.5}

Initial parameters:

Component 1: μ₁ = 2, σ₁ = 1, π₁ = 0.5
Component 2: μ₂ = 9, σ₂ = 1, π₂ = 0.5

E-step — compute responsibilities:

For x = 2.5:

N(2.5 | μ₁=2, σ₁=1) ∝ exp(-(2.5-2)²/2) = exp(-0.125) ≈ 0.882
N(2.5 | μ₂=9, σ₂=1) ∝ exp(-(2.5-9)²/2) = exp(-21.1) ≈ 0.000

r₁(2.5) ≈ 1.000, r₂(2.5) ≈ 0.000 — component 1 claims this point entirely.

For x = 8.0:

N(8.0 | μ₁=2, σ₁=1) ∝ exp(-18) ≈ 0.000
N(8.0 | μ₂=9, σ₂=1) ∝ exp(-0.5) ≈ 0.607

r₁(8.0) ≈ 0.000, r₂(8.0) ≈ 1.000 — component 2 claims it.

After computing all six responsibilities:

N₁ ≈ 3.0 (points 1.5, 2.0, 2.5 fully owned by component 1)
N₂ ≈ 3.0 (points 8.0, 9.0, 9.5 fully owned by component 2)

M-step — update parameters:

μ₁_new = (1×1.5 + 1×2.0 + 1×2.5) / 3 = 2.0 (no change — already well-positioned)

μ₂_new = (1×8.0 + 1×9.0 + 1×9.5) / 3 = 8.83

σ₁_new² = (1×(1.5-2)² + 1×(2.0-2)² + 1×(2.5-2)²) / 3 = (0.25 + 0 + 0.25)/3 = 0.167 → σ₁ ≈ 0.41

σ₂_new² = ((8-8.83)² + (9-8.83)² + (9.5-8.83)²) / 3 ≈ (0.69 + 0.03 + 0.45)/3 ≈ 0.39 → σ₂ ≈ 0.62

After just one iteration, the algorithm has already tightened each component around its actual data points.

The Log-Likelihood as Objective

The quantity EM optimizes is the observed-data log-likelihood:

\ln p(X \mid \theta) = \sum_{n=1}^{N} \ln \left( \sum_{k=1}^{K} \pi_k , \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \right)

$\ln p(X|\theta)$: log-likelihood of the observed data
$\sum_k$: sum over K components

Each EM iteration increases this value (or leaves it unchanged). Track it during training — a useful diagnostic is whether it's converging smoothly. If it oscillates or increases then decreases, something is wrong (numerical issues or degenerate components).

Monitoring EM convergence in practice

Plot the log-likelihood after each iteration — EM is guaranteed to never decrease it, so any drop signals a numerical bug. Common cause: a component collapses onto a single point, making its covariance matrix singular and log-likelihood diverge to −∞. In sklearn, set verbose=1 to print log-likelihood per iteration:

from sklearn.mixture import GaussianMixture
gm = GaussianMixture(n_components=3, verbose=1, verbose_interval=1)
gm.fit(X)
# Output: Initialization 0   Log-likelihood: -312.4
#                             Iteration 1     Log-likelihood: -298.1  ...

If the log-likelihood starts high and immediately crashes, your initialization placed a component far from any data. Try init_params='kmeans' (the default) to start from K-Means centroids instead of random.

Beyond GMMs: EM Is General

EM applies to any model where:

Some data is missing or hidden (the "E" in EM)
With the missing data filled in, the MLE update has a closed form (the "M" in EM)

Examples beyond GMMs:

Hidden Markov Models: latent state sequence is the missing data
Latent Dirichlet Allocation: topic assignments per word are the missing data
Probabilistic PCA: latent coordinates are the missing data
Factor analysis: latent factors are the missing data

EM typically converges slower than direct gradient optimization near the optimum, but it's often more numerically stable and works when the M-step has a clean closed form.

EM for GMM — conceptual Python pseudocode

# EM for GMM (conceptual — not production code)
# Initialize: random means, identity covariances, equal weights
for iteration in range(max_iter):
    # E-step: compute responsibilities r[i,k] = P(z=k | x_i)
    for k in range(K):
        r[:, k] = pi[k] * gaussian_pdf(X, mu[k], Sigma[k])
    r /= r.sum(axis=1, keepdims=True)  # normalize rows to sum to 1

    # M-step: update parameters using responsibilities as weights
    N_k = r.sum(axis=0)             # effective count per component
    pi = N_k / n                    # new mixing weights
    mu = (r.T @ X) / N_k[:, None]  # new means (weighted average)
    # update Sigma similarly as weighted outer products...

    if log_likelihood_converged():
        break

r is an (n × K) matrix — row i gives the soft assignment of point i across all K components. Every M-step update is just a weighted average, using r as the weights.

Interactive example

Step through EM iterations on a 1D dataset — watch the Gaussians drift into position

Coming soon

Practical Notes

Initialize GMM with K-Means centroids — much better than random initialization
Watch for degenerate solutions: a component with very few points can have σ → 0 and likelihood → ∞. Add regularization (a small ridge to Σ_k diagonal) to avoid this.
Use BIC or AIC to compare GMMs with different K values: BIC = -2 log L + k log n (penalizes complexity)
Multiple restarts help — EM also has a local optima problem

What to Remember

EM alternates between E-step (compute responsibilities using current parameters) and M-step (update parameters using responsibilities as weights)
Each iteration is guaranteed to not decrease the log-likelihood → convergence guaranteed
M-step updates have clean closed forms for GMMs: weighted means, covariances, and mixing weights
EM applies to any model with latent variables where the complete-data MLE is tractable
Initialize with K-Means to avoid bad local optima; add Σ regularization to avoid degenerate components

Maximum likelihood estimation