In 1957, Stuart Lloyd at Bell Labs invented an algorithm to solve a telephone signal quantization problem. It turned out to be one of the most widely used algorithms in all of machine learning. You know it as .
The Core Idea
K-Means is beautifully simple: place K representative points (centroids) in your data space. Assign every data point to its nearest centroid. Move each centroid to the average position of its assigned points. Repeat until nothing changes.
That's it. Two steps, alternating, until convergence.
Let's make it precise.
The Algorithm
Input: Data points = {x₁, x₂, ..., xₙ}, number of clusters
Initialize: Randomly choose K points from X as initial centroids {μ₁, μ₂, ..., μ_K}
Repeat until convergence:
Assignment step: For each point xᵢ, assign it to cluster :
- cluster assignment for point i
- centroid of cluster k
- Euclidean distance
Update step: Recompute each centroid as the mean of its assigned points:
- new centroid of cluster k
- number of points assigned to cluster k
- point i is assigned to cluster k
Objective (Inertia): K-Means minimizes the :
- inertia (within-cluster sum of squares)
- data point i
- centroid of the cluster point i belongs to
Worked Example: Two Iterations
Six points in 2D. K = 2. Let's trace two full iterations.
Points:
| Point | x | y |
|---|---|---|
| P1 | 1.0 | 1.0 |
| P2 | 1.5 | 2.0 |
| P3 | 2.0 | 1.5 |
| P4 | 8.0 | 8.0 |
| P5 | 8.5 | 7.5 |
| P6 | 9.0 | 8.0 |
Iteration 1 — Initialize: Suppose we randomly pick P2 = (1.5, 2.0) as μ₁ and P5 = (8.5, 7.5) as μ₂.
Assignment: Compute distance from each point to μ₁ and μ₂.
For P1 = (1, 1):
- d(P1, μ₁) = √((1-1.5)² + (1-2)²) = √(0.25 + 1) = √1.25 ≈ 1.12
- d(P1, μ₂) = √((1-8.5)² + (1-7.5)²) = √(56.25 + 42.25) = √98.5 ≈ 9.92
- Assign to Cluster 1 ✓
P4 = (8, 8):
- d(P4, μ₁) ≈ 9.19, d(P4, μ₂) = √(0.25 + 0.25) ≈ 0.71
- Assign to Cluster 2 ✓
Result: Cluster 1 = {P1, P2, P3}, Cluster 2 = {P4, P5, P6}
Update:
- μ₁ = mean of {(1,1), (1.5,2), (2,1.5)} = ((1+1.5+2)/3, (1+2+1.5)/3) = (1.5, 1.5)
- μ₂ = mean of {(8,8), (8.5,7.5), (9,8)} = (8.5, 7.83)
Iteration 2:
Assignment: All six points are still closer to their current centroid — no changes.
Convergence: No points changed cluster → algorithm terminates.
Final clusters: {P1, P2, P3} and {P4, P5, P6}. Inertia J = (1.12² + 0² + ...) — all small numbers, because the centroids are right in the middle of each group.
Convergence: Guaranteed, But Not Perfect
K-Means always converges. Here's why: the objective J can only decrease or stay the same after each iteration.
- The assignment step: reassigning a point to a closer centroid can only decrease J
- The update step: replacing a centroid with the mean of its assigned points can only decrease J
Since J is bounded below by 0 and decreases monotonically, the algorithm must terminate.
K-Means can converge to a bad local minimum. The objective is . If you initialize unluckily — say, both initial centroids land in the same natural cluster — the algorithm may converge to a solution far worse than optimal.
The practical consequence: always run K-Means multiple times with different random initializations and take the run with the lowest inertia. Most libraries default to n_init=10 for exactly this reason. K-Means++ (next lesson) is a smarter approach to initialization that dramatically reduces the chance of bad starts.
Computational Complexity
Each iteration costs O(n × K × d): n points, K distance computations each, d dimensions per distance. The number of iterations until convergence is usually small in practice (10–100 for well-separated clusters), making K-Means one of the fastest clustering algorithms available.
Interactive example
Watch K-Means centroids migrate iteration by iteration — click to step through
Coming soon
Assumptions and Limitations
K-Means works well when clusters are:
- Spherical — roughly equal radius in all directions
- Similar size — roughly the same number of points
- Well-separated — clear gaps between groups
It struggles when clusters are elongated, crescent-shaped, or nested. It also requires you to specify K ahead of time (next lesson covers how to choose K).
What to Remember
- K-Means alternates between assigning points to the nearest centroid and recomputing centroids as means
- The objective (inertia) is guaranteed to decrease every iteration → convergence is guaranteed
- The objective is non-convex → can converge to local optima → always run multiple times
- Complexity is O(n × K × d) per iteration — fast and scalable
- Assumes spherical, equal-sized, well-separated clusters