You already know how to multiply a matrix by a vector: it transforms a vector into a new vector. But there's a different kind of multiplication that builds a matrix from scratch using just two vectors. It's called the outer product, and it shows up constantly in ML.
Building a matrix from two vectors
Say you have two vectors:
The outer product (also written ) multiplies every entry of a with every entry of b:
The result is a 3×2 matrix — the dimensions of a by the dimensions of b.
The rule
For a column vector of length m and a row vector of length n:
Entry at row i, column j is just the product of the i-th element of a and the j-th element of b. No summing — just multiplying pairs.
What does a rank-1 matrix look like?
The matrix you get from an outer product has a special structure: every row is a scaled copy of the same vector (), scaled by the corresponding entry of a.
All three rows point in the same direction — just different lengths. This is called a rank-1 matrix: it only has one independent direction.
Why this matters for ML
Outer products appear in three key places:
1. Weight gradients in a single layer
When a neural network layer computes , the gradient of the loss with respect to the weight matrix is:
That's an outer product. is the error signal flowing back, and is the layer's input. Their outer product tells us how much each weight contributed.
2. Attention patterns
In transformers, the attention matrix is built by computing dot products between all pairs of query and key vectors. For one head:
The part is essentially a sum of outer products — each query dotted with every key.
3. Low-rank approximations
Many large matrices in ML (embeddings, weight matrices) are approximated as a sum of a few rank-1 matrices. This is the foundation of techniques like LoRA (Low-Rank Adaptation), which adapts huge models cheaply by adding small outer-product corrections.
Interactive example
Outer product visualizer
Coming soon
Contrast: outer vs dot
It's worth pausing to make sure the two operations are distinct:
| Operation | Takes | Returns | Example |
|---|---|---|---|
| Dot product | Two vectors (same length) | A single number | Measures alignment |
| Outer product | Two vectors (any length) | A matrix (m×n) | Combines every pair |
Think of the dot product as collapsing two vectors into one number, and the outer product as expanding two vectors into a grid of combinations.
import numpy as np a = np.array([1, 2, 3]) b = np.array([4, 5]) # Outer product: shape (3, 2) outer = np.outer(a, b) print(outer) # [[4, 5], # [8, 10], # [12, 15]] # entry (i,j) = a[i] * b[j] # In gradient computation: δ ⊗ xᵀ gives weight gradient delta = np.array([0.5, -0.3]) # output error signal (2-dim) x_in = np.array([1.0, 2.0, 3.0]) # input features (3-dim) dW = np.outer(delta, x_in) # weight gradient shape (2, 3) print(dW) # PyTorch equivalent import torch a_t = torch.tensor([1.0, 2.0, 3.0]) b_t = torch.tensor([4.0, 5.0]) print(torch.outer(a_t, b_t))
A useful identity
Any matrix can be written as a sum of rank-1 matrices:
This is the Singular Value Decomposition (SVD), which you'll meet later when studying PCA and embeddings. Each term is an outer product, weighted by .
Summary
- The outer product of an m-vector and an n-vector gives an m×n matrix.
- Entry (i, j) = aᵢ × bⱼ — multiply pairs, no summing.
- The result is rank-1: all rows are multiples of the same vector.
- Outer products describe weight gradients (δ·xᵀ) and underlie attention and low-rank methods.