Outer products and rank-1 matrices — Vectors & Matrices

You already know how to multiply a matrix by a vector: it transforms a vector into a new vector. But there's a different kind of multiplication that builds a matrix from scratch using just two vectors. It's called the outer product, and it shows up constantly in ML.

Building a matrix from two vectors

Say you have two vectors:

\mathbf{a} = \begin{bmatrix} 1 \ 2 \ 3 \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} 4 \ 5 \end{bmatrix}

The outer product $\mathbf{a} \otimes \mathbf{b}$ (also written $\mathbf{a}\mathbf{b}^\top$ ) multiplies every entry of a with every entry of b:

\mathbf{a}\mathbf{b}^\top = \begin{bmatrix} 1 \cdot 4 &amp; 1 \cdot 5 \ 2 \cdot 4 &amp; 2 \cdot 5 \ 3 \cdot 4 &amp; 3 \cdot 5 \end{bmatrix} = \begin{bmatrix} 4 &amp; 5 \ 8 &amp; 10 \ 12 &amp; 15 \end{bmatrix}

The result is a 3×2 matrix — the dimensions of a by the dimensions of b.

The rule

For a column vector $\mathbf{a}$ of length m and a row vector $\mathbf{b}^\top$ of length n:

(\mathbf{a}\mathbf{b}^\top)_{ij} = a_i \cdot b_j

Entry at row i, column j is just the product of the i-th element of a and the j-th element of b. No summing — just multiplying pairs.

What does a rank-1 matrix look like?

The matrix you get from an outer product has a special structure: every row is a scaled copy of the same vector ( $\mathbf{b}^\top$ ), scaled by the corresponding entry of a.

\begin{bmatrix} 4 &amp; 5 \ 8 &amp; 10 \ 12 &amp; 15 \end{bmatrix} = \begin{bmatrix} 1 \cdot [4,5] \ 2 \cdot [4,5] \ 3 \cdot [4,5] \end{bmatrix}

All three rows point in the same direction — just different lengths. This is called a rank-1 matrix: it only has one independent direction.

Why this matters for ML

Outer products appear in three key places:

1. Weight gradients in a single layer

When a neural network layer computes $\mathbf{y} = W\mathbf{x}$ , the gradient of the loss with respect to the weight matrix is:

\frac{\partial L}{\partial W} = \boldsymbol{\delta} , \mathbf{x}^\top

That's an outer product. is the error signal flowing back, and $\mathbf{x}$ is the layer's input. Their outer product tells us how much each weight contributed.

2. Attention patterns

In transformers, the attention matrix is built by computing dot products between all pairs of query and key vectors. For one head:

A = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d}}\right)

The $QK^\top$ part is essentially a sum of outer products — each query dotted with every key.

3. Low-rank approximations

Many large matrices in ML (embeddings, weight matrices) are approximated as a sum of a few rank-1 matrices. This is the foundation of techniques like LoRA (Low-Rank Adaptation), which adapts huge models cheaply by adding small outer-product corrections.

Interactive example

Outer product visualizer

Coming soon

Contrast: outer vs dot

It's worth pausing to make sure the two operations are distinct:

Operation	Takes	Returns	Example
Dot product $\mathbf{a} \cdot \mathbf{b}$	Two vectors (same length)	A single number	Measures alignment
Outer product $\mathbf{a}\mathbf{b}^\top$	Two vectors (any length)	A matrix (m×n)	Combines every pair

Think of the dot product as collapsing two vectors into one number, and the outer product as expanding two vectors into a grid of combinations.

import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5])

# Outer product: shape (3, 2)
outer = np.outer(a, b)
print(outer)
# [[4,  5],
#  [8, 10],
#  [12, 15]]
# entry (i,j) = a[i] * b[j]

# In gradient computation: δ ⊗ xᵀ gives weight gradient
delta = np.array([0.5, -0.3])   # output error signal (2-dim)
x_in  = np.array([1.0, 2.0, 3.0])  # input features (3-dim)
dW    = np.outer(delta, x_in)   # weight gradient shape (2, 3)
print(dW)

# PyTorch equivalent
import torch
a_t = torch.tensor([1.0, 2.0, 3.0])
b_t = torch.tensor([4.0, 5.0])
print(torch.outer(a_t, b_t))

A useful identity

Any matrix $M$ can be written as a sum of rank-1 matrices:

M = \sum_k \sigma_k , \mathbf{u}_k \mathbf{v}_k^\top

This is the Singular Value Decomposition (SVD), which you'll meet later when studying PCA and embeddings. Each term is an outer product, weighted by .

Summary

The outer product of an m-vector and an n-vector gives an m×n matrix.
Entry (i, j) = aᵢ × bⱼ — multiply pairs, no summing.
The result is rank-1: all rows are multiples of the same vector.
Outer products describe weight gradients (δ·xᵀ) and underlie attention and low-rank methods.