Multi-head attention — Attention & Transformers

Single-head attention is powerful, but it makes a single set of attention decisions at once. A word might need to simultaneously track its syntactic role (it's the subject of the verb), its semantic relationship (it's a synonym of another word), and its positional context (it's adjacent to a modifier). These are fundamentally different types of relationships, and one set of Q, K, V projections can only capture one perspective at a time.

Multi-head attention is what allows transformers to track syntax, semantics, and coreference simultaneously. Research on BERT has shown that individual heads specialize — some track direct objects, others track positional proximity. This specialization is why transformers outperform every previous architecture on nearly every NLP benchmark.

Multi-head attention runs several attention mechanisms in parallel, each with different learned projections. Each can specialize in a different type of relationship.

The Architecture

For each of attention heads, compute independent Q, K, V projections and run attention:

\text{head}_i = \text{Attention}(X W_i^Q,; X W_i^K,; X W_i^V)

$\text{head}_i$: the output of the i-th attention head
$W_i^Q, W_i^K, W_i^V$: learned projection matrices for head i - different for each head

Each set of projection matrices is learned independently. Head 1 might learn to project in a direction that captures subject-verb agreement; head 2 coreference; head 3 positional proximity.

After computing all h heads, concatenate them along the feature dimension:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\thinspace W_O

$\text{Concat}$: concatenation along the feature/column dimension
$W_O$: output projection matrix of shape (h*d_v x d_model)

The $W_O$ has shape $(h \cdot d_v \times d_{\text{model}})$ .

Interactive example

Multi-head attention - toggle heads on/off to see how each specializes on a sample sentence

Coming soon

Dimension Arithmetic

To keep multi-head attention computationally comparable to single-head, you reduce the dimensionality per head:

Single-head with $d_{\text{model}} = 512$ : use $d_k = 512$ . Compute: proportional to $\text{seq_len}^2 \times 512$ .
Multi-head with $h = 8$ : set $d_k = d_v = d_{\text{model}} / h = 512 / 8 = 64$ per head.

Each head is 8x cheaper; with 8 heads, total compute is the same as single-head. After concatenation: 8 heads × 64 dimensions = 512. Apply $W_O$ : back to $d_{\text{model}}$ .

Dimension walkthrough: GPT-2 small (d_model=768, h=12, sequence length T)

Tracing every shape step by step:

Step 1: Input X has shape $(T \times 768)$ .
Step 2: Per-head dimension: $d_k = 768 / 12 = 64$ .
Step 3: Each head projects X to Q, K, V with a $(768 \times 64)$ weight matrix → each of Q, K, V has shape $(T \times 64)$ .
Step 4: Scaled dot-product attention inside each head → output shape $(T \times 64)$ .
Step 5: Concatenate all 12 head outputs: $12 \times 64 = 768$ , giving shape $(T \times 768)$ .
Step 6: Apply $W_O \in \mathbb{R}^{768 \times 768}$ → output $(T \times 768)$ .

Input shape = output shape. This is required for the residual connection $x_1 = x + \text{Attention}(x)$ that wraps every attention sublayer.

What Different Heads Learn

Interpretability researchers have probed what specific attention heads learn in trained transformers:

Syntactic heads: track grammatical structure - a verb attending strongly to its subject, or a noun to its modifying adjective, even across long-distance dependencies.

Coreference heads: connect pronouns to antecedents. "The woman said would arrive" - a coreference head shows strong attention from "she" to "woman."

Positional heads: primarily attend to adjacent tokens - next token, previous token, start of sentence. These capture local context.

Copy heads: near-identity behavior - a token attends to itself or to an earlier instance of the same token. These help copy information through the network.

This specialization - you don't program these behaviors.

Typical Hyperparameters

Model	Heads (h)	d_model	d_k per head
GPT-2 small	12	768	64
GPT-2 large	20	1280	64
BERT base	12	768	64
GPT-3	96	12288	128

Notice that $d_k = 64$ is remarkably consistent. What scales is the number of heads and $d_{\text{model}}$ .

Parameter Count

For one multi-head attention block with h heads, $d_{\text{model}}$ , and $d_k = d_v = d_{\text{model}} / h$ :

\text{Total} = 4 \times d_{\text{model}}^2

$d_{\text{model}}$: the model's hidden dimension - determines total parameter count

h sets of $W_i^Q$ : total $= h \times d_{\text{model}} \times d_k = d_{\text{model}}^2$
Same for $W_i^K$ and $W_i^V$
Output projection $W_O$ : $d_{\text{model}}^2$

For $d_{\text{model}} = 768$ : $4 \times 768^2 \approx 2.36\text{M}$ parameters per attention layer.

The Output Projection

The output projection does more than resize the concatenated output. After concatenating all h head outputs (shape $T \times h \cdot d_v$ ), $W_O$ applies a learned linear combination that allows head information to mix across the feature dimension.

Without $W_O$ , head 1's syntactic features and head 2's semantic features would stay in separate channels of the output. The output projection is what lets the model build token representations simultaneously informed by multiple relationship types — the entire point of multi-head attention.