We know that attention computes a weighted average of values, with weights based on relevance. Now we need to answer the key question: exactly how do you compute those relevance scores? The query-key-value framework is the answer - a wonderfully clean decomposition of "what am I looking for," "what do I contain," and "what will I contribute."
The QKV formulation is the core operation in every transformer — GPT, BERT, T5, LLaMA — all compute attention as scaled dot-product of queries and keys. Understanding this framework means you can read any transformer paper and know exactly what the attention block is doing.
The Search Metaphor
Think about searching a library. You have a question in mind - your search query. The library catalog has entries describing what each book contains. You match your query against those catalog entries to find the most relevant books, then actually read the content of those books.
In QKV attention:
- Query (Q): "What am I looking for?" Each token produces a query vector representing its information needs.
- Key (K): "What do I contain?" Each token produces a key vector representing what it offers for matching.
- Value (V): "What will I contribute?" Each token produces a value vector - the actual information it provides when selected.
The process: match each query against all keys to get relevance scores, apply softmax for weights, take a weighted sum of values.
"bank" has a query asking "what kind of object am I?" "deposit" has a key saying "financial transaction." These match well → high score → "deposit"'s value vector is strongly mixed into "bank"'s new representation.
Computing Q, K, V from Input
The input to an attention layer is a matrix of shape (\text{seq_len} \times d_{\text{model}}). Each row is one token's representation.
You compute Q, K, V using learned :
- learned query projection matrix (d_model x d_k)
- learned key projection matrix (d_model x d_k)
- learned value projection matrix (d_model x d_v)
Each result has shape (\text{seq_len} \times d_k) or (\text{seq_len} \times d_v). Every token gets its own query, key, and value vector derived from its representation via the learned projections. are trainable parameters.
Scaled Dot-Product Attention
The full formula. Given Q, K, V:
- query matrix (seq_len x d_k)
- key matrix (seq_len x d_k)
- value matrix (seq_len x d_v)
- dimension of query/key vectors - used as scaling factor
Let's unpack each step.
Step 1: QKᵀ - The Score Matrix
Q has shape (\text{seq_len} \times d_k). has shape (d_k \times \text{seq_len}). Their product:
- score matrix of shape (seq_len x seq_len) - entry [i,j] is the dot product of query i with key j
Entry [i, j] is the dot product of query vector i with key vector j - the raw relevance score between token i and token j. Matrix multiplication computes all \text{seq_len}^2 scores simultaneously.
Step 2: Divide by √d_k - The Scaling
Before softmax, divide by .
Why? For -dimensional random vectors, the dot product variance scales with . When you feed large values into softmax, it : one element gets weight ≈ 1, all others ≈ 0, and gradients vanish.
Dividing by normalizes to approximately unit variance. For : divide by 8. For : divide by ≈ 11.3.
Step 3: softmax() - The Probability Weights
Apply softmax to each row of the scaled score matrix independently. Each row corresponds to one query token; softmax converts the seq_len raw scores in that row into a probability distribution.
After softmax, matrix shape is still (\text{seq_len} \times \text{seq_len}) and each row sums to 1. Entry [i, j] is now the attention weight .
Step 4: Multiply by V - The Weighted Sum
- attention weight matrix (seq_len x seq_len)
- value matrix (seq_len x d_v)
Each row of the output is the attended representation of one token - a weighted sum of all value vectors. Each token has now gathered information from across the sequence based on relevance.
Interactive example
QKV walkthrough - trace a 4-token sentence through each step of the attention formula
Coming soon
Causal (Masked) Attention for Generation
In language models like GPT, token i should only attend to tokens at positions 1 through i - it cannot see the future.
Implementation: before applying softmax, set for all j > i. Softmax of is 0, so those attention weights become exactly 0.
The Computational Cost
Attention's main cost is the \text{seq_len} \times \text{seq_len} score matrix: . For short sequences (n = 512), this is fine. For very long sequences, the quadratic cost is a serious bottleneck - which is why research into efficient attention is an active area.