Skip to content
Attention & Transformers
Lesson 3 ⏱ 16 min

Q, K, V: the full attention formula

Video coming soon

Query, Key, Value - The Attention Mechanism in Detail

Step-by-step walkthrough of the QKV projections, the scaled dot-product formula, the score matrix shape, causal masking, and a concrete 4-token worked example.

⏱ ~9 min

🧮

Quick refresher

Softmax and matrix multiplication

Softmax converts a vector of scores to a probability distribution summing to 1. Matrix multiplication (m×n)(n×p) = (m×p), used to compute all attention scores at once.

Example

Softmax([2, 1, 0]) = [e²/(e²+e+1), e¹/(e²+e+1), e⁰/(e²+e+1)] ≈ [0.67, 0.24, 0.09].

We know that attention computes a weighted average of values, with weights based on relevance. Now we need to answer the key question: exactly how do you compute those relevance scores? The query-key-value framework is the answer - a wonderfully clean decomposition of "what am I looking for," "what do I contain," and "what will I contribute."

The QKV formulation is the core operation in every transformer — GPT, BERT, T5, LLaMA — all compute attention as scaled dot-product of queries and keys. Understanding this framework means you can read any transformer paper and know exactly what the attention block is doing.

The Search Metaphor

Think about searching a library. You have a question in mind - your search query. The library catalog has entries describing what each book contains. You match your query against those catalog entries to find the most relevant books, then actually read the content of those books.

In QKV attention:

  • Query (Q): "What am I looking for?" Each token produces a query vector representing its information needs.
  • Key (K): "What do I contain?" Each token produces a key vector representing what it offers for matching.
  • Value (V): "What will I contribute?" Each token produces a value vector - the actual information it provides when selected.

The process: match each query against all keys to get relevance scores, apply softmax for weights, take a weighted sum of values.

"bank" has a query asking "what kind of object am I?" "deposit" has a key saying "financial transaction." These match well → high score → "deposit"'s value vector is strongly mixed into "bank"'s new representation.

Computing Q, K, V from Input

The input to an attention layer is a matrix of shape (\text{seq_len} \times d_{\text{model}}). Each row is one token's representation.

You compute Q, K, V using learned :

Q=XWQ,K=XWK,V=XWVQ = X W_Q, \quad K = X W_K, \quad V = X W_V
WQW_Q
learned query projection matrix (d_model x d_k)
WKW_K
learned key projection matrix (d_model x d_k)
WVW_V
learned value projection matrix (d_model x d_v)

Each result has shape (\text{seq_len} \times d_k) or (\text{seq_len} \times d_v). Every token gets its own query, key, and value vector derived from its representation via the learned projections. are trainable parameters.

Scaled Dot-Product Attention

The full formula. Given Q, K, V:

Attention(Q,K,V)=softmax!(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
QQ
query matrix (seq_len x d_k)
KK
key matrix (seq_len x d_k)
VV
value matrix (seq_len x d_v)
dkd_k
dimension of query/key vectors - used as scaling factor

Let's unpack each step.

Step 1: QKᵀ - The Score Matrix

Q has shape (\text{seq_len} \times d_k). has shape (d_k \times \text{seq_len}). Their product:

QK^\top: ;(\text{seq_len} \times d_k) \cdot (d_k \times \text{seq_len}) = (\text{seq_len} \times \text{seq_len})
QKQK^\top
score matrix of shape (seq_len x seq_len) - entry [i,j] is the dot product of query i with key j

Entry [i, j] is the dot product of query vector i with key vector j - the raw relevance score between token i and token j. Matrix multiplication computes all \text{seq_len}^2 scores simultaneously.

Step 2: Divide by √d_k - The Scaling

Before softmax, divide by .

Why? For dkd_k-dimensional random vectors, the dot product variance scales with dkd_k. When you feed large values into softmax, it : one element gets weight ≈ 1, all others ≈ 0, and gradients vanish.

Dividing by dk\sqrt{d_k} normalizes to approximately unit variance. For dk=64d_k = 64: divide by 8. For dk=128d_k = 128: divide by ≈ 11.3.

Step 3: softmax() - The Probability Weights

Apply softmax to each row of the scaled score matrix independently. Each row corresponds to one query token; softmax converts the seq_len raw scores in that row into a probability distribution.

After softmax, matrix shape is still (\text{seq_len} \times \text{seq_len}) and each row sums to 1. Entry [i, j] is now the attention weight αij\alpha_{ij}.

Step 4: Multiply by V - The Weighted Sum

\alpha \cdot V: ;(\text{seq_len} \times \text{seq_len}) \cdot (\text{seq_len} \times d_v) = (\text{seq_len} \times d_v)
α\alpha
attention weight matrix (seq_len x seq_len)
VV
value matrix (seq_len x d_v)

Each row of the output is the attended representation of one token - a weighted sum of all value vectors. Each token has now gathered information from across the sequence based on relevance.

Interactive example

QKV walkthrough - trace a 4-token sentence through each step of the attention formula

Coming soon

Causal (Masked) Attention for Generation

In language models like GPT, token i should only attend to tokens at positions 1 through i - it cannot see the future.

Implementation: before applying softmax, set QK[i,j]=QK^\top\lbrack i, j\rbrack = -\infty for all j > i. Softmax of -\infty is 0, so those attention weights become exactly 0.

The Computational Cost

Attention's main cost is the \text{seq_len} \times \text{seq_len} score matrix: . For short sequences (n = 512), this is fine. For very long sequences, the quadratic cost is a serious bottleneck - which is why research into efficient attention is an active area.

Quiz

1 / 3

In query-key-value attention, the 'key' vector represents...