Q, K, V: the full attention formula — Attention & Transformers

We know that attention computes a weighted average of values, with weights based on relevance. Now we need to answer the key question: exactly how do you compute those relevance scores? The query-key-value framework is the answer - a wonderfully clean decomposition of "what am I looking for," "what do I contain," and "what will I contribute."

The QKV formulation is the core operation in every transformer — GPT, BERT, T5, LLaMA — all compute attention as scaled dot-product of queries and keys. Understanding this framework means you can read any transformer paper and know exactly what the attention block is doing.

The Search Metaphor

Think about searching a library. You have a question in mind - your search query. The library catalog has entries describing what each book contains. You match your query against those catalog entries to find the most relevant books, then actually read the content of those books.

In QKV attention:

Query (Q): "What am I looking for?" Each token produces a query vector representing its information needs.
Key (K): "What do I contain?" Each token produces a key vector representing what it offers for matching.
Value (V): "What will I contribute?" Each token produces a value vector - the actual information it provides when selected.

The process: match each query against all keys to get relevance scores, apply softmax for weights, take a weighted sum of values.

"bank" has a query asking "what kind of object am I?" "deposit" has a key saying "financial transaction." These match well → high score → "deposit"'s value vector is strongly mixed into "bank"'s new representation.

Computing Q, K, V from Input

The input to an attention layer is a matrix of shape $(\text{seq_len} \times d_{\text{model}})$ . Each row is one token's representation.

You compute Q, K, V using learned :

Q = X W_Q, \quad K = X W_K, \quad V = X W_V

$W_Q$: learned query projection matrix (d_model x d_k)
$W_K$: learned key projection matrix (d_model x d_k)
$W_V$: learned value projection matrix (d_model x d_v)

Each result has shape $(\text{seq_len} \times d_k)$ or $(\text{seq_len} \times d_v)$ . Every token gets its own query, key, and value vector derived from its representation via the learned projections. are trainable parameters.

Scaled Dot-Product Attention

The full formula. Given Q, K, V:

\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

$Q$: query matrix (seq_len x d_k)
$K$: key matrix (seq_len x d_k)
$V$: value matrix (seq_len x d_v)
$d_k$: dimension of query/key vectors - used as scaling factor

Let's unpack each step.

Step 1: QKᵀ - The Score Matrix

Q has shape $(\text{seq_len} \times d_k)$ . has shape $(d_k \times \text{seq_len})$ . Their product:

QK^\top: ;(\text{seq_len} \times d_k) \cdot (d_k \times \text{seq_len}) = (\text{seq_len} \times \text{seq_len})

$QK^\top$: score matrix of shape (seq_len x seq_len) - entry [i,j] is the dot product of query i with key j

Entry [i, j] is the dot product of query vector i with key vector j - the raw relevance score between token i and token j. Matrix multiplication computes all $\text{seq_len}^2$ scores simultaneously.

Step 2: Divide by √d_k - The Scaling

Before softmax, divide by .

Why? For $d_k$ -dimensional random vectors, the dot product variance scales with $d_k$ . When you feed large values into softmax, it : one element gets weight ≈ 1, all others ≈ 0, and gradients vanish.

Dividing by $\sqrt{d_k}$ normalizes to approximately unit variance. For $d_k = 64$ : divide by 8. For $d_k = 128$ : divide by ≈ 11.3.

Step 3: softmax() - The Probability Weights

Apply softmax to each row of the scaled score matrix independently. Each row corresponds to one query token; softmax converts the seq_len raw scores in that row into a probability distribution.

After softmax, matrix shape is still $(\text{seq_len} \times \text{seq_len})$ and each row sums to 1. Entry [i, j] is now the attention weight $\alpha_{ij}$ .

Step 4: Multiply by V - The Weighted Sum

\alpha \cdot V: ;(\text{seq_len} \times \text{seq_len}) \cdot (\text{seq_len} \times d_v) = (\text{seq_len} \times d_v)

$\alpha$: attention weight matrix (seq_len x seq_len)
$V$: value matrix (seq_len x d_v)

Each row of the output is the attended representation of one token - a weighted sum of all value vectors. Each token has now gathered information from across the sequence based on relevance.

Interactive example

QKV walkthrough - trace a 4-token sentence through each step of the attention formula

Coming soon

Causal (Masked) Attention for Generation

In language models like GPT, token i should only attend to tokens at positions 1 through i - it cannot see the future.

Implementation: before applying softmax, set $QK^\top\lbrack i, j\rbrack = -\infty$ for all $j > i$ . Softmax of $-\infty$ is 0, so those attention weights become exactly 0.

The Computational Cost

Attention's main cost is the $\text{seq_len} \times \text{seq_len}$ score matrix: . For short sequences (n = 512), this is fine. For very long sequences, the quadratic cost is a serious bottleneck - which is why research into efficient attention is an active area.

Minimal worked example: 2 tokens, 2D vectors

Let's trace scaled dot-product attention with tiny, concrete numbers. Two tokens: "bank" and "deposit". d_k = 2.

Query matrix Q (each row = one token's query):

Q = [[1.0, 0.0],   # "bank" asks: financial context?
     [0.0, 1.0]]   # "deposit" asks: what kind of action?

Key matrix K (each row = one token's key):

K = [[0.5, 0.5],   # "bank" advertises: I'm ambiguous
     [0.0, 1.0]]   # "deposit" advertises: I'm a financial action

Value matrix V (same as K here for simplicity):

V = [[0.5, 0.5],
     [0.0, 1.0]]

Step 1: QKᵀ

Each entry [i, j] = dot product of row i of Q with row j of K:

QKᵀ = [[1×0.5 + 0×0,  1×0.5 + 0×1],   = [[0.5, 0.5],
        [0×0.5 + 1×0,  0×0.5 + 1×1]]      [0.0, 1.0]]

Step 2: Scale by √d_k = √2 ≈ 1.41

Scaled = [[0.35, 0.35],
          [0.00, 0.71]]

Step 3: Softmax each row

Row 1 ("bank"): softmax([0.35, 0.35]) = [0.50, 0.50] — equally attends to both tokens
Row 2 ("deposit"): softmax([0.00, 0.71]) ≈ [0.33, 0.67] — attends more to itself

Step 4: Multiply by V

"bank" output: 0.50×[0.5, 0.5] + 0.50×[0.0, 1.0] = [0.25, 0.75]
"deposit" output: 0.33×[0.5, 0.5] + 0.67×[0.0, 1.0] ≈ [0.17, 0.83]

"bank" originally had representation biased toward neutral [1,0]; after attention, it's shifted toward [0.25, 0.75] — pulled toward the financial meaning of "deposit."