The core idea of attention is beautiful in its simplicity: to update the representation of any given word, look at all the other words in the sentence, weight them by how relevant they are, and blend their information together. That's it. Everything else in the transformer is engineering details around this central intuition.
Consider the sentence: "The trophy wouldn't fit in the suitcase because it was too big." Which object is "it"? You resolved that in milliseconds by connecting "it" to "trophy" across eight words. A model needs the same ability — to look at any word and weigh the relevance of every other word in context. Attention is the mechanism that makes that possible.
The Intuition: Context as a Weighted Average
Let's ground this in the "bank" example. You have the sentence:
"I went to the bank to deposit money."
When you process the word "bank," you want to figure out which sense it carries - riverbank or financial institution. Which nearby words are relevant?
- "deposit" is highly relevant - deposits happen at financial institutions, not rivers
- "money" is highly relevant - same reason
- "went" and "to" and "the" are not very relevant - they don't disambiguate the meaning
Attention encodes this as a weighted average of representations:
- Collect the representation of every word: "I", "went", "to", "the", "bank", "to", "deposit", "money"
- For "bank" as the current token, assign weights: "deposit" gets 0.35, "money" gets 0.30, "bank" itself gets 0.15, remaining words share 0.20
- Compute a weighted sum:
- The result is a context-aware blended representation for "bank" that carries financial meaning
The magic is that these weights are learned from data and computed from the content of the tokens - not fixed or hand-designed.
Interactive example
Attention weight explorer - adjust token relevance and see the blended output shift
Coming soon
What the Weights Must Satisfy
For this to work as a proper blending operation, the must form a valid :
- Each weight must be non-negative:
- All weights must sum to 1:
This is exactly what provides. Compute raw scores (any real numbers), pass through softmax, get a valid probability distribution.
The Formal Update
For token , its attended output is:
- attended output for token i
- attention weight from token i to token j
- value vector of token j - the information token j contributes
where the sum is over all positions j in the sequence.
Every token gets its own attended output using its own attention weight vector. "bank" has one set of weights; "deposit" has a different set reflecting what is relevant for understanding "deposit" in context.
Computing Attention Weights
The attention weights are computed from token representations in two steps:
- Compute a relevance score between token i and token j (how well do they "match"?)
- Apply softmax over all j scores:
The specific way scores are computed (dot products of query and key vectors) is the subject of the next lesson. For now, the key fact is that scores are data-dependent - they change for every input sentence.
Key Properties of Attention
Computed in parallel: unlike RNNs, all attention weights for all positions can be computed simultaneously. The weight from token 3 to token 7 does not require token 4's attention to be computed first. This parallelism enables training massive language models efficiently on GPUs.
Arbitrary-range dependencies: there is no distance penalty. Token 1 can attend to token 500 just as easily as to token 2. If "bank" appears at position 50 and "deposit" at position 200, attention connects them directly in a single step.
Data-dependent receptive field: different sentences produce different attention weights. "I deposited money at the bank" produces strong bank-deposited attention; "We fished by the bank of the river" produces strong bank-river attention. Same architecture, different behavior based on content.
Comparison to What Came Before
Compared to RNNs: RNNs must route information through every intermediate step. Connecting tokens 1 and 100 means 99 transformation steps that blur and distort. Attention connects them directly in one step.
Compared to CNNs: CNN receptive fields are local (3x3, 5x5). Distant positions require many stacked layers to interact. Attention has a from layer 1.
The key difference: attention weights are computed from content, not fixed. A 3x3 convolution always applies the same kernel. Attention looks at all positions but weights them dynamically based on relevance. This is why attention is called "soft" - it is a differentiable, content-based routing mechanism.