Skip to content
Attention & Transformers
Lesson 2 ⏱ 12 min

Attention as weighted averaging

Video coming soon

Attention as Weighted Averaging - Context by Blending

Animates how attention weights are assigned to tokens, how a weighted average of value vectors produces a context-aware representation, and the contrast with fixed pooling.

⏱ ~7 min

🧮

Quick refresher

Weighted averages

A weighted average multiplies each value by its weight, sums the results, where weights are non-negative and sum to 1. Result: w1*v1 + w2*v2 + ... where all wi >= 0 and sum to 1.

Example

Values [3, 7, 5] with weights [0.5, 0.3, 0.2]: 0.5*3 + 0.3*7 + 0.2*5 = 1.5 + 2.1 + 1.0 = 4.6.

The core idea of attention is beautiful in its simplicity: to update the representation of any given word, look at all the other words in the sentence, weight them by how relevant they are, and blend their information together. That's it. Everything else in the transformer is engineering details around this central intuition.

Consider the sentence: "The trophy wouldn't fit in the suitcase because it was too big." Which object is "it"? You resolved that in milliseconds by connecting "it" to "trophy" across eight words. A model needs the same ability — to look at any word and weigh the relevance of every other word in context. Attention is the mechanism that makes that possible.

The Intuition: Context as a Weighted Average

Let's ground this in the "bank" example. You have the sentence:

"I went to the bank to deposit money."

When you process the word "bank," you want to figure out which sense it carries - riverbank or financial institution. Which nearby words are relevant?

  • "deposit" is highly relevant - deposits happen at financial institutions, not rivers
  • "money" is highly relevant - same reason
  • "went" and "to" and "the" are not very relevant - they don't disambiguate the meaning

Attention encodes this as a weighted average of representations:

  1. Collect the representation of every word: "I", "went", "to", "the", "bank", "to", "deposit", "money"
  2. For "bank" as the current token, assign weights: "deposit" gets 0.35, "money" gets 0.30, "bank" itself gets 0.15, remaining words share 0.20
  3. Compute a weighted sum: 0.35×vdeposit+0.30×vmoney+0.35 \times \mathbf{v}{\text{deposit}} + 0.30 \times \mathbf{v}{\text{money}} + \ldots
  4. The result is a context-aware blended representation for "bank" that carries financial meaning

The magic is that these weights are learned from data and computed from the content of the tokens - not fixed or hand-designed.

Interactive example

Attention weight explorer - adjust token relevance and see the blended output shift

Coming soon

What the Weights Must Satisfy

For this to work as a proper blending operation, the must form a valid :

  • Each weight must be non-negative: αij0\alpha_{ij} \geq 0
  • All weights must sum to 1: jαij=1\sum_j \alpha_{ij} = 1

This is exactly what provides. Compute raw scores (any real numbers), pass through softmax, get a valid probability distribution.

The Formal Update

For token , its attended output is:

ai=jαijvj\mathbf{a}i = \sum_j \alpha{ij} \cdot \mathbf{v}_j
aia_i
attended output for token i
α\alpha
attention weight from token i to token j
vjv_j
value vector of token j - the information token j contributes

where the sum is over all positions j in the sequence.

Every token gets its own attended output using its own attention weight vector. "bank" has one set of weights; "deposit" has a different set reflecting what is relevant for understanding "deposit" in context.

Computing Attention Weights

The attention weights αij\alpha_{ij} are computed from token representations in two steps:

  1. Compute a relevance score between token i and token j (how well do they "match"?)
  2. Apply softmax over all j scores: αij=softmax(si1,si2,,sin)[j]\alpha_{ij} = \text{softmax}(s_{i1}, s_{i2}, \ldots, s_{in})\lbrack j\rbrack

The specific way scores are computed (dot products of query and key vectors) is the subject of the next lesson. For now, the key fact is that scores are data-dependent - they change for every input sentence.

Key Properties of Attention

Computed in parallel: unlike RNNs, all attention weights for all positions can be computed simultaneously. The weight from token 3 to token 7 does not require token 4's attention to be computed first. This parallelism enables training massive language models efficiently on GPUs.

Arbitrary-range dependencies: there is no distance penalty. Token 1 can attend to token 500 just as easily as to token 2. If "bank" appears at position 50 and "deposit" at position 200, attention connects them directly in a single step.

Data-dependent receptive field: different sentences produce different attention weights. "I deposited money at the bank" produces strong bank-deposited attention; "We fished by the bank of the river" produces strong bank-river attention. Same architecture, different behavior based on content.

Comparison to What Came Before

Compared to RNNs: RNNs must route information through every intermediate step. Connecting tokens 1 and 100 means 99 transformation steps that blur and distort. Attention connects them directly in one step.

Compared to CNNs: CNN receptive fields are local (3x3, 5x5). Distant positions require many stacked layers to interact. Attention has a from layer 1.

The key difference: attention weights are computed from content, not fixed. A 3x3 convolution always applies the same kernel. Attention looks at all positions but weights them dynamically based on relevance. This is why attention is called "soft" - it is a differentiable, content-based routing mechanism.

Quiz

1 / 3

In attention, the output for token i is...