Skip to content
Interpretability & Fairness
Lesson 5 ⏱ 10 min

Attention visualization

Video coming soon

What Are Attention Heads Looking At?

BertViz walkthrough showing how different attention heads in BERT track syntax, coreference, and positional patterns — and why this doesn't prove attention is the reason for the model's decision.

⏱ ~6 min

🧮

Quick refresher

Transformer attention mechanism

Transformer models process sequences by computing attention weights over all positions. For each output position, attention weights determine how much each input position contributes to that output. Multi-head attention runs several attention operations in parallel, each potentially capturing different relationships.

Example

When reading 'The trophy didn't fit in the bag because it was too big,' attention should link 'it' back to 'trophy' — a coreference relationship.

Some BERT attention heads have been shown to do exactly this.

Reading Minds Through Heatmaps

When BERT reads the sentence "The animal didn't cross the street because it was too tired," a linguist would say the model needs to resolve that it refers to animal, not street — a classical task.

Sure enough, when researchers visualized BERT's using BertViz, they found a specific attention head in layer 8 that reliably attends from it back to animal across many such sentences. This was exciting evidence that BERT had learned syntax.

Then came a paper that complicated everything.

Attention visualization is often the first tool researchers reach for when debugging transformer behavior. It has been used to discover that individual attention heads specialize in syntactic roles — subject-verb agreement, coreference, relative clauses — revealing that transformers implicitly learn linguistic structure. It's also the tool that exposed the limits of interpreting attention as explanation.

What Attention Weights Actually Are

In a transformer, attention is a weighted average: for each output position, the model scores how relevant every other position is, then takes a weighted mix of their information. High weight means "draw heavily from this position"; low weight means "mostly ignore it." The formula below is that weighted average, with a scaling factor ( ) to prevent the dot products from exploding before the softmax.

In a transformer, the attention output for position is a weighted sum of value vectors:

Attn(i)=jaijvj,aij=softmax!(qikjTdk)\text{Attn}(i) = \sum_j a_{ij} v_j, \quad a_{ij} = \text{softmax}!\left(\frac{q_i k_j^T}{\sqrt{d_k}}\right)
aija_{ij}
attention weight from position i to position j
vjv_j
value vector at position j
QQ
query matrix
KK
key matrix
dkd_k
key/query dimension — used for scaling

The weights live on a (they sum to 1 and are non-negative). When plotted as a heatmap — rows are output positions, columns are input tokens — they produce the familiar attention maps you've seen in blog posts.

The question is: do these weights explain the model's behavior?

BertViz: A Tour of What Heads Learn

Jesse Vig's lets you explore all 144 attention heads (12 layers × 12 heads) in BERT-base interactively. Specific patterns appear repeatedly:

  • Positional heads: certain heads always attend to the previous or next token (local context)
  • Delimiter heads: some heads attend heavily to [CLS] and [SEP] tokens — apparently using them as "garbage collectors" for global information
  • Syntactic heads: layer 8–10 heads track dependency parse structure — subjects and their verbs, adjectives and their nouns
  • Coreference heads: specific heads in layers 6–12 track pronoun antecedents

Attention Is Not Explanation

In 2019, Sarthak Jain and Byron Wallace published "Attention is not Explanation" — one of the most cited ML papers of that year. Their finding: for most NLP classification tasks, you can replace the model's learned attention weights with adversarially constructed attention distributions, and the model outputs the same predictions.

If changing the attention weights dramatically doesn't change the output, then the attention weights don't determine the output — something else does. The attention pattern is not the causal story of the prediction.

The counter-paper, "Attention is not not Explanation" (Wiegreffe & Pinter, 2019), argued the criticism was too strict: attention weights don't need to be the sole explanation to be useful. They provide genuine information about input-output dependencies. Both papers are right about different things.

Better Proxies: Gradient × Attention

A more principled approach: weight each attention head's map by the gradient of the loss with respect to that head's output:

A^h=AhLAh\hat{A}_h = A_h \cdot \left|\frac{\partial L}{\partial A_h}\right|
AhA_h
attention weight matrix for head h
LAh\frac{\partial L}{\partial A_h}
gradient of loss with respect to head h's attention map
A^\hat{A}
gradient-weighted attention — a better importance proxy

This filters out heads that have strong attention patterns but don't actually influence the final prediction. Heads with high gradient magnitude are genuinely important; heads with near-zero gradient are along for the ride.

Attention Rollout for Multi-Layer Models

A single layer's attention map doesn't tell the whole story. In transformers, each layer's output is a mix of newly attended information and the previous layer's output carried forward unchanged — this "carry forward" path is called a :

xl=Attnl(xl1)+xl1x_l = \text{Attn}l(x{l-1}) + x_{l-1}
xlx_l
output of layer l
Attnl\text{Attn}_l
attention function of layer l
xl1x_{l-1}
output of previous layer (residual connection)

The residual connection means each layer's output mixes attended information with the unmodified previous layer. (Abnar & Zuidema, 2020) propagates attention maps from output back to input:

A~l=(Al+I2)A~l1\tilde{A}l = \left(\frac{A_l + I}{2}\right) \cdot \tilde{A}{l-1}
A~l\tilde{A}_l
rollout attention at layer l
AlA_l
attention matrix at layer l
II
identity matrix (models residual pass-through)

This gives a single matrix showing how much each input token ultimately contributed to each output position, integrated across all layers — a more faithful view of long-range information flow.

What Attention CAN Tell You

Despite the controversy, attention visualization provides genuine value for:

  • Debugging: discovering that a sentiment model attends to punctuation rather than sentiment words suggests a data artifact
  • Model comparison: comparing attention patterns between two checkpoints can reveal what changed during fine-tuning
  • Error analysis: when a model makes a wrong prediction, checking attention for unexpected focus patterns can suggest why
  • Linguistic structure: strong evidence that transformers encode syntactic and semantic relationships in their attention

Interactive example

Attention Head Explorer: Visualize BERT Attention Patterns

Coming soon

Summary

  • Transformer attention weights can be visualized as heatmaps; specific heads track syntax, coreference, and position.
  • BertViz provides interactive multi-head, multi-layer attention visualization.
  • Jain & Wallace (2019) showed attention weights can be swapped without changing predictions — they are not the sole causal explanation.
  • Gradient × Attention weighting gives a more faithful importance proxy by filtering out non-influential heads.
  • Attention rollout integrates attention across all layers, accounting for residual connections.
  • Attention visualization is most useful for debugging, linguistic analysis, and error analysis — not for auditable compliance explanations.

Quiz

1 / 3

The Jain & Wallace (2019) paper challenged attention as explanation by showing: