Attention is a beautiful mechanism but it has a blind spot: it is completely indifferent to order. If you shuffle the words in a sentence, the fundamental operation does not care where in the sequence a token sits. That is a problem, because word order is everything in language.
We need to tell the model where each token lives in the sequence.
Positional encoding is the solution that makes transformers work for language — without it, "the dog bit the man" and "the man bit the dog" would be indistinguishable. Every transformer-based model in production today, from BERT to GPT-4, relies on some form of positional encoding.
The Problem: Attention Is Permutation-Invariant
Let's be precise. The attention computation depends only on the content of the tokens via Q, K, V - not on their positions.
If you swap the token embeddings for "dog" and "man" in "the dog bit the man," getting "the man bit the dog," the attention mechanism computes the same set of relationships - just between different word contents. Without position information, it cannot know that "the dog" was the subject and "the man" the object.
The matters for:
- Subject vs. object roles ("the dog bit the man" vs "the man bit the dog")
- Modifier attachment ("I saw the man with a telescope" - who has the telescope?)
- Temporal reasoning ("she ate, then she slept" vs "she slept, then she ate")
The fix: inject position information into the token representations before they enter the transformer.
Interactive example
Permutation test - shuffle tokens and watch how attention scores change with and without positional encodings
Coming soon
Sinusoidal Positional Encodings
The original "Attention Is All You Need" paper proposed a fixed, formula-based encoding. For position and dimension index :
- position of the token in the sequence (0-indexed)
- dimension index within the encoding vector
- d}
- total embedding dimension
- the encoding at odd dimension 2i+1
Even-indexed dimensions use ; odd-indexed use . The base 10000 gives a wide range of frequencies.
An even better analogy: a binary counter. The number 5 in binary is 101 - the least significant bit flips every step, the next bit every two steps, the most significant bit every four. Sinusoidal encoding is the continuous, smooth version of this idea.
A critical property: for any fixed offset , can be expressed as a linear function of . This means attention can learn to recognize relative positions: "this token is k positions ahead" can be computed from the encodings alone.
Learned Positional Embeddings
The alternative - used by BERT, GPT-2, GPT-3, and most modern language models - is to learn the positional embeddings directly.
Instead of computing a fixed pattern, maintain a learned : a matrix of shape (\text{max_length} \times d_{\text{model}}), where row p contains the learned embedding for position p.
| Sinusoidal | Learned | |
|---|---|---|
| Generalization beyond training length | Yes (formula defined for any pos) | No (no entry for unseen positions) |
| Flexibility | Fixed | Adapts to task |
| Parameters | 0 | max_length × d_model |
Learned embeddings empirically often perform slightly better on fixed-length tasks; sinusoidal encodings generalize better to longer sequences.
Adding to Token Embeddings
Positional encodings are added to token embeddings, not concatenated:
- the combined representation for token i, used as input to the first transformer layer
- the token embedding for token i
- the positional encoding for position i
Both the token embedding and the positional encoding have dimension . Adding them elementwise produces a vector that contains both content information (what word) and position information (where in the sequence). Concatenation would double the dimension to - adding keeps everything at .
Relative Positional Encodings
Absolute position (token 47) is often less useful than relative position (token 47 is 5 positions before token 52). Many recent models use relative positional encodings.
Relative encodings generally generalize better to longer sequences and have become the dominant approach in modern large language models.