Skip to content
Attention & Transformers
Lesson 5 ⏱ 12 min

Positional encoding

Video coming soon

Positional Encoding - Teaching the Transformer About Order

Visualizes the sinusoidal encoding pattern, shows how different frequencies create unique position fingerprints, and compares fixed vs learned positional embeddings.

⏱ ~7 min

🧮

Quick refresher

Sine and cosine waves

sin(x) and cos(x) oscillate between -1 and 1 with period 2pi. Different frequencies mean different numbers of oscillations per unit distance. sin(2x) completes a full cycle twice as fast as sin(x).

Example

sin(pos/10000^0) oscillates rapidly.

sin(pos/10000^1) oscillates 10000x more slowly.

Each frequency creates a unique pattern for each position.

Attention is a beautiful mechanism but it has a blind spot: it is completely indifferent to order. If you shuffle the words in a sentence, the fundamental operation does not care where in the sequence a token sits. That is a problem, because word order is everything in language.

We need to tell the model where each token lives in the sequence.

Positional encoding is the solution that makes transformers work for language — without it, "the dog bit the man" and "the man bit the dog" would be indistinguishable. Every transformer-based model in production today, from BERT to GPT-4, relies on some form of positional encoding.

The Problem: Attention Is Permutation-Invariant

Let's be precise. The attention computation Attention(Q,K,V)=softmax(QK/dk)V\text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d_k}) \cdot V depends only on the content of the tokens via Q, K, V - not on their positions.

If you swap the token embeddings for "dog" and "man" in "the dog bit the man," getting "the man bit the dog," the attention mechanism computes the same set of relationships - just between different word contents. Without position information, it cannot know that "the dog" was the subject and "the man" the object.

The matters for:

  • Subject vs. object roles ("the dog bit the man" vs "the man bit the dog")
  • Modifier attachment ("I saw the man with a telescope" - who has the telescope?)
  • Temporal reasoning ("she ate, then she slept" vs "she slept, then she ate")

The fix: inject position information into the token representations before they enter the transformer.

Interactive example

Permutation test - shuffle tokens and watch how attention scores change with and without positional encodings

Coming soon

Sinusoidal Positional Encodings

The original "Attention Is All You Need" paper proposed a fixed, formula-based encoding. For position and dimension index :

PE(pos,2i)=sin!(pos100002i/dmodel)\text{PE}(\text{pos},\thinspace 2i) = \sin!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)
pos\text{pos}
position of the token in the sequence (0-indexed)
ii
dimension index within the encoding vector
d}
total embedding dimension
PE(pos,2i+1)=cos!(pos100002i/dmodel)\text{PE}(\text{pos},\thinspace 2i+1) = \cos!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)
PE(pos,2i+1)\text{PE}(\text{pos}, 2i+1)
the encoding at odd dimension 2i+1

Even-indexed dimensions use ; odd-indexed use . The base 10000 gives a wide range of frequencies.

An even better analogy: a binary counter. The number 5 in binary is 101 - the least significant bit flips every step, the next bit every two steps, the most significant bit every four. Sinusoidal encoding is the continuous, smooth version of this idea.

A critical property: for any fixed offset , PE(pos+k)\text{PE}(\text{pos} + k) can be expressed as a linear function of PE(pos)\text{PE}(\text{pos}). This means attention can learn to recognize relative positions: "this token is k positions ahead" can be computed from the encodings alone.

Learned Positional Embeddings

The alternative - used by BERT, GPT-2, GPT-3, and most modern language models - is to learn the positional embeddings directly.

Instead of computing a fixed pattern, maintain a learned : a matrix of shape (\text{max_length} \times d_{\text{model}}), where row p contains the learned embedding for position p.

SinusoidalLearned
Generalization beyond training lengthYes (formula defined for any pos)No (no entry for unseen positions)
FlexibilityFixedAdapts to task
Parameters0max_length × d_model

Learned embeddings empirically often perform slightly better on fixed-length tasks; sinusoidal encodings generalize better to longer sequences.

Adding to Token Embeddings

Positional encodings are added to token embeddings, not concatenated:

xi=ei+PE(i)\mathbf{x}_i = \mathbf{e}_i + \text{PE}(i)
xi\mathbf{x}_i
the combined representation for token i, used as input to the first transformer layer
ei\mathbf{e}_i
the token embedding for token i
PE(i)\text{PE}(i)
the positional encoding for position i

Both the token embedding and the positional encoding have dimension dmodeld_{\text{model}}. Adding them elementwise produces a vector that contains both content information (what word) and position information (where in the sequence). Concatenation would double the dimension to 2×dmodel2 \times d_{\text{model}} - adding keeps everything at dmodeld_{\text{model}}.

Relative Positional Encodings

Absolute position (token 47) is often less useful than relative position (token 47 is 5 positions before token 52). Many recent models use relative positional encodings.

Relative encodings generally generalize better to longer sequences and have become the dominant approach in modern large language models.

Quiz

1 / 3

Why do Transformers need positional encodings?