Positional encoding — Attention & Transformers

Attention is a beautiful mechanism but it has a blind spot: it is completely indifferent to order. If you shuffle the words in a sentence, the fundamental operation does not care where in the sequence a token sits. That is a problem, because word order is everything in language.

We need to tell the model where each token lives in the sequence.

Positional encoding is the solution that makes transformers work for language — without it, "the dog bit the man" and "the man bit the dog" would be indistinguishable. Every transformer-based model in production today, from BERT to GPT-4, relies on some form of positional encoding.

The Problem: Attention Is Permutation-Invariant

Let's be precise. The attention computation $\text{Attention}(Q, K, V) = \text{softmax}(QK^\top / \sqrt{d_k}) \cdot V$ depends only on the content of the tokens via Q, K, V - not on their positions.

If you swap the token embeddings for "dog" and "man" in "the dog bit the man," getting "the man bit the dog," the attention mechanism computes the same set of relationships - just between different word contents. Without position information, it cannot know that "the dog" was the subject and "the man" the object.

The matters for:

Subject vs. object roles ("the dog bit the man" vs "the man bit the dog")
Modifier attachment ("I saw the man with a telescope" - who has the telescope?)
Temporal reasoning ("she ate, then she slept" vs "she slept, then she ate")

The fix: inject position information into the token representations before they enter the transformer.

Interactive example

Permutation test - shuffle tokens and watch how attention scores change with and without positional encodings

Coming soon

Sinusoidal Positional Encodings

The original "Attention Is All You Need" paper proposed a fixed, formula-based encoding. For position and dimension index :

\text{PE}(\text{pos},\thinspace 2i) = \sin!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)

$\text{pos}$: position of the token in the sequence (0-indexed)
$i$: dimension index within the encoding vector
$d}$: total embedding dimension

\text{PE}(\text{pos},\thinspace 2i+1) = \cos!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)

$\text{PE}(\text{pos}, 2i+1)$: the encoding at odd dimension 2i+1

Even-indexed dimensions use ; odd-indexed use . The base 10000 gives a wide range of frequencies.

An even better analogy: a binary counter. The number 5 in binary is 101 - the least significant bit flips every step, the next bit every two steps, the most significant bit every four. Sinusoidal encoding is the continuous, smooth version of this idea.

Computing PE(5, 0), PE(5, 1), PE(5, 2) by hand (d_model = 512)

For position $\text{pos} = 5$ and $d_{\text{model}} = 512$ :

Dimension 0 — $i = 0$ , even → sine (fastest oscillation):

Answer: $\text{PE}(5,,0) = \sin!\left(\tfrac{5}{10000^{0/512}}\right) = \sin(5/1) = \sin(5) \approx -0.959$

Dimension 1 — $i = 0$ , odd → cosine:

Answer: $\text{PE}(5,,1) = \cos!\left(\tfrac{5}{10000^{0/512}}\right) = \cos(5) \approx 0.284$

Dimension 2 — $i = 1$ , even → sine with a larger denominator (slower oscillation):

Answer: $\text{PE}(5,,2) = \sin!\left(\tfrac{5}{10000^{2/512}}\right) = \sin!\left(\tfrac{5}{1.036}\right) \approx \sin(4.83) \approx -1.000$

Low-indexed dimensions distinguish nearby positions (they oscillate quickly). High-indexed dimensions distinguish positions far apart (they oscillate slowly — like the hour hand on a clock). The combination of all 512 dimensions creates a unique fingerprint for every position up to ~10,000.

A critical property: for any fixed offset , $\text{PE}(\text{pos} + k)$ can be expressed as a linear function of $\text{PE}(\text{pos})$ . This means attention can learn to recognize relative positions: "this token is k positions ahead" can be computed from the encodings alone.

Learned Positional Embeddings

The alternative - used by BERT, GPT-2, GPT-3, and most modern language models - is to learn the positional embeddings directly.

Instead of computing a fixed pattern, maintain a learned : a matrix of shape $(\text{max_length} \times d_{\text{model}})$ , where row p contains the learned embedding for position p.

	Sinusoidal	Learned
Generalization beyond training length	Yes (formula defined for any pos)	No (no entry for unseen positions)
Flexibility	Fixed	Adapts to task
Parameters	0	max_length × d_model

Learned embeddings empirically often perform slightly better on fixed-length tasks; sinusoidal encodings generalize better to longer sequences.

Adding to Token Embeddings

Positional encodings are added to token embeddings, not concatenated:

\mathbf{x}_i = \mathbf{e}_i + \text{PE}(i)

$\mathbf{x}_i$: the combined representation for token i, used as input to the first transformer layer
$\mathbf{e}_i$: the token embedding for token i
$\text{PE}(i)$: the positional encoding for position i

Both the token embedding and the positional encoding have dimension $d_{\text{model}}$ . Adding them elementwise produces a vector that contains both content information (what word) and position information (where in the sequence). Concatenation would double the dimension to $2 \times d_{\text{model}}$ - adding keeps everything at $d_{\text{model}}$ .

Relative Positional Encodings

Absolute position (token 47) is often less useful than relative position (token 47 is 5 positions before token 52). Many recent models use relative positional encodings.

Relative encodings generally generalize better to longer sequences and have become the dominant approach in modern large language models.

Sine and cosine waves

The Problem: Attention Is Permutation-Invariant

Sinusoidal Positional Encodings

Learned Positional Embeddings

Adding to Token Embeddings

Relative Positional Encodings