The sequence problem — Attention & Transformers

Before transformers, working with language and sequential data meant wrestling with a fundamental tension: language has structure that depends on order and context, but the architectures available were bad at capturing long-range relationships. Understanding why previous approaches failed makes the transformer's design choices feel inevitable rather than arbitrary.

Transformers are the architecture behind GPT, BERT, LLaMA, and virtually every state-of-the-art language model in production today. Understanding what they were designed to fix — and why the alternatives failed — is the foundation for understanding modern AI.

Language Requires Context

Consider the word "bank." In "I walked to the bank by the river," it means a riverbank. In "The bank approved my loan application," it means a financial institution. Same word, completely different meaning - determined entirely by the surrounding context.

Or consider: "The trophy didn't fit in the suitcase because it was too big." What does "it" refer to? The trophy. You need to read the whole sentence and reason about which object is too big to fit. This is — the kind of thing that makes language understanding genuinely hard.

Any architecture for language must handle:

Variable-length inputs - sentences range from 3 words to 100+
Sequential order - "dog bites man" is not the same as "man bites dog"
Long-range dependencies - a word early in the sentence can be crucial to understanding a word near the end
Context-dependent meaning - the same word means different things in different contexts

The fail immediately on point 1 and point 3.

Recurrent Neural Networks: A Partial Solution

RNNs process sequences one step at a time. At each position , the model computes a new :

h_t = f(h_{t-1},\thinspace x_t)

$h_t$: hidden state at step t
$h$: hidden state from previous step
$x_t$: input (e.g. word embedding) at step t

The hidden state accumulates information as the sequence progresses. This elegantly handles variable-length inputs: just keep processing until you run out of tokens.

RNNs worked - genuinely. They powered machine translation, speech recognition, and text generation for years. But they had deep structural problems.

Problem 1: Sequential Computation

The update $h_t = f(h_{t-1}, x_t)$ requires $h_{t-1}$ to be ready before you can compute $h_t$ . This is an inherent sequential dependency: step 5 requires step 4, which requires step 3, which requires step 2.

You cannot parallelize over the sequence. On modern hardware (GPUs, TPUs) that excel at massive parallel computation, this is crippling. Processing a 1,000-word document requires 1,000 sequential steps regardless of how much parallel compute you have. Training time scales linearly with sequence length, and GPUs sit mostly idle.

Problem 2: The Information Bottleneck

All the information about everything the model has seen must be compressed into one . For typical RNNs, this might be a 512-dimensional vector.

Compress 1,000 words of rich context into 512 numbers, then use those 512 numbers to predict the next word. For short sentences, this works. For long documents, early information inevitably gets overwritten or diluted as the model processes more tokens.

Consider: "The country where the reporter who was chasing the politician had grown up celebrated its independence." The subject is "country" but it appears 15 tokens before the verb "celebrated." An RNN must carry "country" through 15 steps of processing while also tracking "reporter," "chasing," and "politician." Something gets lost.

Problem 3: Vanishing Gradients Through Time

The through time mean that backpropagation through an RNN multiplies gradients together for every time step back in time. If these multiplications shrink the gradient, it vanishes long before reaching the beginning of the sequence.

LSTMs and GRUs were clever engineering solutions - they add gating mechanisms that selectively remember and forget. They improved things substantially, but the fundamental bottleneck of a single hidden state and sequential computation remained.

RNN processing step by step (pseudocode)

# h_t = f(h_{t-1}, x_t) unrolled for a 5-word sentence
h = zeros(512)                         # initial hidden state: 512 zeros

for t, word in enumerate(sentence):    # must process in order
    x_t = embed(word)                  # word → 300-dim vector
    h = tanh(W_h @ h + W_x @ x_t + b) # completely rewrite h

# h now encodes the entire sentence history — or tries to.
# After word 5, how much of word 1 survives? It depends on
# what happened at steps 2, 3, 4, and 5.
prediction = W_out @ h                 # predict using only final h

At each step, the entire 512-number hidden state is overwritten. For a 5-word sentence this works fine; for a 100-word passage, the information from word 1 has been partially overwritten 99 times. Contrast this with the transformer: every token can directly look up every other token at any time — no overwriting, no decay.

Interactive example

RNN information bottleneck - watch how early tokens fade as sequence length grows

Coming soon

The Transformer's Answer

In 2017, Vaswani et al. published "Attention Is All You Need" and proposed abandoning recurrence entirely.

The key insight: you don't need to process sequences step by step. Instead, process all tokens simultaneously. Let each token directly attend to every other token to gather context. No sequential bottleneck. No information bottleneck. No vanishing gradients through time.

The mechanism that makes this work - attention - is what the next several lessons are about. But knowing why the transformer was designed as it was makes the mechanism much easier to understand.

The transformer changed everything. GPT, BERT, T5, LLaMA - every major language model today is built on transformer blocks. The architecture that started as an improvement to machine translation became the foundation for AI systems that can write code, explain concepts, and reason about complex problems.