Skip to content
Recurrent Networks
Lesson 8 ⏱ 12 min

Why transformers replaced RNNs

Video coming soon

Why Transformers Replaced RNNs: Three Fundamental Limitations

Analyzes the three structural limitations of RNNs (sequential compute, information bottleneck, long-range dependencies), explains how transformers solve each one, and discusses where RNNs still hold advantages.

⏱ ~7 min

🧮

Quick refresher

Parallelization and sequential computation

An algorithm is sequential if each step depends on the previous step's result. An algorithm is parallelizable if steps can be computed simultaneously. GPUs are designed for massive parallelism — they can execute thousands of operations simultaneously. Sequential bottlenecks prevent GPU utilization.

Example

Computing the sum 1+2+3+...+100 sequentially takes 99 additions in series.

Computing the sum as (1+2+...+50) + (51+52+...+100) can do 50 additions in parallel, then 1 more — much faster on parallel hardware.

Every major language model today — GPT, LLaMA, Claude, Gemini — is built on transformer blocks, not RNNs. This shift happened rapidly between 2017 and 2020. Understanding why RNNs lost isn't just historical trivia; it reveals the design constraints that transformers were built to solve, and clarifies when RNNs are still the right choice.

There are three fundamental limitations of RNNs, each corresponding to a structural decision transformers made differently.

Limitation 1: Sequential Computation

The hidden state update is:

ht=f(ht1,xt)h_t = f(h_{t-1}, x_t)
hth_t
hidden state at step t

Computing hth_t requires ht1h_{t-1} to be already computed. This creates a strict sequential dependency chain. For a sequence of length , you must complete steps 1, 2, 3, ..., T in order — no step can start until the previous one finishes.

Modern GPUs have tens of thousands of compute units operating in parallel. A matrix multiply of size 512×512 is one GPU operation — thousands of multiplications happen simultaneously. But an RNN's sequential dependency means the GPU's parallelism is almost entirely wasted: at each step, you're doing one matrix multiply and then waiting.

The transformer's answer: process all positions simultaneously. The attention operation:

Attention(Q,K,V)=softmax!(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right) V
QQ
query matrix — all positions at once
KK
key matrix — all positions at once
VV
value matrix

computes interactions between all TT positions with a single matrix multiply QKTQK^T — shape T×TT \times T. All positions are processed in parallel. For a GPU with 10,000 parallel units, processing a 512-token sequence is nearly as fast as processing 1 token.

Limitation 2: The Information Bottleneck

As established in the seq2seq lesson, all information about the source sequence must pass through the final hidden state hTh_T — a single fixed-size vector. For long sequences, this vector must compress information it cannot fully represent.

The transformer's answer: attention directly connects every decoder position to every encoder position. There is no bottleneck — the decoder can access any part of the source sequence with equal ease. The question "what did the encoder produce at position 7?" is answered by directly retrieving the position-7 key and value vectors.

Information path length between any two positions: always 1 step. In an RNN, the path between position 1 and position 100 goes through 99 hidden state updates, each of which may discard some information.

Limitation 3: Long-Range Dependencies

You've seen quantitatively how gradients vanish over T steps in vanilla RNNs. LSTMs and GRUs extend this range substantially — in practice to ~100-500 steps — but not indefinitely.

The root cause: information still flows through sequential multiplicative operations, even in LSTMs. The forget gate ftf_t is close to 1 for good memory, but "close to 1" applied 500 times gives 0.995000.0070.99^{500} \approx 0.007 — a 99% reduction.

The transformer's answer: attention creates direct connections between any two positions in one step, with no decay over distance. "The cat" in position 1 and "was hungry" in position 100 are connected by a single attention computation, not 99 multiplicative operations. The gradient path between any two positions has length 1 in a transformer.

The Cost: O(T²) Attention

Transformers solve RNN limitations, but introduce their own:

Memory: the attention matrix QKTQK^T has shape T×TT \times T. For T=512T = 512: 262,144 values. For T=4096T = 4096: 16.8 million values per layer. For T=100,000T = 100{,}000: 10 billion values — impossible to fit in GPU memory.

Compute: computing the T×TT \times T attention matrix costs O(T2d)O(T^2 \cdot d) operations. For RNNs, each step costs O(n2)O(n^2) and there are T steps, giving O(Tn2)O(T \cdot n^2) total — linear in T. Transformers are quadratic.

For short sequences (T < 2048) on modern hardware, transformers win overwhelmingly. For very long sequences (T > 32,768), the quadratic cost becomes prohibitive.

When RNNs Still Make Sense

Despite transformer dominance, RNNs retain advantages in specific settings:

Streaming inference: a token arrives, you immediately update the hidden state and produce a response. O(1) per step, constant memory. A transformer must re-attend over the full context — O(T) per step, growing memory.

On-device / edge inference: a 2-layer LSTM with 256 hidden units has ~500K parameters and runs at 1ms per step on a phone CPU. A small transformer with comparable quality has millions of parameters and requires substantially more compute.

Online learning: learning from a continuous data stream where the "batch" is individual examples arriving in sequence. RNNs handle this naturally. Transformers require attention over a context window that must fit in memory.

Long sequences with limited compute: for sequences of 50,000+ tokens where transformer attention is infeasible, recurrent architectures with their O(T) scaling may be the only practical option.

Summary: The Three Problems, Three Solutions

ProblemRNN behaviorTransformer solution
Sequential computeMust process T steps in series — GPU mostly idleAll positions computed in parallel
Information bottleneckAll source info compressed to one vectorDecoder directly attends to all encoder states
Long-range dependenciesGradients decay exponentially with distanceDirect O(1)-path connections via attention
New costO(T²) memory and compute

The transformer didn't simply improve on RNNs — it traded one set of tradeoffs for another. RNNs are O(T) time and memory with limited long-range modeling. Transformers are O(T²) time and memory with unlimited long-range modeling. For most NLP tasks at moderate sequence lengths, the transformer's tradeoff is overwhelmingly better. For streaming, edge, or very-long-sequence applications, the RNN's tradeoff sometimes wins.

Understanding both sides of this tradeoff is what lets you choose the right architecture for a new problem — rather than defaulting to whichever one is currently fashionable.

Quiz

1 / 3

An RNN processing a 512-token sequence must take at least 512 serial steps. A transformer processing the same sequence...