Every major language model today — GPT, LLaMA, Claude, Gemini — is built on transformer blocks, not RNNs. This shift happened rapidly between 2017 and 2020. Understanding why RNNs lost isn't just historical trivia; it reveals the design constraints that transformers were built to solve, and clarifies when RNNs are still the right choice.
There are three fundamental limitations of RNNs, each corresponding to a structural decision transformers made differently.
Limitation 1: Sequential Computation
The hidden state update is:
- hidden state at step t
Computing requires to be already computed. This creates a strict sequential dependency chain. For a sequence of length , you must complete steps 1, 2, 3, ..., T in order — no step can start until the previous one finishes.
Modern GPUs have tens of thousands of compute units operating in parallel. A matrix multiply of size 512×512 is one GPU operation — thousands of multiplications happen simultaneously. But an RNN's sequential dependency means the GPU's parallelism is almost entirely wasted: at each step, you're doing one matrix multiply and then waiting.
The transformer's answer: process all positions simultaneously. The attention operation:
- query matrix — all positions at once
- key matrix — all positions at once
- value matrix
computes interactions between all positions with a single matrix multiply — shape . All positions are processed in parallel. For a GPU with 10,000 parallel units, processing a 512-token sequence is nearly as fast as processing 1 token.
Limitation 2: The Information Bottleneck
As established in the seq2seq lesson, all information about the source sequence must pass through the final hidden state — a single fixed-size vector. For long sequences, this vector must compress information it cannot fully represent.
The transformer's answer: attention directly connects every decoder position to every encoder position. There is no bottleneck — the decoder can access any part of the source sequence with equal ease. The question "what did the encoder produce at position 7?" is answered by directly retrieving the position-7 key and value vectors.
Information path length between any two positions: always 1 step. In an RNN, the path between position 1 and position 100 goes through 99 hidden state updates, each of which may discard some information.
Limitation 3: Long-Range Dependencies
You've seen quantitatively how gradients vanish over T steps in vanilla RNNs. LSTMs and GRUs extend this range substantially — in practice to ~100-500 steps — but not indefinitely.
The root cause: information still flows through sequential multiplicative operations, even in LSTMs. The forget gate is close to 1 for good memory, but "close to 1" applied 500 times gives — a 99% reduction.
The transformer's answer: attention creates direct connections between any two positions in one step, with no decay over distance. "The cat" in position 1 and "was hungry" in position 100 are connected by a single attention computation, not 99 multiplicative operations. The gradient path between any two positions has length 1 in a transformer.
The Cost: O(T²) Attention
Transformers solve RNN limitations, but introduce their own:
Memory: the attention matrix has shape . For : 262,144 values. For : 16.8 million values per layer. For : 10 billion values — impossible to fit in GPU memory.
Compute: computing the attention matrix costs operations. For RNNs, each step costs and there are T steps, giving total — linear in T. Transformers are quadratic.
For short sequences (T < 2048) on modern hardware, transformers win overwhelmingly. For very long sequences (T > 32,768), the quadratic cost becomes prohibitive.
When RNNs Still Make Sense
Despite transformer dominance, RNNs retain advantages in specific settings:
Streaming inference: a token arrives, you immediately update the hidden state and produce a response. O(1) per step, constant memory. A transformer must re-attend over the full context — O(T) per step, growing memory.
On-device / edge inference: a 2-layer LSTM with 256 hidden units has ~500K parameters and runs at 1ms per step on a phone CPU. A small transformer with comparable quality has millions of parameters and requires substantially more compute.
Online learning: learning from a continuous data stream where the "batch" is individual examples arriving in sequence. RNNs handle this naturally. Transformers require attention over a context window that must fit in memory.
Long sequences with limited compute: for sequences of 50,000+ tokens where transformer attention is infeasible, recurrent architectures with their O(T) scaling may be the only practical option.
Summary: The Three Problems, Three Solutions
| Problem | RNN behavior | Transformer solution |
|---|---|---|
| Sequential compute | Must process T steps in series — GPU mostly idle | All positions computed in parallel |
| Information bottleneck | All source info compressed to one vector | Decoder directly attends to all encoder states |
| Long-range dependencies | Gradients decay exponentially with distance | Direct O(1)-path connections via attention |
| New cost | O(T²) memory and compute |
The transformer didn't simply improve on RNNs — it traded one set of tradeoffs for another. RNNs are O(T) time and memory with limited long-range modeling. Transformers are O(T²) time and memory with unlimited long-range modeling. For most NLP tasks at moderate sequence lengths, the transformer's tradeoff is overwhelmingly better. For streaming, edge, or very-long-sequence applications, the RNN's tradeoff sometimes wins.
Understanding both sides of this tradeoff is what lets you choose the right architecture for a new problem — rather than defaulting to whichever one is currently fashionable.