Why transformers replaced RNNs — Recurrent Networks

Every major language model today — GPT, LLaMA, Claude, Gemini — is built on transformer blocks, not RNNs. This shift happened rapidly between 2017 and 2020. Understanding why RNNs lost isn't just historical trivia; it reveals the design constraints that transformers were built to solve, and clarifies when RNNs are still the right choice.

There are three fundamental limitations of RNNs, each corresponding to a structural decision transformers made differently.

Limitation 1: Sequential Computation

The hidden state update is:

h_t = f(h_{t-1}, x_t)

$h_t$: hidden state at step t

Computing $h_t$ requires $h_{t-1}$ to be already computed. This creates a strict sequential dependency chain. For a sequence of length , you must complete steps 1, 2, 3, ..., T in order — no step can start until the previous one finishes.

Modern GPUs have tens of thousands of compute units operating in parallel. A matrix multiply of size 512×512 is one GPU operation — thousands of multiplications happen simultaneously. But an RNN's sequential dependency means the GPU's parallelism is almost entirely wasted: at each step, you're doing one matrix multiply and then waiting.

The transformer's answer: process all positions simultaneously. The attention operation:

\text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

$Q$: query matrix — all positions at once
$K$: key matrix — all positions at once
$V$: value matrix

computes interactions between all $T$ positions with a single matrix multiply $QK^T$ — shape $T \times T$ . All positions are processed in parallel. For a GPU with 10,000 parallel units, processing a 512-token sequence is nearly as fast as processing 1 token.

Limitation 2: The Information Bottleneck

As established in the seq2seq lesson, all information about the source sequence must pass through the final hidden state $h_T$ — a single fixed-size vector. For long sequences, this vector must compress information it cannot fully represent.

The transformer's answer: attention directly connects every decoder position to every encoder position. There is no bottleneck — the decoder can access any part of the source sequence with equal ease. The question "what did the encoder produce at position 7?" is answered by directly retrieving the position-7 key and value vectors.

Information path length between any two positions: always 1 step. In an RNN, the path between position 1 and position 100 goes through 99 hidden state updates, each of which may discard some information.

Limitation 3: Long-Range Dependencies

You've seen quantitatively how gradients vanish over T steps in vanilla RNNs. LSTMs and GRUs extend this range substantially — in practice to ~100-500 steps — but not indefinitely.

The root cause: information still flows through sequential multiplicative operations, even in LSTMs. The forget gate $f_t$ is close to 1 for good memory, but "close to 1" applied 500 times gives $0.99^{500} \approx 0.007$ — a 99% reduction.

The transformer's answer: attention creates direct connections between any two positions in one step, with no decay over distance. "The cat" in position 1 and "was hungry" in position 100 are connected by a single attention computation, not 99 multiplicative operations. The gradient path between any two positions has length 1 in a transformer.

The Cost: O(T²) Attention

Transformers solve RNN limitations, but introduce their own:

Memory: the attention matrix $QK^T$ has shape $T \times T$ . For $T = 512$ : 262,144 values. For $T = 4096$ : 16.8 million values per layer. For $T = 100{,}000$ : 10 billion values — impossible to fit in GPU memory.

Compute: computing the $T \times T$ attention matrix costs $O(T^2 \cdot d)$ operations. For RNNs, each step costs $O(n^2)$ and there are T steps, giving $O(T \cdot n^2)$ total — linear in T. Transformers are quadratic.

For short sequences (T < 2048) on modern hardware, transformers win overwhelmingly. For very long sequences (T > 32,768), the quadratic cost becomes prohibitive.

When RNNs Still Make Sense

Despite transformer dominance, RNNs retain advantages in specific settings:

Streaming inference: a token arrives, you immediately update the hidden state and produce a response. O(1) per step, constant memory. A transformer must re-attend over the full context — O(T) per step, growing memory.

On-device / edge inference: a 2-layer LSTM with 256 hidden units has ~500K parameters and runs at 1ms per step on a phone CPU. A small transformer with comparable quality has millions of parameters and requires substantially more compute.

Online learning: learning from a continuous data stream where the "batch" is individual examples arriving in sequence. RNNs handle this naturally. Transformers require attention over a context window that must fit in memory.

Long sequences with limited compute: for sequences of 50,000+ tokens where transformer attention is infeasible, recurrent architectures with their O(T) scaling may be the only practical option.

Summary: The Three Problems, Three Solutions

Problem	RNN behavior	Transformer solution
Sequential compute	Must process T steps in series — GPU mostly idle	All positions computed in parallel
Information bottleneck	All source info compressed to one vector	Decoder directly attends to all encoder states
Long-range dependencies	Gradients decay exponentially with distance	Direct O(1)-path connections via attention
New cost		O(T²) memory and compute

The transformer didn't simply improve on RNNs — it traded one set of tradeoffs for another. RNNs are O(T) time and memory with limited long-range modeling. Transformers are O(T²) time and memory with unlimited long-range modeling. For most NLP tasks at moderate sequence lengths, the transformer's tradeoff is overwhelmingly better. For streaming, edge, or very-long-sequence applications, the RNN's tradeoff sometimes wins.

Understanding both sides of this tradeoff is what lets you choose the right architecture for a new problem — rather than defaulting to whichever one is currently fashionable.