The LSTM solves the vanishing gradient problem but adds complexity: four weight matrices, two state vectors ( and ), and several gates to interpret. Cho et al. introduced the GRU in 2014 as a simpler alternative that achieves comparable performance with fewer moving parts.
The GRU is still widely used in production systems where lower latency or fewer parameters matter — on-device NLP, time series forecasting, and streaming audio processing often use GRUs over LSTMs precisely because they are faster to run and easier to tune.
The Simplification Idea
The LSTM has separate forget and input gates that decide, independently, how much of the old state to erase and how much new information to add. The GRU observes that these two decisions are related: if you're mostly retaining old information, you're mostly not adding new information, and vice versa.
This suggests merging them into a single gate: the . When the update gate is near 0, keep the old state. When it's near 1, replace with the new candidate. One gate does the work of two.
The LSTM's output gate (which filtered the cell state into the hidden state) is replaced by a that filters how much history enters the candidate computation.
The GRU Equations
Reset gate — how much past hidden state to use when computing the candidate:
- reset gate vector, values in (0,1)
- reset gate weight matrix
Update gate — how much to update the hidden state:
- update gate vector, values in (0,1)
- update gate weight matrix
Candidate hidden state — new information to potentially write:
- candidate hidden state
- candidate weight matrix
New hidden state — interpolation between old and candidate:
- new hidden state after this step
That's the complete GRU — three equations and no separate cell state.
Interpreting the Gates
When (update gate near zero): Here, — the hidden state barely changes. The network is saying "nothing important happened at this step; maintain memory."
When (update gate near one): Here, — the hidden state is mostly replaced by the candidate. The network is saying "this input is important; update memory substantially."
When (reset gate near zero): In the candidate computation, , so the past is ignored: . The candidate is computed as if there's no prior memory — useful at the start of a new topic in language, or a new regime in time series.
When (reset gate near one): Full history is available for computing the candidate: . The candidate can draw on everything the network has seen.
Worked Example: 2-Step GRU
Use , , start with , input .
For this example, suppose the gate pre-activations are:
- Reset:
- Update:
Reset gate:
Update gate:
Candidate ( since , so only contributes):
Suppose :
New hidden state:
The first component () adopted mostly the candidate. The second component () adopted a bit less than half. Both started from zero hidden state, so the distinction is mainly about how much of the candidate to use.
GRU vs LSTM: When to Use Which
| Criterion | GRU | LSTM |
|---|---|---|
| Parameters | ~25% fewer | More |
| Training speed | Faster | Slower |
| Memory (RAM) | Less (one state) | More (two states) |
| Performance: short sequences | Similar | Similar |
| Performance: long sequences | Often similar | Slight edge |
| Limited data | Prefer (less overfit) | May overfit |
| Large dataset | Similar | Similar |
The practical rule: start with a GRU for simplicity and speed. Switch to LSTM if performance is unsatisfactory on tasks requiring long-range memory.
# GRU in PyTorch — same interface as LSTM but simpler state gru = nn.GRU(input_size=50, hidden_size=128, batch_first=True) x = torch.randn(32, 20, 50) h0 = torch.zeros(1, 32, 128) # only one state vector, not two output, hn = gru(x, h0) # output: [32, 20, 128] — hidden states at all steps # hn: [1, 32, 128] — final hidden state
Both GRU and LSTM are strong sequence models for moderate sequence lengths. But they share a fundamental limitation: sequential computation. Processing a sequence of length T requires T serial steps, no matter how much parallel compute you have. The next two lessons cover architectures that sidestep this: seq2seq as the bridge, and transformers as the solution.