GRUs: a simpler gating mechanism — Recurrent Networks

The LSTM solves the vanishing gradient problem but adds complexity: four weight matrices, two state vectors ( $h_t$ and $c_t$ ), and several gates to interpret. Cho et al. introduced the GRU in 2014 as a simpler alternative that achieves comparable performance with fewer moving parts.

The GRU is still widely used in production systems where lower latency or fewer parameters matter — on-device NLP, time series forecasting, and streaming audio processing often use GRUs over LSTMs precisely because they are faster to run and easier to tune.

The Simplification Idea

The LSTM has separate forget and input gates that decide, independently, how much of the old state to erase and how much new information to add. The GRU observes that these two decisions are related: if you're mostly retaining old information, you're mostly not adding new information, and vice versa.

This suggests merging them into a single gate: the . When the update gate is near 0, keep the old state. When it's near 1, replace with the new candidate. One gate does the work of two.

The LSTM's output gate (which filtered the cell state into the hidden state) is replaced by a that filters how much history enters the candidate computation.

The GRU Equations

Reset gate — how much past hidden state to use when computing the candidate:

r_t = \sigma(W_r \cdot [h_{t-1};, x_t])

$r_t$: reset gate vector, values in (0,1)
$W_r$: reset gate weight matrix

Update gate — how much to update the hidden state:

z_t = \sigma(W_z \cdot [h_{t-1};, x_t])

$z_t$: update gate vector, values in (0,1)
$W_z$: update gate weight matrix

Candidate hidden state — new information to potentially write:

\tilde{h}t = \tanh(W \cdot [r_t \odot h{t-1};, x_t])

$h̃_t$: candidate hidden state
$W$: candidate weight matrix

New hidden state — interpolation between old and candidate:

h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

$h_t$: new hidden state after this step

That's the complete GRU — three equations and no separate cell state.

Interpreting the Gates

When $z_t \approx 0$ (update gate near zero): Here, $h_t \approx h_{t-1}$ — the hidden state barely changes. The network is saying "nothing important happened at this step; maintain memory."

When $z_t \approx 1$ (update gate near one): Here, $h_t \approx \tilde{h}_t$ — the hidden state is mostly replaced by the candidate. The network is saying "this input is important; update memory substantially."

When $r_t \approx 0$ (reset gate near zero): In the candidate computation, $r_t \odot h_{t-1} \approx 0$ , so the past is ignored: $\tilde{h}_t = \tanh(W \cdot [0; x_t])$ . The candidate is computed as if there's no prior memory — useful at the start of a new topic in language, or a new regime in time series.

When $r_t \approx 1$ (reset gate near one): Full history is available for computing the candidate: $\tilde{h}t = \tanh(W \cdot [h{t-1}; x_t])$ . The candidate can draw on everything the network has seen.

Worked Example: 2-Step GRU

Use $n=2$ , $d=1$ , start with $h_0 = [0, 0]^T$ , input $x_1 = 2.0$ .

For this example, suppose the gate pre-activations are:

Reset: $W_r \cdot [h_0; x_1] = [-0.3, 1.1]$
Update: $W_z \cdot [h_0; x_1] = [0.5, -0.4]$

Reset gate:

r_1 = \sigma([-0.3, 1.1]) = [0.426, 0.750]

Update gate:

z_1 = \sigma([0.5, -0.4]) = [0.622, 0.401]

Candidate ( $r_1 \odot h_0 = [0, 0]$ since $h_0 = 0$ , so only $x_1$ contributes):

Suppose $W \cdot [0; 2.0] = [1.4, -0.6]$ :

\tilde{h}_1 = \tanh([1.4, -0.6]) = [0.885, -0.537]

New hidden state:

h_1 = (1 - [0.622, 0.401]) \odot [0, 0] + [0.622, 0.401] \odot [0.885, -0.537]

h_1 = [0, 0] + [0.550, -0.215] = [0.550, -0.215]

The first component ( $z_{1,1} = 0.622$ ) adopted mostly the candidate. The second component ( $z_{1,2} = 0.401$ ) adopted a bit less than half. Both started from zero hidden state, so the distinction is mainly about how much of the candidate to use.

GRU vs LSTM: When to Use Which

Criterion	GRU	LSTM
Parameters	~25% fewer	More
Training speed	Faster	Slower
Memory (RAM)	Less (one state)	More (two states)
Performance: short sequences	Similar	Similar
Performance: long sequences	Often similar	Slight edge
Limited data	Prefer (less overfit)	May overfit
Large dataset	Similar	Similar

The practical rule: start with a GRU for simplicity and speed. Switch to LSTM if performance is unsatisfactory on tasks requiring long-range memory.

# GRU in PyTorch — same interface as LSTM but simpler state
gru = nn.GRU(input_size=50, hidden_size=128, batch_first=True)
x = torch.randn(32, 20, 50)
h0 = torch.zeros(1, 32, 128)   # only one state vector, not two

output, hn = gru(x, h0)
# output: [32, 20, 128] — hidden states at all steps
# hn:     [1, 32, 128]  — final hidden state

Both GRU and LSTM are strong sequence models for moderate sequence lengths. But they share a fundamental limitation: sequential computation. Processing a sequence of length T requires T serial steps, no matter how much parallel compute you have. The next two lessons cover architectures that sidestep this: seq2seq as the bridge, and transformers as the solution.