Skip to content
Recurrent Networks
Lesson 6 ⏱ 14 min

GRUs: a simpler gating mechanism

Video coming soon

GRUs: Simplifying the LSTM to Two Gates

Derives the GRU equations by merging the LSTM's forget and input gates into an update gate, explains how the reset gate differs from the output gate, and compares GRU vs LSTM performance tradeoffs.

⏱ ~6 min

🧮

Quick refresher

Interpolation between two values

Linear interpolation between a and b with weight z: result = (1-z)·a + z·b. When z=0, result=a. When z=1, result=b. When z=0.5, result is the midpoint. This is how the GRU update gate blends old and new state.

Example

h_prev = [1, 3], h_candidate = [5, 1], z = [0.2, 0.8].

New h = [(1-0.2)×1 + 0.2×5, (1-0.8)×3 + 0.8×1] = [0.8+1.0, 0.6+0.8] = [1.8, 1.4].

First component mostly keeps old; second mostly adopts new.

The LSTM solves the vanishing gradient problem but adds complexity: four weight matrices, two state vectors (hth_t and ctc_t), and several gates to interpret. Cho et al. introduced the GRU in 2014 as a simpler alternative that achieves comparable performance with fewer moving parts.

The GRU is still widely used in production systems where lower latency or fewer parameters matter — on-device NLP, time series forecasting, and streaming audio processing often use GRUs over LSTMs precisely because they are faster to run and easier to tune.

The Simplification Idea

The LSTM has separate forget and input gates that decide, independently, how much of the old state to erase and how much new information to add. The GRU observes that these two decisions are related: if you're mostly retaining old information, you're mostly not adding new information, and vice versa.

This suggests merging them into a single gate: the . When the update gate is near 0, keep the old state. When it's near 1, replace with the new candidate. One gate does the work of two.

The LSTM's output gate (which filtered the cell state into the hidden state) is replaced by a that filters how much history enters the candidate computation.

The GRU Equations

Reset gate — how much past hidden state to use when computing the candidate:

rt=σ(Wr[ht1;,xt])r_t = \sigma(W_r \cdot [h_{t-1};, x_t])
rtr_t
reset gate vector, values in (0,1)
WrW_r
reset gate weight matrix

Update gate — how much to update the hidden state:

zt=σ(Wz[ht1;,xt])z_t = \sigma(W_z \cdot [h_{t-1};, x_t])
ztz_t
update gate vector, values in (0,1)
WzW_z
update gate weight matrix

Candidate hidden state — new information to potentially write:

h~t=tanh(W[rtht1;,xt])\tilde{h}t = \tanh(W \cdot [r_t \odot h{t-1};, x_t])
h~th̃_t
candidate hidden state
WW
candidate weight matrix

New hidden state — interpolation between old and candidate:

ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
hth_t
new hidden state after this step

That's the complete GRU — three equations and no separate cell state.

Interpreting the Gates

When zt0z_t \approx 0 (update gate near zero): Here, htht1h_t \approx h_{t-1} — the hidden state barely changes. The network is saying "nothing important happened at this step; maintain memory."

When zt1z_t \approx 1 (update gate near one): Here, hth~th_t \approx \tilde{h}_t — the hidden state is mostly replaced by the candidate. The network is saying "this input is important; update memory substantially."

When rt0r_t \approx 0 (reset gate near zero): In the candidate computation, rtht10r_t \odot h_{t-1} \approx 0, so the past is ignored: h~t=tanh(W[0;xt])\tilde{h}_t = \tanh(W \cdot [0; x_t]). The candidate is computed as if there's no prior memory — useful at the start of a new topic in language, or a new regime in time series.

When rt1r_t \approx 1 (reset gate near one): Full history is available for computing the candidate: h~t=tanh(W[ht1;xt])\tilde{h}t = \tanh(W \cdot [h{t-1}; x_t]). The candidate can draw on everything the network has seen.

Worked Example: 2-Step GRU

Use n=2n=2, d=1d=1, start with h0=[0,0]Th_0 = [0, 0]^T, input x1=2.0x_1 = 2.0.

For this example, suppose the gate pre-activations are:

  • Reset: Wr[h0;x1]=[0.3,1.1]W_r \cdot [h_0; x_1] = [-0.3, 1.1]
  • Update: Wz[h0;x1]=[0.5,0.4]W_z \cdot [h_0; x_1] = [0.5, -0.4]

Reset gate:

r1=σ([0.3,1.1])=[0.426,0.750]r_1 = \sigma([-0.3, 1.1]) = [0.426, 0.750]

Update gate:

z1=σ([0.5,0.4])=[0.622,0.401]z_1 = \sigma([0.5, -0.4]) = [0.622, 0.401]

Candidate (r1h0=[0,0]r_1 \odot h_0 = [0, 0] since h0=0h_0 = 0, so only x1x_1 contributes):

Suppose W[0;2.0]=[1.4,0.6]W \cdot [0; 2.0] = [1.4, -0.6]:

h~1=tanh([1.4,0.6])=[0.885,0.537]\tilde{h}_1 = \tanh([1.4, -0.6]) = [0.885, -0.537]

New hidden state:

h1=(1[0.622,0.401])[0,0]+[0.622,0.401][0.885,0.537]h_1 = (1 - [0.622, 0.401]) \odot [0, 0] + [0.622, 0.401] \odot [0.885, -0.537]
h1=[0,0]+[0.550,0.215]=[0.550,0.215]h_1 = [0, 0] + [0.550, -0.215] = [0.550, -0.215]

The first component (z1,1=0.622z_{1,1} = 0.622) adopted mostly the candidate. The second component (z1,2=0.401z_{1,2} = 0.401) adopted a bit less than half. Both started from zero hidden state, so the distinction is mainly about how much of the candidate to use.

GRU vs LSTM: When to Use Which

CriterionGRULSTM
Parameters~25% fewerMore
Training speedFasterSlower
Memory (RAM)Less (one state)More (two states)
Performance: short sequencesSimilarSimilar
Performance: long sequencesOften similarSlight edge
Limited dataPrefer (less overfit)May overfit
Large datasetSimilarSimilar

The practical rule: start with a GRU for simplicity and speed. Switch to LSTM if performance is unsatisfactory on tasks requiring long-range memory.

# GRU in PyTorch — same interface as LSTM but simpler state
gru = nn.GRU(input_size=50, hidden_size=128, batch_first=True)
x = torch.randn(32, 20, 50)
h0 = torch.zeros(1, 32, 128)   # only one state vector, not two

output, hn = gru(x, h0)
# output: [32, 20, 128] — hidden states at all steps
# hn:     [1, 32, 128]  — final hidden state

Both GRU and LSTM are strong sequence models for moderate sequence lengths. But they share a fundamental limitation: sequential computation. Processing a sequence of length T requires T serial steps, no matter how much parallel compute you have. The next two lessons cover architectures that sidestep this: seq2seq as the bridge, and transformers as the solution.

Quiz

1 / 3

The GRU update gate z_t controls...