The previous lesson defined the hidden state update . Now we make this concrete: what exactly is the function in a vanilla RNN? How do we choose its dimensions? And what does weight sharing mean in practice?
Understanding the vanilla RNN equations is essential groundwork — every LSTM and GRU cell is a direct extension of this structure, and the weight sharing mechanism here is what makes sequence models efficient enough to be practical.
The Vanilla RNN Equations
A vanilla RNN uses a simple linear transformation followed by a nonlinearity:
- hidden state at time step t — the updated memory
- weight matrix applied to the previous hidden state
- previous hidden state
- weight matrix applied to the current input
- input at time step t
- bias vector
And an optional output at each step (for tasks that predict at every position):
- output prediction at time step t
- output weight matrix
- output bias
That's it. Two equations. All the complexity of sequential modeling emerges from applying these equations repeatedly.
Dimensions
Let's be explicit about shapes. Say the input is -dimensional and the hidden state is -dimensional:
| Symbol | Shape | Role |
|---|---|---|
| Input at time t | ||
| Hidden state (memory) | ||
| Maps input to hidden space | ||
| Maps previous hidden to current hidden | ||
| Hidden bias |
Dimension check for :
- Value: : (n×n)·(n×1) = n×1 ✓
- Value: : (n×d)·(d×1) = n×1 ✓
- Sum: (n×1) + (n×1) + (n×1) = n×1 ✓
- After tanh: still n×1 ✓
The result has the same shape as , which is necessary since we'll use it again at the next step.
Weight Sharing: One Rule for All Steps
The critical feature of the RNN is that , , and are identical at every time step. The RNN doesn't have a "step 1 update rule" and a different "step 7 update rule." It learns one universal update rule and applies it at every position.
This is analogous to how a convolutional filter learns one feature detector and applies it everywhere in the image. The RNN learns one "how to update memory given a new input" rule and applies it everywhere in the sequence.
Consequences:
- The model generalizes to sequence lengths never seen during training
- Total parameters = (for the hidden update) + (for output, where k = output size), regardless of sequence length
- During backprop, gradients from all T time steps accumulate into the same parameters
Worked Example: 2-Step Sequence
Let's use (hidden size) and (scalar input).
Initialize: , and suppose the learned weights are:
Step 1: Input
Step 2: Input
The hidden state now encodes both inputs. Notice how influenced , which influenced the computation of — the network remembers the first input while processing the second.
The Unrolled View
The sequential computation is easiest to visualize by "unrolling" the RNN into a chain of cells:
h₀ → [RNN cell, x₁] → h₁ → [RNN cell, x₂] → h₂ → ... → hₜ
Each box is the same function with the same weights. The arrows passing hidden states left-to-right are what make it recurrent. This "unrolled" view is not just conceptual — it's exactly how backpropagation is implemented. You treat the unrolled network as a deep feedforward network and apply standard backprop. The next lesson shows what this means for computing gradients.
PyTorch Implementation
import torch
import torch.nn as nn
rnn = nn.RNN(
input_size=50, # d: input dimension
hidden_size=128, # n: hidden state dimension
num_layers=1,
batch_first=True # input shape: [batch, seq_len, input_size]
)
# Input: batch of 32 sequences, each 20 steps long, each step is 50-dim
x = torch.randn(32, 20, 50)
h0 = torch.zeros(1, 32, 128) # initial hidden state
output, hn = rnn(x, h0)
# output: [32, 20, 128] — hidden state at every step
# hn: [1, 32, 128] — final hidden state
The vanilla RNN is conceptually clean. But it has a serious training problem that we'll confront in lesson 4 — and that motivated the LSTM architecture.