Skip to content
Recurrent Networks
Lesson 2 ⏱ 12 min

Vanilla RNNs: the hidden state equation

Video coming soon

Vanilla RNNs: Equations, Dimensions, and Weight Sharing

Derives the RNN update equations from scratch, walks through a numerical example with a 2-step sequence, visualizes the unrolled network, and explains why weight sharing across time steps is essential.

⏱ ~7 min

🧮

Quick refresher

Matrix-vector multiplication

Multiplying an (n×d) matrix by a d-dimensional vector produces an n-dimensional vector. Each output element is the dot product of one row of the matrix with the input vector.

Example

[[1,2],[3,4]] × [5,6] = [1×5+2×6, 3×5+4×6] = [17, 39].

A (2×2) matrix times a 2-vector gives a 2-vector.

The previous lesson defined the hidden state update ht=f(ht1,xt)h_t = f(h_{t-1}, x_t). Now we make this concrete: what exactly is the function ff in a vanilla RNN? How do we choose its dimensions? And what does weight sharing mean in practice?

Understanding the vanilla RNN equations is essential groundwork — every LSTM and GRU cell is a direct extension of this structure, and the weight sharing mechanism here is what makes sequence models efficient enough to be practical.

The Vanilla RNN Equations

A vanilla RNN uses a simple linear transformation followed by a nonlinearity:

ht=tanh(Whht1+Wxxt+b)h_t = \tanh(W_h h_{t-1} + W_x x_t + b)
hth_t
hidden state at time step t — the updated memory
WhW_h
weight matrix applied to the previous hidden state
ht1h_{t-1}
previous hidden state
WxW_x
weight matrix applied to the current input
xtx_t
input at time step t
bb
bias vector

And an optional output at each step (for tasks that predict at every position):

yt=Wyht+byy_t = W_y h_t + b_y
yty_t
output prediction at time step t
WyW_y
output weight matrix
byb_y
output bias

That's it. Two equations. All the complexity of sequential modeling emerges from applying these equations repeatedly.

Dimensions

Let's be explicit about shapes. Say the input is -dimensional and the hidden state is -dimensional:

SymbolShapeRole
xtx_td×1d \times 1Input at time t
hth_tn×1n \times 1Hidden state (memory)
WxW_xn×dn \times dMaps input to hidden space
WhW_hn×nn \times nMaps previous hidden to current hidden
bbn×1n \times 1Hidden bias

Dimension check for Whht1+Wxxt+bW_h h_{t-1} + W_x x_t + b:

  • Value: Whht1W_h h_{t-1}: (n×n)·(n×1) = n×1 ✓
  • Value: WxxtW_x x_t: (n×d)·(d×1) = n×1 ✓
  • Sum: (n×1) + (n×1) + (n×1) = n×1 ✓
  • After tanh: still n×1 ✓

The result hth_t has the same shape as ht1h_{t-1}, which is necessary since we'll use it again at the next step.

Weight Sharing: One Rule for All Steps

The critical feature of the RNN is that WhW_h, WxW_x, and bb are identical at every time step. The RNN doesn't have a "step 1 update rule" and a different "step 7 update rule." It learns one universal update rule and applies it at every position.

This is analogous to how a convolutional filter learns one feature detector and applies it everywhere in the image. The RNN learns one "how to update memory given a new input" rule and applies it everywhere in the sequence.

Consequences:

  • The model generalizes to sequence lengths never seen during training
  • Total parameters = n2+nd+nn^2 + nd + n (for the hidden update) + nkn \cdot k (for output, where k = output size), regardless of sequence length
  • During backprop, gradients from all T time steps accumulate into the same parameters

Worked Example: 2-Step Sequence

Let's use n=2n = 2 (hidden size) and d=1d = 1 (scalar input).

Initialize: h0=[0,0]Th_0 = [0, 0]^T, and suppose the learned weights are:

Wh=[0.5amp;0.2 0.3amp;0.6],Wx=[0.8 0.4],b=[0.1 0.1]W_h = \begin{bmatrix} 0.5 & 0.2 \ 0.3 & 0.6 \end{bmatrix}, \quad W_x = \begin{bmatrix} 0.8 \ 0.4 \end{bmatrix}, \quad b = \begin{bmatrix} 0.1 \ 0.1 \end{bmatrix}

Step 1: Input x1=2.0x_1 = 2.0

Whh0+Wxx1+b=[0\0]+[0.8×2\0.4×2]+[0.1\0.1]=[1.7\0.9]W_h h_0 + W_x x_1 + b = \begin{bmatrix}0\0\end{bmatrix} + \begin{bmatrix}0.8 \times 2\0.4 \times 2\end{bmatrix} + \begin{bmatrix}0.1\0.1\end{bmatrix} = \begin{bmatrix}1.7\0.9\end{bmatrix}
h1=tanh![1.7\0.9]=[0.935\0.716]h_1 = \tanh!\begin{bmatrix}1.7\0.9\end{bmatrix} = \begin{bmatrix}0.935\0.716\end{bmatrix}

Step 2: Input x2=1.0x_2 = -1.0

Whh1+Wxx2+b=[0.5×0.935+0.2×0.716\0.3×0.935+0.6×0.716]+[0.8\-0.4]+[0.1\0.1]W_h h_1 + W_x x_2 + b = \begin{bmatrix}0.5 \times 0.935 + 0.2 \times 0.716\0.3 \times 0.935 + 0.6 \times 0.716\end{bmatrix} + \begin{bmatrix}-0.8\-0.4\end{bmatrix} + \begin{bmatrix}0.1\0.1\end{bmatrix}
=[0.610\0.710]+[0.8\-0.4]+[0.1\0.1]=[0.090\0.410]= \begin{bmatrix}0.610\0.710\end{bmatrix} + \begin{bmatrix}-0.8\-0.4\end{bmatrix} + \begin{bmatrix}0.1\0.1\end{bmatrix} = \begin{bmatrix}-0.090\0.410\end{bmatrix}
h2=tanh![0.090\0.410]=[0.090\0.389]h_2 = \tanh!\begin{bmatrix}-0.090\0.410\end{bmatrix} = \begin{bmatrix}-0.090\0.389\end{bmatrix}

The hidden state h2h_2 now encodes both inputs. Notice how x1=2.0x_1 = 2.0 influenced h1h_1, which influenced the computation of h2h_2 — the network remembers the first input while processing the second.

The Unrolled View

The sequential computation is easiest to visualize by "unrolling" the RNN into a chain of cells:

h₀ → [RNN cell, x₁] → h₁ → [RNN cell, x₂] → h₂ → ... → hₜ

Each box is the same function with the same weights. The arrows passing hidden states left-to-right are what make it recurrent. This "unrolled" view is not just conceptual — it's exactly how backpropagation is implemented. You treat the unrolled network as a deep feedforward network and apply standard backprop. The next lesson shows what this means for computing gradients.

PyTorch Implementation

import torch
import torch.nn as nn

rnn = nn.RNN(
    input_size=50,   # d: input dimension
    hidden_size=128, # n: hidden state dimension
    num_layers=1,
    batch_first=True # input shape: [batch, seq_len, input_size]
)

# Input: batch of 32 sequences, each 20 steps long, each step is 50-dim
x = torch.randn(32, 20, 50)
h0 = torch.zeros(1, 32, 128)  # initial hidden state

output, hn = rnn(x, h0)
# output: [32, 20, 128] — hidden state at every step
# hn:     [1, 32, 128]  — final hidden state

The vanilla RNN is conceptually clean. But it has a serious training problem that we'll confront in lesson 4 — and that motivated the LSTM architecture.

Quiz

1 / 3

Why do RNNs use the same weight matrices W_h, W_x, b at every time step?