Vanilla RNNs: the hidden state equation — Recurrent Networks

The previous lesson defined the hidden state update $h_t = f(h_{t-1}, x_t)$ . Now we make this concrete: what exactly is the function $f$ in a vanilla RNN? How do we choose its dimensions? And what does weight sharing mean in practice?

Understanding the vanilla RNN equations is essential groundwork — every LSTM and GRU cell is a direct extension of this structure, and the weight sharing mechanism here is what makes sequence models efficient enough to be practical.

The Vanilla RNN Equations

A vanilla RNN uses a simple linear transformation followed by a nonlinearity:

h_t = \tanh(W_h h_{t-1} + W_x x_t + b)

$h_t$: hidden state at time step t — the updated memory
$W_h$: weight matrix applied to the previous hidden state
$h_{t-1}$: previous hidden state
$W_x$: weight matrix applied to the current input
$x_t$: input at time step t
$b$: bias vector

And an optional output at each step (for tasks that predict at every position):

y_t = W_y h_t + b_y

$y_t$: output prediction at time step t
$W_y$: output weight matrix
$b_y$: output bias

That's it. Two equations. All the complexity of sequential modeling emerges from applying these equations repeatedly.

Dimensions

Let's be explicit about shapes. Say the input is -dimensional and the hidden state is -dimensional:

Symbol	Shape	Role
$x_t$	$d \times 1$	Input at time t
$h_t$	$n \times 1$	Hidden state (memory)
$W_x$	$n \times d$	Maps input to hidden space
$W_h$	$n \times n$	Maps previous hidden to current hidden
$b$	$n \times 1$	Hidden bias

Dimension check for $W_h h_{t-1} + W_x x_t + b$ :

Value: $W_h h_{t-1}$ : (n×n)·(n×1) = n×1 ✓
Value: $W_x x_t$ : (n×d)·(d×1) = n×1 ✓
Sum: (n×1) + (n×1) + (n×1) = n×1 ✓
After tanh: still n×1 ✓

The result $h_t$ has the same shape as $h_{t-1}$ , which is necessary since we'll use it again at the next step.

The critical feature of the RNN is that $W_h$ , $W_x$ , and $b$ are identical at every time step. The RNN doesn't have a "step 1 update rule" and a different "step 7 update rule." It learns one universal update rule and applies it at every position.

This is analogous to how a convolutional filter learns one feature detector and applies it everywhere in the image. The RNN learns one "how to update memory given a new input" rule and applies it everywhere in the sequence.

Consequences:

The model generalizes to sequence lengths never seen during training
Total parameters = $n^2 + nd + n$ (for the hidden update) + $n \cdot k$ (for output, where k = output size), regardless of sequence length
During backprop, gradients from all T time steps accumulate into the same parameters

Worked Example: 2-Step Sequence

Let's use $n = 2$ (hidden size) and $d = 1$ (scalar input).

Initialize: $h_0 = [0, 0]^T$ , and suppose the learned weights are:

W_h = \begin{bmatrix} 0.5 &amp; 0.2 \ 0.3 &amp; 0.6 \end{bmatrix}, \quad W_x = \begin{bmatrix} 0.8 \ 0.4 \end{bmatrix}, \quad b = \begin{bmatrix} 0.1 \ 0.1 \end{bmatrix}

Step 1: Input $x_1 = 2.0$

W_h h_0 + W_x x_1 + b = \begin{bmatrix}0\0\end{bmatrix} + \begin{bmatrix}0.8 \times 2\0.4 \times 2\end{bmatrix} + \begin{bmatrix}0.1\0.1\end{bmatrix} = \begin{bmatrix}1.7\0.9\end{bmatrix}

h_1 = \tanh!\begin{bmatrix}1.7\0.9\end{bmatrix} = \begin{bmatrix}0.935\0.716\end{bmatrix}

Step 2: Input $x_2 = -1.0$

W_h h_1 + W_x x_2 + b = \begin{bmatrix}0.5 \times 0.935 + 0.2 \times 0.716\0.3 \times 0.935 + 0.6 \times 0.716\end{bmatrix} + \begin{bmatrix}-0.8\-0.4\end{bmatrix} + \begin{bmatrix}0.1\0.1\end{bmatrix}

= \begin{bmatrix}0.610\0.710\end{bmatrix} + \begin{bmatrix}-0.8\-0.4\end{bmatrix} + \begin{bmatrix}0.1\0.1\end{bmatrix} = \begin{bmatrix}-0.090\0.410\end{bmatrix}

h_2 = \tanh!\begin{bmatrix}-0.090\0.410\end{bmatrix} = \begin{bmatrix}-0.090\0.389\end{bmatrix}

The hidden state $h_2$ now encodes both inputs. Notice how $x_1 = 2.0$ influenced $h_1$ , which influenced the computation of $h_2$ — the network remembers the first input while processing the second.

The Unrolled View

The sequential computation is easiest to visualize by "unrolling" the RNN into a chain of cells:

h₀ → [RNN cell, x₁] → h₁ → [RNN cell, x₂] → h₂ → ... → hₜ

Each box is the same function with the same weights. The arrows passing hidden states left-to-right are what make it recurrent. This "unrolled" view is not just conceptual — it's exactly how backpropagation is implemented. You treat the unrolled network as a deep feedforward network and apply standard backprop. The next lesson shows what this means for computing gradients.

PyTorch Implementation

import torch
import torch.nn as nn

rnn = nn.RNN(
    input_size=50,   # d: input dimension
    hidden_size=128, # n: hidden state dimension
    num_layers=1,
    batch_first=True # input shape: [batch, seq_len, input_size]
)

# Input: batch of 32 sequences, each 20 steps long, each step is 50-dim
x = torch.randn(32, 20, 50)
h0 = torch.zeros(1, 32, 128)  # initial hidden state

output, hn = rnn(x, h0)
# output: [32, 20, 128] — hidden state at every step
# hn:     [1, 32, 128]  — final hidden state

The vanilla RNN is conceptually clean. But it has a serious training problem that we'll confront in lesson 4 — and that motivated the LSTM architecture.