Sequences and the memory problem — Recurrent Networks

The first 12 units of this course dealt almost exclusively with tabular data: one fixed-size vector in, one fixed-size vector out. A patient's medical features → a diagnosis. An image's pixels → a class label.

But a huge class of important problems doesn't fit this mold. Language, audio, time series, video, DNA — all involve data where order and history matter. Understanding what makes these problems fundamentally different, and what kind of architecture is required to handle them, is the subject of this unit.

Recurrent networks were the architecture that powered the first generation of voice assistants, machine translation systems, and predictive text. They are still used today in streaming and on-device settings where the full sequence isn't available in advance. Understanding them also explains exactly what transformers were designed to improve.

What Makes Sequential Data Different

Consider these three prediction problems:

Predict whether an image contains a dog. (Fixed input, order irrelevant)
Predict the next word in "The cat sat on the ___". (Variable-length input, order critical)
Predict tomorrow's stock price given the last 30 days. (Variable-length input, recent history more relevant)

Problems 2 and 3 share a structure: the correct answer depends not just on the current input, but on the history of previous inputs in a specific order. "mat" is the correct next word not because of any single word, but because of the full phrase "The cat sat on the." Swap the words: "on sat cat The the" — now no word sequence is sensible and no prediction is possible.

A has no mechanism for this. It processes inputs independently. It has no memory of previous inputs when it processes the current one.

Defining a Sequence

A is an ordered list:

x_1, x_2, x_3, \ldots, x_T

$x_t$: the input at position (time step) t
$T$: the length of the sequence — can vary between examples

Each might be a single number (univariate time series), a vector (word embedding, sensor readings), or even an image (video frame).

The critical property: $T$ can vary between examples. A sentence can be 3 words or 50 words. An audio clip can be 1 second or 60 seconds. Any architecture for sequential data must handle variable lengths naturally.

The Hidden State: Memory in a Vector

The key idea underlying all sequential models is the , denoted .

At each time step, the model receives the current input $x_t$ and the previous hidden state $h_{t-1}$ , and produces a new hidden state $h_t$ :

h_t = f(h_{t-1},; x_t)

$h_t$: hidden state after processing x_t
$f$: some learned function of the current input and previous state

This one equation is the core of all recurrent models. The function $f$ determines how to update the memory given a new observation. If we design $f$ well, $h_t$ will contain the information from the sequence needed to make predictions.

The process unfolds step by step:

Start: $h_0 = \mathbf{0}$ (empty memory)
Step 1: $h_1 = f(h_0, x_1)$ — process first input, update memory
Step 2: $h_2 = f(h_1, x_2)$ — update memory with second input
...
Step T: $h_T = f(h_{T-1}, x_T)$ — final memory after full sequence

After processing the whole sequence, $h_T$ is a fixed-size summary of everything the model has seen.

What Problems Require This

Sequential models are the right tool for any task where:

Input length varies. You can't fix the input size ahead of time.
Order matters. Shuffling the input changes its meaning.
History matters. The correct output at step $t$ depends on what happened at steps $1, \ldots, t-1$ .

This covers:

Language modeling: predict the next word given all previous words
Machine translation: encode an English sentence, decode a French one
Speech recognition: convert audio signal (sequence of amplitudes) to text
Time series forecasting: predict next values given past observations
DNA sequence analysis: classify or annotate sequences of nucleotides

The Three Solutions

The idea $h_t = f(h_{t-1}, x_t)$ underlies three progressively refined architectures:

Vanilla RNN — the simplest form: $f$ is one matrix multiplication followed by tanh. Fast to compute but practically limited to short-range dependencies (reason in the next two lessons).

LSTM (Long Short-Term Memory) — adds a "cell state" that carries information over long distances with minimal modification. Solves the vanishing gradient problem through clever gating.

GRU (Gated Recurrent Unit) — a simplified LSTM with fewer parameters. Achieves similar performance on most tasks.

All three share the same interface: take $h_{t-1}$ and $x_t$ , produce $h_t$ . The next lesson builds the vanilla RNN from scratch and derives its equations in full.