Skip to content
Recurrent Networks
Lesson 1 ⏱ 10 min

Sequences and the memory problem

Video coming soon

Sequences and the Memory Problem

Defines sequential data, shows why feedforward networks can't handle it, introduces the hidden state as a memory mechanism, and previews the RNN, LSTM, and GRU as solutions.

⏱ ~6 min

🧮

Quick refresher

Feedforward networks and fixed-size inputs

A feedforward network maps a fixed-size input vector to a fixed-size output vector. Each input is processed independently — the network has no mechanism to relate one input to another or to use earlier inputs when processing later ones.

Example

A network trained on 28×28 images takes exactly 784 inputs every time.

If you want to classify a sequence of images, it can only classify each one in isolation — it has no way to remember image 1 while classifying image 3.

The first 12 units of this course dealt almost exclusively with tabular data: one fixed-size vector in, one fixed-size vector out. A patient's medical features → a diagnosis. An image's pixels → a class label.

But a huge class of important problems doesn't fit this mold. Language, audio, time series, video, DNA — all involve data where order and history matter. Understanding what makes these problems fundamentally different, and what kind of architecture is required to handle them, is the subject of this unit.

Recurrent networks were the architecture that powered the first generation of voice assistants, machine translation systems, and predictive text. They are still used today in streaming and on-device settings where the full sequence isn't available in advance. Understanding them also explains exactly what transformers were designed to improve.

What Makes Sequential Data Different

Consider these three prediction problems:

  1. Predict whether an image contains a dog. (Fixed input, order irrelevant)
  2. Predict the next word in "The cat sat on the ___". (Variable-length input, order critical)
  3. Predict tomorrow's stock price given the last 30 days. (Variable-length input, recent history more relevant)

Problems 2 and 3 share a structure: the correct answer depends not just on the current input, but on the history of previous inputs in a specific order. "mat" is the correct next word not because of any single word, but because of the full phrase "The cat sat on the." Swap the words: "on sat cat The the" — now no word sequence is sensible and no prediction is possible.

A has no mechanism for this. It processes inputs independently. It has no memory of previous inputs when it processes the current one.

Defining a Sequence

A is an ordered list:

x1,x2,x3,,xTx_1, x_2, x_3, \ldots, x_T
xtx_t
the input at position (time step) t
TT
the length of the sequence — can vary between examples

Each might be a single number (univariate time series), a vector (word embedding, sensor readings), or even an image (video frame).

The critical property: TT can vary between examples. A sentence can be 3 words or 50 words. An audio clip can be 1 second or 60 seconds. Any architecture for sequential data must handle variable lengths naturally.

The Hidden State: Memory in a Vector

The key idea underlying all sequential models is the , denoted .

At each time step, the model receives the current input xtx_t and the previous hidden state ht1h_{t-1}, and produces a new hidden state hth_t:

ht=f(ht1,;xt)h_t = f(h_{t-1},; x_t)
hth_t
hidden state after processing x_t
ff
some learned function of the current input and previous state

This one equation is the core of all recurrent models. The function ff determines how to update the memory given a new observation. If we design ff well, hth_t will contain the information from the sequence needed to make predictions.

The process unfolds step by step:

  • Start: h0=0h_0 = \mathbf{0} (empty memory)
  • Step 1: h1=f(h0,x1)h_1 = f(h_0, x_1) — process first input, update memory
  • Step 2: h2=f(h1,x2)h_2 = f(h_1, x_2) — update memory with second input
  • ...
  • Step T: hT=f(hT1,xT)h_T = f(h_{T-1}, x_T) — final memory after full sequence

After processing the whole sequence, hTh_T is a fixed-size summary of everything the model has seen.

What Problems Require This

Sequential models are the right tool for any task where:

  1. Input length varies. You can't fix the input size ahead of time.
  2. Order matters. Shuffling the input changes its meaning.
  3. History matters. The correct output at step tt depends on what happened at steps 1,,t11, \ldots, t-1.

This covers:

  • Language modeling: predict the next word given all previous words
  • Machine translation: encode an English sentence, decode a French one
  • Speech recognition: convert audio signal (sequence of amplitudes) to text
  • Time series forecasting: predict next values given past observations
  • DNA sequence analysis: classify or annotate sequences of nucleotides

The Three Solutions

The idea ht=f(ht1,xt)h_t = f(h_{t-1}, x_t) underlies three progressively refined architectures:

Vanilla RNN — the simplest form: ff is one matrix multiplication followed by tanh. Fast to compute but practically limited to short-range dependencies (reason in the next two lessons).

LSTM (Long Short-Term Memory) — adds a "cell state" that carries information over long distances with minimal modification. Solves the vanishing gradient problem through clever gating.

GRU (Gated Recurrent Unit) — a simplified LSTM with fewer parameters. Achieves similar performance on most tasks.

All three share the same interface: take ht1h_{t-1} and xtx_t, produce hth_t. The next lesson builds the vanilla RNN from scratch and derives its equations in full.

Quiz

1 / 3

Why can't a standard feedforward network model the sentence 'The bank approved my loan' correctly after seeing it word by word?