The vanishing gradient problem comes from multiplying many small Jacobians together over long sequences. The LSTM's solution is architectural: introduce an additional state variable that updates additively rather than multiplicatively, creating a "gradient highway" that lets information (and gradients) flow over hundreds of steps without shrinking.
LSTMs were the dominant architecture for language modeling, speech recognition, and machine translation from roughly 2014 to 2018. They powered the first wave of production neural NLP systems at Google, Baidu, and Apple. Understanding the LSTM cell is essential for reading that entire body of work.
The Key Idea: Additive Updates
In a vanilla RNN, the hidden state updates multiplicatively:
The gradient involves multiplying by and by tanh'. After T such multiplications, gradients vanish.
The LSTM introduces a that updates as:
- cell state at time t
- forget gate — values in (0,1), controls how much of c_{t-1} to retain
- previous cell state
- input gate — values in (0,1), controls how much new info to add
- candidate cell state — new information to potentially add
The gradient of with respect to : just — a diagonal matrix of forget gate values. When (remember everything), this is near-identity: gradients flow backward through with minimal change. No matrix multiplication, no tanh derivative. This is the gradient highway.
The Four Gate Equations
The LSTM has four components, all computed from the concatenation (the previous hidden state stacked with the current input):
Forget gate — what fraction of the cell state to erase:
- forget gate vector, values in (0,1)
- forget gate weight matrix
- forget gate bias
Input gate — what fraction of the candidate to write:
- input gate vector
Candidate cell state — what new information to potentially store:
- candidate values for cell state update
Output gate — what to expose from the cell state:
- output gate vector
Then the updates:
- LSTM hidden state — filtered view of cell state
Parameter Count
Each gate has its own weight matrix applied to . If the hidden size is and input size is , each weight matrix is . With 4 matrices:
For : — roughly 4× more than a vanilla RNN.
Worked Forward Pass
Let's trace one LSTM step. Use , , starting from , .
Current input: . Concatenation: .
Forget gate (suppose ):
Input gate (suppose pre-activation = [-0.5, 1.2]):
Candidate (suppose pre-activation = [0.6, -0.3]):
Cell state update:
Output gate (suppose pre-activation = [0.3, -0.7]):
Hidden state:
The cell state and hidden state carry forward to the next step.
The Gradient Highway in Action
During backpropagation, the gradient of the loss with respect to through the cell state path is:
As long as is close to 1, this is close to the identity matrix — gradients pass through with minimal modification. Over 100 steps:
If forget gate values are 0.95: . Still small — but the key difference from vanilla RNNs is that the forget gate learns to be close to 1 when long-range memory is needed. A vanilla RNN's Jacobian depends on and tanh' — it can't specifically decide to preserve certain memory components over long distances.
PyTorch Implementation
lstm = nn.LSTM(
input_size=50, # d
hidden_size=128, # n
num_layers=1,
batch_first=True
)
x = torch.randn(32, 20, 50) # [batch, seq_len, input_size]
h0 = torch.zeros(1, 32, 128) # initial hidden state
c0 = torch.zeros(1, 32, 128) # initial cell state
output, (hn, cn) = lstm(x, (h0, c0))
# output: [32, 20, 128] — hidden states at every step
# hn, cn: [1, 32, 128] — final hidden and cell states
The LSTM is powerful but adds complexity: 4 matrix multiplications per step instead of 1, plus more hyperparameters to tune. The next lesson covers the GRU — a streamlined version that achieves similar results with fewer gates.