Skip to content
Recurrent Networks
Lesson 7 ⏱ 12 min

Sequence-to-sequence models

Video coming soon

Sequence-to-Sequence: Encoder, Decoder, and the Bottleneck

Builds the seq2seq architecture from two LSTMs, explains teacher forcing during training vs. autoregressive generation at inference, and demonstrates the information bottleneck problem that motivates attention.

⏱ ~7 min

🧮

Quick refresher

Conditional probability

P(B|A) is the probability of B given that A is known. In language modeling, P(wₜ|w₁,...,wₜ₋₁) is the probability of word t given all previous words. A seq2seq model learns this conditional distribution for the target language, conditioned on the source sentence.

Example

P('faim'|J', 'ai') might be high if the model has learned French.

P('faim'|'J'', 'ai', 'très') is even higher because 'très faim' is a common phrase.

The context (everything to the left) shapes the probability.

The RNN, LSTM, and GRU all process one sequence as input and produce one output (either at each step, or a single summary at the end). Many real problems require something more: a sequence in, a different sequence out. Translation, summarization, transcription, code generation — all map one sequence to another, often of different length.

Sequence-to-sequence models are behind machine translation, text summarization, and voice assistants. The encoder-decoder architecture introduced here is also the direct ancestor of GPT and BERT: both split the problem into encoding context and generating output — a pattern that dominated NLP for years before attention changed everything.

The (seq2seq) framework, introduced by Sutskever et al. (2014), provides an elegant structure for this: split the problem into two halves.

The Architecture

Encoder: an RNN (or LSTM/GRU) that reads the input sequence x1,x2,,xTxx_1, x_2, \ldots, x_{T_x} and produces a :

htenc=LSTM(ht1enc,,xt)h_t^\text{enc} = \text{LSTM}(h_{t-1}^\text{enc},, x_t)
htench_t^{enc}
encoder hidden state at step t
xtx_t
source token at step t
c=hTxencc = h_{T_x}^\text{enc}
cc
context vector: the encoder's final hidden state

Decoder: an RNN (or LSTM/GRU) that generates the output sequence y1,y2,,yTyy_1, y_2, \ldots, y_{T_y} conditioned on cc:

st=LSTM(st1,,[yt1;,c])s_t = \text{LSTM}(s_{t-1},, [y_{t-1};, c])
sts_t
decoder hidden state at step t
yt1y_{t-1}
previously generated output token
P(yty1,,yt1,,c)=softmax(Wyst+by)P(y_t \mid y_1,\ldots,y_{t-1},, c) = \text{softmax}(W_y s_t + b_y)
P(yt...)P(y_t|...)
probability distribution over vocabulary for next token

The decoder is initialized with the context vector: s0=cs_0 = c. A special start token SOS\langle\text{SOS}\rangle is fed as y0y_0 to kick off generation. The decoder continues until it outputs a EOS\langle\text{EOS}\rangle (end of sequence) token.

Translation Example

Source: "I am hungry" (3 tokens)

Encoder processes:

  • Step 1: h_1 = \text{LSTM}(h_0, \text{embed}(\text{"I"}))
  • Step 2: h_2 = \text{LSTM}(h_1, \text{embed}(\text{"am"}))
  • Step 3: h_3 = \text{LSTM}(h_2, \text{embed}(\text{"hungry"}))
  • Context vector: c=h3c = h_3 — a 512-dimensional summary of "I am hungry"

Decoder generates:

  • Value: s0=cs_0 = c, input = SOS\langle\text{SOS}\rangle
  • Produces distribution over French vocabulary → samples/argmaxes "J'"
  • Value: s_1 = \text{LSTM}(s_0, [\text{embed}(\text{"J'"});, c])
  • Produces distribution → "ai"
  • Step 3 → "faim"
  • Step 4 → EOS\langle\text{EOS}\rangle: stop

Output: "J' ai faim" ✓

Teacher Forcing vs Autoregressive Generation

During training, there's a choice: what do we feed as yt1y_{t-1} to the decoder?

Teacher forcing: always feed the true target token from the training set.

Advantages: stable gradients (decoder never sees compounding errors), faster convergence.

Disadvantage: the decoder has never learned to recover from its own mistakes. At inference time, an error at step 3 corrupts the input to step 4, which corrupts step 5, etc. This is called exposure bias — the model is trained with teacher inputs but tested with its own outputs.

Free running (autoregressive): feed the decoder's own previous prediction.

Advantages: matches inference conditions exactly.

Disadvantage: early in training, predictions are wrong → decoder sees bad inputs → gradients are noisy and unstable.

Scheduled sampling (Bengio et al., 2015): gradually transition from teacher forcing to free running over training. Start with 100% teacher forcing; by the end, mostly free running. This bridges the gap.

The Information Bottleneck

The seq2seq architecture has a fundamental limitation: all information about the source sequence is compressed into the single context vector c=hTxc = h_{T_x}. This is a fixed-size vector — 256 or 512 numbers, regardless of source length.

For short sentences (5-10 words), this is fine. For long documents (200+ words), the encoder must cram all the meaning, structure, and content into the same number of slots. Detail is inevitably lost. Empirically, scores on machine translation drop sharply for source sentences longer than 20-30 words.

Practical Notes

# Minimal seq2seq encoder/decoder in PyTorch
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)

    def forward(self, src):  # src: [B, T_src]
        x = self.embed(src)
        _, (h, c) = self.lstm(x)  # h, c: [1, B, hidden]
        return h, c  # context vectors

class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, tgt_token, h, c):  # one step at a time
        x = self.embed(tgt_token.unsqueeze(1))  # [B, 1, embed]
        out, (h, c) = self.lstm(x, (h, c))
        logits = self.fc(out.squeeze(1))  # [B, vocab_size]
        return logits, h, c

The seq2seq architecture is the direct ancestor of the transformer. Its central innovation — an encoder that builds a representation, a decoder that consumes it — appears in every modern language model. The only thing that changes is how the decoder accesses the encoder's representation: through a single vector cc (seq2seq), or through direct attention to all encoder states (transformer).

Quiz

1 / 3

In a seq2seq model, the encoder's role is to...