Sequence-to-sequence models — Recurrent Networks

The RNN, LSTM, and GRU all process one sequence as input and produce one output (either at each step, or a single summary at the end). Many real problems require something more: a sequence in, a different sequence out. Translation, summarization, transcription, code generation — all map one sequence to another, often of different length.

Sequence-to-sequence models are behind machine translation, text summarization, and voice assistants. The encoder-decoder architecture introduced here is also the direct ancestor of GPT and BERT: both split the problem into encoding context and generating output — a pattern that dominated NLP for years before attention changed everything.

The (seq2seq) framework, introduced by Sutskever et al. (2014), provides an elegant structure for this: split the problem into two halves.

The Architecture

Encoder: an RNN (or LSTM/GRU) that reads the input sequence $x_1, x_2, \ldots, x_{T_x}$ and produces a :

h_t^\text{enc} = \text{LSTM}(h_{t-1}^\text{enc},, x_t)

$h_t^{enc}$: encoder hidden state at step t
$x_t$: source token at step t

c = h_{T_x}^\text{enc}

$c$: context vector: the encoder's final hidden state

Decoder: an RNN (or LSTM/GRU) that generates the output sequence $y_1, y_2, \ldots, y_{T_y}$ conditioned on $c$ :

s_t = \text{LSTM}(s_{t-1},, [y_{t-1};, c])

$s_t$: decoder hidden state at step t
$y_{t-1}$: previously generated output token

P(y_t \mid y_1,\ldots,y_{t-1},, c) = \text{softmax}(W_y s_t + b_y)

$P(y_t|...)$: probability distribution over vocabulary for next token

The decoder is initialized with the context vector: $s_0 = c$ . A special start token $\langle\text{SOS}\rangle$ is fed as $y_0$ to kick off generation. The decoder continues until it outputs a $\langle\text{EOS}\rangle$ (end of sequence) token.

Translation Example

Source: "I am hungry" (3 tokens)

Encoder processes:

Step 1: $h_1 = \text{LSTM}(h_0, \text{embed}(\text{"I"}))$
Step 2: $h_2 = \text{LSTM}(h_1, \text{embed}(\text{"am"}))$
Step 3: $h_3 = \text{LSTM}(h_2, \text{embed}(\text{"hungry"}))$
Context vector: $c = h_3$ — a 512-dimensional summary of "I am hungry"

Decoder generates:

Value: $s_0 = c$ , input = $\langle\text{SOS}\rangle$
Produces distribution over French vocabulary → samples/argmaxes "J'"
Value: $s_1 = \text{LSTM}(s_0, [\text{embed}(\text{"J'"});, c])$
Produces distribution → "ai"
Step 3 → "faim"
Step 4 → $\langle\text{EOS}\rangle$ : stop

Output: "J' ai faim" ✓

Teacher Forcing vs Autoregressive Generation

During training, there's a choice: what do we feed as $y_{t-1}$ to the decoder?

Teacher forcing: always feed the true target token from the training set.

Advantages: stable gradients (decoder never sees compounding errors), faster convergence.

Disadvantage: the decoder has never learned to recover from its own mistakes. At inference time, an error at step 3 corrupts the input to step 4, which corrupts step 5, etc. This is called exposure bias — the model is trained with teacher inputs but tested with its own outputs.

Free running (autoregressive): feed the decoder's own previous prediction.

Advantages: matches inference conditions exactly.

Disadvantage: early in training, predictions are wrong → decoder sees bad inputs → gradients are noisy and unstable.

Scheduled sampling (Bengio et al., 2015): gradually transition from teacher forcing to free running over training. Start with 100% teacher forcing; by the end, mostly free running. This bridges the gap.

The Information Bottleneck

The seq2seq architecture has a fundamental limitation: all information about the source sequence is compressed into the single context vector $c = h_{T_x}$ . This is a fixed-size vector — 256 or 512 numbers, regardless of source length.

For short sentences (5-10 words), this is fine. For long documents (200+ words), the encoder must cram all the meaning, structure, and content into the same number of slots. Detail is inevitably lost. Empirically, scores on machine translation drop sharply for source sentences longer than 20-30 words.

Practical Notes

# Minimal seq2seq encoder/decoder in PyTorch
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)

    def forward(self, src):  # src: [B, T_src]
        x = self.embed(src)
        _, (h, c) = self.lstm(x)  # h, c: [1, B, hidden]
        return h, c  # context vectors

class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, tgt_token, h, c):  # one step at a time
        x = self.embed(tgt_token.unsqueeze(1))  # [B, 1, embed]
        out, (h, c) = self.lstm(x, (h, c))
        logits = self.fc(out.squeeze(1))  # [B, vocab_size]
        return logits, h, c

The seq2seq architecture is the direct ancestor of the transformer. Its central innovation — an encoder that builds a representation, a decoder that consumes it — appears in every modern language model. The only thing that changes is how the decoder accesses the encoder's representation: through a single vector $c$ (seq2seq), or through direct attention to all encoder states (transformer).