Sequence-to-Sequence Models
Sequence-to-sequence (Seq2Seq) models are a class of deep learning architectures designed to map an input sequence of variable length to an output sequence of variable length. They are the foundational backbone for many neural machine translation, text summarization, and conversational AI systems.
Definition & Overview
Sequence-to-sequence learning addresses tasks where the input and output lengths are not fixed and may differ significantly. Unlike traditional classification tasks that map input to a discrete label, Seq2Seq models generate sequences token-by-token, maintaining contextual coherence throughout the generation process.
Introduced prominently in the context of neural machine translation by Sutskever et al. (2014) and Cho et al. (2014), these models utilize an encoder-decoder framework, often enhanced with attention mechanisms to handle long-range dependencies.
Encoder-Decoder Architecture
The standard Seq2Seq architecture consists of two recurrent neural network (RNN) components:
1. The Encoder
The encoder processes the input sequence \( X = (x_1, x_2, ..., x_T) \) step-by-step. At each time step \( t \), it updates a hidden state \( h_t \) based on the current input and the previous hidden state:
The final hidden state \( h_T \) (or a derived representation) is often used as a context vector \( c \) that summarizes the entire input sequence. This vector is passed to the decoder.
2. The Decoder
The decoder generates the output sequence \( Y = (y_1, y_2, ..., y_{T'}) \) conditioned on the context vector \( c \) and previously generated tokens. It predicts the probability distribution of the next token:
P(y_t | y_{
Where \( h'_t \) is the decoder's hidden state at time \( t \).
The encoder maps the input sequence into a fixed-dimensional semantic space, while the decoder traverses this space to reconstruct or transform the information into a target sequence.
The Attention Mechanism
Early Seq2Seq models suffered from the information bottleneck problem, where all input information had to be compressed into a single context vector. This limited performance on long sequences.
Bahdanau et al. (2015) introduced Attention, allowing the decoder to access all encoder hidden states at each generation step. Instead of relying solely on \( c \), the decoder computes a weighted sum of encoder states:
a_t = softmax(Attention(h'_t, H_encoder))
c_t = Σ a_i * h_i
This dynamic focus enables the model to align relevant parts of the input with the current output token, dramatically improving translation quality and gradient flow.
From RNNs to Transformers
While RNN-based Seq2Seq models with attention achieved state-of-the-art results, they faced challenges with parallelization and long-range dependency modeling.
The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), replaced recurrence entirely with self-attention mechanisms. Transformers are technically a type of Seq2Seq architecture but operate via parallel matrix operations, enabling:
- Massive parallelization during training.
- Superior handling of very long contexts.
- Better gradient propagation.
Modern large language models (LLMs) like GPT, T5, and BART are all variations of Seq2Seq architectures built on the Transformer foundation.
Applications
Translating text between languages (e.g., EN → FR). Seq2Seq models learn to map sentence structures across linguistic boundaries.
Abstract summarization generates concise summaries that may introduce new wording, requiring generative Seq2Seq capabilities.
Chatbots and virtual assistants use Seq2Seq to map user utterances to coherent, context-aware responses.
Mapping audio feature sequences to text sequences, often combined with CTC loss or attention-based decoders.
Training & Loss Functions
Seq2Seq models are typically trained using Teacher Forcing, where the ground truth previous token is fed as input during training. The loss is computed as the cross-entropy between predicted and actual token distributions:
Loss = -Σ log P(y_t | y_{
At inference time, Beam Search or Sampling strategies are used to generate sequences autoregressively.
A known issue where the model sees ground truth during training but its own predictions during inference, leading to error accumulation. Techniques like Scheduled Sampling or Reinforcement Learning can mitigate this.
References
- Cho, K. et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078.
- Sutskever, I. et al. (2014). Sequence to Sequence Learning with Neural Networks. NeurIPS 2014.
- Bahdanau, D. et al. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015.
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017.
- Sutskever, I. & Le, Q. (2019). Sequence to Sequence. Stanford CS224n Lecture Notes.