You now have all the pieces: multi-head attention, positional encodings, the scaled dot-product formula. A transformer block assembles these components with two additional elements - feedforward networks and layer normalization - plus residual connections to make the whole stack trainable at depth. This is the repeating unit that, stacked dozens or hundreds of times, produces models like GPT and BERT.
The transformer block is the single most important architectural unit in modern AI. GPT-4 is hundreds of these blocks stacked together. Understanding one block completely means you can read any transformer paper, implement any LLM from scratch, and reason about what each component contributes.
The Structure of One Block
The modern "Pre-LN" transformer block (used in GPT-2 and most recent models) applies :
- input to the block (seq_len x d_model)
- after self-attention sublayer
- after feedforward sublayer - the block output
Two sublayers, each wrapped in: LayerNorm → sublayer → residual addition.
Interactive example
Transformer block data flow - trace a single token's representation through one complete block
Coming soon
Multi-Head Self-Attention: Token Communication
The attention sublayer is where tokens talk to each other. Each token queries all others, gathers weighted context, and updates its representation. After attention, each token's representation has been enriched by information from the entire sequence.
This is - the sequence is attending to itself. The model asks: "given what I know about each token, what information should each token gather from the others?"
The Feedforward Network: Per-Token Processing
After attention handles cross-token communication, the handles per-token processing:
- first linear layer weights (d_model x 4*d_model)
- second linear layer weights (4*d_model x d_model)
- bias vectors
The intermediate dimension is typically : expand from to , apply nonlinearity, project back to .
Critically: the same FFN transformation is applied independently to each token position - no interaction between positions. The FFN is a per-token "thinking" step: given what this token has gathered from attention, what should its representation be?
Layer Normalization: Stability
The normalizes each token's representation independently:
- mean across the d_model features of a single token
- standard deviation across the d_model features
- learned per-feature scale and shift parameters
where and are computed across the features of a single token (not across the batch, unlike batch normalization).
The are learned parameters that allow the network to optionally undo the normalization.
Residual Connections: The Key to Depth
Each sublayer is wrapped in a : .
Two benefits:
Gradient highways: during backpropagation, the gradient flows through two paths - through the sublayer and through the direct path. The residual path has a gradient of exactly 1 regardless of what the sublayer does. Even if the sublayer gradient is tiny, gradients still flow through the residual path unimpeded. This is how you train 96-layer networks.
Identity initialization: early in training, when weights are near zero, each sublayer produces near-zero outputs. Residual connections mean and - the whole stack acts like the identity function. The network can learn from this blank slate incrementally, each layer gradually refining the representation.
A Full Transformer: Stacking Blocks
A full transformer model is N identical blocks stacked sequentially. Each block takes a (\text{seq_len} \times d_{\text{model}}) input and produces a (\text{seq_len} \times d_{\text{model}}) output. After N blocks, a final linear layer projects to vocabulary size for prediction.
Typical depths: GPT-2 small (N=12), GPT-2 medium (N=24), GPT-3 (N=96), LLaMA 70B (N=80). The same equations - just more layers.
GPT vs BERT: Decoder vs Encoder
GPT (decoder-only): uses - each token only attends to tokens at its position or earlier. Trained to predict the next token. Used for generation tasks.
BERT (encoder-only): no causal mask - every token attends to every other token . Trained with masked language modeling (predict randomly masked tokens using full context). Used for classification and understanding.
Neither is strictly better - they are designed for different purposes.
Estimating Parameter Count
One useful skill: back-of-envelope parameter counting. For one transformer block with features:
- model hidden dimension
- Attention (Q, K, V, O projections):
- FFN (two linear layers, 4x expansion):
- LayerNorm: negligible
For GPT-3 with and 96 layers: parameters. GPT-3's reported 175 billion. The same simple equations, applied at enormous scale.