The full transformer block — Attention & Transformers

You now have all the pieces: multi-head attention, positional encodings, the scaled dot-product formula. A transformer block assembles these components with two additional elements - feedforward networks and layer normalization - plus residual connections to make the whole stack trainable at depth. This is the repeating unit that, stacked dozens or hundreds of times, produces models like GPT and BERT.

The transformer block is the single most important architectural unit in modern AI. GPT-4 is hundreds of these blocks stacked together. Understanding one block completely means you can read any transformer paper, implement any LLM from scratch, and reason about what each component contributes.

The Structure of One Block

The modern "Pre-LN" transformer block (used in GPT-2 and most recent models) applies :

\begin{aligned} x_1 &amp;= x + \text{MultiHeadAttention}(\text{LayerNorm}(x)) \ x_2 &amp;= x_1 + \text{FFN}(\text{LayerNorm}(x_1)) \end{aligned}

$x$: input to the block (seq_len x d_model)
$x_1$: after self-attention sublayer
$x_2$: after feedforward sublayer - the block output

Two sublayers, each wrapped in: LayerNorm → sublayer → residual addition.

Interactive example

Transformer block data flow - trace a single token's representation through one complete block

Coming soon

Multi-Head Self-Attention: Token Communication

The attention sublayer is where tokens talk to each other. Each token queries all others, gathers weighted context, and updates its representation. After attention, each token's representation has been enriched by information from the entire sequence.

This is - the sequence is attending to itself. The model asks: "given what I know about each token, what information should each token gather from the others?"

The Feedforward Network: Per-Token Processing

After attention handles cross-token communication, the handles per-token processing:

\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 \cdot x + b_1) + b_2

$W_1$: first linear layer weights (d_model x 4*d_model)
$W_2$: second linear layer weights (4*d_model x d_model)
$b_1, b_2$: bias vectors

The intermediate dimension is typically $4 \times d_{\text{model}}$ : expand from $d_{\text{model}}$ to $4d_{\text{model}}$ , apply nonlinearity, project back to $d_{\text{model}}$ .

Critically: the same FFN transformation is applied independently to each token position - no interaction between positions. The FFN is a per-token "thinking" step: given what this token has gathered from attention, what should its representation be?

Layer Normalization: Stability

The normalizes each token's representation independently:

\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma} + \beta

$\mu$: mean across the d_model features of a single token
$\sigma$: standard deviation across the d_model features
$\gamma, \beta$: learned per-feature scale and shift parameters

where and are computed across the $d_{\text{model}}$ features of a single token (not across the batch, unlike batch normalization).

The are learned parameters that allow the network to optionally undo the normalization.

P10 — LayerNorm step by step with real numbers

Suppose a token has a 4-dimensional representation (using 4 dimensions instead of 768 for clarity):

Input vector: $\mathbf{x} = [2.0,; 4.0,; 6.0,; 8.0]$

Step 1: Compute mean across the 4 features:

Result: $\mu = (2 + 4 + 6 + 8)/4 = 5.0$

Step 2: Compute standard deviation:

Result: $\sigma = \sqrt{[(2-5)^2 + (4-5)^2 + (6-5)^2 + (8-5)^2]/4} = \sqrt{(9+1+1+9)/4} = \sqrt{5} \approx 2.24$

Step 3: Normalise each value: $(x_i - \mu)/\sigma$ :

Normalised: $[-1.34,; -0.45,; 0.45,; 1.34]$

Step 4: Scale and shift with learned $\gamma$ and $\beta$ (both initialised to 1 and 0): output stays $[-1.34,; -0.45,; 0.45,; 1.34]$ initially.

After normalisation, all four values have zero mean and unit variance — no matter how large or small the original values were. This is what keeps activations well-behaved through 96+ layers of a deep transformer.

Residual Connections: The Key to Depth

Each sublayer is wrapped in a : $\text{output} = \text{sublayer}(x) + x$ .

Two benefits:

Gradient highways: during backpropagation, the gradient flows through two paths - through the sublayer and through the direct $+x$ path. The residual path has a gradient of exactly 1 regardless of what the sublayer does. Even if the sublayer gradient is tiny, gradients still flow through the residual path unimpeded. This is how you train 96-layer networks.

Identity initialization: early in training, when weights are near zero, each sublayer produces near-zero outputs. Residual connections mean $x_1 \approx x$ and $x_2 \approx x_1 \approx x$ - the whole stack acts like the identity function. The network can learn from this blank slate incrementally, each layer gradually refining the representation.

P3/P8 — Two roles in one block: committee then individual reflection

Each transformer block does two fundamentally different things:

Attention (the committee): tokens look at each other and gather context. "Given everything in this sentence, what should I know?" — a purely cross-token operation.

FFN (the individual reflection): each token independently processes what it learned. "Given the context I just gathered, how should I update my own representation?" — a purely per-token operation.

From a control-theory perspective, the residual connections act as a feedback loop with gain 1. At initialisation, when sublayer outputs are near zero, the residual path dominates (the block acts like an identity). Gradients flow through the additive path with magnitude 1 regardless of sublayer behaviour — a built-in stabiliser that prevents the deep stack from becoming untrainable.

A Full Transformer: Stacking Blocks

A full transformer model is N identical blocks stacked sequentially. Each block takes a $(\text{seq_len} \times d_{\text{model}})$ input and produces a $(\text{seq_len} \times d_{\text{model}})$ output. After N blocks, a final linear layer projects to vocabulary size for prediction.

Typical depths: GPT-2 small (N=12), GPT-2 medium (N=24), GPT-3 (N=96), LLaMA 70B (N=80). The same equations - just more layers.

GPT vs BERT: Decoder vs Encoder

GPT (decoder-only): uses - each token only attends to tokens at its position or earlier. Trained to predict the next token. Used for generation tasks.

BERT (encoder-only): no causal mask - every token attends to every other token . Trained with masked language modeling (predict randomly masked tokens using full context). Used for classification and understanding.

Neither is strictly better - they are designed for different purposes.

Estimating Parameter Count

One useful skill: back-of-envelope parameter counting. For one transformer block with $d_{\text{model}}$ features:

\text{Parameters per block} \approx 12 \times d_{\text{model}}^2

$d_{\text{model}}$: model hidden dimension

Attention (Q, K, V, O projections): $4 \times d_{\text{model}}^2$
FFN (two linear layers, 4x expansion): $8 \times d_{\text{model}}^2$
LayerNorm: negligible

For GPT-3 with $d_{\text{model}} = 12{,}288$ and 96 layers: $96 \times 12 \times 12{,}288^2 \approx 174\text{B}$ parameters. GPT-3's reported 175 billion. The same simple equations, applied at enormous scale.