The math behind modern AI
The sequence problem
Attention as weighted averaging
Q, K, V: the full attention formula
Multi-head attention
Positional encoding
The full transformer block