Module II·Article III·~3 min read

Transformers and Attention Architectures

Mathematical Foundations of Deep Learning

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Attention Mechanism and Transformers

The Transformer (Vaswani et al., "Attention Is All You Need", 2017) is a revolutionary architecture that eliminated recurrence in sequence processing. The self-attention mechanism allows each element of the sequence to interact with any other in a single step, opening the way to scaling models up to hundreds of billions of parameters.

Scaled Dot-Product Attention

The key operation of the transformer. Three matrices: Q (queries) ∈ ℝ^{n×d_k}, K (keys) ∈ ℝ^{n×d_k}, V (values) ∈ ℝ^{n×d_v}.

Attention(Q, K, V) = softmax(QKᵀ/√d_k) · V

Step-by-step interpretation:

  • QKᵀ — matrix of dot products for all query-key pairs: how "compatible" each query is with each key. Element [i,j] — relevance of position j for position i.
  • /√d_k — scaling to prevent softmax saturation (with large d_k, the products become large → softmax concentrates on one position → vanishing gradient).
  • softmax(...) — normalization into probabilities: rows sum to 1 — "distribution of attention" across positions.
  • · V — weighted sum of values: the output = mixture of values from all positions with attention weights.

For a single token "The cat sat": the query "sat" looks at the keys "The"(0.1), "cat"(0.8), "sat"(0.1) — the attention mechanism found that "sat" → cat.

Multi-Head Attention

A single attention mechanism sees only one type of interaction. Multi-head Attention uses h parallel "heads" with different projections:

MultiHead(Q, K, V) = Concat(head₁,...,headₕ) Wᴼ headᵢ = Attention(QWᵢᴾ, KWᵢᴷ, VWᵢᵛ)

Each head is trained to "look" at its own aspect: one head — syntactic dependencies (subject-verb), another — coreference ("she" → "Maria"), a third — semantic similarity. Size of each head: d_k = d_model/h, maintaining total computational complexity.

Transformer Architecture

Positional Encoding: Self-attention is invariant to permutation of positions (no built-in concept of "order"). Positional encoding is added to the embeddings:

PE(pos, 2i) = sin(pos/10000^{2i/d_model}) PE(pos, 2i+1) = cos(pos/10000^{2i/d_model})

Idea: different frequencies for different dimensions → each position has a unique "fingerprint".

Encoder layer: Input → MHA → Add&Norm → FFN → Add&Norm. FFN = max(0, xW₁ + b₁)W₂ + b₂ (dim_ff = 4 × d_model). Add&Norm: residual connections + LayerNorm.

Decoder layer: Additionally, Cross-Attention (Q from decoder, K,V from encoder output) — allows the decoder to "look" at the input sequence. Masked self-attention: prevents "looking ahead" during autoregressive generation.

Self-attention complexity: O(n²·d_model) in time and memory (n×n matrix). For long documents (n > 4096) — this is a bottleneck.

Efficiency Improvements

Flash Attention (Dao et al., 2022): does not materialize the n×n matrix in memory. Computes in tiles, using online-softmax. Result: 3–6× speedup, 10–20× less memory. Allows processing context of 100K+ tokens.

Sparse Attention (Longformer): each token "looks" only at a local window ± w plus several global tokens. Complexity O(n·w + n·g). Significant acceleration for long documents.

Grouped Query Attention (GQA, Ainslie et al., 2023): several Q heads share one pair of K,V. Reduces KV-cache during inference — a key bottleneck with large batch size. Used in LLaMA-2, Mistral.

RoPE (Rotary Position Embedding): positions are encoded by rotating the vectors Q,K: Q(pos)·K(pos') depends only on (pos−pos'). Better extrapolation to long contexts compared to absolute PE.

Scaling Laws

Kaplan et al. (OpenAI, 2020): Loss ∝ N^{-0.076} (vs parameters), Loss ∝ D^{-0.095} (vs data). Predictable improvement when scaling.

Chinchilla (Hoffmann, DeepMind, 2022): optimal N ∝ D (tokens ≈ parameters). GPT-3 (175B, 300B tokens) is "undertrained": a model with 70B and 1.4T tokens (Chinchilla) outperformed it.

Numerical Example

Let's compute attention for a sequence of length 3, d_k = 2: Q = [[1,0],[0,1],[1,1]], K = [[1,0],[0,1],[1,1]], V = [[1,0],[0,1],[1,1]]. QKᵀ = [[1,0,1],[0,1,1],[1,1,2]]. After dividing by √2 and row-wise softmax: [0.42, 0.21, 0.36], [0.21, 0.42, 0.36], [0.23, 0.23, 0.54]. Position 3 ("1,1") attracts attention most strongly — as a semantically "rich" token.

Assignment: Implement from scratch (PyTorch without nn.MultiheadAttention): Scaled Dot-Product Attention, Multi-Head Attention, Encoder layer (MHA + FFN + LayerNorm + residual). Train a 2-layer transformer encoder on the sentiment classification task (IMDB, 2000 examples). Visualize attention matrices for 3 examples — which words receive the highest weight?

§ Act · what next