Module V·Article I·~3 min read
Attention Mechanism and Transformer Architecture
Transformers and Large Language Models
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Transformer (Vaswani et al., “Attention Is All You Need”, 2017) is a revolutionary architecture that eliminated recurrence in sequence processing. Self-attention allows each element to interact directly with any other, without recurrent bottlenecks. Transformers now dominate in NLP, vision, protein biology, and audio.
Attention Mechanism (Attention)
Intuition: When translating “The animal didn't cross the street because it was too tired” → it is necessary to realize that “it” = “animal”. An RNN “forgets” long-range dependencies. Attention allows you to explicitly “look” at needed positions when generating each token.
Scaled Dot-Product Attention:
Attention(Q, K, V) = softmax(QKᵀ/√d_k) · V
Q (queries) ∈ ℝ^{n×d_k}: “what we’re searching for”. K (keys) ∈ ℝ^{m×d_k}: “what is offered” at all positions. V (values) ∈ ℝ^{m×d_v}: “what we take” from the selected positions.
Steps: (1) QKᵀ ∈ ℝ^{n×m}: dot products of every query-key pair. Element [i,j]: how much position j “responds” to query at position i. (2) /√d_k: scaling prevents overly large values with high d_k (otherwise softmax saturates). (3) softmax: normalization into a distribution (rows sum to 1). (4) ·V: weighted sum of values.
Complexity: O(n²·d_k) in memory and time — quadratic with sequence length.
Multi-Head Attention
One attention mechanism sees one aspect. Multi-head: h heads, each with different projections (W^Q_i, W^K_i, W^V_i):
MultiHead(Q,K,V) = Concat(head₁,...,headₕ) W^O headᵢ = Attention(QW^Q_i, KW^K_i, VW^V_i)
d_k = d_v = d_model/h. Heads specialize: one — in syntactic dependencies, another — in semantic proximity, third — in coreference. W^O ∈ ℝ^{hd_v × d_model} — projection of the concatenation.
Transformer Architecture
Positional Encoding: Self-attention is permutation-invariant — no built-in concept of “order”. PE(pos, 2i) = sin(pos/10000^{2i/d_model}), PE(pos, 2i+1) = cos(...). Added to token embeddings. Learnable PE (BERT): positional embeddings are trainable.
Encoder layer: Input tokens → Add(x, MHA(x, x, x)) → LayerNorm → Add(·, FFN(·)) → LayerNorm. FFN = max(0, xW₁ + b₁)W₂ + b₂ (d_ff = 4·d_model). Residual connections + LayerNorm = stable learning.
Decoder layer: Masked self-attention (does not see future tokens) + Cross-attention (Q from decoder, K,V from encoder) + FFN. Masking for autoregressive generation: when generating token t, must not “see” tokens t+1,...
Pre-training/Fine-tuning: BERT: Masked LM + NSP → fine-tune on NLU tasks (QA, NER, entailment). GPT: LM (next token prediction) → few-shot and zero-shot.
Efficient Alternatives
Flash Attention (Dao et al., 2022): Does not materialize the n×n attention matrix. Computes in tiles, using online softmax. 3–6× speedup, 10–20× less memory (HBM). Allows for a 100K+ context window.
Sparse Attention (Longformer): Local window ±w + several global tokens. O(n·w) instead of O(n²). Efficient for long documents.
Grouped Query Attention (GQA, Ainslie et al., 2023): Several Q-heads share K,V. Reduces KV-cache during inference. LLaMA-2-70B: 8 KV-heads for 64 Q-heads.
RoPE (Rotary Position Embedding): Positions are encoded by rotating Q and K: Q(pos) = R(pos)·Q, K(pos) = R(pos)·K. Dot product Q·K depends only on (pos₁−pos₂) — relative position. Better extrapolates to long contexts.
Numerical Example
Transformer for MT (Russian → English): n=20 tokens, d_model=512, h=8, N=6 layers. Parameters: 6×(4·512² + 2·512·2048) = 6×(1.05M + 2.1M) ≈ 19M. KV-cache during inference (one token at a time): 6 layers × 2 (K,V) × 8 heads × d_k = 6×2×512 = 6144 floats per token. With 1000 tokens: 6144×1000×4 bytes = 24 MB (a small model).
Attention patterns (visualization): head 1 — local grammatical dependencies (adjective→noun). Head 5 — global semantic (subject→predicate).
Assignment: Implement a simplified transformer from scratch (PyTorch): Encoder-only, 2 layers, 4 heads, d_model=64. Task: sentiment classification IMDB (25K reviews). (1) BPE tokenization (tokenizers library, vocab=5000). (2) Train from scratch on 5000 examples. (3) Compare with BERT fine-tune on the same data (transformers = universal language understanding?). (4) Visualize attention matrices for 5 examples — which words “look at” “not”, “terrible”, “excellent”?
§ Act · what next