Large Language Models: architecture and lifecycle

GPT-4, Claude, Gemini — large language models (LLMs) revolutionized AI in 2020–2024. To apply and further develop these models effectively, it is necessary to understand their architecture, training process, and adaptation methods.

Pretraining: the language modeling task

Autoregressive language modeling: $P(x_1,\ldots,x_n) = \prod_i P(x_i|x_1,\ldots,x_{i-1})$. Loss during pretraining: $L_{LM} = -\frac{1}{n} \sum_i \log P(x_i|x_1,\ldots,x_{i-1};\theta)$. The "next token" is a simple task that forces the model to understand semantics, syntax, factuality, and causality.

Tokenization: Byte-Pair Encoding (BPE) — iteratively merges the most frequent pairs of bytes. Vocabulary 32K–256K tokens. "Moscow" → [“Mos”, “cow”] or as a single token.

Data: Common Crawl (filtered internet), Books, Wikipedia, GitHub, arXiv, StackExchange. Trackers: The Pile (800GB, EleutherAI), RedPajama, Dolma. The quality of filtering is critical: careless filtering → degradation.

Scale: LLaMA-3 (70B): 1.4 trillion tokens, 2048 GPU A100, ~2 months. Cost: $5–50M for a single run. This explains why only large companies can train frontier models from scratch.

Architectural Details of LLM

GPT architecture (decoder-only transformer): N layers each = {RMSNorm → Masked Self-Attention → RMSNorm → FFN}. Residual connections on top. Reason for decoder-only: autoregressive generation — each new token depends on the previous ones.

SwiGLU (Shazeer, 2020): $FFN = W_2 \cdot (\sigma(W_1x) \odot W_3x)$, where $\sigma$ is swish. Better than ReLU/GELU — used in LLaMA, PaLM.

Grouped Query Attention (GQA): several Q-heads share K,V. LLaMA-2-70B: 8 KV-heads across 64 Q-heads. Reduces KV cache by 8× without significant quality loss.

RoPE (Rotary Position Embedding): Q and K are multiplied by rotation matrices depending on position. The scalar product $Q(pos_1)\cdot K(pos_2)$ depends only on $(pos_1−pos_2)$. Better extrapolation to long contexts. Standard: LLaMA, Mistral, GPT-NeoX.

Mixture of Experts (MoE): each token is processed by $k$ out of $N$ FFN "experts": $y = \sum_{i \in \text{Top-}k(\text{router}(x))} g_i \cdot FFN_i(x)$. Router = linear layer + softmax. Mixtral 8×7B: 8 experts, $k=2$ active → 13B active parameters with 47B total. Throughput as in 13B, quality as in 70B.

RLHF and Post-training

A pretrained model generates text similar to the internet — including toxicity, falsehoods, and unhelpful answers. RLHF (Reinforcement Learning from Human Feedback) aligns its behavior.

Step 1 — SFT (Supervised Fine-Tuning): trained on examples “question → good answer” from human annotators. ~10K–100K examples. The model learns to respond in the “assistant” format.

Step 2 — Reward Model (RM): annotators rank pairs of answers. We train a “judge” model: $r_\theta(x,y)$ — numerical assessment of answer $y$ to question $x$. Loss function (Bradley-Terry): $L = -\mathbb{E}[\log \sigma(r_\theta(x,y_w) - r_\theta(x,y_l))]$, $y_w > y_l$ according to annotator opinion.

Step 3 — PPO: optimize the LLM using PPO, leveraging RM as reward. KL-penalty prevents drifting too far from SFT: $r(x,y) = RM(x,y) - \beta\cdot KL(\pi_\theta(y|x) || \pi_{SFT}(y|x))$.

DPO (Direct Preference Optimization, Rafailov et al., 2023): eliminates RM and PPO. Directly optimizes preferences:

$ L_{DPO} = -\mathbb{E}\left[\log \sigma\left(\beta\left(\frac{\log \pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \frac{\log \pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right)\right] $

Mathematically equivalent to RLHF under certain conditions, but much simpler and more stable. De facto standard for open-source fine-tuning.

Scaling Laws and Capabilities

Scaling Laws (Kaplan, 2020): Loss $\propto N^{-0.076}$ (vs parameters), Loss $\propto D^{-0.095}$ (vs data). Predictable improvement. But beyond a threshold scale (~$10^{23}$ FLOPs) emergent abilities appear — capabilities not present in smaller models: chain-of-thought reasoning, few-shot learning, arithmetic.

Chinchilla (Hoffmann, 2022): optimal $N \propto D$. With a fixed FLOPs budget: a smaller model on a larger corpus is better. Llama-3-8B on 15T tokens outperforms GPT-3-175B on 300B tokens.

Numerical Example

LLaMA-2-7B (7B parameters, 2T training tokens):

Architecture: 32 layers, 32 Q-heads, 4 KV-heads (GQA), $d_{model}=4096$
KV-cache (32K context, FP16): $32 \times 4096 \times 2 \times 32768 \times 2$ bytes ≈ 8 GB — more than the weights!
Generation speed (A100): ~50 tokens/s (without batching)
After DPO fine-tuning (10K preference pairs): MT-Bench score +1.2 points

Assignment: (1) Fine-tune LLaMA-3-8B using LoRA (rank=16) on a Russian-language instruction dataset (Saiga). Batch size=4, lr=2e-4, 1000 steps. Evaluate quality via MT-Bench (GPT-4 as judge). (2) Implement DPO training on synthetic preference data (GPT-4 generates $y_w$, SFT model generates $y_l$). Compare to SFT by win rate (GPT-4 evaluation). (3) Measure latency and throughput at batch size 1, 4, 16. Where is the bottleneck: compute or memory bandwidth?