Large Language Models: Capabilities and Limitations — Neural Networks & Deep Learning

GPT-4, Claude, Gemini, LLaMA — large language models (LLM) have become a transformative technology of the 2020s. Understanding their capabilities, training mechanisms, and limitations is a necessity for every AI specialist.

Scaling Laws

Scaling Law (Kaplan et al., OpenAI, 2020):

Loss ∝ N^{−0.076} · D^{−0.095} · C^{−0.050}

Interpretation: N — number of parameters, D — dataset size in tokens, C — compute budget in FLOPs. Loss decreases according to a power law as each factor increases — improvement is predictable and continuous. A practical conclusion: more data + larger model + more computation → better results.

Hoffmann-Chinchilla Law (DeepMind, 2022): Optimal ratio at fixed C: N∝D — parameters ≈ tokens (multiplied by a constant ≈ 20). GPT-3 (175B) was trained on 300B tokens → “undertrained”. Chinchilla (70B, 1.4T tokens) outperforms Gopher (280B, 300B tokens). Strategy: fewer parameters, more data → greater efficiency.

Inference efficiency: LLaMA-3-8B (8B parameters, 15T tokens): 90% of GPT-3 performance with 22× fewer parameters and open access.

Emergent Abilities

Phenomenon: Upon reaching a threshold scale (~10²³ FLOPs), models suddenly acquire new capabilities that smaller models lack. This cannot be predicted in advance from extrapolation.

Examples of emergent abilities: multi-step arithmetic (GPT-3-small: 0%, GPT-3-large: 75%), chain-of-thought reasoning, multilingual translation (even without multilingual data!), code generation, analogical reasoning.

Chain-of-Thought (Wei et al., 2022): Prompt “let’s think step by step” + several examples with reasoning → sharp improvement in reasoning. GSM8K (arithmetic problems): zero-shot GPT-3: 17%. CoT GPT-3: 57%. CoT GPT-4: 92%. Self-consistency (several CoT, majority vote): 95%.

RLHF: Reinforcement Learning from Human Feedback

Alignment problem: A language model optimized for the next token (LM) can be useful, harmful, honest, and dishonest at the same time — depending on the prompt. An additional training stage is necessary.

RLHF pipeline (InstructGPT, Ouyang et al., 2022):

SFT (Supervised Fine-Tuning): Further train the LLM on pairs of (question, high-quality answer) from human annotators (≈50K pairs). The model learns to “answer correctly”.
Reward Model: Annotators rank 4 responses from the model to a question. We train an RM: r_θ(x, y) — a scalar score of the quality of answer y to question x. Objective function (Bradley-Terry): L = −E[log σ(r_θ(x, yw) − r_θ(x, yl))], yw ≻ yl.
PPO: Optimize the LLM using PPO with reward r = RM(x, y) − β·KL(π_θ || π_SFT). The KL-penalty prevents collapse (“saying only what people like”).

DPO (Rafailov et al., 2023): Direct preference optimization without RM and PPO:

L_DPO = −E[log σ(β(log π_θ(yw|x)/π_ref(yw|x) − log π_θ(yl|x)/π_ref(yl|x)))]

Simpler, more stable, and not worse than RLHF on most tasks.

Limitations of LLMs

Hallucinations: LLMs confidently generate factually incorrect information. Reason: there is no explicit “memory of facts” — only statistical patterns. They do not distinguish “I know” from “I don’t know.” Study: GPT-4 hallucinates on 15–30% of legal/medical queries.

Reasoning: Poor on formal tasks (logic, arithmetic) when a training-template cannot be reproduced. Tool use (Python interpreter, calculator) solves arithmetic, RAG — memory.

Context window: Finite context window (4K–200K tokens). LLMs use information worse in the middle of the context (“Lost in the Middle,” Liu et al., 2023). No true long-term memory.

Prompt injection: An attacker can “hijack” LLM control via special text in the context. A safety issue for autonomous AI agents.

Numerical Example

GPT-3 (text-davinci-003) on the GSM8K task: “There are 48 passengers on a bus. At a bus stop, 12 get off, 8 get on. How many?”

Zero-shot: “48 − 12 + 8 = 44.” ✓ (simple problem)

“A shop sold 52 apples on the first day, twice as many on the second. How many apples in both days?” Zero-shot: “52 + 104 = 156.” ✓

Chain-of-thought: “First day: 52. Second: 52×2=104. Total: 52+104=156.” ✓ (CoT helps on multi-step problems, not simple ones)

“For 3 tasks: A + B + C = X. If A = X/3, B = X/4, find C” (formal logic). Zero-shot: incorrect answer 65% of the time. CoT: 78% correct. Symbolic reasoning via Python: 100%.

Assignment: Use the GPT-3.5 API (or open-source LLaMA-7B via ollama). (1) Create 3 prompts for a reasoning problem (logical problem, 5 steps): zero-shot, few-shot (3 examples), chain-of-thought. (2) Evaluate accuracy on 20 problems. (3) For 5 incorrect answers: analyze the type of error (distraction, arithmetic, logic). (4) Implement RAG: embed Wikipedia text, search for relevant fragments, add to context. How does accuracy improve on factual questions?