Module III·Article I·~4 min read

Gradient Descent and Nesterov Acceleration

First-Order Algorithms

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Why Are First-Order Algorithms Needed?

Second-order methods (Newton's method) are very fast, but require the computation and inversion of the Hessian matrix — O(n³) operations per step. With n = 10⁶ parameters (a medium-sized neural network), this is simply impossible. First-order methods use only the gradient — O(n) operations. They are slower per iteration, but each iteration is cheap, which makes them irreplaceable for high-dimensional problems.

Gradient Descent: Basic Algorithm

The idea is simple: move in the direction of greatest decrease of the function.

x^{k+1} = x^k − α ∇f(x^k)

Here α > 0 is the step size (learning rate). If α is too large, the algorithm diverges. If too small — convergence is very slow.

The Class of L-smooth Functions: f is called L-smooth if ‖∇f(x) − ∇f(y)‖ ≤ L‖x − y‖ for all x, y. The constant L is the "degree of smoothness" (largest eigenvalue of the Hessian matrix).

Convergence Theorem: for convex L-smooth f with step size α = 1/L:

f(x^k) − f* ≤ L‖x⁰ − x*‖² / (2k)

This is O(1/k) rate: to halve the error, you need to double the number of iterations.

For μ-strongly convex functions (all eigenvalues of ∇²f ≥ μ > 0):

f(x^k) − f* ≤ (1 − μ/L)^k · (f(x⁰) − f*)

This is linear (exponential) convergence! The number κ = L/μ is the "condition number" of the problem. If κ = 1, convergence is instantaneous; if κ = 1000, about ~4000 iterations are needed for 4 digits of accuracy.

Nesterov Acceleration (1983)

Yurii Nesterov discovered a remarkable fact: by adding "momentum" to gradient descent, the convergence rate can be doubled without increasing the cost of a single iteration.

Nesterov Algorithm (NAG):

y^{k+1} = x^k − α ∇f(x^k) (gradient step) x^{k+1} = y^{k+1} + βₖ(y^{k+1} − y^k) (momentum)

where βₖ = (k−1)/(k+2) — the "momentum" coefficient, which grows towards 1.

Convergence Rate: O(1/k²) — twice as fast as gradient descent! This is theoretically optimal for first-order methods (Nesterov's lower bound).

Physical Meaning: "Momentum" prevents getting stuck in "ravines" — places with large κ. A ball rolling with momentum "flips over" the narrow bottom of the ravine and stops closer to the minimum.

Full Example Analysis

Problem: Minimize f(x) = (1/2)(x₁² + 100x₂²) — an elongated parabola with κ = 100.

Gradient descent with α = 1/L = 1/100:

x^{k+1} = x^k − (1/100)(x₁^k, 100x₂^k) = (x₁^k(1 − 1/100), x₂^k(1 − 1))

Along x₂, instant zeroing out; along x₁, slow: x₁^k = (99/100)^k x₁⁰. After 1000 iterations: (0.99)^{1000} ≈ 0.000045 → about 4-5 significant digits.

Nesterov: converges in ~20√κ ≈ 200 iterations instead of 1000. That's a 5-fold speedup at κ = 100.

Stochastic Gradient Descent (SGD)

For machine learning problems: f(x) = (1/N) Σᵢ fᵢ(x). With N = 10⁶, computing the full gradient costs O(N). SGD uses a random subsample:

x^{k+1} = x^k − αₖ ∇fᵢ(x^k), where i is chosen randomly

Iteration cost: O(1) instead of O(N). The price: "noise" in the gradient estimate → fixed step size cannot be used.

Rate: O(1/√k) for convex, O(1/k) for strongly convex (with proper decay of the step size).

Adaptive methods: Adam, RMSProp, AdaGrad — scale the step per coordinate. In deep learning, Adam is almost always better than SGD with fixed step size.

Applications in Machine Learning

All modern neural network optimization is variants of SGD with momentum. GPT-4 was trained with Adam on 100+ billion parameters. Nesterov acceleration is used for training physical models, compression problems, recommendation systems. Understanding theoretical convergence rates allows one to choose the right algorithm and tune hyperparameters.

Connection with Modern Libraries

In machine learning practice, first-order algorithms are implemented in specialized libraries: PyTorch (torch.optim.SGD, Adam, AdamW), TensorFlow (tf.keras.optimizers), JAX (optax). All of them are built around gradient methods, expanded with adaptive steps, momentum, and normalization. Adam (Kingma & Ba, 2014) supports exponentially smoothed estimates of the first and second moments of the gradient — this automatically adapts the step for each coordinate, which is critical when working with sparse or ill-conditioned tasks. For very large models (LLM with billions of parameters), distributed versions are used: ZeRO, FSDP, which shard the optimizer state across GPUs.

Comparison of Convergence Rates

In a convex problem:

  • Gradient descent: O(1/k) for smooth functions, O(1/√k) for non-smooth
  • Accelerated (Nesterov): O(1/k²) — theoretical optimum for smooth convex problems
  • Prox methods: O(1/k) or O(1/k²) with acceleration (FISTA)
  • Stochastic gradient descent (SGD): O(1/√k) for convex, O(1/k) with averaging

This order of convergence determines how many iterations are needed to reach a given accuracy ε: for O(1/k), about ~1/ε iterations are needed; for O(1/k²), only ~1/√ε. The difference is enormous at ε = 10⁻⁶: 10⁶ versus 10³ iterations.

§ Act · what next