Optimization in Deep Learning — Big Data & Machine Learning

Optimization of Neural Networks

Training a neural network is solving the optimization problem $\min_\theta L(\theta)$ in a space with billions of variables. The landscape of the loss function $L(\theta)$ is complex: saddle points, flat plateaus, ravines. Understanding optimization methods is the key to successful training of deep models.

Stochastic Gradient Descent (SGD)

The full gradient $\nabla L(\theta) = (1/n)\Sigma_i \nabla l_i(\theta)$ is expensive when $n =$ millions. Stochastic approximation: take a mini-batch $B \subset {1, ..., n}$ and approximate: $\hat{g}t = (1/|B|)\sum{i \in B} \nabla l_i(\theta_t)$. Update: $\theta_{t+1} = \theta_t - \alpha_t \hat{g}_t$.

Key property: $\mathbb{E}[\hat{g}_t] = \nabla L(\theta_t)$ — unbiased estimator. Variance $\mathrm{Var}[\hat{g}_t] = \sigma^2/|B|$ decreases with batch size.

Convergence theorem (convex case): With decreasing lr $\alpha_t = O(1/\sqrt{t})$ and $L$-smooth convex function: $\mathbb{E}[L(\theta_T)] - L(\theta^*) \leq O(1/\sqrt{T})$. For $\mu$-strongly convex with $\alpha_t = O(1/t)$: $O(\sigma^2/(\mu T))$.

SGD issues: (1) Ravines (ill-conditioned): if curvature differs sharply in different directions (Hessian $H$ has large condition number), SGD "oscillates" along the narrow direction and moves slowly in the wide one. (2) Local minima (not a problem for DL — they are almost equally good). (3) Saddle points: $\nabla L = 0$, but not a minimum — theoretical issue, practically SGD escapes them quickly due to stochasticity.

Adaptive Methods

Momentum (Polyak, 1964): accumulate "velocity" in the gradient direction: $ v_t = \beta v_{t-1} - \alpha \nabla L(\theta_t), \quad \theta_{t+1} = \theta_t + v_t $

Physical analogy: a ball is rolling over the loss surface, gaining speed in persistent directions and slowing down when the direction changes. $\beta = 0.9$ is standard.

RMSProp (Hinton, 2012): Adaptive learning rate for each parameter: $ v_t = \beta v_{t-1} + (1 - \beta)(\nabla L)^2, \quad \theta_{t+1} = \theta_t - \alpha \nabla L / \sqrt{v_t + \epsilon} $

$v_t$ is the exponentially weighted average of the squared gradient. Parameters with larger gradients receive a smaller effective lr. Helpful with sparse gradients (NLP).

Adam (Kingma & Ba, 2014): combines momentum and RMSProp:

$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ (first moment — mean)
$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ (second moment — variance)
$\hat{m}_t = m_t/(1-\beta_1^t)$, $\hat{v}_t = v_t/(1-\beta_2^t)$ (bias correction)
$\theta_{t+1} = \theta_t - \alpha \hat{m}_t/\sqrt{\hat{v}_t + \epsilon}$

$\beta_1 = 0.9$, $\beta_2 = 0.999$ — standard. Bias correction is important at the start: without it, $\hat{m}_1 = (1-\beta_1)g_1 \ll g_1$.

AdamW: Adam with correct weight decay: $\theta \leftarrow \theta(1-\alpha\lambda) - \alpha\cdot \nabla_\text{correction}$. Standard for transformers.

Learning Rate Schedules

Learning rate is the most critical hyperparameter. Too large: divergence. Too small: slow convergence.

Cosine annealing: $\alpha_t = \alpha_\text{min} + (\alpha_\text{max} - \alpha_\text{min})/2 \cdot (1 + \cos(\pi t / T))$. Smoothly lowers lr from $\alpha_\text{max}$ to $\alpha_\text{min}$ over $T$ steps.

Warmup + cosine (BERT, ViT): first $W$ steps: linear increase of lr from 0 to $\alpha_\text{max}$. Then cosine decay. Warmup is critical for transformers: without it — instability due to large lr on randomly initialized weights.

1-cycle policy (Smith, 2017): lr increases from base_lr to max_lr (warmup), then decreases to base_lr/100 (cosine). Fast convergence — can train in 1 epoch what used to take 10 epochs.

Normalization

Batch Normalization (Ioffe & Szegedy, 2015): for a batch $X$: $\hat{x} = (x - \mu)/\sqrt{\sigma^2 + \epsilon}$, $y = \gamma \hat{x} + \beta$ ($\gamma, \beta$ — trainable). Advantages: stabilization of activations → larger lr, regularization effect (noise from batch stats).

Layer Normalization: normalization over features (not over batch) — needed for NLP (variable length). RMS Norm (without centering): $\hat{x} = x/ \mathrm{RMS}(x)$ — LLaMA, Mistral.

Numerical Example

Simple task: training $y = w \cdot x + b$, MSE, $x \in [-1, 1]$, $w^* = 2$, $b^* = 1$. $\alpha = 0.1$, batch = 16. After 100 iterations of SGD: $w \approx 1.95$, $b \approx 0.99$ (some noise). Adam with $\beta_1 = 0.9$, $\beta_2 = 0.999$ in same 100 iterations: $w \approx 1.999$, $b \approx 1.000$. Adam converges more precisely and quickly in this task due to adaptive lr.

Assignment: Train ResNet-20 on CIFAR-10 with three optimizers: SGD+momentum, Adam, AdamW. For each: select lr via grid search ${0.001, 0.01, 0.1}$. Plot val accuracy vs epochs curves. Implement cosine annealing + warmup. Which optimizer + schedule gives best val accuracy?