Convex Optimization: First-Order Methods — Big Data & Machine Learning

Convex Optimization in Machine Learning

Many ML problems have a convex structure: linear regression, logistic regression, SVM, LASSO, Ridge, support vector machines. Such problems guarantee a global optimum, and there is a rich theory of efficient algorithms. Knowledge of convex optimization is fundamental to understanding “why” ML model training works.

Convex Functions and Problems

Definition of convexity: A function f: ℝⁿ → ℝ is convex if for all x, y ∈ dom(f) and α ∈ [0,1]:

f(αx + (1−α)y) ≤ αf(x) + (1−α)f(y)

Geometrically: the chord between any two points lies above the graph. Consequence: local minimum = global minimum.

L-smoothness: f is twice differentiable with ||∇²f(x)|| ≤ L (the largest eigenvalue of the Hessian is bounded). Equivalently: ||∇f(x) − ∇f(y)|| ≤ L||x−y|| — the gradient does not change too rapidly. Consequence (descent lemma):

f(y) ≤ f(x) + ∇f(x)ᵀ(y−x) + L/2||y−x||²

μ-strong convexity: f(y) ≥ f(x) + ∇f(x)ᵀ(y−x) + μ/2||y−x||². Meaning: the function is “not too flat” — a quadratic lower bound. The condition number κ = L/μ — the larger, the harder to optimize.

Gradient Descent Methods

Gradient Descent (GD): xₜ₊₁ = xₜ − (1/L)∇f(xₜ). Optimal step for L-smooth functions: α = 1/L.

Convergence theorem:

Convex, L-smooth: f(x_T) − f* ≤ L||x₀−x*||²/(2T). Convergence: O(1/T).
μ-strongly convex, L-smooth: ||xₜ−x*||² ≤ (1−μ/L)ᵗ ||x₀−x*||². Convergence: O(exp(−t/κ)).

Nesterov Acceleration (1983): adds “momentum” from the previous iteration:

yₜ = xₜ + (t−1)/(t+2)·(xₜ−xₜ₋₁) (momentum step) xₜ₊₁ = yₜ − (1/L)∇f(yₜ) (gradient step)

Convergence: O(1/t²) for convex functions — optimal for first-order methods! Doubling t reduces error 4-fold (vs 2-fold for GD).

Proximal Methods

Many ML problems have form: min_x f(x) + g(x), where f is smooth convex (e.g., MSE), g is nonsmooth convex (e.g., λ||x||₁ for LASSO).

Proximal operator: prox_{αg}(v) = argmin_x {(1/2)||x−v||² + αg(x)}. Has analytic solution for many g:

g(x) = λ||x||₁: prox = sign(v)·max(|v|−αλ, 0) (soft-thresholding)
g(x) = λ||x||₂: prox = max(1−αλ/||v||, 0)·v (projection onto the ball)
g = I_C (indicator of convex set C): prox = projection onto C

ISTA (Iterative Shrinkage-Thresholding Algorithm):

xₜ₊₁ = prox_{αg}(xₜ − α∇f(xₜ))

Step 1: gradient step on f. Step 2: proximal step on g. Convergence: O(1/t) for convex functions.

FISTA: Nesterov acceleration for ISTA → O(1/t²). De facto standard for LASSO with theoretical guarantees.

ADMM: Decomposition of the Problem

Constrained problem: min f(x) + g(z), subject to Ax + Bz = c. Examples: distributed ML, portfolio optimization.

Augmented Lagrangian: L_ρ(x,z,y) = f(x) + g(z) + yᵀ(Ax+Bz−c) + (ρ/2)||Ax+Bz−c||²

ADMM steps: x^{k+1} = argmin_x L_ρ(x, z^k, y^k) z^{k+1} = argmin_z L_ρ(x^{k+1}, z, y^k) y^{k+1} = y^k + ρ(Ax^{k+1} + Bz^{k+1} − c)

Each step is a simple optimization (often analytic). Parallelizable: x-steps are independent by blocks. Used in distributed optimization, portfolio selection with constraints.

Coordinate Descent

Idea: instead of optimizing all variables at once — optimize one variable, fixing the others: min_{x_j} f(x₁,...,x_{j−1}, x_j, x_{j+1},...,x_n).

For LASSO analytically: x_j^* = (1/n)soft-thresh_{λ}(zⱼ), where zⱼ = xⱼ − (Aᵀ(Ax−y))ⱼ. Randomized coordinate descent (randomized CD): at each iteration pick j randomly → O(n/T) convergence. Used in liblinear (SVM), word2vec.

Numerical Example

LASSO: y = Xβ* + ε, n=100 observations, p=500 features, true β* is 10-sparse. λ = 0.1·||Xᵀy||_∞.

ISTA: 1000 iterations, α=1/L, final MSE = 0.023. FISTA: same 1000 iterations, MSE = 0.0018 (13 times better). Coordinate descent (sklearn): 200 iterations, MSE = 0.0015 (faster in practice). ISTA finds 12 nonzero coefficients (2 extra), FISTA and CD — exactly 10.

Assignment: Implement ISTA and FISTA for LASSO on synthetically generated data (X ∈ ℝ^{100×500}, s=20 nonzero). Plot convergence curves (log(f(xₜ)−f*) vs t) for both methods. Theoretically, FISTA converges as O(1/t²), ISTA — O(1/t). Validate theoretical rates empirically. Implement ADMM for Ridge regression with the sum constraint Σxⱼ = 1.