Cheatsheet

Convex Analysis & Optimization

All topics on one page

4modules

12articles

94definitions

30formulas

Convex Sets and Functions

Basic concepts of convex analysis: convex sets, convex functions, and their properties

Convex Sets: Properties and Operations

Why Is Convexity Important? → Definition of a Convex Set → Classic Examples of Convex Sets → Operations Preserving Convexity → The Separating Hyperplane Theorem → Projection onto a Convex Set → Full Analysis of an Example → Real Applications

Definitions

Geometric test: — draw the figure. If there exist two points inside it, connected by a segment that partially goes outside the boundary — the figure is nonconvex. The letter "C" is nonconvex. A circle, square, triangle — are convex.
Hyperplane: — $\{x \in \mathbb{R}^n : a^\top x = b\}$, where $a \neq 0$ is a fixed vector, $b$ is a number. This is an $n-1$ dimensional “plane” in space. Example in $\mathbb{R}^2$: line $2x_1 + 3x_2 = 6$. Any two points on the line are connected by a segment l...
Halfspace: — $\{x : a^\top x \leq b\}$ — one "side" from the hyperplane. In $\mathbb{R}^2$ this is a halfplane on one side of a line.
Ball: — $\{x : \|x - x_c\| \leq r\}$ — all points at distance no more than $r$ from the center $x_c$. Convexity: if two points lie in the ball (distance to $x_c \leq r$), then any point between them also lies in the ball — this follows from the triangle i...
Ellipsoid: — $\{x : (x-x_c)^\top P^{-1}(x-x_c) \leq 1\}$, where $P$ is a positive definite matrix. This is a “stretched ball” along different axes. Actively used in control theory for describing admissible regions of states.
Second-order cone (SOCP): — $\{(x, t) : \|x\| \leq t,\, t \geq 0\}$ — "ice cream cone" in space. The surface of the cone is points with $\|x\| = t$, interior — with $\|x\| < t$.
Set of positive semidefinite matrices: — $S^n_+ = \{X \in \mathbb{R}^{n \times n} : X = X^\top,\, u^\top X u \geq 0\,\, \text{for all } u\}$. This is a convex cone in the space of symmetric matrices.
Intersection: — if $C_1$ and $C_2$ are convex, then $C_1 \cap C_2$ is also convex. Proof: if $x, y \in C_1 \cap C_2$, then $x, y \in C_1$ $\rightarrow$ segment $xy \subseteq C_1$, and $x, y \in C_2$ $\rightarrow$ segment $xy \subseteq C_2$, thus segment $xy \subs...
Image under linear transformation: — $f(C) = \{Ax + b : x \in C\}$ — is convex if $C$ is convex. Affine transformations "preserve" convexity.
Preimage: — if $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ is affine ($f(x) = Ax+b$) and $D$ is convex, then $f^{-1}(D) = \{x : f(x) \in D\}$ is convex.
Minkowski sum: — $C_1 + C_2 = \{x + y : x \in C_1,\ y \in C_2\}$ — is convex if $C_1$, $C_2$ are convex.
Theorem: — if $C_1$ and $C_2$ are nonempty convex sets with empty intersection ($C_1 \cap C_2 = \varnothing$), then there exists a vector $a \neq 0$ and a number $b$ such that:
Meaning: — the hyperplane $\{x : a^\top x = b\}$ “separates” the two sets. This is geometrically obvious in $\mathbb{R}^2$: two non-intersecting convex sets on the plane can always be separated by a line.
Supporting hyperplane: — at a boundary point $x_0$ of convex $C$ there exists a vector $g \neq 0$ such that $g^\top(x - x_0) \leq 0$ for all $x \in C$. This is a "tangent" hyperplane to $C$ at point $x_0$, lying "outside".
Existence and uniqueness: — such a point always exists and is unique. Uniqueness is a consequence of the strict convexity of the function $\|y - x\|^2$.

Formulas

Image under linear transformation: $f(C) = \{Ax + b : x \in C\}$ — is convex if $C$ is convex. Affine transformations "preserve" convexity.Preimage: if $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ is affine ($f(x) = Ax+b$) and $D$ is convex, then $f^{-1}(D) = \{x : f(x) \in D\}$ is convex.Minkowski sum: $C_1 + C_2 = \{x + y : x \in C_1,\ y \in C_2\}$ — is convex if $C_1$, $C_2$ are convex.Problem: find the point closest to $x = (3, 4)$ in the square $C = \{y \in \mathbb{R}^2 : 0 \leq y_1 \leq 1,\, 0 \leq y_2 \leq 1\}$.Solution: Projection onto the square is coordinate-wise cutoff: $P_C(x) = (\min(\max(x_1, 0), 1),\, \min(\max(x_2, 0), 1))$.

·$a^\top x \leq b$ for all $x \in C_1$
·$a^\top x \geq b$ for all $x \in C_2$

Imagine you are searching for a path in the mountains. If the terrain is convex (with no depressions or pockets), any trail without dead ends will lead you to the unique lowest point. This very idea underlies convex analysis: optimization problems on convex sets have a unique global minimum, and ...

A set $C \subseteq \mathbb{R}^n$ is called convex if for any two points $x, y \in C$ and any number $\theta \in [0,1]$ it holds that:

What does this mean in words? Take any two points in the set. Connect them by a segment. If the entire segment lies inside the set — it is convex. The parameter $\theta$ "runs" from 0 to 1, describing all points between $x$ and $y$: at $\theta=1$ we get $x$, at $\theta=0$ — $y$, at $\theta=0.5$ —...

Geometric test: draw the figure. If there exist two points inside it, connected by a segment that partially goes outside the boundary — the figure is nonconvex. The letter "C" is nonconvex. A circle, square, triangle — are convex.

Convex Functions: Definitions and Criteria

Why Study Convex Functions? → Definition of a Convex Function → Examples of Convex Functions → Criteria of Convexity → Sublevel Sets and the Epigraph → Operations Preserving Convexity → Complete Example Breakdown → Applications

Formulas

Important consequence: if $\nabla f(x^*) = 0$, then $x^*$ is a global minimum! Indeed, $f(y) \geq f(x^*) + 0 = f(x^*)$ for all $y$.

·$f(x) = x^2$ — convex (parabola "opens upwards")
·$f(x) = e^x$ — convex (exponential)
·$f(x) = |x|$ — convex, but not differentiable at $0$
·$f(x) = x \log x$ on $x > 0$ — convex (used in information theory)
·$f(x) = \log x$ on $x > 0$ — concave
·$f(x) = \|x\|^2 = x_1^2 + \ldots + x_n^2$ — convex
·$f(x) = \max(x_1, \ldots, x_n)$ — convex (maximum of linear functions)
·$f(x) = \log\left(\sum_i e^{x^i}\right)$ — "soft-max", log-sum-exp, convex and smooth
·$f(X) = \lambda_{\max}(X)$ — largest eigenvalue of a symmetric matrix — convex
·Sum: $f_1 + f_2$ is convex if $f_1$ and $f_2$ are convex
·Nonnegative scaling: $\lambda f$ for $\lambda \geq 0$ is convex
·Maximum: $\max(f_1, f_2, \ldots, f_m)$ is convex (the maximum of convex functions is convex)
·Affine substitution: $f(Ax + b)$ is convex (if $f$ is convex)
·Supremum: $\sup_{y \in S} f(x, y)$ is convex in $x$ (if $f(\cdot, y)$ is convex for every $y$)
·Leading principal minor $1 \times 1$: $2 > 0$ ✓
·Determinant $2 \times 2$: $2 \cdot 6 - 2 \cdot 2 = 12 - 4 = 8 > 0$ ✓

The main practical fact: if a function is convex, then any of its local minima is also a global minimum. This is an enormous advantage. In a non-convex problem, you may have thousands of local minima, and the algorithm can get "stuck" in a bad one. In a convex problem, there is only one minimum, ...

A function $f: \mathbb{R}^n \to \mathbb{R}$ (or $f: \operatorname{dom}(f) \to \mathbb{R}$, where $\operatorname{dom}(f)$ is a convex set) is called convex if for any $x, y \in \operatorname{dom}(f)$ and any $\lambda \in [0,1]$:

Geometric meaning: consider two points on the graph of the function: $A = (x, f(x))$, $B = (y, f(y))$. The right-hand side, $\lambda f(x) + (1-\lambda)f(y)$, is a point on the chord AB corresponding to parameter $\lambda$. The left-hand side, $f(\lambda x + (1-\lambda)y)$, is the value of the fun...

If a strict inequality holds (

lt;$ instead of $\leq$) for $x \neq y$ and $\lambda \in (0,1)$, the function is strictly convex. A strictly convex function has a unique minimum.

Subgradients and Subdifferential

Motivation: What to Do with Nondifferentiable Functions? → Definition of Subgradient → Examples of Subdifferentials → Optimality Condition via Subdifferential → Subdifferentiation Rules → Proximal Operator → Full Analysis of Example: LASSO → Applications

Definitions

Interpretation: — a subgradient is a vector that defines a hyperplane through the point $(x, f(x))$ that lies "below" (or touches) the graph of $f$. By the first-order criterion, if $f$ is differentiable, the only subgradient is the gradient $\nabla f(x)$. If $f$ i...
Theorem: — a point $x^*$ is a global minimum of a convex function $f$ if and only if:
Meaning: — zero must be a subgradient. If $f$ is differentiable, this is the standard condition $\nabla f(x^*) = 0$. For nondifferentiable functions: zero must lie in the "fan" of subgradients.
Example: — for $f(x) = |x|$: $0 \in \partial|x^*| \iff x^* = 0$ (because only at $x = 0$ does the subdifferential contain 0).
Sum: — $\partial(f + g)(x) \supseteq \partial f(x) + \partial g(x)$. Under regularity conditions (e.g., one function is continuous): equality holds.
Affine substitution of argument: — if $h(x) = f(Ax + b)$, then
Problem: — $\min_{x \in \mathbb{R}^n} F(x) = \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1$
Optimality condition: — $0 \in \partial F(x^*) = \partial\left[\frac{1}{2}\|Ax-b\|^2\right](x^*) + \lambda\ \partial\|x^*\|_1$
Feature selection: — LASSO regression with L1 penalty automatically zeros out unnecessary coefficients. This is "automatic variable selection". In medical data with thousands of genes, LASSO selects several dozen most significant.
Compressed signal recovery: — An MRI scanner takes 10 times fewer measurements, using L1 minimization to recover a sparse image. The subdifferential of the L1 norm is key to understanding why this works.

Formulas

Example: for $f(x) = |x|$: $0 \in \partial|x^*| \iff x^* = 0$ (because only at $x = 0$ does the subdifferential contain 0).Affine substitution of argument: if $h(x) = f(Ax + b)$, thenProblem: $\min_{x \in \mathbb{R}^n} F(x) = \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1$Optimality condition: $0 \in \partial F(x^*) = \partial\left[\frac{1}{2}\|Ax-b\|^2\right](x^*) + \lambda\ \partial\|x^*\|_1$

·For $x > 0$: $f'(x) = 1$, so $\partial|x| = \{1\}$
·For $x < 0$: $f'(x) = -1$, so $\partial|x| = \{-1\}$
·For $x = 0$: the subdifferential $\partial|0| = [-1, 1]$ — the entire segment

Many important convex functions do not have a gradient at every point. The absolute value |x| is not differentiable at zero. The L1 norm ‖x‖₁ is not differentiable when at least one component is zero. The maximum of several functions is not differentiable on the switching set. If we want to solve...

A vector $g \in \mathbb{R}^n$ is called a subgradient of a convex function $f$ at the point $x$ if:

$ f(y) \geq f(x) + g^\top(y - x) \quad \text{for all } y \in \operatorname{dom}(f) $

Interpretation: a subgradient is a vector that defines a hyperplane through the point $(x, f(x))$ that lies "below" (or touches) the graph of $f$. By the first-order criterion, if $f$ is differentiable, the only subgradient is the gradient $\nabla f(x)$. If $f$ is nondifferentiable, there may be ...

Duality and Optimality Conditions

Lagrangian duality, Slater’s theorem, and Karush–Kuhn–Tucker conditions

Lagrangian Duality and Slater's Theorem

Why is duality needed? → The primal problem and the Lagrangian → Weak duality → The dual problem and Slater's theorem → KKT (Karush-Kuhn-Tucker) conditions → Economic interpretation → Full analysis of an example → Geometric interpretation of duality → Saddle points of the Lagrangian → Applications of duality

Definitions

The Lagrangian: — we “relax” the constraints by moving them into the objective function with penalties λᵢ and νⱼ:
Physical meaning of λᵢ: — the price of violating the i-th constraint. If gᵢ(x) > 0 (violation), we pay λᵢ gᵢ(x) > 0 for it. If λᵢ = 0 — there is no penalty. For large λᵢ — violation is “expensive”.
Dual function: — g(λ, ν) = inf_{x} L(x, λ, ν) — the infimum of the Lagrangian over x for fixed multipliers.
Theorem (weak duality): — for any admissible λ ≥ 0 and ν it holds that:
Proof: — let x* be the optimal admissible solution of the primal problem. Then:
Corollary: — maximizing g(λ, ν) over λ ≥ 0, ν gives the best lower bound for p*.
The dual problem: — max_{λ ≥ 0, ν} g(λ, ν).
Slater’s theorem: — if the problem is convex (f, gᵢ are convex, hⱼ are affine) and there exists a strictly feasible point x̃ such that gᵢ(x̃) < 0 strictly for all i (and hⱼ(x̃) = 0), then:
1. Primal admissibility: — gᵢ(x*) ≤ 0, hⱼ(x*) = 0 (x* is feasible)
2. Dual admissibility: — λᵢ* ≥ 0
3. Complementary slackness: — λᵢ* gᵢ(x*) = 0 for all i
4. Stationarity: — ∇f(x*) + Σᵢ λᵢ* ∇gᵢ(x*) + Σⱼ νⱼ* ∇hⱼ(x*) = 0
Interpretation of condition 3: — either the constraint is “active” (gᵢ(x*) = 0), or its price is zero (λᵢ* = 0). It is impossible for a “inactive” constraint to have a nonzero price.
Problem: — min (x₁ − 1)² + (x₂ − 1)² subject to x₁ + x₂ ≤ 1, x₁ ≥ 0, x₂ ≥ 0.
Slater’s condition: — x̃ = (0.1, 0.1) is strictly feasible: 0.1 + 0.1 = 0.2 < 1, 0.1 > 0. Strong duality holds.

Formulas

3. Complementary slackness: λᵢ* gᵢ(x*) = 0 for all i4. Stationarity: ∇f(x*) + Σᵢ λᵢ* ∇gᵢ(x*) + Σⱼ νⱼ* ∇hⱼ(x*) = 0Slater’s condition: x̃ = (0.1, 0.1) is strictly feasible: 0.1 + 0.1 = 0.2 < 1, 0.1 > 0. Strong duality holds.KKT: Stationarity: 2(x₁* − 1) + λ* − μ₁* = 0, 2(x₂* − 1) + λ* − μ₂* = 0, where λ* is the multiplier for x₁+x₂ ≤ 1, μ₁*, μ₂* for x₁ ≥ 0, x₂ ≥ 0.

·Support Vector Machines (SVM): the dual problem has smaller dimension (number of support vectors instead of feature dimension) and permits the kernel trick
·Benders decomposition in large-scale optimization: decomposition into “easy” and “hard” parts via duality
·Distributed optimization (ADMM): dual variables serve as the “coordinator” among parallel solvers
·Sensitivity analysis: dual variable λᵢ is the “shadow price” of a constraint, showing how much the optimum improves if the constraint is relaxed by one unit

Sometimes it is difficult to solve an optimization problem directly, but there exists a “reformulation” that is easier to solve. Lagrangian duality is a systematic way to construct such a dual problem. It turns out that every minimization problem has a “dual maximization problem” whose optimal va...

Here f is the objective function, gᵢ are inequalities, hⱼ are equations. Optimum p*.

The Lagrangian: we “relax” the constraints by moving them into the objective function with penalties λᵢ and νⱼ:

where λᵢ ≥ 0 are the “dual variables” (Lagrange multipliers) for inequalities, and νⱼ are for equations (can be of any sign).

Fenchel Conjugate Functions

The Idea of the Legendre-Fenchel Transform → Definition of a Conjugate Function → Fenchel-Young Inequality → Calculation Examples → Duality via Conjugate Functions → Complete Example Analysis: Computing f* for the Log-Barrier → Applications → Properties of the Fenchel Transform → Examples of Conjugate Pairs → Applications

Definitions

Key property: — f* is always a convex function, even if the original f is non-convex! (As the supremum of affine functions over y.)
Bipolar theorem: — if f is a closed convex function, then f = f (the double conjugate coincides with the original). For non-convex f: f = cl(conv(f)) — closure of the convex hull.
Calculation: — if |yᵢ| > 1 for some i, take xᵢ → ±∞ → supremum = +∞. If |yᵢ| ≤ 1 for all i, then yᵀx ≤ ‖y‖_∞‖x‖₁ ≤ ‖x‖₁ → yᵀx − ‖x‖₁ ≤ 0, maximum = 0 (at x = 0).
LASSO problem: — min_x {(1/2)‖Ax−b‖² + λ‖x‖₁}
Task: — Find the conjugate for f(x) = −log x (x > 0).
Step 1: — Derivative with respect to x: y + 1/x = 0 → x* = −1/y (only for y < 0).
Step 2: — For y ≥ 0: yx + log x → +∞ as x → +∞ (for y > 0) or x → +∞ for y = 0 → supremum = +∞.
Step 3: — For y < 0: f*(y) = y·(−1/y) + log(−1/y) = −1 + log(−1/y) = −1 − log(−y).
Result: — f*(y) = −1 − log(−y) for y < 0, +∞ for y ≥ 0.
Duality in optimization: — the conjugate function automatically generates the dual problem. This is used in optimal portfolio calculations (duality of the Markowitz problem), SVM (kernel trick through duality), and in the ADMM method.
Prox-operator via conjugate: — by Moreau's theorem, prox_{τf}(x) + τ prox_{f*/τ}(x/τ) = x. If computing the prox operator of one function is hard, compute the prox of the conjugate.

Formulas

Negative entropy f(x) = Σᵢ xᵢ log xᵢ (for xᵢ > 0):

·y — “dual variable,” direction (hyperplane slope)
·yᵀx — scalar product (linear function of x)
·f(x) — “subtract” the function
·sup — take the greatest value over all x
·The conjugate function is always convex (even if f is not convex) — the transformation “convexifies” the function
·Double conjugate f = f, if f is convex and closed; in general, f is the convex hull of f
·Order correspondence: f₁ ≤ f₂ → f₂* ≤ f₁*
·Conjugate of the sum: (f₁ + f₂)* = f₁* □ f₂* (infimal convolution)
·Connection with subdifferential: y ∈ ∂f(x) ⟺ x ∈ ∂f*(y) ⟺ f(x) + f*(y) = xᵀy
·f(x) = (1/2)xᵀPx (quadratic) ↔ f*(y) = (1/2)yᵀP⁻¹y
·f(x) = exp(x) ↔ f*(y) = y log y − y (for y > 0)
·f(x) = log(1 + eˣ) (softplus) ↔ f*(y) = y log y + (1−y) log(1−y) (binary entropy) for y ∈ [0,1]
·f(x) = ‖x‖_p ↔ f*(y) = δ_{‖·‖_q ≤ 1}(y), where 1/p + 1/q = 1 (dual norms)
·f(x) = max(x₁,...,xₙ) ↔ f*(y) = δ_Δ(y) (indicator of the simplex)

Imagine that you want to characterize a convex function not through its values at points, but through the hyperplanes supporting it. Each tangent line to a convex function is defined by its slope y and “intercept point.” The conjugate function f*(y) fixes this "intersection" for the hyperplane wi...

Geometrically: f*(y) is the maximal “gap” between the linear function yᵀx and f(x).

Key property: f* is always a convex function, even if the original f is non-convex! (As the supremum of affine functions over y.)

Bipolar theorem: if f is a closed convex function, then f = f (the double conjugate coincides with the original). For non-convex f: f = cl(conv(f)) — closure of the convex hull.

Linear, Quadratic, and Semidefinite Programming

Hierarchy of Convex Problems → Linear Programming (LP) → Quadratic Programming (QP) → Second-Order Cone Programming (SOCP) → Semidefinite Programming (SDP) → Complete Analysis: SDP Relaxation of the MAX-CUT Problem → Solvers → Hierarchy of Problem Classes → Industrial Solvers → Modern Applications

Definitions

Standard form: — min $c^\mathrm{T}x$ subject to $Ax \leq b$, $x \geq 0$.
Geometry: — the feasible region is a convex polyhedron (polytope). The linear objective function attains its minimum at a vertex of the polyhedron. The simplex method (Dantzig, 1947) "moves" from vertex to vertex until it finds the optimum. The interior point...
Duality in LP: — the primal problem min $c^\mathrm{T}x$ subject to $Ax \geq b$, $x \geq 0$ has a dual max $b^\mathrm{T}y$ subject to $A^\mathrm{T}y \leq c$, $y \geq 0$. Strong duality always holds (if both problems are feasible).
Example — Diet Problem: — minimize the cost of a set of products while ensuring sufficient nutrients. This is a classic LP problem, solved back in the 1940s.
SOCP constraint: — $\|A_i x + b_i\| \leq c_i^\mathrm{T}x + d_i$ — the norm of a vector is bounded by a linear function of $x$.
Includes LP and QP: — LP is a special case (when $A_i = 0$, degenerates to linear). QP with $P \succeq 0$ allows an SOCP formulation.
SDP problem: — $\min_{X} \mathrm{tr}(CX)$ subject to $\mathrm{tr}(A_i X) = b_i$, $i=1,...,m$, $X \succeq 0$
Variable: — matrix $X$! This generalizes LP (variable — vector $x$) to the matrix case.
Problem: — graph $G = (V, E)$. Partition the vertices into two sets $S$ and $V \setminus S$, maximizing the number of edges between them.
ILP formulation: — $x_i \in \{-1, +1\}$. Cut: $(1/4)\sum_{(i,j)\in E} (1 - x_i x_j)$. NP-hard.
SDP relaxation (Goemans-Williamson, 1995): — replace $x_i \in \{\pm 1\}$ by vectors $v_i \in \mathbb{R}^n$ with $\|v_i\| = 1$. Product $x_i x_j \rightarrow$ scalar product $v_i^\mathrm{T} v_j$. Matrix $Y_{ij} = v_i^\mathrm{T} v_j \succeq 0$!
Solution: — this is SDP! Randomized rounding (random hyperplane) yields a $0.878$-approximation of MAX-CUT.
What does this mean: — the algorithm is guaranteed to find a cut constituting at least $87.8\%$ of the optimum. This is the best known polynomial algorithm.

Formulas

Includes LP and QP: LP is a special case (when $A_i = 0$, degenerates to linear). QP with $P \succeq 0$ allows an SOCP formulation.SDP problem: $\min_{X} \mathrm{tr}(CX)$ subject to $\mathrm{tr}(A_i X) = b_i$, $i=1,...,m$, $X \succeq 0$Problem: graph $G = (V, E)$. Partition the vertices into two sets $S$ and $V \setminus S$, maximizing the number of edges between them.

·CVXPY (Python): high-level language for describing problems
·MOSEK: commercial, very fast for LP/QP/SOCP/SDP
·SCS: open, scalable for large SDP
·ECOS: efficient for embedded systems
·LP: Gurobi, CPLEX, MOSEK, HiGHS (open source) — millions of variables in seconds
·QP: OSQP, qpOASES (for real-time control), Gurobi
·SOCP: ECOS, MOSEK, SCS — widely used in finance for robust portfolios
·SDP: SDPT3, SeDuMi, MOSEK, COSMO — for problems up to several thousand variables
·Universal modeling languages: CVXPY, JuMP, YALMIP — allow writing the problem in natural form and automatically reducing to canonical for the solver
·LP in aviation: American Airlines solves problems with millions of variables for crew assignments
·QP in robotics: quadratic regulators (LQR) and MPC (Model Predictive Control) — foundation for controlling manipulators and quadcopters
·SOCP in finance: Goldfarb-Aiyengar robust portfolios account for uncertainty in return estimates using ellipsoidal sets
·SDP in quantum computing: quantum tomography problems, estimation of quantum channels are formulated via SDP
·SDP in combinatorics: Goemans-Williamson’s MAX-CUT relaxation yields a $0.878$-approximation via SDP — a record in theoretical computer science

Convex programming is a "family" of optimization problems of varying complexity. Linear programming (LP) is the simplest: linear objective, linear constraints. Quadratic programming (QP) involves a quadratic objective. Second-order cone programming (SOCP) has conic constraints. Semidefinite progr...

Here $c \in \mathbb{R}^n$ is the vector of objective coefficients, $A \in \mathbb{R}^{m \times n}$ is the constraint matrix, $b \in \mathbb{R}^m$ is the right-hand sides.

Geometry: the feasible region is a convex polyhedron (polytope). The linear objective function attains its minimum at a vertex of the polyhedron. The simplex method (Dantzig, 1947) "moves" from vertex to vertex until it finds the optimum. The interior point method moves through the interior.

Duality in LP: the primal problem min $c^\mathrm{T}x$ subject to $Ax \geq b$, $x \geq 0$ has a dual max $b^\mathrm{T}y$ subject to $A^\mathrm{T}y \leq c$, $y \geq 0$. Strong duality always holds (if both problems are feasible).

First-Order Algorithms

Gradient descent, Nesterov acceleration, proximal algorithms, and ADMM

Gradient Descent and Nesterov Acceleration

Why Are First-Order Algorithms Needed? → Gradient Descent: Basic Algorithm → Nesterov Acceleration (1983) → Full Example Analysis → Stochastic Gradient Descent (SGD) → Applications in Machine Learning → Connection with Modern Libraries → Comparison of Convergence Rates

Definitions

The Class of L-smooth Functions: — f is called L-smooth if ‖∇f(x) − ∇f(y)‖ ≤ L‖x − y‖ for all x, y. The constant L is the "degree of smoothness" (largest eigenvalue of the Hessian matrix).
Convergence Theorem: — for convex L-smooth f with step size α = 1/L:
Convergence Rate: — O(1/k²) — twice as fast as gradient descent! This is theoretically optimal for first-order methods (Nesterov's lower bound).
Physical Meaning: — "Momentum" prevents getting stuck in "ravines" — places with large κ. A ball rolling with momentum "flips over" the narrow bottom of the ravine and stops closer to the minimum.
Problem: — Minimize f(x) = (1/2)(x₁² + 100x₂²) — an elongated parabola with κ = 100.
Nesterov: — converges in ~20√κ ≈ 200 iterations instead of 1000. That's a 5-fold speedup at κ = 100.
Rate: — O(1/√k) for convex, O(1/k) for strongly convex (with proper decay of the step size).
Adaptive methods: — Adam, RMSProp, AdaGrad — scale the step per coordinate. In deep learning, Adam is almost always better than SGD with fixed step size.

Formulas

Problem: Minimize f(x) = (1/2)(x₁² + 100x₂²) — an elongated parabola with κ = 100.

·Gradient descent: O(1/k) for smooth functions, O(1/√k) for non-smooth
·Accelerated (Nesterov): O(1/k²) — theoretical optimum for smooth convex problems
·Prox methods: O(1/k) or O(1/k²) with acceleration (FISTA)
·Stochastic gradient descent (SGD): O(1/√k) for convex, O(1/k) with averaging

Second-order methods (Newton's method) are very fast, but require the computation and inversion of the Hessian matrix — O(n³) operations per step. With n = 10⁶ parameters (a medium-sized neural network), this is simply impossible. First-order methods use only the gradient — O(n) operations. They ...

Here α > 0 is the step size (learning rate). If α is too large, the algorithm diverges. If too small — convergence is very slow.

The Class of L-smooth Functions: f is called L-smooth if ‖∇f(x) − ∇f(y)‖ ≤ L‖x − y‖ for all x, y. The constant L is the "degree of smoothness" (largest eigenvalue of the Hessian matrix).

This is O(1/k) rate: to halve the error, you need to double the number of iterations.

Proximal Algorithms and Operator Splitting

Motivation: How to Handle Nondifferentiable Terms? → Proximal Operator → Proximal Gradient Method (ISTA/FISTA) → Complete Walkthrough of LASSO with FISTA → ADMM (Alternating Direction Method of Multipliers) → Applications → Proximal Operators for Typical Regularizers → Applications of Operator Splitting

Definitions

Definition: — $\operatorname{prox}_{\tau f}(x) = \arg\min_{y} \left\{ f(y) + \frac{1}{2\tau} \|y - x\|^2 \right\}$
Problem: — $\min F(x) = \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1$
Numerical example: — $A$ is a $50 \times 100$ matrix, $x^*$ is sparse (5 nonzero components). FISTA with $\lambda = 0.1$ achieves accuracy $10^{-6}$ in about 200 iterations. Without acceleration (ISTA) — in about 2000 iterations.
Why ADMM is more powerful: — it splits the problem into two subproblems — for $x$ and for $z$ — which are solved independently. If both subproblems have convenient proximal operators, the whole algorithm is very efficient.
Distributed optimization: — with $n$ machines, each stores its own portion of the data. ADMM enables solving the global problem without gathering all the data in one place — only exchange of “dual” variables.
Image processing: — Total Variation (TV) denoising: $\min \frac{1}{2}\|u-f\|^2 + \lambda TV(u)$. TV norm is nonsmooth, but has a convenient proximal. FISTA and ADMM are standard methods.

Formulas

Definition: $\operatorname{prox}_{\tau f}(x) = \arg\min_{y} \left\{ f(y) + \frac{1}{2\tau} \|y - x\|^2 \right\}$Problem: $\min F(x) = \frac{1}{2}\|Ax - b\|^2 + \lambda\|x\|_1$

·$\tau > 0$ — step size
·$f(y)$ — “minimize” $f$
·$\frac{1}{2\tau}\|y - x\|^2$ — “don’t move far from $x$”
·$\arg\min$ — returns the minimizer $y$
·For L2-regularization $\tau\|x\|^2$: $\operatorname{prox}(x) = x / (1 + 2\tau)$ — simple scaling
·For group LASSO $\tau\|x\|_{2,1}$: group soft thresholding, zeroing out entire groups of coordinates
·For nuclear norm $\|X\|_*$ (sum of singular values of a matrix): SVD decomposition $X = U\Sigma V^T$, then soft threshold the singular values and reassemble
·For the simplex indicator: projection onto the simplex via sorting — $O(n \log n)$

LASSO problem: min (1/2)‖Ax−b‖² + λ‖x‖₁. The first term is smooth (can be differentiated), the second is not ($|x|_1$ is nondifferentiable at zero). Gradient descent cannot be applied directly. The subgradient method is too slow ($O(1/\sqrt{k})$). Proximal algorithms solve this problem elegantly:...

Definition: $\operatorname{prox}_{\tau f}(x) = \arg\min_{y} \left\{ f(y) + \frac{1}{2\tau} \|y - x\|^2 \right\}$

This is a “soft step toward the minimum of $f$.” When $\tau \to 0$: prox $\approx x$ (do not move). When $\tau \to \infty$: prox $\to \arg\min f$ (go straight to the minimum).

1. $f(x) = \|x\|_1$: $\operatorname{prox}_{\tau f}(x) = \operatorname{sign}(x) \cdot \max(|x| - \tau, 0)$ — soft thresholding

Interior-Point Method and Barrier Functions

Idea: how to circumvent constraints? → Logarithmic Barrier → Central Path → IPM Algorithm → Self-Concordant Barriers → Full Analysis: LP via IPM → Applications → Algorithmic Implementation → Modern Packages

Definitions

Meaning: — When gᵢ(x) → 0 (approaching the constraint boundary), −gᵢ(x) → 0⁺ → log → −∞ → φ(x) → +∞. The barrier "repels" from the boundary.
Barrier problem: — min f(x) + (1/t) φ(x)
Duality gap on the central path: — f(x*(t)) − d* = m/t, where m is the number of constraints. Precision ε is achieved at t = m/ε.
Complexity: — O(√m) iterations of the "outer" loop, each — one Newton step O(n³) (inverting a matrix). In total: O(√m · n³).
Problem: — min cᵀx subject to Ax = b, x ≥ 0 (standard LP form).
KKT conditions: — c − (1/t)X⁻¹e + Aᵀλ = 0, Ax = b. Here X = diag(x).
Newton step: — Solve the linear system:
Example: — Transportation problem 100×100 (10,000 variables): IPM solves in ~50 iterations (~50 linear systems), simplex method — in ~10,000 vertex steps.

Formulas

Duality gap on the central path: f(x*(t)) − d* = m/t, where m is the number of constraints. Precision ε is achieved at t = m/ε.Problem: min cᵀx subject to Ax = b, x ≥ 0 (standard LP form).

·For LP: φ(x) = −Σ log xᵢ, ν = n (constraints x ≥ 0)
·For SDP: φ(X) = −log det X, ν = n (n×n matrix)
·For SOCP: φ = −log(t² − ‖x‖²), ν = 2

A constrained optimization problem: min f(x) subject to gᵢ(x) ≤ 0. One approach is to "forget" about the constraints, but add a large penalty for their violation. The logarithmic barrier does this elegantly: it goes to +∞ as x approaches the boundary of the feasible set. The interior-point method...

For the problem min f(x) subject to gᵢ(x) ≤ 0 we introduce the logarithmic barrier:

Meaning: When gᵢ(x) → 0 (approaching the constraint boundary), −gᵢ(x) → 0⁺ → log → −∞ → φ(x) → +∞. The barrier "repels" from the boundary.

For gᵢ(x) = −t (x is strictly inside, gap = t): φ(x) = − log t. When the gap doubles, the barrier decreases by log 2.

Applications in Machine Learning

Regularization, SVM, convex neural networks, and compressed sensing

Regularization: Lasso, Ridge, Elastic Net

The Overfitting Problem and Why Regularization Is Needed → Ridge Regression (L2 Regularization) → Lasso (L1 Regularization) → Elastic Net: The Best of Both Worlds → Compressed Sensing → Full Breakdown: Lasso on a Numerical Example → Practical Applications

Formulas

Key effect: sparsity. For sufficiently large λ, many xᵢ* = 0 exactly! This is not an approximation — it is an exact zero.L0-minimization: min ‖x‖₀ with Ax = b (‖x‖₀ = number of nonzero components). This is an NP-hard combinatorial problem.

·Sparsity from L1: some coefficients are set to zero
·Stability from L2: in the case of multicollinearity (similar features), L1 arbitrarily selects one, Elastic Net selects a “group” together
·Closed-form solution compared to pure Lasso (but only iteratively)
·Gradient: ∇f(x⁰) = Aᵀ(Ax⁰ − b) = Aᵀ(−b) = [[−22], [−28]]
·Gradient step: z = x⁰ − τ∇f = [0.24, 0.31]
·Soft threshold with τλ ≈ 0.006: x¹ = [0.234, 0.304]

Imagine you are building a model to predict apartment prices using 1000 features, but have only 100 observations. Without constraints, the model can “memorize” the training data (overfitting), showing zero error on it but terrible error on new data. Regularization is the addition of a penalty ter...

Here, A ∈ ℝ^{m×n} is the feature matrix, b ∈ ℝᵐ are the responses, λ > 0 is the regularization parameter.

Closed-form solution: Take the derivative with respect to x, set it to zero:

Importantly, the matrix AᵀA + λI is always invertible for λ > 0, even if AᵀA is singular! This solves the problem of multicollinearity.

SVM and Kernel Methods

Idea: Maximize the Margin → Primal SVM Problem → Dual Problem and the Kernel Trick → Support Vectors → Soft Margin SVM → Popular Kernels → Full Example Analysis → Generalization Guarantees → Support Vector Method: Mathematical Core → Popular Kernels and Their Properties

Definitions

SVM Problem (hard margin): — maximize the margin under correct classification:
Lagrangian: — $L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_i \alpha_i [y_i(w^\top x_i + b) - 1],\ \alpha_i \ge 0.$

Formulas

Lagrangian: $L(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_i \alpha_i [y_i(w^\top x_i + b) - 1],\ \alpha_i \ge 0.$Linear: $K(x, y) = x^\top y$. SVM becomes a linear classifier.Polynomial: $K(x, y) = (x^\top y + 1)^d$. Operates implicitly in the space of degree $d$ polynomial features.Symmetry: $w = \alpha_1 y_1 x_1 + \alpha_2 y_2 x_2 + \alpha_3 y_3 x_3 + \alpha_4 y_4 x_4$.

·$\partial L / \partial w = 0$: $w = \sum_i \alpha_i y_i x_i$
·$\partial L / \partial b = 0$: $\sum_i \alpha_i y_i = 0$
·Class +1: $x_1 = (1, 2)$, $x_2 = (2, 1)$
·Class −1: $x_3 = (−1, −2)$, $x_4 = (−2, −1)$
·Linear: $K(x,y) = x^\top y$—for linearly separable data
·Polynomial: $K(x,y) = (x^\top y + c)^d$—for polynomial boundary of degree $d$
·RBF (Gaussian): $K(x,y) = \exp(-\gamma\|x - y\|^2)$—infinite-dimensional kernel, universal approximator
·Sigmoid: $K(x,y) = \tanh(\alpha x^\top y + c)$—related to neural networks
·String kernels, graph kernels—for non-geometric data
·SMO (Sequential Minimal Optimization, Platt, 1998): iteratively optimizes pairs of Lagrange multipliers
·Pegasos (Shalev-Shwartz): stochastic gradient descent for linear SVMs
·LIBLINEAR, LIBSVM—standard libraries

Classification problem: given N points $(x_i, y_i)$, where $x_i \in \mathbb{R}^n$ and $y_i \in \{-1, +1\}$ are class labels. The goal is to find a hyperplane separating the two classes. There are infinitely many such hyperplanes—any separating one will do. SVM (Support Vector Machine) chooses the...

Hyperplane: $w^\top x + b = 0$. Class of a point $x$: $\operatorname{sign}(w^\top x + b)$.

Margin = $2/\|w\|$ (distance between the two parallel hyperplanes $w^\top x + b = \pm 1$).

SVM Problem (hard margin): maximize the margin under correct classification:

Theory of Learnability and Convexity of Neural Networks

When Does a Learning Algorithm “Work”? → VC Dimension → Rademacher Complexity for Convex Functions → Convexity of Neural Networks Under Overparameterization → Implicit Regularization of SGD → Double Descent → Complete Example: Algorithm Comparison → Double Descent and Overfitting → Modern Directions

Definitions

Shattering: — a set of points S is “shattered” by a hypothesis class H if for any labeling of points ∈ {−1, +1}, there exists a hypothesis h ∈ H that correctly classifies all points.
Main VC theorem: — finite vc(H) ↔ PAC (Probably Approximately Correct) learnability. Number of samples for (ε, δ)-learning: N = O((vc(H) + log(1/δ)) / ε²).
Theorem: — with probability ≥ 1−δ:
Phenomenon of implicit bias: — SGD doesn’t just find any global minimum — it finds the minimum with minimal norm.
Double descent phenomenon (Belkin et al., 2019): — under overparameterization (number of parameters > number of data), the error begins to decrease again!
Task: — MNIST classification (60,000 training samples, 784 features, 10 classes).

Formulas

VC dimension vc(H) = the maximal number of points that H can shatter.

Method	Accuracy	Number of parameters	Guarantees
Logistic regression	92%	7,840	Convex, global optimum
SVM (RBF)	98%	~10,000 support vectors	Convex dual
3-layer neural network	98.5%	100,000	No strict guarantees, works
ResNet-50	99.7%	25 million	Empirically reliable

·Linear classifiers in ℝⁿ: vc = n+1. In ℝ², a linear classifier can shatter any 3 points (in general position), but not any 4.
·RBF kernel: vc = ∞ (can shatter any number of points). But the SVM with RBF kernel generalizes via margin!
·Neural network with W parameters: vc ≈ W log W.
·All critical points (∇L = 0) are either global minima or saddle points
·There are no “bad” local minima
·Interpolation regime: for M < N — classical learning
·Interpolation threshold: M = N — error is maximal
·Overparameterized regime: M >> N — error decreases again
·Generalization bounds for deep networks: PAC-Bayes provides nontrivial estimates for specific trained models based on “flatness” of the loss minimum
·Lottery tickets (Frankle & Carbin, 2019): a large network contains a small subnet that can be trained to comparable quality — related to convexity of local basins
·Safe learning: convex-concave optimization for adversarial training guarantees robustness to attacks

Machine learning seems like an empirical discipline: you try it — it works. But behind the scenes, there is a mathematical theory explaining why and when learning yields generalization to new data. VC theory and PAC learnability are strict frameworks. Convex problems have special guarantees: inde...

Shattering: a set of points S is “shattered” by a hypothesis class H if for any labeling of points ∈ {−1, +1}, there exists a hypothesis h ∈ H that correctly classifies all points.

Main VC theorem: finite vc(H) ↔ PAC (Probably Approximately Correct) learnability. Number of samples for (ε, δ)-learning: N = O((vc(H) + log(1/δ)) / ε²).

Corollary: generalization ≤ O(Bρ/√N) — does not depend on the dimensionality of the space! Only the margin (1/B) and data norm (ρ) matter.