Module III·Article II·~4 min read

Proximal Algorithms and Operator Splitting

First-Order Algorithms

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Motivation: How to Handle Nondifferentiable Terms?

LASSO problem: min (1/2)‖Ax−b‖² + λ‖x‖₁. The first term is smooth (can be differentiated), the second is not ($|x|_1$ is nondifferentiable at zero). Gradient descent cannot be applied directly. The subgradient method is too slow ($O(1/\sqrt{k})$). Proximal algorithms solve this problem elegantly: they “split” the function into smooth and nonsmooth parts and handle each differently.

Proximal Operator

Definition: $\operatorname{prox}{\tau f}(x) = \arg\min{y} \left{ f(y) + \frac{1}{2\tau} |y - x|^2 \right}$

Meaning of each symbol:

  • $\tau > 0$ — step size
  • $f(y)$ — “minimize” $f$
  • $\frac{1}{2\tau}|y - x|^2$ — “don’t move far from $x$”
  • $\arg\min$ — returns the minimizer $y$

This is a “soft step toward the minimum of $f$.” When $\tau \to 0$: prox $\approx x$ (do not move). When $\tau \to \infty$: prox $\to \arg\min f$ (go straight to the minimum).

Key examples:

  1. $f(x) = |x|1$: $\operatorname{prox}{\tau f}(x) = \operatorname{sign}(x) \cdot \max(|x| - \tau, 0)$ — soft thresholding

    Meaning: components with $|x_i| \leq \tau$ are set to zero, the rest are shifted toward zero by $\tau$.

  2. $f(x) = \delta_C(x)$ — indicator of a convex $C$: $\operatorname{prox}_{\delta_C}(x) = P_C(x)$ — projection onto $C$

  3. $f(X) = |X|*$ (nuclear norm = sum of singular values): $\operatorname{prox}{\tau|\cdot|*}(X) = U\tilde{\Sigma}V^T$, where $X = U\Sigma V^T$ (SVD), $\tilde{\Sigma}{ii} = \max(\sigma_i - \tau, 0)$ — soft thresholding for singular values.

Proximal Gradient Method (ISTA/FISTA)

For the problem $\min F(x) = f(x) + g(x)$, where $f$ is smooth ($L$-smooth), $g$ is convex (prox-friendly):

ISTA (Iterative Shrinkage-Thresholding Algorithm):

$ x^{k+1} = \operatorname{prox}_{\tau g} (x^k - \tau \nabla f(x^k)) $

Steps: 1) Gradient step for $f$ 2) Proximal step for $g$. Convergence: $O(1/k)$.

FISTA = ISTA + Nesterov momentum:

$ y^{k+1} = \operatorname{prox}_{\tau g} (x^k - \tau \nabla f(x^k)) \ x^{k+1} = y^{k+1} + \frac{k-1}{k+2} (y^{k+1} - y^k) $

Convergence: $O(1/k^2)$ — optimal!

Complete Walkthrough of LASSO with FISTA

Problem: $\min F(x) = \frac{1}{2}|Ax - b|^2 + \lambda|x|_1$

$f(x) = \frac{1}{2}|Ax-b|^2$, $\nabla f(x) = A^T(Ax-b)$, $L = \lambda_\max(A^TA)$

$g(x) = \lambda|x|1$, $\operatorname{prox}{\tau g}(z) = \operatorname{sign}(z) \cdot \max(|z| - \tau \lambda, 0)$

FISTA iteration:

  1. Compute $r^k = Ax^k - b$ (residual)
  2. Gradient: $\nabla f(x^k) = A^T r^k$
  3. Gradient step: $z^{k+1} = x^k - (1/L) A^T r^k$
  4. Proximal: $y^{k+1} = \operatorname{sign}(z^{k+1}) \cdot \max(|z^{k+1}| - \lambda/L, 0)$
  5. Momentum: $x^{k+1} = y^{k+1} + \beta_k(y^{k+1} - y^k)$

Numerical example: $A$ is a $50 \times 100$ matrix, $x^*$ is sparse (5 nonzero components). FISTA with $\lambda = 0.1$ achieves accuracy $10^{-6}$ in about 200 iterations. Without acceleration (ISTA) — in about 2000 iterations.

ADMM (Alternating Direction Method of Multipliers)

ADMM solves problems of the form: $\min f(x) + g(z)$ subject to $Ax + Bz = c$

ADMM Algorithm (iteration):

  1. x-step: $x^{k+1} = \arg\min_x \left{ f(x) + \frac{\rho}{2}|Ax + Bz^k - c + u^k|^2 \right}$
  2. z-step: $z^{k+1} = \arg\min_z \left{ g(z) + \frac{\rho}{2}|Ax^{k+1} + Bz - c + u^k|^2 \right}$
  3. Dual step: $u^{k+1} = u^k + Ax^{k+1} + Bz^{k+1} - c$

Here $\rho > 0$ is the penalty parameter, $u$ is the (scaled) dual variable.

Why ADMM is more powerful: it splits the problem into two subproblems — for $x$ and for $z$ — which are solved independently. If both subproblems have convenient proximal operators, the whole algorithm is very efficient.

Example: Lasso via ADMM

x-step: ordinary ridge regression (closed-form solution via LDLT decomposition) z-step: soft thresholding

ADMM converges in $O(1/k)$ and is highly parallelizable over data blocks.

Applications

Distributed optimization: with $n$ machines, each stores its own portion of the data. ADMM enables solving the global problem without gathering all the data in one place — only exchange of “dual” variables.

Image processing: Total Variation (TV) denoising: $\min \frac{1}{2}|u-f|^2 + \lambda TV(u)$. TV norm is nonsmooth, but has a convenient proximal. FISTA and ADMM are standard methods.

Proximal Operators for Typical Regularizers

Besides soft thresholding for L1 and projection for the indicator, important proximal operators are:

  • For L2-regularization $\tau|x|^2$: $\operatorname{prox}(x) = x / (1 + 2\tau)$ — simple scaling
  • For group LASSO $\tau|x|_{2,1}$: group soft thresholding, zeroing out entire groups of coordinates
  • For nuclear norm $|X|_*$ (sum of singular values of a matrix): SVD decomposition $X = U\Sigma V^T$, then soft threshold the singular values and reassemble
  • For the simplex indicator: projection onto the simplex via sorting — $O(n \log n)$

Applications of Operator Splitting

Douglas-Rachford splitting and ADMM form the foundation of modern convex optimization systems: TFOCS, CVXPY, COSMO. They are especially effective for problems with a separable structure, where the objective function splits into simple terms with easily computable proximal operators. ADMM is actively used in distributed optimization: each compute node solves a local subproblem via a prox, then synchronizes through dual variables. This allows scaling training to thousands of machines.

§ Act · what next