Module III·Article II·~4 min read
Proximal Algorithms and Operator Splitting
First-Order Algorithms
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Motivation: How to Handle Nondifferentiable Terms?
LASSO problem: min (1/2)‖Ax−b‖² + λ‖x‖₁. The first term is smooth (can be differentiated), the second is not ($|x|_1$ is nondifferentiable at zero). Gradient descent cannot be applied directly. The subgradient method is too slow ($O(1/\sqrt{k})$). Proximal algorithms solve this problem elegantly: they “split” the function into smooth and nonsmooth parts and handle each differently.
Proximal Operator
Definition: $\operatorname{prox}{\tau f}(x) = \arg\min{y} \left{ f(y) + \frac{1}{2\tau} |y - x|^2 \right}$
Meaning of each symbol:
- $\tau > 0$ — step size
- $f(y)$ — “minimize” $f$
- $\frac{1}{2\tau}|y - x|^2$ — “don’t move far from $x$”
- $\arg\min$ — returns the minimizer $y$
This is a “soft step toward the minimum of $f$.” When $\tau \to 0$: prox $\approx x$ (do not move). When $\tau \to \infty$: prox $\to \arg\min f$ (go straight to the minimum).
Key examples:
-
$f(x) = |x|1$: $\operatorname{prox}{\tau f}(x) = \operatorname{sign}(x) \cdot \max(|x| - \tau, 0)$ — soft thresholding
Meaning: components with $|x_i| \leq \tau$ are set to zero, the rest are shifted toward zero by $\tau$.
-
$f(x) = \delta_C(x)$ — indicator of a convex $C$: $\operatorname{prox}_{\delta_C}(x) = P_C(x)$ — projection onto $C$
-
$f(X) = |X|*$ (nuclear norm = sum of singular values): $\operatorname{prox}{\tau|\cdot|*}(X) = U\tilde{\Sigma}V^T$, where $X = U\Sigma V^T$ (SVD), $\tilde{\Sigma}{ii} = \max(\sigma_i - \tau, 0)$ — soft thresholding for singular values.
Proximal Gradient Method (ISTA/FISTA)
For the problem $\min F(x) = f(x) + g(x)$, where $f$ is smooth ($L$-smooth), $g$ is convex (prox-friendly):
ISTA (Iterative Shrinkage-Thresholding Algorithm):
$ x^{k+1} = \operatorname{prox}_{\tau g} (x^k - \tau \nabla f(x^k)) $
Steps: 1) Gradient step for $f$ 2) Proximal step for $g$. Convergence: $O(1/k)$.
FISTA = ISTA + Nesterov momentum:
$ y^{k+1} = \operatorname{prox}_{\tau g} (x^k - \tau \nabla f(x^k)) \ x^{k+1} = y^{k+1} + \frac{k-1}{k+2} (y^{k+1} - y^k) $
Convergence: $O(1/k^2)$ — optimal!
Complete Walkthrough of LASSO with FISTA
Problem: $\min F(x) = \frac{1}{2}|Ax - b|^2 + \lambda|x|_1$
$f(x) = \frac{1}{2}|Ax-b|^2$, $\nabla f(x) = A^T(Ax-b)$, $L = \lambda_\max(A^TA)$
$g(x) = \lambda|x|1$, $\operatorname{prox}{\tau g}(z) = \operatorname{sign}(z) \cdot \max(|z| - \tau \lambda, 0)$
FISTA iteration:
- Compute $r^k = Ax^k - b$ (residual)
- Gradient: $\nabla f(x^k) = A^T r^k$
- Gradient step: $z^{k+1} = x^k - (1/L) A^T r^k$
- Proximal: $y^{k+1} = \operatorname{sign}(z^{k+1}) \cdot \max(|z^{k+1}| - \lambda/L, 0)$
- Momentum: $x^{k+1} = y^{k+1} + \beta_k(y^{k+1} - y^k)$
Numerical example: $A$ is a $50 \times 100$ matrix, $x^*$ is sparse (5 nonzero components). FISTA with $\lambda = 0.1$ achieves accuracy $10^{-6}$ in about 200 iterations. Without acceleration (ISTA) — in about 2000 iterations.
ADMM (Alternating Direction Method of Multipliers)
ADMM solves problems of the form: $\min f(x) + g(z)$ subject to $Ax + Bz = c$
ADMM Algorithm (iteration):
- x-step: $x^{k+1} = \arg\min_x \left{ f(x) + \frac{\rho}{2}|Ax + Bz^k - c + u^k|^2 \right}$
- z-step: $z^{k+1} = \arg\min_z \left{ g(z) + \frac{\rho}{2}|Ax^{k+1} + Bz - c + u^k|^2 \right}$
- Dual step: $u^{k+1} = u^k + Ax^{k+1} + Bz^{k+1} - c$
Here $\rho > 0$ is the penalty parameter, $u$ is the (scaled) dual variable.
Why ADMM is more powerful: it splits the problem into two subproblems — for $x$ and for $z$ — which are solved independently. If both subproblems have convenient proximal operators, the whole algorithm is very efficient.
Example: Lasso via ADMM
x-step: ordinary ridge regression (closed-form solution via LDLT decomposition) z-step: soft thresholding
ADMM converges in $O(1/k)$ and is highly parallelizable over data blocks.
Applications
Distributed optimization: with $n$ machines, each stores its own portion of the data. ADMM enables solving the global problem without gathering all the data in one place — only exchange of “dual” variables.
Image processing: Total Variation (TV) denoising: $\min \frac{1}{2}|u-f|^2 + \lambda TV(u)$. TV norm is nonsmooth, but has a convenient proximal. FISTA and ADMM are standard methods.
Proximal Operators for Typical Regularizers
Besides soft thresholding for L1 and projection for the indicator, important proximal operators are:
- For L2-regularization $\tau|x|^2$: $\operatorname{prox}(x) = x / (1 + 2\tau)$ — simple scaling
- For group LASSO $\tau|x|_{2,1}$: group soft thresholding, zeroing out entire groups of coordinates
- For nuclear norm $|X|_*$ (sum of singular values of a matrix): SVD decomposition $X = U\Sigma V^T$, then soft threshold the singular values and reassemble
- For the simplex indicator: projection onto the simplex via sorting — $O(n \log n)$
Applications of Operator Splitting
Douglas-Rachford splitting and ADMM form the foundation of modern convex optimization systems: TFOCS, CVXPY, COSMO. They are especially effective for problems with a separable structure, where the objective function splits into simple terms with easily computable proximal operators. ADMM is actively used in distributed optimization: each compute node solves a local subproblem via a prox, then synchronizes through dual variables. This allows scaling training to thousands of machines.
§ Act · what next