Module I·Article III·~4 min read
Subgradients and Subdifferential
Convex Sets and Functions
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Motivation: What to Do with Nondifferentiable Functions?
Many important convex functions do not have a gradient at every point. The absolute value |x| is not differentiable at zero. The L1 norm ‖x‖₁ is not differentiable when at least one component is zero. The maximum of several functions is not differentiable on the switching set. If we want to solve LASSO minimization problems (‖Ax − b‖² + λ‖x‖₁), we need a tool that generalizes the gradient to nonsmooth functions. This is the subdifferential.
Definition of Subgradient
A vector $g \in \mathbb{R}^n$ is called a subgradient of a convex function $f$ at the point $x$ if:
$ f(y) \geq f(x) + g^\top(y - x) \quad \text{for all } y \in \operatorname{dom}(f) $
Interpretation: a subgradient is a vector that defines a hyperplane through the point $(x, f(x))$ that lies "below" (or touches) the graph of $f$. By the first-order criterion, if $f$ is differentiable, the only subgradient is the gradient $\nabla f(x)$. If $f$ is nondifferentiable, there may be a whole "fan" of admissible subgradients.
The subdifferential $\partial f(x)$ is the set of all subgradients of $f$ at $x$:
$ \partial f(x) = {g : f(y) \geq f(x) + g^\top(y - x) \text{ for all } y } $
This is always a closed convex set.
Examples of Subdifferentials
Absolute value |x| in one variable:
- For $x > 0$: $f'(x) = 1$, so $\partial|x| = {1}$
- For $x < 0$: $f'(x) = -1$, so $\partial|x| = {-1}$
- For $x = 0$: the subdifferential $\partial|0| = [-1, 1]$ — the entire segment
Why? For $x = 0$ we need to find all $g$ such that $|y| \geq |0| + g(y - 0) = gy$ for all $y$. This holds when $|y| \geq gy$ for all $y$, which means $g \in [-1, 1]$.
L1 norm $|x|_1 = \sum_i |x_i|$ in $\mathbb{R}^n$:
$ \partial|x|_1 = {g : g_i = \operatorname{sign}(x_i) \text{ when } x_i \neq 0,\ g_i \in [-1, 1] \text{ when } x_i = 0 } $
Componentwise: each component $g_i$ is chosen independently from $\partial|x_i|$.
Maximum $\max(f_1, ..., f_m)(x)$ (assuming $f_i$ are smooth):
$ \partial \max(f_1,\ldots,f_m)(x) = \operatorname{conv}{\nabla f_i(x) : i \in I(x)} $
where $I(x) = {i : f_i(x) = \max_j f_j(x)}$ is the set of "active" functions, and $\operatorname{conv}$ is the convex hull.
Indicator function $\delta_C(x) = 0$ if $x \in C$, otherwise $+\infty$:
$ \partial\delta_C(x) = N_C(x) = {g : g^\top(y - x) \leq 0 \text{ for all } y \in C} $
— the normal cone to $C$ at the point $x$.
Optimality Condition via Subdifferential
Theorem: a point $x^*$ is a global minimum of a convex function $f$ if and only if:
$ 0 \in \partial f(x^*) $
Meaning: zero must be a subgradient. If $f$ is differentiable, this is the standard condition $\nabla f(x^*) = 0$. For nondifferentiable functions: zero must lie in the "fan" of subgradients.
Example: for $f(x) = |x|$: $0 \in \partial|x^| \iff x^ = 0$ (because only at $x = 0$ does the subdifferential contain 0).
Subdifferentiation Rules
Sum: $\partial(f + g)(x) \supseteq \partial f(x) + \partial g(x)$. Under regularity conditions (e.g., one function is continuous): equality holds.
Affine substitution of argument: if $h(x) = f(Ax + b)$, then
$ \partial h(x) = A^\top \partial f(Ax + b) $
Proximal Operator
A subgradient is a "direction" to move in. But the subgradient method is slow ($O(1/\sqrt{k})$). The proximal operator provides a more efficient alternative:
$ \operatorname{prox}_{\tau f}(x) = \arg\min_y {f(y) + \frac{1}{2\tau} |y - x|^2} $
Meaning: "move toward the minimum of $f
quot;, but not too far from the current point $x$. Parameter $\tau$ is the step size.Examples of proximal operator calculation:
For $f(x) = \tau|x|_1$ (soft thresholding function):
$ \operatorname{prox}_{\tau|\cdot|_1}(x) = \operatorname{sign}(x) \cdot \max(|x| - \tau, 0) $
— elementwise.
This is "soft thresholding": it zeros out components with $|x_i| \leq \tau$, and shifts the rest toward zero by $\tau$.
For $f(x) = \delta_C(x)$ (indicator):
$ \operatorname{prox}_{\delta_C}(x) = P_C(x) $
— just projection onto $C$.
Full Analysis of Example: LASSO
Problem: $\min_{x \in \mathbb{R}^n} F(x) = \frac{1}{2}|Ax - b|^2 + \lambda|x|_1$
Optimality condition: $0 \in \partial F(x^) = \partial\left\frac{1}{2}|Ax-b|^2\right + \lambda\ \partial|x^|_1$
The first subdifferential is the gradient of the smooth part: $A^\top(Ax^* - b)$. The second is $\lambda \cdot \partial|x^*|_1$.
So: $-A^\top(Ax^* - b) \in \lambda \partial|x^*|_1$
Componentwise: if $x^_i \neq 0$, then $[A^\top(Ax^ - b)]_i = -\lambda, \operatorname{sign}(x^_i)$. If $x^_i = 0$, then $|[A^\top(Ax^* - b)]_i| \leq \lambda$.
ISTA Algorithm (soft thresholding: iterative shrinkage-thresholding):
$ x^{k+1} = \operatorname{prox}_{\tau \lambda|\cdot|_1}(x^k - \tau A^\top(Ax^k - b)) = \operatorname{softthresh}(x^k - \tau A^\top(Ax^k - b),\ \tau\lambda) $
At each step: gradient step on the smooth part + proximal on the nonsmooth.
Applications
Feature selection: LASSO regression with L1 penalty automatically zeros out unnecessary coefficients. This is "automatic variable selection". In medical data with thousands of genes, LASSO selects several dozen most significant.
Compressed signal recovery: An MRI scanner takes 10 times fewer measurements, using L1 minimization to recover a sparse image. The subdifferential of the L1 norm is key to understanding why this works.
§ Act · what next