Module I·Article III·~4 min read

Subgradients and Subdifferential

Convex Sets and Functions

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Motivation: What to Do with Nondifferentiable Functions?

Many important convex functions do not have a gradient at every point. The absolute value |x| is not differentiable at zero. The L1 norm ‖x‖₁ is not differentiable when at least one component is zero. The maximum of several functions is not differentiable on the switching set. If we want to solve LASSO minimization problems (‖Ax − b‖² + λ‖x‖₁), we need a tool that generalizes the gradient to nonsmooth functions. This is the subdifferential.

Definition of Subgradient

A vector $g \in \mathbb{R}^n$ is called a subgradient of a convex function $f$ at the point $x$ if:

$ f(y) \geq f(x) + g^\top(y - x) \quad \text{for all } y \in \operatorname{dom}(f) $

Interpretation: a subgradient is a vector that defines a hyperplane through the point $(x, f(x))$ that lies "below" (or touches) the graph of $f$. By the first-order criterion, if $f$ is differentiable, the only subgradient is the gradient $\nabla f(x)$. If $f$ is nondifferentiable, there may be a whole "fan" of admissible subgradients.

The subdifferential $\partial f(x)$ is the set of all subgradients of $f$ at $x$:

$ \partial f(x) = {g : f(y) \geq f(x) + g^\top(y - x) \text{ for all } y } $

This is always a closed convex set.

Examples of Subdifferentials

Absolute value |x| in one variable:

  • For $x > 0$: $f'(x) = 1$, so $\partial|x| = {1}$
  • For $x < 0$: $f'(x) = -1$, so $\partial|x| = {-1}$
  • For $x = 0$: the subdifferential $\partial|0| = [-1, 1]$ — the entire segment

Why? For $x = 0$ we need to find all $g$ such that $|y| \geq |0| + g(y - 0) = gy$ for all $y$. This holds when $|y| \geq gy$ for all $y$, which means $g \in [-1, 1]$.

L1 norm $|x|_1 = \sum_i |x_i|$ in $\mathbb{R}^n$:

$ \partial|x|_1 = {g : g_i = \operatorname{sign}(x_i) \text{ when } x_i \neq 0,\ g_i \in [-1, 1] \text{ when } x_i = 0 } $

Componentwise: each component $g_i$ is chosen independently from $\partial|x_i|$.

Maximum $\max(f_1, ..., f_m)(x)$ (assuming $f_i$ are smooth):

$ \partial \max(f_1,\ldots,f_m)(x) = \operatorname{conv}{\nabla f_i(x) : i \in I(x)} $

where $I(x) = {i : f_i(x) = \max_j f_j(x)}$ is the set of "active" functions, and $\operatorname{conv}$ is the convex hull.

Indicator function $\delta_C(x) = 0$ if $x \in C$, otherwise $+\infty$:

$ \partial\delta_C(x) = N_C(x) = {g : g^\top(y - x) \leq 0 \text{ for all } y \in C} $

— the normal cone to $C$ at the point $x$.

Optimality Condition via Subdifferential

Theorem: a point $x^*$ is a global minimum of a convex function $f$ if and only if:

$ 0 \in \partial f(x^*) $

Meaning: zero must be a subgradient. If $f$ is differentiable, this is the standard condition $\nabla f(x^*) = 0$. For nondifferentiable functions: zero must lie in the "fan" of subgradients.

Example: for $f(x) = |x|$: $0 \in \partial|x^| \iff x^ = 0$ (because only at $x = 0$ does the subdifferential contain 0).

Subdifferentiation Rules

Sum: $\partial(f + g)(x) \supseteq \partial f(x) + \partial g(x)$. Under regularity conditions (e.g., one function is continuous): equality holds.

Affine substitution of argument: if $h(x) = f(Ax + b)$, then

$ \partial h(x) = A^\top \partial f(Ax + b) $

Proximal Operator

A subgradient is a "direction" to move in. But the subgradient method is slow ($O(1/\sqrt{k})$). The proximal operator provides a more efficient alternative:

$ \operatorname{prox}_{\tau f}(x) = \arg\min_y {f(y) + \frac{1}{2\tau} |y - x|^2} $

Meaning: "move toward the minimum of $f

quot;, but not too far from the current point $x$. Parameter $\tau$ is the step size.

Examples of proximal operator calculation:

For $f(x) = \tau|x|_1$ (soft thresholding function):

$ \operatorname{prox}_{\tau|\cdot|_1}(x) = \operatorname{sign}(x) \cdot \max(|x| - \tau, 0) $

— elementwise.

This is "soft thresholding": it zeros out components with $|x_i| \leq \tau$, and shifts the rest toward zero by $\tau$.

For $f(x) = \delta_C(x)$ (indicator):

$ \operatorname{prox}_{\delta_C}(x) = P_C(x) $

— just projection onto $C$.

Full Analysis of Example: LASSO

Problem: $\min_{x \in \mathbb{R}^n} F(x) = \frac{1}{2}|Ax - b|^2 + \lambda|x|_1$

Optimality condition: $0 \in \partial F(x^) = \partial\left\frac{1}{2}|Ax-b|^2\right + \lambda\ \partial|x^|_1$

The first subdifferential is the gradient of the smooth part: $A^\top(Ax^* - b)$. The second is $\lambda \cdot \partial|x^*|_1$.

So: $-A^\top(Ax^* - b) \in \lambda \partial|x^*|_1$

Componentwise: if $x^_i \neq 0$, then $[A^\top(Ax^ - b)]_i = -\lambda, \operatorname{sign}(x^_i)$. If $x^_i = 0$, then $|[A^\top(Ax^* - b)]_i| \leq \lambda$.

ISTA Algorithm (soft thresholding: iterative shrinkage-thresholding):

$ x^{k+1} = \operatorname{prox}_{\tau \lambda|\cdot|_1}(x^k - \tau A^\top(Ax^k - b)) = \operatorname{softthresh}(x^k - \tau A^\top(Ax^k - b),\ \tau\lambda) $

At each step: gradient step on the smooth part + proximal on the nonsmooth.

Applications

Feature selection: LASSO regression with L1 penalty automatically zeros out unnecessary coefficients. This is "automatic variable selection". In medical data with thousands of genes, LASSO selects several dozen most significant.

Compressed signal recovery: An MRI scanner takes 10 times fewer measurements, using L1 minimization to recover a sparse image. The subdifferential of the L1 norm is key to understanding why this works.

§ Act · what next