Regularization: Lasso, Ridge, Elastic Net — Convex Analysis & Optimization

The Overfitting Problem and Why Regularization Is Needed

Imagine you are building a model to predict apartment prices using 1000 features, but have only 100 observations. Without constraints, the model can “memorize” the training data (overfitting), showing zero error on it but terrible error on new data. Regularization is the addition of a penalty term to the loss function, which “restrains” the model parameters. Convex analysis explains why different types of regularization have fundamentally different effects: L2 (Ridge) shrinks, L1 (Lasso) zeros out.

Ridge Regression (L2 Regularization)

Problem: min_{x ∈ ℝⁿ} ‖Ax − b‖² + λ‖x‖²

Here, A ∈ ℝ^{m×n} is the feature matrix, b ∈ ℝᵐ are the responses, λ > 0 is the regularization parameter.

Closed-form solution: Take the derivative with respect to x, set it to zero:

2Aᵀ(Ax − b) + 2λx = 0 → (AᵀA + λI)x = Aᵀb → x_R = (AᵀA + λI)⁻¹Aᵀb

Importantly, the matrix AᵀA + λI is always invertible for λ > 0, even if AᵀA is singular! This solves the problem of multicollinearity.

Effect: all coefficients are shrunk toward zero proportionally, but none are set exactly to zero. This is good for “stabilization” (reducing variance), but bad for “feature selection”.

Bayesian interpretation: Ridge = MAP estimate with Gaussian prior p(x) ∝ exp(−λ‖x‖²/2). Assumption: all coefficients are “roughly zero” with the same variance.

Lasso (L1 Regularization)

Problem: min_{x ∈ ℝⁿ} ‖Ax − b‖² + λ‖x‖₁

There is no closed-form solution in the general case (due to nonsmoothness of ‖x‖₁). But the problem is convex → the global minimum is unique.

Key effect: sparsity. For sufficiently large λ, many xᵢ* = 0 exactly! This is not an approximation — it is an exact zero.

Geometric explanation: level sets of the loss function ‖Ax−b‖² are ellipsoids. Level sets of the L1 penalty λ‖x‖₁ are “diamonds” (in 2D) with corner points on the axes. As λ decreases, the ellipsoid “expands” and first touches the diamond at a corner point → xᵢ = 0 for some i.

Bayesian interpretation: Lasso = MAP with double exponential (Laplace) prior p(xᵢ) ∝ exp(−λ|xᵢ|). This prior is “sharper” at zero — it more actively zeros out small coefficients.

Practical example: in genomic data with 20,000 candidate genes, but only about 50 truly relevant. Lasso automatically selects these 50, zeroing out the rest. Ridge will give 20,000 small nonzero coefficients — impossible to interpret.

Elastic Net: The Best of Both Worlds

Problem: min ‖Ax−b‖² + λ₁‖x‖₁ + λ₂‖x‖²

Combines L1 and L2 with parameters λ₁ and λ₂.

Advantages:

Sparsity from L1: some coefficients are set to zero
Stability from L2: in the case of multicollinearity (similar features), L1 arbitrarily selects one, Elastic Net selects a “group” together
Closed-form solution compared to pure Lasso (but only iteratively)

When to use: when there are groups of correlated features and you want both sparsity and robustness.

Compressed Sensing

Problem: recover x ∈ ℝⁿ from measurements b = Ax where A ∈ ℝ^{m×n} and m << n (fewer measurements than unknowns!).

It would seem the problem is underdetermined — infinitely many solutions. But if x is sparse (only s << n nonzero components), the problem is solvable!

L0-minimization: min ‖x‖₀ with Ax = b (‖x‖₀ = number of nonzero components). This is an NP-hard combinatorial problem.

L1-relaxation (Candes, Romberg, Tao, 2004): min ‖x‖₁ with Ax = b — a convex problem!

RIP theorem: if the matrix A satisfies the RIP (Restricted Isometry Property) with constant δ_{2s} < √2 − 1, then L1-minimization exactly recovers any s-sparse x.

RIP meaning: A almost preserves the norm of sparse vectors. Random matrices (Gaussian, Bernoulli) satisfy RIP with probability →1 when m ≥ O(s log(n/s)).

Full Breakdown: Lasso on a Numerical Example

Data: A = [[1, 2], [3, 4], [5, 6]], b = [1, 2, 3], λ = 0.5.

Step 1: L = λ_max(AᵀA). AᵀA = [[35, 44], [44, 56]], λ_max ≈ 90.5. ISTA step size: τ = 1/L ≈ 0.011.

Step 2: Initial point x⁰ = [0, 0].

Iteration 1:

Gradient: ∇f(x⁰) = Aᵀ(Ax⁰ − b) = Aᵀ(−b) = [[−22], [−28]]
Gradient step: z = x⁰ − τ∇f = [0.24, 0.31]
Soft threshold with τλ ≈ 0.006: x¹ = [0.234, 0.304]

After convergence: x* ≈ [0, 0.5] (one component is zeroed out for sufficiently large λ).

Practical Applications

Medicine: Lasso regression on ECG data selects 5–10 significant rhythmic features out of 500. Ridge regression retains all features with small coefficients.

MRI: Compressed sensing allows scans to be done 8–10 times faster (fewer measurements) by leveraging sparsity of medical images in frequency space.

Finance: Elastic Net for selecting equity return factors from thousands of candidates under multicollinearity.