Sparsity and Compressed Sensing Theory — Big Data & Machine Learning

The Theory of Sparsity and Compressed Measurements

Imagine: you want to record an MRI image, but the patient has little time—you cannot obtain all the measurements. Or you need to transmit a signal through a channel with limited bandwidth. The theory of compressed sensing (Candès, Romberg, Tao; Donoho, 2006) says: if the signal is “sparse”, several linear measurements are enough for its exact recovery.

Main Theorem of Compressed Sensing

Problem Statement: $x \in \mathbb{R}^n$ — signal (for example, an image of pixels). $A \in \mathbb{R}^{m \times n}$, $m \ll n$ — measurement matrix. We observe $y = Ax$ ($m \ll n$ measurements — obviously an underdetermined system). Goal: recover $x$.

Sparsity Condition: $x$ has no more than $s$ non-zero components: $||x||_0 = s \ll n$. Most signals are sparse in some basis: images — in the wavelet basis, audio — in the Fourier basis.

Theorem (Candès, 2006): If $A$ satisfies the RIP$(2s, \delta)$ condition:

$(1-\delta)||x||_2^2 \leq ||Ax||_2^2 \leq (1+\delta)||x||_2^2 \quad \text{for all } s\text{-sparse } x$

then the solution $\hat{x}$ of the Basis Pursuit Denoising problem:

$ \hat{x} = \arg\min ||x||_1 \quad \text{subject to} \quad ||Ax - y||_2 \leq \epsilon $

satisfies: $||\hat{x} - x||_2 \leq C_1 \sigma_s(x) / \sqrt{s} + C_2 \epsilon$

Here: $||x||_1 = \sum_i |x_i|$ (L1-norm), $\sigma_s(x)$ — error of the best $s$-sparse approximation. The key is: the L1 minimization problem (convex!) solves the seemingly impossible problem of recovery from $m \ll n$ measurements.

Why L1? L0-norm (count of non-zeros) — NP-hard optimization. L2-norm — finds the least squares solution, which is not sparse. L1 — “the closest convex relaxation” of L0: maintains sparsity, allows convex programming.

Random Matrices Satisfy the RIP

If $A_{ij} \sim N(0, 1/m)$ (Gaussian random matrix), then $A$ satisfies RIP$(s, \delta)$ with probability $\geq 1-\epsilon$ when:

$ m \geq C \cdot s \cdot \log(n/s) / \delta^2 $

Example: $n = 10000$, $s = 100$, $\delta = 0.1$ $\rightarrow$ $m \geq C \cdot 100 \cdot \log(100) \cdot 100 \approx 46000/k$ — enough $m \approx 500$ measurements instead of $n = 10000$!

Physical meaning: random projection “sees” all $s$-sparse vectors (JL lemma). Other matrices satisfying RIP: partial Fourier matrix (with randomly selected rows), Hadamard matrix.

Recovery Algorithms

LASSO (Tibshirani, 1996): Equivalent to BPDN for regression: $\min_x (1/2)||y - Ax||_2^2 + \lambda||x||_1$. There is no analytical solution (except for orthogonal $A$), but there are efficient algorithms.

ISTA (Iterative Shrinkage-Thresholding): uses proximal operator for L1: $ \text{prox}_{\alpha\lambda||\cdot||_1}(v)_i = \text{sign}(v_i) \cdot \max(|v_i| - \alpha\lambda, 0) \quad \text{(soft thresholding)} $

Step: $x_{t+1} = \text{prox}_{\alpha g}(x_t - \alpha \cdot A^T(Ax_t - y))$. Convergence: $O(1/t)$.

FISTA: Nesterov-accelerated version of ISTA: $O(1/t^2)$.

Matching Pursuit (OMP): greedy algorithm: at each step, selects the “most correlated” column of $A$ with the residual. Faster, not as precise in theory.

Applications

MRI: k-space (frequency measurements) is sampled randomly by Poisson $\rightarrow$ LASSO reconstruction $\rightarrow$ 4–8× less time in the machine. First clinical implementation — in 2011.

Single-Pixel Camera: we measure random linear combinations of pixels (one photodiode sensor) $\rightarrow$ CS reconstruction. Allows operation in ranges where a pixel matrix is expensive (infrared, terahertz).

Genomics: from $m = 500$ lab tests we recover the full genotype $n = 10000$ SNPs — each person carries a sparse set of mutations.

Structural Sparsity: Group LASSO and Total Variation

Group LASSO: features are split into groups $g$. Penalty: $\lambda\sum_g ||\beta_g||_2$. Groups are selected “entirely” — either all features in the group are non-zero, or all are zeros. Applications: time series (groups of lags), neural networks (structural pruning — remove entire attention heads).

Total Variation (TV) denoising: $x \in \mathbb{R}^n$ — noisy signal. $\min_u ||y - u||2^2 + \lambda \sum_i |u{i+1} - u_i|$. Preserves jumps (edges), smooths noise within flat regions. Standard in image processing.

Numerical Example

$n = 256$ (signal length), $s = 10$ (sparsity in Fourier). $m = 60$ random measurements. Gaussian matrix $A$ ($60 \times 256$). Signal: $x$ with $10$ nonzero frequencies. BPDN recovers $x$ with error

lt; 0.01$ at $\epsilon = 0$ (no noise). With noise $\sigma = 0.01$: $||\hat{x} - x||2 \approx 0.08$. Naive pseudoinverse yields $||\hat{x}{\text{lstsq}} - x||_2 \approx 2.1$ — 26 times worse.

Assignment: Implement CS for an audio signal: generate a signal = sum of 5 sine waves, $n=1024$ points. Create a random measurement matrix $A$ ($m \times 1024$, $m=100$). Recover via LASSO (sklearn). Plot: (1) original spectrum; (2) recovered spectrum; (3) recovery error vs $m$ (from 50 to 500). At which $m$ does the error drop below 1%?