High-Dimensional Geometry and Concentration — Big Data & Machine Learning

High-Dimensional Geometry

Most of our geometric intuitions were formed in 2D and 3D. But data about people, texts, molecules live in spaces of dimension $d = 100, 1000, 100000$. In these spaces everything works differently, and understanding the “curse of dimensionality” is crucial for the ML practitioner.

High-Dimension Paradoxes

Volume of the “shell”: In $\mathbb{R}^{d}$ the volume of the ball $B^d(r) = \pi^{d/2} r^d/\Gamma(d/2+1)$. As $d \to \infty$ almost all the volume of the ball is concentrated in a thin layer near the surface. More precisely: for $X \sim \text{Uniform}(B^d)$, $E[|X|] \approx r \cdot \sqrt{d/(d+2)} \to r$, and $\operatorname{Var}(|X|) \to 0$.

Practical implication: all random points in the ball are located approximately at the same distance from the center $\to$ distance from the center is not a distinguishing feature.

Random vectors are almost orthogonal: For $X, Y \sim N(0, I^d/d)$: $\cos(\angle XY) = X^{\top} Y/(|X| |Y|) \to N(0, 1/d)$. For $d = 1000$ the standard deviation of the angle is $\approx 1/\sqrt{1000} \approx 0.03$ radians $\approx 1.8^{\circ}$. Almost all random vectors are nearly orthogonal — this is both good (easy to embed many directions) and bad (clustering is ineffective).

Measure concentration: A Lipschitz function $F:\mathbb{R}^d \to \mathbb{R}$ ($|\nabla F| \leq L$) is almost constant in a high-dimensional Gaussian measure:

$ P(|F(X) - EF(X)| \geq t) \leq 2 \exp(-t^2/(2L^2)) $

Practically: the average value $F(X)$ is almost equal to the typical value. Example: norm $|X|$ for $X \sim N(0, I^d)$ concentrates at $\sqrt{d}$ with deviation $O(1/\sqrt{d})$.

Johnson–Lindenstrauss Lemma

Theorem (JL, 1984): For $n$ points in $\mathbb{R}^{d}$ and any $\epsilon \in (0, 1/2)$, a random linear projection to $\mathbb{R}^{k}$ with $k = O(\log n/\epsilon^2)$ preserves all pairwise distances with accuracy $(1 \pm \epsilon)$ with high probability.

Random projection matrix: $R_{ij} \sim N(0, 1/k)$. For each pair $(i,j)$: $|(1/k)|R(x_i - x_j)|^2 - |x_i - x_j|^2| \leq \epsilon|x_i - x_j|^2$ with probability $\geq 1 - 2 \exp(-c \epsilon^2 k)$.

Example: 10000 documents in TF-IDF space $d = 50000 \to$ JL projection to $k = O(\log 10000/\epsilon^2) \approx 1700$ dimensions (for $\epsilon = 0.1$), preserving all distances to 10% accuracy.

Applications: LSH (Locality Sensitive Hashing) — fast approximate nearest neighbor search, feature compression for ML, acceleration of matrix operations.

PCA and Principal Component Method

PCA solves the problem: find $k$ orthogonal directions $v_1, ..., v_k$ that maximize the variance of data projections.

Solution: SVD of the data matrix $X$ ($n \times p$): $X = U \Sigma V^{\top}$. The first $k$ columns of $V$ are the principal components (PC). Projection: $X \cdot V_{1:k} \in \mathbb{R}^{n \times k}$. Explained variance: $\sum_{i\le k} \sigma_i^2 / \sum_j \sigma_j^2$.

Geometrically: PCA rotates the coordinate system so that the greatest variance is along the first axis, the next greatest along the second axis, and so on.

Random matrices and PCA: For $p/n \to c > 0$ (many features, not so many observations) the spectrum of the sample covariance matrix follows the Marchenko–Pastur law: density of singular values is concentrated in the interval $[(1-\sqrt{c})^2, (1+\sqrt{c})^2]$ — “sea of noise”. Signal components above $(1+\sqrt{c})^2$ protrude as “spikes”.

The BBP rule (Baik-Ben Arous-Péché): PC is reliably estimated only if its “signal” $\sigma > 1/\sqrt{c}$. Otherwise, the PC estimate is orthogonal to the true direction!

Robust PCA and Sparse PCA

Robust PCA (Candès, Li, Ma, Wright, 2011): Given: $X = L + S$, where $L$ is a low-rank matrix, $S$ is sparse (outliers). Task: decompose $X$. Algorithm: $\min \operatorname{rank}(L) + \lambda |S|0$ is relaxed to $\min |L|* + \lambda |S|_1$ (nuclear norm + $L_1$). Accurately recovers $L$ and $S$ under certain conditions. Application: separation of “background + moving object” in video.

Sparse PCA: Standard PCs are linear combinations of all $p$ features. Sparse PCA: PCs are sparse (few nonzero coefficients). Interpretability: PC“1” $= 0.8 \cdot \text{gene}_A + 0.6 \cdot \text{gene}_B$ (not all thousands of genes). NP-hard in general, effective approximations: ADMM, SCoTLASS.

Numerical Example

Data: 100 observations in 50-dimensional space. True dimensionality = 3 (3 latent factors + noise). PCA: first 3 PCs explain 78% of variance. For $p/n = 0.5$ ($c=0.5$), noise threshold $=(1+\sqrt{0.5})^2 \approx 2.9$. If 4th singular PC $\approx 2.7 < 2.9$ — not reliable.

Assignment: Apply PCA to the MNIST dataset (784-dimensional images). Build a “scree plot” of explained variance. How many PCs are needed to explain 95% variance? Visualize the first 2 PCs for all digits (scatter plot colored by label). Implement random projection (JL) to $k=50$ dimensions. Compare kNN accuracy ($k=5$) on original and compressed features.