Randomized Linear Algebra — Big Data & Machine Learning

Matrix of data: 10 million users × 100 thousand products. Full SVD is impossible: O(mn·min(m,n)) ≈ 10¹⁷ operations. Randomized algorithms reduce complexity to O(mnk) with theoretical guarantees of closeness to the exact solution. This is not "approximation due to laziness"—this is mathematically rigorous theory.

Randomized SVD

Task: Compute the best rank-k approximation of matrix $A \in \mathbb{R}^{m \times n}$: $\min_{\mathrm{rank}(B) \leq k} |A - B|_F$.

Exact algorithm: SVD $A = U\Sigma V^\top$, $A_k = U_{1:k}\Sigma_{1:k}V^\top_{1:k}$. Complexity O(mn²).

Randomized SVD (Halko, Martinsson, Tropp, 2011):

Step 1. Random matrix $\Omega \in \mathbb{R}^{n \times k}$: $\Omega_{ij} \sim N(0, 1/k)$. Sketch $Y = A\Omega \in \mathbb{R}^{m \times k}$. Meaning: $Y$ approximates the "column space" of $A$—we project $A$ onto $k$ random directions.

Step 2. QR decomposition $Y = QR$, $Q \in \mathbb{R}^{m \times k}$—orthonormal basis of the column space of $Y$.

Step 3. $B = Q^\top A \in \mathbb{R}^{k \times n}$. Matrix $B$ is much smaller: $k \times n$.

Step 4. $\mathrm{SVD}(B) = \tilde{U} \Sigma V^\top$, $U = Q\tilde{U}$.

Result: $U \in \mathbb{R}^{m \times k}$, $\Sigma \in \mathbb{R}^{k \times k}$, $V \in \mathbb{R}^{n \times k}$—approximate SVD. Complexity: O(mnk) vs O(mn²).

Theoretical guarantees: with probability ≥ 1−δ:

$ |A - QQ^\top A|F \leq (1 + O(\sqrt{k/\delta})), \sigma{k+1}(A) $

Here $\sigma_{k+1}(A)$ is the (k+1)th singular value, the lower bound for any rank-k approximation. In practice: $k_\mathrm{actual} = k + p$ ($p=10$ "oversampling" iterations) gives nearly exact solution.

Degree-q power iteration: Instead of $Y = A\Omega$ use $Y = (AA^\top)^q A\Omega$. For $q=1$–2: error $\sim \sigma_{k+1}^{2q+1}$—improves exponentially with $q$. Cost: $q$ additional multiplications by $A$.

Sklearn random SVD by default: n_iter=4—4 power iterations.

Approximate Nearest Neighbor Search (ANNS)

Task: For query $q$ find $x^* = \arg\min_{x \in X} d(q, x)$. Exact kNN: O(nd) per query. For $n=10^8$, $d=768$ (BERT embeddings)—impossible.

LSH (Locality Sensitive Hashing): Family of hash functions $h$ such that $P(h(x) = h(y))$ is high when $d(x, y)$ is small and low when $d(x, y)$ is large.

For L2 (Datar, 2004): $h_{a,b}(x) = \left\lfloor \frac{a^\top x + b}{w} \right\rfloor$, where $a \sim N(0, I)$, $b \sim \mathrm{Uniform}[0, w]$. Close points fall into the same “bucket” with high probability.

Amplification: $L$ independent tables × $k$ hash functions per table. Query: check in $L$ tables, take union of buckets. Time: $O(d \cdot n^\rho)$, $\rho = 1/(2L^2 w^2 + 1) < 1$.

HNSW (Malkov & Yashunin, 2018): Hierarchical Navigating Small World graph. Each level = nearest neighbor graph with different density. Search: start from the top (sparse) level → quickly approach the query zone → descend down. Time: $O(\log n)$. SOTA for most ANNS tasks.

FAISS (Facebook AI Similarity Search): Library with GPU acceleration. Product Quantization: compress vectors into codes (4–8 bytes instead of $768 \times 4 = 3072$ bytes). IVF (Inverted File): cluster the space, search only in closest clusters. Search $10^8$ vectors in <1 ms on GPU.

Streaming Algorithms: Big Data with Small Memory

Task: Incoming stream of $N$ elements $x_1, x_2,..., x_n$. Memory $O(N)$ is unacceptable. Need to compute statistics of the stream in one pass.

Count-Min Sketch (Cormode & Muthukrishnan, 2004): Estimate the frequency $f(x)$ with small memory.

Structure: Counter matrix $C[d][w]$ ($d$ rows × $w$ columns). $d$ independent hash functions $h_1,...,h_d: U \to {1,...,w}$.

Update for element $x$: for $i = 1,...,d$: $C[i][h_i(x)] += 1$.

Estimate $\hat{f}(x) = \min_i C[i][h_i(x)]$ (minimum across rows).

Guarantees: $P(|\hat{f}(x) - f(x)| \leq \varepsilon N) \geq 1 - \delta$ for $w = \lceil e / \varepsilon \rceil$, $d = \lceil \ln(1/\delta) \rceil$. Memory: $O(1/\varepsilon \cdot \log(1/\delta))$—does not depend on $N$!

HyperLogLog (Flajolet et al., 2007): Estimate the number of unique elements (cardinality). Idea: hash $x$ → uniform [0,1]. Maximum exponent of consecutive zeros at the start of the binary representation → estimate $\log_2(n)$. Average across $m = 2^b$ subregisters. Error: ~1.04/$\sqrt{m}$. Memory: $O(\log \log N)$ bits. Implemented in Redis, used for Count Distinct in analytics.

Morris Counter: Count $n$ up to $\sim 2^{64}$ in $O(\log \log n)$ bits. Store $k = \lfloor \log_2(n) \rfloor$ plus random error. Increment: $k \leftarrow k+1$ with probability $2^{-k}$. Estimate: $2^k$. Variance = $O(n^2)$.

Numerical Example

Randomized SVD Netflix Prize (480K users × 17K movies):

True matrix rank ≈ 100
Exact SVD: $480K \times 17K \times 100 \approx 816 \times 10^9$ operations (~4 hours)
Random SVD ($k=100$, $q=2$): ~240 × $10^9$ → ~1 hour, error <1%

HNSW search over 10M BERT embeddings (768d): indexing 10 minutes, top-10 search in 0.3 ms (vs 4.5 seconds for exact brute-force).

Assignment: (1) Implement randomized SVD without sklearn for the MovieLens (100K) ratings matrix. Compare accuracy (Frobenius error) with truncated SVD for $k=10$, $50$, $100$ at $q=0$, $1$, $2$ power iterations. (2) Implement Count-Min Sketch and HyperLogLog in Python. Test on a stream of 10M words from English text. Compare estimated number of unique words with the true count.