Approximation Theory and Deep Networks — Big Data & Machine Learning

Approximation Theory for Neural Networks

Why do neural networks work? What class of functions can they approximate? Why is depth needed—can you make do with just one wide layer? Approximation theory answers these questions mathematically rigorously, substantiating the practical success of deep learning.

Universal Approximation Theorem

Classic theorem (Cybenko, 1989; Hornik, Stinchcombe, White, 1989): A single-layer neural network with sigmoidal neurons, a sufficient number of hidden neurons, and any nonconstant sigmoidal activation function can approximate any continuous function $f: [0,1]^n \to \mathbb{R}$ to any precision $\epsilon > 0$.

Formally: for any $\epsilon > 0$, there exists $N$ and parameters ${\alpha_j, w_j, b_j}$ such that

$ \sup_{x \in [0,1]^n} |f(x) - \sum_j \alpha_j \sigma(w_j^\mathrm{T} x + b_j)| < \epsilon $

Limitation of the theorem: it does not specify how many neurons are needed (it could be astronomically many), nor does it tell how to train.

Barron's Theorem (1993): For functions with bounded first moment of the Fourier spectrum $|C_f| < \infty$, a single-layer network of $n$ neurons achieves an approximation error $O(C_f^2/n)$ in $L_2$. Crucially, the dimensionality of the input space does not enter the estimate—there is no “curse of dimensionality” for this class of functions.

The Advantage of Depth

Intuition of depth: the first layer highlights elementary patterns (edges in an image), the second—combines them into shapes, the third—into objects. Hierarchical representation is analogous to how the brain processes information.

Theorems on Exponential Advantage (Montufar, Pascanu, Bengio, Le Cun, 2014): A ReLU network with $O(n)$ neurons in $d$ layers can represent a function with $O(n^d)$ linear regions (piecewise-linear segments). A single-layer network with the same $O(n)$ neurons: only $O(n)$ regions. Exponential gain from depth!

Consequence: deep networks efficiently "reuse" features from the lower layers. The XOR function requires at least two layers of neurons—a proven fact.

Neural Tangent Kernels (NTK)

At initialization, and in the limit of infinitely wide networks (Jacot, Gabriel, Hongler, 2018), training a neural network is equivalent to kernel regression with the kernel:

$ K_\mathrm{NTK}(x, x') = \langle \frac{\partial f(x; \theta)}{\partial \theta}, \frac{\partial f(x'; \theta)}{\partial \theta} \rangle $

Physical meaning: how similar are the "directions of improvement" of the network for two examples $x$ and $x'$. In the infinite width limit, $K_\mathrm{NTK}$ does not change during training → linear regime → predictions = regression of a Gaussian process with kernel $K_\mathrm{NTK}$.

Practical consequences: explains implicit bias (minimal norm solution, SGD regularization effect), predicts generalization for wide networks, explains the double descent phenomenon.

Double Descent

Classic U-shaped tradeoff: as the number of parameters increases, error first decreases, then grows (overfitting). But in modern neural networks a different pattern is observed:

Error: decreases → reaches a maximum when (parameters ≈ data) → decreases again when the number of parameters

gt;> n$.

An interpolating network (parameters

gt;> n$): there exist infinitely many solutions that zero the train loss. SGD finds the solution with minimal norm (implicit bias)—it generalizes well! Through NTK: minimal RKHS norm → good generalization according to kernel methods theory.

Numerical Example

Approximation of $f(x) = \sin(2\pi x)$ on $[0,1]$:

1 hidden layer, 10 neurons (ReLU): error ≈ 0.05 (piecewise-linear)
1 layer, 100 neurons: error ≈ 0.005
2 layers × 10 neurons (20 parameters total): error ≈ 0.003

Two layers with 20 parameters are better than one with 100 parameters—a quantitative advantage of depth.

Applications of the Theory

Architecture selection (understanding when depth vs width is needed), development of regularization methods (weight decay as minimization of RKHS norm), understanding generalization (NTK → generalization bounds), training diagnostics (if val loss does not decrease—perhaps insufficient depth or width).

Assignment: Implement an MLP in PyTorch for approximating $f(x,y) = \sin(x) \cdot \cos(y) + 0.5 \cdot \sin(2x + y)$ on $[-\pi, \pi]^2$. Compare architectures: [1 layer × 64, 2 layers × 32, 4 layers × 16]. With the same number of parameters: which depth gives the lowest MSE? Plot error surfaces for each architecture. How does behavior change when noise is added to the data?