Theory of Learnability and Convexity of Neural Networks

When Does a Learning Algorithm “Work”?

Machine learning seems like an empirical discipline: you try it — it works. But behind the scenes, there is a mathematical theory explaining why and when learning yields generalization to new data. VC theory and PAC learnability are strict frameworks. Convex problems have special guarantees: independent of the dimensionality of the space, depending only on the “complexity” of the hypothesis class.

VC Dimension

Shattering: a set of points S is “shattered” by a hypothesis class H if for any labeling of points ∈ {−1, +1}, there exists a hypothesis h ∈ H that correctly classifies all points.

VC dimension vc(H) = the maximal number of points that H can shatter.

Examples:

Linear classifiers in ℝⁿ: vc = n+1. In ℝ², a linear classifier can shatter any 3 points (in general position), but not any 4.
RBF kernel: vc = ∞ (can shatter any number of points). But the SVM with RBF kernel generalizes via margin!
Neural network with W parameters: vc ≈ W log W.

Main VC theorem: finite vc(H) ↔ PAC (Probably Approximately Correct) learnability. Number of samples for (ε, δ)-learning: N = O((vc(H) + log(1/δ)) / ε²).

Rademacher Complexity for Convex Functions

Rademacher complexity is a more modern and precise metric than VC:

R_N(F) = E_{σ,x} [sup_{f ∈ F} (1/N) Σᵢ σᵢ f(xᵢ)]

where σᵢ are independent equiprobable ±1 (Rademacher variables).

Theorem: with probability ≥ 1−δ:

E[f(x)] − (1/N) Σᵢ f(xᵢ) ≤ 2R_N(F) + O(√(log(1/δ)/N))

For classes with bounded norm ‖w‖ ≤ B and ‖xᵢ‖ ≤ ρ:

R_N(F) ≤ Bρ/√N

Corollary: generalization ≤ O(Bρ/√N) — does not depend on the dimensionality of the space! Only the margin (1/B) and data norm (ρ) matter.

Convexity of Neural Networks Under Overparameterization

Neural networks are non-convex problems. So why does SGD work?

Theorem (Kawaguchi, 2016): linear networks

For a neural network without nonlinearity (f(x) = WₗWₗ₋₁...W₁x):

All critical points (∇L = 0) are either global minima or saddle points
There are no “bad” local minima

For sufficiently wide ReLU networks:

Theorem: if the width of the hidden layer m ≥ N (number of training samples), then: with high probability over random initialization, SGD will find a global minimum of the training error.

Intuition: under overparameterization (m >> N) there are no “bad” local minima — a weight vector of essentially random walk suffices to find a solution.

Implicit Regularization of SGD

Phenomenon of implicit bias: SGD doesn’t just find any global minimum — it finds the minimum with minimal norm.

Theorem: SGD (and gradient flow) for neural networks converges to the minimum ‖w‖₂-norm solution among all global minima (for linear networks).

This is “convexity by default”: SGD implicitly solves min ‖w‖² s.t. Xw = y — the ridge regression problem without explicit penalty!

Double Descent

Classical machine learning: as the number of parameters increases, the error first decreases (less bias), then increases (more variance) — the “U-shaped error curve.”

Double descent phenomenon (Belkin et al., 2019): under overparameterization (number of parameters > number of data), the error begins to decrease again!

Interpolation regime: for M < N — classical learning
Interpolation threshold: M = N — error is maximal
Overparameterized regime: M >> N — error decreases again

Reason: SGD in the overparameterized regime finds “flat” minima (small gradient ∇²L) with good generalization, not “sharp” minima (large ∇²L) with poor generalization.

Complete Example: Algorithm Comparison

Task: MNIST classification (60,000 training samples, 784 features, 10 classes).

Method	Accuracy	Number of parameters	Guarantees
Logistic regression	92%	7,840	Convex, global optimum
SVM (RBF)	98%	~10,000 support vectors	Convex dual
3-layer neural network	98.5%	100,000	No strict guarantees, works
ResNet-50	99.7%	25 million	Empirically reliable

Convex models provide strict guarantees, but are limited in expressiveness. Neural networks are more powerful, but theory explains their work only partially. The task of mathematicians and ML researchers is to develop more precise theoretical guarantees for real neural networks.

Double Descent and Overfitting

Classical learning theory predicts that the test error increases as model complexity grows — the “U-shaped curve.” However, modern neural networks exhibit the double descent phenomenon: after the first error peak, it decreases again as parameter count increases further. This contradicts classical PAC and Rademacher estimates based on VC dimension. The explanation lies in implicit regularization: SGD prefers solutions with small norm, even when not explicitly penalizing. For linear models with an overdetermined system, this is equivalent to the minimum L2-norm solution.

Modern Directions

Generalization bounds for deep networks: PAC-Bayes provides nontrivial estimates for specific trained models based on “flatness” of the loss minimum
Lottery tickets (Frankle & Carbin, 2019): a large network contains a small subnet that can be trained to comparable quality — related to convexity of local basins
Safe learning: convex-concave optimization for adversarial training guarantees robustness to attacks