Module I·Article III·~3 min read
Regularization and Prevention of Overfitting
Fundamentals of Neural Networks
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Regularization and Prevention of Overfitting
Neural networks with millions of parameters can "memorize" the training set completely — overfit. The gap between train and test error is the main enemy in practice. Regularization is a set of techniques that reduce overfitting without sacrificing model expressiveness.
Bias-Variance Tradeoff
Generalization through error decomposition:
E[L(y, f̂(x))] = Bias²(f̂) + Var(f̂) + σ²_noise
Simple model: high bias, low variance. Complex model (DNN): low bias (can approximate everything), but high variance (sensitive to data).
Double descent: With modern overparameterized networks, the classical U-shaped tradeoff is broken — when the number of parameters ≫ n, error decreases again.
L2 Regularization (Weight Decay)
Penalty: L_reg = L + λ/2 · ||θ||² = L + λ/2 · Σᵢ θᵢ². SGD update: θ ← θ − α·∇L − αλθ = θ(1 − αλ) − α·∇L. "Weight decay": each step decreases weights by a factor of (1−αλ) — pulls towards zero.
Bayesian interpretation: MAP estimate with Gaussian prior N(0, 1/λ) on parameters. L2 implies the assumption that "true" weights are small.
Practice: λ = 10⁻⁴ ... 10⁻² — typical values. AdamW implements weight decay correctly (not via the gradient).
L1 Regularization (LASSO)
Penalty: L + λ||θ||₁ = L + λΣᵢ|θᵢ|. Creates sparsity: many θᵢ → 0. Bayesian: MAP with double exponential (Laplace) prior. Applied less frequently in neural networks — subdifferentiable, less stable.
Dropout (Srivastava et al., 2014)
Idea: During training, each neuron "shuts off" with probability p (usually p = 0.5 for fully connected, p = 0.1 for convolutions). Mask mult: zˡ = aˡ⊙m, m_i ~ Bernoulli(1−p). During inference: all neurons are active, outputs are scaled by (1−p). Inverted dropout (PyTorch default): during training, divide by (1−p).
Dropout interpretations:
- Ensemble: train exponentially many "sparse" subnetworks, at inference — average (approximately via scaling).
- Regularization: neurons cannot "rely" on specific partners → more independent features.
- Noise: adds noise to activations → regularization through stochasticity.
Dropout breaks batch statistics → incompatible with Batch Normalization. In modern architectures: BN replaces dropout in convolutional parts.
Batch Normalization (Ioffe & Szegedy, 2015)
Idea: Normalize activations across the batch before applying activation. For batch B = {x₁,...,xₘ}:
x̂ᵢ = (xᵢ − μB)/√(σB² + ε), yᵢ = γx̂ᵢ + β
Here, μB = (1/m)Σᵢxᵢ, σB² = (1/m)Σᵢ(xᵢ−μB)², γ and β are trainable parameters (affine transform).
Advantages: stabilizes training, allows larger learning rates, reduces sensitivity to initialization. Regularizing effect (via batch statistics noise). At inference: use running mean/variance (exponential moving average).
Limitations: effective for batch size ≥ 16. For small batches — Group Normalization, Instance Normalization.
Data Augmentation
Idea: Expand the training set via "synthetic" transformations that do not change the class. For images: horizontal flip (50%), random crop, color jitter, rotation ±15°, cutout (random rectangular masking).
Mixup (Zhang et al., 2018): x_mix = λx_i + (1−λ)x_j, y_mix = λy_i + (1−λ)y_j, λ ~ Beta(α,α). Train on "mixed" examples → smoother predictions, better generalization.
CutMix (2019): cut and paste patches between images → labels are mixed accordingly.
RandAugment (Cubuk et al., 2019): randomly choose N out of K augmentations with strength M. Simplifies the augmentation policy search.
Early Stopping
Algorithm: Train, periodically evaluate val loss. Save best weights (checkpoint) at minimum val loss. Stop if patience is exceeded (patience = 10–20 epochs without improvement).
Theoretically: Early stopping is equivalent to L2 regularization under sufficient conditions (Bishop, 2006). Number of iterations ≈ 1/λ.
Numerical Example
ResNet-50 on CIFAR-10 (50K train, 10K test):
- No regularization: train acc 99%, test acc 88% (overfitting 11%)
-
- L2 (λ=1e-4): test acc 91%
-
- Dropout (p=0.5 before FC): test acc 91.5%
-
- Data augmentation (flip+crop): test acc 93.2%
-
- Mixup (α=1.0): test acc 94.1%
- All together: test acc 95.3%
Assignment: Reproduce the experiment on CIFAR-10: (1) Basic CNN (without regularization) — record train/val accuracy. (2) Add one at a time: L2 λ=1e-4, dropout p=0.3, BN, augmentation. (3) Plot: epochs vs val accuracy for each configuration. (4) Implement early stopping with patience=10. In how many epochs does training stop in each case?
§ Act · what next