Module I·Article III·~3 min read

Regularization and Prevention of Overfitting

Fundamentals of Neural Networks

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Regularization and Prevention of Overfitting

Neural networks with millions of parameters can "memorize" the training set completely — overfit. The gap between train and test error is the main enemy in practice. Regularization is a set of techniques that reduce overfitting without sacrificing model expressiveness.

Bias-Variance Tradeoff

Generalization through error decomposition:

E[L(y, f̂(x))] = Bias²(f̂) + Var(f̂) + σ²_noise

Simple model: high bias, low variance. Complex model (DNN): low bias (can approximate everything), but high variance (sensitive to data).

Double descent: With modern overparameterized networks, the classical U-shaped tradeoff is broken — when the number of parameters ≫ n, error decreases again.

L2 Regularization (Weight Decay)

Penalty: L_reg = L + λ/2 · ||θ||² = L + λ/2 · Σᵢ θᵢ². SGD update: θ ← θ − α·∇L − αλθ = θ(1 − αλ) − α·∇L. "Weight decay": each step decreases weights by a factor of (1−αλ) — pulls towards zero.

Bayesian interpretation: MAP estimate with Gaussian prior N(0, 1/λ) on parameters. L2 implies the assumption that "true" weights are small.

Practice: λ = 10⁻⁴ ... 10⁻² — typical values. AdamW implements weight decay correctly (not via the gradient).

L1 Regularization (LASSO)

Penalty: L + λ||θ||₁ = L + λΣᵢ|θᵢ|. Creates sparsity: many θᵢ → 0. Bayesian: MAP with double exponential (Laplace) prior. Applied less frequently in neural networks — subdifferentiable, less stable.

Dropout (Srivastava et al., 2014)

Idea: During training, each neuron "shuts off" with probability p (usually p = 0.5 for fully connected, p = 0.1 for convolutions). Mask mult: zˡ = aˡ⊙m, m_i ~ Bernoulli(1−p). During inference: all neurons are active, outputs are scaled by (1−p). Inverted dropout (PyTorch default): during training, divide by (1−p).

Dropout interpretations:

  1. Ensemble: train exponentially many "sparse" subnetworks, at inference — average (approximately via scaling).
  2. Regularization: neurons cannot "rely" on specific partners → more independent features.
  3. Noise: adds noise to activations → regularization through stochasticity.

Dropout breaks batch statistics → incompatible with Batch Normalization. In modern architectures: BN replaces dropout in convolutional parts.

Batch Normalization (Ioffe & Szegedy, 2015)

Idea: Normalize activations across the batch before applying activation. For batch B = {x₁,...,xₘ}:

x̂ᵢ = (xᵢ − μB)/√(σB² + ε), yᵢ = γx̂ᵢ + β

Here, μB = (1/m)Σᵢxᵢ, σB² = (1/m)Σᵢ(xᵢ−μB)², γ and β are trainable parameters (affine transform).

Advantages: stabilizes training, allows larger learning rates, reduces sensitivity to initialization. Regularizing effect (via batch statistics noise). At inference: use running mean/variance (exponential moving average).

Limitations: effective for batch size ≥ 16. For small batches — Group Normalization, Instance Normalization.

Data Augmentation

Idea: Expand the training set via "synthetic" transformations that do not change the class. For images: horizontal flip (50%), random crop, color jitter, rotation ±15°, cutout (random rectangular masking).

Mixup (Zhang et al., 2018): x_mix = λx_i + (1−λ)x_j, y_mix = λy_i + (1−λ)y_j, λ ~ Beta(α,α). Train on "mixed" examples → smoother predictions, better generalization.

CutMix (2019): cut and paste patches between images → labels are mixed accordingly.

RandAugment (Cubuk et al., 2019): randomly choose N out of K augmentations with strength M. Simplifies the augmentation policy search.

Early Stopping

Algorithm: Train, periodically evaluate val loss. Save best weights (checkpoint) at minimum val loss. Stop if patience is exceeded (patience = 10–20 epochs without improvement).

Theoretically: Early stopping is equivalent to L2 regularization under sufficient conditions (Bishop, 2006). Number of iterations ≈ 1/λ.

Numerical Example

ResNet-50 on CIFAR-10 (50K train, 10K test):

  • No regularization: train acc 99%, test acc 88% (overfitting 11%)
    • L2 (λ=1e-4): test acc 91%
    • Dropout (p=0.5 before FC): test acc 91.5%
    • Data augmentation (flip+crop): test acc 93.2%
    • Mixup (α=1.0): test acc 94.1%
  • All together: test acc 95.3%

Assignment: Reproduce the experiment on CIFAR-10: (1) Basic CNN (without regularization) — record train/val accuracy. (2) Add one at a time: L2 λ=1e-4, dropout p=0.3, BN, augmentation. (3) Plot: epochs vs val accuracy for each configuration. (4) Implement early stopping with patience=10. In how many epochs does training stop in each case?

§ Act · what next