Module II·Article I·~3 min read
Training Deep Networks: Problems and Solutions
Deep Learning: Theory and Practice
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Training neural networks with dozens of layers is a non-trivial task. Vanishing gradient, exploding gradient, dead neurons, poor initialization—all these are practical problems that have required decades of research to solve.
The Vanishing/Exploding Gradient Problem
Mechanism: During backpropagation through $L$ layers: $\dfrac{\partial L}{\partial x_0} = \left(\prod_l \dfrac{\partial a^l}{\partial a^{l-1}}\right) \cdot \dfrac{\partial L}{\partial a^L}$. Each multiplier $\dfrac{\partial a^l}{\partial a^{l-1}} = W^l \cdot \operatorname{diag}(\sigma'(z^l))$. If the eigenvalues $||W^l \cdot \operatorname{diag}(\sigma'(z^l))|| < 1$, the product $\rightarrow 0$ exponentially. If
gt; 1$, the product $\rightarrow \infty$.Sigmoid and tanh: $\sigma'(z) \leq 0.25$ (maximum at $z=0$). For $L$ layers: $||\dfrac{\partial L}{\partial x_0}|| \leq (0.25 \cdot \max||W^l||)^L \rightarrow 0$ if $\max||W^l|| < 4$. ReLU eliminates saturation problem: $\sigma'(z) = 1$ for $z>0$ — gradient passes without attenuation.
Gradient clipping: for explosion: $\nabla \leftarrow \nabla \cdot \min(1, \tau/||\nabla||)$. Limit the gradient's norm. Standard in RNN/LSTM. PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).
Residual Learning
Degradation problem: When increasing depth (plain networks), training error grows — optimization becomes harder, not overfitting. A 56-layer network performs worse than a 20-layer one on CIFAR-10!
Residual block (He et al., 2015): $H(x) = F(x) + x$, where $F(x) = W_2 \cdot \operatorname{ReLU}(W_1x + b_1) + b_2$. The network learns the "residual" $F(x) = H(x) - x$, not the full transformation $H(x)$. If optimal $H(x) \approx x$ (identity mapping): $F(x) \rightarrow 0$ (easy). Shortcut connection transmits the gradient directly — does not vanish.
Mathematically: Complete network of $N$ blocks: $a^N = a^l + \sum_{k=l}^{N-1} F_k(a_k)$. Gradient: $\dfrac{\partial a^N}{\partial a^l} = I + \sum_k \dfrac{\partial F_k}{\partial a^l}$ — there is always a unit component — no vanishing.
DenseNet (Huang et al., 2017): each layer receives inputs from all previous ones: $x^l = \sigma(W^l[x_0, x_1, ..., x^{l-1}])$. Maximum feature reuse. Fewer parameters at same quality.
Highway Networks and Gate Mechanisms
Highway Network (Srivastava, 2015): $y = H(x,W_h)\cdot T(x,W_t) + x\cdot(1-T(x,W_t))$. $T = \operatorname{sigmoid}(W_tx + b_t)$ — "transform gate". If $T \rightarrow 0$: $y=x$ (pass-through path). If $T \rightarrow 1$: $y=H(x)$ (transformation). Predecessor of LSTM-gate.
Normalization Within Layers
Layer Normalization: normalization along feature axis (not batch). $\hat{x} = (x - \mu_{feature})/\sigma_{feature}$, then affine transform. Independent of batch size — works at $\text{bs}=1$. Standard for NLP (BERT, GPT).
Group Normalization: divides channels into $G$ groups, normalizes within each. Compromise between BN and LN. Works at $\text{bs}=1–2$.
RMS Norm: $\hat{x} = x/\operatorname{RMS}(x)$, $\operatorname{RMS}(x) = \sqrt{\operatorname{mean}(x^2)}$. Faster than BN (no centering). Used in LLaMA, Mistral.
Transfer Learning and Fine-tuning
Pretrained models: A network trained on ImageNet (1.2M images, 1000 classes) extracts rich features from images. Transfer to new task (medical imaging, satellite images): use pretrained lower layers as "feature extractor".
Strategies of Transfer Learning:
- Feature extraction: freeze all layers, train only classifier head. Fast, requires little data.
- Fine-tuning: unfreeze upper layers of pretrained network + train classifier. Better if there are gt;$1K examples.
- Full fine-tuning: all layers. Requires lots of data and small lr ($1\mathrm{e}{-5}$ instead of $1\mathrm{e}{-3}$).
Rule: The more data and the closer the task to ImageNet, the fewer layers need to be retrained.
Numerical Example
Task: diagnosis of tumors by MRI (1000 labeled scans). Approach 1: CNN from scratch $\rightarrow$ val accuracy 72% (little data, overfitting). Approach 2: ResNet-50 (ImageNet), feature extraction $\rightarrow$ 85% (freeze everything except FC). Approach 3: ResNet-50, fine-tune last 2 blocks $\rightarrow$ 91% (optimal for 1000 examples). Approach 4: ResNet-50, full fine-tune $\rightarrow$ 89% (worse than 3 — overfitting with little data).
Exercise: Implement transfer learning for Stanford Cars dataset (196 car classes, 16K photos): (1) Feature extraction with ResNet-50 (freeze all Conv, train only FC). (2) Fine-tuning the top 2 blocks of ResNet-50 (lr=1e-4). (3) Full fine-tuning (lr=1e-5). Compare val accuracy and training time. (4) Add strong augmentation (RandAugment). Visualize: which images cause the highest activation in the last Conv layer (Grad-CAM).
§ Act · what next