Convolutional Neural Networks (CNN) — Neural Networks & Deep Learning

Convolutional Neural Networks

Image processing has become a breakthrough application of neural networks. Convolutional networks (CNN, LeCun et al., 1989) leverage the local structure of images: features (edges, textures) are local and shift-invariant. CNNs revolutionized computer vision, surpassing humans on ImageNet in 2012.

Convolution Operation

A fully connected network for a 224×224×3 image: the first layer would have 224×224×3 = 150,528 inputs. With 1,000 neurons: 150M parameters — inefficient, does not scale.

2D Convolution: A kernel (filter) $K \in \mathbb{R}^{f\times f\times C_{in}}$ “slides” over the input image $X \in \mathbb{R}^{H\times W\times C_{in}}$:

$(X * K)[i,j] = \sum_{m=0}^{f-1} \sum_{n=0}^{f-1} \sum_{c=0}^{C_{in}-1} X[i+m, j+n, c] \cdot K[m, n, c] + b$

Here $K$ is a filter of size $f\times f$, applied to each $f\times f$ patch of the input. The output object (feature map): $H' \times W' \times C_{out}$ ($C_{out}$ filters). Number of parameters in a convolutional layer: $f \times f \times C_{in} \times C_{out} + C_{out}$ (vs. $H\times W\times C_{in} \times C_{out}$ in fully connected).

Key properties of CNNs:

Weight sharing: one filter is applied to the entire image → shift invariance
Locality: each neuron “looks at” a local patch → local features
Hierarchy: early layers — edges, color; deep layers — objects, semantics

Padding: SAME — pad with zeros, output has same spatial size as input. VALID — no padding, output is smaller. Stride $s > 1$ — sliding step

gt; 1$ → output shrinks by factor $s$.

Pooling and CNN Architecture

Max pooling: take the maximum in a $k\times k$ window. Reduces spatial size by a factor of $k$. Creates invariance to small shifts and deformations. Has no parameters.

Global Average Pooling (GAP): average over the entire feature map → vector of $C$ values. Replaces fully connected layers at the end — fewer parameters, better regularization.

Typical architecture: Conv → BN → ReLU → Conv → BN → ReLU → Pool → ... → GAP → FC → Softmax.

Evolution of Architectures

AlexNet (Krizhevsky, Sutskever, Hinton, 2012): 5 convolutional layers + 3 fully connected. ImageNet Top-5 error: 15.3% (vs 26% for classical methods). Used ReLU, dropout, data augmentation — key innovations.

VGGNet (2014): deep networks using only $3\times 3$ convolutions + max-pool. "Two 3×3 = one 5×5, but with fewer parameters." Top-5: 7.3%.

ResNet (He et al., 2015): Residual connections: $F(x) + x$ — “shortcut” enables training networks with 50, 100, 152 layers. Solves degradation (not vanishing) gradient problem. Top-5: 3.57% (better than human at 5.1%). Mathematically: block approximates the "residual" $F(x) = H(x) - x$, not the complete transformation $H(x)$.

EfficientNet (Tan & Le, 2019): scaling depth, width, and resolution together using a compound coefficient. SOTA with fewer parameters — found through NAS.

Specialized Architectures

U-Net (2015): Encoder-decoder with skip connections. For semantic segmentation: need to predict class for each pixel. Skip connections preserve high-frequency spatial details.

YOLO (You Only Look Once): Object detection in a single network pass. We divide the image into $S\times S$ cells, each predicts $B$ bounding boxes + classes. Speed: 30+ FPS — real-time detection.

Numerical Example

VGG-16 on ImageNet: 13 convolutional layers ($3\times 3$), 3 fully connected. Parameters: ~138M. First convolutional layer: 64 filters $3\times 3\times 3 = 1728$ parameters. Last fully connected: $4096\times 4096 = 16.7$M — majority of parameters in FC! Global Average Pooling instead of FC reduces to ~7M parameters with comparable quality.

ResNet-50 vs VGG-16: 25M parameters (vs 138M), Top-1 accuracy: 76% vs 74%. Depth + residual connections > width.

Exercise: Implement a simple CNN in PyTorch for CIFAR-10 ($32\times 32\times 3$, 10 classes): [Conv(3→32, $3\times 3$) → BN → ReLU → MaxPool → Conv(32→64, $3\times 3$) → BN → ReLU → MaxPool → FC($64\times 8\times 8 \to 256$) → FC($256 \to 10$)]. Train for 30 epochs, Adam, lr=0.001. Compare with a fully connected network [3072→256→128→10]. Build a confusion matrix. Visualize the first layer’s filters.