Explainable AI and Representation Analysis — Neural Networks & Deep Learning

Explainable AI and Analysis of Neural Network Representations

Neural networks are "black boxes": they output predictions but do not explain their decisions. In high-stakes domains (medicine, finance, law), explainability is a legal requirement (GDPR Art. 22). Interpretability also helps debug models and detect systematic errors.

Visualization of CNN Features

What has the CNN learned? One can directly study the filters $W^{(1)}$ of the first layer (dimensions 3×3×3 or 7×7×3): they often resemble Gabor edges. But deeper layers are high-dimensional and not directly interpretable.

Activation maps: for an image $x$, we visualize the feature map $a^l[k]$ (the $k$-th channel of layer $l$). Bright pixels = strong response to the $k$-th pattern.

Grad-CAM (Selvaraju et al., 2017):

$ w_k = \frac{1}{Z}\sum_{i,j} \frac{\partial y^c}{\partial A_{ij}^k} \quad \text{(global average pooling of the gradients)} $

$ L_\text{Grad-CAM} = \mathrm{ReLU}\left(\sum_k w_k A^k\right) $

Here $y^c$ is the score for class $c$, $A^k$ is the feature map of channel $k$ of the last Conv layer. The heatmap shows which regions of the image influenced the prediction. Grad-CAM++ and Score-CAM are improvements.

Interpretability via Feature Perturbations

LIME (Ribeiro et al., 2016): To explain a prediction $f(x)$: generate $n'$ "perturbed" versions $x'$ (randomly mask superpixels); obtain $f(x'1),...,f(x'{n'})$; weight them according to proximity to $x$; train an interpretable model $g$ (lasso): $\min_g E_\pi[(f(x') - g(x'))^2] + \Omega(g)$. The coefficients of $g$ are the explanation.

Integrated Gradients: $\mathrm{IG}_i(x) = (x_i - x'i)\cdot \int_0^1 \left.\frac{\partial f}{\partial x_i}\right|{x' + \alpha(x - x')} d\alpha$. Attributes the contribution of feature $i$ as the integrated gradient from the baseline $x'$ to $x$. Axioms: completeness ($\sum \mathrm{IG}_i = f(x) - f(x')$), sensitivity (if feature $i$ affects output — $\mathrm{IG}_i \neq 0$).

Probing

Idea: What do the hidden representations of a neural network "store"? Take $h$ — the hidden vector from layer $l$. Train a simple (linear) classifier $P$: $h \rightarrow$ label. If $P$ works well → information about the label is encoded in $h$.

BERTology: probing layers of BERT showed: lower layers — morphology, middle layers — syntax (POS, NER), upper layers — semantics, discourse.

Probing classifier: for each layer $l$, train logistic regression: $P: a^l(x) \rightarrow$ target. The accuracy of $P$ as a function of $l$ shows which layer best encodes the target.

Mechanistic Interpretability

Superposition hypothesis (Elhage et al., 2022): neural networks store more "features" than there are neurons. A single neuron encodes several features via interference. Sparse autoencoders allow one to separate superposed features.

Induction heads (Olsson et al., 2022): certain attention heads in transformers perform "in-context learning"—after training, they detect A...B...A patterns and predict B. This is a mechanistic explanation for few-shot learning in GPT.

Circuit analysis: find the minimal "computational circuit"—a subgraph of weights—responsible for a specific behavior (for example, addition of numbers). IOI task (Indirect Object Identification): "John and Mary went to the store. Mary gave a book to ___"—specific attention heads responsible for computing this answer have been identified.

Numerical Example

ResNet-50 diagnoses skin cancer (HAM10000 dataset): no explanations. We apply Grad-CAM: for a "melanoma" image—the heatmap focuses on a dark spot with irregular edges. For "seborrheic keratosis"—on comedones.

Problem: Grad-CAM showed that the model was focused on... the doctor's ruler! (Confound—the scale marker was present only on malignant images). Interpretability helped reveal this systematic bias.

Task: Train ResNet-18 on Cats vs Dogs (25K photos). (1) Implement Grad-CAM for 20 random test images. Visualize the heatmaps. Interpret: what parts of the cat/dog are important for the classifier? (2) Conduct probing: for each residual block, train a linear classifier (background color, breed, presence of sky). Which layer best encodes each feature? (3) Implement LIME for 5 incorrect predictions: why did the model err?