Module IV·Article III·~4 min read

Interpretability and Reliability of DL Models

Convex Optimization for ML

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Interpretability and Reliability of Models

A neural network predicts cancer from an MRI scan. The doctor asks: "Why?" A bank denies a loan. The client demands an explanation. A regulator audits the model. Interpretability is not an academic toy, but a legal and ethical requirement. The GDPR in the EU guarantees the “right to explanation” of automated decisions.

Global vs. Local Interpretability

Global: to understand how the model works as a whole — which features are important for all predictions. Example: feature importance in a random forest — average MDI across all trees.

Local: to explain a specific prediction — why the model made this decision for this particular object. Critical in high-stakes applications (medicine, lending, justice).

SHAP: An Axiomatically Grounded Method

Shapley values (from game theory): The contribution of feature $i$ in the “coalition game” $f$:

$ \varphi_i(f) = \sum_{S \subseteq F\setminus {i}} \frac{|S|! , (|F| - |S| - 1)!}{|F|!} \cdot [f(S \cup {i}) - f(S)] $

Decoding: we average the marginal contribution of feature $i$ over all possible coalitions $S$ of other features. $|S|! (|F| - |S| - 1)! / |F|!$ — the weight of the given coalition (corresponds to a random order of adding features).

SHAP axioms: (1) Efficiency: $\sum_i \varphi_i = f(x) - \mathbb{E}[f(X)]$ — contributions sum to the deviation from the baseline. (2) Symmetry: if $i$ and $j$ contribute the same — $\varphi_i = \varphi_j$. (3) Dummy: $\varphi_i = 0$ if $i$ does not affect any prediction.

TreeSHAP (Lundberg et al., 2020): exact SHAP for trees in $O(TL^2)$ ($T$ trees, $L$ leaves), instead of exponential $O(2^p)$ for the naive algorithm. Fully deterministic.

KernelSHAP: model-agnostic SHAP via LIME-like coalition sampling. More expensive than TreeSHAP, applicable to any model.

LIME (Ribeiro et al., 2016): locally approximates $f(x)$ with a linear model $g$ in the neighborhood of $x$: $\arg\min_g \mathbb{E}_{x'\sim\pi_x}[(f(x')-g(x'))^2] + \Omega(g)$. $\pi_x$ is a proximity kernel (Gaussian). $g$ is an interpretable linear model → its weights serve as the explanation.

Integrated Gradients (Sundararajan, 2017): for differentiable models:

$ IG_i(x) = (x_i - x'_i) \cdot \int_0^1 \frac{\partial f(x' + \alpha(x-x'))}{\partial x_i} d\alpha $

It averages the gradient along the straight path from baseline $x'$ (e.g., a black image) to input $x$. Satisfies sensitivity and completeness axioms.

Adversarial Robustness

Adversarial examples: $\delta = \arg\max_{||\delta||_\infty \leq \varepsilon} \mathcal{L}(f(x+\delta), y)$. Add a small (imperceptible to a human) perturbation → incorrect prediction. Panda image → gibbon with 99% confidence.

FGSM (Fast Gradient Sign Method, Goodfellow, 2014): $\delta = \varepsilon \cdot \mathrm{sgn}(\nabla_x \mathcal{L}(f(x),y))$. One step in the direction of increasing loss.

PGD (Projected Gradient Descent, Madry et al., 2018): multistep FGSM + projection onto the $\varepsilon$-ball:

$ x_{t+1} = \Pi_{x+\delta \in \varepsilon\textrm{-ball}}[x_t + \alpha \cdot \mathrm{sgn}(\nabla_x \mathcal{L}(f(x_t),y))] $

PGD is the “strongest” first-order attack.

Adversarial Training: train on perturbed examples: $\min_\theta \mathbb{E}[\max_{||\delta||\leq \varepsilon} \mathcal{L}(f(x+\delta), y)]$. Inner maximization — PGD. This is the most reliable defense method. Price: 3–10× slower training, small reduction in clean accuracy.

Certified Robustness (Randomized Smoothing, Cohen et al., 2019): Smoothed version of classifier $g(x) = \arg\max P(f(x+\epsilon)=c)$, $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$. Guaranteed robust in $L_2$-ball of radius $r = \sigma \cdot \Phi^{-1}(p_A)$ ($p_A$ — probability of class $A$ for $x$). The only method with mathematically proven guarantees for neural networks.

Calibration: Correctness of Model Confidence

The problem: neural networks often have poor calibration — 90% confidence does not mean being right in 90% of cases. Overconfidence is especially dangerous in medicine.

Expected Calibration Error (ECE): split predictions into $M$ bins by confidence. ${\rm ECE} = \sum_m \frac{|B_m|}{n} |{\rm acc}(B_m) - {\rm conf}(B_m)|$. Reliability diagram: acc vs conf by bins.

Temperature Scaling: after training, rescale logits: $p = \mathrm{softmax}(z / T)$. $T>1$ “softens” confidence, $T<1$ “sharpens”. Optimal $T$ is found by NLL on val. Simple yet powerful method.

Data Drift and OOD Detection

In-distribution vs Out-of-distribution (OOD): the model should express uncertainty for objects unlike those it was trained on. A medical model trained on Europeans should be unsure analyzing Asian patients.

Maximum Softmax Probability baseline: if $\max_c P(y=c \mid x) < \textrm{threshold} \rightarrow$ OOD. Simple, works on many tasks.

Energy Score: $E(x;f) = -\log \sum_y \exp(f_y(x))$. OOD objects have high energy (low normalized class probability).

Mahalanobis distance: compute the distance of features from $x$ to class centroids. If the distance $\gg$ usual $\rightarrow$ OOD.

Numerical Example

SHAP for credit scoring (XGBoost, 50 features): a client was denied. SHAP values: salary (−0.15), overdue history (+0.32), loan amount (+0.18), age (−0.05). Interpretation: overdue history is the main reason for the denial, despite high salary.

Assignment: (1) Train ResNet-18 on CIFAR-10. Compute SHAP for 10 random images using KernelSHAP. Visualize pixel importance heatmaps. (2) Implement Temperature Scaling: train, plot reliability diagram before and after. (3) Perform PGD attack ($\varepsilon=8/255$) on 100 test examples. What % of correct predictions is retained? Train an adversarially robust model.

§ Act · what next