Robust Estimation and M-Estimators — Probability & Statistics

Classical estimators (mean, MLE) are sensitive to outliers. Robust statistics develops estimators resilient to data contamination while maintaining high efficiency in the normal case.

Breakdown Point and Influence Function

Breakdown Point (BP): The minimal proportion of contaminated data at which an estimator breaks down. $\bar{X}$: BP = $1/n \to 0$. Median: BP = $0.5$. $S^2$: BP = $1/n$. MAD $= \text{median}|X_i - \text{median}|$: BP = $0.5$.

Influence Function (Hampel, 1974): $IF(x; T, F) = \lim_{\varepsilon \to 0} \frac{T((1-\varepsilon)F + \varepsilon \delta_x) - T(F)}{\varepsilon}$. For $\bar{X}$: $IF(x) = x - \mu$ — unbounded. For the median: $IF(x) = \frac{\text{sign}(x-\mu)}{2f(\mu)}$ — bounded. Gross-error sensitivity: $\gamma^* = \sup|IF(x)| < \infty$ → robust estimator.

M-Estimators (Huber, 1964)

Definition: M-estimator $\hat{\theta} = \arg\min_\theta \sum_i \rho(X_i - \theta)$, or solution to $\sum_i \psi(X_i - \theta) = 0$, $\psi = \rho'$. Mean: $\rho(r) = r^2/2$. Median: $\rho(r) = |r|$. MLE: $\rho(r) = -\log f(r)$.

Huber M-Estimator ($c > 0$): $\rho_c(r) = \frac{r^2}{2}$ if $|r| \leq c$; $c|r| - c^2/2$ if $|r| > c$. $\psi_c(r) = \min(c, \max(-c, r))$. Mean for small residuals, median for large. $c = 1.345$: 95% relative efficiency under $N(0,1)$.

Tukey’s Biweight Estimator: $\psi(r) = r\left(1-(r/c)^2\right)^2$ if $|r| \leq c$; $0$ otherwise. Completely “turns off” extreme outliers. $c = 4.685$: 95% efficiency. BP = $1/n$ (small, but IF bounded).

IRLS Algorithm and Asymptotics

IRLS (Iteratively Reweighted Least Squares): $\theta^{(t+1)} = \frac{\sum w_i(\theta^{(t)})X_i}{\sum w_i(\theta^{(t)})}$, $w_i = \psi(r_i)/r_i$. Each iteration is weighted OLS. Converges to M-estimator. In linear regression: $\hat{\beta} = (X^\top WX)^{-1}X^\top WY$.

Asymptotics of M-Estimators: $\sqrt{n}(\hat{\theta} - \theta_0) \to_d N(0, V)$, $V = \frac{E[\psi^2(X-\theta_0)]}{\left(E[\psi'(X-\theta_0)]\right)^2}$. For Huber estimator under $N(0,1)$: $V \approx 1.08$ (almost as good as the mean!).

Scale Estimate: Standardized MAD: $\hat{\sigma} = \text{MAD}/0.6745$ — consistent estimator for $\sigma$ under normality, BP = $0.5$.

Exercise: (a) Data: $1,2,3,4,5,50$. $\bar{X}$, median, 20% trimmed mean, Huber M-estimator ($c=1.5$). How does the outlier affect the results? (b) IRLS for linear regression: data with 2 outliers. Compare $\hat{\beta}{OLS}$ and $\hat{\beta}{Huber}$ — coefficients and their SE. (c) Simulate: $n=100$ from $0.9\cdot N(0,1) + 0.1\cdot N(10,1)$. Compare the variance of $\bar{X}$ and the median — which estimator is more robust?

Breakdown Point and Robustness of Estimators

Breakdown point ($\epsilon^*$) — maximum proportion of spoiled data for which the estimator remains bounded. $\bar{X}$: $\epsilon^* = 1/n \approx 0$ (a single outlier can cause it to diverge). Median: $\epsilon^* = 0.5$ (theoretically the best). Trimmed mean ($\alpha$-trimmed): $\epsilon^* = \alpha$. Huber: $\epsilon^*$ depends on $c$.

Influence function: $IF(x, T, F) = \lim_{t\to 0} \frac{T(F_t) - T(F)}{t}$, where $F_t = (1-t)F + t\delta_x$. For the mean: $IF(x, \mu, F) = x-\mu$ (unbounded). For the median: $IF(x, Med, F) = \frac{\text{sgn}(x-Med)}{2f(Med)}$ — bounded! For M-estimator: $IF(x, T, F) = \frac{\psi(x-T)}{E[\psi'(X-T)]}$.

Robust Estimates of Dispersion and Covariance Matrix

MAD (Median Absolute Deviation): $\text{MAD} = \text{median}|X_i - \text{median}(X)|$. Breakdown point 50%. $\hat{\sigma} = 1.4826\cdot\text{MAD}$ (consistent with normal distribution).

MCD (Minimum Covariance Determinant) Estimator: For multivariate data — find subset $h$ of observations with minimal $\det(\text{Cov})$. Breakdown point $\approx (n-h)/n$. Implemented in sklearn.covariance.MinCovDet. Used for detecting multivariate outliers via Mahalanobis distance.

Outlier Test and Anomaly Detection Methods

Grubbs Test: $H_0$: no outlier. Statistic $G = \max|X_i-\bar{X}|/s$. Critical value depends on $n$ and $\alpha$. Assumes normality. Tukey Test: Outlier = point outside $Q_3+1.5IQR$ or $Q_1-1.5IQR$ (box-and-whisker plot). Nonparametric: does not depend on distribution shape. For symmetric heavy tails may misclassify points as outliers.

Connection with Survival Theory

Survival function: $S(t) = P(T > t)$. Hazard function: $h(t) = \frac{f(t)}{S(t)} = -\frac{d}{dt}\ln S(t)$. For exponential $T$: $h(t) = \lambda$ (constant hazard). For Weibull: $h(t) = \lambda\beta(\lambda t)^{\beta-1}$ (decreasing for $\beta<1$, increasing for $\beta>1$). Kaplan-Meier estimator: Nonparametric estimate of $S(t)$ under censoring. $\hat{S}(t) = \prod_{t_i \leq t} (1-d_i/n_i)$, where $d_i$ is the number of events, $n_i$ is the number at risk at time $t_i$.

Copulas and Tail Dependence

Copula $C:[0,1]^n \to [0,1]$ — joint distribution with uniform marginals. Sklar’s theorem: $F(x_1,..,x_n) = C(F_1(x_1),...,F_n(x_n))$. Allows separate modeling of marginals and dependence. Gaussian copula: no tail dependence ($\lambda_L = \lambda_U = 0$). Used in CDOs until 2008 — catastrophically underestimated joint default probability. Clayton copula: $\lambda_L > 0$ — lower tail dependence. Applied to credit risks.

Multivariate Statistics and Principal Component Analysis

PCA: Minimizes reconstruction error → eigenvectors of covariance matrix. Maximum variance property: first $k$ principal components maximize variance in $k$-dimensional projection space. Kernel PCA: Replace $\Sigma = X^\top X$ with kernel matrix $K_{ij} = k(x_i, x_j)$ — nonlinear PCA. Sparse PCA (SPCA): Add $l_1$ penalty to loadings for interpretability. Independent Component Analysis (ICA): Decompose mixed signals into statistically independent components — "cocktail party problem".

Numerical Example: Principal Component Method (PCA)

Task: Data matrix $4 \times 2$: points $(3,1)$, $(1,3)$, $(-3,-1)$, $(-1,-3)$. Find principal components.

Step 1: $\bar{x} = (0, 0)$. Covariance matrix: $\Sigma = (1/4)\cdot X^\top X$. $X^\top X = [ (9+1+9+1, 3+3+3+3), (3+3+3+3, 1+9+1+9) ] = [[20,12],[12,20]]$. $\Sigma = [[5,3],[3,5]]$.

Step 2: Eigenvalues: $\det(\Sigma-\lambda I) = (5-\lambda)^2-9=0 \to \lambda_1=8, \lambda_2=2$.

Step 3: $\lambda_1=8$: vector $(1,1)/\sqrt{2}$ (PC1 — diagonal). $\lambda_2=2$: vector $(1,-1)/\sqrt{2}$ (PC2 — anti-diagonal).

Step 4: PC1 explains $8/(8+2)=80%$ variance. Projection on PC1: $z_1 = (x_{i1}+x_{i2})/\sqrt{2} \to {2\sqrt{2}, 2\sqrt{2}, -2\sqrt{2}, -2\sqrt{2}} \approx {2.83, 2.83, -2.83, -2.83}$. Compression from 2D to 1D with only 20% loss of variance. ICA would then split the projection into independent signal sources.