Module VI·Article II·~5 min read
Linear Regression and Analysis of Variance
Statistical Hypothesis Testing
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Linear regression models the dependence between variables and is one of the most important tools in applied statistics. Analysis of variance generalizes the t-test to multiple groups.
Simple Linear Regression
Model: $Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$, $\varepsilon_i \sim N(0, \sigma^2)$ i.i.d. OLS estimates: $\hat{\beta}_1 = \Sigma (x_i-\bar{x})(y_i-\bar{y})/\Sigma(x_i-\bar{x})^2$, $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$.
Gauss-Markov Theorem: $\hat{\beta}$ is BLUE (best linear unbiased estimator) among all linear unbiased estimators. $\operatorname{Var}[\hat{\beta}_1] = \sigma^2/\Sigma(x_i-\bar{x})^2$. $S^2 = \operatorname{RSS}/(n-2)$ is an unbiased estimator of $\sigma^2$.
Coefficient of Determination $R^2$: $R^2 = 1 - \operatorname{RSS}/\operatorname{TSS}$, $\operatorname{RSS} = \Sigma(y_i-\hat{y}_i)^2$, $\operatorname{TSS} = \Sigma(y_i-\bar{y})^2$. $R^2 = \operatorname{cor}(y, \hat{y})^2$ in simple regression. The proportion of explained variation $\in [0,1]$.
Significance Test for $\beta_1$: $H_0: \beta_1=0$. $T = \hat{\beta}_1/\operatorname{SE}(\hat{\beta}_1) \sim t(n-2)$, $\operatorname{SE}(\hat{\beta}1) = s/\sqrt{\Sigma(x_i-\bar{x})^2}$. $|T| > t{\alpha/2, n-2}$ $\rightarrow$ reject $H_0$.
Multiple Linear Regression
Matrix Form: $Y = X\beta + \varepsilon$. OLS: $\hat{\beta} = (X^\top X)^{-1} X^\top Y$. $\hat{H} = X(X^\top X)^{-1} X^\top$ — projection matrix. $\operatorname{Var}[\hat{\beta}] = \sigma^2(X^\top X)^{-1}$.
F-test: $H_0: \beta_1=...=\beta_p=0$. $F = (\operatorname{SSR}/p)/(\operatorname{RSS}/(n-p-1)) \sim F(p, n-p-1)$ under $H_0$. Adjusted $R^2 = 1 - (\operatorname{RSS}/(n-p-1))/(\operatorname{TSS}/(n-1))$ — does not automatically increase with the addition of predictors.
Diagnostics: Multicollinearity: $VIF_j = 1/(1-R_j^2) > 10$ is a problem. Heteroskedasticity: Breusch-Pagan test. Residual normality: Shapiro-Wilk criterion.
One-Way Analysis of Variance (ANOVA)
Task: $k$ groups, $H_0: \mu_1=...=\mu_k$. Decomposition: $SS_\text{Total} = SS_\text{Between} + SS_\text{Within}$. $F = [SS_B/(k-1)]/[SS_W/(n-k)] \sim F(k-1, n-k)$ under $H_0$.
Post-hoc Analysis (Tukey HSD): After a significant F-test, pairwise comparison: $|\bar{y}_i - \bar{y}j| > q{\alpha,k,n-k}/\sqrt{2} \cdot \sqrt{MS_W\cdot(1/n_i + 1/n_j)}$.
Assignment: (a) $x=(1,2,3,4,5)$, $y=(2.1,3.9,6.2,7.8,10.1)$. Find $\hat{\beta}_0$, $\hat{\beta}_1$, $R^2$, test $H_0: \beta_1=0$. (b) Three fertilizers for 6 fields each: $SS_B=90$, $SS_W=60$. F-statistic, conclusion at $\alpha=0.05$. (c) Simulate F-test under $H_0$ $(k=3, n=15$, normal data$)$ 5000 times, plot histogram versus $F(2,12)$.
Regularized Regression: Ridge and Lasso
Ridge (L2-regularization): $\hat{\beta}_\text{Ridge} = \operatorname{argmin}{||Y-X\beta||^2 + \lambda||\beta||^2} = (X^\top X + \lambda I)^{-1} X^\top Y$. As $\lambda\to0$: OLS. As $\lambda\to\infty$: $\beta\to0$. Coefficients shrink uniformly, do not become zero. Optimal with correlated predictors.
Lasso (L1-regularization): $\hat{\beta}_\text{Lasso} = \operatorname{argmin}{||Y-X\beta||^2 + \lambda||\beta||_1}$. No closed form, solved via coordinate descent. At sufficiently large $\lambda$: some $\beta_j = 0$ (variable selection). Geometrically: L1-ball has corners $\rightarrow$ solution on a corner $\rightarrow$ sparsity.
Elastic Net: Combination L1+L2: $\lambda_1||\beta||_1 + \lambda_2||\beta||^2$. Combines variable selection (Lasso) with stability under multicollinearity (Ridge). Parameters $\lambda_1$, $\lambda_2$ selected by cross-validation.
Two-Way ANOVA and Interaction
Model: $Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta){ij} + \varepsilon{ijk}$. Three F-tests: factor A, factor B, interaction A×B. Significant interaction: effect of A depends on B's level. Interaction plots: parallel lines $\rightarrow$ no interaction; intersecting $\rightarrow$ interaction.
Mixed Models (LMM): $Y_{ij} = X_{ij}\beta + Z_{ij}b_i + \varepsilon_{ij}$, $b_i \sim N(0,D)$. Random effects account for correlation within groups (repeated measures, multilevel data). Estimation via REML (restricted maximum likelihood). Random effects tested by LRT or permutation test.
Linear Regression Diagnostics: In-Depth Analysis
Influence Points (Leverage): $h_{ii}$ — diagonal elements of projection matrix $H = X(X^\top X)^{-1} X^\top$. Mean $h_{ii} = p/n$. Points with $h_{ii} > 2p/n$ are "highly influential". Cook’s Distance $D_i = (\hat{\beta} − \hat{\beta}{(−i)})^\top (X^\top X)(\hat{\beta} − \hat{\beta}{(−i)})/(p\cdot\sigma^2)$ — cumulative influence of removing the $i$th observation. $D_i > 4/n$ — threshold. Studentized residuals: $e_i^* = e_i/(\hat{\sigma}\sqrt{1−h_{ii}}) \sim t(n−p−1)$ under $H_0$. If $|e_i^*| > t_{0.025, n−p−1}$: outlier.
Nonlinear Regression
Nonlinear Regression: $Y_i = f(x_i, \beta) + \varepsilon_i$. OLS estimates: $\hat{\beta} = \operatorname{argmin} \Sigma(Y_i−f(x_i,\beta))^2$. No closed form: iterative Gauss-Newton algorithm: $\beta_{t+1} = \beta_t + (J^\top J)^{-1}J^\top(Y−f(\beta_t))$, where $J$ — Jacobian of $f$. Logistic curve: $f(x,\beta) = \beta_1/(1+\exp(−\beta_2(x−\beta_3)))$. Applied in biology (population growth), pharmacokinetics (drug absorption).
Spatial Regression and Geostatistics
Kriging: BLUE prediction for spatial data with given covariance structure. Semivariogram $\gamma(h) = 0.5\operatorname{Var}[Z(x+h)−Z(x)]$. Spherical model, exponential model. Ordinary Kriging: $\hat{Z}(x_0) = \Sigma\lambda_i Z(x_i)$, $\Sigma\lambda_i=1$, minimizes prediction variance. Widely used in geology, meteorology, ecology.
Causal Inference
Distinguishing correlation and causality. Potential Outcomes (Rubin): $Y_i(1)$ — if treatment, $Y_i(0)$ — if not. $ATE = E[Y(1)−Y(0)]$. "Fundamental problem": only one of the two is observed. ATE estimation from observational data: ignorability condition: $Y(t)\perp T|X$ (no unobserved confounders). IPW method: estimate via inverse probability weighting for treatment. Double-robust estimators: valid if at least one of two models (outcome or treatment) is correct.
Numerical Example: Simple Linear Regression
Task: Data $(x, y)$: $(1,2)$, $(2,4)$, $(3,5)$, $(4,7)$. Build $\hat{y}=\hat{\beta}_0+\hat{\beta}_1x$ and assess fit quality.
Step 1: $\bar{x} = (1+2+3+4)/4 = 2.5$, $\bar{y} = (2+4+5+7)/4 = 4.5$.
Step 2: $\hat{\beta}_1 = \Sigma(x_i-\bar{x})(y_i-\bar{y})/\Sigma(x_i-\bar{x})^2 = [(-1.5)(-2.5)+(-0.5)(-0.5)+(0.5)(0.5)+(1.5)(2.5)] / [2.25+0.25+0.25+2.25] = [3.75+0.25+0.25+3.75]/5.0 = 8.0/5.0 = 1.6$.
Step 3: $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\cdot\bar{x} = 4.5 - 1.6\cdot2.5 = 0.5$. Equation: $\hat{y} = 0.5 + 1.6x$.
Step 4: Prediction at $x=5$: $\hat{y}=8.5$. $R^2$: $SS_\text{res} = \Sigma(y_i-\hat{y}i)^2 = (2−2.1)^2+(4−3.7)^2+(5−5.3)^2+(7−6.9)^2=0.01+0.09+0.09+0.01=0.20$. $SS\text{tot} = \Sigma(y_i-\bar{y})^2=6.25+0.25+0.25+6.25=13.0$. $R^2=1−0.20/13.0\approx0.985$ — excellent fit.
§ Act · what next