Goodness-of-Fit Tests — Probability & Statistics

Goodness-of-fit tests assess whether data conform to a hypothesized theoretical distribution. This is an important step in statistical analysis before applying parametric methods.

Pearson's Chi-Squared Test

Idea: Divide the data into k cells with observed frequencies Oᵢ and theoretical Eᵢ = nPᵢ(θ). Statistic: χ² = Σᵢ(Oᵢ - Eᵢ)²/Eᵢ. Under H₀, asymptotically: χ² ~ χ²(k-1-r), where r is the number of estimated parameters.

Conditions for applicability: Eᵢ ≥ 5 for each cell (if violated, combine adjacent cells). Independent observations. Minimum n ≥ 30-50.

Contingency tables: Testing independence of two categorical variables. χ² = Σᵢⱼ(Oᵢⱼ - Eᵢⱼ)²/Eᵢⱼ, Eᵢⱼ = nᵢ·nⱼ/n. Degrees of freedom: (r-1)(c-1).

Kolmogorov-Smirnov Test

Statistic: Dₙ = sup_x |F̂ₙ(x) - F₀(x;θ)|, where F̂ₙ is the empirical distribution function. Under H₀ (with known θ): √n·Dₙ →_d K (Kolmogorov distribution).

Important limitation: When parameters are estimated from the data (composite hypothesis), the critical value changes—use modified tables (Lilliefors for normality).

Two-sample KS test: Dₙ,ₘ = sup_x |F̂ₙ(x) - Ĝₘ(x)|. H₀: F = G (two samples from the same distribution). Nonparametric test.

Shapiro-Wilk Test for Normality

The most powerful test for assessing normality for n ≤ 2000. W = (Σaᵢx_(i))²/Σ(xᵢ-x̄)², where x_(i) are order statistics, aᵢ are coefficients from tables. For normality, W is close to 1.

Practical recommendations: n < 50: Shapiro–Wilk. n = 50–2000: Anderson–Darling. n > 2000: visual QQ-plot + χ² are sufficient.

QQ-plot: Quantile-quantile plot. If data are from F₀, sample quantiles versus theoretical lay on a straight line. Deviations from the line → distribution mismatch.

Exercise: (a) 120 dice rolls: frequencies (18,22,19,21,17,23). χ² test for uniformity. (b) Sample from LN(2,1): χ² test for normality (8 equiprobable cells). What will happen with right skewness? (c) 50 observations: simulate from t(3) and test for normality using KS, Shapiro–Wilk, and QQ-plot. Compare p-values.

Anderson-Darling Test

Anderson–Darling statistic (AD): A² = −n − (1/n)Σᵢ(2i−1)[ln F₀(x₍ᵢ₎) + ln(1−F₀(x₍ₙ₊₁₋ᵢ₎))]. Gives greater weight to the tails of the distribution than the KS test. Especially sensitive to deviations in the tails—which is important for financial data. For normality: special critical values with Stephens’ correction (depend on n).

Shapiro–Wilk test: W = (Σaᵢx₍ᵢ₎)²/(Σ(xᵢ-x̄)²), where aᵢ are coefficients from expected normal order statistics. W ∈ (0,1], W = 1 for normal data. Most powerful for small samples (n ≤ 50). For n > 5000: any real distribution does not significantly differ from normal.

Checking Symmetry and Heavy Tails

Skewness coefficient: g₁ = m₃/m₂^{3/2} (standardized third moment). For normal: g₁ = 0. Excess kurtosis: g₂ = m₄/m₂² − 3. Normal: g₂ = 0. D’Agostino–Pearson test: combines g₁ and g₂ into a χ² statistic with 2 degrees of freedom. Normal: g₁=0 and g₂=0. Log-normal have g₁>0, g₂>0—they are right-skewed with heavy tails.

Quantile-quantile plot (PP-plot vs QQ-plot): PP-plot: empirical vs theoretical probabilities—sensitive to the center. QQ-plot: empirical vs theoretical quantiles—sensitive to tails. For finance: QQ-plot reveals heavy tails (points above the line in the tails).

Goodness-of-Fit Tests for Discrete Distributions

Poisson test: combine the tail (k ≥ K) into one cell. Number of parameters r = 1 (λ_hat = x̄). df = K − 1 − r. Binomial test: similarly with r = 1 (p̂ = x̄/n). For negative binomial: r = 2 (n̂, p̂). The power of the χ² test against “overdispersion” is low—special tests (Cochran, Anderson–Darling for discrete) exist.

Testing for Overdispersion

In real count data, the Poisson model often fails: Var[Y] > E[Y] (overdispersion). Dean’s test: statistic Z = (Σ(Yᵢ−λ̂)²−Σλ̂ᵢ)/√(2Σλ̂ᵢ²) → N(0,1). Poisson alternatives: Negative binomial (NB): Var = μ + μ²/r. Zero-inflated (ZIP): a mixture of Poisson and point mass at 0. Model selection via LRT or AIC.

Goodness-of-Fit Tests in Bayesian Statistics

In the Bayesian paradigm, “goodness-of-fit tests” are posterior predictive p-values (PPP): p_ppc = P(T(X_rep) ≥ T(X_obs) | X_obs), where X_rep ~ P(X|θ)·P(θ|X_obs). PPP ≈ 0.5 means good fit. Criticism: PPPs are conservative—inflate p-values (double use of data for fit and test).

Information Criteria and Model Selection

AIC (Akaike): AIC = −2ℓ(θ̂) + 2k. Select model with min AIC. Assesses the expected KL divergence between true and fitted distributions. BIC (Bayesian): BIC = −2ℓ(θ̂) + k·ln(n). Penalizes complexity more with large n → selects simpler models. For nested models: BIC difference > 10 is strong evidence for the simpler. Likelihood ratio test: Λ = 2(ℓ₁−ℓ₀) ~ χ²(k₁−k₀)—nested models.

Numerical Example: χ² Goodness-of-Fit Test and AIC Criterion

Problem: A die is thrown n=60 times. Observed frequencies O={8,10,12,11,9,10}. Test H₀: fair die; choose the best of two models by AIC.

Step 1: Expected frequencies under H₀ (uniform): Eᵢ=60/6=10 for each face.

Step 2: χ²=Σ(Oᵢ−Eᵢ)²/Eᵢ=(8−10)²/10+(10−10)²/10+(12−10)²/10+(11−10)²/10+(9−10)²/10+(10−10)²/10 = 4/10+0+4/10+1/10+1/10+0 = 1.0.

Step 3: df=k−1=5. Critical χ²(0.05;5)=11.07. Since 1.0<11.07, H₀ is not rejected. p-value≈0.96.

Step 4: Model M₁ (uniform, 0 parameters): log-likelihood = 60·ln(1/6)≈−107.4; AIC=2·107.4+2·0=214.8. Model M₂ (free probabilities, 5 parameters): AIC=2·106.9+2·5=223.8. ΔAIC=9>6 → M₁ is preferred—Occam’s razor confirmed.