Hypothesis Testing

Hypothesis testing is a formalized decision-making procedure based on data. The null hypothesis H₀ is rejected or not rejected based on the criterion statistic.

Key Concepts

Null (H₀) and alternative (H₁) hypotheses. Significance level α = P(Type I error) = P(reject H₀ | H₀ is true). Power 1-β = P(reject H₀ | H₁ is true).

p-value: p = P(observed statistic or more extreme | H₀). If p < α → reject H₀. p-value ≠ probability of H₀ (misinterpretation!).

Criteria for the Normal Distribution

z-test: σ known. Z = (X̄-μ₀)/(σ/√n) ~ N(0,1). |Z| > z_{α/2} → reject.

Student's t-test: σ unknown. T = (X̄-μ₀)/(S/√n) ~ t(n-1). |T| > t_{α/2,n-1} → reject.

Two-sample t-test: H₀: μ₁ = μ₂. T = (X̄₁-X̄₂)/√(Sp²/n₁ + Sp²/n₂) ~ t(n₁+n₂-2).

Neyman-Pearson Lemma

Theorem: The optimal (most powerful) criterion for H₀: f(x;θ₀) vs H₁: f(x;θ₁) is the likelihood ratio criterion: L(x) = L(θ₁)/L(θ₀) > c. For a given α → unique most powerful criterion (NP criterion).

Multiple Testing

Problem: With m tests and α=0.05: P(at least one false alarm) ≈ 1-(1-0.05)^m → 1. FWER (Family-wise error rate) is controlled with the Bonferroni correction: αᵢ = α/m.

FDR (False Discovery Rate, Benjamini-Hochberg, 1995): Control the proportion of false discoveries among all discoveries. Less strict, more powerful for large m. Used in bioinformatics (gene expression).

Assignment: (a) 20 coins were flipped 10 times each. For each H₀: p=0.5, z-test at α=0.05. If 3 coins showed a "significant" result, what does this indicate? (b) Data: n=25, X̄=52, S=10. H₀: μ=50. t-statistic. p-value for two-sided test. (c) For n=100 simultaneous tests: how many false discoveries are expected at α=0.05 and H₀ is true everywhere? BH correction vs. Bonferroni.

Test Power and Sample Size

Power 1−β = P(reject H₀|H₁ is true). Depends on: effect size δ = |μ−μ₀|/σ, significance level α, sample size n. Formula for t-test: n = (z_{α/2} + z_β)²σ²/δ². For δ=0.5σ (average effect), α=0.05, power=0.8: n ≈ 64. For small effect δ=0.2σ: n ≈ 394.

Power curves: at fixed n and α — function of δ. As δ→∞ power→1. Understanding power is critical for study planning (power analysis before data collection, a priori), interpretation of non-significant results, and assessment of practical significance.

Bayesian Factors and Alternatives to p-values

The p-value is criticized: it does not answer the question "how likely is H₀?". Bayesian factor BF₁₀ = P(data|H₁)/P(data|H₀): BF > 10 — convincing support for H₁; BF < 1/10 — for H₀. Jeffreys scale. Requires prior distribution under H₁ (informative or default Cauchy-Jeffreys).

Effect size: Cohen's d = (μ₁−μ₂)/σ_pooled. For d=0.2 — small effect; 0.5 — medium; 0.8 — large. Reporting only p-values without effect size is uninformative: as n→∞ any nonzero effect becomes significant.

Sequential Tests (SPRT)

Wald's Test (SPRT): At each step n, calculate Λₙ = ∏ f(Xᵢ;θ₁)/f(Xᵢ;θ₀). Stopping rule: if Λₙ ≥ B → accept H₁; if Λₙ ≤ A → accept H₀; otherwise continue. Optimal boundary: B ≈ (1−β)/α, A ≈ β/(1−α). SPRT — optimal (minimal average n) for fixed α, β — Wald-Wolfowitz theorem.

Bayesian Tests and Procedures

In the Bayesian paradigm, H₀ and H₁ have probabilities: P(H₀|data) = P(data|H₀)P(H₀)/P(data). Decision: accept the hypothesis with the greater posterior probability (maximum expected correctness). With symmetric losses — this is the MAP solution.

Contrast example with p-value: n=10000, X̄=50.5, S=10, H₀: μ=50. z=(50.5-50)/(10/√10000)=5, p < 10⁻⁶. Bayesian factor: BF ≈ 1.7 (moderate evidence). Reason: with very large n even negligible effects produce significant p, but Bayesian analysis takes effect size into account.

Multiple Comparison in Clinical Trials

In clinical trials, multiple outcome testing requires controlling FWER. Closed testing principle (Marcus-Peritz-Gabriel): test hypothesis H only if every broader hypothesis including H is rejected. Holm method (Holm-Bonferroni): sort p-values p₍₁₎ ≤ ... ≤ p₍ₘ₎. Reject H₍ᵢ₎ if p₍ᵢ₎ ≤ α/(m−i+1). More powerful than Bonferroni, controls FWER exactly.

Adaptive Tests and Distributed Inference Tasks

p-value combination: for independent tests — Fisher's method: −2Σlog(pᵢ) ~ χ²(2m). For dependent — Browns, Kost-McDermott methods. Meta-analytic approach: combine multiple studies via weighted z-scores (Stouffer). Meta-analysis power is much higher than single studies — standard in medicine.

Neyman-Pearson Test and Uniformly Most Powerful Tests

UMP test: there exists a test that is uniformly more powerful than any other of level α for all θ ∈ Θ₁. For one-parameter exponential families with monotonic likelihood ratio: UMP test exists — this is the Neyman-Pearson test. For two-sided hypotheses, UMP generally does not exist — requires a compromise between powers in different directions.

Numerical Example: z-test for Mean

Problem: Manufacturer claims μ=500 g (package weight). Sample n=25: x̄=492 g, σ=20 g (known). Test H₀: μ=500 against H₁: μ≠500 at α=0.05.

Step 1: Test statistic: z=(x̄−μ₀)/(σ/√n)=(492−500)/(20/5)=−8/4=−2.0.

Step 2: Critical region (two-sided): |z|>z_{α/2}=1.96. Since |−2.0|=2.0>1.96, H₀ is rejected.

Step 3: p-value: p=2·P(Z<−2.0)=2·Φ(−2.0)=2·0.0228=0.0456<0.05. Conclusion: deviation is significant.

Step 4: Power at alternative μ=495 (δ=(495−500)/4=−1.25): β=P(accept H₀|μ=495)=P(|z|<1.96|μ=495)=Φ(1.96+1.25)−Φ(−1.96+1.25)=Φ(3.21)−Φ(−0.71)≈0.9993−0.239=0.760. Power=1−0.760=0.240: the test weakly distinguishes μ=500 and μ=495 at n=25. To reach 80% power, need n≥(σ(z_{α/2}+z_β)/δ)²=(20·(1.96+0.84)/5)²≈(11.2)²≈126.