Asymptotic Properties of Estimators and the Delta Method

Asymptotic statistics studies the behavior of estimators as n → ∞. Consistency, asymptotic normality, and the delta method are the main tools for analyzing estimator properties.

Consistency

Weak consistency: $\hat{\theta}_n \to_P \theta_0$ as $n\to\infty$. A sufficient condition: $\operatorname{Bias}(\hat{\theta}_n)\to 0$ and $\operatorname{Var}[\hat{\theta}_n] \to 0$. Mean square consistency (MSE→0) implies weak consistency.

Strong consistency: $\hat{\theta}n \to{a.s.} \theta_0$. The MLE is strongly consistent under regular conditions—this follows from the strong law of large numbers applied to the log-likelihood.

Invariance of MLE: If $\hat{\theta}$ is the MLE for $\theta$, then $g(\hat{\theta})$ is the MLE for $g(\theta)$ (for any function $g$). This follows as a consequence of the definition via likelihood maximization.

Asymptotic Normality of MLE

Theorem: Under regular conditions: $\sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \to_d N(0, I(\theta_0)^{-1})$. The estimator is approximately normal with variance $1/(nI(\theta_0))$.

Violations of regularity: Uniform $U[0,\theta]$: MLE $= X_{(n)}$, convergence rate $n$ (not $\sqrt{n}$), limiting distribution is exponential. Mixed models: rate $\sqrt{n}$, but limiting distribution is non-normal.

Asymptotic confidence intervals: Three types: (1) Wald: $\hat{\theta} \pm z_{\alpha/2} \cdot SE(\hat{\theta})$; (2) Likelihood profile: via profile likelihood; (3) Score: via likelihood derivative. Intervals based on the profile likelihood are more accurate for small $n$.

Delta Method

Univariate: If $\sqrt{n}(\hat{\theta}_n - \theta_0) \to_d N(0, \sigma^2)$ and $g$ is differentiable at $\theta_0$ ($g'(\theta_0)\ne 0$): $\sqrt{n}(g(\hat{\theta}_n) - g(\theta_0)) \to_d N(0, \sigma^2 (g'(\theta_0))^2)$. The variance of a nonlinear transformation $\approx$ (derivative)$^2$ times the original variance.

Multivariate delta method: $\sqrt{n}(g(\hat{\theta})-g(\theta_0))\to_d N(0, \nabla g(\theta_0)^\top \Sigma \nabla g(\theta_0))$, where $\Sigma$ is the asymptotic covariance matrix of $\hat{\theta}$.

Practical applications: Coefficient of variation $CV = \sigma/\mu$. Logit transformation: $g(p) = \log(p/(1-p))$. Risk ratio in survival analysis. Difference of proportions.

Assignment: (a) Poisson($\lambda$=3): $I(\lambda)=1/\lambda$. Delta method for $g(\hat{\lambda}) = e^{-\hat{\lambda}} = P(X=0)$. Asymptotic 95% CI. (b) $LN(\mu,\sigma^2)$: MLE for median $e^{\hat{\mu}}$. Delta method for $SE(e^{\hat{\mu}})$. (c) $n=200$ from Gamma($\alpha=2$, $\beta=1$). Check delta method for $1/\bar{X}$ by simulation (1000 repetitions).

Delta Method: Second Order and Multivariate Case

Second order delta method: When $g'(\mu) = 0$: $\sqrt{n}(g(\bar{X}) - g(\mu))$ is not normal. Use the second order: $n(g(\bar{X}) - g(\mu)) \to \sigma^2 g''(\mu)/2 \cdot \chi^2(1)$—chi-squared distribution. Example: $g(\mu) = \mu^2$ at $\mu=0$: $n(\bar{X}^2) \to \sigma^2 \cdot \chi^2(1)$.

Multivariate delta method: If $\sqrt{n}(\bar{X} - \mu) \to N_k(0,\Sigma)$, then for $g: \mathbb{R}^k \to \mathbb{R}$: $\sqrt{n}(g(\bar{X})-g(\mu)) \to N(0, \nabla g(\mu)^\top \Sigma \nabla g(\mu))$. For a vector-valued function $g: \mathbb{R}^k \to \mathbb{R}^m$: $\sqrt{n}(g(\bar{X})-g(\mu)) \to N(0, J\Sigma J^\top)$, where $J$ is the Jacobian.

Fisher Information: Meaning and Applications

$I(\theta) = -E[\frac{\partial^2}{\partial \theta^2} \log f(X; \theta)]$—"expected information" in one observation. The greater the curvature of the log-likelihood, the more information. Information inequality: $\operatorname{Var}(\hat{\theta}) \geq 1/(nI(\theta))$—lower limit of the variance. For normal $N(\mu, \sigma^2)$: $I(\mu)=1/\sigma^2$, $I(\sigma^2)=1/(2\sigma^4)$. For Poisson($\lambda$): $I(\lambda)=1/\lambda$.

Information in several parameters: Fisher information matrix $I(\theta) \in \mathbb{R}^{p \times p}$: $I_{ij} = -E\left[\frac{\partial^2}{\partial \theta_i \partial \theta_j} \log f\right]$. Matrix CRB: $\operatorname{Var}(\hat{\theta}) \geq I(\theta)^{-1}$ (in the Löwner sense). For MLE: $\sqrt{n}(\hat{\theta} - \theta) \to N_p(0, I(\theta)^{-1})$—asymptotically efficient.

Regression as an Estimation Problem

In linear regression $Y = X\beta + \varepsilon$, $\varepsilon \sim N(0, \sigma^2 I)$: the MLE coincides with OLS. Fisher information $I(\beta) = X^\top X/\sigma^2$. CRB: $\operatorname{Var}(\hat{\beta}) \geq \sigma^2 (X^\top X)^{-1}$—achieved by OLS (Gauss-Markov theorem). For $p > n$: regularized estimators (Ridge, Lasso)—lower MSE than CRB due to bias.

Asymptotic Criteria: Wald, Rao, and Wald Tests

Three equivalent asymptotic tests for $H_0: \theta = \theta_0$. Wald test: $W_n = n(\hat{\theta}-\theta_0)^\top I(\hat{\theta})(\hat{\theta} - \theta_0) \sim \chi^2(k)$. Likelihood Ratio Test (LRT): $\Lambda_n = 2(\ell(\hat{\theta}) - \ell(\theta_0)) \sim \chi^2(k)$. Rao (Score) test: $R_n = n^{-1} S(\theta_0)^\top I(\theta_0)^{-1} S(\theta_0) \sim \chi^2(k)$, $S$—score function. All three are asymptotically equivalent, but finite-sample properties differ. The Rao test is convenient: does not require finding the MLE.

Bootstrap in Estimating Fisher Information

For complex models, analytical $I(\theta)$ is hard to obtain. Parametric bootstrap: generate $B$ samples from $F(\hat{\theta})$, estimate $\hat{\theta}^$ for each. $SE_{boot} \approx \operatorname{std}(\hat{\theta}^) \approx 1/\sqrt{I(\hat{\theta})}$. Nonparametric bootstrap: resample from data—asymptotically correct, but requires extra caution for dependent data.

Regularized Estimation in High Dimensions

For $p > n$: the MLE does not exist ($X^\top X$ is singular). Ridge estimator: $\hat{\beta}{ridge} = (X^\top X + \lambda I)^{-1}X^\top y$. Bayesian interpretation: Gaussian prior $N(0, \tau^2 I) \to$ posterior mean with $\lambda = \sigma^2/\tau^2$. Never produces exactly zero coefficients. Lasso: $\hat{\beta}{lasso} = \arg\min_\beta \left{|y-X\beta|^2 + \lambda |\beta|_1\right}$. Laplace prior $\to$ sparse solution (feature selection). Geometrically: the $l_1$-ball has vertices $\to$ optimum often at a corner.

Estimation Theory Under Constraints

Estimation with constraints: $\theta \in C$ (closed convex $C$). MLE with constraint: $\arg\max_{\theta \in C} L(\theta)$. Lagrange multiplier method; for inequality constraints—the Karush-Kuhn-Tucker (KKT) conditions. Testing constraints: Wald test: $H_0: R\theta = r$ versus $R\theta \ne r$. Statistic: $(R\hat{\beta} - r)^\top [R(X^\top X)^{-1} R^\top]^{-1} (R\hat{\beta} - r)/(q\sigma^2) \sim F(q, n-p)$. Used to test groups of hypotheses simultaneously.

Numerical Example: Delta Method for Confidence Interval of the Logit

Task: $X_1,\ldots,X_{100} \sim \text{Bernoulli}(p)$, $\bar{x}=0.30$. Construct a 95% CI for the logit $g(p)=\ln(p/(1-p))$.

Step 1: By the CLT: $\sqrt{n}(\bar{x}-p)\to N(0,p(1-p))$. $g(p)=\ln(p/(1-p))$, $g'(p)=1/(p(1-p))$.

Step 2: At $\hat{p}=0.3$: $g'(0.3)=1/(0.3 \cdot 0.7)=1/0.21\approx4.762$. $\operatorname{Var}[g(\hat{p})]\approx[g'(\hat{p})]^2 \cdot \hat{p}(1-\hat{p})/n = 22.68 \cdot 0.21/100=0.0476$.

Step 3: $SE\approx \sqrt{0.0476}\approx 0.218$. $g(0.3)=\ln(0.3/0.7)=\ln(0.4286)\approx-0.847$. 95% CI for the logit: $-0.847 \pm 1.96 \cdot 0.218 = [-1.274, -0.420]$.

Step 4: Back-transformation: lower bound $p: e^{-1.274}/(1+e^{-1.274})=(0.2793/1.2793)\approx0.218$. Upper: $e^{-0.420}/(1+e^{-0.420}) \approx 0.397$. CI for $p$: [0.22, 0.40]. Direct Wald interval: $0.30 \pm 1.96 \cdot 0.0458 = [0.21, 0.39]$—comparable, but the logit interval is better for extreme $p$.