Parametric Estimation — Probability & Statistics

Statistical estimation is the construction of estimates for unknown parameters based on observed data. The maximum likelihood method, Bayesian approach, and method of moments are the three main paradigms.

Maximum Likelihood Method (MLE)

Likelihood function: $L(\theta; x_1, ..., x_n) = \prod_i f(x_i; \theta)$ (for an i.i.d. sample). Log-likelihood: $\ell(\theta) = \sum_i \log f(x_i; \theta)$.

MLE: $\hat{\theta}{\text{MLE}} = \arg\max\theta \ell(\theta)$. Solution: $\partial\ell/\partial\theta = 0$ (likelihood equations).

Properties of MLE: Consistency: $\hat{\theta}_n \to_P \theta_0$. Asymptotic normality: $\sqrt{n}(\hat{\theta}_n - \theta_0) \to_d N(0, I(\theta_0)^{-1})$. Asymptotic efficiency: attains the Cramer-Rao lower bound.

Fisher information: $I(\theta) = \mathbb{E}[-(\partial^2\ell/\partial\theta^2)] = \operatorname{Var}[\partial\ell/\partial\theta]$. Cramer-Rao lower bound: $\operatorname{Var}[\hat{\theta}] \geq 1/(nI(\theta))$.

Bayesian Estimation

Bayesian updating: $\pi(\theta|x) \propto L(\theta; x) \cdot \pi(\theta)$. Posterior = Likelihood × Prior / Normalization.

Bayesian estimates: MAP (maximum a posteriori): $\hat{\theta}{\text{MAP}} = \arg\max \pi(\theta|x)$. EAP (expected a posteriori): $\hat{\theta}{\text{Bayes}} = \mathbb{E}[\theta|x]$ — minimizes MSE.

Conjugate priors: If $\pi(\theta)$ and $L(\theta; x)$ have such forms that $\pi(\theta|x)$ belongs to the same family — conjugate prior. Beta-Bernoulli: $\pi = \operatorname{Beta}(\alpha,\beta)$, $L = \operatorname{Bin}(n, p)$ $\to$ $\pi|x = \operatorname{Beta}(\alpha+k, \beta+n-k)$.

EM Algorithm

Task: MLE with incomplete data or hidden variables. Maximize $\ell(\theta; x) = \log P(x; \theta) = \log \sum_z P(x, z; \theta)$.

EM: E-step: $Q(\theta|\theta^t) = \mathbb{E}{z|x,\theta^t}[\log P(x, z; \theta)]$. M-step: $\theta^{t+1} = \arg\max\theta Q(\theta|\theta^t)$. Guaranteed not to decrease $\ell$ at each step.

Exercise:
(a) Poisson sample: $x_1,...,x_n$. Find MLE for $\lambda$. Fisher information $I(\lambda)$. Confidence interval.
(b) Beta prior $\operatorname{Beta}(2,2)$ for coin. 7 heads out of 10. Posterior? MAP vs MLE vs Bayes EAP.
(c) EM for Gaussian Mixture with two components: write E-step ($\gamma_{ik}$) and M-step ($\mu_k, \sigma_k, \pi_k$).

Estimation Theory: Completeness and Sufficiency

Sufficient statistic $T(X)$ contains all information from the sample about parameter $\theta$: the distribution of $X|T$ does not depend on $\theta$. Neyman–Fisher factorization criterion: $T$ is sufficient if and only if $f(x|\theta) = g(T(x), \theta)\cdot h(x)$. For exponential families $f(x|\theta) = h(x)\cdot\exp{\eta(\theta)T(x) - A(\theta)}$ — natural sufficient statistics $T(x)$.

Complete statistic $T$: $\mathbb{E}_\theta[g(T)] = 0$ for all $\theta \implies g(T) = 0$ almost surely. Lehmann–Scheffé lemma: if $T$ is complete and sufficient, then any unbiased function of $T$ is the MVUE (minimum variance unbiased estimator in the class of unbiased estimators).

Bayesian Estimation: Posterior Functionals

MAP (Maximum A Posteriori): $\hat{\theta}_{\text{MAP}} = \arg\max p(\theta|x)$. With uniform prior $\to$ MLE. With Laplacian prior $\to$ L1-regularization (Lasso). With Gaussian prior $\to$ L2-regularization (Ridge).

EAP (Expected A Posteriori) = Bayes estimate: $\hat{\theta}_{\text{EAP}} = \mathbb{E}[\theta|x]$ — optimal for quadratic losses. MAP is optimal for 0-1 loss. Posterior median is optimal for absolute loss. Choice of estimator depends on loss function.

Empirical Bayes: The posterior prior is estimated from the data (Stein, 1956). James–Stein estimator reduces MSE of the estimate $\mu$ in $\mathbb{R}^k$ for $k \geq 3$: $\hat{\theta}{\text{JS}} = (1 - (k-2)/||X||^2)\cdot X$. Paradox: estimation of a vector $\mu$ with uncorrelated components improves by pooling. EM algorithm: iterative method for finding MLE with incomplete data. E-step: $\mathbb{E}[\log L(\theta|X{\text{complete}})|X_{\text{obs}}, \theta_t]$; M-step: maximize over $\theta$. Converges monotonically ($Q(\theta|\theta_t)$ increases).

Conjugate Prior Families in Bayesian Analysis

A prior $\pi(\theta)$ is called conjugate for likelihood $f(x|\theta)$ if the posterior distribution $\pi(\theta|x)$ belongs to the same parametric family as the prior. Table: Bernoulli $\to$ Beta; Poisson $\to$ Gamma; Normal $(\mu, \sigma^2$ known$)\to$ Normal; Multinomial $\to$ Dirichlet; Exponential $\to$ Gamma.

Beta–Binomial: prior $\operatorname{Beta}(\alpha, \beta)$, data $\operatorname{Bin}(n, p)$: posterior $\operatorname{Beta}(\alpha+k, \beta + n - k)$. “Pseudo-data”: $\alpha$ and $\beta$ are “prior-observations” of heads and tails. As $n$ grows, prior washes out. MAP estimator: $\hat{p}_{\text{MAP}} = (\alpha + k - 1)/(\alpha + \beta + n - 2)$. For $\alpha = \beta = 1$ (uniform prior) MAP = MLE = $k/n$.

Bayesian Model Selection and Comparison

Bayes factor: $BF_{12} = P(X|M_1)/P(X|M_2) = \int L(\theta_1)\pi(\theta_1)d\theta_1 / \int L(\theta_2)\pi(\theta_2)d\theta_2$. Requires marginal likelihoods — computationally intensive (MCMC, Laplace approximation). DIC criterion: $DIC = \bar{D} + p_D$, where $\bar{D}$ is mean deviance, $p_D$ is the effective number of parameters. Bayesian analogue of AIC.

Posterior Prediction and Calibration

Posterior predictive distribution: $P(\tilde{x}|x) = \int P(\tilde{x}|\theta)\pi(\theta|x)d\theta$ — averages over parameter uncertainty. Bayesian-calibrated model: $P(\text{event}|\text{posterior}) = \text{frequency}(\text{event})$. This is stronger than frequentist calibration: accounts for parametric uncertainty.

Variational Bayesian Inference

Instead of MCMC: approximate $P(\theta|X)$ by a family $Q(\theta; \lambda)$. Minimize ELBO: $L(\lambda) = \mathbb{E}_Q[\log P(X, \theta)] - \mathbb{E}_Q[\log Q(\theta; \lambda)]$. For mean-field approximation $Q(\theta) = \prod Q(\theta_i)$: iterative update of each factor. Advantages: faster than MCMC for large data sets. Drawback: may underestimate variance (underestimate uncertainty).

Bayesian Updating in Real Time

As new data arrives, the posterior becomes the new prior: $P(\theta|x_1, ..., x_n) = P(x_n|\theta) \cdot P(\theta|x_1, ..., x_{n-1})/P(x_n|x_1, ..., x_{n-1})$. For conjugate priors — analytical update. Online learning: the Bayesian approach naturally supports streaming data. Kalman filter is a special case: Bayesian filter for linear Gaussian models.

Numerical Example: MLE for the Normal Distribution

Task: Sample: $x = {2.1, 3.4, 2.8, 3.1, 2.6}$. Find MLE for $\mu$ and $\sigma^2$. Compare to Bayesian estimate.

Step 1: Log-likelihood: $\ell(\mu, \sigma^2) = -(n/2)\ln(2\pi\sigma^2) - \sum (x_i - \mu)^2/(2\sigma^2)$. $\partial\ell/\partial\mu = 0 \to \hat{\mu} = \bar{x}$.

Step 2: $\bar{x} = (2.1 + 3.4 + 2.8 + 3.1 + 2.6)/5 = 14.0/5 = 2.80$.

Step 3: $\partial\ell/\partial\sigma^2 = 0 \to \hat{\sigma}^2 = \sum (x_i - \bar{x})^2 / n$. Deviations: $(-0.7)^2 + (0.6)^2 + (0)^2 + (0.3)^2 + (-0.2)^2 = 0.49 + 0.36 + 0 + 0.09 + 0.04 = 0.98$. $\hat{\sigma}^2 = 0.98/5 = 0.196$.

Step 4: Unbiased variance: $s^2 = 0.98/4 = 0.245$. 95% CI for $\mu$: $\bar{x} \pm t_{0.025,4} \cdot s/\sqrt{n} = 2.80 \pm 2.776 \cdot (0.495 / \sqrt{5}) = 2.80 \pm 0.614 = [2.19, 3.41]$. Bayesian estimate with $N(\mu_0=3, \tau^2=1)$ prior: $\hat{\mu}_{\text{Bayes}} = (0.196 \cdot 3 + 5 \cdot 1 \cdot 2.80)/(0.196+5) = 14.588/5.196 \approx 2.808$ — shifted toward the prior by only 0.008.