Module IV·Article III·~5 min read
Convergences of Random Variables and Large Deviations
Limit Theorems
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Convergences of Random Variables and Large Deviations
There exist several different concepts of convergence for sequences of random variables. Large Deviations Theory studies the probabilities of rare events—exponentially small probabilities of deviations from the mean.
Types of Convergence
In probability: $X_n \to_P X$ if $\forall \varepsilon > 0:\ P(|X_n-X| > \varepsilon) \to 0$ as $n\to\infty$.
Almost surely (a.s.): $X_n \to_{a.s.} X$ if $P(\lim X_n = X) = 1$. Stronger than in probability. Consequence: convergence in probability.
In distribution (weak): $X_n \to_d X$ if $F_{X_n}(x) \to F_X(x)$ for all points of continuity of $F_X$. The weakest—does not require identity on a single probability space.
In quadratic mean: $X_n \to_{L^2} X$ if $E[(X_n-X)^2] \to 0$. Stronger than in probability, incomparable with almost surely.
Scheme of convergences: almost surely $\implies$ in probability $\implies$ in distribution. $L^2 \implies$ in probability.
Large Deviations Theory
Principle of large deviations: The probability of a “large” deviation of $\overline{X}_n$ from $\mu$ decreases exponentially: $P(\overline{X}_n \geq \mu+\varepsilon) \approx e^{-nI(\mu+\varepsilon)}$, where $I(x)$ is the “rate function”.
Rate function: $I(x) = \sup_t {tx - \Lambda(t)}$, $\Lambda(t) = \log E[e^{tX}]$ (logarithmic MGF). $I$ is convex, $I(\mu) = 0$, $I(x) > 0$ when $x \neq \mu$.
Cramér's Theorem (1938): For i.i.d. $X_i$: $\lim_{n\to\infty} \frac{1}{n} \log P(\overline{X}_n \geq x) = -I(x)$ for $x > \mu$.
Applications: Insurance: $P(S_n > na)$ for $a > \mu$ → estimation of catastrophic losses. Telecommunications: $P(\mathrm{queue} > n\cdot\mathrm{threshold})$ → buffer design. Statistics: p-value of very precise tests as $n\to\infty$.
Exercise: (a) For Bernoulli($p=0.5$): compute $I(0.7) = \sup_t{0.7t - \log(e^t/2+1/2)}$. Find $t^*$ via derivative. (b) $P(\overline{X}_{100} \geq 0.7)$ for Bernoulli: exact (binomial), Chernoff, Cramér's theorem. (c) Explain how Cramér's Theorem generalizes the Chernoff inequality.
Rate Function and Rate of Probability Decay
Rate function $I(x) = \sup_{t\in\mathbb{R}}{tx − \Lambda(t)}$, where $\Lambda(t) = \log E[e^{tX}]$—the logarithmic MGF (cumulant generating function). $I(x) \geq 0$, $I(\mu) = 0$, convex. Cramér's theorem: for $x > \mu$: $\lim_{n\to\infty} (1/n)\log P(\overline{X}_n \geq x) = -I(x)$. The rate of probability decay is determined by the rate function.
For normal $N(\mu,\sigma^2)$: $I(x) = (x−\mu)^2/(2\sigma^2)$. For Bernoulli($p$): $I(x) = x\cdot\ln(x/p) + (1−x)\cdot\ln((1−x)/(1−p))$ (binary KL divergence). The rate function is the Kullback-Leibler distance from the “tilted” distribution to the exponential family.
Principle of Large Deviations: General Theory
Principle of Large Deviations (PLD): A family of measures ${P_n}$ satisfies PLD with rate function $I$ if: (1) for open $G$: $\liminf (1/n)\log P_n(G) \geq -\inf_{x\in G} I(x)$; (2) for closed $F$: $\limsup (1/n)\log P_n(F) \leq -\inf_{x\in F} I(x)$. Cramér's Theorem—PLD for $\overline{X}_n$. Sanov's Theorem—PLD for empirical distribution ($I = KL$ distance).
PLD applications: Estimation of tail probabilities in rare events (reliability, financial risks). Statistical physics: Gibbs distribution from the principle of maximum entropy—a consequence of PLD via Sanov's theorem. Information theory: probability of decoding error—through the rate function (Chernoff theorem).
Relationship of Rate Function with Entropy
For empirical distribution $\widehat{L}n = (1/n)\sum\delta{X_i}$: $P(\widehat{L}n \approx Q) \sim e^{-n\cdot D{KL}(Q||P)}$, where $D_{KL}$ is the Kullback-Leibler divergence. This is Sanov's theorem. Consequence: typical sequences are those which realize the distribution $P$ (close to $P$ in the KL sense). Atypical sequences are exponentially unlikely with exponent $KL(Q||P)$. Atypicality = distance from the “true” distribution.
Langevin Dynamics and Sampling from Continuous Distributions
For sampling from distribution $\pi(x) \propto e^{-U(x)}$: Langevin SDE: $dX_t = -\nabla U(X_t)dt + \sqrt{2}\ dW_t$. Invariant distribution: $e^{-U(x)}/Z$. Discretization (ULA): $X_{k+1} = X_k - \eta\nabla U(X_k) + \sqrt{2\eta}\xi_k$, $\xi_k \sim N(0,I)$. As $\eta\to0$ converges to $\pi$. Metropolis-adjusted Langevin (MALA) adds an acceptance step.
Principle of Maximum Entropy
Problem: among all distributions $P$ satisfying constraints $E[f_i(X)] = \mu_i$, find one that maximizes entropy $H(P) = -E[\ln P]$. Solution: $P^*(x) = (1/Z)\exp(\sum_i \lambda_i f_i(x))$—exponential family with Lagrange multipliers $\lambda_i$. This explains the Gibbs distribution in statistical physics and maximum likelihood estimates.
Information Theory and Probability
Mutual information $I(X;Y) = D_{KL}(P_{XY}||P_X\cdot P_Y) = H(X) - H(X|Y)$. Nonnegative, $=0$ if and only if $X$ and $Y$ are independent. In coding theory: $I(X;Y)$—channel capacity. In ML: mutual information is a criterion for feature selection (maximizing $I(\text{feature}; \text{label})$).
Differential Entropy and Its Properties
Differential entropy: $h(X) = -\int f(x)\log f(x)dx$. For $N(\mu,\sigma^2)$: $h = (1/2)\log(2\pi e\sigma^2)$. Maximized for fixed variance by the normal (principle of maximum entropy). KL divergence: $D_{KL}(P||Q) \geq 0$, $=0$ only when $P = Q$. Connection with Fisher information: for small perturbations $D_{KL}(P_\theta||P_{\theta+d\theta}) \approx (1/2)I(\theta)(d\theta)^2$—Rao geometry in distribution space.
Numerical Example: Cramér's Deviation Rate Function
Problem: $X_1,...,X_n \sim$ Bernoulli($p=0.5$). Find the approximation $P(\overline{X}_n \geq 0.7)$ for $n=100$ via the large deviations theorem.
Step 1: Cramér’s rate function for Bernoulli: $I(x) = x\cdot\ln(x/p)+(1-x)\cdot\ln((1-x)/(1-p))$. For $x=0.7$, $p=0.5$.
Step 2: $I(0.7) = 0.7\cdot\ln(0.7/0.5)+0.3\cdot\ln(0.3/0.5) = 0.7\cdot\ln(1.4)+0.3\cdot\ln(0.6) = 0.7\cdot0.3365+0.3\cdot(-0.5108) = 0.2355-0.1532 = 0.0823$.
Step 3: By Cramér’s Theorem: $\ln P(\overline{X}n \geq 0.7)\approx -n\cdot I(0.7) = -100\cdot 0.0823 = -8.23$. $P(\overline{X}{100} \geq 0.7)\approx e^{-8.23}\approx 2.66\cdot 10^{-4}$.
Step 4: Normal approximation (CLT): $P(Z \geq (0.7-0.5)/0.05) = P(Z \geq 4.0) \approx 3.2\cdot 10^{-5}$—8 times smaller. For large deviations, CLT underestimates tail probability; Cramér’s theorem is more accurate. $I(x) = D_{KL}(\mathrm{Ber}(x)||\mathrm{Ber}(p))$—it is precisely the Kullback-Leibler divergence that determines the rate of exponential decay.
§ Act · what next