Convergences of Random Variables and Large Deviations

There exist several different concepts of convergence for sequences of random variables. Large Deviations Theory studies the probabilities of rare events—exponentially small probabilities of deviations from the mean.

Types of Convergence

In probability: $X_n \to_P X$ if $\forall \varepsilon > 0:\ P(|X_n-X| > \varepsilon) \to 0$ as $n\to\infty$.

Almost surely (a.s.): $X_n \to_{a.s.} X$ if $P(\lim X_n = X) = 1$. Stronger than in probability. Consequence: convergence in probability.

In distribution (weak): $X_n \to_d X$ if $F_{X_n}(x) \to F_X(x)$ for all points of continuity of $F_X$. The weakest—does not require identity on a single probability space.

In quadratic mean: $X_n \to_{L^2} X$ if $E[(X_n-X)^2] \to 0$. Stronger than in probability, incomparable with almost surely.

Scheme of convergences: almost surely $\implies$ in probability $\implies$ in distribution. $L^2 \implies$ in probability.

Large Deviations Theory

Principle of large deviations: The probability of a “large” deviation of $\overline{X}_n$ from $\mu$ decreases exponentially: $P(\overline{X}_n \geq \mu+\varepsilon) \approx e^{-nI(\mu+\varepsilon)}$, where $I(x)$ is the “rate function”.

Rate function: $I(x) = \sup_t {tx - \Lambda(t)}$, $\Lambda(t) = \log E[e^{tX}]$ (logarithmic MGF). $I$ is convex, $I(\mu) = 0$, $I(x) > 0$ when $x \neq \mu$.

Cramér's Theorem (1938): For i.i.d. $X_i$: $\lim_{n\to\infty} \frac{1}{n} \log P(\overline{X}_n \geq x) = -I(x)$ for $x > \mu$.

Applications: Insurance: $P(S_n > na)$ for $a > \mu$ → estimation of catastrophic losses. Telecommunications: $P(\mathrm{queue} > n\cdot\mathrm{threshold})$ → buffer design. Statistics: p-value of very precise tests as $n\to\infty$.

Exercise: (a) For Bernoulli($p=0.5$): compute $I(0.7) = \sup_t{0.7t - \log(e^t/2+1/2)}$. Find $t^*$ via derivative. (b) $P(\overline{X}_{100} \geq 0.7)$ for Bernoulli: exact (binomial), Chernoff, Cramér's theorem. (c) Explain how Cramér's Theorem generalizes the Chernoff inequality.

Rate Function and Rate of Probability Decay

Rate function $I(x) = \sup_{t\in\mathbb{R}}{tx − \Lambda(t)}$, where $\Lambda(t) = \log E[e^{tX}]$—the logarithmic MGF (cumulant generating function). $I(x) \geq 0$, $I(\mu) = 0$, convex. Cramér's theorem: for $x > \mu$: $\lim_{n\to\infty} (1/n)\log P(\overline{X}_n \geq x) = -I(x)$. The rate of probability decay is determined by the rate function.

For normal $N(\mu,\sigma^2)$: $I(x) = (x−\mu)^2/(2\sigma^2)$. For Bernoulli($p$): $I(x) = x\cdot\ln(x/p) + (1−x)\cdot\ln((1−x)/(1−p))$ (binary KL divergence). The rate function is the Kullback-Leibler distance from the “tilted” distribution to the exponential family.

Principle of Large Deviations: General Theory

Principle of Large Deviations (PLD): A family of measures ${P_n}$ satisfies PLD with rate function $I$ if: (1) for open $G$: $\liminf (1/n)\log P_n(G) \geq -\inf_{x\in G} I(x)$; (2) for closed $F$: $\limsup (1/n)\log P_n(F) \leq -\inf_{x\in F} I(x)$. Cramér's Theorem—PLD for $\overline{X}_n$. Sanov's Theorem—PLD for empirical distribution ($I = KL$ distance).

PLD applications: Estimation of tail probabilities in rare events (reliability, financial risks). Statistical physics: Gibbs distribution from the principle of maximum entropy—a consequence of PLD via Sanov's theorem. Information theory: probability of decoding error—through the rate function (Chernoff theorem).

Relationship of Rate Function with Entropy

For empirical distribution $\widehat{L}n = (1/n)\sum\delta{X_i}$: $P(\widehat{L}n \approx Q) \sim e^{-n\cdot D{KL}(Q||P)}$, where $D_{KL}$ is the Kullback-Leibler divergence. This is Sanov's theorem. Consequence: typical sequences are those which realize the distribution $P$ (close to $P$ in the KL sense). Atypical sequences are exponentially unlikely with exponent $KL(Q||P)$. Atypicality = distance from the “true” distribution.

Langevin Dynamics and Sampling from Continuous Distributions

For sampling from distribution $\pi(x) \propto e^{-U(x)}$: Langevin SDE: $dX_t = -\nabla U(X_t)dt + \sqrt{2}\ dW_t$. Invariant distribution: $e^{-U(x)}/Z$. Discretization (ULA): $X_{k+1} = X_k - \eta\nabla U(X_k) + \sqrt{2\eta}\xi_k$, $\xi_k \sim N(0,I)$. As $\eta\to0$ converges to $\pi$. Metropolis-adjusted Langevin (MALA) adds an acceptance step.

Principle of Maximum Entropy

Problem: among all distributions $P$ satisfying constraints $E[f_i(X)] = \mu_i$, find one that maximizes entropy $H(P) = -E[\ln P]$. Solution: $P^*(x) = (1/Z)\exp(\sum_i \lambda_i f_i(x))$—exponential family with Lagrange multipliers $\lambda_i$. This explains the Gibbs distribution in statistical physics and maximum likelihood estimates.

Information Theory and Probability

Mutual information $I(X;Y) = D_{KL}(P_{XY}||P_X\cdot P_Y) = H(X) - H(X|Y)$. Nonnegative, $=0$ if and only if $X$ and $Y$ are independent. In coding theory: $I(X;Y)$—channel capacity. In ML: mutual information is a criterion for feature selection (maximizing $I(\text{feature}; \text{label})$).

Differential Entropy and Its Properties

Differential entropy: $h(X) = -\int f(x)\log f(x)dx$. For $N(\mu,\sigma^2)$: $h = (1/2)\log(2\pi e\sigma^2)$. Maximized for fixed variance by the normal (principle of maximum entropy). KL divergence: $D_{KL}(P||Q) \geq 0$, $=0$ only when $P = Q$. Connection with Fisher information: for small perturbations $D_{KL}(P_\theta||P_{\theta+d\theta}) \approx (1/2)I(\theta)(d\theta)^2$—Rao geometry in distribution space.

Numerical Example: Cramér's Deviation Rate Function

Problem: $X_1,...,X_n \sim$ Bernoulli($p=0.5$). Find the approximation $P(\overline{X}_n \geq 0.7)$ for $n=100$ via the large deviations theorem.

Step 1: Cramér’s rate function for Bernoulli: $I(x) = x\cdot\ln(x/p)+(1-x)\cdot\ln((1-x)/(1-p))$. For $x=0.7$, $p=0.5$.

Step 2: $I(0.7) = 0.7\cdot\ln(0.7/0.5)+0.3\cdot\ln(0.3/0.5) = 0.7\cdot\ln(1.4)+0.3\cdot\ln(0.6) = 0.7\cdot0.3365+0.3\cdot(-0.5108) = 0.2355-0.1532 = 0.0823$.

Step 3: By Cramér’s Theorem: $\ln P(\overline{X}n \geq 0.7)\approx -n\cdot I(0.7) = -100\cdot 0.0823 = -8.23$. $P(\overline{X}{100} \geq 0.7)\approx e^{-8.23}\approx 2.66\cdot 10^{-4}$.

Step 4: Normal approximation (CLT): $P(Z \geq (0.7-0.5)/0.05) = P(Z \geq 4.0) \approx 3.2\cdot 10^{-5}$—8 times smaller. For large deviations, CLT underestimates tail probability; Cramér’s theorem is more accurate. $I(x) = D_{KL}(\mathrm{Ber}(x)||\mathrm{Ber}(p))$—it is precisely the Kullback-Leibler divergence that determines the rate of exponential decay.