Law of Large Numbers

The Law of Large Numbers (LLN) is a fundamental result in probability theory: the mean of a large number of independent random variables converges to their expected value. This is the mathematical foundation of statistics.

Weak and Strong LLN

Weak LLN (Chebyshev, 1866): For i.i.d. $X_1,\ldots,X_n$ with $E[X_i]=\mu$ and $Var[X_i]=\sigma^2 < \infty$: $\overline{X}_n = (X_1+\ldots+X_n)/n \to_P \mu$ (convergence in probability). Proof via Chebyshev: $P(|\overline{X}_n-\mu|\geq\epsilon) \leq \sigma^2/(n\epsilon^2)\to 0$.

Strong LLN (Kolmogorov): Under the same conditions: $\overline{X}_n \to \mu$ almost surely ($P(\lim \overline{X}_n = \mu) = 1$). This is stronger than the weak law (almost sure $\to$ in probability, but not vice versa).

LLN without finite variance (Khinchin): Finiteness of $E[|X|]$ is sufficient. For $Var[X]=\infty$, LLN still holds!

Exceptions: Cauchy distribution—there is no finite mean, so LLN does not hold! $\overline{X}_n$ remains Cauchy.

Applications of LLN

Monte Carlo Method: $E[f(X)]\approx (1/n)\sum f(X_i)$. As $n\to\infty$ the estimate becomes exact. Used for complex integrals: $\int_D f(x)dx \approx \text{Vol}(D)\cdot(1/n)\sum f(x_i)$ for $x_i \sim U[D]$.

Frequency as probability: The relative frequency of event $A$ in $n$ experiments: $f_n(A) = N_n(A)/n\to P(A)$ (LLN). "Probability = limit of frequency"—operational definition.

Exercise: (a) Simulate the LLN: generate Exp(1) and accumulate $\overline{X}_n$ for $n=1,\ldots,10000$. Plot $\overline{X}_n$ vs $n$. (b) For Cauchy: simulate $\overline{X}_n$ for Cauchy(0,1). How does the mean behave? Explain. (c) Monte Carlo for $\pi$: generate points in the square $[-1,1]^2$, count the proportion inside the unit circle $\to \pi/4$. Plot the convergence.

Law of the Iterated Logarithm

Besides LLN ($\overline{X}_n\to\mu$) and CLT ($(\overline{X}n-\mu)\sqrt{n}\to N(0, \sigma^2)$), there is a more precise result—the law of the iterated logarithm (LIL): $\lim \sup{n\to\infty} \frac{S_n-n\mu}{\sigma\sqrt{2n\ln\ln n}}=1$ almost surely. This describes the exact speed of the maximal fluctuations of partial sums. Speed of divergence: $O(\sqrt{n\ln\ln n})$—just faster than $\sqrt{n}$. The proof uses Hoeffding's inequality and Borel–Cantelli lemma.

Practical corollary: when monitoring a random process, deviations exceeding $2\sigma\sqrt{\ln\ln n/n}$ should be considered "unusual"—this is an objective threshold for anomalies.

Strong vs. Weak LLN

Weak LLN (Chebyshev, 1866): $\overline{X}_n\to_P\mu$ (in probability). For any $\epsilon>0$: $P(|\overline{X}_n-\mu|>\epsilon)\to 0$. Only requires $E[X]<\infty$ and a finite second moment (in Chebyshev's version).

Strong LLN (Borel, Kolmogorov): $\overline{X}n\to\text{a.s.}\mu$ (almost surely). $P(\overline{X}_n\to\mu)=1$. Requires $E[|X|]<\infty$. Significantly stronger: not just that each fixed $n$-tail is small, but the entire trajectory converges almost surely.

The difference is fundamental: the weak LLN does not claim convergence of trajectories, only of marginal distributions. Example: for $\epsilon=0.01$ after $10^6$ trials, the "tail" where $|\overline{X}_n-\mu|>0.01$ is small but nonzero. The strong LLN says: almost surely no future deviation will exceed $\epsilon$ for large enough $n$.

Applications of LLN in Computation

Monte Carlo Method relies directly on LLN: $\theta = \int f(x)dx = E[f(X)] \approx (1/n)\sum f(X_i)$. The convergence rate is $O(1/\sqrt{n})$ by CLT, regardless of space dimensionality! This is a key advantage over deterministic quadrature methods (which suffer from the "curse of dimensionality").

Quasi-Monte Carlo: uses deterministic low-discrepancy sequences (Halton, Sobol) instead of random ones. The convergence rate is $O((\log n)^d / n)$—better than $O(1/\sqrt{n})$ for small $d$, worse for large $d$. In practice: QMC is more efficient than random MC for $d \leq 20$.

Ergodic Theorem and Strong LLN for Dependent Data

Birkhoff's Ergodic Theorem: For an ergodic measure-preserving transformation $T$: $(1/n)\sum_{i=0}^{n-1} f(T^i\omega)\to \int f,d\mu$ almost surely. This generalizes the LLN to dependent data. MCMC is based on this theorem: a long Markov chain $\to$ integral over the invariant distribution.

Characteristic Function Method in LLN

Proof of weak LLN via CF: $\varphi_{\overline{X}n}(t) = (\varphi_X(t/n))^n$. For $E[X]=\mu$: $\varphi_X(t/n) = 1 + i\mu(t/n) + o(1/n)$. Then $\varphi{\overline{X}n}(t) \to e^{i\mu t} = \varphi{\delta_\mu}(t)$—the CF of a point mass at $\mu$. By Lévy's continuity theorem: $\overline{X}_n \to \mu$ in probability.

Probabilistic Method in Combinatorics

Erdős' Theorem: If $P($there is no monochromatic $k$-clique$)>0$, then such a graph exists. Application: $R(k,k)>n$ if $2C(n,k)/2^{C(k,2)}<1$. Hence $R(k,k)>(1+o(1))k\cdot2^{k/2}/(e\sqrt{2})$. This is an existence proof: the first exponential lower bounds of Ramsey. The probabilistic method—Erdős proved the existence of objects that cannot be constructed constructively.

Inequalities for Sums of Random Variables

Hoeffding's inequality for bounded r.v.'s ($a_i\leq X_i\leq b_i$): $P(\sum(X_i-EX_i)\geq t) \leq \exp(-2t^2/\sum(b_i-a_i)^2)$. Used for $\epsilon$-$\delta$ estimates in supervised learning. Azuma's inequality: for martingale differences $|X_k-X_{k-1}|\leq c_k$: $P(|M_n|\geq t)\leq 2\exp(-t^2/(2\sum c_k^2))$. Used in proofs on concentration in random graphs and algorithms.

Numerical Example: LLN with Coin Tossing

Problem: $X_1,\ldots,X_n \sim $ Bernoulli($p=0.5$). Find $P(|\overline{X}_n-0.5|>0.02)$ for $n=100$ and $n=10,000$.

Step 1: $Var[\overline{X}_n] = p(1-p)/n = 0.25/n$. Standard deviation: $\sigma[\overline{X}_n]=0.5/\sqrt{n}$.

Step 2 ($n=100$): $\sigma[\overline{X}_{100}]=0.5/10=0.05$. $P(|\overline{X}-0.5|>0.02)\approx 2(1-\Phi(0.02/0.05)) = 2(1-\Phi(0.4)) \approx 2\cdot0.345=0.690$. About 69%—such a deviation is quite likely.

Step 3 ($n=10,000$): $\sigma[\overline{X}_{10,000}]=0.5/100=0.005$. $P(|\overline{X}-0.5|>0.02)\approx 2(1-\Phi(0.02/0.005))=2(1-\Phi(4.0))\approx 2\cdot0.000032=0.000064$. Less than 0.007%—LLN in action.

Step 4: For $n=10,000$ the mean is almost always in $[0.48, 0.52]$. Weak LLN (Khinchin): for any $\epsilon>0$, $P(|\overline{X}_n-p|>\epsilon)\to 0$. Strong LLN (Kolmogorov): $\overline{X}_n\to p$ almost surely. A casino, with millions of bets, knows its revenue to a fraction of a percent.