Module III·Article III·~5 min read
Probability Theory Inequalities
Expectation and Moments
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Probability inequalities allow us to estimate event probabilities without full knowledge of the distribution. They are critically important for statistics, machine learning, and information theory.
Basic Inequalities
Markov's Inequality: For $X \geq 0$, $a > 0$: $P(X \geq a) \leq \mathbb{E}[X]/a$. Proof: $\mathbb{E}[X] \geq \mathbb{E}[X \cdot 1_{X \geq a}] \geq a \cdot P(X \geq a)$.
Chebyshev's Inequality: For any $X$ with finite $\mathbb{E}[X] = \mu$, $\operatorname{Var}[X] = \sigma^2$: $P(|X-\mu| \geq k) \leq \sigma^2/k^2$ or $P(|X-\mu| \geq k\sigma) \leq 1/k^2$. Does not require knowledge of the distribution, only $\mu$ and $\sigma^2$.
Jensen's Inequality: For convex $g$: $\mathbb{E}[g(X)] \geq g(\mathbb{E}[X])$. Corollaries: $\mathbb{E}[X^2] \geq (\mathbb{E}[X])^2$ (variance $\geq 0$). $\mathbb{E}[e^X] \geq e^{\mathbb{E}[X]}$.
Concentration Inequalities
Chernoff Inequality: For a sum of i.i.d. $X_1, ..., X_n$ with $\mathbb{E}[X_i] = \mu$: $P(\overline{X}_n \geq \mu + \varepsilon) \leq \exp(-n I(\mu+\varepsilon))$, $I$ — rate function. More precise than Chebyshev.
For Bounded Random Variables ($a \leq X_i \leq b$):
Hoeffding's Inequality: $P(\overline{X}_n - \mu \geq \varepsilon) \leq \exp\left(-2 n^2 \varepsilon^2/\sum (b_i - a_i)^2\right)$. Tail decay is exponential! Important for PAC learning.
Bernstein's Inequality: $P(S_n - n\mu \geq t) \leq \exp\left(-\frac{t^2}{2(n\sigma^2 + b t/3)}\right)$. Accounts for variance and boundedness — more accurate for small deviations.
Azuma–Hoeffding Inequality
For Martingales: Difference sequence $|D_k| \leq c_k$. $P(X_n - X_0 \geq t) \leq \exp(-t^2/(2\sum c_k^2))$. Applied in the analysis of randomized algorithms.
Assignment: (a) $X \sim \mathrm{Poisson}(10)$: estimate $P(X \geq 20)$ by Markov and Chebyshev. Exact value via Poisson CDF. Which inequality is more accurate? (b) 1000 coins, $p=0.5$. $P($more than 550 heads$)$ — Hoeffding vs. normal approximation. (c) Learning task: need $P(|E_\text{test} - E_\text{train}| < \varepsilon) \geq 1-\delta$. For $n$ samples, $|h| \in [0,1]$: $n \geq \log(2/\delta)/(2\varepsilon^2)$ suffices (Hoeffding). For $\varepsilon=0.05$, $\delta=0.05$: $n=?$
Chernoff Inequality and Applications
Chernoff inequality for Bernoulli($p$): $P(\overline{X} \geq (1+\delta)p) \leq e^{-np\delta^2/3}$ for $\delta \in (0,1)$. For the sum $S = \sum X_i$: $P(S \geq (1+\delta)\mu) \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu$. Significantly sharper than Hoeffding for large $p$.
Application in load balancing: random distribution of $n$ tasks among $n$ servers — expected maximum load is $O(\log n/\log \log n)$ (analyzed via Chernoff). Balls-and-bins: with $n$ balls and $n$ bins maximum load $\leq 3 \ln n/\ln \ln n$ with high probability.
Sub-Gaussian and Sub-Exponential Random Variables
A random variable $X$ is sub-Gaussian with parameter $\sigma$ if $\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2/2}$ for all $\lambda$. Then $P(X \geq t) \leq e^{-t^2/(2\sigma^2)}$ — Gaussian tail. Examples: bounded random variables on $[a,b]$ are sub-Gaussian with $\sigma = (b-a)/2$; normal $N(0, \sigma^2)$ — with parameter $\sigma$.
A random variable $X$ is sub-exponential (finite $\psi_1$-norm) if its tails decay at least exponentially. For sub-exponential variables: $P(|X| > t) \leq 2e^{-t/K}$. Examples: exponential, chi-square, product of two Gaussian variables. Bernstein's inequality combines sub-Gaussian center behavior and sub-exponential tails.
VC Dimension and Generalization in ML
Vapnik–Chervonenkis dimension (VC dimension) of a hypothesis class $H$ is a key parameter for generalization. VC-inequality: $P\left(\sup_{h \in H} |E_\text{test}(h) - E_\text{train}(h)| > \varepsilon\right) \leq 4 \cdot \Pi_H(2n) \cdot e^{-n\varepsilon^2/8}$. Here $\Pi_H(n) \leq n^{d_{VC}}$ — shattering function. With finite VC dimension: generalization is guaranteed for $n = O\left(\frac{d_{VC}}{\varepsilon^2}\log(1/\delta)\right)$.
Example: linear classifiers in $\mathbb{R}^d$ have VC dimension $d+1$. Neural networks with $W$ weights: VC dimension $O(W \log W)$. PAC-learnability theorem (Blumer, Ehrenfeucht, Haussler, Warmuth, 1989): a learning task is PAC-learnable $\Leftrightarrow$ VC dimension is finite.
Randomized Algorithms and Probabilistic Analysis
Many algorithms use randomness to increase efficiency. Quicksort: with random pivot selection, expected runtime is $O(n \log n)$. Proof: $\mathbb{E}[C(n)] = n-1 + (2/n)\sum_{i=1}^{n-1} \mathbb{E}[C(i)]$ — recurrence, solution is $O(n \log n)$.
Probabilistic analysis method: the algorithm is analyzed not for the worst case, but on average or under a random input. Secretary problem: expected number of hires when candidates are presented in random order $= H(n) = \sum 1/k \approx \ln n$. Each $i$-th candidate is hired with probability $1/i$ (they are the best among the first $i$).
Second Moment Method
For a non-negative r.v. $X$: $P(X > 0) \geq (\mathbb{E}[X])^2/\mathbb{E}[X^2]$. This is the second moment method: if $\mathbb{E}[X^2]/\mathbb{E}[X]^2 \to 1$, then $X > 0$ almost surely. Applied in combinatorics to prove the existence of objects. For example, for the number of cliques in $G(n,p)$: if $\mathbb{E}[X] \to \infty$ and the second moment is not too large, a clique exists. Counter-example: if $\mathbb{E}[X] \to c < \infty$ (first moment method fails), analysis via PGF or the Stein–Chen method is needed.
Concentration of Measure and High Dimensionality
The concentration of measure phenomenon: in high dimensions, most of the probabilistic mass is concentrated near the equator. Lévy's inequality: for a Lipschitz function $f$ on the sphere $S^n$, $P(|f-\mathbb{E}f| > \varepsilon) \leq 2\exp(-n\varepsilon^2/2L^2)$. Consequence for ML: in high-dimensional spaces, the Euclidean metric loses discriminative power (all points are at approximately the same distance).
Numerical Example: Comparing Three Inequalities
Problem: $X \sim \mathrm{Bernoulli}(p=0.2)$, sample $n=100$. Estimate $P(|\overline{X}-0.2|>0.1)$ by Chebyshev, Hoeffding, and normal approximation.
Step 1 (Chebyshev): $\operatorname{Var}[\overline{X}] = p(1-p)/n = 0.16/100 = 0.0016$. $P(|\overline{X}-0.2| \geq 0.1) \leq 0.0016/0.01 = 0.16$ (estimate: 16%).
Step 2 (Hoeffding): For $X_i \in [0,1]$: $P(|\overline{X}-p| \geq t) \leq 2e^{-2nt^2} = 2e^{-2\cdot 100\cdot 0.01} = 2e^{-2} \approx 0.271$ (estimate: 27%).
Step 3 (CLT): $\sigma[\overline{X}] = \sqrt{0.0016} = 0.04$. $P(|\overline{X}-0.2| > 0.1) \approx 2(1-\Phi(0.1/0.04)) = 2(1-\Phi(2.5)) \approx 2 \cdot 0.0062 = 0.0124$ (exact estimate: 1.2%).
Step 4: Chebyshev: 16%, Hoeffding: 27%, exact: 1.2%. The inequalities are conservative by 13–22 times, but they work without assumptions on the shape of the distribution — this is precisely what makes them valuable for theoretical proofs and robust guarantees.
§ Act · what next