Probability Theory Inequalities — Probability & Statistics

Probability inequalities allow us to estimate event probabilities without full knowledge of the distribution. They are critically important for statistics, machine learning, and information theory.

Basic Inequalities

Markov's Inequality: For $X \geq 0$, $a > 0$: $P(X \geq a) \leq \mathbb{E}[X]/a$. Proof: $\mathbb{E}[X] \geq \mathbb{E}[X \cdot 1_{X \geq a}] \geq a \cdot P(X \geq a)$.

Chebyshev's Inequality: For any $X$ with finite $\mathbb{E}[X] = \mu$, $\operatorname{Var}[X] = \sigma^2$: $P(|X-\mu| \geq k) \leq \sigma^2/k^2$ or $P(|X-\mu| \geq k\sigma) \leq 1/k^2$. Does not require knowledge of the distribution, only $\mu$ and $\sigma^2$.

Jensen's Inequality: For convex $g$: $\mathbb{E}[g(X)] \geq g(\mathbb{E}[X])$. Corollaries: $\mathbb{E}[X^2] \geq (\mathbb{E}[X])^2$ (variance $\geq 0$). $\mathbb{E}[e^X] \geq e^{\mathbb{E}[X]}$.

Concentration Inequalities

Chernoff Inequality: For a sum of i.i.d. $X_1, ..., X_n$ with $\mathbb{E}[X_i] = \mu$: $P(\overline{X}_n \geq \mu + \varepsilon) \leq \exp(-n I(\mu+\varepsilon))$, $I$ — rate function. More precise than Chebyshev.

For Bounded Random Variables ($a \leq X_i \leq b$):

Hoeffding's Inequality: $P(\overline{X}_n - \mu \geq \varepsilon) \leq \exp\left(-2 n^2 \varepsilon^2/\sum (b_i - a_i)^2\right)$. Tail decay is exponential! Important for PAC learning.

Bernstein's Inequality: $P(S_n - n\mu \geq t) \leq \exp\left(-\frac{t^2}{2(n\sigma^2 + b t/3)}\right)$. Accounts for variance and boundedness — more accurate for small deviations.

Azuma–Hoeffding Inequality

For Martingales: Difference sequence $|D_k| \leq c_k$. $P(X_n - X_0 \geq t) \leq \exp(-t^2/(2\sum c_k^2))$. Applied in the analysis of randomized algorithms.

Assignment: (a) $X \sim \mathrm{Poisson}(10)$: estimate $P(X \geq 20)$ by Markov and Chebyshev. Exact value via Poisson CDF. Which inequality is more accurate? (b) 1000 coins, $p=0.5$. $P($more than 550 heads$)$ — Hoeffding vs. normal approximation. (c) Learning task: need $P(|E_\text{test} - E_\text{train}| < \varepsilon) \geq 1-\delta$. For $n$ samples, $|h| \in [0,1]$: $n \geq \log(2/\delta)/(2\varepsilon^2)$ suffices (Hoeffding). For $\varepsilon=0.05$, $\delta=0.05$: $n=?$

Chernoff Inequality and Applications

Chernoff inequality for Bernoulli($p$): $P(\overline{X} \geq (1+\delta)p) \leq e^{-np\delta^2/3}$ for $\delta \in (0,1)$. For the sum $S = \sum X_i$: $P(S \geq (1+\delta)\mu) \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu$. Significantly sharper than Hoeffding for large $p$.

Application in load balancing: random distribution of $n$ tasks among $n$ servers — expected maximum load is $O(\log n/\log \log n)$ (analyzed via Chernoff). Balls-and-bins: with $n$ balls and $n$ bins maximum load $\leq 3 \ln n/\ln \ln n$ with high probability.

Sub-Gaussian and Sub-Exponential Random Variables

A random variable $X$ is sub-Gaussian with parameter $\sigma$ if $\mathbb{E}[e^{\lambda X}] \leq e^{\sigma^2 \lambda^2/2}$ for all $\lambda$. Then $P(X \geq t) \leq e^{-t^2/(2\sigma^2)}$ — Gaussian tail. Examples: bounded random variables on $[a,b]$ are sub-Gaussian with $\sigma = (b-a)/2$; normal $N(0, \sigma^2)$ — with parameter $\sigma$.

A random variable $X$ is sub-exponential (finite $\psi_1$-norm) if its tails decay at least exponentially. For sub-exponential variables: $P(|X| > t) \leq 2e^{-t/K}$. Examples: exponential, chi-square, product of two Gaussian variables. Bernstein's inequality combines sub-Gaussian center behavior and sub-exponential tails.

VC Dimension and Generalization in ML

Vapnik–Chervonenkis dimension (VC dimension) of a hypothesis class $H$ is a key parameter for generalization. VC-inequality: $P\left(\sup_{h \in H} |E_\text{test}(h) - E_\text{train}(h)| > \varepsilon\right) \leq 4 \cdot \Pi_H(2n) \cdot e^{-n\varepsilon^2/8}$. Here $\Pi_H(n) \leq n^{d_{VC}}$ — shattering function. With finite VC dimension: generalization is guaranteed for $n = O\left(\frac{d_{VC}}{\varepsilon^2}\log(1/\delta)\right)$.

Example: linear classifiers in $\mathbb{R}^d$ have VC dimension $d+1$. Neural networks with $W$ weights: VC dimension $O(W \log W)$. PAC-learnability theorem (Blumer, Ehrenfeucht, Haussler, Warmuth, 1989): a learning task is PAC-learnable $\Leftrightarrow$ VC dimension is finite.

Randomized Algorithms and Probabilistic Analysis

Many algorithms use randomness to increase efficiency. Quicksort: with random pivot selection, expected runtime is $O(n \log n)$. Proof: $\mathbb{E}[C(n)] = n-1 + (2/n)\sum_{i=1}^{n-1} \mathbb{E}[C(i)]$ — recurrence, solution is $O(n \log n)$.

Probabilistic analysis method: the algorithm is analyzed not for the worst case, but on average or under a random input. Secretary problem: expected number of hires when candidates are presented in random order $= H(n) = \sum 1/k \approx \ln n$. Each $i$-th candidate is hired with probability $1/i$ (they are the best among the first $i$).

Second Moment Method

For a non-negative r.v. $X$: $P(X > 0) \geq (\mathbb{E}[X])^2/\mathbb{E}[X^2]$. This is the second moment method: if $\mathbb{E}[X^2]/\mathbb{E}[X]^2 \to 1$, then $X > 0$ almost surely. Applied in combinatorics to prove the existence of objects. For example, for the number of cliques in $G(n,p)$: if $\mathbb{E}[X] \to \infty$ and the second moment is not too large, a clique exists. Counter-example: if $\mathbb{E}[X] \to c < \infty$ (first moment method fails), analysis via PGF or the Stein–Chen method is needed.

Concentration of Measure and High Dimensionality

The concentration of measure phenomenon: in high dimensions, most of the probabilistic mass is concentrated near the equator. Lévy's inequality: for a Lipschitz function $f$ on the sphere $S^n$, $P(|f-\mathbb{E}f| > \varepsilon) \leq 2\exp(-n\varepsilon^2/2L^2)$. Consequence for ML: in high-dimensional spaces, the Euclidean metric loses discriminative power (all points are at approximately the same distance).

Numerical Example: Comparing Three Inequalities

Problem: $X \sim \mathrm{Bernoulli}(p=0.2)$, sample $n=100$. Estimate $P(|\overline{X}-0.2|>0.1)$ by Chebyshev, Hoeffding, and normal approximation.

Step 1 (Chebyshev): $\operatorname{Var}[\overline{X}] = p(1-p)/n = 0.16/100 = 0.0016$. $P(|\overline{X}-0.2| \geq 0.1) \leq 0.0016/0.01 = 0.16$ (estimate: 16%).

Step 2 (Hoeffding): For $X_i \in [0,1]$: $P(|\overline{X}-p| \geq t) \leq 2e^{-2nt^2} = 2e^{-2\cdot 100\cdot 0.01} = 2e^{-2} \approx 0.271$ (estimate: 27%).

Step 3 (CLT): $\sigma[\overline{X}] = \sqrt{0.0016} = 0.04$. $P(|\overline{X}-0.2| > 0.1) \approx 2(1-\Phi(0.1/0.04)) = 2(1-\Phi(2.5)) \approx 2 \cdot 0.0062 = 0.0124$ (exact estimate: 1.2%).

Step 4: Chebyshev: 16%, Hoeffding: 27%, exact: 1.2%. The inequalities are conservative by 13–22 times, but they work without assumptions on the shape of the distribution — this is precisely what makes them valuable for theoretical proofs and robust guarantees.