Mathematical Expectation and Moments — Probability & Statistics

Mathematical expectation is the "center of mass" of a distribution. Moments characterize the shape of a distribution: mean, variance, skewness, kurtosis.

Mathematical Expectation

Definition: $E[X] = \sum x \cdot P(X=x)$ (discrete), $\int x \cdot f(x)dx$ (continuous). Exists if $\sum|x|P(X=x) < \infty$.

Linearity: $E[aX+bY] = aE[X] + bE[Y]$ — always (regardless of the dependence of $X, Y$!). $E[X_1+\dots+X_n] = n\mu$ (if $E[X_i]=\mu$ for all $i$).

For functions: $E[g(X)] = \sum g(x)P(X=x) = \int g(x)f(x)dx$. Jensen’s inequality: for convex $g$: $E[g(X)] \geq g(E[X])$.

Variance and Covariance

Variance: $\mathrm{Var}[X] = E[(X-E[X])^2] = E[X^2] - (E[X])^2$. $\Sigma$ — standard deviation. $\mathrm{Var}[aX+b] = a^2\mathrm{Var}[X]$. $\mathrm{Var}[X+Y] = \mathrm{Var}[X] + \mathrm{Var}[Y] + 2\mathrm{Cov}(X,Y)$.

Covariance: $\mathrm{Cov}(X,Y) = E[(X-E[X])(Y-E[Y])] = E[XY] - E[X]E[Y]$. If independent: $\mathrm{Cov}(X,Y) = 0$ (but not vice versa!). Correlation: $\rho = \mathrm{Cov}(X,Y)/(\sigma_X \sigma_Y) \in [-1,1]$.

Moments and cumulants: n-th moment: $\mu_n = E[X^n]$. Central moment: $E[(X-\mu)^n]$. Skewness: $\gamma_1 = \mu_3/\sigma^3$. Kurtosis: $\gamma_2 = \mu_4/\sigma^4 - 3$. For normal: $\gamma_1=0$, $\gamma_2=0$ (mesokurtosis).

Exercise: (a) $X\sim\mathrm{Poisson}(3)$: compute $E[X]$, $E[X^2]$, $\mathrm{Var}[X]$, $E[X(X-1)]$. (b) $X\sim U[0,1]$: $E[X^n]$? (c) Let $X,Y$ be dependent: $X\sim N(0,1)$, $Y=X^2$. Compute $\mathrm{Cov}(X,Y)=0$, but they are dependent. Why does zero covariance not imply independence?

Conditional Expectation

Conditional expectation $E[X|Y]$ — a random variable that "optimally predicts" $X$ given $Y$ in the sense of minimal mean squared deviation. Formally: $E[X|Y=y] = \int x\cdot f_{X|Y}(x|y)dx$. Function $g(y) = E[X|Y=y]$ is the best predictor of $X$ from $Y$ among all functions of $Y$.

Law of total expectation: $E[X] = E[E[X|Y]]$. A powerful computational tool: $E[X] = \sum_j E[X|Y=j]\cdot P(Y=j)$. For example, $E[$number of dice rolls until sum

gt; 10]$ is computed via conditional expectation.

Law of total variance: $\mathrm{Var}[X] = E[\mathrm{Var}[X|Y]] + \mathrm{Var}[E[X|Y]]$. Splits variance into "average within-group" and "between-group" — directly analogous to ANOVA.

$E[X|Y]$ as a projection: In the $L^2(\Omega)$ space of random variables with finite second moment, $E[X|F_Y]$ is the orthogonal projection of $X$ onto the subspace of $F_Y$-measurable random variables. Orthogonality means: $E[(X − E[X|Y])\cdot g(Y)] = 0$ for any function $g$. This is a geometric interpretation linking probability theory with functional analysis.

Jensen's Inequality and Convexity

Jensen's inequality $E[g(X)] \geq g(E[X])$ for convex functions has numerous applications. AM-GM: arithmetic mean ≥ geometric mean: $(X_1+\dots+X_n)/n \geq (X_1\cdots X_n)^{1/n}$ — consequence of convexity of $-\ln$. Information inequality: $D_{KL}(P|Q) \geq 0$ (Kullback-Leibler divergence is non-negative) — consequence of Jensen for convex function $-\log$. Option pricing: $E[\max(S_T-K,0)] \geq \max(E[S_T]-K,0)$ — an option is worth at least its intrinsic value.

For concave functions, the inequality reverses: $E[g(X)] \leq g(E[X])$. Example: $E[\ln X] \leq \ln E[X]$ — logarithm of the mean is greater than the mean of the logarithm.

Markov's Inequality: Origins and Consequences

Markov's inequality $P(X \geq a) \leq E[X]/a$ for $X \geq 0$ is proved in one line: $E[X] = \int_0^a x f(x)dx + \int_a^\infty x f(x)dx \geq a\cdot P(X\geq a)$. Its simplicity makes it universal: it does not require finite variance. Applicability limit: for $X \sim \mathrm{Exp}(1)$ and $a=2$: $P(X\geq 2) = e^{-2} \approx 0.135$, Markov's inequality gives $0.5$ — $3.7$ times coarser. The heavier the tail, the cruder the Markov estimate.

Hoeffding's inequality: For $X_1,\dots,X_n$ independent, $a_i \leq X_i \leq b_i$ and $S = \sum X_i$: $P(S − E[S] \geq t) \leq \exp(−2t^2/\sum(b_i−a_i)^2)$. This is an exponential bound for bounded random variables, significantly sharper than Chebyshev's. Widely used in learning theory (PAC-learning, generalization bounds).

Concentration of Measure and High Dimensions

In high-dimensional spaces, amazing phenomena occur — concentration of measure. For a standard normal random variable $X\sim N(0,1)$: as $n\rightarrow \infty$, $X_1^2+\dots+X_n^2 \approx n$ with small deviations ($\sqrt{2n}$). The volume of the unit ball $B^n$ concentrates near the equator and the surface: most points are found at $|x| \approx \sqrt{n}/\sqrt{n+2} \approx 1$.

Borel–Sudakov inequality (isoperimetric inequality on the sphere): If $A \subset S^{n-1}$ has measure $\geq 1/2$, then its $\varepsilon$-expansion has measure $\geq 1 - 2e^{-n\varepsilon^2/2}$. Concentration of measure explains why neural networks work in high dimensions, why random projections preserve distances (Johnson–Lindenstrauss lemma), and why bootstrap samples are close to the mean.

Martingales and Azuma–Hoeffding Inequality

Martingale $M_0, M_1, ...$ — a sequence of random variables with $E[M_{n+1}|M_0,...,M_n] = M_n$. There is no systematic trend. Examples: random walk, cumulative winnings in a fair game, partial sums of i.i.d. random variables with zero mean.

Azuma’s inequality: For a martingale with $|M_k - M_{k-1}| \leq c_k$: $P(M_n - M_0 \geq t) \leq \exp(-t^2/2\sum c_k^2)$. Generalizes Hoeffding's inequality to dependent random variables. Used in the analysis of randomized algorithms, chromosome analyses by Doob's method, estimation of functions of load factors in combinatorics (the "Lipschitz martingale" method).

Numerical Example: Application of Chebyshev's Inequality (LLN)

Problem: $X_1,...,X_n \sim \mathrm{Bernoulli}(p=0.4)$. Find the minimal $n$ such that $P(|\overline{X}_n-0.4|>0.05) \leq 0.05$.

Step 1: Variance: $\sigma^2 = p(1-p) = 0.4\cdot 0.6 = 0.24$. $\mathrm{Var}[\overline{X}_n] = \sigma^2/n = 0.24/n$.

Step 2: Chebyshev’s inequality: $P(|\overline{X}_n-\mu|\geq\varepsilon) \leq \mathrm{Var}[\overline{X}_n]/\varepsilon^2 = 0.24/(n\cdot0.0025)$.

Step 3: Need: $0.24/(0.0025n) \leq 0.05 \Rightarrow n \geq 0.24/(0.0025\cdot0.05) = 1920$.

Step 4: The CLT gives a sharper bound: $P(|\overline{X}-0.4|>0.05) \leq 2(1-\Phi(0.05\sqrt{n}/\sqrt{0.24})) \leq 0.05$. From $\Phi(z) = 0.975 \Rightarrow z = 1.96$. Thus, $0.05\sqrt{n}/0.490 \geq 1.96 \Rightarrow n \geq (1.96\cdot 0.490/0.05)^2 = 369$. The LLN guarantees convergence, the CLT — efficiency.