Discrete-Time Markov Chains — Probability & Statistics

A Markov chain is a memoryless stochastic process: the future depends only on the present, not on the past. This is a powerful model for systems with "states": queues, markets, genomic sequences, PageRank.

Definition and Properties of Markov Chains

Definition: A sequence X₀, X₁, X₂, ... with values in a finite/countable set S satisfies the Markov property: P(Xₙ₊₁ = j | X₀,...,Xₙ) = P(Xₙ₊₁ = j | Xₙ) for all n and j.

Transition matrix: Pᵢⱼ = P(Xₙ₊₁ = j | Xₙ = i). Stochastic: Pᵢⱼ ≥ 0, ΣⱼPᵢⱼ = 1 for all i.

n-step transitions: P(Xₙ = j | X₀ = i) = (Pⁿ)ᵢⱼ — the (i,j)-th element of the matrix Pⁿ.

Classification of States

Recurrent state: i is recurrent if P(return to i) = 1. Null-recurrent: E[return time] = ∞. Positive-recurrent: E[return time] < ∞.

Transient state: i is transient if P(return to i) < 1. Visited a finite number of times.

Periodicity: Period d(i) = GCD{n ≥ 1: (Pⁿ)ᵢᵢ > 0}. For d=1 — aperiodic.

Stationary and Limiting Distribution

Stationary distribution π: A vector π ≥ 0, Σπᵢ = 1: π = πP. For a finite irreducible aperiodic chain: unique π, and Pⁿ → 1·πᵀ.

Ergodic theorem: For an irreducible positive-recurrent chain: the fraction of time spent in state i → πᵢ as n→∞ (almost surely). Πᵢ = 1/E[τᵢ] (inverse of the mean return time).

Exercise: (a) Weather chain: P(sun|sun)=0.8, P(rain|rain)=0.6. Matrix P. Find π and mean return times. (b) PageRank: graph of 4 pages, each links to others with equal probability. Find π using π=πP. (c) Simple random walk on {0,1,...,N} with absorption at 0 and N. P(absorption at N | start at i) = i/N — prove it.

Ergodicity and Limit Theorems for Markov Chains

Ergodic chain — irreducible, aperiodic, positive recurrent. For an ergodic chain: μₙ → π (geometrically fast when S is finite). Mixing rate: spectral gap gap Lₛ = 1 − λ₂ (λ₂ — the second largest eigenvalue of P). Mixing time: τ_mix ≈ 1/gap. For the "shuffling" chain (complete graph with equal weights): gap = 1 → instantaneous mixing.

Ergodic average theorem: (1/n)Σᵢ₌₀^{n-1} f(Xᵢ) →_{a.s.} Σⱼ f(j)πⱼ for any function f with Σ|f(j)|πⱼ < ∞. This is the law of large numbers for Markov chains — the foundation of MCMC (Markov Chain Monte Carlo).

MCMC: Metropolis-Hastings and Gibbs Sampling

Problem: How to sample from a complex distribution π(x) (normalization constant Z unknown)? MCMC solution: construct a chain with invariant distribution π.

Metropolis-Hastings algorithm: Propose x' from q(x'|x). Accept with probability α = min(1, π(x')q(x|x')/(π(x)q(x'|x))). Detailed balance: π(x)P(x→x') = π(x')P(x'→x) ⟹ π is invariant. Works even without normalization Z!

Gibbs sampling: Sequentially update each coordinate from the conditional distribution: Xᵢ ~ π(Xᵢ|X_{-ᵢ}). Special case of MH with acceptance rate 1. Used in Bayesian models, LDA (Latent Dirichlet Allocation), HMM.

Hidden Markov Models (HMM)

HMM: hidden Markov chain Z₁,Z₂,...,Zₙ (K states) + observables X₁,...,Xₙ with P(Xₜ|Zₜ). Three tasks: (1) Decoding (Viterbi): find the most probable hidden sequence Z* — Viterbi algorithm (DP, O(nK²)). (2) Filtering (forward-backward): P(Zₜ|X₁,...,Xₙ) — forward-backward algorithm (O(nK²)). (3) Parameter estimation (Baum-Welch = EM): iterative algorithm.

Applications: speech recognition (states = phonemes), genomics (CpG-islands), finance (market regimes).

Theorem on MCMC Convergence

For an ergodic Markov chain: ||Pⁿ(x,·) − π||_TV ≤ r·ρⁿ, where r > 0, ρ < 1 — mixing rate. Spectral gap: λ₂ — second eigenvalue (for reversible chains). Mixing rate is determined by 1−λ₂. Spatial gap and log-Sobolev inequality give more precise bounds.

Practice: the 'burn-in' period should be sufficient to reach the stationary regime. Convergence diagnostics: Gelman-Rubin criterion (R̂ < 1.1), chain autocorrelation, Geweke test.

Variational Methods and Normalizing Flows

Normalizing flows: transform a simple distribution (standard normal) through invertible differentiable functions f: z ~ p_z, x = f(z), p_x(x) = p_z(f⁻¹(x))·|det(∂f⁻¹/∂x)|. Exact computation of log-likelihood! RealNVP: split z = (z_a, z_b), x_a = z_a, x_b = z_b·exp(s(z_a)) + t(z_a). The Jacobian is triangular → det computed in O(d). Applications: generative models, density estimation, variational inference.

Spectral Methods in Clustering

Spectral clustering: build an affinity graph, normalized Laplacian L = I − D^{-1/2}AD^{-1/2}. The first k eigenvectors of L form a new representation → k-means in the new space. Manifold connection theorem: spectral clustering reveals geometric clusters that k-means misses. Applications: image segmentation, community detection in social networks.

Numerical Example: Stationary Distribution of a Markov Chain

Problem: Two-state MC (sun/rain): P(sun|rain)=0.4, P(rain|sun)=0.3. Matrix: P=[[0.7,0.3],[0.4,0.6]]. Find π and first return times.

Step 1: System π·P=π: π₁=0.7π₁+0.4π₂, π₂=0.3π₁+0.6π₂, π₁+π₂=1.

Step 2: From the first equation: 0.3π₁=0.4π₂ → π₁=4π₂/3. Normalization: 4π₂/3+π₂=1 → 7π₂/3=1 → π₂=3/7≈0.429.

Step 3: π₁=4/7≈0.571, π₂=3/7≈0.429. In the long run: 57.1% of days are sunny.

Step 4: E[T_sun→sun]=1/π₁=7/4=1.75 steps. Mixing rate: λ₂=tr(P)−det(P)... actually, the second eigenvalue = 0.7+0.6−1=0.3. P^n converges to [[π₁,π₂],[π₁,π₂]] as 0.3ⁿ. For n=10: 0.3¹⁰≈0.000006 — virtually instantaneous mixing.