Module I·Article II·~5 min read

Conditional Probability and Independence

Axiomatic Foundations of Probability Theory

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Conditional probability is the probability of an event given that another event has occurred. This allows us to update our knowledge as information arrives and is the foundation of Bayesian inference.

Conditional Probability

Definition: If P(B) > 0: P(A|B) = P(A∩B)/P(B).

Multiplication Theorem: P(A₁∩A₂∩...∩Aₙ) = P(A₁)·P(A₂|A₁)·P(A₃|A₁A₂)·...·P(Aₙ|A₁...Aₙ₋₁).

Law of Total Probability: If B₁,...,Bₙ form a complete group (mutually exclusive, ⋃Bᵢ = Ω): P(A) = Σᵢ P(A|Bᵢ)P(Bᵢ).

Bayes’ Theorem: P(Bᵢ|A) = P(A|Bᵢ)P(Bᵢ) / Σⱼ P(A|Bⱼ)P(Bⱼ). “Posterior probability of hypothesis Bᵢ upon observing A.”

Independence of Events

Two events: A and B are independent if P(A∩B) = P(A)P(B). Equivalently: P(A|B) = P(A) (knowledge about B does not change the probability of A).

Pairwise vs. mutual independence: Three events A, B, C are pairwise independent if every pair of them is independent. They are mutually independent if, in addition, P(A∩B∩C) = P(A)P(B)P(C). Pairwise independence does not imply mutual!

Counterexample: Ω = {1,2,3,4}, P(k)=1/4. A={1,2}, B={1,3}, C={1,4}. P(A)=P(B)=P(C)=1/2. P(AB)=P(AC)=P(BC)=1/4 = P(A)P(B) — pairwise independent. P(ABC) = 1/4 ≠ 1/8 = P(A)P(B)P(C) — not mutually independent.

Exercise: (a) Disease test: sensitivity 99%, specificity 95%. Prevalence 1%. P(disease|positive test) — compute using Bayes. (b) A coin is tossed 10 times. P(exactly 3 heads) = C(10,3)·(1/2)¹⁰. Why? (c) Monty Hall paradox: three doors, car behind one. You chose door 1, the host opened door 3 (goat). Should you switch?

Bayesian Inference in Practice

Bayes’ Theorem is not just a formula, but a philosophy of updating beliefs in light of new data. Prior probability P(Bᵢ) reflects our initial knowledge about a hypothesis before observation. The likelihood function P(A|Bᵢ) says how probable the observation is under the hypothesis. Posterior probability P(Bᵢ|A) is the updated probability after observing A.

Example: evaluation of vaccine efficacy. Before the study: P(vaccine works) = 0.5 (prior uncertainty). During the trial we observe that 2% of the vaccinated group fell ill versus 8% in the control group. Bayesian posterior inference substantially raises P(vaccine works). This formalizes how science updates hypotheses on the basis of data.

Paradox of false positives: if disease prevalence is 0.1%, the test has 99% accuracy, then after a positive test result P(disease) ≈ 9% — the overwhelming majority of positive results are false! This explains why screening programs for rare diseases require confirmatory tests.

Chains of Conditional Probabilities

Multiplication theorem allows the decomposition of joint probability. For example, P(A₁ ∩ A₂ ∩ A₃) = P(A₁) · P(A₂|A₁) · P(A₃|A₁A₂). This is important in sequential trials: P(three red cards in a row from 52) = (26/52)(25/51)(24/50) ≈ 0.118.

Event tree — a visual way to apply conditional probability formulas. Every node is an event, each branch is a conditional transition probability. The probability of a final outcome is the product of probabilities along a path from the root. The sum of the probabilities of the leaves stemming from a single node equals 1.

Independence in the Flow of Information

The concept of independence has a deep informational meaning. A and B are independent means that knowledge of B does not change the probability of A: P(A|B) = P(A). In terms of information theory (Shannon, 1948): mutual information I(A;B) = log P(AB)/(P(A)P(B)) = 0 when independent.

In practice, independence is a powerful assumption that simplifies calculations. The naïve Bayes classifier assumes conditional independence of all features given the class, providing a computationally simple, yet surprisingly effective machine learning model.

Conditional independence: A and B are conditionally independent given C: P(A∩B|C) = P(A|C)·P(B|C). This is weaker than unconditional independence and is fundamental for Bayesian networks (directed acyclic graphical models) — a powerful tool of probabilistic inference in AI systems. In a Bayesian network every node is conditionally independent of its non-descendants given its parents — the so-called Markov condition.

Causal Inference and Simpson’s Paradox

Simpson's paradox: a trend observed in aggregated data may disappear or reverse when splitting into subgroups. This occurs because of hidden variables — confounding factors. Example: treatment A yields a better result than treatment B in each of two clinics, but worse in the combined data (if more severe patients are sent to clinic A). Conditional probabilities P(recovery|treatment, clinic) correctly account for the data structure, whereas P(recovery|treatment) misleads. Simpson’s paradox is the main reason why randomized controlled trials are used in medical research.

Bayesian Inference and Updating Beliefs

The Bayes formula P(H|E) = P(E|H)·P(H)/P(E) recalculates the posterior probability of a hypothesis H after observing evidence E. Prior P(H) — our belief before the observation. Likelihood P(E|H) — the probability of evidence if H is true. Evidence P(E) = Σ P(E|Hᵢ)P(Hᵢ) — normalizing constant. Posterior P(H|E) — updated belief.

Sequential Bayesian inference: the posterior distribution after n observations becomes the prior for the (n+1)-th observation. The result does not depend on the order of observations (if independent). Bayesian vs. frequentist: the Bayesian approach treats parameters as random variables with distributions; frequentist — as fixed unknowns.

Application: medical diagnostics. Test sensitivity = P(+|diseased), specificity = P(−|healthy). For a rare disease (P(diseased) = 0.001) even a test with sensitivity 0.99 and specificity 0.99 yields P(diseased|+) ≈ 0.09 — only 9%! This illustrates how neglecting the base rate leads to erroneous conclusions.

Numerical Example: Bayes’ Theorem — Three Urns

Problem: Three urns: urn 1 contains 2 red and 8 blue balls, urn 2 — 6 red and 4 blue, urn 3 — 5 red and 5 blue. An urn is chosen at random (P=1/3 each), then a ball is drawn. The ball turns out red. Find P(urn 2 | red ball).

Step 1: P(red|urn 1)=0.2; P(red|urn 2)=0.6; P(red|urn 3)=0.5.

Step 2: Total probability: P(red) = (1/3)·0.2+(1/3)·0.6+(1/3)·0.5 = 1.3/3 ≈ 0.433.

Step 3: Bayes: P(urn 2|red) = 0.6·(1/3) / (1.3/3) = 0.6/1.3 ≈ 0.462.

Step 4: P(urn 1|red)=0.2/1.3≈0.154; P(urn 3|red)=0.5/1.3≈0.385. Sum: 1.001≈1 ✓. The posterior probability of urn 2 has increased from 33% to 46% — the observation has "shifted" the probabilities according to Bayes’ rule.

§ Act · what next