What is the difference between probability and statistics?

Probability is the forward direction — given a model of randomness, predict outcomes. Statistics is the backward direction — given observed outcomes, infer the underlying model. They use the same mathematics but in opposite directions.

What is the central limit theorem and why is it important?

The CLT says the sum of many independent random variables approaches a normal distribution, regardless of their individual distributions. It's important because (1) it explains why normality is so common in nature, and (2) it's the foundation of statistical inference — confidence intervals, hypothesis tests, etc.

What is the difference between independent and uncorrelated?

Independent: $P(A \cap B) = P(A) P(B)$. Uncorrelated: $E[XY] = E[X] E[Y]$. Independence implies uncorrelation, but not vice versa. Uncorrelated variables can have nonlinear dependence.

Is Bayes' theorem just for medical diagnosis?

No — Bayes is used everywhere uncertainty matters. Spam filtering (Naive Bayes), evidence in court, machine learning classifiers, scientific inference (Bayesian statistics), forecasting (Bayesian methods in weather). The medical example is famous because it dramatizes the base-rate fallacy.

Why is the normal distribution so common?

Because of the central limit theorem. Anything that is a sum of many independent influences will be approximately normally distributed. Heights, IQ scores, measurement errors, financial returns over short intervals — all approximately normal because of the CLT.

What is a probability distribution function vs density function?

For discrete variables, the probability mass function (PMF) gives $P(X = x)$. For continuous variables, the probability density function (PDF) gives the density $f(x)$ — the probability of falling in an interval is the integral of the PDF over that interval. The cumulative distribution function (CDF), $F(x) = P(X \leq x)$, is defined for both.

§ CALCULUS · 20 MIN READ · Updated 2026-05-13

Probability Theory Basics for Engineers

The mathematics of uncertainty — for people who need to make decisions, not just pass exams.

"Probability is the very guide of life."

— Bishop Joseph Butler, *Analogy of Religion* (1736)

Probability Theory Basics for Engineers — PROBABILITY THEORY BASICS FOR ENGINEERS

Probability theory is the mathematical framework for reasoning about uncertainty. Every prediction, every decision under risk, every machine learning model rests on probability. Yet most engineers and scientists who use probability have only a hazy intuition for what it means — they remember formulas without understanding the structure.

This article covers what a probability is, the formal structure (probability spaces), random variables, the most important distributions, conditional probability and Bayes' theorem, expectation and variance, the central limit theorem, and applications in engineering and machine learning.

What probability is

Probability is a measure of the likelihood of an event. We assign a number between 0 and 1 to each event:

0 means the event is impossible.
1 means the event is certain.
0.5 means equally likely as not.

This is the intuition. The formal definition is more careful.

A probability space is a triple $(Ω, F, P)$ where:

$Ω$ is the sample space — the set of all possible outcomes.
$F$ is a collection of events — subsets of $Ω$ that we can assign probabilities to.
$P$ is the probability measure — a function from $F$ to $[0, 1]$ satisfying:
- $P (Ω) = 1$ .
- $P (A) \geq 0$ for every event $A$ .
- For disjoint events $A_{1}, A_{2}, \dots$ : $P (A_{1} \cup A_{2} \cup \dots) = P (A_{1}) + P (A_{2}) + \dots$ .

This is the Kolmogorov axiomatization (1933), which gave probability theory its current rigorous foundation.

In practice, you rarely think about probability spaces explicitly. You work with random variables, distributions, and probabilities of events.

Discrete vs continuous probability

Discrete probability handles situations with countable outcomes. Coin flips, dice rolls, customer counts, page views.

Continuous probability handles situations with uncountably many outcomes. Measurements, times, weights, lengths.

The distinction matters because discrete distributions assign positive probability to individual outcomes (e.g., $P (coin = heads) = 0.5$ ), while continuous distributions assign probability zero to any individual outcome. For continuous variables, you talk about probabilities of ranges: $P (X < 5)$ or $P (2 \leq X \leq 5)$ .

Continuous probability is described by a probability density function (PDF). The PDF $f (x)$ has the property that

$P (a \leq X \leq b) = \int_{a}^{b} f (x) d x$

The PDF is not itself a probability — it is a density. The integral of the PDF over an interval gives the probability of falling in that interval.

Random variables

A random variable is a function from the sample space to the real numbers. In English: a quantity whose value depends on a random outcome.

Examples:

$X$ = number of heads in 10 coin flips (discrete, takes values 0 to 10).
$Y$ = height of a randomly selected adult (continuous, takes values in roughly 50–250 cm).
$Z$ = time until a server fails (continuous, takes positive real values).

Random variables are usually denoted by capital letters ( $X$ , $Y$ , $Z$ ). Specific realizations (observed values) are denoted by lowercase ( $x$ , $y$ , $z$ ).

Common distributions

A small number of distributions appear repeatedly in applications. Memorize their forms and properties.

Discrete distributions:

Bernoulli $(p)$ — single trial with two outcomes. $P (X = 1) = p$ , $P (X = 0) = 1 - p$ . Example: single coin flip.

Binomial $(n, p)$ — number of successes in $n$ independent Bernoulli $(p)$ trials. $P (X = k) = (k n) p^{k} (1 - p)^{n - k}$ . Example: number of heads in 10 coin flips.

Poisson $(λ)$ — number of events in a fixed interval, when events occur independently at rate $λ$ . $P (X = k) = \frac{λ ^{k} e ^{- λ}}{k !}$ . Example: number of customers arriving in an hour.

Geometric $(p)$ — number of trials until the first success. $P (X = k) = (1 - p)^{k - 1} p$ . Example: number of coin flips until the first heads.

Continuous distributions:

Uniform $(a, b)$ — equally likely to be anywhere in $[a, b]$ . PDF: $f (x) = \frac{1}{b - a}$ for $x \in [a, b]$ , zero elsewhere.

Normal (Gaussian) $N (μ, σ^{2})$ — the bell curve. PDF: $f (x) = \frac{1}{σ 2 π} e^{- (x - μ)^{2} / (2 σ^{2})}$ . Mean $μ$ , variance $σ^{2}$ . The single most important distribution in probability and statistics.

Exponential $(λ)$ — waiting time between events of a Poisson process. PDF: $f (x) = λ e^{- λ x}$ for $x \geq 0$ . Memoryless property.

Beta $(α, β)$ — distribution on $[0, 1]$ , useful for modeling probabilities. PDF: $f (x) = \frac{x ^{α - 1} ( 1 - x ) ^{β - 1}}{B ( α , β )}$ .

Each distribution has parameters that determine its shape. Each has a mean, variance, and other moments computable from the parameters.

Conditional probability and Bayes' theorem

The conditional probability of $A$ given $B$ is

$P (A ∣ B) = \frac{P ( A \cap B )}{P ( B )}$

assuming $P (B) > 0$ . In words: the probability of $A$ , given that we know $B$ has occurred.

Bayes' theorem rewrites this:

$P (A ∣ B) = \frac{P ( B ∣ A ) P ( A )}{P ( B )}$

Why this matters: it lets you reverse conditional probabilities. If you know $P (B ∣ A)$ and you want $P (A ∣ B)$ , Bayes' theorem gets you there.

Example 1 — the classic medical test: A disease affects 1 in 1000 people. A test for the disease is 99% accurate (both sensitivity and specificity). If a person tests positive, what is the probability they have the disease?

Intuition says: about 99%, since the test is 99% accurate. Bayes' theorem says otherwise.

Let $D$ = "has disease," $T$ = "tests positive."

$P (D) = 0.001$ . $P (T ∣ D) = 0.99$ (true positive rate). $P (T ∣\neg D) = 0.01$ (false positive rate).

By the law of total probability: $P (T) = P (T ∣ D) P (D) + P (T ∣\neg D) P (\neg D) = 0.99 \cdot 0.001 + 0.01 \cdot 0.999 = 0.00099 + 0.00999 = 0.01098$ .

Bayes: $P (D ∣ T) = \frac{P ( T ∣ D ) P ( D )}{P ( T )} = \frac{0.99 \cdot 0.001}{0.01098} \approx 0.090$ .

So if you test positive, the probability you actually have the disease is about 9%, not 99%. The 99% accuracy of the test is overwhelmed by the very low prior probability of the disease.

This is the base rate fallacy, and it is one of the most consequential cognitive errors in medical, judicial, and security contexts.

Expectation and variance

The expected value (or mean) of a random variable is a weighted average:

Discrete: $E [X] = \sum_{x} x \cdot P (X = x)$ . Continuous: $E [X] = \int_{- \infty}^{\infty} x \cdot f (x) d x$ .

Example 2: Expected value of a die roll.

$E [X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + \dots + 6 \cdot \frac{1}{6} = \frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3.5$ .

(Note: 3.5 is not a possible outcome. Expected value is an average, not a typical value.)

Variance measures the spread of a distribution:

$Var (X) = E [(X - E [X])^{2}] = E [X^{2}] - E [X]^{2}$

The standard deviation is the square root of variance: $σ_{X} = Var (X)$ .

Linearity of expectation: $E [a X + bY] = a E [X] + b E [Y]$ for any random variables $X$ , $Y$ and constants $a$ , $b$ . This holds whether or not $X$ and $Y$ are independent.

Variance is not linear: $Var (X + Y) = Var (X) + Var (Y) + 2 Cov (X, Y)$ . For independent variables, the covariance is zero and variances add.

The central limit theorem

The single most important theorem in probability. Plain version:

The sum (or average) of a large number of independent random variables, regardless of their individual distributions, is approximately normally distributed.

Formally: if $X_{1}, X_{2}, \dots, X_{n}$ are independent and identically distributed (i.i.d.) random variables with mean $μ$ and variance $σ^{2}$ , then for large $n$ :

$\frac{X ˉ _{n} - μ}{σ / n} \to N (0, 1)$

in distribution, where $\overset{ˉ}{X}_{n} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ is the sample mean.

Why this matters. Most real-world quantities are sums or averages of many smaller random influences — measurement errors, financial returns, biological traits, manufacturing variations. The central limit theorem explains why the normal distribution appears everywhere: not because the universe loves it, but because sums of independent random things approach it.

The central limit theorem is also the foundation of statistical inference. Confidence intervals, hypothesis tests, and most of practical statistics rely on the asymptotic normality it guarantees.

Applications

Engineering reliability. The exponential distribution models the time-to-failure of components with constant failure rate. Compound systems (multiple components in series or parallel) have failure-time distributions computable from the components' distributions.

Communication theory. Shannon's information theory uses probability to quantify information. Channel capacity, error rates, and compression are all probability problems.

Machine learning. Classification outputs are probability distributions. Bayesian methods model uncertainty in parameters and predictions. Generative models learn the probability distributions of data.

Finance. Asset returns are modeled as random variables. Portfolio variance is computed from the covariance matrix of returns. The Black-Scholes option pricing formula assumes returns are normally distributed (lognormal prices).

Quality control. Statistical process control uses probability distributions of manufacturing measurements to detect process drift before quality problems become severe.

Frequently asked

What is the difference between probability and statistics?: Probability is the forward direction — given a model of randomness, predict outcomes. Statistics is the backward direction — given observed outcomes, infer the underlying model. They use the same mathematics but in opposite directions.
What is the central limit theorem and why is it important?: The CLT says the sum of many independent random variables approaches a normal distribution, regardless of their individual distributions. It's important because (1) it explains why normality is so common in nature, and (2) it's the foundation of statistical inference — confidence intervals, hypothesis tests, etc.
What is the difference between independent and uncorrelated?: Independent: $P(A \cap B) = P(A) P(B)$. Uncorrelated: $E[XY] = E[X] E[Y]$. Independence implies uncorrelation, but not vice versa. Uncorrelated variables can have nonlinear dependence.
Is Bayes' theorem just for medical diagnosis?: No — Bayes is used everywhere uncertainty matters. Spam filtering (Naive Bayes), evidence in court, machine learning classifiers, scientific inference (Bayesian statistics), forecasting (Bayesian methods in weather). The medical example is famous because it dramatizes the base-rate fallacy.
Why is the normal distribution so common?: Because of the central limit theorem. Anything that is a sum of many independent influences will be approximately normally distributed. Heights, IQ scores, measurement errors, financial returns over short intervals — all approximately normal because of the CLT.
What is a probability distribution function vs density function?: For discrete variables, the probability mass function (PMF) gives $P(X = x)$. For continuous variables, the probability density function (PDF) gives the density $f(x)$ — the probability of falling in an interval is the integral of the PDF over that interval. The cumulative distribution function (CDF), $F(x) = P(X \leq x)$, is defined for both.

— ACT —

Cited works & further reading

·Blitzstein, J. and Hwang, J. (2019). Introduction to Probability, 2nd edition. Chapman & Hall.
·Ross, S. (2019). A First Course in Probability, 10th edition. Pearson.
·Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. — On base rate fallacy.
·Stat 110 Lectures: https://projects.iq.harvard.edu/stat110

A letter from the portico

Once a week — a long-read, a quote, a practice. No promotions. Unsubscribe in one click.

By subscribing you agree to receive letters from Stoa.