Module III·Article I·~5 min read

Reliability Functions and Failure Analysis

Reliability Theory of Systems

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Modern technical systems—airplanes, nuclear reactors, medical equipment—contain millions of components. Failure of any one can cost lives and billions of dollars. Reliability theory studies the probability of failure-free operation and offers quantitative methods for evaluating and designing systems with a specified level of safety. It is applied in aviation (DO-254), nuclear energy (IAEA SSG-3), automotive industry (ISO 26262), medicine (IEC 60601). Without it, civil aviation (catastrophe probability 10⁻⁹ per flight hour) and nuclear power would be impossible.

Basic Reliability Functions

For a component with time to failure $T$ (a random variable):

Reliability function: $R(t) = P(T > t)$ — the probability of failure-free operation up to moment $t$. Analogous to the survival function $S(x)$ in actuarial mathematics. $R(0) = 1$, $R(\infty) = 0$.

Failure distribution function: $F(t) = 1 − R(t) = P(T \leq t)$.

Failure density: $f(t) = F'(t) = −R'(t)$.

Hazard rate (failure rate): $h(t) = \dfrac{f(t)}{R(t)} = -\dfrac{d \ln R(t)}{dt}$.

Interpretation of $h(t)$: the instantaneous probability of failure at time $t$, given that the component has survived up to $t$. Analogous to $\mu(x)$ in survival theory. Inverse relationship: $R(t) = \exp\left(-\int_0^t h(s); ds\right)$.

Bathtub Curve

Real technical components often have $h(t)$ with three characteristic periods:

  1. Infant mortality: $h(t)$ decreases. Hidden manufacturing defects manifest and are eliminated. Lasts days to months. Solution: "burn-in test"—the manufacturer tests components, rejecting defective ones.

  2. Useful life: $h(t) \approx$ const $= \lambda$. Random failures (lightning, power surges, human factor). Lasts years.

  3. Wear-out: $h(t)$ increases. Physical wear—metal fatigue, corrosion, electrolyte leakage. Solution: preventive replacement.

Parametric Models

1. Exponential: $R(t) = e^{−\lambda t}$, $h(t) = \lambda = \text{const}$.
Corresponds to "useful life." Memoryless: $P(T > s + t\mid T > s) = P(T > t)$.
MTBF (Mean Time Between Failures): $E[T] = 1/\lambda$.
Example: processor with $\lambda = 10^{-5}$/hour $\to$ MTBF $= 100,!000$ hours $\approx 11$ years.

2. Weibull: $R(t) = \exp\left(-\left(\dfrac{t}{\eta}\right)^\beta\right)$, $h(t) = \dfrac{\beta}{\eta}\left(\dfrac{t}{\eta}\right)^{\beta-1}$.

  • $\beta < 1$: $h$ decreases (infant mortality).
  • $\beta = 1$: exponential (useful life).
  • $\beta > 1$: $h$ increases (wear-out).
  • $\beta = 2$: Rayleigh distribution—linear growth of $h$.

$\eta$ — scale (characteristic time), $\beta$ — shape. Flexibility makes Weibull the reliability standard.

3. Gompertz: $R(t) = \exp\left(-\dfrac{B}{c}(c^t-1)\right)$. Analogous to actuarial model—for biological systems and humans.

4. Lognormal: $\ln T \sim N(\mu, \sigma^2)$. Used for failures due to metal fatigue (Bazovsky’s law).

Structural System Reliability

Complex systems consist of components. The structure of their connection determines system reliability $R_s$.

Series system. All components must work (failure of any $\to$ system failure):
$R_s = \prod_i R_i$.

Example: chain of 5 components with $R_i = 0.99$: $R_s = 0.99^5 = 0.951$. Each extra link reduces reliability.

Parallel system. Sufficient for at least one to work:
$R_s = 1 − \prod_i(1 − R_i)$.

Example: 3 parallel components with $R = 0.9$: $R_s = 1 − 0.1^3 = 0.999$. Redundancy is the main method to increase reliability.

k-out-of-n. System works if at least $k$ out of $n$ identical components work:
$R_s = \sum_{j=k}^n C(n, j) \cdot R^j \cdot (1 − R)^{n−j}$.

Example: 2-of-3 with $R = 0.95$: $R_s = 3 \cdot 0.95^2 \cdot 0.05 + 0.95^3 = 0.135 + 0.857 = 0.993$.

Fault Tree Analysis (FTA) and Reliability Block Diagrams (RBD)

FTA — fault tree analysis. Deductive method: top (undesirable) event $\to$ intermediate causes $\to$ basic events (component failures).

Logic gates:

  • AND: top event requires all inputs (product of probabilities).
  • OR: at least one input is sufficient.

Minimal cut sets. Minimal combinations of basic events leading to top event. If $n$ cut sets, each requires $k_j$ components: $P(\text{top}) \approx \sum_j \prod_{i \in S_j} q_i$ (for small $q$).

RBD — reliability block diagrams. Graphical representation of system structure in terms of series/parallel connected blocks. Equivalent to FTA, but more convenient for calculating $R_s$.

Numerical Example

System: A and B in series, C, D, E in parallel, then (AB) and (CDE) in parallel.
$R_A = 0.98$, $R_B = 0.95$, $R_C = R_D = R_E = 0.90$.

$R_{AB} = 0.98 \times 0.95 = 0.931$.
$R_{CDE} = 1 - (1 - 0.9)^3 = 1 - 0.001 = 0.999$.
$R_\text{system} = 1 - (1 - 0.931)\times (1 - 0.999) = 1 - 0.069 \times 0.001 = 0.99993$.

Importance analysis (Birnbaum importance): $I_B(i) = R_s | R_i = 1 - R_s | R_i = 0$.
For component A: $R_s | R_A = 1 = 1 - (1 - 0.95)\times 0.001 \approx 0.99995$;
$R_s | R_A = 0 = 1 - 1 \times 0.001 = 0.999$.
$I_B(A) \approx 0.001$—low importance (duplicated by parallel CDE branch).
For C: $I_B(C) \approx 0.069 \times 0.01 = 0.00069$.
The most important are A and B (series link determines operation of the AB branch, though the system still functions via CDE).

Real-World Applications

  • Aviation. Boeing 787, Airbus A350: each critical system (flight control, hydraulics, avionics) has triple redundancy. Calculated probability of FCS failure
    lt;$ 10⁻⁹/hour.
  • Nuclear energy. WANO benchmark: probability of core meltdown
    lt;$ $10^{-5}$/reactor-year. After Fukushima—revised standards, additional safety systems.
  • Automotive industry. ISO 26262 ASIL D (for airbag, ABS): requirement $R$(10 years)
    gt; 1 - 10^{-9}$. Verified by FMEDA + FTA.
  • Medicine. Pacemakers (Medtronic, St. Jude): MTBF
    gt;$ 8 years, calculated MTTF
    gt;$ 12 years. Tested by accelerated life testing.
  • IT infrastructure. AWS, Azure SLA 99.99% ($\approx 53$ min downtime/year). Achieved by redundancy of data centers (multi-AZ deployment).

Assignment. System: 5 components in configuration from the problem: A-B in series, then in parallel with C-D-E (which are also parallel among themselves). $R_A = 0.98$, $R_B = 0.95$, $R_C = R_D = R_E = 0.90$. (a) Build the RBD. (b) Calculate $R_\text{system}$. (c) Find Birnbaum importance of each component. (d) Which component is most critical for improving $R_\text{system}$? (e) If the budget allows improving one component to $R = 0.99$, which would you choose to maximize $R_\text{system}$?

§ Act · what next