Module III·Article III·~5 min read
Software Reliability and System Safety
Reliability Theory of Systems
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Software is becoming a critical component of modern systems—Boeing 787 aircraft have approximately 10 million lines of code, the Tesla Model S ~100 million, and the F-35 ~25 million. Software reliability is fundamentally different from hardware: errors are deterministic (the same inputs yield the same error), not random, and software does not "wear out" physically. System safety focuses on preventing catastrophic consequences regardless of the failure type. Standards such as IEC 61508 (general), DO-178C (aviation), ISO 26262 (automotive), and IEC 62304 (medical) define processes for developing safety-critical software.
Features of Software Reliability
Principal differences from hardware:
- Failures are deterministic: same input data → same error. The "randomness" of software failures is the randomness of input data and states.
- No physical wear: software does not age in the hardware sense.
- No "bathtub curve": after commissioning, the number of defects decreases over time (due to fixes).
- Defects are introduced during development: bugs — in the code, design flaws — in architecture.
Mean Time To Failure (MTTF) for software makes sense only with a stable usage profile. If input patterns change, MTTF changes.
Software Reliability Models
NHPP-models (Non-Homogeneous Poisson Process). The cumulative number of detected defects N(t) is an NHPP with variable intensity λ(t). Upon detection and correction of defects, λ(t) decreases.
1. Jelinski-Moranda (1972). Initially N₀ defects. Failure intensity at time t: λ(t) = φ·(N₀ − n(t)), where n(t) is the number of detected defects by time t. Each fixed bug decreases λ by φ. Classic model, but overestimates remaining defects.
2. Goel-Okumoto (1979). N(t) ~ Poisson(a·(1 − e^{−b·t})). Parameters: a — total number of defects, b — detection rate. Example: a = 200, b = 0.05 → in the first 50 days, about 84% of defects are found.
3. Musa-Okumoto (1984). μ(t) = (1/θ)·ln(λ₀·θ·t + 1). Logarithmic model—slowly decreasing intensity.
4. S-shaped Yamada (1984). μ(t) = a·(1 − (1 + b·t)·e^{−b·t}). S-shaped defect detection curve: initially slow start (testers get used to it), then fast, then plateau.
Model calibration: by maximum likelihood estimation on testing data (number of defects in each interval).
Numerical Example
The team tested software for 100 days; 80 defects were detected. Based on the data, Goel-Okumoto was fitted: a = 95, b = 0.04. This means: total ~95 defects, on the 100th day detected 95·(1 − e^{−4}) = 95·0.9817 = 93.3 (slightly overestimated versus the actual 80; model adjustment needed).
Forecast for the next 30 days: μ(130) − μ(100) = 95·(e^{−4} − e^{−5.2}) = 95·(0.0183 − 0.0055) = 1.22—we expect to find about 1 more defect.
Release decision: if defect intensity λ(t) ≤ 0.01/day—release. Current λ(100) = a·b·e^{−b·100} = 95·0.04·e^{−4} ≈ 0.07/day—too early to release.
Functional Safety (IEC 61508)
Safety standard for electronic systems. Defines 4 SIL (Safety Integrity Level):
| SIL | PFD (low demand) | PFH (high demand) | Required risk reduction |
|---|---|---|---|
| 1 | 10⁻² – 10⁻¹ | 10⁻⁶ – 10⁻⁵ | 10–100× |
| 2 | 10⁻³ – 10⁻² | 10⁻⁷ – 10⁻⁶ | 100–1000× |
| 3 | 10⁻⁴ – 10⁻³ | 10⁻⁸ – 10⁻⁷ | 1000–10000× |
| 4 | 10⁻⁵ – 10⁻⁴ | 10⁻⁹ – 10⁻⁸ | 10000–100000× |
PFD = Probability of Failure on Demand (for rarely actuated protections). PFH = Probability of dangerous Failure per Hour (for continuously operating protections).
Architectures with Redundancy
Notation KooN: “K out of N must work”.
1oo1. One channel. PFD ≈ λ_DU·T_test/2 (DU = dangerous undetected, T_test — proof-test interval).
1oo2. Two channels, protection triggers if at least one. Reduces risk failure to act, but increases spurious trips. PFD ≈ (λ_DU·T_test)²/3.
2oo2. Two channels, protection triggers only if both. Reduces spurious trips, but increases PFD. Hardly used in safety-critical systems.
2oo3 (TMR — Triple Modular Redundancy). Three channels, protection triggers by majority. Balances both. Used in Boeing 777 fly-by-wire, NASA Space Shuttle.
Common Cause Failure (CCF). All channels can fail simultaneously (same software error, common power supply, temperature). Accounted for using the β-factor: PFD_total = (1 − β)²·PFD_independent + β·PFD_single. β is usually 1–10%.
HAZOP (Hazard and Operability Study)
Systematic qualitative analysis of deviations from design parameters. "Guide words" (MORE, LESS, NONE, REVERSE, INSTEAD) are applied to parameters (flow, temperature, pressure, concentration). For each combination:
- Is a cause possible?
- What are the consequences?
- Is there protection?
- What is recommended?
Standard in petrochemicals, pharmaceuticals, nuclear industry.
Numerical Example: SIL for a Protection System
Reactor overheating protection: 2 temperature sensors + 1oo2 logic (triggers on signal from at least one). λ_DU of each sensor = 0.01/year = 1.14·10⁻⁶/hour. Proof-test interval T = 1 year = 8760 hours.
PFD of one channel = λ_DU·T/2 = 1.14·10⁻⁶·8760/2 = 5·10⁻³. 1oo2 (without CCF): PFD = PFD_A · PFD_B = (5·10⁻³)² = 2.5·10⁻⁵ → SIL3. With CCF (β = 0.1): PFD = (1 − 0.1)²·2.5·10⁻⁵ + 0.1·5·10⁻³ = 2.0·10⁻⁵ + 5·10⁻⁴ = 5.2·10⁻⁴ → only SIL2.
Lesson: CCF dominates. Reducing β (via diversity, different manufacturers) is critical.
For required SIL3 under CCF conditions—a 2oo3 architecture with heterogeneous sensors is needed.
Real-World Applications
- Boeing 737 MAX MCAS catastrophes (2018–2019). 1oo1 architecture (single angle of attack sensor) for a critical function—a violation of IEC 61508 principles. After redesign—2oo2 with segregation principle (separate power supply, software).
- Tesla Autopilot. ASIL D components (brakes, steering)—TMR. Camera + radar + ultrasonic—sensor fusion to reduce CCF.
- NPPs (VVER, EPR). 4-channel reactor protections. Diversity: different principles (mechanical + electronic).
- Therac-25 (1985–1987). Linear accelerator for medical radiation therapy. Race condition in software → radiation overdose, 6 deaths. Lesson: software-only safety with no hardware interlock is unacceptable.
- Ariane 5 Flight 501 (1996). Overflow of a 16-bit integer in navigation → loss of rocket $370M. Reuse of Ariane 4 code without re-validation.
Assignment. Reactor protection system: two independent temperature sensors + 1oo2 logic. PFD of each sensor 0.01 (with proof-test).
(a) Compute the system PFD without considering CCF. What SIL does this correspond to?
(b) With β = 0.1 (CCF)—recalculate the PFD. Did the SIL change?
(c) What architecture (1oo2, 2oo3) should be used to guarantee SIL3 at β = 0.1?
(d) If the cost of one sensor is €5000, and CCF can be reduced to β = 0.05 by using sensors from two different manufacturers—is it worth investing in diversity?
§ Act · what next