Maintenance Strategies — Risk Theory & Actuarial Math

Any equipment requires maintenance, but "too much" maintenance is a waste of money and time, while "too little" leads to catastrophic failures. The optimal maintenance strategy balances preventive costs and the cost of failures. Mathematical models allow us to select optimal intervals and types of maintenance. This is a key practical task for airlines (aircraft maintenance costs ~$1M/landing), the oil and gas industry (turbine maintenance means weeks of downtime), manufacturing companies (line downtime — millions in lost revenue per day).

Types of Maintenance Strategies

1. Reactive (Corrective Maintenance, CM).
"If it breaks — fix it." Minimal planning, maximal downtime cost. Effective for cheap components (bulbs) and low failure costs.

2. Preventive Maintenance (PM).
Maintenance at a fixed interval T (for example, oil change every 10,000 km) regardless of condition. Reduces probability of failure, but results in "excess" maintenance for components that still could have operated. Standard for automotive and aviation industries of the 20th century.

3. Predictive / Condition-Based Maintenance (CBM).
Maintenance when a component reaches a threshold level of degradation measured by sensors. Requires monitoring (IoT, vibration diagnostics, oil analysis, thermography). Economically optimal, but requires investment in sensors and analytics. Standard of the 21st century (Industry 4.0).

4. Reliability-Centered Maintenance (RCM).
A systematic approach (MSG-3 standard in aviation, SAE JA1011): for each functional failure, the optimal strategy is chosen. Questions: "can the failure accumulate covertly?", "are there protective systems?", "what are the consequences?" — tactics are chosen based on the answers.

Optimal Preventive Maintenance Interval

Model. The time to failure T is distributed as F(t) with density f(t). In case of failure — cost $C_f$. Preventive maintenance at time $T_p$ (if not yet failed) — cost $C_p$. Usually $C_f > C_p$ (failure is more expensive than planned replacement).

Expected cost per unit time (renewal reward theory):
$ \text{ECPU}(T_p) = \frac{\mathbf{E}[\text{cycle cost}]}{\mathbf{E}[\text{cycle duration}]} = \frac{C_p \cdot R(T_p) + C_f \cdot F(T_p)}{\int_0^{T_p} R(t) dt}. $

Numerator: expected cost of one cycle (planned if survived; unplanned if failed earlier).
Denominator: expected duration of a cycle (lasts until $\min(T, T_p)$, and $\int R(t) dt = \mathbf{E}[\min(T, T_p)]$).

Optimal $T^*$: $\min_T$ ECPU(T). Find numerically (derivative $d\text{ECPU}/dT = 0$).

Numerical Example: Pump

Failure distribution: Weibull($\eta = 1000$ h, $\beta = 2.5$) (expressed aging). $C_p = 500$ units (planned replacement), $C_f = 5000$ units (failure during operation).

$R(t) = \exp\left(- (t/1000)^{2.5}\right)$.

Numerically (Python):

$T_p$	$R(T_p)$	$F(T_p)$	$\int R,dt$	ECPU
200	0.982	0.018	198	2.94
400	0.881	0.119	380	2.73
600	0.692	0.308	522	3.62
800	0.456	0.544	615	4.79

Optimum $T^* \approx 400$ hours with $\text{ECPU} \approx 2.73$ units/h.

Comparison with strategies:

Without PM: $\text{ECPU} = C_f / \mathbf{E}[T] = 5000 / 887$ (mean Weibull) $= 5.64$ units/h (twice as bad).
Too frequent PM ($T = 100$): $\text{ECPU} = 5.0$ (also worse due to "excess" replacements).

Sensitivity: If $C_f = 10\ 000$ (twice higher) — optimal $T^$ drops to ~300 hours. If $\beta = 1$ (exponential, no aging) — PM does not help, optimum $T^ \to \infty$.

Age Replacement vs. Block Replacement

Age replacement. Replace on failure or upon reaching age $T_p$ (for a specific component). Each component "starts over" after failure.

Block replacement. Replace all components at fixed moments $T, 2T, 3T$ (even if some were just replaced due to failure). Simpler logistically (scheduled maintenance), but requires more inventory.

If $\beta > 1$ (aging): both strategies are optimal with proper choice of $T$. Block is more often chosen for big maintenance windows.

Reliability Centered Maintenance (RCM)

MSG-3 Standard (Maintenance Steering Group, 3rd edition). Used throughout civil aviation. For each functional failure:

Consequence analysis (Safety, Operational, Economic).
Selection of maintenance strategy:
- On-Condition (CBM by indicators).
- Hard Time (fixed resource, replacement by flight hours).
- Failure Finding (testing hidden functions — fire alarms, etc.).
- Servicing/Lubrication.

FMEA (Failure Mode and Effect Analysis). For each component: failure modes, frequency (Occurrence O), severity (Severity S), detectability (Detectability D). Risk Priority Number $RPN = O \times S \times D$. Maintenance priority for components with high RPN.

Examples with Actual Figures

CFM56 aircraft engines (Boeing 737 NG). Heavy maintenance every 24,000 takeoff-landing cycles. Cost: $4–6 million. Optimal resource calculated using a combination of Weibull models for each part.
GE 9HA gas turbines. CBM based on vibration diagnostics, exhaust temperature, pressure. Reduces downtime by 30% vs. fixed schedule.
Wind turbines. Lifetime 20–25 years, maintenance of hydro/electric systems is critical (repairing offshore is expensive). Vestas, Siemens Gamesa use CBM with remote monitoring.

Real-World Applications

Aviation (Lufthansa Technik, AAR Corp). Basic checks A, B, C, D (from 250 h to 12 years). RCM analysis for each system. Costs $3,000–8,000 per flight hour.
Oil & Gas (Shell, ExxonMobil). Equipment on platforms, pipelines: stop loss from unplanned shutdown >$10M/day. CBM with thousands of IoT sensors.
Railways. RZD, Deutsche Bahn — predictive maintenance of rolling stock, tracks. Acoustic monitoring for bearings.
Data Centers and IT. Hard drive replacement: Backblaze statistics show Weibull distribution, optimal PM at year 5–7.
Medical equipment. MRI (Siemens, GE): CBM by magnet (helium pressure), gradient coils. Unplanned failure — loss of ~$50K/day.

Assignment. Pump with failure distribution Weibull($\eta = 1000$ h, $\beta = 2.5$). $C_p = 500$, $C_f = 5000$. (a) In Python (numpy, scipy.stats.weibull_min) implement the function ECPU($T_p$) and plot it for $T_p \in [50, 1500]$. (b) Find $T^$ via scipy.optimize.minimize. (c) What is the percentage savings compared to the "no PM" strategy? (d) Sensitivity: how will $T^$ change for $C_f = 10\ 000$? for $\beta = 4$ (sharply expressed aging)? for $\beta = 1$ (no aging)? (e) Compare with block replacement: compute ECPU numerically for the block strategy with the same parameters.