Module V·Article II·~5 min read
Stochastic DP and LQG Control
Stochastic Optimal Control
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
When a system is random, the optimal control itself becomes a "function of the stochastic history." Stochastic DP generalizes the Bellman equation to the case of random transitions and discounted expected rewards — this is the mathematical foundation of reinforcement learning (RL). A special case — LQG (Linear-Quadratic-Gaussian) — gives an elegant closed-form solution via the separation principle (Joseph-Tou, 1961): optimal control = LQR regulator applied to the state estimate from the Kalman filter. This is an exceptional "gift" of Gaussian linear systems; in the general case, such separation does not exist.
Stochastic Bellman Equation
Problem: max E[Σ_{t=0}^T β^t·r(x_t, u_t)] subject to x_{t+1} = f(x_t, u_t, w_t), w_t ~ p(w).
Stochastic Bellman equation: V_t(x) = max_{u ∈ U} [r(x, u) + β·E_{w}[V_{t+1}(f(x, u, w))]].
Backward induction: V_T(x) = r_T(x) (terminal). For t = T−1, ..., 0: V_t(x) = max_u [r(x, u) + β·Σ_{x'} P(x' | x, u)·V_{t+1}(x')].
Here, P(x' | x, u) is the transition probability. The expectation E_w integrates over random noise.
Infinite horizon: V(x) = max_u [r(x, u) + β·E_w·V(f(x, u, w))]. This is the Bellman operator equation; its solution is the unique fixed point of the β-contraction operator (Banach theorem).
Numerical Example: Inventory Management Problem
State x ∈ {0, 1, ..., 100} — inventory. Action u ∈ {0, 1, ..., 50} — order size. Demand shock d ~ Binomial(20, 0.5) — random. Dynamics: x_{t+1} = max(0, x_t + u_t − d_t). Reward: r(x, u) = (sales) − (ordering cost) − (storage cost) = min(x + u, d) − 0.5·u − 0.1·max(0, x + u − d).
VFI for infinite horizon with β = 0.95 produces an (s, S)-type policy: "if inventory < s, order up to level S." Numerically: s* = 8, S* = 25 for the given parameters.
Separation Principle in LQG
LQG problem: min E[Σ_{t=0}^T (x_tᵀ·Q·x_t + u_tᵀ·R·u_t)] for a stochastic linear system: x_{t+1} = A·x_t + B·u_t + w_t, y_t = C·x_t + v_t, where w_t ~ N(0, W), v_t ~ N(0, V), independent.
Separation theorem (Joseph-Tou, 1961): Optimal control u_t* = −K·x̂_{t|t}, where:
- K — LQR gain matrix from the deterministic problem (solution of algebraic Riccati with A, B, Q, R).
- x̂_{t|t} — state estimate from the Kalman filter with matrices A, C, W, V.
That is: the control and estimation problems are solved independently, and their combination is optimal. This result is critically important — it allows us to use extensive libraries of LQR and Kalman without the need for joint optimization.
Proof (outline). Via the principle of dynamic programming and the fact that for linear-Gaussian systems, the conditional distribution x_t | y_{0:t} is completely described by the mean x̂_{t|t} and covariance P_{t|t}, and P_{t|t} does not depend on the control.
Limitations: The separation principle is valid only for linear-Gaussian systems with quadratic objective. For nonlinear or non-Gaussian problems, separation is suboptimal (although it is often used as a practical approximation, "certainty equivalence").
Stochastic MPC (Model Predictive Control)
MPC idea. At each step, we solve an optimal control problem on a finite horizon {t, ..., t + N}. We apply only the first control u_t*, then recalculate with the updated state.
Advantages:
- Naturally accounts for constraints on u and x.
- Adaptive: recalculates the plan when conditions change.
- Applicable to nonlinear systems.
Stochastic MPC. Noise is explicitly considered:
- Scenario approach: generate M = 100 scenarios w^{(i)}_{0:N}, solve the problem for each, select the robust one.
- Tube MPC: design a "tube" of allowable trajectories, guaranteeing constraint satisfaction for all allowable w.
- Chance-constrained MPC: P(C·x_t ≤ d) ≥ 1 − ε — probabilistic constraints.
Numerical Example: LQG for a Double Integrator
System: A = [1, 1; 0, 1], B = [0.5; 1], C = [1, 0]. Q = I, R = 0.1, W = 0.01·I, V = 0.1. LQR (via scipy.linalg.solve_discrete_are): K ≈ [1.78, 2.45]. Kalman (via solve_discrete_are for Aᵀ, Cᵀ, W, V): L ≈ [0.27; 0.10] (gain).
Simulation of 50 steps from x_0 = (1, 0): LQG cost J = E[Σ x_tᵀ Q x_t + u_tᵀ R u_t] ≈ 4.2. LQR (with direct access to x): J ≈ 3.8. Estimation overhead ~10% — the price of not knowing the state.
Reinforcement Learning (RL) — Modern Continuation
When the model is unknown, E[V(f(x, u, w))] cannot be computed. RL replaces this with sampling:
- Q-learning: Q(x, u) ← Q(x, u) + α·[r + β·max_{u'} Q(x', u') − Q(x, u)].
- DQN (Mnih 2015): Q is approximated by a neural network, trained on batches from a replay buffer.
- Actor-Critic, PPO, SAC: train policy and V/Q in parallel.
These are "model-free" methods, complementing the classical theory of optimal control.
Real Applications
- HVAC and smart buildings. Stochastic MPC balances weather forecasts, electricity prices, and comfort. Achieves 15–30% energy savings.
- Portfolio management. Dynamic asset allocation taking into account stochastic returns — an LQG-type problem in linearization; in nonlinear form — stochastic control with HJB.
- Process control (refining, chemistry). MPC is the industrial standard: Honeywell APC systems, AspenTech DMCplus, ExxonMobil PIC. Thousands of installations.
- Autonomous vehicles. Tesla Autopilot, Waymo use MPC in real time for maneuver planning. Accounting for uncertainty in movements of other cars — stochastic MPC.
- Energy. Optimal control of a power station with renewables: stochastic demand, prices, solar/wind generation — stochastic DP problem.
Assignment. LQG for the discrete-time double integrator: A = [1, 0.1; 0, 1], B = [0.005; 0.1], C = [1, 0]. Q = diag(1, 0.1), R = 1, W = 0.01·I, V = 0.1. (a) Using scipy.linalg.solve_discrete_are, compute the LQR gain matrix K. (b) Solving the dual system (Aᵀ, Cᵀ, W, V), find the Kalman gain L. (c) Simulate the closed LQG system for 100 steps with x_0 = (1, 0), x̂_0 = (0, 0). Plot x(t), x̂(t), u(t). (d) Compute empirical J = Σ(x_tᵀ Q x_t + u_tᵀ R u_t) by averaging over 100 trajectories. Compare with the theoretical LQR minimum (without measurement noise).
§ Act · what next