Module V·Article II·~5 min read

Stochastic DP and LQG Control

Stochastic Optimal Control

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

When a system is random, the optimal control itself becomes a "function of the stochastic history." Stochastic DP generalizes the Bellman equation to the case of random transitions and discounted expected rewards — this is the mathematical foundation of reinforcement learning (RL). A special case — LQG (Linear-Quadratic-Gaussian) — gives an elegant closed-form solution via the separation principle (Joseph-Tou, 1961): optimal control = LQR regulator applied to the state estimate from the Kalman filter. This is an exceptional "gift" of Gaussian linear systems; in the general case, such separation does not exist.

Stochastic Bellman Equation

Problem: max E[Σ_{t=0}^T β^t·r(x_t, u_t)] subject to x_{t+1} = f(x_t, u_t, w_t), w_t ~ p(w).

Stochastic Bellman equation: V_t(x) = max_{u ∈ U} [r(x, u) + β·E_{w}[V_{t+1}(f(x, u, w))]].

Backward induction: V_T(x) = r_T(x) (terminal). For t = T−1, ..., 0: V_t(x) = max_u [r(x, u) + β·Σ_{x'} P(x' | x, u)·V_{t+1}(x')].

Here, P(x' | x, u) is the transition probability. The expectation E_w integrates over random noise.

Infinite horizon: V(x) = max_u [r(x, u) + β·E_w·V(f(x, u, w))]. This is the Bellman operator equation; its solution is the unique fixed point of the β-contraction operator (Banach theorem).

Numerical Example: Inventory Management Problem

State x ∈ {0, 1, ..., 100} — inventory. Action u ∈ {0, 1, ..., 50} — order size. Demand shock d ~ Binomial(20, 0.5) — random. Dynamics: x_{t+1} = max(0, x_t + u_t − d_t). Reward: r(x, u) = (sales) − (ordering cost) − (storage cost) = min(x + u, d) − 0.5·u − 0.1·max(0, x + u − d).

VFI for infinite horizon with β = 0.95 produces an (s, S)-type policy: "if inventory < s, order up to level S." Numerically: s* = 8, S* = 25 for the given parameters.

Separation Principle in LQG

LQG problem: min E[Σ_{t=0}^T (x_tᵀ·Q·x_t + u_tᵀ·R·u_t)] for a stochastic linear system: x_{t+1} = A·x_t + B·u_t + w_t, y_t = C·x_t + v_t, where w_t ~ N(0, W), v_t ~ N(0, V), independent.

Separation theorem (Joseph-Tou, 1961): Optimal control u_t* = −K·x̂_{t|t}, where:

  • K — LQR gain matrix from the deterministic problem (solution of algebraic Riccati with A, B, Q, R).
  • x̂_{t|t} — state estimate from the Kalman filter with matrices A, C, W, V.

That is: the control and estimation problems are solved independently, and their combination is optimal. This result is critically important — it allows us to use extensive libraries of LQR and Kalman without the need for joint optimization.

Proof (outline). Via the principle of dynamic programming and the fact that for linear-Gaussian systems, the conditional distribution x_t | y_{0:t} is completely described by the mean x̂_{t|t} and covariance P_{t|t}, and P_{t|t} does not depend on the control.

Limitations: The separation principle is valid only for linear-Gaussian systems with quadratic objective. For nonlinear or non-Gaussian problems, separation is suboptimal (although it is often used as a practical approximation, "certainty equivalence").

Stochastic MPC (Model Predictive Control)

MPC idea. At each step, we solve an optimal control problem on a finite horizon {t, ..., t + N}. We apply only the first control u_t*, then recalculate with the updated state.

Advantages:

  • Naturally accounts for constraints on u and x.
  • Adaptive: recalculates the plan when conditions change.
  • Applicable to nonlinear systems.

Stochastic MPC. Noise is explicitly considered:

  • Scenario approach: generate M = 100 scenarios w^{(i)}_{0:N}, solve the problem for each, select the robust one.
  • Tube MPC: design a "tube" of allowable trajectories, guaranteeing constraint satisfaction for all allowable w.
  • Chance-constrained MPC: P(C·x_t ≤ d) ≥ 1 − ε — probabilistic constraints.

Numerical Example: LQG for a Double Integrator

System: A = [1, 1; 0, 1], B = [0.5; 1], C = [1, 0]. Q = I, R = 0.1, W = 0.01·I, V = 0.1. LQR (via scipy.linalg.solve_discrete_are): K ≈ [1.78, 2.45]. Kalman (via solve_discrete_are for Aᵀ, Cᵀ, W, V): L ≈ [0.27; 0.10] (gain).

Simulation of 50 steps from x_0 = (1, 0): LQG cost J = E[Σ x_tᵀ Q x_t + u_tᵀ R u_t] ≈ 4.2. LQR (with direct access to x): J ≈ 3.8. Estimation overhead ~10% — the price of not knowing the state.

Reinforcement Learning (RL) — Modern Continuation

When the model is unknown, E[V(f(x, u, w))] cannot be computed. RL replaces this with sampling:

  • Q-learning: Q(x, u) ← Q(x, u) + α·[r + β·max_{u'} Q(x', u') − Q(x, u)].
  • DQN (Mnih 2015): Q is approximated by a neural network, trained on batches from a replay buffer.
  • Actor-Critic, PPO, SAC: train policy and V/Q in parallel.

These are "model-free" methods, complementing the classical theory of optimal control.

Real Applications

  • HVAC and smart buildings. Stochastic MPC balances weather forecasts, electricity prices, and comfort. Achieves 15–30% energy savings.
  • Portfolio management. Dynamic asset allocation taking into account stochastic returns — an LQG-type problem in linearization; in nonlinear form — stochastic control with HJB.
  • Process control (refining, chemistry). MPC is the industrial standard: Honeywell APC systems, AspenTech DMCplus, ExxonMobil PIC. Thousands of installations.
  • Autonomous vehicles. Tesla Autopilot, Waymo use MPC in real time for maneuver planning. Accounting for uncertainty in movements of other cars — stochastic MPC.
  • Energy. Optimal control of a power station with renewables: stochastic demand, prices, solar/wind generation — stochastic DP problem.

Assignment. LQG for the discrete-time double integrator: A = [1, 0.1; 0, 1], B = [0.005; 0.1], C = [1, 0]. Q = diag(1, 0.1), R = 1, W = 0.01·I, V = 0.1. (a) Using scipy.linalg.solve_discrete_are, compute the LQR gain matrix K. (b) Solving the dual system (Aᵀ, Cᵀ, W, V), find the Kalman gain L. (c) Simulate the closed LQG system for 100 steps with x_0 = (1, 0), x̂_0 = (0, 0). Plot x(t), x̂(t), u(t). (d) Compute empirical J = Σ(x_tᵀ Q x_t + u_tᵀ R u_t) by averaging over 100 trajectories. Compare with the theoretical LQR minimum (without measurement noise).

§ Act · what next