Stochastic DP and LQG Control

When a system is random, the optimal control itself becomes a "function of the stochastic history." Stochastic DP generalizes the Bellman equation to the case of random transitions and discounted expected rewards — this is the mathematical foundation of reinforcement learning (RL). A special case — LQG (Linear-Quadratic-Gaussian) — gives an elegant closed-form solution via the separation principle (Joseph-Tou, 1961): optimal control = LQR regulator applied to the state estimate from the Kalman filter. This is an exceptional "gift" of Gaussian linear systems; in the general case, such separation does not exist.

Stochastic Bellman Equation

Problem: max E[Σ_{t=0}^T β^t·r(x_t, u_t)] subject to x_{t+1} = f(x_t, u_t, w_t), w_t ~ p(w).

Stochastic Bellman equation: V_t(x) = max_{u ∈ U} [r(x, u) + β·E_{w}[V_{t+1}(f(x, u, w))]].

Backward induction: V_T(x) = r_T(x) (terminal). For t = T−1, ..., 0: V_t(x) = max_u [r(x, u) + β·Σ_{x'} P(x' | x, u)·V_{t+1}(x')].

Here, P(x' | x, u) is the transition probability. The expectation E_w integrates over random noise.

Infinite horizon: V(x) = max_u [r(x, u) + β·E_w·V(f(x, u, w))]. This is the Bellman operator equation; its solution is the unique fixed point of the β-contraction operator (Banach theorem).

Numerical Example: Inventory Management Problem

State x ∈ {0, 1, ..., 100} — inventory. Action u ∈ {0, 1, ..., 50} — order size. Demand shock d ~ Binomial(20, 0.5) — random. Dynamics: x_{t+1} = max(0, x_t + u_t − d_t). Reward: r(x, u) = (sales) − (ordering cost) − (storage cost) = min(x + u, d) − 0.5·u − 0.1·max(0, x + u − d).

VFI for infinite horizon with β = 0.95 produces an (s, S)-type policy: "if inventory < s, order up to level S." Numerically: s* = 8, S* = 25 for the given parameters.

Separation Principle in LQG

LQG problem: min E[Σ_{t=0}^T (x_tᵀ·Q·x_t + u_tᵀ·R·u_t)] for a stochastic linear system: x_{t+1} = A·x_t + B·u_t + w_t, y_t = C·x_t + v_t, where w_t ~ N(0, W), v_t ~ N(0, V), independent.

Separation theorem (Joseph-Tou, 1961): Optimal control u_t* = −K·x̂_{t|t}, where:

K — LQR gain matrix from the deterministic problem (solution of algebraic Riccati with A, B, Q, R).
x̂_{t|t} — state estimate from the Kalman filter with matrices A, C, W, V.

That is: the control and estimation problems are solved independently, and their combination is optimal. This result is critically important — it allows us to use extensive libraries of LQR and Kalman without the need for joint optimization.

Proof (outline). Via the principle of dynamic programming and the fact that for linear-Gaussian systems, the conditional distribution x_t | y_{0:t} is completely described by the mean x̂_{t|t} and covariance P_{t|t}, and P_{t|t} does not depend on the control.

Limitations: The separation principle is valid only for linear-Gaussian systems with quadratic objective. For nonlinear or non-Gaussian problems, separation is suboptimal (although it is often used as a practical approximation, "certainty equivalence").

Stochastic MPC (Model Predictive Control)

MPC idea. At each step, we solve an optimal control problem on a finite horizon {t, ..., t + N}. We apply only the first control u_t*, then recalculate with the updated state.

Advantages:

Naturally accounts for constraints on u and x.
Adaptive: recalculates the plan when conditions change.
Applicable to nonlinear systems.

Stochastic MPC. Noise is explicitly considered:

Scenario approach: generate M = 100 scenarios w^{(i)}_{0:N}, solve the problem for each, select the robust one.
Tube MPC: design a "tube" of allowable trajectories, guaranteeing constraint satisfaction for all allowable w.
Chance-constrained MPC: P(C·x_t ≤ d) ≥ 1 − ε — probabilistic constraints.

Numerical Example: LQG for a Double Integrator

System: A = [1, 1; 0, 1], B = [0.5; 1], C = [1, 0]. Q = I, R = 0.1, W = 0.01·I, V = 0.1. LQR (via scipy.linalg.solve_discrete_are): K ≈ [1.78, 2.45]. Kalman (via solve_discrete_are for Aᵀ, Cᵀ, W, V): L ≈ [0.27; 0.10] (gain).

Simulation of 50 steps from x_0 = (1, 0): LQG cost J = E[Σ x_tᵀ Q x_t + u_tᵀ R u_t] ≈ 4.2. LQR (with direct access to x): J ≈ 3.8. Estimation overhead ~10% — the price of not knowing the state.

Reinforcement Learning (RL) — Modern Continuation

When the model is unknown, E[V(f(x, u, w))] cannot be computed. RL replaces this with sampling:

Q-learning: Q(x, u) ← Q(x, u) + α·[r + β·max_{u'} Q(x', u') − Q(x, u)].
DQN (Mnih 2015): Q is approximated by a neural network, trained on batches from a replay buffer.
Actor-Critic, PPO, SAC: train policy and V/Q in parallel.

These are "model-free" methods, complementing the classical theory of optimal control.

Real Applications

HVAC and smart buildings. Stochastic MPC balances weather forecasts, electricity prices, and comfort. Achieves 15–30% energy savings.
Portfolio management. Dynamic asset allocation taking into account stochastic returns — an LQG-type problem in linearization; in nonlinear form — stochastic control with HJB.
Process control (refining, chemistry). MPC is the industrial standard: Honeywell APC systems, AspenTech DMCplus, ExxonMobil PIC. Thousands of installations.
Autonomous vehicles. Tesla Autopilot, Waymo use MPC in real time for maneuver planning. Accounting for uncertainty in movements of other cars — stochastic MPC.
Energy. Optimal control of a power station with renewables: stochastic demand, prices, solar/wind generation — stochastic DP problem.

Assignment. LQG for the discrete-time double integrator: A = [1, 0.1; 0, 1], B = [0.005; 0.1], C = [1, 0]. Q = diag(1, 0.1), R = 1, W = 0.01·I, V = 0.1. (a) Using scipy.linalg.solve_discrete_are, compute the LQR gain matrix K. (b) Solving the dual system (Aᵀ, Cᵀ, W, V), find the Kalman gain L. (c) Simulate the closed LQG system for 100 steps with x_0 = (1, 0), x̂_0 = (0, 0). Plot x(t), x̂(t), u(t). (d) Compute empirical J = Σ(x_tᵀ Q x_t + u_tᵀ R u_t) by averaging over 100 trajectories. Compare with the theoretical LQR minimum (without measurement noise).