Discrete Dynamic Programming and Numerical Methods

Analytical solutions to optimal control problems are possible only for special structures (LQR, the Ramsey problem with Cobb-Douglas, the Merton problem). The majority of practical problems—nonlinear, high-dimensional, with discrete decisions—must be solved numerically. There exist two major families of methods: discrete dynamic programming (via the Bellman equation) and direct trajectory optimization methods (via nonlinear programming). Their comparison is a major practical issue.

Discrete Dynamic Programming

Problem: $\max \sum_{t=0}^{T-1} r(x_t, u_t) + V_T(x_T)$ subject to $x_{t+1} = f(x_t, u_t)$, $x_t \in S$, $u_t \in U$.

Here $S$ is the set of states (for example, $|S| = 100$), $U$ is the set of actions, $r$ is the immediate reward.

Backward Induction Algorithm.

Initialization: $V_T(x)$ is given for all $x \in S$ (terminal value).
For $t = T-1, T-2, ..., 0$ and each $x \in S$: $ V_t(x) = \max_{u \in U} [r(x, u) + V_{t+1}(f(x, u))]. $
Optimal policy: $u^*(t, x) = \arg\max_{u} [r(x, u) + V_{t+1}(f(x, u))]$.

Complexity: $O(T \cdot |S| \cdot |U|)$. For $T = 100$, $|S| = 100$, $|U| = 10$: $10^5$ operations—instantaneous.

Numerical Example: The Knapsack-in-Time Problem

A backpacker collects mushrooms over $T = 10$ days. The state $x \in {0, 1, ..., 100}$—kg in the backpack ($\leq 100$). The action $u \in {0, 1, ..., 5}$—kg today. The reward $r(x, u) = \sqrt{u} \cdot (1 + 0.01 \cdot (100 - x))$—more joy from mushrooms when the backpack is empty.

Using VFI we obtain the optimal policy: in the first days, take 5 kg, in the last—slow down the pace. Numerically $V_0(0) \approx 12.4$—total joy over 10 days.

Curse of Dimensionality

For continuous states, a grid is needed. For $\dim(x) = d$ and $N$ points per axis—$|S| \sim N^d$. For $d = 5$, $N = 50$: $|S| = 3 \cdot 10^8$—at the edge of computability. For $d = 10$—unattainable.

This is the “curse of dimensionality”—a fundamental barrier for tabular DP. Overcoming it is a key subject of modern methods (approximate DP, reinforcement learning).

Approximate DP

1. Grid Approximation with Interpolation. Discretize $S$ on an irregular grid (denser in “interesting” regions). Interpolate $V$ linearly or with splines between nodes. Complexity decreases, but accuracy depends on the quality of the grid.

2. Parametric Approximation (ADP/RL). Parameterize $V^*(x; \theta)$—neural network, Chebyshev polynomials, radial basis functions. Parameters $\theta$ are trained from the Bellman equation via TD-learning:

$ L(\theta) = \sum [r(x, u) + V(x'; \theta^-) - V(x; \theta)]^2 \to \min. $

This is the basis of DQN (Deep Q-Network, Mnih 2015), AlphaGo, AlphaZero.

3. Monte Carlo Tree Search (MCTS). Randomly sample trajectories + averaging. AlphaGo combines MCTS with neural network approximation.

Direct Trajectory Optimization Methods

Transcription: Continuous problem $\to$ nonlinear program (NLP).

Step 1: Discretize time $t_0, t_1, ..., t_N$ with step $\Delta t$. Step 2: Discretize dynamics (Euler, RK4): $x_{k+1} = x_k + \Delta t \cdot f(x_k, u_k)$. Step 3: Objective: $\min \sum L(x_k, u_k) \cdot \Delta t + \varphi(x_N)$. Step 4: Constraints: $x(0) = x_0$, $x(N) = x_T$, $u_{min} \leq u_k \leq u_{max}$.

This results in an NLP with variables ${x_0, ..., x_N, u_0, ..., u_{N-1}}$—a total of $(n + m) \cdot (N + 1) - m$ variables and dynamic constraints.

Solving the NLP: IPOPT (interior point), SNOPT (SQP). For smooth problems—scales to thousands of variables.

Pseudospectral Methods: Approximate trajectory by Chebyshev or Legendre polynomials. High accuracy for smooth solutions (exponential convergence). Packages: GPOPS-II, DIDO, PSOPT.

Comparison of Approaches

Approach	Advantages	Disadvantages
DP (VFI/PFI)	Global optimum, feedback $u^*(x, t)$	Curse of dimensionality
PMP + boundary value problem	Efficient for small dim	Local optimum, no feedback
Direct methods (NLP)	Large problems, constraints	Local optimum, open-loop control
Approximate DP / RL	Complex environments, neural nets	Requires delicate tuning, no guarantees

Real-World Applications

Mars Curiosity & Perseverance. Rover path planning: discrete DP on altitude grid (DEM maps from orbit), cost—energy + risk of overturning.
Amazon Inventory Management. Multi-level model (supplier → warehouse → regional center → customer)—gigantic DP with an approximate value function.
Self-Driving Car. MPC (Model Predictive Control) with horizon 3–5 seconds: at each step, solve an NLP to select trajectory. Solution speed 10–50 Hz.
Power Grids. Optimal control of EV charging in a smart grid—stochastic DP with hundreds of thousands of agents, solved via approximate methods.

Assignment. “Soft landing” of a rocket: height $h(t)$, velocity $v(t)$, mass $m(t)$, control—thrust $u(t) \in [0, u_{max}]$. Equations: $\dot{h} = v$, $\dot{v} = -g + u/m$, $\dot{m} = -\alpha \cdot u$. Initial: $h(0) = 1000$ m, $v(0) = -100$ m/s, $m(0) = 10000$ kg. Terminal: $h(T) = 0$, $v(T) = 0$. Minimize fuel consumption $m(0) - m(T)$ for $g = 9.81$, $u_{max} = 200,000$ N, $\alpha = 0.001$. (a) Discretize into $N = 100$ steps. (b) Solve NLP via scipy.optimize.minimize (method 'SLSQP'). (c) Plot graphs of $h(t)$, $v(t)$, $m(t)$, $u(t)$. (d) Explain why the optimal strategy is close to bang-bang.