Principle of Optimality and the Bellman Equation

Richard Bellman in the 1950s proposed a radically different perspective on optimization: not “find the entire trajectory at once” (as in the Maximum Principle), but “solve the problem recursively — from right to left.” At its core lies a deep principle: the optimal plan “remembers” only the current state, not the path by which we arrived there. This is the principle of optimality, and it gives rise to the Bellman equation — a functional equation for the optimal value function. The entire modern theory of reinforcement learning, including AlphaGo and ChatGPT (RLHF), is built on this principle.

Bellman’s Principle of Optimality

Formulation: “An optimal policy has the property that, whatever the initial state and initial decision are, the remaining decisions constitute an optimal policy with regard to the state resulting from the first decision.”

More simply: any “tail part” of an optimal trajectory is itself optimal for the corresponding subproblem.

Formally. Define the optimal value function: $ V^*(x, t) = \max_{u(\cdot)\ \text{on}\ [t,T]} \left[ \int_t^T L(x, u, s)\ ds + \varphi(x(T)) \right], \quad \text{where}\ x(t) = x. $

Then for any $\tau \in (t, T)$: $ V^(x, t) = \max_{u(\cdot)\ \text{on}\ [t, \tau]} \left[ \int_t^\tau L(x, u, s)\ ds + V^(x(\tau), \tau) \right]. $

That is: we optimize the “short future” $[t, \tau]$, and thereafter use the already known optimal value $V^*(\cdot, \tau)$.

Corollary. The optimal control at instant $s$ depends only on the current state $x(s)$ and time $s$, but not on the history — the Markov property. This is a tremendous simplification: we seek a function $u^(x, t)$, not a functional $u^(\text{entire history})$.

Hamilton-Jacobi-Bellman Equation (HJB)

Let us take the principle of optimality with $\tau = t + \Delta t$ and let $\Delta t \to 0$. Expanding into a Taylor series, we obtain a partial differential equation:

$-\frac{\partial V^}{\partial t} = \max_{u \in U} \left[ L(x, u, t) + \left(\frac{\partial V^}{\partial x}\right)^\top f(x, u, t) \right]$.

Terminal condition: $V^*(x, T) = \varphi(x)$.

This is the Hamilton-Jacobi-Bellman (HJB) equation. In more detail: the left side is the “rate of change of the value over time”; the right side is the “optimal instantaneous rate of utility accumulation.” There is a deep connection between the Maximum Principle and HJB: the adjoint variable $\psi = \frac{\partial V^*}{\partial x}$ — the shadow price of the state.

Remark. HJB is a nonlinear partial differential equation. Its solution, generally speaking, is not smooth — “kinks” are allowed. For a rigorous definition, viscosity solutions (Crandall-Lions, 1980s) are used.

Verification Theorem

Theorem: If $W(x, t)$ is a smooth solution to HJB with $W(x, T) = \varphi(x)$, and the control $u^(x, t) = \arg\max [L + W_x \cdot f]$ is measurable and admissible, then $W = V^$ (the optimal value function) and $u^*$ is the optimal policy.

This yields a practical recipe: “guess” the functional form of $V$ (for example, quadratic for the LQR), substitute into HJB, obtain an ODE for the coefficients, solve — and verification is complete.

Numerical Example: LQR with Finite Horizon

Problem: $\min \int_0^T (x^2 + u^2),dt + x(T)^2,\ \dot{x} = -x + u.$

HJB: $-V_t = \min_u[x^2 + u^2 + V_x \cdot (-x + u)].$ Minimizing over $u$: $2u + V_x = 0 \implies u^* = -V_x/2$.

Substituting: $-V_t = x^2 - V_x^2/4 - x V_x$.

Trial form: $V(x, t) = P(t), x^2$. Then $V_x = 2P x$, $V_t = \dot{P} x^2$. The equation: $ -\dot{P} x^2 = x^2 - P^2 x^2 - 2P x^2 \implies \dot{P} = P^2 + 2P - 1. $

ODE for $P$ with condition $P(T) = 1$ (from $V(x, T) = x^2$). This is a Riccati equation backward in time!

Numerical integration ($T = 1$): $P(0) \approx 0.46$. The optimal $u^*(x, t) = -P(t),x$ — a linear feedback with a time-varying coefficient.

As $T \to \infty$: $P(t) \to P_\infty = \frac{-2 + \sqrt{8}}{2} = \sqrt{2} - 1 \approx 0.414$, which coincides with the algebraic Riccati solution (infinite horizon).

Connection with the Maximum Principle

If $V^$ is smooth: $\psi = \frac{\partial V^}{\partial x}$. Then HJB $\leftrightarrow$ Maximum Principle provide the same optimal $u^*$. The difference is in approach:

Maximum Principle: a boundary value problem (forward for $x$, backward for $\psi$) — efficient for a small number of states.
HJB: PDE for $V^*(x, t)$ — efficient with complex constraints, but suffers from the curse of dimensionality for large $\dim(x)$.

Real Applications

Finance. Merton’s problem (optimal consumption and investment) was solved explicitly via HJB in 1969: the policy “invest a fixed share $\pi^* = \frac{\mu - r}{\sigma^2 \gamma}$ in risky assets” was obtained.
Inventory management. The Scarf and Arrow-Harris models use HJB to find the optimal $(s, S)$ policy: “if inventory
lt;$ $s$, order up to $S$.”
Advertising and marketing. Dynamic allocation of budget between advertising channels — HJB problem with empirically calibrated response models.
Reinforcement Learning. The Bellman equation is the basis of Q-learning, DQN, AlphaGo. In each episode, the agent updates $V$ (or $Q$) according to the discrete-time version of the HJB.

Assignment. LQR finite-horizon problem: $\min \int_0^T (x^2 + u^2),dt + x(T)^2,\ \dot{x} = -x + u,\ T = 2$. (a) Write the HJB. (b) Use the trial form $V(x, t) = P(t),x^2$; find the ODE for $P(t)$ and solve numerically backward in time with step $dt = 0.01$. (c) Find the optimal $u^*(x, t) = -P(t),x$. (d) Simulate $x(t)$ at $x(0) = 1$ and compare with the open-loop solution $u \equiv 0$. (e) Confirm the equality $\psi(t) = 2P(t),x(t)$ with the solution of the adjoint system from the Maximum Principle.