Reinforcement Learning and Differential Games

When Analytics Are Not Enough

Differential games provide an elegant theory, but in practice, analytic solutions to the HJI are possible only in a few special cases (LQ-games, problems with simple dynamics). In real tasks there is nonlinear dynamics, high dimensionality, and the system model is unknown. Here, reinforcement learning (RL) comes to the rescue: agents learn to act optimally through interaction with the environment, without explicit knowledge of the dynamics. Multi-Agent RL generalizes this to the game setting.

Game RL: Formulation

Markov differential game: discretization of continuous time.

State $s \in S$
Actions of agents: $a_1 \in A_1$, $a_2 \in A_2$
Transition: $P(s' \mid s, a_1, a_2)$
Rewards: $r_1(s, a_1, a_2)$, $r_2(s, a_1, a_2)$

Agent $i$ wants to maximize the discounted sum of rewards: $\sum_t \gamma^t r_{i,t}$.

Non-stationarity problem: if agent 1 updates its policy, the environment from the viewpoint of agent 2 changes. The "target" moves—Q-learning convergence is not guaranteed!

Independent Q-learning (IQL)

Each agent $i$ learns its own Q-function $Q_i(s, a_i)$ independently, ignoring the actions of others.

Update: $Q_i(s, a_i) \leftarrow (1-\alpha) Q_i + \alpha [r_i + \gamma \max_{a_i'} Q_i(s', a_i')]$.

Advantages: simplicity, scalability to many agents.

Drawbacks: no theoretical guarantees of convergence. The environment is non-stationary. Often works in practice!

MADDPG (Multi-Agent DDPG)

Key idea: centralized training, decentralized execution (CTDE).

During training: critic $Q_i(x, a_1,...,a_N)$ sees all states and actions. Actor $\mu_i(o_i)$ sees only its own observation. The critic provides a "stable" quality estimate.

During execution: each agent acts only based on its own $o_i$ — decentralized.

Critic update: $L_i = E[(Q_i(x,a) - y_i)^2]$, $y_i = r_i + \gamma Q_i'(x', a_1',...,a_N')|_{a_j' = \mu_j'(o_j')}$.

Actor update: $\nabla_{\theta_i} J = E[\nabla_{a_i} Q_i \cdot \nabla_{\theta_i} \mu_i(o_i)]$.

Advantage: the critic "sees" the full picture → stable learning. In NE: $Q_i(x, a^*)$ accurately estimates the equilibrium value.

Self-Play and Convergence to Nash

Self-play: the agent plays against a copy of itself. With proper implementation, it converges to NE for two-player zero-sum games (e.g., chess, go).

AlphaGo/AlphaZero: pure self-play + MCTS (Monte Carlo Tree Search) + deep neural networks. Achieves superhuman level in go and chess—essentially solving a huge discrete "differential" game.

League Training (AlphaStar): for StarCraft II—a game with significant uncertainty. The "league" is a set of heterogeneous past agent versions, avoiding "cyclicality" ($A$ beats $B$, $B$ beats $C$, $C$ beats $A$).

Connection to the HJI Equation via Actor-Critic

Continuous actor-critic $\leftrightarrow$ differential game:

Critic $\approx V(s,t)$—value function (approximation of HJI solution). Actor $\approx u^*(s,t)$—optimal feedback strategy.

Policy gradient for games: $\nabla_\theta E[J] = E\left[\sum_t \nabla_\theta \log \pi_\theta(a_{i,t}|o_{i,t}) \cdot A_{i,t}\right]$, where $A_{i,t} = Q_i(s, a) - V_i(s)$—advantage.

This is an approximate gradient of the game value with respect to policy parameters.

MAPPO (Multi-Agent PPO): extension of PPO to $N$ agents with separate critics. The standard in modern MARL tasks (StarCraft, Google Football, traffic optimization).

Full Analysis: MADDPG Convergence on a Cooperation Task

Task: 2 agents, goal—to meet at point (5, 5). Joint reward: $r = -|x_1 - goal| - |x_2 - goal|$.

Each agent chooses acceleration (2D action). Observation: its own position.

MADDPG training (1000 episodes):

Episodes 1-100: agents move randomly, average reward $\approx -15$
Episodes 100-500: agents start moving toward the center, but by different paths, $\approx -8$
Episodes 500-1000: convergence to a "rendezvous" strategy, $\approx -2$

Nash interpretation: in NE both agents move directly to (5,5)—neither can improve their result unilaterally. MADDPG finds this NE through interaction.

Multi-Agent Reinforcement Learning (MARL)

When the environment contains other learning agents, classical Q-learning (one agent versus a stationary environment) stops working: the "environment" is non-stationary, because other agents are also learning. Thus, a link arises with the theory of differential games.

MARL Algorithms

Independent Q-learning: each agent learns independently. Simple, but with no convergence guarantees.
MADDPG (Multi-Agent DDPG, Lowe et al., 2017): centralized training, decentralized execution—the critic sees all policies during training, the actor uses only its own observability during execution.
Nash-Q learning: explicit search for Nash equilibrium at each step.
PSRO (Policy-Space Response Oracles): iteratively expands the policy population, seeks the best response to the current mixture.
MFRL (Mean Field RL): application of mean field approximation to large $N$-agent games.

Connection to Differential Games

In the limit of continuous time and states, RL for games converges to the HJI solution. The Q-function approximates $V$ (the value function of the game). This is the rationale for why RL can find equilibria in complex games:

AlphaZero—analogous to solving a zero-sum game (chess, Go) via self-play
AlphaStar—partial information (StarCraft II) via Counterfactual Regret Minimization
OpenAI Five—cooperation within a team versus a team (Dota 2)

Open Problems

Guarantees of MARL convergence in the general case
Scaling up to thousands of agents
Explainability of strategies
Safety (avoiding adversarial exploitation)

Applications

MARL and differential games are applied in trading bots (HFT, market making), drone swarm control, coordination of autonomous transport, balancing power grids, in training military tactics in simulators.