Reinforcement Learning — Big Data & Machine Learning

Reinforcement Learning (RL)

Reinforcement learning is a paradigm in which an agent learns to act in an environment by receiving reward signals for its decisions. Unlike supervised learning, there are no correct answers: the agent must discover the optimal strategy by interacting with the environment. It is RL that enabled computers to defeat the world champion in Go (AlphaGo, 2016), master Atari video games, and control industrial robots.

Markov Decision Process (MDP)

The formal foundation of RL is the MDP: (S, A, P, R, γ). Deciphered:

S — state space: possible situations in which the agent may find itself
A — action space: what the agent can do
P(s'|s, a) — probability of transition from state s to s' under action a
R(s, a, s') — immediate reward upon transition
γ ∈ [0, 1) — discount factor for future rewards (γ = 0.99: future rewards are nearly as important as current ones)

Agent’s goal: to find a policy π(a|s) (probability of action a in state s), maximizing the discounted sum of rewards: max E[Σₜ γᵗrₜ].

State value function: V^π(s) = E_π[Σₜ γᵗrₜ | s₀ = s] — expected sum of rewards starting from s and following policy π.

Q-function (action-value function): Q^π(s,a) = E_π[Σₜ γᵗrₜ | s₀=s, a₀=a] — value of taking action a in state s. Knowing Q*, the optimal policy is trivial: π*(s) = argmax_a Q*(s,a).

Bellman equations: V^π(s) = Σₐ π(a|s)[R(s,a) + γΣ_{s'} P(s'|s,a) V^π(s')]. A system of equations linking the value of a state to the values of neighboring states.

Q-learning and DQN

Q-learning (Watkins, 1989) is model-free learning (without an environment model). TD update (Temporal Difference):

Q(s, a) ← Q(s, a) + α[r + γ max_{a'} Q(s', a') − Q(s, a)]

Here: α is the learning rate, r is the received reward, s' is the new state, [r + γ max Q(s',a') − Q(s,a)] is the TD error (how much the prediction differed from reality). Q-learning converges to Q* if sufficient exploration takes place.

DQN — Deep Q-Network (Mnih et al., DeepMind, 2015): Q(s,a;θ) is a neural network with parameters θ. Two key stabilization tricks:

Experience replay: the agent stores transitions (s, a, r, s') in a buffer of size N. Training is done on random minibatches from the buffer, which breaks temporal correlations between consecutive transitions and stabilizes learning.
Target network: a separate network with “frozen” parameters θ⁻ is used to compute target values y = r + γ max_a' Q(s',a';θ⁻). Parameters θ⁻ are updated infrequently, providing stable targets for learning.

DQN with raw Atari pixels surpassed human-level performance in 29 out of 49 games.

Policy Gradient Methods

REINFORCE (Williams, 1992): Update θ ← θ + α Σₜ ∇θ log π_θ(aₜ|sₜ) Gₜ, where Gₜ = Σ{k≥t} γᵏ⁻ᵗ rₖ is the discounted sum of rewards from time t onward. Logic: make good actions (high G) more likely, and bad ones less likely.

Problem: high variance of the gradient estimate. Solution: baseline — subtract V(sₜ), forming the “advantage” Aₜ = Gₜ − V(sₜ). Actor-Critic: the actor updates π_θ, the critic estimates V_φ and is updated via MSE.

PPO (Proximal Policy Optimization, Schulman et al., 2017):

L_CLIP(θ) = E[min(rₜ(θ) Aₜ, clip(rₜ(θ), 1−ε, 1+ε) Aₜ)]

Here, rₜ(θ) = π_θ(aₜ|sₜ)/π_{θ_old}(aₜ|sₜ) is the ratio of probabilities under new and old policies. Clipping prevents excessively large updates from destabilizing learning. PPO is robust and widely used. RLHF for ChatGPT uses PPO.

Numerical Example: Grid World

A 3×3 grid, the agent starts at (0,0), the goal is (2,2). Reward +10 for reaching the goal, −1 for each step. γ = 0.9. After Q-learning training, the optimal policy: always go down or right (a path of 4 steps). V((0,0)) = 10·0.9⁴ − 4·1 ≈ 6.56 − 4 ≈ 2.56 — expected value of the starting position.

Applications

Robot control (Boston Dynamics), optimization of Google data center cooling (−40% energy consumption), stock trading, ad bidding optimization, energy system management, games (AlphaGo, AlphaZero, OpenAI Five for Dota 2).

Assignment: Implement Q-learning for CartPole (pole balancing). State: 4 variables (cart position, pole angle, velocities). Discretize each into 6 bins. Q-table: 6⁴×2. Train for 5000 episodes. Plot the average reward per episode (100-episode moving average). Implement ε-greedy (ε decays from 1 to 0.01). Compare with DQN on the same task.