Learning in Games and Behavioral Game Theory

How Do Players "Find" Equilibrium?

Theory states that in Nash equilibrium, no one benefits from deviating. But how do players end up at equilibrium? It is unlikely that everyone solves the system of equations in advance. The real process is adaptive learning: observing the past, correcting strategies, and gradually approaching equilibrium.

Learning models disrupt the notion of instantly rational agents. Instead, there is an iterative process, which may or may not converge to equilibrium.

Fictitious Play

Rule: In each period t, player i: (1) observes the frequencies of past actions of opponents; (2) forms beliefs as empirical frequencies; (3) chooses the best response to these beliefs.

Formally: let $n_j^t(s)$ be the number of times $j$ played $s$ before $t$. Belief: $\pi_j^t(s) = n_j^t(s)/t$. Choice: $a_i^t = \arg\max_s u_i(s, \pi_{-i}^t)$.

Convergence: FP converges to equilibrium in: (1) zero-sum games (Robinson, 1951); (2) potential games; (3) $2\times 2$ games with common interests. Does not converge in the general case (Shapley’s example, 1964: cycle in $3\times 3$).

Numerical Example ($2\times 2$, fictitious play): Coordination matrix:

	L	R
T	(2,2)	(0,0)
B	(0,0)	(1,1)

Start: Player 1 plays T, Player 2 plays L. $t=1$: $u_1(T|L)=2 > u_1(B|L)=0 \rightarrow T$; $u_2(L|T)=2 > u_2(R|T)=0 \rightarrow L$. Both keep (T,L). The process immediately converges to equilibrium (2,2). The upper equilibrium "dominates" in the dynamics.

Regret Minimization

Regret: After $T$ periods, regret for strategy $s$:

$ R^T_i(s) = \frac{1}{T} \cdot \sum_t u_i(s, a^t_{-i}) - \frac{1}{T} \cdot \sum_t u_i(a^t) $

Regret = how much better it would have been to always play $s$ instead of what was actually played. External regret: $R^T_i = \max_s R^T_i(s)$.

No-regret algorithm: An algorithm with zero external regret: $R^T_i/T \to 0$ as $T \to \infty$. One of the first no-regret algorithms is Hannan’s algorithm (1957).

Connection to equilibrium: If all players use no-regret algorithms, the empirical distribution of strategies converges to correlated equilibria (Aumann, 1987) — a broader class than NE.

Multiplicative Weights Update (MWU) Algorithm

One of the most powerful online algorithms: weights $w^t(s)$ are updated as follows:

$ w^{t+1}(s) = w^t(s) \cdot \exp(\eta \cdot u_i(s, a^t_{-i})) $

Choice is proportional to weights. Guarantees $R^T_i = O(\sqrt{T})$ — regret grows more slowly than $T \to$ no-regret. It is the basis of: boosting (AdaBoost), stochastic gradient descent, EM algorithm.

Behavioral Game Theory

Real people systematically deviate from the predictions of standard game theory. Three key directions:

1. Bounded rationality: "Level-k reasoning" model (Camerer): a level k-agent believes all opponents are at level $k-1$. Level 0: random choice. Level 1: best response to random choice. Level 2: best response to level 1. Empirically, most people are at level 1–2. Prediction: intermediate outcomes between "naive" and equilibrium.

2. Social preferences: People care not only about their own payoff but also about "fairness" and the payoffs of others. Fehr–Schmidt model: $u_i = x_i - \alpha \max(x_j - x_i, 0) - \beta \max(x_i - x_j, 0)$. First term: payoff. Second: "envy" (pain from others having more). Third: "guilt" (pain from having more than others).

3. Limited willpower and framing effects: The same decision described differently $\rightarrow$ different choices. Loss aversion effect: losses are felt more strongly than gains. Default effect: the default option is chosen disproportionately often. Nudge (Thaler–Sunstein): changing the "choice architecture" changes behavior without changing incentives.

Real-World Applications

Retirement savings (Thaler, Nobel 2017): In the US, "opt-out" default (automatic enrollment with the right to withdraw) increased participation in retirement plans from ~40% to ~90%. Standard game theory predicts: default is not important (the rational agent will sign up themselves). Behavioral theory: default is a "nudge" that uses inertia.

Negotiations: Anchor effect: the first offer disproportionately influences the final outcome. Behavioral negotiators (and the best diplomats) take this into account.

Learning in Games and Behavioral Economics in Practice

Learning algorithms in games are used in algorithmic trading and artificial intelligence. Market-making algorithms on financial markets use online learning (the "no-regret" algorithm): the market maker updates the bid-ask spread based on historical data, minimizing losses from informed traders. In Google and Meta ad systems, real-time bidding (RTB) algorithms use reinforcement learning methods that update bidding strategies based on the results of past auctions. The Fehr–Schmidt "inequity aversion" model explains the rejection of "unfair" offers in ultimatum games: In 15 cultures studied by Henrich et al. (2001), rejection rates ranged from 10 to 60%, associated with cultural norms of fairness. Behavioral game theory is implemented in the design of bonus systems: employers who offer "gift" wages above the market norm achieve higher productivity due to reciprocal worker behavior (the Akerlof effect, experimentally measured by Fehr and Gächter). In negotiations, knowledge of "level-k reasoning" principles helps choose a strategy taking into account the actual, not rational, behavior of the opponent.

Assignment: (a) For the matrix (Heads–Tails), simulate fictitious play for 10 periods: start with T (Heads) for P1 and L (Heads) for P2. How do beliefs and actions change? What do the empirical frequencies converge to? (b) Fehr–Schmidt model with $\alpha = 0.5$, $\beta = 0.25$: for outcome (8, 4) compute each player’s utility. Should player 1 "reduce" their payoff to (6,6)? (c) In the ultimatum game, real players reject offers below 20–30%. How does the Fehr–Schmidt model explain this?