Optimal Transport and the Monge-Kantorovich Problem

The Earth Moving Problem

In 1781, the French mathematician Gaspard Monge posed the question: how can a pile of earth (density μ) be moved into a hole (density ν) with minimal labor costs? The problem turned out to be incredibly difficult and remained unsolved for 160 years. In 1942, the Soviet mathematician Leonid Kantorovich solved a relaxed version of the problem and was awarded the Nobel Prize in Economics (1975) for it. Today, optimal transport is a central mathematical theory underlying generative neural networks, image processing, medicine, and economic theory. It is a direct descendant of variational calculus in infinite-dimensional spaces.

Monge's Problem: Rigorous Formulation

Given two measures μ and ν on a space X (for example, probability distributions on ℝⁿ) and a cost function c(x, y) ≥ 0 (the cost of moving a “unit of material” from x to y).

Monge’s problem: Find a mapping T: X → X (“transport plan”), mapping μ to ν (that is, T_#μ = ν — the “pushforward” of μ under T equals ν), minimizing the total cost:

min_T ∫_X c(x, T(x)) dμ(x)

The formulation problem of Monge: the mapping T is a “deterministic” plan, but sometimes it is optimal to “split” a unit of material. For example, part of the earth from location x₁ goes to hole y₁, part — to y₂. Monge’s plan does not allow for this.

Kantorovich's Formulation

Kantorovich extended the problem by considering “probabilistic plans” — joint measures γ(x,y):

min_{γ ∈ Π(μ,ν)} ∫_{X×X} c(x, y) dγ(x, y)

where Π(μ,ν) = {γ : marginals of γ with respect to x and y equal μ and ν, respectively}.

Key properties:

Kantorovich's problem is linear in γ (infinite-dimensional LP!)
Always has a solution (under reasonable conditions on c and μ, ν)
Monge’s problem is a special case: γ concentrated on the graph of T

Kantorovich's dual problem: by the LP duality theorem:

max ∫u(x) dμ(x) + ∫v(y) dν(y) subject to u(x) + v(y) ≤ c(x,y) for all x,y

Strong duality holds: inf = sup.

Wasserstein Distance

W_p-distance between probability measures μ, ν:

W_p(μ, ν) = (min_{γ ∈ Π(μ,ν)} ∫|x−y|^p dγ)^{1/p}

for c(x,y) = |x−y|^p.

Properties: W_p is a true metric on the space of probability measures. Convergence in W_p is equivalent to weak convergence + convergence of the p-th moment.

Wasserstein space: P_p(ℝⁿ) with metric W_p is the “space of shapes”. Interpolation between μ and ν by W₂ is “optimal morphing” between two distributions.

Brenier-McCann Theorem

Theorem (for c(x,y) = |x−y|²/2, μ absolutely continuous): there exists a unique optimal mapping T = ∇φ, where φ is a convex function.

Meaning: the optimal transport plan is a “shift along the gradient of a convex function”. This is striking: from the earth-moving problem, convexity arises!

Brenier-McCann polar decomposition: any mapping T: ℝⁿ → ℝⁿ that preserves measures (T_#μ = ν) decomposes as T = R ∘ ∇φ, where R is a rotation (distance-preserving), ∇φ is optimal transport.

Applications in Machine Learning

WGAN (Wasserstein GAN): training generative neural networks. The regular GAN uses KL-divergence, which is unstable for disjoint distributions. WGAN replaces it with W₁:

L(G, D) = W₁(μ_real, μ_G) ≈ max_‖D‖Lip≤1 E{x~~μ_real}[D(x)] − E_{x~~μ_G}[D(x)]

More stable training, no mode collapse.

Point Cloud Matching: to match two sets of points {xᵢ} ~ μ and {yⱼ} ~ ν with minimal cost. The Sinkhorn algorithm is a fast iterative method for approximate solution of the Kantorovich problem with entropic regularization.

Morph between images: averaging images in Wasserstein space. Wasserstein barycenter: min_{ν} Σᵢ wᵢ W₂²(μᵢ, ν). The result is a “mean” image preserving geometry.

Complete Analysis: Transport Plan in ℝ

Problem: μ = 0.5 δ₀ + 0.5 δ₂ (mass 0.5 at points 0 and 2), ν = 0.5 δ₁ + 0.5 δ₃ (mass 0.5 at points 1 and 3). Cost c(x,y) = |x−y|².

Admissible plans: γ is defined by a 2×2 matrix with row sums (0.5, 0.5) and column sums (0.5, 0.5):

γ = [[a, 0.5−a], [0.5−a, a]], a ∈ [0, 0.5].

Cost: C(γ) = a|0−1|² + (0.5−a)|0−3|² + (0.5−a)|2−1|² + a|2−3|² = a + 9(0.5−a) + (0.5−a) + a = 4.5 − 8a.

Minimum at a = 0.5: γ = [[0.5, 0], [0, 0.5]] — “Monge plan”: 0→1, 2→3. Cost = 0.5. Optimum!

W₂: √C = √0.5 ≈ 0.707.