Evolutionary Algorithms and Genetic Programming

Evolutionary algorithms are optimization methods inspired by biological evolution. Where classical optimization methods (gradient descent, LP) fail—multimodal landscapes, combinatorial spaces, non-differentiable objectives—evolution finds good solutions.

Genetic Algorithm (GA)

Biological analogy: individuals in a population = solutions to the problem. Chromosome = encoding of the solution. Fitness = solution quality. Selection, crossover, mutation = search in the solution space.

GA Structure:

Initialization: random population of P chromosomes (strings of length L)
Evaluation: calculate fitness f(xᵢ) for each individual
Selection: select "parents" with probability ∝ f(xᵢ) (roulette wheel selection) or top-k (tournament selection)
Crossover: take two "parents," with probability p_c create "offspring" by exchanging fragments of chromosomes (one-point, two-point, uniform)
Mutation: with a small probability p_m, change a random bit in the chromosome
Replacement: offspring replace part of the population (elitism strategy: the best individual is always preserved)

Holland’s Schema Theorem: Schema = template H (string with symbols {0,1,*}). Length l(H) — position of the last specific symbol. Order o(H) — number of specific symbols. Short (small l), low-order, high-fitness schemata increase their frequency exponentially in the next generation. “Building blocks” → optimal solution.

Genetic Programming (GP)

GA Extension: chromosomes are not strings, but syntactic trees (programs). Nodes = operators (+, –, sin, if). Leaves = variables and constants.

GP Operations: tree crossover = exchange of random subtrees between parents. Mutation = replacement of a random subtree by a randomly generated one.

Symbolic regression: GP finds the formula f(x) from data. Feynman symbolic regression (Cranmer et al., 2019): from synthetic data, GP "discovers" physical laws (Newtonian gravity, Ohm’s law, Schrödinger equation)—pure machine playing the role of Newton.

Evolution Strategies (ES) for Continuous Optimization

For continuous spaces: chromosome = real-valued vector x ∈ ℝⁿ. Mutation: x_new = x + ε, where ε ~ N(0, σ²I). σ — “step size”, adapts.

1/5 Rule: if more than 1/5 of mutations improve fitness → increase σ; less → decrease. Maintains efficient step size.

CMA-ES (Covariance Matrix Adaptation): adapts the full covariance matrix C: mₜ₊₁ = mₜ + α/λ·Σᵢ wᵢ(xᵢ − mₜ) (new mean). Cₜ₊₁ = (1−c_C)Cₜ + c_C·pₓpₓᵀ + c_μ/λ·Σᵢ wᵢ(xᵢ−mₜ)(xᵢ−mₜ)ᵀ. One of the best gradient-free optimization methods. BBOB (Black-Box Optimization Benchmark): CMA-ES dominates across many types of problems.

Multiobjective Evolution (MOEA)

Pareto front: x dominates y (x≻y) if x is not worse than y on all criteria and is better on at least one. The set of non-dominated solutions is the Pareto front.

NSGA-II (Deb, 2002): (1) Non-dominated sorting: split the population into fronts F₁, F₂,... (F₁ — Pareto front). (2) Crowding distance: within a front, maintain diversity. (3) Selection: prefer earlier front, if same front—choose higher crowding distance.

Applications: design optimization (weight vs strength), trading strategies (profitability vs risk), tuning ML hyperparameters (accuracy vs inference time).

OpenAI ES and Neuroevolution

OpenAI ES (Salimans et al., 2017): instead of backpropagation—evolution strategy for training neural networks. Perturb parameters θ → θ + σεᵢ, i=1,...,n. Evaluate reward F(θ + σεᵢ). Update: θ' = θ + α/(nσ) Σᵢ F(θ+σεᵢ)εᵢ. Parallelizes to thousands of CPUs, robust to sparse rewards.

NAS by evolution (Google AutoML, 2017): controller evolves neural network architectures via mutations (add/remove layer, change operation). After 450 GPU days: found EfficientNet-like architectures surpassing human designs.

Numerical Example: GA for the Traveling Salesman Problem

10 cities, TSP (find the shortest route through all). Population P=50, p_c=0.85, p_m=0.02. Start: random routes, average length ≈ 450 (normalized units). After 200 generations: best route ≈ 310 (31% decrease). Optimal (brute force): 295. GA found a solution close to the optimum in seconds.

Assignment: Implement GA for optimization: function f(x,y) = sin(x)·cos(y) − 0.5·sin(x+y) on [−π, π]². Encoding: 16-bit string for x, 16-bit for y (Gray code). P=100, p_c=0.8, p_m=0.01, 500 generations. Build: (1) convergence curve (best and average fitness vs generation); (2) “heat map” of visitation frequencies of points (x,y) in the population. Compare with CMA-ES on the same problem—which is faster and more accurate?