SVM and Kernel Methods — Convex Analysis & Optimization

Idea: Maximize the Margin

Classification problem: given N points $(x_i, y_i)$, where $x_i \in \mathbb{R}^n$ and $y_i \in {-1, +1}$ are class labels. The goal is to find a hyperplane separating the two classes. There are infinitely many such hyperplanes—any separating one will do. SVM (Support Vector Machine) chooses the "best"—the one maximally distant from both classes. Maximum margin means maximum robustness to new data. The elegance of SVM lies in the fact that finding the optimal hyperplane is a convex quadratic programming problem.

Primal SVM Problem

Hyperplane: $w^\top x + b = 0$. Class of a point $x$: $\operatorname{sign}(w^\top x + b)$.

Distance from point $x_0$ to the hyperplane: $|w^\top x_0 + b| / |w|$.

Margin = $2/|w|$ (distance between the two parallel hyperplanes $w^\top x + b = \pm 1$).

SVM Problem (hard margin): maximize the margin under correct classification:

$ \max \frac{2}{|w|} \quad \text{subject to } y_i(w^\top x_i + b) \ge 1 \text{ for all } i $

Equivalently:

$ \min_{w,b} \frac{1}{2}|w|^2 \quad \text{subject to } y_i(w^\top x_i + b) \ge 1 $

This is a convex quadratic programming problem! Unique solution (strictly convex objective).

Dual Problem and the Kernel Trick

Lagrangian: $L(w, b, \alpha) = \frac{1}{2}|w|^2 - \sum_i \alpha_i [y_i(w^\top x_i + b) - 1],\ \alpha_i \ge 0.$

KKT conditions:

$\partial L / \partial w = 0$: $w = \sum_i \alpha_i y_i x_i$
$\partial L / \partial b = 0$: $\sum_i \alpha_i y_i = 0$

Substituting into the Lagrangian, we obtain the dual problem:

$ \max \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^\top x_j \quad \text{subject to } \sum_i \alpha_i y_i = 0, \ \alpha_i \ge 0 $

Key observation: the input data $x_i$ enter only via inner products $x_i^\top x_j$!

Kernel trick: replace $x_i^\top x_j$ with $K(x_i, x_j)$—a kernel function. Then, the SVM implicitly operates in a high-dimensional (possibly infinite-dimensional) feature space, without computing it explicitly.

Support Vectors

From the KKT conditions (complementary slackness): $\alpha_i [y_i(w^\top x_i + b) - 1] = 0$.

Therefore, $\alpha_i > 0$ only for points "on the margin"—$y_i(w^\top x_i + b) = 1$. These are the support vectors—the points "closest" to the separating hyperplane. There are usually few of them: in a problem with 1000 points—dozens. Only they determine the solution!

Classification of a new point $x$: $\operatorname{sign}(\sum_i \alpha_i y_i K(x_i, x) + b)$—depends only on kernels with support vectors.

Soft Margin SVM

When classes are not separable, add "slack variables" $\xi_i \ge 0$:

$ \min \frac{1}{2}|w|^2 + C \sum_i \xi_i \quad \text{subject to } y_i(w^\top x_i + b) \ge 1 - \xi_i, \ \xi_i \ge 0 $

The parameter $C$ controls the trade-off: large $C$ → strict margin (few violations), small $C$ → wide margin (many violations allowed). In the dual problem: $\alpha_i \in [0, C]$.

Popular Kernels

Linear: $K(x, y) = x^\top y$. SVM becomes a linear classifier.

Polynomial: $K(x, y) = (x^\top y + 1)^d$. Operates implicitly in the space of degree $d$ polynomial features.

RBF (radial): $K(x, y) = \exp\left(-|x - y|^2 / (2\sigma^2)\right)$. Corresponds to an infinite-dimensional space (!) of Gaussian functions. $\sigma$—the "width" parameter. As $\sigma \to 0$: memorization; as $\sigma \to \infty$: linear classifier.

Full Example Analysis

Data (2D, 4 points):

Class +1: $x_1 = (1, 2)$, $x_2 = (2, 1)$
Class −1: $x_3 = (−1, −2)$, $x_4 = (−2, −1)$

Symmetry: $w = \alpha_1 y_1 x_1 + \alpha_2 y_2 x_2 + \alpha_3 y_3 x_3 + \alpha_4 y_4 x_4$.

By symmetry of the problem: $\alpha_1 = \alpha_2$, $\alpha_3 = \alpha_4$, and $\alpha_i y_i x_i$ for each pair are opposite → $w = 2\alpha[(1,2) + (2,1)] = 2\alpha(3,3)$.

Dual Problem: $\max 4\alpha - (1/2) \cdot (4\alpha)^2 \cdot (5+5+5+5)/(\text{some norm})$. After computation: $w = (0.2, 0.2)$, $b = 0$. Margin $= 2/|w| = 2/0.283 \approx 7.07$.

Generalization Guarantees

Theorem: with probability $\ge 1-\delta$ (over random training sample), the SVM error is bounded:

$ L(h) \le \frac{\text{number of training errors}}{N} + O\left(\sqrt{\frac{\rho^2 / \gamma^2 + \log(1/\delta)}{N}}\right) $

Here $\rho$ is the data norm, $\gamma$ is the margin. Feature space dimensionality doesn't appear! The margin, not the model complexity, determines generalization. This is the theoretical basis: large margin = good generalization.

Support Vector Method: Mathematical Core

SVM solves a classification problem by finding a hyperplane with maximum margin between two classes. This is naturally formulated as a QP. The dual form is especially important: it depends only on inner products $x_i^\top x_j$ between training samples, which paves the way for the kernel trick.

Popular Kernels and Their Properties

Linear: $K(x,y) = x^\top y$—for linearly separable data
Polynomial: $K(x,y) = (x^\top y + c)^d$—for polynomial boundary of degree $d$
RBF (Gaussian): $K(x,y) = \exp(-\gamma|x - y|^2)$—infinite-dimensional kernel, universal approximator
Sigmoid: $K(x,y) = \tanh(\alpha x^\top y + c)$—related to neural networks
String kernels, graph kernels—for non-geometric data

Key requirement: Gram matrix $K_{ij} = K(x_i, x_j)$ must be positive semi-definite (Mercer's theorem). This guarantees that the kernel corresponds to an inner product in some Hilbert space.

Implementation and Scaling

For large data, standard QP does not scale. Specialized algorithms are used:

SMO (Sequential Minimal Optimization, Platt, 1998): iteratively optimizes pairs of Lagrange multipliers
Pegasos (Shalev-Shwartz): stochastic gradient descent for linear SVMs
LIBLINEAR, LIBSVM—standard libraries

Applications

For many years, SVMs were state-of-the-art in text, image, and bioinformatics classification tasks. Before the deep learning era, it was SVM with RBF kernel that won competitions for handwritten digit recognition (MNIST), protein classification, and spam filtering. Today, SVMs remain relevant for small datasets, where deep networks overfit, and in industrial applications requiring model interpretability.