Module IV·Article II·~5 min read
SVM and Kernel Methods
Applications in Machine Learning
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Idea: Maximize the Margin
Classification problem: given N points $(x_i, y_i)$, where $x_i \in \mathbb{R}^n$ and $y_i \in {-1, +1}$ are class labels. The goal is to find a hyperplane separating the two classes. There are infinitely many such hyperplanes—any separating one will do. SVM (Support Vector Machine) chooses the "best"—the one maximally distant from both classes. Maximum margin means maximum robustness to new data. The elegance of SVM lies in the fact that finding the optimal hyperplane is a convex quadratic programming problem.
Primal SVM Problem
Hyperplane: $w^\top x + b = 0$. Class of a point $x$: $\operatorname{sign}(w^\top x + b)$.
Distance from point $x_0$ to the hyperplane: $|w^\top x_0 + b| / |w|$.
Margin = $2/|w|$ (distance between the two parallel hyperplanes $w^\top x + b = \pm 1$).
SVM Problem (hard margin): maximize the margin under correct classification:
$ \max \frac{2}{|w|} \quad \text{subject to } y_i(w^\top x_i + b) \ge 1 \text{ for all } i $
Equivalently:
$ \min_{w,b} \frac{1}{2}|w|^2 \quad \text{subject to } y_i(w^\top x_i + b) \ge 1 $
This is a convex quadratic programming problem! Unique solution (strictly convex objective).
Dual Problem and the Kernel Trick
Lagrangian: $L(w, b, \alpha) = \frac{1}{2}|w|^2 - \sum_i \alpha_i [y_i(w^\top x_i + b) - 1],\ \alpha_i \ge 0.$
KKT conditions:
- $\partial L / \partial w = 0$: $w = \sum_i \alpha_i y_i x_i$
- $\partial L / \partial b = 0$: $\sum_i \alpha_i y_i = 0$
Substituting into the Lagrangian, we obtain the dual problem:
$ \max \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^\top x_j \quad \text{subject to } \sum_i \alpha_i y_i = 0, \ \alpha_i \ge 0 $
Key observation: the input data $x_i$ enter only via inner products $x_i^\top x_j$!
Kernel trick: replace $x_i^\top x_j$ with $K(x_i, x_j)$—a kernel function. Then, the SVM implicitly operates in a high-dimensional (possibly infinite-dimensional) feature space, without computing it explicitly.
Support Vectors
From the KKT conditions (complementary slackness): $\alpha_i [y_i(w^\top x_i + b) - 1] = 0$.
Therefore, $\alpha_i > 0$ only for points "on the margin"—$y_i(w^\top x_i + b) = 1$. These are the support vectors—the points "closest" to the separating hyperplane. There are usually few of them: in a problem with 1000 points—dozens. Only they determine the solution!
Classification of a new point $x$: $\operatorname{sign}(\sum_i \alpha_i y_i K(x_i, x) + b)$—depends only on kernels with support vectors.
Soft Margin SVM
When classes are not separable, add "slack variables" $\xi_i \ge 0$:
$ \min \frac{1}{2}|w|^2 + C \sum_i \xi_i \quad \text{subject to } y_i(w^\top x_i + b) \ge 1 - \xi_i, \ \xi_i \ge 0 $
The parameter $C$ controls the trade-off: large $C$ → strict margin (few violations), small $C$ → wide margin (many violations allowed). In the dual problem: $\alpha_i \in [0, C]$.
Popular Kernels
Linear: $K(x, y) = x^\top y$. SVM becomes a linear classifier.
Polynomial: $K(x, y) = (x^\top y + 1)^d$. Operates implicitly in the space of degree $d$ polynomial features.
RBF (radial): $K(x, y) = \exp\left(-|x - y|^2 / (2\sigma^2)\right)$. Corresponds to an infinite-dimensional space (!) of Gaussian functions. $\sigma$—the "width" parameter. As $\sigma \to 0$: memorization; as $\sigma \to \infty$: linear classifier.
Full Example Analysis
Data (2D, 4 points):
- Class +1: $x_1 = (1, 2)$, $x_2 = (2, 1)$
- Class −1: $x_3 = (−1, −2)$, $x_4 = (−2, −1)$
Symmetry: $w = \alpha_1 y_1 x_1 + \alpha_2 y_2 x_2 + \alpha_3 y_3 x_3 + \alpha_4 y_4 x_4$.
By symmetry of the problem: $\alpha_1 = \alpha_2$, $\alpha_3 = \alpha_4$, and $\alpha_i y_i x_i$ for each pair are opposite → $w = 2\alpha[(1,2) + (2,1)] = 2\alpha(3,3)$.
Dual Problem: $\max 4\alpha - (1/2) \cdot (4\alpha)^2 \cdot (5+5+5+5)/(\text{some norm})$. After computation: $w = (0.2, 0.2)$, $b = 0$. Margin $= 2/|w| = 2/0.283 \approx 7.07$.
Generalization Guarantees
Theorem: with probability $\ge 1-\delta$ (over random training sample), the SVM error is bounded:
$ L(h) \le \frac{\text{number of training errors}}{N} + O\left(\sqrt{\frac{\rho^2 / \gamma^2 + \log(1/\delta)}{N}}\right) $
Here $\rho$ is the data norm, $\gamma$ is the margin. Feature space dimensionality doesn't appear! The margin, not the model complexity, determines generalization. This is the theoretical basis: large margin = good generalization.
Support Vector Method: Mathematical Core
SVM solves a classification problem by finding a hyperplane with maximum margin between two classes. This is naturally formulated as a QP. The dual form is especially important: it depends only on inner products $x_i^\top x_j$ between training samples, which paves the way for the kernel trick.
Popular Kernels and Their Properties
- Linear: $K(x,y) = x^\top y$—for linearly separable data
- Polynomial: $K(x,y) = (x^\top y + c)^d$—for polynomial boundary of degree $d$
- RBF (Gaussian): $K(x,y) = \exp(-\gamma|x - y|^2)$—infinite-dimensional kernel, universal approximator
- Sigmoid: $K(x,y) = \tanh(\alpha x^\top y + c)$—related to neural networks
- String kernels, graph kernels—for non-geometric data
Key requirement: Gram matrix $K_{ij} = K(x_i, x_j)$ must be positive semi-definite (Mercer's theorem). This guarantees that the kernel corresponds to an inner product in some Hilbert space.
Implementation and Scaling
For large data, standard QP does not scale. Specialized algorithms are used:
- SMO (Sequential Minimal Optimization, Platt, 1998): iteratively optimizes pairs of Lagrange multipliers
- Pegasos (Shalev-Shwartz): stochastic gradient descent for linear SVMs
- LIBLINEAR, LIBSVM—standard libraries
Applications
For many years, SVMs were state-of-the-art in text, image, and bioinformatics classification tasks. Before the deep learning era, it was SVM with RBF kernel that won competitions for handwritten digit recognition (MNIST), protein classification, and spam filtering. Today, SVMs remain relevant for small datasets, where deep networks overfit, and in industrial applications requiring model interpretability.
§ Act · what next