Gradient Boosting and Ensemble Methods — Big Data & Machine Learning

Ensemble Methods and Boosting

A single classifier is almost always worse than the collective: different models make different mistakes, and their combination reduces the total error. Ensemble methods are the foundation of winning solutions in machine learning on real-world data.

Error Decomposition: Bias and Variance

The error of any algorithm can be decomposed: Err = Bias² + Variance + Irreducible noise. Bias is the systematic error due to incorrect model assumptions: an overly simple model "misses" the correct answer even with infinite data. Variance is the sensitivity of the model to random fluctuations in the training sample: a complex model overfits to every training sample.

The intuition: a shooter with high bias aims away from the target; a shooter with high variance is accurate but inconsistent. We want neither.

Bagging: Reducing Variance

Bagging (Bootstrap Aggregation, Breiman, 1994): train B models on B bootstrap samples (each is a random sample with replacement from the original n objects), then average the predictions. For regression: $\hat{F} = (1/B)\sum F_b(x)$. For classification—majority vote.

Why does this work: if the models are independent and each has variance σ², then the mean of B independent models has variance σ²/B. The bias remains unchanged, the variance decreases by a factor of B.

Random Forest (Breiman, 2001): bagging of trees with additional randomization. At each node split, a random subset of features is used—√p for classification, p/3 for regression (p is the total number of features). The extra randomness reduces correlation between trees → greater improvement from averaging.

Out-of-bag estimate: each object is included in a bootstrap sample ≈63% of the time, so ≈37% of trees have not seen it during training. Their average prediction is an honest estimate of generalization error without separate validation.

Boosting: Reducing Bias

The idea of boosting: models are trained sequentially, and each subsequent model focuses on the errors of the previous ones. If the first model predicts "easy" examples well, the next will focus more on the "hard" ones.

Gradient Boosting (Friedman, 2001): Additive model $F_M(x) = F_0(x) + \sum_{m=1}^M h_m(x)$. At each step m, a new weak tree $h_m$ approximates the "pseudo-residuals"—the negative gradient of the loss function with respect to the prediction:

$ r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{m-1}} $

Here $L(y, F)$ is the loss function (quadratic, logloss, Huber, etc.), $y_i$ is the true value, $F_{m-1}(x_i)$ is the current prediction. For quadratic loss, $r_i = y_i - F_{m-1}(x_i)$—the usual residuals. For logloss—weighted residuals. The learning rate $\eta \in (0,1]$ controls the "step size" in function space.

Modern Implementations

XGBoost (Chen, Guestrin, 2016): uses second-order Taylor expansion (gᵢ = first gradient, hᵢ = second). Structural leaf score $w_j^* = -(\sum g_i)/(\sum h_i + \lambda)$. Leaf regularization (λ, γ) prevents overfitting. Feature-wise parallelism accelerates split search. Dominated Kaggle competitions 2015–2018.

LightGBM: Histogram trick (binning continuous features into 255 bins) speeds up split search by 10–20 times. GOSS (Gradient-based One-Side Sampling): retain all objects with large gradients, randomly sample objects with small ones. EFB (Exclusive Feature Bundling): combine mutually exclusive features.

CatBoost: Ordered boosting computes residuals for object i only from models trained on a subset without i—eliminates data leakage with categorical features. Automatic categorical feature encoding via TS (Target Statistics).

Numerical Example

Data: predicting house price. After two boosting steps, $F_2(x) = 200$ thousand rubles. True price $y = 250$ thousand. Residual $r = 50$ thousand. The third tree $h_3$ predicts 50 thousand, $\eta = 0.1$: $F_3 = 200 + 0.1 \cdot 50 = 205$ thousand. The process is slow but accurate—a "tortoise" that will reach its goal.

Real-World Applications

Credit scoring (banks: LightGBM on 500+ features), churn prediction (telecom), medical diagnostics (disease risk), recommender systems (ranking), real estate price prediction. In 2023, XGBoost/LightGBM formed the basis of solutions in all top-10 positions for 7 out of 10 largest tabular competitions on Kaggle.

Assignment: Train XGBoost and LightGBM on the Boston Housing dataset. Tune hyperparameters (n_estimators, max_depth, learning_rate, subsample) using 5-fold CV. Plot learning curves (train/val RMSE vs number of trees). Compare the OOB estimate of random forest with the CV estimate of XGBoost. Which method converges faster and gives a better result?