Automatic Architecture Search and AutoML — Big Data & Machine Learning

AutoML and Neural Architecture Search

Developing an ML pipeline is a sequence of complex choices: algorithm, preprocessing, hyperparameters, architecture. Each choice requires expertise and hundreds of experiments. AutoML automates these choices, making machine learning accessible to non-experts and accelerating experts' workflows.

The Hyperparameter Optimization Problem

Hyperparameters (learning rate, tree depth, hidden layer size) are not learned via backpropagation—they must be searched for using an external method. The task: find a configuration $\lambda \in \Lambda$ that minimizes the validation error: $\lambda^* = \arg\min_{\lambda\in\Lambda} L_{val}(f_{\lambda}(D_{train}))$.

The problem: computing $L_{val}(f_\lambda)$ = “run full training” — is expensive! The space $\Lambda$ can be combinatorially large (conditional — if kernel='rbf', kernel parameters appear).

Random search (Bergstra & Bengio, 2012): better than grid search in high-dimensional spaces: if only 2 out of 10 hyperparameters matter, a 10¹⁰ grid wastes resources on insignificant axes. Random search explores all axes uniformly.

Bayesian optimization: we build a surrogate model $p(L|\lambda)$ (often a Gaussian process) which approximates the function $L(\lambda)$ based on already evaluated configurations. Acquisition function (EI — Expected Improvement): $EI(\lambda) = \mathbb{E}[\max(0, L_{best} − L(\lambda))]$. We select $\lambda$ with the maximum EI—a balance of exploration and exploitation.

Hyperopt / SMAC: Hyperopt uses TPE (Tree Parzen Estimator)—approximates $p(\lambda|L<threshold)$ and $p(\lambda|L\geq threshold)$ separately, finds $\lambda$ with the best ratio. SMAC uses Random Forest as a surrogate (convenient for conditional spaces).

Auto-Sklearn (Feurer et al., 2015): combines meta-learning (choosing a starting point based on similar datasets), Bayesian optimization (searching for configuration), and ensembling final configurations. Achieves expert-level performance on many benchmark tasks.

Neural Architecture Search (NAS)

NAS is the search for the optimal neural network architecture. The search space: choice of operations in each cell (3×3 convolution, 5×5, max-pool, skip connection, etc.), number of layers, width.

Early NAS (Zoph & Le, 2017): an LSTM controller generates an architecture description, trains it on CIFAR, gets a signal (accuracy) as a reward—REINFORCE. Expensive: 800 GPUs × 28 days.

DARTS (Liu et al., 2019, Differentiable Architecture Search): Continuous relaxation of discrete choice. Each edge of a cell is a weighted sum of all operations: $\bar{o}(x) = \sum_{op} \frac{\exp(\alpha_{op})}{\sum \exp(\alpha)} \cdot op(x)$. Architecture parameters $\alpha$ are trained simultaneously with weights $w$ via bi-level optimization:

$ \min_\alpha L_{val}(w^(\alpha), \alpha), \quad \text{where} \quad w^(\alpha) = \arg\min_w L_{train}(w, \alpha) $

The outer loop updates $\alpha$ on validation data, the inner loop updates $w$ on training data. After searching, discretize: in each edge, keep the operation with max $\alpha$. Time: 1 GPU × 4 days.

EfficientNet (Tan & Le, 2019): The base architecture is found via NAS, then scaled in three dimensions: depth $d$, width $w$, resolution $r$ with compound coefficient $\phi$: $d = \alpha^\phi$, $w = \beta^\phi$, $r = \gamma^\phi$. Best accuracy/parameter trade-off among models of its time.

Automation of Feature Engineering

Deep Feature Synthesis (DFS, Kanter & Veeramachaneni, 2015): automatically generates features from relational data. Aggregations (mean, max, sum, number of client transactions per month), transformations (log, sqrt, date differences).

Categorical embeddings: instead of one-hot encoding (dimension = number of categories), we train a dense embedding vector of size $d$. Example: city → vector (latitude, longitude, economic class) via training. Works better than one-hot for categories with thousands of values.

TabNet (Arik & Pfenning, 2021): attention-based architecture for tabular data. At each step it selects a subset of features via self-attention, interpretability — you can see which features were used.

MLOps: ML in Production

Problem: a model that performs well on historical data degrades in production due to changing data distribution (data drift) or concept drift. Feature store (Feast, Hopsworks): centralized storage of features for training and inference—eliminates training-serving skew. Model registry (MLflow): versioning of experiments, models, metrics. Monitoring: Evidently AI detects drift via KL-divergence, Kolmogorov-Smirnov tests.

Numerical Example

Hyperopt on LightGBM: search space — $n_{estimators} \in [100,1000]$, $learning_rate \in [0.01,0.3]$, $max_depth \in [3,8]$. After 50 iterations of Bayesian optimization found: $n_{estimators}=300$, $lr=0.05$, $max_depth=5$, CV RMSE = 4.2 (vs 5.1 with default parameters). The 10×10×6 grid would require 600 iterations.

Assignment: Apply Auto-Sklearn to the House Prices dataset (Kaggle). Time limit: 30 minutes. Compare with “manual” tuning (XGBoost + your Hyperopt). Visualize which configurations were selected. Apply SHAP to interpret the best model.