Architectures for Time Series and Forecasting — Neural Networks & Deep Learning

Time series—financial data, IoT sensors, meteorological observations, energy consumption—require specialized architectures. Neural networks compete with classical methods (ARIMA, ES) and often outperform them when sufficient data volume is available.

Specifics of Time Series

Key properties: Autocorrelation (values depend on the past). Trend (long-term direction). Seasonality (periodic patterns: daily, weekly, yearly). Non-stationarity (statistics change over time: mean, variance).

Preprocessing: Normalization: $\hat{x}t = (x_t − \mu)/\sigma$. Differencing (removes non-stationarity): $\Delta x_t = x_t − x{t-1}$. Seasonal differencing: $x_t − x_{t-s}$. Log transformation: $\ln(x_t)$—stabilizes variance under multiplicative seasonality.

Train/test split: You cannot randomly shuffle! Only chronological splitting: train = first 80%, test = last 20%.

Seq2Seq for Forecasting

Encoder-Decoder (Sutskever et al., 2014): Encoder (LSTM): processes the input sequence (T steps) → bottleneck vector $z = h_T$. Decoder (LSTM): generates a forecast for H steps, starting from $z$.

Problem: the entire history is compressed into one vector $z$—information loss for long histories. Solution: attention mechanism allows the decoder to "look" at all encoder steps.

Temporal Fusion Transformer (TFT, Lim et al., 2020): SOTA for multi-step forecasting of time series. Components: Gated Residual Networks (GRN) for variable feature importance. LSTM for short- and long-term patterns. Temporal self-attention for identifying important time steps. Quantile loss for prediction of uncertainty intervals.

N-BEATS (Oreshkin et al., 2020)

Idea: Pure neural network—no recurrence, no convolutions. Stack of blocks: each block predicts a "component" of the series and the "residual". Two architectures: Generic (no constraints) and Interpretable (trend + seasonality, like Holt-Winters decomposition). Winner of the M4 Competition (2018) with interpretability.

N-BEATS Block:

FC layers: $x_t \rightarrow (\theta^b, \theta^f)$ (backcast and forecast parameters)
Backcast: $g_b(\theta^b)$—reconstruction of input (what the block "explained")
Forecast: $g_f(\theta^f)$—block forecast
Residual: $x_{t+1} = x_t − g_b(\theta^b)$ passed to the next block

PatchTST and Transformers for Time Series

PatchTST (Nie et al., 2023): splits the time series into patches (L consecutive points). Each patch is a "token". Transformer is applied. Patch positional encoding. More efficient than applying a transformer to each individual point—fewer tokens, better local semantics.

TimesNet (2023): transforms a 1D time series into a 2D "image" (using periodicity) → CNN processes as an image → SOTA on several tasks.

Comparison with Classical Methods

ARIMA(p,d,q): $p$—AR order, $d$—order of differencing, $q$—MA order. Good for stationary series, small data volumes, interpretability. Does not scale to thousands of series.

Prophet (Facebook, 2017): $y(t) = $ trend $+$ seasonality $+$ holidays $+ \epsilon$. Trend: piecewise-linear or logistic. Seasonality: Fourier series. Good for business series with holidays and trend changes.

LSTM vs ARIMA: With > 500–1000 training points LSTM is usually better. With small data—classical methods are competitive. Ensemble (LSTM + ARIMA + Prophet) often outperforms any method alone.

M4 Competition (2018, 100K series): Winner—Exponential Smoothing + LSTM. Pure neural networks (without ES) only took 5th–6th place. Hybrid approach > pure neural.

Numerical Example

Forecasting electricity consumption in Australia (hourly data, 5 years): 43,800 points.

SARIMA(2,1,2)(1,1,1,24): MASE = 1.24 (baseline). LSTM (2 layers × 64, 72h lag): MASE = 1.08 (-13%). TFT: MASE = 0.96 (-23%). Ensemble (TFT + SARIMA): MASE = 0.89 (-28%).

MASE < 1 means better than naive seasonal (we predict the same as yesterday at the same time).

Assignment: Use the M4 Competition dataset (100K series of various frequencies). (1) For 10 random monthly series: train SARIMA(1,1,1)(1,1,1,12)—automatically via pmdarima. (2) Train LSTM with 12 lags, 2 layers × 64 neurons. (3) Ensemble with weights 0.5/0.5. Evaluate MASE on the test set (last 18 points). (4) Visualize forecasts on 6 random series—when does LSTM outperform SARIMA, and vice versa?