RNN, LSTM, and GRU: Architectures for Sequences — Neural Networks & Deep Learning

Recurrent neural networks are designed for processing sequences—text, time series, audio. They maintain “memory” through a hidden state passed through time. LSTM is a key invention from 1997 that enabled learning of long-term dependencies.

Basic RNN

RNN equations:

hₜ = tanh(Wₕₕ hₜ₋₁ + Wₓₕ xₜ + b), ŷₜ = Wₕᵧ hₜ + c

Breakdown: hₜ is the hidden state (memory of the past), xₜ is the input at time t, Wₕₕ is the “memory” matrix (links hₜ and hₜ₋₁), Wₓₕ is the input matrix, Wₕᵧ is the output matrix. The same weights W are used at every step (weight sharing over time).

Backpropagation Through Time (BPTT): To train an RNN, we unroll it over time and apply the chain rule: ∂L/∂Wₕₕ = Σₜ ∂L/∂hₜ · ∂hₜ/∂Wₕₕ. Gradient: ∂hₜ/∂hₛ = Π_{k=s}^{t-1} ∂hₖ₊₁/∂hₖ = Π_{k=s}^{t-1} Wₕₕᵀ diag(tanh'(zₖ)).

The product of (t−s) matrices → if ||Wₕₕᵀ diag(tanh'(zₖ))|| < 1: vanishing; if > 1: exploding. This is a problem for long sequences (T > 20–30).

RNN modes: one-to-many (text generation), many-to-one (sentiment classification), many-to-many (machine translation via seq2seq, Named Entity Recognition).

LSTM (Long Short-Term Memory, Hochreiter-Schmidhuber, 1997)

LSTM solves the vanishing gradient problem through a “memory cell” $Cₜ$ and three gates. The key idea: additive update of the cell creates a direct path for the gradient.

Forget gate: fₜ = σ(Wf · [hₜ₋₁, xₜ] + bf)

Breakdown: σ ∈ (0,1) — what fraction of the previous memory $Cₜ₋₁$ to “forget.” fₜ ≈ 0: wipe memory. fₜ ≈ 1: retain.

Input gate: iₜ = σ(Wᵢ · [hₜ₋₁, xₜ] + bᵢ) — “how much to add” C̃ₜ = tanh(Wc · [hₜ₋₁, xₜ] + bc) — “what to add”

Memory update: Cₜ = fₜ ⊙ Cₜ₋₁ + iₜ ⊙ C̃ₜ

Breakdown: we combine the “filtered” old memory (fₜ ⊙ Cₜ₋₁) and new information (iₜ ⊙ C̃ₜ). The key: addition (not multiplication!) — the gradient can flow without attenuation.

Output gate: oₜ = σ(Wo · [hₜ₋₁, xₜ] + bo) hₜ = oₜ ⊙ tanh(Cₜ)

Direct gradient path: ∂Cₜ/∂Cₜ₋₁ = fₜ (elementwise multiplication). If fₜ ≈ 1 (gate remembers) — the gradient is passed with coefficient 1. This resolves vanishing for long dependencies.

GRU (Gated Recurrent Unit, Cho et al., 2014)

A simplification of LSTM: two gates instead of three, no separate memory cell $Cₜ$.

Update gate: zₜ = σ(Wz · [hₜ₋₁, xₜ]) — how much to update hₜ. Reset gate: rₜ = σ(Wr · [hₜ₋₁, xₜ]) — how much to “forget” hₜ₋₁. New state: $\tilde{h}t = \tanh(W \cdot [r_t \odot h{t-1}, x_t])$ Result: $h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

GRU is simpler (fewer parameters), comparable quality on most tasks.

Bidirectional RNN and Deep RNN

BiLSTM: Forward LSTM (t = 1,...,T) + backward LSTM (t = T,...,1). Concatenate $hₜ_{fwd}$ and $hₜ_{bwd}$ → each token “sees” both directions. Standard for NER, POS-tagging, machine translation (BERT is essentially a deep BiLSTM with attention).

Stacked RNN: Input for the $l$-th layer = $hₜ^{(l-1)}$ (output of the previous layer at time t). 2–4 layers give significant improvements on complex tasks.

Numerical Example: Language Model

A language model on LSTM to predict the next character. Vocabulary: 100 characters. Architecture: Embedding(100→32) → LSTM(32→128) → FC(128→100) → Softmax.

When training on “War and Peace” (3M characters), 50 epochs: Perplexity = 45. Generated text (temperature=0.7): “Prince Andrei felt as his heart...” — meaningful. At temperature=1.5: “To chew this should be bright...” — incoherent. Temperature controls diversity: T<1 is conservative, T>1 — diverse, but less accurate.

Assignment: Train an LSTM to predict the next word on the Tolstoy corpus (1 million tokens). Vocabulary size: 10,000 tokens. LSTM: 2 layers, hidden=512. (1) Evaluate perplexity on the test set. (2) Generate 200 words with temperature=0.7 and 1.2. How does temperature affect coherence and diversity? (3) Implement beam search (beam=5) — how does output quality improve?