Word Embeddings and Pretrained Language Models — Neural Networks & Deep Learning

Before neural networks can process text, words must be transformed into numerical representations. The evolution from one-hot vectors to Word2Vec, FastText, and BERT demonstrates how neural networks have "learned" to "understand" language. This is one of the most revolutionary trajectories in NLP.

The Problem of Word Representation

One-hot: word i → a vector of zeros with a one at the i-th position. Dimension = |V| (vocabulary, 50K–200K). Problems: (1) All pairwise cosine distances = 0 — there is no semantic similarity. (2) Huge dimensionality. (3) Sparsity.

Distributional Hypothesis (Harris, 1954): words with similar context have similar meanings. "Bank" and "finance" often occur in the same context → are close. "Cat" and "dog" are close. This is the basis for Word2Vec.

Word2Vec (Mikolov et al., 2013)

Skip-gram: Given a word $w_t$ — predict the surrounding words $w_{t-2}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+2}$ (window $c=2$).

Goal: $\max \sum_t \sum_{-c \leq j \leq c,,j \neq 0} \log P(w_{t+j}|w_t)$

$P(w|w_t) = \frac{\exp({v'w}^\top v{w_t})}{\sum_i \exp({v'{w_i}}^\top v{w_t})}$

Softmax computation is expensive (denominator — over the entire vocabulary). Negative sampling: for each positive context ($w_t, w_{t+j}$) sample $k$ "random" ($w_t, w_k$), train a binary classifier "real/random".

CBOW (Continuous Bag of Words): the reverse task — predict $w_t$ from its context. Faster than skip-gram, slightly worse on rare words.

Properties of Word2Vec embeddings: Linear analogies: king − man + woman ≈ queen. Paris − France + Russia ≈ Moscow. Syntactic patterns: jumped − jump ≈ running − run.

Why does this work: the PMI (Pointwise Mutual Information) matrix — $\log \frac{P(w,c)}{P(w)P(c)}$ — is implicitly factorized by Word2Vec (Levy & Goldberg, 2014).

FastText (Bojanowski et al., 2017): embedding of a word = sum of the embeddings of its n-grams (character n-grams). The word "programming" → {"progr", "rogra", ...}. Advantages: handles out-of-vocabulary (OOV) words: "programmer" → average subword embedding. Better for morphologically rich languages (Russian, Finnish, Arabic).

GloVe (Pennington et al., 2014)

Co-occurrence matrix: $X_{ij}$ = number of times word $j$ occurred in the context of word $i$. The ratio $X_{ik}/X_{jk}$ carries semantics: for $i$ = ice, $j$ = steam, $k$ = water: $X_{ik}/X_{jk} \approx 1$ (both related to water); $k$ = solid: $ \gg 1$ (ice is solid, steam is not).

GloVe's objective function:

$J = \sum_{i,j} f(X_{ij}) (v_i^\top \tilde v_j + b_i + \tilde b_j - \log X_{ij})^2$

$f(x) = (x/x_{\max})^\alpha$ — weight function (downweights very frequent co-occurrences). Combines global statistics (like LSA) + local context (like Word2Vec).

Contextual Embeddings: ELMo and BERT

The problem of context-independent embeddings: "bank account" vs "river bank" — same word, different meanings. Word2Vec: the same vector. Contextual representations are needed.

ELMo (Peters et al., 2018): Bidirectional LSTM-LM. The embedding of a word = weighted sum of all BiLSTM layers. The embedding depends on the entire sentence. +10-20% on most NLP tasks of that time.

BERT (Devlin et al., 2018): Transformer encoder pretrained on two tasks: Masked Language Modeling (MLM): mask 15% tokens, predict them. Next Sentence Prediction (NSP): predict whether one sentence follows another. Fine-tune on downstream tasks: add task-specific head, train for 3–5 epochs. Revolution: +10-20% on GLUE, SQuAD, NER, ...

GPT (OpenAI, 2018–2023): Autoregressive transformer-decoder. Pre-training: predict next token. GPT-3 (175B): few-shot learning — solves tasks without fine-tuning, using only examples in context. ChatGPT = GPT-4 + RLHF (human feedback).

Numerical Example: Word2Vec Analogies

Model: Google News Word2Vec (300-dimensional vectors, 100B words).

king − man + woman: closest vector = queen (cos sim 0.71). Moscow − Russia + France: Paris (0.69). jumped − jump + run: runs (0.74).

Interesting failure: doctor − he + she = nurse (gender bias in the corpus). This illustrates the problem of social biases in language models.

Assignment: Use pretrained Word2Vec (gensim + Google News 300d). (1) Find the 10 closest words to "Moscow", "programming", "crisis". (2) Test the analogies: Moscow − Russia + France = ?; president − man + woman = ? (3) Compare to GloVe on the same tasks — where do results differ? (4) Implement Word2Vec from scratch (skip-gram + negative sampling) on a small corpus. Train on 1M tokens of Wikipedia. Compare analogies to the ready-made model.