Graph Neural Networks (GNN)

Many real-world data have a graph structure: social networks (users + connections), molecules (atoms + bonds), road networks (intersections + roads), knowledge (entities + relations). Standard CNN and MLP ignore the structure — GNN preserve it.

The Problem of Learning on Graphs

Graph G = (V, E): n nodes, m edges. Each node v has features x_v ∈ ℝᵈ. Adjacency matrix A ∈ {0,1}^{n×n}. Task: prediction of the label for each node (node classification), edge (link prediction), or the entire graph (graph classification).

The main difference from ordinary data: there is no fixed order of nodes (graph is permutation-equivariant). The size of the graph may vary. It is necessary to account for the neighborhood structure.

Message Passing Framework

The general principle of GNN is "message passing" between neighbors. Each node aggregates information from neighbors, updates its representation:

hᵥ^{(k)} = UPDATE^{(k)}(hᵥ^{(k-1)}, AGGREGATE^{(k)}({hᵤ^{(k-1)}: u ∈ N(v)}))

After K iterations: hᵥ^{(K)} encodes information about the K-hop neighborhood of v.

Specific GNN Architectures

Graph Convolutional Network (GCN, Kipf & Welling, 2017):

H^{(l+1)} = σ(D̃^{−1/2} Ã D̃^{−1/2} H^{(l)} W^{(l)})

Here, Ã = A + I (adding self-loops), D̃ is the degree matrix of Ã. D̃^{−1/2} Ã D̃^{−1/2} is the normalized adjacency matrix. Intuition: we average the features of neighbors + the node itself, normalization prevents scaling.

Problem: GCN "smooths" features — as K→∞, all nodes converge to one value (over-smoothing).

GraphSAGE (Hamilton et al., 2017):

hᵥ^{(k)} = σ(W · CONCAT(hᵥ^{(k−1)}, AGG({hᵤ^{(k−1)}: u ∈ N(v)})))

AGG can be: mean, max, LSTM. Inductive: can generalize to new nodes without retraining (vs GCN — transductive).

Graph Attention Network (GAT, Veličković et al., 2018):

αᵥᵤ = exp(LeakyReLU(aᵀ[Whᵥ || Whᵤ])) / Σₖ exp(...) hᵥ' = σ(Σᵤ αᵥᵤ Whᵤ)

Attention weights αᵥᵤ are learned automatically — different neighbors have different contributions. Multi-head attention: several independent attention mechanisms.

Graph Transformer

Graphormer (Ying et al., 2021): the standard Transformer is applied to a graph. Special positional embeddings: centrality encoding (node degree), spatial encoding (shortest path between nodes), edge encoding. SOTA on many graph tasks.

GNN Applications

Chemistry and Pharmaceuticals: Molecule = graph (atoms + bonds). GNN predicts molecular properties: solubility, toxicity, biological activity. AlphaFold does not use GNN (Transformer), but GNNs are widely used in DeepMind for molecular dynamics.

Recommendation Systems: User-item graph. GNN spreads information throughout the graph → better recommendations. PinSage (Pinterest, 2018): 3 billion nodes, real-world system.

Fraud Detection: Transaction graph. Fraudulent accounts create characteristic patterns in the graph. GNN detects fraud substructures.

Physical Simulations: Particle systems as a graph (particle = node, interaction = edge). GNN approximates dynamics → 1000× faster than numerical solution. DeepMind MeshGraphNets for CFD simulations.

Numerical Example

Node classification task: Cora dataset (2708 scientific papers, 7 classes, 5429 citations). Features: bag-of-words (1433 features).

Logistic Regression (ignoring the graph): test accuracy 58%. GCN (2 layers): 81.5%. GAT (2 layers, 8 heads): 83.0%. GraphSAGE (mean aggregation): 82.1%.

Importance of structure: GCN without edges (using only node features) = LR → structure gives +23%.

Assignment: Implement GCN and GAT in PyTorch Geometric for the Cora dataset. (1) Train GCN (2 layers, hidden=64): train/val/test accuracy. (2) Train GAT (2 layers, 8 heads). (3) Visualize GAT attention (heatmap of αᵥᵤ weights for 5 nodes) — which neighbors get the highest weight? (4) Study over-smoothing: how does accuracy change with 1, 2, 4, 8 layers? (5) Link prediction task: predict whether there is an edge between a pair of nodes.