Module III·Article III·~4 min read
Generative Models: VAE and Diffusion Models
High-Dimensional Statistics
Turn this article into a podcast
Pick voices, format, length — AI generates the audio
Generative Models: from VAE to Diffusion
Discriminative models answer the question “which class?”. Generative models answer “what does it look like?”. They learn the data distribution representation and can create new samples—images, molecules, music. Three families have defined modern AIGC: VAE, GAN, and diffusion models.
Variational Autoencoder (VAE, Kingma & Welling, 2013)
Problem statement: Given x (an image). We want to train a generative model $p_\theta(x) = \int p_\theta(x|z) p(z) dz$, where $z$ is a latent code (hidden factors). The integral over all $z$ is not analytic.
Variational inference (ELBO): we introduce an encoder $q_\phi(z|x) \approx p(z|x)$ and maximize the evidence lower bound (ELBO):
$\log p_\theta(x) \geq \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - KL(q_\phi(z|x) \Vert p(z)) = \mathcal{L}(\theta, \phi; x)$
Decoding the two terms:
- Reconstruction loss $\mathbb{E}[\log p_\theta(x|z)]$: how well the decoder reconstructs $x$ from $z$—the quality of reconstruction
- KL divergence $KL(q\Vert p)$: how close the posterior distribution $z|x$ is to the prior $p(z)=\mathcal{N}(0, I)$—regularization of the latent space
Reparameterization trick: $z \sim q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)I)$. We cannot take the gradient through a stochastic sample. Trick: $z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon$, where $\varepsilon \sim \mathcal{N}(0, I)$. Now the randomness is in $\varepsilon$ (does not depend on parameters), the gradient $\partial z / \partial \phi$ is analytic.
β-VAE: scale the KL term: $\mathcal{L} = \mathbb{E}[\log p(x|z)] - \beta \cdot KL$. For $\beta > 1$: strong regularization $\rightarrow$ disentangled representation (variation factors are independent in $z$). Example: $z_1 =$ lighting, $z_2 =$ face rotation, $z_3 =$ expression.
Generative Adversarial Networks (GAN, Goodfellow et al., 2014)
Game formulation: Generator $G$: $z \rightarrow \hat{x}$ and discriminator $D$: $x \rightarrow [0,1]$ play a minimax game:
$\min_G \max_D V(D, G) = \mathbb{E}{x\sim p{data}}[\log D(x)] + \mathbb{E}_{z\sim p_z}[\log(1 - D(G(z)))]$
Nash equilibrium: $D^*(x) = \dfrac{p_{data}(x)}{p_{data}(x) + p_G(x)} \rightarrow 1/2$ (cannot distinguish). $G$ reproduces $p_{data}$.
Wasserstein GAN (Arjovsky, 2017): More stable than the original. Replaces JS divergence with Wasserstein distance: $W(p, q) = \inf_{\gamma \in \Pi(p, q)} \mathbb{E}{(x, y)\sim\gamma}[\lVert x-y \rVert]$. Kantorovich-Rubinstein theorem: $W(p, q) = \sup{\lVert f \rVert_L \leq 1} [\mathbb{E}_p[f] - \mathbb{E}_q[f]]$. Critic (not discriminator) must be 1-Lipschitz $\rightarrow$ gradient penalty.
Diffusion Models (Ho et al., DDPM, 2020)
Forward process (adding noise): A sequence of $T$ steps, each adds Gaussian noise:
$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t},x_{t-1}, \beta_t I)$
$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \varepsilon$, where $\bar{\alpha}t = \prod{s \leq t} (1-\beta_s)$, $\varepsilon \sim \mathcal{N}(0, I)$
Decoding: $\bar{\alpha}_t$ is the “signal fraction” at step $t$ (decreases to $0$), $\sqrt{1-\bar{\alpha}_t}$ is the “noise fraction” (increases to $1$). For $T=1000$, $\beta_1=0.0001 \rightarrow \beta_T=0.02$: $x_T \approx \mathcal{N}(0, I)$.
Reverse process (denoising): Train a neural network $\varepsilon_\theta(x_t, t)$ to predict the added noise:
$\mathcal{L} = \mathbb{E}{x_0, \varepsilon, t}[\lVert \varepsilon - \varepsilon\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon, t) \rVert^2]$
Intuition: if the noise $\varepsilon$ is known, $x_{t-1}$ can be restored from $x_t$. The trained network iteratively “removes noise” from random $\mathcal{N}(0, I)$ to a real image.
Latent Diffusion (Rombach et al., 2022—Stable Diffusion): Instead of working in pixel space (slow)—encode $x$ into latent space $z = E(x)$ via a VAE-encoder (compression $\sim 8\times$). Diffusion in $z$. Decode $x = D(z)$. Generation is $8\times$ faster. Text-conditioning: CLIP text encoder $\rightarrow$ cross-attention in U-Net.
Comparison of the Three Families
| Method | Quality | Diversity | Speed | Controllability |
|---|---|---|---|---|
| VAE | Medium | High | Fast | Good (in $z$) |
| GAN | High | Low (mode collapse) | Fast | Complex |
| Diffusion | SOTA | High | Slow | Very good |
Numerical Example
VAE for MNIST (28×28): $z$-dim=2. After training: $z$-space is split into clusters by digits (0–9). Interpolation $z_1$ (digit 1) $\rightarrow$ $z_2$ (digit 7): intermediate $z$ decode into “transition” images—a smooth transformation of 1 into 7. KL-loss $\approx 8$ nats (compact space), reconstruction BCE $\approx 15$ nats.
Assignment: Implement a VAE for CelebA (faces 64×64). Architecture: Conv-encoder $\rightarrow (\mu, \sigma) \rightarrow$ reparameterization $\rightarrow z$ (dim=128) $\rightarrow$ ConvTranspose-decoder. Train for 20 epochs. (1) Visualize reconstructions. (2) Generate 16 random faces. (3) Interpolate between two faces ($z_1 \rightarrow z_2$, 8 steps). (4) Add $\beta=4$—how do reconstruction and interpolation quality change?
§ Act · what next