Generative Models: VAE and Diffusion Models — Big Data & Machine Learning

Generative Models: from VAE to Diffusion

Discriminative models answer the question “which class?”. Generative models answer “what does it look like?”. They learn the data distribution representation and can create new samples—images, molecules, music. Three families have defined modern AIGC: VAE, GAN, and diffusion models.

Variational Autoencoder (VAE, Kingma & Welling, 2013)

Problem statement: Given x (an image). We want to train a generative model $p_\theta(x) = \int p_\theta(x|z) p(z) dz$, where $z$ is a latent code (hidden factors). The integral over all $z$ is not analytic.

Variational inference (ELBO): we introduce an encoder $q_\phi(z|x) \approx p(z|x)$ and maximize the evidence lower bound (ELBO):

$\log p_\theta(x) \geq \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - KL(q_\phi(z|x) \Vert p(z)) = \mathcal{L}(\theta, \phi; x)$

Decoding the two terms:

Reconstruction loss $\mathbb{E}[\log p_\theta(x|z)]$: how well the decoder reconstructs $x$ from $z$—the quality of reconstruction
KL divergence $KL(q\Vert p)$: how close the posterior distribution $z|x$ is to the prior $p(z)=\mathcal{N}(0, I)$—regularization of the latent space

Reparameterization trick: $z \sim q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x)I)$. We cannot take the gradient through a stochastic sample. Trick: $z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon$, where $\varepsilon \sim \mathcal{N}(0, I)$. Now the randomness is in $\varepsilon$ (does not depend on parameters), the gradient $\partial z / \partial \phi$ is analytic.

β-VAE: scale the KL term: $\mathcal{L} = \mathbb{E}[\log p(x|z)] - \beta \cdot KL$. For $\beta > 1$: strong regularization $\rightarrow$ disentangled representation (variation factors are independent in $z$). Example: $z_1 =$ lighting, $z_2 =$ face rotation, $z_3 =$ expression.

Generative Adversarial Networks (GAN, Goodfellow et al., 2014)

Game formulation: Generator $G$: $z \rightarrow \hat{x}$ and discriminator $D$: $x \rightarrow [0,1]$ play a minimax game:

$\min_G \max_D V(D, G) = \mathbb{E}{x\sim p{data}}[\log D(x)] + \mathbb{E}_{z\sim p_z}[\log(1 - D(G(z)))]$

Nash equilibrium: $D^*(x) = \dfrac{p_{data}(x)}{p_{data}(x) + p_G(x)} \rightarrow 1/2$ (cannot distinguish). $G$ reproduces $p_{data}$.

Wasserstein GAN (Arjovsky, 2017): More stable than the original. Replaces JS divergence with Wasserstein distance: $W(p, q) = \inf_{\gamma \in \Pi(p, q)} \mathbb{E}{(x, y)\sim\gamma}[\lVert x-y \rVert]$. Kantorovich-Rubinstein theorem: $W(p, q) = \sup{\lVert f \rVert_L \leq 1} [\mathbb{E}_p[f] - \mathbb{E}_q[f]]$. Critic (not discriminator) must be 1-Lipschitz $\rightarrow$ gradient penalty.

Diffusion Models (Ho et al., DDPM, 2020)

Forward process (adding noise): A sequence of $T$ steps, each adds Gaussian noise:

$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t},x_{t-1}, \beta_t I)$

$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \varepsilon$, where $\bar{\alpha}t = \prod{s \leq t} (1-\beta_s)$, $\varepsilon \sim \mathcal{N}(0, I)$

Decoding: $\bar{\alpha}_t$ is the “signal fraction” at step $t$ (decreases to $0$), $\sqrt{1-\bar{\alpha}_t}$ is the “noise fraction” (increases to $1$). For $T=1000$, $\beta_1=0.0001 \rightarrow \beta_T=0.02$: $x_T \approx \mathcal{N}(0, I)$.

Reverse process (denoising): Train a neural network $\varepsilon_\theta(x_t, t)$ to predict the added noise:

$\mathcal{L} = \mathbb{E}{x_0, \varepsilon, t}[\lVert \varepsilon - \varepsilon\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon, t) \rVert^2]$

Intuition: if the noise $\varepsilon$ is known, $x_{t-1}$ can be restored from $x_t$. The trained network iteratively “removes noise” from random $\mathcal{N}(0, I)$ to a real image.

Latent Diffusion (Rombach et al., 2022—Stable Diffusion): Instead of working in pixel space (slow)—encode $x$ into latent space $z = E(x)$ via a VAE-encoder (compression $\sim 8\times$). Diffusion in $z$. Decode $x = D(z)$. Generation is $8\times$ faster. Text-conditioning: CLIP text encoder $\rightarrow$ cross-attention in U-Net.

Comparison of the Three Families

Method	Quality	Diversity	Speed	Controllability
VAE	Medium	High	Fast	Good (in $z$)
GAN	High	Low (mode collapse)	Fast	Complex
Diffusion	SOTA	High	Slow	Very good

Numerical Example

VAE for MNIST (28×28): $z$-dim=2. After training: $z$-space is split into clusters by digits (0–9). Interpolation $z_1$ (digit 1) $\rightarrow$ $z_2$ (digit 7): intermediate $z$ decode into “transition” images—a smooth transformation of 1 into 7. KL-loss $\approx 8$ nats (compact space), reconstruction BCE $\approx 15$ nats.

Assignment: Implement a VAE for CelebA (faces 64×64). Architecture: Conv-encoder $\rightarrow (\mu, \sigma) \rightarrow$ reparameterization $\rightarrow z$ (dim=128) $\rightarrow$ ConvTranspose-decoder. Train for 20 epochs. (1) Visualize reconstructions. (2) Generate 16 random faces. (3) Interpolate between two faces ($z_1 \rightarrow z_2$, 8 steps). (4) Add $\beta=4$—how do reconstruction and interpolation quality change?