Variational Autoencoders and Diffusion Models — Neural Networks & Deep Learning

VAE and diffusion models are alternatives to GANs for data generation, with fundamentally different mathematical foundations. VAE is based on variational Bayesian inference; diffusion models are based on iterative denoising and connections with diffusion equations.

Variational Autoencoder (VAE, Kingma & Welling, 2013)

Problem: We want to train a generative model $p_\theta(x) = \int p_\theta(x|z) p(z) dz$, where $z$ is a latent code. The integral over all $z$ is intractable (high-dimensional integration).

Evidence Lower BOund (ELBO):

$ \log p_\theta(x) \geq \mathbb{E}{q\phi(z|x)}[\log p_\theta(x|z)] - KL(q_\phi(z|x) | p(z)) = \mathrm{ELBO}(\theta, \phi; x) $

Decoding the two terms: (1) $\mathbb{E}[\log p_\theta(x|z)]$ — reconstruction loss: how well the decoder reconstructs $x$ from $z$ — reconstruction quality. (2) $-KL(q_\phi(z|x)|p(z))$ — KL-divergence between the posterior $q_\phi(z|x)$ and prior $p(z)=N(0,I)$ — regularization, pulls the posterior towards standard Gaussian.

VAE architecture: Encoder: $x \to (\mu_\phi(x), \log \sigma^2_\phi(x))$ (posterior parameters). Sampling: $z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon$, $\varepsilon \sim N(0,I)$ (reparameterization trick — makes sampling differentiable). Decoder: $z \to \hat{x}$ (reconstruction).

Reparameterization trick: Problem: can't take gradient through stochastic sampling $z \sim q_\phi(z|x)$. Solution: $z = \mu + \sigma \cdot \varepsilon$, $\varepsilon \sim N(0,I)$. Now randomness is in $\varepsilon$ (does not depend on $\phi$), and $\frac{\partial z}{\partial \phi} = \frac{\partial \mu}{\partial \phi} + \sigma \cdot \frac{\partial (\log \sigma)}{\partial \phi} \cdot \varepsilon$ — analytic.

$\beta$-VAE (Higgins et al., 2017): $ \mathrm{ELBO}_\beta = \mathbb{E}[\log p(x|z)] - \beta \cdot KL $ For $\beta > 1$: strong regularization $\rightarrow$ “disentangled” representations (factors of variation are independent). For example: in CelebA latent space: $z_1 \rightarrow$ illumination, $z_2 \rightarrow$ rotation, $z_3 \rightarrow$ smile (interpretable independent dimensions).

Diffusion Models (Ho et al., DDPM, 2020)

Key idea: Gradually add noise to data (forward process), then train a neural network to “invert” this process (reverse process — denoising).

Forward process:

$ q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I) $

Directly to any step: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \varepsilon$, $\bar{\alpha}t = \prod{s \leq t}(1-\beta_s)$, $\varepsilon \sim N(0,I)$.

Explanation: $\bar{\alpha}_t$ decreases from $1$ (at $t=0$) to $0$ (at $t=T$). $\sqrt{\bar{\alpha}_t} x_0$ — “signal” component. $\sqrt{1-\bar{\alpha}_t} \varepsilon$ — “noise” component. For $T=1000$: $x_T \approx N(0,I)$ (pure noise).

Reverse process — training objective:

$ L = \mathbb{E}{x_0, \varepsilon, t}[,|,\varepsilon - \varepsilon\theta(x_t, t),|^2,] $

We train $\varepsilon_\theta(x_t, t)$ to predict the added noise $\varepsilon$ given the noised image $x_t$ and step number $t$. The neural network: usually U-Net with temporal embedding for $t$.

Generation: Start from $x_T \sim N(0,I)$. Iteratively denoise: $x_{t-1} = \frac{x_t - \sqrt{1-\bar\alpha_t}\varepsilon_\theta(x_t,t)}{\sqrt{\bar\alpha_t}} + \sigma_t \cdot \varepsilon'$ (by formula). $T$ steps $\to$ $x_0$ (realistic image). Speed: $1000$ steps = slow. DDIM (Song et al., 2021): $20–50$ steps, deterministic sampling.

Latent Diffusion and Stable Diffusion

DDPM problem: Works in pixel space — slow ($512 \times 512 \times 3 = 786$K iterations per step).

Latent Diffusion (Rombach et al., 2022): Compress $x \to z = E(x)$ via pretrained VAE (compression $\approx 8\times = 64 \times 64 \times 4$). Diffusion in $z$ (much smaller). Decode $x = D(z)$. Result: $8\times$ faster at comparable quality.

Conditioning (Stable Diffusion): Text prompt $\to$ CLIP text encoder $\to$ embedding $c$. U-Net architecture adds $c$ via cross-attention: $Q = W_Q\cdot z$ (from latent), $K = W_K\cdot c$, $V = W_V\cdot c$. At each denoising step the image is “aligned” with the text.

Classifier-Free Guidance: train both conditional $\varepsilon_\theta(x_t, c)$ and unconditional $\varepsilon_\theta(x_t, \emptyset)$. At sampling: $\varepsilon_{guided} = \varepsilon(x_t,\emptyset) + w, \bigl(\varepsilon(x_t, c) - \varepsilon(x_t, \emptyset)\bigr)$. $w$ — guidance scale: $w=7$ — strong adherence to prompt, $w=1$ — diversity.

Numerical Example

VAE for CELEBA-HQ ($256 \times 256$ faces): $z_{dim}=512$. KL-loss after training: $\approx 30$ nats (reasonable reconstruction/regularization balance). Interpolating $z_1 \to z_2$ (8 steps): smooth “morphing” between faces. $\beta$-VAE ($\beta=10$): latent variables interpretable — $z_1$ correlates with illumination ($r=0.83$).

DDPM (FFHQ $256\times256$): $1000$ steps, FID $= 3.17$ (better than StyleGAN2-ADA in certain modes). DDIM (100 steps): FID $= 4.16$ ($10\times$ faster generation). Classifier-free guidance $w=7.5$: images strictly follow text prompt.

Assignment: Implement VAE for CelebA ($64\times64$). (1) Architecture: Conv-Encoder $\to$ $(\mu,\sigma) \to z(\mathrm{dim}=128) \to$ ConvTranspose-Decoder. Loss $= \mathrm{BCE} + \beta \cdot \mathrm{KL}$. (2) Train at $\beta=1,4,10$. Visualize: reconstructions, random generations, interpolations. (3) $\beta=10$: use t-SNE to visualize $z$-space (color by “smile” attribute). Is the separation clear?