Diffusion Models & Flow Matching
Diffusion Models
Forward Process: Add Noise
The forward process is fixed and gradually adds Gaussian noise to data: \[ q(x_{1:T}\mid x_0) := \prod_{t=1}^{T} q(x_t\mid x_{t-1}) \]
\[ q(x_t\mid x_{t-1}) := \mathcal N \left( x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I \right) \]
Equivalently, one step of the forward process can be written as: \[ x_t = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}\epsilon, \qquad \epsilon\sim\mathcal N(0,I) \]
Define: \[ \alpha_t=1-\beta_t, \qquad \bar\alpha_t=\prod_{s=1}^{t}\alpha_s \]
Then we can sample \(x_t\) directly from \(x_0\): \[ x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon, \qquad \epsilon\sim\mathcal N(0,I) \]
The goal of the forward process is to transform the data distribution into a standard Gaussian: \[ x_T\approx\mathcal N(0,I) \]
When \(T\) is large: \[ \sqrt{\bar\alpha_T}\rightarrow 0, \qquad 1-\bar\alpha_T\rightarrow 1 \]
Reverse Process: Denoise
The reverse process is learned and gradually removes noise: \[ p_\theta(x_{0:T}) := p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1}\mid x_t) \]
Each reverse step is modeled as a Gaussian: \[ p_\theta(x_{t-1}\mid x_t) := \mathcal N \left( x_{t-1}; \mu_\theta(x_t,t), \Sigma_\theta(x_t,t) \right) \]
A denoising neural network predicts the reverse transition parameters: \[ \mu_\theta(x_t,t), \qquad \Sigma_\theta(x_t,t) \]
Generation starts from pure noise: \[ x_T\sim\mathcal N(0,I) \]
and repeatedly denoises: \[ x_T \rightarrow x_{T-1}\rightarrow \cdots \rightarrow x_0 \]
Variational inference identity: \[ -\log p(x) = -\int q_x(z)\log p(x\mid z)dz + KL(q_x(z)\parallel p(z)) - KL(q_x(z)\parallel p(z\mid x)) \]
Therefore, minimizing negative log-likelihood can be replaced by minimizing an upper bound: \[ -\log p(x) \le -\int q_x(z)\log p(x\mid z)dz + KL(q_x(z)\parallel p(z)) \]
DDPM is similar to a hierarchical VAE.
The latent code is the entire noisy trajectory: \[ z=\{x_1,x_2,\dots,x_T\} \]
Forward process as encoder: \[ q(x_{1:T}\mid x_0) = \prod_{t=1}^{T}q(x_t\mid x_{t-1}) \]
Reverse process as decoder: \[ p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T}p_\theta(x_{t-1}\mid x_t) \]
DDPM variational objective: \[ \mathbb E_q \left[ D_{KL}(q(x_T\mid x_0)\parallel p(x_T)) + \sum_{t>1} D_{KL} \left( q(x_{t-1}\mid x_t,x_0) \parallel p_\theta(x_{t-1}\mid x_t) \right) - \log p_\theta(x_0\mid x_1) \right] \]
Term notation: \[ L_T = D_{KL}(q(x_T\mid x_0)\parallel p(x_T)) \] \[ L_{t-1} = D_{KL} \left( q(x_{t-1}\mid x_t,x_0) \parallel p_\theta(x_{t-1}\mid x_t) \right) \] \[ L_0 = -\log p_\theta(x_0\mid x_1) \]
The true posterior of the forward process is Gaussian: \[ q(x_{t-1}\mid x_t,x_0) = \mathcal N \left( x_{t-1}; \tilde\mu_t(x_t,x_0), \tilde\beta_t I \right) \]
where \[ \tilde\mu_t(x_t,x_0) := \frac{\sqrt{\bar\alpha_{t-1}}\beta_t}{1-\bar\alpha_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}x_t \]
and \[ \tilde\beta_t := \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}\beta_t \]
The forward process is decomposed into a similar form as the reverse process so that the KL divergence terms are easy to compute.
Since \[ q(x_{t-1}\mid x_t,x_0) \] and \[ p_\theta(x_{t-1}\mid x_t) \] are both Gaussian, their KL divergence can be reduced to a mean-matching loss: \[ L_{t-1} = D_{KL} \left( q(x_{t-1}\mid x_t,x_0) \parallel p_\theta(x_{t-1}\mid x_t) \right) \] \[ = \mathbb E_q \left[ \frac{1}{2\sigma_t^2} \left\| \tilde\mu_t(x_t,x_0) - \mu_\theta(x_t,t) \right\|^2 \right] + C \]
The true posterior mean can be written with the noise \(\epsilon\): \[ \tilde\mu_t(x_t,x_0) = \frac{1}{\sqrt{1-\beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}}\epsilon \right) \]
Reparametrize the model mean by predicting noise: \[ \mu_\theta(x_t,t) = \frac{1}{\sqrt{1-\beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t,t) \right) \]
Therefore: \[ L_{t-1} = \mathbb E_{x_0\sim q(x_0),\epsilon\sim\mathcal N(0,I)} \left[ \frac{\beta_t^2} {2\sigma_t^2(1-\beta_t)(1-\bar\alpha_t)} \left\| \epsilon - \epsilon_\theta \left( \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon, t \right) \right\|^2 \right] + C \]
Let \[ x_t= \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon \]
and define the weighting coefficient: \[ \lambda_t = \frac{\beta_t^2} {2\sigma_t^2(1-\beta_t)(1-\bar\alpha_t)} \]
DDPM uses the simplified loss by setting \(\lambda_t=1\): \[ L_{\text{simple}}(\theta) := \mathbb E_{t,x_0,\epsilon} \left[ \left\| \epsilon - \epsilon_\theta \left( \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon, t \right) \right\|^2 \right] \]
Algorithm 1: Training
- Repeat until converged.
- Sample a clean data point: \[ x_0\sim q(x_0) \]
- Sample a timestep: \[ t\sim \mathrm{Uniform}(\{1,\dots,T\}) \]
- Sample Gaussian noise: \[ \epsilon\sim\mathcal N(0,I) \]
- Construct noisy input: \[ x_t= \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon \]
- Take a gradient descent step on: \[ \nabla_\theta \left\| \epsilon - \epsilon_\theta(x_t,t) \right\|^2 \]
Algorithm 2: Sampling
Start from pure Gaussian noise: \[ x_T\sim\mathcal N(0,I) \]
For \[ t=T,\dots,1 \] repeat:
sample \[ z\sim\mathcal N(0,I) \] if \(t>1\), otherwise set \(z=0\).
update: \[ x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t,t) \right) + \sigma_t z \]
Return generated sample: \[ x_0 \]