A diffusion probabilistic model is a parameterized Markov chain trained to reverse a predefined forward process, closely related to both likelihood-based optimization and score matching. The forward diffusion process is a stochastic process constructed to gradually corrupt the original data into random nose.

Gaussian Diffusion (Continuous)

Diffusion models ^[1]^[2] are latent variable models inspired by the non-equilibrium statistical physics ( thermodynamics) that gradually destroy structure in data distribution through an iterative forward diffusion process, and then learn a reversal process to recover the original data structure through iterative denoising.

Diffusion models can be treated as a Markovian Hierarchical Variational Autoencoder with three restrictions:^[6]

The latent dimension is the same as the original data.
The encoder is not learned, instead uses a (pre-defined) linear Gaussian model.
The latent in final timestep $T$ is an isotropic Gaussian.

Forward (Diffusion) process

Given a data point sampled from the data distribution $\mathbf{x}_0 \sim q(\mathbf{x})$. The forward diffusion process gradually applied a (fixed) linear Gaussian model at each timestep $t$ out of $T$ steps:

$\begin{align} q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N} (\mathbf{x}_t; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}) \end{align}$

where the forward diffusion transitions produce a series of gradually noisy samples $\mathbf{x}_1, \cdots, \mathbf{x}_T$. Each noisy sample has the exactly same dimension as the original data point $\mathbf{x}_0$.

Under the Markovian assumption, the Gaussian noise is gradually added to examples from previous timestep, with the variance schedule $\{\beta_t \in (0, 1) \}_{t=1}^T$ . Given a large number of $T \rightarrow \infty$, $\mathbf{x}_T$ can ideally be an isotropic Gaussian noise.

Let $\alpha_t = 1 - \beta_t$, the linear Gaussian model in the forward process is rewritten as:

$\begin{align} q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N} (\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_{t-1}, (1-\alpha_t) \mathbf{I}) \end{align}$

Under the reparameterization trick, samples $\mathbf{x}_t \sim q (\mathbf{x}_t | \mathbf{x}_{t-1})$ can be rewritten as:

$\begin{align} \mathbf{x}_t = \sqrt{\alpha_t} \mathbf{x}_{t-1} + \sqrt{1-\alpha_t} \pmb{\epsilon} \,\,\,\, \text{with } \,\,\,\, \pmb{\epsilon}\sim \mathcal{N} (\pmb{\epsilon}; \mathbf{0},\mathbf{I}) \end{align}$

In similar vein, samples $\mathbf{x}_{t-1}$ can be rewritten as:

$\begin{align} \mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1-\alpha_{t-1}} \pmb{\epsilon} \,\,\,\, \text{with } \,\,\,\, \pmb{\epsilon}\sim \mathcal{N} (\pmb{\epsilon}; \mathbf{0},\mathbf{I}) \end{align}$

Let $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ . Usually, the update step gets larger as the timestep increases, i.e., $\beta_1 < \beta_2 < \cdots < \beta_T$ and thus $\bar{\alpha}_1 > \bar{\alpha}_2 > \cdots \bar{\alpha}_T$.

Suppose we have $2T$ random noise variables $\{ \pmb{\epsilon}_t, \bar{\pmb{\epsilon}}_t \}_{t=1}^T \overset{\text{i.i.d}}{\sim} \mathcal{N} (\pmb{\epsilon}; \mathbf{0},\mathbf{I})$ .

For an arbitrary sample $\mathbf{x}_t \sim q(\mathbf{x}_t | \mathbf{x}_0)$, we have:

$\begin{align} \mathbf{x}_t &{}= \sqrt{\alpha_t} \mathbf{x}_{t-1} + \sqrt{1-\alpha_t} \pmb{\epsilon}_{t-1} \\ &{}= \sqrt{\alpha_t} (\sqrt{\alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1-\alpha_{t-1}} \pmb{\epsilon}_{t-2}) + \sqrt{1-\alpha_{t}} \pmb{\epsilon}_{t-1} \nonumber \\ &{}= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{\alpha_t- \alpha_t \alpha_{t-1}} \pmb{\epsilon}_{t-2} + \sqrt{1-\alpha_{t}} \pmb{\epsilon}_{t-1} \nonumber \\ &{}= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{\sqrt{\alpha_t- \alpha_t \alpha_{t-1}}^2 + \sqrt{1-\alpha_{t}}^2 } \bar{\pmb{\epsilon}}_{t-2}) \nonumber \\ &{}= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\pmb{\epsilon}}_{t-2})\\ &{}= \cdots \nonumber \\ &{}= \sqrt{\prod_{i=1}^t \alpha_i \mathbf{x}_0} + \sqrt{1 - \prod_{i=1}^t \alpha_i \pmb{\epsilon}_0} \\ &{}= \color{blue}{\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \bar{\pmb{\epsilon}}_0} \label{forward_add_noise} \\ &{} \sim \mathcal{N} (\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t) \mathbf{I}) \label{noise_process} \end{align}$

Therefore, the linear Gaussian form is derived as: $q(\mathbf{x}_t | \mathbf{x}_0) \sim \mathcal{N} (\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t) \mathbf{I})$ .

Reverse process

The reverse diffusion process, with the form $p_\theta(\mathbf{x}_0) := \int p_\theta (\mathbf{x}_{0:T} d \mathbf{x}_{1:T})$ , learns the reversal of diffusion process by gradually denoising from timestep T to 1. The reverse process is defined as a Markov chain with learned Gaussian transitions starting at $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I})$:

$\begin{align} p_\theta (\mathbf{x}_{0:T}) &{}:= p(\mathbf{x}_T) \prod_{t=1}^T p_\theta (\mathbf{x}_{t-1}\vert \mathbf{x}_t) \\ p_\theta (\mathbf{x}_{t-1}|\mathbf{x}_{t}) &{} := \mathcal{N} (\mathbf{x}_{t-1}; \pmb{\mu}_\theta (\mathbf{x}_{t},t), \pmb{\Sigma}_\theta (\mathbf{x}_{t}, t)) \end{align}$

Therefore, we can derive the Gussian form of both $q(\mathbf{x}_t | \mathbf{x}_0)$ and $q(\mathbf{x}_{t-1} | \mathbf{x}_0)$ . Using Bayes rule, we have:

$\begin{align} q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) &{}= \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) \cdot q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} \\ &{}= \frac{\mathcal{N} (\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_0, (1-\alpha_t) \mathbf{I}) \cdot \mathcal{N} (\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0, (1-\bar{\alpha}_{t-1}) \mathbf{I})}{\mathcal{N} (\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t) \mathbf{I})} \\ &{}\propto \exp \big\{ -\frac{1}{2} ( \frac{(\mathbf{x}_t - \sqrt{\alpha} \mathbf{x}_{t-1})^2}{1-\alpha_t} + \frac{(\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}} + \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} ) \big\} \\ &{}= \exp \big\{ -\frac{1}{2} (\frac{-2\sqrt{\alpha_t} \mathbf{x}_t \mathbf{x}_{t-1} + \alpha_t \mathbf{x}_{t-1}^2 }{ 1 - \alpha_t} ) + \frac{\mathbf{x}_{t-1}^2 - 2 \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_{t-1}\mathbf{x}_0}{1-\bar{\alpha}_{t-1}} + C(\mathbf{x}_t, \mathbf{x}_0) \big\} \\ &{}\propto \exp\Big\{ -\frac{1}{2} \big( (\frac{\alpha_t}{1-\alpha_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) \mathbf{x}_{t-1}^2 - 2(\frac{\sqrt{\alpha_t}}{1-\alpha_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0) \mathbf{x}_{t-1} \big) \Big\} \\ &{}= \exp\Big\{ -\frac{1}{2} (\frac{1}{\frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}}) \Big[ \mathbf{x}_{t-1}^2 - 2\frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})\mathbf{x}_t + \sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)\mathbf{x}_0 }{1-\bar{\alpha}_t} \mathbf{x}_{t-1}\Big] \Big\} \\ &{}\propto \mathcal{N}\Big(\mathbf{x}_{t-1}; \underbrace{\frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})\mathbf{x}_t + \sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t)\mathbf{x}_0 }{1-\bar{\alpha}_t}}_{\color{blue}{\pmb{\mu}(\mathbf{x}_t, \mathbf{x}_0)}}, \underbrace{\frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}}_{\color{green}{\pmb{\Sigma}_q(t)}} \Big) \label{mu} \end{align}$

In each timestep, $\mathbf{x}_{t-1} \sim q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ follows the Gaussian distribution. The mean $\pmb{\mu}(\mathbf{x}_t, \mathbf{x}_0)$ is a function of $\mathbf{x}_t$ and $\mathbf{x}_0$ , and $\pmb{\Sigma}_q(t)$ is a function of $\alpha$ coefficient (either as hyperparameter or learned with neural networks). The variance can be formulated as: $\pmb{\Sigma}_q (t) = \sigma^2_q (t) \mathbf{I}$ , where $\sigma^2=\frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}$ .

Since $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ does not condition on $\mathbf{x}_0$ , we thus optimize the KL divergence between the means of two Gaussians:

$\begin{align} &{}\mathop{\arg\min}_\theta \; \mathbb{KL} \Big( q(\mathbf{x}_{t-1}\vert \mathbf{x}_t, \mathbf{x}_0) \Vert p_\theta (\mathbf{x}_{t-1} \Vert \mathbf{x}_t) \Big) \\ = &{}\mathop{\arg\min}_\theta \; \mathbb{KL} \Big( \mathcal{N} \big( \mathbf{x}_{t-1}; \pmb{\mu}_q, \pmb{\Sigma}_q(t) \big) \Vert \mathcal{N} \big( \mathbf{x}_{t-1}; \pmb{\mu}_\theta, \pmb{\Sigma}_q(t) \big) \Big) \\ = &{}\mathop{\arg\min}_\theta \; \frac{1}{2} \Big[ \log \frac{|\pmb{\Sigma}_q (t)|}{| \pmb{\Sigma}_q (t)|} -d + \text{tr} (\pmb{\Sigma}_q (t)^{-1} \Sigma_q (t)) + (\pmb{\mu}_\theta - \pmb{\mu}_q)^T \pmb{\Sigma}_q (t)^{-1} (\pmb{\mu}_\theta - \pmb{\mu}_q) \Big] \\ = &{}\mathop{\arg\min}_\theta \; \frac{1}{2} \Big[ \log 1 -d + d + (\pmb{\mu}_\theta - \pmb{\mu}_q)^T \pmb{\Sigma}_q (t)^{-1} (\pmb{\mu}_\theta - \pmb{\mu}_q) \Big] \\ = &{}\mathop{\arg\min}_\theta \;\frac{1}{2} \Big[ (\pmb{\mu}_\theta - \pmb{\mu}_q)^T \Sigma_q (t)^{-1} (\pmb{\mu}_\theta - \pmb{\mu}_q) \Big] \\ = &{}\mathop{\arg\min}_\theta \;\frac{1}{2} \Big[ (\pmb{\mu}_\theta - \pmb{\mu}_q)^T (\sigma_q^2 (t)\mathbf{I})^{-1} (\pmb{\mu}_\theta - \pmb{\mu}_q) \Big] \\ = &{}\mathop{\arg\min}_\theta \; \frac{1}{2 \sigma_q^2 (t)} \Vert \pmb{\mu}_\theta - \pmb{\mu}_q \Vert_2^2 \label{kl} \end{align}$

Given Eq.$\eqref{mu}$, we have:

$\begin{align} \pmb{\mu}_q (\mathbf{x}_t, \mathbf{x}_0) &{}= \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})\mathbf{x}_t + \sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t) \color{green}{\mathbf{x}_0} }{1-\bar{\alpha}_t} \label{mu_q} \\ \pmb{\mu}_\theta (\mathbf{x}_t, t) &{}= \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})\mathbf{x}_t + \sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t) \color{blue}{\hat{\mathbf{x}}_\theta (\mathbf{x}_t, t)} }{1-\bar{\alpha}_t} \label{mu_theta} \\ \end{align}$

Therefore, Eq.$\eqref{kl}$ can be rewritten as:

$\begin{align} &{}\mathop{\arg\min}_\theta \; \mathbb{KL} \Big( q(\mathbf{x}_{t-1}\vert \mathbf{x}_t, \mathbf{x}_0) \Vert p_\theta (\mathbf{x}_{t-1} \Vert \mathbf{x}_t) \Big) \\ = &{}\mathop{\arg\min}_\theta \; \mathbb{KL} \Big( \mathcal{N} \big( \mathbf{x}_{t-1}; \pmb{\mu}_q, \pmb{\Sigma}_q(t) \big) \Vert \mathcal{N} \big( \mathbf{x}_{t-1}; \pmb{\mu}_\theta, \pmb{\Sigma}_q(t) \big) \Big) \\ = &{}\mathop{\arg\min}_\theta \; \frac{1}{2\sigma_q^2 (t)} \big\Vert \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t) \hat{\mathbf{x}}_\theta (\mathbf{x}_t, t) }{1-\bar{\alpha}_t} - \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t) \mathbf{x}_0 }{1-\bar{\alpha}_t} \big\Vert_2^2 \\ = &{}\mathop{\arg\min}_\theta \; \frac{1}{2\sigma_q^2 (t)} \big\Vert \frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t) }{1-\bar{\alpha}_t} (\hat{\mathbf{x}}_\theta (\mathbf{x}_t, t) - \mathbf{x}_0) \big\Vert_2^2 \\ = &{}\mathop{\arg\min}_\theta \; \frac{1}{2\sigma_q^2 (t)} \frac{\bar{\alpha}_{t-1}(1-\alpha_t)^2 }{(1-\bar{\alpha}_t)^2} \big\Vert (\hat{\mathbf{x}}_\theta (\mathbf{x}_t, t) - \mathbf{x}_0) \big\Vert_2^2 \label{loss_mu} \end{align}$

Intuitive understanding towards the diffusion process^[1]^[6].

$\begin{align}\log p(\mathbf{x}) &{}= \log \int p(\mathbf{x}_{0:T}) d \mathbf{x}_{1:T} \\&{}= \log \int \frac{ p(\mathbf{x}_{0:T}) q(\mathbf{x}_{1:T} | \mathbf{x}_0)}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} d \mathbf{x}_{1:T} \\&{}= \log \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0) }\Big[ \frac{p(\mathbf{x}_{0:T}) }{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \Big] \\&{} \geq \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0) }\Big[ \log \frac{p(\mathbf{x}_{0:T}) }{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \Big] \\&{} =\mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0) }\Big[ \log \frac{p(\mathbf{x}_{T}) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) }{\prod_{t=1}^T q(\mathbf{x}_t |\mathbf{x}_{t-1})} \Big] \\&{} = \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0) }\Big[ \log \frac{p(\mathbf{x}_{T}) p_\theta( \mathbf{x}_0 | \mathbf{x}_1) \prod_{t=2}^T p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) }{q(\mathbf{x}_T |\mathbf{x}_{T-1}) \prod_{t=1}^{T-1} q(\mathbf{x}_t |\mathbf{x}_{t-1})} \Big] \\&{} = \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0) }\Big[ \log \frac{p(\mathbf{x}_{T}) p_\theta ( \mathbf{x}_0 | \mathbf{x}_1) \prod_{t=1}^{T-1} p_\theta(\mathbf{x}_{t}|\mathbf{x}_{t+1}) }{q(\mathbf{x}_T |\mathbf{x}_{T-1}) \prod_{t=1}^{T-1} q(\mathbf{x}_t |\mathbf{x}_{t-1})} \Big] \\&{} = \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0) }\Big[ \log p_\theta( \mathbf{x}_0 | \mathbf{x}_1) \Big] + \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0) }\Big[ \log \frac{\log p(\mathbf{x}_T)}{q(\mathbf{x}_T |\mathbf{x}_{T-1})} \Big] + \nonumber \\ &{} \qquad\qquad\qquad \quad\quad \sum_{t=1}^{T-1} \mathbb{E}_{q(\mathbf{x}_{1:T}|\mathbf{x}_0) }\Big[ \log \frac{p_\theta (\mathbf{x}_t \vert \mathbf{x}_{t+1}) }{ q (\mathbf{x}_t \vert \mathbf{x}_{t-1}) } \Big]\\&{} = \mathbb{E}_{q(\mathbf{x}_{1}|\mathbf{x}_0) }\Big[ \log p_\theta( \mathbf{x}_0 | \mathbf{x}_1) \Big] + \mathbb{E}_{q(\mathbf{x}_{T-1}, \mathbf{x}_{T}|\mathbf{x}_0) }\Big[ \log \frac{\log p(\mathbf{x}_T)}{q(\mathbf{x}_T |\mathbf{x}_{T-1})} \Big] + \nonumber \\ &{} \qquad\qquad\qquad \quad\quad \sum_{t=1}^{T-1} \mathbb{E}_{q(\mathbf{x}_{t-1}, \mathbf{x}_{t}, \mathbf{x}_{t+1}|\mathbf{x}_0) }\Big[ \log \frac{p_\theta (\mathbf{x}_t \vert \mathbf{x}_{t+1}) }{ q (\mathbf{x}_t \vert \mathbf{x}_{t-1}) } \Big] \\&{} = \underbrace{\mathbb{E}_{q(\mathbf{x}_{1}|\mathbf{x}_0) }\Big[ \log p_\theta( \mathbf{x}_0 | \mathbf{x}_1) \Big]}_{\text{reconstruction}} + \underbrace{\mathbb{E}_{q(\mathbf{x}_{T-1} |\mathbf{x}_0) } \Big[ \mathbb{KL}(q(\mathbf{x}_T |\mathbf{x}_{T-1}) \vert \log p(\mathbf{x}_T)) \Big]}_{\text{prior matching} \rightarrow 0} + \nonumber \\ &{} \qquad\qquad\qquad \quad\quad \underbrace{\sum_{t=1}^{T-1} \mathbb{E}_{q(\mathbf{x}_{t-1}, \mathbf{x}_{t}, \mathbf{x}_{t+1}|\mathbf{x}_0) }\Big[ \log \frac{p_\theta (\mathbf{x}_t \vert \mathbf{x}_{t+1}) }{ q (\mathbf{x}_t \vert \mathbf{x}_{t-1}) } \Big]}_{\text{consistency}}\end{align}$

The reconstruction term corresponds to the first-step optimization.
The prior matching term does not contain trainable parameters, requiring no optimization.
The consistency term makes the denoising process at timestep $t$ match the corresponding diffusion step from a cleaner input.

The ELBO objective is thus approximated across all noise levels over the expection of all timesteps.

Training

The ELBO objective can be derived as ^[7]

$\begin{align} \mathcal{L}_\text{VLB} &{}= - \log p_\theta(\mathbf{x}_0) \\ &\leq - \log p_\theta(\mathbf{x}_0) + D_\text{KL}(q(\mathbf{x}_{1:T}\vert\mathbf{x}_0) \| p_\theta(\mathbf{x}_{1:T}\vert\mathbf{x}_0) ) \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T}) / p_\theta(\mathbf{x}_0)} \Big] \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} + \log p_\theta(\mathbf{x}_0) \Big] \\ &= \mathbb{E}_q \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ &= \mathbb{E}_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\cdot \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)} \Big) + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \vert \mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{q(\mathbf{x}_1 \vert \mathbf{x}_0)} + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big]\\ &= \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} - \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \Big] \\ &= \mathbb{E}_q [\underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} ] \end{align}$

The training of diffusion process can be implemented by learning a neural network to predict either of following three formats (given a arbitrary noised version $\mathbf{x}_t$ ):

The original natural input $\mathbf{x}_0$ . See Eq.$\eqref{loss_mu}$.^[6] empirically finds it leads to worse sampling quality early.
The source noise $\pmb{\epsilon}_0$ ($\pmb{\epsilon}$-prediction parameterization). ^[2]
The score of input at an arbitrary noise level $\nabla \log p(\mathbf{x}_t)$ . ^[8]
The velocity of diffusion latents $\mathbf{x}_t$ . ^[19]

$\pmb{\epsilon}$-prediction parameterization

We arrange the Eq.$\eqref{noise_process}$ as:

$\begin{align} \mathbf{x}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\pmb{\epsilon}_t}{\sqrt{\bar{\alpha}_t}} \label{denoise} \end{align}$

Plugging Eq.$\eqref{denoise}$ into the denoising transition mean in Eq.$\eqref{mu_q}$, we have:

$\begin{align} \pmb{\mu}_q (\mathbf{x}_t, \mathbf{x}_0) &{}= \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})\mathbf{x}_t + \sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t) \color{green}{\mathbf{x}_0} }{1-\bar{\alpha}_t} \\ &{}= \frac{\sqrt{\alpha_t} (1-\bar{\alpha}_{t-1})\mathbf{x}_t + \sqrt{\bar{\alpha}_{t-1}}(1-\alpha_t) \color{grey}{\frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\pmb{\epsilon}_t}{\sqrt{\bar{\alpha}_t}}} }{1-\bar{\alpha}_t} \\ &{}= \frac{1}{\sqrt{\alpha_t}} \mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}\sqrt{\alpha_t}}\pmb{\epsilon}_t \label{eps_q} \end{align}$

Similarly, the approximate denoising transition mean $\hat{\pmb{\epsilon}}_\theta (\mathbf{x}_t, t)$ is:

$\begin{align} \pmb{\mu}_\theta (\mathbf{x}_t, t) &{}= \frac{1}{\sqrt{\alpha_t}} \mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}\sqrt{\alpha_t}}\hat{\pmb{\epsilon}}_t(\mathbf{x}_t, t) \label{eps_theta} \end{align}$

Plugging the Eq.$\eqref{eps_q}$ and $\eqref{eps_theta}$ into Eq.$\eqref{kl}$, we can write:

$\begin{align} &{}\mathop{\arg\min}_\theta \; \mathbb{KL} \Big( q(\mathbf{x}_{t-1}\vert \mathbf{x}_t, \mathbf{x}_0) \Vert p_\theta (\mathbf{x}_{t-1} \Vert \mathbf{x}_t) \Big) \\ = &{}\mathop{\arg\min}_\theta \; \mathbb{KL} \Big( \mathcal{N} \big( \mathbf{x}_{t-1}; \pmb{\mu}_q, \pmb{\Sigma}_q(t) \big) \Vert \mathcal{N} \big( \mathbf{x}_{t-1}; \pmb{\mu}_\theta, \pmb{\Sigma}_q(t) \big) \Big) \\ =&{}\mathop{\arg\min}_\theta \; \frac{1}{2 \sigma_q^2 (t)} \Vert \pmb{\mu}_\theta - \pmb{\mu}_q \Vert_2^2 \\ =&{}\mathop{\arg\min}_\theta \; \frac{1}{2 \sigma_q^2 (t)} \Vert \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}\sqrt{\alpha_t}}\pmb{\epsilon}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}\sqrt{\alpha_t}}\hat{\pmb{\epsilon}}_t(\mathbf{x}_t, t) \Vert_2^2 \\ =&{}\mathop{\arg\min}_\theta \; \frac{1}{2 \sigma_q^2 (t)} \frac{(1-\alpha_t)^2}{(1-\bar{\alpha}_t)\alpha_t} \Vert \pmb{\epsilon}_t - \hat{\pmb{\epsilon}}_t(\mathbf{x}_t, t) \Vert_2^2 \\ =&{}\mathop{\arg\min}_\theta \; \frac{1}{2 \sigma_q^2 (t)} \frac{(1-\alpha_t)^2}{(1-\bar{\alpha}_t)\alpha_t} \Vert \pmb{\epsilon}_t - \pmb{\epsilon}_\theta ( \underbrace{\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \pmb{\epsilon}_t}_{\text{Plugging Eq.\eqref{forward_add_noise}}} , t) \Vert_2^2 \label{loss_noise} \end{align}$

Simplified objective: ^[2] empirically find it better to remove the weighting term in Eq.$\eqref{loss_noise}$:

$\begin{align} \color{blue}{\mathcal{L}_\text{simple}} &{}= \Vert \pmb{\epsilon}_t - \hat{\pmb{\epsilon}}_t(\mathbf{x}_t, t) \Vert_2^2 \\ &{}= \Vert \pmb{\epsilon}_t - \pmb{\epsilon}_\theta ( \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \pmb{\epsilon}_t , t) \Vert_2^2 \end{align}$

The training objective resembles denoising score matching over multiple noise scales indexed by $t$. It can be treated as using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics.

The overall DDPM training algorithm is:

The sampling process resembles Langevin dynamics with $\pmb{\epsilon}_\theta$ as a learned gradient of the data density.

Velocity prediction

^[19] propose to parameterize the diffusion velocity by predicting the velocity of diffusion latents, by predicting $\mathbf{v} \equiv \alpha_t \epsilon - \sigma_t \mathbf{x}$ , which gives $\hat{\mathbf{x}} = \alpha_t \mathbf{z}_t - \sigma_t \hat{\mathbf{x}}_\theta (\mathbf{z}_t)$ .

Let $\phi_t = \arctan (\sigma_t / \alpha_t)$ , assumming a variance preserving diffusion process, we have $\alpha_\phi = \cos (\phi), \sigma_\phi = \sin (\phi)$ , and hence $\mathbf{z}_\phi = \cos (\phi) \mathbf{x} + \sin (\phi) \epsilon$ .

^[19] thus define the velocity of $\mathbf{z}_\phi$ as:

$\begin{align} \mathbf{v}_\phi &{}\equiv \frac{d \mathbf{z}_\phi}{d \phi} \\ &{}= \frac{d \cos (\phi)}{d \phi} \mathbf{x} + \frac{d \sin (\phi)}{d \phi} \epsilon \\ &{}= \cos (\phi) \epsilon - \sin (\phi) \mathbf{x} \end{align}$

By rearranging the $\epsilon$, $\mathbf{x}$, $\mathbf{v}$, we then get:

$\begin{align} \sin (\phi) \mathbf{x} &{}= \cos(\phi) \epsilon - \mathbf{v}_\phi \\ &{}= \frac{\cos(\phi)}{\sin(\phi)} (\mathbf{z} - \cos(\phi) \mathbf{x}) - \mathbf{v}_\phi \\ \sin^2(\phi) \mathbf{x} &{}= \cos(\phi) \mathbf{z} - \cos^2(\phi)\mathbf{x} - \sin (\phi) \mathbf{v}_\phi \\ (\sin^2(\phi) + \cos^2(\phi))\mathbf{x} &{}= \mathbf{x} = \cos(\phi) \mathbf{z} - \sin (\phi) \mathbf{v}_\phi \end{align}$

We also get $\epsilon = \sin (\phi) \mathbf{z}_\phi + \cos (\phi)\mathbf{v}_\phi$ .

The predicted velocity is defined as:

$\begin{align} \hat{v}_\theta (\mathbf{z}_\phi) \equiv \cos(\phi) \hat{\epsilon}_\theta (\mathbf{z}_\phi) - \sin (\phi) \hat{\mathbf{x}}_\theta (\mathbf{z}_\phi) \end{align}$

where $\hat{\epsilon}_\theta (\mathbf{z}_\phi) = (\mathbf{z}_\phi - \cos(\phi)\hat{\mathbf{x}}_\theta (\mathbf{z}_\phi) ) / \sin(\phi)$ .

$The visualization of reparameterization in terms of $\phi$ and $\mathbf{v}_\phi$$

Following algorithm illustrates the complete training process:

Conditional Generation

For conditional generation, it includes classifier-guided or classifier-free methods. The distinct difference is the existence of an extra classifier for condition guidance.

Classifier Guidance

^[4] utilized a trained classifier $f_\phi (y \vert \mathbf{x}_t,t)$ on noisy image $\mathbf{x}_t$ to obtain the gradients towards input $\nabla_\mathbf{x} \log f_\phi (y \vert \mathbf{x}_t)$ to guide the sampling process using the condition $y$, such as the target class label.

Given a Gaussian $\mathbf{x} \sim \mathcal{N}(\pmb{\mu}, \pmb{\sigma}^2\mathbf{I})$ , the log derivative of the density function^[8] is:

$\begin{align}\nabla_\mathbf{x} \log p(\mathbf{x}) &{}= \nabla_\mathbf{x} \Big( - \frac{1}{s\sigma^2} (\mathbf{x} - \pmb{\mu})^2 \Big) \\&{}= -\frac{\mathbf{x} - \pmb{\mu}}{\pmb{\sigma}^2} \\&{}= -\frac{\pmb{\epsilon}}{\pmb{\sigma}} \qquad \qquad\qquad \text{with}\qquad\pmb{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{1})\end{align}$

Given Eq.$\eqref{noise_process}$, we have:

$\begin{align}\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) &{}= \mathbb{E}_{q(\mathbf{x}_0)} \Big[ \nabla_{\mathbf{x}_t} q(\mathbf{x}_t \vert \mathbf{x}_0) \Big] \\&{}= \mathbb{E}_{q(\mathbf{x}_0)} \Big[ -\frac{\pmb{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}} \Big] \\&{}= -\frac{\pmb{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1-\bar{\alpha}_t}}\end{align}$

The score function for the joint distribution $q (\mathbf{x}_t, y)$ is:

$\begin{align} \nabla_{\mathbf{x}_t} \log q (\mathbf{x}_t, y) &{}= \nabla_{\mathbf{x}_t} \log q (\mathbf{x}_t) + \nabla_{\mathbf{x}_t} \log q (y \vert \mathbf{x}_t) \\ &{}\approx - \frac{1}{\sqrt{1-\bar{\alpha}_t}} \pmb{\epsilon} (\mathbf{x}_t, t) + \nabla_{\mathbf{x}_t} \log f_\phi (y \vert \mathbf{x}_t) \\ &{}= - \frac{1}{\sqrt{1-\bar{\alpha}_t}} \big( \pmb{\epsilon}_\theta (\mathbf{x}_t, t) - \sqrt{1 - \bar{\alpha}_t} \nabla_{\mathbf{x}_t} \log f_\phi (y \vert \mathbf{x}_t) \big) \end{align}$

The classifier-guided predictor $\bar{\pmb{\epsilon}}_\theta$ thus obtains a truncation-like effect by sampling in the direction of the gradient of image classifier to perform conditional generation:

$\begin{align} \bar{\pmb{\epsilon}}_\theta (\mathbf{x}_t, t) = \pmb{\epsilon}_\theta (\mathbf{x}_t, t) - \sqrt{1-\bar{\alpha}_t} \nabla_{\mathbf{x}_t} \log f_\phi (y \vert \mathbf{x}_t) \end{align}$

Classifier guided prediction ^[4] uses a weight factor $w$ to contrail the shifted gradient:

$\begin{align} \bar{\pmb{\epsilon}}_\theta (\mathbf{x}_t, t) = \pmb{\epsilon}_\theta (\mathbf{x}_t, t) - \sqrt{1-\bar{\alpha}_t} \nabla_{\mathbf{x}_t} {\color{red} w} \log f_\phi (y \vert \mathbf{x}_t) \label{classifier_guidance} \end{align}$

Classifier-Free Guidance

Classifier guidiance introduces an auxiliary classifier and thus complicates the training process. It is naturally to think about the approach of conditional generation without any explicit classifier $f_\phi$ entirely. Instead of sampling in the direction of the gradient of image classifier, ^[5] proposes to combine the score estimates of a conditional diffusion model $p_\theta (\mathbf{x}|y)$ and a jointly trained unconditional model $p_\theta (\mathbf{x})$ via a single model.

Specifically, when training conditional diffusion $p_\theta (\mathbf{x}|y)$ parameterized by the score estimator $\pmb{\epsilon}_\theta (\mathbf{x}_t, t, y)$ , ^[5] randomly gets rid of the conditions by setting $y=\emptyset$, that is $\pmb{\epsilon}_\theta (\mathbf{x}_t, t) = \pmb{\epsilon}_\theta (\mathbf{x}_t, t, \emptyset)$

The gradient of an implicit classifier can be formulated with the difference between conditional and unconditional classifiers:

$\begin{align} \nabla_{\mathbf{x}_t} \log f_\phi (y \vert \mathbf{x}_t) &{} = \nabla_{\mathbf{x}_t} \log p (\mathbf{x}_t \vert y) - \nabla_{\mathbf{x}_t} \log p (\mathbf{x}_t) \\ &{}= - \frac{1}{\sqrt{1-\bar{\alpha}_t}} \big( \pmb{\epsilon}_\theta (\mathbf{x}_t, t, y) - \pmb{\epsilon}_\theta (\mathbf{x}_t, t, y=\emptyset) \big) \end{align}$

Plugging into the Eq.$\eqref{classifier_guidance}$, the score estimator will be:

$\begin{align} \bar{\pmb{\epsilon}}_\theta (\mathbf{x}_t, t) &{} = \pmb{\epsilon}_\theta (\mathbf{x}_t, t) - \sqrt{1-\bar{\alpha}_t} \nabla_{\mathbf{x}_t} w \log f_\phi (y \vert \mathbf{x}_t) \\ &{}= \pmb{\epsilon}_\theta (\mathbf{x}_t, t, y) - w \Big( \pmb{\epsilon}_\theta (\mathbf{x}_t, t, y) - \pmb{\epsilon}_\theta (\mathbf{x}_t, t, y=\emptyset) \Big)\\ &{}= (w+1) \pmb{\epsilon}_\theta (\mathbf{x}_t, t, y) - w \cdot \pmb{\epsilon}_\theta (\mathbf{x}_t, t, y=\emptyset) \end{align}$

Categorical Diffusion (Discrete)

Gaussian diffusion process focuses on continuous state space, such as real-valued image and waveform data. There has been research trials by applying the Gaussian diffusion into categorical data, which requires relaxing or embedding discrete data into continuous spaces. A more natural way is to use categorical diffusion that corrupts the categorical data such as language in discrete state spaces.

^[1] firstly introduces the diffusion models with discrete state spaces over binary random variables. ^[9] extended the model class to categorical random variables with transition matrices characterized by uniform transition probabilities. ^[10] introduces discrete denoising diffusion probabilistic models (D3PM) by more generally extending the state corruption process.

$Quantizedd swiss roll. Each dot represents a 2D categorical variable. <br>Top: Diffused samples from the uniform, discretized Gaussian, and absorbing state, with transition matrices $\mathbf{Q}$. <br> Bottom: Learned discretized Gaussian reverse process.$

Discrete Diffusion (D3PM)

For scalar discrete random variables with $K$ categories $x_t, x_{t-1} \in 1,\cdots, K$ , the forward transition probability can be represented by matrices: $[\mathbf{Q}_t]_{ij} = q (x_t = j \vert x_{t-1}=i)$ .

Denoting the one hot version of $x$ with the row vector $\mathbf{x}$ , a categorical distribution $\text{Cat} (\mathbf{x}, \mathbf{p})$ over the one-hot row vector $\mathbf{x}$ with probabilities given by the row vector $\mathbf{p}$, we can write:

$\begin{align} q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \text{Cat} (\mathbf{x}_t; \mathbf{p}=\mathbf{x}_{t-1}\mathbf{Q}_t) \end{align}$

The term $\mathbf{x}_{t-1}\mathbf{Q}_t$ can be understood as a row vector-matrix product. $\mathbf{Q}$ is assumed to apply to each image pixel or sequence token independently. $q$ factorizes over the higher dimensions. Thus we write $q(\mathbf{x}_t \vert \mathbf{x}_{t-1})$ w.r.t a single element.

Discrete state spaces

Starting from $\mathbf{x}_0$ , the $t$-step marginal at time $t-1$:

$\begin{align} q(\mathbf{x}_t \vert \mathbf{x}_0) = \text{Cat} (\mathbf{x}_t; \mathbf{p}=\mathbf{x}_0 \mathbf{\overline{Q}}_t) \quad \quad \text{with} \quad\quad \mathbf{\overline{Q}}_t= \prod_{i=1}^t\mathbf{Q}_i \end{align}$

The posterior is:

$\begin{align} q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &{}= \frac{q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) q(\mathbf{x}_{t-1}\vert \mathbf{x}_0)}{q(\mathbf{x}_{t}\vert \mathbf{x}_0)} \qquad\qquad\qquad \text{Markov property}\\ &{}= \frac{q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) q(\mathbf{x}_{t-1}\vert \mathbf{x}_0)}{q(\mathbf{x}_{t}\vert \mathbf{x}_0)} \\ &{}=\text{Cat} (\mathbf{x}_t; \mathbf{p}= \frac{\mathbf{x}_t \mathbf{Q}_t^\top \odot \mathbf{x}_0 \mathbf{\overline{Q}}_{t-1}}{\mathbf{x}_0 \mathbf{\overline{Q}}_{t} \mathbf{x}_t^\top} ) \end{align}$

Assuming that the reverse process $p_\theta (\mathbf{x}_t \vert \mathbf{x}_{t-1})$ is factorized as conditionally independent over all the elements, the KL divergence between $q$ and $p_\theta$ is summing over all values of each random variable.

Forward Markov transition matrices

Uniform^[9]. Given $\beta_t \in [0,1]$, the transition matrix $\mathbf{Q}_t = (1-\beta_t)\mathbf{I} + \frac{\beta_t}{K} \mathbb{1}\mathbb{1}^\top$ .
Absorbing state. Define transition matrix with an absorbing state (called [MASK]), such that each token either stays the same or transitions to [MASK] with some probability $\beta_t$. This is motivated by BERT. For images, it reuses the grey pixels as the [MASK] absorbing token.
Discretized Gaussian. ^[9] uses a discretized, truncated Gaussian distribution for ordinal data such as images.
Token embedding distance. ^[9] uses similarity in an embedding space to guide the forward process, so that the transitions become more frequently between tokens that have simialr embeddings, , while maintaining a uniform stationary distribution.

Training

$\begin{align} \mathcal{L}_\lambda = \mathcal{L}_{\text{vlb}} + \lambda \mathbb{E}_{q(\mathbf{x}_0}\mathbb{E}_{q(\mathbf{x}_t \vert \mathbf{x}_0)} [- \log \tilde{p}_\theta (\mathbf{x}_0 \vert \mathbf{x}_t)] \end{align}$

BERT is a one-step diffusion model. For a one-step diffusion process in which $q(\mathbf{x}_1 \vert \mathbf{x}_0)$ replaces 10% of tokens with [MASK] and 5% uniformly at random. We have:

$\begin{align}\mathcal{L}_\text{vlb} - \mathcal{L}_\text{T} &{}= - \mathbb{E}_{q(\mathbf{x}_1 \vert \mathbf{x}_0)} [\log p_\theta (\mathbf{x}_0 \vert \mathbf{x}_1)] \\&{}= \mathcal{L}_\text{BERT}\end{align}$

Autoregressive models are (discrete) diffusion models. Consider a diffusion process taht deterministically masks tokens one-by-one in a sequence of length $T$:

$\begin{align}q([\textbf{x}_t]_i | \textbf{x}_0) =\left\{ \begin{array}{ll} [\textbf{x}_0]_i \qquad \text{if}\quad i<T-t\\ \text{[MASK]} \quad\text{otherwise} \end{array} \right.\end{align}$

For the position $i \neq T-t$, the KL divergence

$\begin{align}\mathbb{KL}(q([\mathbf{x}_{t-1}]_i \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta([\mathbf{x}_{t-1}]_i \vert\mathbf{x}_t)) \rightarrow 0\end{align}$

Therefore, the KL divergence is computed over the tokens at position $i$, which is exactly the standard cross entropy loss for an autoregressive model.

$\begin{align}\mathbb{KL}(q([\mathbf{x}_{t-1}]_i \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta([\mathbf{x}_{t-1}]_i \vert\mathbf{x}_t)) &= q([\mathbf{x}_{t-1}]_i \vert \mathbf{x}_t, \mathbf{x}_0) \cdot \log \frac{q([\mathbf{x}_{t-1}]_i \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta([\mathbf{x}_{t-1}]_i \vert\mathbf{x}_t)} \\&=-p_\theta([\mathbf{x}_0]_i \vert\mathbf{x}_t) \\&= -p_\theta(x_{t-1}\vert x_{>t})\end{align}$

(Generative) Maskde Language-Models are diffusion models. Generated MLMs^[15]^[16] are generative models that generate text from a sequence of [MASK] tokens.

References

1.Sohl-Dickstein, Jascha, et al. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. ICML 2015 ↩
2.Ho, Jonathan, et al. Denoising Diffusion Probabilistic Models. arXiv:2006.11239, arXiv, 16 Dec. 2020 ↩
3.Nichol, Alex, and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. arXiv:2102.09672, arXiv, 18 Feb. 2021 ↩
4.Dhariwal, Prafulla, and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, arXiv, 1 June 2021 ↩
5.Ho, Jonathan, and Tim Salimans. Classifier-Free Diffusion Guidance. 2021. openreview.net ↩
6.Luo, C. (2022). Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970. ↩
7.Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/. ↩
8.Yang Song & Stefano Ermon. Generative modeling by estimating gradients of the data distribution. NeurIPS 2019. ↩
9.Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P. and Welling, M. Argmax flows and multinomial diffusion: Learning categorical distributions. NeurIPS 2021. ↩
10.Austin, J., Johnson, D.D., Ho, J., Tarlow, D. and van den Berg, R., 2021. Structured denoising diffusion models in discrete state-spaces. NeurIPS 2021. ↩
11.Li, X.L., Thickstun, J., Gulrajani, I., Liang, P. and Hashimoto, T.B., 2022. Diffusion-LM Improves Controllable Text Generation. arXiv preprint arXiv:2205.14217. ↩
12.Gong, S., Li, M., Feng, J., Wu, Z. and Kong, L., 2022. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933. ↩
13.Lin, Z., Gong, Y., Shen, Y., Wu, T., Fan, Z., Lin, C., Chen, W. and Duan, N., 2022. GENIE: Large Scale Pre-training for Text Generation with Diffusion Model. arXiv preprint arXiv:2212.11685. ↩
14.He, Z., Sun, T., Wang, K., Huang, X. and Qiu, X., 2022. DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models. arXiv preprint arXiv:2211.15029. ↩
15.Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-Predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324, April 2019. ↩
16.Alex Wang and Kyunghyun Cho. BERT has a mouth, and it must speak: BERT as a markov random field language model. arXiv preprint arXiv:1902.04094, February 2019. ↩
17.Sergios Karagiannakos,Nikolas Adaloglou. How diffusion models work: the math from scratch. AI Summer. September 2022. ↩
18.The Annotated Diffusion Model. Huggingface Blog. June 2022. ↩
19.Salimans, Tim and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. ICLR 2022. ↩

Yekun's Note

Diffusion Models: A Mathematical Note from Scratch

Gaussian Diffusion (Continuous)

Forward (Diffusion) process

Reverse process

Training

$\pmb{\epsilon}$-prediction parameterization

Velocity prediction

Conditional Generation

Classifier Guidance

Classifier-Free Guidance

Categorical Diffusion (Discrete)

Discrete Diffusion (D3PM)

Discrete state spaces

Forward Markov transition matrices

Training

References