# Yekun's ML Notes

Some machine learning notes and writeup.

A diffusion probabilistic model is a parameterzed Markov chain trained to reverse a predefined forward process, closely related to both likelihood-based optimization and score matching. The forward diffusion process is a stochastic process constructed to gradually corrupt the original data into random nose.

# Gaussian Diffusion (Continuous)

Diffusion models [1][2] are latent variable models inspired by the non-equilibrium statistical physics ( thermodynamics) that gradually destroy structure in data distribution through an iterative forward diffusion process, and then learn a reversal process to recover the original data structure through iterative denoising.

Diffusion models can be treated as a Markovian Hierarchical Variational Autoencoder with three restrictions:[6]

1. The latent dimension is the same as the original data.
2. The encoder is not learned, instead uses a (pre-defined) linear Gaussian model.
3. The latent in final timestep $T$ is an isotropic Gaussian.

## Forward (Diffusion) process

Given a data point sampled from the data distribution $\mathbf{x}_0 \sim q(\mathbf{x})$. The forward diffusion process gradually applied a (fixed) linear Gaussian model at each timestep $t$ out of $T$ steps:

where the forward diffusion transitions produce a series of gradually noisy samples $\mathbf{x}_1, \cdots, \mathbf{x}_T$. Each noisy sample has the exactly same dimension as the original data point $\mathbf{x}_0$.

Under the Markovian assumption, the Gaussian noise is gradually added to examples from previous timestep, with the variance schedule $\{\beta_t \in (0, 1) \}_{t=1}^T$. Given a large number of $T \rightarrow \infty$, $\mathbf{x}_T$ can ideally be an isotropic Gaussian noise.

Let $\alpha_t = 1 - \beta_t$, the linear Gaussian model in the forward process is rewritten as:

Under the reparameterization trick, samples $\mathbf{x}_t \sim q (\mathbf{x}_t | \mathbf{x}_{t-1})$ can be rewritten as:

In similar vein, samples $\mathbf{x}_{t-1}$ can be rewritten as:

Let $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$. Usually, the update step gets larger as the timestep increases, i.e., $\beta_1 < \beta_2 < \cdots < \beta_T$ and thus $\bar{\alpha}_1 > \bar{\alpha}_2 > \cdots \bar{\alpha}_T$.

Suppose we have $2T$ random noise variables $\{ \pmb{\epsilon}_t, \bar{\pmb{\epsilon}}_t \}_{t=1}^T \overset{\text{i.i.d}}{\sim} \mathcal{N} (\pmb{\epsilon}; \mathbf{0},\mathbf{I})$.

For an arbitrary sample $\mathbf{x}_t \sim q(\mathbf{x}_t | \mathbf{x}_0)$, we have:

Therefore, the linear Gaussian form is derived as: $q(\mathbf{x}_t | \mathbf{x}_0) \sim \mathcal{N} (\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t) \mathbf{I})$.

## Reverse process

The reverse diffusion process, with the form $p_\theta(\mathbf{x}_0) := \int p_\theta (\mathbf{x}_{0:T} d \mathbf{x}_{1:T})$, learns the reversal of diffusion process by gradually denoising from timestep T to 1. The reverse process is defined as a Markov chain with learned Gaussian transitions starting at $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I})$:

Therefore, we can derive the Gussian form of both $q(\mathbf{x}_t | \mathbf{x}_0)$ and $q(\mathbf{x}_{t-1} | \mathbf{x}_0)$. Using Bayes rule, we have:

In each timestep, $\mathbf{x}_{t-1} \sim q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ follows the Gaussian distribution. The mean $\pmb{\mu}(\mathbf{x}_t, \mathbf{x}_0)$ is a function of $\mathbf{x}_t$ and $\mathbf{x}_0$, and $\pmb{\Sigma}_q(t)$ is a function of $\alpha$ coefficient (either as hyperparameter or learned with neural networks). The variance can be formulated as: $\pmb{\Sigma}_q (t) = \sigma^2_q (t) \mathbf{I}$, where $\sigma^2=\frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}$.

Since $p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ does not condition on $\mathbf{x}_0$, we thus optimize the KL divergence between the means of two Gaussians:

Given Eq.$\eqref{mu}$, we have:

Therefore, Eq.$\eqref{kl}$ can be rewritten as:

Intuitive understanding towards the diffusion process[1][6].

1. The reconstruction term corresponds to the first-step optimization.
2. The prior matching term does not contain trainable parameters, requiring no optimization.
3. The consistency term makes the denoising process at timestep $t$ match the corresponding diffusion step from a cleaner input.

The ELBO objective is thus approximated across all noise levels over the expection of all timesteps.

## Training

The ELBO objective can be derived as [7]

The training of diffusion process can be implemented by learning a neural network to predict either of following three formats (given a arbitrary noised version $\mathbf{x}_t$):

1. The original natural input $\mathbf{x}_0$. See Eq.$\eqref{loss_mu}$.[6] empirically finds it leads to worse sampling quality early.
2. The source noise $\pmb{\epsilon}_0$ ($\pmb{\epsilon}$-prediction parameterization). [2]
3. The score of input at an arbitrary noise level $\nabla \log p(\mathbf{x}_t)$. [8]
4. The velocity of diffusion latents $\mathbf{x}_t$. [19]

### $\pmb{\epsilon}$-prediction parameterization

We arrange the Eq.$\eqref{noise_process}$ as:

Plugging Eq.$\eqref{denoise}$ into the denoising transition mean in Eq.$\eqref{mu_q}$, we have:

Similarly, the approximate denoising transition mean $\hat{\pmb{\epsilon}}_\theta (\mathbf{x}_t, t)$ is:

Plugging the Eq.$\eqref{eps_q}$ and $\eqref{eps_theta}$ into Eq.$\eqref{kl}$, we can write:

Simplified objective: [2] empirically find it better to remove the weighting term in Eq.$\eqref{loss_noise}$:

The training objective resembles denoising score matching over multiple noise scales indexed by $t$. It can be treated as using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics.

The overall DDPM training algorithm is:

The sampling process resembles Langevin dynamics with $\pmb{\epsilon}_\theta$ as a learned gradient of the data density.

### Velocity prediction

[19] propose to parameterize the diffusion velocity by predicting the velocity of diffusion latents, by predicting $\mathbf{v} \equiv \alpha_t \epsilon - \sigma_t \mathbf{x}$, which gives $\hat{\mathbf{x}} = \alpha_t \mathbf{z}_t - \sigma_t \hat{\mathbf{x}}_\theta (\mathbf{z}_t)$.

Let $\phi_t = \arctan (\sigma_t / \alpha_t)$, assumming a variance preserving diffusion process, we have $\alpha_\phi = \cos (\phi), \sigma_\phi = \sin (\phi)$, and hence $\mathbf{z}_\phi = \cos (\phi) \mathbf{x} + \sin (\phi) \epsilon$.

[19] thus define the velocity of $\mathbf{z}_\phi$ as:

By rearranging the $\epsilon$, $\mathbf{x}$, $\mathbf{v}$, we then get:

We also get $\epsilon = \sin (\phi) \mathbf{z}_\phi + \cos (\phi)\mathbf{v}_\phi$.

The predicted velocity is defined as:

where $\hat{\epsilon}_\theta (\mathbf{z}_\phi) = (\mathbf{z}_\phi - \cos(\phi)\hat{\mathbf{x}}_\theta (\mathbf{z}_\phi) ) / \sin(\phi)$.

Following algorithm illustrates the complete training process:

## Conditional Generation

For conditional generation, it includes classifier-guided or classifier-free methods. The distinct difference is the existence of an extra classifier for condition guidance.

### Classifier Guidance

[4] utilized a trained classifier $f_\phi (y \vert \mathbf{x}_t,t)$ on noisy image $\mathbf{x}_t$ to obtain the gradients towards input $\nabla_\mathbf{x} \log f_\phi (y \vert \mathbf{x}_t)$ to guide the sampling process using the condition $y$, such as the target class label.

Given a Gaussian $\mathbf{x} \sim \mathcal{N}(\pmb{\mu}, \pmb{\sigma}^2\mathbf{I})$, the log derivative of the density function[8] is:

Given Eq.$\eqref{noise_process}$, we have:

The score function for the joint distribution $q (\mathbf{x}_t, y)$ is:

The classifier-guided predictor $\bar{\pmb{\epsilon}}_\theta$ thus obtains a truncation-like effect by sampling in the direction of the gradient of image classifier to perform conditional generation:

Classifier guided prediction [4] uses a weight factor $w$ to contrail the shifted gradient:

### Classifier-Free Guidance

Classifier guidiance introduces an auxiliary classifier and thus complicates the training process. It is naturally to think about the approach of conditional generation without any explicit classifier $f_\phi$ entirely. Instead of sampling in the direction of the gradient of image classifier, [5] proposes to combine the score estimates of a conditional diffusion model $p_\theta (\mathbf{x}|y)$ and a jointly trained unconditional model $p_\theta (\mathbf{x})$ via a single model.

Specifically, when training conditional diffusion $p_\theta (\mathbf{x}|y)$ parameterized by the score estimator $\pmb{\epsilon}_\theta (\mathbf{x}_t, t, y)$, [5] randomly gets rid of the conditions by setting $y=\emptyset$, that is $\pmb{\epsilon}_\theta (\mathbf{x}_t, t) = \pmb{\epsilon}_\theta (\mathbf{x}_t, t, \emptyset)$

The gradient of an implicit classifier can be formulated with the difference between conditional and unconditional classifiers:

Plugging into the Eq.$\eqref{classifier_guidance}$, the score estimator will be:

# Categorical Diffusion (Discrete)

Gaussian diffusion process focuses on continuous state space, such as real-valued image and waveform data. There has been research trials by applying the Gaussian diffusion into categorical data, which requires relaxing or embedding discrete data into continuous spaces. A more natural way is to use categorical diffusion that corrupts the categorical data such as language in discrete state spaces.

[1] firstly introduces the diffusion models with discrete state spaces over binary random variables. [9] extended the model class to categorical random variables with transition matrices characterized by uniform transition probabilities. [10] introduces discrete denoising diffusion probabilistic models (D3PM) by more generally extending the state corruption process.

## Discrete Diffusion (D3PM)

For scalar discrete random variables with $K$ categories $x_t, x_{t-1} \in 1,\cdots, K$, the forward transition probability can be represented by matrices: $[\mathbf{Q}_t]_{ij} = q (x_t = j \vert x_{t-1}=i)$.

Denoting the one hot version of $x$ with the row vector $\mathbf{x}$ , a categorical distribution $\text{Cat} (\mathbf{x}, \mathbf{p})$ over the one-hot row vector $\mathbf{x}$ with probabilities given by the row vector $\mathbf{p}$, we can write:

The term $\mathbf{x}_{t-1}\mathbf{Q}_t$ can be understood as a row vector-matrix product. $\mathbf{Q}$ is assumed to apply to each image pixel or sequence token independently. $q$ factorizes over the higher dimensions. Thus we write $q(\mathbf{x}_t \vert \mathbf{x}_{t-1})$ w.r.t a single element.

## Discrete state spaces

Starting from $\mathbf{x}_0$, the $t$-step marginal at time $t-1$:

The posterior is:

Assuming that the reverse process $p_\theta (\mathbf{x}_t \vert \mathbf{x}_{t-1})$ is factorized as conditionally independent over all the elements, the KL divergence between $q$ and $p_\theta$ is summing over all values of each random variable.

### Forward Markov transition matrices

1. Uniform[9]. Given $\beta_t \in [0,1]$, the transition matrix $\mathbf{Q}_t = (1-\beta_t)\mathbf{I} + \frac{\beta_t}{K} \mathbb{1}\mathbb{1}^\top$.
2. Absorbing state. Define transition matrix with an absorbing state (called [MASK]), such that each token either stays the same or transitions to [MASK] with some probability $\beta_t$. This is motivated by BERT. For images, it reuses the grey pixels as the [MASK] absorbing token.
3. Discretized Gaussian. [9] uses a discretized, truncated Gaussian distribution for ordinal data such as images.
4. Token embedding distance. [9] uses similarity in an embedding space to guide the forward process, so that the transitions become more frequently between tokens that have simialr embeddings, , while maintaining a uniform stationary distribution.

### Training

BERT is a one-step diffusion model. For a one-step diffusion process in which $q(\mathbf{x}_1 \vert \mathbf{x}_0)$ replaces 10% of tokens with [MASK] and 5% uniformly at random. We have:

Autoregressive models are (discrete) diffusion models. Consider a diffusion process taht deterministically masks tokens one-by-one in a sequence of length $T$:

For the position $i \neq T-t$, the KL divergence

Therefore, the KL divergence is computed over the tokens at position $i$, which is exactly the standard cross entropy loss for an autoregressive model.

(Generative) Maskde Language-Models are diffusion models. Generated MLMs[15][16] are generative models that generate text from a sequence of [MASK] tokens.

# References

1. 1.Sohl-Dickstein, Jascha, et al. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. ICML 2015
2. 2.Ho, Jonathan, et al. Denoising Diffusion Probabilistic Models. arXiv:2006.11239, arXiv, 16 Dec. 2020
3. 3.Nichol, Alex, and Prafulla Dhariwal. Improved Denoising Diffusion Probabilistic Models. arXiv:2102.09672, arXiv, 18 Feb. 2021
4. 4.Dhariwal, Prafulla, and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, arXiv, 1 June 2021
5. 5.Ho, Jonathan, and Tim Salimans. Classifier-Free Diffusion Guidance. 2021. openreview.net
6. 6.Luo, C. (2022). Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970.
7. 7.Weng, Lilian. (Jul 2021). What are diffusion models? Lil’Log. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/.
8. 8.Yang Song & Stefano Ermon. Generative modeling by estimating gradients of the data distribution. NeurIPS 2019.
9. 9.Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P. and Welling, M. Argmax flows and multinomial diffusion: Learning categorical distributions. NeurIPS 2021.
10. 10.Austin, J., Johnson, D.D., Ho, J., Tarlow, D. and van den Berg, R., 2021. Structured denoising diffusion models in discrete state-spaces. NeurIPS 2021.
11. 11.Li, X.L., Thickstun, J., Gulrajani, I., Liang, P. and Hashimoto, T.B., 2022. Diffusion-LM Improves Controllable Text Generation. arXiv preprint arXiv:2205.14217.
12. 12.Gong, S., Li, M., Feng, J., Wu, Z. and Kong, L., 2022. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933.
13. 13.Lin, Z., Gong, Y., Shen, Y., Wu, T., Fan, Z., Lin, C., Chen, W. and Duan, N., 2022. GENIE: Large Scale Pre-training for Text Generation with Diffusion Model. arXiv preprint arXiv:2212.11685.
14. 14.He, Z., Sun, T., Wang, K., Huang, X. and Qiu, X., 2022. DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models. arXiv preprint arXiv:2211.15029.
15. 15.Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-Predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324, April 2019.
16. 16.Alex Wang and Kyunghyun Cho. BERT has a mouth, and it must speak: BERT as a markov random field language model. arXiv preprint arXiv:1902.04094, February 2019.
17. 17.Sergios Karagiannakos,Nikolas Adaloglou. How diffusion models work: the math from scratch. AI Summer. September 2022.
18. 18.The Annotated Diffusion Model. Huggingface Blog. June 2022.
19. 19.Salimans, Tim and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. ICLR 2022.