Flow models are used to learn **continuous data**.

**Discrete autoregressive models** tractably learn the joint distribution by decomposing it into a product of conditionals using the *probability chain rule* according to a fixed ordering over dimensions.

The sequential nature limits its computational efficiency:

- its sampling procedure is sequential and non-parallelizable;
- there is no natural latent representation associated with autoregressive models.

To sum up:

- Pros:
- fast evaluation of p(x)
- great compression performance
- good samples with carefully designed dependency structure

- Cons:
- slow sampling (without significant engineering)
- discrete data only

**Flows explanation**

Ideas: instead of using a pdf over $\mathbf{x}$, work with a flow from $\mathbf{x}$ to $\mathbf{z}$

**Flows** in 1D

**Flow from $\mathbf{x}$ to $\mathbf{z}$ **: an invertible differentiable mapping $f$ from $\mathbf{x}$ (data) to $\mathbf{z}$ (noise):

**Training**

- $\mathbf{x}$ to $\mathbf{z}$: transform the data distribution into a
**base distribution p(z)**- common choices: uniform, std Gaussian

- Training the CDF is equivalent to the PDF:

**Ancestral sampling** (a.k.a *forward sampling*):

- $\mathbf{z}$ to $\mathbf{x}$: mapping $\mathbf{z} \sim p(\mathbf{z})$ through the flow’s inverse will yield the data distribution $\mathbf{x}=f_\theta^{-1}(\mathbf{z})$

^{[3]}

*Change of variables*:

**Fitting flows**:

- fit with MLE:
- is the density of $x$ under the sampling process and is calculated using
*Change of variables*.

CDF flows:

- Recover the original objective with CDF flows:
- Flows can be more general than CDFs.

**Flows** in high dimensions

For high-dimensional data, $x$ and $z$ have the same dimension.

^{[3]}

*Change of variables*: For $z \sim p(z)$, ancestral sampling process f^{-1} linearly transforms a small cube $dz$ to a small parallelepiped $dx$.

*Intuition*: $x$ is likely if it maps to a “large” region in $z$ space.

### Training

Train with MLE:

where the Jacobian determinant must be easy to calculate and differentiate!

### Constructing flows: composition

- Fows
^{[5]}can be composed:

### Affine flows (multivariate Gaussian)

Affine flows (multivariate Gaussian)

- Parameters: an invertible matrix $A$ and a vector $b$
- $f(x) = A^{-1} (x-b)$

**Sampling**: $x= Az + b$, where $z \sim \mathcal{N}(0,1)$

Log likelihood is expensive when dimension is large:

- the Jacobian of $f$ is $A^{-1}$
- Log likelihood involves calculating $\text{det}(A)$

### Elementwise flows

- flexible: it can use elementwise affine functions or CDF flows

The Jacobian is diaganal, so the determinant is easy to evaluate:

### More flow types

Coupling layers (NICE^{[1]}/RealNVP^{[2]}), directed graphical models, invertible 1x1 convs^{[9]}, FFJORD

# NICE

*A good representation is one in which the distribution of the data is easy to model.*

Consider the problem of learning a pdf from a parametric family of densities over finite dataset $\mathcal{D}$ of $N$ samples, of each in a space $\chi$, typically $\chi=\mathbb{R}^D$

**N**on-linear **I**ndependent **C**omponent **E**stimation (NICE)^{[1]} learns a non-linear deterministic transformation $h=f(\mathbf{x})$ to map the input $\mathbf{x}$ into the latent space with a factorized distribution, i.e. independent latent variables :

Apply the *change of variables* $h=f(x)$, assuming that $f$ is invertible and differentiable, the dimension of $h$ and $x$ are the same in order to fit the distribution :

where is the Jocobian matrix of function $f$ at $x$.

where , the *prior distribution*, is a predefined PDF, e.g. isotropic Gaussian. If the prior distribution is factorial (i.e. with independent dimensions), then we have the deterministic transform of a factorial distribution:

where

**NICE** can be viewed as an *invertible preprocessing transform* of the dataset.

- Invertible preprocessings can increase likelihood arbitrarily simply by contracting the data. The factorized prior encourages to discover meaningful structures in the dataset.

## Coupling layer

Given $x \in \chi$, , are partitions of $[1,D]$ such that $d=|I_1|$ and $m$ is a function. Define where:

where g: is the coupling law. If and , the Jacobian is:

where is the identity matrix of size $d$, thus

The inverted mapping is:

### Additive coupling layer

NICE applies so that taking and :

# RealNVP

Real-valued non-volume preserving (real NVP)

**Affine coupling layer**

- Given a $D$ dimensional input $x$ and $d<D$, the output $y$ of an affine coupling layer follows:where $s$ and $t$ represent
*scale*and*translation*functions from $\mathbb{R}^{d} \rightarrow \mathbb{R}^{D-d}$.

The Jocobian is:

The determinant can be computed as:

## Masked convolution

Partitioning adopts the binary mask $b$:

where $s(\cdot)$ and $t(\cdot)$ are rectified CNNs.

Two partitioning are applied:

- spatial checkerboard patterns
- channel-wise masking

# Autoregressive flows

Construct flows from directed acyclic graphs.

**Autoregressive flows**:

- The sampling process of a Bayes net is a flow:
- if autoregressive, called
*autoregressive flow*:

- if autoregressive, called
- Sampling is an
**invertible**mapping from $z$ to $x$ - The DAG structure causes the Jacobian to be
*triangular*, when variables are ordered by topological sort.

**How to fit autoregressive flow*?

- map $\mathbf{x}$ to $\mathbf{z}$
- fully parallelizable

## Variational lossy autoencoder (VLAE)

Variational Lossy Autoencoder (VLAE)^{[6]} force the global latent code to discard irrelevant information and thus encode the data in a lossy fassion.

For an **autoregressive flow** $f$, some continuous noise source $\epsilon$ is transformed into latent code $\mathbf{z}$: $\mathbf{z} = f(\epsilon)$. Assuming the density function for noise source is $u(\epsilon)$, then $\log p(\mathbf{z}) = \log u(\epsilon) + \log \text{det}\frac{d \epsilon}{d \mathbf{z}}$

## Masked autoregressive flow (MAF)

Considering an AR model whose conditionals are parameterized as single Gaussians, i.e., the $i$-th conditional is given by:

where and are unconstrained scalar functions that compute the mean and log standard deviation of the $i$-th conditional given all previous variables

Thus the data can be generated by:

- The vector of random numbers (with
`randn()`

) is used to generate data.

Given data point $\mathbf{x}$, the random number vector $\mathbf{u}$ is:

- where and .

Due to the AR structure, the Jacobian of $f^{-1}$ is triangular by design, hence:

- where .
- This can be equivalently interpreted as a normalizing flow.

**Masking**

- The functions are with masking, like MADE.

It is derived that MAF and IAF has the theoretical equivalence (See ^{[7]}).

# Inverse autoregressive flow (IAF)

Inverse autoregressive flow (IAF)^{[8]} is the inverse of an AR flow:

- $\mathbf{x} \rightarrow \mathbf{z}$ has the same structure as the
**sampling**in an AR model - has the same structure as
**loglikelihood**computation of an AR model. Thus,**IAF sampling is fast**

IAF approximates the posterior $q(\mathbf{z} \vert \mathbf{x})$ with the chain. At $t$-th step of the flow, IAFapply the AR NN with inputs $\mathbf{z}_{t-1}$ and $\mathbf{h}$, and outputs and . (See the figure)

The density at $T$-th step:

The output of autogressive NN:

- where

## Pseudocode

Given:

- $\mathbf{x}$: a datapoint
- $\mathbf{\theta}$: NN parameters
- $\textrm{EncoderNN}(\mathbf{x}; \mathbf{\theta})$: encoder network, with output $\mathbf{h}$
- $\text{AutoregressiveNN}[*](\mathbf{z,h;\theta})$: autoregressive networks, with input $\mathbf{h}$
- let $l$ denote the scalar value of $\log q(\mathbf{z} \vert \mathbf{x})$, evaluated at sample $\mathbf{z}$

**IAF algorithm**:

- $[\mathbf{\mu},\sigma, \mathbf{h}] \leftarrow$ EncoderNN$(\mathbf{x; \theta})$
- $\mathbf{\epsilon} \sim \mathcal{N}(0, I)$
- $\mathbf{z} \leftarrow \sigma \odot \epsilon + \mu$
- $l \leftarrow -\text{ sum}(\log \sigma + \frac{1}{2}\epsilon^2) + \frac{1}{2}\log(2\pi)$
- for $t = {1,\cdots, T}$:
- $[\mathbf{m}, \mathbf{s}] \rightarrow \text{AutoregressiveNN}[t](\mathbf{z,h;\theta})$
- $\sigma \leftarrow \text{sigmoid}(\mathbf{s})$
- $\mathbf{z} \leftarrow \sigma \odot \mathbf{z} + (1-\sigma) \odot \mathbf{m}$
- $l \leftarrow -\text{sum}(\log \sigma) + l$

## AF vs IAF

**AF**:

- Fast evaluation of $p(x)$ for arbitrary $x$
- Slow sampling

**IAF**:

- Slow evaluation of $p(x)$ for arbitrary $x$, so training directly by MLE is slow
- Fast sampling
- Fast evaluation of $p(x)$ if $x$ is a sample

Parallel WaveNet, IAF-VAE exploit IAF’s fast sampling.

# Glow

Glow^{[9]}

- Activation norm
- Invertible 1x1 convolution
- Affine coupling layers

# Flow++

Flow++^{[10]}

- Dequantization via VI
- Improved coupling layers
- Apply CDF for a mixture of $K$ logistics
- stacking conv and multi-head self-attention

# References

- 1.Dinh, L., Krueger, D., & Bengio, Y. (2014). NICE: Non-linear Independent Components Estimation. CoRR, abs/1410.8516. ↩
- 2.Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using Real NVP. ArXiv, abs/1605.08803. ↩
- 3.CS294-158 Lecture 2c+3a slides ↩
- 4.CS228 notes: sampling methods ↩
- 5.Rezende, D.J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. ArXiv, abs/1505.05770. ↩
- 6.Chen, X., Kingma, D.P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Variational Lossy Autoencoder. ArXiv, abs/1611.02731. ↩
- 7.Papamakarios, G., Murray, I., & Pavlakou, T. (2017). Masked Autoregressive Flow for Density Estimation. NIPS. ↩
- 8.Kingma, D.P., Salimans, T., & Welling, M. (2017). Improved Variational Inference with Inverse Autoregressive Flow. ArXiv, abs/1606.04934. ↩
- 9.Kingma, D.P., & Dhariwal, P. (2018). Glow: Generative Flow with Invertible 1x1 Convolutions. NeurIPS. ↩
- 10.Ho, J., Chen, X., Srinivas, A., Duan, Y., & Abbeel, P. (2019). Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design. ICML. ↩
- 11.Müller, T., McWilliams, B., Rousselle, F., Gross, M., & Novák, J. (2018). Neural Importance Sampling. ACM Trans. Graph., 38, 145:1-145:19. ↩
- 12.Grathwohl, W., Chen, R.T., Bettencourt, J., Sutskever, I., & Duvenaud, D. (2018). FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models. ArXiv, abs/1810.01367. ↩
- 13.Huang, C., Krueger, D., Lacoste, A., & Courville, A.C. (2018). Neural Autoregressive Flows. ArXiv, abs/1804.00779. ↩
- 14.Oord, A.V., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G.V., Lockhart, E., Cobo, L.C., Stimberg, F., Casagrande, N., Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalchbrenner, N., Zen, H., Graves, A., King, H., Walters, T., Belov, D., & Hassabis, D. (2017). Parallel WaveNet: Fast High-Fidelity Speech Synthesis. ArXiv, abs/1711.10433. ↩