This is a concise introduction of Variational Autoencoder (VAE).

# Background

• PixelCNN define tractable density function with MLE:

• VAE define the intractable density function with latent $\mathbf{z}$:

This cannot directly optimize, VAEs derive and optimize the lower bound on likelihood instead.

## Autoencoder

Autoencoders (AEs) encodes the inputs into latent representations $\mathbf{z}$ with dimension reduction to capture meaningful factors of variation in data. Then employ $\mathbf{z}$ to reconstruct original data by autoencoding itself.

• After training, throw away the decoder and only retain the encoder.
• Encoder can be used to initialize the supervised model on downstream tasks.

# Variational Autoencoder

Assume training data $\{ x^{(i)}\}_{i=1}^N$ is generated from underlying unobserved (latent) representation $\mathbf{z}$.

Intuition:

• $\mathbf{x}$ -> image
• $\mathbf{z}$ -> latent factors used to generate $\mathbf{x}$: attributes, orientation, pose, how much smile, etc. Choose prior $p(z)$ to be simple, e.g. Gaussian.

## Training

### Problems

Intractable integral to MLE of training data:

where it is intractible to compute $p(x \vert z)$ for every $z$, i.e. integral. The intractability is marked in red.

Thus, the posterior density is also intractable due to the intractable data likelihood:

VAE Decoder[5]

### Solutions

Encoder -> “recognition / inference” networks.

• Define encoder network $q_\phi (z \vert x)$ that approximates the intractable true posterior $p_\theta(z \vert x)$. VAE makes the variational approximate posterior be a multivariate Gaussian with diagonal covariance for data point $\mathbf{x}^{(i)}$:

where

• For Gaussian MLP encoder or decoder[4],

Use NN to model $\log \sigma^2$ instead of $\sigma^2$ is because that $\log \sigma^2 \in (-\infty, \infty)$ whereas $\sigma^2 \geq 0$

Decoder -> “generation” networks $p_\theta (x \vert z)$

• The first RHS term represents tractable lower bound $\mathcal{L}(x^{(i)}, \theta, \phi)$, wherein $p_\theta (x \vert z)$ and $\mathbb{KL}$ terms are differentiable.
• Thus, the variational lower bound (ELBO) is derived：
• Training: maximize lower bound

where

• the fist term $\mathbb{E}_z \bigg[ \log \color{green}{p_\theta (x^{(i)} \vert z)} \bigg]$: reconstruct the input data. It is a negative reconstruction error.
• the second term $\mathbb{KL} \big( \color{blue}{q_\phi (z \vert x^{(i)})} \| \color{blue}{p_\theta (z)} \big)$ make approximate posterior distribution close to the prior. It acts as a regularizer.

The derived estimator when using isotropic multivariate Gaussian $p_\theta(\mathbf{z})=\mathcal{N}(\mathbf{z}; \mathbf{0,I})$:

where $\mathbf{z}^{(i,l)} = \mu^{i} + \sigma^{(i)} \odot \epsilon^{(l)}$ and $\epsilon^{(l)} \sim \mathcal{N}(0, \mathbf{I})$

### Reparameterization trick

Given the deterministic mapping $\mathbf{z}= g_\phi (\epsilon, x)$, we know that

Thus,

Take the univariate Gaussian case for example: $z \sim p(z \vert x) = \mathcal{N}(\mu, \sigma^2)$, the valid reparameterization is: $z = \mu + \sigma \epsilon$, where the auxiliary noise variable $\epsilon \sim \mathcal{N}(0,1)$.
Thus,

## Generation

• After training, remove the encoder network, and use decoder network to generate.
• Sample $z$ from prior as the input!
• Diagonal prior on $z$ -> independent latent variables!

• Different dimensions of $z$ encode interpretable factors of variation.
• Good feature representation that can be computed using $q_\phi (z \vert x)$

## Pros & cons

• Probabilistic spin to traditional autoencoders => allows generating data
• Defines an intractable density => derive and optimize a (variational) lower bound

Pros:

• Principles approach to generative models
• Allows inference of $q(z \vert x)$, can be useful feature representation for downstream tasks

Cons:

• Maximizes lower bound of likelihood: okay, but not as good evalution as PixelRNN / PixelCNN
• loert quality compared to the sota (GANs)

# References

1. 1.
2. 2.
3. 3.Goodfellow, I. (2016). Tutorial: Generative adversarial networks. In NIPS.
4. 4.Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
5. 5.Doersch, C. (2016). Tutorial on Variational Autoencoders. ArXiv, abs/1606.05908.
6. 6.