This is a concise introduction of Variational Autoencoder (VAE).

# Background

PixelCNN define tractable density function with MLE:

VAE define the intractable density function with latent $\mathbf{z}$:

This cannot directly optimize, VAEs derive and optimize the *lower bound* on likelihood instead.

## Autoencoder

Autoencoder (AE) encodes the inputs into latent representations $\mathbf{z}$ with dimension reduction to capture meaningful factors of variation in data. Then employ $\mathbf{z}$ to reconstruct original data by autoencoding itself.

- After training, throw away the decoder and
**only retain the encoder**. - Encoder can be used to initialize the supervised model on downstream tasks.

# Variational Autoencoder

Assume training data is generated from underlying unobserved (latent) representation $\mathbf{z}$.

**Intuition**:

- $\mathbf{x}$ -> image
- $\mathbf{z}$ -> latent factors used to generate $\mathbf{x}$: attributes, orientation, pose, how much smile,
*etc*. Choose prior $p(z)$ to be simple, e.g. Gaussian.

## Training

### Problem

** Intractable integral ** to MLE of training data:

where it is *intractable* to compute $p(x \vert z)$ for every $z$, i.e. integral. The intractability is marked in red.

Thus, the posterior density is also intractable due to the intractable data likelihood:

^{[5]}

### Solution

**Encoder** -> “recognition / inference” networks.

- Define encoder network that approximates the intractable true posterior . VAE makes the variational approximate posterior be a multivariate Gaussian with diagonal covariance for data point $\mathbf{x}^{(i)}$:

where

- For Gaussian MLP encoder or decoder
^{[4]},

Use NN to model $\log \sigma^2$ instead of $\sigma^2$ is because that $\log \sigma^2 \in (-\infty, \infty)$ whereas $\sigma^2 \geq 0$

**Decoder** -> “generation” networks

- The first RHS term represents
**tractable lower bound**, wherein and $\mathbb{KL}$ terms are differentiable. - Thus, the variational lower bound (ELBO) is derived：
- Training: maximize lower bound

where

- the fist term : reconstruct the input data. It is a
*negative reconstruction error*. - the second term make approximate posterior distribution close to the prior. It acts as a regularizer.

The derived estimator when using isotropic multivariate Gaussian :

where and , and denote the $j$-th element of mean and variance vectors.

### Reparameterization trick

Given the deterministic mapping , we know that

Thus,

Take the univariate Gaussian case for example: $z \sim p(z \vert x) = \mathcal{N}(\mu, \sigma^2)$, the valid reparameterization is: $z = \mu + \sigma \epsilon$, where the auxiliary noise variable $\epsilon \sim \mathcal{N}(0,1)$.

Thus,

## Generation

- After training, remove the encoder network, and use decoder network to generate.
**Sample $z$ from prior**as the input!- Diagonal prior on $z$ -> independent latent variables!

*Different dimensions of $z$ encode interpretable factors of variation*.- Good feature representation that can be computed using

## Pros & cons

- Probabilistic spin to traditional autoencoders => allows generating data
- Defines an intractable density => derive and optimize a (variational) lower bound

**Pros**:

- Principles approach to generative models
- Allows inference of $q(z \vert x)$, can be useful feature representation for downstream tasks

**Cons**:

- Maximizes lower bound of likelihood: okay, but not as good evalution as
*PixelRNN / PixelCNN*！ - loert quality compared to the sota (GANs)

# Variational Graph Auto-Encoder (VGAE)

## Definition

Given an undirected, unweighted graph $\mathcal{G}=(\mathcal{V}, \mathcal{E})$ with $N=|V|$ nodes, the ajacency matrix $\mathbf{A}$ with self-connection (i.e., the diagonal is set to 1), degree matrix $\mathbf{D}$, stochastic latent variable in matrix $\mathbf{Z} \in \mathbb{R}^{N \times F}$, node features $\mathbf{X} \in \mathbb{R}^{N \times D}$. ^{[7]}

## Inference model

Apply a 2-layer Graph Convolutional Networks (GCN) to for parameterization:

where

- Mean:
- Variance:

The two-layer GCN is defined as

where is the semmetrically normalized adjacency matrix.

## Generative model

The generative model applies an inner product between latent variables:

where are elements of ajacency matrix $\mathbf{A}$ and $\sigma(\cdot)$ represents the sigmoid function.

## Learning

Optimize the variational lower bound (ELBO) w.r.t the variational parameters :

where the Gaussian prior

# References

- 1.Stanford cs231n: Generative models ↩
- 2.I. Goodfellow et. al, Deep Learning ↩
- 3.Goodfellow, I. (2016). Tutorial: Generative adversarial networks. In NIPS. ↩
- 4.Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. ↩
- 5.Doersch, C. (2016). Tutorial on Variational Autoencoders. ArXiv, abs/1606.05908. ↩
- 6.cs236 VAE notes ↩
- 7.Kipf, T., & Welling, M. (2016). Variational Graph Auto-Encoders. ArXiv, abs/1611.07308. ↩