This is a concise introduction of Variational Autoencoder (VAE).

Background

• PixelCNN define tractable density function with MLE:

• VAE define the intractable density function with latent $\mathbf{z}$:

This cannot directly optimize, VAEs derive and optimize the lower bound on likelihood instead.

Autoencoder

Autoencoder (AE) encodes the inputs into latent representations $\mathbf{z}$ with dimension reduction to capture meaningful factors of variation in data. Then employ $\mathbf{z}$ to reconstruct original data by autoencoding itself.

• After training, throw away the decoder and only retain the encoder.
• Encoder can be used to initialize the supervised model on downstream tasks.

Variational Autoencoder

Assume training data $\{ x^{(i)}\}_{i=1}^N$ is generated from underlying unobserved (latent) representation $\mathbf{z}$.

Intuition:

• $\mathbf{x}$ -> image
• $\mathbf{z}$ -> latent factors used to generate $\mathbf{x}$: attributes, orientation, pose, how much smile, etc. Choose prior $p(z)$ to be simple, e.g. Gaussian.

Training

Problems

Intractable integral to MLE of training data:

where it is intractible to compute $p(x \vert z)$ for every $z$, i.e. integral. The intractability is marked in red.

Thus, the posterior density is also intractable due to the intractable data likelihood:

VAE Decoder[5]

Solutions

Encoder -> “recognition / inference” networks.

• Define encoder network $q_\phi (z \vert x)$ that approximates the intractable true posterior $p_\theta(z \vert x)$. VAE makes the variational approximate posterior be a multivariate Gaussian with diagonal covariance for data point $\mathbf{x}^{(i)}$:

where

• For Gaussian MLP encoder or decoder[4],

Use NN to model $\log \sigma^2$ instead of $\sigma^2$ is because that $\log \sigma^2 \in (-\infty, \infty)$ whereas $\sigma^2 \geq 0$

Decoder -> “generation” networks $p_\theta (x \vert z)$

• The first RHS term represents tractable lower bound $\mathcal{L}(x^{(i)}, \theta, \phi)$, wherein $p_\theta (x \vert z)$ and $\mathbb{KL}$ terms are differentiable.
• Thus, the variational lower bound (ELBO) is derived：
• Training: maximize lower bound

where

• the fist term $\mathbb{E}_z \bigg[ \log \color{green}{p_\theta (x^{(i)} \vert z)} \bigg]$: reconstruct the input data. It is a negative reconstruction error.
• the second term $\mathbb{KL} \big( \color{blue}{q_\phi (z \vert x^{(i)})} \| \color{blue}{p_\theta (z)} \big)$ make approximate posterior distribution close to the prior. It acts as a regularizer.

The derived estimator when using isotropic multivariate Gaussian $p_\theta(\mathbf{z})=\mathcal{N}(\mathbf{z}; \mathbf{0,I})$:

where $\mathbf{z}^{(i,l)} = \mu^{i} + \sigma^{(i)} \odot \epsilon^{(l)}$ and $\epsilon^{(l)} \sim \mathcal{N}(0, \mathbf{I})$, $\mu_j$ and $\sigma_j$ denote the $j$-th element of mean and variance vectors.

Reparameterization trick

Given the deterministic mapping $\mathbf{z}= g_\phi (\epsilon, x)$, we know that

Thus,

Take the univariate Gaussian case for example: $z \sim p(z \vert x) = \mathcal{N}(\mu, \sigma^2)$, the valid reparameterization is: $z = \mu + \sigma \epsilon$, where the auxiliary noise variable $\epsilon \sim \mathcal{N}(0,1)$.
Thus,

Generation

• After training, remove the encoder network, and use decoder network to generate.
• Sample $z$ from prior as the input!
• Diagonal prior on $z$ -> independent latent variables!

• Different dimensions of $z$ encode interpretable factors of variation.
• Good feature representation that can be computed using $q_\phi (z \vert x)$

Pros & cons

• Probabilistic spin to traditional autoencoders => allows generating data
• Defines an intractable density => derive and optimize a (variational) lower bound

Pros:

• Principles approach to generative models
• Allows inference of $q(z \vert x)$, can be useful feature representation for downstream tasks

Cons:

• Maximizes lower bound of likelihood: okay, but not as good evalution as PixelRNN / PixelCNN
• loert quality compared to the sota (GANs)

Variational Graph Auto-Encoder (VGAE)

Definition

Given an undirected, unweighted graph $\mathcal{G}=(\mathcal{V}, \mathcal{E})$ with $N=|V|$ nodes, the ajacency matrix $\mathbf{A}$ with self-connection (i.e., the diagonal is set to 1), degree matrix $\mathbf{D}$, stochastic latent variable $\mathbf{z}_i$ in matrix $\mathbf{Z} \in \mathbb{R}^{N \times F}$, node features $\mathbf{X} \in \mathbb{R}^{N \times D}$. [7]

Inference model

Apply a 2-layer Graph Convolutional Networks (GCN) to for parameterization:

where

• Mean: $\mu = \text{GCN}_\mu (\mathbf{X}, \mathbf{A})$
• Variance: $\log \sigma = \text{GCN}_\sigma (\mathbf{X}, \mathbf{A})$

The two-layer GCN is defined as $\text{GCN}(\mathbf{X}, \mathbf{A}) = \tilde{\mathbf{A}} \text{ReLU}(\tilde{\mathbf{A}}\mathbf{X}\mathbf{W}_0)\mathbf{W}_1$
where $\tilde{\mathbf{A}} = \mathbf{D}^{-1/2} \mathbf{A}\mathbf{D}^{-1/2}$ is the semmetrically normalized adjacency matrix.

Generative model

The generative model applies an inner product between latent variables:

where $A_{ij}$ are elements of ajacency matrix $\mathbf{A}$ and $\sigma(\cdot)$ represents the sigmoid function.

Learning

Optimize the variational lower bound (ELBO) $\mathcal{L}$ w.r.t the variational parameters $\mathbf{W}_i$:

where the Gaussian prior $p(\mathbf{Z}) = \prod_I p(\mathbf{z}_i) = \prod_i \mathcal{N}(\mathbf{z}_i \vert 0, \mathbf{I})$

References

1. 1.
2. 2.
3. 3.Goodfellow, I. (2016). Tutorial: Generative adversarial networks. In NIPS.
4. 4.Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
5. 5.Doersch, C. (2016). Tutorial on Variational Autoencoders. ArXiv, abs/1606.05908.
6. 6.
7. 7.Kipf, T., & Welling, M. (2016). Variational Graph Auto-Encoders. ArXiv, abs/1611.07308.