Fork me on GitHub

Variational Auto-Encoders

This is a concise introduction of Variational Autoencoder (VAE).

upload successful


  • PixelCNN define tractable density function with MLE:

  • VAE define the intractable density function with latent $\mathbf{z}$:

This cannot directly optimize, VAEs derive and optimize the lower bound on likelihood instead.


Autoencoder (AE) encodes the inputs into latent representations $\mathbf{z}$ with dimension reduction to capture meaningful factors of variation in data. Then employ $\mathbf{z}$ to reconstruct original data by autoencoding itself.

  • After training, throw away the decoder and only retain the encoder.
  • Encoder can be used to initialize the supervised model on downstream tasks.

Variational Autoencoder

Assume training data is generated from underlying unobserved (latent) representation $\mathbf{z}$.


  • $\mathbf{x}$ -> image
  • $\mathbf{z}$ -> latent factors used to generate $\mathbf{x}$: attributes, orientation, pose, how much smile, etc. Choose prior $p(z)$ to be simple, e.g. Gaussian.



Intractable integral to MLE of training data:

where it is intractible to compute $p(x \vert z)$ for every $z$, i.e. integral. The intractability is marked in red.

Thus, the posterior density is also intractable due to the intractable data likelihood:

VAE Decoder[5]


Encoder -> “recognition / inference” networks.

  • Define encoder network that approximates the intractable true posterior . VAE makes the variational approximate posterior be a multivariate Gaussian with diagonal covariance for data point $\mathbf{x}^{(i)}$:


  • For Gaussian MLP encoder or decoder[4],

Use NN to model $\log \sigma^2$ instead of $\sigma^2$ is because that $\log \sigma^2 \in (-\infty, \infty)$ whereas $\sigma^2 \geq 0$

Decoder -> “generation” networks

upload successful

  • The first RHS term represents tractable lower bound , wherein and $\mathbb{KL}$ terms are differentiable.
  • Thus, the variational lower bound (ELBO) is derived:
  • Training: maximize lower bound


  • the fist term : reconstruct the input data. It is a negative reconstruction error.
  • the second term make approximate posterior distribution close to the prior. It acts as a regularizer.

The derived estimator when using isotropic multivariate Gaussian :

where and , and denote the $j$-th element of mean and variance vectors.

upload successful

Reparameterization trick

Given the deterministic mapping , we know that


Take the univariate Gaussian case for example: $z \sim p(z \vert x) = \mathcal{N}(\mu, \sigma^2)$, the valid reparameterization is: $z = \mu + \sigma \epsilon$, where the auxiliary noise variable $\epsilon \sim \mathcal{N}(0,1)$.


  • After training, remove the encoder network, and use decoder network to generate.
  • Sample $z$ from prior as the input!
  • Diagonal prior on $z$ -> independent latent variables!

upload successful

  • Different dimensions of $z$ encode interpretable factors of variation.
  • Good feature representation that can be computed using

Pros & cons

  • Probabilistic spin to traditional autoencoders => allows generating data
  • Defines an intractable density => derive and optimize a (variational) lower bound


  • Principles approach to generative models
  • Allows inference of $q(z \vert x)$, can be useful feature representation for downstream tasks


  • Maximizes lower bound of likelihood: okay, but not as good evalution as PixelRNN / PixelCNN
  • loert quality compared to the sota (GANs)

Variational Graph Auto-Encoder (VGAE)


Given an undirected, unweighted graph $\mathcal{G}=(\mathcal{V}, \mathcal{E})$ with $N=|V|$ nodes, the ajacency matrix $\mathbf{A}$ with self-connection (i.e., the diagonal is set to 1), degree matrix $\mathbf{D}$, stochastic latent variable in matrix $\mathbf{Z} \in \mathbb{R}^{N \times F}$, node features $\mathbf{X} \in \mathbb{R}^{N \times D}$. [7]

Inference model

Apply a 2-layer Graph Convolutional Networks (GCN) to for parameterization:


  • Mean:
  • Variance:

The two-layer GCN is defined as
where is the semmetrically normalized adjacency matrix.

Generative model

The generative model applies an inner product between latent variables:

where are elements of ajacency matrix $\mathbf{A}$ and $\sigma(\cdot)$ represents the sigmoid function.


Optimize the variational lower bound (ELBO) w.r.t the variational parameters :

where the Gaussian prior


  1. 1.Stanford cs231n: Generative models
  2. 2.I. Goodfellow et. al, Deep Learning
  3. 3.Goodfellow, I. (2016). Tutorial: Generative adversarial networks. In NIPS.
  4. 4.Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  5. 5.Doersch, C. (2016). Tutorial on Variational Autoencoders. ArXiv, abs/1606.05908.
  6. 6.cs236 VAE notes
  7. 7.Kipf, T., & Welling, M. (2016). Variational Graph Auto-Encoders. ArXiv, abs/1611.07308.
Thanks for your reward!