Fork me on GitHub

An Introduction to Variational Autoencoder

This is a concise introduction of Variational Autoencoder (VAE).

upload successful


  • PixelCNN define tractable density function with MLE:

  • VAE define the intractable density function with latent $\mathbf{z}$:

This cannot directly optimize, VAEs derive and optimize the lower bound on likelihood instead.


Autoencoders (AEs) encodes the inputs into latent representations $\mathbf{z}$ with dimension reduction to capture meaningful factors of variation in data. Then employ $\mathbf{z}$ to reconstruct original data by autoencoding itself.

  • After training, throw away the decoder and only retain the encoder.
  • Encoder can be used to initialize the supervised model on downstream tasks.

Variational Autoencoder

Assume training data is generated from underlying unobserved (latent) representation $\mathbf{z}$.


  • $\mathbf{x}$ -> image
  • $\mathbf{z}$ -> latent factors used to generate $\mathbf{x}$: attributes, orientation, pose, how much smile, etc. Choose prior $p(z)$ to be simple, e.g. Gaussian.



Intractable integral to MLE of training data:

where it is intractible to compute $p(x \vert z)$ for every $z$, i.e. integral. The intractability is marked in red.

Thus, the posterior density is also intractable due to the intractable data likelihood:

VAE Decoder[5]


Encoder -> “recognition / inference” networks.

  • Define encoder network that approximates the intractable true posterior . VAE makes the variational approximate posterior be a multivariate Gaussian with diagonal covariance for data point $\mathbf{x}^{(i)}$:


  • For Gaussian MLP encoder or decoder[4],

Use NN to model $\log \sigma^2$ instead of $\sigma^2$ is because that $\log \sigma^2 \in (-\infty, \infty)$ whereas $\sigma^2 \geq 0$

Decoder -> “generation” networks

upload successful

  • The first RHS term represents tractable lower bound , wherein and $\mathbb{KL}$ terms are differentiable.
  • Thus, the variational lower bound (ELBO) is derived:
  • Training: maximize lower bound


  • the fist term : reconstruct the input data. It is a negative reconstruction error.
  • the second term make approximate posterior distribution close to the prior. It acts as a regularizer.

The derived estimator when using isotropic multivariate Gaussian :

where and

upload successful

Reparameterization trick

Given the deterministic mapping , we know that


Take the univariate Gaussian case for example: $z \sim p(z \vert x) = \mathcal{N}(\mu, \sigma^2)$, the valid reparameterization is: $z = \mu + \sigma \epsilon$, where the auxiliary noise variable $\epsilon \sim \mathcal{N}(0,1)$.


  • After training, remove the encoder network, and use decoder network to generate.
  • Sample $z$ from prior as the input!
  • Diagonal prior on $z$ -> independent latent variables!

upload successful

  • Different dimensions of $z$ encode interpretable factors of variation.
  • Good feature representation that can be computed using

Pros & cons

  • Probabilistic spin to traditional autoencoders => allows generating data
  • Defines an intractable density => derive and optimize a (variational) lower bound


  • Principles approach to generative models
  • Allows inference of $q(z \vert x)$, can be useful feature representation for downstream tasks


  • Maximizes lower bound of likelihood: okay, but not as good evalution as PixelRNN / PixelCNN
  • loert quality compared to the sota (GANs)


Thanks for your reward!