Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Generative Adversarial Networks

GANs are widely applied to estimate generative models without any explicit density function, which instead take the game-theoretic approach: learn to generate from training distribution via 2-player games.

Generative Adversarial Networks (GAN)


GANs consist of two components:

  • Generator network $G$: try to fool the discriminator by generating real-looking images
  • Discriminator network $D$: try to distinguish between real and fake images


Train joinly in minimax game -> minimax objective function


  • $D$ outputs likelihood in (0,1) of real image
  • Discriminator $D$ aims to maximize the objective such that $D(x) \approx 1$(real) and (fake).
  • Generator $G$ aims to minimize the objective such that $D(G(z)) \approx 1$ (to fool the $D$)

The objective is also:

  1. Gradient ascent on discriminator $D$:
  • (not work well) Gradient descent on $G$ minimizes the $D$ being correct:

    Problems: when samples are likely fake, the gradient the the left region of the figure is relatively flat!

  • Instead, gradient ascent on $G$ maximize the likelihood of $D$ being wrong


Minibatch SGD training

  1. for number of training iterations do
    1. for $k$ steps do
      • Sample minibatch of $m$ noise samples from noise distribution
      • Sample minibatch of $m$ examples from data generating distribution
      • Update the discriminator by ascending its stochastic gradient:
    2. Sample minibatch of $m$ noise samples $m$ noise samples from noise distribution
    3. Update the $G$ by ascending its stochastic gradient:


After training, use $G$ to generate new images


Parzen-window density estimator[9] should be avoided for evaluation.

  • a.k.a. Kernel Density Esitimator (KDE)
  • An estimator with kernel $K$ and brandwidth $h$:
  • In generative model evaluation, $K$ is usually density function of standard Gaussian distribution.
  • Parzen-window estimator can be unreliable

The average log-likelihood[9] is also not correlated with the sample qualities, i.e., a model have poor log-likelihood and produce great samples, or have great log-likelihood and produce poor samples.

Theoretical Results

The global optimum

  • For fixed $G$, the optimal $D$ is

The training criteron

  1. The global minimum of the virtual training criterion $C(G)$ gets the value $-\log4$ iff .

The salient difference between VAEs and GANs when using backprop:

  • VAEs cannot have discrete variables at the input to the generator
  • GANs cannot have discrete variable at the output of the generator.



  • Beautiful, sota samples!


  • Trickier / unstable to train
  • Cannot solve inference queries such as $p(x)$, $p(z \vert x)$

Active research

  • Better loss function, more stable training (Wasserstein GAN, LSGAN, …)
  • Conditional GANs, GANs for all kinds of applications

Generative models:

  • PixelCNN/ PixelRNN: Explicit density model, optimizes exact likelihood, good samples. But inefficient sequential generation
  • VAE: Optimize variational lower bound on likelihood. Useful latent representations, inference queries. But current sample quality not the best.
  • GAN: Game-theoretic approach, best samples! But can be tricky and unstable to train, no inference queries.

Deep Convolutional GAN (DCGAN)

DCGAN architecture[2]

  • Generator $G$ => upsampling network with fractionally-strided convolutions
  • Discriminator $D$ => CNN
  • Replace pooling layers with strided convolutions ($D$) and fractional-strided convolutions ($G$)
  • BatchNorm in both $G$ and $D$
  • Remove FC hidden layer for deeper architectures
  • Use ReLU activation in $G$ for all layers except for the output which uses $\tanh$
  • Use LeakyReLU in $D$ for all layers

upload successful

Vector Arithmetic

Inspired by Word2Vec that vec(“King”) - vec(“man”) + vec(“woman”) = vec(“Queen”), DCGAN found similar latent semantic representation space in $Z$ representations in the generator $G$.

  • The arithmetic between single samples is unstable, thus DCGAN takes the average vector of three examplars.
  • z(“smiling woman”) - z(“neutral woman”) + z(“neutral man”) = z(“smiling man”)
    DCGAN vector semantics

  • z(“man with glasses”) - z(“man without glasses”) + z(“woman without glasses”) = z(“woman with glasses”)
    DCGAN glasses

Improved Techniques for Training GANs

  • Feature Matching[8] specifies the objective for $G$ to prevent the overtraining on the current $D$. Let $\mathbf{f(x)}$ be activations on intermediate layers of the $D$, the objective is:

  • Minibatch Discrimination is applied to allow the $D$ to look at multiple data in combination so as to tell the output of $G$ to be more dissimilar to each other. Let be a feature vector for input , which is multiplied with a tensor $T \in \mathbb{R}^{A \times B \times C}$:

Allows to incoporate side information from other samples and is superior to feature matching in the unconditional setting. This helps addressing mode collapse by allowing $D$ to detect if the generated samples are too close to each other.

  • Historical averaging modifies each palyer’s cost as the term

    where $\mathbf{\theta}[i]$ is the parameters at past time $i$.

  • One-sided label smoothing applies smoothed values like .0 or .1 as the target of 0 and 1.

  • Virtual batch normalization uses a reference batch (fixed) to compute normalization statistics and constract a batch containing the sample and reference batch.
  • Inception score has been empirically shown to be well correlated with human judgement.

Conditional GAN

Conditional GAN[6] feeds the input $y$ to conditional on both the generator $G$ and discriminator $D$.

  • In $G$, the prior input noise and $y$ are combined in joint hidden representations;
  • In $D$, $x$ and $y$ are presented as inputs.

The objective function is:


InfoGAN [20] maximizes the mutual information between a small subset of the latent variables and the observation. Given the generator $G$ with both the incompressible noise $z$ and the latent code $c$, the generator becomes $G(z,c)$. InfoGAN applies the information-theoretic regularization, $i.e.$, the high mutual information (MI) between latent codes $c$ and generator distribution $G(z,c)$.

InfoGAN aims to solve the information-regularized minimax game:

In practice, the mutual information term $I(c; G(z,c))$ is hard to maximize directly as it requires the posterior $P(c \vert x)$. InfoGAN defines an auxiliary distribution $Q(c \vert x)$ to approximate $P(c \vert x)$:

The variational lower bound of the mutual information $I(c;G(z,c))$:

where can be maximized w.r.t. $Q$ directly and w.r.t. $G$ via the reparametrization trick. Hence can be added to the GAN’s objectives with no changes to GAN’s training procedure.

Thus, InfoGAN defines the minimax game with a variational regularization of mutua information and a hyperparameter $\lambda$:

Wasserstein GAN (WGAN)

WGAN[7] designed the objective function such that $G$ minimizes the Earth Mover/Wasserstein distance between data and generative distributions.
It improved the stability of learning, avoiding to balance generator $G$ and discriminator $D$’s capacity properly. It also got rid of the mode collapse.

EM distance

It applies Earth Mover (EM) distance or Wasserstein-1:

  • where denotes the set of all joint distributions $\gamma(x,y)$ whose marginals are respectively and .
  • “Intuitively, $\gamma(x,y)$ indicates how much ‘mass’ must be transported from $x$ to $y$ in order to transform the distributions into . The EM distance is the ‘cost’ of the optimal transport plan.”[7]
    -In EM distance, the infimum $\inf$ is intractable to compute!

Thus it applies Kantorovinch-Rubinstein duality:

  • where the supremum is over all the 1-Lipschitz functions $f: \chi \rightarrow \mathbb{R}$

$f: X \rightarrow Y$ is $K$-Lipschitz if for distance functions and on $X$ and $Y$,

Assume that we search over a parameterized family of functions which $w \in \mathcal{W}$:

  • For induced by we can backprop through :

WGAN pseudocode


  • $\alpha=1e-5$: learning rate
  • $c=0.01$: clipping hyperparameter
  • $m=64$: batch size
  • $n_\text{critic}=5$: the # of iterations of the critic per generator iteration
  • : initial critic parameters; : initial generator’s parameters

while $\theta$ has not converged do

  1. for $t=0,\cdots, n_\text{critic}$ do:
    1. Sample , a batch from the real data
    2. Sample a batch of prior samples
    3. Update
    4. Update
    5. Update
  2. Sample a batch of prior samples
  3. Update
  4. Update


  • It can be seen that the gradient of regular GAN’s discriminator could get vanishing gradients whereas the WGAN’s critic cannot saturate and converge to a linear function with gradients everywhere.
  • Momentum-based optimizer like Adam perform worse since the loss for the critic is non-stationary, whereas RMSProp is known to perform well even on very non-stationary problems.

  • No mode collapse has been evidenced during WGAN experiments.


Problems of WGAN:

  • the weight clipping for $k$-Lipshitz constraint biases the $D$ towards much simpler functions.
  • “It is observed that our NNs try to attain the maximum gradient norm $k$ and end up learning extremely simple functions”[10] (see figures below).

WGAN-GP value surface

WGAN-GP weights

Gradient Penalty

The gradient penalty[10] (WGAN-GP) is an alternative to implement the Lipschitz constraints. “The differentiable function is 1-Lipschtiz iff it has gradients with norm at most 1 everywhere”[10].

The new objective is:

  • Sample distribution: define sampling uniformly along straight lines between pairs $\hat{\mathbf{x}}$ and $\mathbf{x}$ sampled from the data distribution and generator distribution , i.e.:



  • gradient penalty coefficient $\lambda = 10$
  • # of critic iterations per generator iteration
  • batch size $m$
  • Adam hyperparameters
  • Initial critic parameters , initial generator parameters

while $\theta$ has not converged do

  1. for do:
    1. for $i=1, \cdots, m$ do:
      1. Sample real data , latent variable , a random number $\epsilon \sim U[0,1]$
      2. forward pass of $G$:
      3. $ \color{orange}{ \hat{\mathbf{x}}} \leftarrow \epsilon \color{blue}{\mathbf{x}} + (1-\epsilon) \color{red}{\mathbf{\tilde{x}}} $
      4. Objective:
    2. Update $D$:
  2. Sample a batch of latent variables
  3. Update $G$:

Spectral Normalization GAN (SN-GAN)

Spectral normalization[11] is a novel weight normalization method to stablize the training of discriminator $D$.

For a linear layer , the norm is given by:

If the Lipchiz norm of the activation function , we can use the inequality to observe the following bound on :

Our spectral normalization normalizes the spectral norm of the weigth matrix $W$ so that it satisfies the Lipschitz constraint $\sigma(W)=1$:

Projection Discriminator

[14] proposed a novel projection based discriminator to incorporate conditional information into the $D$ of GANs that respects the role of the conditional information.

Self-Attention GAN (SAGAN)

SA-GAN[17] applies self-attention op to capture the long0range depdencies in images, whereas conventional convolution op learns the information in a local neighborhood which requires deep layers to model larger receptive regions.


Given images features from prev layer , firstly transform into two feature spaces $f$, $g$:


  • where , are 1x1 convolutions.
  • $C$ is the # of channels, $N$ is the # of feature locations

Further, a scale parameter are applied on the attention layer output:

where $\gamma$ is a learnable scalar and is initialized as 0.

Hinge adversarial loss

SAGAN applies the hinge adversarial loss ([13])

Stablizing techniques

  • Spectral normalization [11]
  • two timescale update rule (TTUR) [12]


  • Large batch size, model size
  • Fuse class information at all levels (cGAN)
  • Hinge loss
  • Orthonormal regularization & Truncation trick

BigGAN [18] applies onthogonal regularization to directly enforces the orthogonality condition:

BigGAN architecture:


StyleGAN[19] automatically learned unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stocastic variation in generation.

  • Given a latent code $\mathbf{z}$ in the input latent space $\mathcal{Z}$, a non-linear mapping network $f: \mathcal{Z} \rightarrow \mathcal{W}$ first produces $\mathbf{w} \in \mathcal{W}$. Here $f$ is an 8-layer MLP.
  • The learned affine transformations then specialize $\mathbf{w}$ to styles that control the adaptive instance normalization (AdaIN):

    where each feature map $\mathbf{x}_i$ is normalized separately, and then scaled and biased using learned style $\mathbf{y}$.

  • The explicit noise inputs are broadcast to all feature maps using learned per-feature scaling factors and then added to the convolution output.


  1. 1.Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative Adversarial Networks. ArXiv, abs/1406.2661.
  2. 2.Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434.
  3. 3.Goodfellow, I.J. (2017). NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv, abs/1701.00160.
  4. 4.cs236 notes
  5. 5.cs231 slides
  6. 6.Mirza, M., & Osindero, S. (2014). Conditional Generative Adversarial Nets. ArXiv, abs/1411.1784.
  7. 7.Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. ArXiv, abs/1701.07875.
  8. 8.Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved Techniques for Training GANs. ArXiv, abs/1606.03498.
  9. 9.Theis, L., Oord, A.V., & Bethge, M. (2015). A note on the evaluation of generative models. CoRR, abs/1511.01844.
  10. 10.Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A.C. (2017). Improved Training of Wasserstein GANs. NIPS.
  11. 11.Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral Normalization for Generative Adversarial Networks. ArXiv, abs/1802.05957.
  12. 12.Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NIPS.
  13. 13.Lim, J.H., & Ye, J.C. (2017). Geometric GAN. ArXiv, abs/1705.02894.
  14. 14.Miyato, T., & Koyama, M. (2018). cGANs with Projection Discriminator. ArXiv, abs/1802.05637.
  15. 15.Zhao, J.J., Mathieu, M., & LeCun, Y. (2016). Energy-based Generative Adversarial Network. ArXiv, abs/1609.03126.
  16. 16.Nowozin, S., Cseke, B., & Tomioka, R. (2016). f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. ArXiv, abs/1606.00709.
  17. 17.Zhang, H., Goodfellow, I.J., Metaxas, D.N., & Odena, A. (2018). Self-Attention Generative Adversarial Networks. ArXiv, abs/1805.08318.
  18. 18.(BigGAN) Brock, A., Donahue, J., & Simonyan, K. (2018). Large Scale GAN Training for High Fidelity Natural Image Synthesis. ArXiv, abs/1809.11096.
  19. 19.Karras, T., Laine, S., & Aila, T. (2018). A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR.
  20. 20.Chen, Xi, et al. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems. 2016.