GANs are widely applied to estimate generative models without any explicit density function, which instead take the game-theoretic approach: learn to generate from training distribution via 2-player games.

# Generative Adversarial Networks (GAN)

## Architecture

GANs consist of two components:

- Generator network $G$: try to fool the discriminator by generating real-looking images
- Discriminator network $D$: try to distinguish between real and fake images

## Training

Train joinly in **minimax game** -> minimax objective function

where

- $D$ outputs likelihood in (0,1) of real image

- Discriminator $D$ aims to
**maximize the objective**such that $D(x) \approx 1$(real) and (fake). - Generator $G$ aims to
**minimize the objective**such that $D(G(z)) \approx 1$ (to fool the $D$)

The objective is also:

**Gradient ascent**on discriminator $D$:

- (not work well)
*Gradient descent*on $G$ minimizes the $D$ being correct:Problems: when samples are likely fake, the gradient the the left region of the figure is relatively flat!

- Instead,
**gradient ascent**on $G$ maximize the likelihood of $D$ being wrong

### Pseudocode

Minibatch SGD training

**for**number of training iterations**do****for**$k$ steps**do**- Sample minibatch of $m$ noise samples from noise distribution
- Sample minibatch of $m$ examples from data generating distribution
- Update the discriminator by ascending its stochastic gradient:

- Sample minibatch of $m$ noise samples $m$ noise samples from noise distribution
- Update the $G$ by ascending its stochastic gradient:

### Generate

After training, use $G$ to generate new images

### Evaluation note

**Parzen-window density estimator**^{[9]} should be avoided for evaluation.

- a.k.a. Kernel Density Esitimator (KDE)
- An estimator with kernel $K$ and brandwidth $h$:
- In generative model evaluation, $K$ is usually density function of standard Gaussian distribution.
- Parzen-window estimator can be
**unreliable**

The average **log-likelihood**^{[9]} is also not correlated with the sample qualities, i.e., a model have poor log-likelihood and produce great samples, or have great log-likelihood and produce poor samples.

## Theoretical results

The global optimum

- For fixed $G$, the optimal $D$ is

The training criteron

- The global minimum of the virtual training criterion $C(G)$ gets the value $-\log4$ iff .

The salient *difference* between VAEs and GANs when using backprop:

**VAEs**cannot have discrete variables at the**input**to the generator**GANs**cannot have discrete variable at the**output**of the generator.

## Comparison

Pros:

- beautiful, sota samples!

Cons:

- Trickier / unstable to train
- cannot solve inference queries such as $p(x)$, $p(z \vert x)$

Active research

- better loss function, more stable training (Wasserstein GAN, LSGAN, …)
- Conditional GANs, GANs for all kinds of applications

Generative models:

*PixelCNN*/*PixelRNN*: explicit density model, optimizes exact likelihood, good samples. But inefficient sequential generation*VAE*: optimize variational lower bound on likelihood. Useful latent representations, inference queries. But current sample quality not the best.*GAN*: game-theoretic approach, best samples! But can be tricky and unstable to train, no inference queries.

# Deep Convolutional GAN (DCGAN)

**DCGAN** architecture^{[2]}

- Generator $G$ => upsampling network with fractionally-strided convolutions
- Discriminator $D$ => CNN
- Replace pooling layers with
**strided convolutions**($D$) and**fractional-strided convolutions**($G$) *BatchNorm*in both $G$ and $D$- Remove FC hidden layer for deeper architectures
- Use
*ReLU*activation in $G$ for all layers except for the output which uses $\tanh$ - Use
*LeakyReLU*in $D$ for all layers

## Vector arithmetic

Inspired by *Word2Vec* that vec(“King”) - vec(“man”) + vec(“woman”) = vec(“Queen”), DCGAN found similar latent semantic representation space in $Z$ representations in the generator $G$.

- The arithmetic between single samples is unstable, thus DCGAN takes the average vector of three examplars.
z(“smiling woman”) - z(“neutral woman”) + z(“neutral man”) = z(“smiling man”)

z(“man with glasses”) - z(“man without glasses”) + z(“woman without glasses”) = z(“woman with glasses”)

# Improved techniques for training GANs

**Feature matching**^{[8]}specifies the objective for $G$ to prevent the overtraining on the current $D$. Let $\mathbf{f(x)}$ be activations on intermediate layers of the $D$, the objective is:**Minibatch discrimination**is applied to allow the $D$ to look at multiple data in combination so as to tell the output of $G$ to be more dissimilar to each other. Let be a feature vector for input , which is multiplied with a tensor $T \in \mathbb{R}^{A \times B \times C}$:

Allows to incoporate side information from other samples and is superior to feature matching in the unconditional setting. This helps addressing mode collapse by allowing $D$ to detect if the generated samples are too close to each other.

**Historical averaging**modifies each palyer’s cost as the termwhere $\mathbf{\theta}[i]$ is the parameters at past time $i$.

**One-sided label smoothing**applies smoothed values like .0 or .1 as the target of 0 and 1.

**Virtual batch normalization**uses a reference batch (fixed) to compute normalization statistics and constract a batch containing the sample and reference batch.

*Inception score*has been empirically shown to be well correlated with human judgement.

# Conditional GAN

Conditional GAN^{[6]} feeds the input $y$ to conditional on both the generator $G$ and discriminator $D$.

- In $G$, the prior input noise and $y$ are combined in joint hidden representations;
- In $D$, $x$ and $y$ are presented as inputs.

The objective function is:

# Wasserstein GAN (WGAN)

WGAN^{[7]} designed the objective function such that $G$ minimizes the Earth Mover/Wasserstein distance between data and generative distributions.

It improved the stability of learning, avoiding to balance generator $G$ and discriminator $D$’s capacity properly. It also got rid of the *mode collapse*.

## EM distance

It applies **Earth Mover** (EM) distance or Wasserstein-1：

- where denotes the set of all joint distributions $\gamma(x,y)$ whose marginals are respectively and .
- “Intuitively, $\gamma(x,y)$ indicates how much ‘mass’ must be transported from $x$ to $y$ in order to transform the distributions into . The EM distance is the ‘cost’ of the optimal transport plan.”
^{[7]}

-In EM distance, the infimum $\inf$ is intractable to compute!

Thus it applies **Kantorovinch-Rubinstein duality**:

- where the supremum is over all the 1-Lipschitz functions $f: \chi \rightarrow \mathbb{R}$

$f: X \rightarrow Y$ is $K$-Lipschitz if for distance functions and on $X$ and $Y$,

Assume that we search over a parameterized family of functions which $w \in \mathcal{W}$:

- For induced by we can backprop through :

## WGAN pseudocode

Given:

- $\alpha=1e-5$: learning rate
- $c=0.01$: clipping hyperparameter
- $m=64$: batch size
- $n_\text{critic}=5$: the # of iterations of the critic per generator iteration
- : initial critic parameters; : initial generator’s parameters

**while** $\theta$ has not converged **do**

- for $t=0,\cdots, n_\text{critic}$ do:
- Sample , a batch from the real data
- Sample a batch of prior samples
- Update
- Update
- Update

- Sample a batch of prior samples
- Update
- Update

## Empirical results

- It can be seen that the gradient of regular GAN’s discriminator could get vanishing gradients whereas the WGAN’s critic cannot saturate and converge to a linear function with gradients everywhere.

Momentum-based optimizer like Adam perform worse since the loss for the critic is non-stationary, whereas

**RMSProp**is known to perform well even on very non-stationary problems.No mode collapse has been evidenced during WGAN experiments.

# WGAN-GP

**Problems** of WGAN:

- the
**weight clipping**for $k$-Lipshitz constraint biases the $D$ towards much simpler functions. - “It is observed that our NNs try to attain the maximum gradient norm $k$ and end up learning extremely simple functions”
^{[10]}(see figures below).

## Gradient penalty

The *gradient penalty*^{[10]} (WGAN-GP) is an alternative to implement the Lipschitz constraints. “The differentiable function is 1-Lipschtiz iff it has gradients with norm at most 1 everywhere”^{[10]}.

The new objective is:

**Sample distribution**: define sampling uniformly along straight lines between pairs $\hat{\mathbf{x}}$ and $\mathbf{x}$ sampled from the data distribution and generator distribution , i.e.:

## Pseudocode

Given:

- gradient penalty coefficient $\lambda = 10$
- # of critic iterations per generator iteration
- batch size $m$
- Adam hyperparameters
- Initial critic parameters , initial generator parameters

**while** $\theta$ has not converged **do**

- for do:
- for $i=1, \cdots, m$ do:
- Sample real data , latent variable , a random number $\epsilon \sim U[0,1]$
- forward pass of $G$:
- $ \color{orange}{ \hat{\mathbf{x}}} \leftarrow \epsilon \color{blue}{\mathbf{x}} + (1-\epsilon) \color{red}{\mathbf{\tilde{x}}} $
- Objective:

- Update $D$:

- for $i=1, \cdots, m$ do:
- Sample a batch of latent variables
- Update $G$:

# Spectral Normalization GAN (SN-GAN)

**Spectral normalization**^{[11]} is a novel weight normalization method to stablize the training of discriminator $D$.

For a linear layer , the norm is given by:

If the Lipchiz norm of the activation function , we can use the inequality to observe the following bound on :

Our *spectral normalization* normalizes the spectral norm of the weigth matrix $W$ so that it satisfies the Lipschitz constraint $\sigma(W)=1$:

# Projection Discriminator

^{[14]} proposed a novel **projection based discriminator** to incorporate conditional information into the $D$ of GANs that respects the role of the conditional information.

# Self-Attention GAN (SAGAN)

SA-GAN^{[17]} applies self-attention op to capture the long0range depdencies in images, whereas conventional convolution op learns the information in a local neighborhood which requires deep layers to model larger receptive regions.

## Self-attention

Given images features from prev layer , firstly transform into two feature spaces $f$, $g$:

Thus,

- where , are 1x1 convolutions.
- $C$ is the # of channels, $N$ is the # of feature locations

Further, a scale parameter are applied on the attention layer output:

where $\gamma$ is a learnable scalar and is initialized as 0.

## Hinge adversarial loss

SAGAN applies the hinge adversarial loss (^{[13]})

## Stablizing techniques

- Spectral normalization
^{[11]} - two timescale update rule (TTUR)
^{[12]}

# BigGAN

- Large batch size, model size
- Fuse class information at all levels (cGAN)
- Hinge loss
- Orthonormal regularization & Truncation trick

BigGAN ^{[18]} applies onthogonal regularization to directly enforces the orthogonality condition:

**BigGAN architecture**:

# StyleGAN

StyleGAN^{[19]} automatically learned unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stocastic variation in generation.

- Given a latent code $\mathbf{z}$ in the input latent space $\mathcal{Z}$, a non-linear mapping network $f: \mathcal{Z} \rightarrow \mathcal{W}$ first produces $\mathbf{w} \in \mathcal{W}$. Here $f$ is an 8-layer MLP.
The learned affine transformations then specialize $\mathbf{w}$ to

*styles*that control the**adaptive instance normalization**(AdaIN):where each feature map $\mathbf{x}_i$ is normalized separately, and then scaled and biased using learned style $\mathbf{y}$

The explicit noise inputs are broadcast to all feature maps using learned per-feature scaling factors and then added to the convolution output.

# References

- 1.Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative Adversarial Networks. ArXiv, abs/1406.2661. ↩
- 2.Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, abs/1511.06434. ↩
- 3.Goodfellow, I.J. (2017). NIPS 2016 Tutorial: Generative Adversarial Networks. ArXiv, abs/1701.00160. ↩
- 4.cs236 notes ↩
- 5.cs231 slides ↩
- 6.Mirza, M., & Osindero, S. (2014). Conditional Generative Adversarial Nets. ArXiv, abs/1411.1784. ↩
- 7.Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. ArXiv, abs/1701.07875. ↩
- 8.Salimans, T., Goodfellow, I.J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved Techniques for Training GANs. ArXiv, abs/1606.03498. ↩
- 9.Theis, L., Oord, A.V., & Bethge, M. (2015). A note on the evaluation of generative models. CoRR, abs/1511.01844. ↩
- 10.Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A.C. (2017). Improved Training of Wasserstein GANs. NIPS. ↩
- 11.Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral Normalization for Generative Adversarial Networks. ArXiv, abs/1802.05957. ↩
- 12.Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. NIPS. ↩
- 13.Lim, J.H., & Ye, J.C. (2017). Geometric GAN. ArXiv, abs/1705.02894. ↩
- 14.Miyato, T., & Koyama, M. (2018). cGANs with Projection Discriminator. ArXiv, abs/1802.05637. ↩
- 15.Zhao, J.J., Mathieu, M., & LeCun, Y. (2016). Energy-based Generative Adversarial Network. ArXiv, abs/1609.03126. ↩
- 16.Nowozin, S., Cseke, B., & Tomioka, R. (2016). f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. ArXiv, abs/1606.00709. ↩
- 17.Zhang, H., Goodfellow, I.J., Metaxas, D.N., & Odena, A. (2018). Self-Attention Generative Adversarial Networks. ArXiv, abs/1805.08318. ↩
- 18.(BigGAN) Brock, A., Donahue, J., & Simonyan, K. (2018). Large Scale GAN Training for High Fidelity Natural Image Synthesis. ArXiv, abs/1809.11096. ↩
- 19.Karras, T., Laine, S., & Aila, T. (2018). A Style-Based Generator Architecture for Generative Adversarial Networks. CVPR. ↩