Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Sequence GANs in a Nutshell

Background: Conventional maximum likelihood approaches for sequence generation with teacher forcing algorithms are inherently prone to exposure bias at the inference stage due to the training-testing discrepancy—the generator produces a sequence iteratively conditioned on its previously predicted ones that may be never observed during training—leading to accumulative mismatch with the increment of generated sequences. In other words, the model is only trained on demonstrated behaviors (real data samples) but not free-running mode.
Generative Adversarial Networks (GANs) hold the promise of mitigating such issues for generating discrete sequences, such as language modeling, speech/music generation, etc.

GANs have demonstrated the compelling performance in generating real-valued data such as pixel-based images but have fallen short of discrete data generation primarily resulting from the incapability of gradient propagation passing from the discriminator (denoted as $\mathcal{D}$) to the generator (denoted as $\mathcal{G}$) in the original (image) GAN framework, which is incurred by the non-differential sampling/argmax operation in between.

Existing solutions to discrete sequence generation using GANs could be mainly sorted into different groups by resorting to:

  1. Reinforcement Learning (RL): modeling the sequence generation procedure as a sequential decision-making process [1][6][7][8]; typically yielding high-variance but unbiased gradient estimates.
  2. RL-free: utilizing soft-argmax operator[2], Gumbel-softmax trick[9], or continuous relaxation[16] to provide the continuous approximation of the discrete distribution on the sequences; yielding low variance but biased estimation.


Policy Gradient Gumbel-softmax Soft-argmax Dense Reward Internal Feature Pretraining $\mathcal{G}$ $\mathcal{D}$
MaskGAN (ICLR’18) LSTM (seq2seq) LSTM (seq2seq)
ScratchGAN (NeurIPS’19) LSTM LSTM
ColdGAN (NeurIPS’20) T5 / BART N/A

SeqGAN (AAAI’17)


There exist limitations in discrete sequence generation using GANs, such as:

  1. The discrete output of the generator $\mathcal{G}$;
  2. $\mathcal{D}$ can only assess the complete sequence, while it is non-trivial to balance the current score and future one for partially generated sequence once the entire sequence has been generated.


SeqGAN[1] bypasses the generator differentiation problem by directly performing a policy gradient update, which adopts the judgments of $\mathcal{D}$ on the complete generated sequences as reward signals using Monte Carlo (MC) search.

SeqGAN considers the sequence generation as a sequential decision-making process with a stochastic parameterized policy, in which the generator $\mathcal{G}$ is treated as the actor/agent of RL, the state is previously generated tokens so far and the action is the next token to be generated.

Image source: [1]


Given a dataset of real-word structured sequences, train a $\theta$-parameterized generative model to produce a sequence , where $\mathcal{Y}$ is the vocabulary of candidate tokens. The policy is stochastic: at the $t$-th timestep, the state $s$ is the current partially predicted sequences , and the action $a$ is the next token $y_t$ to be selected.

The discriminator parameterized by $\phi$ predicts how likely the sampled sequence is from real data, providing the guidance (reward) to update the policy .

Let be the action-value function of a sequence, i.e., the expected accumulative reward starting from the state $s$ taking action $a$ with policy ; be the reward for a complete sequence. The objective of is to generate a sequence from the start state to maxmize its expected reward at the end of the episode:

SeqGAN adopts the estimated probability of being real by as the reward, but $D$ can only provides the reward for a finished sequence. Thus, in order to evaluate the action-value for an intermediate state, Monte Carlo (MC) search with a roll-out policy is applied to sample the unknown last $T-t$ tokens. Let an $N$-time Monte Carlo search be , where and is sampled based on the roll-out policy and the current state.

It runs the roll-out policy starting from current state till the end of the sequence for $N$ times to get a batch of output samples. Thus,

where . The intermediate reward is iteratively defined as the next-state value starting from the state and rolling out to the end.

The is trained as follows:

The gradient of objective function $J(\theta)$ w.r.t. policy’s parameter $\theta$ is:

where is the obvserved intermediate state sampled from .

Training Algorithm

Require: generator policy ; roll-out policy ; discriminator ; a sequence dataset ; learning rate $\alpha$

  1. Initialize , with random weights $\theta$, $\phi$
  2. Pretrain using MLE on $\mathcal{S}$
  3. $\beta \leftarrow \theta$
  4. Generate negative samples using for training
  5. Pretrain via minimizing the cross entropy
  6. repeat
    • for g-steps do
      1. Generate a sequence
        1. for $t$ in $1:T$ do
          • compute using Eq.(\ref{eq1})
      2. Update generator parameters with policy gradient:
    • for d-steps do
      • Use current to generate negative (synthetic) examples and combine with sampled positive (real) examples $\mathcal{S}$
      • Train discriminator for $k$ epochs using Eq.(\ref{eq2})
    • $\beta \leftarrow \theta$
  7. until SeqGAN converges

Model Architecture


$G_\theta$: LSTM actor.

where represents the $t$-th hidden state of LSTMs, $\mathbf{x}_t$ denotes the input embedding at the time step $t$.


$D_\phi$: CNN critic.
The input word embeddings are:

where represents the $k$ dimensional embedding, $\oplus$ is the vertical concatenation operator to build the matrix $\varepsilon_{1:T} \in \mathbb{R}^{T \times k}$. Then a kernel $\mathbf{w} \in \mathbb{R}^{n \times k}$ applies a convolutional operation to extract $n$-gram features:

where $\otimes$ operator is the summation of elementwise product, $b$ is a bias term, $\rho$ is a non-linear function. Then concatenate the output of multi-channel convolutions with various kernel sizes followed by a max-over-time-pooling:

Then apply a highway architecture before the final dense layer:

where , , are highway layer weights, $H$ denotes an affine transform with non-linearity, $\pmb{\tau}$ represents the transform gate.

Finally, apply a sigmoid function to get the probability of being real given the input sequences:

where and are the weight and bias respectively.

TextGAN (ICML’17)


Two fundamental problems of the GAN framework limit their usage in practice:

  1. Mode collapse: $G$ tends to produce a single observation for multiple latent representations.[3]
  2. Vanishing gradient: $G$’s contribution to the learning signal is insubstantial when $D$ is close to its local optimum.[4]

    When $D$ is optimal, using standard GAN’s miminax objective is equivalent to minimizing the Jenson-Shannon Divergence (JSD)[4] between the real data distribution and the synthetic data distribution , where . However, the saddile-point solution of the object is intractable. Thus iteratively updating $D$ and $G$ is required.

    However, standard GAN’s objective suffers from unstable weak learning signal when $D$ gets close to its local minimum resulting from the vanishing gradient problem, which comes from that JSD implied by the original GAN objective approaches to a constant when and share no support, thus minimizing JSD yields no learning signal. This problem also exists in the distance metric of Total Variance Distance (TVD) of energy-based GAN (EBGAN).


TextGAN[2] leverages the kernel-based moment-matching scheme over a Reproducing Kernel Hilbert Space (RKHS) to force the empirical distributions of real and synthetic sentences to have matched moments in latent-feature space, which consequentially ameliorates the mode collapsing issues associated with standard GAN training.

Objective Function

Given a sentence corpus $\mathcal{S}$, TextGAN proposes the objective:

where is the Euclidean distance between the reconstructed latent code $\hat{z}$ and the original code $z$ drawn from prior distribution ; represents the Maximum Mean Discrepany (MMD) between the emprical distribution of sentence embeddings and for synthetic and real data respectively.

$\mathcal{L}(G)$ attempts to adjust to force the synthetic sentences’ features to match the real sentence features encoded by $D(\cdot)$, by matching the empirical distributions of and with a kernel discrepancy metric, MMD.


In Eq.(\ref{eq3}), the reconstruction and MMD loss in $D$ serve as the regularizer to the binary classification loss in that $D$ features tend to be more spread out in the feature space.

Thus, $D(\cdot)$ attempts to select informative sentence features, whereas $G(\cdot)$ aims to match these features. Hyperparameters and act as the trade-off.

The original GAN objective is prone to mode collapsing especially when applying $\log D$ alternative for the generator loss, i.e., replacing the second term of Eq.(\ref{eq3}) with . If so, fake samples are more severely penalized than less diverse samples, thus grossly underestimating the variance of latent features[3].

The $G$’s loss in Eq.(\ref{eq4}) forces $G$ to produce highly diverse sentences to match the variations of real data by latent moment matching, thus alleviating the mode-collapsing problem.

Feature Matching via MMD

MMD measures the mean squared difference between two sets of samples $\mathcal{X}$ andq $\mathcal{Y}$ over a RKHD $\mathcal{H}$ with kernel function $k(\cdot): \mathbb{R}^d \times \mathbb{R}^d \mapsto \mathbb{R}$, where , . The kernel can be written as an inner product over $\mathcal{H}$: , and denotes the feature mapping. Fomally the MMD between $\mathcal{X}$ and $\mathcal{Y}$ is given by:

Here TextGANs adopt a gaussian (rbf) kernel $k(x,y)=\exp\big( - \frac{|x-y|^2}{2 \sigma} \big)$ with brandwidth $\sigma$.

Model Architecture

  • $G$: LSTM generator.
  • $D$: CNN discriminator.



Instability of GAN training: When optimizing $G$ USING $D$’s output as a reward via RL, the policy $G$ has difficulties to get positive and stable reward signals from $D$ even with careful pretraining.

When applying the GAN framework to discrete data, the discontinuity prohibits the update of the generator parameters via standard back-propagation. One way is to employ an RL strategy that directly uses the generator’s output, $D(\cdot)$, or $\log D(\cdot)$ as a reward.

Thus the objective for $G$ is to optimize:

Define the normalized probability distribution in some bounded region to guarantee the integrability ($D$ is an approxmation to if well trained) and also put a maximum-entropy regularizer to encourage diversity, yielding the regularized loss:

where $c(D)$ is a constant only depending on $D$. Hence, optimizing the original GAN is equivalent to minimizing the KL-divergence . However, since initially $p$ generates sentences with bad quality, it has little chance of generating good sequences to get a positive reward. Though with dedicated pre-training and variance reduction mechanisms, RL based on the moving reward signals still shows the unstable training and does not work on large scale datasets.


Maximum-Likelihood Augmented Discrete GAN (MaliGAN)[7] utilizes the information of $D$ as an additional source of training signals on top of the maximum-likelihood objective, significantly reducing the variance during training.

Basic MaliGAN

MaliGAN keeps a delayed copy $p^\prime(\mathbf{x})$ of $G$ who is less often optimized. We know that the optimal $D$ is: ; so we have . Thus MaliGAN sets the target distribution $q$ for maximum likelihood training to be $\frac{D}{1-D}p^\prime$.

Let , we define the augmented target distribution as:

Regarding $q$ as a fixed probablity distribution, the target is to optimize:

This objective has an attractive prob=perty that $q$ is a “fixed” distribution during training, i.e., if $D$ is sufficiently trained, then $q$ is always approximately the data generating distribution .
Defining the gradient as , we have:

where we assume that and the delayed generator is only one step behind the current update in the experiments.

Then $G$ is optimized as:

where $b$ is the baseline to reduce variance. In practice, $b$ increases very slowly from 0 to 1 (as $D$).


MaliGAN with Variance Reduction

Mixed MLE-Mali Training

To alleviate the accumulated variance for long sequence generation, MaliGAN clamps the input using the training data for $N$ time steps< and switch to the free-running mode for the remaining $T-N$ time steps. During training, $N$ slowly moves from $T$ towards 0.


For each sample from the real data batch, if it has length larger than $N$, we fix the first $N$ words of , then sample $n$ times from $G$ till the end of the sequence and get $n$ samples . Then for each mini-batch with $0 \leq N \leq T$:


GSGAN (2016)


In the standard GAN framework, samples from a distribution on discrete objects such as multinomial are not differentiable w.r.t. the distribution parameters.

Gumbel-softmax Distribution

GSGAN[9] uses the Gumbel-softmax distribution parameterized in terms of the softmax function to avoid the non-differential problem in GAN.

The softmax function can be used to parameterize a multinomial distribution on a one-hot-encoding $d$-dimensional vector $\mathbf{y}$ in terms of a continuous $d$-dimensional vector $\mathbf{h}$. Let $\mathbf{p}$ be a $d$-dimensional vector of probabilities specifying the multinomial distribution on $\mathbf{y}$ with .



Sampling $\mathbf{y}$ accoridng to the previous multinomial distribution with probability vector is the same as sampling $\mathbf{y}$ according to

where are independent and follow a Gumbel distribution with zero lcoation and unit scale. The sampled result has gradient zero w.r.t. $\mathbf{h}$ because the $\textrm{one_hot}(\arg\max(\cdot))$ is not differentiable. Thus, GSGAN propose to approximate with a differentiable function based on the soft-max transformtion:

where $\tau$ is an inverse temperature parameter. When $\tau \rightarrow 0$, the samples have the same output as argmax versionl when $\tau \rightarrow \infty$, the samples are always the uniform probability vector. GAN on discrete data can be trained with this, starting with soem relatively large $\tau$ and then annealing it to zero during training.

RankGAN (NIPS’17)


GANs assume the output of $D$ to be a binary predicate indicating whether the given sequence is from real or fake data, which is too restrictive since the diversity and richness of the sentences are constrained by the degenerated distribution due to binary classification.


RankGAN[8] replaces the original binary classifier discriminator with a ranking model by taking a softmax over the expected cosine distances from the generated sequences to the real data. It relaxes the training of binary discriminator to a learning-to-rank optimization problem, consisting of a generator and a ranker . Instead of performing binary classification, the ranker is trained to rank the machine-generated sequences lower than the human-generated sequences.

$G$ is to confuse the ranker $R$ so that synthetic samples are ranked higher than real samples, while $R$ is to rank the synthetic sample (denoted “G” in the figure) lower than human-written setences (denoted “H” in the figure). Thus, $G$ and $R$ play a minimax game:

where denotes the read data from human-written sentences, $C^+, C^-$ are comparison set w.r.t. different input $s$: when $s$ is the real data, $C^-$ generated data pre-sampled from ; If $s$ is the synthetic data, $C^+$ is the human written data.

Rank Score

The relevance score of the input sequence $s$ given a reference $u$ is:

where and are embedded feature vectors of the reference and input sequence, respectively.

Then the ranking score for a sequence $s$ is computed given a comparison set $\mathcal{S}$:

which is similar to Boltzmann exploration in RL. Lower $\gamma$ results in all setenecs to be nearly equiprobable (uniform), while higher $\gamma$ increases the biases towards the sentence with higher score. denotes the set of input sentences to be ranked.


Like SeqGAN, RankGAN employs Monte Carlo rollout methods to simulate the intermediate rewards when a sequence is incomplete. The expected future reward $V$ for partial sequences is computed by:

where represents the complete setence sampled by rollout methods with given partial sequence . Specifically, the beginning tokens are fixed and the rest tokens are consecutively sampled by unitl the last token is generated. It samples $n$ times and take the average ranking score to approximate the expected reward.

The gradient of $G$’s objective is:

In practice, minimizing $\log R(\cdot)$ instead of maximizing $\log (1-R(\cdot))$ performs better to train the ranker $R$. Thus, maximize the ranking objective:

In a sense, replacing binary predicates with (multi-sentence) ranking scores can relieve the gradient vanishing problem.[8]

LeakGAN (AAAI’18)


  1. Sparsity: GANs with policy gradient can only get a scalar guiding signal after generating the entire texts and lack intermediate information about text structure during the generation process, which grossly hinders the generation of long texts (>20 words).
  2. Non-informativeness: the scalar guiding signal for a whole text is non-informative as it does not necessarily preserve the picture about the intermediate syntactic and semantics of the text that is being generated for $G$ to sufficiently learn.


Inspired by Hierarchical Reinforcement Learning (HRL), LeakGAN[6] designs a hierarchical generator $G$, consisting of a high-level “MANAGER” module and a low-level “WORKER” module. In each step, “MANAGER” receives $D$’s high-level feature representation to form the guiding goal for the “WORKER” module, which is a leakage of information from $D$. Then the “WORKER” module firstly encodes the currently generated tokens and combines with the goal embedding to take the final action at the current state. As such, the guiding signals from $D$ is available not only at the end but during the generation process.

LeakGAN can implicitly learn sentence structures, such as punctuation, clause structure, and long suffix without any supervision[6].

Feature Leakage from $D$

LeakGAN allows to provide additional information, i.e., feature of the current sequence to generate .

Typically, can be decomposed into a feature extractor and a final sigmoid classification layer with weight . Mathematically, given the input $s$, we have:

where $f$ is the exxtracted features of CNN after max-over-time pooling.

In each time step $t$, “MANAGER” is an LSTM that takes the extracted feature vector and generates a goal vector , which is then fed into the “WORKER” module to guide the next word’s generation.


The “MANAGER” and “WORKER” of LSTMs are all zero-initialized. At each step, the “MANAGER” receives the leaked feature vector from the $D$ to produce the goal vector as:

where denotes the LSTM of “MANAGER” with parameters and hidden vector .

The goal is a linear transformation $\psi$ with weight matrix with a summation over recent $c$ goals to produce a $k$-dimensional goal embedding as:

Then the “WORKER” takes the current word and combines the output with the goal embedding with a dot product before softmax:

where denotes the LSTM of “WORKER”, $\alpha$ is the temperature to control the generation entropy.

Training of $G$

“MANAGER” is trained to predict advantageous directions in the discriminative feature space and the “WORKER” is intrinsically rewarded to follow such directions. The gradient of manager is defined as:

where is the expected reward under the current policy. $\cos(\cdot)$ measures the cosine similarity between the change of feature representation after $c$ step transitions, i.e., , and the goal vector . This loss functin is intuitively force the goal vector to match the transition inArrow the feature space while achieving high reward.

Meanwhile, the “WORKER” is trined to maximize the reaward using the REINFORCE algorithm:

where the intrinsit reward for “WORKER” is defined as:

To be consistent, in pretraining stage, the gradient of “MANAGER” is:

Interleaved training of MLE and GAN instead of full GAN training after pretraining. Blending these two training would help GAN get rid of some local minimum and alleviate mode collapse. Inserting MLE performs an implicit regularization on GAN to prevent it from going too far away from the MLE solution.

FM-GAN (NeurIPS’18)


TextGAN[2] applied feature matching with MMD in the objective, which is difficult to train:

  1. Choices of the bandwidth of the RBF kernel;
  2. Kernel methods often suffer from poor scaling;
  3. Empirically, TextGAN tends to generate short sentences.


Feature Mover GAN (FM-GAN)[11] leverages earth-mover’s distance (EMD) in optimal transport (OT), which considers the problem of optimally transporting one set of data points to another. FM-GAN proposes feature-mover’s distance, a variant of EMD between the feature distribution of real and synthetic sentences. In this adversarial setting, $D$ aims to maximize the dissimilarity of the feature distributions based on the FMD, while the generator is trained to minimize the FMD by synthesizing more-realistic data.

See [11] for detailed formula of FMD.

MaskGAN (ICLR’18)


Training instability and mode dropping.


MaskGAN[10] introduces an actor-critic conditional GAN that provides rewards at every time step. It fills in missing text conditioned on the surrounding context including text fill-in-the-blank or in-filling tasks, in which portions of the body of text are deleted or redacted. The goal of the model is to infill the missing portions of the text so that it is indistinguishable from the original data.

  • In-filling text: autoregressively output tokens that have thus far filled in as in standard language modeling while conditioned on the true known context.
  • If the entire body of the text is redacted, then this reduces to language modeling.


Let denote pairs of input and target tokens; is the filled-in token. Either real or fake will be passed to $D$ during training.

MaskGAN uses seq2seq encoder-decoder architecture. For a discrete sequence , a binary mask is generated of the same length where determining whether to retain or mask.

The masked sequence is fed to the encoder (as below figure), and the decoder fills in missing tokens auto-regressively conditioned on both the masked input and what has filled-in upfront. The generator decomposes the distribution over the sequence into an ordered conditional sequence:

Generator architecture[10]

The discriminator $D$ has the identical architecture to $G$ except the scalar output at each time step, computing the probability of each token $\tilde{x}_t$ being real given the true context of masked sequences $\mathbf{m(x)}$:

The logrithm of the $D$’s estimates are regarded as the reward:

The critic net is an additional head off the discriminator, estimating the value function in RL.


MaskGAN employs policy gradient estimation for generator $G$:

where $gamma$ is the discount vector, $b_t$ is the critic.

Finally, $D$ is updated with:


  1. Trin LM using MLE for encoder/decoder.
  2. Then pretrain the seq2seq model on the in-filling task using MLE. Select with holdout set.
  3. Not include critic.

SentiGAN (IJCAI’18)

SentiGAN[12] employs $k$ generators with $k$ sentiment labels and one multi-class ($k+1$) discriminator.

Let represent the partially generated sequence , where is a token generated at time $t$. It defines the penalty based loss function at step $t$ for $G$:

where is generated by $D$.

The objective of $G$ is defined with MC search:

$D$ is a CNN-based multi-class discriminator, producing a ${k+1}$-dimensional probability vector. The score at $i$-th ($i \in {1,\cdots,k}$) index represents the probablity of being the $i$-th sentiment, the $(k+1)$-th index denote the probability to be synthetic.

Refer to [12] for details.

RelGAN (ICLR’19)


GANs suffer from mode collapse issue due to either a lack of expressive power in $G$ (not considering many more complex modes in the data distribution), or by a less informative guiding signal in $D$ (constrain the $G$’s update to within certain modes).

The LSTM-based generator might be the bottleneck of GANs with such experimental observations:

  1. $D$’s loss value very quickly goes t near minimum after few iterations, which means $D$ may be more powerful than $G$ and can easily distinguish between real/fake samples;
  2. Mode collapse may partly indicate the incapacity of $G$, as it may not be expressive enough to fit all modes of data distribution;
  3. Existing GANs perform poorly at long sentence generation, and LSTM encodes all previous sequences into a fixed hidden vector, potentially limiting its ability to modeling long-distance dependency.


RelGAN[13] employs a relational memory based generator; Gumbel-softmax trick; and multi-representations in $D$.

Relational Memory based $G$

As below figure, let each row of the memory denote a memory slot. Given input at time $t$ and $H$ heads, the memory is updated with self-attention mechanisms.

For each head, we have query , key , and value , where $[;]$ denotes row-wise concatenation. Thus, the updated memory :

where is the column dimension of , $[:]$ denotes column-wise concatenation.

Then the next memory is computed with skip-connections/MLP/gated operations.

Gumbel-Softmax Relaxation

The multinomial softmax can be parameterized as:

where denotes the $i$-th entry of and is from the $i.i.d.$ Gumbel distribution with .

Further, the one-hot with argmax op can be approximated as:

where the incerse temperature $\beta \in \mathbb{R}+$ is a tunable parameter. Large $\beta$ encourages exploration for better sample diversity while smaller one does more explitation for bettter sample quality.

Thus it has an exponential policy: , where denotes the maximum inverse temperature, $N$ is the maximum # of training iteration, $n$ denotes current iteration. The increase rate of inverse temperature is from exploitation phrase to exploration phrase.

Multiple Representaions in $D$

RelGAN applies multiple embedded representations for each input with each independently passed through CNN-based classifiers to get the score. Finally, take the average of different representations as the final guiding signal to update $G$. This resembles the use of multiple discriminators in image GANs but keeps a weight-sharing CNN-based classifier to curtail the computational cost.


Loss function

RelGAN use the loss of Relativistic GAN (RSGAN), i.e., for $a,b \in \mathbb{R}$.

Intuitively, this loss is to directly estimate the average probability that real sentences are more realistic than generated sentences in terms of different embedded representations.

ScratchGAN (NeurIPS’19)


Having suffered from challenges with gradient estimation, optimization instability, and mode collapse, existing language GANs resorted to MLE pretraining followed by adversarial fine-tuning with restrictive fine-tuning epochs and a small learning rate.
This suggests that “the best-performing GANs tend to stay close to the solution given by MLE training”. Even with pre-training, it shows that discrete GANs do not improve over MLE training.

Learning Signals

The REINFORCE gradient estimator for $G$:

where $R(\mathbf{x})$ is provided by $D(\cdot)$. When setting , it recovers the MLE estimator:

The gradient updates of MLE can be seen as a special case of the REINFORCE updates in discrete GAN training, whereas the language GANs’ rewards are learned.

We postulate the learned rewards provide a smoother signal to $G$ than classical MLE loss: $D$ can learn to generalize and provide a meaningful signal over parts of the distribution uncovered by the training data. As the training progresses and the signal from $D$ improves, $G$ also explores other parts of the data space, providing a natural curriculum, whereas MLE training is only exposed to the expert demonstration (real data).


ScratchGAN[14] combines existing techniques such as large batch sizes, dense rewards, and discriminator regularization to stabilize and improve the discrete GANs.

Dense Rewards

ScratchGAN emplolys a recurrent discriminator to provide rewards for each generated token. The discriminator learns to distinguish between sentence prefixees coming from real data and sampled sentence prefixes:

The recurrent $D$ is much cheaper than Monte Carlo Tree Search (MCTS) to score partial sentences.

FOr the geerated token , the reward at time step $t$ is scaled linearly with $D$’s output:

The goal of $G$ at timestep $t$ is to maximize the sum of discounted future rewards using a discount factor $\gamma$:

Large Batch Size for Variance Reduction

$G$ is updated using MC estimates of policy gradients, where $N$ is the batch size:

ScratchGAN uses a global moving-average of rewards as a baseline .


  • $D$ and $G$ both use an embedding layer followed by one or more LSTM layers.
  • Discriminator regularization: layer normalization, dropout, L2 weight decay.
  • Concatenating the fixed sinusoidal position matrices and word embeddings in $D$.



MLE is equivalent to minimizing the KL divergence between the empirical data distribution and the model distribution, which tends to favor approximations of model distribtuion that overgeneralize the data distribtuion. Instead the reverse KL divergence favors under-generalization. JSD combines KL and reverse KL, which is symmetric.

GAN is regarded as a two-play minimax game with distinguishability game value function $V(G,D)$:

where denotes the empirical data distribution over training data , and

GAN without Explicit $D$

[15] claimed that optimal $D$ has a closed form solution, and approximation on $D$ with neural networks is unnecessary.
It directly optimizes the JSD divergence between the distribution between $G$ and real data without sampling from $G$, which implies an alternative minimax optimizatin procedure.

The optimal discriminator is:

The value function with optimal becomes:

This approach is only applicable when has explicit representations.

CatGAN (AAAI’20)

Category-aware GAN (CatGAN)[18] employs such methods to generate sentences of different categories:

  • Gumbel-softmax relaxation (as [13])
  • SAN-based relational memory (as [13])
  • Category-wise relativistic objective.
  • Hierarchical evolutionary learning.



  • Reward spasity
  • Mode collapse

Comparative discriminaor

SALGAN[17] employs a comparative discriminaor to pairwisely compare the text quality between a pair of samples: better($>$), worse ($<$), or indistinguishable ($\approx$). Given a training set with $n$ real samples and $n$ generated samples, the comparative discimination can construct $\binom{2n}{2}$ pairwise training examples.


ColdGAN[19] adopts such methods on T5 (small) and BART:

  • Importance sampling
  • PPO Clip
  • Nucleus sampling


  1. 1.Yu, Lantao, et al. "Seqgan: Sequence generative adversarial nets with policy gradient." Thirty-first AAAI conference on artificial intelligence. (2017).
  2. 2.Zhang, Yizhe, et al. "Adversarial feature matching for text generation." arXiv preprint arXiv:1706.03850 (2017).
  3. 3.Metz, Luke, et al. "Unrolled generative adversarial networks." arXiv preprint arXiv:1611.02163 (2016).
  4. 4.Arjovsky, Martin, and Léon Bottou. "Towards principled methods for training generative adversarial networks." arXiv preprint arXiv:1701.04862 (2017).
  5. 5.Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems (2014).
  6. 6.Guo, Jiaxian, et al. "Long text generation via adversarial training with leaked information." Thirty-Second AAAI Conference on Artificial Intelligence (2018).
  7. 7.Che, Tong, et al. "Maximum-likelihood augmented discrete generative adversarial networks." arXiv preprint arXiv:1702.07983 (2017).
  8. 8.Lin, Kevin, et al. "Adversarial ranking for language generation." Advances in Neural Information Processing Systems (2017).
  9. 9.Kusner, Matt J., and José Miguel Hernández-Lobato. "Gans for sequences of discrete elements with the gumbel-softmax distribution." arXiv preprint arXiv:1611.04051 (2016).
  10. 10.Fedus, William, Ian Goodfellow, and Andrew M. Dai. "MaskGAN: Better text generation via filling in the _." ICLR (2018).
  11. 11.Chen, Liqun, et al. "Adversarial text generation via feature-mover's distance." Advances in Neural Information Processing Systems (2018).
  12. 12.Wang, Ke, and Xiaojun Wan. "SentiGAN: Generating Sentimental Texts via Mixture Adversarial Networks." IJCAI (2018).
  13. 13.Nie, Weili, Nina Narodytska, and Ankit Patel. "Relgan: Relational generative adversarial networks for text generation." International conference on learning representations (2019).
  14. Masson d'Autume, Cyprien, et al. "Training language gans from scratch." Advances in Neural Information Processing Systems (2019).
  15. 15.Li, Zhongliang, et al. "Adversarial discrete sequence generation without explicit neuralnetworks as discriminators." The 22nd International Conference on Artificial Intelligence and Statistics (2019).
  16. 16.Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Sneural information processing systems (2017).
  17. 17.Zhou, Wangchunshu, et al. "Self-Adversarial Learning with Comparative Discrimination for Text Generation." arXiv preprint arXiv:2001.11691 (2020).
  18. 18.Liu, Zhiyue, Jiahai Wang, and Zhiwei Liang. "CatGAN: Category-Aware Generative Adversarial Networks with Hierarchical Evolutionary Learning for Category Text Generation." AAAI. 2020.
  19. 19.Scialom, Thomas, et al. "ColdGANs: Taming Language GANs with Cautious Sampling Strategies." arXiv preprint arXiv:2006.04643 (2020).