A summary of image-to-text translation.

Neural Image Captioning (CVPR 2015)

As the first end-to-end neural model for image captioning tasks, Neural Image Captioning (NIC)^[1] combines the pretrained convolutional neural networks (CNNs) for image classification with recurrent networks (RNNs) for sequence modeling.

Image source: ^[1]

Let $I$ denote the input image, $\mathbf{W}_e \in \mathbb{R}^{\vert V \vert \times D}$ be the $D$-dimensional word embedding matrices of vocabulary $V$, $\mathbf{s}_t$ be the one-hot vector of $t$-th word.

$\begin{align} x_{-1} &{}= \textrm{CNN}(I) & \\ x_t &{}= \mathbf{W}_e \mathbf{s}_t, & t \in \{ 0 \cdots N-1\} \\ p_{t+1} &{}= \textrm{LSTM}(x_t), & t \in \{0 \cdots N-1 \} \\ \mathcal{L} &{}= -\sum_{t=1}^N \log p_t (\mathbf{s}_t) & \textrm{NLL loss} \end{align}$

Inference: sampling or beam search.

Show, Attend and Tell (ICML 2015)

The model receives a single raw image and generates a caption $\mathbf{y}$ encoded as a sequence of $1$-of-$V$ encoded words.

$y = \{ \mathbf{y}_1, \cdots, \mathbf{y}_C \}, \mathbf{y}_i \in \mathbb{R}^V$

where $V$ is the vocabulary size and $C$ is the caption length.

Encoder

Encoder: employ CNNs (Oxford VGGnet) from lower convolutional layers (4-th convolutional layer before max-pooling. 14 x 14 x 512) to extract $L$ (14 x 14 =196) vectors for each image. $D$-dimensional (512) features corresponds to different part of the images. $a = \{ \mathbf{a}_1, \cdots, \mathbf{a}_L \}, \mathbf{a}_i \in \mathbb{R}^D$

Decoder

Decoder: LSTM $\left[\begin{array}{c} \mathbf{i}_t\\ \mathbf{o}_t \\ \mathbf{f}_t \\ \mathbf{g}_t \end{array}\right] = \left[\begin{array}{c} \sigma \\ \sigma \\ \sigma \\ \tanh \end{array}\right] T_{D+m+n, n} \left(\begin{array}{c} \mathbf{E} \mathbf{y}_{t-1} \\ \mathbf{h}{t-1} \\ \hat{\mathbf{z}_t } \end{array}\right)$ $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \mathbf{g}_{t}$ $\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$ where $T_{s,t}: \mathbb{R}^s \rightarrow \mathbb{R}^t$ denotes affine transformation, $\mathbf{i}_t$ , $\mathbf{f}_t$ , $\mathbf{c}_t$ , $\mathbf{o}_t$ , $\mathbf{h}_t$ are the input, forget, memory, output and hidden state of the LSTM, respectively.

The context vector $\hat{\mathbf{z}_t} \in \mathbb{R}^{D}$ is calculated as:

$\begin{align} e_{ti} &{}= \color{green}{f_\textrm{att}}(\mathbf{a}_i, \mathbf{h}_{t-1}) \\ \alpha_{ti} &{}= \frac{\exp(e_{ti})}{\sum_{k=1}^L \exp(e_{tk})} \\ \hat{\mathbf{z}_t} &{}= \phi(\{ \mathbf{a}_i \} \{ \alpha_i\}) \end{align}$

where $\alpha_i$ represents the position weight, indicating as either the probability that location $i$ is the right place to focus on, or as relative importance to give to location $i$ in blending the $\alpha_i$ ‘s together. “Where” the network looks next, i.e., $\alpha_{ti}$ , depends on the sequence of words that has already been generated, i.e., $\mathbf{h}_{t-1}$ .

The initial memory state $\mathbf{c}_0$ and hidden state $\mathbf{h}_0$ of the LSTM are linear projected outputs of an average of annotation vectors $\mathbf{a}_i, i=\{1, \cdots, L\}$ :

$\begin{align} \mathbf{c}_0 &{}= f_c (\frac{1}{L} \sum_{L}^1\mathbf{a}_i) \\ \mathbf{h}_0 &{}= f_c (\frac{1}{L} \sum_{L}^1\mathbf{a}_i) \\ \end{align}$

Image source: ^[2]

Output: use a deep output layer with LSTM state, context vector and the previous word: $p(\mathbf{y}_t \vert \mathbf{a}, \mathbf{y}_1^{t-1}) \propto \exp(\mathbf{L}_O (\mathbf{E}\mathbf{y}_{t-1} + \mathbf{L}_h \mathbf{h}_t + \mathbf{L}_z \hat{\mathbf{z}_t}))$ where $\mathbf{L}_O \in \mathbb{R}^{V \times m}, \mathbf{L}_h \in \mathbb{R}^{m \times n}, \mathbf{L}_z \in \mathbb{R}^{m \times D}$ and $\mathbb{E}$ are learnable parameters initialized randomly.

Attention

f_att has two alternatives:

stochastic (hard) attention
deterministic (soft) attention.

Stochastic Hard Attention

Let the location variable $s_t \in \mathbb{R}^L$ denote where the model to focus on when generating the $t$-th word. $s_{t,i}$ is an indicator one-hot variable where the $i$-th location to focus is set to 1. They assign a multinouli distribution parameterized by $\alpha_i$ . This method requires sampling the attention location $s_t$ at each time $t$.

It is computed as:

$\begin{align} p(s_{t,i} = 1 \vert s_{j<t}, \mathbf{a}) &{}= \alpha_{t,i} \\ \hat{\mathbf{z}_t} &{}= \sum_i s_{t,i} \mathbf{a}_i \\ \end{align}$

The objective function $L$ is defined as a variational lower bound on the marginal log-likelihood $\log p(\mathbf{y} \vert \mathbf{a})$ of obsreving sequence of words $\mathbf{y}$ given image features $\mathbf{a}$. Let $W$ denote the parameters of the model.

$\begin{align} L_s &{}= \sum_s p(s \vert \mathbf{a}) \log p(\mathbf{y} \vert s, \mathbf{a}) \\ &{}\leq \sum_s p(s \vert \mathbf{a}) p(\mathbf{y} \vert s, \mathbf{a})\\ &{}= \log p(\mathbf{y} \vert \mathbf{a}) \end{align}$

By assuming $\tilde{s} \sim \textrm{Multinoulli}_L (\{ \alpha_i \})$ , the location $s_t$ is calculated by sampling with Monte Carlo method.

$\begin{align} \frac{\partial L_s}{\partial W} &=\sum_s p(s \vert \mathbf{a}) \bigg[ \frac{\partial \log p(\mathbf{y} \vert s, \mathbf{a})}{\partial W} + \log p(\mathbf{y} \vert s, \mathbf{a}) \frac{\log p(s \vert \mathbf{a})}{\partial W} \bigg] \\ &{} \approx \frac{1}{N} \sum_{n=1}^N \bigg[ \frac{\partial \log p(\mathbf{y} \vert \tilde{s}^n, \mathbf{a})}{\partial W} + \log p(\mathbf{y} \vert \tilde{s}^n, \mathbf{a}) \frac{\log p(\tilde{s}^n \vert \mathbf{a})}{\partial W} \bigg] \end{align}$

A moveing average baseline is used to reduce the variance in the Monte Carlo estimator:

$b_k = 0.9 \times b_{k-1} + 0.1 \times \log p(\mathbf{y} \vert \tilde{s}_k, \mathbf{a})$

Finally, the entropy term is added:

$\frac{\partial L_s}{\partial W} \approx \frac{1}{N} \sum_{n=1}^N \bigg[ \frac{\partial \log p(\mathbf{y} \vert \tilde{s}^n, \mathbf{a})}{\partial W} + \color{green}{\lambda_t} (\log p(\mathbf{y} \vert \tilde{s}^n, \mathbf{a}) - b )\frac{\log p(\tilde{s}^n \vert \mathbf{a})}{\partial W} + \color{green}{\lambda_e} \frac{\partial H [\tilde{s}^n]}{\partial W} \bigg]$

where $\lambda_r$ and $\lambda_e$ are discounting factors.

This equation is equivalent to the REINFORCE learning, where the reward for selecting the attention is a real value proportional to the log-likelihood of the target sentence under the sampled attention rollouts.

Deterministic Soft Attention

Soft attention take the expectation of the context vector $\hat{\mathbf{z}}_t$ directly:

$\mathbb{E}_{p(s_t \vert a)} [\hat{\mathbf{z}}_t] = \sum_{i=1}^L \alpha_{t,i} \mathbf{a}_i$

Deterministic soft attention can be treated as an approximation to the marginal likelihood over the attention locations.

The expectation $\mathbb{E}_{p(s_t \vert a)}$ can be treated as the first order Taylor approximation using a single forward prop.

Let $\mathbf{n}_t = \mathbf{L}_O (\mathbf{E}\mathbf{y}_{t-1} + \mathbf{L}_h \mathbf{h}_t + \mathbf{L}_z \hat{\mathbf{z}_t})$ , $\mathbf{n}_{t,i}$ denote $\mathbf{n}_t$ computed by setting the context vector $\hat{\mathbf{z}}$ value to $\mathbf{a}_i$ . The normalized weighted geometric mean for the softmax $k$-th word prediction:

$\begin{align} \textrm{out}[p(y_t=k \vert \mathbf{a})] &{}= \frac{\prod_{i} \exp(n_{t,k,i})^{p(s_{t,i}=1 \vert a)}}{\sum_j \prod_{i} \exp(n_{t,k,i})^{p(s_{t,i}=1 \vert a)}}\\ &{}= \frac{\exp(\mathbb{\mathbf{E}_{p(s_t \vert a)}[n_{t,k}]})}{\sum_j \exp(\mathbb{\mathbf{E}_{p(s_t \vert a)}[n_{t,k}]})} \end{align}$

It show that the expectation of context vector $\mathbb{E} [\mathbf{n}_t] = \mathbf{L}_O (\mathbf{E}\mathbf{y}_{t-1} + \mathbf{L}_h \mathbb{E} [\mathbf{h}_t] + \mathbf{L}_z \mathbb{E} [\hat{\mathbf{z}_t}] )$ .

Doubly Stochastic Attention

Encourage $\sum_{i} \alpha_{ti} \approx 1$
Adopt a gating scalar $\beta$ from previsou hidden state $\mathbf{h}_{t-1}$ at each time step $t$ $\begin{align} \phi(\{ \mathbf{a}_i \}, \{ \alpha_i \}) &{}= \beta \sum_i^L \alpha_i \mathbf{a}_i \\ \beta_t &{}= \sigma(f_\beta(\mathbf{h}_{t-1})) \end{align}$
The model is trained end-to-end by minimizing the penalized negative log-likelihood: $L_d = -\log(P(\mathbf{y} \vert \mathbf{x})) + \lambda \sum_i^L (1 - \sum_{t}^C \alpha_{ti})^2$

Training

Trained both attention variants using SGD with an adaptive learning rate. They found that RMSProp worked best on Flickr8k, whereas Adam performed better on Flickr30k/MS COCO dataset.
Early stopping on BLEU score, dropout.
MS COCO: < 3 days training on an NVIDIA Titan Black GPU.
Vocabulary size V=10k.
Problems: no public splits on Flickr30k and COCO datasets.
Single model w/o an ensemble.

Semantic Attention (CVPR 2016)

Image captioning methods can be generally divided into two approaches: top-down and bottom-up.

The top-down method starts from the image features and converts it into words end-to-end using RNNs. But it is hard to attend to fine details when describing the image.
The bottom-up method is free to operate on any image resolution but lacks end-to-end formulation.

Architecture

Semantic Attention^[4] extracts top-down and bottom-up features from an input image. Firstly, the global visual feature $\mathbf{v}$ is extracted from a classification CNN and a list of visual attributes or concepts $\{ A_i \}$ that are detected using attribute detectors.

$\mathbf{v}$ is only used to initilize the input node $\mathbf{x}_0$ .

$\begin{align} \mathbf{x}_0 &{}= \phi_0 (\mathbf{v}) = \mathbf{W}^{x,v} \mathbf{v}\\ \mathbf{h}_t &{}= \textrm{RNN} (\mathbf{h}_{t-1}, \mathbf{x}_t) \\ Y_t &\sim \mathbf{p}_t = \varphi (\mathbf{h}_t, \{ A_t \}) \\ \mathbf{x}_i &{}= \phi (Y_{t-1}, \{ A_i \}), t>0 \end{align}$

Input attention $\phi$

Both $Y_{t-1}$ and $A_i$ correspond to an one-hot entry in dictionay $\mathcal{Y}$, denoting as $\mathbf{y}_{t-1}$ and $\mathbf{y}^i$ , respectively. Let $\mathbf{E} \in \mathbf{R}^{d \times \vert \mathcal{Y} \vert}$ with $d << \vert \mathcal{Y} \vert$, the relevance score assigned to each detected attribute $A_i$ based on its relevance between the previous predicted word $Y_{t-1}$ :

$\alpha_t^i \propto \exp (\mathbf{y}_{t-1}^\top\mathbf{E}^\top \mathbf{U}\mathbf{E}\mathbf{y}^i)$

where trainable parameters $\mathbf{U} \in \mathbb{R}^{d \times d}$

The attention score $\alpha$ measures the attention on different attributes. The weighted sum are added to the input space together with previous word:

$\mathbf{x}_t = \mathbf{W}^{x, Y} \bigg(\mathbf{E} \mathbf{y}_{t-1} + \textrm{diag} (\mathbf{w}^{x,A}) \sum_{i} \alpha_t^i \mathbf{E} \mathbf{y}^i \bigg)$

where $\mathbf{W}^{x, Y} \in \mathbf{R}^{m \times d}$ is the project matrix, $\mathbf{w}^{x,A} \in \mathbb{R}^d$ models the relative importance of visual attributes in each dimension of the word space.

Output attention $\varphi$

Similarly, the score $\beta_t^i$ for each attribute $A_i$ is measured w.r.t $\mathbf{h}_t$ :

$\beta_t^i \propto \exp(\mathbf{h}_t^\top \mathbf{V} \sigma(\mathbf{E}\mathbf{y}^i))$

The sigmiod activation function $\sigma$ is applied as the output to hidden state in RNN to ensure the same nonlinear transform on compared feature vectors.

The distribution is generated by a linaer transform followed by a softmax normalization:

$\mathbf{p}_t \propto \exp(\mathbf{E}^\top \mathbf{W}^{Y,h} (\mathbf{h}_t + \textrm{diag}(\mathbf{w}^{Y,A}) \sum_i \beta_t^i \sigma (\mathbf{E}\mathbf{y}^i)))$

where $\mathbf{W}^{Y,h} \in \mathbf{R}^{d \times n}$ is the projection matrix and $\mathbf{w}^{Y,A} \in \mathbf{R}^n$ models the relative importance of visual attributes in each dimension of the RNN state space.

Model training

The loss function is defined as the NLL loss combined with regularization terms on attention scores $\{ \alpha_t^i \}$ and $\{ \beta_t^i \}$ :

$\min -\sum_t \log p(Y_t) + g(\mathbf{\alpha}) + g(\mathbf{\beta})$

where $\mathbf{\alpha}$ and $\mathbf{\beta}$ are attention score matrices with $(t,i)$-th entries of $\alpha_t^i$ and $\beta_t^i$ . The regularization function $g$ is used to enfoce the completeness of attention paid to every attribute in $\{ A_i \}$ as well as the sparsity of attention at any particular time step.

$\begin{align} g(\mathbf{\alpha}) &{}= \Vert \mathbf{\alpha} \Vert_{1,p} + \Vert \mathbf{\alpha}^\top \Vert_{q,1}\\ &{}= \big[ \sum_i [ \sum_t \alpha_t^i]^p \big]^{1/p} + \big[ \sum_t [ \sum_i \alpha_t^i]^q \big]^{1/q} \end{align}$

where the first term with $p>1$ penalizes excessive attention paid to any single attribute $A_i$ accumulated over the entire sentence, and the second term $0<q<1$ penalizes diverted attention to multiple attributes at any particular time.

SCA-CNN (CVPR 2017)

Motivation:

low-layer filters detect low-level visual cues like edges and corners, while higher-level ones extract abstract semantic patterns like objects.
CNN extractors output a hierarchy of visual abstractions, which is spatial, channel-wise, and multi-layer. Previous work only takes into account the spatial characteristics, regardless of the channel-wise and multi-layer information.
SCA-CNN takes full advantage of such three characteristics of CNN features.

Spatial and Channel-wise Attention CNN

Spatial and Channel-wise Attention-based Convolutional Neural Network (SCA-CNN)^[5] applies channel-wise attention and spatial attention at multiple layers.

At $l$-th layer, the spatial and channel-wise attention weights $\gamma^l$ are function of LSTM memory $\mathbf{h}_{t-1} \in \mathbb{R}^d$ and input CNN features $\mathbf{V}^l$, where $d$ is the dimension of hidden state. SCA-CNN modulates $\mathbf{V}^l$ using the spatial and channel attention weights $\gamma^l$ as follows:

$\begin{align} \mathbf{V} &{}= \textrm{CNN} (\mathbf{X}^{l-1}) \\ \gamma^l &{}= \Phi(\mathbf{h}_{t-1}, \mathbf{V}^l)\\ \mathbf{X}_l &{}= f(\mathbf{V}^l, \gamma^l) \end{align}$

where $\mathbf{X}^l$ is the modulated feature, $\pmb{\Phi}(\cdot)$ is the spatial and channel-wise attention function, $f(\cdot)$ is a linear weighting function that modulates CNN features and attention weights.

$\begin{align} \mathbf{h}_t &{}= \textrm{LSTM}(\mathbf{h}_{t-1}, \mathbf{X}^L, y_{t-1})\\ y_t &\sim p_t = \textrm{softmax}(\mathbf{h}_t, y_{t-1}) \end{align}$

The spatial attention weights $\alpha^l$ and channel-wise attention weights $\beta^l$ are learned separately:

$\begin{align} \alpha^l &{}= \mathbf{\Phi}_s (\mathbf{h}_{t-1}, \mathbf{V}^l) \\ \beta^l &{}= \mathbf{\Phi}_c (\mathbf{h}_{t-1}, \mathbf{V}^l) \end{align}$

where $\mathbf{\Phi}_s$ and $\mathbf{\Phi}_c$ represent spatial and channel-wise model respectively, having the cost of $\mathcal{O}(W^lH^lk)$ for spatial attention and $\mathcal{O}(C^lk)$ for channel-wise attention. $W$,$H$,$C$, $k$ represent the width, height, channel and mapping dimension.

Spatial Attention

CNN features $\mathbf{V}=[\mathbf{v}_1, \mathbf{v}_3, \cdots ,\mathbf{v}_m]$ is flattened features along width and height, where $\mathbf{v}_i \in \mathbb{R}^C$ , and $m=W \cdot H$. $\mathbf{v}_i$ is considered as the visual feature of the $i$-th location. Given the previous hidden state $\mathbf{h}_{t-1}$ , a single-layer fully-connected layer followed by a softmax is applied to generate attention distributions $\alpha$ over the image regions.

$\begin{align} \mathbf{a} &{}= \tanh \big( (\mathbf{W}_s \mathbf{V} + b_s) \oplus \mathbf{W}_{hs}\mathbf{h}_{t-1} \big) \\ \alpha &{}= \textrm{softmax}(\mathbf{W}_i \mathbf{a} + b_i) \end{align}$

where $\mathbf{W}_s \in \mathbb{R}^{k \times C}$ , $\mathbf{W}_hs \in \mathbb{R}^{k \times d}$ , $\mathbf{W}_i \in \mathbb{R}^{k}$ are trainable matrices to obtain the same dimension $k$, $\oplus$ is the addition between a matrice and a vector, model biases $b_s \in \mathbb{R}^k, b_i \in \mathbb{R}$ .

Channel-wise Attention

Each CNN filter is a pattern detector on images, and each channel of a feature map in CNN is a response activation of the corresponding convolutional filter. Applying channel-wise attention mechanisms can be treated as selecting semantic attributes.

Firstly, CNN features $\mathbf{V}$ is reshaped to $\mathbf{U} = [\mathbf{u}_1, \mathbf{u}_2,\cdots, \mathbf{u}_C]$, where $\mathbf{u}_i \in \mathbb{R}^{W \times H}$ represents the $i$-th channel of the feature map $\mathbf{V}$, $C$ is the channel number. The mean-pooling for each channel is applied to obtain the channel feature $\mathbf{v}$:

$\mathbf{v} = [v_1, v_2, \cdots, v_C ], \quad \mathbf{v} \in \mathbb{R}^C$

where scalar $v_i$ is the mean of vector $\mathbf{u}_i$ , which represents the $i$-th channel features.

The channel-wise attention model $\Phi_c$ can be defined as:

$\begin{align} \mathbf{b} &{}= \tanh \big( (\mathbf{W}_c \otimes \mathbf{v} + b_c) \oplus \mathbf{W}_{hc} \mathbf{h}_{t-1} \big) \\ \beta &{}= \textrm{softmax} (\mathbf{W}^\prime_i \mathbf{b} + b^\prime_i) \end{align}$

where $\mathbf{W}_c \in \mathbb{R}^k, \mathbf{W}_{hc} \in \mathbb{R}^{k \times d}, \mathbf{W}^\prime_i \in \mathbb{R}^k$ are trainable parameters, $\otimes$ represents the outer product of vectors, biases $b_c \in \mathbb{R}^k, b^\prime_i \in \mathbb{R}$ .

Channel-Spatial

Apply channel-wise attention followed by feature map $\mathbf{X}$.

$\begin{align} \beta &{} = \Phi_c (\mathbf{h}_{t-1}, \mathbf{V}) \\ \alpha &{}= \Phi_s (\mathbf{h}_{t-1}, f_c (\mathbf{V}, \beta)) \\ \mathbf{X} &{}= f(\mathbf{V}, \alpha, \beta) \end{align}$

where $f_c(\cdot)$ is a channel-wise multiplication for feature map channels and corresponding channel weights.

Adaptive Attention (CVPR 2017)

Motivation:

Most attention-based methods force visual attention to be active for each generated word. However, not all words have corresponding visual signals.
Decoders require little to predict words like “the”/“of”. Besides, other words that may be predicted reliably just from the language model, e.g.,, “sign” after “behind a red stop” or “phone” following “talk on a cell”.

Adaptive attention with a “visual sentinel”^[6] is proposed to decide when to rely on the visual signals and when to just rely on the language model.

Spatial Attention

A spatial attention (fig. (b)) is uesd to compute the context vector $\mathbf{c}_t$ as:

$\mathbf{c}_t = g(\mathbf{V}, \mathbf{h}_t)$

where $g$ is the attention function, $\mathbf{V}=[\mathbf{v}_1, \cdots, \mathbf{v}_k], \mathbf{v}_i \in \mathbb{R}^d$ is the $d$-dimensional spatial image feature, $\mathbf{h}_t \in \mathbb{R}^d$ is the hidden state of RNN at time $t$.

(a) Soft attention (b) Spatial attention

As in fig. (b), given the spatial image feature $\mathbf{V} \in \mathbb{R}^{d \times k}$ and $\mathbf{h}_t$ , the context vector can be computed as:

$\begin{align} \mathbf{z}_t &{}= \mathbf{w}^\top_h \tanh \big(\mathbf{W}_v \mathbf{V} + (\mathbf{W}_g \mathbf{h}_t) \mathbb{I}^\top \big) \\ \mathbf{\alpha}_t &{}= \textrm{softmax} (\mathbf{z}_t)\\ \mathbf{c}_t &{}= \sum_{i=1}^k \mathbf{\alpha}_{ti} \mathbf{v}_{ti} \\ \log p(t_t \vert y-1, \cdots, y_{t-1}, \mathbf{I}) &{}= f(\mathbf{h}_t, \mathbf{c}_t) \end{align}$

where $\mathbb{I} \in \mathbb{R}^k$ is a vector with all elements set to 1, $\mathbf{W}_v,\mathbf{W}_v \in \mathbb{R}^{k \times d}$ , $\mathbf{w}_h \in \mathbb{R}^k$ are trainable parameters. $\mathbf{\alpha} \in \mathbb{R}^k$ is the attention weight over features in $\mathbf{V}$. $\mathbf{I}$ is the input image.

It uses the current hidden state rather than the previous one to generate the context vector, which can be treated as the residual visual information of current hidden state $\mathbf{h}_t$ , diminishing the uncertainty or complements the informativeness of the current hidden state for next word prediction.

Adaptive Attention

Aforementioned spatial attention cannot determine when to leverage visual signals or language models. The visual sentinel vector $\mathbf{s}_t$ is extended on LSTM:

$\begin{align} \mathbf{g}_t &{}= \sigma (\mathbf{W}_x \mathbf{x}_t + \mathbf{W}_h \mathbf{h}_{t-1}) \\ \mathbf{s}_t &{}= \mathbf{g}_t \odot \tanh (\mathbf{m}_t) \\ \hat{\mathbf{c}_t} &{}= \beta_t \mathbf{s}_t + (1- \beta_t) \mathbf{c}_t \end{align}$

where the new sentinel gate at time $t$ $\beta_t$ controls the trade-off beween the image information and decoder memory.

The new sentinel gate $\beta_t$ is computed as:

$\begin{align} \hat{\mathbf{\alpha}}_t &{}= \textrm{softmax}([ \mathbf{z}_t ; \mathbf{w}_h^\top \tanh \big(\mathbf{W}_s \mathbf{s}_t + \mathbf{W}_g \mathbf{h}_t \big) ]) \\ \beta_t &{}= \hat{\mathbf{\alpha}}_t [k+1] \end{align}$

where $\hat{\mathbf{\alpha}}_t \in \mathbb{R}^{k+1}$ is the attention distribution over both the spatial image feature and visual sentinel vector. In which the last element serves as the gate value $\beta_t$ .

The probability over vocabulary at time $t$ is:

$\mathbf{p}_t = \textrm{softmax} \big( \mathbf{W}_p (\hat{\mathbf{c}}_t + \mathbf{h}_t) \big)$

Semantic Compositional Networks (CVPR 2017)

Motivation: LSTM-based generation is quite limited: it only uses semantic concepts through soft attention or initialization at the first step.

Semantic Compositional Network

Semantic Compositional Networks (SCN)^[7] detect the semantic concepts, i.e., tags, from each input image. It uses the $K$ most common words in the training captions to determine the vocabulary of tags, including most frequent nouns, verbs, or adjectives.

The tag detection can be cast as a multi-label classification task. Given $N$ training examples, $\mathbf{y}_i = [ y_{i1}, \cdots, y_{iK}] \in \{ 0,1 \}^K$ is the label’s dummy encoding of $i$-th image, wherein 1 and 0 indicate annotation or not respectively. Let $\mathbf{v}_i$ and $\mathbf{s}_i$ be the image feature vector and semantic feature vector of the $i$-th image, the cost function is:

$\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K \big( y_{ik} \log s_{ik} + (1- y_{ik}) \log (1-s_{ik}) \big)$

where $\mathbf{s}_{i} = \sigma \big( \textrm{MLP}(\mathbf{v}_i) \big)$ is a $K$-dimensional vector with $\mathbf{s}_i = [ s_{i1},\cdots,s_{iK} ]$ .

SCN-RNN

SCN injects the tag-dependent matrices:

$\begin{align} \tilde{\mathbf{x}}_{t-1} &{}= \mathbf{W}_b \mathbf{s} \odot \mathbf{W}_c \mathbf{x}_{t-1} \\ \tilde{\mathbf{h}}_{t-1} &{}= \mathbf{U}_b \mathbf{s} \odot \mathbf{U}_c \mathbf{h}_{t-1}\\ \mathbf{z} &{}= \mathbb{I}(t=1) \cdot \mathbf{C}\mathbf{v}\\ \mathbf{h}_t &{}= \sigma ( \mathbf{W}_a \tilde{\mathbf{x}}_{t-1} + \mathbf{U}_a \tilde{\mathbf{h}}_{t-1} + \mathbf{z} ) \end{align}$

where $\mathbf{x}_t$ is the $t$-th word in the generated caption, $\mathbf{x}_0$ is defined as the ‘BOS’ token, $\odot$ is the Hadamard product. Trainable parameters $\{ \mathbf{W}_a, \mathbf{U}_a \} \in \mathbf{R}^{d \times d^\prime}$ , $\{ \mathbf{W}_b, \mathbf{U}_b \} \in \mathbf{R}^{d^\prime \times K}$ , where $d$ is the hidden dimension, $d^\prime$ is the number of factors. $\mathbf{W}_a$ and $\mathbf{W}_b$ are shared among all captions, capturing common linguistic patterns; $\mathbf{W}_b \mathbf{s}$ accounts for semantic aspects of the image captured by $\mathbf{s}$.

SCN-LSTM

SCN-RNN can be generalized using LSTM units:

$\begin{align} \left[\begin{array}{c} \mathbf{i}_t \\ \mathbf{o}_t \\ \mathbf{f}_t \\ \tilde{c}_t \end{array}\right] &{}= \left[\begin{array}{c} \sigma \\ \sigma \\ \sigma \\ \color{red}{\sigma} \end{array}\right] (\mathbf{W}_{a} \tilde{\mathbf{x}}_{i, t-1} + \mathbf{U}_{a} \mathbf{h}_{t-1} + \mathbf{z}) \\ \mathbf{c}_t &{}= \mathbf{i}_t \odot \tilde{\mathbf{c}}_t + \mathbf{f}_t \odot \mathbf{c}_{t-1} \\ \mathbf{h}_{t} &{}= \mathbf{o}_t \odot \tanh (\mathbf{c}_t) \end{align}$

where $\mathbf{z}=\mathbb{I} (t=1) \cdot \mathbf{C}\mathbf{v}$ . For $\star = i,f,o,c$, we define

$\begin{align} \tilde{\mathbf{x}}_{\star, t-1} &{}= \mathbf{W}_{\star b }\mathbf{s} \odot \mathbf{W}_{\star c}\mathbf{x}_{t-1} \\ \tilde{\mathbf{h}}_{\star, t-1} &{}= \mathbf{U}_{\star b }\mathbf{s} \odot \mathbf{U}_{\star c}\mathbf{x}_{t-1} \end{align}$

Training

Given image $\mathbf{I}$ and corresponding caption $\mathbf{X}$, the objective function is defined as:

$\log p(\mathbf{X} \vert \mathbf{I}) = \sum_{t=1}^T p(\mathbf{x}_0, \cdots, \mathbf{x}_{t-1}, \mathbf{v}, \mathbf{s})$

Averaged objectives among all (image, caption) pairs are used during training.

Up-Down Attention (CVPR 2018)

Up-Down Attention^[8] combines the bottom-up (based on Faster R-CNN^[9]), a top-down attention mechanism to attend to attention at the level of objects and other salient image regions. Top-down uses the non-visual or task-specific contexts to predict an attention distribution over image regions using ResNet-101^[10], whereas bottom-up proposes a set of salient image regions, wich each region represented by a pooled convolutional feature vector using Faster R-CNN.

As shown in the left figure, the input regions correspond to a uniform grid of equally sized and shaped neural receptive fields, irrespective of the content of the image. In contrast, the right focuses on the objects and salient image regions for attention.

Bottom-Up attention

All regions whose class detection probability exceeds a confidence threshold are selected^[9]. For each selected region $i$, $\mathbf{v}_i$ is defined as the mean-pooled convolutional feature from this region.

Decoder

A top-down attention LSTM followed by a language model (LM) LSTM is used to generate captions.

Attention LSTM

Let superscript denotes the layer number, i.e., $\mathbf{h}^1$ indicates the hidden state in the first LSTM. The top-down attention LSTM receives the concatenated previous output of LM LSTM $\mathbf{h}^2_{t-1}$ , mean-pooled image feature $\bar{\mathbf{v}}=\frac{1}{k}\sum_i \mathbf{v}_i$ and the previous generated word vector $\mathbf{x}_t = \mathbf{W}_e \Pi_t$ , where $\mathbf{W}_e \in \mathbb{R}^{\vert V \vert \times D}$ is the word embedding matrix for vocabulary $V$, and $\Pi_t$ is one-hot encoding of the input word at timestep $t$.

$\mathbf{x}_t^1 = [ \mathbf{h}_{t-1}^2, \bar{\mathbf{v}}, W_e \Pi_t ]$

The output $\mathbf{h}_t^1$ of the attention LSTM, a normalized attention weight $\alpha_{i,t}$ for each of the $k$ image features $\mathbf{v}_i$ at each time step $t$:

$\begin{align} a_{i,t} &{}= \mathbf{w}_a^\top \tanh \big( \mathbf{W}_{va}\mathbf{v}_i + \mathbf{W}_{ha} \mathbf{h}_t^1 \big) \\ \pmb{\alpha}_1 &{}= \textrm{softmax} (\mathbf{a}_t) \\ \hat{\mathbf{v}}_t &{}= \sum_{i=1}^K \alpha_{i,t} \mathbf{v}_i \end{align}$

where $\hat{\mathbf{v}}_t$ is the input to language LSTM, $\mathbf{W}_{va} \in \mathbb{R}^{H \times V}, \mathbf{W}_{ha} \in \mathbb{R}^{H \times M}, \mathbf{w}_{a} \in \mathbb{R}^H$ are learnable parameters.

Language LSTM

The input to LM LSTM is concatated image features and attention LSTM output:

$\mathbf{x}_t^2 = [ \hat{\mathbf{v}}_t, \mathbf{h}_t^1 ]$

The predicted caption sequences $y_{1:T}=(y_1, \cdots, y_T)$ :

$p(y_t \vert y_{1:t-1}) = \textrm{softmax} (\mathbf{W}_p \mathbf{h}_t^2 + \mathbf{b}_p )$

where $\mathbf{W}_p \in \mathbb{R}^{\vert V \vert \times M}$ and $\mathbf{b}_p \in \mathbb{R}^{\vert V \vert}$ are learnable weights and biases.

The probability of generated captions is:

$p(y_{1:T}) = \prod_{t=1}^T p(y_t \vert y_{1: t-1})$

Objective

Cross-entropy
Given the ground truth sequence $t_{1:T}^*$ , the corss entropy loss is:
$\mathcal{L}_{ce} (\theta) = - \sum_{t=1}^T \log \big( p_\theta(y_t^* \vert y_{1:t-1}^* ) \big)$
Negative expected score
$\mathcal{L}_r (\theta) = - \mathbb{E}_{y_{1:T \sim p_\theta}}[r(y_{1:T})]$
where $r$ is the score function (e.g., CIDEr).

Stylized Image Captioning

StyleNet (CVPR 2017)

Motivation:

Previous works on image captioning all generate the factual description of the image content while overlooking the style of generated captions. Stylized descriptions can greatly enrich the expressibility and attractiveness of the caption.
Application: people always struggle to come up with an attractive title when uploading images to a social media platform. Stylized captioning can be a helpful solution.

Factored LSTM

StyleNet^[11] proposed the Factored LSTM to memorize the languge style pattern, by factorizing the parameters $\mathbf{W}_x \in \mathbb{R}^{M \times N}$ in standard LSTMs into three matrices $\mathbf{U}_x \in \mathbb{R}^{M \times E}, \mathbf{S}_x \in \mathbb{R}^{E \times E}, \mathbf{V}_x \in \mathbb{R}^{E \times N}$ . But it retain the weight parameters of recurrent connections $\mathbf{W}_{ih}, \mathbf{W}_{fh}, \mathbf{W}_{oh}, \mathbf{W}_{ch}$ , which captures the long span syntactic dependency of the text.

The Factored LSTM are defined as:

$\begin{align} \mathbf{i}_t &{}= \sigma (\mathbf{U}_{ix}\mathbf{S}_{ix}\mathbf{V}_{ix} \mathbf{x}_t + \mathbf{W}_{ih}\mathbf{h}_{t-1}) \\ \mathbf{f}_t &{}= \sigma (\mathbf{U}_{fx}\mathbf{S}_{fx}\mathbf{V}_{fx} \mathbf{x}_t + \mathbf{W}_{fh}\mathbf{h}_{t-1}) \\ \mathbf{o}_t &{}= \sigma (\mathbf{U}_{ox}\mathbf{S}_{ox}\mathbf{V}_{ox} \mathbf{x}_t + \mathbf{W}_{oh}\mathbf{h}_{t-1}) \\ \tilde{\mathbf{c}}_t &{}= \tanh (\mathbf{U}_{cx}\mathbf{S}_{cx}\mathbf{V}_{cx} \mathbf{x}_t + \mathbf{W}_{ch}\mathbf{h}_{t-1}) \\ \mathbf{c}_t &{}= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_{t}\\ \mathbf{h}_t &{}= \mathbf{o}_t \odot \mathbf{c}_t \\ \mathbf{p}_{t+1} &{}= \textrm{softmax} (\mathbf{C}\mathbf{h}_t) \end{align}$

where ${ \mathbf{U}, \mathbf{V}, \mathbf{W} }$ are shared among different styles. But $\mathbf{S}$ is style-specific.

Training StyleNet

Two tasks:

Firstly, train the factored LSTM model to generate factual captions given paired images.
Train factored LSTM as language model on stylized language corpus, but only update the style-specific matrix $\mathbf{S}$.

SemStyle (CVPR 2018)

SemStyle^[12] proposed a term generator by generating an ordered term sequence of image semantics, and a language generator trained on styled text data.

Semantic term

Given a setence $\mathbf{w} = \{ w_1, w_2, \cdots, w_n \}, w_i \in \mathcal{V}^\text{in}$ , a set of rules is defined to get ordered semantic terms $\mathbf{x} = \{ x_1, x_2, \cdots, x_M \}, x_i \in \mathcal{V}^\text{term}$ . The rules are as:

Filtering non-semantic words
lemmatization and tagging using spaCy.
Verb abstraction. Use semantic role labeling tool SEMAFOR to annotate frames and reduce frame vocabulary.

Term generator

Use CNN+GRU to generate semantic terms collected above. The greedy search decoding is used to recover the term sequence from the conditional probabilities. Given input image $I$,

$x_{i+1} = \arg\max_{j \in \mathcal{V}^\text{term}} P(x_{i=1}=j \vert I, x_i, \cdots, x_1)$

where $x_1$ is the ‘BOS’ token.

Language generator

A bi-GRU is used to encode the semantic terms $x$’s, and concatenate the forward and backward hidden states as outputs: $\mathbf{h}_(\text{enc}, i) = [\mathbf{h}_(\text{fw},i); \mathbf{h}_(\text{bw},i)]$ . The last state is used to initialize the first hidden state of decoder: $\mathbf{h}_{(\text{dec}, 0)} = \mathbf{h}_{(\text{enc}, M)}$ .

The context vector at step $t$ is computed with bi-linear attention:

$\begin{align} v_{t,i} &{}= \mathbf{h}_{\text{enc,i}}^\top \mathbf{W}_a \mathbf{h}_{\text{dec},t} \\ a_{t,i} &{}= \frac{\exp (v_{t,i})}{\sum_{j=1}^M \exp (v_{t,j})}\\ \mathbf{c}_t &{}= \sum_{i=1}^M a_{t,i} \mathbf{h}_{\text{enc},i} \end{align}$

The output uses a NLP with softmax non-linearity:

$\begin{align} \mathbf{h}^\text{out}_{t} &{}= \mathbf{W}^\text{out}[\mathbf{c}_t, \mathbf{h}^\text{dec}_{t}] + \mathbf{b}^\text{out}\\ p(y_t =k \vert \mathbf{x}) &{}= \frac{\exp(h^\text{out}_{t,k})}{ \sum_{j=1}^{\vert \mathcal{V}^\text{out} \vert} \exp(h^\text{out}_{t,j})} \end{align}$

Training

Train the term generator on factual descriptions, using mean categorical cross entropy over semantic terms:
$\mathcal{L} = -\frac{1}{M} \sum_{i=1}^M \log p(x_i = \hat{x}_i \vert I,\hat{x}_{i-1}, \cdots,\hat{x}_{1} )$
Train the language generator on both styled and descriptive sentences.

“Factual” or “Emotional” (ECCV 2018)

Style-factual LSTM

Two set of matrices are used in style-factual LSTM:

$\begin{align} i_t &{}= \sigma \big( (g_{xt} S_{xi} + (1- g_{xt})W_{xi})x_t + (g_{ht} S_{hi} + (1-g_{ht})W_{hi})h_{t-1} + b_i \big) \\ f_t &{}= \sigma \big( (g_{xt} S_{xf} + (1- g_{xt})W_{xf})x_t + (g_{ht} S_{hf} + (1-g_{ht})W_{hf})h_{t-1} + b_f \big) \\ o_t &{}= \sigma \big( (g_{xt} S_{xo} + (1- g_{xt})W_{xo})x_t + (g_{ht} S_{ho} + (1-g_{ht})W_{ho})h_{t-1} + b_o \big) \\ \tilde{c}_t &{}= \phi \big( (g_{xt} S_{xc} + (1- g_{xt})W_{xc})x_t + (g_{ht} S_{hc} + (1-g_{ht})W_{hc})h_{t-1} + b_c \big) \\ c_t &{}= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \\ h_t &{}= o_t \odot \phi(c_t) \end{align}$

where style-related matrices $g_{xt}$ and $g_{ht}$ controls to predict word based on $W_x$ ($\approx 0$), or a styled word ($\approx 1$) ^[13].

Training

Two stages:

At the first stage, fix $g_{xt}=g_{h_t}=0$ and freeze the style-related matrices $S_x$ and $S_h$ . The model is trained using paired factual captioning datasets with MLE loss.
At 2nd stage, train the model on paried stylized captioning datasets, but update $S_x$ and $S_h$ for style-factual LSTM, and fix $W_x, W_h$ . The loss for this stage is designed as: $\begin{align} \mathbb{KL}(P_s^t \Vert P_r^t) &{}= \sum_{w in W} P_s^t(w) \log \frac{P_s^t(w)}{P_r^t(w)} \\ g_{ip}^t &{}= P_s^t \cdot P_r^t \\ \mathcal{L} &{}= \sum_{t=1}^T -(1- g_{ip}^t) \log P_s^t (y_t) + \alpha \cdot \sum_{t=1}^T g_{ip}^t \mathbb{KL}(P_s^t \Vert P_r^t) \end{align}$ where $P_s^t$ and $P_r^t$ are predicted word probability distribution by the real model and the reference, $g_{ip}^t$ represents the similarity between word probability distributions $P_s^t$ and $P_r^t$ . The term $g_{ip}^t \rightarrow 0$ when $P_s^t$ has a higher probability to a stylized word.

Adversarial Training

Show, Adapt and Tell (ICCV 2017)

In the source domain, given a set $\mathcal{P} = \{ (\mathbf{x}^n, \hat{\mathbf{y}}^n) \}_n$ with paired image $\mathbf{x}^n$ and ground truth sentence $\hat{\mathbf{y}}^n$. In the target domain, two separate sets are given: a set of example images $\chi = \{ \mathbf{x}^n \}_n$ and example sentences $\hat{\mathcal{Y}} = \{ \hat{\mathbf{y}}^n \}_n$ .

Captioner as an Agent

Captioner using the standard CNN-RNN architecture is treated as an agent. At time $t$, the captioner takes an action, i.e., a word $y_t$ , according to a stochastic policy $\pi_\theta (y_t \vert \mathbf{x}, \mathbf{y}_{t-1}$ . The total per-word loss $J(\theta)$ is minimized:

$\begin{align} J(\theta) &{}= \sum_{n=1}^N \sum_{t=1}^{T_n} \mathcal{L}(\pi_\theta(\hat{y}_t^n \vert \mathbf{x}^n, \hat{\mathbf{y}}_{t-1}^n))\\ \mathcal{L}(\pi_\theta(\hat{y}_t^n \vert \mathbf{x}^n, \hat{\mathbf{y}}_{t-1}^n)) &{}= - \log \pi_\theta (\hat{y}_t^n \vert \mathbf{x}^n, \hat{\mathbf{y}}_{t-1}^n) \end{align}$

where $N$ is the number of images, $T_n$ is the length of the sentence $\hat{\mathbf{y}}^n$ , $\mathcal{L}$ indicates the cross-entropy loss. $\hat{\mathbf{y}}_{t-1}^n$ and $\hat{y}_t^n$ are ground truth partial sentence and word, respectively.

The state-action function $Q((\mathbf{x}, \mathbf{y}_{t-1}), y_t) = \mathbb{E}_{\mathbf{y}_{(t+1):T}} [R( [ \mathbf{y}_{t-1}, y_t, \mathbf{y}_{(t+1):T} ] \vert \mathbf{x}, \mathcal{Y}, \mathcal{P} )]$

The object function:

$\begin{align} J(\theta) &{}= \sum_{n=1}^N J_n (\theta) \\ J_n(\theta) &{}= \sum_{t=1}^{T_n} \mathbb{E}_{\mathbf{y}_t^n} [ \pi_\theta (y_t^n \vert \mathbf{x}^n, \mathbf{y}_{t-1}^n) Q \big((\mathbf{x}^n, \mathbf{y}_{t-1}^n), y_t^n \big) ] \end{align}$

Since the action sapce of $\mathbf{y}_t$ is huge, $M$ sentences $\{ \mathbf{y}^m \}_m$ is generated and replace expectation with the mean:

$\begin{align} J_n(\theta) &\simeq \frac{1}{M} \sum_{m=1}^M J_{n,m}(\theta) \\ J_{n,m} (\theta) &{}= \sum_{t=1}^{T_m} \pi_\theta (y_t^m \vert \mathbf{x}, \mathbf{y}_{t-1}^m) Q \big( (\mathbf{x}, \mathbf{y}_{t-1}^m), y_t^m \big) \end{align}$

The policy gradient is:

$\begin{align} \nabla_\theta J_{n,m} (\theta) &{}= \sum_{t=1}^{T_m} \nabla_\theta \pi_\theta (y_t^m \vert \mathbf{x}, \mathbf{y}_{t-1}^m) Q \big( (\mathbf{x}, \mathbf{y}_{t-1}^m), y_t^m \big) \\ &{}= \sum_{t=1}^{T_m} \pi_\theta (y_t^m \vert \mathbf{x}, \mathbf{y}_{t-1}^m) \nabla_\theta \log \pi_\theta (y_t^m \vert \mathbf{x}, \mathbf{y}_{t-1}^m) Q((\mathbf{x}, \mathbf{y}_{t-1}^m), y_t^m) \\ \nabla_\theta J(\theta) &\simeq \frac{1}{M} \sum_{n=1}^N \sum_{m=1}^M \nabla_\theta J_{n,m}(\theta) \end{align}$

Monte Carlo roolout is used to replace the expectation of Q function:

$Q((\mathbf{x}, \mathbf{y}_{t-1}), y_t) \simeq \frac{1}{K} \sum_{k=1}^K R(\big[ \mathbf{y}_{t-1}, y_t, \mathbf{y}_{(t+1):T_k}^k \big] \vert \mathbf{x}, \mathcal{Y}, \mathcal{P})$

where $\{\mathbf{y}_{(t+1):T_k}^k \}$ are generated future words, and $K$ complete sentences are sampled with policy $\pi_\theta$ .

Critics

Domain critic

Domain critic (DC) model uses an encoder with a classifier. A sentence $\mathbf{y}$ is encoded using TextCNNs with highway connection, the pass to an MLP followed by a softmax to generate probability $C_d (l \vert \mathbf{y})$ , where $l \in$ {source, target, generated}.

Training DC: the goal is to classify a sentence into source, target, and generated data.

$\mathcal{L}_d (\phi) = - \sum_{n=1}^N \log C_d (l^n \vert \mathbf{y}^n; \phi)$

Multi-modal critic (MC) classifies $(\mathbf{x}, \mathbf{y})$ as “paired”, “unpaired”, or “generated” data. The model is:

$\begin{align} \mathbf{c} &{}= \textrm{LSTM}(\mathbf{y}) \\ f &{}= \tanh(\mathbf{W}_x \mathbf{x} + \mathbf{b}_x) \odot \tanh (\mathbf{W}_c \mathbf{c} + \mathbf{b}_c) \\ C_m &{}= \textrm{softmax} ( \mathbf{W}_m f + \mathbf{b}_m) \end{align}$

where $\mathbf{W}_x, \mathbf{b}_x, \mathbf{W}_c, \mathbf{b}_c,\mathbf{W}_m, \mathbf{b}_m$ are learnable parameters, $\odot$ denotes the element-wise product, $C_m$ is the probabilities over three classes: paired, unpaired, and genrated data. $C_m$ indicates how a generated caption $\mathbf{y}$ is relevant to an image $\mathbf{x}$.

Training MC: the goal is to classify the image-sentence pair into paired, unpaired, and generated data.

$\mathcal{L}_m (\eta) = - \sum_{n=1}^N \log C_m (l^n \vert \mathbf{x}^n, \mathbf{y}^n; \eta)$

Sentence reward

The sentence reward $R(\mathbf{y} \vert \cdot) = C_d(\text{target}\vert \cdot) \cdot C_m(\textrm{paired} \vert \cdot)$

Training algorithm

Require: captioner $\pi_\theta$ , domain critic $C_d$ , multi-modal critic $C_m$ , and empty set of generated sentences $\mathcal{Y}_{\pi\theta}$ , and an empty set for paired image-generated-sentence $\mathcal{P}_\textrm{gen}$ .

Input: sentences $\hat{\mathcal{Y}}_\textrm{src}$ , image-sentence pairs $\mathcal{P}_\text{src}$ , unpaired data $\acute{\mathcal{P}}_\textrm{src}$ in source domain; sentences $\hat{\mathcal{Y}}_\textrm{tgt}$ , images $\chi_\textrm{tgt}$ in target domain.

1, Pretrain $\pi_\theta$ on $\mathcal{P}_\text{src}$ ;

while $\theta$ has not converged do:
1. for $i=0, \cdots, N_c$ do
  1. $\mathcal{Y}_{\pi_\theta} \leftarrow \{ \mathbf{y} \}$ , where $\mathbf{y} \sim \pi_\theta (\cdot \vert, \cdot)$ and $\mathbf{x} \sim \chi_\textrm{tgt}$ ;
  2. Compute $g_d = \nabla_\phi \mathcal{L}_d (\phi)$ ;
  3. $\mathcal{Y}_{\pi_\theta} \leftarrow \{ \mathbf{y} \}$ , where $\mathbf{y} \sim \pi_\theta (\cdot \vert, \cdot)$ and $\mathbf{x} \sim \chi_\textrm{src}$ ;
  4. $\mathcal{P}_\text{gen} \leftarrow \{ \mathbf{x}, \mathbf{y}\}$ ;
  5. Compute $g_m = \nabla_\eta \mathcal{L}_m (\eta)$ ;
  6. Adam update for $\eta$ for $C_m$ using $g_m$ ;
2. for do
  1. $\mathcal{Y}_{\pi_\theta} \leftarrow \{ \mathbf{y} \}$ , where $\mathbf{y} \sim \pi_\theta (\cdot \vert, \cdot)$ and $\mathbf{x} \sim \chi_\textrm{tgt}$ ;
  2. $\mathcal{P}_\text{gen} \leftarrow \{ \mathbf{x}, \mathbf{y}\}$ ;
  3. for $t=1, \cdots, T$ do
    1. Compute $Q((\mathbf{x}, \mathbf{y}_{t-1}), y_t)$ with Monte Carlo rollouts;
  4. Compute $g_\theta = \nabla_\theta J(\theta)$ ;
  5. Adam update of $\theta$ using $g_\theta$ .

Poetry generation (ACM MM 2018)

^[15]

Reinforcement Learning

Self-Critical Sequence Training (CVPR 2017)

Motivation:

Teacher-Forcing leads to the mismatch between training and testing, and exposure bias, resulting in error accumulation during generation at test time.
While training with cross-entropy loss, discrete and non-differentiable NLP metrics such as BLEU, ROUGE, METEOR, CIDEr are evaluated at test time.
Ideally, sequence models should be trained to avoid exposure bias and directly optimize metrics for the task at hand.

Policy Gradient

Reinforcement Learning (RL) can be used to directly optimize NLP metrics and address the exposuire bias issue, such as REINFORCE and Actor-Critic. LSTM can be treated as an agent that interacts with an external environment (state: words and image features, action: predicted words, done: “EOS”). The policy network $p_\theta$ results in an action of next word prediction. After each action, the agent updates its internal state (parameters) until generating the EOS token. The reward $r$ is the NLP metric, like CIDEr score of generated sentence by comparing with ground-truth sequences. The goal is to minimize the negative expected reward:

$\mathcal{L} (\theta) = - \mathbb{E}_{w^s \sim p_\theta} [r(w^s)]$

where $w^s = (w_1^s, \cdots, w_T^s)$ and $w_t^s$ is the word sampled from the model at the time step $t$. In practive, $ \mathcal{L} (\theta)$ is typically estimated with a single sample from $p_\theta$ :

$\mathcal{L} (\theta) \approx -r (w^s),\quad w^s \sim p_\theta$

Policy gradient with REINFORCE

REINFORCE is based on the observation that the expected graident of a non-differentiable reward function:

$\nabla_\theta \mathcal{L}(\theta) = - \mathbb{E}_{w^s \sim p_\theta} [r(w^s)\nabla_\theta \log p_\theta (w^s)]$

In practice, a single MC sample $w^s = (w_1^s,\cdots,w_T^s)$ from $p_\theta$ , for each training example in the minibatch:

$\nabla_\theta \mathcal{L}(\theta) \approx -r (w^s) \nabla_\theta \log p_\theta (w^s)$

REINFORCE with Baseline

To reduce the variance of the gradient estimate, it minus a reference reward or baseline $b$:

$\nabla_\theta \mathcal{L}(\theta) = - \mathbb{E}_{w^s \sim p_\theta} [(r(w^s) - b) \nabla_\theta \log p_\theta (w^s)]$

The baseline can be arbitrary function, as long as it does not depend on the action $w^s$ because:

$\begin{align} \mathbb{E}_{w^s \sim p_\theta} [b \nabla_\theta \log p_\theta (w^s)] &{}= b \sum_{w_s} \nabla_\theta p_\theta (w^s) \\ &{}= b \nabla_\theta \sum_{w_s} p_\theta (w^s)\\ &{}= b \nabla_\theta 1 = 0 \end{align}$

This shows that the baseline does not change the expected gradient but can reduce the variance.

For each training case, it can be approximated with a single sample $w^s \sim p_\theta$ as:

$\nabla_\theta \mathcal{L}(\theta) \approx - (r(w^s) -b) \nabla_\theta \log p_\theta (w^s)$

Note if $b$ is a function of $\theta$ or $t$, this is sill valid.

The gradient is:

$\begin{align} \nabla_\theta \mathcal{L}(\theta) &{}= \sum_{t=1}^T \frac{\partial \mathcal{L}(\theta)}{\partial s_t} \frac{\partial s_t}{\partial \theta}\\ \frac{\nabla_\theta \mathcal{L}(\theta)}{\partial s_t} &{}\approx (r(w^s)-b) (p_\theta (w_t \vert h_t) - 1_{w_t^s}) \end{align}$

where $s_t$ is the input to the softmax function;

Self-Critical Sequence Training (SCST)

Self-Critical Sequence Training ^[16] applies the reward obtained by the current model under the inference mode at test time as the baseline in REINFORCE. The gradient at time step $t$ becomes:

$\frac{\nabla_\theta \mathcal{L}(\theta)}{\partial s_t} \approx (r(w^s)- \color{green}{r(\hat{w})}) (p_\theta (w_t \vert h_t) - 1_{w_t^s})$

where $r(\hat{w})$ is the reward obtained by the current model at test time.

SCST directly optimizes the true, sequence-level, evaluation metric, encouraging train/test time consistency.
SCST avoids the usual scenario of having to learn a (context-dependent) estimate of expected future rewards as a baseline.
In practice, it has much lower variance and is more effective on mini-batches using SGD.
It avoids the training problems with actor-critic methods, where the actor is trained on value functions estimated by a critic rather than actual rewards.

It uses the greedy decoding:

$\hat{w}_t = \arg\max_{w_t} p(w_t \vert h_t)$

RL with Embedding Reward (CVPR 2017)

This work^[17] utilized the Actor-Critic algorithm with the reward of visual-semantic embedding for image captioning. The policy and value network jointly determine the next best word at each time step. The former provides a local guidance by predicting the confidence of predicted next words, whereas the latter serves as the global and lookahead guidance by evaluating the reward value of the current state.

Policy Network

The policy network $p_\pi$ consists of standard CNN-RNN encoder-decoder architecture, with the huge vocabulary size as its action space.

Value Network

The value function $v^p$ is predicted by a value network $v_\theta$ .

$\begin{align} v^p (s) &= \mathbb{E} [r \vert s_t =s, a_{t,\cdots, T} \sim p] \\ v_\theta (s) &\approx v^p(s) \end{align}$

where $s_t = \{ \mathbf{I}, w_1, \cdots, w_t \}$ .

As in the figure, the value network consists of a CNN, an RNN, and an MLP, where CNN encodes the raw image $\mathbf{I}$, RNN encodes the semantic information of partially generated sentence $\{ w_1,\cdots, w_t \}$ . The concatenated representation is projected to a scalar reward from $s_t$ using MLP.

Visual-Semantic Embedding Reward

Give an image with feature $\mathbf{v}^*$ , the reward of generated sentence $\hat{S}$ is defined to be the embedding similarity between $\hat{S}$ and $\mathbf{v}^*$:

$r = \frac{f_e (\mathbf{v}^*) \cdot \mathbf{h}^\prime_T (\hat{S})}{\Vert \mathbf{v}^* \Vert \Vert \mathbf{h}^\prime_T (\hat{S}) \Vert}$

The bidirectional ranking loss is defined as:

$\mathcal{L}_e = \sum_\mathbf{v} \sum_{S^-} \max (0, \beta - f_e (\mathbf{v})\cdot \mathbf{h}^\prime_T (S) + f_e (\mathbf{v})\cdot \mathbf{h}^\prime_T (S^-)) + \sum_{S}\sum_{\mathbf{v}^-} \max (0, \beta - \mathbf{h}^\prime_T (S) \cdot f_e (\mathbf{v}) + \mathbf{h}^\prime_T (S) \cdot f_e (\mathbf{v}^-) )$

where $\beta$ is margin cross-validated, $(\mathbf{v}, S)$ are ground truth image-sentence pair, $S^-$ is a negetive description for image corresponding to $\mathbf{v}$, and vice-versa with $\mathbf{v}^-$.

Training

Two steps:

Train policy network use cross entropy loss;
Train $p_\pi$ and $v_\theta$ jointly using reinforcement learning and curriculum learning. And the value network $v_\theta$ serves as a moving baseline. $\begin{align} \nabla_\pi J &{}\approx \sum_{t=1}^T \nabla_\pi \log p_\pi (a_t \vert s_t) (r - v_\theta (s_t)) \\ \nabla_\theta J &{}= \nabla_\theta v_\theta (s_t) (r - v_\theta (s_t)) \end{align}$

References

1.Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3156-3164. ↩
2.Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., & Bengio, Y. (2015). Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention. ICML. ↩
3.Karpathy, A., & Li, F. (2015). Deep visual-semantic alignments for generating image descriptions. CVPR. ↩
4.You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image Captioning with Semantic Attention. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4651-4659. ↩
5.Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T. (2017). SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6298-6306. ↩
6.Lu, J., Xiong, C., Parikh, D., & Socher, R. (2017). Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3242-3250. ↩
7.Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., & Deng, L. (2017). Semantic Compositional Networks for Visual Captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1141-1150. ↩
8.Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2017). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6077-6086. ↩
9.Ren, S., He, K., Girshick, R.B., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149. ↩
10.He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778. ↩
11.Gan, C., Gan, Z., He, X., Gao, J., & Deng, L. (2017). StyleNet: Generating Attractive Visual Captions with Styles. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 955-964. ↩
12.Mathews, A.P., Xie, L., & He, X. (2018). SemStyle: Learning to Generate Stylised Image Captions Using Unaligned Text. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8591-8600. ↩
13.Chen, T., Zhang, Z., You, Q., Fang, C., Wang, Z., Jin, H., & Luo, J. (2018). "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention. ECCV. ↩
14.Chen, T., Liao, Y., Chuang, C., Hsu, W.T., Fu, J., & Sun, M. (2017). Show, Adapt, and Tell: Adversarial Training of Cross-Domain Image Captioner. 2017 IEEE International Conference on Computer Vision (ICCV), 521-530. ↩
15.Liu, B., Fu, J., Kato, M.P., & Yoshikawa, M. (2018). Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training. ArXiv, abs/1804.08473. ↩
16.Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2016). Self-Critical Sequence Training for Image Captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1179-1195. ↩
17.Ren, Z., Wang, X., Zhang, N., Lv, X., & Li, L. (2017). Deep Reinforcement Learning-Based Image Captioning with Embedding Reward. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1151-1159. ↩