Attention mechanism has been widely applied in natural langugage processing and computer vision tasks.

# Attention mechanism

## Background: what is WRONG with seq2seq?

Encoder-decoder architecture:
Encode a source sentence into a fixed-length vector from which a decoder generates a translation.[3]

• Seq2seq models encode an input sentence of variable length $(x_1,...,x_T)$ into a fixed-length vector representation $c$ (a.k.a sentence embedding, “thought” vector), by apply one LSTM to read the input sentence, one timestep at a time. The representation vector $c$ is expected to well capture the meaning of the source sentence.

• Then decode the vector representation $v$ to the target sentence $(y_1,...,y_{T'})$ with another LSTM whose initial hidden state is the last hidden state of the encoder (i.e. the representation of the input sentence: $c$). [1]

where $g$ is a RNN that outputs the probability of $y_t$, and $s_t$ is the hidden state of the RNN. $c$ is the fixed-length context vector for the input sentence.

Drawbacks: The fixed-length context vector $c$ is incapable of remembering long sentences[2]. It will forget the former part when processing the latter sequence. Sutskever et al.(2014)[1] proposed a trick that only reversing the order of source sentences rather than target sentences could be of benefit for MT.

Basic encoder-decoder architecture compresses all the necessary information of a source sentence into a fixed-length vector. This may be incapable of coping with long sentences, especially those that are longer than the sentences in the training corpus [3]. The performance of basic encoder-decoder drops rapidly as the length of an input sentence increases [4].

Thus attention mechanism is proposed to tackle this problem.

## Attention mechanism

### NMT by jointly learning to align and translate (EMNLP 2014)

#### Encoder

Bi-RNNs, obtain the annotation for each word $x_i$ by concatenating the forward and backward hidden states:

#### Decoder

The conditional probability is :

where $s_i$ is the RNN hidden state for time $i$:

Unlike the basic encoder-decoder architecture, the probability of each output word is conditioned on a distinct context vector $c_i$ for each target word $y_i$.

where $e_{ij}$ is an alignment model which scores how well the input at the position $j$ and the output at position $i$ match [3]. Here $\text{score}$ is a simple FF-NN layer:

where $\mathbf{v}_a$ and $\mathbf{W}_a$ are learnable.

Img source: M. Ji's blog
## Attention zoo
Content-based locationAlignment score function
Concat (additive, FF_NN)[5]$$\text{score}(\mathbf{h}_t, \overline{\mathbf{h}}_s) = \mathbf{v}_a^T \tanh(\mathbf{W}_a [\mathbf{h}_t; \overline{\mathbf{h}}_s] )$$
General [5]$$\text{score}(\mathbf{h}_t, \overline{\mathbf{h}}_s) = \mathbf{h}_t^T \mathbf{W}_a \overline{\mathbf{h}}_s$$
Dot-product [5]$$\text{score}(\mathbf{h}_t, \overline{\mathbf{h}}_s) = \mathbf{h}_t^T \overline{\mathbf{h}}_s$$
Scaled dot-product [6]$$\text{score}(\mathbf{h}_t, \overline{\mathbf{h}}_s) = \frac{ \mathbf{h}_t^T \overline{\mathbf{h}}_s}{\sqrt{d}}$$ where $d$ is the dimension of the source input hidden states.
Self-attention [7]Replace the target sequence with the same input sequence with other attention function.
Location-based attentionAlignment score function
Location-based [5]$$\alpha_{ij} = \text{softmax}(\mathbf{W}_a \mathbf{h}_{t=j})$$
image source:[5]

## Transformer

### Background

End2end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence. However, sequence-aligned RNNs preclude parallelization.

### Model architecture

Architecture: stacked self-attention + point-wise FC layer (with residual connection + layer normalization)

#### Encoder

The transformer encoder applies one multi-head attention followed by one FC-FF layer, adopting the residual connection and layer normalization trick: $\text{LayerNorm}(x+ \text{Sublayer}(x))$

Layer Normalization:

where $\gamma$ and $\beta$ are leanable affine transorm parameters.

• N = 6 stack transformer layers
• Output dimension $d_{model}$ = 512

### Decoder

• Same as the encoder, N = 6 identical stacked layers

• Difference
The first multi-head attention layer is masked to prevent positions from attending to subsequent positions, ensuring that the prediction output at position $i$ only depends on the known outputs at positions less than $i$, regardless of the future.

Transformer regarded encoded representation of input sequences as a set of key-value pairs (K,V), with dimension of input sequence length $n$. In MT context, encoder hidden states serve as (K, V) pairs. In the decoder the previous output is a query (with dimension $m$)

#### Scaled dot-product attention

where $d_k$ is the dimension of Key.

In conventional attention view:

Dot-product is faster and more space-efficient compared with additive attention (one FF layer) in practice.[6]

More precisely, for the input sequence $x = (x_1, \cdots, x_n)$, dot-product attention outputs the new sequence $z = (z_1,\cdots, z_n)$ of the same length, $z_i \in \mathbb{R}^{d_z}$

Multi-head: “linear project the $Q$, $K$ and $V$ $h$ times with different, learned linear projections to $d_k$, $d_k$ and $d_v$ dimensions, respectively.”[6] Then concat all ($h$) the $d_v$ and use a linear layer to project to the final representation values.

Multi-head allows to “jointly attend to information from different representation subspaces at different positions“:

where $W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$, $W_i^O \in \mathbb{R}^{hd_{v} \times d_{model}}$

Transformer attention:

• Mimic the conventional encoder-decoder attention mechanisms: $Q$ comes from previous decoder, $K$, $V$ come from the decoder output. This allows every position in the decoder to attend over all positions in the input sequence (as figure above).
• Encoder: K=V=Q, i.e. the output of previous layer. Each position in the encoder can attend to all positions in the previous layer of the encoder.
• Decoder: allow each position in the decoder to attend to all positions in the decoder up to and including that position.

### Positional encoding

Drawbacks: self-attention cannot capture the order information of input sequences.

Positional embeddings can be learned or pre-fixed [8].

RNNs solution: inherently model the sequential information, but preclude parallelization.

• Residual connections help propagate position information to higher layers.

#### Absolute Positional Encoding

Transformer solution: use sinusoidal timing signal as positional encoding (PE).

where $\text{pos}$ is the position in the sentence and $i$ is the order along the embedding vector dimension. Assume this allows to learn to attend by relative positions, since for and fixed offset $k$, $\text{PE}_{\text{pos}+k}$ can be represented as the linear function of $\text{PE}_{\text{pos}}$[6]

Usage: before stacked encoder/decoder, take the sum of PE and input embeddings (as figure below).

#### Relative Positional Representation(RPR)

• Relation-aware self-attn
Consider the pairwise relationships between input elements, which can be seen as a labeled, directed fully-connected graph. Let $a_{ij}^V,a_{ij}^K \in \mathbb{R}^{d_a}$ represent the edge between input elements $x_i$ and $x_j$.

Then add the pairwise information to the sublayer output:

• Clip RPR
$k$ denotes the maximum relative position. The relative position information beyond $k$ will be clipped to the maximum value, which generalizes to the unseen sequence lengths during training.[14] In other words, RPR only considers context in a fixed window size $2k+1$, indicating $k$ elements on the l.h.s, and $k$ elements on the r.h.s, as well as itself.

where rpr $w^K = (w_{-k}^K, \cdots, w_k^K) \in \mathbb{R}^{d_a}$ and $w^V = (w_{-k}^V, \cdots, w_k^V) \in \mathbb{R}^{d_a}$ are learnable.

Trainable param number:

• MADPA: $\overbrace{4 \times \big(\text{d_model} \times \text{d_model} + \text{d_model} \big)}^\text{4 dense layers}$
• MADPA with RPR: $\underbrace{4 \times \big(\text{d_model} \times \text{d_model} + \text{d_model} \big)}_\text{4 dense layers} + \underbrace{\color{red}{2 \times (\text{seq_len}^2 \times d_k )}}_\text{2 RPR matrices}$

• My PyTorch implementation

• Tensorflow implementation: [16]

Thought: current attention mechanism is one round, and one dimension (at sequence dimension)

# References

1. 1.Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112).
2. 2.Weng L. (2018, Jun 24). Attention? Attention! [Blog post]. Retrieved from https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
3. 3.Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.
4. 4.Cho, K., Merrienboer, B.V., Bahdanau, D., & Bengio, Y. (2014). On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. SSST@EMNLP.
5. 5.Luong, T., Pham, H., & Manning, C.D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP.
6. 6.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. NIPS.
7. 7.Cheng, J., Dong, L., & Lapata, M. (2016). Long Short-Term Memory-Networks for Machine Reading. EMNLP.
8. 8.LeCun, Y., Bottou, L., & Bengio, Y. (2006). PROC OF THE IEEE NOVEMBER Gradient Based Learning Applied to Document Recognition.
9. 9.Raffel, C., & Ellis, D.P. (2015). Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems. CoRR, abs/1512.08756.
10. 10.
11. 11.
12. 12.
13. 13.
14. 14.Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
15. 15.
16. 16.
17. 17.