This is an introduction of variant Transformers.[1]

Relevant notes of mine (CYK):

# Transformer

The details of transformer is explained in previous blogs. The schema of Transformer is as following fig.

• Architecture
• Decoding

# Vanilla Transformer

It is impossible to preprocess the entire context sequence in the whole corpus from the beginning, due to the limited resource in practice.

Vanilla Transformer (Al-Rfou et. al 2019)[2] splits the entire corpus into shorter segments, and train within each segment. This leads to the context fragmentation problem by ignoreing all contextual information from previous segments.

As in above fig., information never flows across segements.

• Evaluation

During evaluation, for each output step, the segment shifts right by only one position, which hurts the decoding efficiency and speed.

# Relative Positional Representation(RPR)

• Relation-aware self-attn
Consider the pairwise relationships between input elements, which can be seen as a labeled, directed fully-connected graph. Let $a_{ij}^V,a_{ij}^K \in \mathbb{R}^{d_a}$ represent the edge between input elements $x_i$ and $x_j$.

Then add the pairwise information to the sublayer output:

• Clip RPR
$k$ denotes the maximum relative position. The relative position information beyond $k$ will be clipped to the maximum value, which generalizes to the unseen sequence lengths during training.[5] In other words, RPR only considers context in a fixed window size $2k+1$, indicating $k$ elements on the l.h.s, and $k$ elements on the r.h.s, as well as itself.

where rpr $w^K = (w_{-k}^K, \cdots, w_k^K) \in \mathbb{R}^{d_a}$ and $w^V = (w_{-k}^V, \cdots, w_k^V) \in \mathbb{R}^{d_a}$ are learnable.

Trainable param number:

• MADPA: $\overbrace{4 \times \big(\text{d_model} \times \text{d_model} + \text{d_model} \big)}^\text{4 dense layers}$
• MADPA with RPR: $\underbrace{4 \times \big(\text{d_model} \times \text{d_model} + \text{d_model} \big)}_\text{4 dense layers} + \underbrace{\color{red}{2 \times (\text{seq_len}^2 \times d_k )}}_\text{2 RPR matrices}$

• My PyTorch implementation

• Tensorflow implementation: [7]

# Transformer-XL

Transformer-XL[3] is capable of learning the long-term dependency between different context fragements in Vanilla Transformers. It mainly employs the segment-level recurrence and relative positional encoding scheme.

## Segment-level recurrence

During training, transformer-xl adopts both current and the previous segments, levaraging the recurrence mechanism on segement level.

Let the consecutive segment of length $L$ be $\pmb{s}_\tau = [x_{\tau,1}, \cdots, x_{\tau,L}]$ and $\pmb{s}_{\tau+1}=[x_{\tau+1,1}, \cdots, x_{\tau+1,L}]$. Denote the $d$-dimensional hidden state of $n$-th layer for the $\tau$-th segment $\pmb{s}_\tau$, be $\pmb{h}_\tau^n \in \mathbb{R}^{L \times d}$.

Thus, the recurrent dependency between $\pmb{h}_{\tau+1}^n$ and $\pmb{h}_{\tau}^{n-1}$ shifts one layer vertically and one segment horizontally, unlike the recurrence of same layer in RNNs. As a result, the largest long-range dependency length is linearly w.r.t # of layers times segment length, i.e. $O(N \times L)$.

• Evaluation
During evaluation process, the representation from previous segments can be reused, which is much faster compared with vanilla Transformers (as below fig.).

## Positional Encoding

### Absolute Positional Encoding

• Problems: In the segment $\tau$, using the same absolute positional encoding $\pmb{U}_{1:L}$ for all segments cannot distinguish the positional difference between the same place in different segments, i.e. $x_{\tau,j}$ and $x_{\tau+1,j}$ for any $j=1, \cdots, L$.[1]

Here,

• (a) captures content-based information, i.e., how much attention the word in row-$i$ pays to word in col-$j$ despite the position.
• (b) captures content-dependent positional bias, representing how much the word in row-$i$ should attend to position $j$.
• (c) defines the global content biases, denoting how much the position-$i$ should attend to words in $j$-th position.
• (d) denotes the global positional bias, i.e., the soft attention that words in position $i$ should pay to a row in position $j$.

### Relative positional encoding

Solution: use relative positional encoding. Conceptionally, positional encoding (pe) gives the temporal clue or biases about how information should be gathered, i.e., where to attend.[3] It is sufficient to know the relative distance beween each key vector $k_{\tau,j}$ and itself $k_{\tau,i}$, i.e. $i-j$.

Replacement:

1. replace all absolute pe’s $U_j$ in (b) and (d) with relative counterpart $\color{cyan}{R_{i-j}}$, which is a sinusoid encoding matrix without learnable weights.
2. replace the query $\color{red}{U_{i,\bullet}^\top W_q^\top}$ with a trainable parameter $\color{blue}{u \in \mathbb{R}^d}$ and similarly, $\color{blue}{v \in \mathbb{R}^d}$ in (d). Because the query vector is the same for all query positions, meaning that the query bias attending to words at various positions should be identical, no matter the query positions.
3. substitude the weight of key vector with two matrices $\color{Salmon}{W_{k,E}}$ and $\color{ForestGreen}{W_{k,R}}$ respectively, to produce the $\color{Salmon}{\text{content-based}}$ and $\color{Green}{\text{location-based}}$ key vectors.

Thus,

• (b) captures content-dependent positional bias
• (c) denotes the global bias
• (d) represents the global positional bias

The PyTorch implementation:

#### Comparison with Shaw et. al(2018)

Relative positional representation (RPR) (Shaw et. al, 2018) merely leveraged relative postional embedding, throwing away the sinusoid hard encodings. The RPR term $\color{red}{a_{ij}^K}$ introduces the trainable parameters. See my attention blog [6] for more details.

• The terms in the numerator correspond to terms (a) and (b) in relative PE in Transformer-XL. It is obvious that RPR shows a lack of the (c) and (d) terms.

# R-Transformer

• Argument: multi-head attention only learn the global dependencies, but it ignores the inherent local structures.[4]

## LocalRNN

R-Transformer[4] employs LocalRNN to model local structures, only focusing on local short-term dependencies with a local short sequence of length $M$: $x_{t-M-1}, x_{t-M-2}, \cdots, x_t$. The last hidden state $h_t$ is the representation of the local short sequences of a fixed length $M$.

• LocalRNNs pad $(M-1)$ positions before the start of a sequence.
• R-Transformers do not use any position embeddings.
• Here, the LocalRNN resembles the 1D ConvNets but the op for each window is not convolution. However, the conv op completely ignores the sequential information of positions within the local window.

Image source:[4]

Given sequence of length $m$: $x_1, x_2, x_3, \cdots, x_m$ and window size $k=4$, localRNN encodes segmented short sub-sequence as:

When doing implementation,

1. first pad the sequence with embeddings of all 0s on the left hand side (kernel size-1) positions;
2. then segment the subsequence of window size $k$, with one position shift right per time step. (See above digram.)

For $i$-th layer, ($i \in {1,2,\cdots,N}$)

# References

1. 1.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
2. 2.Al-Rfou, R., Choe, D., Constant, N., Guo, M., & Jones, L. (2019, July). Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, pp. 3159-3166).
3. 3.Dai, Z., Yang, Z., Yang, Y., Cohen, W. W., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
4. 4.Wang, Z., Ma, Y., Liu, Z., & Tang, J. (2019). R-Transformer: Recurrent Neural Network Enhanced Transformer. arXiv preprint arXiv:1907.05572.
5. 5.Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
6. 6.
7. 7.