Machine reading comprehension aims to answer questions given a passage or document.

Symbol matching models

Frame-Semantic parsing

Frame-semantic parsing identifies predicates and their arguments, i.e. “who did what to whom”.

Word Distance

Sum the distances of every word in $q$ to their nearest aligned word in $d$

Teaching Machines to Read and Comprehend

Deep LSTM Reader

In NMT, deep LSTMs have shown a remarkable ability to embed long sequences into a vector representation, which contains enough information to generate a full translation in another language.^[2]

Deep LSTMs feed out documents one word at a time into a Deep LSTM encoder, after a delimiter, followed by a query ($d \oplus ||| \oplus q$, or $q \oplus ||| \oplus d$ ). The network predicts which token in the document answers the query.

Attentive Reader

Limitations of the Deep LSTM Reader:

fixed width hidden vector

Solution: the Attentive Reader employs a finer grained token level attention mechanism, where the tokens are embedded given their entire future and past context in the input documents.

Attentive Reader encodes the document $d$ and the query $q$ with two separate 1-layer bi-LSTMs.^[2]

When encoding the query $q$, the encoding $u$ of a query with length $|q|$ is the concatenation of the final forward and backward outputs:

$u = \overrightarrow{y_q}(|q|) || \overleftarrow{y_q}(1)$

When encoding the document $d$, each token at position $t$ is:

$y_d(t) = \overrightarrow{y_d}(t) || \overleftarrow{y_d}(t)$

The representation $r$ of $d$ is a weighted sum of these output vectors. The weights can be interpreted as the degree to which the network attends to a particular token in the document $d$ when answering the query:

$m(t) = tanh(W_{ym} y_d(t)) + W_{um}u$ $s(t) \approx exp(w^T_{ms} m(t))$ $r = y_d s$

Finally, the joint document and query embedding is:

$g^{AR}(d,q) = \text{tanh}(W_{rg}r + W_{ug}u)$

Impatient Reader

The Attentive Reader focuses on the passage of a context document that are most likely to inform the answer to the query.

Impatient Reader can reread from the document as each query token is read.^[2]

At each token $i$ of the query $q$, the model computes the document representation vector $r(i)$ with the bidirectional embedding $y_q(i) = \overrightarrow{y_q}(i) || \overleftarrow{y_q}(i)$ :

$m(i, t) = \text{tanh}(W_{dm}y_d(t) + W_{rm} r(i-1) + W_{qm} y_q(i)), \quad 1 \leq i \leq |q|$ $s(i,t) \propto \text{exp}(W_{ms}^T m(i,t))$ $r(0)= \pmb{r_0}, \quad r(i) = y_d^T s(i) + \pmb{\text{tanh}(W_{rr}r(i-1))} \quad 1 \leq i \leq |q|$

The attention mechanism allows the model to recurrently accumulate information from the document as it sees each query token, ultimately outputting a final joint document query representation for the answer prediction

$g^{IR}(d,q) = \text{tanh}(W_{rg}r(|q|) + W_{qg} u)$

Attention Sum Reader

For cloze-style QA. ^[6]

Compute the vector embedding for the query. $g(\pmb{q}) = \overrightarrow{g_{|\pmb{q}|}}(\pmb{q}) || \overleftarrow{g_1}(\pmb{q})$
Compute the vector embedding of each individual word in the context of the whole document. The word embedding is a look-up table $V$. $f_i(\pmb{d}) = \overrightarrow{f_i} (\pmb{d}) || \overleftarrow{f_i}(\pmb{d})$
Dot product between the question embedding and the contextual embedding. Select the most likely answer.

EpiReader

Pointer Nets

Problems:

Conventional seq2seq architecture can only applies softmax distribution over a fixed-sized output dictionary. It cannot handle problems where the size of the output dictionary is equal to the length of the input sequence.^[8]

$p(\mathcal{C} \vert \mathcal{P}; \theta) = \prod_{i=1}^{m(\mathcal{P})} p(C_i \vert C_1, \cdots, c_{i-1}, \mathcal{P}; \theta)$

where $\mathcal{P}=\{ P_1, \cdots, P_n \}$ is a sequence of $n$ vectors and $\mathcal{C}^{\mathcal{P}} = \{ C_1, \cdots, C_{m(\mathcal{P})} \}$ is a sequence of $m(\mathcal{P})$ indices.

The parameters are learnt by maximizing the conditional probabilities of the training set:

$\theta^* = \arg\max_\theta \sum_{\mathcal{P}, \mathcal{C}^{\mathcal{P}}} \log p(\mathcal{C}^{\mathcal{P}} \vert \mathcal{P}; \theta)$

Solution: Pointer Net.

Applies the attention mechanism:
$u_j^i = v^T \tanh (W_1 e_j + W_2 d_i) \quad j \in (1,\cdots,n)$ $p(C_i \vert C_1, \cdots, C_{i-1}, \mathcal{P}) = \text{softmax}(u^i)$
where softmax normalizes the vector $u^i$ (of length $n$) to be an output distribution over the dictionary of inputs. And $v$, $W_1$ , $W_2$ are learnable parameters of the output model.

Here, we do not blend the encoder state $e_j$ to propagate extra information to the decoder. Instead, we use $u_j^i$ as pointers to the input elements.

Ptr Nets can be seen as an application of content-based attention mechanisms.

EpiReader

Extractor: Pointer Nets

Use bi-RNNs to encode passage $f(\theta_T, \pmb{T})$ and question $g(\theta_Q, \pmb{Q})$ , where $\theta_T$ and $\theta_Q$ represents the parameters of the text and question encoders, $\pmb{T} \in \mathbb{R}^{D \times N}$ and $\pmb{Q} \in \mathbb{R}^{D \times N_Q}$ are matrix representations of the texts and questions (comprising $N$ words and $N_Q$ words separately) . Concatenate the last hidden states of forward and backward GRU, denoted $g(\pmb{Q}) \in \mathbb{R}^{2d}$
Take the inner product of text and question representations, followed by a softmax. The probability that the $i$-th word in text $\tau$ answers $\mathcal{Q}$ :
$s_i \propto \exp (f(\pmb{t}_i) \cdot g(\pmb{Q}))$
Compute the total probability that word $w$ is the correct answer:
$P(w \vert \tau, \mathcal{Q}) = \sum_{i: t_i=w} s_i$
The extractor take the $K$ highest word probabilities with the corresponding $K$ most probable answer words $\{\hat{a}_1,\cdots,\hat{a}_K \}$

Reasoner

Insert the answer candidates into the question sequence $\mathcal{Q}$ at the placeholder location, which forms $K$ hypotheses ${ \mathcal{H}_1, \cdots, \mathcal{H}_K }$
For each hypothesis and each sentence of the text: $\pmb{S}_i \in \mathbb{R}^{D \times |\mathcal{S}_i|}$ whose columns are embedding vectors for each word of sentence $\mathcal{S}_i$ , $\pmb{H}_k \in \mathbb{R}^{D \times |\mathcal{H}_k|}$ whose columns are the embedding vectors for each word in the hypothesis $\mathcal{H}_k$
Augment $\pmb{S}_i$ with word-matching features $\pmb{M} \in \mathbb{R}^{2 \times |\mathcal{S}_i|}$ . The first row is the inner product of each word embedding in the sentence with the candidate answer embedding; the second row is the maximum inner product of each sentence word embedding with any word embedding in the question.
Then the augmented $\pmb{S}_i$ and $\pmb{H}_k$ are fed into two different ConvNets, with filters $\pmb{F}^S \in \mathbb{R}^{(D+2) \times m}$ and $\pmb{F}^H \in \mathbb{R}^{D \times m}$ , where $m$ is the filter width. After ReLU and maxpooling op, we can obtain the representations of the text sentence and the hypothesis: $\pmb{r}_{\mathcal{S}_i} \in \mathbb{R}^{N_F}$ , $\pmb{r}_{\mathcal{H}_k} \in \mathbb{R}^{N_F}$ , where $N_F$ is the number of filters.
Then compute a scalar similarity score representations using bilinear form:
$\zeta = r_{\mathcal{S}_i}^T \pmb{R} \pmb{r}_{\mathcal{H}_k}$
where $\pmb{R} \in \mathbb{R}^{N_F \times N_F}$ is a trainable parameter.
Concat the similarity score with the sentence and hypothesis representations to get: $\pmb{x}_{ik} = [\zeta; \pmb{r}_{\mathcal{S}_i}; \pmb{r}_{\mathcal{H}_k}]^T$
Pass $\pmb{x}_{ik}$ to a GRU, and the final hidden state is given to an FC layer, followed by a softmax op.

Finally, combine the output of the Reasoner and the Extractor at the same time when minimizing the loss function. (See the original paper^[9] for details)

Bi-Directional Attention Flow (BiDAF)

Highway Networks

A plain feedforward NN consists of $L$ layers where the $l^{th}$ layer $(l \in { 1,2,\cdots,L})$ applies a non-linear transformation $H$ (with parameter $\pmb{H,l}$ ) on its input $\pmb{x}$ to the output $\pmb{y}$. $y = H(\pmb{x}, \pmb{W_H})$

$H$ is usually a affine transformation followed by a non-linear activation function.

Highway Network:
- Additionally define $T$ as the transform gate, $C$ as the carry gate. Intuitionally, these gates express how much of the output is produced by transforming the input and carrying it. $\pmb{y} = \underbrace{H(\pmb{x}, \pmb{W_H})}_\text{FFNN output} \cdot \underbrace{T(\pmb{x}, \pmb{W_T})}_\text{transform gate} + \pmb{x} \cdot \underbrace{C(\pmb{x}, \pmb{W_C})}_\text{carry gate}$
- For simplicity we set $C = 1 - T$, giving
  $\pmb{y} = H(\pmb{x}, \pmb{W_H}) \cdot T(\pmb{x}, \pmb{W_T}) + \pmb{x} \cdot ( 1 - T(\pmb{x}, \pmb{W_T}) )$
- In particular,
  $\pmb{y}=\left\{ \begin{array}{ll} \pmb{x} \quad \text{if } T(\pmb{x}, \pmb{W_T}) = \pmb{0}, \\ H(\pmb{x}, \pmb{W_H}) \quad \text{if } T(\pmb{x}, \pmb{W_T}) = \pmb{1} \end{array} \right.$

BiDAF

Problems:

Previous models summarized the context paragraph into a fixed-size vector, which could lead to the information loss.
Solution: the attention is computed at each time step, and the attended vector at each time step, along with the representations from previous layers, is allowed to flow through to the subsequent modeling layer.

Char embedding layer

Let $\pmb{x}_1, \cdots, \pmb{x}_T$ and $\pmb{q}_1, \cdots, \pmb{q}_J$ represent the words in the input context paragraph and query. Use TextCNNs to encode the char-level inputs, followed by a max-pooling over the entire width to obtain a fixed-size vector for each word.

Word embedding layer

Applied pretrained word embeddings, GloVe.

Then concatenate the char and word embedding vectors, feed them into a 2-layer Highway Network. The outputs are $\pmb{X} \in \mathbb{R}^{2d \times T}$ for the context, and $\pmb{Q} \in \mathbb{R}^{d \times J}$ for the query.

Contextual embedding layer

Use bi-LSTMs to encode the context and query representations, by concatenating the last hidden states of each direction. We obtain $\pmb{H} \in \mathbb{R}^{2d \times T}$ from the context word vectors $\pmb{X}$, $\pmb{U} \in \mathbb{R}^{2d \times J}$ from query word vectors $\pmb{Q}$

The first three layers are used to extract features form the query and context at different levels of granularity, akin to mlti-stage feature computation of CNNs in computer vision field.

Attention flow layer

Inputs: the context $\pmb{H}$ and the query $\pmb{U}$.
Outputs: query-aware vector representation of context words, $\pmb{G}$, along with previous contextual embedding
Similarity matrix $\pmb{S} \in \mathbb{R}^{T \times J}$ between the contextual embeddings of the context($\pmb{H}$) and the query ($\pmb{U}$), where $\pmb{S}_{tj}$ indicates the similarity between the $t$-th context word and $j$-th query word: $\pmb{S}_{tj} = \alpha(\pmb{H}_{:t}, \pmb{U}_{:j}) \in \mathbb{R}$ $\alpha(\pmb{h},\pmb{u}) = \pmb{w}_{(\pmb{S})}^T [\pmb{h};\pmb{u};\pmb{h} \odot \pmb{u}]$ where $\alpha$ is a trainable scalar function that encodes the similarity between its input vectors, $\pmb{H}_{:t}$ is $t$-th column vector of $\pmb{H}$ and $\pmb{U}_{:j}$ is $j$-th column vector of $\pmb{U}$.

Then use $\pmb{S}$ to obtain the attentions and the attended vectors in both directions.

Context-to-query Attention: context-to-query(C2Q) attention signifies which query words are most relevant to each context word. Let $\pmb{a}_t \in \mathbb{R}^J$ represent the attention weights on the query words by $t$-th context word, $\sum_j \pmb{a}_{tj} = 1$ for each $t$. The attention weight: $\pmb{a}_t = \text{softmax}(\pmb{S}_{t:}) \in \mathbb{R}^J$ Each attended query vector: $\tilde{\pmb{U}}_{:t} = \sum_j \pmb{a}_{tj} \pmb{U}_{:j}$ Here $\tilde{\pmb{U}}$ is a 2$d$-by-$T matrix.
Query-to-context Attention: query-to-context(Q2C) attention signifies which context words have the closest similarity to one query word and hence crucial for answering. The attention weights on the context words: $\pmb{b} = \text{softmax}(\max_{col} (\pmb{S})) \in \mathbb{R}^T$ where the maximum function ( $\max_{col}$ ) is performed across the column.
The attended context vector is $\tilde{\pmb{h}} = \sum_t \pmb{b}_t \pmb{H}_{:t} \in \mathbb{R}^{2d}$

Finally, concatenate the contextual embeddings and attention vectors:

$\pmb{G}_{:t} = \beta(\pmb{H}_{:t}, \tilde{U}_{:t}, \tilde{H}_{:t}) \in \mathbb{R}^{d_{G}}$

where $\pmb{G}_{:t}$ is the $t$-th column vector, $\beta$ is a trainable vector function that fuses three input vectors. In the experiments, $\\pmb{\beta(h, \tilde{u}, \tilde{h}) = [h; \tilde{u}; h \odot \tilde{u}; h \odot \tilde{h} ] } \in \mathbb{R}^{D_G \times T}$

Modeling layer

Use bi-LSTMs to encode, obtaining a matrix $\pmb{M} \in \mathbb{R}^{2d \times T}$

Output layer

Application-specific
For QA-tasks, find the sub-phrase of the paragraph to answer the query. We obtain the start index over the entire paragraph by:
$\pmb{p}^1 = \text{softmax}(\pmb{w}^T_{(p^1)} [\pmb{G};\pmb{M}] )$
For the end index of the answer phrase, we pass $\pmb{M}$ into another bi-LSTM and obtain $\pmb{M}^2 \in \mathbb{R}^{2d \times T}$
$\pmb{p}^2 = \text{softmax}(\pmb{w}^T_{(p^2)} [\pmb{G};\pmb{M}^2] )$
Training: minimize the sum of the negative log probabilities of the true start and end indices by the predicted distributions, averaged over all examples:
$L(\theta) = -\frac{1}{N} \sum_i^N \log(\pmb{i}_{y_i^1}^1) + \log(\pmb{p}_{y_i^w}^2)$

Match-LSTM and Answer pointer

Match-LSTM

It is used for textual entailment (RTE). In RTE, given two sentences, one premise and another hypothesis, predict where the premise entails the hypothesis.
Match-LSTMs go through the hypothesis sequentially. At each position of the hypothesis, apply attention mechanism to obtain a weighted vector representation of the premise. This weighted vector is combined with current token representation of the hypothesis, then fed to an LSTM.
Match-LSTMs sequentially aggregates the matching of the attention-weighted premise to each token of the hypothesis.

Architecture

Given the matrix of passage $\pmb{P} \in \mathbb{R}^{d \times P}$, question $\pmb{Q} \in \mathbb{R}^{d \times Q}$, where the $P$ and $Q$ os the length (# of tokens) of the passage and question, $d$ is the dimension of word embeddings.
The answer is a sequence of inteegers $\pmb{a} = (a_1,a_2,\cdots)$ , where each $a_i$ is an integer between 1 and $P$, indicating the certain region in the passage. Or select only the start and end index from input passages, represented as $\pmb{a} = (a_s, a_e)$ , where $a_s$ and $a_e$ are integers between 1 and $P$.
Overall, given $\{ \pmb{P}_n, \pmb{Q}_n, \pmb{a}_n \}_{n=1}^N$ .
Goal: identify a subsequence from the passage as the answer to the question.

LSTM Preprocessing layer

In order to incorporate contextual information to the representation of each token, apply one-dimensional LSTM to process the passage and the question separately. $\pmb{H}^p = \overleftarrow{\text{LSTM}}(\pmb{P})$ $\pmb{H}^q = \overrightarrow{\text{LSTM}}(\pmb{Q})$ The output $\pmb{H}^p \in \mathbb{R}^{l \times P}$ and $\pmb{H}^q \in \mathbb{R}^{l \times Q}$ are hidden representations of the passage and the question, where $l$ is the hidden dimension.

Match-LSTM layer

Apply match-LSTM model by sequentially goes through the passage, obtaining the weighted representation of question.
At position $i$ of the passage, it first uses the standard word-by-word attention mechanism to obtain attention weight $\overrightarrow{\alpha}_i \in \mathbb{R}^{Q}$ :
$\overrightarrow{\pmb{G}}_i = \tanh \big(\pmb{W}^q \pmb{H}^q + (\pmb{W}^p \pmb{h}_i^p + \pmb{W}^r \overrightarrow{\pmb{h}_{i-1}^r} + \pmb{b}^p ) \otimes \pmb{e}_Q \big)$ $\overrightarrow{\alpha}_i = \text{softmax} (\pmb{w}^T \overrightarrow{\pmb{G}}_i + b \otimes \pmb{e}_Q)$
where $\pmb{W}^q$ , $\pmb{W}^p$ , $\pmb{W}^r \in \mathbb{R}^{l \times l}$ , $\pmb{b}^p, \pmb{w} \in \mathbb{R}$ are learnable, $\overrightarrow{\pmb{h}_{i-1}^r} \in \mathbb{R}^l$ is the hidden vector of the one-directional match-LSTM at previous position. The outer product ( $\cdot \otimes \pmb{e}_Q$ ) generates a matrix or row vector by repeating the vector or scalar on the left for $Q$ times.
Then combine the weighted vector with original representations:
$\overrightarrow{\pmb{z}}_i = \begin{bmatrix} \pmb{h}_i^p \\ \pmb{h}^q \overrightarrow{\alpha}_i^T \end{bmatrix}$

The vector $\overrightarrow{\pmb{z}}_i$ is fed to a one-directional LSTM, so-called match-LSTM:

$\overrightarrow{\pmb{h}}_i^r = \overrightarrow{\text{LSTM}}(\overrightarrow{\pmb{z}}_i, \overrightarrow{\pmb{h}}_{i-1}^r)$

where $\overrightarrow{\pmb{h}}_i^r \in \mathbb{R}^l$

Further apply a match-LSTM in the reverse direction.
$\overleftarrow{\pmb{G}}_i = \tanh \big(\pmb{W}^q \pmb{H}^q + (\pmb{W}^p \pmb{h}_i^p + \pmb{W}^r \overleftarrow{\pmb{h}_{i-1}^r} + \pmb{b}^p ) \otimes \pmb{e}_Q \big)$ $\overleftarrow{\alpha}_i = \text{softmax} (\pmb{w}^T \overleftarrow{\pmb{G}}_i + b \otimes \pmb{e}_Q)$
Let $\overrightarrow{\pmb{H}^r} \in \mathbb{R}^{l \times P}$ represent the hidden states $[\overrightarrow{\pmb{h}^r_1}, \overrightarrow{\pmb{h}^r_2, \cdots, \overrightarrow{\pmb{h}^r_P}}]$ and $\overleftarrow{\pmb{H}^r} \in \mathbb{R}^{l \times P}$ represent $[\overleftarrow{\pmb{h}^r_1}, \overleftarrow{\pmb{h}^r_2}, \cdots, \overleftarrow{\pmb{h}^r_P}]$ .
Define $\pmb{H}^r \in \mathbb{R}^{2l \times P}$ as the concatenation:
$\pmb{H}^r = \begin{bmatrix} \overrightarrow{\pmb{H}^r} \\ \overleftarrow{\pmb{H}^r} \end{bmatrix}$

Answer pointer layer

The sequence model

Compute the attention weight vector $\beta_k \in \mathbb{R}^{(P+1)}$ : $\pmb{F}_k = \tanh (\pmb{V} \tilde{H}^r + (\pmb{W}^a \pmb{h}_{k-1}^a + \pmb{b}^a) \otimes \pmb{e}_{(P+1)})$ $\beta_k = \text{softmax}(\pmb{v}^T \pmb{F}_k + \pmb{c} \otimes \pmb{e}_{(P+1)})$ where $\tilde{H}^r \in \mathbb{R}^{2l \times (P+1)}$ is the concatenation of $\pmb{H}^r$ with a zero vector, defined as $\tilde{H}^r = [\pmb{H}^r; \pmb{0}]$

$\pmb{h}_k^a = \overrightarrow{\text{LSTM}} (\tilde{\pmb{H}}^r \beta_k^T, \pmb{h}_{k-1}^a)$

Then model the probability of generating the answer sequence as:

$p(\pmb{a} \vert \pmb{H}^r) = \prod_k p(a_k \vert a_1, a_2, \cdots, a_{k-1}, \pmb{H}^r)$ $p(a_k = j \vert a_1, a_2, \cdots, a_{k-1}, \pmb{H}^r) = \beta_{k,j}$

Minimize the loss: $J(\theta) = -\sum_{n=1}^N \log p(\pmb{a}_n \vert \pmb{P}_n, \pmb{Q}_n)$

The boundary model

Predict the start and end index from input sequences. The probability is modeled as: $p(\pmb{a} \vert \pmb{H}^r) = p(a_s \vert \pmb{H}^r) p(a_e \vert a_s, \pmb{H}^r)$

Gated self-matching networks

Firstly, apply bi-RNNs to process the question and passage separately; then match the question and passage with gated attention-based RNNs, obtaining question-aware representation for the passage. On top of that, apply self-matching attention to aggregate evidence from the whole passage and refine the passage representation, which is then fed to the output layer to predict the boundary of the answer span.

Question and passage encoder

Given question $\mathcal{Q} = \{ w_t^Q\}_{t=1}^m$ and passage $\mathcal{P} = \{ w_t^P\}_{t=1}^n$ .
Concatenate the respective word-level embeddings ( $\{ e_t^Q\}_{t=1}^m$ and $\{ e_t^P\}_{t=1}^n$ ) and char-level embeddings ( $\{ c_t^Q\}_{t=1}^m$ and $\{ c_t^P\}_{t=1}^n$ ). The char-level embedding is generated by concatenating the final hidden state of bi-directional RNNs, which is helpful to handel OOV words.
Then use a bi-RNN to produce the new representation of all words in the question and passage respectively: $u_t^Q = \text{BiRNN}_Q (u_{t-1}^Q, [e_t^Q, c_t^Q])$ $u_t^P = \text{BiRNN}_P (u_{t-1}^P, [e_t^P, c_t^P])$

Gated attention-based RNNs

Incorporate an additional gate to determine the importance of information in the passage regarding a question.
Rocktäschel et al.(2015)^[15] proposed generating sentence-pair representation $\{ v_t^P \}_{t=1}^n$ via soft-alignment of words in the question and passage:
$v_t^P = \text{RNN}(v_{t-1}^P, c_t)$
where $c_t = \text{att}(u^Q, [u_t^P, v_{t-1}^P])$ is an attention-pooling vector of the whole question $u^Q$:
$s_j^t = v^T \tanh (w_u^Q u_j^Q + W_u^P u_t^P + W_v^P v_{t-1}^P)$ $a_i^t = \frac{\exp (s_i^t)}{\sum_{j=1}^m \exp(s_j^t)}$ $c_t = \sum_{i=1}^m a_i^t u_i^Q$
Match-LSTM(Wang and Jiang, 2016) takes $u_t^P$ as an additional input into the recurrent network:
$v_t^P = \text{RNN}(v_{t-1}^P, [u_t^P, c_t])$
To determine the importance of passage parts and attend to the ones relevant to the question, add another gate to the input $[u_t^p, c_t]$ of RNNs:
$g_t = \text{sigmoid} (W_g [u_t^P, c_t])$ $[u_t^P, c_t]^* = g_t \odot [u_t^P, c_t]$

Self-matching attention

Match the question-aware passage representation against itself.
$h_t^P = \text{BiRNN}(h_{t-1}^P, [v_t^P, c_t])$
where $c_t=\text{att}(v^P, v_t^P)$ is an attention pooling vector of the whole passage $v^P$:
$s_j^t = v^T \tanh(W_v^P v_j^P + W_v^{\tilde{P}}v_t^P)$ $a_i^t = \frac{\exp(s_i^t)}{\sum_{j=1}^n \exp(s_j^t)}$ $c_t = \sum_{i=1}^n a_i^t v_i^P$
An additional gate as in gated attention-based RNNs is applied to $[v_t^P, c_t]$ to adaptively control the input of RNNs.

Output layer

Use pointer net to select the start position ($p^1$) and end position ($p^2$) from the passage: $s_j^t = v^T \tanh (W_h^P h_j^P + W_h^a h_{t-1}^a)$ $a_i^t = \frac{\exp(s_i^t)}{\sum_{j=1}^n \exp(s_j^t)}$ $p^t = \arg \max (a_1^t, \cdots, a_n^t)$
Utilize the question vector $r^Q$ as the initial state of the answer RNNs: $r^Q \text{att}(u^Q, v_r^Q)$

Attention-over-Attention Reader

Contextual embedding for document $\mathcal{D}$ and query $\mathcal{Q}$ using bi-GRUs: $h_{doc} \in \mathbb{R}^{|\mathcal{D}|*2d}$ , $h_{query} \in \mathbb{R}^{|\mathcal{Q}|*2d}$
$e(x) = W_e \cdot x, \text{where } x \in \mathcal{D}, \mathcal{Q}$ $\overrightarrow{h_s(x)} = \overrightarrow{\text{GRU}}(e(x))$ $\overleftarrow{h_s(x)} = \overleftarrow{\text{GRU}}(e(x))$ $h_s(x) = [\overrightarrow{h_s(x)}; \overleftarrow{h_s(x)} ]$
Pair-wise matching score:
Given $i$-th word of the document and $j$-th word of query, we compute a matching score by dot product, forming a matrix $M \in \mathbb{R}^{\mathcal{D}*\mathcal{Q}} $, where the value of $i$-th row and $j$-th column is filled by $M(i,j)$:
$M(i,j) = h_{\text{doc}}(i)^T \cdot h_{\text{query}}(j)$
Individual document-level attentions
Apply a column-wise softmax function to get distribution of each column, where each column is an individual document-level attention considering a single query word (one element in rows). Let $\alpha(t) \in \mathbb{R}^{|\mathcal{D}|}$ is a query-to-document attention at time $t$:
$\alpha(t) = \text{softmax} (M(1,t), \cdots, M(|\mathcal{D}|,t))$ $\alpha = [\alpha(1), \alpha(2), \cdots, \alpha(\mathcal{Q})]$

Attention-over-Attention ^[7]
1. First, for each document word at time $t$, compute the “importance” distribution on the query, indicating which query words are most important given a single document word.
2. Apply row-wise softmax function to the pair-wise matching matrix $M$ to get query-level attentions. The document-to-query attention $\beta(t) \in \mathbb{R}^{|\mathcal{Q}|}$ is； $\beta(t) = \text{softmax}\big(M(t,1), \cdots, M(t_m, |\mathcal{Q}|)\big)$
3. We average the attention for each query word: $\beta = \frac{1}{n} \sum_{t=1}^{|\mathcal{D}|} \beta(t)$
4. Calculate the dot product of $\alpha$ and $\beta$ to get the attended document-level attention: $s = \alpha^T \beta$
Predictions
The final output is mapped to the vocabulary space $V$, rather than document-level attention $|\mathcal{D}|$:
$p(W \vert \mathcal{D}, \mathcal{Q}) = \sum_{i \in I(w, \mathcal{D})} s_i, w \in V$
where $I(w, \mathcal{D})$ indicate the positions that word $w$ appears in the document $\mathcal{D}$.

The training objective is to maximize the log-likelihood of the correct answer:
$\mathcal{L} = \sum_i \log{(p(x))}, x \in \mathcal{A}$

R-Net

Overview:

First, the question $Q$ and passage $P$ are processed by a bi-RNNs separately.
Then, match the $Q$ and $P$ with gated attention-based RNNs, obtaining question-aware representation for the passage $P$
Apply self-matching attention to aggregate evidence from the whole passage and refine the passage representation.
Feed into the output layer to predict the boundary of the answer span.

Question and passage encoder

Consider a question $Q = \{ w_t^Q \}_{t=1}^m$ and a passage $P=\{ w_t^P \}^n_{t=1}$ .

First convert words to word-level embeddings $\{ e_t^Q\}_{t=1}^m$ and $\{ e_t^P \}_{t=1}^n$ and char-level embeddings $\{ c_t^Q\}_{t=1}^m$ and $\{ c_t^P \}_{t=1}^n$ (generated by the final hidden states of bi-RNNs, which benefits for OOV tokens)
Then use a bi-RNN to encode the question and passage respectively: $u_t^Q = \text{bi-RNN}_Q (u_{t-1}^Q, [e_t^Q, c_t^Q])$ $u_t^P = \text{bi-RNN}_P (u_{t-1}^P, [e_t^P, c_t^P])$

Gated attention-based RNNs

Given question representation $\{u_t^Q\}_{t=1}^m$ and passage representation $\{ u_t^P \}_{t=1}^n$ .

Generate sentence-pair representation $\{v_t^P\}_{t=1}^n$ with soft-alignment of words in the question and passage:^[4] $\pmb{v_t^P} = \text{RNN} (v_{t-1}^P, c_t)$ where $c_t = \text{att}(u^Q, [u_t^P, v_{t-1}^P])$ is an attention-pooling vector of the whole question $(u^Q)$: $s_j^t = v^T \text{tanh}(W_u^Q u_j^Q + W_u^P u_t^P + W_V^P v_{t-1}^P)$ $a_i^t = \frac{\exp(s_i^t)}{\sum_{j=1}^m \exp{(s_j^t)} }$ $c_t = \sum_{i=1}^m a_i^t u_i^Q$

Each passage representation $v_t^P$ dynamically incorporates aggregated matching information from the whole question.

match-LSTM^[5]. Take $u_t^P$ as an additional input into the RNNs: $\pmb{v_t^P} = \text{RNN}(v_{t-1}^P, [u_t^P, c_t])$ To determine the importance of passage parts and attend to the ones relevant to the question, add another gate $g_t$ to the input $([u_t^P, c_t])$ of RNN: $g_t = \text{sigmoid}(W_g [u_t^P, c_t])$ $[u_t^P, c_t]^* = g_t \odot [u_t^P, c_t]$

Self-matching attention

Given question-aware passage representation $\{ v_t^P \}_{t=1}^n$ . One problem is that, it has very limited knowledge of context,

Solution: match the question-aware passage representation against itself.

$\pmb{h_t^P} = \text{bi-RNN}(h_{t-1}^P, [v_t^P, c_t])$

where $c_t = \text{att}(v^P, v_t^P)$ is an attention-pooling vector of the whole passage $(v^P)$:

$s_j^t = v^T \text{tanh}(W_u^P v_j^P + W_v^{\tilde{P}} v_t^P)$ $a_i^t = \frac{\exp{(s_i^t)}}{\sum_{j=1}^n \exp{(s_j^t)}}$ $c_t = \sum_{i=1}^n a_i^t v_i^P$

An additional gate as in gated attention-based RNNs is applied to $[v_t^P, c_t]$ to adaptively control the input of RNNs.

Output layer

Given the passage representation $\{ h_t^P \}_{t=1}^n$

Use pointer networks to predict the start and the end position of the answer.
Attention mechanism is utilized as the pointer to select the start position $(p^1)$ and end position $(p^2)$:
$s_j^t = v^T \text{tanh}(W_h^P h_j^P + W_h^a h_{t-1}^a)$ $a_i^t = \frac{\exp{(s_i^t)}}{\sum_{j=1}^n \exp{(s_j^t)}}$ $o^t = \arg\max{a_1^t, \cdots, a_n^t}$
here $h_{t-1}^a$ represents the last hidden state of the answer RNNs(pointer net).

The input of the answer RNN is the attention-pooling vector:

$c_t= \sum_{i=1}^n a_i^t h_i^P$ $h_t^a = \text{RNN}(h_{t-1}^a, c_t)$

When predicting the start position, $h_{t-1}^a$ represents the initial hidden state of the answer RNN. We use the question vector $r^Q$ as the initial state of the answer RNN. $r^Q = \text{att}(u^Q, V_r^Q)$ is an attention-pooling vector of the question based on the parameter $V_r^Q$ :

$s_j v^T \text{tanh}(W_u^Q u_j^Q + W_v^Q V_r^Q)$ $a_i = \frac{\exp{(s_i)}}{\sum_{j=1}^m \exp{(s_j)}}$ $r^Q = \sum_{i=1}^m a_i u_i^Q$

Loss: the sum of negative log probabilities of the label start and end position by the predicted distributions.

Reasoning Network (ReasoNet)

ReasoNet mimics the inference process of human readers by introducing a termination state in the inference with reinforcement learning. The state can decide whether to continue the inference to the next turn after digesting intermediate information, or to terminate the whole inference when it concludes that existing information is sufficient to yield an answer.

The stochastic inference process can be seen as a POMDP. The state sequence $s_{1:T}$ is controlled by an RNN sequence model. The ReasoNet performs an answer action $a_T$ at $T$-th step, which implies that the termination gate variables $t_{1:T} = (t_1=0, t_2=0, \cdots, t_{t-1}=0, t_T=1)$ .
The ReasoNet learns a stochastic policy $\pi((t_t, a_t) \vert s_t; \theta)$ with parameters $\theta$ to get a distribution of termination actions if the model decides to stop at the current step.

The expected reward for an instance is:
- The reward can only be received at the final termination step when an asnwer action $a_T$ is performed.
  - $J$ can be maximized by directly applying gradient based optimization methods:
    $\nabla_\theta J(\theta) =\mathbb{E}_{\pi(t_{1:T}, a_T; \theta)} \big[ \nabla_\theta \log \pi(t_{1:T}, a_T; \theta) r_t \big]$
  - Motivated by REINFORCE algorithm, we compute $\nabla_\theta J(\theta)$ :
    $\mathbb{E}_\pi(t_{1:T, a_T; \theta}) \big[ \nabla_\theta \log \pi(t_{1:T}, a_T; \theta) r_t \big] = \sum_{(t_{1:T}, a_T) \in \mathbb{A}} \pi(t_{1:T}, a_T; \theta) \big[ \nabla_\theta \log \pi(t_{1:T}, a_T; \theta) (r_T - b_T) \big]$
    where $b_T = \mathbb{E}[r_T]$ and can be updated via online moving average approach: $b_T = \lambda b_T + (1- \lambda) b_t$

Cross-passage answer verification

Compute the question-aware representation for each passage. Employ a Pointer network to predict the start and end position of the answer in the module of answer boundary prediction.
Meanwhile, with the answer content module, we estimate whether each word should be included in the answer.
In the answer verification module, each answer candidate can attend to the other answer candidates to collect supportive information and compute one score for each candidate to indicate whether it is correct or not according to the verification.

Question and passage modeling

Encoding: map each word into the vector space by concatenating the word embedding and sum of its char-embeddings. Then employ bi-LSTM to encode the question $\pmb{Q}$ and passages $\{ \pmb{P}_i\}$ :
$\pmb{u}_t^Q = \text{biLSTM}_Q(\pmb{u}_{t-1}^Q, [\pmb{e}_t^Q, \pmb{c}_t^Q])$ $\pmb{u}_t^{P_i} = \text{biLSTM}_P(\pmb{u}_{t-1}^{P_i}, [\pmb{e}_t^{P_i}, \pmb{c}_t^{P_i}])$
where $\pmb{e}_t^Q$ , $\pmb{c}_t^Q$ are word-level and char-level embeddings of the $t$-th word.
Q-P Matching: use the attention flow layer to conduct Q-P matching in two directions. The similarity between the $t$-th word in the question and $k$-th word in passage $i$ is:
$\pmb{S}_{t,k} = \pmb{u}_t^{QT} \cdot \pmb{u}_k^{P_i}$

Then the context-to-question attention and question-to-context attention is applied as aforementioned BiDAF to obtain the question-aware passage representation $\{ \pmb{\tilde{u}}_t^{P_i}\}$ .

The match output:

$\pmb{v}_t^{P_i} = \text{BiLSTM}_M (\pmb{v}_{t-1}^{P_i}, \pmb{\tilde{u}}_t^{P_i})$

Answer boundary prediction

Employ Pointer net to compute the probability of each word to be the start or end position of the span:
$g_k^t = {\pmb{w}_1^{\alpha}}^T \tanh(\pmb{W}_2^\alpha [\pmb{v}_k^P, \pmb{h}_{t-1}^a])$ $\alpha_k^t = \frac{ \exp (g_k^t)}{\sum_{j=1}^{\pmb{|P|}} \exp(g_j^t)}$ $\pmb{c}_t = \sum_{k=1}^{|\pmb{P}|} \alpha_k^t \pmb{v}_k^P$ $\pmb{h}_t^a = \text{LSTM}(\pmb{h}_{t-1}^a, \pmb{c}_t)$
The probability of $k$-th word in the passage to be the start and end position of the answer is obtained as $\alpha_k^1$ and $\alpha_k^2$
Minimize the negative log probabilities of the true start and end indices: $\mathcal{L}_{\text{boundaries}} = -\frac{1}{N} \sum_{i=1}^N (\log \alpha_{Y_i^1}^1 + \log \alpha_{y_i^2}^2)$ where $N$ is the # of samples in the dataset and $y_i^1$ , $y_i^2$ are the gold start and end positions.

Answer content modeling

We predict whether each word should be included in the context of the answer. The content probability of the $k$-th word is computed as:
$p_k^c = \text{sigmoid}({\pmb{w}_1^c}^T \text{ReLU}(\pmb{W}_2^c \pmb{v}_k ^{P_i}))$
Words within the answer span are labeled as 1 and the other 0. The loss is averaged cross entropy:
$\mathcal{L}_{\text{content}} = -\frac{1}{N} \frac{1}{|P|} \sum_{i=1}^N \sum_{j=1}^{|P|} [y_k^c \log p_k^c + (1-y_k^c)\log(1-p_k^c)]$
The content probabilities provide another view to measure the quality of the answer in addition to the boundary. Moreover, with these probabilities, we can represent the answer from passage $i$ as a weighted sum of all word embeddings:
$\pmb{r}^{A_i} = \frac{1}{|\pmb{P}_i|} \sum_{k=1}^{|\pmb{P}_i|} p_k^c [\pmb{e}_k^{P_i}, \pmb{c}_k^{P_i}]$

Cross-passage answer verification

The boundary and content model focus on modeling within a single passage, with little consideration of the cross-passage information.
Given the representation of the answer candidates from all passages $\{ \pmb{r}^{A_i}\}$ , each answer candidate then attends to other candidates to collect supportive information via attention mechanism: $s_{i,j} = \left\{ \begin{array}{ll} 0 & \text{ if } i=j \\ {\pmb{r}^{A_i}}^T \cdot \pmb{r}^{A_j} & otherwise \end{array} \right.$

$\alpha_{i,j} = \frac{\exp(s_{i,j})}{\sum_{k=1}^n \exp(s_{i,k})}$ $\tilde{\pmb{r}}^{A_i} =\sum_{j=1}^n \alpha_{i,j} \pmb{r}^{A_j}$

Here $\tilde{\pmb{r}}^{A_i}$ is the collected verification information from other passages with attention weights. Then we pass it together with the original $\pmb{r}^{A_i}$ to a FC layer:

$g_i^v = {\pmb{w}^v}^T [\pmb{r}^{A_i}, \tilde{\pmb{r}}^{A_i}, \pmb{r}^{A_i} \odot \tilde{\pmb{r}}^{A_i} ]$

Then normalize the score:

$p_i^v = \frac{\exp(g_i^v)}{\sum_{j=1}^n \exp(g_j^v)}$

The loss function:

$\mathcal{L}_{\text{verify}} = -\frac{1}{N} \sum_{i=1}^N \log p_{y_i^v}^v$

where $y_i^v$ is the index of the correct answer in all the answer candidates of the $i$-th instance.

Joint training

The joint objective function:

$\mathcal{L} = \mathcal{L}_{boundary} + \mathcal{L}_{content} + \mathcal{L}_{verify}$

QANet

Previous models relied on Recurrent neural nets, slowing down the training and inference speed. QANet^[17] applied exclusively convolutions and self-attentions to speed up the training process.

QANet architecture

Input embedding layer

Concatenate the pretrained word embedding $x_w$ and char embedding $x_c$ : $[x_w; x_c] \in \pmb{R}^{p_1+p_2}$ . Also adopt a two-layer high-way network on top of the representation.

Embedding encoder layer

A stack of the building block: [convolution-layer $\times$ # + self-attention layer + feed-forword layer]

Context-query attention layer

Firstly compute the similarity matrix $S$ between each word pair of context $C$ and query $Q$, i.e. $S \in \pmb{R}^{n \times m}$. We then normalize each row of $S$ by applying the softmax function, getting a matrix $\bar{S}$. Then the context to query attention is computed as: $A = \bar{S} \cdot Q^T \in \pmb{R}^{n \times d}$.
The similarity function is the trilinear function:
$f(q,c) = W_0 [q,c, q \odot c]$
where $\odot$ is the element-wise multiplication and $W_0$ is a trainable variable.

The query-to-context attention is:

$B = \bar{S} \odot {\overline{\overline{S}}}^T \odot C^T$

where $\overline{\overline{S}}$ is normalized matrix of $S$ along column with softmax function.

Model encoder layer

The input at each position is $[c,a, c \odot a, c \odot b]$, where $a$ and $b$ are respectively a row of attention matrix $A$ and $B$.

Output layer

Predict the probability of each position in the context being the start and end of an answer span
$p^1 = \text{softmax}(W_1[M_0;M_1])$ $p^2 = \text{softmax}(W_2[M_0;M_2])$
where $W_1$ , $W_2$ are two trainable variables and $M_0$ , $M_1$ , $M_2$ are the outputs of the three model encoders, from bottom to top.
Loss function
$\mathcal{L}(\theta) = -\frac{1}{N}\sum_{i}^{N} [\log (p_{y_i^1}) + \log (p^2_{y_i^2})]$

Tricks

Data augmentation with back-translation

References

1.Jason Weston, Sumit Chopra, and Antoine Bordes (2014). Memory networks. arXiv preprint arXiv:1410.3916. ↩
2.Hermann, K.M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). Teaching Machines to Read and Comprehend. NIPS. ↩
3.Kadlec, R., Schmid, M., Bajgar, O., & Kleindienst, J. (2016). Text understanding with the attention sum reader network. CoRR, abs/1603.01547. ↩
4.Rocktäschel, T., Grefenstette, E., Hermann, K.M., Kociský, T., & Blunsom, P. (2016). Reasoning about Entailment with Neural Attention. CoRR, abs/1509.06664. ↩
5.Wang, S., & Jiang, J. (2016). Learning Natural Language Inference with LSTM. HLT-NAACL. ↩
6.Kadlec, R., Schmid, M., Bajgar, O., & Kleindienst, J. (2016). Text Understanding with the Attention Sum Reader Network. CoRR, abs/1603.01547. ↩
7.Cui, Y., Chen, Z., Wei, S., Wang, S., & Liu, T. (2017). Attention-over-Attention Neural Networks for Reading Comprehension. ACL. ↩
8.Vinyals, O., Fortunato, M., & Jaitly, N. (2015). Pointer Networks. NIPS. ↩
9.Trischler, A., Ye, Z., Yuan, X., Bachman, P., Sordoni, A., & Suleman, K. (2016). Natural Language Comprehension with the EpiReader. EMNLP. ↩
10.Seo, M.J., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). Bidirectional Attention Flow for Machine Comprehension. CoRR, abs/1611.01603. ↩
11.Wang, S., & Jiang, J. (2017). Machine Comprehension Using Match-LSTM and Answer Pointer. CoRR, abs/1608.07905. ↩
12.Wang, W., Yang, N., Wei, F., Chang, B., & Zhou, M. (2017). R-NET: Machine reading comprehension with self-matching networks. Natural Lang. Comput. Group, Microsoft Res. Asia, Beijing, China, Tech. Rep, 5. ↩
13.Shen, Y., Huang, P., Gao, J., & Chen, W. (2016). ReasoNet: Learning to Stop Reading in Machine Comprehension. CoCo@NIPS. ↩
14.Wang, W., Yang, N., Wei, F., Chang, B., & Zhou, M. (2017). Gated Self-Matching Networks for Reading Comprehension and Question Answering. ACL. ↩
15.Rocktäschel, T., Grefenstette, E., Hermann, K.M., Kociský, T., & Blunsom, P. (2016). Reasoning about Entailment with Neural Attention. CoRR, abs/1509.06664. ↩
16.Wang, Y., Liu, K., Liu, J., He, W., Lyu, Y., Wu, H., Li, S., & Wang, H. (2018). Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification. ACL. ↩
17.Yu, A.W., Dohan, D., Luong, M., Zhao, R., Chen, K., Norouzi, M., & Le, Q.V. (2018). QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. CoRR, abs/1804.09541. ↩