Machine reading comprehension aims to answer questions given a passage or document.

# Symbol matching models

## Frame-Semantic parsing

Frame-semantic parsing identifies predicates and their arguments, i.e. “who did what to whom”.

## Word Distance

Sum the distances of every word in $q$ to their nearest aligned word in $d$

# Teaching Machines to Read and Comprehend

In NMT, deep LSTMs have shown a remarkable ability to embed long sequences into a vector representation, which contains enough information to generate a full translation in another language.[2]

Deep LSTMs feed out documents one word at a time into a Deep LSTM encoder, after a delimiter, followed by a query ($d \oplus ||| \oplus q$, or $q \oplus ||| \oplus d$ ). The network predicts which token in the document answers the query.

Limitations of the Deep LSTM Reader:

• fixed width hidden vector
• Solution: the Attentive Reader employs a finer grained token level attention mechanism, where the tokens are embedded given their entire future and past context in the input documents.

Attentive Reader encodes the document $d$ and the query $q$ with two separate 1-layer bi-LSTMs.[2]

When encoding the query $q$, the encoding $u$ of a query with length $|q|$ is the concatenation of the final forward and backward outputs:

When encoding the document $d$, each token at position $t$ is:

The representation $r$ of $d$ is a weighted sum of these output vectors. The weights can be interpreted as the degree to which the network attends to a particular token in the document $d$ when answering the query:

Finally, the joint document and query embedding is:

The Attentive Reader focuses on the passage of a context document that are most likely to inform the answer to the query.

Impatient Reader can reread from the document as each query token is read.[2]

At each token $i$ of the query $q$, the model computes the document representation vector $r(i)$ with the bidirectional embedding $y_q(i) = \overrightarrow{y_q}(i) || \overleftarrow{y_q}(i)$:

The attention mechanism allows the model to recurrently accumulate information from the document as it sees each query token, ultimately outputting a final joint document query representation for the answer prediction

• For cloze-style QA. [6]

1. Compute the vector embedding for the query.
2. Compute the vector embedding of each individual word in the context of the whole document. The word embedding is a look-up table $V$.
3. Dot product between the question embedding and the contextual embedding. Select the most likely answer.

## Pointer Nets

Problems:

• Conventional seq2seq architecture can only applies softmax distribution over a fixed-sized output dictionary. It cannot handle problems where the size of the output dictionary is equal to the length of the input sequence.[8]

where $\mathcal{P}=\{ P_1, \cdots, P_n \}$ is a sequence of $n$ vectors and $\mathcal{C}^{\mathcal{P}} = \{ C_1, \cdots, C_{m(\mathcal{P})} \}$ is a sequence of $m(\mathcal{P})$ indices.

The parameters are learnt by maximizing the conditional probabilities of the training set:

Solution: Pointer Net.

• Applies the attention mechanism:

where softmax normalizes the vector $u^i$ (of length $n$) to be an output distribution over the dictionary of inputs. And $v$,$W_1$, $W_2$ are learnable parameters of the output model.

Here, we do not blend the encoder state $e_j$ to propagate extra information to the decoder. Instead, we use $u_j^i$ as pointers to the input elements.

• Ptr Nets can be seen as an application of content-based attention mechanisms.

### Extractor: Pointer Nets

1. Use bi-RNNs to encode passage $f(\theta_T, \pmb{T})$ and question $g(\theta_Q, \pmb{Q})$, where $\theta_T$ and $\theta_Q$ represents the parameters of the text and question encoders, $\pmb{T} \in \mathbb{R}^{D \times N}$ and $\pmb{Q} \in \mathbb{R}^{D \times N_Q}$ are matrix representations of the texts and questions (comprising $N$ words and $N_Q$ words separately) . Concatenate the last hidden states of forward and backward GRU, denoted $g(\pmb{Q}) \in \mathbb{R}^{2d}$
2. Take the inner product of text and question representations, followed by a softmax. The probability that the $i$-th word in text $\tau$ answers $\mathcal{Q}$:

3. Compute the total probability that word $w$ is the correct answer:

4. The extractor take the $K$ highest word probabilities with the corresponding $K$ most probable answer words $\{\hat{a}_1,\cdots,\hat{a}_K \}$

### Reasoner

1. Insert the answer candidates into the question sequence $\mathcal{Q}$ at the placeholder location, which forms $K$ hypotheses ${ \mathcal{H}_1, \cdots, \mathcal{H}_K }$
2. For each hypothesis and each sentence of the text: $\pmb{S}_i \in \mathbb{R}^{D \times |\mathcal{S}_i|}$ whose columns are embedding vectors for each word of sentence $\mathcal{S}_i$, $\pmb{H}_k \in \mathbb{R}^{D \times |\mathcal{H}_k|}$ whose columns are the embedding vectors for each word in the hypothesis $\mathcal{H}_k$
3. Augment $\pmb{S}_i$ with word-matching features $\pmb{M} \in \mathbb{R}^{2 \times |\mathcal{S}_i|}$. The first row is the inner product of each word embedding in the sentence with the candidate answer embedding; the second row is the maximum inner product of each sentence word embedding with any word embedding in the question.
4. Then the augmented $\pmb{S}_i$ and $\pmb{H}_k$ are fed into two different ConvNets, with filters $\pmb{F}^S \in \mathbb{R}^{(D+2) \times m}$ and $\pmb{F}^H \in \mathbb{R}^{D \times m}$, where $m$ is the filter width. After ReLU and maxpooling op, we can obtain the representations of the text sentence and the hypothesis: $\pmb{r}_{\mathcal{S}_i} \in \mathbb{R}^{N_F}$, $\pmb{r}_{\mathcal{H}_k} \in \mathbb{R}^{N_F}$, where $N_F$ is the number of filters.
5. Then compute a scalar similarity score representations using bilinear form:

where $\pmb{R} \in \mathbb{R}^{N_F \times N_F}$ is a trainable parameter.

6. Concat the similarity score with the sentence and hypothesis representations to get: $\pmb{x}_{ik} = [\zeta; \pmb{r}_{\mathcal{S}_i}; \pmb{r}_{\mathcal{H}_k}]^T$

7. Pass $\pmb{x}_{ik}$ to a GRU, and the final hidden state is given to an FC layer, followed by a softmax op.

Finally, combine the output of the Reasoner and the Extractor at the same time when minimizing the loss function. (See the original paper[9] for details)

# Bi-Directional Attention Flow (BiDAF)

## Highway Networks

• A plain feedforward NN consists of $L$ layers where the $l^{th}$ layer $(l \in { 1,2,\cdots,L})$ applies a non-linear transformation $H$ (with parameter $\pmb{H,l}$) on its input $\pmb{x}$ to the output $\pmb{y}$.

$H$ is usually a affine transformation followed by a non-linear activation function.

• Highway Network:

• Additionally define $T$ as the transform gate, $C$ as the carry gate. Intuitionally, these gates express how much of the output is produced by transforming the input and carrying it.
• For simplicity we set $C = 1 - T$, giving

• In particular,

## BiDAF

Problems:

• Previous models summarized the context paragraph into a fixed-size vector, which could lead to the information loss.

• Solution: the attention is computed at each time step, and the attended vector at each time step, along with the representations from previous layers, is allowed to flow through to the subsequent modeling layer.

### Char embedding layer

Let $\pmb{x}_1, \cdots, \pmb{x}_T$ and $\pmb{q}_1, \cdots, \pmb{q}_J$ represent the words in the input context paragraph and query. Use TextCNNs to encode the char-level inputs, followed by a max-pooling over the entire width to obtain a fixed-size vector for each word.

### Word embedding layer

Applied pretrained word embeddings, GloVe.

Then concatenate the char and word embedding vectors, feed them into a 2-layer Highway Network. The outputs are $\pmb{X} \in \mathbb{R}^{2d \times T}$ for the context, and $\pmb{Q} \in \mathbb{R}^{d \times J}$ for the query.

### Contextual embedding layer

Use bi-LSTMs to encode the context and query representations, by concatenating the last hidden states of each direction. We obtain $\pmb{H} \in \mathbb{R}^{2d \times T}$ from the context word vectors $\pmb{X}$, $\pmb{U} \in \mathbb{R}^{2d \times J}$ from query word vectors $\pmb{Q}$

The first three layers are used to extract features form the query and context at different levels of granularity, akin to mlti-stage feature computation of CNNs in computer vision field.

### Attention flow layer

• Inputs: the context $\pmb{H}$ and the query $\pmb{U}$.
• Outputs: query-aware vector representation of context words, $\pmb{G}$, along with previous contextual embedding
• Similarity matrix $\pmb{S} \in \mathbb{R}^{T \times J}$ between the contextual embeddings of the context($\pmb{H}$) and the query ($\pmb{U}$), where $\pmb{S}_{tj}$ indicates the similarity between the $t$-th context word and $j$-th query word:where $\alpha$ is a trainable scalar function that encodes the similarity between its input vectors, $\pmb{H}_{:t}$ is $t$-th column vector of $\pmb{H}$ and $\pmb{U}_{:j}$ is $j$-th column vector of $\pmb{U}$.

Then use $\pmb{S}$ to obtain the attentions and the attended vectors in both directions.

### Output layer

• Predict the probability of each position in the context being the start and end of an answer span

where $W_1$, $W_2$ are two trainable variables and $M_0$, $M_1$, $M_2$ are the outputs of the three model encoders, from bottom to top.

• Loss function

## Tricks

• Data augmentation with back-translation

# References

1. 1.Jason Weston, Sumit Chopra, and Antoine Bordes (2014). Memory networks. arXiv preprint arXiv:1410.3916.
2. 2.Hermann, K.M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). Teaching Machines to Read and Comprehend. NIPS.
3. 3.Kadlec, R., Schmid, M., Bajgar, O., & Kleindienst, J. (2016). Text understanding with the attention sum reader network. CoRR, abs/1603.01547.
4. 4.Rocktäschel, T., Grefenstette, E., Hermann, K.M., Kociský, T., & Blunsom, P. (2016). Reasoning about Entailment with Neural Attention. CoRR, abs/1509.06664.
5. 5.Wang, S., & Jiang, J. (2016). Learning Natural Language Inference with LSTM. HLT-NAACL.
6. 6.Kadlec, R., Schmid, M., Bajgar, O., & Kleindienst, J. (2016). Text Understanding with the Attention Sum Reader Network. CoRR, abs/1603.01547.
7. 7.Cui, Y., Chen, Z., Wei, S., Wang, S., & Liu, T. (2017). Attention-over-Attention Neural Networks for Reading Comprehension. ACL.
8. 8.Vinyals, O., Fortunato, M., & Jaitly, N. (2015). Pointer Networks. NIPS.
9. 9.Trischler, A., Ye, Z., Yuan, X., Bachman, P., Sordoni, A., & Suleman, K. (2016). Natural Language Comprehension with the EpiReader. EMNLP.
10. 10.Seo, M.J., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). Bidirectional Attention Flow for Machine Comprehension. CoRR, abs/1611.01603.
11. 11.Wang, S., & Jiang, J. (2017). Machine Comprehension Using Match-LSTM and Answer Pointer. CoRR, abs/1608.07905.
12. 12.Wang, W., Yang, N., Wei, F., Chang, B., & Zhou, M. (2017). R-NET: Machine reading comprehension with self-matching networks. Natural Lang. Comput. Group, Microsoft Res. Asia, Beijing, China, Tech. Rep, 5.
13. 13.Shen, Y., Huang, P., Gao, J., & Chen, W. (2016). ReasoNet: Learning to Stop Reading in Machine Comprehension. CoCo@NIPS.
14. 14.Wang, W., Yang, N., Wei, F., Chang, B., & Zhou, M. (2017). Gated Self-Matching Networks for Reading Comprehension and Question Answering. ACL.
15. 15.Rocktäschel, T., Grefenstette, E., Hermann, K.M., Kociský, T., & Blunsom, P. (2016). Reasoning about Entailment with Neural Attention. CoRR, abs/1509.06664.
16. 16.Wang, Y., Liu, K., Liu, J., He, W., Lyu, Y., Wu, H., Li, S., & Wang, H. (2018). Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification. ACL.
17. 17.Yu, A.W., Dohan, D., Luong, M., Zhao, R., Chen, K., Norouzi, M., & Le, Q.V. (2018). QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. CoRR, abs/1804.09541.