Introduction

Enabling machines to understand natural language text is arguably the ultimate goal of natural language processing, and the task of machine reading comprehension is an intermediate step towards this ultimate goal . Recently, released a new multi-choice machine comprehension dataset called RACE that was extracted from middle and high school English examinations in China. Figure [tbl:example] shows an example passage and two related questions from RACE. The key difference between RACE and previously released machine comprehension datasets (e.g., the CNN/Daily Mail dataset and SQuAD ) is that the answers in RACE often cannot be directly extracted from the given passages, as illustrated by the two example questions (Q1 & Q2) in Figure [tbl:example]. Thus, answering these questions is more challenging and requires more inferences.

Previous approaches to machine comprehension are usually based on pairwise sequence matching, where either the passage is matched against the sequence that concatenates both the question and a candidate answer , or the passage is matched against the question alone followed by a second step of selecting an answer using the matching result of the first step . However, these approaches may not be suitable for multi-choice reading comprehension since questions and answers are often equally important. Matching the passage only against the question may not be meaningful and may lead to loss of information from the original passage, as we can see from the first example question in Figure [tbl:example]. On the other hand, concatenating the question and the answer into a single sequence for matching may not work, either, due to the loss of interaction information between a question and an answer. As illustrated by Q2 in Figure [tbl:example], the model may need to recognize what “he” and “it” in candidate answer (c) refer to in the question, in order to select (c) as the correct answer. This observation of the RACE dataset shows that we face a new challenge of matching sequence triplets (i.e., passage, question and answer) instead of pairwise matching.

Passage: My father wasn’t a king, he was a taxi driver, but I am a prince-Prince Renato II, of the country Pontinha , an island fort on Funchal harbour. In 1903, the king of Portugal sold the land to a wealthy British family, the Blandys, who make Madeira wine. Fourteen years ago the family decided to sell it for just EUR25,000, but nobody wanted to buy it either. I met Blandy at a party and he asked if I’d like to buy the island. Of course I said yes, but I had no money-I was just an art teacher. I tried to find some business partners, who all thought I was crazy. So I sold some of my possessions, put my savings together and bought it. Of course, my family and my friends-all thought I was mad ... If l want to have a national flag, it could be blue today, red tomorrow. ... My family sometimes drops by, and other people come every day because the country is free for tourists to visit ...
Q1: Which statement of the following is true?	Q2: How did the author get the island?
a. The author made his living by driving.	a. It was a present from Blandy.
b. The author’s wife supported to buy the island.	b. The king sold it to him.
c. Blue and red are the main colors of his national flag.	c. He bought it from Blandy.
d. People can travel around the island free of charge.	d. He inherited from his father.

In this paper, we propose a new model to match a question-answer pair to a given passage. Our co-matching approach explicitly treats the question and the candidate answer as two sequences and jointly matches them to the given passage. Specifically, for each position in the passage, we compute two attention-weighted vectors, where one is from the question and the other from the candidate answer. Then, two matching representations are constructed: the first one matches the passage with the question while the second one matches the passage with the candidate answer. These two newly constructed matching representations together form a co-matching state. Intuitively, it encodes the locational information of the question and the candidate answer matched to a specific context of the passage. Finally, we apply a hierarchical LSTM over the sequence of co-matching states at different positions of the passage. Information is aggregated from word-level to sentence-level and then from sentence-level to document-level. In this way, our model can better deal with the questions that require evidence scattered in different sentences in the passage. Our model improves the state-of-the-art model by 3 percentage on the RACE dataset. Our code will be released under https://github.com/shuohangwang/comatch.

Model

For the task of multi-choice reading comprehension, the machine is given a passage, a question and a set of candidate answers. The goal is to select the correct answer from the candidates. Let us use $`\mathbf{P}\in \mathbb{R}^{d\times P}`$, $`\mathbf{Q}\in \mathbb{R}^{d\times Q}`$ and $`\mathbf{A}\in \mathbb{R}^{d \times A}`$ to represent the passage, the question and a candidate answer, respectively, where each word in each sequence is represented by an embedding vector. $`d`$ is the dimensionality of the embeddings, and $`P`$, $`Q`$, and $`A`$ are the lengths of these sequences.

Overall our model works as follows. For each candidate answer, our model constructs a vector that represents the matching of $`\mathbf{P}`$ with both $`\mathbf{Q}`$ and $`\mathbf{A}`$. The vectors of all candidate answers are then used for answer selection. Because we simultaneously match $`\mathbf{P}`$ with $`\mathbf{Q}`$ and $`\mathbf{A}`$, we call this a co-matching model. In Section 4.1 we introduce the word-level co-matching mechanism. Then in Section 4.2 we introduce a hierarchical aggregation process. Finally in Section 4.3 we present the objective function. An overview of our co-matching model is shown in Figure 1.

Co-matching

The co-matching part of our model aims to match the passage with the question and the candidate answer at the word-level. Inspired by some previous work , we first use bi-directional LSTMs to pre-process the sequences as follows:

\begin{eqnarray}
\nonumber
&\mathbf{H}^{\text{p}} = \text{Bi-LSTM}(\mathbf{P}), \mathbf{H}^{\text{q}} = \text{Bi-LSTM}(\mathbf{Q}), \\
&\mathbf{H}^{\text{a}} = \text{Bi-LSTM}(\mathbf{A}),
\label{eqn:pre}
\end{eqnarray}

where $`\mathbf{H}^{\text{p}}\in \mathbb{R}^{l\times P}`$, $`\mathbf{H}^{\text{q}}\in \mathbb{R}^{l\times Q}`$ and $`\mathbf{H}^{\text{a}} \in \mathbb{R}^{l\times A}`$ are the sequences of hidden states generated by the bi-directional LSTMs. We then make use of the attention mechanism to match each state in the passage to an aggregated representation of the question and the candidate answer. The attention vectors are computed as follows:

\begin{eqnarray}
\nonumber
\mathbf{G}^{\text{q}} & = & \text{SoftMax}\left( (\mathbf{W}^{\text{g}} \mathbf{H}^{\text{q}}+\mathbf{b}^\text{g}\otimes \mathbf{e}_Q)^\text{T} \mathbf{H}^{\text{p}} \right), \\
\nonumber
\mathbf{G}^{\text{a}} & = & \text{SoftMax}\left( (\mathbf{W}^{\text{g}} \mathbf{H}^{\text{a}}+\mathbf{b}^\text{g}\otimes \mathbf{e}_Q)^\text{T} \mathbf{H}^{\text{p}} \right), \\
\nonumber
\overline{\mathbf{H}}^{\text{q}} & = & \mathbf{H}^{\text{q}}\mathbf{G}^{\text{q}}, \\
\overline{\mathbf{H}}^{\text{a}} & = & \mathbf{H}^{\text{a}}\mathbf{G}^{\text{a}},
\label{eqn:alpha}
\end{eqnarray}

where $`\mathbf{W}^{\text{g}}\in \mathbb{R}^{l\times l}`$ and $`\mathbf{b}^{\text{g}}\in \mathbb{R}^{l}`$ are the parameters to learn. $`e_Q\in \mathbb{R}^{\text{Q}}`$ is a vector of all 1s and it is used to repeat the bias vector into the matrix. $`\mathbf{G}^{\text{q}}\in \mathbb{R}^{Q\times P}`$ and $`\mathbf{G}^{\text{a}}\in \mathbb{R}^{A\times P}`$ are the attention weights assigned to the different hidden states in the question and the candidate answer sequences, respectively. $`\overline{\mathbf{H}}^{\text{q}}\in \mathbb{R}^{l\times P}`$ is the weighted sum of all the question hidden states and it represents how the question can be aligned to each hidden state in the passage. So is $`\overline{\mathbf{H}}^{\text{a}}\in \mathbb{R}^{l\times P}`$. Finally we can co-match the passage states with the question and the candidate answer as follows:

\begin{eqnarray}
\nonumber
\mathbf{M}^{\text{q}} &=& \text{ReLU}\left( \mathbf{W}^{\text{m}} \begin{bmatrix} \overline{\mathbf{H}}^{\text{q}}\ominus \mathbf{H}^{\text{p}}\\ \overline{\mathbf{H}}^{\text{q}}\otimes \mathbf{H}^{\text{p}}  \end{bmatrix} + \mathbf{b}^{\text{m}}\right), \\
\nonumber
\mathbf{M}^{\text{a}} &=& \text{ReLU}\left( \mathbf{W}^{\text{m}} \begin{bmatrix} \overline{\mathbf{H}}^{\text{a}}\ominus \mathbf{H}^{\text{p}}\\ \overline{\mathbf{H}}^{\text{a}}\otimes \mathbf{H}^{\text{p}}  \end{bmatrix} + \mathbf{b}^{\text{m}}\right), \\
\mathbf{C} &=& \begin{bmatrix}\mathbf{M}^{\text{q}} \\ \mathbf{M}^{\text{a}}  \end{bmatrix}, 
%\;\;\; \mathbf{c}_k =\begin{bmatrix}\mathbf{m}^{\text{q}}_k \\ \mathbf{m}^{\text{a}}_k  \end{bmatrix}
\label{eqn:match}
\end{eqnarray}

where $`\mathbf{W}^{\text{g}}\in \mathbb{R}^{l\times 2l}`$ and $`\mathbf{b}^{\text{g}}\in \mathbb{R}^{l}`$ are the parameters to learn. $`\begin{bmatrix} \cdot \\ \cdot \end{bmatrix}`$ is the column-wise concatenation of two matrices, and $`\cdot\ominus \cdot`$ and $`\cdot\otimes \cdot`$ are the element-wise subtraction and multiplication between two matrices, which are used to build better matching representations . $`\mathbf{M}^{\text{q}}\in \mathbb{R}^{l\times P}`$ represents the matching between the hidden states of the passage and the corresponding attention-weighted representations of the question. Similarly, we match the passage with the candidate answer and represent the matching results using $`\mathbf{M}^{\text{a}}\in \mathbb{R}^{l\times P}`$. Finally $`C\in \mathbb{R}^{2l\times P}`$ is the concatenation of $`\mathbf{M}^{\text{q}}\in \mathbb{R}^{l\times P}`$ and $`\mathbf{M}^{\text{a}}\in \mathbb{R}^{l\times P}`$ and represents how each passage state can be matched with the question and the candidate answer. We refer to $`\mathbf{c}\in \mathbb{R}^{2l}`$, which is a single column of $`\mathbf{C}`$, as a co-matching state that concurrently matches a passage state with both the question and the candidate answer.

Hierarchical Aggregation

In order to capture the sentence structure of the passage, we further modify the model presented earlier and build a hierarchical LSTM on top of the co-matching states. Specifically, we first split the passage into sentences and we use $`\mathbf{P}_1,\mathbf{P}_2, \ldots, \mathbf{P}_N`$ to represent these sentences, where $`N`$ is the number of sentences in the passage. For each triplet $`\{\mathbf{P}_n, \mathbf{Q}, \mathbf{A}\}, n \in [1, N]`$, we can get the co-matching states $`\mathbf{C}_n`$ through Eqn. ([eqn:pre]-[eqn:match]). Then we build a bi-directional LSTM followed by max pooling on top of the co-matching states of each sentence as follows:

\begin{eqnarray}
\mathbf{h}^{\text{s}}_n & = & \text{MaxPooling}\left( \text{Bi-LSTM} \left( \mathbf{C}_n \right) \right),
\end{eqnarray}

where the function $`\text{MaxPooling}(\cdot)`$ is the row-wise max pooling operation. $`\mathbf{h}^{\text{s}}_n \in \mathbb{R}^l, n\in [1,N]`$ is the sentence-level aggregation of the co-matching states. All these representations will be further integrated by another Bi-LSTM to get the final triplet matching representation.

\begin{eqnarray}
\nonumber
\mathbf{H}^{\text{s}} &=& [\mathbf{h}^s_1;\mathbf{h}^s_2; \ldots; \mathbf{h}^s_N], \\
\mathbf{h}^{\text{t}} & = & \text{MaxPooling}\left( \text{Bi-LSTM} \left( \mathbf{H}^{\text{s}} \right) \right),
\label{eqn:rep}
\end{eqnarray}

where $`\mathbf{H}^{\text{s}}\in \mathbb{R}^{l\times N}`$ is the concatenation of all the sentence-level representations and it is the input of a higher level LSTM. $`\mathbf{h}^{\text{t}}\in \mathbb{R}^l`$ is the final output of the matching between the sequences of the passage, the question and the candidate answer.

Objective function

For each candidate answer $`\mathbf{A}_i`$, we can build its matching representation $`\mathbf{h}^t_i \in \mathbb{R}^l`$ with the question and the passage through Eqn. ([eqn:rep]). Our loss function is computed as follows:

\begin{equation}
L(\mathbf{A}_i |\mathbf{P},\mathbf{Q}) = -\log \frac{\exp (\mathbf{w}^T \mathbf{h}^t_i)}{\sum_{j=1}^4 \exp (\mathbf{w}^T \mathbf{h}^t_j)},
\end{equation}

where $`\mathbf{w}\in \mathbb{R}^l`$ is a parameter to learn.