Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification
Our Approach
gives an overview of our multi-passage MRC model which is mainly composed of three modules including answer boundary prediction, answer content modeling and answer verification. First of all, we need to model the question and passages. Following , we compute the question-aware representation for each passage (). Based on this representation, we employ a Pointer Network to predict the start and end position of the answer in the module of answer boundary prediction (). At the same time, with the answer content model (), we estimate whether each word should be included in the answer and thus obtain the answer representations. Next, in the answer verification module (), each answer candidate can attend to the other answer candidates to collect supportive information and we compute one score for each candidate to indicate whether it is correct or not according to the verification. The final answer is determined by not only the boundary but also the answer content and its verification score ().
Question and Passage Modeling
Given a question $`\matr{Q}`$ and a set of passages $`\{\matr{P}_i\}`$ retrieved by search engines, our task is to find the best concise answer to the question. First, we formally present the details of modeling the question and passages.
Encoding
We first map each word into the vector space by concatenating its word embedding and sum of its character embeddings. Then we employ bi-directional LSTMs (BiLSTM) to encode the question $`\matr{Q}`$ and passages $`\{\matr{P}_i\}`$ as follows:
\begin{align}
% U^Q &= [ \overrightarrow{\textrm{LSTM}}(Q), \overleftarrow{\textrm{LSTM}}(Q) ] \\
% U^{P_i} &= [ \overrightarrow{\textrm{LSTM}}(P_i), \overleftarrow{\textrm{LSTM}}(P_i) ]
%U^Q &= \textrm{BiLSTM}_Q(Q) \\
%U^{P_i} &= \textrm{BiLSTM}_P(P_i)
{\vec{u}}_t^Q & = \textrm{BiLSTM}_Q({\vec{u}}_{t-1}^Q, [{\vec{e}}_t^Q, {\vec{c}}_t^Q]) \\
{\vec{u}}_t^{P_i} & = \textrm{BiLSTM}_P({\vec{u}}_{t-1}^{P_i}, [{\vec{e}}_t^{P_i}, {\vec{c}}_t^{P_i}])
\end{align}
where $`{\vec{e}}_t^Q`$, $`{\vec{c}}_t^Q`$, $`{\vec{e}}_t^{P_i}`$, $`{\vec{c}}_t^{P_i}`$ are the word-level and character-level embeddings of the $`t^{th}`$ word. $`{\vec{u}}_t^Q`$ and $`{\vec{u}}_t^{P_i}`$ are the encoding vectors of the $`t^{th}`$ words in $`\matr{Q}`$ and $`\matr{P}_i`$ respectively. Unlike previous work that simply concatenates all the passages, we process the passages independently at the encoding and matching steps.
Q-P Matching
One essential step in MRC is to match the question with passages so that important information can be highlighted. We use the Attention Flow Layer to conduct the Q-P matching in two directions. The similarity matrix $`\matr{S} \in \mathbb{R}^{|\matr{Q}| \times |\matr{P}_i|}`$ between the question and passage $`i`$ is changed to a simpler version, where the similarity between the $`t^{th}`$ word in the question and the $`k^{th}`$ word in passage $`i`$ is computed as:
\begin{equation}
\matr{S}_{t,k} = {{\vec{u}}_t^Q}^{\intercal} \cdot {\vec{u}}_k^{P_i}
\end{equation}
Then the context-to-question attention and question-to-context attention is applied strictly following to obtain the question-aware passage representation $`\{{\vec{\tilde{u}}}_t^{P_i}\}`$. We do not give the details here due to space limitation. Next, another BiLSTM is applied in order to fuse the contextual information and get the new representation for each word in the passage, which is regarded as the match output:
\begin{equation}
{\vec{v}}_t^{P_i} = \textrm{BiLSTM}_M({\vec{v}}_{t-1}^{P_i}, {\vec{\tilde{u}}}_t^{P_i})
\end{equation}
Based on the passage representations, we introduce the three main modules of our model.
Answer Boundary Prediction
To extract the answer span from passages, mainstream studies try to locate the boundary of the answer, which is called boundary model. Following , we employ Pointer Network to compute the probability of each word to be the start or end position of the span:
\begin{align}
g_k^t &= {\vec{w}_{1}^{a}}^{\intercal} \tanh ( \matr{W}_{2}^a [\vec{v}_k^{P}, \vec{h}_{t-1}^{a}] ) \\
{\alpha}_k^t &= \textrm{exp} (g_k^t) / \sum\nolimits_{j=1}^{|\matr{P}|} \textrm{exp} (g_j^t) \\
\vec{c}_t &= \sum\nolimits_{k=1}^{|\matr{P}|} {\alpha}_k^t \vec{v}_k^{P} \\
\vec{h}_t^a &= \textrm{LSTM} (\vec{h}_{t-1}^a, \vec{c}_t)
\end{align}
By utilizing the attention weights, the probability of the $`k^{th}`$ word in the passage to be the start and end position of the answer is obtained as $`{\alpha}_k^1`$ and $`{\alpha}_k^2`$. It should be noted that the pointer network is applied to the concatenation of all passages, which is denoted as $`\textrm{P}`$ so that the probabilities are comparable across passages. This boundary model can be trained by minimizing the negative log probabilities of the true start and end indices:
\begin{equation}
\mathcal{L}_{boundary} = - \frac{1}{N} \sum_{i=1}^N (\log {\alpha}_{y_i^{1}}^1 + \log {\alpha}_{y_i^{2}}^2)
\end{equation}
where $`N`$ is the number of samples in the dataset and $`y_i^{1}`$, $`y_i^{2}`$ are the gold start and end positions.
Answer Content Modeling
Previous work employs the boundary model to find the text span with the maximum boundary score as the final answer. However, in our context, besides locating the answer candidates, we also need to model their meanings in order to conduct the verification. An intuitive method is to compute the representation of the answer candidates separately after extracting them, but it could be hard to train such model end-to-end. Here, we propose a novel method that can obtain the representation of the answer candidates based on probabilities.
Specifically, we change the output layer of the classic MRC model. Besides predicting the boundary probabilities for the words in the passages, we also predict whether each word should be included in the content of the answer. The content probability of the $`k^{th}`$ word is computed as:
\begin{align}
p_k^c &= \textrm{sigmoid} ({\vec{w}_{1}^{c}}^{\intercal} \textrm{ReLU} (\matr{W}_{2}^c \vec{v}_k^{P_i}) )
% a_k^c &= \textrm{exp} (s_k^c) / \sum\nolimits_{j=1}^{|\matr{P}^i|} \textrm{exp} (s_j^c)
\end{align}
Training this content model is also quite intuitive. We transform the boundary labels into a continuous segment, which means the words within the answer span will be labeled as 1 and other words will be labeled as 0. In this way, we define the loss function as the averaged cross entropy:
\begin{equation}
\begin{split}
\mathcal{L}_{content} = & - \frac{1}{N} \frac{1}{|\textrm{P}|} \sum_{i=1}^N \sum_{j=1}^{|P|} [ y_k^c\log p_{k}^c \\
& + (1-y_k^c)\log (1 - p_{k}^c)]
\end{split}
\end{equation}
The content probabilities provide another view to measure the quality of the answer in addition to the boundary. Moreover, with these probabilities, we can represent the answer from passage $`i`$ as a weighted sum of all the word embeddings in this passage:
\begin{align}
\vec{r}^{A_i} = \frac{1}{|\matr{P}_{i}|}\sum\nolimits_{k=1}^{|\matr{P}_{i}|} p_k^c [{\vec{e}}_k^{P_i}, {\vec{c}}_k^{P_i}]
\end{align}
Cross-Passage Answer Verification
The boundary model and the content model focus on extracting and modeling the answer within a single passage respectively, with little consideration of the cross-passage information. However, as is discussed in , there could be multiple answer candidates from different passages and some of them may mislead the MRC model to make an incorrect prediction. It’s necessary to aggregate the information from different passages and choose the best one from those candidates. Therefore, we propose a method to enable the answer candidates to exchange information and verify each other through the cross-passage answer verification process.
Given the representation of the answer candidates from all passages $`\{\vec{r}^{A_{i}}\}`$, each answer candidate then attends to other candidates to collect supportive information via attention mechanism:
\begin{align}
s_{i, j} &=
\begin{cases}
0, & \text{if } i=j, \\
{\vec{r}^{A_i}}^{\intercal} \cdot \vec{r}^{A_j}, & \text{otherwise}
\end{cases}
\\
{\alpha}_{i, j} &= \textrm{exp} (s_{i, j}) / \sum\nolimits_{k=1}^{n} \textrm{exp} (s_{i, k}) \\
\vec{\tilde{r}}^{A_i} &= \sum\nolimits_{j=1}^{n} {\alpha}_{i, j}\vec{r}^{A_j}
\end{align}
Here $`\vec{\tilde{r}}^{A_{i}}`$ is the collected verification information from other passages based on the attention weights. Then we pass it together with the original representation $`\vec{r}^{A_{i}}`$ to a fully connected layer:
\begin{align}
g_{i}^v &= {\vec{w}^v}^{\intercal} [\vec{r}^{A_i}, \vec{\tilde{r}}^{A_i}, \vec{r}^{A_i} \odot \vec{\tilde{r}}^{A_i} ]
\end{align}
We further normalize these scores over all passages to get the verification score for answer candidate $`A_i`$:
\begin{equation}
% p_i^v = g_i^v / \sum\nolimits_{j=1}^{n} g_j^v
p_i^v = \textrm{exp} (g_i^v) / \sum\nolimits_{j=1}^{n} \textrm{exp} (g_j^v)
\end{equation}
In order to train this verification model, we take the answer from the gold passage as the gold answer. And the loss function can be formulated as the negative log probability of the correct answer:
\begin{equation}
\mathcal{L}_{verify} = - \frac{1}{N} \sum_{i=1}^N \log p_{y_i^v}^{v}
\end{equation}
where $`y_i^v`$ is the index of the correct answer in all the answer candidates of the $`i^{th}`$ instance .
Joint Training and Prediction
As is described above, we define three objectives for the reading comprehension model over multiple passages: 1. finding the boundary of the answer; 2. predicting whether each word should be included in the content; 3. selecting the best answer via cross-passage answer verification. According to our design, these three tasks can share the same embedding, encoding and matching layers. Therefore, we propose to train them together as multi-task learning . The joint objective function is formulated as follows:
\begin{equation}
\mathcal{L} = \mathcal{L}_{boundary} + \beta_{1} \mathcal{L}_{content} + \beta_{2} \mathcal{L}_{verify}
\end{equation}
where $`\beta_1`$ and $`\beta_2`$ are two hyper-parameters that control the weights of those tasks.
When predicting the final answer, we take the boundary score, content score and verification score into consideration. We first extract the answer candidate $`A_i`$ that has the maximum boundary score from each passage $`i`$. This boundary score is computed as the product of the start and end probability of the answer span. Then for each answer candidate $`A_i`$, we average the content probabilities of all its words as the content score of $`A_i`$. And we can also predict the verification score for $`A_i`$ using the verification model. Therefore, the final answer can be selected from all the answer candidates according to the product of these three scores.
Conclusion
In this paper, we propose an end-to-end framework to tackle the multi-passage MRC task . We creatively design three different modules in our model, which can find the answer boundary, model the answer content and conduct cross-passage answer verification respectively. All these three modules can be trained with different forms of the answer labels and training them jointly can provide further improvement. The experimental results demonstrate that our model outperforms the baseline models by a large margin and achieves the state-of-the-art performance on two challenging datasets, both of which are designed for MRC on real web data.