Contextualized Word Representations for Reading Comprehension

Contextualized Word Representations for Reading Comprehension

Base model

We build upon , who proposed the model. For word-embedding inputs $`q_1,\dots,q_m`$ and $`p_1,\dots,p_n`$ of dimension $`{d_w}`$, consists of the following components:

Passage-independent question representation

The question is encoded via a BiLSTM $`\{v_1,\dots,v_m\}=\textsc{B\normalfont{i}LSTM}(q_1,\dots,q_m)`$ and the resulting hidden states are summarized via attention : $`q^{indep}=\sum_{j=1}^{m} \alpha_{j} v_j\in\mathbb{R}^{2d_h}`$. The attention coefficients $`\mathbf{\alpha}`$ are normalized logits {$`\alpha_1,\dots,\alpha_m\}=\textsc{\normalfont{softmax}}(s_1,\dots,s_m)`$ where $`s_j=w_q^T \cdot \textsc{FF}(v_j)`$ for a parameter vector $`w_q\in\mathbb{R}^{d_f}`$ and $`\textsc{FF}(\cdot)`$ a single layer feed-forward network.

Passage-aligned question representations

For each passage position $`i`$, the question is encoded via attention operated over its word-embeddings $`q_{i}^{align}=\sum_{j=1}^{m}\beta_{ij} q_j\in\mathbb{R}^{d_w}`$. The coefficients $`\beta_i`$ are produced by normalizing the logits $`\{s_{i1},\dots,s_{im}\}`$, where $`s_{ij}=\textsc{FF}(q_j)^T \cdot \textsc{FF}(p_i)`$.

Augmented passage token representations

Each passage word-embedding $`p_i`$ is concatenated with its corresponding $`q_i^{align}`$ and with the independent $`q^{indep}`$ to produce $`p_i^*=[p_i;q_i^{align};q^{indep}]`$, and a BiLSTM is operated over the resulting vectors: $`\{h_1,\dots,h_n\}=\textsc{B\normalfont{i}LSTM}(p_1^*,\dots,p_n^*)`$.

Span representations

A candidate answer span $`a=(l,r)`$ with $`l \leq r`$ is represented as the concatenation of the corresponding augmented passage representations: $`h_a^* = [h_l;h_r]`$. In order to avoid quadratic runtime, only spans up to length 30 are considered.

Prediction layer

Finally, each span representation $`h_a^*`$ is transformed to a logit $`s_{a} = w_c^T \cdot \textsc{FF}(h_a^*)`$ for a parameter vector $`w_c\in\mathbb{R}^{d_f}`$, and these logits are normalized to produce a distribution over spans. Learning is performed by maximizing the log-likelihood of the correct answer span.

Acknowledgements

We thank the anonymous reviewers for their constructive comments. This work was supported by the Israel Science Foundation, grant 942/16, and by the Yandex Initiative in Machine Learning.

Introduction

Reading comprehension (RC) is a high-level task in natural language understanding that requires reading a document and answering questions about its content. RC has attracted substantial attention over the last few years with the advent of large annotated datasets , computing resources, and neural network models and optimization procedures .

Reading comprehension models must invariably represent word tokens contextually, as a function of their encompassing sequence (document or question). The vast majority of RC systems encode contextualized representations of words in both the document and question as hidden states of bidirectional RNNs , and focus model design and capacity around question-document interaction, carrying out calculations where information from both is available .

Analysis of current RC models has shown that models tend to react to simple word-matching between the question and document , as well as benefit from explicitly providing matching information in model inputs . In this work, we hypothesize that the still-relatively-small size of RC datasets drives this behavior, which leads to models that make limited use of context when representing word tokens.

To illustrate this idea, we take a model that carries out only basic question-document interaction and prepend to it a module that produces token embeddings by explicitly gating between contextual and non-contextual representations (for both the document and question). This simple addition already places the model’s performance on par with recent work, and allows us to demonstrate the importance of context.

Motivated by these findings, we turn to a semi-supervised setting in which we leverage a language model, pre-trained on large amounts of data, as a sequence encoder which forcibly facilitates context utilization. We find that model performance substantially improves, reaching accuracy comparable to state-of-the-art on the competitive dataset, showing that contextual word representations captured by the language model are beneficial for reading comprehension. 1


  1. Our complete code base is available at http://github.com/shimisalant/CWR↩︎