Tag-Enhanced Tree-Structured Neural Networks for Implicit Discourse Relation Classification
Our Method
This section details the models we use for implicit discourse relation classification. Given two textual arguments without explicit connectives, our task is to classify the discourse relation between them. It can be viewed as two parts: 1) modeling the semantics of the two arguments; 2) classifying the relations based on the semantics. Our main contribution concentrates on the semantic modeling part with two types of tree-structured neural networks described in and we further illustrate how to leverage the constituent tags to enhance these two models in . In , we will shortly introduce the relation classifier and the training procedure of our model. The architecture of our system is illustrated in .
Modeling the Arguments with Tree-Structured Neural Networks
In a typical tree-structured neural network, given a parse tree of the text, the semantic representations of smaller text units are recursively composed to compute the representation of larger text spans and finally compute the representation for the whole text (e.g., sentence). In this work, we will construct our models based on the constituency parse tree, as is shown in . Following previous convention , we convert the general parse tree, where the branching factor may be arbitrary, into a binary tree so that we only need to consider the left and right children at each step. Then the following Tree-LSTM and Tree-GRU models can be used to obtain a vector representation of each argument.
Tree-LSTM Model.
In a standard sequential LSTM model, the LSTM unit is repeated at each step to take the word at current step and previous output as its input, update its memory cell and output a new hidden vector. In the Tree-LSTM model, a similar LSTM unit is applied to each node in the tree in a bottom-up manner. Since each internal node in the binary parse tree has two children, the Tree-LSTM unit has to consider information from two preceding nodes, as opposed to the single preceding node in the sequential LSTM model. Each Tree-LSTM unit (indexed by $`j`$) contains an input gate $`i_j`$, a forget gate $`f_j`$ 1 and an output gate $`o_j`$. The computation equations at node $`j`$ are as follows:
\begin{align}
i_j &= \sigma \left( W^{ \left( i \right)} x_j + U^{\left( i \right)} \left[ h_j^L, h_j^R \right] \right) \\
f_j &= \sigma \left( W^{ \left( f \right)} x_j + U^{\left( f \right)} \left[ h_j^L, h_j^R \right] \right) \\
o_j &= \sigma \left( W^{ \left( o \right)} x_j + U^{\left( o \right)} \left[ h_j^L, h_j^R \right] \right) \\
u_j &= \tanh \left( W^{ \left( u \right)} x_j + U^{\left( u \right)} \left[ h_j^L, h_j^R \right] \right) \\
c_j &= i_j \odot u_j + f_j \odot c_j^L + f_j \odot c_j^R \\
h_j &= o_j \odot \tanh \left(c_j\right)
\end{align}
where $`x_j`$ is the embedded word input at current node $`j`$, $`\sigma`$ denotes the logistic sigmoid function and $`\odot`$ denotes element-wise multiplication. $`h_j^L`$, $`h_j^R`$ are the output hidden vectors of the left and right children, and $`c_j^L`$, $`c_j^R`$ are the memory cell states from them, respectively. To save space, we leave out all the bias terms in affine transformations and the same is true for other affine transformations in this paper.
Intuitively, $`u_j`$ can be regarded as a summary of the inputs at current node, which is then filtered by $`i_j`$. The memory from left and right children are forgotten by $`f_j`$ and then we compose them together with the new inputs to form the the new memory $`c_j`$. At last, part of the information in memory $`c_j`$ is exposed by $`o_j`$ to generate the output vector $`h_j`$ for current step. Another thing to note is that only leaf nodes in the constituency tree have words as its input, so $`x_j`$ is set to a zero vector in other cases.
Tree-GRU Model.
Similar to Tree-LSTM, the Tree-GRU model extends the sequential GRU model to tree structures. The only difference between Tree-GRU and Tree-LSTM is how they modulate the flow of information inside the unit. Specifically, Tree-GRU unit removes the separate memory cell and only uses two gates to simulate the reset and update procedure in information gathering. The computation equations in each Tree-GRU unit are the following:
\begin{align}
r_j &= \sigma \left( W^{\left( r \right)} x_j + U^{\left( r \right)} \left[ h_j^L, h_j^R \right] \right) \\
z_j &= \sigma \left( W^{\left( z \right)} x_j + U^{\left( z \right)} \left[ h_j^L, h_j^R \right] \right) \\
\tilde{h}_j &= \tanh \left( W^{\left( h \right)} x_j + U^{\left( h \right)} \left[ h_j^L \odot r_j, h_j^R \odot r_j \right] \right) \\
h_j &= z_j \odot \tilde{h}_j + \left( 1 - z_j \right) \odot \left( h_j^{L} + h_j^{R} \right)
\end{align}
where $`r_j`$ is the reset gate and $`z_j`$ is the update gate. The reset gate allows the network to forget previous computed representations, while the update gate decides the degree of update to the hidden state. There is no memory cell in Tree-GRU, with only $`h_j^L`$ and $`h_j^R`$ as the hidden states from the left and right children.
Controlling the Semantic Composition with Constituent Tags
The constituent tag in a parse tree describes the grammatical role of its corresponding constituent in the context. defines several types of constituent tags, including clause-level tags (e.g., SBAR, SINV, SQ), phrase-level tags (e.g., NP, VP, PP) and word-level tags (e.g., NN, VP, JJ). These constituent tags greatly interleave with the semantics and in some ways can provide determinant signals for the importance of a constituent. For example, for most of the time, constituents with PP (prepositional phrase) tag are less important than those with VP (verb phrase) tag. Therefore we argue that these tags are worth considering when we compose the semantics in the tree-structured neural networks.
One way to leverage such tags is using tag-specific composition functions but that would lead to large number of parameters and some tags are very sparse so it’s very hard to train their corresponding parameters sufficiently. To solve this problem, we propose to use tag embeddings and dynamically control the composition process via the gates in our model.
Gates in Tree-LSTM and Tree-GRU units control the flow of information and thus determine how the semantics from child nodes are composed to a new representation. Furthermore, these gates are computed dynamically according to the inputs at a certain step. Therefore, it’s natural to incorporate the tag embeddings in the computation of these gates. Based on this idea, we propose the Tag-Enhanced Tree-LSTM model, where the input, forget and output gates in each unit are calculated as follows:
\begin{align}
i_j &= \sigma \left( W^{ \left( i \right)} x_j + M^{\left( i \right)} t_j + U^{\left( i \right)} \left[ h_j^L, h_j^R \right]\right) \\
f_j &= \sigma \left( W^{ \left( f \right)} x_j + M^{\left( f \right)} t_j + U^{\left( f \right)} \left[ h_j^L, h_j^R \right]\right) \\
o_j &= \sigma \left( W^{ \left( o \right)} x_j + M^{\left( o \right)} t_j + U^{\left( o \right)} \left[ h_j^L, h_j^R \right]\right)
\end{align}
Similarly, we can have the Tag-Enhanced Tree-GRU model with new reset and update gates:
\begin{align}
r_j &= \sigma \left( W^{\left( r \right)} x_j + M^{\left( r \right)} t_j + U^{\left( r \right)} \left[ h_j^L, h_j^R \right]\right) \\
z_j &= \sigma \left( W^{\left( z \right)} x_j + M^{\left( z \right)} t_j + U^{\left( z \right)} \left[ h_j^L, h_j^R \right] \right)
\end{align}
where $`t_j`$ is the embedding of the tag at current node (indexed by $`j`$).
Relation Classification and Training
In our work, the two arguments are encoded with the same network in order to reduce the number of parameters. After that we get a vector representation for each argument, which can be denoted as $`r_1`$ for argument 1 and $`r_2`$ for argument 2. Supposing that there are totally $`n`$ relation types, the predicted probability distribution $`\hat{y}\in\mathbb{R}^n`$ is calculated as:
\begin{equation}
\hat{y} = softmax\left(W^{\left( \hat{y} \right)}\left[ r_1, r_2\right] + b^{\left( \hat{y} \right)}\right)
\end{equation}
To train our model, the training objective $`J`$ is defined as the cross-entropy loss with $`L2`$ regularization:
\begin{align}
E\left(\hat{y}, y\right) &= - \sum_{j=1}^{n}y_j \times \log \hat{y}_j \\
J\left( \theta \right) &= \frac{1}{N} \sum_{k=1}^{N} E\left(\hat{y}, y\right) + \frac{\lambda}{2} {\|\theta\|}^2
\end{align}
where $`\hat{y}`$ is the predicted probability distribution, $`y`$ is the one-hot representation of the gold label and $`N`$ is the number of training samples.
-
The original Binary Tree-LSTM in contains separate forget gates for different child nodes but we find single forget gate performs better in our task. ↩︎