Adversarial Connective-exploiting Networks for Implicit Discourse Relation Classification

Adversarial Connecti ve-exploiting Netw orks f or Implicit Discourse Relation Classiﬁcation Lianhui Qin 1 , Zhisong Zhang 1 , Hai Zhao 1 , Zhiting Hu 2 , Eric P . Xing 2 1 Shanghai Jiao T ong Uni versity , 2 Carnegie Mellon Uni v ersity { qinlianhui, zzs2011 } @sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn, { zhitingh, epxing } @cs.cmu.edu Abstract Implicit discourse relation classiﬁcation is of great challenge due to the lack of con- necti ves as strong linguistic cues, which moti v ates the use of annotated implicit connecti ves to improv e the recognition. W e propose a feature imitation frame- work in which an implicit relation net- work is dri v en to learn from another neu- ral network with access to connectiv es, and thus encouraged to extract similarly salient features for accurate classiﬁcation. W e dev elop an adversarial model to en- able an adapti v e imitation scheme through competition between the implicit network and a riv al feature discriminator . Our method ef fecti v ely transfers discriminabil- ity of connecti ves to the implicit features, and achie ves state-of-the-art performance on the PDTB benchmark. 1 Introduction Discourse relations connect linguistic units such as clauses and sentences to form coherent seman- tics. Identiﬁcation of discourse relations can ben- eﬁt a v ariety of do wnstream applications includ- ing question answering ( Liakata et al. , 2013 ), ma- chine translation ( Li et al. , 2014 ), text summariza- tion ( Gerani et al. , 2014 ), and so forth. Connecti ves (e.g., but , so , etc) are one of the most critical linguistic cues for identifying dis- course relations. When explicit connecti ves are present in the text, a simple frequency-based map- ping is sufﬁcient to achiev e ov er 85% classiﬁca- tion accuracy ( Xue et al. , 2016 ). In contrast, im- plicit discourse r elation r ecognition has long been seen as a challenging problem, with the best ac- curacy so f ar still lower than 50%. In the im- plicit case, discourse relations are not lexicalized by connectives, but to be inferred from relev ant sentences (i.e., ar guments ). For example, the fol- lo wing two adjacent sentences Ar g1 and Ar g2 im- ply relation Cause (i.e., Arg2 is the cause of Arg1). [ Ar g1 ]: Nev er mind. [ Ar g2 ]: Y ou already kno w the answer . [ Implicit connective ]: Because [ Discourse relation ]: Cause V arious attempts hav e been made to directly in- fer underlying relations by modeling the seman- tics of the arguments, ranging from feature-based methods ( Lin et al. , 2009 ; Pitler et al. , 2009 ) to the very recent end-to-end neural models ( Chen et al. , 2016a ; Qin et al. , 2016c ). Despite impressive per - formance, the absence of strong e xplicit connec- ti ve cues has made the inference extremely hard and hindered further improv ement. In fact, even the human annotators would make use of connec- ti ves to aid relation annotation. For instance, the popular Penn Discourse T reebank (PDTB) bench- mark data ( Prasad et al. , 2008 ) was annotated by ﬁrst inserting a connectiv e e xpression (i.e., im- plicit connective , as shown in the abov e example) manually , and determining the abstract relation by combining both the implicit connectiv e and con- textual semantics. Therefore, the huge performance gap between explicit and implicit parsing (namely , 85% vs 50%), as well as the human annotation practice, strongly motiv ates to incorporate connectiv e infor- mation to guide the reasoning process. This paper aims to adv ance implicit parsing by making use of annotated implicit connecti ves av ailable in train- ing data. Fe w recent work has explored such com- bination. Zhou et al. ( 2010 ) developed a two-step approach by ﬁrst predicting implicit connectives whose sense is then disambiguated to obtain the relation. Howe ver , the pipeline approach usually suf fers from error propagation, and the method it- self has relied on hand-crafted features which do not necessarily generalize well. Other research le veraged explicit connecti ve examples for data augmentation ( Rutherford and Xue , 2015 ; Braud and Denis , 2015 ; Ji et al. , 2015 ; Braud and Denis , 2016 ). Our work is orthogonal and complemen- tary to this line. In this paper , we propose a nov el neural method that incorporates implicit connectiv es in a princi- pled adversarial framework. W e use deep neu- ral models for relation classiﬁcation, and take the intuition that, sentence arguments integrated with connecti ves would enable highly discriminativ e neural features for accurate relation inference, and an ideal implicit relation classiﬁer , e v en though without access to connecti ves, should mimic the connecti ve-augmented reasoning behavior by ex- tracting similarly salient features. W e therefore setup a secondary network in addition to the im- plicit relation classiﬁer , b uilding upon connecti v e- augmented inputs and serving as a feature learning model for the implicit classiﬁer to emulate. Methodologically , howe v er , feature imitation in our problem is challenging due to the semantic gap induced by adding the connectiv e cues. It is nec- essary to dev elop an adaptiv e scheme to ﬂexibly dri ve learning and transfer discriminability . W e de vise a novel adversarial approach which enables a self-calibrated imitation mechanism. Speciﬁ- cally , we b uild a discriminator which distinguishes between the features by the two counterpart net- works. The implicit relation network is then trained to correctly classify relations and simulta- neously to fool the discriminator , resulting in an adversarial framework. The adversarial mecha- nism has been an emerging method in different context, especially for image generation ( Good- fello w et al. , 2014 ) and domain adaptation ( Ganin et al. , 2016 ; Chen et al. , 2016c ). Our adversar- ial framework is unique to address neural fea- ture emulation between two models. Besides, to the best of our kno wledge, this is the ﬁrst adv er- sarial approach in the context of discourse pars- ing. Compared to pre vious connecti ve exploit- ing work ( Zhou et al. , 2010 ; Xu et al. , 2012 ), our method pro vides a ne w inte gration paradigm and an end-to-end procedure that avoids inef ﬁcient feature engineering and error propagation. W e e v aluate our method on the PDTB 2.0 benchmark in a variety of experimental settings. The proposed adversarial model greatly improv es ov er standalone neural networks and pre vious best-performing approaches. W e also demonstrate that our implicit recognition network successfully imitates and extracts crucial hidden representa- tions. W e begin by brieﬂy re vie wing related work in section 2 . Section 3 presents the proposed adver - sarial model. Section 4 shows substantially im- prov ed experimental results over previous meth- ods. Section 5 discusses extensions and future work. 2 Related W ork 2.1 Implicit Discourse Relation Recognition There has been a surge of interest in implicit dis- course parsing since the release of PDTB ( Prasad et al. , 2008 ), the ﬁrst large discourse corpus distin- guishing implicit examples from explicit ones. A large set of work has focused on direct classiﬁca- tion based on observed sentences, including struc- tured methods with linguistically-informed fea- tures ( Lin et al. , 2009 ; Pitler et al. , 2009 ; Zhou et al. , 2010 ), end-to-end neural models ( Qin et al. , 2016b , c ; Chen et al. , 2016a ; Liu and Li , 2016 ), and combined approaches ( Ji and Eisenstein , 2015 ; Ji et al. , 2016 ). Howe ver , the lacking of connec- ti ve cues makes learning purely from contextual semantics full of challenges. Prior work has attempted to leverage connec- ti ve information. Zhou et al. ( 2010 ) also incorpo- rate implicit connectiv es, but in a pipeline man- ner by ﬁrst predicting the implicit connecti ve with a language model and determining discourse rela- tion accordingly . Instead of treating implicit con- necti ves as intermediate prediction tar gets which can suf fer from error propagation, we use the con- necti ves to induce highly discriminati ve features to guide the learning of an implicit network, serving as an adapti ve regularization mechanism for en- hanced robustness and generalization. Our frame- work is also end-to-end, av oiding costly feature engineering. Another notable line aims at adapt- ing explicit examples for data synthesis ( Biran and McKeo wn , 2013 ; Rutherford and Xue , 2015 ; Braud and Denis , 2015 ; Ji et al. , 2015 ), multi-task learning ( Lan et al. , 2013 ; Liu et al. , 2016 ), and word representation ( Braud and Denis , 2016 ). Our work is orthogonal and complementary to these methods, as we use implicit connecti ves which hav e been annotated for implicit e xamples. x 1 : Never mind. x 2 : Y ou Know the answer . i -CNN a -CNN +implicit connective c : Because Discriminator D Classifier C x 1 : Never mind. x 2 : Because Y ou Know the answer . H I H A Figure 1: Architecture of the proposed method. The frame work contains three main components: 1) an implicit relation network i-CNN over raw sentence ar guments, 2) a connectiv e-augmented relation network a-CNN whose inputs are augmented with implicit connectives, and 3) a discriminator distin- guishing between the features by the tw o networks. The features are fed to the ﬁnal classiﬁer for relation classiﬁcation. The discriminator and i-CNN form an adversarial pair for feature imitation. At test time, the implicit network i-CNN with the classiﬁer is used for prediction. 2.2 Adversarial Networks Adversarial method has gained impressi ve success in deep generative modeling ( Goodfellow et al. , 2014 ) and domain adaptation ( Ganin et al. , 2016 ). Generati ve adversarial nets ( Goodfello w et al. , 2014 ) learn to produce realistic images through competition between an image generator and a real/fake discriminator . Professor forcing ( Lamb et al. , 2016 ) applies a similar idea to improve long- term generation of a recurrent neural language model. Other approaches ( Chen et al. , 2016b ; Hu et al. , 2017 ) e xtend the frame work for control- lable image/text generation. Li et al. ( 2015 ); Sali- mans et al. ( 2016 ) propose feature matching which trains generators to match the statistics of real/fake examples. Their features are extracted by the dis- criminator rather than the classiﬁer networks as in our case. Our work differs from the abov e since we consider the context of discriminativ e model- ing. Adversarial domain adaptation forces a neu- ral network to learn domain-in v ariant features us- ing a classiﬁer that distinguishes the domain of the network’ s input data based on the hidden feature. Our adversarial framework is distinct in that be- sides the implicit relation network we construct a second neural network serving as a teacher model for feature emulation. T o the best of our kno wledge, this is the ﬁrst to employ the idea of adversarial learning in the context of discourse parsing. W e propose a novel connecti ve exploiting scheme based on feature im- itation, and to this end deri ve a ne w adversar - ial framework, achieving substantial performance gain o ver existing methods. The proposed ap- proach is generally applicable to other tasks for utilizing an y indicati ve side information. W e gi ve more discussions in section 5 . 3 Adversarial Method Discourse connecti ves are key indicators for dis- course relation. In the annotation procedure of the PDTB implicit relation benchmark, annotators inserted implicit connecti ve expressions between adjacent sentences to lexicalize abstract relations and help with ﬁnal decisions. Our model aims at making full use of the provided implicit connec- ti ves at training time to regulate learning of im- plicit relation recognizer , encouraging extraction of highly discriminati ve semantics from ra w argu- ments, and improving generalization at test time. Our method provides a novel adversarial frame- work that lev erages connectiv e information in a ﬂexible adapti ve manner , and is ef ﬁciently trained end-to-end through standard back-propagation. The basic idea of the proposed approach is sim- ple. W e want our implicit relation recognizer , which predicts the underlying relation of sen- tence arguments without discourse connective, to hav e prediction behaviors close to a connective- augmented relation recognizer which is provided with a discourse connecti v e in addition to the ar- guments. The connective-augmented recognizer is in analogy to an annotator with the help of connec- ti ves as in the human annotation process, and the implicit recognizer would be improv ed by learn- ing from such an “informed” annotator . Specif- ically , we want the latent features extracted by the two models to match as closely as possible, which explicitly transfers the discriminability of the connectiv e-augmented representations to im- plicit ones. T o this end, instead of manually selecting a closeness metric, we take advantage of the ad- versarial framew ork by constructing a two-player zero-sum game between the implicit recognizer and a riv al discriminator . The discriminator at- tempts to distinguish between the features ex- tracted by the two relation models, while the im- plicit relation model is trained to maximize the ac- curacy on implicit data, and at the same time to confuse the discriminator . In the next we ﬁrst present the o verall architec- ture of the proposed approach (section 3.1 ), then de velop the training procedure (section 3.2 ). The components are realized as deep (conv olutional) neural networks, with detailed modeling choices discussed in section 3.3 . 3.1 Model Architectur e Let ( x , y ) be a pair of input and output of implicit relation classiﬁcation, where x = ( x 1 , x 2 ) is a pair of sentence arguments, and y is the underlying discourse relation. Each training example also in- cludes an annotated implicit connecti ve c that best expresses the relation. Figure 1 shows the archi- tecture of our frame work. The neural model for implicit relation clas- siﬁcation ( i-CNN in the ﬁgure) extracts latent representation from the arguments, denoted as H I ( x 1 , x 2 ) , and feeds the feature into a classiﬁer C for ﬁnal prediction C ( H I ( x 1 , x 2 )) . For ease of notation, we will also use H I ( x ) to denote the latent feature on data x . The second relation network ( a-CNN ) takes as inputs the sentence arguments along with an implicit connectiv e, to induce the connecti ve- augmented representation H A ( x 1 , x 2 , c ) , and ob- tains relation prediction C ( H A ( x 1 , x 2 , c )) . Note that the same ﬁnal classiﬁer C is used for both net- works, so that the feature representations by the two networks are ensured to be within the same semantic space, enabling feature emulation as pre- sented shortly . W e further pair the implicit network with a ri- v al discriminator D to form our adversarial game. The discriminator is to differentiate between the reasoning behaviors of the implicit network i-CNN and the augmented network a-CNN . Speciﬁcally , D is a binary classiﬁer that takes as inputs a la- tent feature H deriv ed from either i-CNN or a- CNN giv en appropriate data (where implicit con- necti ves is either missing or present, respecti vely). The output D ( H ) estimates the probability that H comes from the connectiv e-augmented a-CNN rather than i-CNN . 3.2 T raining Procedure The system is trained through an alternating op- timization procedure that updates the components in an interleav ed manner . In this section, we ﬁrst present the training objecti ve for each component, and then gi ve the o verall training algorithm. Let θ D denote the parameters of the discrimina- tor . The training objecti ve of D is straightforward, i.e., to maximize the probability of correctly dis- tinguishing the input features: max θ D L D = E ( x ,c,y ) ∼ data h log D ( H A ( x , c ); θ D )+ log(1 − D ( H I ( x ); θ D )) i , (1) where E ( x ,c,y ) ∼ data [ · ] denotes the expectation in terms of the data distribution. W e denote the parameters of the implicit net- work i-CNN and the classiﬁer C as θ I and θ C , respecti vely . The model is then trained to (a) correctly classify relations in training data and (b) produce salient features close to connectiv e- augmented ones. The ﬁrst objecti ve can be ful- ﬁlled by minimizing the usual cross-entropy loss: L I ,C ( θ I , θ C ) = E ( x ,y ) ∼ data h J  C ( H I ( x ; θ I ); θ C ) , y  i , (2) where J ( p , y ) = − P k I ( y = k ) log p k is the cross-entropy loss between predictive distribution p and ground-truth label y . W e achieve objecti ve (b) by minimizing the discriminator’ s chance of correctly telling apart the features: L I ( θ I ) = E x ∼ data h log  1 − D ( H I ( x ; θ I ))  i . (3) The parameters of the augmented network a- CNN , denoted as θ A , can be learned by simply ﬁt- ting to the data, i.e., minimizing the cross-entropy loss as follo ws: L A ( θ A ) = E ( x ,c,y ) ∼ data h J  C ( H A ( x , c ; θ A )) , y  i . (4) Algorithm 1 Adversarial Model for Implicit Recognition Input: T raining data { ( x , c, y ) n } Parameters: λ 1 , λ 2 – balancing parameters 1: Initialize { θ I , θ C } and { θ A } by minimizing Eq.( 2 ) and Eq.( 4 ), respecti vely 2: repeat 3: T rain the discriminator through Eq.( 1 ) 4: T rain the relation models through Eq.( 5 ) 5: until con v ergence Output: Adversarially enhanced implicit relation network i-CNN with classiﬁer C for prediction As mentioned abov e, here we use the same classi- ﬁer C as for the implicit netw ork, forcing a uniﬁed feature space of both networks. W e combine the abov e objectiv es Eqs.( 2 )-( 4 ) of the relation classi- ﬁers and minimize the joint loss: min θ I , θ A , θ C L I ,A,C = L I ,C ( θ I , θ C ) + λ 1 L I ( θ I ) + λ 2 L A ( θ A ) , (5) where λ 1 and λ 2 are two balancing parameters cal- ibrating the weights of the classiﬁcation losses and the feature-regulating loss. In practice, we pre- train the implicit and augmented networks inde- pendently by minimizing Eq.( 2 ) and Eq.( 4 ), re- specti vely . In the adversarial training process, we found setting λ 2 = 0 gives stable con ver - gence. That is, the connecti ve-augmented features are ﬁxed after the pre-training stage. Algorithm 1 summarizes the training procedure, where we interleave the optimization of Eq.( 1 ) and Eq.( 5 ) at each iteration. More practical details are provided in section 4 . W e instantiate all modules as neural networks (section 3.3 ) which are differ - entiable, and perform the optimization efﬁciently through standard stochastic gradient descent and back-propagation. Through Eq.( 1 ) and Eq.( 3 ), the discriminator and the implicit relation network follow a min- imax competition, which driv es both to improve until the implicit feature representations are close to the connecti ve-augmented latent representa- tions, encouraging the implicit network to ex- tract highly discriminativ e features from raw sen- tence arguments for relation classiﬁcation. Alter- nati vely , we can see Eq.( 3 ) as an adaptive regu- larization on the implicit model, which, compared to pre-ﬁxed re gularizors such as ` 2 -regularization, provides a more ﬂexible, self-calibrated mecha- nism to improv e generalization ability . Concat Max-pooling Convolution Embedding Arg1: Never mind. Arg2: Y ou know the answer . H(x) Classication: Cause Discrimination Figure 2: Neural structure of i-CNN . T wo sets of conv olutional ﬁlters are sho wn, with the cor- responding features in red and blue, respecti vely . The weights of the ﬁlters on two input arguments are tied. 0 1 Input Gate Output / H I H A / Gate Figure 3: Neural structure of the discriminator D . 3.3 Component Structures W e have presented our adversarial frame work for implicit relation classiﬁcation. W e now discuss the model realization of each component. All com- ponents of the frame work are parameterized with neural networks. Distinct roles of the modules in the frame work lead to dif ferent modeling choices. Relation Classiﬁcation Networks Figure 2 il- lustrates the structure of the implicit relation net- work i-CNN . W e use a conv olutional network as it is a common architectural choice for discourse parsing. The network takes as inputs the word vectors of the tokens in each sentence argument, and maps each argument to intermediate features through a shared con volutional layer . The result- ing representations are then concatenated and fed into a max pooling layer to select most salient fea- tures as the ﬁnal representation. The ﬁnal classi- ﬁer C is a simple fully-connected layer follo wed by a softmax classiﬁer . The connecti ve-augmented network a-CNN has a similar structure as i-CNN , wherein implicit con- necti ve is appended to the second sentence as in- put. The key difference from i-CNN is that here we adopt average k-max pooling, which takes the av- erage of the top- k maximum v alues in each pool- ing window . The reason is to pre v ent the net- work from solely selecting the connecti v e induced features (which are typically the most salient fea- tures) which would be the case when using max pooling, but instead force it to also attend to con- textual features derived from the arguments. This facilitates more homogeneous output features of the two networks, and thus f acilitates feature imi- tation. In all the experiments we ﬁx ed k = 2 . Discriminator The discriminator is a binary classiﬁer to identify the correct source of an in- put feature vector . T o make it a strong riv al to the feature imitating network ( i-CNN ), we model the discriminator as a multi-layer perceptron (MLP) enhanced with gated mechanism for ef ﬁcient in- formation ﬂo w ( Sri v asta v a et al. , 2015 ; Qin et al. , 2016c ), as sho wn in Figure 3 . 4 Experiments W e demonstrate the ef fecti v eness of our approach both quantitatively and qualitativ ely with exten- si ve experiments. W e ev aluate prediction perfor- mance on the PDTB benchmark in different set- tings. Our method substantially impro ves over a di verse set of previous models, especially in the practical multi-class classiﬁcation task. W e per- form in-depth analysis of the model behaviors, and sho w our adversarial frame work successfully en- ables the implicit relation model to imitate and learn discriminati ve features. 4.1 Experiment Setup W e use PDTB 2.0 1 , one of the largest manually annotated discourse relation corpus. The dataset contains 16,224 implicit relation instances in total, with three le vels of senses: Level-1 Class , Le vel-2 T ype , and Lev el-3 Subtypes . The 1st lev el con- sists of four major relation Class es: C O M PA R I - 1 http://www .seas.upenn.edu/ ∼ pdtb/ S O N , C O N T I N G E N C Y , E X P A N S I O N and T E M P O - R A L . The 2nd lev el contains 16 T ype s. T o make extensi ve comparison with prior work of implicit discourse relation classiﬁcation, we e v aluate on two popular e xperimental settings: 1) multi-class classiﬁcation for 2nd-level types ( Lin et al. , 2009 ; Ji and Eisenstein , 2015 ), and 2) one- versus-others binary classiﬁcations for 1st-lev el classes ( Pitler et al. , 2009 ). W e describe the de- tailed conﬁgurations in the follo wing respective sections. W e will focus our analysis on the multi- class classiﬁcation setting, which is most realis- tic in practice and serves as a building block for a complete discourse parser such as that for the shared tasks of CoNLL-2015 and 2016 ( Xue et al. , 2015 , 2016 ). Model T raining W e provide detailed model and training conﬁgurations in the supplementary ma- terials, and only mention a fe w of them here. Throughout the experiments i-CNN and a-CNN contains 3 sets of con volutional ﬁlters with the ﬁl- ter sizes selected on the dev set. The ﬁnal single- layer classiﬁer C contains 512 neurons. The dis- criminator D consists of 4 fully-connected layers, with 2 gated pathways from layer 1 to layer 3 and layer 4 (Figure 3 ). For adversarial model training, it is critical to keep balance between the progress of the two play- ers. W e use a simple strategy which at each itera- tion optimizes the discriminator and the implicit relation network on a randomly-sampled mini- batch. W e found this is enough to stabilize the training. The neural parameters are trained using AdaGrad ( Duchi et al. , 2011 ) with an initial learn- ing rate of 0.001. F or the balancing parameters in Eq.( 5 ), we set λ 1 = 0 . 1 , while λ 2 = 0 . That is, after the initialization stage the weights of the connecti ve-augmented network a-CNN are ﬁxed. This has been shown capable of gi ving stable and good predicti ve performance for our system. 4.2 Implicit Relation Classiﬁcation W e will mainly focus on the general multi-class classiﬁcation problem in two alternativ e settings adopted in prior work, showing the superiority of our model over previous state of the arts. W e perform in-depth comparison with carefully de- signed baselines, providing empirical insights into the working mechanism of the proposed frame- work. For broader comparisons we also report the performance in the one-versus-all setting. Model PDTB-Lin PDTB-Ji 1 W ord-vector 34.07 36.86 2 CNN 43.12 44.51 3 Ensemble 42.17 44.27 4 Multi-task 43.73 44.75 5 ` 2 -reg 44.12 45.33 6 Lin et al. ( 2009 ) 40.20 - 7 Lin et al. ( 2009 )+Brown clusters - 40.66 8 Ji and Eisenstein ( 2015 ) - 44.59 9 Qin et al. ( 2016a ) 43.81 45.04 10 Ours 44.65 46.23 T able 1: Accuracy (%) on the test sets of the PDTB-Lin and PDTB-Ji settings for multi-class classiﬁca- tion . Please see the text for more details. Multi-class Classiﬁcations W e ﬁrst adopt the standard PDTB splitting con- vention following ( Lin et al. , 2009 ), denoted as PDTB-Lin, where sections 2-21, 22, and 23 are used as training, dev , and test sets, respectiv ely . The most frequent 11 types of relations are se- lected in the task. During training, instances with more than one annotated relation types are con- sidered as multiple instances, each of which has one of the annotations. At test time, a prediction that matches one of the gold types is considered as correct. The test set contains 766 examples. Please refer to ( Lin et al. , 2009 ) for more details. An alternativ e, slightly dif ferent multi-class set- ting is used in ( Ji and Eisenstein , 2015 ), denoted as PDTB-Ji, where sections 2-20, 0-1, and 21-22 are used as training, dev , and test sets, respectiv ely . The resulting test set contains 1039 e xamples. W e also ev aluate in this setting for thorough compar- isons. T able 1 shows the classiﬁcation accuracy in both of the settings. W e see that our model (Ro w 10) achiev es state-of-the-art performance, greatly outperforming pre vious methods (Ro ws 6- 9) with v arious modeling paradigms, including the linguistic feature-based model ( Lin et al. , 2009 ), pure neural methods ( Qin et al. , 2016c ), and com- bined approach ( Ji and Eisenstein , 2015 ). T o obtain better insights into the working mech- anism of our method, we further compare with a set of carefully selected baselines as shown in Ro ws 1-5. 1) “W ord-vector” sums over the word vectors for sentence representation, show- ing the base effect of word embeddings. 2) “CNN” is a standalone conv olutional net having the exact same architecture with our implicit rela- tion network. Our model trained within the pro- posed framew ork provides signiﬁcant improve- ment, showing the beneﬁts of utilizing implicit connecti ves at training time. 3) “Ensemble” has the same neural architecture with the proposed frame work except that the input of a-CNN is not augmented with implicit connectives. This essen- tially is an ensemble of two implicit recognition networks. W e see that the method performs ev en inferior to the single CNN model. This further conﬁrms the necessity of e xploiting connective in- formation. 4) “Multi-task” is the con volutional net augmented with an additional task of simultane- ously predicting the implicit connecti ves based on the network features. As a straightforward way of incorporating connectiv es, we see that the method slightly improves over the stand-alone CNN, while falling behind our approach with a large margin. This indicates that our proposed feature imitation is a more effecti v e scheme for making use of im- plicit connectives. 5) At last, “ ` 2 -reg” also imple- ments feature mimicking by imposing an ` 2 dis- tance penalty between the implicit relation fea- tures and connectiv e-augmented features. W e see that the simple model has obtained improvement ov er previous best-performing systems in both settings, further v alidating the idea of imitation. Ho we ver , in contrast to the ﬁx ed ` 2 regularization, our adversarial framework provides an adaptiv e mechanism, which is more ﬂexible and performs better as sho wn in the table. One-versus-all Classiﬁcations W e also report the results of four one-versus-all binary classiﬁcations for more comparisons with prior work. W e follow the conv entional experi- mental setting ( Pitler et al. , 2009 ) by selecting sec- tions 2-20, 21-22, and 0-1 as training, dev , and test sets. More detailed data statistics are provided in 0 5 10 15 Training epochs 0.4 0.6 0.8 Accuracy a-CNN i-CNN Discr 0 5 10 15 20 Training epochs 0.4 0.6 0.8 Accuracy a-CNN i-CNN Discr Figure 4: (Best viewed in colors.) T est-set performance of three components over training epochs. Relation networks a-CNN and i-CNN are measured with multi-class classiﬁcation accuracy (with or without implicit connectiv es, respectiv ely), while the discriminator is ev aluated with binary classiﬁcation accuracy . T op : the PDTB-Lin setting ( Lin et al. , 2009 ), where ﬁrst 8 epochs are for initialization stage (thus the discriminator is ﬁxed and not sho wn); Bottom : the PDTB-Ji setting ( Ji and Eisenstein , 2015 ), where ﬁrst 3 epochs are for initialization. (a) (b) (c) Figure 5: (Best viewed in colors.) V isualizations of the extracted hidden features by the implicit relation network i-CNN (blue) and connecti ve-augmented relation network a-CNN (orange), in the multi-class classiﬁcation setting ( Lin et al. , 2009 ). (a) T wo networks are trained without adversary (with shared classiﬁer); (b) T wo networks are trained within our framework at epoch 10; (c) Features at epoch 20. The implicit relation network successfully imitates the connectiv e-augmented features through the adversarial game. V isualization is conducted with the t-SNE algorithm ( Maaten and Hinton , 2008 ). Model C O M P . C O N T . E X P . T E M P . Pitler et al. ( 2009 ) 21.96 47.13 - 16.76 Qin et al. ( 2016c ) 41.55 57.32 71.50 35.43 Zhang et al. ( 2016 ) 35.88 50.56 71.48 29.54 Zhou et al. ( 2010 ) 31.79 47.16 70.11 20.30 Liu and Li ( 2016 ) 36.70 54.48 70.43 38.84 Chen et al. ( 2016a ) 40.17 54.76 - 31.32 Ours 40.87 54.56 72.38 36.20 T able 2: Comparisons of F 1 scores (%) for binary classiﬁcation. Please see the supplementary mate- rials for more details of the experiment setting. the supplementary materials. Follo wing previous work, the F1 scores are re- ported in T able 2 . Our method outperforms most of the prior systems in all the tasks. W e achiev e state-of-the-art performance in recognition of the Expansion relation, and obtain comparable scores with the best-performing methods in each of the other relations, respectiv ely . Notably , our fea- ture imitation scheme greatly improves over ( Zhou et al. , 2010 ) which le verages implicit connecti ves as an intermediate prediction task. This pro vides additional evidence for the effecti veness of our ap- proach. 4.3 Qualitative Analysis W e now take a closer look into the modeling be- havior of our frame work, by in vestigating the pro- cess of the adversarial game during training, as well as the feature imitation ef fects. Figure 4 demonstrates the training progress of dif ferent components. The a-CNN network keeps high predictive accuracy as implicit connectiv es are giv en, showing the importance of connectiv e cues. The rise-and-fall patterns in the accuracy of the discriminator clearly sho w its competition with the implicit relation network i-CNN as train- ing goes. At ﬁrst few iterations the accuracy of the discriminator increases quickly to ov er 0 . 9 , while at late stage the accuracy drops to around 0 . 6 , sho wing that the discriminator is getting confused by i-CNN (an accuracy of 0 . 5 indicates full con- fusion). The i-CNN network keeps improving in terms of implicit relation classiﬁcation accuracy , as it is gradually ﬁtting to the data and simultane- ously learning increasingly discriminativ e features by mimicking a-CNN . The system exhibits simi- lar learning patterns in the two different settings, sho wing the stability of the training strategy . W e ﬁnally visualize the output feature vec- tors of i-CNN and a-CNN using the t-SNE method ( Maaten and Hinton , 2008 ) in Figure 5 . W ithout feature imitation, the extracted features by the two networks are clearly separated (Fig- ure 5 (a)). In contrast, as sho wn in Figures 5 (b)- (c), the feature vectors are increasingly mixed as training proceeds. Thus our framew ork has suc- cessfully dri ven i-CNN to induce similar represen- tations with a-CNN , e ven though connectiv es are not present. 5 Discussions W e hav e dev eloped an adversarial neural frame- work that facilitates an implicit relation network to extract highly discriminativ e features by mimick- ing a connecti ve-augmented network. Our method achie ved state-of-the-art performance for implicit discourse relation classiﬁcation. Besides implicit connecti ve examples, our model can naturally ex- ploit enormous explicit connecti ve data to further improv e discourse parsing. The proposed adversarial feature imitation scheme is also generally applicable to other con- text to incorporate indicati ve side information av ailable at training time for enhanced infer- ence. Our framew ork shares a similar spirit of the iterati ve kno wledge distillation method ( Hu et al. , 2016a , b ) which train a “student” network to mimic the classiﬁcation beha vior of a kno wledge- informed “teacher” network. Our approach en- courages imitation on the feature le v el instead of the ﬁnal prediction lev el. This allo ws our ap- proach to apply to regression tasks, and more in- terestingly , the context in which the student and teacher networks hav e different prediction out- puts, e.g., performing different tasks, while trans- ferring kno wledge between each other can be ben- eﬁcial. Besides, our adversarial mechanism pro- vides an adapti ve metric to measure and dri ve the imitation procedure. References Or Biran and Kathleen McKeo wn. 2013. Aggre gated word pair features for implicit discourse relation dis- ambiguation. In Pr oceedings of the 51st Annual Meeting of the Association for Computational Lin- guistics (V olume 2: Short P apers) . Soﬁa, Bulgaria, pages 69–73. Chlo ´ e Braud and P ascal Denis. 2015. Comparing w ord representations for implicit discourse relation classi- ﬁcation. In Pr oceedings of the 2015 Confer ence on Empirical Methods in Natural Language Pr ocess- ing . Lisbon, Portugal, pages 2201–2211. Chlo ´ e Braud and Pascal Denis. 2016. Learning connectiv e-based word representations for implicit discourse relation identiﬁcation. In Pr oceedings of the 2016 Confer ence on Empirical Methods in Natu- ral Language Pr ocessing . Austin, T exas, pages 203– 213. Jifan Chen, Qi Zhang, Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016a. Implicit discourse rela- tion detection via a deep architecture with gated rel- ev ance netw ork. In Pr oceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (V olume 1: Long P apers) . Berlin, Germany , pages 1726–1735. Xi Chen, Y an Duan, Rein Houthooft, John Schul- man, Ilya Sutske ver , and Pieter Abbeel. 2016b. In- fogan: Interpretable representation learning by in- formation maximizing generativ e adversarial nets. In Advances in Neur al Information Pr ocessing Sys- tems . pages 2172–2180. Xilun Chen, Ben Athiwaratkun, Y u Sun, Kilian W ein- berger , and Claire Cardie. 2016c. Adversarial deep av eraging networks for cross-lingual sentiment clas- siﬁcation. arXiv preprint arXiv:1606.01614 . John Duchi, Elad Hazan, and Y oram Singer . 2011. Adaptiv e subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Resear ch 12(Jul):2121–2159. Y aroslav Ganin, Evgeniya Ustinov a, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc ¸ ois Lavi- olette, Mario Marchand, and V ictor Lempitsky . 2016. Domain-adv ersarial training of neural net- works. J ournal of Machine Learning Researc h 17(59):1–35. Shima Gerani, Y ashar Mehdad, Giuseppe Carenini, T . Raymond Ng, and Bita Nejat. 2014. Abstracti ve summarization of product revie ws using discourse structure. In Pr oceedings of the 2014 Confer ence on Empirical Methods in Natural Language Pr ocessing (EMNLP) . pages 1602–1613. Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-Farle y , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. 2014. Generati ve ad- versarial nets. In Advances in Neural Information Pr ocessing Systems . pages 2672–2680. Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy , and Eric P Xing. 2016a. Harnessing deep neural networks with logic rules. In Pr oceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL) . Berlin, Germany , pages 2410–2420. Zhiting Hu, Zichao Y ang, Xiaodan Liang, Ruslan Salakhutdinov , and Eric P Xing. 2017. Controllable text generation. arXiv pr eprint arXiv:1703.00955 . Zhiting Hu, Zichao Y ang, Ruslan Salakhutdinov , and Eric P Xing. 2016b. Deep neural networks with massiv e learned kno wledge. In Pr oceedings of the 2016 Confer ence on Empirical Methods in Natural Language Pr ocessing (EMNLP) . Austin, USA. Y angfeng Ji and Jacob Eisenstein. 2015. One vector is not enough: Entity-augmented distributed semantics for discourse relations. T r ansactions of the Associ- ation for Computational Linguistics (T ACL) 3:329– 344. Y angfeng Ji, Gholamreza Haff ari, and Jacob Eisen- stein. 2016. A latent variable recurrent neural net- work for discourse-driven language models. In Pr oceedings of the 2016 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Languag e T echnologies (N AA CL) . San Diego, California, pages 332–342. Y angfeng Ji, Gongbo Zhang, and Jacob Eisenstein. 2015. Closing the gap: Domain adaptation from explicit to implicit discourse relations. In Pr oceed- ings of the 2015 Confer ence on Empirical Methods in Natural Language Pr ocessing . Lisbon, Portugal, pages 2219–2224. Alex M Lamb, Anirudh Goyal, Y ing Zhang, Saizheng Zhang, Aaron C Courville, and Y oshua Bengio. 2016. Professor forcing: A new algorithm for train- ing recurrent networks. In Advances In Neural In- formation Pr ocessing Systems . pages 4601–4609. Man Lan, Y u Xu, and Zhengyu Niu. 2013. Lev eraging synthetic discourse data via multi-task learning for implicit discourse relation recognition. In Pr oceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P a- pers) . Soﬁa, Bulgaria, pages 476–485. Junyi Jessy Li, Marine Carpuat, and Ani Nenk ov a. 2014. Assessing the discourse factors that inﬂu- ence the quality of machine translation. In Pr oceed- ings of the 52nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short P a- pers) . Baltimore, Maryland, pages 283–288. Y ujia Li, Ke vin Swersky , and Richard S Zemel. 2015. Generativ e moment matching networks. In Pr o- ceedings of the 32nd International Conference on Machine Learning (ICML) . Lille, France, pages 1718–1727. Maria Liakata, Simon Dobnik, Shyamasree Saha, Colin Batchelor , and Dietrich Rebholz-Schuhmann. 2013. A discourse-dri ven content model for sum- marising scientiﬁc articles ev aluated in a complex question answering task. In Proceedings of the 2013 Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing . Seattle, W ashington, USA, pages 747–757. Ziheng Lin, Min-Y en Kan, and Hwee T ou Ng. 2009. Recognizing implicit discourse relations in the Penn Discourse T reebank. In Proceedings of the 2009 Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing . Singapore, pages 343–351. Y ang Liu and Sujian Li. 2016. Recognizing implicit discourse relations via repeated reading: Neural net- works with multi-lev el attention. In Pr oceedings of the 2016 Conference on Empirical Methods in Natural Language Pr ocessing . Austin, T exas, pages 1224–1233. Y ang Liu, Sujian Li, Xiaodong Zhang, and Zhifang Sui. 2016. Implicit discourse relation classiﬁca- tion via multi-task neural networks. arXiv preprint arXiv:1603.02776 . Laurens van der Maaten and Geoffre y Hinton. 2008. V isualizing data using t-SNE. Journal of Machine Learning Resear ch 9:2579–2605. T omas Mikolo v , Ilya Sutske ver , Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distrib uted representa- tions of words and phrases and their compositional- ity . In Advances in neural information pr ocessing systems . pages 3111–3119. Emily Pitler, Annie Louis, and Ani Nenkov a. 2009. Automatic sense prediction for implicit discourse re- lations in text. In Proceedings of the J oint Confer- ence of the 47th Annual Meeting of he Association for Computational Linguistics and the 4th Interna- tional Joint Confer ence on Natural Language Pr o- cessing (A CL-IJCNLP) . Suntec, Singapore, pages 683–691. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- sakaki, Li vio Robaldo, Aravind K Joshi, and Bon- nie L W ebber . 2008. The penn discourse treebank 2.0. In LREC . pages 2961–2968. Lianhui Qin, Zhisong Zhang, and Hai Zhao. 2016a. Implicit discourse relation recognition with context- aware character -enhanced embeddings. In Pr oceed- ings of COLING 2016, the 26th International Con- fer ence on Computational Linguistics: T echnical P apers . Osaka, Japan, pages 1914–1924. Lianhui Qin, Zhisong Zhang, and Hai Zhao. 2016b. Shallow discourse parsing using con volutional neu- ral network. In Pr oceedings of the CoNLL-16 shar ed task . Berlin, Germany , pages 70–77. Lianhui Qin, Zhisong Zhang, and Hai Zhao. 2016c. A stacking gated neural architecture for implicit dis- course relation classiﬁcation. In Pr oceedings of the 2016 Confer ence on Empirical Methods in Natural Language Pr ocessing . Austin, T exas, pages 2263– 2270. Attapol Rutherford and Nianwen Xue. 2015. Improv- ing the inference of implicit discourse relations via classifying explicit discourse connectiv es. In Pr o- ceedings of the 2015 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics (N AA CL): Human Language T echnolo- gies . Den v er , Colorado, pages 799–808. T im Salimans, Ian Goodfello w , W ojciech Zaremba, V icki Cheung, Alec Radford, and Xi Chen. 2016. Improv ed techniques for training gans. In Advances in Neural Information Pr ocessing Systems . pages 2226–2234. Rupesh Kumar Sriv asta v a, Klaus Gref f, and J ¨ urgen Schmidhuber . 2015. Highway networks. arXiv pr eprint arXiv:1505.00387 . Y u Xu, Man Lan, Y ue Lu, Zheng Y u Niu, and Chew Lim T an. 2012. Connectiv e prediction us- ing machine learning for implicit discourse relation classiﬁcation. In The 2012 International Joint Con- fer ence on Neural Networks (IJCNN) . IEEE, pages 1–8. Nianwen Xue, Hwee T ou Ng, Sameer Pradhan, Rashmi Prasad, Christopher Bryant, and Attapol Ruther- ford. 2015. The CoNLL-2015 shared task on shal- low discourse parsing. In Pr oceedings of the Nine- teenth Conference on Computational Natural Lan- guage Learning - Shar ed T ask (CONLL) . Beijing, China, pages 1–16. Nianwen Xue, Hwee T ou Ng, Sameer Pradhan, Bon- nie W ebber , Attapol Rutherford, Chuan W ang, and Hongmin W ang. 2016. The CoNLL-2016 shared task on shallow discourse parsing. In Pr oceedings of the T wentieth Conference on Computational Natural Language Learning - Shared T ask (CONLL) . Berlin, Germany , pages 1–19. Biao Zhang, Deyi Xiong, jinsong su, Qun Liu, Ron- grong Ji, Hong Duan, and Min Zhang. 2016. V ari- ational neural discourse relation recognizer . In Pr o- ceedings of the 2016 Confer ence on Empirical Meth- ods in Natural Language Pr ocessing . Austin, T exas, pages 382–391. Zhi-Min Zhou, Y u Xu, Zheng-Y u Niu, Man Lan, Jian Su, and Chew Lim T an. 2010. Predicting discourse connectiv es for implicit discourse relation recog- nition. In Pr oceedings of the 23rd International Confer ence on Computational Linguistics (Coling 2010) . Beijing, China, pages 1507–1514. A ppendix A Model architectur es and training conﬁgurations In this section we provide the detailed architecture conﬁgurations of each component we used in the experiments. • T able 3 lists the ﬁlter conﬁgurations of the con volutional layer in i-CNN and a-CNN in different tasks, tuned on dev sets. As described in section 3.3 in the paper , following the con v olutional layer is a max pooling layer in i-CNN , and an av erage k-max pooling layer with k = 2 in a-CNN . • The ﬁnal single-layer classiﬁer C contains 512 neurons and uses “tanh” as activ ation function. • The discriminator D consists of 4 fully-connected layers, with 2 gated pathways from layer 1 to layer 3 and layer 4 (see Figure 3 in the paper). The size of each layer is set to 1024 and is ﬁxed in all the experiments. • W e set the dimension of the input word vectors to 300 and initialize with pre-trained word2v ec ( Mikolo v et al. , 2013 ). The maximum length of sentence argument is set to 80, and truncation or zero-padding is applied when necessary . T ask ﬁlter sizes ﬁlter number PDTB-Lin 2, 4, 8 3 × 256 PDTB-Ji 2, 5, 10 3 × 256 One-vs-all 2, 5, 10 3 × 1024 T able 3: The conv olutional architectures of i-CNN and a-CNN in dif ferent tasks (section 4). For example, in PDTB-Lin, we use 3 sets of ﬁlters, each of which is of size 2, 4, and 8, respecti vely; and each set has 256 ﬁlters. All experiments were performed on a Linux machine with eight 4.0GHz CPU cores and 32GB RAM. W e implemented neural networks based on T ensorﬂow 2 , a popular deep learning platform. A ppendix B One-vs-all Classiﬁcations T able 4 lists the statistics of the data. Relation T rain De v T est Comparison 1942/1942 197/986 152/894 Contigency 3342/3342 295/888 279/767 Expansion 7004/7004 671/512 574/472 T emporal 760/760 64/1119 85/961 T able 4: Distributions of positi ve and neg ati ve instances from the train/dev/test sets in four binary relation classiﬁcation tasks. 2 https://www .tensorﬂow .org

Adversarial Connective-exploiting Networks for Implicit Discourse Relation Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment