Learning Semantic Script Knowledge with Event Embeddings

Induction of common sense knowledge about prototypical sequences of events has recently received much attention. Instead of inducing this knowledge in the form of graphs, as in much of the previous work, in our method, distributed representations of …

Authors: Ashutosh Modi, Ivan Titov

Learning Semantic Script Knowledge with Ev ent Embeddings Ashutosh Modi A S H U T O S H @ C O L I . U N I - S B . D E Saarland Univ ersity , Saarbr ¨ ucken, Germany Ivan T itov T I T OV @ U V A . N L Univ ersity of Amsterdam, Amsterdam, the Netherlands Abstract Induction of common sense knowledge about prototypical sequences of ev ents has recently re- ceiv ed much attention (e.g., ( Chambers & Juraf- sky , 2008 ; Regneri et al. , 2010 )). Instead of in- ducing this knowledge in the form of graphs, as in much of the previous w ork, in our method, dis- tributed representations of event realizations are computed based on distributed representations of predicates and their arguments, and then these representations are used to predict prototypical ev ent orderings. The parameters of the com- positional process for computing the ev ent rep- resentations and the ranking component of the model are jointly estimated from texts. W e show that this approach results in a substantial boost in ordering performance with respect to previous methods. 1. Introduction It is generally belie v ed that natural language understanding systems would benefit from incorporating common-sense knowledge about prototypical sequences of ev ents and their participants. Early work focused on structured representa- tions of this kno wledge (called scripts ( Schank & Abelson , 1977 )) and manual construction of script knowledge bases. Howe v er , these approaches do not scale to complex do- mains ( Mueller , 1998 ; Gordon , 2001 ). More recently , auto- matic induction of script knowledge from text hav e started to attract attention: these methods exploit either natural texts ( Chambers & Jurafsky , 2008 ; 2009 ) or crowdsourced data ( Regneri et al. , 2010 ), and, consequently , do not re- quire expensi v e expert annotation. Giv en a text corpus, they e xtract structured representations (i.e. graphs), for ex- ample chains ( Chambers & Jurafsky , 2008 ) or more gen- Accepted at the workshop track of International Conference on Learning Repr esentations (ICLR), 2014 eral directed acyclic graphs ( Regneri et al. , 2010 ). These graphs are scenario-specific, nodes in them correspond to ev ents (and associated with sets of potential e vent men- tions) and arcs encode the temporal precedence relation. These graphs can then be used to inform NLP applica- tions (e.g., question answering) by providing information whether one ev ent is likely to precede or succeed another . In this work we adv ocate constructing a statistical model which is capable of “answering” at least some of the ques- tions these graphs can be used to answer , b ut doing this without explicitly representing the kno wledge as a graph. In our method, the distributed representations (i.e. v ec- tors of real numbers) of event realizations are computed based on distributed representations of predicates and their arguments, and then the event representations are used in a ranker to predict the expected ordering of events. Both the parameters of the compositional process for computing the ev ent representation and the ranking component of the model are estimated from data. In order to get an intuition why the embedding approach may be attracti ve, consider a situation where a prototypi- cal ordering of ev ents the bus disembarked passeng ers and the bus dr ove away needs to be predicted. An approach based on frequency of predicate pairs ( Chambers & Juraf- sky , 2008 ), is unlikely to make a right prediction as driv- ing usually precedes disembarking. Similarly , an approach which treats the whole predicate-argument structure as an atomic unit ( Re gneri et al. , 2010 ) will probably fail as well, as such a sparse model is unlikely to be ef fectiv ely learn- able even from large amounts of data. Ho we ver , our em- bedding method would be expected to capture rele v ant fea- tures of the verb frames, namely , the transitiv e use for the predicate disembark and the effect of the particle away , and these features will then be used by the ranking component to make the correct prediction. In previous work on learning inference rules ( Berant et al. , 2011 ), it has been shown that enforcing transitivity con- straints on the inference rules results in significantly im- prov ed performance. The same is true for the ev ent order- Learning Semantic Script Knowledge with Event Embeddings disembarked passengers bus predicate embedding event embedding arg embedding T a 1 R p T a 2 x a 1 = C ( bus ) a 2 = C ( passeng er ) p = C ( disembar k ) arg embedding hidden layer h A h Figure 1. Computation of an event representation ( the bus disem- barked passenger s ). ing task, as scripts ha ve lar gely linear structure, and ob- serving that a ≺ b and b ≺ c is likely to imply a ≺ c . In- terestingly , in our approach we implicitly learn the model which satisfies transitivity constraints, without the need for any e xplicit global optimization on a graph. The approach is e v aluated on cro wdsourced dataset of Reg- neri et al. ( 2010 ) and we demonstrate that using our model results in the 13.5% absolute improv ement in F 1 on ev ent ordering with respect to their graph induction method (84% vs. 71%). 2. Model In this section we describe the model we use for computing ev ent representations as well as the ranking component of our model. 2.1. Event Repr esentation Learning and exploiting distrib uted word representations (i.e. vectors of real values, also known as embeddings ) hav e been sho wn to be beneficial in many NLP applica- tions ( Bengio et al. , 2001 ; Turian et al. , 2010 ; Collobert et al. , 2011 ). These representations encode semantic and syntactic properties of a word, and are normally learned in the language modeling setting (i.e. learned to be predictiv e of local word context), though they can also be specialized by learning in the context of other NLP applications such as PoS tagging or semantic role labeling ( Collobert et al. , 2011 ). More recently , the area of distributional composi- tional semantics hav e started to emer ge ( Baroni & Zam- parelli , 2011 ; Socher et al. , 2012 ), they focus on induc- ing representations of phrases by learning a compositional model. Such a model would compute a representation of a phrase by starting with embeddings of individual words in the phrase, often this composition process is recursive and guided by some form of syntactic structure. Algorithm 1 Learning Algorithm Notation w : ranking weight vector E k : k th sequence of ev ents in temporal order t k : array of model scores for ev ents in E k γ : fixed global mar gin for ranking LearnW eights() for epoch = 1 to T for k = 1 to K [ov er e vent sequences] for i = 1 to | E k | [ov er e vents in the seq] Compute embedding x e i for ev ent e i Calculate score s e i = w T x e i end for Collect scores in t k = [ s e 1 , . . . , s e i , . . . ] er ror = RankingError ( t k ) back-propagate err or update all embedding parameters and w end for end for RankingError ( t k ) er r = 0 for r ank = 1 , . . . , l for r ank B ef ore = 1 , . . . , r ank if ( t k [ r ankB ef or e ] − t k [ r ank ]) < γ er r = er r + 1 end if end for for r ank Af ter = r ank + 1 , . . . , l if ( t k [ r ank ] − t k [ r ankAf ter ]) < γ er r = er r + 1 end if end for end for retur n err In our work, we use a simple compositional model for rep- resenting semantics of a verb frame (i.e. the predicate and its arguments). The model is sho wn in Figure 1 . Each word w i in the vocab ulary is mapped to a real vector based on the corresponding lemma (the embedding function C ). The hidden layer is computed by summing linearly transformed predicate and ar gument embeddings and passing it through the logistic sigmoid function. 1 W e use different transfor- mation matrices for arguments and predicates, T and R , respectiv ely . The event representation x is then obtained by applying another linear transform (matrix A ) follo wed by another application of the sigmoid function. These e vent representations are learned in the context of 1 Only syntactic heads of arguments are used in this work. If an argument is a cof fee maker , we will use only the word mak er . Learning Semantic Script Knowledge with Event Embeddings Scenario Precision (%) Recall (%) F1 (%) BL EE ver b MSA BS EE BL EE ver b MSA BS EE BL EE ver b MSA BS EE Bus 70.1 81.9 80.0 76.0 85.1 71.3 75.8 80.0 76.0 91.9 70.7 78.8 80.0 76.0 88.4 Coffee 70.1 73.7 70.0 68.0 69.5 72.6 75.1 78.0 57.0 71.0 71.3 74.4 74.0 62.0 70.2 Fastfood 69.9 81.0 53.0 97.0 90.0 65.1 79.1 81.0 65.0 87.9 67.4 80.0 64.0 78.0 88.9 Ret. Food 74.0 94.1 48.0 87.0 92.4 68.6 91.4 75.0 72.0 89.7 71.0 92.8 58.0 79.0 91.0 Iron 73.4 80.1 78.0 87.0 86.9 67.3 69.8 72.0 69.0 80.2 70.2 69.8 75.0 77.0 83.4 Microw av e 72.6 79.2 47.0 91.0 82.9 63.4 62.8 83.0 74.0 90.3 67.7 70.0 60.0 82.0 86.4 Scr . Eggs 72.7 71.4 67.0 77.0 80.7 68.0 67.7 64.0 59.0 76.9 70.3 69.5 66.0 67.0 78.7 Shower 62.2 76.2 48.0 85.0 80.0 62.5 80.0 82.0 84.0 84.3 62.3 78.1 61.0 85.0 82.1 T elephone 67.6 87.8 83.0 92.0 87.5 62.8 87.9 86.0 87.0 89.0 65.1 87.8 84.0 89.0 88.2 V ending 66.4 87.3 84.0 90.0 84.2 60.6 87.6 85.0 74.0 81.9 63.3 84.9 84.0 81.0 88.2 A verage 69.9 81.3 65.8 85.0 83.9 66.2 77.2 78.6 71.7 84.3 68.0 79.1 70.6 77.6 84.1 T able 1. Results on the crowdsourced data for the v erb-frequency baseline (BL), the verb-only embedding model (EE ver b ), Re gneri et al. ( 2010 ) (MSA), Frermann et al. ( 2014 )(BS) and the full model (EE). ev ent ranking: the transformation parameters as well as representations of words are forced to be predictive of the temporal order of events. Howe ver , one important charac- teristic of neural netw ork embeddings is that they can be in- duced in a multitasking scenario, and consequently can be learned to be predictiv e of different types of contexts pro- viding a general framew ork for inducing dif ferent aspects of (semantic) properties of events, as well as exploiting the same representations in different applications. 2.2. Learning to Order The task of learning stereotyped order of events naturally corresponds to the standard ranking setting. Here, we as- sume that we are provided with sequences of events, and our goal is to capture this order . W e discuss how we obtain this learning material in the next section. W e learn a linear ranker (characterized by a vector w ) which takes an ev ent representation and returns a ranking score. Events are then ordered according to the score to yield the model predic- tion. Note that during the learning stage we estimate not only w but also the e vent representation parameters, i.e. matrices T , R and A , and the word embedding C . Note that by casting the ev ent ordering task as a global rank- ing problem we ensure that the model implicitly exploits transitivity of the temporal relation, the property which is crucial for successful learning from finite amount of data, as we argued in the introduction and will confirm in our experiments. W e use an online ranking algorithm based on the Percep- tron Rank (PRank, ( Crammer & Singer , 2001 )), or, more accurately , its large-mar gin extension. One crucial differ - ence though is that the error is computed not only with re- spect to w but also propagated back through the structure of the neural network. The learning procedure is sketched in Algorithm 1. Additionally , we use a Gaussian prior on weights, regularizing both the embedding parameters and the vector w . W e initialize word representations using the SENN A embeddings ( Collobert et al. , 2011 ). 2 3. Experiments W e ev aluate our approach on crowdsourced data collected for script induction by Regneri et al. ( 2010 ), though, in principle, the method is applicable in arguably more gen- eral setting of Chambers & Jurafsky ( 2008 ). 3.1. Data and task Regneri et al. ( 2010 ) collected short textual descriptions (called event sequence descriptions, ESDs ) of various types of human activities (e.g., going to a restaurant, ironing clothes) using crowdsourcing (Amazon Mechanical T urk), this dataset was also complemented by descriptions pro- vided in the OMICS corpus ( Gupta & Kochenderfer , 2004 ). The datasets are fairly small, containing 30 ESDs per ac- tivity type in av erage (we will refer to different activities as scenarios ), but the collection can easily be extended given the low cost of crowdsourcing. The ESDs are written in a bullet-point style and the annotators were asked to follo w the temporal order in writing. Consider an example ESD for the scenario pr epar e cof fee : { go to coffee maker } → { fill water in coffee maker } → { place the filter in holder } → { place cof fee in filter } → { place holder in coffee mak er } → { turn on coffee mak er } Though individual ESDs may seem simple, the learning task is challenging because of the limited amount of train- ing data, variability in the used vocab ulary , optionality of ev ents (e.g., going to the coffee machine may not be men- tioned in a ESD), different granularity of e vents and v ari- ability in the ordering (e.g., coffee may be put in a filter before placing it in a coffee mak er). 2 When we kept the word representations fixed to the SENN A embeddings and learned only matrices T , R and A , we obtained similar results (0.3% difference in the a v erage F1 score). Learning Semantic Script Knowledge with Event Embeddings Unlike our work, Regneri et al. ( 2010 ) relies on W ordNet to provide extra signal when using the Multiple Sequence Alignment (MSA) algorithm. As in their work, each de- scription was preprocessed to extract a predicate and heads of argument noun phrases to be used in the model. The methods are ev aluated on human annotated scenario- specific tests: the goal is to classify ev ent pairs as appearing in a giv en stereotypical order or not ( Regneri et al. , 2010 ). 3 The model was estimated as explained in Section 2.2 with the order of e vents in ESDs treated as gold standard. W e used 4 held-out scenarios to choose model parameters, no scenario-specific tuning was performed, and the 10 test scripts were not used to perform model selection. When testing, we predicted that the ev ent pair ( e 1 , e 2 ) is in the stereotypical order ( e 1 ≺ e 2 ) if the ranking score for e 1 exceeded the ranking score for e 2 3.2. Results and discussion In our e xperiments, we compared our e vent embedding model ( EE ) against three baseline systems ( BL , MSA ) and BS MSA is the system of Regneri et al. (2010). BS is a a hierarchical Bayesian system of Frermann et al. ( 2014 ). BL chooses the order of events based on the preferred or- der of the corresponding verbs in the training set: ( e 1 , e 2 ) is predicted to be in the stereotypical order if the number of times the corresponding verbs v 1 and v 2 appear in this or- der in the training ESDs exceeds the number of times they appear in the opposite order (not necessary at adjacent po- sitions); a coin is tossed to break ties (or if v 1 and v 2 are the same verb). W e also compare to the version of our model which uses only verbs (EE v er bs ). Note that EE v er bs is conceptually very similar to BL, as it essentially induces an ordering ov er verbs. Howe ver , this ordering can benefit from the im- plicit transitivity assumption used in EE v er bs (and EE), as we discussed in the introduction. The results are presented in T able 1 . The first observ ation is that the full model improv es sub- stantially ov er the baseline and the previous methods (MSA and BS) (13.5% and 6.5% improv ement o ver MSA and BS respectiv ely in F1), this improv ement is largely due to an increase in the recall but the precision is not negativ ely af- fected. W e also observe a substantial improv ement in all metrics from using transiti vity , as seen by comparing the results of BL and EE v er b (11.3% improvement in F1). This simple approach already outperforms the pipelined MSA system. These results seem to support our hypothesis in 3 The unseen event pairs are not coming from the same ESDs making the task harder: the ev ents may not be in any temporal relation. This is also the reason for using the F1 score rather than the accuracy , both in Regneri et al. ( 2010 ) and in our work. the introduction that inducing graph representations from scripts may not be an optimal strategy from the practical perspectiv e. References Baroni, Marco and Zamparelli, Robert. Nouns are vec- tors, adjectives are matrices: Representing adjectiv e- noun constructions in semantic space. In Pr oceedings of EMNLP , 2011. Bengio, Y oshua, Ducharme, R ´ ejean, and V incent, Pascal. A neural probabilistic language model. In Pr oceedings of NIPS , 2001. Berant, Jonathan, Dagan, Ido, and Goldberger , Jacob . Global learning of typed entailment rules. In Pr oceed- ings of A CL , 2011. Chambers, Nathanael and Jurafsky , Dan. Unsupervised learning of narrativ e schemas and their participants. In Pr oceedings of A CL , 2009. Chambers, Nathanael and Jurafsky , Daniel. Unsupervised learning of narrati ve event chains. In Pr oceedings of A CL , 2008. Collobert, R., W eston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P . Natural language pro- cessing (almost) from scratch. J ournal of Machine Learning Resear ch , 12:2493–2537, 2011. Crammer , Koby and Singer , Y oram. Pranking with ranking. In Pr oceedings of NIPS , 2001. Frermann, Lea, Tito v , Ivan, and Pinkal, Manfred. A hi- erarchical bayesian model for unsupervised induction of script knowledge. In EA CL, Gothenber g, Sweden , 2014. Gordon, Andrew . Browsing image collections with repre- sentations of common-sense activities. J AIST , 52(11), 2001. Gupta, Rak esh and K ochenderfer, Mykel J. Common sense data acquisition for indoor mobile robots. In Proceed- ings of AAAI , 2004. Mueller , Erik T . Natural Language Pr ocessing with Thought T reasur e . Signiform, 1998. Regneri, Michaela, K oller , Alexander , and Pinkal, Man- fred. Learning script knowledge with web experiments. In Pr oceedings of A CL , 2010. Schank, R. C and Abelson, R. P . Scripts, Plans, Goals, and Understanding . Lawrence Erlbaum Associates, Po- tomac, Maryland, 1977. Learning Semantic Script Knowledge with Event Embeddings Socher , Richard, Huval, Brody , Manning, Christopher D., and Ng, Andrew Y . Semantic compositionality through recursiv e matrix-vector spaces. In Pr oceedings of EMNLP , 2012. T urian, Joseph, Ratinov , Lev , and Bengio, Y oshua. W ord representations: A simple and general method for semi- supervised learning. In Proceedings of A CL , 2010.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment