Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction

Modeling Semantic Expectation: Using Script Knowledge f or Refer ent Prediction Ashutosh Modi 1 , 3 Ivan T itov 2 , 4 V era Demberg 1 , 3 Asad Sayeed 1 , 3 Manfred Pinkal 1 , 3 1 { ashutosh,vera,asayeed,pinkal } @coli.uni-saarland.de 2 titov@uva.nl 3 Uni versit ¨ at des Saarlandes, Germany 4 ILLC, Uni versity of Amsterdam, the Netherlands Abstract Recent research in psycholinguistics has pro- vided increasing evidence that humans predict upcoming content. Prediction also af fects per- ception and might be a key to rob ustness in human language processing. In this paper , we in vestigate the factors that affect human prediction by b uilding a computational model that can predict upcoming discourse referents based on linguistic knowledge alone vs. lin- guistic kno wledge jointly with common-sense knowledge in the form of scripts. W e ﬁnd that script knowledge signiﬁcantly improves model estimates of human predictions. In a second study , we test the highly controversial hypothesis that predictability inﬂuences refer- ring expression type but do not ﬁnd evidence for such an effect. 1 Introduction Being able to anticipate upcoming content is a core property of human language processing (Kutas et al., 2011; Kuperber g and Jaeger , 2016) that has re- cei ved a lot of attention in the psycholinguistic liter - ature in recent years. Expectations about upcoming words help humans comprehend language in noisy settings and deal with ungrammatical input. In this paper , we use a computational model to address the question of ho w dif ferent layers of kno wledge (lin- guistic knowledge as well as common-sense knowl- edge) inﬂuence human anticipation. Here we focus our attention on semantic pre- dictions of discourse r efer ents for upcoming noun phrases. This task is particularly interesting because it allows us to separate the semantic task of antic- ipating an intended referent and the processing of the actual surface form. For example, in the con- text of I or der ed a medium sirloin steak with fries. Later , the waiter br ought . . . , there is a strong ex- pectation of a speciﬁc discourse referent, i.e., the referent introduced by the object NP of the preced- ing sentence, while the possible referring e xpression could be either the steak I had or der ed , the steak , our food , or it . Existing models of human predic- tion are usually formulated using the information- theoretic concept of surprisal . In recent work, how- e ver , surprisal is usually not computed for DRs, which represent the rele vant semantic unit, but for the surface form of the referring expressions, even though there is an increasing amount of literature suggesting that human expectations at different lev- els of representation hav e separable effects on pre- diction and, as a consequence, that the modelling of only one level (the linguistic surface form) is in- suf ﬁcient (Kuperber g and Jaeger , 2016; Kuperber g, 2016; Zarcone et al., 2016). The present model ad- dresses this shortcoming by explicitly modelling and representing common-sense kno wledge and concep- tually separating the semantic (discourse referent) and the surface le vel (referring expression) expec- tations. Our discourse referent prediction task is related to the NLP task of coreference resolution, but it substantially dif fers from that task in the follo wing ways: 1) we use only the incrementally a vailable left context, while coreference resolution uses the full text; 2) coreference resolution tries to identify the DR for a gi ven target NP in context, while we look at the e xpectations of DRs based only on the context before the target NP is seen. The distinction between referent prediction and prediction of referring expressions also allows us to study a closely related question in natural language generation: the choice of a type of referring expres- sion based on the predictability of the DR that is intended by the speaker . This part of our work is inspired by a referent guessing experiment by T ily and Piantadosi (2009), who showed that highly pre- dictable referents were more likely to be realized with a pronoun than unpredictable referents, which were more likely to be realized using a full NP . The ef fect they observe is consistent with a Gricean point of vie w , or the principle of uniform information den- sity (see Section 5.1). Howe ver , Tily and Piantadosi do not provide a computational model for estimat- ing referent predictability . Also, they do not include selectional preference or common-sense knowledge ef fects in their analysis. W e belie ve that script knowledge , i.e., common- sense knowledge about ev eryday event sequences, represents a good starting point for modelling con- versational anticipation. This type of common-sense kno wledge includes temporal structure which is par- ticularly rele vant for anticipation in continuous lan- guage processing. Furthermore, our approach can build on progress that has been made in recent years in methods for acquiring large-scale script kno wl- edge; see Section 1.1. Our hypothesis is that script kno wledge may be a signiﬁcant factor in human an- ticipation of discourse referents. Explicitly mod- elling this kno wledge will thus allow us to produce more human-like predictions. Script knowledge enables our model to generate anticipations about discourse referents that hav e al- ready been mentioned in the text, as well as anticipa- tions about te xtually ne w discourse referents which hav e been acti vated due to script knowledge. By modelling e vent sequences and e vent participants, our model captures many more long-range depen- dencies than normal language models are able to. As an example, consider the following two alternativ e text passages: W e got seated, and had to wait for 20 minutes. Then, the waiter br ought the ... W e or der ed, and had to wait for 20 minutes. Then, the waiter br ought the ... Preferred candidate referents for the object posi- tion of the waiter br ought the ... are instances of the food , menu , or bill participant types. In the con- text of the alternati ve preceding sentences, there is a strong e xpectation of instances of a menu and a food participant, respecti vely . This paper represents foundational research in- vestigating human language processing. Howe ver , it also has the potential for application in assistant technology and embodied agents. The goal is to achie ve human-le vel language comprehension in re- alistic settings, and in particular to achiev e robust- ness in the face of errors or noise. Explicitly mod- elling e xpectations that are dri ven by common-sense kno wledge is an important step in this direction. In order to be able to inv estigate the inﬂuence of script kno wledge on discourse referent expecta- tions, we use a corpus that contains frequent refer - ence to script kno wledge, and provides annotations for coreference information, script ev ents and par- ticipants (Section 2). In Section 3, we present a large-scale experiment for empirically assessing hu- man expectations on upcoming referents, which al- lo ws us to quantify at what points in a te xt humans hav e very clear anticipations vs. when they do not. Our goal is to model human expectations, ev en if they turn out to be incorrect in a speciﬁc instance. The experiment was conducted via Mechanical T urk and follows the methodology of Tily and Pianta- dosi (2009). In section 4, we describe our computa- tional model that represents script knowledge. The model is trained on the gold standard annotations of the corpus, because we assume that human compre- henders usually will have an analysis of the preced- ing discourse which closely corresponds to the gold standard. W e compare the prediction accurac y of this model to human predictions, as well as to two baseline models in Section 4.3. One of them uses only structural linguistic features for predicting ref- erents; the other uses general script-independent se- lectional preference features. In Section 5, we test whether surprisal (as estim ated from human guesses vs. computational models) can predict the type of referring expression used in the original texts in the corpus (pronoun vs. full referring expression). This experiment also has wider implications with respect to the on-going discussion of whether the referring expression choice is dependent on predictability , as predicted by the uniform information density hy- ( I ) (1) P bather [ decided ] E wash to take a ( bath ) (2) P bath yesterday afternoon after working out . Once ( I ) (1) P bather got back home , ( I ) (1) P bather [ walked ] E enter bathroom to ( my ) (1) P bather ( bathroom ) (3) P bathroom and ﬁrst quickly scrubbed the ( bathroom tub ) (4) P bathtub by [ turning on ] E turn water on the ( water ) (5) P water and rinsing ( it ) (4) P bathtub clean with a rag . After ( I ) (1) P bather ﬁnished , ( I ) (1) P bather [ plugged ] E close drain the ( tub ) (4) P bathtub and began [ ﬁlling ] E ﬁll water ( it ) (4) P bathtub with warm ( water ) (5) P water set at about 98 ( degrees ) (6) P temperature . Figure 1: An excerpt from a story in the InScript corpus. The referring expressions are in parentheses, and the corresponding discourse referent label is given by the superscript. Referring expressions of the same discourse referent hav e the same color and superscript number . Script-relev ant events are in square brackets and colored in orange. Event type is indicated by the corresponding subscript. pothesis. The contributions of this paper consist of: • a large dataset of human expectations, in a va- riety of texts related to e very-day acti vities. • an implementation of the conceptual distinction between the semantic level of referent predic- tion and the type of a referring expression. • a computational model which signiﬁcantly im- prov es modelling of human anticipations. • showing that script knowledge is a signiﬁcant factor in human e xpectations. • testing the hypothesis of T ily and Piantadosi that the choice of the type of referring expres- sion (pronoun or full NP) depends on the pre- dictability of the referent. 1.1 Scripts Scripts represent kno wledge about typical ev ent sequences (Schank and Abelson, 1977), for exam- ple the sequence of ev ents happening when eating at a restaurant. Script knowledge thereby includes e vents like or der , bring and eat as well as partici- pants of those ev ents, e.g., menu , waiter , food , guest . Existing methods for acquiring script knowledge are based on extracting narrativ e chains from text (Chambers and Jurafsky , 2008; Chambers and Juraf- sky , 2009; Jans et al., 2012; Pichotta and Mooney , 2014; Rudinger et al., 2015; Modi, 2016; Ahrendt and Demberg, 2016) or by eliciting script knowledge via Crowdsourcing on Mechanical T urk (Regneri et al., 2010; Frermann et al., 2014; Modi and T itov , 2014). Modelling anticipated ev ents and participants is moti vated by evidence showing that ev ent repre- sentations in humans contain information not only about the current event, b ut also about previous and future states, that is, humans generate anticipa- tions about e vent sequences during normal language comprehension (Sch ¨ utz-Bosbach and Prinz, 2007). Script knowledge representations ha ve been shown to be useful in NLP applications for ambiguity reso- lution during reference resolution (Rahman and Ng, 2012). 2 Data: The InScript Corpus Ordinary te xts, including narrativ es, encode script structure in a way that is too complex and too im- plicit at the same time to enable a systematic study of script-based expectation. They contain interleav ed references to many different scripts, and the y usually refer to single scripts in a point-wise fashion only , relying on the ability of the reader to infer the full e vent chain using their background knowledge. W e use the InScript corpus (Modi et al., 2016) to study the predictiv e ef fect of script knowledge. In- Script is a crowdsourced corpus of simple narrativ e texts. Participants were ask ed to write about a spe- ciﬁc acti vity (e.g., a restaurant visit, a bus ride, or a grocery shopping event) which they personally ex- perienced, and the y were instructed to tell the story as if explaining the acti vity to a child. This resulted in stories that are centered around a speciﬁc scenario and that explicitly mention mundane details. Thus, they generally realize longer e vent chains associated with a single script, which makes them particularly appropriate to our purpose. The InScript corpus is labelled with event-type, participant-type, and coreference information. Full verbs are labeled with e vent type information, heads of all noun phrases with participant types, using scenario-speciﬁc lists of ev ent types (such as enter bathr oom, close dr ain and ﬁll water for the “taking a bath” scenario) and participant types (such as bather , water and bathtub ). On a verage, each template of- fers a choice of 20 ev ent types and 18 participant ( I ) (1) decided to tak e a ( bath ) (2) yesterday afternoon after working out . Once ( I ) (1) got back home , ( I ) (1) walked to ( my ) (1) ( bathroom ) (3) and ﬁrst quickly scrubbed the ( bathroom tub ) (4) by turning on the ( water ) (5) and rinsing ( it ) (4) clean with a rag . Af- ter ( I ) (1) ﬁnished , ( I ) (1) plugged XXXXXX Figure 2: An illustration of the Mechanical T urk e xper- iment for the referent cloze task. W orkers are supposed to guess the upcoming referent (indicated by XXXXXX abov e). They can either choose from the pre viously acti- vated referents, or the y can write something new . 0 5 10 15 20 14 5 1 DR_4 (P_bathtub) the drain (new DR) DR_1 (P_bather) Number of Workers Figure 3: Response of workers corresponding to the story in Fig. 2. W orkers guessed two already activ ated dis- course referents (DR) DR 4 and DR 1. Some of the workers also chose the “new” option and wrote different lexical v ariants of “bathtub drain”, a new DR correspond- ing to the participant type “the drain”. types. The InScript corpus consists of 910 stories ad- dressing 10 scenarios (about 90 stories per scenario). The corpus has 200,000 words, 12,000 verb in- stances with event labels, and 44,000 head nouns with participant instances. Modi et al. (2016) report an inter-annotator agreement of 0.64 for e vent types and 0.77 for participant types (Fleiss’ kappa). W e use gold-standard event- and participant-type annotation to study the inﬂuence of script knowl- edge on the expectation of discourse referents. In addition, InScript provides coreference annotation, which makes it possible to keep track of the men- tioned discourse referents at each point in the story . W e use this information in the computational model of DR prediction and in the DR guessing experiment described in the next section. An example of an an- notated InScript story is sho wn in Figure 1. 3 Referent Cloze T ask W e use the InScript corpus to develop computa- tional models for the prediction of discourse refer- ents (DRs) and to e valuate their prediction accurac y . This can be done by testing how often our models manage to reproduce the original discourse referent (cf. also the “narrative cloze” task by (Chambers and Jurafsky , 2008) which tests whether a verb together with a role can be correctly guessed by a model). Ho wever , we do not only want to predict the “cor- rect” DRs in a text but also to model human expec- tation of DRs in conte xt. T o empirically assess hu- man expectation, we created an additional database of crowdsourced human predictions of discourse ref- erents in context using Amazon Mechanical T urk. The design of our experiment closely resembles the guessing game of (T ily and Piantadosi, 2009) but ex- tends it in a substantial way . W orkers had to read stories of the InScript corpus 1 and guess upcoming participants: for each target NP , workers were shown the story up to this NP ex- cluding the NP itself, and they were ask ed to guess the ne xt person or object most lik ely to be referred to. In case the y decided in fav our of a discourse ref- erent already mentioned, they had to choose among the av ailable discourse referents by clicking an NP in the preceding te xt, i.e., some noun with a speciﬁc, coreference-indicating color; see Figure 2. Other- wise, the y would click the “Ne w” b utton, and would in turn be ask ed to give a short description of the new person or object the y e xpected to be mentioned. The percentage of guesses that agree with the actually re- ferred entity was taken as a basis for estimating the surprisal. The experiment was done for all stories of the test set: 182 stories (20%) of the InScript corpus, e venly taken from all scenarios. Since our focus is on the effect of script knowledge, we only consid- ered those NPs as targets that are direct dependents of script-related events. Guessing started from the third sentence only in order to ensure that a mini- mum of context information was a vailable. T o keep the complexity of the context manageable, we re- stricted guessing to a maximum of 30 targets and skipped the rest of the story (this applied to 12% of the stories). W e collected 20 guesses per NP for 3346 noun phrase instances, which amounts to a to- tal of around 67K guesses. W orkers selected a con- 1 The corpus is av ailable at : http://www.sfb1102. uni- saarland.de/?page_id=2582 text NP in 68% of cases and “Ne w” in 32% of cases. Our leading hypothesis is that script knowledge substantially inﬂuences human expectation of dis- course referents. The guessing experiment provides a basis to estimate human expectation of already mentioned DRs (the number of clicks on the respec- ti ve NPs in te xt). Ho we ver , we e xpect that script kno wledge has a particularly strong inﬂuence in the case of ﬁrst mentions. Once a script is ev oked in a text, we assume that the full script structure, includ- ing all participants, is activ ated and av ailable to the reader . T ily and Piantadosi (2009) are interested in sec- ond mentions only and therefore do not make use of the worker -generated noun phrases classiﬁed as “Ne w”. T o study the effect of activ ated but not explicitly mentioned participants, we carried out a subsequent annotation step on the worker-generated noun phrases classiﬁed as “New”. W e presented an- notators with these noun phrases in their conte xts (with co-referring NPs mark ed by color , as in the M- T urk e xperiment) and, in addition, displayed all par - ticipant types of the relev ant script (i.e., the script as- sociated with the text in the InScript corpus). Anno- tators did not see the “correct” tar get NP . W e asked annotators to either (1) select the participant type in- stantiated by the NP (if an y), (2) label the NP as un- related to the script, or (3), link the NP to an ov ert antecedent in the text, in the case that the NP is ac- tually a second mention that had been erroneously labeled as ne w by the worker . Option (1) pro vides a basis for a ﬁne-grained estimation of ﬁrst-mention DRs. Option (3), which we added when we noticed the considerable number of ov erlooked antecedents, serves as correction of the results of the M-T urk ex- periment. Out of the 22K annotated “New” cases, 39% were identiﬁed as second mentions, 55% were linked to a participant type, and 6% were classiﬁed as really nov el. 4 Referent Pr ediction Model In this section, we describe the model we use to predict upcoming discourse referents (DRs). 4.1 Model Our model should not only assign probabilities to DRs already e xplicitly introduced in the preced- ing te xt fragment (e.g., “bath” or “bathroom” for the cloze task in Figure 2) but also reserve some prob- ability mass for ‘new’ DRs, i.e., DRs activ ated via the script conte xt or completely no vel ones not be- longing to the script. In principle, different variants of the acti vation mechanism must be distinguished. For many participant types, a single participant be- longing to a speciﬁc semantic class is expected (re- ferred to with the bathtub or the soap ). In contrast, the “to wel” participant type may activ ate a set of objects, elements of which then can be referred to with a towel or another towel . The “bath means” participant type may ev en activ ate a group of DRs belonging to different semantic classes (e.g., bubble bath and salts ). Since it is not feasible to enumer- ate all potential participants, for ‘new’ DRs we only predict their participant type (“bath means” in our example). In other words, the number of categories in our model is equal to the number of previously introduced DRs plus the number of participant types of the script plus 1, reserv ed for a new DR not corre- sponding to any script participant (e.g., cellphone ). In what follo ws, we slightly ab use the terminology and refer to all these categories as discourse refer- ents. Unlike standard co-reference models, which pre- dict co-reference chains relying on the entire docu- ment, our model is incremental, that is, when pre- dicting a discourse referent d ( t ) at a giv en position t , it can look only in the history h ( t ) (i.e., the pre- ceding part of the document), e xcluding the refer - ring expression (RE) for the predicted DR. W e also assume that past REs are correctly resolved and as- signed to correct participant types (PTs). T ypical NLP applications use automatic coreference reso- lution systems, but since we want to model human behavior , this might be inappropriate, since an au- tomated system w ould underestimate human perfor- mance. This may be a strong assumption, but for reasons explained abov e, we use gold standard past REs. W e use the following log-linear model (“softmax regression”): p ( d ( t ) = d | h ( t ) ) = exp( w T f ( d, h ( t ) )) P d 0 exp( w T f ( d 0 , h ( t ) )) , where f is the feature function we will discuss in the following subsection, w are model parameters, and the summation in the denominator is over the Feature T ype Recency Shallow Linguistic Frequency Shallow Linguistic Grammatical function Shallow Linguistic Previous subject Shallow Linguistic Previous object Shallow Linguistic Previous RE type Shallow Linguistic Selectional preferences Linguistic Participant type ﬁt Script Predicate schemas Script T able 1: Summary of feature types set of categories described abo ve. Some of the features included in f are a func- tion of the predicate syntactically governing the unobserv able target RE (corresponding to the DR being predicted). Ho wever , in our incremental setting, the predicate is not av ailable in the his- tory h ( t ) for subject NPs. In this case, we use an additional probabilistic model, which esti- mates the probability of the predicate v given the context h ( t ) , and marginalize out its predictions: p ( d ( t ) = d | h ( t ) ) = X v p ( v | h ( t ) ) exp( w T f ( d, h ( t ) , v )) P d 0 exp( w T f ( d 0 , h ( t ) , v )) The predicate probabilities p ( v | h ( t ) ) are computed based on the sequence of preceding predicates (i.e., ignoring an y other words) using the recurrent neural network language model estimated on our training set. 2 The e xpression f ( d, h ( t ) , v ) denotes the feature function computed for the referent d , giv en the history composed of h ( t ) and the predicate v . 4.2 F eatures Our features encode properties of a DR as well as characterize its compatibility with the context. W e face two challenges when designing our fea- tures. First, although the sizes of our datasets are respectable from the script annotation perspective, they are too small to learn a richly parameterized model. For many of our features, we address this challenge by using e xternal word embeddings 3 and associate parameters with some simple similarity measures computed using these embeddings. Con- 2 W e used RNNLM toolkit (Mikolov et al., 2011; Mikolov et al., 2010) with default settings. 3 W e use 300-dimensional word embeddings estimated on W ikipedia with the skip-gram model of Mikolov et al. (2013): https://code.google.com/p/word2vec/ sequently , there are only a fe w dozen parameters which need to be estimated from scenario-speciﬁc data. Second, in order to test our hypothesis that script information is beneﬁcial for the DR prediction task, we need to disentangle the inﬂuence of script information from general linguistic knowledge. W e address this by carefully splitting the features apart, e ven if it prevents us from modeling some interplay between the sources of information. W e will de- scribe both classes of features belo w; also see a sum- mary in T able 1. 4.2.1 Shallow Linguistic F eatures These features are based on T ily and Pianta- dosi (2009). In addition, we consider a selectional preference feature. Recency feature. This feature captures the distance l t ( d ) between the position t and the last occurrence of the candidate DR d . As a distance measure, we use the number of sentences from the last mention and e xponentiate this number to make the depen- dence more extreme; only very recent DRs will re- cei ve a noticeable weight: exp( − l t ( d )) . This feature is set to 0 for ne w DRs. Frequency . The frequency feature indicates the number of times the candidate discourse referent d has been mentioned so far . W e do not perform any buck eting. Grammatical function. This feature encodes the dependency relation assigned to the head word of the last mention of the DR or a special none label if the DR is ne w . Pre vious subject indicator . This binary feature in- dicates whether the candidate DR d is coreferential with the subject of the pre vious verbal predicate. Pre vious object indicator . The same but for the ob- ject position. Pre vious RE type. This three-valued feature indi- cates whether the previous mention of the candidate DR d is a pronoun, a non-pronominal noun phrase, or has ne ver been observed before. 4.2.2 Selectional Prefer ences Featur e The selectional preference feature captures how well the candidate DR d ﬁts a giv en syntactic po- sition r of a giv en verbal predicate v . It is com- puted as the cosine similarity sim cos ( x T d , x v ,r ) of a v ector-space representation of the DR x d and a structured vector -space representation of the pred- icate x v ,r . The similarities are calculated using a Distributional Memory approach similar to that of Baroni and Lenci (2010). Their structured vector space representation has been shown to work well on tasks that e valuate correlation with human the- matic ﬁt estimates (Baroni and Lenci, 2010; Baroni et al., 2014; Sayeed et al., 2016) and is thus suited to our task. The representation x d is computed as an aver - age of head word representations of all the previ- ous mentions of DR d , where the word v ectors are obtained from the T ypeDM model of Baroni and Lenci (2010). This is a count-based, third-order co- occurrence tensor whose indices are a word w 0 , a second word w 1 , and a complex syntactic relation r , which is used as a stand-in for a semantic link. The v alues for each ( w 0 , r, w 1 ) cell of the tensor are the local mutual information (LMI) estimates obtained from a dependency-parsed combination of large cor - pora (ukW aC, BNC, and W ikipedia). Our procedure has some dif ferences with that of Baroni and Lenci. F or example, for estimating the ﬁt of an alternative ne w DR (in other words, x d based on no pre vious mentions), we use an av er- age over head words of all REs in the training set, a “null referent. ” x v ,r is calculated as the av erage of the top 20 (by LMI) r -ﬁllers for v in T ypeDM; in other words, the prototypical instrument of rub may be represented by summing vectors lik e towel, soap, eraser , coin. . . If the predicate has not yet been en- countered (as for subject positions), scores for all scenario-rele vant verbs are emitted for marginaliza- tion. 4.2.3 Script F eatures In this section, we describe features which rely on script information. Our goal will be to show that such common-sense information is beneﬁcial in per - forming DR prediction. W e consider only two script features. Participant type ﬁt This feature characterizes how well the participant type (PT) of the candidate DR d ﬁts a speciﬁc syn- tactic role r of the governing predicate v ; it can be regarded as a generalization of the selectional prefer - ence feature to participant types and also its special- isation to the considered scenario. Gi ven the candi- date DR d , its participant type p , and the syntactic ( I ) (1) decided to tak e a ( bath ) (2) yesterday afternoon after working out . ( I ) (1) was getting ready to go out and needed to get cleaned before ( I ) (1) went so ( I ) (1) decided to take a ( bath ) (2) . ( I ) (1) ﬁlled the ( bath- tub ) (3) with warm ( water ) (4) and added some ( bub- ble bath ) (5) . ( I ) (1) got undressed and stepped into the ( water ) (4) . ( I ) (1) grabbed the ( soap ) (5) and rubbed it on ( my ) (1) ( body ) (7) and rinsed XXXXXX Figure 4: An example of the referent cloze task. Similar to the Mechanical T urk experiment (Figure 2), our refer- ent prediction model is ask ed to guess the upcoming DR. relation r , we collect all the predicates in the train- ing set which ha ve the participant type p in the posi- tion r . The embedding of the DR x p,r is given by the av erage embedding of these predicates. The feature is computed as the dot product of x p,r and the word embedding of the predicate v . Predicate schemas The following feature captures a speciﬁc aspect of kno wledge about prototypical sequences of e vents. This kno wledge is called pr edicate schemas in the recent co-reference modeling work of Peng et al. (2015). In predicate schemas, the goal is to model pairs of e vents such that if a DR d participated in the ﬁrst e vent (in a speciﬁc role), it is likely to partici- pate in the second ev ent (again, in a speciﬁc role). For example, in the restaurant scenario, if one ob- serves a phrase J ohn order ed , one is likely to see J ohn waited some where later in the document. Spe- ciﬁc ar guments are not that important (where it is J ohn or some other DR), what is important is that the argument is reused across the predicates. This would correspond to the rule X-subject-of-order → X-subject-of-eat . 4 Unlike the previous work, our dataset is small, so we cannot induce these rules di- rectly as there will be very fe w rules, and the model would not generalize to ne w data well enough. In- stead, we again encode this intuition using similari- ties in the real-v alued embedding space. Recall that our goal is to compute a feature ϕ ( d, h ( t ) ) indicating ho w lik ely a potential DR d is to follo w , giv en the history h ( t ) . For example, imag- 4 In this work, we limit ourselves to rules where the syntactic function is the same on both sides of the rule. In other words, we can, in principle, encode the pattern X pushed Y → X apologized but not the pattern X pushed Y → Y cried . Model Name Featur e T ypes Featur es Base Shallow Linguistic Features Recency , Frequency , Grammatical function, Previous subject, Previous object Linguistic Shallow Linguistic Features + Linguistic Feature Recency , Frequency , Grammatical function, Previous subject, Previous object + Selectional Preferences Script Shallow Linguistic Features + Linguistic Feature + Script Features Recency , Frequency , Grammatical function, Previous subject, Previous object + Selectional Preferences + Participant type ﬁt, Predicate schemas T able 2: Summary of model features ine that the model is ask ed to predict the DR marked by XXXXXX in Figure 4. Predicate-schema rules can only yield pre viously introduced DRs, so the score ϕ ( d, h ( t ) ) = 0 for any ne w DR d . Let us use “soap” as an e xample of a pre viously introduced DR and see ho w the feature is computed. In order to choose which inference rules can be applied to yield “soap”, we can inspect Figure 4. There are only two preceding predicates which ha ve DR “soap” as their object ( rubbed and gr abbed ), resulting in tw o poten- tial rules X-object-of-grabbed → X-object-of-rinsed and X-object-of-rubbed → X-object-of-rinsed . W e deﬁne the score ϕ ( d, h ( t ) ) as the av erage of the rule scores. More formally , we can write ϕ ( d, h ( t ) ) = 1 | N ( d, h ( t ) ) | X ( u,v ,r ) ∈ N ( d,h ( t ) ) ψ ( u, v , r ) , (1) where ψ ( u, v , r ) is the score for a rule X-r-of-u → X-r-of-v , N ( d, h ( t ) ) is the set of applicable rules, and | N ( d, h ( t ) ) | denotes its cardinality . 5 W e deﬁne ϕ ( d, h ( t ) ) as 0 , when the set of applicable rules is empty (i.e. | N ( d, h ( t ) ) | = 0 ). The scoring function ψ ( u, v , r ) as a linear func- 5 In all our experiments, rather than considering all potential predicates in the history to instantiate rules, we take into ac- count only 2 preceding verbs. In other words, u and v can be interleav ed by at most one verb and | N ( d, h ( t ) ) | is in { 0 , 1 , 2 } . tion of a joint embedding x u,v of verbs u and v : ψ ( u, v , r ) = α T r x u,v . The two remaining questions are (1) how to deﬁne the joint embeddings x u,v , and (2) how to estimate the parameter vector α r . The joint embedding of two predicates, x u,v , can, in principle, be any composi- tion function of embeddings of u and v , for e xample their sum or component-wise product. Inspired by Bordes et al. (2013), we use the dif ference between the word embeddings: ψ ( u, v , r ) = α T r ( x u − x v ) , where x u and x v are external embeddings of the corresponding verbs. Encoding the succession re- lation as translation in the embedding space has one desirable property: the scoring function will be largely agnostic to the morphological form of the predicates. For example, the difference between the embeddings of rinsed and rubbed is very sim- ilar to that of r inse and r ub (Botha and Blunsom, 2014), so the corresponding rules will receiv e simi- lar scores. No w , we can rewrite the equation (1) as ϕ ( d, h ( t ) ) = α T r ( h ( t ) ) P ( u,v ,r ) ∈ N ( d,h ( t ) ) ( x u − x v ) | N ( d, h ( t ) ) | (2) where r ( h ( t ) ) denotes the syntactic function corre- sponding to the DR being predicted (object in our example). As for the parameter vector α r , there are again a number of potential ways ho w it can be estimated. For example, one can train a discriminativ e classiﬁer to estimate the parameters. Howe ver , we opted for a simpler approach—we set it equal to the empirical estimate of the expected feature vector x u,v on the training set: 6 α r = 1 D r X l,t δ r ( r ( h ( l,t ) )) X ( u,v ,r 0 ) ∈ N ( d ( l,t ) ,h ( l,t ) ) ( x u − x v ) , (3) where l refers to a document in the training set, t is (as before) a position in the document, h ( l,t ) and 6 This essentially corresponds to using the Naive Bayes model with the simplistic assumption that the score differences are normally distributed with spherical co variance matrices. Scenario Human Model Script Model Linguistic Model Tily Model Accuracy Perplexity Accurac y Perplexity Accuracy Perplexity Accuracy Perple xity Grocery Shopping 74.80 2.13 68.17 3.16 53.85 6.54 32.89 24.48 Repairing a ﬂat bicycle tyre 78.34 2.72 62.09 3.89 51.26 6.38 29.24 19.08 Riding a public bus 72.19 2.28 64.57 3.67 52.65 6.34 32.78 23.39 Getting a haircut 71.06 2.45 58.82 3.79 42.82 7.11 28.70 15.40 Planting a tree 71.86 2.46 59.32 4.25 47.80 7.31 28.14 24.28 Borrowing book from library 77.49 1.93 64.07 3.55 43.29 8.40 33.33 20.26 T aking Bath 81.29 1.84 67.42 3.14 61.29 4.33 43.23 16.33 Going on a train 70.79 2.39 58.73 4.20 47.62 7.68 30.16 35.11 Baking a cake 76.43 2.16 61.79 5.11 46.40 9.16 24.07 23.67 Flying in an airplane 62.04 3.08 61.31 4.01 48.18 7.27 30.90 30.18 A verage 73.63 2.34 62.63 3.88 49.52 7.05 31.34 23.22 T able 3: Accuracies (in %) and perplexities for different models and scenarios. The script model substantially out- performs linguistic and base models (with p < 0 . 001 , signiﬁcance tested with McNemar’ s test (Everitt, 1992)). As expected, the human prediction model outperforms the script model (with p < 0 . 001 , signiﬁcance tested by McNe- mar’ s test). Model Accuracy Perplexity Linguistic Model 49.52 7.05 Linguistic Model + Predicate Schemas 55.44 5.88 Linguistic Model + Participant type ﬁt 58.88 4.29 Full Script Model (both features) 62.63 3.88 T able 4: Accuracies from ablation experiments. d ( l,t ) are the history and the correct DR for this posi- tion, respectiv ely . The term δ r ( r 0 ) is the Kronecker delta which equals 1 if r = r 0 and 0 , otherwise. D r is the total number of rules for the syntactic function r in the training set: D r = X l,t δ r ( r ( h ( l,t ) )) × | N ( d ( l,t ) , h ( l,t ) ) | . Let us illustrate the computation with an example. Imagine that our training set consists of the docu- ment in Figure 1, and the trained model is used to predict the upcoming DR in our referent cloze exam- ple (Figure 4). The training document includes the pair X-object-of-scrubbed → X-object-of-rinsing , so the corresponding term ( x scrubbed - x rinsing ) partici- pates in the summation (3) for α obj . As we rely on external embeddings, which encode semantic simi- larities between lexical items, the dot product of this term and ( x rubbed - x rinsed ) will be high. 7 Conse- quently , ϕ ( d, h ( t ) ) is expected to be positi ve for d = “soap”, thus, predicting “soap” as the likely forth- coming DR. Unfortunately , there are other terms ( x u − x v ) both in e xpression (3) for α obj and in expression (2) for ϕ ( d, h ( t ) ) . These terms may be 7 The score would have been even higher , should the pred- icate be in the morphological form rinsing rather than rinsed . Howe ver , embeddings of rinsing and rinsed would still be suf- ﬁciently close to each other for our argument to hold. irrele vant to the current prediction, as X-object-of- plugged → X-object-of-ﬁlling from Figure 1, and may not even encode an y v alid regularities, as X- object-of-got → X-object-of-scrubbed (again from Figure 1). This may suggest that our feature will be too contaminated with noise to be informativ e for making predictions. Howe ver , recall that inde- pendent random vectors in high dimensions are al- most orthogonal, and, assuming they are bounded, their dot products are close to zero. Consequently , the products of the relev ant (“non-random”) terms, in our example ( x scrubbed - x rinsing ) and ( x rubbed - x rinsed ), are lik ely to overcome the (“random”) noise. As we will see in the ablation studies, the predicate- schema feature is indeed predicti ve of a DR and con- tributes to the performance of the full model. 4.3 Experiments W e would lik e to test whether our model can pro- duce accurate predictions and whether the model’ s guesses correlate well with human predictions for the referent cloze task. In order to be able to ev aluate the effect of script kno wledge on referent predictability , we compare three models: our full Script model uses all of the features introduced in section 4.2; the Linguistic model relies only on the ‘linguistic features’ but not the script-speciﬁc ones; and the Base model includes all the shallow linguistic features. The Base model dif fers from the linguistic model in that it does not model selectional preferences. T able 2 summarizes features used in dif ferent models. The data set was randomly divided into training (70%), development (10%, 91 stories from 10 sce- narios), and test (20%, 182 stories from 10 scenar- ios) sets. The feature weights were learned using L-BFGS (Byrd et al., 1995) to optimize the log- likelihood. Evaluation against original refer ents. W e calcu- lated the percentage of correct DR predictions. See T able 3 for the av erages across 10 scenarios. W e can see that the task appears hard for humans: their av erage performance reaches only 73% accuracy . As expected, the Base model is the weakest system (the accuracy of 31%). Modeling selectional pref- erences yields an extra 18% in accuracy (Linguis- tic model). The key ﬁnding is that incorporation of script knowledge increases the accuracy by further 13%, although still far behind human performance (62% vs. 73%). Besides accuracy , we use perple x- ity , which we computed not only for all our models but also for human predictions. This was possible as each task was solved by multiple humans. W e used unsmoothed normalized guess frequencies as the probabilities. As we can see from T able 3, the perplexity scores are consistent with the accuracies: the script model again outperforms other methods, and, as e xpected, all the models are weaker than hu- mans. As we used two sets of script features, capturing dif ferent aspects of script knowledge, we performed extra ablation studies (T able 4). The e xperiments conﬁrm that both feature sets were beneﬁcial. Evaluation against human expectations. In the pre vious subsection, we demonstrated that the in- corporation of selectional preferences and, perhaps more interestingly , the integration of automatically acquired script knowledge lead to improv ed accu- racy in predicting discourse referents. Now we turn to another question raised in the introduction: does incorporation of this knowledge make our predic- tions more human-like? In other words, are we able to accurately estimate human e xpectations? This in- cludes not only being suf ﬁciently accurate but also making the same kind of incorrect predictions. In this e valuation, we therefore use human guesses collected during the referent cloze task as our tar get. W e then calculate the relativ e accurac y of each computational model. As can be seen in Figure 5, the Script model, at approx. 53% accuracy , is a lot more accurate in predicting human guesses than the Linguistic model and the Base model. W e can also Script Linguistic Base 0 10 20 30 40 50 60 52.9 38.4 34.52 Rel. Accuracy (in %) Figure 5: A verage relati ve accuracies of different models w .r .t human predictions. Script Linguistic Base 0.0 0.2 0.4 0.6 0.8 0.5 0.57 0.66 JS Divergence Figure 6: A verage Jensen-Shannon di vergence between human predictions and models. observe that the mar gin between the Script model and the Linguistic model is a lot lar ger in this e valu- ation than between the Base model and the Linguis- tic model. This indicates that the model which has access to script knowledge is much more similar to human prediction behavior in terms of top guesses than the script-agnostic models. No w we would like to assess if our predictions are similar as distributions rather than only yield- ing similar top predictions. In order to compare the distributions, we use the Jensen-Shannon diver gence (JSD), a symmetrized version of the Kullback- Leibler di vergence. Intuiti vely , JSD measures the distance between two probability distrib utions. A smaller JSD value is indicativ e of more similar distributions. Fig- ure 6 shows that the probability distributions result- ing from the Script model are more similar to human predictions than those of the Linguistic and Base models. In these experiments, we hav e shown that script kno wledge improv es predictions of upcoming ref- erents and that the script model is the best among our models in approximating human referent predic- tions. 5 Referring Expression T ype Prediction Model (RE Model) Using the referent prediction models, we next at- tempt to replicate T ily and Piantadosi’ s ﬁndings that the choice of the type of referring expression (pro- noun or full NP) depends in part on the predictability of the referent. 5.1 Unif orm Information Density hypothesis The uniform information density (UID) hypothe- sis suggests that speakers tend to con ve y information at a uniform rate (Jaeger , 2010). Applied to choice of referring expression type, it would predict that a highly predictable referent should be encoded us- ing a short code (here: a pronoun), while an unpre- dictable referent should be encoded using a longer form (here: a full NP). Information density is mea- sured using the information-theoretic measure of the surprisal S of a message m i : S ( m i ) = − log P ( m i | context ) UID has been very successful in explaining a v ari- ety of linguistic phenomena; see Jaeger et al. (2016). There is, howe ver , controversy about whether UID af fects pronominalization. T ily and Piantadosi (2009) report e vidence that writers are more likely to refer using a pronoun or proper name when the ref- erent is easy to guess and use a full NP when readers hav e less certainty about the upcoming referent; see also Arnold (2001). But other experiments (using highly controlled stimuli) have failed to ﬁnd an ef- fect of predictability on pronominalization (Ste ven- son et al., 1994; Fukumura and van Gompel, 2010; Rohde and K ehler , 2014). The present study hence contributes to the debate on whether UID af fects re- ferring expression choice. 5.2 A model of Referring Expression Choice Our goal is to determine whether referent pre- dictability (quantiﬁed in terms of surprisal) is cor - related with the type of referring expression used in the text. Here we focus on the distinction be- tween pronouns and full noun phrases. Our data also contains a small percentage (ca. 1%) of proper names (like “John”). Due to this small class size and earlier ﬁndings that proper nouns beha ve much like pronouns (T ily and Piantadosi, 2009), we com- bined pronouns and proper names into a single class of short encodings. For the referring e xpression type prediction task, we estimate the surprisal of the referent from each of our computational models from Section 4 as well as the human cloze task. The surprisal of an upcoming discourse referent d ( t ) based on the previous context h ( t ) is thereby estimated as: S ( d ( t ) ) = − log p ( d ( t ) | h ( t ) ) In order to determine whether referent predictability has an effect on referring expression type over and above other factors that are known to af fect the choice of referring expression, we train a logistic regression model with referring e xpression type as a response v ariable and discourse referent predictabil- ity as well as a large set of other linguistic factors (based on Tily and Piantadosi, 2009) as explanatory v ariables. The model is deﬁned as follows: p ( n ( t ) = n | d ( t ) , h ( t ) ) = exp( v T g ( n, d t , h ( t ) )) P n 0 exp( v T g ( n 0 , d t , h ( t ) )) , where d ( t ) and h ( t ) are deﬁned as before, g is the feature function, and v is the v ector of model pa- rameters. The summation in the denominator is over NP types (full NP vs. pronoun/proper noun). 5.3 RE Model Experiments W e ran four dif ferent logistic regression models. These models all contained exactly the same set of linguistic predictors but differed in the estimates used for referent type surprisal and residual entropy . One logistic regression model used surprisal esti- mates based on the human referent cloze task, while the three other models used estimates based on the three computational models (Base, Linguistic and Script). For our e xperiment, we are interested in the choice of referring e xpression type for those occur- rences of references, where a “real choice” is possi- ble. W e therefore e xclude for our analysis reported belo w all ﬁrst mentions as well as all ﬁrst and second person pronouns (because there is no optionality in ho w to refer to ﬁrst or second person). This subset contains 1345 data points. 5.4 Results The results of all four logistic regression models are sho wn in T able 5. W e ﬁrst take a look at the results for the linguistic features. While there is a bit of variability in terms of the exact coefﬁcient es- timates between the models (this is simply due to small correlations between these predictors and the predictors for surprisal), the ef fect of all of these features is largely consistent across models. F or in- stance, the positi ve coefﬁcients for the recency fea- ture means that when a pre vious mention happened Estimate Std. Error Pr( > | z | ) Human Script Linguistic Base Human Script Linguistic Base Human Script Linguistic Base (Intercept) -3.4 -3.418 -3.245 -3.061 0.244 0.279 0.321 0.791 < 2e-16 *** < 2e-16 *** < 2e-16 *** 0.00011 *** recency 1.322 1.322 1.324 1.322 0.095 0.095 0.096 0.097 < 2e-16 *** < 2e-16 *** < 2e-16 *** < 2e-16 *** frequency 0.097 0.103 0.112 0.114 0.098 0.097 0.098 0.102 0.317 0.289 0.251 0.262 pastObj 0.407 0.396 0.423 0.395 0.293 0.294 0.295 0.3 0.165 0.178 0.151 0.189 pastSubj -0.967 -0.973 -0.909 -0.926 0.559 0.564 0.562 0.565 0.0838 . 0.0846 . 0.106 0.101 pastExpPronoun 1.603 1.619 1.616 1.602 0.21 0.207 0.208 0.245 2.19e-14 *** 5.48e-15 *** 7.59e-15 *** 6.11e-11 *** depT ypeSubj 2.939 2.942 2.656 2.417 0.299 0.347 0.429 1.113 < 2e-16 *** < 2e-16 *** 5.68e-10 *** 0.02994 * depT ypeObj 1.199 1.227 0.977 0.705 0.248 0.306 0.389 1.109 1.35e-06 *** 6.05e-05 *** 0.0119 * 0.525 surprisal -0.04 -0.006 0.002 -0.131 0.099 0.097 0.117 0.387 0.684 0.951 0.988 0.735 residualEntropy -0.009 0.023 -0.141 -0.128 0.088 0.128 0.168 0.258 0.916 0.859 0.401 0.619 T able 5: Coefﬁcients obtained from regression analysis for different models. T wo NP types considered: full NP and Pronoun/ProperNoun, with base class full NP . Signiﬁcance: ‘***’ < 0 . 001 , ‘**’ < 0 . 01 , ‘*’ < 0 . 05 , and ‘. ’ < 0 . 1 . very recently , the referring e xpression is more likely to be a pronoun (and not a full NP). The coefﬁcients for the surprisal estimates of the dif ferent models are, ho wev er, not signiﬁcantly dif- ferent from zero. Model comparison sho ws that they do not improve model ﬁt. W e also used the esti- mated models to predict referring expression type on ne w data and again found that surprisal estimates from the models did not improv e prediction accu- racy . This effect e ven holds for our human cloze data. Hence, it cannot be interpreted as a problem with the models—ev en human predictability esti- mates are, for this dataset, not predicti ve of referring expression type. W e also calculated regression models for the full dataset including ﬁrst and second person pronouns as well as ﬁrst mentions (3346 data points). The re- sults for the full dataset are fully consistent with the ﬁndings shown in T able 5: there w as no signiﬁcant ef fect of surprisal on referring expression type. This result contrasts with the ﬁndings by T ily and Piantadosi (2009), who reported a signiﬁcant ef fect of surprisal on RE type for their data. In order to replicate their settings as closely as possible, we also included residualEntrop y as a predictor in our model (see last predictor in T able 5); howe ver , this did not change the results. 6 Discussion and Future W ork Our study on incrementally predicting discourse referents showed that script knowledge is a highly important factor in determining human discourse ex- pectations. Crucially , the computational modelling approach allowed us to tease apart the different fac- tors that af fect human prediction as we cannot ma- nipulate this in humans directly (by asking them to “switch of f” their common-sense knowledge). By modelling common-sense kno wledge in terms of event sequences and e vent participants, our model captures many more long-range dependencies than normal language models. The script knowledge is automatically induced by our model from cro wd- sourced scenario-speciﬁc text collections. In a second study , we set out to test the hypoth- esis that uniform information density affects refer- ring expression type. This question is highly con- trov ersial in the literature: while T ily and Piantadosi (2009) ﬁnd a signiﬁcant effect of surprisal on refer - ring expression type in a corpus study very similar to ours, other studies that use a more tightly con- trolled experimental approach hav e not found an ef- fect of predictability on RE type (Stev enson et al., 1994; Fukumura and van Gompel, 2010; Rohde and K ehler, 2014). The present study , while replicating exactly the setting of T&P in terms of features and analysis, did not ﬁnd support for a UID ef fect on RE type. The difference in results between T&P 2009 and our results could be due to the different corpora and text sorts that were used; speciﬁcally , we would expect that larger predictability ef fects might be ob- serv able at script boundaries, rather than within a script, as is the case in our stories. A next step in moving our participant predic- tion model to wards NLP applications would be to replicate our modelling results on automatic text- to-script mapping instead of gold-standard data as done here (in order to approximate human lev el of processing). Furthermore, we aim to move to more complex text types that include reference to several scripts. W e plan to consider the recently published R OC Stories corpus (Mostafazadeh et al., 2016), a large crowdsourced collection of topically unre- stricted short and simple narrativ es, as a basis for these next steps in our research. Acknowledgments W e thank the editors and the anon ymous re view- ers for their insightful suggestions. W e would like to thank Florian Pusse for helping with the Ama- zon Mechanical T urk experiment. W e would also like to thank Simon Ostermann and T atjana Anikina for helping with the InScript corpus. This research was partially supported by the German Research Foundation (DFG) as part of SFB 1102 ‘Informa- tion Density and Linguistic Encoding’, European Research Council (ERC) as part of ERC Starting Grant BroadSem (#678254), the Dutch National Sci- ence Foundation as part of NWO VIDI 639.022.518, and the DFG once again as part of the MMCI Cluster of Excellence (EXC 284). References Simon Ahrendt and V era Dember g. 2016. Improving ev ent prediction by representing script participants. In Pr oceedings of NAA CL-HLT . Jennifer E. Arnold. 2001. The ef fect of thematic roles on pronoun use and frequency of reference continuation. Discourse Pr ocesses , 31(2):137–162. Marco Baroni and Alessandro Lenci. 2010. Distribu- tional memory: A general framework for corpus-based semantics. Computational Linguistics , 36(4):673– 721. Marco Baroni, Geor giana Dinu, and Germ ´ an Kruszewski. 2014. Don’t count, predict! A systematic compari- son of context-counting vs. context-predicting seman- tic vectors. In Pr oceedings of ACL . Antoine Bordes, Nicolas Usunier , Alberto Garcia-Duran, Jason W eston, and Oksana Y akhnenko. 2013. T rans- lating embeddings for modeling multi-relational data. In Pr oceedings of NIPS . Jan A. Botha and Phil Blunsom. 2014. Compositional morphology for word representations and language modelling. In Pr oceedings of ICML . Richard H. Byrd, Peihuang Lu, Jor ge Nocedal, and Ciyou Zhu. 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientiﬁc Computing , 16(5):1190–1208. Nathanael Chambers and Daniel Jurafsky . 2008. Unsu- pervised learning of narrative ev ent chains. In Pro- ceedings of A CL . Nathanael Chambers and Dan Jurafsky . 2009. Unsuper - vised learning of narrati ve schemas and their partici- pants. In Pr oceedings of ACL . Brian S. Everitt. 1992. The analysis of conting ency ta- bles . CRC Press. Lea Frermann, Ivan T itov , and Manfred Pinkal. 2014. A hierarchical Bayesian model for unsupervised induc- tion of script knowledge. In Proceedings of EA CL . Kumiko Fukumura and Roger P . G. van Gompel. 2010. Choosing anaphoric expressions: Do people take into account likelihood of reference? Journal of Memory and Language , 62(1):52–66. T . Florian Jaeger , Esteban Buz, Ev a M. Fernandez, and Helen S. Cairns. 2016. Signal reduction and linguis- tic encoding. Handbook of psycholinguistics. W ile y- Blackwell . T . Florian Jae ger . 2010. Redundancy and reduction: Speakers manage syntactic information density . Cog- nitive psychology , 61(1):23–62. Bram Jans, Ste ven Bethard, Ivan V uli ´ c, and Marie Francine Moens. 2012. Skip n-grams and ranking functions for predicting script events. In Pr oceedings of EACL . Gina R. K uperberg and T . Florian Jaeger . 2016. What do we mean by prediction in language comprehension? Language , cognition and neur oscience , 31(1):32–59. Gina R. Kuperber g. 2016. Separate streams or proba- bilistic inference? What the N400 can tell us about the comprehension of events. Language, Cognition and Neur oscience , 31(5):602–616. Marta Kutas, Katherine A. DeLong, and Nathaniel J. Smith. 2011. A look around at what lies ahead: Pre- diction and predictability in language processing. Pr e- dictions in the brain: Using our past to generate a fu- tur e . T omas Mikolov , Martin Karaﬁ ´ at, Lukas Burget, Jan Cer- nock ` y, and Sanjeev Khudanpur . 2010. Recurrent neu- ral network based language model. In Proceedings of Interspeech . T omas Mikolo v , Stefan K ombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky . 2011. RNNLM-recurrent neural network language modeling toolkit. In Pr o- ceedings of the 2011 ASR U W orkshop . T omas Mikolov , Ilya Sutske ver , Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality . In Pr oceedings of NIPS . Ashutosh Modi and Iv an T itov . 2014. Inducing neural models of script knowledge. Proceedings of CoNLL . Ashutosh Modi, T atjana Anikina, Simon Ostermann, and Manfred Pinkal. 2016. Inscript: Narrativ e texts anno- tated with script information. Pr oceedings of LREC . Ashutosh Modi. 2016. Ev ent embeddings for semantic script modeling. Pr oceedings of CoNLL . Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy V anderwende, Pushmeet K ohli, and James Allen. 2016. A corpus and cloze ev aluation for deeper understanding of com- monsense stories. Pr oceedings of NAA CL . Haoruo Peng, Daniel Khashabi, and Dan Roth. 2015. Solving hard coreference problems. In Pr oceedings of N AACL . Karl Pichotta and Raymond J Mooney . 2014. Statistical script learning with multi-argument events. Pr oceed- ings of EA CL . Altaf Rahman and V incent Ng. 2012. Resolving com- plex cases of deﬁnite pronouns: the W inograd schema challenge. In Pr oceedings of EMNLP . Michaela Regneri, Alexander Koller , and Manfred Pinkal. 2010. Learning script knowledge with web experiments. In Pr oceedings of ACL . Hannah Rohde and Andrew K ehler . 2014. Grammati- cal and information-structural inﬂuences on pronoun production. Language, Cognition and Neur oscience , 29(8):912–927. Rachel Rudinger , V era Demberg, Ashutosh Modi, Ben- jamin V an Durme, and Manfred Pinkal. 2015. Learn- ing to predict script ev ents from domain-speciﬁc text. Pr oceedings of the International Conference on Lexi- cal and Computational Semantics (*SEM 2015) . Asad Sayeed, Clayton Greenber g, and V era Demberg. 2016. Thematic ﬁt e valuation: an aspect of selectional preferences. In Pr oceedings of the W orkshop on Eval- uating V ector Space Repr esentations for NLP (RepE- val2016) . Roger C. Schank and Robert P . Abelson. 1977. Scripts, Plans, Goals, and Understanding . Lawrence Erlbaum Associates, Potomac, Maryland. Simone Sch ¨ utz-Bosbach and W olfgang Prinz. 2007. Prospectiv e coding in ev ent representation. Cognitive pr ocessing , 8(2):93–102. Rosemary J. Stevenson, Rosalind A. Crawle y , and David Kleinman. 1994. Thematic roles, focus and the rep- resentation of ev ents. Language and Cognitive Pr o- cesses , 9(4):519–548. Harry Tily and Steven Piantadosi. 2009. Refer efﬁ- ciently: Use less informativ e expressions for more pre- dictable meanings. In Pr oceedings of the workshop on the pr oduction of referring e xpressions: Bridging the gap between computational and empirical appr oaches to r eference . Alessandra Zarcone, Marten van Schijndel, Jorrig V o- gels, and V era Dember g. 2016. Salience and atten- tion in surprisal-based accounts of language process- ing. F r ontiers in Psycholo gy , 7:844.

Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment