Experimental Support for a Categorical Compositional Distributional Model of Meaning

Modelling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. We implement the abstract categorical model of Coecke et al. (arXiv:1003.4394v1 [cs.CL]) using data from the BNC an…

Authors: Edward Grefenstette, Mehrnoosh Sadrzadeh

Experimental Support f or a Categorical Compositional Distrib utional Model of Meaning Edward Grefenstette Uni versity of Oxford Department of Computer Science W olfson Building, Parks Road Oxford O X1 3QD, UK edward.grefenstette@cs.ox.ac.uk Mehrnoosh Sadrzadeh Uni versity of Oxford Department of Computer Science W olfson Building, Parks Road Oxford O X1 3QD, UK mehrs@cs.ox.ac.uk Abstract Modelling compositional meaning for sen- tences using empirical distributional methods has been a challenge for computational lin- guists. W e implement the abstract categorical model of Coeck e et al. (2010) using data from the BNC and ev aluate it. The implementation is based on unsupervised learning of matrices for relational words and applying them to the vectors of their arguments. The ev aluation is based on the word disambiguation task de vel- oped by Mitchell and Lapata (2008) for intran- sitiv e sentences, and on a similar new experi- ment designed for transitiv e sentences. Our model matches the results of its competitors in the first experiment, and betters them in the second. The general improvement in results with increase in syntactic complexity show- cases the compositional power of our model. 1 Introduction As competent language speak ers, we humans can al- most tri vially make sense of sentences we’ ve nev er seen or heard before. W e are naturally good at un- derstanding ambiguous words given a context, and forming the meaning of a sentence from the mean- ing of its parts. But while human beings seem comfortable doing this, machines fail to deli ver . Search engines such as Google either fall back on bag of words models—ignoring syntax and lexical relations—or exploit superficial models of le xical semantics to retrie ve pages with terms related to those in the query (Manning et al., 2008). Ho wev er , such models fail to shine when it comes to processing the semantics of phrases and sen- tences. Discov ering the process of meaning as- signment in natural language is among the most challenging and foundational questions of linguis- tics and computer science. The findings thereof will increase our understanding of cognition and intelli- gence and shall assist in applications to automating language-related tasks such as document search. Compositional type-logical approaches (Mon- tague, 1974; Lambek, 2008) and distrib utional mod- els of le xical semantics (Schutze, 1998; Firth, 1957) hav e provided two partial orthogonal solutions to the question. Compositional formal semantic models stem from classical ideas from mathematical logic, mainly Frege’ s principle that the meaning of a sen- tence is a function of the meaning of its parts (Fre ge, 1892). Distributional models are more recent and can be related to Wittgenstein’ s later philosophy of ‘meaning is use’, whereby meanings of words can be determined from their context (Wittgenstein, 1953). The logical models relate to well known and robust logical formalisms, hence offering a scalable theory of meaning which can be used to reason inferen- tially . The distributional models ha ve found their way into real world applications such as thesaurus extraction (Grefenstette, 1994; Curran, 2004) or au- tomated essay marking (Landauer , 1997), and hav e connections to semantically motiv ated information retrie val (Manning et al., 2008). This two-sortedness of defining properties of meaning: ‘logical form’ versus ‘contextual use’, has left the quest for ‘what is the foundational structure of meaning?’ ev en more of a challenge. Recently , Coecke et al. (2010) used high lev el cross-disciplinary techniques from logic, category theory , and physics to bring the abov e two ap- proaches together . They de veloped a unified mathe- matical framew ork whereby a sentence vector is by definition a function of the Kronecker product of its word vectors. A concrete instantiation of this the- ory was exemplified on a toy hand crafted corpus by Grefenstette et al. (2011). In this paper we imple- ment it by training the model over the entire BNC. The highlight of our implementation is that words with relational types, such as verbs, adjecti ves, and adverbs are matrices that act on their ar guments. W e provide a general algorithm for building (or indeed learning) these matrices from the corpus. The implementation is e valuated against the task provided by Mitchell and Lapata (2008) for disam- biguating intransiti ve verbs, as well as a similar ne w experiment for transiti ve verbs. Our model improves on the best method ev aluated in Mitchell and Lapata (2008) and offers promising results for the transitiv e case, demonstrating its scalability in comparison to that of other models. But we still feel there is need for a different class of experiments to sho wcase mer- its of compositionality in a statistically significant manner . Our work shows that the cate gorical com- positional distributional model of meaning permits a practical implementation and that this opens the way to the production of large scale compositional models. 2 T wo Orthogonal Semantic Models F ormal Semantics T o compute the meaning of a sentence consisting of n words, meanings of these words must interact with one another . In formal se- mantics, this further interaction is represented as a function deriv ed from the grammatical structure of the sentence, but meanings of words are amorphous objects of the domain: no distinction is made be- tween words that have the same type. Such models consist of a pairing of syntactic interpretation rules (in the form of a grammar) with semantic interpreta- tion rules, as exemplified by the simple model pre- sented in Figure 1. The parse of a sentence such as “cats like milk” typically produces its semantic interpretation by substituting semantic representation for their gram- matical constituents and applying β -reduction where needed. Such a deri vation is sho wn in Figure 2. Syntactic Analysis Semantic Interpretation S → NP VP | V P | ( | N P | ) NP → cats, milk, etc. | cats | , | milk | , . . . VP → Vt NP | V t | ( | N P | ) Vt → like, hug, etc. λy x. | lik e | ( x, y ) , . . . Figure 1: A simple model of formal semantics. | like | ( | cats | , | milk | ) | cats | λx. | like | ( x, | milk | ) | milk | λy x. | like | ( x, y ) Figure 2: A parse tree showing a semantic deri vation. This methodology is used to translate sentences of natural language into logical formulae, then use computer-aided automation tools to reason about them (Alshawi, 1992). One major dra wback is that the result of such analysis can only deal with truth or f alsity as the meaning of a sentence, and says nothing about the closeness in meaning or topic of expressions beyond their truth-conditions and what models satisfy them, hence do not perform well on language tasks such as search. Furthermore, an un- derlying domain of objects and a valuation function must be provided, as with any logic, lea ving open the question of how we might learn the meaning of language using such a model, rather than just use it. Distributional Models Distributional models of semantics, on the other hand, dismiss the interaction between syntactically linked words and are solely concerned with lexical semantics. W ord meaning is obtained empirically by examining the contexts 1 in which a word appears, and equating the meaning of a word with the distribution of contexts it shares. The intuition is that context of use is what we ap- peal to in learning the meaning of a w ord, and that words that frequently ha ve the same sort of context in common are likely to be semantically related. For instance, beer and sherry are both drinks, al- coholic, and often cause a hangov er . W e expect these facts to be reflected in a sufficiently large cor- pus: the words ‘beer’ and ‘sherry’ occur within the 1 E.g. words which appear in the same sentence or n -word window , or words which hold particular grammatical or depen- dency relations to the word being learned. context of identifying words such as ‘drink’, ‘alco- holic’ and ‘hangover’ more frequently than they oc- cur with other content words. Such conte xt distrib utions can be encoded as vec- tors in a high dimensional space with contexts as basis vectors. F or any word vector − − → word, the scalar weight c wor d i associated with each context basis vec- tor − → n i is a function of the number of times the word has appeared in that context. Semantic v ectors ( c wor d 1 , c wor d 2 , · · · , c wor d n ) are also denoted by sums of such weight/basis vector pairs: − − → word = X i c wor d i − → n i Learning a semantic vector is just learning its ba- sis weights from the corpus. This setting offers ge- ometric means to reason about semantic similarity ( e.g . via cosine measure or k -means clustering), as discussed in W iddows (2005). The principal drawback of such models is their non-compositional nature: they ignore grammatical structure and logical words, and hence cannot com- pute the meanings of phrases and sentences in the same ef ficient way that they do for words. Com- mon operations discussed in (Mitchell and Lapata, 2008) such as vector addition ( + ) and component- wise multiplication (  , cf. § 4 for details) are com- mutati ve, hence if − → v w = − → v + − → w or − → v  − → w , then − → v w = − → w v , leading to unwelcome equalities such as − − − − − − − − − − − − − → the dog bit the man = − − − − − − − − − − − − − → the man bit the dog Non-commutati ve operations, such as the Kronecker product ( cf. § 4 for definition) can take word-order into account (Smolensky , 1990) or even some more complex syntactic relations, as described in Clark and Pulman (2007). Ho wever , the dimensionality of sentence vectors produced in this manner differs for sentences of different length, barring all sentences from being compared in the same vector space, and gro wing exponentially with sentence length hence quickly becoming computationally intractable. 3 A Hybrid Logico-Distributional Model Whereas semantic compositional mechanisms for set-theoretic constructions are well understood, there are no obvious corresponding methods for vec- tor spaces. T o solve this problem, Coecke et al. (2010) use the abstract setting of category theory to turn the grammatical structure of a sentence into a morphism compatible with the higher lev el logical structure of vector spaces. One pragmatic consequence of this abstract idea is as follows. In distrib utional models, there is a meaning vector for each word, e.g. − → cats, − → like, and − − → milk. The logical recipe tells us to apply the mean- ing of the verb to the meanings of subject and object. But how can a vector apply to other vectors? The so- lution proposed abov e implies that one needs to hav e dif ferent lev els of meaning for words with different types. This is similar to logical models where verbs are relations and nouns are atomic sets. So v erb v ec- tors should be b uilt differently from noun vectors, for instance as matrices. The general information as to which words should be matrices and which words atomic vectors is in fact encoded in the type-logical representation of the grammatical structure of the sentence. This is the linear map with word vectors as input and sentence vectors as output. Hence, at least theoretically , one should be able to b uild sentence vectors and com- pare their synonymity in e xactly the same way as one measures word synon ymity . Pregr oup Grammars The aforementioned linear maps turn out to be the grammatical reductions of a type-logic called a Lambek pregroup gram- mar (Lambek, 2008) 2 . Pregroups and vector spaces share the same high le vel mathematical structure, re- ferred to as a compact closed cate gory , for a proof and details of this claim see Coeck e et al. (2010); for a friendly introduction to category theory , see Co- ecke and Paquette (2011). One consequence of this parity is that the grammatical reductions of a pre- group grammar can be directly transformed into lin- ear maps that act on vectors. In a nutshell, pregroup types are either atomic or compound. Atomic types can be simple ( e .g. n for noun phrases, s for statements) or left/right superscripted—referred to as adjoint types ( e.g. n r and n l ). An example of a compound type is that of a verb n r sn l . The superscripted types express that the verb is a relation with two arguments of type n , 2 The usage of pregroup types is not essential, the types of any other logic, for instance CCG can be used, but should be translated into the language of pregroups. which ha ve to occur to the r ight and to the l eft of it, and that it outputs an argument of the type s . A transiti ve sentence has types as sho wn in Figure 3. Each type n cancels out with its right adjoint n r from the right and its left adjoint n l from the left; mathematically speaking these mean 3 n l n ≤ 1 and nn r ≤ 1 Here 1 is the unit of concatenation: 1 n = n 1 = n . The corresponding grammatical reduction of a transiti ve sentence is nn r sn l ≤ 1 s 1 = s . Each such reduction can be depicted as a wire diagram. The diagram of a transitiv e sentence is shown in Figure 3. Cats n like n r s n l milk. n Figure 3: The pregroup types and reduction diagram for a transitiv e sentence. Syntax-guided Semantic Composition Accord- ing to Coecke et al. (2010) and based on a general completeness theorem between compact categories, wire diagrams, and vector spaces, the meaning of sentences can be canonically reduced to linear alge- braic formulae. The follo wing is the meaning vector of our transiti ve sentence: − − − − − − − − → cats like milk = ( f )  − → cats ⊗ − → like ⊗ − − → milk  ( I ) Here f is the linear map that encodes the grammati- cal structure. The categorical morphism correspond- ing to it is denoted by the tensor product of 3 compo- nents:  V ⊗ 1 S ⊗  W , where V and W are subject and object spaces, S is the sentence space, the  ’ s are the cups, and 1 S is the straight line in the diagram. The cups stand for taking inner products, which when done with the basis vectors imitate substitution. The straight line stands for the identity map that does nothing. By the rules of the category , equation (I) re- duces to the following linear algebraic formula with 3 The relation ≤ is the partial order of the pregroup. It corre- sponds to implication = ⇒ in a logical reading thereof. If these inequalities are replaced by equalities, i.e. if n l n = 1 = nn r , then the pregroup collapses into a group where n l = n r . lo wer dimensions, hence the dimensional explosion problem for Kronecker products is a voided: X itj c itj h − → cats | − → v i i − → s t h − → w j | − − → milk i ∈ S ( I I ) − → v i , − → w j are basis v ectors of V and W . The inner product h − → cats | − → v i i substitutes the weights of − → cats into the first ar gument place of the verb (similarly for object and second argument place). − → s t is a basis vector of the sentence space S in which meanings of sentences liv e, regardless of their grammatical struc- ture. The degree of synonymity of sentences is ob- tained by taking the cosine measure of their vectors. S is an abstract space: it needs to be instantiated to provide concrete meanings and synonymity mea- sures. F or instance, a truth-theoretic model is ob- tained by taking the sentence space S to be the 2- dimensional space with basis vectors | 1 i (True) and | 0 i (False). 4 Building Matrices for Relational W ords In this section we present a general scheme to build matrices for relational words. Recall that given a v ector space A with basis { − → n i } i , the Kronecker product of two vectors − → v = P i c a i − → n i and − → w = P i c b i − → n i is defined as follo ws: − → v ⊗ − → w = X ij c a i c b j ( − → n i ⊗ − → n j ) where ( − → n i ⊗ − → n j ) is just the pairing of the basis of A , i.e. ( − → n i , − → n j ) . The Kronecker product vectors belong in the tensor product of A with itself: A ⊗ A , hence if A has dimension r , these will be of dimensionality r × r . The point-wise multiplication of these vectors is defined as follo ws − → v  − → w = X i c a i c b i − → n i The intuition behind having a matrix for a rela- tional word is that any relation R on sets X and Y , i.e. R ⊆ X × Y can be represented as a matrix, namely one that has as row-bases x ∈ X and as column-bases y ∈ Y , with weight c xy = 1 where ( x, y ) ∈ R and 0 otherwise. In a distributional set- ting, the weights, which are natural or real numbers, will represent more: ‘the extent according to which x and y are related’. This can be determined in dif- ferent ways. Suppose X is the set of animals, and ‘chase’ is a relation on it: chase ⊆ X × X . T ake x = ‘dog’ and y = ‘cat’: with our type-logical glasses on, the obvious choice would be to take c xy to be the num- ber of times ‘dog’ has chased ‘cat’, i.e. the number of times the sentence ‘the dog chases the cat’ has appeared in the corpus. But in the distrib utional set- ting, this method will be too syntactic and dismissiv e of the actual meaning of ‘cat’ and ‘dog’. If instead the corpus contains the sentence ‘the hound hunted the wild cat’, c xy will be 0, restricting us to only assign meaning to sentences that have directly ap- peared in the corpus. W e propose to, instead, use a le vel of abstraction by taking w ords such as v erbs to be distrib utions ov er the semantic information in the vectors of their context words, rather than ov er the context w ords themselves. Start with an r -dimensional vector space N with basis { − → n i } i , in which meaning vectors of atomic words, such as nouns, liv e. The basis v ectors of N are in principle all the words from the corpus, how- e ver in practice and follo wing Mitchell and Lapata (2008) we had to restrict these to a subset of the most occurring words. These basis v ectors are not restricted to nouns: they can as well be verbs, adjec- ti ves, and adverbs, so that we can define the mean- ing of a noun in all possible contexts—as is usual in context-based models—and not only in the con- text of other nouns. Note that basis words with re- lational types are treated as pure le xical items rather than as semantic objects represented as matrices. In short, we count how many times a noun has occurred close to w ords of other syntactic types such as ‘elect’ and ‘scientific’, rather than count ho w many times it has occurred close to their corresponding matrices: it is the le xical tokens that form the context, not their meaning. Each relational w ord P with grammatical type π and m adjoint types α 1 , α 2 , · · · , α m is encoded as an ( r × . . . × r ) matrix with m dimensions. Since our vector space N has a fixed basis, each such ma- trix is represented in vector form as follo ws: − → P = X ij · · · ζ | {z } m c ij ··· ζ ( − → n i ⊗ − → n j ⊗ · · · ⊗ − → n ζ ) | {z } m This vector li ves in the tensor space N ⊗ N ⊗ · · · ⊗ N | {z } m . Each c ij ··· ζ is computed according to the procedure described in Figure 4. 1) Consider a sequence of words containing a re- lational word ‘P’ and its arguments w 1 , w 2 , · · · , w m , occurring in the same order as described in P’ s grammatical type π . Refer to these sequences as ‘P’-relations. Suppose there are k of them. 2) Retrie ve the v ector − → w l of each argument w l . 3) Suppose w 1 has weight c 1 i on basis vector − → n i , w 2 has weight c 2 j on basis vector − → n j , · · · , and w m has weight c m ζ on basis vector − → n ζ . Multiply these weights c 1 i × c 2 j × · · · × c m ζ 4) Repeat the above steps for all the k ‘P’- relations, and sum a the corresponding weights c ij ··· ζ = X k  c 1 i × c 2 j × · · · × c m ζ  k a W e also experimented with multiplication, b ut the spar- sity of noun vectors resulted in most v erb matrices being empty . Figure 4: Procedure for learning weights for matrices of words ‘P’ with relational types π of m arguments. Linear algebraically , this procedure corresponds to computing the follo wing − → P = X k  − → w 1 ⊗ − → w 2 ⊗ · · · ⊗ − → w m  k T ype-logical examples of relational words are verbs, adjectiv es, and adverbs. A transiti ve v erb is represented as a 2 dimensional matrix since its type is n r sn l with two adjoint types n r and n l . The cor- responding vector of this matrix is − − → verb = X ij c ij ( − → n i ⊗ − → n j ) The weight c ij corresponding to basis vector − → n i ⊗ − → n j , is the extent according to which words that hav e co-occurred with − → n i hav e been the subject of the ‘verb’ and words that have co-occurred with − → n j hav e been the object of the ‘verb’. This example computation is demonstrated in Figure 5. 1) Consider phrases containing ‘v erb’, its subject w 1 and object w 2 . Suppose there are k of them. 2) Retrie ve v ectors − → w 1 and − → w 2 . 3) Suppose − → w 1 has weight c 1 i on − → n i and − → w 2 has c 2 j on − → n j . Multiply these weights c 1 i × c 2 j . 4) Repeat the abo ve steps for all k ‘v erb’- relations and sum the corresponding weights P k ( c 1 i × c 2 j ) k Figure 5: Procedure for learning weights for matrices of transitiv e verbs. Linear algebraically , we are computing − − → verb = X k  − → w 1 ⊗ − → w 2  k As an example, consider the verb ‘show’ and sup- pose there are two ‘sho w’-relations in the corpus: s 1 = table show result s 2 = map show location The vector of ‘sho w’ is − − → sho w = − − → table ⊗ − − − → result + − − → map ⊗ − − − − → location Consider an N space with four basis vectors ‘far’, ‘room’, ‘scientific’, and ‘elect’. The TF/IDF- weighted values for vectors of the above four nouns (built from the BNC) are as sho wn in T able 1. i − → n i table map result location 1 far 6.6 5.6 7 5.9 2 room 27 7.4 0.99 7.3 3 scientific 0 5.4 13 6.1 4 elect 0 0 4.2 0 T able 1: Sample weights for selected noun vectors. Part of the matrix of ‘sho w’ is presented in T able 2. As a sample computation, the weight c 11 for vector (1 , 1) , i.e. ( − → far , − → far ) is computed by multiply- ing weights of ‘table’ and ‘result’ on − → far, i.e. 6 . 6 × 7 , far room scientific elect far 79.24 47.41 119.96 27.72 room 232.66 80.75 396.14 113.2 scientific 32.94 31.86 32.94 0 elect 0 0 0 0 T able 2: Sample semantic matrix for ‘show’. multiplying weights of ‘map’ and ‘location’ on − → far, i.e. 5 . 6 × 5 . 9 then adding these 46 . 2 + 33 . 04 and obtaining the total weight 79 . 24 . The same method is applied to b uild matrices for di- transiti ve verbs, which will hav e 3 dimensions, and adjecti ves and adv erbs, which will be of 1 dimension each. 5 Computing Sentence V ectors Meaning of sentences are vectors computed by tak- ing the variables of the categorical prescription of meaning (the linear map f obtained from the gram- matical reduction of the sentence) to be determined by the matrices of the relational words. F or instance the meaning of the transitive sentence ‘sub verb obj’ is: − − − − − − − − → sub verb obj = X itj h − → sub | − → v i ih − → w j | − → obj i c itj − → s t W e take V := W := N and S = N ⊗ N , then P itj c itj − → s t is determined by the matrix of the verb, i.e. substitute it by P ij c ij ( − → n i ⊗ − → n j ) 4 . Hence − − − − − − − − → sub verb obj becomes: X ij h − → sub | − → n i ih − → n j | − → obj i c ij ( − → n i ⊗ − → n j ) = X ij c sub i c obj j c ij ( − → n i ⊗ − → n j ) This can be decomposed to point-wise multiplica- tion of two v ectors as follows:  X ij c sub i c obj j ( − → n i ⊗ − → n j )    X ij c ij ( − → n i ⊗ − → n j )  4 Note that by doing so we are also reducing the verb space from N ⊗ ( N ⊗ N ) ⊗ N to N ⊗ N , since for our construction we only need tuples of the form − → n i ⊗ − → n i ⊗ − → n j ⊗ − → n j which are isomorphic to pairs ( − → n i ⊗ − → n j ) . The left argument is the Kronecker product of sub- ject and object vectors and the right argument is the vector of the v erb, so we obtain  − → sub ⊗ − → obj   − − → verb Since  is commutativ e, this provides us with a dis- tributional version of the type-logical meaning of the sentence: point-wise multiplication of the meaning of the verb to the Kronecker product of its subject and object: − − − − − − − − → sub verb obj = − − → verb   − → sub ⊗ − → obj  This mathematical operation can be informally de- scribed as a structured ‘mixing’ of the information of the subject and object, followed by it being ‘fil- tered’ through the information of the v erb applied to them, in order to produce the information of the sentence. In the transitive case, S = N ⊗ N , hence − → s t = − → n i ⊗ − → n j . More generally , the vector space cor- responding to the abstract sentence space S is the concrete tensor space ( N ⊗ . . . ⊗ N ) for m the di- mension of the matrix of the ‘verb’. As we ha ve seen abov e, in practice we do not need to build this tensor space, as the computations thereof reduce to point-wise multiplications and summations. Similar computations yield meanings of sentences with adjectiv es and adverbs. For instance the mean- ing of a transiti ve sentence with a modified subject and a modified verb we ha ve − − − − − − − − − − − − − → adj sub verb obj adv =  − → adv  − − → verb    − → adj  − → sub  ⊗ − → obj  After building vectors for sentences, we can com- pare their meaning and measure their de gree of syn- onymy by taking their cosine measure. 6 Evaluation Ev aluating such a framew ork is no easy task. What to ev aluate depends heavily on what sort of applica- tion a practical instantiation of the model is geared to wards. In (Grefenstette et al., 2011), it is sug- gested that the simplified model we presented and expanded here could be e valuated in the same way as lexical semantic models, measuring compositionally built sentence vectors against a benchmark dataset such as that provided by Mitchell and Lapata (2008). In this section, we briefly describe the ev aluation of our model against this dataset. Follo wing this, we present a ne w ev aluation task extending the e xperi- mental methodology of Mitchell and Lapata (2008) to transiti ve v erb-centric sentences, and compare our model to those discussed by Mitchell and Lapata (2008) within this ne w experiment. First Dataset Description The first experiment, described in detail by Mitchell and Lapata (2008), e valuates ho w well compositional models disam- biguate ambiguous words given the context of a po- tentially disambiguating noun. Each entry of the dataset provides a noun, a target verb and landmark verb (both intransiti ve). The noun must be com- posed with both verbs to produce short phrase vec- tors the similarity of which is measured by the can- didate. Also provided with each entry is a classifi- cation (“High” or “Low”) indicating whether or not the verbs are indeed semantically close within the context of the noun, as well as an ev aluator-set simi- larity score between 1 and 7 (along with an ev aluator identifier), where 1 is lo w similarity and 7 is high. Evaluation Methodology Candidate models pro- vide a similarity score for each entry . The scores of high similarity entries and low similarity entries are av eraged to produce a mean High score and mean Lo w score for the model. The correlation of the model’ s similarity judgements with the human judgements is also calculated using Spearman’ s ρ , a metric which is deemed to be more scrupulous, and ultimately that by which models should be ranked, by Mitchell and Lapata (2008). The mean for each model is on a [0 , 1] scale, e xcept for UpperBound which is on the same [1 , 7] scale the annotators used. The ρ scores are on a [ − 1 , 1] scale. It is assumed that inter-annotator agreement pro vides the theoret- ical maximum ρ for any model for this e xperiment. The cosine measure of the verb vectors, ignoring the noun, is taken to be the baseline (no composition). Other Models The other models we compare ours to are those ev aluated by Mitchell and Lap- ata (2008). W e provide a selection of the results from that paper for the worst (Add) and best 5 (Mul- tiply) performing models, as well as the pre vious second-best performing model (Kintsch). The ad- diti ve and multiplicativ e models are simply applica- tions of vector addition and component-wise multi- plication. W e in vite the reader to consult (Mitchell and Lapata, 2008) for the description of Kintsch’ s additi ve model and parametric choices. Model P arameters T o provide the most accurate comparison with the existing multiplicati ve model, and e xploiting the aforementioned feature that the categorical model can be built “on top of ” existing lexical distributional models, we used the parame- ters described by Mitchell and Lapata (2008) to re- produce the vectors ev aluated in the original e xper- iment as our noun vectors. All vectors were b uilt from a lemmatised version of the BNC. The noun basis w as the 2000 most common context words, basis weights were the probability of context words gi ven the target w ord di vided by the overall proba- bility of the context word. Intransitive verb function- vectors were trained using the procedure presented in § 4. Since the dataset only contains intransitive verbs and nouns, we used S = N . The cosine mea- sure of vectors w as used as a similarity metric. First Experiment Results In T able 3 we present the comparison of the selected models. Our categor - ical model performs significantly better than the ex- isting second-place (Kintsch) and obtains a ρ quasi- identical to the multiplicati ve model, indicating sig- nificant correlation with the annotator scores. There is not a large difference between the mean High score and mean Lo w score, but the distri- bution in Figure 6 shows that our model makes a non-negligible distinction between high similarity phrases and low similarity phrases, despite the ab- solute scores not being different by more than a few percentiles. 5 The multiplicativ e model presented here is what is quali- fied as best in (Mitchell and Lapata, 2008). Ho wever , they also present a slightly better performing ( ρ = 0 . 19 ) model which is a combination of their multiplicativ e model and a weighted additiv e model. The dif ference in ρ is qualified as “ not sta- tistically significant” in the original paper , and furthermore the mixed model requires parametric optimisation hence w as not ev aluated against the entire test set. For these reasons, we chose not to include it in the comparison. Model High Lo w ρ Baseline 0.27 0.26 0.08 Add 0.59 0.59 0.04 Kintsch 0.47 0.45 0.09 Multiply 0.42 0.28 0.17 Categorical 0.84 0.79 0.17 UpperBound 4.94 3.25 0.40 T able 3: Selected model means for High and Lo w similar - ity items and correlation coefficients with human judge- ments, first experiment (Mitchell and Lapata, 2008). p < 0 . 05 for each ρ . High Low 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 6: Distribution of predicted similarities for the cat- egorical distributional model on High and Low similarity items. Second Dataset Description The second dataset 6 , de veloped by the authors, follows the format of the (Mitchell and Lapata, 2008) dataset used for the first experiment, with the exception that the tar get and landmark verbs are transitiv e, and an object noun is pro vided in addition to the subject noun, hence forming a small transiti ve sentence. The dataset comprises 200 entries consisting of sentence pairs (hence a total of 400 sentences) constructed by fol- lo wing the procedure outlined in § 4 of (Mitchell and Lapata, 2008), using transitiv e v erbs from CELEX 7 . For e xamples of these sentences, see T able 4. The dataset was split into four sections of 100 entries each, with guaranteed 50% exclusiv e ov erlap with 6 http://www.cs.ox.ac.uk/activities/CompD istMeaning/GS2011data.txt 7 http://celex.mpi.nl/ exactly two other datasets. Each section was giv en to a group of e v aluators, with a total of 25, who were asked to form simple transiti ve sentence pairs from the verbs, subject and object pro vided in each entry; for inst ance ‘the table sho wed the result’ from ‘table sho w result’. The e valuators were then ask ed to rate the semantic similarity of each verb pair within the context of those sentences, and offer a score between 1 and 7 for each entry . Each entry was gi ven an arbi- trary classification of HIGH or LO W by the authors, for the purpose of calculating mean high/low scores for each model. For e xample, the first tw o pairs in table 4 were classified as HIGH, whereas the second two pairs as LO W . Sentence 1 Sentence 2 table sho w result table e xpress result map sho w location map picture location table sho w result table picture result map sho w location map express location T able 4: Example entries from the transiti ve dataset with- out annotator score, second experiment. Evaluation Methodology The ev aluation methodology for the second experiment was identical to that of the first, as are the scales for means and scores. Here also, Spearman’ s ρ is deemed a more rigorous w ay of determining ho w well a model tracks difference in meaning. This is both because of the imprecise nature of the classifi- cation of v erb pairs as HIGH or LO W ; and since the objecti ve similarity scores produced by a model that distinguishes sentences of different meaning from those of similar meaning can be renormalised in practice. Therefore the delta between HIGH means and LO W mean cannot serve as a definite indication of the practical applicability (or lack thereof) of semantic models; the means are provided just to aid comparison with the results of the first experiment. Model Parameters As in the first experiment, the lexical vectors from (Mitchell and Lapata, 2008) were used for the other models e valuated (additi ve, multiplicati ve and baseline) 8 and for the noun v ec- 8 Kintsch was not evalua ted as it required optimising model parameters against a held-out se gment of the test set, and we could not replicate the methodology of Mitchell and Lapata tors of our categorical model. T ransitiv e verb v ec- tors were trained as described in § 4 with S = N ⊗ N . Second Experiment Results The results for the models e v aluated against the second dataset are pre- sented in T able 5 . Model High Lo w ρ Baseline 0.47 0.44 0.16 Add 0.90 0.90 0.05 Multiply 0.67 0.59 0.17 Categorical 0.73 0.72 0.21 UpperBound 4.80 2.49 0.62 T able 5: Selected model means for High and Lo w similar - ity items and correlation coefficients with human judge- ments, second experiment. p < 0 . 05 for each ρ . W e observe a significant (according to p < 0 . 0 . 5 ) improv ement in the alignment of our categorical model with the human judgements, from 0.17 to 0.21. The additiv e model continues to make lit- tle distinction between senses of the verb during composition, and the multiplicativ e model’ s align- ment does not change, b ut becomes statistically in- distinguishable from the non-compositional baseline model. Once again we note that the high-lo w means are not very indicativ e of model performance, as the dif- ference between high mean and the low mean of the categorical model is much smaller than that of the both the baseline model and multiplicativ e model, despite better alignment with annotator judgements. 7 Discussion In this paper, we described an implementation of the categorical model of meaning (Coecke et al., 2010), which combines the formal logical and the empiri- cal distrib utional frame works into a unified seman- tic model. The implementation is based on build- ing matrices for words with relational types (ad- jecti ves, verbs), and vectors for words with atomic types (nouns), based on data from the BNC. W e then show ho w to apply verbs to their subject/object, in order to compute the meaning of intransitiv e and transiti ve sentences. (2008) with full confidence. Other work uses matrices to model meaning (Ba- roni and Zamparelli, 2010; Gue vara, 2010), b ut only for adjectiv e-noun phrases. Our approach easily ap- plies to such compositions, as well as to sentences containing combinations of adjecti ves, nouns, v erbs, and adverbs. The other key difference is that they learn their matrices in a top-do wn f ashion, i.e. by re- gression from the composite adjectiv e-noun context vectors, whereas our model is bottom-up: it learns sentence/phrase meaning compositionally from the vectors of the compartments of the composites. Fi- nally , v ery similar functions, for example a verb with argument alternations such as ‘break’ in ‘Y breaks’ and ‘X breaks Y’, are not treated as unrelated. The matrix of the intransiti ve ‘break’ uses the corpus- observed information about the subject of break, in- cluding that of ‘Y’, similarly the matrix of the tran- siti ve ‘break’ uses information about its subject and object, including that of ‘X’ and ‘Y’. W e lea ve a thorough study of these phenomena, which fall un- der providing a modular representation of passiv e- acti ve similarities, to future work. W e e valuated our model in tw o w ays: first against the w ord disambiguation task of Mitchell and Lap- ata (2008) for intransiti ve verbs, and then against a similar ne w experiment for transiti ve verbs, which we de veloped. Our findings in the first experiment show that the categorical method performs on par with the leading existing approaches. This should not sur- prise us gi ven that the context is so small and our method becomes similar to the multiplicati ve model of Mitchell and Lapata (2008). Howe ver , our ap- proach is sensiti ve to grammatical structure, lead- ing us to de velop a second experiment taking this into account and differentiating it from models with commutati ve composition operations. The second e xperiment’ s results deli ver the ex- pected qualitativ e difference between models, with our categorical model outperforming the others and sho wing an increase in alignment with human judge- ments in correlation with the increase in sentence complexity . W e use this second ev aluation princi- pally to show that there is a strong case for the dev el- opment of more complex e xperiments measuring not only the disambiguating qualities of compositional models, but also their syntactic sensiti vity , which is not directly measured in the existing e xperiments. These results sho w that the high le vel categori- cal distributional model, uniting empirical data with logical form, can be implemented just like an y other concrete model. Furthermore it shows better results in experiments in volving higher syntactic comple x- ity . This is just the tip of the iceberg: the mathe- matics underlying the implementation ensures that it uniformly scales to larger , more complicated sen- tences and enables it to compare synonymity of sen- tences that are of dif ferent grammatical structure. 8 Future W ork T reatment of function words such as ‘that’, ‘who’, as well as logical words such as quantifiers and con- juncti ves are left to future work. This will b uild alongside the general guidelines of Coecke et al. (2010) and concrete insights from the work of W id- do ws (2005). It is not yet entirely clear ho w ex- isting set-theoretic approaches, for example that of discourse representation and generalised quantifiers, apply to our setting. Preliminary work on inte gration of the two has been presented by Preller (2007) and more recently also by Preller and Sadrzadeh ( 2009). As mentioned by one of the revie wers, our pre- group approach to grammar flattens the sentence representation, in that the verb is applied to its sub- ject and object at the same time; whereas in other approaches such as CCG, it is first applied to the object to produce a v erb phrase, then applied to the subject to produce the sentence. The adv antages and disadv antages of this method and comparisons with other systems, in particular CCG, constitutes ongo- ing work. 9 Acknowledgement W e wish to thank P . Blunsom, S. Clark, B. Coecke, S. Pulman, and the anonymous EMNLP revie w- ers for discussions and comments. Support from EPSRC grant EP/F042728/1 is gratefully acknowl- edged by M. Sadrzadeh. References H. Alshawi (ed). 1992. The Cor e Language Engine . MIT Press. M. Baroni and R. Zamparelli. 2010. Nouns ar e vectors, adjectives ar e matrices . Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP). S. Clark and S. Pulman. 2007. Combining Symbolic and Distrib utional Models of Meaning . Proceedings of AAAI Spring Symposium on Quantum Interaction. AAAI Press. B. Coecke, and E. Paquette. 2011. Cate gories for the Practicing Physicist . Ne w Structur es for Physics , 167- 271. B. Coecke (ed.). Lecture Notes in Physics 813 . Springer . B. Coecke, M. Sadrzadeh and S. Clark. 2010. Mathemat- ical F oundations for Distrib uted Compositional Model of Meaning . Lambek Festschrift. Linguistic Analysis 36 , 345–384. J. v an Benthem, M. Moortgat and W . Buszko wski (eds.). J. Curran. 2004. F r om Distributional to Semantic Simi- larity . PhD Thesis, Univ ersity of Edinbur gh. K. Erk and S. P ad ´ o. 2004. A Structur ed V ector Space Model for W or d Meaning in Conte xt . Proceedings of Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), 897–906. G. Fre ge 1892. ¨ Uber Sinn und Bedeutung . Zeitschrift f ¨ ur Philosophie und philosophische Kritik 100. J. R. Firth. 1957. A synopsis of linguistic theory 1930- 1955 . Studies in Linguistic Analysis. E. Grefenstette, M. Sadrzadeh, S. Clark, B. Coecke, S. Pulman. 2011. Concrete Compositional Sentence Spaces for a Compositional Distributional Model of Meaning . International Conference on Computational Semantics (IWCS’11). Oxford. G. Grefenstette. 1994. Explorations in A utomatic The- saurus Discovery . Kluwer . E. Guev ara. 2010. A Re gression Model of Adjective- Noun Compositionality in Distributional Semantics . Proceedings of the A CL GEMS W orkshop. Z. S. Harris. 1966. A Cycling Cancellation-Automaton for Sentence W ell-F ormedness . International Compu- tation Centre Bulletin 5 , 69–94. R. Hudson. 1984. W or d Grammar . Blackwell. J. Lambek. 2008. F r om W or d to Sentence . Polimetrica, Milan. T . Landauer , and S. Dumais. 2008. A solution to Platos pr oblem: The latent semantic analysis theory of ac- quisition, induction, and repr esentation of knowledge . Psychological revie w . C. D. Manning, P . Raghav an, and H. Sch ¨ utze. 2008. In- tr oduction to information r etrieval . Cambridge Uni- versity Press. J. Mitchell and M. Lapata. 2008. V ector -based mod- els of semantic composition . Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, 236–244. R. Montague. 1974. English as a formal language . For- mal Philosophy , 189–223. J. Ni vre. 2003. An efficient algorithm for pr ojective dependency parsing . Proceedings of the 8th Interna- tional W orkshop on Parsing T echnologies (IWPT). A. Preller . T owar ds Discourse Repr esentation via Pr e- gr oup Grammars . Journal of Logic Language Infor- mation 16 173–194. A. Preller and M. Sadrzadeh. Semantic V ector Mod- els and Functional Models for Pr e group Grammars . Journal of Logic Language Information. DOI: 10.1007/s10849-011-9132-2. to appear . J. Saffron, E. Newport, R. Asling. 1999. W ord Se gmenta- tion: The role of distributional cues . Journal of Mem- ory and Language 35 , 606–621. H. Schuetze. 1998. Automatic W or d Sense Discrimina- tion . Computational Linguistics 24 , 97–123. P . Smolensky . 1990. T ensor pr oduct variable binding and the r epr esentation of symbolic structures in con- nectionist systems . Computational Linguistics 46, 1– 2 , 159–216. M. Steedman. 2000. The Syntactic Pr ocess . MIT Press. D. Widdo ws. 2005. Geometry and Meaning . Univ ersity of Chicago Press. L. Wittgenstein. 1953. Philosophical In vestigations . Blackwell.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment