Mathematical Foundations for a Compositional Distributional Model of Meaning
We propose a mathematical framework for a unification of the distributional theory of meaning in terms of vector space models, and a compositional theory for grammatical types, for which we rely on the algebra of Pregroups, introduced by Lambek. This…
Authors: Bob Coecke, Mehrnoosh Sadrzadeh, Stephen Clark
Mathematical F oundations for a Compositional Distrib utional Model of Meaning Bob Coecke ∗ , Mehrnoosh Sadrzadeh ∗ , Stephen Clark † coecke, mehrs@comlab .ox.ac.uk – stephen.clark@cl.cam.ac.uk ∗ Oxford Uni versity Computing Laboratory † Uni versity of Cambridge Computer Laboratory Abstract W e propose a mathematical frame work for a unification of the distrib u- tional theory of meaning in terms of vector space models, and a compo- sitional theory for grammatical types, for which we rely on the algebra of Pregroups, introduced by Lambek. This mathematical frame work enables us to compute the meaning of a well-typed sentence from the meanings of its constituents. Concretely , the type reductions of Pregroups are ‘lifted’ to mor- phisms in a category , a procedure that transforms meanings of constituents into a meaning of the (well-typed) whole. Importantly , meanings of whole sentences live in a single space, independent of the grammatical structure of the sentence. Hence the inner-product can be used to compare meanings of arbitrary sentences, as it is for comparing the meanings of words in the distributional model. The mathematical structure we employ admits a purely diagrammatic calculus which exposes how the information flows between the words in a sentence in order to make up the meaning of the whole sentence. A variation of our ‘categorical model’ which in v olves constraining the scalars of the vector spaces to the semiring of Booleans results in a Montague-style Boolean-valued semantics. 1 Intr oduction The symbolic [13] and distributional [36] theories of meaning are some what or- thogonal with competing pros and cons: the former is compositional but only qual- itati ve, the latter is non-compositional b ut quantitati ve. For a discussion of these two competing paradigms in Natural Languge Processing see [15]. F ollo wing [39] in the context of Cognitiv e Science, where a similar problem exists between the 1 connectionist and symbolic models of mind, [6] argued for the use of the tensor product of v ector spaces and pairing the vectors of meaning with their roles. In this paper we will also use tensor spaces and pair v ectors with their grammatical types, but in a way which ov ercomes some of the shortcomings of [6]. One shortcoming is that, since inner-products can only be computed between vectors which li v e in the same space, sentences can only be compared if the y hav e the same grammatical structure. In this paper we provide a procedure to compute the meaning of an y sen- tence as a vector within a single space. A second problem is the lack of a method to compute the vectors representing the grammatical type; the procedure presented here does not require such vectors. The use of Pregroups for analysing the structure of natural languages is a recent de velopment by Lambek [19] and builds on his original Lambek (or Syntactic) calculus [18], where types are used to analyze the syntax of natural languages in a simple equational algebraic setting. Pregroups have been used to analyze the syntax of a range of different languages, from English and French to Polish and Persian [32], and many more; for more references see [23, 21]. But what is particularly interesting about Pregroups, and motiv ates their use here, is that they share a common structure with vector spaces and tensor prod- ucts, when passing to a category-theoretic perspectiv e. Both the cate gory of vector spaces, linear maps and the tensor product, as well as pregoups, are examples of so- called compact closed categories. Concretely , Pregroups are posetal instances of the categorical logic of vector spaces, where juxtaposition of types corresponds to the monoidal tensor of the monoidal cate gory . The mathematical structure within which we compute the meaning of sentences will be a compact closed category which combines the two abov e. The meanings of words are vectors in vector spaces, their grammatical roles are types in a Pregroup, and tensor product of vec- tor spaces paired with the Pregroup composition is used for the composition of (meaning, type) pairs. T ype-checking is now an essential fragment of the overall categorical logic, and the reduction scheme to v erify grammatical correctness of sentences will not only provide a statement on the well-typedness of a sentence, b ut will also assign a vector in a v ector space to each sentence. Hence we obtain a theory with both Pregroup analysis and vector space models as constituents, but which is inherently compositional and assigns a meaning to a sentence gi ven the meanings of its words. The vectors − → s representing the meanings of sentences all live in the same meaning space S . Hence we can compare the meanings of an y tw o sentences − → s , − → t ∈ S by computing their inner-product h − → s | − → t i . Compact closed categories admit a beautiful purely diagrammatic calculus that simplifies the meaning computations to a great extent. They also provide reduc- tion diagrams for typing sentences; these allow for a high le vel comparison of the 2 grammatical patterns of sentences in different languages [33]. This diagrammatic structure, for the case of vector spaces, was recently e xploited by Abramsk y and the second author to e xpose the flows of information withing quantum information protocols [1, 7, 9]. Here, they will expose the flow of information between the words that make up a sentence, in order to produce the meaning of the whole sen- tence. Note that the connection between linguistics and physics was also identified by Lambek himself [22]. Interestingly , a Montague-style Boolean-valued semantics emerges as a sim- plified v ariant of our setting, by restricting the vectors to range ov er B = { 0 , 1 } , where sentences are simply true or false. Theoretically , this is nothing b ut the pas- sage from the category of vector spaces to the category of relations as described in [8]. In the same spirit, one can look at vectors ranging over N or Q and obtain degrees or probabilities of meaning. As a final remark, in this paper we only set up our general mathematical framework and lea ve a practical implementation for future work. 2 T wo ‘camps’ within computational linguistics W e briefly present the two domains of Computational Linguistics which provide the linguistic background for this paper , and refer the reader to the literature for more details. 2.1 V ector space models of meaning The ke y idea behind v ector space models of meaning [36] can be summed up by Firth’ s oft-quoted dictum that “you shall know a word by the company it keeps”. The basic idea is that the meaning of a word can be determined by the words which appear in its contexts, where context can be a simple n -word windo w , or the argument slots of grammatical relations, such as the direct object of the verb eat . Intuiti vely , cat and dog hav e similar meanings (in some sense) because cats and dogs sleep, run, walk; cats and dogs can be bought, cleaned, stroked; cats and dogs can be small, big, furry . This intuition is reflected in text because cat and dog appear as the subject of sleep , run , walk ; as the direct object of bought , cleaned , str ok ed ; and as the modifiee of small , big , furry . Meanings of words can be represented as vectors in a high-dimensional “mean- ing space”, in which the orthogonal basis v ectors are represented by context words. T o gi ve a simple example, if the basis vectors correspond to eat , sleep , and run , and the word dog has eat in its conte xt 6 times (in some text), sleep 5 times, and run 3 7 times, then the vector for dog in this space is (6,5,7). 1 The advantage of repre- senting meanings in this way is that the vector space giv es us a notion of distance between words, so that the inner product (or some other measure) can be used to determine how close in meaning one word is to another . Computational models along these lines hav e been built using large vector spaces (tens of thousands of context words/basis v ectors) and large bodies of text (up to a billion words in some experiments). Experiments in constructing thesauri using these methods hav e been relati vely successful. For example, the top 10 most similar nouns to intr oduc- tion , according to the system of [11], are launch, implementation, advent, addition, adoption, arrival, absence, inclusion, cr eation . The other main approach to representing lexical semantics is through an ontol- ogy or semantic network, typically manually created by lexicographers or domain experts. The advantages of vector -based representations over hand-built ontologies are that: • they are created objecti vely and automatically from text; • they allo w the representation of gradations of meaning; • they relate well to e xperimental evidence indicating that the human cogniti ve system is sensiti ve to distributional information [34, 40]. V ector -based models of word meaning hav e been fruitfully applied to many language processing tasks. Examples include lexicon acquisition [16, 26], word sense discrimination and disambiguation [36, 28], text segmentation [5], language modelling [2], and notably document retrie val [35]. W ithin cogniti ve science, vector -based models have been successful in simulating a wide variety of semantic processing tasks ranging from semantic priming [27, 24, 29] to episodic memory [17], and text comprehension [24, 14, 25]. Moreov er , the cosine similarities ob- tained within vector -based models hav e been shown to substantially correlate with human similarity judgements [29] and word association norms [12, 17]. 2.2 Algebra of Pregr oups as a type-categorial logic W e pro vide a brief overvie w of the algebra of Pregroups from the e xisting literature and refer the reader for more details to [19, 20, 21, 4]. A partially or der ed monoid ( P , ≤ , · , 1) is a partially ordered set, equipped with a monoid multiplication − · − with unit 1 , where for p, q , r ∈ P , if p ≤ q then we hav e r · p ≤ r · q and p · r ≤ q · r . A Pr e gr oup ( P , ≤ , · , 1 , ( − ) l , ( − ) r ) is a partially 1 In practice the counts are typically weighted in some way to reflect how informativ e the contex- tual element is with respect to the meaning of the target w ord. 4 ordered monoid whose each element p ∈ P has a left adjoint p l and a right adjoint p r , i.e. the follo wing hold: p l · p ≤ 1 ≤ p · p l and p · p r ≤ 1 ≤ p r · p . Some properties of interest in a Pregroup are: • Adjoints are unique. • Adjoints are order rev ersing: p ≤ q = ⇒ q r ≤ p r and q l ≤ p l . • The unit is self adjoint: 1 l = 1 = 1 r . • Multiplication is self adjoint: ( p · q ) r = q r · p r and ( p · q ) l = q l · p l . • Opposite adjoints annihilate each other: ( p l ) r = p = ( p r ) l . • Same adjoints iterate: p ll p l ≤ 1 ≤ p r p rr , p lll p ll ≤ 1 ≤ p rr p rr r , . . . . An example of a Pregorup from arithmetic is the set of all monotone unbounded maps on inte gers f : Z → Z . In this Pre group, function composition is the monoid multiplication and the identity map is its unit, the underlying order on integers lifts to an order on the maps whose Galois adjoints are their Pre group adjoints, defined: f l ( x ) = min { y ∈ Z | x ≤ f ( y ) } f r ( x ) = max { y ∈ Z | f ( y ) ≤ x } Recall that a Lambek Calculus ( P , ≤ , · , 1 , /, \ ) is also a partially ordered monoid, but there it is the monoid multiplication that has a right − \ − and a left − / − adjoint. Roughly speaking, the passage from Lambek Calculus to Pregroups can be thought of as replacing the two adjoints of the monoid multiplication with the two adjoints of the elements. One can define a translation between a Lambek Calculus and a Pregroup by sending ( p \ q ) to ( p r · q ) and ( p/q ) to ( p · q l ) , and via the lambda calculus correspondence of the former think of the adjoint types of a Pregroup as function arguments. Pregroups formalize grammar of natural languages in the same way as type- categorial logics do. One starts by fixing a set of basic grammatical roles and a partial ordering between them, then freely generating a Pregroup of these types, the existence of which ha ve been pro ved. In this paper , we present two examples from English: positiv e and neg ative transitiv e sentences 2 , for which we fix the follo wing basic types: 2 By a negati ve sentence we mean one with a negation operator , such as not , and a positiv e sen- tence one without a negation operator . 5 n : noun s : declarati ve statement j : infinitiv e of the verb σ : glueing type Compound types are formed from these by taking adjoints and juxtaposition. A type (basic or compound) is assigned to each word of the dictionary . W e define that if the juxtaposition of the types of the words within a sentence reduces to the basic type s , then the sentence is grammatical. It has been shown that this procedure is decidable. In what follows we use an arrow → for ≤ and drop the · between juxtaposed types. The e xample sentence “John likes Mary”, has the following type assignment 3 : John likes Mary n ( n r sn l ) n and it is grammatical because of the follo wing reduction: n ( n r sn l ) n → 1 sn l n → 1 s 1 → s Reductions are depicted diagrammatically , that of the above is: n n r s n l n Reduction diagrams depict the grammatical structure of sentences in one dimen- sion, as opposed to the two dimensional trees of type-categorial logics. This feature becomes useful in applications such as comparing the grammatical patterns of dif- ferent languages; for some examples see [33]. W e type the negation of the abov e sentence as follows: John does not like Mary n ( n r sj l σ ) ( σ r j j l σ ) ( σ r j n l ) n which is grammatical because of the follo wing reduction: n ( n r sj l σ ) ( σ r j j l σ ) ( σ r j n l ) n → s depicted diagrammatically as follo ws: n n r s j l σ σ r j j l σ σ r j n l n 3 The brackets are only for the purpose of clarity of exposition and are not part of the mathematical presentation. 6 The types used here for “does” and “not” are not the original ones, e.g. as suggested in [21], but are rather obtainable from the procedure later introduced in [30]. The difference between the two is in the use of the glueing types; once these are deleted from the abov e, the original types are retrieved. The moti vation behind introducing these glueing types is their crucial role in the de velopment of a discourse semantics for Pregroups [30]. Our moti vation, as will be demonstrated in section 4, is that these allow for the information to flow and be acted upon in the sentence and as such assist in constructing the meaning of the whole sentence. Interestingly , we hav e come to realize that these new types can also be obtained by translating into the Pregroup notation the types of the same words from a type- categorial logic approach, up to the replacement of the intermediate n ’ s with σ ’ s. 3 Modeling a language in a concr ete category Our mathematical model of language will be category-theoretic. Category theory is usually not conceiv ed as the most evident part of mathematics, so let us briefly state why this passage is essential. The reader may consult the category theory tutorial [10] which co vers the background on the kinds of categories that are rele v ant here. Also the surve y of graphical languages for monoidal categories [38] could be useful – note that Selinger refers to ‘non-commutative’ compact closed cate gories as (both left and right) planar autonomous cate gories . So why do we use categories? 1. The passage from { true, false } -v aluations (as in Montague semantics) to quantitati ve meaning spaces requires a mathematical structure that can store this additional information, but which at the same time retains the composi- tional structure. Concrete monoidal categories do e xactly that: • the axiomatic structure, in particular the monoidal tensor, captures com- positionality; • the concrete objects and corresponding morphisms enable the encoding of the particular model of meaning one uses, here vector spaces. 2. The structural morphisms of the particular categories that we consider , com- pact closed categories, will be the basic b uilding blocks to construct the mor- phisms that represent the ‘from-meaning-of-words-to-meaning-of-a-sentence’- process. 3. Even in a purely syntactic setting, the lifting to categories will allow us to reason about the grammatical structures of dif ferent sentences as first class citizens of the formalism. This will enable us to provide more than just 7 a yes-no answer about the grammatical structure of a phrase, i.e. if it is grammatical or not. As such, the categorical setting will, for instance, allow us to distinguish and reason about ambiguities in grammatical sentences, where their dif ferent grammatical structures gives rise to different meaning interpretations. W e first briefly recall the basic notions of the theory of monoidal categories, be- fore explaining in more detail what we mean by this ‘from-meaning-of-words-to- meaning-of-a-sentence’-process. 3.1 Monoidal categories Here we consider the non-symmetric case of a compact closed category , non- degenerate Pregroups being examples of essentially non-commutativ e compact closed categories. The formal definition of monoidal categories is somewhat in volved. It does admit an intuitiv e operational interpretation and an eleg ant, purely diagram- matic calculus. A (strict) monoidal category C requires the following data and axioms: • a family | C | of objects ; – for each ordered pair of objects ( A, B ) a corresponding set C ( A, B ) of morphisms ; it is conv enient to abbreviate f ∈ C ( A, B ) by f : A → B ; – for each ordered triple of objects ( A, B , C ) , each f : A → B , and g : B → C , there is a sequential composite g ◦ f : A → C ; we moreover require that: ( h ◦ g ) ◦ f = h ◦ ( g ◦ f ) ; – for each object A there is an identity morphism 1 A : A → A ; for f : A → B we moreover require that: f ◦ 1 A = f and 1 B ◦ f = f ; • for each ordered pair of objects ( A, B ) a composite object A ⊗ B ; we more- ov er require that: ( A ⊗ B ) ⊗ C = A ⊗ ( B ⊗ C ) ; (1) • there is a unit object I which satisfies: I ⊗ A = A = A ⊗ I ; (2) 8 • for each ordered pair of morphisms ( f : A → C , g : B → D ) a parallel composite f ⊗ g : A ⊗ B → C ⊗ D ; we moreover require bifunctoriality i.e. ( g 1 ⊗ g 2 ) ◦ ( f 1 ⊗ f 2 ) = ( g 1 ◦ f 1 ) ⊗ ( g 2 ◦ f 2 ) . (3) There is a very intuiti ve operational interpretation of monoidal categories. W e think of the objects as types of systems . W e think of a morphism f : A → B as a pr ocess which tak es a system of type A as input and pro vides a system of type B as output, i.e. giv en any state ψ of the system of type A , it produces a state f ( ψ ) of the system of type B . Composition of morphisms is sequential application of processes. The compound type A ⊗ B represents joint systems . W e think of I as the tri vial system, which can be either ‘nothing’ or ‘unspecified’. More on this intuitive interpretation can be found in [8, 10]. Morphisms ψ : I → A are called elements of A . At first this might seem to be a double use of terminology: if A were to be a set, then x ∈ A would be an element, rather than some function x : I → A . Ho wever , one easily sees that elements in x ∈ A are in bijecti ve correspondence with functions x : I → A provided one takes I to be a singleton set. The same holds for vectors − → v ∈ V , where V is a vector space, and linear maps − → v : R → V . In this paper we take the liberty to jump between these two representations of a vector − → v ∈ V , when using them to represent meanings. In the standard definition of monoidal categories the ‘strict’ equality of eqs. (1,2) is not required but rather the existence of a natural isomorphism between ( A ⊗ B ) ⊗ C and A ⊗ ( B ⊗ C ) . W e assume strictness in order to av oid coher ence conditions . This simplification is justified by the fact that each monoidal category is categorically equiv alent to a strict one, which is obtained by imposing appro- priate congruences. Moreov er , the graphical language which we introduce belo w represents (free) strict monoidal categories. This issue is discussed in detail in [10]. So what is particularly interesting about these monoidal categories is indeed that they admit a graphical calculus in the follo wing sense [38]: An equational statement between morphisms in a monoidal cate gory is pr ovable fr om the axioms of monoidal cate gories if and only if it is derivable in the graphical langua ge . This fact moreover does not only hold for ordinary monoidal categories, but also for many kinds that ha ve additional structure, including the compact closed cate gories that we will consider here. 9 Graphical language f or monoidal categories. In the graphical calculus for monoidal categories we depict morphisms by boxes, with incoming and outgoing wires la- belled by the corresponding types, with sequential composition depicted by con- necting matching outputs and inputs, and with parallel composition depicted by locating boxes side by side. For example, the morphisms 1 A f g ◦ f 1 A ⊗ 1 B f ⊗ 1 C f ⊗ g ( f ⊗ g ) ◦ h are depicted as follo ws in a top-down fashion: g B B D C f B A C B f A B A h B E A B D C g E A f f g A f When representing morphisms in this manner by boxes, eq.(3) comes for free [10]! The unit object I is represented by ‘no wire’; for example ψ : I → A π : A → I π ◦ ψ : I → I are depicted as: A A π A π π ψ o = ψ ψ 3.2 The ‘from-meaning-of-w ords-to-meaning-of-a-sentence’ process Monoidal categories are widely used to represent processes between systems of v arying types, e.g. data types in computer programs. The process which is central to this paper is the one which tak es the meanings of words as its input and produces the meaning of a sentence as output, within a fix ed type S (Sentence) that allo ws the representation of meanings of all well-typed sentences. Diagrammatically we represent it as follo ws: word 1 word 2 word n . . . pr ocess depending on grammathical structur e sentence = A B Z S S where all triangles represent meanings, both of words and sentences. For e xample, the triangle labeled ‘word 1’ represents the meaning of word 1 which is of gram- matical type A , and the triangle labeled ‘sentence’ represents the meaning of the 10 whole sentence. The concatenation (word 1) · . . . · (word n) is the sentence itself, which is of grammatical type A ⊗ . . . ⊗ Z , and the way in which the list of meanings of words: word 1 word 2 word n . . . A B Z becomes the meaning of a sentence: sentence S within the fixed type S , is mediated by the grammatical structure. The concrete manner in which grammatical structure performs this role will be e xplained belo w . This method will exploit the common mathematical structure which vector spaces (used to assign meanings to words in a language) and Pregroups (used to assign grammatical structure to sentences) share, namely compact closure. 3.3 Compact closed categories A monoidal category is compact closed if for each object A there are also objects A r and A l , and morphisms η l : I → A ⊗ A l l : A l ⊗ A → I η r : I → A r ⊗ A r : A ⊗ A r → I which satisfy: (1 A ⊗ l ) ◦ ( η l ⊗ 1 A ) = 1 A ( r ⊗ 1 A ) ◦ (1 A ⊗ η r ) = 1 A ( l ⊗ 1 A l ) ◦ (1 A l ⊗ η l ) = 1 A l (1 A r ⊗ r ) ◦ ( η r ⊗ 1 A r ) = 1 A r Compact closed categories are in a sense orthogonal to cartesian categories, such as the category of sets and functions with the cartesian product as the monoidal structure. Diagrammatically , in a cartesian category the triangles representing meanings of type A ⊗ B could always be decomposed into a triangle represent- ing meanings of type A and a triangle representing meanings of type B : Cartesian non - Cartesian = = = But if we consider a verb, then its grammatical type is n r sn l , that is, of the form N ⊗ S ⊗ N within the realm of monoidal categories. Clearly , to compute the 11 meaning of the whole sentence, the meaning of the verb will need to interact with the meaning of both the object and subject, so it cannot be decomposed into three disconnected entities: verb object subject meaning of sentence = In this graphical language, the topology (i.e. either being connected or not) repre- sents when interaction occurs. In other words, ‘connectedness’ encodes ‘correla- tions’. That we cannot always decompose triangles representing meanings of type A ⊗ B in compact closed categories can be immediately seen in the graphical calculus of compact closed categories, which explicitly introduces wires between different types, and these will mediate flows of information between words in a sentence. A fully worked out e xample of sentences of this type is giv en in section 4.1. Graphical language for compact closed categories. When depicting the mor- phisms η l , l , η r , r as (read in a top-do wn fashion) A A l A A A A l r A A r rather than as triangles, the axioms of compact closure simplify to: = A A A A = A A A A r r = A A A A = A A A A l l l l r r i.e. they boil do wn to ‘yanking wires’. V ector spaces, linear maps and tensor product as a compact closed category . Let FV ect be the category which has vector spaces over the base field R as objects, linear maps as morphisms and the vector space tensor product as the monoidal tensor . In this category , the tensor is commutativ e, i.e. V ⊗ W ∼ = W ⊗ V , and left and right adjoints are the same, i.e. V l = V r so we denote either by V ∗ , which 12 is the identity maps, i.e. V ∗ = V . T o simplify the presentation we assume that each vector space comes with an inner product, that is, it is an inner-pr oduct space . For the case of vector space models of meaning this is alw ays the case, since we consider a fixed base, and a fixed base canonically induces an inner-product. The reader can verify that compact closure arises, given a vector space V with base { − → e i } i , by setting V l = V r = V , η l = η r : R → V ⊗ V :: 1 7→ X i − → e i ⊗ − → e i (4) and l = r : V ⊗ V → R :: X ij c ij − → v i ⊗ − → w j 7→ X ij c ij h − → v i | − → w j i . (5) In equation 4 we hav e that l = r is the inner-product e xtended by linearity to the whole tensor product. Recall that if { − → e i } i is a base for V and if { − → e 0 i } i is a base for W then { − → e i ⊗ − → e 0 j } ij is a base for V ⊗ W . In the base { − → e i ⊗ − → e j } ij for V ⊗ V the linear map l = r : V ⊗ V → R has as its matrix the row vector which has entry 1 for the base vectors − → e i ⊗ − → e i and which has entry 0 for the base vectors − → e i ⊗ − → e j with i 6 = j . The matrix of η l = η r is the column vector obained by transposition. In eq. (5), the weighted sum P ij c ij − → v i ⊗ − → w j denotes a typical vector in a tensor space V ⊗ W , where c ij ’ s enumerate all possible weights for the tensored pair of base vectors − → v i ⊗ − → w j . If in the definition of l = r we apply the restriction that − → v i = − → w i = − → e i , which we can do if we stipulate that l = r is a linear map, then it simplifies to l = r : V ⊗ V → R :: X ij c ij − → e i ⊗ − → e j 7→ X i c ii . A Pregroup as a compact closed category . A Pre group is an e xample of a pose- tal cate gory , that is, a category which is also a poset. For a category this means that for any two objects there is either one or no morphism between them. In the case that this morphism is of type A → B then we write A ≤ B , and in the case it is of type B → A we write B ≤ A . The reader can then v erify that the axioms of a category guarantee that the relation ≤ on | C | is indeed a partial order . Con versely , any partially ordered set ( P , ≤ ) is a category . For ‘objects’ p, q , r ∈ P we take [ p ≤ q ] to be the singleton { p ≤ q } whenever p ≤ q , and empty otherwise. If p ≤ q and q ≤ r we define p ≤ r to be the composite of the ‘morphisms’ p ≤ q and q ≤ r . A partially ordered monoid is a monoidal category with the monoid multipli- cation as tensor on objects; whenev er p ≤ r and q ≤ z then we have p · q ≤ r · z 13 by monotonicity of monoid multiplication, and we define this to be the tensor of ‘morphisms’ [ p ≤ r ] and [ q ≤ z ] . Bifunctoriality , as well as any equational state- ment between morphisms in posetal cate gories, is trivially satisfied, since there can only be one morphism between any tw o objects. Finally , each Pregroup is a compact closed category for η l = [1 ≤ p · p l ] l = [ p l · p ≤ 1] η r = [1 ≤ p r · p ] r = [ p · p r ≤ 1] and so the required equations are again trivially satisfied. Diagrammatically , the under-links representing the type reductions in a Pregroup grammar are exactly the ‘cups’ of the compact closed structure. The symbolic counterpart of the diagram of the reduction of a sentence with a transiti ve verb n n r s n l n is the follo wing morphism: r n ⊗ 1 s ⊗ l n : n ⊗ n r ⊗ s ⊗ n l ⊗ n → s . 3.4 Categories repr esenting both grammar and meaning W e hav e described tw o aspects of natural language which admit mathematical pre- sentations: 1. vector spaces can be used to assign meanings to words in a language; 2. Pregroups can be used to assign grammatical structure to sentences. When we org anize these vector spaces as a monoidal category by also considering linear maps, and tensor products both of vector spaces and linear maps, then these two mathematical objects share common structure, namely compact closure. W e can think of these two compact closed structures as two structures that we can pr oject out of a language, where P is the free Pregroup generated from the basic types of a natural language: lang uag e FV ect meaning P g rammar - 14 W e aim for a mathematical structure that unifies both of these aspects of lan- guage, that is, in which the compositional structure of Pregroups would lift to the le vel of assigning meaning to sentences and their constituents, or dually , where the structure of assigning meaning to words comes with a mechanism that enables us to compute the meaning of a sentence. The compact closed structure of FV ect alone is too degenerate for this purpose since A l = A r = A . Moreover , there are canonical isomorphisms V ⊗ W → W ⊗ V which translate to posetal cate gories as a · b = b · a , and in general we should not be able to e xchange words in a sentence without altering its meaning. Therefore we have to refine types to retain the full grammatical content obtained from the Pregroup analysis. There is an easy way of doing this: rather than objects in FV ect we will consider objects in the product category FV ect × P : lang uag e FV ect π m meaning FV ect × P ? π g - P g rammar - Explicitly , FV ect × P is the category which has pairs ( V , a ) with V a v ector space and a ∈ P a grammatical type as objects, and the following pairs as morphisms: ( f : V → W , p ≤ q ) , which we can also write as ( f , ≤ ) : ( V , p ) → ( W, q ) . Note that if p 6≤ q then there are no morphisms of type ( V , p ) → ( W, q ) . It is easy to verify that the compact closed structure of FV ect and P lifts component- wise to one on FV ect × P . The structural morphisms in this new category are no w: ( η l , ≤ ) : ( R , 1) → ( V ⊗ V , p · p l ) ( η r , ≤ ) : ( R , 1) → ( V ⊗ V , p r · p ) ( l , ≤ ) : ( V ⊗ V , p l · p ) → ( R , 1) ( r , ≤ ) : ( V ⊗ V , p · p r ) → ( R , 1) 3.5 Meaning of a sentence as a morphism in FV ect × P . Definition 3.1. W e r efer to an object ( W, p ) of Fv ect × P as a meaning space . This consists of a vector space W in which the meaning of a wor d lives − → w ∈ W and the grammatical type p of the wor d. 15 Definition 3.2. W e define the vector − − − − − − → w 1 · · · w n of the meaning of a string of words w 1 · · · w n to be − − − − − − → w 1 · · · w n := f ( − → w 1 ⊗ · · · ⊗ − → w n ) wher e for ( W i , p i ) meaning space of the wor d w i , the linear map f is built by substituting each p i in [ p 1 · · · p n ≤ x ] with W i . Thus for α = [ p 1 · · · p n → x ] a morphism in P and f = α [ p i \ W i ] a linear map in Fv ect , the following is a morphism in Fv ect × P : ( W 1 ⊗ · · · ⊗ W n , p 1 · · · p n ) ( f , ≤ ) - ( X , x ) W e call f the ‘from-meaning-of-words-to-meaning-of-a-sentence’ map. According to this formal definition, the procedure of assigning meaning to a string of words can be roughly described as follo ws: 1. Assign a grammatical type p i to each w ord w i of the string, apply the axioms and rules of the Pregroup grammar to reduce these types to a simpler type p 1 · · · p n → x . If the string of words is a sentence, then the reduced type x should be the basic grammatical type s of a sentence 4 . 2. Assign a vector space to each word of the sentence based on its syntactic type assignment. For the purpose of this paper , we prefer to be flexible with the manner in which these vector spaces are b uilt, e.g. the vector spaces of the words with basic types like noun may be atomic and built according to the usual rules of the distributional model; the vector spaces of the words with compound types like v erbs are tensor spaces. 3. Consider the vector of the meaning of each word in the spaces built above, take their tensor , and apply to it the diagram of the syntactic reduction of the string, according to the meaning spaces of each word. This will provide us with the meaning of the string. 3.6 Comparison with the connectionist proposal Follo wing the solution of connectionists [39], Pulman and the third author argued for the use of tensor products in developing a compositional distributional model of meaning [6]. They suggested that to implement this idea in linguistics one can, 4 By Lambek’ s switching lemma [19] the epsilon maps suf fice for the grammatical reductions and thus x already exists in the type of one of the words in the string. 16 for example, traverse the parse tree of a sentence and tensor the v ectors of the meanings of words with the v ectors of their roles: ( − − → J ohn ⊗ − − → subj ) ⊗ − − → likes ⊗ ( − − − → Mary ⊗ − → obj ) This vector in the tensor product space should then be regarded as the meaning of the sentence “John likes Mary . ” The tensors ( − − → J ohn ⊗ − − → subj ) and ( − − − → Mary ⊗ − → obj ) in the abov e are pure tensors, and thus can be considered as a pair of vectors, i.e. ( − − → J ohn , − − → subj ) and ( − − − → Mary , − → obj ) . These are pairs of a meaning of a word and its grammatical role, and almost the same as the pairs considered in our approach, i.e. that of a meaning space of each word. A minor difference is that, in the above, the grammatical role − → p is a genuine vector , whereas in our approach this remains a grammatical type. If needed, our approach can easily be adapted to also allo w types to be represented in a vector space. A more conceptual difference between the two approaches lies in the fact that the above does not assign a grammatical type to the verb, i.e. treats − − → likes as a single vector . Whereas in our approach, the vector of the verb itself lives in a tensor space. 4 Computing the meaning of example sentences In what follows we use the steps above to assign meaning to positiv e and negati ve transiti ve sentences 5 . 4.1 Positi ve T ransitive Sentence A positive sentence with a transitiv e verb has the Pregroup type n ( n r sn l ) n . W e assume that the meaning spaces of the subject and object of the sentence are atomic and are giv en as ( V , n ) and ( W , n ) . The meaning space of the verb is compound and is giv en as ( V ⊗ S ⊗ W , n r sn l ) . The ‘from-meaning-of-words-to-meaning-of- a-sentence’ linear map f is the linear map which realizes the follo wing structural morphism in FV ect × P : V ⊗ T ⊗ W , n ( n r sn l ) n ( f , ≤ ) - ( S, s ) , and arises from a syntactic reduction map; in this case we obtain: f = V ⊗ 1 S ⊗ W : V ⊗ ( V ⊗ S ⊗ W ) ⊗ W → S . 5 For the negativ e example, we use the idea and treatment of previous work [31], in that we use eta maps to interpret the logical meaning of “does” and “not”, but extend the details of calculations, diagrammatic representations, and corresponding comparisons. 17 Noting the isomorphism V ⊗ S ⊗ W ∼ = V ⊗ W ⊗ S ∼ = V ∗ ⊗ W ∗ → S obtained from the commutati vity of tensor in the FV ect and that V ∗ = V and W ∗ = W therein, and the uni versal property of the tensor with respect to product, we can think about the meaning space of a v erb V ⊗ W ⊗ S as a function space V × W → S . So the meaning v ector of each transiti ve verb can be thought of as a function that inputs a subject from V and an object from W and outputs a sentence in S . In the graphical calculus, the linear map of meaning is depicted as follo ws: The matrix of f has dim ( V ) 2 × dim ( S ) × dim ( W ) 2 columns and dim ( S ) ro ws, and its entries are either 0 or 1 . When applied to the vectors of the meanings of the words, i.e. f ( − → v ⊗ − → Ψ ⊗ − → w ) ∈ S for − → v ⊗ − → Ψ ⊗ − → w ∈ V ⊗ S ⊗ W we obtain, diagrammatically: v w Ψ This map can be expressed in terms of the inner-product as follows. Consider the typical vector in the tensor space which represents the type of v erb: Ψ = X ij k c ij k − → v i ⊗ − → s j ⊗ − → w k ∈ V ⊗ S ⊗ W then f ( − → v ⊗ − → Ψ ⊗ − → w ) = V ⊗ 1 S ⊗ W ( − → v ⊗ − → Ψ ⊗ − → w ) = X ij k c ij k h − → v | − → v i i − → s j h − → w k | − → w i = X j X ik c ij k h − → v | − → v i ih − → w k | − → w i ! − → s j . This vector is the meaning of the sentence of type n ( n r sn l ) n , and assumes as gi ven the meanings of its constituents − → v ∈ V , − → Ψ ∈ T and − → w ∈ W , obtained from data using some suitable method. Note that, in Dirac notation, f ( − → v ⊗ − → Ψ ⊗ − → w ) is written as: h r V | ⊗ 1 S ⊗ h r V | − → v ⊗ − → Ψ ⊗ − → w . Also, the diagrammatic calculus tells us that: 18 v w Ψ v w Ψ = where the reversed triangles are now the corresponding Dirac-bra’ s, or in vector space terms, the corresponding functionals in the dual space. This simplifies the expression that we need to compute to: ( h − → v | ⊗ 1 S ⊗ h − → v | ) | − → Ψ i As mentioned in the introduction, our focus in this paper is not on how to practi- cally exploit the mathematical framew ork, which would require substantial further research, but to expose the mechanisms which govern it. T o show that this par- ticular computation (i.e. the ‘from-meaning-of-words-to-meaning-of-a-sentence’- process) does indeed produce a vector which captures the meaning of a sentence, we explicitly compute f ( − → v ⊗ − → Ψ ⊗ − → w ) for some simple e xamples, with the inten- tion of providing the reader with some insight into the underlying mechanisms and ho w the approach relates to existing framew orks. Example 1. One Dimensional T ruth-Theoretic Meaning. Consider the sen- tence J ohn likes Mary . (6) W e encode this sentence as follows; we ha ve: − − → J ohn ∈ V , − − → likes ∈ T , − − − → Mary ∈ W where we take V to be the vector space spanned by men and W the vector space spanned by women. In terms of context vectors this means that each word is its o wn and only context vector , which is of course a far too simple idealisation for practical purposes. W e will con veniently assume that all men are referred to as male , using indices to distinguish them: m i . Thus the set of vectors { − → m i } i spans V . Similarly every woman will be referred to as female and distinguished by f j , for some j , and the set of vectors { − → f j } j spans W . Let us assume that John in sentence (6) is m 3 and that Mary is f 4 . If we are only interested in the truth or falsity of a sentence, we hav e two choices in creating the sentence space S : it can be spanned by two basis vectors | 0 i and | 1 i representing the truth v alues of true and false , or just by a single v ector − → 1 , which we identify with true , the origin − → 0 is then identified with false (so we use Dirac notation for the basis to distinguish between the origin − → 0 and the | 0 i basis vector). This latter approach might feel a little unintuiti ve, but it enables us 19 to establish a conv enient connection with the relational Montague-style models of meaning, which we shall present in the last section of the paper . The transiti ve verb − − → likes is encoded as the superposition : − − → likes = X ij − → m i ⊗ − − → likes ij ⊗ − → f j where − − → likes ij = − → 1 if m i likes f j and − − → likes ij = − → 0 otherwise. Of course, in practice, the vector that we have constructed here would be obtained automatically from data using some suitable method. Finally , we obtain: f − → m 3 ⊗ − − → likes ⊗ − → f 4 = X ij h − → m 3 | − → m i i − − → likes ij h − → f j | − → f 4 i = X ij δ 3 i − − → likes ij δ j 4 = − − → likes 34 = ( − → 1 m 3 likes f 4 − → 0 o.w . So we indeed obtain the correct truth-v alue meaning of our sentence. W e are not restricted to the truth-value meaning; on the contrary , we can have, for example, degrees of meaning, as sho wn in section 5. Example 1b. T wo Dimensional T ruth-Theoretic Meaning. It would be more intuiti ve to assume that the sentence space S is spanned by two vectors | 0 i and | 1 i , which stand for false and true respectively . In this case, the computing of the meaning map proceeds in exactly the same way as in the one dimensional case. The only difference is that when the sentence “John likes Mary” is false, the v ector likes ij takes the value | 0 i rather than just the origin − → 0 , and if it is true it takes the v alue | 1 i rather than − → 1 . 4.2 Negative T ransitiv e Sentence The types of a sentence with negation and a transitive verb, for e xample “John does not like Mary”, are: n ( n r sj l σ ) ( σ r j j l σ ) ( σ r j n l ) n Similar to the positi ve case, we assume the vector spaces of the subject and object are atomic ( V , n ) , ( W, n ) . The meaning space of the auxiliary verb is ( V ⊗ S ⊗ J ⊗ V , n r sj l σ ) , that of the negation particle is ( V ⊗ J ⊗ J ⊗ V , σ r j j l σ ) , and that 20 of the verb is ( V ⊗ J ⊗ W , σ r j n l ) . The ‘from-meaning-of-words-to-meaning-of- a-sentence’ linear map f is: f = (1 S ⊗ J ⊗ J ) ◦ ( V ⊗ 1 S ⊗ 1 J ∗ ⊗ V ⊗ 1 J ⊗ 1 J ∗ ⊗ V ⊗ 1 J ⊗ W ) : V ⊗ ( V ∗ ⊗ S ⊗ J ∗ ⊗ V ) ⊗ ( V ∗ ⊗ J ⊗ J ∗ ⊗ V ) ⊗ ( V ∗ ⊗ J ⊗ W ∗ ) ⊗ W → S and depicted as: v w Ψ not When applied to the meaning vectors of w ords one obtains: f ( − → v ⊗ − − → does ⊗ − → not ⊗ − → Ψ ⊗ − → w ) which is depicted as: v w Ψ does not where − − → does and − → not are the v ectors corresponding to the meanings of “does” and “not”. Since these are logical function words, we may decide to assign meaning to them manually and without consulting the document. For instance, for does we set: S = J and − − → does = X ij − → e i ⊗ − → e j ⊗ − → e j ⊗ − → e i ∈ V ⊗ J ⊗ J ⊗ V . As e xplained in section 3.1, vectors in V ⊗ J ⊗ J ⊗ V can also be presented as linear maps of type R → V ⊗ J ⊗ J ⊗ V , and in the case of does we hav e: − − → does ' (1 V ⊗ η J ⊗ 1 V ) ◦ η V : R → V ⊗ J ⊗ J ⊗ V which sho ws that we only relied on structural morphisms. As we will demonstrate in the examples below , by relying only on η -maps, does acts v ery much as an ‘identity’ with respect to the flo w of information between the words in a sentence. This can be formalized in a more mathematical manner . There is a well-known bijectiv e correspondence between linear maps of type V → W and vectors in V ⊗ W . Giv en a linear map f : V → W then the corresponding vector is: Ψ f = X i − → e i ⊗ f ( − → e i ) 21 where { − → e i } i is a basis for V . Diagrammatically we have: f = f V W = ⇒ Ψ f = f V W V T ake this linear map to be the identity on V and we obtain η V . The trick to implement not will be to take this linear map to be the linear matrix representing the logical not . Concretely , while the matrix of the identity is 1 0 0 1 , the matrix of the logical not is 0 1 1 0 . In Dirac notation, the vector corresponding to the identity is | 00 i + | 11 i , while the vector corresponding to the logical not is | 01 i + | 10 i . While we hav e − − → does = X i − → e i ⊗ ( | 00 i + | 11 i ) ⊗ − → e i ∈ V ⊗ J ⊗ J ⊗ V , we will set − → not = X i − → e i ⊗ ( | 01 i + | 10 i ) ⊗ − → e i ∈ V ⊗ J ⊗ J ⊗ V . Diagrammatically we hav e: − − → does = − → not = not Substituting all of this in f ( − → v ⊗ − − → does ⊗ − → not ⊗ − → Ψ ⊗ − → w ) we obtain, diagrammatically: v w Ψ not which by the diagrammatic calculus of compact closed categories is equal to: v w Ψ v w Ψ = not not (7) since in particular we hav e that: 22 not not = where the configuration on the left alw ays encodes the transpose and the matrix of the not is obviously self-transpose. In the language of vectors and linear maps, the left hand side of eq. (7) is: V ⊗ 0 1 1 0 ⊗ W ( − → v ⊗ − → Ψ ⊗ − → w ) . Note that the above pictures are very similar to the ones encountered in [7, 9] which describe quantum informatic protocols such as quantum teleportation and entanglement swapping. There the morphisms η and encode Bell-states and cor- responding measurement projectors. Example 2. Negative T ruth-Theoretic Meaning . The meaning of the sentence J ohn does not like Mary is calculated as follows. W e assume that the vector spaces S = J are spanned by the two v ectors as in Example 1b, | 1 i = 0 1 and | 0 i = 1 0 . W e assume that | 1 i stands for true and that | 0 i stands for false . V ector spaces V and W are as in the positi ve case abov e. The vector of lik e is as before: − → like = X ij − → m i ⊗ − − → lik e ij ⊗ − → f j for − − → lik e ij = ( | 1 i m i likes f j | 0 i o.w . Setting N = 0 1 1 0 we obtain: ( V ⊗ N ⊗ W ) − → m 3 ⊗ − − → likes ⊗ − → f 4 = X ij h − → m 3 | − → m i i N ( − − → likes ij ) h − → f j | − → f 4 i = X ij δ 3 i N ( − − → likes ij ) δ j 4 = N ( − − → likes 34 ) = ( | 1 i − − → lik e 34 = | 0 i | 0 i − − → lik e 34 = | 1 i = | 1 i m 3 does not like f 4 | 0 i o.w . That is, the meaning of “John does not like Mary” is true if − − → lik e 34 is false, i.e. if the meaning of “John likes Mary” is false. 23 For those readers who are suspicious of our graphical reasoning, here is the full-blo wn symbolic computation. Abbreviating | 10 i + | 01 i to n and | 00 i + | 11 i to d , and setting f = h ◦ g with h = 1 J ⊗ J ⊗ J and g = V ⊗ 1 J ⊗ 1 J ⊗ V ⊗ 1 J ⊗ 1 J ⊗ V ⊗ 1 J ⊗ W f − → m 3 ⊗ X l − → m l ⊗ d ⊗ − → m l ⊗ X k − → m k ⊗ n ⊗ − → m k ⊗ X ij − → m i ⊗ − − → lik e ij ⊗ − → f j ⊗ − → f 4 = h X ij k l h − → m 3 | − → m l i d h − → m l | − → m k i n h − → m k | − → m i i − − → lik e ij h − → f j | − → f 4 i = h X ij k l δ 3 l d δ lk n δ ki − − → lik e ij δ j 4 = h d ⊗ n ⊗ − − → lik e 34 = h ( | 00 i + | 11 i ) ⊗ ( | 10 i + | 01 i ) ⊗ − − → lik e 34 = h | 0010 − − → lik e 34 i + | 0001 − − → lik e 34 i + | 1110 − − → lik e 34 i + | 1101 − − → lik e 34 i = | 0 ih 0 | 1 ih 0 | − − → lik e 34 i + | 0 ih 0 | 0 ih 1 | − − → lik e 34 i + | 1 ih 1 | 1 ih 0 | − − → lik e 34 i + | 1 ih 1 | 0 ih 1 | − − → lik e 34 i = | 0 ih 1 | − − → lik e 34 i + | 1 ih 0 | − − → lik e 34 i = ( | 1 i − − → lik e 34 = | 0 i | 0 i − − → lik e 34 = | 1 i 5 Comparing meanings of sentences One of the advantages of our approach to compositional meaning is that the mean- ings of sentences are all vectors in the same space, so we can use the inner product to compare the meaning vectors. This measure has been referred to and widely used as a degr ee of similarity between meanings of words in the distributional ap- proaches to meaning [36]. Here we extend it to strings of w ords as follows. Definition 5.1. T wo strings of words w 1 · · · w k and w 0 1 · · · w 0 l hav e degree of similarity m iff their Pre gr oup r eductions r esult in the same gr ammatical type 6 6 If one wishes to do so, meaning of phrases that do not ha ve the same grammatical types can also be compared, but only after transferring them to a common dummy space. 24 and we have 1 N × N 0 D f ( − → w 1 ⊗ · · · ⊗ − → w k ) | f ( − → w 0 1 ⊗ · · · ⊗ − → w 0 l ) E = m for N = | f ( − → w 1 ⊗ · · · ⊗ − → w k ) | N 0 = | f ( − → w 0 1 ⊗ · · · ⊗ − → w 0 l ) | wher e | − → v | is the norm of − → v , that is, | − → v | 2 = h − → v | − → v i , and f , f 0 ar e the meaning maps defined accor ding to definition 3.2 Thus we use this tool to compare meanings of positive sentences to each other , meanings of negati ve sentences to each other, and more importantly meanings of positi ve sentences to negati ve ones. For example, we compare the meaning of “John likes Mary” to “John loves Mary”, the meaning of “John does not like Mary” to “John does not love Mary”, and also the meaning of the latter two sentences to “John likes Mary” and “John loves Mary”. T o make the examples more interesting, we assume that “likes” has de grees of both “love” and “hate”. Example 3. Hierarchical Meaning . Similar to before, we hav e: − − → loves = X ij − → m i ⊗ − − → loves ij ⊗ − → f j − − → hates = X ij − → m i ⊗ − − → hates ij ⊗ − → f j where − − → loves ij = | 1 i if m i lov es f j and − − → loves ij = | 0 i otherwise, and − − → hates ij = | 1 i if m i hates f j and − − → hates ij = | 0 i otherwise. Define likes to hav e degrees of love and hate as follo ws: − − → likes = 3 4 − − → loves + 1 4 − − → hates = X ij − → m i ⊗ 3 4 − − → loves ij + 1 4 − − → hates ij ⊗ − → f j The meaning of our example sentence is thus obtained as follo ws: f − → m 3 ⊗ − − → likes ⊗ − → f 4 = f − → m 3 ⊗ 3 4 − − → loves + 1 4 − − → hates ⊗ − → f 4 = X ij h − → m 3 | − → m i i 3 4 − − → loves ij + 1 4 − − → hates ij D − → f j | − → f 4 E = X ij δ 3 i 3 4 − − → loves ij + 1 4 − − → hates ij δ j 4 = 3 4 − − → loves 34 + 1 4 − − → hates 34 25 Example 4. Negativ e Hierarchical Meaning. T o obtain the meaning of “John does not like Mary” in this case, one inserts 3 4 − − − → lov es ij + 1 4 − − − → hates ij for − − − → lik es ij in the calculations and one obtains: h d ⊗ n ⊗ 3 4 − − − → lov es 34 + 1 4 − − − → hates 34 = 1 4 − − → loves 34 + 3 4 − − → hates 34 That is, the meaning of “John does not like Mary” is the vector obtained from the meaning of “John likes Mary” by swapping the basis v ectors. Example 5. Degree of similarity of positive sentences. The meanings of the distinct verbs l ov es , lik es and hates in the different sentences propagate through the reduction mechanism and rev eal themselves when computing inner-products between sentences in the sentence space. F or instance, the sentence “John lov es Mary” and “John likes Mary” ha ve a degree of similarity of 3/4, calculated as follo ws: D f − → m 3 ⊗ − − → loves ⊗ − → f 4 f − → m 3 ⊗ − − → likes ⊗ − → f 4 E = D − − → loves 34 − − → likes 34 E In the abov e, we expand the definition of − − → likes 34 and obtain: D − − → loves 34 3 4 − − → loves 34 + 1 4 − − → hates 34 E = 3 4 D − − → loves 34 − − → loves 34 E + 1 4 D − − → loves 34 − − → hates 34 E and since − − → loves 34 and − − → hates 34 are always orthogonal, that is, if one is | 1 i then the other one is | 0 i , we hav e that D − − − − − − − − − − − → J ohn loves Mary − − − − − − − − − − → J ohn likes Mary E = 3 4 | − − − → lov es 34 | 2 Hence the degree of similarity of these sentences is 3 4 . A similar calculation pro- vides us with the follo wing de grees of similarity . For notational simplicity we drop the square of norms from no w on, i.e. we implicitly normalize meaning vectors. D − − − − − − − − − − − → J ohn hates Mary − − − − − − − − − − → J ohn likes Mary E = D f − → m 3 ⊗ − − → hates ⊗ − → f 4 f − → m 3 ⊗ − − → likes ⊗ − → f 4 E = 1 4 D − − − − − − − − − − − → J ohn loves Mary − − − − − − − − − − − → J ohn hates Mary E = D f − → m 3 ⊗ − − → loves ⊗ − → f 4 f − → m 3 ⊗ − − → hates ⊗ − → f 4 E = 0 . 26 Example 6. Degree of similarity of negative sentences. In the negati ve case, the meaning of the composition of the meanings of the auxiliary and ne gation markers (“does not”), applied to the meaning of the verb, propagates through the computa- tions and defines the cases of the inner product. For instance, the sentences “John does not love Mary” and “John does not like Mary” have a degree of similarity of 3/4, calculated as follo ws: D − − − − − − − − − − − − − − − − → J ohn does not love Mary − − − − − − − − − − − − − − − − → J ohn does not like Mary E = D f − → m 3 ⊗ − − → does ⊗ − → not ⊗ − − → love ⊗ − → f 4 f − → m 3 ⊗ − − → does ⊗ − → not ⊗ − → like ⊗ − → f 4 E = D − − → hates 34 1 4 − − → loves 34 + 3 4 − − → hates 34 E = 1 4 D − − → hates 34 − − → loves 34 E + 3 4 D − − → hates 34 − − → hates 34 E = 3 4 Example 7. Degree of similarity of positiv e and negative sentences. Here we compare the meanings of positiv e and negati ve sentences. This is perhaps of special interest to linguists of distributional meaning, since these sentences do not hav e the same grammatical structure. That we can compare these sentences sho ws that our approach does not limit us to the comparison of meanings of sentences that have the same grammatical structure. W e have: f − → m 3 ⊗ − − → does ⊗ − → not ⊗ − → like ⊗ − → f 4 f − → m 3 ⊗ − − → loves ⊗ − → f 4 = 1 4 f − → m 3 ⊗ − − → does ⊗ − → not ⊗ − → like ⊗ − → f 4 f − → m 3 ⊗ − − → hates ⊗ − → f 4 = 3 4 The follo wing is the most interesting case: f − → m 3 ⊗ − − → does ⊗ − → not ⊗ − → like ⊗ − → f 4 f − → m 3 ⊗ − − → likes ⊗ − → f 4 = 1 4 − − → loves 34 + 3 4 − − → hates 34 3 4 − − → loves 34 + 1 4 − − → hates 34 = ( 1 4 × 3 4 ) − − → loves 34 − − → loves 34 + ( 3 4 × 1 4 ) − − → hates 34 − − → hates 34 = ( 1 4 × 3 4 ) + ( 3 4 × 1 4 ) = 3 8 This v alue might feel non-intuitiv e, since one expects that “like” and “does not like” have zero intersection in their meanings. This would indeed be the case had we used our original truth-value definitions. But since we ha ve set “like” to have degrees of “lo ve” and “hate”, their intersection will no longer be 0. 27 Using the same method, one can form and compare meanings of many different types of sentences. In a full-blo wn vector space model, which has been automati- cally extracted from large amounts of text, we obtain ‘imperfect’ vector represen- tations for words, rather than the ‘ideal’ ones presented here. But the mechanism of ho w the meanings of words propagate to the meanings of sentences remains the same. 6 Relations vs V ectors f or Montague-style semantics When fixing a base for each vector space we can think of FV ect as a category of which the morphisms are matrices expressed in this base. These matrices have real numbers as entries. It turns out that if we consider matrices with entries not in ( R , + , × ) , but in an y other semiring 7 ( R, + , × ) , we again obtain a compact closed category . This semiring does not hav e to be a field, and can for example be the positi ve reals ( R + , + , × ) , positiv e integers ( N , + , × ) or e ven Booleans ( B , ∨ , ∧ ) . In the case of ( B , ∨ , ∧ ) , we obtain an isomorphic copy of the category FRel of finite sets and relations with the cartesian product as tensor , as follo ws. Let X be a set whose elements we hav e enumerated as X = x i | 1 ≤ i ≤ | X | . Each element can be seen as a column with a 1 at the row equal to its number and 0 in all other ro ws. Let Y = y j | 1 ≤ j ≤ | Y | be another enumerated set. A relation r ⊆ X × Y is represented by an | X | × | Y | matrix, where the entry in the i th column and j th row is 1 if f ( x i , y j ) ∈ r or else 0 . The composite s ◦ r of relations r ⊆ X × Y and s ⊆ Y × Z is { ( x, z ) | ∃ y ∈ Y : ( x, y ) ∈ r , ( y , z ) ∈ s } . The reader can verify that this composition induces matrix multiplication of the corresponding matrices. Interestingly , in the world of relations (b ut not functions) there is a notion of superposition [8]. The relations of type r ⊆ {∗} × X (in matricial terms, all column vectors with 0 ’ s and 1 ’ s as entries) are in bijectiv e correspondence with the subsets of X via the correspondence r 7→ { x ∈ X | ( ∗ , x ) ∈ r } . Each such subset can be seen as the superposition of the elements it contains. The inner-pr oduct of two subsets is 0 if the y are disjoint and 1 if they ha ve a non-empty intersection. So we can think of two disjoint sets as being ortho gonal . 7 A semiring is a set together with two operations, addition and multiplication, for which we hav e a distributiv e law but no additiv e nor multiplicative in verses. Having an addition and multiplication of this kind suffices to ha ve a matrix calculus. 28 Since the abstract nature of our procedure for assigning meaning to sentences did not depend on the particular choice of FV ect we can now repeat it for the follo wing situation: lang uag e FRel π m meaning FRel × P ? π g - P g r ammar - In FRel × P we recov er a Montague-style Boolean semantics. The vector spaces in this setting are encodings of sets of individuals and relations ov er these sets. Inner products take intersections between the sets and eta maps produce ne w relations by connecting pairs that are not necessarily side by side. In all our e xamples so far , the vector spaces of subject and object were es- sentially sets that were encoded in a vector space framework. This was done by assuming that each possible male subject is a base in the vector space of males and similarly for the female objects. That is why the meaning in these examples was a truth-theoretic one. W e repeat our previous calculations for example 1 in the relational setting of FRel × P . Example 1 re visited. Consider the singleton set {∗} ; we assume that it signifies the vector space S . W e assume that the two subsets of this set, namely {∗} and ∅ , will respecti vely identify true and false . W e now hav e sets V , W and T = V × {∗} × W with V := { m i } i , lik es ⊂ T , W := { f j } j such that: likes := { ( m i , ∗ , f j ) | m i likes f j } = [ ij { m i } × ∗ ij × { f j } where ∗ ij is either {∗} or ∅ . So we obtain f ( { m 3 } × likes × { f 4 } ) = [ ij ( { m 3 } ∩ { m i } ) × ∗ ij × { f j } ∩ { f 4 } = ∗ 34 . 7 Futur e W ork This paper aims to lay a mathematical foundation for the new field of compositional distributional models of meaning in the realm of computational and mathematical 29 linguistics, with applications to language processing, information retrie val, artifi- cial intelligence, and in a conceptual way to the philosophy of language. This is just the beginning and there is so much more to do, both on the practical and the theoretical sides. Here are some examples: • On the logical side, our “not” matrix works by swapping basis and is thus essentially two dimensional. De veloping a canonical matrix of negation, one that works uniformly for any dimension of the meaning spaces, constitutes future work. The proposal of [41] in using projection to the orthogonal sub- space might be an option. • A similar problem arises for the meanings of other logical words, such as “and”, “or”, “if then”. So we need to de velop a general logical setting on top of our meaning category FV ect × P . One subtlety here is that the opera- tion that first come to mind, i.e. vector sum and product, do not correspond to logical connectiv e of disjunction and conjunction (since e.g. they are not fully distributi ve). Ho wev er, the more relaxed setting of vector spaces en- ables us to also encode w ords such as ”b ut”, whose meaning depends on the context and thus do not ha ve a unique logical counterpart. • Formalizing the connection with Montague-semantics is another future di- rection. Our above ideas can be generalized by proving a representation theorem for Fvect × P on the semiring of Booleans with respect to the cat- egory of FRel of sets and relations. It would then be interesting to see how the so called ‘non-logical’ axioms of Montague are manifested at that level, e.g. as adjoints to substitution to recov er quantifiers. • Along similar semantic lines, it would be good to hav e a Curry-How ard-like isomorphism between non-commutative compact closed categories, bicom- pact linear logic [4], a version of lambda calculus. This will enable us to automatically obtain computations for the meaning and type assignments of our categorical setting. • Our categorical axiomatics is flexible enough to accommodate mixed states [37], so in principle we are able to study their linguistic significance, and for instance implement the proposals of [3]. • Finally , and perhaps most importantly , the mathematical setting needs to be implemented and e valuated, by running experiments on real corpus data. Ef ficiency and the complexity of our approach then become an issue and need to be in vestigated, along with optimization techniques. 30 Acknowledgements Support from EPSRC Advanced Research Fellowship EP/D072786/1 and Euro- pean Committee grant EC-FP6-STREP 033763 for Bob Coecke, EPSRC Post- doctoral Fellowship EP/F042728/1 for Mehrnoosh Sadrzadeh, and EPSRC grant EP/E035698/1 for Stephen Clark are gratefully ackno wledged. W e thank Keith V an Rijsber gen, Stephen Pulman, and Edward Grefenstette for discussions, and Mirella Lapata for providing relev ant references for vector space models of mean- ing. Refer ences [1] S. Abramsky and B. Coecke. A categorical semantics of quantum protocols. In Pr oceedings of the 19th Annual IEEE Symposium on Logic in Computer Science , pages 415–425. IEEE Computer Science Press, 2004. arXiv:quant- ph/0402130. [2] Jerome R. Bellegarda. Exploiting latent semantic information in statistical language modeling. Pr oceedings of the IEEE , 88(8):1279–1296, 2000. [3] P . Bruza and D. W iddows. Quantum information dynamics and open world science. In Pr oceedings of AAAI Spring Symposium on Quantum Interaction . AAAI Press, 2007. [4] W . Buszkowski. Lambek grammars based on pregroups. Logical Aspects of Computational Linguistics , 2001. [5] Freddy Choi, Peter W iemer-Hastings, and Johanna Moore. Latent Semantic Analysis for text se gmentation. In Pr oceedings of the EMNLP Confer ence , pages 109–117, 2001. [6] S. Clark and S. Pulman. Combining symbolic and distributional models of meaning. In Pr oceedings of AAAI Spring Symposium on Quantum Interac- tion . AAAI Press, 2007. [7] B. Coecke. Kindergarten quantum mechanics — lecture notes. In A. Khren- niko v , editor , Quantum Theory: Reconsiderations of the F oundations III , pages 81–98. AIP Press, 2005. arXiv:quant-ph/0510032. [8] B. Coecke. Introducing categories to the practicing physicist. In G. Sica, edi- tor , What is category theory? , volume 30 of Advanced Studies in Mathematics and Logic , pages 45–74. Polimetrica Publishing, 2006. 31 [9] B. Coecke. Quantum picturalism. Contempor ary physics , 51:59–83, 2010. arXi v:0908.1787. [10] B. Coecke and E. O. Paquette. Cate gories for the practicing physicist. In B. Coecke, editor , New structures for physics , Lecture Notes in Physics, pages 167–271. Springer , 2010. [11] James R. Curran. F r om Distributional to Semantic Similarity . PhD thesis, Uni versity of Edinbur gh, 2004. [12] G. Denhire and B. Lemaire. A computational model of children’ s semantic memory . In Pr oceedings of the 26th Annual Meeting of the Cognitive Science Society , pages 297–302, Chicago, IL, 2004. [13] D.R. Dowty , R.E. W all, and S. Peters. Intr oduction to Montague Semantics . Dordrecht, 1981. [14] Peter W . Foltz, W alter Kintsch, and Thomas K. Landauer . The measure- ment of textual coherence with latent semantic analysis. Discourse Pr ocess , 15:285–307, 1998. [15] G. Gazdar . Paradigm merger in natural language processing. In R. Milner and I. W and, editors, Computing T omorr ow: Futur e Resear ch Dir ections in Computer Science , pages 88–109. Cambridge Uni versity Press, 1996. [16] Gregory Grefenstette. Explorations in Automatic Thesaurus Discovery . Kluwer Academic Publishers, 1994. [17] Thomas L. Griffiths, Mark Steyv ers, and Joshua B. T enenbaum. T opics in semantic representation. Psychological Revie w , 114(2):211–244, 2007. [18] J. Lambek. The mathematics of sentence structure. American Mathematics Monthly , 65, 1958. [19] J. Lambek. T ype grammar revisited. Logical Aspects of Computational Lin- guistics , 1582, 1999. [20] J. Lambek. Iterated galois connections in arithmetics and linguistics. Galois Connections and Applications, Mathematics and its Applications , 565, 2004. [21] J. Lambek. F r om W or d to Sentence . Polimetrica, 2008. [22] J. Lambek. Compact monoidal categories from linguistics to physics. In B. Coecke, editor , New structures for physics , Lecture Notes in Physics, pages 451–469. Springer , 2010. 32 [23] J. Lambek and C. Casadio, editors. Computational algebr aic appr oaches to natural langua ge . Polimetrica, Milan, 2006. [24] T . K. Landauer and S. T . Dumais. A solution to Plato’ s problem: the la- tent semantic analysis theory of acquisition, induction and representation of kno wledge. Psychological Revie w , 104(2):211–240, 1997. [25] Michael D. Lee, Brandon Pincombe, and Matthe w W elsh. An empirical ev al- uation of models of te xt document similarity . In B.G. Bara, L.W . Barsalou, and M. Bucciarelli, editors, Pr oceedings of the 27th Annual Conference of the Cognitive Science Society , pages 1254–1259, Mahwah, NJ, 2005. Erlbaum. [26] Dekang Lin. Automatic retriev al and clustering of similar words. In Pr o- ceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Confer ence on Computational Linguistics , pages 768–774, 1998. [27] K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Resear ch Methods, Instruments & Comput- ers , 28:203–208, 1996. [28] Diana McCarthy , Rob K oeling, Julie W eeds, and John Carroll. Finding pre- dominant senses in untagged text. In Pr oceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) , pages 280–287, Barcelona, Spain, 2004. [29] Scott McDonald. En vir onmental Determinants of Lexical Pr ocessing Ef fort. PhD thesis, Uni versity of Edinbur gh, 2000. [30] A. Preller . T ow ards discourse representation via pregroup grammars. JoLLI , 2007. [31] A. Preller and M. Sadrzadeh. Bell states and neg ative sentences in the dis- tributed model of meaning. In P . Selinger B. Coecke, P . Panangaden, editor , Electr onic Notes in Theor etical Computer Science, Pr oceedings of the 6th QPL W orkshop on Quantum Physics and Logic . Uni versity of Oxford, 2010. [32] M. Sadrzadeh. Pregroup analysis of Persian sentences. In C. Casadio and J. Lambek, editors, Computational algebr aic appr oaches to natural lan- guage . Polimetrica, 2006. [33] M. Sadrzadeh. High lev el quantum structures in linguistics and multi agent systems. In J. van Rijsbergen P . Bruza, W . Lawless, editor , Pr oceedings of the AAAI Spring Symposium on Quantum Interaction . Stanford Univ ersity , 2007. 33 [34] J. R. Saffran, E. L. Ne wport, and R. N. Asling. W ord segmentation: The role of distrib utional cues. Journal of Memory and Langua ge , 35:606–621, 1999. [35] G Salton, A W ang, and C Y ang. A vector -space model for information re- trie val. J ournal of the American Society for Information Science , 18:613–620, 1975. [36] H. Schuetze. Automatic word sense discrimination. Computational Linguis- tics , 24(1):97–123, 1998. [37] P . Selinger . Dagger compact closed cate gories and completely positiv e maps. Electr onic Notes in Theor etical Computer Science , 170:139–163, 2007. [38] P . Selinger . A surve y of graphical languages for monoidal categories. In B. Coecke, editor , New structures for physics , Lecture Notes in Physics, pages 275–337. Springer , 2010. [39] P . Smolensky and G. Le gendre. The Harmonic Mind: F r om Neural Computa- tion to Optimality-Theor etic Grammar V ol. I: Cognitive Ar chitectur e V ol. II: Linguistic and Philosophical Implications . MIT Press, 2005. [40] D. P . Spence and K. C. Owens. Lexical co-occurrence and association strength. Journal of Psycholinguistic Resear ch , (19):317–330, 1990. [41] D. W iddows. Orthogonal negation in vector spaces for modelling word- meanings and document retriev al. In 41st Annual Meeting of the Association for Computational Linguistics , Japan, 2003. 34
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment