Acquiring Word-Meaning Mappings for Natural Language Interfaces

Journal of Artiﬁcial In telligence Research 18 ( 2003) 1- 44 Submitted 5/02; published 1/03 Acquiring W ord-Meaning Mappings for Natural Language In terfaces Cyn thia A. Thompson cindi@cs.ut ah.edu Scho ol of Computing, University of Utah Salt L ake City, UT 84112-3320 Ra ymond J. Mo oney mooney@cs.utexas.edu Dep artment of Computer Scienc es, University of T exas Austin, TX 7871 2-1188 Abstract This pap er fo cuses on a system, Wolfie (WOrd Learning F rom In terpreted Exam- ples), that acquires a semantic lexicon from a corpus of sentences paired with semantic representations. The le x icon lea rned co nsists o f phras es pa ir ed with meaning represen- tations. Wolfie is par t o f an in tegrated system that learns to transform sentences int o representations s uc h as logical databas e queries. Exp eriment al results are pr esen ted demonstrating Wolfie ’s abilit y to learn useful lexicons for a database interface in four diﬀeren t natural languages . The usefulness of the lexicons le a rned b y Wolfie a re compared to thos e a cquired b y a similar system, with results fa vorable to Wolfie . A seco nd set of exp eriments demonstrates Wolfie ’s ability to s cale to la r ger a nd more diﬃcult, alb eit a rtiﬁcially genera ted, cor por a. In natural la nguage acquisition, it is diﬃcult to gather the annotated data needed for supervised learning; how ever, un annotated data is fairly plentiful. Activ e learning methods attempt to select for annotation and training only the most informative examples, and therefore are po ten tially very useful in natura l language applicatio ns. Howev er, most results to date for a ctiv e learning hav e only considered standard classiﬁca tion tasks. T o reduce annota tion eﬀort while maintaining accuracy , we apply activ e lea rning to semantic lexicons. W e show that activ e lear ning ca n signiﬁca n tly reduce the num b er of a nnotated examples re q uired to achieve a given level of p erformance . 1. In tro duction and Ov erview A long-standing goal for the ﬁeld of artiﬁcial intell igence is to enable computer unders tand- ing of h uman languages. Muc h progress has b een made in reac hing this goal, but m uc h also remains to b e done. Before artiﬁcial intel ligence systems can meet this goal, they ﬁrst need the abilit y to p arse sent ences, or transform them in to a r epresen tation that is more easily manipulated b y computers. Seve ral kno wledge sources are required for parsing, such as a grammar, lexicon, and parsing mec hanism. Natural language pro cessing (NLP) r esearc hers hav e t raditionally attempted to build these kno wledge sources b y hand, ofte n resulti ng in b rittle, ineﬃcien t systems that tak e a s igniﬁcan t eﬀort to build. Our goal here is to ov ercome this “kno wledge acquisition b ottlenec k” by app lying metho ds from mac hine learning. W e dev elop and apply method s from empiric al or c orpus-b ase d NLP to learn seman tic lexicons, and fr om active le arning to reduce the annotation eﬀort required to learn th em. c  2003 AI Access F oundation and M organ Kaufmann Publishers. Al l rights r eserv ed. Thompson & Mooney The seman tic lexicon is one NLP comp onent that is t ypically c h allenging and time con- suming to construct and up date by hand . Our notion of semanti c lexicon, form ally deﬁn ed in Section 3, is that of a list of phrase-meaning pairs, where the meaning repr esen tatio n is determined by the language understanding task at hand, an d where we are taking a co m- p ositional view of sen tence meaning (P artee, Meulen, & W all, 19 90). This pap er describ es a system, W olfie (W Ord Learning F rom Interpreted E xamp les), that acquires a semantic lexicon of p hrase-meaning pairs from a corpus of sen tences paired with seman tic r epresen- tations. The goal is to automate lexicon construction for an in tegrated NLP system that acquires b oth seman tic lexico ns and p arsers for natural language interface s from a single training set of annotated sen tences. Although many others (S ´ ebillot, Bouillon, & F abre, 2000; Riloﬀ & Jones, 1999; Siskind , 1996; Hastings, 1996; Grefenstette, 1994; Brent , 1991) hav e presented systems for learning information ab out lexica l semant ics, w e p r esen t here a system for learning lexicons of phrase- meaning pairs. F u rther, our work is uniqu e in its com bination of sev eral features, though prior work has in cluded some of these asp ects. First, its output can b e used b y a system, Chill (Zelle & Mo oney , 1996 ; Zelle, 1995), that lea rns to parse sente nces into seman tic represen tations. Second, it uses a fairly straight forward b atc h, greedy , heuristic learning algorithm that r equires only a small n umb er of examples to generaliz e w ell. Th ird, it is easily extendible to n ew rep r esen tatio n formalisms. F our th, it requires no prior knowledge although it can exploit an initial lexicon if provided. Finally , it simpliﬁes th e lea rnin g problem b y making sev eral assumptions ab out the training data, as describ ed further in Section 3.2. W e test W olfie ’s abilit y to acquire a semanti c lexicon for a natural language in terface to a geographical database using a corpus of qu eries collec ted from h uman sub jects and annotated w ith their logical form. In this test, W olfie is in tegrated with Chill , whic h learns parsers but requires a semantic lexico n (previously built man ually). The results demonstrate that the ﬁnal acquired parser p erforms nearly as accurately at answ ering no v el questions when using a learned lexicon as when using a hand-built lexi con. W ol fie is also compared to an alternativ e lexico n acquisition syste m dev elop ed b y Siskind (1996) , demonstrating sup erior p erformance on this t ask. Finall y , the corpus is translated in to Spanish, Japanese, and T urkish, and experiments are conducted d emonstrating an abilit y to learn successful lexicons and pars ers for a v ariet y of languages. A second set of exp erimen ts demonstrates W olfie ’s abilit y to scale to larger and m ore diﬃcult, a lb eit a rtiﬁcially generated, corp ora. Overal l, the results demonstrate a r obust abilit y to acquire accurate lexico ns directly usable for seman tic p arsing. With suc h an in tegrated system, the task of b uilding a seman tic parser for a new d omain is simpliﬁed. A single representa tiv e corpus of sente nce-represent ation pairs allo ws the acquisition of b oth a semanti c lexicon an d parser that generalizes w ell to no ve l sente nces. While b uilding an ann otat ed corpu s is arguably less w ork than buildin g an en tire NLP system, it is still not a simple task. Redun d ancies and errors ma y o ccur in the training data. A goal should b e to also minimize the annotation eﬀort, yet still ac hieve a r easonable level of general ization p erformance. In the case of natural language, there is frequently a large amoun t of unannotated text a v ailable. W e w ould lik e to autom atically , but in tellige ntly , c ho ose wh ich of the a v ailable sen tences to ann otat e. 2 Acquiring Word-Meaning Mappings W e do this here using a tec hnique called active le arning . Acti ve learning is a researc h area in m ac hine learning that features systems that automatically select the most informa- tiv e examples for annotation and training (Cohn, A tlas, & Ladner, 1994). The pr imary goal of activ e learning is to reduce the num b er of examples that the system is trained on, thereby reducing th e example annotation cost, while main taining the accuracy of th e acquired in- formation. T o demonstrate th e usefulness of our activ e learning tec hniques, we compared the acc uracy of parsers and lexic ons learned usin g examples c hosen by activ e lea rnin g for lexicon acquisition, to those learned using randomly chose n examples, ﬁnding that activ e learning sa v ed signiﬁcant annotatio n cost o v er training on rand omly chose n examples. This sa vings is demonstrated in the geograph y qu ery domain. In summary , this paper pr ovides a new statemen t of the lexicon a cquisition problem and demonstrates a mac h ine learning tec hniqu e for solving this p roblem. Next, by com- bining this with previous researc h, w e sho w that an en tire natural language in terface can b e acquired from one training corpus. F u rther, w e d emonstrate the app lication of activ e learning tec h niques to minimize the num b er of sent ences to annotate as training inpu t for the integ rated learning system. The remainder of the p ap er is organized as follo w s. Secti on 2 giv es more bac kgroun d information on Chill and introdu ces Siskind ’s lexicon acquisition system, which we will compare to W olfie in S ectio n 5. Sections 3 and 4 formally d eﬁne the learning p roblem and describ e the Wolfie algo rithm in detail. In Section 5 w e present and discuss exp erimen ts ev aluating Wolfie ’s p erformance in learning lexicons in a database qu ery domain and for an artiﬁcial corpus. Next, Sectio n 6 describ es and ev aluates our use of activ e learning tec hn iques for Wolfie . Sections 7 a nd 8 discuss relate d researc h and future directions, resp ectiv ely . Finally , Section 9 summ arizes our researc h and r esults. 2. Bac kground In this section we give an ov erview of Chill , the system that our researc h adds to. W e also describ e Jeﬀ S iskind’s lexicon acquisition system. 2.1 Chill The output pro duced by Wolfie can b e u sed to assist a larger language acquisition system; in particular, it is currently used as p art of the input to a p arser acquisition system called Chill (Constructiv e Heuristics Induction for Language Learning). Chill uses inductive lo gic pr o gr amming (Mugg leton, 199 2; La vr a ˘ c & D˘ zeroski , 1994) to learn a d eterministic shift-reduce parser (T omita , 1986 ) written in P r olog. T he inpu t to Chill is a corpus of sen tences paired with semantic represent ations, the same in p ut required by Wolfie . Th e parser learned is capable of mapping the sent ences in to their correct r epresen tations, as w ell as generalizing w ell to no vel sen tences. In this pap er, we limit our discussion to Chill ’s abilit y to acquire parsers that map natural language questions directly in to Prolog queries that can be executed to pro duce an a nswer (Zelle & Mooney , 199 6). F ollo w ing are t wo sample queries for a database on U.S. geography , paired with their corresp onding Prolog query: 3 Thompson & Mooney Prolog Training Examples WOLFIE Lexicon CHILL Final Parser Figure 1: The Int egrated S ystem What is the capital of th e state w ith the biggest p opulation? answer(C , (capital( S,C), lar gest(P, (state(S), population(S,P ))))). What state is T exark ana lo cated in? answer(S , (state(S) , eq(C,cityid( texarkana, )), loc(C,S) )). Chill treats p arser induction as the problem of learning r ules to con trol the actions of a shift-reduce parser. Dur ing parsing, the current con text is main tained in a stac k and a buﬀer con taining the remaining inp ut. When parsing is complete, the stac k con tains the represen tation of th e input sentence . There are three t yp es of op erators that the pars er uses to construct logic al queries. One is the introdu ction onto the stac k of a predicate needed in the sen tence r ep r esen tatio n d ue to a phrase’s app earance at the fron t of the inp ut buﬀer. These op erators require a seman tic lexicon as bac kground kno w ledge. F or details on this and the other t wo parsing op erators, see Zelle and Mo oney (1996). By usin g Wolfie , the lexicon is pro vided automatica lly . Figure 1 illustrates the complete system. 2.2 Jeﬀ Siskind’s Lexicon Learning Researc h The m ost closely rela ted previous researc h in to a utomated lexicon acquisition is that of Siskind (1996 ), itself insp ired b y w ork by Ra yn er, Hugosson, and Hage rt (19 88). As w e will b e comparing our syste m to his in Section 5, w e describ e the m ain features of his researc h in this section. His goal is on e of cognitiv e mo deling of c hildren’s acquisition of the lexicon, where that lexicon ca n b e used for b oth comprehension and generation. O u r goal is a mac hine learning and engineering one, and fo cuses on a lexicon for comprehension and use in parsing, using a learning pro cess that do es not claim any cognitiv e plausibilit y , and with the goal of learning a lexicon that generalizes w ell from a small num b er of training examples. His system tak es an in cremental approac h to acquiring a lexicon. Learning pr oceeds in t w o stages. T he ﬁr st stage learns which symbols in the r epresen tation are to b e u sed in the 4 Acquiring Word-Meaning Mappings (‘‘capit al’’, capital(_,_) ), (‘‘state ’’, state(_)), (‘‘bigge st’’, largest(_,_) ), (‘‘in’’, loc(_, _)), (‘‘highe st point’’, high_point(_, _)), (‘‘long’ ’, len(_,_)), (‘‘throu gh’’, traverse(_,_ )), (‘‘capit al’’, capital(_)), (‘‘has’’ , loc(_,_)) Figure 2: Sample Seman tic Lexicon ﬁnal “co nceptual expression” that r epresen ts the meaning of a w ord, by usin g a v ers ion- space approac h . The second stage learns how these symb ols are put toget her to form the ﬁnal represen tation. F or example, when learning the meaning of the w ord “raise”, the algorithm ma y learn th e set { CAUSE, GO, UP } during the ﬁrst stage and pu t them together to form the expr ession CAUSE( x , GO( y , UP)) dur ing the second stage. Siskind (1996) sho ws the eﬀecti ve ness of his approac h on a series of artiﬁcial corp ora. The s ystem handles noise, lexical am biguit y , referentia l uncertain t y , and v ery la rge cor- p ora, but the usefulness of lexic ons learned i s only compared to the “correct,” artiﬁcial lexicon. The goal of the exp eriments p resen ted there w as to ev aluate the correctness and completeness of learned lexicons. E arlier w ork (Siskind, 1992) also ev aluated v ersions of h is tec hn ique on a quite small corpu s of real English and Japanese sentence s. W e extend that ev aluation to a demonstration of the system’s usefulness in p erformin g real w orld natural language pr ocessing tasks, using a larger corpus of real sentence s. 3. The Lexicon Acquisition Problem Although in the end our goal is to acquire an en tire natural languag e interfac e, w e cur ren tly divide the task into tw o p arts, the lexicon acquisitio n compon ent an d the parser acquisition comp onen t. In this section, w e discuss the problem of acquiring semanti c lexicons that assist parsing and the acquisition of parsers. The training input consists of natural language sen tences paired with their meaning representa tions. F rom these p airs w e extract a lexicon consisting of phrases paired w ith their meaning represen tations. S ome training p airs w ere giv en in the previous section, and a sample lexicon is sho wn in Figure 2. 3.1 F ormal Deﬁnition T o presen t the learning problem more formally , some d eﬁnitions are needed. While in the follo win g w e use the terms “string” and “substring,” these extend straigh t-forwa rdly to natural language sentence s and p hrases, resp ectiv ely . W e also refer to lab eled trees, making the assumption that the semantic meanings of inte rest can b e represen ted as such. Most common representa tions can b e recast as lab eled trees or forests, and our formalism extends easily to the latter. Deﬁnition: Let Σ V , Σ E b e ﬁnite alphab ets of v ertex labels and edge lab els, resp ectiv ely . Let V b e a ﬁnite nonempty set of v ertices, l a total function l : V → Σ V , E a set of unordered pairs of distinct v ertices called edges, and a a total function a : E → Σ E . G = ( V , l, E , a ) is a lab ele d gr aph . 5 Thompson & Mooney 2 3 5 6 8 1 t 1 f 1 f 1 f 1 f 1 f 1 s t 1 1 t 1 s ingest person patient age food sex agent 1 cheese type pasta food type accomp 4 7 Tree (‘‘the cheese") = 7 (‘‘pasta") = 3 (‘‘ate") = 1 (‘‘girl’’) = 2 String : ‘‘The girl ate the pasta with the cheese.’’ with its vertex and edge labels: Interpretation from to : female child Figure 3: Lab eled T rees and In terpretations Deﬁnition: A lab e le d tr e e is a connected, acyclic lab eled graph. Figure 3 sh o ws the lab eled tree t 1 (with vertic es 1-8) on the left, with asso ciate d v ertex and edge lab els on the right . The function l is: 1 { (1, in gest ), (2, person ), (3, food ), (4, female ), (5, child ), (6, past a ), (7, fo od ), (8, chee se ) } . The tree t 1 is a semant ic r epresen tation of the sentence s 1 : “The girl ate the pasta with the c heese.” Usi ng a conceptual dep endency (Sc hank, 1975) representa tion in Prolog list form, the meaning is: [ingest, agent: [person, sex:femal e, age :child], patient: [food, type:pasta, accomp:[food, type:cheese]]] . Deﬁnition: A u-v p ath in a graph G is a ﬁnite alternating sequence of vertice s and edges of G , in wh ic h no vertex is rep eated, that b egins with v ertex u and ends with v ertex v , and in whic h eac h edge in the sequence connects the vertex that precedes it in the sequence to the verte x that follo ws it in the sequence. Deﬁnition: A dir e cte d, lab ele d tr e e T = ( V , l , E , a ) is a lab eled tree w hose ed ges consist of ordered pairs of vertice s, w ith a distinguished v ertex r , called the root, with the prop erty that for ev ery v ∈ V , there is a directed r - v path in T , and suc h th at the underlying undirected un lab eled graph induced b y ( V , E ) is a connected, acyclic graph. Deﬁnition: An interpr etation f from a ﬁnite string s to a d ir ected, lab eled t ree t is a one-to-o ne function mapp in g a subset s ′ of the s ubstrings of s , suc h that no t wo strings in s ′ o v erlap, in to the verti ces of t suc h that the ro ot of t is in the range of f . 1. W e omit enumeration of the function e but it could b e given in a similar manner, for ex ample ((1,2), agent ) is an elemen t of e . 6 Acquiring Word-Meaning Mappings food type pasta cheese food type ‘‘the cheese": ‘‘girl": ‘‘ate": ‘‘pasta": ingest person age sex female child Figure 4: Meanings The in terpretation pr o vid es inf orm atio n ab out wh at parts of the meaning of a sentence originate from whic h of its phr ases. In Figure 3, we sho w an interpretati on, f 1 , of s 1 to t 1 . Note that “with” is not in the domain of f 1 , since s ′ is a subset of the substrings of s , th us allo wing some words in s to hav e no meaning. Because we disallo w o v erlapping su bstrings in the domain, b oth “c heese” and “the cheese” could not map to verti ces in t 1 . Deﬁnition: Giv en an interpretat ion f of string s to tree t , and an element p o f the domain of f , t he me aning o f p relativ e to s , t , f is the connected subgraph of t whose v ertices include f ( p ) and all its descendent s exc ept an y other v ertices in the range of f and their descenden ts. Meanings in this sense concern the “lo w est lev el” of phrasal meanings, o ccurring at the terminal no d es of a seman tic grammar, namely the en tries in the semantic lexicon. The grammar can then b e used to constru ct the meanings of longer p h rases and enti re sent ences. This is our motiv ation for the pr eviously stated constraint that the ro ot must b e included in the range of f : we w an t all v ertices in the sen tence represent ation to b e included in the meaning of some phrase. No te that the meaning of p is also a directed tree with f ( p ) as its ro ot. Figure 4 sho ws the meanings of eac h phrase in the domain of interpretati on fu nction f 1 sho wn in Figure 3. W e show only the lab els on th e v ertices and edges f or readabilit y . Deﬁnition: Giv en a ﬁnite set S T F of triples < s 1 , t 1 , f 1 >, . . . , < s n , t n , f n > , w here eac h s i is a ﬁn ite string, eac h t i is a directed, lab eled tree, and eac h f i is an interpretati on function from s i to t i , let the language L S T F = { p 1 , . . . , p k } of S T F b e the u nion of all substrings 2 that o ccur in the domain of some f i . F or eac h p j ∈ L S T F , the me aning set of p j , denoted M S T F ( p j ) , 3 is the set of all meanings of p j relativ e to s i , t i , f i for some < s i , t i , f i > ∈ S T F . W e consider t wo meanings to b e the same if th ey are isomorph ic trees taking labels in to accoun t. F or example, giv en sentence s 2 : “ The man ate th e c heese,” the lab eled tree t 2 pictured in Figure 5, and f 2 deﬁned as: f 2 (“ate” ) = 1, f 2 (“man”) = 2, f 2 (“the c heese”) = 3; the 2. W e consider tw o substrings to b e the same string if they contain the same characters in the same order, irrespective of their p ositions within the larger string in which they occur. 3. W e omit t h e subscript on M when the set S T F is obvious from con text. 7 Thompson & Mooney 4 2 5 3 6 1 ingest person patient age type food male adult cheese sex agent String : ‘‘The man ate the cheese." with its vertex and edge labels: s t t 2 2 2 Tree : Figure 5: A Second T ree meaning set of “the cheese ” with resp ect to S T F = { < s 1 , t 1 , f 1 >, < s 2 , t 2 , f 2 > } is { [food, type:che ese] } , just one meaning though f 1 and f 2 map “the c heese” to diﬀerent v ertices in the t wo trees, b ecause th e subgraph s denoting the meaning of “the c heese” for the t wo functions are isomorphic. Deﬁnition: Giv en a ﬁnite set S T F of triples < s 1 , t 1 , f 1 >, . . . , < s n , t n , f n > , w here eac h s i is a ﬁnite string, eac h t i is a directed, labeled tree, and eac h f i is an in terpretation function from s i to t i , the c overing lexic on expressed b y S T F is { ( p, m ) : p ∈ L S T F , m ∈ M ( p ) } . The co v ering lexicon L expressed b y S T F = { < s 1 , t 1 , f 1 >, < s 2 , t 2 , f 2 > } is: { (“girl”, [perso n, sex:female, age:child] ), (“man”, [perso n, sex:male, age:adult] ), (“ate” , [ingest] ), (“pasta”, [food, type:p asta] ), (“the c heese”, [foo d, type:cheese] ) } . The idea of a cov ering lexicon is that it pr o vides, for ea c h string (sentence ) s i , a meaning for some of the phrases in th at sen tence. F ur ther, these meanings are tree s whose la b eled v ertices together in clude eac h of the lab eled ve rtices in the tree t i represen ting the meaning of s i , with no verti ces dup licate d, and con taining no vertic es not in t i . E dge lab els ma y or may n ot b e included, sin ce the idea is that some of them are due to syn tax, whic h the parser will pr ovide; those edges capturing lexical seman tics are in the lexicon. Note that b ecause we only include in the co v ering lexicon p h rases (substrings) that are in the domains of the f i ’s, w ords with the empt y tree as meaning are not included in the co ve ring lexicon. Note also that w e wil l in general use “phrase” to mean s u bstrings of sentence s, whether they consist of one w ord, or more th an one. Finall y the strings in the co v ering lexicon may con tain o ve rlapping w ords ev en though those in the domain of an individual in terpretation function must not, since those ov erlapping words could ha v e o ccurred in diﬀerent sen tences. Finally , we are ready to deﬁne the learning problem at hand. 8 Acquiring Word-Meaning Mappings The Lexicon Acquisition Problem: Giv en: a m ultiset of s trin gs S = { s 1 , . . . , s n } and a mult iset of lab eled trees T = { t 1 , . . . , t n } , Find: a multiset of in terpretation fu nctions, F = { f 1 , . . . , f n } , suc h that the cardin alit y of the co v ering lexicon expressed by S T F = { < s 1 , t 1 , f 1 >, . . . , < s n , t n , f n > } is m inimized. If suc h a set is found, we s a y w e hav e found a minimal set of in terpretations (or a minimal c overing lexic on ). ✷ Less form ally , a learner is pr esen ted with a multiset of sen tences ( S ) paired with their meanings ( T ); the goal of learning is to ﬁn d the smallest lexicon consisten t with this data. This lexicon is t he p aired listing of all p hrases occurr ing in the domain of some f i ∈ F (where F is the m ultiset of inte rp r etati on functions f ound) with eac h of the elemen ts in their meaning sets. The motiv ation for ﬁnding a lexi con of minimal size is the usual bias to wards simplicit y of r epresen tation and generaliz ation beyond th e t raining data. While this deﬁnition allo ws for phrases of any length, w e will usually w an t to li mit the length of phrases to b e considered for inclusion in the domain of the interpretati on functions, for eﬃciency pur p oses. Once we determine a set of in terpretation functions for a set of strings and trees, there is only one unique co vering lexicon e xpressed b y S T F . Ho w ev er, this mig ht not b e t he only set of int erpretation functions p ossible, and may not result in the lexicon with smallest cardinalit y . F or example, the co v ering lexic on giv en with the previous example is not a minimal co vering lexicon. F or the t wo sen tences giv en, we could ﬁnd minimal, though rather degenerate, lexicons su c h as: { (“girl”, [ingest, agent: [person, sex:femal e, age :child], patient: [food, type:pasta, accomp:[fo od, ty pe:cheese] ]] ), (“man”, [ingest , agent:[pe rson, sex:male, age:adult], patient: [food, type:cheese ]] ) } This type of lexicon b ecomes less lik ely as the size of the corpu s gro ws. 3.2 Implications of the Deﬁnition This d eﬁ n ition of the lexicon acquisition problem diﬀers from that giv en b y other authors, including Riloﬀ and J ones (1999), Siskind (1996), Manning (1993 ), Brent (1991) and others, as further discussed in Section 7. Our deﬁnition of the problem mak es some assumptions ab out the training input. First, b y making f a function instead of a r elation, the deﬁnition assumes that the meaning for eac h phrase in a sent ence app ears once in the represent ation of that sente nce, the single- use assumption. S econd, by making f one-to-o ne, it assumes exclusivity , that eac h vertex in a sen tence’s represent ation is d u e to only one ph r ase in the sen tence. Third , it assumes that a phrase’s meaning is a connected subgraph of a sent ence’s represen tation, n ot a more distribu ted repr esen tation, the c onne cte dness assumption. While the ﬁrst assump tion ma y not hold for some r ep r esen tatio n languages, it do es not p resen t a problem in the d omains w e ha v e considered. The second and third assump tions are p erh aps less pr oblematic with resp ect to general language use. Our deﬁn ition also assu mes c omp ositional ity : that the meaning of a sen tence is deriv ed from the meanings of the p hrases it con tains, i n addition, p erhaps to some “connecting” information sp eciﬁc to the represen tation at hand, b ut is not deriv ed from external s our ces 9 Thompson & Mooney suc h as noise. In other w ords, all the v ertices of a sen tence’s r epresen tation are included within the meaning of some word or p hrase in that sen tence. This assumption is similar to the linking rules of Jac ke ndoﬀ (199 0), and has b een used in previous wo rk on grammar and la nguage acquisition (e.g ., Haas and Ja yaraman, 1997 ; Siskind, 199 6 4 ) While there is some debate in the linguistics communit y ab out the abilit y of comp ositional tec h niques to handle all phenomena (Fillmore, 1988; Goldb erg, 1995), making this assumption simpliﬁes the learning p ro cess and works reasonably for the domains of in terest here. Also, since w e allo w m ulti-w ord ph rases in the lexico n (e.g ., (“kic k the buc k et”, die( ) )), one ob jection to comp ositionalit y can b e addressed. This d eﬁ n ition also allo w s training input in wh ic h: 1. W ords and ph rases ha ve m ultiple meanings. That is, homony my migh t o ccur in the lexicon. 2. Sev eral phr ases m ap to the sa me mea ning. That is, synon ym y migh t o ccur in the lexicon. 3. Some w ords in a sen tence do not map to any meanings, lea ving them un used in the assignmen t of words to meanings. 5 4. Phrases of con tiguous words map to parts of a sen tence’s meaning representa tion. Of particular n ote is lexical am biguit y (1 ab o v e). Note that w e could hav e also deriv ed an am biguous lexicon such as: { (“girl”, [perso n, sex:female, age:child] ), (“ate” , [ingest] ), (“ate” , [ingest, agent:[person, sex:male, age:adult]] ), (“pasta”, [food, type:p asta] ), (“the c heese”, [foo d, type:cheese] ) } . from our sample corpus. In this lexicon, “ate” is an am biguous w ord. The earlier example minimizes ambiguit y resulting in an alternativ e, m ore in tuitiv ely pleasing lexicon. While our problem deﬁn ition ﬁ rst minimizes the n umber of entries in the lexicon, our lea rnin g algorithm will also exploit a p reference for minimizing am biguit y . Also note that our deﬁnition allo ws t raining input in whic h sen tences themselve s are am biguous (paired with more than one m eaning), since a give n sen tence in S (a mult iset) migh t app ear m ultiple times app ear with more than one meaning. In fact, the training data that we consider in Section 5 do es hav e some am biguous sentence s. Our deﬁnition of the lexicon acquisition problem do es n ot ﬁt cleanly in to the traditional deﬁnition of learning for classiﬁcation. Eac h training example con tains a sen tence and its seman tic parse, and we are trying to extract semantic information ab out some of the phrases in that sen tence. S o eac h example p oten tially con tains information ab out m u ltiple ta rget concepts (phrases), and w e are trying to pick out the relev an t “features,” or verti ces of the 4. In fact, all of these assumptions except for singl e-use w ere made by S iskind (1996); see S ection 7 for details. 5. These words may , how ever, serv e as cues to a parser on how to assem ble sentence meanings from w ord meanings. 10 Acquiring Word-Meaning Mappings represen tation, correspon d ing to the correct meaning of eac h phrase. Of c ourse, our as- sumptions of single-use, exclusivit y , connectedness, and comp ositionalit y imp ose additional constrain ts. In addition to this “m ultiple examples in one” learning scenario, we do not ha v e access to negativ e exa mples, n or can we deriv e any implicit negativ es, b ecause of the p ossibilit y of am biguous and syn on ymous phrases. In some w a ys the problem is r elated to clustering, whic h is al so capable of learning m ultiple, p oten tially non-disjoint categories. Ho w ev er, it is not clear ho w a clustering system could b e made to learn the ph rase-meaning mappings needed for pars in g. Finally , current systems that learn m ultiple concepts commonly use examples for other concepts as negativ e examples of the concept curr ently being learned. The implicit assumption made b y doing this is that concepts are disjoin t, an unw arran ted assumption in the pr esence of synon ymy . 4. The W olfie A lgorithm an d an Example In this section, w e ﬁrst discuss some issues we considered in the design of ou r alg orithm, then describ e it fully in Section 4.2. 4.1 Solving the Lexicon Acquisition Problem A ﬁrst attempt to solv e the Lexicon Acquisition Problem migh t b e to examine all in terpre- tation functions across the corpus, then choose the one(s) with minimal lexicon size . The n umber of p ossible inte rpr etati on f unctions for a giv en inpu t pair is d ep enden t on b oth the size of th e sen tence and its represent ation. In a sentence with w words, there are Θ( w 2 ) p ossible phr ases, not a particular c hallenge. Ho w ev er, the n umber of p ossible inte rpr etati on fu nctions grows extremely quickl y with the size of the input. F or a sentence with p phrases and an asso ciated tree with n ve rtices, the num b er of p ossible in terpretation functions is: c !( n − 1)! c X i =1 1 ( i − 1)!( n − i )!( c − i )! . (1) where c is min ( p, n ). The deriv ation of the ab ov e formula is as follo ws. W e m ust c ho ose whic h phrases to use in the domain of f , and w e can c ho ose one phrase, or t wo , or an y n umber u p to min ( p, n ) (if n < p we can only assign n p hrases since f is one-to-one) , or p i ! = p ! i !( p − i )! where i is the n u m b er of ph r ases chosen. But w e ca n also p erm ute these phr ases, so that the “order” in w hic h they are assigned to the v ertices is diﬀeren t. There are i ! s u c h p ermu- tations. W e m u st also c ho ose w hic h v ertices to include in the range of the interpretatio n function. W e hav e to c h o ose the root eac h time, so if we are c ho osing i v ertices, we hav e n − 1 c ho ose i − 1 v ertices left after c ho osing the ro ot, or n − 1 i − 1 ! = ( n − 1)! ( i − 1)!( n − i )! . 11 Thompson & Mooney The full n umb er of p ossible int erpretation functions is then: min ( p,n ) X i =1 p ! i !( p − i )! × i ! × ( n − 1)! ( i − 1)!( n − i )! , whic h simpliﬁes to Equation 1. When n = p , the largest term of this equat ion is c ! = p !, whic h gro ws at least exp onenti ally with p , so in general the n um b er of int erpretation functions is to o large to allo w en u meration. Therefore, ﬁnding a lexicon by exa mining all in terpretations across the corpus, then c ho osing the lexic on(s) of minim um size , is clearly not tractable. Instead of ﬁnding all in terpretations, one could ﬁnd a set of candid ate meanings for eac h p hrase, from whic h the ﬁ nal meaning(s) for that phrase could b e chosen in a wa y that minimizes lexicon size. One w a y to ﬁnd candidate meanings is to fr actur e the meanings of sen tences in whic h a ph rase app ears. Siskind (1 993) deﬁned fr acturing (he also calls it the Unlink* op eration) o ve r terms s uc h that the result includes all sub terms of an expression plus ⊥ . In our repr esen tation formalism, this corresp onds to ﬁnding all p ossible connected subgraphs of a meaning, and adding the empty graph. Lik e the in terpretation function tec h nique just discussed, fracturing would also le ad to an exp onentia l b lo wu p in the num b er of candidate meanings for a p hrase: A lo w er b oun d on the n umber of connected subgraphs for a full binary tree with n ve rtices is obtained by noting that an y subset of the ( n + 1) / 2 lea ve s may b e d eleted and still main tain connectivit y of the remaining tree. Th us, coun ting all of the w a ys that lea v es can b e deleted giv es us a lo w er b ound of 2 ( n +1) / 2 fractures. 6 This do es not completely ru le out f racturing as part of a tec hnique for lexicon learning since trees d o not tend to get v ery large, and indeed Siskind us es it in man y of his systems, with other constraint s to help cont rol the searc h. Ho wev er, w e wish to a v oid an y c hance of exp onenti al blo wup to preserve the generalit y of our app roac h for other tasks. Another option is to force Chill to essen tially induce a lexic on on its o wn . I n this mo del, we w ould provide to Chill an am biguous lexicon in whic h eac h ph rase is paired with ev ery fracture of ev ery sentence in wh ic h it app ears. Chill w ould then ha v e to d ecide whic h set of fractures leads to the correct parse f or eac h training sen tence, and wo uld only include those in a ﬁnal learned parser-lexicon com bination. Th us the searc h w ould again b ecome exp onen tial. F urtherm ore, ev en with small represen tations, it w ould lik ely lea d to a system with p o or generaliza tion abilit y . While some of Siskin d’s work (e. g., Siskind, 1992) to ok synt actic constrain ts into accoun t and did not encoun ter such diﬃculties, those v ersions did not handle lexical am biguit y . If w e could eﬃcien tly ﬁnd some go o d candidates, a standard induction algorithm could then attempt to use them as a source of training examples for eac h phrase. Ho we v er, an y atte mpt to u se the list of candidate m eanings of one p hrase as negativ e examples for another ph rase w ould b e ﬂa we d. The learner could not kno w in a dv ance w h ic h phrases are p ossib ly synonymous, and thus whic h phrase lists to use as n egat iv e examples of other phrase meanings. Also, man y r ep r esen tatio n comp onen ts wo uld b e presen t in the lists of more than one p hrase. T h is is a source of conﬂicting evidence for a learner, ev en without the presence of synon ymy . S ince only p ositiv e examples are a v ailable, one migh t think of using most sp eciﬁc conjunctiv e learning, or ﬁ nding the inte rsection of all the repr esen tations 6. Thanks to net-citizen Dan Hirshberg for help with this analysis. 12 Acquiring Word-Meaning Mappings F or each phra se, p (of a t most tw o words): 1.1) Collect the tr aining examples in whic h p appea rs 1.2) Calcula te LICS fro m (sampled) pair s of these examples’ r e present ations 1.3) F or ea c h l in the LICS, add ( p, l ) to the set of candidate lexico n en tries Un til the input r epresentations ar e covered, or no candidate lexicon entries remain do : 2.1) Add the b est (phra se, meaning) pair from the candidate en tries to the lexicon 2.2) Up date c a ndidate meanings of phrases in the same sentences as the phrase just learned Return the lexicon o f learned (phrase , meaning) pairs. Figure 6: W olfie Algorithm Ov erview for eac h p h rase, as prop osed b y An derson (1977). Ho wev er, the meanings of an ambiguous phrase are disjun ctiv e, and this inte rsection would b e empty . A similar diﬃcult y wo uld b e exp ected with the p ositiv e-only compression of Muggleton (1995). 4.2 Our Solution: W olfie The ab o ve analysis leads us to b eliev e that the Lexicon Acquisition Problem is computa- tionally intrac table. Therefore, w e can not p erform an eﬃcien t searc h for the b est lexicon. Nor can w e u se a standard indu ction algorithm. Therefore, w e hav e implemente d W olfie 7 , outlined in Figure 6, w hic h ﬁnds an app ro ximate solution to the Lexicon Acquisition Prob- lem. Our approac h is to generate a set of candidate lexicon en tries, from which the ﬁ nal learned lexicon is d eriv ed by greedily c h o osing the “b est” lexicon item at eac h p oint, in the hop es of ﬁnd ing a ﬁnal (minimal) cov ering lexicon. W e do not act ually learn in terpretation functions, so do not guaran tee that w e will ﬁnd a c o v ering lexicon. 8 Ev en if w e w ere to searc h for inte rpr etati on functions, using a greedy search w ould also not guarante e co vering the inp ut, and of course it also do es not guaran tee that a minimal lexicon is found. Ho wev er, w e will later presen t exp erimental results demonstrating that our greedy approac h p erforms w ell. W olfie ﬁrst derive s an initial set of candidate meanings for eac h phrase. The algorit hm for generating candid ates, LICS, attempts to ﬁnd a “maximal ly common” meaning for eac h phrase, whic h biases to w ard b oth ﬁnd in g a small lexicon b y co vering man y v ertices of a tree at once, and ﬁn ding a lexicon that actually do es cov er the input. S econd, W olfie c ho oses ﬁnal lexico n en tries from this candidate set, one at a time, up dating the candidate set as it go es, taking in to accoun t our assumptions of s in gle-use, connectedness, and exclusivit y . The basic sc heme for c ho osing ent ries f r om the candid ate set is to maximize the pr ed iction of meanings giv en p hrases, but also to ﬁnd general meanings. This adds a tension b et w een LICS, whic h co v er m any v ertices, and generalit y , whic h biases to wards few er vertic es. Ho w- ev er, generalit y , lik e LICS , helps lead to a small lexicon since a ge neral meaning will m ore lik ely ap p ly widely across a corpu s . 7. The cod e is av ailable u pon requ est from th e ﬁrst author. 8. Though, of course, interpretati on functions are not the only wa y to guaran tee a co vering lexicon – see Siskind (1993) for an alternativ e. 13 Thompson & Mooney cityid/2 1 texarkana answer/2 1 S state/1 eq/2 loc/2 2 2 2 1 C 1 2 C 1 S 2 S Figure 7: T ree with V aria bles Let us e xplain the algorithm in further detail b y w a y of an exa mple, using Spanish instead of En glish to illustrate the diﬃculty somewhat more clearly . Consider the follo wing corpus: 1. ¿ Cu´ al es el capital del estado con la p oblaci´ o n m´ as gr a nde? answer (C, (capital( S,C), lar gest(P, (state(S) , p opulati on(S,P) )))). 2. ¿ Cu´ al es la pun ta m´ a s alta del estado con la a rea m´ as g rande? answer (P, (high point( S,P), largest(A, (state(S) , area (S,A))) )). 3. ¿ En que estado se encuentra T exark ana ? answer (S, (state(S) , eq (C,city id(texarkana, )), loc(C, S))). 4. ¿ Qu´ e capital es la m´ as grande? answer (A, largest(A , ca pital(A ))). 5. ¿ Qu´ e es la area de los es tados unitos? answer (A, (area(C,A ), eq(C ,countr yid(usa)))). 6. ¿ Cu´ al es la po blaci´ o n de un estado que b ordean a Utah? answer (P, (populati on(S,P) , s tate(S) , ne xt to(S,M ), eq(M,sta teid(utah )))). 7. ¿ Qu´ e es la pun ta m´ as a lta del estado con la capital Madiso n? answer (C, (high point( B,C), loc(C,B), state(B), capita l(B,A) , eq(A,ci tyid(ma dison, )))). The sente nce represen tations h ere are sligh tly diﬀeren t than the tree repr esen tatio ns giv en in the problem deﬁn ition, with the main d iﬀerence b eing the addition of existen tially qu an tiﬁed v ariables sh ared b et ween some lea ve s of a rep r esen tatio n tree. As men tioned in S ectio n 2.1, the r epresen tations are Prolog queries to a database. Give n su c h a query , we can create a tree that conforms to our formalism, bu t w ith this add ition of qu an tiﬁed v ariables. An example is sho wn in Fig ure 7 for the represent ation of the third sen tence. Eac h v ertex is a pr edicate name and its arit y , in the Prolog st yle, e.g., state/1 , with quan tiﬁed v ariables at some of th e lea v es. F or eac h outgoing edge ( n, m ) of a v ertex n , the edge is labeled with the argument p osition ﬁlled by the subtree rooted b y m . If there is not an edge labeled with a giv en argument p osition, the argument is a free v ariable. Eac h verte x lab eled with a 14 Acquiring Word-Meaning Mappings v ariable (whic h can o ccur only at lea ves) is an existen tially quan tiﬁed v ariable whose scop e is the entire tree (or query). The learned lexic on, ho w ev er, do es not need to main tain the iden tit y b et we en v ariables across d istinct lexical entrie s. Another represen tation diﬀerence is that w e will strip the answer predicate from the input to our learner, 9 th us allo wing a forest of directed trees as input rather than a single tree. Th e deﬁnition of the problem easily extends suc h that the ro ot of eac h tree in the forest must b e in the domain of some in terpretation function. Ev aluation of our system us ing this representa tion is giv en in Section 5.1; ev aluation using a representat ion without v ariables or forests is presen ted in Section 5.2. W e p reviously (Thompson, 1995 ) presented results d emonstrating learning representa tions of a diﬀeren t form, that of a case-role represent ation (Fi llmore, 1968) augmen ted with Conceptual De- p endency (Sc hank, 1975) in f ormation. Th is last representa tion co nforms directl y to our problem deﬁn ition. No w, con tinuing with the example of s olving the Lexicon Acquisit ion Problem for this corpus, let us also assume for simpliﬁcation, although n ot required, that sen tences are stripp ed of p hrases that we kno w hav e empty meanings (e.g., “qu´ e”, “es”, “con”, and “la”). W e will similarly assume th at it is kno wn that some p h rases refer directly to giv en d atabase constan ts (e.g ., lo cation names), and remo ve those p hrases and their meaning from the training inp ut. 4.2.1 Candida te Genera tion Phase Initial candidate meanings for a ph rase are pro du ced by computing the maximally common substructure(s) b et ween sampled pairs of represen tations of sente nces that con tain it. W e deriv e common substructure by computing the Largest Isomorphic Connected Su b graphs (LICS) of t wo lab eled trees, taking lab els in to accoun t in the isomorphism. The analogous Largest Common Subgraph pr oblem ( Garey & Johnson, 1979) is solv able in p olynomial time if, as we assume, b oth inpu ts are trees and if K , the n u m b er of edges to include, is giv en . Th us, w e start with K set equal to the largest num b er of edges in th e t w o trees b eing compared, test for common subgraph (s), and iterate d o w n to K = 1, stopping when one or more su bgraphs are found f or a give n K . F or the Prolog query represent ation, the algorithm is complicated a bit b y v ariables. Therefore, w e use LICS with an addition similar to computing the Least General Gener- alizat ion of ﬁr st-order clauses (Plo tkin, 1970 ). The L GG of t w o sets of literals is the least general set of literals that subsum es b oth s ets of literals. W e add to this b y allo wing that when a term in the argumen t of a literal is a conju n ction, the algorithm tries all orderings in its matc h ing of the terms in the conjunction. Ove rall, our algorithm f or ﬁ n ding the LICS b et w een t w o trees in the Prolog repr esen tatio n ﬁr st ﬁn d s the common lab eled edges and v ertices as usual in LICS, but treats all v ariables a s equiv alen t. Then, it computes the Least General Generalizatio n, with conjunction tak en in to accoun t, of the resulting trees as con v erted bac k in to literals. F or example, giv en the t wo trees: 9. The predicate is omitted b ecause Chill initializes the parse stack with the answer predicate, and thus no word has to b e mapp ed to it. 15 Thompson & Mooney Phrase LICS F r om Sentences “capital”: larges t( , ) 1,4 capita l( , ) 1,7 state( ) 1,7 “grande” : larges t( ,state ( )) 1,2 larges t( , ) 1,4; 2 ,4 “estado”: larges t( ,state( )) 1,2 state( ) 1,3; 1,7 ; 2,3; 2 ,6; 2,7; 3,6; 6,7 (popul ation( S, ), state (S)) 1,6 capita l( , ) 1,7 high point( , ) 2,7 (state (S), loc( ,S )) 3,7 “punt a mas”: high point( , ) 2,7 state( ) 2,7 “encuentra”: (state (S), loc( ,S )) 3 T able 1: Sample Candidate Lexical Ent ries and their Deriv ation answer(C , (largest( P, (st ate(S), population(S, P))), capi tal(S,C))) . answer(P , (high point(S, P), la rgest(A, (state(S), area(S,A)))) ). , the common meaning is an swer( ,largest ( ,state( )) . Note that the LICS of t wo trees ma y not b e un ique: there ma y b e m ultiple common subtrees that b oth con tain the same n umber of edges; in this case LI C S returns multi ple answe rs. The sets of initial candidate m eanings for some of the ph rases in the sample corpus are sho wn in T able 1 . While in this example w e sho w the LICS for all pairs that a phrase app ears in , in the actual alg orithm we randomly sample a subset for eﬃciency reaso ns, as in Gole m (Muggleton & F eng, 1990). F or phrases app earing in only one sen tence (e.g., “encuen tra”), the en tire sent ence represen tation (excluding the d atabase constan t giv en as bac kground kno wledge) is us ed as an initia l candidate meaning. Suc h candidates are t ypically generalized in step 2.2 of the alg orithm to only the correct p ortion of th e represen tation b efore they are added to the lexicon; we will see an example of this b elo w. 4.2.2 Adding to the Final Le xicon After deriving initial cand id ates, the greedy searc h b egins. The heuristic u sed to ev aluate candidates atte mpts to help assure that a small but cov ering lexicon is learned. The heur istic ﬁrst lo oks at the weigh ted sum of t wo comp onent s, wh ere p is the p hrase and m its candidate meaning: 1. P ( m | p ) × P ( p | m ) × P ( m ) = P ( p ) × P ( m | p ) 2 2. The generalit y of m Then, tie s in this v alue are b rok en by p referring less ambiguous (those with fewer current meanings) and shorter phrases. The ﬁrst comp onent is analogous the cluster ev aluation 16 Acquiring Word-Meaning Mappings heuristic us ed b y Cobweb (Fisher, 1987), whic h measures the utilit y of clusters based on attribute-v alue p airs and categories, instead of m eanings and phrases. Th e p robabilities are estimated from the training data and then up dated as learning progresses to acc ount for phrases a nd meanings alrea dy co v ered. W e will see h o w this up dating w orks as w e con tin ue through our example of the algorithm. The goal of this part of th e heuristic is to maximize th e p r obabilit y of predicting the correct meaning for a randomly sampled phrase. Th e equalit y holds by Ba ye s Theorem. L o oking at the righ t side, P ( m | p ) 2 is the exp ected probabilit y that meaning m is correctly guessed for a giv en ph rase, p . This assumes a st rategy of probabilit y matc hing, in wh ic h a meaning m is c hosen f or p w ith probabilit y P ( m | p ) and correct with the same probabilit y . The other term, P ( p ), biases the comp on ent by h o w common the ph rase is. Interpreting the left side of the equation, the ﬁrst term b iases to wa rds lexicons with low am biguit y , t he seco nd to wards lo w synon ym y , and the th ird to w ards frequent meanings. The second comp onent of the heuristic, gener ality , is computed as the negation of the n um b er of v ertices in the meaning’s tree structure, and helps prefer smaller, more general meanings. F or example, in the cand id ate set ab ov e, if all else w ere equal, the generalit y p ortion o f the heur istic w ould prefer stat e( ) , with g eneralit y v alue -1 , o ve r largest( ,state( )) and (state(S) ,loc( ,S)) , eac h with generali t y v alue -2, as the meaning o f “estado”. Learning a m eaning with few er terms helps ev enly distribute the v ertices in a sen tence’s represen tation among the m eanings of the ph r ases in that sentence, and th us leads to a lexicon that is more lik ely to b e correct. T o see this, w e note that some pairs of words tend to frequent ly co-occur (“ grande” and “estado” in our example), and so their j oint representat ion (mea ning) is like ly to b e in the set of candidate m eanings for b oth w ords. By preferring a more general meaning, we easily ignore these incorrect join t meanings. In this example and all exp erimen ts, we use a w eight of 10 for the ﬁrst comp onen t of the heuristic, an d a w eigh t of 1 for the second. The ﬁrst comp onent has smaller absolute v alues and is therefore giv en a higher weig ht. Mo dulo this consideration, resu lts are n ot o v erly-sensitiv e to the w eigh ts and automatic ally setting them usin g cross-v alidatio n on the training set (Koha vi & John, 1995) had little eﬀect on ov erall p erformance. In T able 2 w e illustrate the calc ulation of the heuristic measure for some of the ab ov e fourteen pairs, and its v alue for all. The calculati on s h o w s the sum of m ultiplying 10 by the ﬁ r st comp onent of the heuristic and m ultiplying 1 by the second comp onent. The ﬁrs t comp onent is simpliﬁed as follo ws: P ( p ) × P ( m | p ) 2 = | p | t × | m ∩ p | 2 | p | 2 ≈ | m ∩ p | 2 | p | , where | p | is the n umber of times phrase p app ears in the corpus, t is the initial n umber of candidate phrases, and | m ∩ p | is the num b er of times that meaning m is paired with phrase p . W e can ignore t since the n u m b er of phrases in the corpus is the same for eac h pair, and has no eﬀect on the ran kin g. The highest scoring pair is (“estado”, state( ) ), so it is ad d ed to the lexicon. Next is the candidate generali zation step (2. 2), d escrib ed algo rithmically in Figure 8. One of the k ey ideas of the algorithm is that eac h phrase-meaning choic e can constrain the candidate meanings of phrases y et to b e learned. Giv en th e assumption that eac h p ortion of the repr esen tation is due to at most one phrase in the sentence (exclusivit y), once part of a 17 Thompson & Mooney Candidate Lexicon Entry Heuristic V alue (“capita l”, la rgest( , ) ): 10(2 2 / 3) + 1( − 1) = 12 . 3 3 (“capita l”, ca pital( , ) ): 12.33 (“capita l”, st ate( , ) ): 12.33 (“grande”, lar gest( ,state( )) ): 10(2 2 / 3) + 1( − 2) = 11 . 3 (“grande”, lar gest( , ) ): 29 (“estado” , largest( ,state( )) ): 10(2 2 / 5) + 1( − 2) = 6 (“estado” , state( ) ): 10(5 2 / 5) + 1( − 1) = 49 (“estado” , (population( S, ), state (S)) : 6 (“estado” , capital( , ) ): 7 (“estado” , high point( , ) ): 7 (“estado” , (state(S), loc( ,S)) ): 6 (“pun ta m as”, high point( , ) ): 19 (“pun ta m as”, state( ) ): 10(2 2 / 2) + 1( − 1) = 19 (“encuen tra”, (st ate(S), loc( ,S)) ): 10(1 2 / 1) + 1( − 2) = 8 T able 2: He uristic V alue of Sample Candid ate Lexical Entries Giv en: A learned ph rase-meaning pair ( l, g ) F or all sen tence-represen tation pairs con taining l and g , mark them as co v ered. F or eac h candidate phrase-meaning p air ( p, m ): If p o ccurs in some training p airs with ( l , g ) then If the vertice s of m in tersect the ve rtices of g then If all o ccurrences of m are now co vered then Remo v e ( p, m ) from the set of candidate pairs. Else Adjust the heuristic v alue of ( p, m ) as needed to accoun t for newly co vered n od es of the training r epresen tations. Generalize m to remo v e co v ered no des, obtaining m ′ , and Calculate the heuristic v alue of the new candidate pair ( p, m ′ ). If no candidate meanings remain for an un co ve red phrase t hen Deriv e new LICS from unco v ered representa tions and calculate their heuristic v alues. Figure 8: The Candidate Generalizat ion Phase 18 Acquiring Word-Meaning Mappings represen tation is co v ered, n o other ph rase in the sen tence can b e paired with that meaning (at least for that sen tence). Therefore, in step 2.2 the candidate mea nings f or wo rds in the same s entences as the w ord just learned are generalized to exclude the represen tation just le arned. W e use an op eration analo gous to set d iﬀerence when ﬁnding the remaining unco v ered vertic es of the representat ion w hen generalizing meanings to eliminate co ve red v ertices from candidate pairs. F or example, if the meaning largest( , ) w ere learned for a p hrase in se nte nce 2, the mea ning left b ehind w ould b e a forest consisting of the trees high point(S, ) and (s tate(S), area(S, )) . Also, if the generaliza tion results in an empty tree, new LICS are calc ulated. In our example, since s tate( ) is co v ered in sent ences 1, 2, 3, 6, and 7, the ca nd id ates for several other words in th ose sen tences are generalize d. F or example, the meaning (state(S), loc( ,S )) for “encuentra” , is generalize d to loc( , ) , with a new heuristic v alue of 10 (1 2 / 1) + 1( − 1) = 9. Also, our single-use assu mption allo ws u s to remo ve all candidate pairs con taining “estado” from th e set of candid ate m eanings, since the learned pair co v ers all o ccurrences of “estado” in that set. Note that th e pairwise matc hings to generate candidate items, together with this up- dating of the candidate set, enable multi ple meanings to b e learned for am biguous ph rases, and m akes the algorithm less sensitiv e to the in itial rate of s ampling for L I CS. F or example, note that “capital” is am b iguous in th is data set, though its am biguit y is an artifact of the w ay that the query language w as designed, and one d o es not ordinarily thin k of it as an am biguous w ord. Ho w ev er, b oth meanings will b e learned: The seco nd pair added to the ﬁnal lexicon is (“grande”, largest( , ) ), whic h causes a generalization to the empt y meaning for the ﬁrst candidate en try in T able 2, and since no new LICS fr om sentence 4 can b e generated, its entire remaining meaning is added to the candidate meaning set for b oth “capital” and “m´ as.” Subsequently , the greedy searc h con tin ues until th e resulting lexicon co v ers the training corpus, or until no candidate phrase meanings remain. In rare cases, learning errors o ccur that lea ve some p ortions of represen tations unco ve red. In our example, the follo w in g lexicon is learned: (“estado” , state( ) ), (“grande”, lar gest( ) ), (“area” , area( ) ), (“pun ta”, high point( , ) ), (“p oblaci´ on”, p opulation( , ) ), (“capita l”, capital( , ) ), (“encuen tra”, loc( , ) ), (“alta ”, loc( , ) ), (“b ordean”, next to( ) ), (“capita l”, capital( ) ). In the next section, w e discuss the abilit y of W olfie to learn lexicons that are usefu l for parsers and p arser acquisition. 19 Thompson & Mooney 5. Ev aluation of W olfie The follo wing t w o sections discuss exp erimen ts testing W olfie ’s s uccess in learning lexicons for b oth real and artiﬁcial corp ora, comparing it in several cases to a previously devel op ed lexicon learning system. 5.1 A Database Query Application This section d escrib es our exp erimenta l results on a database query application. The ﬁ rst corpus discussed conta ins 250 questions a b out U.S. g eograph y , p aired w ith their Prolog query to extract the answ er to the question from a d atabase. This domain w as originally c hosen d ue to the a v ailabilit y of a hand -built natural language int erface, Geobase , to a database con taining ab out 800 facts. Geobase w as su p plied with T urb o Prolog 2 . 0 (Borland Internati onal, 1988), and designed sp eciﬁcally for this d omain. The questio ns in the corpus w ere collect ed by asking u ndergraduate studen ts to generate English questions for this d atabase, though they w ere giv en only cursory kn owledg e of the d atabase without b eing giv en a c hance to use it. T o broaden the test, w e had the same 250 sen tences translated into Spanish, T urkish, and Japanese. The Jap anese translations are in wo rd-segmente d Roma n orthograph y . T ranslate d questions we re p aired with the appropriate log ical queries from the En glish corpus. T o ev aluate the learned lexi cons, w e measured their utilit y as b ackg round kno wledge for Chill . This is p erformed b y choosing a random set of 25 test examples and then learning lexicons and parsers from increasingly larger subsets of the remaining 225 examples (increasing by 50 examples eac h time). After training, the test examples are parsed using the learned parser. W e then submit the resulting queries to the database, compare the answ ers to those generated b y su bmitting the correct r epresen tation to the database, and record the p ercen tage of correct (matc hing) answ ers. By us in g the diﬃcult “gold standard” of retrieving a correct answ er, w e av oid measures of partial accuracy that we b eliev e do n ot adequately measure ﬁn al utilit y . W e rep eated this pro cess for ten diﬀerent random training and test sets and ev aluated p erformance diﬀerences using a tw o-tailed, paired t -test with a signiﬁcance lev el of p ≤ 0 . 05. W e compared our system to an incremental (on-line) lexicon learner dev elop ed by Siskind (1996 ). T o m ake a more equitable comparison to our b atc h algo rithm, we ran his in a “sim- ulated” b atc h mo de, by rep eatedly presen ting the corpus 50 0 times, analogous to runn ing 500 ep o c hs to train a n eu r al n et work. While this d o es not ac tually add new kind s of d ata o v er which to learn, it allo ws h is algorithm to p erform inter-se nte ntia l inference in b oth di- rections o v er the corpus instead of just one. O ur p oin t h ere is to compare accuracy o v er th e same size training corpus, a metric not optimized for b y Siskind. W e are not w orried ab out the diﬀerence in execution time h ere, 10 and the lexicons learned wh en running Siskind’s system in incremental mo de (presen ting the corpus a single time) resu lted in substantia lly lo wer p erforman ce in p reliminary exp erimen ts with th is data . W e also remov ed W olfie ’s abilit y to learn phrases of more than one word, s ince the current ve rsion of S iskin d’s system 10. The CPU times of the tw o system are not directly comparable since one is written in Prolog and the other in Lisp. How ever, the learning time of the tw o systems is appro ximately the same if Siskind’s system is run in incremental mo de, just a few seconds with 225 t raining examples. 20 Acquiring Word-Meaning Mappings 0 10 20 30 40 50 60 70 80 90 0 50 100 150 200 250 Accuracy Training Examples CHILL+handbuilt CHILL-testlex CHILL+Wolfie CHILL+Siskind Geobase Figure 9: Accuracy on English Geograph y Corpus do es not ha v e this abilit y . Finally , we m ade comparisons to the parsers learned b y Chill when using a hand -coded lexicon as bac kground kno wledge. In this and similar app lications, there are man y terms, suc h as state and cit y names, whose meanings can be automa tically extracte d from the database. Therefore, all tests b elo w were run with suc h n ames giv en to the learner as an initia l lexicon; this is helpful but not requ ir ed. Section 5.2 giv es results for a d iﬀeren t task with no su c h initial lexicon. Ho w ev er, unless otherwise noted, for a ll tests w ithin this Section (5.1) w e did not strip sen tences of phr ases kno wn to h a ve empty meanings, u nlik e in the example of Section 4. 5.1.1 Comp arison s using E nglish The ﬁrst e xp eriment w as a comparison on the original English co rpu s. Figure 9 sh o w s learning curv es for Chill when using the lexicons learned b y Wolfie ( CHILL+ Wolfie ) and b y Siskind’s s y s tem ( CHILL+Sis kind ). The upp ermost curve ( CHILL +handbuilt ) shows Chill ’s p erformance w hen give n the hand-built lexico n. CHILL-t estlex sho ws the p erfor- mance when words that n ev er app ear in the training d ata (e.g., are only in the test sen tences) are deleted from the hand -b uilt lexicon (since a learning algo rithm has no c hance of learning these). Fi nally , the horizon tal line sho ws the p erformance of the Geoba se b enc hmark. The results sh o w that a lexic on learned by W olfie led to parsers that w ere almost as accurate as those ge nerated using a hand-built lexic on. The b est accuracy is ac hieve d b y parsers using the hand-built lexicon, follo we d by the hand-built lexicon with words only in the test set remo v ed, follo we d b y Wolfie , follo w ed b y Siskind’s system. All the systems do as w ell or b etter than Geobase by the time they r eac h 125 t raining examples. The diﬀerences b etw een W olfie and Siskind’s system are statistica lly signiﬁcan t at all training 21 Thompson & Mooney Lexicon Co v erage Ambiguit y En tries hand-built 100% 1.2 88 W olfie 100 % 1.1 56.5 Siskind 94.4% 1.7 154.8 T able 3: Lexic on Comparison example sizes. These results show that Wolfie can learn lexicons that supp ort the learning of succe ssful parsers, and that are b etter from this p ersp ectiv e than those learned by a comp eting system. Also, co mparing to the CHILL -testlex curve, w e see that most of the drop in accuracy from a hand -built lexicon is d ue to words in t he test set that the system has not seen dur ing training. In fact, none of the diﬀerences b et w een CHIL L+Wolfie and CHILL-te stlex are statistical ly signiﬁcan t. One of the implicit hyp otheses of our p roblem deﬁnition is that co v erage of the training data implies a go o d lexicon. The results sho w a co v erage of 100% of the 225 training ex- amples for Wolfie versus 94.4% for Siskind. In addition, the lexico ns learned b y Siskind’s system were more am biguous and larger than those learned b y W olfie . Wolfie ’s lexi- cons h ad an av erage of 1.1 meanings p er w ord, and an a v erage size of 56.5 entries (after 225 training examples) v ersu s 1.7 mea nings p er w ord and 154.8 en tries in Siskind’s lexi- cons. F or comparison, the hand-built lexicon had 1.2 meanings p er w ord and 88 en tries. These d iﬀerences, summarized in T able 3, und oubtedly con tribute to the ﬁ nal p erformance diﬀerences. 5.1.2 Performance f or Other Na tural Languages Next, we examined the p erformance of th e t wo systems on the S panish v ersion of the corpu s. Figure 10 sh o w s the resu lts. The diﬀerences b et wee n using Wolfie and S iskind’s learned lexicons for Chill are again statistically signiﬁcant at all training set sizes. W e also again sho w the p erformance with h and-built lexicons, b oth with and without ph rases presen t only in the testing set. The p erformance compared to the hand-bu ilt lexico n w ith test-set ph rases remo v ed is s till comp etitiv e, with the diﬀerence b eing signiﬁcan t only at 225 examples. Figure 11 sh o ws the acc uracy of lea rned parsers with W ol fie ’s learned lexico ns for all f ou r languages. The p erformance d iﬀerences among the four languages are quite small, demonstrating that our metho ds are not language dep end en t. 5.1.3 A Larger Corpus Next, we present results on a larger, more div erse corpus from the geograph y domain, where the additional sen tences we re coll ected from computer science undergraduates in an in tro ductory AI course. T h e s et of questio ns in the smaller corpus wa s collected from studen ts in a German class, with no sp ecial instru ctions on the complexit y of queries desired. The AI students tended to ask m ore complex and div erse queries: their task w as to give ﬁve in teresting questions and the asso ciat ed logical form for a homew ork assignmen t, though again they did not ha v e d ir ect access to the d atabase. They w ere requested to give at least one sent ence wh ose r epresen tation included a p redicate conta ining em b edd ed pr edicate s, for 22 Acquiring Word-Meaning Mappings 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 Accuracy Training Examples Span-CHILL+handbuilt Span-CHILL-testlex Span-CHILL+Wolfie Span-CHILL+Siskind Figure 10: Accuracy on Sp anish 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 Accuracy Training Examples English Spanish Japanese Turkish Figure 11: Accuracy on All F our Languages 23 Thompson & Mooney 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 Accuracy Training Examples CHILL WOLFIE Geobase Figure 12: Accuracy on the Larger Geograph y C orp us example l argest(S, state(S)) , and w e ask ed for v ariet y in their sen tences. Th ere w ere 221 new sen tences, f or a total of 471 (includin g the original 250 sente nces). F or these exp erimen ts, w e split the data into 42 5 training sen tences and 46 test sen- tences, for 10 random sp lits, then trained W olfie and then Chill as b efore. Our goal w as to see whether Wolfie was still eﬀectiv e for this more diﬃcult corpus, since there w ere appro ximately 40 nov el w ords in the n ew sentence s. Therefore, w e tested against the p erfor- mance of Chill with an extended hand-built lexicon. F or this test, w e stripp ed sente nces of phrases kno wn t o ha v e emp t y meanings, as in the example of Section 4.2. Again, we did not use p hrases of m ore than one word, since these do not seem to mak e a signiﬁcan t diﬀerence in this domain. F or these r esults, we compare Wolfie ’s lexicons for Chill using hand-built lexicons without ph r ases that only app ear in the test set. Figure 12 sho ws t he resulting learning curv es. The diﬀerence s b et ween Chill using the h and-built and lea rn ed lexicons are statistically signiﬁcan t at 175, 225, 325, and 425 examples (four out of the nine data p oin ts). The more mixed results here indicate b oth the diﬃcult y of the domain and the more v ariable vocabulary . Ho w ev er, the imp ro v emen t of mac hine learning metho ds o ve r the Geobase hand-built interface is m uc h more dramatic for this corpu s. 5.1.4 LICS vers us Fra cturing One comp onent of the algorithm not yet ev aluated explicitly is the candidate generation metho d. As men tioned in Section 4.1, w e could use fractures of rep r esen tatio ns of sentence s in whic h a phrase app ears to generate the candidate meanings f or that p hrase, in stead of LICS. W e used this approac h and compared it to the previously describ ed metho d of using the la rgest isomorphic connected sub graphs of sampled pairs of represen tations as 24 Acquiring Word-Meaning Mappings 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 Accuracy Training Examples fractWOLFIE WOLFIE Figure 13: F racturing vs. LIC S: Accuracy candidate meanings. T o attempt a more fair comparison, we also sampled r epresen tations for fracturing, using the same n um b er of source represen tatio ns as t he num b er of pairs sampled for LICS. The accuracy of Chill when using the resulting learned lexicons as bac kground kno wl- edge are sho wn in Figure 13. Using fracturing ( fract WOLFIE ) sho ws little or no adv an tage; none of the diﬀerences b et w een the t w o s ys tems are statistically signiﬁcant . In addition, the n umb er of initial c andidate lexic on entrie s from whic h to choose is m uc h larger for fracturing than our LIC S metho d, as sh o w n in Figure 14. This is tru e ev en though we sampled the same num b er of represen tations as pairs for LICS, b ecause there are a larger n umb er of fr actures for an arbitrary represen tation than the n umber of LICS for an arbitrary pair. Finally , Wolfie ’s learning time w hen usin g fracturing is greater than that when u sing LICS, as sho wn in Figure 15, wh ere the C P U time is sho wn in seconds. In summary , these diﬀerences show the utilit y of LICS as a metho d for generating candidates: a more thorough method do es not result in b etter p erformance, and also results in longer learning times. On e could claim th at we are handicapping fracturing since we are only sampling represen tations for fr acturing. This ma y indeed help the accuracy , but the learning time and the num b er of candidates would lik ely suﬀer ev en further. In a domain with larger represen tations, the diﬀerences in learning time would b e ev en more dr amatic . 5.2 Artiﬁcial Data The previous section sho we d that Wolfie s u ccessfully learns lexicons for a natural corpus and a realistic task. Ho w ev er, this demonstrates success on only a relativ ely small co rpu s and with one r ep r esen tatio n formal ism. W e now sh o w that our algorithm scales up w ell with m ore lexico n items to learn, more am biguit y , and more synon ymy . These fact ors are 25 Thompson & Mooney 0 100 200 300 400 500 600 0 50 100 150 200 250 Number of Candidates Training Examples fractWOLFIE WOLFIE Figure 14: F r acturing vs. LICS: Number of Cand idates 0 0.5 1 1.5 2 2.5 3 3.5 4 0 50 100 150 200 250 Learning Time (sec) Training Examples fractWOLFIE WOLFIE Figure 15: F r acturing vs. LICS: L earn in g Time 26 Acquiring Word-Meaning Mappings diﬃcult to con trol w h en using r eal data as input. Also, there are no large corp ora av aila ble that are annotated with semantic p ars es. W e therefore p resen t exp erimen tal resu lts on an artiﬁcial corpu s. In this corpus, b oth the sen tences and their represen tations are completely artiﬁcial, and the sent ence representa tion is a v ariable-free representa tion, as suggested b y the work of Jack endoﬀ (1990) and others. F or eac h corpus discussed b elo w , a random lexico n mapping w ords to simulate d meanings w as ﬁrs t constructed. 11 This original lexicon w as then u s ed to generate a corpus of random utterances eac h p aired with a meaning repr esen tatio n. After u sing this corpus as inp ut to W olfie 12 , the learned lexicon w as compared to the original lexicon, and weighte d pr e cision and weighte d r e c al l of the learned lexicon were measured. Precision m easur es the p ercen tage of the lexi con en tr ies (i.e ., w ord -meaning pairs) that the system learns th at are c orrect. Recall m easures the p ercen tage of th e lexicon en tries in the hand-built lexicon that are correctly learned b y the system: pr ecision = # correct p airs # pairs learned r ecall = # correct pairs # pairs in hand -built lexicon . T o get weigh ted precision and recall measures, w e then weig ht the results f or eac h pair b y the w ord’s frequen cy in the enti re corpus (not just the training corpu s). This mo d els ho w lik ely we are to h a ve learned the correct meaning for an arbitrarily c hosen w ord in the corpus. W e generated sev eral lexicons and asso ciated corp ora, v arying the ambiguit y rate (n um- b er of meanings p er word) and synon ymy rate (num b er of w ords p er meaning), as in Siskind (1996 ). Meaning r epresen tations were generated using a set of “conce ptual symb ols” that com bined to form the meaning for eac h word. The num b er of conceptual symbols u sed in eac h lexicon will be n oted when we d escrib e eac h corpus b elo w. In ea c h lexicon, 47. 5% of the s enses w ere v ariable-free to sim ulate n oun-lik e meanings, and 47.5% cont ained fr om one to three v ariables to den ote op en argumen t p ositions to sim ulate verb-lik e meanings. The remainder of the words (the remaining 5%) had the empt y meaning to sim ulate fu n c- tion words. In a dd ition, the functors in ea c h meaning could hav e a depth of up to t wo and an arit y of up to t wo. An example noun -lik e meaning is f23(f2(f14)) , and a v erb- meaning f10(A,f15(B )) ; the conceptual symbols in this example are f23 , f 2 , f14 , f10 , and f15 . By using these multi-lev el meaning represen tations w e demonstrate the learning of more complex represen tations than those in the geograph y database domain: n one of the hand-built meanings for p hrases in that lexicon had fun ctors em b edded in argumen ts. W e used a grammar to generate utterances and th eir meanings from eac h original lexicon, with terminal cate gories sele cted us ing a distribution based on Zipf ’s La w (Zipf, 1949). Under Zipf ’s L aw, the o ccurrence frequency of a word is in ve rsely prop ortional to its ranking b y o ccurrence. W e started with a baseline corpus generated from a lexicon of 100 w ords u sing 25 concep- tual sym b ols and no am biguit y or synonym y; 1949 sent ence-meaning pairs were generated. 11. Thanks to Jeﬀ Siskind for th e initial corpus generation softw are, whic h w e enhanced for these tests. 12. In th ese tests, w e allo we d Wo lfie to learn phrases of up t o length tw o. 27 Thompson & Mooney 0 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 1200 1400 1600 1800 Accuracy Training Examples Precision Recall Figure 16: Ba seline Artiﬁcial Corpus W e split this in to ﬁv e training sets of 1700 sen tences eac h. Figure 16 sho w s the weigh ted precision a nd recall curv es for t his initial t est. This demonstrates go o d scalabilit y to a sligh tly larger corpus and lexicon than that of the U.S. geograph y query domain. A second corpus w as generated from a second lexicon, also of 100 words usin g 25 concep- tual symb ols, bu t increasing the am biguit y to 1.25 meanings p er w ord. This time, 1937 pairs w ere generated and the corpus split into ﬁv e s ets of 1700 training examples eac h. W eighte d precision at 1650 examples drops to 65.4% from the previous lev el of 99.3%, and weig hte d recall to 58.9% from 99.3%. The fu ll learning curve is shown in Figure 17. A quic k compari- son to Siskind’s p erformance on this corpus conﬁrmed that his system ac hiev ed comparable p erformance, showing that with current metho ds , this is close to the b est p erformance that w e are able to obtain on this more d iﬃcult corpus. One p ossible explanation f or the smaller p erformance diﬀerence b et ween the tw o systems on this corpu s v ersus th e geog raphy d o- main is that in this domain, the correct meaning for a w ord is not necessarily the most “general,” in terms of n umb er of v ertices, of all its candidate meanings. Th erefore, the generalit y p ortion of the h euristic ma y negativ ely inﬂuence the p erformance of Wolfie in this domain. Finally , we sho w the c han ge in p erformance with increasing am biguit y and increasing synon ymy , holding the num b er of w ords and conceptual sym b ols constan t. Figure 18 sho ws the w eight ed precision and recall with 105 0 training examples for in creasing le ve ls of am- biguit y , holding the synon ymy lev el constan t. Figure 19 sho ws the results at increa sing lev els of syn on ym y , holding am biguit y constan t. Increasing the lev el of synonym y do es not eﬀect the results as m uch as increasing the lev el of am biguit y , whic h is as we exp ecte d. Holding the corpus size co nstant but increasing the num b er of comp eting meanings for a w ord increases the n umber of candidate meanings created by Wolfie wh ile decreasing the amoun t of evidence av aila ble for eac h m eaning (e.g., the ﬁr st comp onent of the heuristic 28 Acquiring Word-Meaning Mappings 0 10 20 30 40 50 60 70 0 200 400 600 800 1000 1200 1400 1600 1800 Accuracy Training Examples Precision Recall Figure 17: A More Ambiguous Artiﬁcial Corp us 60 65 70 75 80 85 90 95 100 1 1.25 1.5 1.75 2 Accuracy Number of Meanings per Word Recall Precision Figure 18: Increasing the L evel of Ambiguit y measure) and m akes the learning task more diﬃcult. On the other h and, increasing the lev el of synonym y do es not ha ve the p otenti al to m islead the learner. The num b er of training examples required to reac h a certain leve l of acc uracy is also informativ e. In T able 4, we s h o w th e p oint at whic h a stand ard pr ecision of 75% w as ﬁrst 29 Thompson & Mooney 80 85 90 95 100 1 1.25 1.5 1.75 2 Accuracy Number of Words per Meaning Recall Precision Figure 19: Increasing the Lev el of Synonym y Ambiguity L evel Numb er of Examples 1.0 150 1.25 450 2.0 1450 T able 4: Num b er of Examples to R eac h 75% Precision reac h ed for eac h lev el of am b iguit y . Note, ho we ve r, that w e only measured ac curacy after eac h set of 100 training examples, so the num b ers in the table are appr o ximate. W e p erformed a second test of scala bilit y o n t w o corp ora generat ed f r om lexicons an order of magnitude larger than those in the abov e tests. In these tests, w e us e a lexicon con taining 1000 words and using 250 conceptual symbols. W e generated b oth a corpus with no am biguit y , and one from a lexic on with ambig uity and synon ymy similar to that found in the W ordNet database (Bec kwith, F ellbaum, Gross, & Miller, 1991 ); the ambiguit y there is app r o ximately 1.68 meanings p er word and the synon ym y 1.3 words p er meaning. Th ese corp ora con tained 9904 (no ambig uity) and 9948 examples, r esp ectiv ely , and we split the data in to ﬁ ve sets of 9000 training examples eac h. F or the easier large corpu s, the maxim um a v erage of w eigh ted p recision and recall w as 85.6%, at 8100 training examples, wh ile for the harder corpus, the maximum a v erage w as 63.1% at 8600 training examples. 6. Activ e Learning As ind icat ed in the p r evious sectio ns, we ha ve bu ilt an in tegrated system for language acquisition that is ﬂ exible and useful. Ho we v er, a ma jor diﬃcult y r emains: the constru ction of training corp ora. Though annotating sen tences is stil l arguably less work th an building 30 Acquiring Word-Meaning Mappings Apply the learner to n b o otstrap examples, creating a classiﬁer. Un til n o examples remain or the annotator is unwilling to lab el more examples, do : Use most r ecen tly learned classiﬁer to annotate eac h u nlab eled instance. Find the k instances w ith the lo west ann otation certain ty . Annotate these in stances. T rain the learner on the b o otstrap examples and all examples ann otat ed so far. Figure 20: Selec tiv e S ampling Algorithm an en tire system by hand, the annotation task is also time-consuming and error-prone. F urther, the training pairs often conta in r edundant information. W e would lik e to minimize the amount of annotation required while s till mainta ining go od generalizatio n accuracy . T o d o this, w e tu r ned to metho ds in active le arning . Activ e learning is a r esearc h area in mac hine learning that f eatures systems that automa tically select the most informativ e examples for annotation and training (Angluin, 1988; Seun g, Opp er, & S omp olinsky , 1992), rather than relying on a b enev olen t teac her or random sampling. T he primary go al of activ e learning is to reduce the n umb er of examples that the system is trained on, while main taining the accuracy of the acquired information. Activ e learning systems ma y con- struct their o wn examples, request certain t yp es of examples, or determine whic h of a set of unsup ervised examples are most usefully lab eled. Th e last app roac h, sele ctive sampling (Cohn et al., 1994), is particularly attractiv e in natural languag e learning, since there is an abundance of text, and we wo uld lik e to annotate only the most informative sen tences. F or man y language learning tasks, annotation is particularly time-consuming since it requires sp ecifying a complex output rather than ju st a category lab el, so reducing th e num b er of training examples required can greatly in crease the utilit y of learning. In this section, we explore the use of activ e learning, sp eciﬁcally selectiv e sampling, for lexicon acquisition, and demonstrate that with activ e learning, few er examples are required to ac hiev e th e same accuracy obtained by training on rand omly c hosen examples. The basic alg orithm for s electiv e sampling is rela tiv ely simple. Learning b egins with a small p o ol of annotated examples and a large p o ol of unannotated examples, and the learner attempts to c ho ose the most informativ e additional examples for annotation. Existing w ork in the area has emphasized tw o approac hes, c ertainty-b ase d metho ds (Lewis & Catlett , 1994) , and c ommitte e-b ase d metho ds (McCallum & Nigam, 1998; F reund, Seu n g, Shamir, & Tishb y , 1997; Liere & T adepalli, 19 97; Dagan & En gelson, 1995; Cohn et al., 1994) ; we fo cus here on the former. In the certain t y-based paradigm, a system is trained on a sm all n um b er of ann otat ed examples to learn an initial classiﬁer. Next, the system examines un an n otate d examples, and attac hes certain ties to the predicted annotation of those examples. The k examples with the lo we st certain ties are then p resen ted to the user for annotation and r etraining. Man y metho ds for attac hing certain ties h a ve b een used, but they t ypically att empt to estimate the probabilit y that a classiﬁer consisten t with the prior training data will classify a new example correctly . 31 Thompson & Mooney Learn a lexicon with the examples annotated so far 1) F or eac h phrase in an un annotated sen tence: If it has entries in the learned lexicon then its certain t y is the av erage of the heur istic v alues of those entries Else , if it is a one-word phrase then its certain t y is zero 2) T o rank sentences use: T otal certain t y of phrases fr om step 1 # of phrases counte d in step 1 Figure 21: Activ e Learning for Wolfie Figure 20 presen ts abstract pseud o co de for certain t y-based selectiv e s amp ling. In an ideal situation, the b atc h size, k , would b e set to one to mak e the most in telligen t decisions in futur e c hoices, but for eﬃcie ncy reaso ns in retraining batc h lea rning alg orithms, it is frequen tly set h igher. Results on a n umb er of classiﬁca tion tasks ha v e demonstrated that this general approac h is eﬀectiv e in redu cing the need for lab eled examples (see citations ab o v e). Applying certain ty- based sample selection to W olfie r equires determining the certain t y of a complete ann otat ion of a p oten tial new training example , despite the fact that ind ividual learned lexical entries and parsin g op erators p erform only part of the o ve rall annotation task. Therefore, our general app roac h is to co mpu te certai ntie s for p ieces of an example, in our case, phr ases, and com bine these to obtain an o verall certain ty for an example. Since lexicon en tries conta in no explicit uncertain t y parameters, we used Wolfie ’s h eu r istic measure to estimate uncertaint y . T o choose the sentence s to b e annotated in eac h round, we ﬁrs t b o otstrapp ed an initial lexicon from a small corpus, k eeping trac k of the heuristic v alues of the learned items. Then, for eac h un annotated sente nce, w e took an a v erage of the heuristic v alues of the lexicon entrie s learned for phrases in that sen tence, giving a v alue of zero to unknown w ords b ut eliminating fr om consideration an y w ords th at w e assume are known in adv ance, suc h as database constan ts. Th us, longer sen tences with only a few known phrases would ha v e a lo wer certai nt y than shorter sentences with the same n umber of known phrases; this is desirable since longer sen tences will b e more informativ e from a lexi con learning p oint of view. Th e sente nces with the lo we st v alues w ere c hosen for annotation, added to the b o otstrap corpus, and a n ew lexicon learned. Our tec hn ique is sum marized in Figure 21. T o ev aluate our tec hniqu e, we compared activ e learning to learning from randomly se- lected examples, again measuring th e eﬀectiv eness of learned lexicons as bac kground kno wl- edge for Chill . W e again used the (smaller) U.S. Ge ography corpus, as in the original W olfie tests, using the lexi cons as bac kground kno w ledge during p arser acquisition (and using the same examples f or parser acquisition). F or ea c h trial in the follo wing exp erimen ts, w e ﬁrst randomly divide the data int o a training and test set. T hen, n = 25 b o otstrap examples are r and omly selected from the 32 Acquiring Word-Meaning Mappings 0 20 40 60 80 100 0 50 100 150 200 250 Accuracy Training Examples WOLF+active WOLFIE Geobase Figure 22: Using Lexicon Certaint y for Activ e Learning training examples and in eac h step of activ e learning, the least certain k = 10 examples of the remaining training examples are selecte d and added to the trainin g set. The r esult of learning on this set is ev aluated after eac h step. The accuracy of th e r esu lting lea rned parsers wa s compared to the accuracy of those learned us ing r andomly c h osen examples to learn lexicons and parsers, as in S ectio n 5; in other words, we can th in k of th e k examples in eac h roun d as b eing chosen randomly . Figure 22 sho ws the accuracy on unseen data of parsers learned using the lexico ns learned b y W olfie when examples are c hosen randomly and activ ely . Th ere is an annotation sa vings of around 50 examples b y using activ e learning: the maxim um accuracy is r eac hed after 175 examples, v ersus 225 with random examples. The adv an tage of using activ e learning is clear from the b eginning, though the d iﬀerences b et ween the tw o curv es are only statistic ally signiﬁcan t at 175 training examples. Since w e are learning b oth lexicons and parsers, b ut only choosing examples b ased on Wolfie ’s certain t y measures, the b o ost could b e impr o ved ev en f u rther if Chill had a say in the examples c hosen. S ee Thomp son, Caliﬀ, and Mo oney (1999) for a description of activ e learning for Chill . 7. Related W or k In this sect ion, w e divide the previous researc h on related to pics into the areas of lexicon acquisition and activ e le arning. 7.1 Lexicon Acquisition W ork on automat ed lexicon and language acquisit ion dates bac k to Siklossy (1972), who demonstrated a system that lea rned transformation patt erns from log ic bac k to natural 33 Thompson & Mooney language. As already noted, the most closely related w ork is that of Jeﬀ Siskind, wh ic h we describ ed brieﬂy in Section 2 and w hose s ys tem w e ran comparisons to in S ectio n 5. O ur deﬁnition of t he lea rnin g problem can b e compared to his “mapping problem” (Siskind, 1993) . That formulatio n diﬀers from ours in se ve ral resp ects. First, his sentence repre- sen tations are terms instead of trees. Ho w ev er, as sho wn in Figure 7, terms can also b e represen ted as tree s that conform to our formalism with some minor additions. Next, his notion of interpretatio n d o es in v olv e a t yp e of tree, bu t carries the ent ire represent ation of a sen tence up to the r o ot. Also, it is not clear ho w he would handle quan tiﬁed v ariables in the represen tation of sen tences. Sk olemization is p ossible, but then generalization across sen tences w ould require sp ecial handling. W e mak e the sin gle- use assu mption and he do es not. Another diﬀerence is our b ias to w ards a minimal num b er of lexic on en tries, while he attempts to ﬁnd a monosemous lexicon. His later wo rk (Siskind, 2000) relaxes this to allo w am biguit y and noise, bu t still biases to wards minimizing am b iguit y . Ho we ve r, his formal deﬁnition do es n ot explicitly allo w lexi cal ambiguit y , but handles it in a heur istic manner. This, though, may lead to more robustness than our metho d in the face of noise. Finally , our deﬁn ition allo ws phrasal lexicon en tries. Siskind’s work on this topic has explored many diﬀeren t v ariations along a con tin uum of using man y constraints b ut requiring more time to incorp orate eac h new example (Siskind, 1993) , v ersus few constraint s but r equiring more training data (Siskind , 1996). Th us, p er- haps his earlier systems w ould ha v e b een able to learn the lexicons of Section 5 more quic kly; b ut crucially those systems did not allo w lexical am biguit y , and th us also m ay not hav e lea rned as accurate a lexic on. More detailed compariso ns to s u c h v ersions of the system are outside the scop e of this pap er. O ur goal with W olfie is to lea rn a p ossibly am biguous lexicon from as few examples as p ossible, and w e thus made comparisons along this dimension alone. Siskind’s approac h, lik e ours, tak es in to acco unt constrai nts b et ween w ord meanings that are j ustiﬁed b y the exclusivit y and comp ositionalit y assumptions. His approac h is somewhat more general in that it h andles noise a nd referen tial un certai nt y (uncertain ty ab out the meaning of a sen tence and th us m ultiple p ossible candidates), while our s is sp ecialized for applications where the meaning (or meanings) is kno w n. The exp erimen tal results in Sectio n 5 d emonstrate the adv an tage of our metho d for suc h an application. He has demonstrated his sys tem to b e capable of learning reasonably accurate lexicons fr om large, am biguous, and noisy artiﬁcial corp ora, but this accuracy is only assured if the learning algorithm conv erges, which did n ot o ccur for our smaller corpus in the exp eriments w e r an. Also, as already noted, his system op erates in an increment al or on-line fashion, discarding eac h sen tence as it pr o cesses it, while ours is batc h . In add ition, his searc h for word meanings pro ceeds in t wo stages, as discussed in S ectio n 2.2. By using common substructur es, we com bine these t wo s tage s in W olfie . Both systems do ha ve greedy asp ects, ours in the c hoice of the n ext b est lexical entry , h is in the c hoice to discard u tterances as noise or create a homon ymous lexical en try . Finally , his s ys tem do es not compute statistical correlati ons b et w een w ords and their p ossible meanings, while ours do es. Besides Siskind ’s w ork, there are others wh o approac h the problem from a c ognitiv e p ersp ectiv e. F or example, De Marc k en (1994 ) also uses c hild language learning as a mo- tiv ation, but app roac hes the segmen tation problem instead of the learning of seman tics. F or training input, he uses a ﬂ at list of tok ens for seman tic represen tations, but do es not 34 Acquiring Word-Meaning Mappings segmen t sen tences in to words. He uses a v arian t of exp ectation-maximiz ation (Dempster, Laird, & Rub in, 1977), together with a f orm of parsing and dictionary m atc hing tec hniques, to segmen t the sentence s and associate th e segmen ts with their most lik ely meaning. On the Ch ildes corpus, the algorithm ac hiev es very high precision, b ut recall is not pro vided. Others taking the cognitiv e appr oac h demonstrate language un d erstanding by the abilit y to carry out some task suc h as parsing. F or example, Nenov and Dy er (199 4) describ e a neural net work mo d el to map b et w een visual and v erbal-motor commands, and Colunga and Gasser (199 8) u se neur al net wo rk modeling tec h niques for learning sp atial concepts. F eldman and his colleagues at Berke ley (F eldman, Lak oﬀ, & Shastri, 1995) a re activ ely pursuin g cognitiv e mo dels of the acquisition of seman tic concepts. Another Berke ley eﬀort, the system b y Re gier (19 96) is give n examples of pictures paired with natural language descriptions that app ly to the p icture, and learns to ju dge whether a new sentence is true of a giv en picture. Similar w ork by Supp es, Liang, and B ¨ ottner (1991) uses rob ots to demonstrat e le xi- con learning. A rob ot is trained on cognitiv e and p erceptual concepts and their asso ciated actions, a nd learns to execute s imple commands. Along similar lines, Tish b y and Gorin (1994 ) ha v e a system that learns associations b et ween w ords and actions, but they use a statistic al framew ork to learn these asso ciatio ns, and do not handle structured representa - tions. S imilarly , O ates, Eyler-W alk er, and Cohen (1999 ) discuss the acquisitio n of lexical hierarc hies and their asso ciated meaning as deﬁned b y the sensory envi ronment of a rob ot. The problem of automatic constru ction of translatio n lexicons (Smadja, McKeo wn, & Hatziv assilog lou, 1996; Melamed, 1995; W u & Xia, 1995; Kumano & Hirak aw a, 1994; Cati- zone, Russell, & W arwic k, 1993; Gale & Churc h , 1991 ; Brown & et al., 1990) has a d eﬁnition similar to our own. While most of these m etho ds also compute asso ciation scores b et w een pairs (in th eir case, wo rd-word pairs) and u se a greedy algorithm to c ho ose the b est trans- lation(s) for eac h w ord, they do not take adv ant age of the constrain ts b et ween pairs. One exception is Melamed (2000); ho w ev er, his app roac h does not allo w f or phrases in the lex- icon or for synon ymy w ithin one text seg ment, wh ile ours do es. Also, Y amaza ki, Paz zani, and Merz (1995) learn b oth translation rules and seman tic h ierarc hies f rom parsed p arallel sen tences in Japanese and English. Of course, the main d iﬀerence b etw een this b o dy of w ork and this pap er is that we map wo rds to semantic structur es , not to other w ords. As ment ioned in th e in tro duction, there is also a large b o dy of wo rk on learning lexical seman tics but using diﬀeren t problem form ulations than our own. F or example, Collins and Singer (1999), Riloﬀ and Jones (1999 ), Roark and Charniak (1998) , and S c h neider (1998) deﬁne seman tic lexi cons as a grouping of words in to semantic categories, and in the lat ter case, add relational information. T he result is t yp ically applied as a seman tic lexicon for information e xtraction or en tit y tagging. P edersen and Chen (1995) describe a metho d for acquiring syntac tic and seman tic features of an u nkno wn w ord , assu ming access to an initial concept hierarch y , but they giv e no exp erimenta l results. Man y systems (F uku m oto & Tsujii, 1995; Haru no, 1995; Johnston, Boguraev, & P u stejo vsky , 1995; W ebster & Marcus, 1995) fo cus only on acquisition of verbs or n ouns, rather than all t yp es of words. Also, the authors just named either do not exp erimentall y ev aluate their systems, or do not sho w th e usefulness of the learned lexicons for a sp eciﬁc applicatio n. Sev eral authors (Ro oth, Riezler, Prescher, Carroll, & Beil, 1999; C ollins, 199 7; Ribas, 1994; Manning, 19 93; R esn ik, 199 3; Bren t, 1991) discuss the acquisition of sub categoriza- 35 Thompson & Mooney tion information for v erbs, and others describ e wo rk on learning selectional restrictions (Manning, 1993; Brent, 19 91). Bot h of these are d iﬀeren t from the information required for map p ing to seman tic represen tatio n, bu t could b e useful as a source of information to further constrain the searc h. Li (1998 ) fu rther expand s on the sub categoriza tion work by inducing clustering information. Finally , several systems (Knigh t, 1996; Hastings, 1996; Russell, 1993) learn n ew w ords from conte xt, assuming that a large initial lexicon and parsing system are already a v ailable. Another related b o dy of wo rk is grammar acquisition, esp ecially those areas that tigh tly in tegrate the grammar with a lexicon, su ch as with Categorial Grammars (Retore & Bonato, 2001; Dudau-Sofronie, T ellier, & T ommasi, 2001; W atkinson & Ma nandh ar, 1999). The theory of Cate gorial Grammar also has ties with lexica l seman tics, but these seman tics ha v e not often b een used for inference in sup p ort of high-lev el tasks suc h as database retriev al. While learning synta x and semanti cs together is arguably a more diﬃcult task, the aforemen tioned wo rk has not b een ev aluated on large corpora, pr esumably primarily due to the diﬃculty of annotation. 7.2 Activ e Learning With resp ect to additional activ e learning tec hniques, Cohn et al. (1 994) w ere among the ﬁrst to discuss certain t y-based ac tiv e learning metho ds in detail. They fo cus on a neural net w ork approac h to activ e learning in a version-space of concepts. Only a few of the researc h ers applying mac hine learning to n atural language pro cessing ha ve utilized activ e learning (Hwa , 2001; Sc hohn & Cohn, 2000; T ong & Koller, 2000; Thomp son et al., 19 99; Argamon-Engelson & Dagan, 1999; Liere & T adepalli, 1997; Lewis & Catle tt, 1994) , and the ma jorit y of these ha v e add ressed c lassiﬁcation tasks suc h as part of sp eec h tagging and text ca tegorizatio n. F or example, Liere and T adepalli (1997) apply a ctiv e learning with committee s to the problem of text catego rization. They show impro veme nts with activ e learning similar to those that w e obtain, b u t use a committee of Winno w-based learners on a traditional classiﬁcation task. Ar gamon-Engelson and Dagan (1999 ) also apply committee- based learning to part-of-sp eec h tagging. In their w ork, a committee of hidden Mark o v mo dels is used to select examples for annotation. Lewis and C atlet t (1994) use heter o gene ous certain t y-based metho ds, in w h ic h a simple classiﬁer is u sed to selec t examples that are then annotated and presente d to a more p o w erful classiﬁer. Ho w ev er, m an y language learning ta sks require annotat ing natural language text with a complex output, such as a p arse tree, seman tic representa tion, or ﬁlled template . The application of activ e l earning to tasks requiring suc h complex outputs has not b een w ell studied, the exceptions b eing Hwa (2001), S od erland (1999 ), Thompson et al. (1999). The latter t wo include w ork on activ e learning applied to information extractio n, and Thompson et al. (1999) includ es wo rk on activ e learning for seman tic parsing. Hw a (2001) describ es an in teresting method for ev aluating a statisti cal parser’s u ncertain ty , when applied f or syn tactic p ars in g. 8. F ut ure W ork Although Wolfie ’ s current greedy searc h method has p erformed q u ite we ll, a b etter searc h heuristic or alternativ e searc h strategy could result in impro v ement s. W e sh ould also more 36 Acquiring Word-Meaning Mappings thoroughly ev aluate Wolfie ’s abilit y to learn long phrases, as we restricted this abilit y in the ev aluations h ere. An other issue is robus tn ess in the face of noise. The current algorithm is not guarantee d to learn a correct lexicon in ev en a noise-free corpus. The addition of noise complicate s an analysis of circumstances in whic h mistak es are lik ely to h app en. F urther theoretica l and empirical analysis of these issues is w arran ted. Referen tial uncertain t y could b e handled, with an increase in complexit y , by forming LICS from m ore pairs of repr esen tations with whic h a phrase app ears, but not b etw een alternativ e represent ations of the same sen tence. Th en, once a pair is added to the lexicon, for eac h sente nce cont aining that w ord, r ep r esen tatio ns c an b e eliminated if they do not con tain the learned meaning, p ro vided another rep r esen tati on do es con tain it (th us allo wing for lexical ambiguit y). W e plan to ﬂesh this out and ev aluate the r esults. A d iﬀeren t a v en ue of explorati on is to apply Wolfie to a corpus of sen tences paired with the more common query language, S QL. Such corp ora should b e easily constructible b y recording qu eries submitted to existing SQL applications along with their English forms, or translating existi ng lists of SQL queries in to En glish (presumably an easier direction to translate). The fact that t he same training data can b e used t o lea rn b oth a semantic lexicon and a parser also helps limit the o v erall burden of constructing a complete natural language inte rface. With resp ect to activ e learning, exp erimen ts on additional corp ora are needed to test the abilit y of our approac h to r educe annotation costs in a v ariet y of domains. It w ould also b e interesti ng to explore activ e learning for other natural language pro cessing pr ob lems suc h as syntac tic parsing, word-sense disambiguat ion, and mac hine translation. Our current results ha ve in vo lv ed a certain t y-b ased approac h; h ow eve r, prop onents of committee -based approac h es ha v e convincing argumen ts for their theoretical adv an tages. Our initial attempts at adapting committee-base d appr oac hes to our systems were n ot v ery successful; ho w ev er, add itional researc h on this topic is indicated. One critica l problem is obtaining diverse committee s th at prop erly sample the version space (Cohn et al., 1994). 9. Conclusions Acquiring a seman tic le xicon from a corpus of sen tences labeled with represent ations of their meaning is an imp ortan t problem that has not b een widely studied. W e present b oth a formalism of the learning p roblem and a greedy algorithm to ﬁn d an approxima te solution to it. W olfie demonstrates that a fairly simple, greedy , symbolic learning algorithm p erform s w ell on this task and obtains p erform ance s u p erior to a previous lexicon acquisitio n system on a co rpu s of geograph y qu eries. O ur r esults also d emonstrate that our metho d s exte nd to a v ariet y of natural languages b esides En glish, and that they scale fairly well to larger, more d iﬃcu lt corp ora. Activ e learning is a n ew area of machine learning that has b een almost exclusiv ely applied to classiﬁcation tasks. W e hav e demonstrat ed its successful application to more complex n atural language mappings from p hrases to sema ntic meanings, sup p orting the acquisition of lexicons and p arsers. T h e w ealth of un annotate d n atural language data, along with the diﬃcult y of ann otat ing suc h data, make selectiv e sampling a p oten tially in v aluable tec hnique f or natural language learning. Ou r results on r ealistic corp ora indicate that example ann otations savings as high as 22% can b e ac h iev ed by emplo ying acti ve 37 Thompson & Mooney sample selection using only simple certai nt y m easur es for p r edictions on un an n otate d data. Impro ve d sample selection metho ds and applications to other imp ortan t language problems hold the promise of con tin ued progress in using mac hine learning to construct eﬀectiv e natural language pro cessing systems. Most exp erimen ts in corpus-based natural language h a ve p resen ted r esults on some subtask of natural language, and there are few r esults on whether th e learned sub systems can b e successfully integrat ed to build a complete NLP system. The exp eriment s p resen ted in this pap er demonstrated how t wo learning systems, Wolfie and Chill , w ere successfully in tegrated to le arn a complete NLP system for parsing database queries in to executable logica l form giv en only a single corpus of annotated queries, and further d emonstr ated the p oten tial of activ e learning to r ed u ce the annotation eﬀort for learning for NLP . Ac kno wledgmen ts W e would lik e to thank Jeﬀ Siskind f or p r o vid in g us with his soft ware, and for all his help in adapting it for use with ou r corpus. Th anks a lso to Agapito Sustaita, Esra Erdem, and Marshall Ma yb err y f or their translation eﬀorts, and to the th r ee anon ymous review ers for their comments w hic h help ed improv e the pap er. This resea rch w as supp orted by the National Science F oundation un der gran ts IRI -931 0819 and IRI-9704 943. References Anderson, J. R. (1977 ). Induction of augment ed transition net wo rks. Co gnitive Scienc e , 1 , 125–1 57. Angluin, D. (1988 ). Queries and concept learning. Machine L e arning , 2 , 319–342. Argamon-Engelson, S., & Dag an, I. (1 999). Committee-based sample selectio n for p roba- bilistic classiﬁers. Journal of Artiﬁcial Intel ligenc e R ese ar ch , 11 , 335–360 . Bec kwith, R., F ellbaum, C., Gross, D., & Mille r, G. (1 991). W ordNet: A lexical database organized on p syc holinguistic pr in ciples. In Zernik, U. (Ed.), L exic al A c quisition: Exploiting On-Line R e sour c es to Build a L exic on , pp. 211 –232. Lawrence Erlbaum , Hillsdale, NJ. Borland In ternational (198 8). T urb o Pr olo g 2.0 R efer enc e G uide . Borland In ternational, Scotts V alley , C A. Bren t, M. (199 1). Automatic acquisition of s u b catego rization frames from u n tagged text . In Pr o c e e dings of the 29th Annual Me eting of the Asso ciation for Computationa l Lin- guistics (ACL-91) , pp. 209–214. Bro wn, P ., & et al. (19 90). A statistica l app roac h to mac hine translation. Computatio nal Linguistics , 16 (2), 79–85 . Catizone, R., Russell, G., & W arwic k, S. (1993). Deriving translation d ata from bilingual texts. In Pr o c e e dings of the First International L exic al A c qu isition Workshop . 38 Acquiring Word-Meaning Mappings Cohn, D., A tlas, L., & Ladner, R. (1994) . Impro ving generaliza tion with activ e learning. Machine L e arning , 15 (2), 201–221. Collins, M., & Singer, Y. (1999 ). Unsup ervised mo dels for named en tity classiﬁcation. In Pr o c e e dings of th e Confer enc e on E mpiric al Me tho ds in N atur al L anguage Pr o c essing and V ery L ar ge Corp or a (EMN LP /VLC-99) Universit y of Maryland. Collins, M. J. (1997). Three generativ e, lexicali sed mo dels for s tatistical parsing. In Pr o- c e e dings of the 35th Annual Me e ting of the Asso ciation for Computatio nal Linguistics (ACL-97 ) , pp. 16–23. Colunga, E., & Gasser, M. (1998). Linguistic relativit y and wo rd acquisition: a com- putational app roac h. In Pr o c e e dings of the Twenty First Annual Confer enc e of the Co gnitive Scienc e So ciety , pp . 244–249. Dagan, I., & Engelson, S. P . (1 995). Committee-based sampling for training probabilis- tic classiﬁers. In Pr o c e e dings of the Twelfth Internationa l Confer enc e on Machine L e arning (ICML-95) , pp. 150–15 7 San F rancisco, CA. Morgan Kaufman. De Marc ke n, C. (1994 ). The acquisition of a lexicon from paired phoneme sequences and seman tic r epresen tations. In L e c tur e Notes in Computer Scie nc e , V ol. 862, p p. 66–7 7. Springer-V erlag. Dempster, A., Laird, N., & Rub in, D. (1977). Maxim um lik eliho o d from in complete data via the EM algorithm. Journal of the R oyal Statistic al So ciety B , 39 , 1–38. Dudau-Sofronie, T ellier, & T ommasi (200 1). Learning catego rial grammars f rom semanti c t yp es. In Pr o c e e dings of the 13t h Am ster dam Col lo quium , pp. 79–84. F eldman, J ., Lak oﬀ, G., & Shastri, L. (19 95). The neural theory of language pr o ject http://www .icsi.berkeley .edu/ntl . International Computer Science In s titute, Un iv ersity of California, Berk eley , C A. Fillmore, C. (1968). The case for case. In Bac h, E., & Harms, R. T. (Ed s .), Universals in Linguistic The ory . Holt, Reinhart and Winston, New Y ork. Fillmore, C. (1988). The mec hanisms of “Constru ction Grammar”. In Axmak er, S., Jaisser, A., & S ingmeister, H. (Eds.), Pr o c e e dings of t he F ourte enth A nnual M e eting of the Berkeley Lingui stics So ciety , pp . 35–55 Berke ley , CA. Fisher, D. H. (1987). Kn o w ledge acquisitio n via incremen tal conceptual clustering. Machine L e arning , 2 , 139–17 2. F reund , Y., Seung, H. S., Shamir, E., & Tish by , N. (19 97). Selectiv e sampling using the query by committee algorithm. Machine L e arning , 28 , 133–1 68. F ukum oto, F., & Tsujii, J. (1995). Representat ion and acquisition of verbal p olysemy . In Pap ers fr om the 1995 AAA I Symp osium on the R epr e sentation and A c quisition of L exic al Know le dge: Polysemy, Ambiguity, and Gener ativity , pp . 39–4 4 Stanford, CA. 39 Thompson & Mooney Gale, W., & Churc h, K. (1 991). Ident ifying w ord c orresp ond ences in parallel text s. In Pr o c e e dings of th e F ourth D ARP A Sp e e c h and Natur al L anguage Workshop . Garey , M., & Johnson , D. (1979). Computers and Intr actability: A Guide to th e The ory of NP-Completeness . F reeman, New Y ork, NY. Goldb erg, A. (1995). Constructions: A Construction Gr ammar Appr o ach to Ar gument Structur e . The Univ ersit y of Chicago Press. Grefenstette, G. (1994) . Sextan t: Extracting seman tics from ra w text, implemen tation details. Int e g r ate d Computer-A ide d Engine ering , 6 (4). Haas, J., & Ja y araman, B. (1997). F r om cont ext-free to deﬁnite-clause grammars: a type- theoretic appr oac h. Journal of L o gic P r o gr amming , 30 (1), 1–23. Haruno, M. (1995) . A case frame learning metho d for Japanese p olysemous ve rbs. In Pap ers fr om the 1995 AAA I Symp osium on the R epr e sentation and A c quisition of L exic al Know le dge: Polysemy, Ambiguity, and Gener ativity , pp . 45–5 0 Stanford, CA. Hastings, P . (19 96). Implications of an automa tic lexi cal acquisitio n mec hanism. In W ermt er, S., Riloﬀ, E., & Sc heler, C. (Eds.), Conne ctionist, Statistic al, and Sym- b olic Appr o aches to L e arning f or natur al language pr o c essing . Sp ringer-V erlag, Berlin. Hw a, R. (2001). On min imizing training corpus for parser acquisition. In Pr o c e e dings of the Fifth Computa tional Natur al L anguage L e arning Worksho p . Jac k endoﬀ, R. (1990). Semantic Structur es . The MIT Pr ess, Cam bridge, MA. Johnston, M., Boguraev, B., & Pu stejo vsky , J. (1995). The acquisition and int erpretation of complex nominals. In Pap ers fr om the 1995 AAA I Symp osium on the R epr esentation and A c quisition of L exic al Know le dge: Polysemy, A mbiguity, and Gener ativity , pp. 69–74 S tanford , CA. Knigh t, K. (1996). Learning word meanings by instruction. In Pr o c e e dings of the Thirte enth National Confer enc e on Artiﬁcial Intel ligenc e (AAAI-96) , pp. 447–454 Portl and, Or. Koha vi, R., & John, G. (1995). Automatic parameter selectio n by minimizing estimat ed error. In Pr o c e e dings of the Twelfth International Confer enc e on M achine L e arning (ICML-95) , pp . 304–31 2 T aho e City , C A. Kumano, A., & Hirak aw a, H. (1994). Building an MT dictionary from parallel texts b ased on linguistic and statistic al information. In Pr o c e e dings of the Fifte enth International Confer enc e on Computational Lingu istics , pp. 76–81. La vra ˘ c, N., & D˘ zeroski, S. (1994). Ind uctive L o gic Pr o gr amming: T e chniques and Applic a- tions . Elli s Horw o o d. Lewis, D. D., & Cat lett, J. (1994) . Heterogeneous uncertain t y sampling for sup ervised learning. In Pr o c e e dings of the E leventh International Confer enc e on Machine L e arn- ing (ICML- 94) , pp. 148–156 San F rancisco, C A. Morgan Kaufman. 40 Acquiring Word-Meaning Mappings Li, H. (1998 ). A pr ob abilistic appr o ach to lexic al semantic know le dge ac quisition and struc- tur al disa mbiguation . Ph.D. thesis, Univ ersit y of T oky o. Liere, R., & T adepall i, P . (1997). Activ e learning with committees for text categorizat ion. In Pr o c e e dings of the F ourte enth N ational Confer enc e on A rtiﬁcial Intel ligenc e (AA AI- 97) , pp . 591– 596 Pro vidence, RI. Manning, C. D. (1993). Au tomati c acquisition of a large sub cate gorization dictionary fr om corp ora. In Pr o c e e dings of the 31st Annual Me eting of the Asso ciation for Com puta- tional Linguistics (A CL-93) , pp. 235–24 2 Colum b us, OH. McCallum, A. K., & Nigam, K. (1998 ). Em plo yin g EM and p o ol-ba sed act iv e lea rnin g for text classiﬁcatio n. In Pr o c e e dings of the Fifte enth Internationa l Confer enc e on Machine L e arning (ICML-98) , pp. 350–358 Madison, WI. Morgan K auf man. Melamed, I. D. (1995). Automatic ev aluation and uniform ﬁ lter cascades f or inducing n -b est translation lexicons. In Pr o c e e dings of the Thir d Worksh op on V ery L ar ge Corp or a . Melamed, I. D. (2000). Mo dels of translational equiv alence among w ord s. Computational Linguistics , 26 (2), 221–2 49. Muggleto n, S. (Ed.). (19 92). Inductive L o gic Pr o gr amming . Academic Press, New Y ork, NY. Muggleto n, S. (1995). In v erse enta ilment and Progol. New Gener ation Computing Journal , 13 , 245–286. Muggleto n, S., & F eng, C. (1990). Eﬃcien t indu ction of logic programs. In Pr o c e e dings of the First Confer enc e on Algorithmic L e arning The ory T oky o, J ap an. Ohmsh a. Neno v, V. I., & Dy er, M. G. (1994). Perce ptually g round ed languag e l earning: Pa rt 2– DETE: A neural/pro cedural mo del. Co nne ction Scienc e , 6 (1), 3–41. Oates, T., Eyler-W alk er, Z., & Cohen, P . (1999). Using syn tax t o learn seman tics: an exp erimen t in language acquisition w ith a mobile rob ot. T ec h . rep. 99-35, Univ ersit y of Massac husett s, Computer Science Departmen t. P artee, B., Meulen, A., & W all, R. (1990). Mathema tic al Metho ds in Linguistics . Kluw er Academic Pu b lishers, Dordrec ht, Th e Netherlands. P edersen, T., & Ch en, W. (1995). Lexical ac quisition via constrain t sol ving. In Pap ers fr om the 1995 A AAI Symp osium on the R e pr esentation and A c quisition of L e xic al Know le dge: Polysemy, Ambiguity, and Gener ativity , pp. 118–1 22 S tanford, CA. Plotkin, G. D. (19 70). A note on indu ctiv e generaliz ation. In Meltzer, B., & Mic hie, D. (Eds.), Machine Intel ligenc e (V ol. 5) . Elsevier North-Holland, New Y ork. Ra yner, M., Hugosson, A., & Hagert, G. (1988). Using a logic grammar to learn a lexicon. T ec h. r ep. R88001, S w edish Institute of Computer S cience. 41 Thompson & Mooney Regier, T. (1996). The human semantic p otential: sp atial language and c onstr aine d c onne c- tionism . M IT Press. Resnik, P . (1993). Sele ction and information: a class-b ase d appr o ach to lexic al r elationships . Ph.D. thesis, Univ ersit y of P ennsylv ania, CIS Department . Retore, C., & Bonato, R. (2001). Learning r igid lam b ek grammars and minimalist grammars from stru ctur ed sen tences. In Pr o c e e dings of the Thir d L e arning L anguage in L o gic Workshop S trasb ourg, F rance. Ribas, F. (1994). An experiment on le arning appropriate selectio nal restrictions from a parsed corpus. In P r o c e e dings of the Fifte enth International Confer enc e on Computa- tional Linguistics , pp. 769–774. Riloﬀ, E., & Jones, R. (1999). Learnin g dictionarie s for information extraction by multi- lev el b o otstrapping. In Pr o c e e dings of the Sixte enth National Confer enc e on Artiﬁcial Intel ligenc e (AAAI-99) , pp. 1044–1 049 Orlando, FL. Roark, B., & Charn iak, E. (1998). Noun-phrase co-occurr en ce statistics for semi-automat ic seman tic lexicon construction. In Pr o c e e dings of the 36th Annual Me eting of the Asso ciation for Computational Linguistics and CO LING-98 (A CL/COLING-98) , pp. 1110– 1116. Ro oth, M., Riezler, S., P r esc h er, D., Carr oll, G., & Beil, F. (1999). Inducing a s emantica lly annotated lexicon via EM-based clustering. In Pr o c e e dings of the 37th Annual Me eting of the Asso ciation for Computational Linguistics , pp. 104–1 11. Russell, D. (1993). L anguage A c quisition in a Uniﬁc ation-Base d Gr amma r Pr o c essing Sys- tem Using a R e al World Know le dge Base . Ph.D. thesis, Universit y of Illinois, Urbana, IL. Sc hank, R. C. (1975). Conc eptual Information Pr o c essing . North-Holland, O xford. Sc hneider, R. (1998). A lexically-in tensiv e algorithm for domain-sp eciﬁc kno wledge acquisi- tion. In Pr o c e e dings of the Joint Confer enc e on New Me tho ds in L anguage Pr o c essing and Computational Natur al L anguage L e arning , pp. 19–28 . Sc hohn, G., & C ohn, D. (2000). Less is more: Activ e learning wit h sup p ort vect or ma- c hines. In Pr o c e e dings of the Seve nte enth Internationa l Confer enc e on Machine L e arn- ing (ICML- 2000 ) , pp . 839–846 Stanford , C A. S ´ ebillot, P ., Bouillon, P ., & F abr e, C. (2000). Inductiv e logic programming for corpu s-based acquisition of seman tic lexicons. In Pr o c e e dings o f 2nd L e arning L anguage in L o g ic (LLL) Workshop Lisb on, Portuga l. Seung, H. S., Op p er, M., & Somp olinsky , H. (1992). Query by committe e. In Pr o c e e dings of the A CM Workshop on Computat ional L e arning The ory Pittsburgh, P A. Siklossy , L. (1972 ). Natural language learning b y computer. In S imon, H. A., & Siklossy , L. (Eds.), R epr esentation and me aning: Exp eriments with Informa tion Pr o c esssing Systems . Pr entice Hall, Englewoo d Cliﬀs, NJ. 42 Acquiring Word-Meaning Mappings Siskind, J. M. (2000). Learn in g w ord -to-mea ning mappings. In Broeder, P ., & Murre, J. (Eds.), Mo dels of L anguage A c quisition: Ind uctive and De ductive A ppr o aches . Oxford Univ ersit y Press. Siskind, J. M. (1992). Naive Physics, Eve nt Per c eption, L exic al Semantics and L anguage A c q uisition . Ph.D. thesis, De partment of Electrical Engineering and Computer S ci- ence, Massac husett s I n stitute of T ec hn ology , Cam bridge, MA. Siskind, J . M. (1996). A compu tatio nal study of cross-situational tec hniques for learning w ord-to-meaning mappings. Co gnition , 61 (1), 39–91 . Siskind, J. M. (1993 ). Lexical acquisition as constrain t satisfact ion. T ec h. rep. IR CS-93-41, Univ ersit y of Pennsylv ania. Smadja, F., McKeo wn, K. R., & Hatziv assiloglo u, V. (1996). T r anslating collocations for bilingual lexicons: A statistica l approac h. Computa tional Linguistics , 22 (1), 1–38. So derland, S. (1999) . L earning information extraction ru les for semi-structured and free text. Mach ine L e arning , 34 , 233–272. Supp es, P ., Liang, L., & B¨ ott ner, M. (199 1). Complexit y issues in rob otic mac hin e learning of n atural language. In Lam, L., & Naro ditsky , V. (Eds.), Mo deling Co mplex Phe- nomena, Pr o c e e dings of the 3r d W o o dwar d Confer enc e , pp. 102–1 27. S p ringer-V erlag. Thompson, C. A., Caliﬀ, M. E., & Mo oney , R. J. (1999 ). Activ e learning for n atural languag e parsing and information extracti on. In Pr o c e e dings of the Sixte enth International Confer enc e on Machine L e arning (ICML-99) , pp. 406–4 14 Bled, Slo v enia. Thompson, C. A. (1995). Acquisition of a lexic on f rom semantic r epresen tations of sen tences. In Pr o c e e dings of the 33r d Annual Me eting of the Asso c i ation for Computatio nal Lin- guistics (ACL-95) , pp. 335–337 Cam bridge, MA. Tish by , N., & Go rin, A. (19 94). Algebraic learning of stat istical asso ciat ions for language acquisition. Computer Sp e e ch and L anguage , 8 , 51–78. T omita, M. (1986). Eﬃcient Parsing for Natur al L anguage . Kluw er Aca demic Publishers, Boston. T ong, S., & Koller, D. (20 00). Supp ort vec tor mac hine activ e learning with applicat ions to text classiﬁcati on. In Pr o c e e dings of the Sevente enth International Confer enc e on Machine L e arning (ICML-2000) , pp . 999–1006 Stanford, CA. W atkinson, S., & Manandhar, S. (1999). Unsup ervised lexical learning with categorial grammars using th e lll corpus . In L e arning L anguage In L o gic (LLL) Workshop Bled, Slo v enia. W ebster, M., & Marcus, M. (1995) . Automa tic acquisition of the lexica l semantic s of verbs from s en tence frames. In P r o c e e dings of the 27t h A nnual Me eting of the Asso ciation for Computational Linguistics (ACL-89) , pp. 177–18 4. 43 Thompson & Mooney W u, D., & Xia, X. (1995). L arge-scale automatic extraction of an En glish-Ch inese transla- tion lexicon. Machine T r anslatio n , 9 (3-4) , 285 –313. Y amazaki, T., Paz zani, M., & Merz, C. (1995 ). Learning hierarc hies from am biguous natural language data. In Pr o c e e dings of the Twelfth Internationa l Confer enc e o n Machine L e arning (ICML-95) , pp. 575–58 3 San F rancisco, CA. Morgan Kaufmann . Zelle, J. M. (1995). Using Inductive L o gic Pr o gr amming to Autom ate the Const ruction of Natur al L anguage Parsers . Ph.D. thesis, Department of Computer Sciences, Un iv er- sit y of T exas, Austin, TX. Also app ears as Artiﬁcial Intelli gence L ab oratory T ec h nical Rep ort AI 96-249 . Zelle, J. M., & Mooney , R. J. (19 96). Learning to p ars e d ataba se queries using inductiv e logic programming. In Pr o c e e dings of the Thirte enth National Confer enc e on Artiﬁcial Intel ligenc e (AAAI-96) , pp. 1050–1 055 P ortland, OR. Zipf, G. (194 9). Human b ehavior and the principle of le ast eﬀort . Addison-W esley , New Y ork, NY. 44

Acquiring Word-Meaning Mappings for Natural Language Interfaces

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment