A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations

A Unif orm A pproach to Analogies, Synonyms, Antonyms, and Associations Pe ter D. T urne y National Research Council of Canada Institute for Information T echnology M50 Montreal Road Ottawa, Ontario, Canada K1A 0R6 peter.turn ey@nrc-cnr c.gc.ca Abstract Recogniz ing analog ies, synon yms, an to- nyms, and asso ciations appear to be four distin ct tasks, requiri ng distinct NLP al- gorith m s. In the past, the four tasks hav e been treated indepen dently , using a wide v ariety of algorithms. These four seman- tic classes, howe ver , are a tiny sample of the full rang e of semanti c pheno m ena, and we cannot af ford to crea te ad hoc algo- rithms for each semantic phen omenon; we need to seek a uniﬁed app roach. W e pro - pose t o subsu me a broad range o f pheno m- ena un der a nalogies. T o limit the scope of this paper , we restrict our attention to the subsu m ption of syn onyms, anton yms, and associ ations. W e introduc e a super vised corpus -based machine learning algori thm for classifying analogous word pairs, and we sh ow that it can solve multipl e-choice SA T ana logy questions, TOEFL synon ym questi ons, E SL synon ym-antony m ques- tions, and similar-a ssociated-b oth ques- tions from cognit ive psych ology . 1 Intr oduction A pair of words (pet rify:stone) is anal ogous to an- other pair (v aporize:g as) whe n the semantic re- lation s between th e words i n the ﬁrst pa ir are highly similar to the relation s in the second pair . T wo w ords (le vied and imposed ) are syno nymous in a conte xt (le vied a tax) when they can be in- tercha nged (imposed a tax), they are are antony- mous when they hav e opposit e meanin gs (black and white) , and the y ar e as sociated when they ten d to co-oc cur (doctor and hospit al). On the su rface, it appe ars that thes e are four dis - tinct semantic classes, requiring distin ct N LP al- gorith m s, b ut we propose a uniform approac h to all four . W e subsume synon yms, antony m s, and associ ations under analogies . In essence, we say that X and Y are anton yms when the pair X : Y is analogous to the pair black:white, X and Y are synony ms when they are analog ous to the pair le vied:imposed , and X and Y are associat ed when the y are an alogous to the pair doc tor:hospital . There is past work o n recogn izing analog ies (Reitman, 196 5 ), synon yms (Landaue r and Dumais, 1997), anton yms (Lin et al., 2003), and assoc iations (Lesk , 1969), b ut each of these four tasks has been ex amined separa tely , in isolat ion from the others. As f ar as we kno w , the algorithm propose d here is the ﬁrst attempt to deal with all four task s using a uniform approa ch. W e belie ve that it is important to seek NLP algorithms that can ha ndle a broad range of semantic phenomena, bec ause de velopin g a specia lized algorithm for each phenomenon is a ver y inef ﬁcient research strate gy . It might seem tha t a lex icon, such as W ord- Net (Fellb aum, 199 8 ), contains all the info rmation we need to handle these four tasks. Howe ver , we prefer to take a corp us-based approa ch to seman- tics. V eale (200 4 ) used W ordNet to answer 374 multiple -choice SA T analogy questions, achiev ing an accurac y of 43%, b ut the best corpus-bas ed ap - proach attain s an accur acy of 56% (T urney , 2006). Another rea son to prefer a corpus-b ased approa ch to a lexic on-based approa ch is th at the fo rmer re- quires less human labour , and thus it is easier to ext end to oth er langu ages. In S ection 2, we d escribe our a lgorithm for recogn izing ana logies. W e us e a standa rd su- pervis ed machine lear ning approach, with fea ture vec tors based on the frequenc ies of pattern s in a lar ge corp us. W e use a support vec tor machine (SVM) to learn how to classify the feature v ectors (Platt, 1998; W itten and Frank, 1999). Section 3 pre sents four sets of expe riments. W e apply our algori thm for recognizing analogies to multiple -choice analo gy questio ns from the SA T colle ge entrance test, multiple -choice synony m questi ons from the T O EFL (test of English as a foreig n language ), ESL (English as a second lan- guage ) practice questio ns for distingu ishing syn- ony ms an d ant onyms, an d a set of word pairs that are labeled similar , assoc iated , and both , de vel- oped for exp eriments in co gnitiv e psycho logy . W e di scuss t he re sults o f the expe riments in Sec- tion 4. The accura cy of the al gorithm is competi- ti ve with other systems, but the stren gth of the al- gorith m is that it is able to handle all four tasks, with no tuning of the learning parameters to the particu lar task . It perfor ms well, alt hough it is competin g against specialized algo rithms, de vel- oped for single tasks. Related work is e xamined in Section 5 and lim- itation s and future work are consid ered in Sec- tion 6. W e conclu de in Section 7. 2 Classifying Analogous W ord Pairs An analogy , A : B : : C : D , ass erts that A is to B as C is to D ; for example , trafﬁc:s treet::water :riv erbed asserts that traf ﬁc is to street as water is to ri verbed ; that is, the semantic relati ons between trafﬁc and street are highly similar to the semantic relations between water and ri verbed . W e may vie w the task of recogniz ing wor d analog ies as a proble m of class ifying word pairs (see T able 1). W ord pair Class labe l carpen ter:wood artisan :material mason:st one artisan :material potter :clay artisan :material glassb lower:gl ass artisan:materi al traf ﬁc:street entity :carrier water: riv erbed entity :carrier pack ets:networ k entity :carrier gossip :grape vine entity :carrier T able 1: Examples of how the task of recog niz- ing w ord analo gies may be vie w ed as a p roblem of classif ying word pair s. W e app roach this as a standar d cla ssiﬁcation proble m for supe rvised mach ine l earning. The al- gorith m tak es as input a traini ng set of word pairs with class labe ls and a testin g set of wor d pairs without lab els. Each word pair is repres ented as a vec tor in a feat ure spac e and a sup ervised learnin g algori thm is used to classi fy the featu re ve ctors. The elements in the feature vect ors are based on the frequ encies of auto m aticall y deﬁned patte rns in a lar ge corp us. The ou tput of th e algorith m is an assign ment of lab els to th e word pairs in the test- ing set. For some of the experimen ts, we select a unique label for each word pair; for other ex- periment s, w e a ssign probabili ties to each possi ble label for each word pai r . For a gi ven word pa ir , s uch as mason :stone, the ﬁrst step is to generate morphologic al v ariations, such as masons:ston es. In the follo wing exp eri- ments, we use morpha (morphologica l analyzer ) and morphg (morphologic al generator ) for mor- pholo gical proces sing (Minnen et al., 2001). 1 The second step is to search in a lar ge corpu s for all phra ses of the follo wing form: “[0 to 1 wo rds] X [0 to 3 words] Y [0 to 1 w ords]” In this template, X : Y c onsists of morpholog ical v ariations o f the giv en w ord pair , in either or - der; for exa mple, mason:stone, stone:maso n, ma- sons:s tones, and so on. A typical phrase for ma- son:st one would be “the mason cut the stone with”. W e then normal ize a ll of the phras es th at are found, by usin g morpha to remov e su fﬁxes . The template we use here is similar to T urney (2006), but we ha ve added extra conte xt wo rds before the X and afte r the Y . Our morpholog- ical proce ssing also diffe rs from T urney (200 6 ). In the follo w ing ex periments, we sea rch in a co r- pus of 5 × 10 10 words (about 280 GB of plain tex t), cons isting of web pages gathered by a web crawle r . 2 T o retrie ve phrases from the corpus, we use W umpus (B ¨ uttche r and Clarke , 2005), an efﬁ- cient search engine for pass age retrie v al from large corpor a. 3 The ne xt step is to generat e patterns from all of the phrases that were found for all of the in- put word pairs (from both the tra ining and testing sets). T o gener ate patter ns from a phra se, we re- place the gi ven word p airs with v ariables, X and Y , and we replace th e remaining words w ith a wild 1 http://www .informatics.susx.ac.uk/research/g roups/nlp/ carroll/morph.html. 2 The corpus was co l lected by Charles Clark e, Uni versity of W aterloo. W e can pro vide copies on request. 3 http://www .wumpus-search.or g/. card symbol (an asterisk) o r leav e th em as they are. For exampl e, the phrase “the mason cut the stone with” yields the pattern s “the X cut * Y with”, “* X * the Y *”, and so on. If a phrase contains n words , then it yiel ds 2 ( n − 2) pattern s. Each patte rn corres ponds to a featur e in the fea - ture vector s that we will generate. Since a typi- cal input se t of wo rd pairs yields million s of pat- terns, we need to use feature selectio n, to redu ce the number of pattern s to a manageab le quant ity . For each pattern, we count the number of input word pairs that gene rated th e pa ttern. For ex ample, “* X cut * Y *” is generat ed by both mason:ston e and carpe nter:wood. W e then sort the patterns in descen ding order of the number of word pairs that genera ted them. If there are N input word pairs (and thus N feature vec tors, in cluding both the trainin g and testing sets), then we select the top k N pattern s and drop the remainde r . In the fol- lo wing e xperiments , k is set to 20. The algorith m is not sensi tive to the precise va lue of k . The reasoni ng behind the feature selection al- gorith m is that shared patterns make more useful feature s than rare p atterns. The number o f features ( k N ) depends on the number of word pairs ( N ), becaus e, if w e ha ve more feature vectors , then we need more feature s to distingu ish them. T urne y (2006) also selects patterns based on the number of pairs that g enerate them, but the number of se - lected pattern s is a constan t (8000), i ndependent of the number of input word pair s. The next step is to genera te feature v ectors, on e vec tor for each inpu t word pair . Each of the N feature vectors has k N elements, one element for each s elected patt ern. The val ue of an element in a vec tor is giv en by the logarithm of the freque ncy in the corpus of the correspo nding pattern for the gi ven word pa ir . For e xample, suppo se the gi ven pair is mason:stone and the pattern is “* X cu t * Y *”. W e look at the normali zed phrases that we collec ted for mason:stone and we count how many match this pattern. If f phras es match the pattern , then the valu e of this element in the fea- ture v ector is log( f + 1) (we add 1 because log (0) is undeﬁned ). Each feature vec tor is then normal- ized to unit l ength. The normaliz ation ensur es that feature s in vector s for high-freq uency word pairs (traf ﬁc:street) are comparable to features in vectors for lo w-frequenc y word pairs (water :riv erbed). No w that we h av e a fe ature vector for each in- put w ord pair , we can apply a standard superv ised learnin g algorithm. In the followin g experi m ents, we us e a sequenti al minimal opti mization (SMO) suppo rt vec tor machine (SVM ) with a radia l ba- sis funct ion (RBF ) k ernel (Platt, 1998), as imple- mented in W eka (W aikato En viro nment for Kno wl- edge Anal ysis) (W itten and F rank, 1999). 4 The al- gorith m generates probabil ity estimates for each class by ﬁtting logistic re gression models to the outpu ts of the SVM. W e disabl e the no rmalization option in W eka, since the vectors are already nor - malized to unit length. W e chos e the SMO RBF algori thm becaus e it is fas t, rob ust, and it easily handle s lar ge numb ers of feat ures. For con venien ce, we will refer to the abo ve algo - rithm as PairClass. In the followin g experiments , PairClas s is applied to each of the four problems with no adj ustments or tun ing to the s peciﬁc pro b- lems. S ome work is requ ired to ﬁt each problem into the general framew ork of PairClass (super - vised classiﬁcatio n of wo rd pai rs) b ut the core al- gorith m is the same in each case. 3 Experiments This section presents four sets of e xperiments, with analog ies, synon yms, anton yms, an d associations . W e explain how each t ask is treated as a prob lem of classifyi ng analo gous word pairs, we gi ve the exp erimental results, and we discuss past work on each of the four tasks. 3.1 SA T An alogie s In this section, we apply PairClass to the task of recogn izing analo gies. T o ev aluate the perfor - mance, we use a se t of 374 mult iple-choice ques- tions from the SA T colleg e entrance e xam. T able 2 sho ws a typical question . The tar get pair is called the stem . The task is to select th e cho ice pa ir tha t is most analog ous to the stem pair . Stem: mason:st one Choices: (a) teache r:chalk (b) carpen ter:wood (c) soldie r:gun (d) photo graph:camera (e) book:w ord Solution : (b) carpe nter:wood T able 2: An examp le of a question from the 374 SA T anal ogy questio ns. The proble m of reco gnizing word analogie s 4 http://www .cs.waikato.ac.nz/ml/weka/. was ﬁrst atte mpted with a system call ed A r gus (Reitman, 1965), usin g a small hand-b uilt seman- tic netwo rk with a spr eading ac tiv ation algorithm. T urney et al. (2003) used a combin ation of 13 in- depen dent modules. V eale (2004) used a spread- ing acti vatio n algori thm with W ordNet (in effect , treatin g W ordNet as a seman tic netw ork). T urne y (2006) used a corpus- based algorith m. W e may view T able 2 as a bin ary classiﬁca- tion problem, in w hich m ason:s tone and carpe n- ter:wo od are positi ve examples and the remaining word pairs are ne gati ve exa mples. The d ifﬁculty is that th e labels of the choice pairs must be hidd en from the learning algorith m. That is, the train ing set consist s of o ne posi tiv e example (the stem pair) and t he testing set consist s of ﬁv e unlab eled e xam- ples (the ﬁve choice pair s). T o make this task mo re tractab le, we randomly choose a stem pair from one of the 373 othe r SA T analog y questions, and we a ssume that thi s new stem p air is a neg ativ e ex- ample, as sho wn in T abl e 3. W ord pair T rain or test Class label mason:st one train posit ive tutor:p upil train nega tiv e teache r:chalk test hidden carpen ter:wood test hidden soldie r:gun test hidden photo graph:camera test hidden book: word te st hidden T able 3: Ho w to ﬁt a SA T analogy question into the frame work of supervise d pair classiﬁcatio n. T o answer the SA T ques tion, we use PairClas s to estimate the p robability that each tes ting exa mple is pos itiv e, and we guess th e te sting exampl e with the highest probabi lity . Learning from a trainin g set with on ly one positi ve e xample and one ne ga- ti ve exa m ple is dif ﬁcult, sinc e the learned model can be highly unstab le. T o increas e the stabilit y , we repeat the learning process 10 times, using a dif ferent randomly cho sen negati ve train ing e xam- ple each time. For each testin g word pair , the 10 probab ility estimate s are ave raged together . This is a form of bagg ing (Breiman, 1996). PairClas s attains an acc uracy of 52.1%. For comparis on, the A CL W iki lists 12 pre viously pub- lished results with the 3 74 SA T analog y ques- tions. 5 Only 2 of the 12 algorith m s hav e higher accura cy . The best pre vious result is an accu- 5 For more information, see SA T A nalog y Questions (State of the art) at http://aclweb .org/acl wi ki/. racy of 56.1% (T urne y , 2006). Random guessing would yield an acc uracy of 20%. The ave rage senior high school studen t achie ves 57% corre ct (T urne y , 2006). 3.2 TOEFL Synonyms No w w e apply PairClass to the task of recogni z- ing syno nyms, using a set of 80 multiple-cho ice synon ym ques tions from the TOEFL (test of En- glish as a foreign language). A sample ques tion is sho wn in T able 4. The task is to select the ch oice word that is most similar in meaning to the stem word. Stem: le vied Choices: (a) imposed (b) belie ved (c) reques ted (d) correla ted Solution : (a) imposed T able 4: An example of a question from the 80 TOEFL qu estions. Synon ymy can be vie wed as a high de- gree of seman tic similarity . The most com- mon way to measure semantic similarity is to measu re t he distance between wo rds in W ordNet (Resnik, 1995 ; Jiang an d Conrat h, 199 7 ; Hirst and St-Onge, 1998). C orpus- based measur es of word similarit y are also common (Lesk, 1969; Landauer and Dumais, 1997; T urne y , 2001). W e may vie w T able 4 as a binar y classiﬁca- tion problem, in which the pa ir le vied:impos ed is a positi ve example of the class syn onymous and the other possible pairin gs are negati ve exampl es, as sho wn in T abl e 5. W ord pair Class label le vied:imposed positi ve le vied:belie ved neg ativ e le vied:reque sted nega tiv e le vied:corre lated neg ativ e T able 5: How to ﬁt a TOEFL quest ion into the frame work of supervise d pair classiﬁcatio n. The 80 TOEFL questio ns yield 320 ( 80 × 4 ) word pairs, 80 labele d pos itiv e and 240 labeled neg ativ e. W e apply PairClass to the w ord pairs us- ing ten- fold cross-v alidation . In each ra ndom fold, 90% of the pairs are used for train ing and 10% are used for testing. For each fold, the model that is learned from the trainin g set is used to assig n probab ilities to the pairs in the testing set. W ith ten separate folds, the ten non-ov erlapping test- ing sets cover the whole dataset. Our guess for each T OE FL question is the choice with the high- est probab ility of being positi ve, when paired w ith the corresp onding stem. PairClas s attains an acc uracy of 76.2%. For comparis on, the A CL W iki lists 15 pre viously pub- lished re sults with the 80 TOEFL sy nonym ques- tions. 6 Of the 15 algorith ms, 8 ha ve highe r ac- curac y and 7 ha ve lower . The best prev ious re- sult is an accuracy of 97.5% (Tu rney et al., 2003), obtain ed using a hyb rid of four di fferen t algo- rithms. Random gues sing would yi eld an ac- curac y of 25%. The av erage foreign appli- cant to a US uni ver sity achie ves 64 . 5% correct (Landaue r and Dumais, 1997). 3.3 Synonyms and Antonyms The task of classifyin g word pairs as either syn- ony ms or anton yms readily ﬁts into the frame work of sup ervised classiﬁcation of wo rd pairs. T able 6 sho ws some example s fr om a set of 136 ESL (En- glish as a secon d langua ge) practice question s that we collect ed from v arious E SL websites. W ord pair Class label gallin g:irksome synon yms yield: bend synon yms nai ve:cal low synon yms advise :suggest synon yms dissimila rity:resemblance anton yms commend:d enounce an tonyms exp ose:camouﬂage ant onyms un v eil:veil ant onyms T able 6: Examples of synon yms and antonyms from 136 ESL practic e questi ons. Lin et al. (2 003 ) distinguis h synon yms from anton yms using two patterns, “from X to Y ” an d “eithe r X or Y ”. When X and Y are anton yms, the y occasion ally appear in a lar ge corpus in one of these two pattern s, but it is very rare for syn- ony ms to appear in these patterns. Our approach is similar to Lin et al. (2003), bu t we do not rely on hand-coded patter ns; instead, PairClass pattern s are generat ed automatica lly . Using te n-fold cross-v alidation, Pa irClass at- tains an ac curacy o f 75.0 % . Al ways guessing the majority class would result in an accurac y of 6 For more information, see TOEFL Synonym Questions (State of the art) at http://aclweb .org /aclwiki/. 65.4%. T he a verag e human score is unkno wn a nd there are no pre vious results for comparison. 3.4 Similar , Associated, and Both A common criticism of corpus-ba sed measures of word si m ilarity (as o pposed to lexi con-based mea- sures) is that they are merely detect ing associa- tions (co-o ccurrences) , rather than act ual semantic similarit y (Lun d et al ., 199 5 ). T o addre ss this crit- icism, Lund et al. (1995) ev aluated their al gorithm for mea suring word similar ity w ith w ord pairs that were labele d similar , associated , or both . These labele d pai rs w ere ori ginally created for cogni - ti ve psycho logy experi ments with human subjects (Chiarell o et al., 1990). T able 7 sho ws some ex- amples from this colle ction of 144 word pai rs (48 pairs in each of the three classes ). W ord pair Class label table:b ed similar music:ar t similar hair:fu r similar house :cabin similar cradle :baby associ ated mug:bee r associ ated camel:hu mp associ ated cheese :mouse associated ale:be er both uncle: aunt both peppe r:salt both fro wn:smile both T able 7: Examples of word pairs labeled similar , associ ated , or both . Lund et al. (1995) did not meas ure the ac curacy of their algorith m on this three -class cl assiﬁcation proble m. Instead, follo wing standard practi ce in cognit ive psyc hology , th ey sho wed that their al- gorith m ’ s similarity scores for the 144 word pairs were corr elated with the respons e times of huma n subjec ts in priming test s. In a typical pri m ing test, a human subjec t reads a priming word ( cr adle ) and is then ask ed to co m plete a partial word (comple te bab as baby ). The time requir ed to perform the task is ta ken t o indi cate the strength of the cogni- ti ve l ink between the tw o word s ( cr adle and ba by ). Using te n-fold cross-v alidatio n, PairClass at- tains an accu racy of 77.1% on the 14 4 word pairs. Since the three cl asses are of equal size, gu essing the majority class and rand om guessin g both yie ld an accura cy of 33.3%. The a verage human score is unkno wn and there are no previo us results for comparis on. 4 Discussion The four experimen ts are summarized in T ables 8 and 9. For the ﬁrst tw o ex periments, where the re are pre vious resul ts, PairClass is not the best , bu t it performs competiti vely . For the second two ex- periment s, PairClass pe rforms signiﬁcan tly abo ve the baseline s. Howe ver , the strengt h of this ap- proach is not its performance on an y one task, but the range of tasks it can handle. As f ar as we kno w , this is the ﬁrst time a stan - dard supervise d learning algo rithm has been ap- plied to any of these four problems. The advan - tage of being able to cast these problems in the frame work of stand ard supe rvised learning prob- lems is that we can no w explo it the huge literature on sup ervised learning . Past work on these prob- lems has required impli citly coding our knowl- edge of the nature of the task into the struc - ture of the algorith m. For example, the str uc- ture of the algorithm for latent semantic analysis (LSA) implicitly contains a theor y of syno nymy (Landaue r and Dumais, 1997). The probl em with this approach i s that it ca n be v ery dif ﬁcult to work out ho w to modify the algo rithm if it does not be- ha ve the way we want. On the other hand, with a supervised learn ing algorithm, we can put our kno wledge int o the labeling of the feature vectors, instea d of putti ng it directly into the algorit hm. This makes it easier to gui de the system to the de- sired beha viour . W ith our approach to the SA T analogy ques- tions, we are blurr ing the line between supervised and unsupe rvised learning, since the training set for a giv en SA T question consist s of a s ingle real positi ve exa mple (and a single “virt ual” or “simu- lated” neg ativ e e xample). In ef fect, a single exam- ple (mason:st one) becomes a sui generi s ; it con- stitute s a class of its o wn. It may be possible to apply the machinery of supervised learning to other proble ms that appa rently call for unsu per- vised lea rning (for exa m ple, cluste ring or measur - ing similarity ), by using this sui gene ris de vice. 5 Related W ork One of the ﬁ rst papers u sing sup ervised ma- chine lea rning to classi fy word pairs wa s Rosario and Hearst’ s (2001) paper on classifyi ng noun- modiﬁer pairs in the medic al domain. F or ex- ample, the noun-modiﬁer expressio n brain biop sy was classiﬁed as Pr ocedur e . Rosario and Hearst (2001) co nstructed feature v ectors for each noun - modiﬁer p air using MeSH (Med ical Subject Hea d- ings) and UMLS (Uniﬁed Medical Lang uage Sys- tem) as le xical resources . They then tra ined a neu - ral network to disting uish 13 classes of semantic relatio ns, such as Cause , L ocatio n , Measur e , and Instru m ent . Nasta se and Szpako wicz (2003) ex- plored a similar approach to classifyin g gene ral- domain noun- m odiﬁer pairs, using W ordNet and Roget’ s T hesau rus as lexi cal resou rces. T urney and Littman (2005) used corpu s-based feature s for cla ssifying noun-mod iﬁer pairs. Their feature s were based on 128 hand-c oded patterns. They used a nearest-neig hbour learning algorithm to classify general-domai n noun-modiﬁer pairs into 30 dif ferent cla sses of semant ic relation s. T ur- ney (200 6 ) later ad dressed the same problem us ing 8000 auto m aticall y generated patterns. One of the tasks in SemEval 2007 was the clas- siﬁcation of semant ic relation s between nominals (Girju et al., 2007). The problem is to classi fy se- mantic relations between nouns and noun com- pound s in the conte xt of a senten ce. The task attract ed 14 teams who created 15 systems, all of which used supervised machine learning with feature s that were le xicon-based , corpus-bas ed, or both. PairClas s is m ost similar to the algorit hm of T ur- ney ( 2006 ), b ut it dif fers in the follo wing way s: • PairClass does not use a lexicon to ﬁnd sy n- ony ms for the input word pairs. One of our goals in thi s paper is to show that a pure corpus -based algorith m can han dle synon yms without a le xicon. This co nsiderably simpli- ﬁes the algo rithm. • PairClass uses a support vector machine (SVM) instea d of a nearest neighbour (NN) learnin g algori thm. • PairClass does not use the singu lar val ue decompo sition (SVD) to smooth the feature vec tors. It has been ou r experie nce th at S VD is not neces sary with SVMs. • PairClass ge nerates probability estimates, whereas Tur ney (2006) uses a cosine m ea- sure of similarity . Probability estimates can be readily used in further downstre am pro- cessin g, b ut cos ines are less usef ul. • The aut omatically gen erated p atterns in Pair - Class are slightly more gene ral than the pat- terns of T urney (2006). Experiment Number of vec tors Number of features Number of classe s SA T Analog ies 2,244 ( 374 × 6 ) 44,880 ( 2 , 244 × 20 ) 374 TOEFL Syno nyms 320 ( 80 × 4 ) 6,400 ( 320 × 20 ) 2 Synon yms and Anton yms 136 2,720 ( 136 × 20 ) 2 Similar , Associated, and Both 144 2,880 ( 144 × 20 ) 3 T able 8: Summary of the four tasks. See Section 3 for ex planations. Experiment Accurac y Best pre vious H uman Baseline Rank SA T Analog ies 52.1% 56.1% 57.0% 20.0% 2 higher out of 12 TOEFL Syno nyms 76.2% 97.5% 64.5% 25.0% 8 higher out of 15 Synon yms and Anton yms 75 .0% none un known 65.4% none Similar , Associated, and Both 77.1% no ne unkno wn 33.3% none T able 9: Summary of e xperimenta l results. See Sectio n 3 for exp lanations. • The morphol ogical processing in PairClass (Minnen et al., 2001) is more sophistic ated than in T urney ( 2006). Ho w e ver , we belie ve that the main contrib ution of this paper is not Pair C lass itself, but the exten sion of supervise d word pair classiﬁcation bey ond the classiﬁca tion of noun-mod iﬁer pairs and seman- tic relations between nomin als, to an alogies, syn- ony ms, antonyms, and associatio ns. A s far a s we kno w , this has not been done before . 6 Limitations and Futur e W o rk The main li mitation of PairClass is the n eed for a lar ge corp us. Phrases that contain a pair o f w ords tend t o be more rare than phras es that con tain ei- ther of the members of the pair , thus a large cor - pus is needed to ensur e that sufﬁcie nt numbers of phrase s are found for each input word pair . The size of the corpu s has a cost in terms of disk space and proc essing time. In the future, as hardware im- pro ves, th is will become les s of an is sue, bu t there may be ways to improve the algorithm, so that a smaller corpu s is suf ﬁcient. Another area for futur e work is to apply Pair - Class to more tasks . W ordNet includ es more than a dozen semantic relatio ns (e.g . , syn onyms, hy- pon yms, hypern yms, m eron yms, holon yms, and anton yms). PairClass should be applica ble to all of these rel ations. Other pot ential appl ications in- clude an y task that in v olves semant ic relations , such as wor d sense disambiguat ion, informatio n retrie v al, information e xtraction, and metapho r in - terpret ation. 7 Conclusion In this pape r , we ha ve described a uniform ap- proach to ana logies, synon yms, anton yms, a nd as- sociat ions, in which all of these phenomena are subsu m ed by analogie s. W e view th e problem of recogn izing analogies as the classi ﬁ cation of se- mantic relatio ns between words . W e belie ve that most of our lexical kno w ledge is relatio nal, not attrib utional. That is, meaning is lar gely about relations among words, rather than proper ties of indi vidual words, considered in iso- lation . For example , consider the knowled ge en- coded in W ordNet : much of the kno w ledge in W ordNet is embedded in the graph structure that conne cts words. Analogie s of the form A : B :: C : D are called pr opo rtional analo gies . These types of lower - le vel analogies may be contraste d w ith higher - le vel analogie s, such as the analog y between the solar system and Rutherford ’ s model of the atom (Falk enhainer et al., 1989), which are sometimes called conceptual analo gies . W e belie ve that the dif ference betwee n these two types is lar gely a matter of complexit y . A higher -le vel analogy is compose d of many lower -lev el a nalogies. P rogress with algorit hms for proces sing lower -le vel analo- gies will ev entually contrib ute to algorithms for higher -le vel analogies. The idea of subsu m ing a broad range of se- mantic phenomena under analo gies has been su g- gested by many res earchers. Minsky (1 986 ) wrote, “Ho w do w e ev er understand any thing? Almost alwa ys, I think, by using one or an- other kind of analog y . ” Hofstadter (2007) claimed, “all meaning comes fro m analogies. ” In NLP , analogic al algorithms ha ve been applied to machine transl ation (Lepage and Denoual, 2005), morphol ogy (Lepage, 1998), and semantic rela- tions (T urney and Littman, 200 5 ). Analog y pro- vides a frame work that has the potential to unify the ﬁeld of semanti cs. This paper is a small step to wards that goal. Acknowledgeme nts Thanks to Jo el Martin and the anonymous revie w- ers of Coling 2008 for their helpfu l comments. Refer ences [Breiman19 96] Breiman , Leo. 1996 . Bag ging p redic- tors. Machine Learning , 24(2):12 3–14 0. [B ¨ uttch er and Clarke2005] B ¨ uttcher, Stefan a nd Charles Clarke. 2005. Efﬁciency vs. effectiv en ess in terabyte-scale inf ormation retrieval. In Pr oceed ings of th e 14 th T ext REtrieval Con fer ence (TREC 20 05) , Gaithersburg, MD. [Chiarello et al.19 90] Chiarello, Christine, Curt Burgess, Lo rie Richards, and Alma Pollock . 1 990. Semantic and associati ve p riming in the cerebral hemispher es: Some words do, some words don’t ... sometimes, some places. Brain and La n guage , 38:75– 104. [Falkenhainer et al.1989] Falkenhainer, Brian, K en- neth D. Forbus, and De dre Gen tner . 1 989. The structure-m apping en gine: Algorithm and examples. Artiﬁcial Intelligence , 41(1):1–63 . [Fellbaum1 998] Fellbau m, Christiane, editor . 1998 . W o rdNet: An Ele c tr onic Le xica l Datab ase . MI T Press. [Girju et al.200 7] Gir ju, Roxana, Preslav Nakov , V i vi Nastase, Stan Szpako wicz, Peter T urney , and Deniz Y uret. 2 007. Semeval-2007 task 04 : Classiﬁcation of semantic relatio ns between no minals. In S emEval 2007 , pages 13–18 , Prag ue, Czech Republic. [Hirst and St-Onge 1998] Hirst, Graeme and Da v id St- Onge. 1998. Lexical c hains as represen ta- tions of context for the detection and correction of malapro pisms. In Fellbau m, Christiane, edito r , W o rdNet: An E le c tr o nic Le xical Da tabase , p ages 305–3 32. M IT Press. [Hofstadter2 007] Ho fstadter, Douglas. 2007. I Am a Srange Loop . Basi c Books. [Jiang and Conrath 1997] Jiang, Jay J. and David W . Conrath. 1997. Semantic similarity b ased on cor - pus statistics and lexical taxono my . I n R OCLING X , pages 19–3 3, T apei, T aiwan. [Landau er and Dumais1997] Landauer, Tho mas K. and Susan T . Duma is. 1997 . A solution to Plato ’ s pr ob- lem: The laten t semantic analysis theory of the ac- quisition, in duction, and re presentation of knowl- edge. Psychological Review , 104(2):2 11–24 0. [Lepage and Denou al2005] Lep age, Yves an d Etien ne Denoual. 2 005. Pu rest ever example-b ased m a- chine translation : Detailed pre sentation and assess- ment. Machine T ranslation , 19(3):251 –282. [Lepage1 998] Lepag e, Yves. 19 98. Solvin g analogies on words: An algo rithm. I n Pr oceed ings of the 36th Annua l Conference of th e Association for Computa - tional Linguistics , pages 728 –735. [Lesk196 9] Le sk, Michael E. 196 9. W ord -word asso- ciations in docum ent retriev al system s. American Documenta tion , 20 (1):27 –38. [Lin et al.200 3] Lin , Dek ang, Shaojun Zhao, Lijuan Qin, and Ming Zhou. 2 003. Identify ing synonyms among distributionally similar words. In IJCAI -03 , pages 1492 –1493 . [Lund et al.199 5] Lu nd, Ke vin, Curt Burgess, and Ruth Ann Atch ley . 199 5. Semantic an d ass o ciativ e priming in h igh-dim ensional semantic space. In Pr o- ceedings of the 17th Ann ual Confer ence o f the Cog- nitive Scien c e So ciety , pages 660–66 5. [Minnen et al.200 1] Min nen, Guid o, John Carroll, and Darren Pearce. 2001 . Applied mo rpho logical pro- cessing of En glish. Natural Language Engineering , 7(3):2 07–22 3. [Minsky198 6] Minsky , Marvin. 1986. The Society of Mind . Simon & Schuster , New Y ork, NY . [Nastase and Szpakowicz2003] Nastase, V ivi and Stan Szpakowicz. 2003. Ex ploring nou n-mod iﬁer se- mantic relations. In F ifth In ternationa l W o rkshop o n Computation al S emantics (IWCS-5) , pages 285–30 1, T ilburg, Th e Netherlands. [Platt1998] Platt, John C. 1998. Fast training of support vector machines using sequen tial minimal optimiza- tion. In Ad vances in K ernel Methods: Sup port V ec- tor Learning , page s 185– 208. MIT Press Camb ridge, MA, USA. [Reitman196 5] Reitman , W alter R. 196 5. Cognition and Th ought: An I nformation Pr ocessing App r oach . John W iley and Sons, Ne w Y or k, NY . [Resnik199 5] Resnik, Philip. 1995. Using information content to ev alua te semantic similar ity in a taxon- omy . In IJCAI-9 5 , pages 448–453, San Mateo , CA. Morgan Kau fmann. [Rosario and Hearst200 1] Rosar io, Barbara and Marti Hearst. 2001 . Classifying the semantic relations in noun- compou nds via a domain-speciﬁc lexical h ier- archy . In EMNLP-0 1 , pages 82–90. [T urney and Littman2005 ] T urney , Peter D. and Michael L. Littman. 2005. Corp us-based learn ing of analo gies and sema ntic relations. Machine Learning , 60(1– 3):251 –278. [T urney et al.2003] T urney , Peter D., M ichael L. Littman, Jeffre y Bigha m, an d V ictor Shn ayder . 2003. Com bining in depend ent modules to solve multiple-ch oice synonym an d analo gy pr oblems. In RANLP-03 , pages 482–48 9, B o rovets, Bulgaria. [T urney2 001] T urn ey , Peter D. 2 001. Mining the W eb for synonym s: PMI- IR versus LSA on TOEFL. In Pr oceeding s of the T welfth Eur o p ean Con fer ence on Machine Learning , pag es 49 1–502 , Berlin. Sp ringer . [T urney2 006] T urn ey , Peter D. 2006. Similarity of semantic relation s. Computation al Linguistics , 32(3) :379–4 16. [V eale20 04] V eale, T ony . 200 4. W o rdNet sits the SA T: A k nowledge-based ap proach to lexical an alogy . In Pr oceeding s of the 16th Eur opea n Conference on Artiﬁcial Intelligence (ECAI 2004) , pages 60 6–61 2, V alencia, Spain. [W itten an d Frank1999] W itten, Ian H . and Eibe Frank. 1999. Data Mining : P ractical Machine Learning T o o ls and T e chn iq ues with Java Implementa tio ns . Morgan Kaufmann, San Francisco.

A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment