A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations
Recognizing analogies, synonyms, antonyms, and associations appear to be four distinct tasks, requiring distinct NLP algorithms. In the past, the four tasks have been treated independently, using a wide variety of algorithms. These four semantic clas…
Authors: ** Peter D. Turney (National Research Council of Canada, Institute for Information Technology) **
A Unif orm A pproach to Analogies, Synonyms, Antonyms, and Associations Pe ter D. T urne y National Research Council of Canada Institute for Information T echnology M50 Montreal Road Ottawa, Ontario, Canada K1A 0R6 peter.turn ey@nrc-cnr c.gc.ca Abstract Recogniz ing analog ies, synon yms, an to- nyms, and asso ciations appear to be four distin ct tasks, requiri ng distinct NLP al- gorith m s. In the past, the four tasks hav e been treated indepen dently , using a wide v ariety of algorithms. These four seman- tic classes, howe ver , are a tiny sample of the full rang e of semanti c pheno m ena, and we cannot af ford to crea te ad hoc algo- rithms for each semantic phen omenon; we need to seek a unified app roach. W e pro - pose t o subsu me a broad range o f pheno m- ena un der a nalogies. T o limit the scope of this paper , we restrict our attention to the subsu m ption of syn onyms, anton yms, and associ ations. W e introduc e a super vised corpus -based machine learning algori thm for classifying analogous word pairs, and we sh ow that it can solve multipl e-choice SA T ana logy questions, TOEFL synon ym questi ons, E SL synon ym-antony m ques- tions, and similar-a ssociated-b oth ques- tions from cognit ive psych ology . 1 Intr oduction A pair of words (pet rify:stone) is anal ogous to an- other pair (v aporize:g as) whe n the semantic re- lation s between th e words i n the first pa ir are highly similar to the relation s in the second pair . T wo w ords (le vied and imposed ) are syno nymous in a conte xt (le vied a tax) when they can be in- tercha nged (imposed a tax), they are are antony- mous when they hav e opposit e meanin gs (black and white) , and the y ar e as sociated when they ten d to co-oc cur (doctor and hospit al). On the su rface, it appe ars that thes e are four dis - tinct semantic classes, requiring distin ct N LP al- gorith m s, b ut we propose a uniform approac h to all four . W e subsume synon yms, antony m s, and associ ations under analogies . In essence, we say that X and Y are anton yms when the pair X : Y is analogous to the pair black:white, X and Y are synony ms when they are analog ous to the pair le vied:imposed , and X and Y are associat ed when the y are an alogous to the pair doc tor:hospital . There is past work o n recogn izing analog ies (Reitman, 196 5 ), synon yms (Landaue r and Dumais, 1997), anton yms (Lin et al., 2003), and assoc iations (Lesk , 1969), b ut each of these four tasks has been ex amined separa tely , in isolat ion from the others. As f ar as we kno w , the algorithm propose d here is the first attempt to deal with all four task s using a uniform approa ch. W e belie ve that it is important to seek NLP algorithms that can ha ndle a broad range of semantic phenomena, bec ause de velopin g a specia lized algorithm for each phenomenon is a ver y inef ficient research strate gy . It might seem tha t a lex icon, such as W ord- Net (Fellb aum, 199 8 ), contains all the info rmation we need to handle these four tasks. Howe ver , we prefer to take a corp us-based approa ch to seman- tics. V eale (200 4 ) used W ordNet to answer 374 multiple -choice SA T analogy questions, achiev ing an accurac y of 43%, b ut the best corpus-bas ed ap - proach attain s an accur acy of 56% (T urney , 2006). Another rea son to prefer a corpus-b ased approa ch to a lexic on-based approa ch is th at the fo rmer re- quires less human labour , and thus it is easier to ext end to oth er langu ages. In S ection 2, we d escribe our a lgorithm for recogn izing ana logies. W e us e a standa rd su- pervis ed machine lear ning approach, with fea ture vec tors based on the frequenc ies of pattern s in a lar ge corp us. W e use a support vec tor machine (SVM) to learn how to classify the feature v ectors (Platt, 1998; W itten and Frank, 1999). Section 3 pre sents four sets of expe riments. W e apply our algori thm for recognizing analogies to multiple -choice analo gy questio ns from the SA T colle ge entrance test, multiple -choice synony m questi ons from the T O EFL (test of English as a foreig n language ), ESL (English as a second lan- guage ) practice questio ns for distingu ishing syn- ony ms an d ant onyms, an d a set of word pairs that are labeled similar , assoc iated , and both , de vel- oped for exp eriments in co gnitiv e psycho logy . W e di scuss t he re sults o f the expe riments in Sec- tion 4. The accura cy of the al gorithm is competi- ti ve with other systems, but the stren gth of the al- gorith m is that it is able to handle all four tasks, with no tuning of the learning parameters to the particu lar task . It perfor ms well, alt hough it is competin g against specialized algo rithms, de vel- oped for single tasks. Related work is e xamined in Section 5 and lim- itation s and future work are consid ered in Sec- tion 6. W e conclu de in Section 7. 2 Classifying Analogous W ord Pairs An analogy , A : B : : C : D , ass erts that A is to B as C is to D ; for example , traffic:s treet::water :riv erbed asserts that traf fic is to street as water is to ri verbed ; that is, the semantic relati ons between traffic and street are highly similar to the semantic relations between water and ri verbed . W e may vie w the task of recogniz ing wor d analog ies as a proble m of class ifying word pairs (see T able 1). W ord pair Class labe l carpen ter:wood artisan :material mason:st one artisan :material potter :clay artisan :material glassb lower:gl ass artisan:materi al traf fic:street entity :carrier water: riv erbed entity :carrier pack ets:networ k entity :carrier gossip :grape vine entity :carrier T able 1: Examples of how the task of recog niz- ing w ord analo gies may be vie w ed as a p roblem of classif ying word pair s. W e app roach this as a standar d cla ssification proble m for supe rvised mach ine l earning. The al- gorith m tak es as input a traini ng set of word pairs with class labe ls and a testin g set of wor d pairs without lab els. Each word pair is repres ented as a vec tor in a feat ure spac e and a sup ervised learnin g algori thm is used to classi fy the featu re ve ctors. The elements in the feature vect ors are based on the frequ encies of auto m aticall y defined patte rns in a lar ge corp us. The ou tput of th e algorith m is an assign ment of lab els to th e word pairs in the test- ing set. For some of the experimen ts, we select a unique label for each word pair; for other ex- periment s, w e a ssign probabili ties to each possi ble label for each word pai r . For a gi ven word pa ir , s uch as mason :stone, the first step is to generate morphologic al v ariations, such as masons:ston es. In the follo wing exp eri- ments, we use morpha (morphologica l analyzer ) and morphg (morphologic al generator ) for mor- pholo gical proces sing (Minnen et al., 2001). 1 The second step is to search in a lar ge corpu s for all phra ses of the follo wing form: “[0 to 1 wo rds] X [0 to 3 words] Y [0 to 1 w ords]” In this template, X : Y c onsists of morpholog ical v ariations o f the giv en w ord pair , in either or - der; for exa mple, mason:stone, stone:maso n, ma- sons:s tones, and so on. A typical phrase for ma- son:st one would be “the mason cut the stone with”. W e then normal ize a ll of the phras es th at are found, by usin g morpha to remov e su ffixes . The template we use here is similar to T urney (2006), but we ha ve added extra conte xt wo rds before the X and afte r the Y . Our morpholog- ical proce ssing also diffe rs from T urney (200 6 ). In the follo w ing ex periments, we sea rch in a co r- pus of 5 × 10 10 words (about 280 GB of plain tex t), cons isting of web pages gathered by a web crawle r . 2 T o retrie ve phrases from the corpus, we use W umpus (B ¨ uttche r and Clarke , 2005), an effi- cient search engine for pass age retrie v al from large corpor a. 3 The ne xt step is to generat e patterns from all of the phrases that were found for all of the in- put word pairs (from both the tra ining and testing sets). T o gener ate patter ns from a phra se, we re- place the gi ven word p airs with v ariables, X and Y , and we replace th e remaining words w ith a wild 1 http://www .informatics.susx.ac.uk/research/g roups/nlp/ carroll/morph.html. 2 The corpus was co l lected by Charles Clark e, Uni versity of W aterloo. W e can pro vide copies on request. 3 http://www .wumpus-search.or g/. card symbol (an asterisk) o r leav e th em as they are. For exampl e, the phrase “the mason cut the stone with” yields the pattern s “the X cut * Y with”, “* X * the Y *”, and so on. If a phrase contains n words , then it yiel ds 2 ( n − 2) pattern s. Each patte rn corres ponds to a featur e in the fea - ture vector s that we will generate. Since a typi- cal input se t of wo rd pairs yields million s of pat- terns, we need to use feature selectio n, to redu ce the number of pattern s to a manageab le quant ity . For each pattern, we count the number of input word pairs that gene rated th e pa ttern. For ex ample, “* X cut * Y *” is generat ed by both mason:ston e and carpe nter:wood. W e then sort the patterns in descen ding order of the number of word pairs that genera ted them. If there are N input word pairs (and thus N feature vec tors, in cluding both the trainin g and testing sets), then we select the top k N pattern s and drop the remainde r . In the fol- lo wing e xperiments , k is set to 20. The algorith m is not sensi tive to the precise va lue of k . The reasoni ng behind the feature selection al- gorith m is that shared patterns make more useful feature s than rare p atterns. The number o f features ( k N ) depends on the number of word pairs ( N ), becaus e, if w e ha ve more feature vectors , then we need more feature s to distingu ish them. T urne y (2006) also selects patterns based on the number of pairs that g enerate them, but the number of se - lected pattern s is a constan t (8000), i ndependent of the number of input word pair s. The next step is to genera te feature v ectors, on e vec tor for each inpu t word pair . Each of the N feature vectors has k N elements, one element for each s elected patt ern. The val ue of an element in a vec tor is giv en by the logarithm of the freque ncy in the corpus of the correspo nding pattern for the gi ven word pa ir . For e xample, suppo se the gi ven pair is mason:stone and the pattern is “* X cu t * Y *”. W e look at the normali zed phrases that we collec ted for mason:stone and we count how many match this pattern. If f phras es match the pattern , then the valu e of this element in the fea- ture v ector is log( f + 1) (we add 1 because log (0) is undefined ). Each feature vec tor is then normal- ized to unit l ength. The normaliz ation ensur es that feature s in vector s for high-freq uency word pairs (traf fic:street) are comparable to features in vectors for lo w-frequenc y word pairs (water :riv erbed). No w that we h av e a fe ature vector for each in- put w ord pair , we can apply a standard superv ised learnin g algorithm. In the followin g experi m ents, we us e a sequenti al minimal opti mization (SMO) suppo rt vec tor machine (SVM ) with a radia l ba- sis funct ion (RBF ) k ernel (Platt, 1998), as imple- mented in W eka (W aikato En viro nment for Kno wl- edge Anal ysis) (W itten and F rank, 1999). 4 The al- gorith m generates probabil ity estimates for each class by fitting logistic re gression models to the outpu ts of the SVM. W e disabl e the no rmalization option in W eka, since the vectors are already nor - malized to unit length. W e chos e the SMO RBF algori thm becaus e it is fas t, rob ust, and it easily handle s lar ge numb ers of feat ures. For con venien ce, we will refer to the abo ve algo - rithm as PairClass. In the followin g experiments , PairClas s is applied to each of the four problems with no adj ustments or tun ing to the s pecific pro b- lems. S ome work is requ ired to fit each problem into the general framew ork of PairClass (super - vised classificatio n of wo rd pai rs) b ut the core al- gorith m is the same in each case. 3 Experiments This section presents four sets of e xperiments, with analog ies, synon yms, anton yms, an d associations . W e explain how each t ask is treated as a prob lem of classifyi ng analo gous word pairs, we gi ve the exp erimental results, and we discuss past work on each of the four tasks. 3.1 SA T An alogie s In this section, we apply PairClass to the task of recogn izing analo gies. T o ev aluate the perfor - mance, we use a se t of 374 mult iple-choice ques- tions from the SA T colleg e entrance e xam. T able 2 sho ws a typical question . The tar get pair is called the stem . The task is to select th e cho ice pa ir tha t is most analog ous to the stem pair . Stem: mason:st one Choices: (a) teache r:chalk (b) carpen ter:wood (c) soldie r:gun (d) photo graph:camera (e) book:w ord Solution : (b) carpe nter:wood T able 2: An examp le of a question from the 374 SA T anal ogy questio ns. The proble m of reco gnizing word analogie s 4 http://www .cs.waikato.ac.nz/ml/weka/. was first atte mpted with a system call ed A r gus (Reitman, 1965), usin g a small hand-b uilt seman- tic netwo rk with a spr eading ac tiv ation algorithm. T urney et al. (2003) used a combin ation of 13 in- depen dent modules. V eale (2004) used a spread- ing acti vatio n algori thm with W ordNet (in effect , treatin g W ordNet as a seman tic netw ork). T urne y (2006) used a corpus- based algorith m. W e may view T able 2 as a bin ary classifica- tion problem, in w hich m ason:s tone and carpe n- ter:wo od are positi ve examples and the remaining word pairs are ne gati ve exa mples. The d ifficulty is that th e labels of the choice pairs must be hidd en from the learning algorith m. That is, the train ing set consist s of o ne posi tiv e example (the stem pair) and t he testing set consist s of fiv e unlab eled e xam- ples (the five choice pair s). T o make this task mo re tractab le, we randomly choose a stem pair from one of the 373 othe r SA T analog y questions, and we a ssume that thi s new stem p air is a neg ativ e ex- ample, as sho wn in T abl e 3. W ord pair T rain or test Class label mason:st one train posit ive tutor:p upil train nega tiv e teache r:chalk test hidden carpen ter:wood test hidden soldie r:gun test hidden photo graph:camera test hidden book: word te st hidden T able 3: Ho w to fit a SA T analogy question into the frame work of supervise d pair classificatio n. T o answer the SA T ques tion, we use PairClas s to estimate the p robability that each tes ting exa mple is pos itiv e, and we guess th e te sting exampl e with the highest probabi lity . Learning from a trainin g set with on ly one positi ve e xample and one ne ga- ti ve exa m ple is dif ficult, sinc e the learned model can be highly unstab le. T o increas e the stabilit y , we repeat the learning process 10 times, using a dif ferent randomly cho sen negati ve train ing e xam- ple each time. For each testin g word pair , the 10 probab ility estimate s are ave raged together . This is a form of bagg ing (Breiman, 1996). PairClas s attains an acc uracy of 52.1%. For comparis on, the A CL W iki lists 12 pre viously pub- lished results with the 3 74 SA T analog y ques- tions. 5 Only 2 of the 12 algorith m s hav e higher accura cy . The best pre vious result is an accu- 5 For more information, see SA T A nalog y Questions (State of the art) at http://aclweb .org/acl wi ki/. racy of 56.1% (T urne y , 2006). Random guessing would yield an acc uracy of 20%. The ave rage senior high school studen t achie ves 57% corre ct (T urne y , 2006). 3.2 TOEFL Synonyms No w w e apply PairClass to the task of recogni z- ing syno nyms, using a set of 80 multiple-cho ice synon ym ques tions from the TOEFL (test of En- glish as a foreign language). A sample ques tion is sho wn in T able 4. The task is to select the ch oice word that is most similar in meaning to the stem word. Stem: le vied Choices: (a) imposed (b) belie ved (c) reques ted (d) correla ted Solution : (a) imposed T able 4: An example of a question from the 80 TOEFL qu estions. Synon ymy can be vie wed as a high de- gree of seman tic similarity . The most com- mon way to measure semantic similarity is to measu re t he distance between wo rds in W ordNet (Resnik, 1995 ; Jiang an d Conrat h, 199 7 ; Hirst and St-Onge, 1998). C orpus- based measur es of word similarit y are also common (Lesk, 1969; Landauer and Dumais, 1997; T urne y , 2001). W e may vie w T able 4 as a binar y classifica- tion problem, in which the pa ir le vied:impos ed is a positi ve example of the class syn onymous and the other possible pairin gs are negati ve exampl es, as sho wn in T abl e 5. W ord pair Class label le vied:imposed positi ve le vied:belie ved neg ativ e le vied:reque sted nega tiv e le vied:corre lated neg ativ e T able 5: How to fit a TOEFL quest ion into the frame work of supervise d pair classificatio n. The 80 TOEFL questio ns yield 320 ( 80 × 4 ) word pairs, 80 labele d pos itiv e and 240 labeled neg ativ e. W e apply PairClass to the w ord pairs us- ing ten- fold cross-v alidation . In each ra ndom fold, 90% of the pairs are used for train ing and 10% are used for testing. For each fold, the model that is learned from the trainin g set is used to assig n probab ilities to the pairs in the testing set. W ith ten separate folds, the ten non-ov erlapping test- ing sets cover the whole dataset. Our guess for each T OE FL question is the choice with the high- est probab ility of being positi ve, when paired w ith the corresp onding stem. PairClas s attains an acc uracy of 76.2%. For comparis on, the A CL W iki lists 15 pre viously pub- lished re sults with the 80 TOEFL sy nonym ques- tions. 6 Of the 15 algorith ms, 8 ha ve highe r ac- curac y and 7 ha ve lower . The best prev ious re- sult is an accuracy of 97.5% (Tu rney et al., 2003), obtain ed using a hyb rid of four di fferen t algo- rithms. Random gues sing would yi eld an ac- curac y of 25%. The av erage foreign appli- cant to a US uni ver sity achie ves 64 . 5% correct (Landaue r and Dumais, 1997). 3.3 Synonyms and Antonyms The task of classifyin g word pairs as either syn- ony ms or anton yms readily fits into the frame work of sup ervised classification of wo rd pairs. T able 6 sho ws some example s fr om a set of 136 ESL (En- glish as a secon d langua ge) practice question s that we collect ed from v arious E SL websites. W ord pair Class label gallin g:irksome synon yms yield: bend synon yms nai ve:cal low synon yms advise :suggest synon yms dissimila rity:resemblance anton yms commend:d enounce an tonyms exp ose:camouflage ant onyms un v eil:veil ant onyms T able 6: Examples of synon yms and antonyms from 136 ESL practic e questi ons. Lin et al. (2 003 ) distinguis h synon yms from anton yms using two patterns, “from X to Y ” an d “eithe r X or Y ”. When X and Y are anton yms, the y occasion ally appear in a lar ge corpus in one of these two pattern s, but it is very rare for syn- ony ms to appear in these patterns. Our approach is similar to Lin et al. (2003), bu t we do not rely on hand-coded patter ns; instead, PairClass pattern s are generat ed automatica lly . Using te n-fold cross-v alidation, Pa irClass at- tains an ac curacy o f 75.0 % . Al ways guessing the majority class would result in an accurac y of 6 For more information, see TOEFL Synonym Questions (State of the art) at http://aclweb .org /aclwiki/. 65.4%. T he a verag e human score is unkno wn a nd there are no pre vious results for comparison. 3.4 Similar , Associated, and Both A common criticism of corpus-ba sed measures of word si m ilarity (as o pposed to lexi con-based mea- sures) is that they are merely detect ing associa- tions (co-o ccurrences) , rather than act ual semantic similarit y (Lun d et al ., 199 5 ). T o addre ss this crit- icism, Lund et al. (1995) ev aluated their al gorithm for mea suring word similar ity w ith w ord pairs that were labele d similar , associated , or both . These labele d pai rs w ere ori ginally created for cogni - ti ve psycho logy experi ments with human subjects (Chiarell o et al., 1990). T able 7 sho ws some ex- amples from this colle ction of 144 word pai rs (48 pairs in each of the three classes ). W ord pair Class label table:b ed similar music:ar t similar hair:fu r similar house :cabin similar cradle :baby associ ated mug:bee r associ ated camel:hu mp associ ated cheese :mouse associated ale:be er both uncle: aunt both peppe r:salt both fro wn:smile both T able 7: Examples of word pairs labeled similar , associ ated , or both . Lund et al. (1995) did not meas ure the ac curacy of their algorith m on this three -class cl assification proble m. Instead, follo wing standard practi ce in cognit ive psyc hology , th ey sho wed that their al- gorith m ’ s similarity scores for the 144 word pairs were corr elated with the respons e times of huma n subjec ts in priming test s. In a typical pri m ing test, a human subjec t reads a priming word ( cr adle ) and is then ask ed to co m plete a partial word (comple te bab as baby ). The time requir ed to perform the task is ta ken t o indi cate the strength of the cogni- ti ve l ink between the tw o word s ( cr adle and ba by ). Using te n-fold cross-v alidatio n, PairClass at- tains an accu racy of 77.1% on the 14 4 word pairs. Since the three cl asses are of equal size, gu essing the majority class and rand om guessin g both yie ld an accura cy of 33.3%. The a verage human score is unkno wn and there are no previo us results for comparis on. 4 Discussion The four experimen ts are summarized in T ables 8 and 9. For the first tw o ex periments, where the re are pre vious resul ts, PairClass is not the best , bu t it performs competiti vely . For the second two ex- periment s, PairClass pe rforms significan tly abo ve the baseline s. Howe ver , the strengt h of this ap- proach is not its performance on an y one task, but the range of tasks it can handle. As f ar as we kno w , this is the first time a stan - dard supervise d learning algo rithm has been ap- plied to any of these four problems. The advan - tage of being able to cast these problems in the frame work of stand ard supe rvised learning prob- lems is that we can no w explo it the huge literature on sup ervised learning . Past work on these prob- lems has required impli citly coding our knowl- edge of the nature of the task into the struc - ture of the algorith m. For example, the str uc- ture of the algorithm for latent semantic analysis (LSA) implicitly contains a theor y of syno nymy (Landaue r and Dumais, 1997). The probl em with this approach i s that it ca n be v ery dif ficult to work out ho w to modify the algo rithm if it does not be- ha ve the way we want. On the other hand, with a supervised learn ing algorithm, we can put our kno wledge int o the labeling of the feature vectors, instea d of putti ng it directly into the algorit hm. This makes it easier to gui de the system to the de- sired beha viour . W ith our approach to the SA T analogy ques- tions, we are blurr ing the line between supervised and unsupe rvised learning, since the training set for a giv en SA T question consist s of a s ingle real positi ve exa mple (and a single “virt ual” or “simu- lated” neg ativ e e xample). In ef fect, a single exam- ple (mason:st one) becomes a sui generi s ; it con- stitute s a class of its o wn. It may be possible to apply the machinery of supervised learning to other proble ms that appa rently call for unsu per- vised lea rning (for exa m ple, cluste ring or measur - ing similarity ), by using this sui gene ris de vice. 5 Related W ork One of the fi rst papers u sing sup ervised ma- chine lea rning to classi fy word pairs wa s Rosario and Hearst’ s (2001) paper on classifyi ng noun- modifier pairs in the medic al domain. F or ex- ample, the noun-modifier expressio n brain biop sy was classified as Pr ocedur e . Rosario and Hearst (2001) co nstructed feature v ectors for each noun - modifier p air using MeSH (Med ical Subject Hea d- ings) and UMLS (Unified Medical Lang uage Sys- tem) as le xical resources . They then tra ined a neu - ral network to disting uish 13 classes of semantic relatio ns, such as Cause , L ocatio n , Measur e , and Instru m ent . Nasta se and Szpako wicz (2003) ex- plored a similar approach to classifyin g gene ral- domain noun- m odifier pairs, using W ordNet and Roget’ s T hesau rus as lexi cal resou rces. T urney and Littman (2005) used corpu s-based feature s for cla ssifying noun-mod ifier pairs. Their feature s were based on 128 hand-c oded patterns. They used a nearest-neig hbour learning algorithm to classify general-domai n noun-modifier pairs into 30 dif ferent cla sses of semant ic relation s. T ur- ney (200 6 ) later ad dressed the same problem us ing 8000 auto m aticall y generated patterns. One of the tasks in SemEval 2007 was the clas- sification of semant ic relation s between nominals (Girju et al., 2007). The problem is to classi fy se- mantic relations between nouns and noun com- pound s in the conte xt of a senten ce. The task attract ed 14 teams who created 15 systems, all of which used supervised machine learning with feature s that were le xicon-based , corpus-bas ed, or both. PairClas s is m ost similar to the algorit hm of T ur- ney ( 2006 ), b ut it dif fers in the follo wing way s: • PairClass does not use a lexicon to find sy n- ony ms for the input word pairs. One of our goals in thi s paper is to show that a pure corpus -based algorith m can han dle synon yms without a le xicon. This co nsiderably simpli- fies the algo rithm. • PairClass uses a support vector machine (SVM) instea d of a nearest neighbour (NN) learnin g algori thm. • PairClass does not use the singu lar val ue decompo sition (SVD) to smooth the feature vec tors. It has been ou r experie nce th at S VD is not neces sary with SVMs. • PairClass ge nerates probability estimates, whereas Tur ney (2006) uses a cosine m ea- sure of similarity . Probability estimates can be readily used in further downstre am pro- cessin g, b ut cos ines are less usef ul. • The aut omatically gen erated p atterns in Pair - Class are slightly more gene ral than the pat- terns of T urney (2006). Experiment Number of vec tors Number of features Number of classe s SA T Analog ies 2,244 ( 374 × 6 ) 44,880 ( 2 , 244 × 20 ) 374 TOEFL Syno nyms 320 ( 80 × 4 ) 6,400 ( 320 × 20 ) 2 Synon yms and Anton yms 136 2,720 ( 136 × 20 ) 2 Similar , Associated, and Both 144 2,880 ( 144 × 20 ) 3 T able 8: Summary of the four tasks. See Section 3 for ex planations. Experiment Accurac y Best pre vious H uman Baseline Rank SA T Analog ies 52.1% 56.1% 57.0% 20.0% 2 higher out of 12 TOEFL Syno nyms 76.2% 97.5% 64.5% 25.0% 8 higher out of 15 Synon yms and Anton yms 75 .0% none un known 65.4% none Similar , Associated, and Both 77.1% no ne unkno wn 33.3% none T able 9: Summary of e xperimenta l results. See Sectio n 3 for exp lanations. • The morphol ogical processing in PairClass (Minnen et al., 2001) is more sophistic ated than in T urney ( 2006). Ho w e ver , we belie ve that the main contrib ution of this paper is not Pair C lass itself, but the exten sion of supervise d word pair classification bey ond the classifica tion of noun-mod ifier pairs and seman- tic relations between nomin als, to an alogies, syn- ony ms, antonyms, and associatio ns. A s far a s we kno w , this has not been done before . 6 Limitations and Futur e W o rk The main li mitation of PairClass is the n eed for a lar ge corp us. Phrases that contain a pair o f w ords tend t o be more rare than phras es that con tain ei- ther of the members of the pair , thus a large cor - pus is needed to ensur e that sufficie nt numbers of phrase s are found for each input word pair . The size of the corpu s has a cost in terms of disk space and proc essing time. In the future, as hardware im- pro ves, th is will become les s of an is sue, bu t there may be ways to improve the algorithm, so that a smaller corpu s is suf ficient. Another area for futur e work is to apply Pair - Class to more tasks . W ordNet includ es more than a dozen semantic relatio ns (e.g . , syn onyms, hy- pon yms, hypern yms, m eron yms, holon yms, and anton yms). PairClass should be applica ble to all of these rel ations. Other pot ential appl ications in- clude an y task that in v olves semant ic relations , such as wor d sense disambiguat ion, informatio n retrie v al, information e xtraction, and metapho r in - terpret ation. 7 Conclusion In this pape r , we ha ve described a uniform ap- proach to ana logies, synon yms, anton yms, a nd as- sociat ions, in which all of these phenomena are subsu m ed by analogie s. W e view th e problem of recogn izing analogies as the classi fi cation of se- mantic relatio ns between words . W e belie ve that most of our lexical kno w ledge is relatio nal, not attrib utional. That is, meaning is lar gely about relations among words, rather than proper ties of indi vidual words, considered in iso- lation . For example , consider the knowled ge en- coded in W ordNet : much of the kno w ledge in W ordNet is embedded in the graph structure that conne cts words. Analogie s of the form A : B :: C : D are called pr opo rtional analo gies . These types of lower - le vel analogies may be contraste d w ith higher - le vel analogie s, such as the analog y between the solar system and Rutherford ’ s model of the atom (Falk enhainer et al., 1989), which are sometimes called conceptual analo gies . W e belie ve that the dif ference betwee n these two types is lar gely a matter of complexit y . A higher -le vel analogy is compose d of many lower -lev el a nalogies. P rogress with algorit hms for proces sing lower -le vel analo- gies will ev entually contrib ute to algorithms for higher -le vel analogies. The idea of subsu m ing a broad range of se- mantic phenomena under analo gies has been su g- gested by many res earchers. Minsky (1 986 ) wrote, “Ho w do w e ev er understand any thing? Almost alwa ys, I think, by using one or an- other kind of analog y . ” Hofstadter (2007) claimed, “all meaning comes fro m analogies. ” In NLP , analogic al algorithms ha ve been applied to machine transl ation (Lepage and Denoual, 2005), morphol ogy (Lepage, 1998), and semantic rela- tions (T urney and Littman, 200 5 ). Analog y pro- vides a frame work that has the potential to unify the field of semanti cs. This paper is a small step to wards that goal. Acknowledgeme nts Thanks to Jo el Martin and the anonymous revie w- ers of Coling 2008 for their helpfu l comments. Refer ences [Breiman19 96] Breiman , Leo. 1996 . Bag ging p redic- tors. Machine Learning , 24(2):12 3–14 0. [B ¨ uttch er and Clarke2005] B ¨ uttcher, Stefan a nd Charles Clarke. 2005. Efficiency vs. effectiv en ess in terabyte-scale inf ormation retrieval. In Pr oceed ings of th e 14 th T ext REtrieval Con fer ence (TREC 20 05) , Gaithersburg, MD. [Chiarello et al.19 90] Chiarello, Christine, Curt Burgess, Lo rie Richards, and Alma Pollock . 1 990. Semantic and associati ve p riming in the cerebral hemispher es: Some words do, some words don’t ... sometimes, some places. Brain and La n guage , 38:75– 104. [Falkenhainer et al.1989] Falkenhainer, Brian, K en- neth D. Forbus, and De dre Gen tner . 1 989. The structure-m apping en gine: Algorithm and examples. Artificial Intelligence , 41(1):1–63 . [Fellbaum1 998] Fellbau m, Christiane, editor . 1998 . W o rdNet: An Ele c tr onic Le xica l Datab ase . MI T Press. [Girju et al.200 7] Gir ju, Roxana, Preslav Nakov , V i vi Nastase, Stan Szpako wicz, Peter T urney , and Deniz Y uret. 2 007. Semeval-2007 task 04 : Classification of semantic relatio ns between no minals. In S emEval 2007 , pages 13–18 , Prag ue, Czech Republic. [Hirst and St-Onge 1998] Hirst, Graeme and Da v id St- Onge. 1998. Lexical c hains as represen ta- tions of context for the detection and correction of malapro pisms. In Fellbau m, Christiane, edito r , W o rdNet: An E le c tr o nic Le xical Da tabase , p ages 305–3 32. M IT Press. [Hofstadter2 007] Ho fstadter, Douglas. 2007. I Am a Srange Loop . Basi c Books. [Jiang and Conrath 1997] Jiang, Jay J. and David W . Conrath. 1997. Semantic similarity b ased on cor - pus statistics and lexical taxono my . I n R OCLING X , pages 19–3 3, T apei, T aiwan. [Landau er and Dumais1997] Landauer, Tho mas K. and Susan T . Duma is. 1997 . A solution to Plato ’ s pr ob- lem: The laten t semantic analysis theory of the ac- quisition, in duction, and re presentation of knowl- edge. Psychological Review , 104(2):2 11–24 0. [Lepage and Denou al2005] Lep age, Yves an d Etien ne Denoual. 2 005. Pu rest ever example-b ased m a- chine translation : Detailed pre sentation and assess- ment. Machine T ranslation , 19(3):251 –282. [Lepage1 998] Lepag e, Yves. 19 98. Solvin g analogies on words: An algo rithm. I n Pr oceed ings of the 36th Annua l Conference of th e Association for Computa - tional Linguistics , pages 728 –735. [Lesk196 9] Le sk, Michael E. 196 9. W ord -word asso- ciations in docum ent retriev al system s. American Documenta tion , 20 (1):27 –38. [Lin et al.200 3] Lin , Dek ang, Shaojun Zhao, Lijuan Qin, and Ming Zhou. 2 003. Identify ing synonyms among distributionally similar words. In IJCAI -03 , pages 1492 –1493 . [Lund et al.199 5] Lu nd, Ke vin, Curt Burgess, and Ruth Ann Atch ley . 199 5. Semantic an d ass o ciativ e priming in h igh-dim ensional semantic space. In Pr o- ceedings of the 17th Ann ual Confer ence o f the Cog- nitive Scien c e So ciety , pages 660–66 5. [Minnen et al.200 1] Min nen, Guid o, John Carroll, and Darren Pearce. 2001 . Applied mo rpho logical pro- cessing of En glish. Natural Language Engineering , 7(3):2 07–22 3. [Minsky198 6] Minsky , Marvin. 1986. The Society of Mind . Simon & Schuster , New Y ork, NY . [Nastase and Szpakowicz2003] Nastase, V ivi and Stan Szpakowicz. 2003. Ex ploring nou n-mod ifier se- mantic relations. In F ifth In ternationa l W o rkshop o n Computation al S emantics (IWCS-5) , pages 285–30 1, T ilburg, Th e Netherlands. [Platt1998] Platt, John C. 1998. Fast training of support vector machines using sequen tial minimal optimiza- tion. In Ad vances in K ernel Methods: Sup port V ec- tor Learning , page s 185– 208. MIT Press Camb ridge, MA, USA. [Reitman196 5] Reitman , W alter R. 196 5. Cognition and Th ought: An I nformation Pr ocessing App r oach . John W iley and Sons, Ne w Y or k, NY . [Resnik199 5] Resnik, Philip. 1995. Using information content to ev alua te semantic similar ity in a taxon- omy . In IJCAI-9 5 , pages 448–453, San Mateo , CA. Morgan Kau fmann. [Rosario and Hearst200 1] Rosar io, Barbara and Marti Hearst. 2001 . Classifying the semantic relations in noun- compou nds via a domain-specific lexical h ier- archy . In EMNLP-0 1 , pages 82–90. [T urney and Littman2005 ] T urney , Peter D. and Michael L. Littman. 2005. Corp us-based learn ing of analo gies and sema ntic relations. Machine Learning , 60(1– 3):251 –278. [T urney et al.2003] T urney , Peter D., M ichael L. Littman, Jeffre y Bigha m, an d V ictor Shn ayder . 2003. Com bining in depend ent modules to solve multiple-ch oice synonym an d analo gy pr oblems. In RANLP-03 , pages 482–48 9, B o rovets, Bulgaria. [T urney2 001] T urn ey , Peter D. 2 001. Mining the W eb for synonym s: PMI- IR versus LSA on TOEFL. In Pr oceeding s of the T welfth Eur o p ean Con fer ence on Machine Learning , pag es 49 1–502 , Berlin. Sp ringer . [T urney2 006] T urn ey , Peter D. 2006. Similarity of semantic relation s. Computation al Linguistics , 32(3) :379–4 16. [V eale20 04] V eale, T ony . 200 4. W o rdNet sits the SA T: A k nowledge-based ap proach to lexical an alogy . In Pr oceeding s of the 16th Eur opea n Conference on Artificial Intelligence (ECAI 2004) , pages 60 6–61 2, V alencia, Spain. [W itten an d Frank1999] W itten, Ian H . and Eibe Frank. 1999. Data Mining : P ractical Machine Learning T o o ls and T e chn iq ues with Java Implementa tio ns . Morgan Kaufmann, San Francisco.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment