Determining the Unithood of Word Sequences using Mutual Information and Independence Measure

Determining the Unithood of W ord Sequen ces using Mutual Inf ormation and Indepen dence M easure Wilson W ong, W ei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering Univ ersity of W estern Australia Crawle y W A 6009 { wilson,wei,benn amou } @csse.uwa.edu.au Abstract Most works related to unithood were con- ducted as part of a larger effo rt for the de- terminati on of termho od. Consequ ently , the number of independen t research that study the notion of uni thood and produce dedica ted techniqu es for measuri ng unit- hood is extr emely small. W e propose a new appr oach, indepe ndent of any in- ﬂuences o f ter mhood, that pr ovi des ded- icated measures to g ather linguistic evi- dence from parsed text an d statis tical ev - idence from Google search engine for the measuremen t of unithoo d. Our e v alu a- tions rev eal ed a p recisio n and recal l of 98 . 68% and 91 . 82% respecti v ely with an accura cy at 95 . 4 2% in measuring the unit - hood of 1005 test cases . 1 Intr oduction T erms and the tasks related to their treatments are an inte gral part of man y applic ations t hat deal with natura l language text suc h as lar ge-s cale s earch engine s, automati c thesauru s constructi on, ma- chine translatio n and ontolo gy learning for pur - poses rang ing from inde xing to clust er analysis . W it h the incre asing re liance on huge text sourc es such as the W orld W ide W eb as input, the need to pro vide automated means for managing domain - speciﬁc terms rises. Such rele v ance and impor - tance of terms has prompted dedica ted resear ch interes ts. V ariou s names such as a utomatic term r eco gni tion , term e xtr actio n and terminolo gy min- ing were g i ven to encompa ss the tasks relate d to the treatment of terms. T erm extrac tion is the pro - cess of extractin g lexica l units from text and ﬁ l- tering them for the purpo se of ident ifying terms which characteri se certain domains of intere st. This process is the d etermina tion of two i mportant fact ors, namely , unitho od and termho od . Unit- hood concerns with whether or not a sequence of words should be combined to form a m ore stable lex ical unit, and termhoo d measures the deg ree to which these stable lexica l units are related to domain-s peciﬁc concepts. While unithood is only rele v ant to comple x terms (i.e. multi-wo rd terms), termhood concerns both simple terms (i.e. single- word te rms) and complex terms. Most research in automatic term recogn ition were conducted solel y to study and de v elop tech- niques fo r meas uring termhoo d, whi le o nly a s mall number exist s that study on unithood. Unfortu- nately , rather than considering the measurement of unithood as an importan t prerequis ite, these researc hers merely treat it as part of a larg er scorin g and ﬁ ltering mechanism for determining termhood . Such perceptio n is clearl y reﬂected throug h the words of Kit (2002), “...we can see that the unithood i s actually subsumed, in ge ner al, by the termhood. ” . Conseq uently , the signiﬁcance of unithood measu res has been o v ershad o wed by the lar ger notion of termhood . As such, the progre ss and innov ati on with respect to this small sub-ﬁeld of auto matic term recogn ition is mini- mal. Most of the e xistin g techniques for measu r - ing unithood employ con ventiona l measure s such as mutual information an d log-lik elihoo d, and rely simply on the occurren ce and co-occ urrenc e fre- quenc ies from domain corpora a s t he source o f ev- idence . In th is paper , w e pro pose the sep aratio n of unit- hood measurements from the de terminat ion of ter - mhood. F rom here on, we w ill consider unithood measuremen t as an important prerequ isite, rather than a subs umption, to the determinat ion of ter - mhood. W e presen t a new dedicate d approach for determining the unithood of word sequ ences by employing the Google search engine as the source of statistica l evid ence, and measures in- spired by m utual information (Chur ch and Hanks (1990)) and C val ue (Frantzi (1997)). The use of the W orld W ide W eb to replace the con ven- tional use of static corpora will eliminate issues related to portability to other domains, and the size of tex t necessary to induce the require d sta- tistica l evid ence. Besides, this ded icated approac h to determin e the unithoo d of word sequences will pro ve to be in v alu able to other areas in natural langua ge proc essing such as noun-p hrase chunk- ing and named-en tity reco gnitio n. Our e v aluations re ve aled a precisio n and a recall of 98 . 68% and 91 . 82% re specti v ely with an accurac y at 95 . 4 2% in measuri ng the unitho od of 1005 test cases. In Sec tion 2, we ha ve a brief revie w on the exist- ing tech nique s for measuring uni thood . In Sec tion 3, we present our ne w approach, the measures in- v olv ed and the justi ﬁcation behind ev ery a spect of the m easure s. In Sectio n 4, we summarize some ﬁndings from our ev al uation s. W e discuss in S ec- tion 5 why our new approach can be applicabl e to other tasks in natural langua ge processing such as named- entity recognitio n. Finally , we conclu de this paper with an outlook to future works in Sec- tion 6. 2 Related W orks Prior to measuring unithood, ter m candid ates must be extract ed. There are tw o common approaches for extract ing the term candida tes. The ﬁ rst re- quires the corpus to be tagge d or parsed , and a ﬁl- ter is then employed to ext ract words or phrases satisfy ing some linguist ic patterns . There are two types of ﬁlters for extr acting from tagged cor- pus, namely , open or closed. T oo restricte d ﬁl- ters (i.e. closed) that rely on a small set of allow- able part-of- speech will produce high precisi on b ut poor recall (Frantzi and Anania dou (1997)). On the oth er h and, ﬁlters that are too liberal (i.e. open), allo wing part-of-spe ech such as preposi- tions and adjecti ve s, will ha ve the opposi te ef- fect. M ost of the existi ng approaches rely on reg ular express ions and the part-o f-speec h tags to acce pt or reject sequence s of n-grams as term candid ates. For examp le, F rantzi and Ananiad ou (1997) employ Brill tagger to tag the raw cor - pus with part-of -speec h and later extract n-grams that fulﬁll the pattern ( N oun | Ad j ec t ive ) + N oun . Bourigau lt and Jacquemin (1999) ut ilise SYLE X, a part-of -speec h tagger , to tag the raw corpus . The part- of-spe ech tags are utili sed to extra ct maximal-le ngth no un phrases, whic h are later re- cursi v ely decomposed into heads and modiﬁers . On the other extre me, Dagan and Church (1994) accept only sequence s of N oun + . The second type of extracti on approaches works on raw corpus us- ing a set of heuristics. Thi s type of approaches which d oes n ot rely on part -of-spe ech tags is qu ite rare. S uch approaches h a ve to mak e use of t he tex- tual sur face constr aints to approximate the bound - aries of term candida tes. One of the constrain ts includ e the use of a stop word list to obta in the bound aries of stopwords for inferri ng the bound- aries of candidates. A selection list of all o wable prepos itions can also be emp loye d to enforce con- straint s on the tok ens between units. The ﬁlters for extra cting term candida tes mak e use of only loc al, su rfac e-le v el information, namely , the part-of-sp eech tags . More ev idence is required to establish the dependence between the constitue nts of each term can didate to ensure strong unithood. Such evi dence will usually be statist ical in nature in the form of co-occurre nces of the constitu ents in the corpus. Accord ingly , the unitho od of the ter m candidate s can be d etermine d either as a separate step or may proceed as part of the ext raction proce ss. From our re vie w of the literat ure, only an ext remely small nu mber of re- search ers actually discussed and presented mea- sures for unithoo d. The lack of exten si ve research and techniques related sp eciﬁcally to unithood is reaf ﬁrmed by Kit (2002). According to the auth or , ”...its measur e (if ther e is one) indicates how likel y it is that a term cand idate is an atomic te xt unit. ” T wo o f the mos t common me asures of unithood ha ve to be pointwis e mutual informatio n (MI) (Church and Hanks (199 0)) and log-lik elih ood ra- tio (Dunning (1994)). In mutual information, th e co-occ urrenc e frequencies of the consti tuents of comple x terms a re utilised to measu re th eir depen- denc y . The mutual information for two word s a and b is deﬁned as: M I ( a , b ) = log 2 p ( a , b ) p ( a ) p ( b ) (1) where p ( a ) and p ( b ) are the probabil ities of oc- curren ce of a and b . Many measures that ap- ply statistical techn iques assumin g strict normal distrib ution, and independe nce between the word occurr ences do not fare well. F or handlin g ex- tremely uncommon words or small size d cor - pus, log-lik elihoo d rati o deli v ers the best preci- sion (Kurz and Xu (2002); Franz (1997)). Log- like lihood ratio attempts to quantify how much more likely one pair of words is to occur compared to the others. Despi te its potential, “How to apply this statistic measur e to quanti fy structur al depen- dency of a wor d s equenc e r emai ns an inter esting issue to e xplor e. ” (Kit (20 02)). Frantzi (1997) propos ed a measure kno wn as Cvalue for extractin g complex terms. The mea- sure is based upon the claim that a substrin g of a term candid ate is a candid ate itself gi ven that it demonstra tes adequate indepe ndence from the longer version it appears in. F or example, “E. coli food poiso ning” , “E. coli” and “food poison - ing” are accepta ble as valid complex term candi- dates. Howe v er , “E. coli food” is not. Therefore, some measures are required to gaug e the strength of word co mbinatio ns to dec ide whethe r two word sequen ces should be merged or not. Give n a word sequ ence a to be e xamined for unithoo d, the Cvalue is deﬁned as: C val ue ( a ) = ( log 2 | a | · f a if | a | = g log 2 | a | · ( f a − ∑ l ∈ L a f l | L a | ) other wise (2) where | a | is the number of words in a , L a is the set of longer term candidates that contai n a , g is the longes t n-gram considered, f a is the fre- quenc y of occurrence of a , and a / ∈ L a . While cer - tain researche rs (Kit (2002)) consider Cvalue as a termhoo d measure, oth ers (Nakagaw a and Mori (2002)) accept it as a measure for unithoo d. One can obser ve that longer candidates tend to gain higher weights due to the inclusion of l og 2 | a | in Equation 2. In addi tion, the weights computed us- ing Equation 2 are purely depen dent on the fre- quenc y of a . 3 A New Ap pr oach for Unith ood Measur ement Our ne w approach fo r measuring the unithood of word sequences consists of two parts. F irstly , a list of word sequences is extra cted using purely lingui stic techni ques. Secon dly , word se quence s are examine d and the related stat istical e v idence is gather ed to assist in determining th eir mutua l in- formatio n and independen ce. 3.1 Extracting W ord S equences Existing t echniq ues for ex tractin g word sequenc es ha ve been relying on part-of -speech information and ﬁ lters in the form of pattern matching (e.g. reg ular ex pressio n). S ince the head-mod iﬁers princi ple is important for our techniqu es, we em- plo y both the part-of-sp eech info rmation and de- pende ncy relatio n for e xtrac ting term can didate s. The ﬁ lter is implemented as a head-dr i ven left- right ﬁlter (W ong (2005)) that feeds on the output of Stanford Parser (Klein and Manning (2003)), which i s an implementatio n of unle xicalis ed pr ob- abilis tic conte xt- fr ee grammar (PCFG) and lexi- cal dependenc y par ser . The head-d ri ve n ﬁlter be- gins by identify ing a list o f hea d noun s from the outpu t of the Stanford Parser . As the name sug- gests, the ﬁlter beg ins from the head and pro- ceeds to the left and later , right in the attempt to identify maximal-leng th noun phrases accord- ing to the head-modiﬁer info rmation. During the proces s, the ﬁlter w ill append or prepend any im- mediate modiﬁer of the current head which is a noun (except possess i ve n ouns), an adjecti ve or a foreig n word. E ach noun phrase or segmen t of noun phrase identiﬁed using the head-d ri ven ﬁl- ter is kno wn as a poten tial term candida te , a i ∈ A where i is the wo rd of fset produ ced by the S tan- ford Parse r (i.e. the “offse t” column in Figure 1). Figure 1 sho ws the output of the Stanford Parser for the sentence “The y’r e living long er with HIV in the brain, e xplains Kathy K opnisky of the NIH’ s National Inst itute of Mental Health, which is spending about million s in vestigatin g neur oAIDS. ” . Note that the words are lemma- tised to obta in the root form. The head nouns are marke d with square s in the ﬁgure. For ex ample, the head “Institute” is modiﬁed by “NIH’ s” , “Na- tional ” and “ of ” . Since we do not allo w for mod- iﬁers of the type “ possess ive” and “pr eposition” , we will obtain “National Institu te” as shown in Figure 2. Figure 2 shows the head-dri ven ﬁ lter at work for s ome of the hea d nouns id entiﬁed from the “modiﬁee” column o f the outpu t in F igure 1. After th e head-d riv en ﬁlter ha s identiﬁed pot ential term candida tes using the heads, remaining nouns from the “w or d” column in F igure 1 which are not part of any potentia l term candida tes will be includ ed in A . 3.2 Determining the Unithood of W ord Sequences In the followin g step, we examine the unithood of all pairs of potentia l term cand idates ( a x , a y ) ∈ A with a x and a y locate d immediatel y nex t to each other (i.e. x + 1 = y ), or s eparate d by a prep osition or coordinat ing conjun ction “and ” (i.e. x + 2 = y ). Obvio usly , a x has to app ear b efore a y in th e sen- tence or in other words, x < y for all pairs where x and y are the wor d offs ets produ ced by the Stan- Figure 1: T he output from Stanford Parser . The tokens in the “modiﬁee” column marked with squares are head nouns, and the correspon ding tokens along the same rows in the “wor d” column are the modiﬁers. The ﬁrst column “off set” is subseq uently represen ted using the v ariable i . ford Parser . Formally , gi ven that s = a x ba y where b is an y prepositio n, the conjunc tion “and” or an empty string , the problem is to determine whether to accept s as an inde pendent lexic al unit (i.e. a term candidate ) or leav e a x and a y as separate units. In order to decide on the merge , we need ad- equate e vidence that s will form a stabl e unit and hence, a better term candidate than a x and a y sep- arated . It is wort h mentioning that the size (i.e. number of words) of a x and a y is not limited to 1. For exampl e, we can ha ve a x =“National In- stitute s” , b=“of ” and a y =“Aller gy and Infectio us Diseases ” . In addition, the size of a x and a y should ha ve no ef fect on the determinatio n of their unit - hood. The most sui table e xisting measu re for g ather - ing evi dence about the dependen cy between two words is mutual inf ormation. Based on the con- ven tional practice, the freque ncy of occurrenc e of each element in W = { s , a x , a y } is normali zed us- ing the sum of the frequenc y of all occurrences in W . S ince data sparsene ss in the local corpus may lead to poor estimation of mutual informat ion, we inno vati vely employ the page count by Google search engine for calculati ng the depend ency of the elements in W instead. W e treat the W orld W ide W eb as a lar ge genera l corpus and Google search engine as a gatew ay for access ing the doc- uments in the corpus. Our choic e of using Google to obtain the page count was merely motiv ated by its extens ive c ov erage. In f act, it is possib le to em- plo y an y sea rch engine s on the W orld W ide W eb for th is research . Each ele ment in W is formula ted as a search query and submitted to Google search engine . The page co unt returne d is ut ilised for cal- culatin g the mutu al information. In addition , we also apply a multiplier within the rang e of [ e − 1 , 1 ] , inspir ed b y T F-IDF , to off set too common terms, especi ally in three-w ord terms such as “Institute of Science” . Formally , for each w ∈ W , we deﬁne the weight as: p ( w ) = n w ∑ v ∈ W n v e ( − n w ∑ v ∈ W n v ) (3) where n w is th e page count (i.e. number of doc - uments) returned by Google search engin e con- tainin g w ∈ W . W e only take into cons iderati on the number of documents that conta in the word sequen ces (i.e. pa ge coun t) due to the difﬁcult y in obtaining the actual frequency of occurrences of word se quence s from Google’ s search results . Next, the mutual information between the two units a x and a y is deﬁned as: Figure 2: An ex ample of our head-dri ven left-right ﬁlter at wo rk. The tok ens which are highl ighted w ith a darke r ton e are the he ad nou ns. T he underli ned token s are the modiﬁers id entiﬁed as a result of th e left-ﬁrst a nd righ t-later moveme nt of the ﬁlter . In the ﬁrst segment, the head noun is “Healt h” while the ﬁrst tok en to t he left “ Mental” is th e correspond ing m odiﬁer . In the second segment of the ﬁgure, “Institut e” is the head noun and “ National” is the modiﬁer . Note that the results from the ﬁrst and second seg ments are actual ly part of a longer noun phras e. D ue to th e restriction in accepting prepositi ons by the head-d ri ven ﬁlter , “of ” is omitted from the outpu t of the ﬁ rst se gment of the ﬁgure. M I ( a x , a y ) = p ( s ) p ( a x ) p ( a y ) (4) If the occurr ence of s approach es 0 (i.e. a x and a y are rarely seen togethe r as a un it with b ), then M I ( a x , a y ) reduces to 0. If the occurre nce of s ap- proach es that of the co-occu rrence of a x and a y , M I ( a x , a y ) will approach 1. A high p ( s ) indicates strong er couplin g of a x and a y , or ev en more so , the exis tence of a x and a y are purely due to the exi stence of s . A high M I ( a x , a y ) implies an in- crease in the unithood of the two units a x and a y . Follo wing this, a x and a y will ha ve poor unitho od in their indivi dual forms if the y are not merge d into s to form a stronger unit. This m utual in- formatio n measure is necessary in disting uishing phrase s su ch as “Asia and Eur ope” (i.e. lo w m u- tual informatio n) from “U.S. F ood and Drug A d- ministr ation” (i.e. high mutual in formation ) when prepos itions and conjunctio ns are in v olve d. Nonethel ess, the units a x and a y may still be capabl e of forming vali d compound unit s eve n thoug h their m utual info rmation is relati vely low . Low mutual informat ion can be attrib uted to the high indiv idual occ urrence s of a x and a y due to their extremely common usage. For example, a x =“Institu te” and a y =“Ophthalmol ogy” yield high oc currences relati ve to s=“Institute of Oph- thalmolo gy” . This doe s not mean that “Ins titute of Ophthalmolo gy” is not a val id unit. T o han- dle such cases where M I ( a x , a y ) is mediocre due to the commonness of a x and a y , we employ another measure of indepen dence. In such situation , we will still a ccept s a s a v alid u nit if it c an be de mon- strated that the extre m ely high independe nce of the indi vidual unit a x and a y is the cause behind the low M I ( a x , a y ) . For this purp ose, we modify the C val ue described in Equ ation 2 to acco m mo- date the use of page counts rather than frequenc y . In additio n, we remov e the multiplier log 2 | a | be- cause the number o f words in a x and a y does not play a role in determining their indepe ndence f rom s . C onseq uently , we deﬁne the measure of Inde- pende nce (ID) for a x and a y from s as: I D ( a x , s ) = ( log 10 ( n a x − n s ) if( n a x > n s ) 0 otherwis e (5) I D ( a y , s ) = ( log 10 ( n a y − n s ) if( n a y > n s ) 0 otherwis e (6) where n a x , n a y and n s is the Google page count for the unit a x , a y and s , respecti vely . As the lexical unit a x occurs more than its longer c ounter part s , its independen ce I D ( a x , s ) gro ws. Only when the number of occurren ces of a x is less than those of s , its indepe ndence from s becomes I D ( a x , s ) = 0. T his m eans that we will not be able to wit- ness a x without encoun tering s . The same can be said about the measure of independ ence for a y , I D ( a y , s ) . In short, ex tremely hi gh ind epende nce of a x and a y relati ve to s will be reﬂected throug h high I D ( a x , s ) and I D ( a y , s ) . Consequ ently , the deci sion to mer ge a x and a y to form s depends on both the mutual information between a x and a y , na mely , M I ( a x , a y ) , and th e in- depen dence of a x and a y from s , namely , I D ( a x , s ) and I D ( a y , s ) . This decision is organ ised into a Boolean function kno wn as Uni thood ( U H ) , and we deﬁne it as: U H ( a x , a y ) =                                1 if ( M I ( a x , a y ) > M I + ) ∨ ( M I + ≥ M I ( a x , a y ) ≥ M I − ∧ I D ( a x , s ) ≥ I D T ∧ I D ( a y , s ) ≥ I D T ∧ I DR + ≥ I DR ( a x , a y ) ≥ I DR − ) 0 otherwise (7) where I DR ( a x , a y ) = I D ( a x , s ) / I D ( a y , s ) . I DR helps to ensure that pairs with mediocre mutual informat ion not only ha ve a x and a y with high in- depen dence b ut are also equally indepen dent be- fore merg ings are perfo rmed. The unithood func- tion in Equation 7 summarises th e relationship be- tween mutual information and the independe nce measure. U H ( a x , a y ) simply states that the two lex ical units a x and a y can only be mer ged in two cases: • If a x and a y has extr emely high mutual infor - mation (i.e. higher than a certa in thresho ld M I + ); or • If a x and a y achie ve a verage mutual infor - mation (i.e. within the accept able range of two thresholds M I + and M I − ) due to both of their extre m ely high independenc e (i.e. higher than the threshold I D T ) from s . T o ensure that both units hav e equall y high in- depen dence, their ratio of indepe ndence I DR has to fal l within the range I DR − and I DR + . The thresho lds for M I ( a x , a y ) , I D ( a x , s ) , I D ( a x , s ) and I DR ( a x , a y ) are decided empirically through our e v aluations: • M I + = 0 . 9 • M I − = 0 . 02 • I D T = 6 • I DR + = 1 . 35 • I DR − = 0 . 93 A genera l guide line for deter mining the appropri- ate threshold value s is prov ided in the next sectio n. Finally , the wor d seq uence s = a x ba y will be ac- cepted as a st able lex ical unit (i.e. term candid ate) if and only if U H ( a x , a y ) = 1. 4 Evaluations and Discussions T able 1: Contingen cy ta ble con structed usin g the actual and ideal results for compu ting precision, recall, accurac y and F-score. Idea l re sults are used as a re ference fo r ev aluation. Actual re sults are the actua l output from our new app roach. For this ev aluation, we employ 300 ne ws arti- cles from Reuters in the health domain gathered between October 2006 to January 2007. These 300 articles are fe d int o the Stan ford Parser whose outpu t is then used by our head-dri ven left-right ﬁlter to extract word sequences in the form of nouns and noun phrases. Pairs of word sequences (i.e. a x and a y ) located immediately next to ea ch other , or separat ed by a prepositio n or the conjunc- tion “and” in the same sentence are m easure d for their unithood. Based on the U H ( a x , a y ) of the pairs, the decis ions on whether to mer ge or no t are done automa tically . T hese decis ions are kno wn as the actual r esults . At the same time, we inspec t the same list manually to decide on the mergi ng of all the pairs. Thes e decision s are kno wn as the ideal res ults . U sing the 300 ne ws articles , we managed to obtain 1005 pairs of words to b e tested for unitho od. The actual and ideal results are or - ganise d into a c onting ency table a s sho wn in T able 1 to identify the true an d the f alse positi ves, and the true and the fal se nega ti ves. Using the results in T able 1, we ob tained a precision of 98.68%, a recall of 91.82% and an F-score of 90.61%. As for the acc urac y , our ne w measures for unithoo d scored 95.42%. It sho w s t hat our new measure has ver y g ood precision and a re lativ ely low recall due to the high numbe r of fa lse negat i ves. Firstly , we realised that th e high false nega tive rate is explain ed by our more conserv ati ve deﬁ- nition of the thresholds namely I D T , I DR + and I DR − . W e disco vered that abou t 90% of the false Figure 3: This ﬁgure sho ws a sna pshot of some sa m ples of f alse positi ves and fal se ne gati ves take n from our ev aluations. Row 66 is an example of mer ged pair which is not suppo sed to be combined (i.e. false positi ves) whi le the remaining ro ws ar e f alse n egati ves. Each r o w has two lexica l units a x (column 2) and a y (column 4) to be examin ed for their mutual informatio n (column 8), indepen dence from s (column 5 and 6) to determine if the y shou ld be mer ged into s (colu m n 10). The decision to merg e (column 9) is accompli shed base d on Equation 7. neg ati ves fa ll within the range of M I + and M I − (i.e. mediocre mutual information ). Such pairs ha ve the opport unity to be merged if they demon- strate adequate independen ce from s . Unfortu- nately , most of the indepen dence I D of either one or both members of the pairs failed to satisfy our indepe ndence threshol ds. For ex ample, referring to ro ws 72, 120 a nd 567 as sho w n in Figu re 3, on e will notice that they ha ve med iocre mutual infor - mation as deﬁned by M I + = 0 . 9 and M I − = 0 . 02. In th e cas e of ro w 72, both a x and a y ha ve an in de- pende nce lower than the threshold I D T = 6. This resulte d in the decisio n of not mergin g them. The same case happe ned to row 120. In row 567, only a y has an independe nce lo wer than I D T and at the same time, their I DR is well abo ve the upper limit I DR + . T he remaini ng 1 0% of t he false ne gati ves are simply due to the extremely lo w mutual infor - mation M I ( a x , a y ) of the pairs. T ake for exam- ple pair 171 in Figu re 3 where the mutual infor - mation is only 0 . 002, w hich is way belo w M I − . Secondly , due to the small number of false po si- ti ves, not m uch conclusi on can be drawn. F rom our analysis of the results, most of the fals e posi- ti ves are due to their high mutual informat ion (i.e. M I ( a x , a y ) abo ve M I + ). Pair 66 in Figure 3 is an exa mple which is in correctly m er ged due to a mu- tual informat ion v alue higher than M I + . From our discussion above, one would realise that both the recall and the precision can be im- pro ved by adjusting the vari ous thresh olds. For most of th e time, th e impro vement of one c omes at the expe nse of the other . For example, we can impro ve the recall by loweri ng I D T and broaden- ing the range between I DR + and I DR − at the ex- pense of prec ision. In other words, more pairs with I D v alues exceedi ng the thresho ld I D T and more pairs will fall w ithin the range of acceptable I DR . In this case, we will lower the number of fals e ne gativ es and he nce, higher recal l. Similarly , we can impro ve the precisio n by increa sing M I + . In this case, an increasing number of pair s w ill ha ve mediocre m utual infor mation (i.e. w ithin the range M I + and M I − ). Conseque ntly , the number of false po siti ves will r educe when such p airs with mediocre mutual informati on a re not merged due to their ina bility to satisfy addi tional constraint s in the forms of I D T , I DR + and I DR − . Due to the lack of e xisting ded icated technique s for measurin g unithood, we were unable to per - form a comparat ive stu dy . N onethe less, the hig h accura cy and F -score presented during our e valu- ation, and o ur anal ysis on the false positiv es and the false negati ves rev ealed the potentials of our ne w m easure s in terms of high precision and re- call, portab ility across domains, and conﬁgur abil- ity of the perfor mance. 5 Ap pli cability to Named-Entity Recognition Named-enti ty recogn ition is one of the impor- tant tasks in information e xtraction. It in v olv es the identiﬁcatio n of noun phrases or more speciﬁ- cally , proper names from free text, and their clas- siﬁcation into one of the many categor ies such as persons , geog raphic al loca tions and compa- nies. A typical named -entity recognis er per - forms part-of-spee ch tagging, a nd rely on patterns , heuris tics and d ictionary to id entify prop er names. The use of machine learnin g and other proba- bilisti c methods such as support ve ctor mach ines (Mayﬁeld et al. (2003)) and hidden Mark ov mod- els (Mittal et al. (1999)) ha ve als o gained popular - ity . Most of these existing techniqu es for named- entity recog nition works well when they are deal- ing with single-wor d names or sequences of nouns . In the face of m ore complex named- entitie s that consist of other part-of- speech es- pecial ly prepositio ns, these techniques performed poorly . For example, most exis ting techniq ues would ha ve to rely on heuristic s and dictiona r- ies to diff erentia te between “Barr ow in Furness” and “countries in Asia” . Unfortunatel y , there are many problems related to the use of heuris- tics and diction aries. For one, the maintenan ce of such dictiona ries are costly and difﬁcult . Ques- tions about the reliability of stop-word bound- aries, word s cap italisation and pun ctuations arise reg arding the use of heuristi cs and linguistic s cue s. For example , how can named-entity reco gnisers rely on capita lised words in the case of “ Bar- r ow in F urness ” v ersus “P erth in Austr alia” ? Besides prepositi ons, the conjun ction “ and” in named-en tities posed similar challeng e. Accord- ing to Osenov a and Kolk ovska (2002), “... pr ob- lem arises when named-en tity is a phr ase , com- prisin g a conjun ction... ” . In such cases, the named-en tity recogniser only recognise s part of the named-entity . For exampl e, in the case of “Centr e for D isease Contr ol and P r eventio n” , only “Centr e for D isease C ontr ol” will be ex- tracted . Other researchers (Mani et al. (1996)) ha ve taken the default step of simply groupi ng name seg ments separated by preposition s or con- juncti ons into long er names. Our ne w ap proach of deciding on whether two word sequence s are to be merge d or not is highly applic able to many areas in natural lang uage pro- cessin g es peciall y na med-enti ty recog nition. The absenc e of any predeﬁned resources in our ap- proach will solve al l the probl ems highligh ted in the previ ous paragra ph. Usin g our U H ( a x , a y ) functi on, named-ent ity recognise r can easily de- termine whether or not parts of proper names should be mer ged tog ether without e ver relying on unreli able heuristics, and domain-restr icted pat- terns and dicti onaries . 6 Conclusion and Futur e W ork Many r esearch ers inappr opriately assume that ter - mhood subsumes unithoo d. In th is pa per , we high- lighte d the signiﬁcance of unithood and that its measuremen t should be gi ven equal attenti on by researc hers in automati c term recognit ion. The po- tential of un ithood measurement s can be extend ed to other areas in natural language processing such as noun -phras e chunking and named-ent ity recog- nition . W e pr oposed a ne w appro ach that pro vides ded- icated measures specialis ed in measuring unit- hood. The ﬁrst measure emplo ys mutual informa- tion M I ( a x , a y ) to cap ture the interdep enden ce of the ex istence of a x and a y , and allows us to deter - mine if a x and a y are better off separa ted or other - wise in order to produce a stronger unit. The sec- ond measure , deﬁned as Inde penden ce (ID) was inspir ed by C val ue , and is meant to pro vide addi- tional eviden ce in the determinat ion of unitho od. These two measures are combine d into a Boolean functi on deﬁned as Unithood ( U H ) that decides on whether a x and a y should be combine d to form s . Our e valua tions re vealed a prec ision and recall of 98 . 68% and 91 . 82% respecti vely with an accu- racy at 95 . 42% in mea suring the unitho od of 1005 test cases. Due to the lack of e xisting dedicate d techni ques for measuri ng unit hood, we w ere un- able to perform a compa rati ve study . Nonetheless, the excell ent e v aluation results together with the real-w orld text emplo yed in our ev aluatio n demon- strate the stre ngths of our ne w appro ach in re gards to the determination of unithood. One of the future works that we plan to under take is to increa se the size of our tes t set to further establish the ad van- tages of our new approach demons trated throug h the c urrent e valu ations. W e are p lanning to reduce the number of threshold s and at the same time, ﬁnd ways to aut omaticall y optimise them. Acknowledgeme nt This research was suppor ted by the A ustrali an Endea vou r Internatio nal Postgradu ate Research Scholars hip, and the Research G rant 2006 by the Uni versity of W estern Australi a. Refer ences D. Bourigault and C. Jacqu emin. 1999. T erm ex- tractio n + term clustering: An integr ated plat- form fo r computer -aided terminology . In Pr o- ceedin gs of the Eur opean Chapter of the Asso- ciatio n for Comput ationa l L inguis tics (EA C L) . Ber gen. K. Church and P . Hanks. 1990. W ord association norms, mutual information, and lexic ography . Computatio nal L inguis tics , 16(1):22–2 9. I. D agan and K . Church. 1994 . T ermight: Identi- fying and translati ng technical terminology . In Pr oceedin gs of the 4th Confer ence on Applied Natur al Langua ge Pr ocessing . Germany . T . Dunning. 1994. A ccurat e methods for the statis- tics of surprise a nd coincidence . Compu tation al Linguist ics , 19(1):61–7 4. K. Frantzi. 1997. Incorpo rating conte xt informa- tion for the extrac tion of terms. In Pr oceeding s of the 3 5th Annual Mee ting on Assoc iation for Computatio nal L inguis tics . S pain. K. F rantzi and S. A naniad ou. 1997. A utomatic term recognit ion using conte xtual cues. In Pr o- ceedin gs of the IJCAI W orkshop on M ultilin - gualit y in Softwar e Industry: the AI Cont ribu- tion . Japan. A. Franz . 1997. Independe nce assumptions con- sidere d harmful. In Pr oceeding s of the 8th C on- fer ence on Eur opean Chapter of the Association for Computatio nal Linguis tics . Madrid, Spain. C. Kit. 2002. C orpus tools for retrie ving an d de- ri ving termhood evid ence. In Pr oceedings of the 5th East Asia F oru m of T erminolo gy . Haikou, China. D. Klein and C. Manning. 200 3. Accura te un- lex icalize d pa rsing. In Pr oceeding s of the 41st Meeting of the A ssocia tion for Computationa l Linguist ics . D. Kurz and F . Xu. 2002. T ext mining for the ext raction of domain relev ant terms and term colloc ations . In Pr oceedings of the Intern a- tional W orkshop on Computati onal Appr oache s to Colloca tions . V ienna. I. Mani, R. Macmill an, S. Luperfoy , E. Lusher , and S. Lasko wski. 1 996. Ident ifying unkno w n proper names in newswire tex t. In Pr oceeding s of the ACL SIG W orkshop on Acquisition of Lex- ical Knowledg e fr om T ext . J. Mayﬁeld, P . McNamee, and C. Piatk o. 2003. Named entity recogniti on using hund reds of thousa nds of featur es. In Pr oceeding s of the Confer ence on Computat ional N atur al Lan- gua ge Learning . V . Mittal , S. Baluja, an d R. Sukth ankar . 1999. Ap- plying machine learning for high-perf ormance named-en tity e xtraction. In Pr oceedings of the Confer ence of the P aciﬁc Associat ion for Com- putati onal L inguis tics . H. Nakaga wa and T . Mori. 2002. A simple b ut po werful automatic term extraction method. In Pr oceedin gs of the Interna tional Con fere nce On Computatio nal L inguis tics (COL ING) . P . Oseno va and S . Kolk ovska. 2002. Combin- ing the named-entity recognition task and np chunk ing strate gy for rob ust pre-p. In Pr oceed- ings of the W orkshop on Linguistic Theories and T ree banks . Sozopo l, Bulgaria . W . W ong. 2005. Practi cal Appr oach to Knowledg e-based Question Answering with Natur al Langua ge Underst anding an d Ad- vanced R eason ing . Master’ s thesis, N ational T echnical Uni versity Colleg e of Malaysia.

Determining the Unithood of Word Sequences using Mutual Information and Independence Measure

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment