A Practioners Guide to Evaluating Entity Resolution Results

A Pra ctioner ’s Guide to Ev alua ting Entity Resolution Resul ts Matt Barnes mb arnes1@cs.cmu.e du Scho ol of Computer Scienc e Carne gie Mel lon University Octob er, 201 4 1. Introduction En tit y resolution (ER) is the task of iden tifying records b elonging to the same entit y (e.g. in d ividual, group) across one or multiple databases. Ironically , it has m ultiple n ames: deduplication and record link age, among others. In th is p ap er we su r v ey metrics u sed to ev aluate ER resu lts in order to iterativ ely impro v e p erformance and guarantee suﬃ cien t qualit y p rior to deplo ymen t. Some of these metrics are b orrow ed from multi-cla ss clas- siﬁcation and clustering domains, though some k ey diﬀerences exist d iﬀeren tiating entit y resolution f rom general clustering. Menestrina et al. empirically sho we d rankings f rom these metrics often conﬂict with eac h other, th us our primary motiv ation for studying them [ 1 ]. Th is pap er provides practitioners the basic kno wledge to b egin ev aluating their en tit y resolution results. 2. Pr oblem St a temen t Our notation f ollo ws th at of [ 1 ]. Consid er an input set of records I = { a, b, c, d, e } wh er e a, b, c, d, and e are u nique records. Let R = {h a, b, d i , h c, e i} denote an entit y r esolution clustering outpu t, w here h ... i denotes a cluster. Let S b e the true clusterin g, referred to as the “gold s tand ard.” The goal of an y en tit y resolution metric is to measure error (or similarit y) of R compared to the gold s tandard S . 3. P a ir wise Metrics P airwise metrics consider ev er y pair of records as samp les for ev aluating p erf orm ance. Let P air s ( R ) d en ote all the int ra-cluster pairs in the clustering R . In our example, P airs ( R ) = { ( a, b ) , ( a, d ) , ( b, d ) , ( c, e ) } . Confu singly , s ome stud ies treat pairs only as those where a direct matc h wa s made and not matc hes made through transitiv e relations [ 2 ]. F or example, [ 2 ] would exclude ( a, d ) if the matc hes leading to R w ere a ≈ b , b ≈ d , and c ≈ e , 1 2 A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS where ≈ denotes a matc h . W e c ho ose the former deﬁ n ition b ecause it is ind ep endent of the underlyin g matc hing p ro cess – it only dep ends on the ﬁnal en tity resolution results. Unlik e man y mac hine learning classiﬁcation tasks, we never consid er non-matc h es (i.e. in ter-cluster pairs) in entit y r esolution m etrics [ 3 ]. In conv ent ional clustering tasks, the n umber of clusters is constan t or s ub-linear w ith resp ect to the num b er of records n [ 4 ]. Ho wev er, th e num b er of clusters is O ( n ) in conv en tional ER tasks. So though the num b er of in tra-cluster pairs is O ( n ) (e.g. true p ositiv es), the num b er of inter-cluster pairs (e.g. true negativ es) is O ( n 2 ). T o illustrate, consider our original example w ith 5 records and 2 clusters. There are 4 in tr a-cluster pairs and 6 inter-cluster pairs. No w, compare this to a larger database w ith 50 records and 20 clus ters, all of comparable s ize to the original example. There will b e approximat ely 40 intra-c luster p airs b ut lik ely o ve r 2000 in ter- cluster pairs. T h u s, metrics using in ter-cluster pairs (e.g. F alse P ositiv e Rate) will improv e exp onen tially with resp ect to th e num b er of records in the database and pr o vide o verly optimistic results for large databases. 3.1. P airw ise Precision, Recall, and F 1 . Using P air s as the samples, the pairwise precision and recall metric functions follo w conv en tional mac hine learning deﬁnitions. The harmonic mean of these metrics leads to the m ost frequentl y used entit y resolution metric, pairwise F 1 . All these metrics are b ound from [0 , 1]. (1) P airP r ecision ( R, S ) = | P airs ( R ) ∩ P air s ( S ) | | P air s ( R ) | (2) P air Recal l ( R, S ) = | P air s ( R ) ∩ P air s ( S ) | | P air s ( S ) | (3) P air F 1 ( R, S ) = 2 ∗ P air P r ecision ( R, S ) ∗ P air Recal l ( R , S ) P air P r ecision ( R, S ) + P air Recal l ( R, S ) The b eneﬁt of pairwise m etrics is their in tuitiv e interpretatio n. P airwise precision is the p ercen tage of matc hes in the predicted clustering that are corr ect. P airwise r ecall is the p ercen tage of matc hes in the tru e clustering that are also in the pred icted clustering. Unfortunately pairwise m etrics ma y con vey o verly optimistic results, dep endin g on the use case. F or example, in m an y enti t y resolution tasks the end us er only cares ab out the ﬁn al en tit y – n ot the records it comprises. Mismatc h ing t wo singleton entit ies has an insigniﬁcan t impact on pairw ise metrics compared to incorrectly joining or splitting t wo large clus ters. 4. Cluster Me trics Lik e the p airwise metrics, all the cluster metrics discu s sed here are b ound by [0 , 1], a con venien t prop erty when comparing across d ataset s and for setting qualit y s tand ards. A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS 3 4.1. Cluster Precision, Recall, and F 1 . Cluster level metrics attempt to capture a m ore holistic under s tanding of the ﬁ nal entities. A t the extreme opp osite of p airwise metrics, cluster lev el precision [ 5 ] and recall [ 6 ] consider exact clus ter matc hes. Mathematically , cluster precision and recall are d eﬁned as | R ∩ S | | R | and | R ∩ S | | S | , resp ectiv ely . No w, mism atching t wo s ingleton entitie s will hav e the same impact as mismatc hing t w o larger clusters. Obvi- ously , this metric h as the opp osite drawbac k – even one corrupted matc h in a cluster will cause an en tire cluster to mismatc h d ue to the use of exact comparisons. T hus, this m etric is r arely u sed in fav or of its predecessor, closest cluster precision, recall, and F 1 . 4.2. Closest Cluster Precision, Recall, and F 1 . Closest cluster metrics correct for the previous cluster-lev el dra wbac ks b y in corp orating a notion of cluster similarit y [ 7 ]. Using the Jaccard similarit y co eﬃcien t J ( r , s ) = | r ∩ s | | r ∪ s | to capture cluster similarit y , the pr ecision and recall can b e exp ressed as (4) ccP r ecision ( R, S ) = P r ǫR max sǫS ( J ( r, s )) | R | (5) ccRecall ( R, S ) = P sǫS max r ǫR ( J ( s, r )) | S | where r and s are clusters in R and S , resp ectiv ely . This metric, an d many of the ones follo wing, attempt to balance the tradeoﬀs of the pairw ise and exact cluster metrics. 4.3. Purit y and K. Clu ster purity was ﬁ rst prop osed in 1998 [ 8 ] and later extend ed to Av er age Cluster Purit y (ACP) and Av erage Author Pur it y (AAP) (arc haically referred to as Av er age S p eak er Pur it y) [ 9 ]. Th e A CP and AAP are deﬁned as (6) AC P = 1 N X r ǫR X sǫS | r ∩ s | 2 | r | (7) AAP = 1 N X r ǫR X sǫS | r ∩ s | 2 | s | Then the K measure is deﬁned as the geomet ric mean of these v alues, K = √ AAP ∗ AC P . In many applicatio ns only a single pur it y m etric is ev aluated, usually something comparable to A CP . F or example, [ 10 ] considers the dominant class in eac h cluster b y d eﬁning purit y as p = 1 N P r ǫR max sǫS | r ∩ s | . The u se of this single metric is misleading and only sh o ws one half of the p recision/reca ll coin. As an extreme example, s etting | R | = N (i.e. eac h record in its own cluster) w ou ld ac hiev e a p er f ect p = 1 . 0, y et is clearly f ar fr om ideal. 4 A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS 4.4. Homogeneit y , Completeness, and V-Measure. Homogeneit y and completeness are entrop y based metrics, somewhat analogous to p recision and recall, resp ectiv ely [ 11 ]. A cluster in R h as p erf ect h omogeneit y if all r ecords b elong to th e same cluster in S . Con v ersely , a cluster in S has p erfect completeness if all its records b elong to the same cluster in R . Entrop y H and its conditional v ariation are d eﬁned as (8) H ( S ) = − 1 | S | X sǫS X r ǫR | r ∩ s | log P r ǫR | r ∩ s | | S | (9) H ( S | R ) = − 1 N X r ǫR X sǫS | r ∩ s | log | r ∩ s | P sǫS | r ∩ s | where N is the total num b er of records. Using these en tropies, homogeneit y and complete- ness are deﬁned as: (10) H omog eneity ( R, S ) = ( 1 if H ( S ) = 0 1 − H ( S | R ) H ( S ) else (11) C ompl eteness ( R, S ) = ( 1 if H ( R ) = 0 1 − H ( R | S ) H ( R ) else V-Measure is deﬁned analogously to th e F 1 metric as the harm onic mean of homogeneit y and completeness. (12) V β = (1 + β 2 ) ∗ H om og en eity ( R, S ) ∗ C ompl eteness ( R , S ) β 2 ∗ H omog eneity ( R, S ) + C om p leteness ( R , S ) where β is a user d eﬁned parameter, usually set to β = 1 as in the F 1 metric. Comp leteness is weig hed more imp ortantly if β > 1 and homogeneit y is weighed more imp ortan tly if β < 1. Some sour ces use β instead of β 2 w eigh ting, w e c h ose the latter due to p opularit y . 4.5. Other Metrics. Th e n atural language p ro cessing comm unit y u ses sev eral other en- tit y resolution metrics, which a re rarely using in mac hine learning and database app licatio ns [ 12 ]. W e refer the r eader to MUC-6 [ 13 ], B 3 F 1 [ 14 ], and CEAF [ 15 ]. 5. Edit Dist ance Metrics Edit distance metrics can b e thought of similarly to string edit distance f unctions. They are a measur e of the in f ormation lost and gained while mo difying R to S . Unfortunately , they do not ha v e the conv enien t [0 , 1] b ound and are th us diﬃcult to relate to any notion of a ‘go o d’ score. A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS 5 5.1. V aria tion of Information. VI [ 16 ] can con v enien tly b e expressed with the previous conditional en trop y metric [ 11 ]. (13) V I ( R , S ) = H ( S | R ) + H ( R | S ) An imp ortant prop erty of VI is it d oes n ot dir ectly dep end on N , only the sizes of the clusters. Th us, it is acceptable to add records from new clusters to a database w hile con tinuously measuring VI p erformance. 5.2. Generalized Merge Distance. Generalized Merge Distance (GMD) is p erhaps the most comprehensive metric in the sense it can b e u sed to directly calculate several other metrics [ 1 ]. GM D ( R, S ) is the minim um legal p ath cost of con verting R to S , where the cost of sp litting and merging sets of records are user-deﬁn ed op eration-order-indep endent functions. Man y suc h f unctions exist, suc h as f ( x, y ) = k , f ( x, y ) = kxy , f ( x, y ) = k 1 + k 2 xy where x and y are the size of th e record sets to split or merge and k is a constant. W e refer the reader to [ 17 ] for a bac kgroun d of op eration-order-indep endence functions. Menestrina et al. n ot only sh ow GM D ( R, S ) can b e computed in linear time, b ut explicitly show ho w pairwise precision, recall, F 1 , and VI can b e computed using sp eciﬁc cost functions. Dep en d ing on the c hoice of cost fu nctions, GMD is likely dep endent on N (the cost fun ctions used in the VI form ulation are one exception) and diﬃcult to compare across datasets of d iﬀeren t sizes. 5.3. Conclusion. Simple examples sho w a promisin g pairwise metric ma y ha v e p o or cluster-lev el p erformance [ 2 ]. More rigorous analysis sho ws this is not only p ossible, but common across a r ange of applications [ 1 ]. A t an absolute minimum, we recommend ev al- uating with p airwise F 1 b ecause of its simplicit y and p opularit y . W e also recommend th e use of a cluster metric and Generalized Merge Distance – whic h could con v enien tly b e conﬁgured to calculate VI and th e pairwise F 1 in linear time. All the metrics d iscu ssed h erein r ely on the av ailabilit y of a “gold standard” S . In practice, human-labeled resu lts rarely n umber b eyo nd several thousand samples. On large datasets, a relativ e gold standard m ay b e obtained by foregoing blo cking eﬃciency and runn in g an exhaus tiv e E R algorithm on the entire d atabase [ 1 ]. W e note, h o wev er, that doing so on databases larger than ev en 10,000 r ecords is inf easible for some algorithms [ 7 ]. F urther, an exhaustiv e appr oac h is still only an appr o ximation and carries no guarantees relativ e to the tru e clustering. A need exists for semi- and un-sup ervised ev aluation metrics. Some m etrics exist for a v ery s p eciﬁc su b set of circu m stances, b ut f or the ma jorit y of applications the general researc h problem is still op en [ 18 ]. Referen ces [1] D. Menestrina, S. E. Whang, and H. Garcia-Molina, “Ev aluating entit y resolution results,” Pr o c e e dings of the VLDB Endowment , vol. 3, no. 1-2, pp. 208–2 19, 2010. [2] M. Mic helson and S. A. Macsk assy , “Record link age measures i n an en tity centric world,” in Pr o c e e dings of the 4th workshop on Evaluation Metho ds for Machine L e arning , 2009. [3] P . Christen and K. Goiser, “Quality and complexity measures for d ata link age and dedup lication,” in Quality Me asur es i n Data Mining , pp. 127–151, Springer, 2007. 6 A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS [4] L. Geto or and A. Machana v a jjhala, “Entit y resolution: theory , practice & open c hallenges,” Pr o c e e dings of the VLDB Endowment , vol. 5, no. 12, pp . 2018–2019, 2012. [5] J. Huang, S. Ertekin, and C. L. Giles, “Eﬃcien t name disam biguation for large-scale databases,” in Know le dge Di sc overy i n Datab ases: PKDD 2006 , p p. 536–544, Sp ringer, 2006 . [6] B. W ellner, A. McCallum, F. P eng, and M. Hay , “An integrated, conditional mo del of information extraction and coreference with applica tion to citation matching,” in Pr o c e e dings of the 20th c onfer enc e on Unc ertainty in artiﬁcial i ntel ligenc e , pp. 593–601, AUAI Press, 2004. [7] O. Benjelloun, H . Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom, “Swoosh: a generic approac h to entit y resolution,” The VLDB JournalThe I nternat ional Journal on V ery L ar ge Data Bases , vol. 18, no. 1, pp. 255–276 , 2009. [8] A. Solomonoﬀ, A. Mielk e, M. Schmidt, a nd H. Gis h, “Clustering speakers b y their voices,” in A c oustics, Sp e e ch and Si gnal Pr o c essing, 1998. Pr o c e e di ngs of the 1998 IEEE International Confer enc e on , vol. 2, pp. 757–76 0, IEEE, 1998. [9] J. A jmera, H . Bourlard, and I. Lapidot, “Improv ed unk no wn-multiple speaker clustering u sing HMM,” tech. rep., 2002. [10] C. D. Manning, P . Raghav an, and H. Sch ¨ utze, Intr o duction to information r etrieval , vol. 1. Cam b rid ge universit y press Cambridge, 2008. [11] A. Rosenberg and J. H irsch b erg, “V- Measure: A Conditional Entrop y-Based Extern al Cluster Ev alu- ation Measure.,” in EMNLP-CoNLL , vol. 7, pp. 410–420, Citeseer, 2007. [12] H. Maidasani, G. N amata, B. Huang, and L. Geto or, “Entit y Resolution Ev aluation Measures,” 2012. [13] M. Vilain, J. Burger, J. A berdeen , D. Connolly , and L. Hirschman, “A mod el-theoretic coreference scoring sc heme,” in Pr o c e e di ngs of the 6th c onfer enc e on Message understanding , pp. 45–52, Association for Computational Linguistics, 1995. [14] A. Bagga and B. Baldwin, “Algorithms for scoring coreference chains,” in The ﬁrst international c onfer enc e on language r esour c es and evaluation workshop on l inguistics c or efer enc e , vol . 1, pp . 563– 566, Citeseer, 1998. [15] X. Luo, “On coreference resolution p erformance metrics,” in Pr o c e e dings of the c onf er enc e on Human L anguage T e chnolo gy and Empiric al Metho ds in Natur al L anguage Pr o c essing , pp . 25–32 , Association for Computational Linguistics, 2005. [16] M. Meil, “Comparing clusterings by th e v ariation of information,” in L e arning the ory and kernel ma- chines , pp. 173–187 , Sp ringer, 2003. [17] M. Hossz´ u, “On the functional equation F(x+ y ,z)+F(x,y ) =F(x,y+z)+F(y ,z),” Perio dic a Mathematic a Hungaric a , vol. 1, n o. 3, pp. 213–216 , 1971. [18] W. E. Winkler, “Overview of record link age and curren t researc h directions,” in Bur e au of the Census , Citeseer, 2006.

A Practioners Guide to Evaluating Entity Resolution Results

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment