A Practioners Guide to Evaluating Entity Resolution Results
Entity resolution (ER) is the task of identifying records belonging to the same entity (e.g. individual, group) across one or multiple databases. Ironically, it has multiple names: deduplication and record linkage, among others. In this paper we surv…
Authors: Matt Barnes
A Pra ctioner ’s Guide to Ev alua ting Entity Resolution Resul ts Matt Barnes mb arnes1@cs.cmu.e du Scho ol of Computer Scienc e Carne gie Mel lon University Octob er, 201 4 1. Introduction En tit y resolution (ER) is the task of iden tifying records b elonging to the same entit y (e.g. in d ividual, group) across one or multiple databases. Ironically , it has m ultiple n ames: deduplication and record link age, among others. In th is p ap er we su r v ey metrics u sed to ev aluate ER resu lts in order to iterativ ely impro v e p erformance and guarantee suffi cien t qualit y p rior to deplo ymen t. Some of these metrics are b orrow ed from multi-cla ss clas- sification and clustering domains, though some k ey differences exist d ifferen tiating entit y resolution f rom general clustering. Menestrina et al. empirically sho we d rankings f rom these metrics often conflict with eac h other, th us our primary motiv ation for studying them [ 1 ]. Th is pap er provides practitioners the basic kno wledge to b egin ev aluating their en tit y resolution results. 2. Pr oblem St a temen t Our notation f ollo ws th at of [ 1 ]. Consid er an input set of records I = { a, b, c, d, e } wh er e a, b, c, d, and e are u nique records. Let R = {h a, b, d i , h c, e i} denote an entit y r esolution clustering outpu t, w here h ... i denotes a cluster. Let S b e the true clusterin g, referred to as the “gold s tand ard.” The goal of an y en tit y resolution metric is to measure error (or similarit y) of R compared to the gold s tandard S . 3. P a ir wise Metrics P airwise metrics consider ev er y pair of records as samp les for ev aluating p erf orm ance. Let P air s ( R ) d en ote all the int ra-cluster pairs in the clustering R . In our example, P airs ( R ) = { ( a, b ) , ( a, d ) , ( b, d ) , ( c, e ) } . Confu singly , s ome stud ies treat pairs only as those where a direct matc h wa s made and not matc hes made through transitiv e relations [ 2 ]. F or example, [ 2 ] would exclude ( a, d ) if the matc hes leading to R w ere a ≈ b , b ≈ d , and c ≈ e , 1 2 A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS where ≈ denotes a matc h . W e c ho ose the former defi n ition b ecause it is ind ep endent of the underlyin g matc hing p ro cess – it only dep ends on the final en tity resolution results. Unlik e man y mac hine learning classification tasks, we never consid er non-matc h es (i.e. in ter-cluster pairs) in entit y r esolution m etrics [ 3 ]. In conv ent ional clustering tasks, the n umber of clusters is constan t or s ub-linear w ith resp ect to the num b er of records n [ 4 ]. Ho wev er, th e num b er of clusters is O ( n ) in conv en tional ER tasks. So though the num b er of in tra-cluster pairs is O ( n ) (e.g. true p ositiv es), the num b er of inter-cluster pairs (e.g. true negativ es) is O ( n 2 ). T o illustrate, consider our original example w ith 5 records and 2 clusters. There are 4 in tr a-cluster pairs and 6 inter-cluster pairs. No w, compare this to a larger database w ith 50 records and 20 clus ters, all of comparable s ize to the original example. There will b e approximat ely 40 intra-c luster p airs b ut lik ely o ve r 2000 in ter- cluster pairs. T h u s, metrics using in ter-cluster pairs (e.g. F alse P ositiv e Rate) will improv e exp onen tially with resp ect to th e num b er of records in the database and pr o vide o verly optimistic results for large databases. 3.1. P airw ise Precision, Recall, and F 1 . Using P air s as the samples, the pairwise precision and recall metric functions follo w conv en tional mac hine learning definitions. The harmonic mean of these metrics leads to the m ost frequentl y used entit y resolution metric, pairwise F 1 . All these metrics are b ound from [0 , 1]. (1) P airP r ecision ( R, S ) = | P airs ( R ) ∩ P air s ( S ) | | P air s ( R ) | (2) P air Recal l ( R, S ) = | P air s ( R ) ∩ P air s ( S ) | | P air s ( S ) | (3) P air F 1 ( R, S ) = 2 ∗ P air P r ecision ( R, S ) ∗ P air Recal l ( R , S ) P air P r ecision ( R, S ) + P air Recal l ( R, S ) The b enefit of pairwise m etrics is their in tuitiv e interpretatio n. P airwise precision is the p ercen tage of matc hes in the predicted clustering that are corr ect. P airwise r ecall is the p ercen tage of matc hes in the tru e clustering that are also in the pred icted clustering. Unfortunately pairwise m etrics ma y con vey o verly optimistic results, dep endin g on the use case. F or example, in m an y enti t y resolution tasks the end us er only cares ab out the fin al en tit y – n ot the records it comprises. Mismatc h ing t wo singleton entit ies has an insignifican t impact on pairw ise metrics compared to incorrectly joining or splitting t wo large clus ters. 4. Cluster Me trics Lik e the p airwise metrics, all the cluster metrics discu s sed here are b ound by [0 , 1], a con venien t prop erty when comparing across d ataset s and for setting qualit y s tand ards. A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS 3 4.1. Cluster Precision, Recall, and F 1 . Cluster level metrics attempt to capture a m ore holistic under s tanding of the fi nal entities. A t the extreme opp osite of p airwise metrics, cluster lev el precision [ 5 ] and recall [ 6 ] consider exact clus ter matc hes. Mathematically , cluster precision and recall are d efined as | R ∩ S | | R | and | R ∩ S | | S | , resp ectiv ely . No w, mism atching t wo s ingleton entitie s will hav e the same impact as mismatc hing t w o larger clusters. Obvi- ously , this metric h as the opp osite drawbac k – even one corrupted matc h in a cluster will cause an en tire cluster to mismatc h d ue to the use of exact comparisons. T hus, this m etric is r arely u sed in fav or of its predecessor, closest cluster precision, recall, and F 1 . 4.2. Closest Cluster Precision, Recall, and F 1 . Closest cluster metrics correct for the previous cluster-lev el dra wbac ks b y in corp orating a notion of cluster similarit y [ 7 ]. Using the Jaccard similarit y co efficien t J ( r , s ) = | r ∩ s | | r ∪ s | to capture cluster similarit y , the pr ecision and recall can b e exp ressed as (4) ccP r ecision ( R, S ) = P r ǫR max sǫS ( J ( r, s )) | R | (5) ccRecall ( R, S ) = P sǫS max r ǫR ( J ( s, r )) | S | where r and s are clusters in R and S , resp ectiv ely . This metric, an d many of the ones follo wing, attempt to balance the tradeoffs of the pairw ise and exact cluster metrics. 4.3. Purit y and K. Clu ster purity was fi rst prop osed in 1998 [ 8 ] and later extend ed to Av er age Cluster Purit y (ACP) and Av erage Author Pur it y (AAP) (arc haically referred to as Av er age S p eak er Pur it y) [ 9 ]. Th e A CP and AAP are defined as (6) AC P = 1 N X r ǫR X sǫS | r ∩ s | 2 | r | (7) AAP = 1 N X r ǫR X sǫS | r ∩ s | 2 | s | Then the K measure is defined as the geomet ric mean of these v alues, K = √ AAP ∗ AC P . In many applicatio ns only a single pur it y m etric is ev aluated, usually something comparable to A CP . F or example, [ 10 ] considers the dominant class in eac h cluster b y d efining purit y as p = 1 N P r ǫR max sǫS | r ∩ s | . The u se of this single metric is misleading and only sh o ws one half of the p recision/reca ll coin. As an extreme example, s etting | R | = N (i.e. eac h record in its own cluster) w ou ld ac hiev e a p er f ect p = 1 . 0, y et is clearly f ar fr om ideal. 4 A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS 4.4. Homogeneit y , Completeness, and V-Measure. Homogeneit y and completeness are entrop y based metrics, somewhat analogous to p recision and recall, resp ectiv ely [ 11 ]. A cluster in R h as p erf ect h omogeneit y if all r ecords b elong to th e same cluster in S . Con v ersely , a cluster in S has p erfect completeness if all its records b elong to the same cluster in R . Entrop y H and its conditional v ariation are d efined as (8) H ( S ) = − 1 | S | X sǫS X r ǫR | r ∩ s | log P r ǫR | r ∩ s | | S | (9) H ( S | R ) = − 1 N X r ǫR X sǫS | r ∩ s | log | r ∩ s | P sǫS | r ∩ s | where N is the total num b er of records. Using these en tropies, homogeneit y and complete- ness are defined as: (10) H omog eneity ( R, S ) = ( 1 if H ( S ) = 0 1 − H ( S | R ) H ( S ) else (11) C ompl eteness ( R, S ) = ( 1 if H ( R ) = 0 1 − H ( R | S ) H ( R ) else V-Measure is defined analogously to th e F 1 metric as the harm onic mean of homogeneit y and completeness. (12) V β = (1 + β 2 ) ∗ H om og en eity ( R, S ) ∗ C ompl eteness ( R , S ) β 2 ∗ H omog eneity ( R, S ) + C om p leteness ( R , S ) where β is a user d efined parameter, usually set to β = 1 as in the F 1 metric. Comp leteness is weig hed more imp ortantly if β > 1 and homogeneit y is weighed more imp ortan tly if β < 1. Some sour ces use β instead of β 2 w eigh ting, w e c h ose the latter due to p opularit y . 4.5. Other Metrics. Th e n atural language p ro cessing comm unit y u ses sev eral other en- tit y resolution metrics, which a re rarely using in mac hine learning and database app licatio ns [ 12 ]. W e refer the r eader to MUC-6 [ 13 ], B 3 F 1 [ 14 ], and CEAF [ 15 ]. 5. Edit Dist ance Metrics Edit distance metrics can b e thought of similarly to string edit distance f unctions. They are a measur e of the in f ormation lost and gained while mo difying R to S . Unfortunately , they do not ha v e the conv enien t [0 , 1] b ound and are th us difficult to relate to any notion of a ‘go o d’ score. A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS 5 5.1. V aria tion of Information. VI [ 16 ] can con v enien tly b e expressed with the previous conditional en trop y metric [ 11 ]. (13) V I ( R , S ) = H ( S | R ) + H ( R | S ) An imp ortant prop erty of VI is it d oes n ot dir ectly dep end on N , only the sizes of the clusters. Th us, it is acceptable to add records from new clusters to a database w hile con tinuously measuring VI p erformance. 5.2. Generalized Merge Distance. Generalized Merge Distance (GMD) is p erhaps the most comprehensive metric in the sense it can b e u sed to directly calculate several other metrics [ 1 ]. GM D ( R, S ) is the minim um legal p ath cost of con verting R to S , where the cost of sp litting and merging sets of records are user-defin ed op eration-order-indep endent functions. Man y suc h f unctions exist, suc h as f ( x, y ) = k , f ( x, y ) = kxy , f ( x, y ) = k 1 + k 2 xy where x and y are the size of th e record sets to split or merge and k is a constant. W e refer the reader to [ 17 ] for a bac kgroun d of op eration-order-indep endence functions. Menestrina et al. n ot only sh ow GM D ( R, S ) can b e computed in linear time, b ut explicitly show ho w pairwise precision, recall, F 1 , and VI can b e computed using sp ecific cost functions. Dep en d ing on the c hoice of cost fu nctions, GMD is likely dep endent on N (the cost fun ctions used in the VI form ulation are one exception) and difficult to compare across datasets of d ifferen t sizes. 5.3. Conclusion. Simple examples sho w a promisin g pairwise metric ma y ha v e p o or cluster-lev el p erformance [ 2 ]. More rigorous analysis sho ws this is not only p ossible, but common across a r ange of applications [ 1 ]. A t an absolute minimum, we recommend ev al- uating with p airwise F 1 b ecause of its simplicit y and p opularit y . W e also recommend th e use of a cluster metric and Generalized Merge Distance – whic h could con v enien tly b e configured to calculate VI and th e pairwise F 1 in linear time. All the metrics d iscu ssed h erein r ely on the av ailabilit y of a “gold standard” S . In practice, human-labeled resu lts rarely n umber b eyo nd several thousand samples. On large datasets, a relativ e gold standard m ay b e obtained by foregoing blo cking efficiency and runn in g an exhaus tiv e E R algorithm on the entire d atabase [ 1 ]. W e note, h o wev er, that doing so on databases larger than ev en 10,000 r ecords is inf easible for some algorithms [ 7 ]. F urther, an exhaustiv e appr oac h is still only an appr o ximation and carries no guarantees relativ e to the tru e clustering. A need exists for semi- and un-sup ervised ev aluation metrics. Some m etrics exist for a v ery s p ecific su b set of circu m stances, b ut f or the ma jorit y of applications the general researc h problem is still op en [ 18 ]. Referen ces [1] D. Menestrina, S. E. Whang, and H. Garcia-Molina, “Ev aluating entit y resolution results,” Pr o c e e dings of the VLDB Endowment , vol. 3, no. 1-2, pp. 208–2 19, 2010. [2] M. Mic helson and S. A. Macsk assy , “Record link age measures i n an en tity centric world,” in Pr o c e e dings of the 4th workshop on Evaluation Metho ds for Machine L e arning , 2009. [3] P . Christen and K. Goiser, “Quality and complexity measures for d ata link age and dedup lication,” in Quality Me asur es i n Data Mining , pp. 127–151, Springer, 2007. 6 A PRAC TIONER’S GUIDE TO EV ALUA TING ENTITY RESOLUTION RESUL TS [4] L. Geto or and A. Machana v a jjhala, “Entit y resolution: theory , practice & open c hallenges,” Pr o c e e dings of the VLDB Endowment , vol. 5, no. 12, pp . 2018–2019, 2012. [5] J. Huang, S. Ertekin, and C. L. Giles, “Efficien t name disam biguation for large-scale databases,” in Know le dge Di sc overy i n Datab ases: PKDD 2006 , p p. 536–544, Sp ringer, 2006 . [6] B. W ellner, A. McCallum, F. P eng, and M. Hay , “An integrated, conditional mo del of information extraction and coreference with applica tion to citation matching,” in Pr o c e e dings of the 20th c onfer enc e on Unc ertainty in artificial i ntel ligenc e , pp. 593–601, AUAI Press, 2004. [7] O. Benjelloun, H . Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom, “Swoosh: a generic approac h to entit y resolution,” The VLDB JournalThe I nternat ional Journal on V ery L ar ge Data Bases , vol. 18, no. 1, pp. 255–276 , 2009. [8] A. Solomonoff, A. Mielk e, M. Schmidt, a nd H. Gis h, “Clustering speakers b y their voices,” in A c oustics, Sp e e ch and Si gnal Pr o c essing, 1998. Pr o c e e di ngs of the 1998 IEEE International Confer enc e on , vol. 2, pp. 757–76 0, IEEE, 1998. [9] J. A jmera, H . Bourlard, and I. Lapidot, “Improv ed unk no wn-multiple speaker clustering u sing HMM,” tech. rep., 2002. [10] C. D. Manning, P . Raghav an, and H. Sch ¨ utze, Intr o duction to information r etrieval , vol. 1. Cam b rid ge universit y press Cambridge, 2008. [11] A. Rosenberg and J. H irsch b erg, “V- Measure: A Conditional Entrop y-Based Extern al Cluster Ev alu- ation Measure.,” in EMNLP-CoNLL , vol. 7, pp. 410–420, Citeseer, 2007. [12] H. Maidasani, G. N amata, B. Huang, and L. Geto or, “Entit y Resolution Ev aluation Measures,” 2012. [13] M. Vilain, J. Burger, J. A berdeen , D. Connolly , and L. Hirschman, “A mod el-theoretic coreference scoring sc heme,” in Pr o c e e di ngs of the 6th c onfer enc e on Message understanding , pp. 45–52, Association for Computational Linguistics, 1995. [14] A. Bagga and B. Baldwin, “Algorithms for scoring coreference chains,” in The first international c onfer enc e on language r esour c es and evaluation workshop on l inguistics c or efer enc e , vol . 1, pp . 563– 566, Citeseer, 1998. [15] X. Luo, “On coreference resolution p erformance metrics,” in Pr o c e e dings of the c onf er enc e on Human L anguage T e chnolo gy and Empiric al Metho ds in Natur al L anguage Pr o c essing , pp . 25–32 , Association for Computational Linguistics, 2005. [16] M. Meil, “Comparing clusterings by th e v ariation of information,” in L e arning the ory and kernel ma- chines , pp. 173–187 , Sp ringer, 2003. [17] M. Hossz´ u, “On the functional equation F(x+ y ,z)+F(x,y ) =F(x,y+z)+F(y ,z),” Perio dic a Mathematic a Hungaric a , vol. 1, n o. 3, pp. 213–216 , 1971. [18] W. E. Winkler, “Overview of record link age and curren t researc h directions,” in Bur e au of the Census , Citeseer, 2006.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment