Principled Graph Matching Algorithms for Integrating Multiple Data Sources
This paper explores combinatorial optimization for problems of max-weight graph matching on multi-partite graphs, which arise in integrating multiple data sources. Entity resolution-the data integration problem of performing noisy joins on structured…
Authors: Duo Zhang, Benjamin I. P. Rubinstein, Jim Gemmell
IEEE TRANSACTIONS O N KNOWLEDGE AND D A T A ENGINEERING, VO L. 26, NO . X, X 2014 1 Pr incipled Gr aph Matching Algor ithms f or Integ r ating Multiple Data Sources Duo Zhang, Benjamin I. P . Rubinstein, and Jim Gemmell Abstract —This paper explores combinator ial optimization for problems of max-weight graph matching on multi-partite graphs , which arise i n integrating multiple data sources. Entity resolution—the data integration problem of performing noisy joi ns on structured data—typically proceeds by first hashing each record into zero or more blocks, scori ng pairs of records that are co- block e d f or similarity , and then matching pairs of suf ficient s imilarity . In the most common case of matching two sources , it is often desirable f or the final matching to be one-to-one (a record may be matched with at most one other); members of the database and statistical record linkage communities accomplish such matchings in the final stage by weighted bipar tite graph matching on similarity scores . Such matchings are intuitively appealing: they lev erage a natur al global proper ty of many real-world entity stores—that of being nearly deduped—and are known to pro vide significant improv ements to precision and recall. Unf or tunately unlike the bipar tite case, exac t max-weight matching on multi-par tite gr aphs is known to be NP-hard. Our two-f old algorithmic contributions approximat e multi-par tite max-weight matching: o ur first algor ithm borrows optimization techniques common to Ba yesian probabi listic inf erence; our second is a greedy appro ximation algorithm. In addition to a theoretical guarantee on the latter , we present comparisons on a real-world ER problem from Bing significantly larger than typically found in the literature, publication data, and on a ser ies of synthetic problems. Our results quan tify significant improv ements due to exploiting multiple sources, which a re made p ossible by global one-to-one constr aints linking otherwise independent matching sub-problems . W e also discov er that our algorithms are complementar y: one being m uch more robust u nder no ise, and the other being simple to implement and v ery f ast to r un. Index T erms —Data integr ation, weighted gr aph matching, message-passing algor ithms ✦ 1 I N T R O D U C T I O N It has long been recognized—and explicitly discussed recently [13]—that many real-world entity stores are naturally free of duplicates. W ere they to have repli- cate entities, crowd-sourced sites such a s W ikipedia would ha ve edits applied to one copy and not oth- ers. Sites that rely on ratings such as Netflix and Y elp would suffer diluted recommendation value by having ratings split over multiple instantiations of the same product or business page. And customers of online retailers such a s Amazon would miss out on lower prices, options on new/used condition, or shipping arrangements offered by sellers surfaced on duplicate pages. Many publishers have natural incentives that d rive them to ded uplicate, or maintain uniqueness, in their databases. Dating back to the late 80s in the statistical recor d linkage community and more recently in data base research [13], [14], [16], [17], [ 28], [32], a number of entity resolution (ER) systems have successfully employed one-to-one graph matching f or levera ging this natural lack of duplicates. Initially the benefit of this approach was taken for granted, but in pre- Research performed by authors while at Microsoft Research. • D. Zhang is with T witter , Inc., USA. • B. Rubinstein is with the University of Melbourne, Australia. • J. Gem mell is with T r ¯ ov , U SA. E-mail: dzhang@twitter .com, ben@b ipr .net, jim.gemmell@gmail.com liminary recent work [13] significant improvements to precision and recall due to this approach have been quantified. The reasons are intuitively c lear: data noise or deficiencies of the scoring function ca n lead to poor scores, which ca n negatively a ffect ER accuracy; however graph matching corresponds to imposing a global, one-to-one constraint which e ffec - tively smoothes local noise. This kind of bipartite one- to-one matching for combining a pa ir of da ta sources is both well-known to many in the community , a nd widely a p p licable through data integration problems as it can be used to augment numerous existing systems— e.g. , [3], [5], [8], [1 0 ]–[12], [18], [20], [21], [ 3 0], [31], [ 35], [37], [38]—as a gener ic stage f ollowin g data preparation, blocking, and scoring. Another common principle to data integration is of improved user utility by fusing multiple sources of data. But while many paper s have circled around the problem, very little is known about extending one-to-one graph ma tc hing to practical multi-partite matching for ER. In the theory community exact max - weight matching is known to be NP-hard [9], and in the statistics community expectation maximization has recently been applied to approximate the solution successfully for ER, but inefficiently with exponential computational requirements [2 8]. In this pa per we propose principled approaches for approximating multi-partite weighted graph match- ing. Our first approach is based on message- passing algorithms typica lly used for inference on probabilis- 2 IEEE TRANSACTIONS ON KNO WLEDGE AND DA T A ENGINEERING, VOL. 26, NO. X, X 2 014 Fig. 1. Multi-source ER is central to (left) Bing’ s ( middle) Google’ s and (right) Y ahoo!’ s movies ver ticals for implementi ng entity ac tions such as “buy”, “rent”, “watch”, “buy ticket” and surfacing meta-data, showtimes and revie ws. In each case data on the same movie is inte grate d from multiple noisy sources. tic graphical models but used here f or combinato- rial optimization. Through a ser ies of non-trivial ap- proximations we der ive an approach more efficient than the leading statistical record linkage work of Sadinle et al. [28]. Our second approach extends the well-known greedy approximation to bipartite max- weight matching to the multi-partite case. While less sophisticated than our message-passing a lgorithm, the greedy approach enjoys an ea sily-implementable design and a worst-case 2- approximation competitive ratio ( cf. Theorem 4.1). The ability to leverage one-to-one constraints when performing multi-source data integration is of great economic value: systems studied here have made sig- nificant impact in production use for example within several B ing ver ticals and the Xbox TV service driving critical c ustomer-facing features ( cf. e.g. , Figur e 1). W e demonstrate on data taken from these real-world services that our a pproaches enjoy significantly im- proved precision/recall over the state-of-the-a rt un- constrained approach and tha t the addition of sources yields further improvements due to global constraints. This exper imental study is of atypical value owing to its unusually large scale: compared to the largest of the four da tasets used in the recent well-cited evaluation study of K ¨ opcke et al. [2 2], our problem is three orders of magnitude la rger . 1 W e conduct a second experimental comparison on a smaller publi- cation data se t to demonstrate generality . Finally we explore the robustness of our a pproaches to varying degrees of edge weight noise via synthetic data . In summary , our main contributions are: 1) A principled fac tor-graph message - passing a l- gorithm for generic, globally-constrained multi- source da ta integra tion; 2) An efficient greedy approach that comes with a sharp, worst-case guarantee ( cf. Theorem 4.1); 3) A counter example to the common misconcep- tion that sequential bipartite matching is a suffi- cient approach to joint multipartite matching ( cf. Example 3.3); 4) Experimental com parisons on a very large real-world d ataset demonstrating that generic one-to-one matching leverages naturally- deduplicated data to sign ificantly improve precision/r ecall; 5) V alidation that our new a pproaches can a p- propriately levera ge information from multiple sources to improve ER precision a nd recall— further supported by a second smaller real- world experiment on publication data; and 6) A synthetic-data comparison under which our 1. While their largest problem contains 1 . 7 × 10 8 pairs, our ’s measures in at 1 . 6 × 10 11 between j ust two of our sources. For data sourc es m easuring in the low thousands, as in th eir other benchmark problems and as is typical in many papers, purely crow d-sourci ng ER systems such as [23] could be used for mere tens of dollars. ZHANG ET AL.: PRINCIPLED GRAPH MA TCHING ALGORITHMS F OR INTEGRA TING MUL TIPLE DA T A SOURCES 3 message-passing approach enjoys superior ro- bustness to noise, over the faster greedy ap- proach. 2 R E L AT E D W O R K Numerous past work has studied entity resolution [3], [18], [20], [21], [31], [38], statistical record linkage [8], [12], [3 5], [37] and deduplica tion [5], [1 0], [11], [ 30]. There has been little previous work investigating one- to-one bipartite resolution s and almost no results on constrained multi-source resolution. Some works have looked at constraints in gen- eral [1], [ 6], [3 2] but they do not f ocus on the one-to- one constraint. S u et al. do rely on the rarity of dupli- cates in a given web source [33] but use it only to gen- erate negative examples. Guo et al. studied a record linkage approach based on uniqueness constraints on entity attribute values. In ma ny domains, however , this c onstraint does not always hold. For example, in the movie domain, each movie entity could have mul- tiple a ctor names a s its attribute. Jaro produced early work [1 7] on linking ce nsus da ta records using a Lin- ear Program formulation that enforces a global one-to- one constraint, however no justification or evaluation of the formulation is offered, and the method is only defined for resolution over two sources. S imilarly for more recent work in databa ses, such as in conflating web ta bles without duplicates [1 6] and the notion of exclusivity constraints [14]. W e are the first to test a nd quantify the use of such a global constraint on entities and apply it in a systematic way building on earlier bipartite work [13]. A more recent study [28] examines the record linkage problem in the multiple sources setting. The authors use a probabilistic fra mework to estimate the probability of whether a pair of entities is a match. However , the complexity of their algorithm is O ( B m n m ) , where m is the number of sources, n is the number of instances, and B m is the m th Bell number which is exp onential in m . Such a prohibitive computational complexity prevents the principled ap- proach from be ing practical. In our message-pa ssing algorithm, the optimization a p proach has far better complexity , which works well on real entity resolu- tion problems with millions of entities; a t the same time our approach is also developed in a principled fashion. Our message-passing algorithm is the first principled, tracta ble a pproach to constrained multi- source ER. In the bipartite setting, maximum-weight matching is now well understood with exact algorithms a ble to solve the problem in polynomial time [25]. When it comes to multi-partite maximum-weight matching, the problem be c omes extremely hard. In pa rticular Crama and Spieksma [9] proved that a special case of the tripartite matching problem is NP-hard, implying that the general multi-pa r tite max-weight matching problem is itself NP-hard. The algorithms presented in this paper are both approximations: one greedy ap- proach that is f ast to run, and one principled approach that empirica lly yields higher total weights a nd im- proved precision and recall in the presence of noise. Our greedy approach is endowed with a competitive ratio generalizing the well-known gua rantee for the bipartite case. Previously Bayesian machine lear ning methods have been applied to entity resolution [4], [ 10], [27], [36], however our goal here is not to perform in- ference with these methods but rather optimization. Along these lines in recent years, message-p assing algorithms hav e been used for the maximum-weight matching problem in bipartite gra phs [ 2]. There the authors proved that the max-product algorithm con- verges to the desira ble optimum point in finding the maximum-weight matching f or a bipartite graph, even in the presence of loops. In our study , we have designed a message- passing algorithm targeting the maximum-weight matching problem in multi- partite graphs. In a specific case of our problem, i.e. , maximum-weight matching in a bipartite graph, our algorithm works a s effectively as exa ct methods. An- other recent work [29] studied the weighted-matching problem in general graphs, whose problem definition differs from our own. It shows that max-p roduct con- verges to the correct answer if the linear programming relaxation of the weighted-matching problem is tight. Compared to this work we pa y special attention to the application to multi-sour ce entity resolution, tune the general message-passing algorithms spe cifically for our model, and per f orm large-scale data integration experiments on real da ta. 3 T H E M P E M P RO B L E M W e begin by formalizing the generic data integration problem on multiple sources. Let D 1 , D 2 , . . . , D m be m data bases , e ach of which contains representations of a finite number of entities along with a special null entity φ . The database sizes need not be e qual. Definition 3.1: Given D 1 , D 2 , . . . , D m , the Multi- Partite Entity Resolution problem is to identify a n unknown target relation or resolution R ⊆ D 1 × D 2 × · · · × D m , given some information a bout R , such as pairwise scores between entities, examples of match- ing and non-matching pairs of entities, etc. For exa mple, for da tabases D 1 = { e 1 , φ } and D 2 = { e ′ 1 , e ′ 2 , φ } , the possible mappings are D 1 × D 2 = { e 1 ↔ e ′ 1 , e 1 ↔ e ′ 2 , e 1 ↔ φ, φ ↔ e ′ 1 , φ ↔ e ′ 2 , φ ↔ φ } . Here, φ appearing in the i th component of a ma p p ing r means that r does not involve any entity f rom source i . A resolution R , which represents the true matchings among entities, is some specific subset of all possible mappings among entities in the m databases. A global one-to- o ne co nstraint on multi-pa r tite entity resolution asserts the target R is pairwise one-to - one : for 4 IEEE TRANSACTIONS ON KNO WLEDGE AND DA T A ENGINEERING, VOL. 26, NO. X, X 2 014 each non-null entity x i ∈ D i , for a ll sources j 6 = i there exists a t most one entity x j ∈ D j such that together x i , x j are involved in one tuple in R . Definition 3.2: A Multi-Partite Entity Matching (MPEM) problem is a multi-partite e ntity resolution problem with the global one-to-one constraint. W e previously showed experimentally [1 3] that leveraging the one-to-one constraint when performing entity resolution across two sources yields signifi- cantly improved precision and recall—a fact long well-known anecdotally in the da tabases and statis- tical recor d linkage communities. For example, if we know The Lord of t he Rings I in IMDB matches with The Lord of the Rings I in Netflix, then the one-to-one property precludes the possibility of this IM DB movie resolving with The Lord of th e Rings II in Ne tflix. The present paper focuses on the MPE M problem and methods for exploiting the global constraint when resolving a cross multiple sources. In particular , we are interested in global methods which perform entity resolution across all sources simultaneously . In the experimental section, we will show that matching multiple sources together achieves super ior resolutio n as compared to matching two sources individually (equivalently pairwise matching of multiple sources). 3.1 The Duplicate-Free Assumption As discussed in S ection 1, ma ny real-world data sources are naturally de - duplicated due to va rious socio-economic incentives. There we argued, by citing significant online data sources such as W ikipedia, Aamzon, Netflix, Y elp, that • Crowd-sourced site duplicates will diverge with edits applied to one copy a nd not the other; • Sites relying on ratings suffer from duplicates as recommendations are made on too little data ; and • Attributes will be split b y duplicates, for example alternate prices of shipping options of p roducts. Indeed in our pa st work, we quantified that the data used in this paper , coming (ra w and untouched) from major online movie sources, are largely d uplicate free, with estimated levels of 0.1 % of duplicates. Our publication dataset is similar in terms of duplicates. Finally , consider matching and merging multiple sources. Rather than ta king a union of all sources then performing matching by deduplication on the union, by instead (1) d eduplicating ea ch source then (2) matching, we may focus on heterogeneous data characteristics in the sources individua lly . Recogniz- ing the importance of data va r iability aids matching accuracy when scaling to many sources [24]. 3.2 Wh y MPEM Requires Glob al Methods One may naively attempt to use one of two natural approaches for the MPE M problem: (1 ) threshold the overall scores to produce many-to-many resolu- tions; or (2) 1-to-1 resolve the sources in a sequential manner by iteratively perf orming two-source entity resolution . The disadva ntages of the first have been explored previously [ 13]. The order of sequential bi- partite matching may adversely affect overa ll results: if we first resolve two poor quality sources, then the poor resolution may propagate to degrade subsequent resolution s. W e could undo earlier matchings in pos- sibly cascad ing fashion, but this would e ssentially reduce to global multipartite. The next exa mple demonstrates the inferiority of sequential matching. Example 3.3: T ake weighted tri-partite graph with nodes a 1 , b 1 , c 1 (source 1 ), a 2 , b 2 , c 2 (source 2 ), a 3 , b 3 , c 3 (source 3 ). The true matching is {{ a 1 , a 2 , a 3 } , { b 1 , b 2 , b 3 } , { c 1 , c 2 , c 3 }} . Each pair of sources is completely connected, with weights: 0.6 for a ll pairs betwee n 1 and 2 except 0. 5 between { a 1 , a 2 } and 1 between { c 1 , c 2 } ; 1 for records truly matching between 1 or 2 , and 3 ; 0.1 for records not truly matching between 1 or 2 , and 3 . When bipartite matching between 1 and 2 then 3 (preserving transitivity in the final ma tc hing), sequential a chieves weight 6 . 4 while global matching achieves 8 . 1 . Sequential matching can p roduce very inferior overall matchings due to being locked into poor subsequent deci- sions i.e. , local optima. This is compounded by poo r quality data. In contrast global matching looks ahead. A s a result we focus on global approaches. 4 A L G O R I T H M S W e now develop two stra tegies for solving MPEM problems. The first is to a p proximately maximize total weight of the matching by co-opting optimization algorithms popular in Bayesian inference. The second approach is to itera tively select the best matching for each entity locally while not violating the one-to-one constraint, yielding a simple greedy approach. 4.1 Message-Pass ing Algorithm As discussed in Section 2, exact multi-pa rtite maximum-weight matching is NP-hard. T herefore we develop a n approximate yet pr incipled maximum- weight stra tegy for MPEM problems. W ithout loss of generality , we use tri-partite matching from time-to- time for illustrative purposes. W e make use of the max-sum algorithm that drives max imum a posteriori probabilistic inference on Ba ye sia n gra phical models [19]. Bipartite factor graphs model random variables and a joint likeli- hood on them, v ia variable nodes and factor nodes respectively . Fa ctor nodes correspond to a factoriza- tion of the unnormalized join t likelihood, and are connected to variable nodes iff the corresponding function depends on the corresponding variable. Max- sum maximizes the sum of these functions ( i.e. , M AP inference in log space) , a nd decomposes via dynamic ZHANG ET AL.: PRINCIPLED GRAPH MA TCHING ALGORITHMS F OR INTEGRA TING MUL TIPLE DA T A SOURCES 5 programming to an iterative message- passing a lgo- rithm on the graph. In this paper we formulate the MPE M problem as a maximization of a sum of functions over var ia bles, naturally reducing to max- sum over a graph. 4.1.1 F actor Graph Model for the MPEM Prob lem The MPEM p roblem ca n be represented by a graphical model, shown in part in Figure 2 for tri-pa rtite gra ph matching. In this model, C ij k represents a va riable node, where i, j, k are three entities from three dif- ferent sources. W e use x ij k to de note the variable associated with graph node C ij k . This boolean-valued variable selects whether the entities a re matched or not. If x ij k is equal to 1, it means a ll three entities are matched. Otherwise, it means at least two of these three entities are not matched. S ince we will not a lways find a matching entity from every source, we use φ to indicate the ca se of no matching entity in a par ticular source. For example, if va riable x i,j,φ is equal to 1 , then i and j a re matched without a matching entity from the third source. Fig. 2. Graphical mode l for the MP EM problem. Figure 3 shows a ll the fac tor nodes connected with a va riable node C ij k . There are two types of factor nodes: ( 1 ) a similarity factor node S ij k which repre- sents the total weight from matching i, j, k ; (2) three constraint factor nodes E 1 i , E 2 j , and E 3 k which repre- sent one-to-one constraints on entity i from source 1, entity j from source 2 , and entity k from source 3 , respectively . T he similarity fa c tor node only connects to a single va r iable node, but the constraint factor nodes connect with multiple variable nodes. The concrete definitions of the factor nodes a re shown in Figure 4 . The definition of the similarity function is easy: if a var iable node is sele c te d, the function’s va lue is the total weight betwee n all pa irs of entities within the v a riable node. Otherwise, the similarity function is 0 . Each constraint E is d efined on a “plane” of va riables which a re related to one specific entity from a specific source. For exa mple, as shown in Figure 2, the fac tor E 1 i is defined on all the va r iables related to entity i in source 1 . It C ijk S ijk E 2 j E 1 i E 3 k Fig. 3. V ariable node’ s neighboring f actor nodes. evaluates to 0 iff the sum of the values of all the variables in the p la ne is equal to 1, which means exactly one var ia ble can be selected as a matching from the plane. Otherwise, the function eva luates to negative infinity , penaliz ing the violation of the global one-to-one constraint. Thus these factor nodes serve as hard constraints enforcing global one-to-one matching. Note that there is no f actor E 1 φ defined on the va riables related to entity φ from a source, because we allow the entity φ to match with multiple different entities from other sources. S ijk ( x ijk ) = ( s ij + s jk + s ik , x ijk = 1 0 , otherw ise E 1 i ( x i 11 , ..., x i 1 φ , ..., x ijk , ..., x iφφ ) = ( 0 x i 11 + ... + x i 1 φ + ... + x ijk + ... + x iφφ = 1 −∞ other wise E 2 j ( x 1 j 1 , ..., x 1 jφ , ..., x ijk , ..., x φjφ ) = ( 0 x 1 j 1 + ... + x 1 jφ + ... + x ijk + ... + x φjφ = 1 −∞ other wise E 3 k ( x 11 k , ..., x 1 φk , ..., x ijk , ..., x φφk ) = ( 0 x 11 k + ... + x 1 φk + ... + x ijk + ... + x φφk = 1 −∞ other wise Fig. 4. Defi nitions of f actor nodes’ potential functions displa y ed in Figure 3. Assuming all functions defined in Figure 4 are in log space, then their sum c orresponds to the objective function f = X i,j,k S ij k ( x ij k ) + 3 X s =1 N s X t =1 E s t , (1) where N s is the number of entities in source s . Maximizing this sum is e quivalent to one-to-one max weight matching, since the constraint f actors are bar- rier f unctions. The optimizing assignment is the c on- figuration of a ll varia ble nodes, either 0 or 1, which maximizes the objective function as desired. 4.1.2 Optimization Approach The general max - sum algorithm itera tively pa sses messages in a factor graph by Eq. (2) and Eq. (3) below . In the equations, n ( C ) represents neighbor factor nodes of a varia b le node C , a nd n ( f ) represents neighbor varia ble nodes of a factor node f . µ f → C ( x ) defines messages passing from a f actor node f to a variable node C , while µ C → f ( x ) d efines messages in the reverse direction. Each message is a function of the 6 IEEE TRANSACTIONS ON KNO WLEDGE AND DA T A ENGINEERING, VOL. 26, NO. X, X 2 014 corresponding var ia ble x . For e x ample, if x is a binary variable, there are in total two messages p assing in one direction: µ f → C (0) and µ f → C (1) . µ f → C ( x ) = max x 1 , ··· ,x n f ( x, x 1 , ..., x n ) + X C i ∈ n ( f ) \{ C } µ C i → f ( x i ) (2) µ C → f ( x ) = X h ∈ n ( C ) \{ f } µ h → C ( x ) (3) As it sta nds the solution is computationally ex- pensive. Suppose there are m sources each with an average number of n entities, then in each iteration there are in total 4 mn m messages to be updated: only messages betwee n variable nodes and constraint factor nodes need to be updated. There are a total number of n m variable nodes and each has m con- straint factor nodes connecting with it. There are 4 messages between ea ch factor node a nd the varia ble node (2 messages in each direction); so there a re in total 4 m messages for each variable node. Affinity Propaga tion. Instead of calculating this many messages, we employ the idea of Affinity Prop- agation [15]. The basic intuition is: instead of p a ssing two messages in ea c h direction, either µ f → C ( x ) or µ C → f ( x ) , we only need to pa ss the difference be- tween these two messages, i.e. , µ f → C (1) − µ f → C (0) and µ C → f (1) − µ C → f (0) ; at the e nd of the max- sum algorithm what we really need to know is which configuration of a variable x (0 or 1) yields a larger result. The concrete calculation—using tri-partite matching to illustrate—is as follows. L et β 1 ij k = µ C ijk → E 1 i (1) − µ C ijk → E 1 i (0) α 1 ij k = µ E 1 i → C ijk (1) − µ E 1 i → C ijk (0) . Since µ C ijk → E 1 i (0) = µ E 2 j → C ijk (0) + µ E 3 k → C ijk (0) + µ S ijk → C ijk (0) µ C ijk → E 1 i (1) = µ E 2 j → C ijk (1) + µ E 3 k → C ijk (1) + µ S ijk → C ijk (1) the difference β 1 ij k between these two messages ca n be calculated by: β 1 ij k = α 2 ij k + α 3 ij k + S ij k (1) . (4) In the other direction, since µ E 1 i → C ijk (1) = max x i 11 ,...,x iφφ \{ x ijk } E 1 i ( x ij k = 1 , x i 11 , ..., x iφφ ) + X a 2 a 3 6 = j k µ C ia 2 a 3 → E 1 i ( x ia 2 a 3 ) , in order to get the max value for the message, a ll the variables except x ij k in E 1 i should be z ero because of the constraint function. Therefore µ E 1 i → C ijk (1) = X a 2 a 3 6 = j k µ C ia 2 a 3 → E 1 i (0) Similarly , since µ E 1 i → C ijk (0) = max x i 11 ,...,x iφφ \{ x ijk } E 1 i ( x ij k = 0 , x i 11 , ..., x iφφ ) + X a 2 a 3 6 = j k µ C ia 2 a 3 → E 1 i ( x ia 2 a 3 ) , to get the max message, one of the variable s except x ij k in E 1 i should be 1, i.e. , µ E 1 i → C ijk (0) = max b 2 b 3 6 = j k µ C ib 2 b 3 → E 1 i (1) + X a 2 a 3 6 = j k,a 2 a 3 6 = b 2 b 3 µ C ia 2 a 3 → E 1 i (0) . Therefore, subtra cting µ E 1 i → C ijk (0) f rom µ E 1 i → C ijk (1) we can get the update formula for α 1 ij k as follows: α 1 ij k = min b 2 b 3 6 = j k − β 1 ib 2 b 3 . (5) If i = φ , since there is no constraint f unction E 1 φ , both β 1 ij k and α 1 ij k are equal to 0. The update rules Eq.’s ( 4) a nd (5) can be easily generalized to m -partite matching. β i x 1 x 2 ...x m = X j 6 = i α j x 1 x 2 ...x m + S x 1 x 2 ...x m (1) (6) α i x 1 x 2 ...x m (7) = min ...b i − 1 b i +1 ... 6 = ...x i − 1 x i +1 ... n − β i ...b i − 1 x i b i +1 ... o . Achievin g expone ntially fewer messa ge s. By mes- sage passing with α, β instead of µ , the total number of message updates per iter ation is reduced by half down to 2 mn m , which includes mn m each of α, β messages. This number is still exponential in m . How- ever , if we look a t Eq. (5) for α messages, the formula actually can be interpreted as Eq. (8) below , where j ∗ k ∗ are minimizers of Eq. (5). This shows that for a fixed entity i in source 1, there are actua lly only two α values f or all different combinations of j and k , and only the minimizer combination j ∗ k ∗ obtains the different value which is the second minimum (we call the second minimum the exception value for α 1 ij k , and the minimum as the normal value ). In other words, ZHANG ET AL.: PRINCIPLED GRAPH MA TCHING ALGORITHMS F OR INTEGRA TING MUL TIPLE DA T A SOURCES 7 there are actually only 2 mn of the α message s instead of mn m . α 1 ij k = ( α 1 ij ∗ k ∗ , for all j k 6 = j ∗ k ∗ second min − β 1 ib 2 b 3 , for j k = j ∗ k ∗ . (8) Using this observa tion, we replace the β messages in Eq. (5) with α messages using Eq. (4), to yield: α 1 ij k = − max b 2 b 3 6 = j k α 2 ib 2 b 3 + α 3 ib 2 b 3 + S ib 2 b 3 (1) . (9) W e can similarly derive α update f unctions in the other sources. As shown in the following sections, the final resolution only depends on α messages, therefore we only calculate α message s and keep updating them iteratively until convergence or a maximum number of iterations is reached. 4.1.3 Stepwise Approximati on In E q. (9) we need to find the optimizer , which results in a c omplexity at le a st O ( m 2 n m − 1 ) in the genera l ca se to compute an α message, beca use we need to con- sider all the combinations of b 2 and b 3 in the similarity function S ib 2 b 3 . T o reduce this complexity , we use a stepwise approach (operating like Gibbs Sa mpling) to find the optimizer . The ba sic idea is: suppose we want to compute the optimum combination j ∗ k ∗ in Eq. ( 9), we first fix k 1 and find the optimum j 1 which optimizes the right hand side of Eq. (9) (in a general case, we fix candida te entities in all sources e xcept one). Also, since k 1 is fixed, we don’t need to consider the similarity S ik 1 between i and k 1 in S ib 2 k 1 (1) , as shown in Eq. 10. j 1 = argmax b 2 α 2 ib 2 k 1 + α 3 ib 2 k 1 + S ib 2 k 1 (1) = argmax b 2 α 2 ib 2 k 1 + α 3 ib 2 k 1 + S ib 2 + S b 2 k 1 (10) From Eq. 10, we see that this step requires only O ( mn ) computation, where m is the number of sources ( m = 3 in the current example), because the computation of similarities among the fixed candidate entities need not matte r . After we compute j 1 , we will fix it and compute the optimizer k 2 for the third source. W e keep this stepwise update c ontinuing until the opti- mum va lue does not change any more. Since the opti- mum value is a lways increased, we are sure that this stepwise computation will converge. In the general case, the computation is similar , and the complexity is now reduced to O ( mnT ) where T is the number of steps. In practice, we want the starting c ombination to be good enough so that the number of steps needed to converge is small. So, we always select those entities which have high similarities with entity i a s the initial combination. Also, to avoid getting stuck in a local optimum, we begin from several different sta r ting points to find the global optimum. W e select the top two most similar entities with entity i from each source, and then choose L combinations among them as the starting points. Therefore, the complexity to find the optimum b e comes O ( mnT L ) . The second optimum will be chosen among the outputs of each of the T steps within L starting points, requiring another O ( T L ) computation. 4.1.4 Final Selection W e iteratively update all α messages round by round until only a small portion ( e.g. , less than 1%) of the α ’s are cha nging. In the final selection step , the general max-sum a lgorithm will a ssign the value of each variable by the following formula. x C = argmax x X h ∈ n ( C ) µ h → C ( x ) The difference between the two configurations of x C i 1 ,...,i m is: Q ( x C ) = X h ∈ n ( C ) µ h → C (1) − X h ∈ n ( C ) µ h → C (0) = m X j =1 α j i 1 ,...,i m + S i 1 ,...,i m (1) This means the final selection only de pends on α mes- sages. Therefore, in our algorithm the c onfiguration of the values is equal to x i 1 ,...,i m = ( 1 , P m j =1 α j i 1 ,...,i m + S i 1 ,...,i m (1) ≥ 0 0 , otherwise (11) While we employ heuristics to improve the effi- ciency of this step we omit the details as they are technical and follow the general approach la id out above. 4.1.5 Comple xity Analysis Suppose there are m sources, each having n e ntities. The general max-sum a lgorithm in Eq.’s (2) and ( 3) needs to compute at least O ( mn m ) messages per iteration. In add ition to the complexity of Eq. (2), each message also requires at least O ( mn m − 1 ) complexity . The cost is significant when n ≈ 1 0 6 and m > 2 . The complexity of our algorithm can be decom- posed into two parts. During message pa ssing, f or each itera tion we upda te O ( nm ) α messages. For each α message, we first need to find the top- 2 candidates in m − 1 sources, which requires O (( m − 1) n ) time. Then, we find the optimum starting from L combina- tions within the candidates, which requires O ( mnT L ) time complexity . The sorting of the candidate array requires O ( mn log mn ) time. A f avorable property of message-passing algo- rithms is that the computation of messages ca n be parallelized. W e leave this for future work. 8 IEEE TRANSACTIONS ON KNO WLEDGE AND DA T A ENGINEERING, VOL. 26, NO. X, X 2 014 4.2 Greedy Algorithm The greedy approach to bipartite max-weight match- ing is simple and efficient, so we explore its natural extension to multi-partite graphs. Algorithm. The a lgorithm first sorts a ll the pairs of e ntities by their scores, discarding those falling below threshold θ . It then steps thro ugh the remaining pairs in order . When a pair x i , y j is e ncountered, with both unmatched so f ar , the pa ir entities is matched in the resolutio n: they form a new clique { x i , y j } . If either of x i , y i are already matc hed with other entities and formed cliques, we examine every entity in the two cliques (a single e ntity is regarded as a singleton clique). If all the entities derive from different sources, then the two cliques are merged into a larger clique with the resulting merged clique included in the resolution . Otherwise, if a ny two entities from the two cliques are f rom the same two sources, the current pair x i and y j will be d isregarded as the merged c lique would violate the one-to-one constraint. Complexit y analysi s . The time complexity of the sorting step is O ( n 2 m 2 (log n + log m )) , where m is the number of sources and n is the average number of entities in e a ch source. T he time complexity of the se- lection step is O ( n 2 m 3 ) as we must examine O ( n 2 m 2 ) edges with ea ch e xamination involving O ( m ) time to see if add ing this edge will violate the one-to- one constraint. In practice the greedy approach is extremely fast. Approximat ion guarantee. In add ition to being fa st and easy to implement, the bipartite greedy a lgorithm is well-known to be a 2 -approximation of exa ct max- weight matc hing on bipartite graphs. W e generalize this competitive ratio to our new multi-partite greedy algorithm via duality . Theorem 4.1 : On any weighted m -partite graph, the total weight of greedy max -weight multi-par tite graph matching is at least half the max -weight matching. Proof: The primal program for multipartite exact max-weight matching is below . Note we have pa irs of sources indexed by u, v with corresponding inter- source ed ge set E uv connecting nodes of D u and D v . p ⋆ = max ∆ X uv X ij ∈ E uv s ij δ ij (12) s.t. X j : ( ij ) ∈E uv δ ij ≤ 1 , ∀ u, v , ∀ i ∈ D u ∆ 0 The pr imary var ia bles select the edges of the match- ing, the objective function measures the total selected weight, and the constraints enforce the one-to-one property and that selections can only be made or not ( and not be negative; and in sum, ensuring the optimal solution rests on valid Boolean values). The Lagr a ngian function c orresponds to the follow- ing, where we introduce dual va riables pe r constraint, so as to br ing constraints into the objective thereby forming an unconstrained optimization: L ( ∆ , λ, ν ) = X uv X ij ∈ E uv s ij δ ij + X uv X i ∈ D u λ uvi 1 − X j : ( ij ) ∈E uv δ ij + X uv X ij ∈ E uv ν ij δ ij . Inequality constraints require non-negative dua ls so that constraint violations are penaliz e d a ppropriately . The Lagrangian dua l function then corresponds to maximizing the Lagrangian over the pr imal variables: g ( λ, ν ) = sup ∆ L ( ∆ , λ, ν ) = X uv X i ∈ D u λ uvi + sup ∆ X uv X ij ∈ E uv ( u ij + ν ij − λ uvi − λ uvj ) δ ij = ( λ · 1 , if s ij + ν ij − λ i − λ j ≤ 0 , ∀ u v ; i, j ∈ E uv ∞ , otherwise Next we form the dual L P , which minimizes the dual function subject to the non-negative dua l vari- able constraints. This is the dual LP to our original primal. Dropping the superfluous dual var ia bles: min λ X uv X i ∈ D u λ uvi (13) s.t. s ij ≤ λ uvi + λ uvj ∀ uv ∀ ij ∈ E uv λ 0 Finally we may note that the greedily matched weights form a fe asible solution for the d ual LP . Moreover the value of the dua l LP under this feasible solution is twice the greedy total weight. By weak du- ality this feasible solution bounds the optimal primal value proving the result. The same a rgument as is used in the bipa r tite case demonstrates that this bound is in fa ct sharp: there exist pa thological weighted gra phs f or which greedy only achieves half the max weight. In real- world matching gra phs a re sparse and their weights are relatively well-behaved, a nd so in practice greedy usually achieves much better than this worst-case approximation a s de monstrated in our exper iments. 5 E X P E R I M E N TA L M E T H O D O L O G Y W e present our exp erimental results on both real- world and synthetic data. Our movie d ataset aims to demonstrate the performance of our algorithms in a real, ver y large-scale application; the publication dataset adds further support to our conclusions on additional real-world data; while the synthetic data is used to stress test multi-partite matching in more difficult settings. ZHANG ET AL.: PRINCIPLED GRAPH MA TCHING ALGORITHMS F OR INTEGRA TING MUL TIPLE DA T A SOURCES 9 T ABLE 1 Experimental movies data Source Entities Source Entities Source Entities AMG 305,743 Flixster 140,881 IMDB 52 6,570 iT unes 12,571 MSN 104,385 Netflix 75,521 5.1 Movi e Datasets For our ma in real-world d ata, we obtained movie meta-data from six popular online movie sources used in the Bing movie vertical ( cf. Figure 1) via a combina- tion of site cra wling and A PI a ccess. All sources have a large number of entities: as shown in T able 1 , the smallest has over 12 ,000, while the la rgest has over half a million. A s noted in Section 1, while we only present results on one real-world dataset, the data is three orders of magnitude larger than typical for real- world benchmarks in the litera ture and is from a real production system [2 2]. Every movie obta ined had a unique ID within each source a nd a title attribute. However , other attributes were not universal. Some sources tended to have more metadata than others. For exa mple, IMDB generally has more meta d ata than Netflix, with more alternative titles, alternative runtimes, and a more comprehensive cast list. No smaller source wa s a strict subset of a larger one. For example, there are thousands of movies on Netflix that c annot be found on AMG, even though AM G is much larger . W e base similarity measurement on the following fea ture scores: • T itle: Exact match yields a perfect score. Other- wise, the fra ction of normalized words that are in common (slightly discounted). T he best score among all available titles is used. • Release year : the a bsolute difference in release years up to a ma ximum of 30. • Runtime: the a b solute d ifference in runtime, up to a maximum of 6 0 minutes. • Cast: a count of the number of matching ca st members up to a maximum of five. M a tching only on part of a name yields a fr actional score. • Directors: names are compared like ca st. How- ever , the number matching is divided by the length of the shorter director list. Although the feature-level scores could be improved— e.g. , by weighing title words by TF- IDF , by perf orming inexac t words matches, by understanding that the omission of “part 4” may not matter while the omission of “ bonus material” is significant—our focus is on entity matc hing using given scores, and these functions are adequate for obtaining high accuracy , as will be seen later . After scoring features, we used regularized logistic regressio n to learn weights to combine f e ature scores into a single score for each entity pair . In order to train the logistic regressio n model and also e valuate our en- tity matching algorithms, we gathered human-labeled T ABLE 2 Experimental publicatio ns data Source D BLP ACM Scholar Entities 2,615 2,290 6 0,292 truth sets of truly matching movies. For each pair of movie sources, we randomly sampled hundreds of movies from one source and then a sked human judges to find matching movies from the other source. If there exists a matching, then the two movies will be labeled as a matching pair in the truth set. Otherwise, the movie will b e labeled as non-matching f rom the other source, and any algorithm that a ssigns a matching to that movie from the other source will incur a fa lse positive. W e also employ the standard techniques of blocking [2 1] in order to avoid scoring a ll possible pairs of movies. In the movie datasets, two movies belong to the same block if they share at least one normalized non-stopwor d in their titles. 5.2 Publications Datasets W e follow the same expe r imental protocol on a sec- ond real-world publications dataset. The da ta collects publication recor ds from three sources as detailed in T able 2. Each record has the title, authors, venue a nd year for one publication. 5.3 Perf ormance Metrics on Real-W orld D ata W e evaluate the performance of both the Greedy a nd the Message Passing algorithms via precision a nd recall. For a ny pair of sources, suppose ˆ R is the output matching from our algorithms. Let R + and R − denote the matchings a nd non-matchings in our truth data set for these two sources, respectively . Then, we calculate the following standard statistics: T P = | ˆ R ∩ R + | F N = | R + \ ˆ R | F P = | ˆ R ∩ R − | + n ( x, y ) ∈ ˆ R ∃ z 6 = y , ( x, z ) ∈ R + o + n ( x, y ) ∈ ˆ R ∃ z 6 = x, ( z , y ) ∈ R + o . The precision is then calculated as T P T P + F P , and recall is calculate d as T P T P + F N . By varying threshold θ we may produce a sequence of precision-recall pairs producing a PR curve. 5.4 Synthetic Data Generation T o thoroughly compare our two approaches to genera l MPEM problems, we design and carry out severa l experiments on synthetic da ta with different difficulty levels. This data is ca refully p roduced b y A lgorithm 1 to simulate the general entity matching problem. 10 IEEE TRANSACTIONS ON KNO WLEDGE AND DA T A ENGINEERING, VOL. 26, NO. X, X 2 014 Algorithm 1 Synthetic Da ta Genera tion Require: Generate n e ntities in m sources; k features per entity 1: for i = 1 to n do 2: for the cur rent entity , ge ner ate k va lues F 1 , F 2 , . . . , F k ∼ U ni f [0 , 1 ] as true features; 3: for j = 1 to m do 4: for each source, genera te fea ture values for the current entity as follows: 5: for t = 1 to k do 6: each feature va lue f t in f j : i sampled from N ( F t , σ 2 ) 7: d one 8: Compute scores between entities across sources: 9: for i 1 = 1 to m − 1 do 10: for i 2 = i 1 + 1 to m do 11: for j 1 = 1 to n do 12: for j 2 = 1 to n do 13: score entities j 1 , j 2 from sources i 1 , i 2 using values f i 1 : j 1 , f i 2 : j 2 14: don e W e randomly ge ner ate features of entities instead of randomly ge ner a ting scores of entity pairs. This corresponds to the real world, where each source may add noise to the true meta-d ata about an entity . In the synthetic case, we start with given true data a nd add random noise. A n important variable in the algorithm is the variance σ 2 . As σ increases, even the same entity may have very different feature va lues in different sources, so the entity matching problem gets harder . In our experiment, we va ry the value of σ through 0.02, 0.04, 0.06 , 0.08, 0.1 and 0.2 to create da ta sets with different difficulty levels. W e set the number of entities n = 10 0 , the number of sources m = 3 , and the number of features k = 5 . W e use the normalized inner product of two entities as their similarity score. 5.5 P erf ormance Metrics on Synthetic Data Since all the entities a nd sources are symmetric, we evaluate the pe rformance of different approaches on the whole data set. For a data set with m sources and n entities, there are a total of nm ( m − 1) 2 true matchings. If the algorithm outputs T e ntity matchings out of which C e ntity pairs are correct, then the precision and recall are computed as: P re cision = C T , R ecal l = 2 C nm ( m − 1) As another measure we also us e F-1 , the har- monic mea n of precision and recall, defined a s 2 ∗ precision ∗ r ecall precision + r ecall . 5.6 Unconstrained ManyMan y Baseline As a baseline approach, we employ a generic uncon- strained ER approach—denoted M A N Y M A N Y —that is common to many previous works [5], [7], [20], [21], [26], [30], [3 1], [3 4], [38]. Given a score function f : D i × D j → R on sources i , j a nd tunable threshol d θ ∈ R , M A N Y M A N Y simply resolves all tuples with score exceeding the threshold. Compared to our message passing a nd greedy algorithms described in Sec tion 4, M A N Y M A N Y a llows an e ntity in one data source to be matched with multiple entities in another data source. Because of this p roperty , it is also important to notice that adding more data sources will not help improve the resolution results of M A N Y M A N Y on any pa ir of sources, i.e. , the resolution results on a pair of two sources using M A N Y M A N Y are invaria nt to any other sources being matched. 6 R E S U LT S W e now present results of our experiments. 6.1 Results on Movie Data The real-world movie data set contains millions of movie entities, on which both message-passing a nd greedy approaches are sufficiently scalable to opera te. This section’s e xperiments are d esigned primarily f or answering the following: 1) Whether multi-partite matching can get bet- ter precision and recall compared with simple bipartite matching—do additional sources im- prove statistical performance; 2) Whether the one-to-one global constraint im- proves the resolution results in real movie data; 3) Whether the message-passing-based ap p roach achieves higher total weight than the heuristic greedy app roach; and 4) How the two a pproaches perform on multi- partite matching of real-world movie data. 6.1.1 Multipar tite vs. Bipartite W e examine the effectiveness of multi-partite match- ing by a dding sources to the basic bipartite matching problem. Specifically , we use two movie sources (msn- movie and flixster in our experiments) as the target source pa ir , and perf orm bipa rtite matching on these two sources to obtain baseline performance. W e then add another movie source (IMDB) a nd perform multi- partite matching on these three sources, a nd compare the matching results on the target two sources against the baseline. Finally , we perform multi-partite match- ing on all six movie sources and record the results on the target two sources. Seven groups of results are shown together in Figure 5. W e use dia monds to plot the results from the greedy approach, d ots to plot the results from the message-pa ssing approach, and plus symbols to plot the results f rom M A N Y M A N Y . The results f or different numbers of sources are recorded with dif- ferent colors. The PR curve is genera ted by using ZHANG ET AL.: PRINCIPLED GRAPH MA TCHING ALGORITHMS F OR INTEGRA TING MUL TIPLE DA T A SOURCES 11 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Effect of Additional Mo vie Sources on Matchings Recall Precision MP on 2 sources Greedy on 2 sources MM on 2 sources MP on 3 sources Greedy on 3 sources MP on 6 sources Greedy on 6 sources Fig. 5. P erf ormance of res olution on increasingly m any movie sourc es. The algor ithms enjoy comparably high perfo r mance which impro ves with additional sourc es. different thresholds θ on scores. Specifica lly , given a threshold, we regard all the entity pairs which have similarity below the thresho ld as having 0 similarity and as never being matched. Then, we perform entity resolution based on the new similarity matrix and plot the precision and recall va lue as a point in the PR curve. W e range from threshold 0. 51 through 0 .96 incrementing by 0. 03. From the results, we can first see that when compar- ing the three approaches on two sources, both mes- sage pa ssing and greedy are much better than M A N Y - M A N Y . This demonstrates the effectiveness of adding the global one-to-one constraint in entity resolution in real movie data. It’s also important to notice that without the one-to-one constraint, M A N Y M A N Y is not affected at all when new sources a re ad d ed; while for the a p p roaches with the global one-to-one constraint, the resolution improves significantly with increasing number of sources, pa rticularly going f rom 2 to 3. This further shows the importance of having the one-to- one constraint in multi-partite entity resoluti on: it fa- cilities appropriate linking of the individual pairwise matching problems. In add ition, for both message-passing and greedy approaches, the PR curve of using three sources for matching is higher than using only two sources for matching. This mea ns that with an additional source IMDB, which has the largest number of entities among all the sources, the resolution of msnmovie and flixster is much improved. The matching results on all of the six sources are also higher tha n using only two sources, but only slightly higher than on three sources. This implies that the other three movie sources do not provide much more information beyond IMDB, likely because IMDB is a very broad source for movies and it’s a lready very informative with good quality metadata to help match msnmo vie and netflix. For a ny number of sources, the precision and recall of greedy is very similar to that of message pass- ing and very close to 1.0. T his may be due to our feature scores and similarity functions being a lready pretty good for movie d a ta. In order to determine whether they are funda mentally similar , or whether greedy was benefiting from this da ta set having low feature noise, we designed the synthetic e xperiment which is describe d in the following sections. Since the expe r iments on real movie data have already shown the adva ntage of these two approaches over M A N Y M A N Y , we do not compare M A N Y M A N Y in the synthetic experiment. 6.1.2 T otal weight Next we compare the total weight of the match- ings output by the message-passing and greedy ap- proaches. The total weight of a matching is calculated as the sum of similarity scores of each matching pa ir . For example, if a movie x from IMDB matches with movie y in msnmovie and movie z in netflix, then the total weight is sim ( x, y ) + sim ( x, z ) + si m ( y , z ) . S ince the message-pa ssing ap p roach goes to additional effort to better-ap p roximate the maximum weight matching, we expec t that its total weight should be higher than greedy . In Figure 6, we show the comparative results when performing multi-par tite matching on two, three, a nd six sources, where the threshold is set as 0.5 1 . For two sources, since exact maximum-weight b ipa rtite matching a lgorithms is tra ctable, we compute the true maximum weight and display as “MaxW eight”. For comparison, we use the weight of the message- passing approach as the reference, and p lot the rela- tive value of the greedy and max-weight a p p roaches to this reference. From the figure, we can see that on all the datasets, the message- passing approach gets higher total weight than the greedy approach. Also, as the number of sources increases, the d ifference between the total weight of the message and greedy approaches gets larger . For example, on six sources the total weight of the message-pa ssing ap p roach is more than 1 0% higher than greedy . In addition, in the two source comparison we can see that the message- passing approach (tota l weight: 7640 4) gets almost the same tota l weight a s the ma x imum-weight matching (total weight: 764 09), while the greedy a pproach’s total weight is a little lower (7 6141). This suggests that our message-passing approach approximates the maximum-weight matching well. On the other hand, the result a lso implies that the movie matching data is far from pathological since the total weight f or greedy is very close to the maximum-weight matching on two sources, which is in sta rk contrast with the 2 - approximation worst ca se. 12 IEEE TRANSACTIONS ON KNO WLEDGE AND DA T A ENGINEERING, VOL. 26, NO. X, X 2 014 2 3 6 Greedy MP MaxWeight Effect of Additional Movie Sour ces Number of sources T otal weight relative to MP (threshold 0.5) 0.0 0.2 0.4 0.6 0.8 1.0 Fig. 6. Comparing weights, with true max weight for 2 sources. T o compare across v ar ying number of sources, we plot weights relative to Message P as sing. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Matching Comparison on Synthetic Data Recall Precision MP; σ =0.02 Greedy; σ =0.02 MP; σ =0.04 Greedy; σ =0.04 MP; σ =0.06 Greedy; σ =0.06 MP; σ =0.08 Greedy; σ =0.08 MP; σ =0.1 Greedy; σ =0.1 MP; σ =0.2 Greedy; σ =0.2 Fig. 7. Precision-Recall com- parison of Message P as sing and Greedy matching on synthetic data with varying amounts of score noise. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Matching Comparison on Synthetic Data Threshold F1 score MP; σ =0.02 Greedy; σ =0.02 MP; σ =0.04 Greedy; σ =0.04 MP; σ =0.06 Greedy; σ =0.06 MP; σ =0.08 Greedy; σ =0.08 MP; σ =0.1 Greedy; σ =0.1 MP; σ =0.2 Greedy; σ =0.2 Fig. 8. F1 scores of Message P ass- ing and Greedy matching on syn- thetic data with varying amounts of score noise. 0.75 0.8 0.85 0.9 0.95 1 0.3 0.4 0.5 0.6 0.7 0.8 0 .9 1 MP on 2 sources Greedy on 3 sources Greedy on 2 sources MM on 2 sources MP on 3 sources Fig. 9. P erf ormance of res olution on increasingly m any pub lication sources. 6.2 Results on Publication Data W e further explore our approaches with the publica- tion dataset of Section 5 .2, for which results are shown in Figure 9. W e expe r iment on matching DBLP vs. Scholar , with matching DBLP vs. ACM vs. S cholar , comparing the results of the bipartite and tripartite matchings with truth labe ls on DBLP-Scholar . The results lead similar conclusions dr awn from the movie data: (1) globally-constrained ma tching with a n ad - ditional source improve the accura cy of matching DBLP-Scholar alone; a nd (2) one-to-one constrained approaches perform better than the unconstrained M A N Y M A N Y ap proach. 6.3 Results on Synthetic Data As discussed in S ection 5, we created synthetic datasets with different levels of fea ture noise by va r y- ing para meter σ . Our primary aim with the synthetic data is to see if greedy is truly competitive to message passing, or if it only achieves similar precision/recall performance in the case of low feature noise on the relatively well-curated movie data. Our experiments are on 3 sources with 10 0 entities. W e use precision, recall a nd F-1 score as the evaluation metrics to com- pare the two a pproaches. Figure 7 shows the PR curves of the two a pproaches on the different data sets. Here, σ = 0 . 02 represents the easiest dataset ( i. e. , generated with minimal noise) and σ = 0 . 2 represents the most noisy dataset ( i. e. , generated with significant noise). W e use a d ifferent color to present the results on different data quality and use different point markers to present results from the different approaches. From the figure, we see that for the easiest dataset ( σ = 0 . 02 ), both greedy and message passing work exceptionally well, reminiscent of the movie matching problem. When the dataset gets more noisy , both approaches experience degraded results, but the decrease of the message- passing approach is far less than the greedy ap- proach. As shown in the figure, for datasets with σ = 0 . 04 , 0 . 06 , 0 . 08 , 0 . 1 , message-pa ssing operates far better than greedy . Finally , when the da ta arrive s at a very noisy level ( σ = 0 . 2 ), both ap p roaches perform equally poorly . In Figure 8, we show the F-1 values of the two ap - proaches along with different thresholds. W e can see very clearly the contrast betwee n the two a pproaches when the data becomes increasingly noisy . Here colors again denote varying noise level, while the line type denotes message passing (solid) or greedy (dashed). For exa mple, the gap between the solid a nd dashed lines is very small at the top of the figure when the data is relatively clean. However , as σ increases the gap too increases, and reaches a n apex on the green ZHANG ET AL.: PRINCIPLED GRAPH MA TCHING ALGORITHMS F OR INTEGRA TING MUL TIPLE DA T A SOURCES 13 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 5 10 15 20 25 30 35 40 45 50 0.8 0.7 0.6 0.5 0.4 0.3 Number of Iterations Number of Changes 0 1000 2000 3000 4000 5000 6000 0 5 10 15 20 25 30 35 40 45 50 0.8 0.7 0.6 0.5 0.4 0.3 Number of Iterations T otal Weights (a) Change of α messages (b) Change of total weight Fig. 10. Con vergen ce T est curves ( σ = 0 . 06 ) . Later , a s σ increases ev e n more, the gap becomes smaller but still exists. At la st, when σ = 0 . 2 , the gap be c omes very small, which means the two approaches perform almost the same under severe noise. This is to be expec te d: no method ca n perform well under extreme noise. In sum, from the experimental results on the syn- thetic data, we ca n conclude that for the d atasets with the least fea ture noise, both the greedy approach and the message-p a ssing a pproach perform very well, while for the most noisy da tasets, both of these two approaches perform poorly . On the other hand, when the f eature noise is betwee n these two extremes, message-passing is much more robust than greedy . 6.4 Con vergence and Sca lability The complexity of our message-passing approach de- pends on the number of iterations that the algorithm needs to converge. In this section, we examine the convergence of the algorithm and empirically com- pare the efficiency of the message-pa ssing approach with the greedy approach. In the convergence experiment, we use 6 data sources ea ch having 1k entities making 6k α messages total. W e change the threshold (as in Sec. 6.1.1) from 0.3 to 0 .8 with increment 0.1 to filter the candida te matchings. A higher value of the threshold will hav e a lower number of edges left in the weighted graph, so the total weight of the final matching will also be lower and the itera tions nee ded to converge is a lso smaller . In Figure 10(a), we show how many itera tions the message- passing approach needs to c onverge with different thresholds, and Figure 10(b) shows the to- tal weight a f ter each iteration of message passing. Both of these two graphs show that with 6 sources and 1 ,000 entities, the message-passing approach con- verges quickly . W ith all the different thresholds, the approach converges within 50 iterations. This indi- cates that the fac tor T in the complexity analysis f or the message-passing approach is much smaller than the total number of entities n . W e also conducted an experiment to e mpirically compare the efficiency of the message-p a ssing a p- proach and the greedy approach. There is no doubt that the message - passing approach is much slower 0 20 40 60 80 100 120 140 200 400 600 800 1000 MP Greedy Number of Entities T otal Runtime (Seconds) 0 10 20 30 40 50 60 70 80 90 100 3 4 5 6 7 MP Greedy Number of Sources T otal Runtime (Seconds) (a) Increase Entities (b) Increase Sources Fig. 11. Running Time Comparison than the greedy approach, but the purpose of the experiment is to see how well the message-passing approach scales when the number of entities and the number of sources increase. The experiment was performed on a computer with Intel Core i5 CPU M560@2 .67GHz and 4 GB Memory . Figure 11(a) shows the time cost comparison between the two approaches when we fix the number of sources as 6 and increase the number of entities per source from 200 to 1,000, and in Figure 11(b) we fix the number of entities per source a s 600 a nd increase the number of sources 3 to 7. W e ca n see that the greedy approach scales well. While the time cost of the message-passing approach increases much f aster , the time complexity of the algorithm is still accepta ble for solving real problems. In practice, it takes 1 to 2 hours to use message passing to find the true matching on the real, large- scale movie da taset, which has 6 sources, some of which containing around half a million movie entities. 7 C O N C L U S I O N S In this pape r , we have studied the multi-partite matching problem for integration across multiple data sources. W e have proposed a sophisticated factor- graph message-passing algorithm and a greedy ap- proach f or solving the problem in the presence of one-to-one constraints, motivated by real-world socio- economic properties that drive data sources to be naturally deduplicated. W e provided a competitive ratio a nalysis of the la tter app roach, and conducted comparisons of the message-passing and greedy ap- proaches on a ver y large real- world Bing movie dataset, a smaller publications d a taset, and synthetic data. Our experimental results prove that with addi- tional sources, the precision and recall of entity reso- lution improve; that leve r aging the global constraint improves resolution; a nd that message-passing, while slower to run, is much more robust to noisy data than the greedy approach. For future work, implementing a parallelized ver- sion of the message-passing approach is an interesting direction to f ollow . Another open area is the forma- tion of theoretical connections between the surrogate objective of weight maximization and the end-goal of high precision and recall. 14 IEEE TRANSACTIONS ON KNO WLEDGE AND DA T A ENGINEERING, VOL. 26, NO. X, X 2 014 R E F E R E N C E S [1] A. Arasu, C. R ´ e, and D. Suciu. L arge-sc ale de duplication with constraints using ded upalog. In ICDE’09 , pages 952–963, 2009. [2] M. Bayati, D. Shah , and M. Sharma. Max-pro duct for maxi- mum weight matching: convergence, correctness and LP du- ality . IEEE T rans. Info. Theory , 54(3):1241–1251, Mar . 2008. [3] O. Benjelloun, H. Garc ia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. W idom. Swoosh: A generic approach to e ntity resolutio n. VLDB J. , 18(1):255–276, 2009. [4] I. Bhattacharya and L. Getoor . A latent dirichlet model for unsupervised entity reso lution. In SIAM Conf. Data M ining , 2006. [5] M. Bilenko and R. J. Mooney . Adaptive duplicate detection using learnable string similarity measures. In KDD’03 , pages 39–48, 2003. [6] S. Chaudhuri, A. Das Sarma, V . Gant i, and R. Kaushik. Leve r- aging aggregate constraints for d eduplication. In SIGMO D’07 , pages 437–448, 2007. [7] Z. Chen, D. V . Kalashnikov , and S . Mehrotra. Exploiting con- text analysis for combining multiple entity resolution systems. In SIGMOD ’09 , pages 207–218, 2009. [8] P . Christen. Automatic t raining example selection for scalable unsupervised record linkage. In P AKDD ’08 , pages 511–518, 2008. [9] Y . Crama and F . C. Spieksma. Approximation algorithms for three-dimensional assignment problems with triangle inequal- ities. E uropea n J. Op. Res. , 60(3):273–279, 1992. [10] A . Culotta and A. McCallum. Joint deduplication of multiple recor d types in relational data. In CIKM’05 , pages 257–258 , 2005. [11] A . K. Elmagarmid, P . G. Ipeirotis, and V . S. V erykios. Duplicate recor d detection: A survey . IEEE T rans. KDE , 19(1):1–16, 2007. [12] I. P . Fellegi and A. B . Sun ter . A theory of record linkage. J. Amer . Stat. Assoc. , 64(328):1183–1210, 1969. [13] J. Gemmell, B. I. P . Rubinstein, and A. K. Chandra. Improving entity resolution with global constraints. T echnical Report MSR-TR-2011-100, Microsoft Research, 2011. [14] L . Getoor and A. Machanavajjhala. Entity resolution: theory , practice & open challenges. 5(12):2018–2019, 2012. [15] I. E. G ivoni and B . J. Frey . A binary variable model for affinity propagation. Neural Computation , 21(6):1589–1600, 2009. [16] R. Gupta an d S. Sarawagi. Answering table augme n tation queries fro m unstructur ed lists on the web. 2(1):289–300, 2009. [17] M . A. Jaro. Advances in recor d-linkage m ethodology as applied to matching t he 1985 census of T am pa, Florida. J. Amer . Stat. Assoc. , 84(406):414–420, 1989. [18] A . Kannan, I. E. Givoni, R. Agrawal, and A. Fuxman. Matching unstructur ed product offers to struc tured product specifica- tions. In KDD’11 , pages 404–412, 2011. [19] D. Koller and N. Friedman. Pro babilistic Graphical Models: Principles and T echniques . MIT Press, 2009. [20] H. K ¨ opcke an d E. Rahm. T raining selection for tun in g e ntity matching. In Proc. Int. Work. Qual. DBs M anag. Uncert. Data , pages 3–12, 2008. [21] H. K ¨ opcke and E. Rahm. Frameworks for entity matching: A comparison. Data & Know . E ng. , 69(2):197–210, 2010. [22] H. K ¨ opcke, A. Thor , and E . Rahm. Evaluation of entity resolutio n approaches on real-world match p robl ems. PVLDB , 3(1):484–493, 2010. [23] A . Marcus , E. W u, D. Karger , S. Madd en, and R. Miller . Human-powered sorts and joins. PVLDB , 5(1):13–24, 2011. [24] S . Negahb an, B. I. P . Rubinstein, and J. Gemmell. Scal ing multiple-sourc e ent it y resolution using statistically efficient transfer learning. In CIKM’12 , pages 2224–2228, 2012. [25] C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimiza- tion . Dover , 1998. [26] J. C. Pinheiro and D. X. Sun. Me thods for linking and mining massive het erogeneous databases. In KDD’98 , pages 309–313, 1998. [27] P . Ravikumar and W . Cohen . A h ierar chical graphical model for record linkage. In UAI’04) , pages 454–461, 2004. [28] M . Sadinle, R. Hall, and S. E. Fien b erg. Approaches t o m ultiple recor d linkage. In Proc. 57th World Cong. ISI , 2011. Invited paper htt p://www .cs.cmu.edu/ ∼ rjhall/ISIpaperfinal.pdf. [29] S . Sanghavi, D. Malioutov , and A. W illsky . Linear pro- gramming analysis of loopy belief p ropagation for weighted matching. In NIPS’07 , 2008. [30] S . Sarawagi and A. Bhamidipaty . Interactive deduplication using active learning. In KDD’02 , pages 269–278, 2002. [31] A . Segev and A . Chatterje e. A framework for object matching in federated d atabases and its implementation. Int. J. Coop. Info. Sys. , 5(1):73–99, 1996. [32] W . Shen, X. L i, and A. Doan. Constraint-based entity matching. In AAAI’05 , pages 862–867, 2005. [33] W . Su, J. W ang, and F . H. Lochovsky . Record matching ove r query results from multiple web databases. IEE E T rans. KDE , 22:578–589, 2010. [34] S . T ejada, C. A. Knoblock, and S. Mint on. L e arning domain- independen t string transformation weights for h igh accuracy object iden tification. In KDD’02 , pages 350–359, 2002. [35] W . E. W inkler . Advanced met hods for record linkage. In Proc. Sect. Surv . Res. Meth., Amer . Stat. Assoc. , pages 467–472, 1994. [36] W . E. W inkler . Met h ods for record linkage and Bayesian networks. T echnical Report St atistics #2002-5, U.S. Census Bureau, 2002. [37] W . E . W inkler . Overview of record linkage and current resear ch directions. T echnical Report St atistics #2006-2, U.S. Census Bureau, 2006. [38] H. Zhao and S. Ram. Ent ity id e ntification for heterogeneous database int egration: a multiple classifier system approach and empirical e valuation. Info. Sys. , 30(2):119–132, 2005. Duo Zhang is a software engineer on the Ads team at T witter . Before j oining T witter , he received hi s Ph.D . from the Uni v ersity of Illinois at Urbana-Champaign. He was a research inter n at Microsoft, IBM, and Face- book during his Ph.D progr am. Dr . Zhang has published numerous research papers in te xt mining, inf ormation retriev al, databases, and social networking. He has also served on PCs and reviewers at major computer science conf erences and jour nals including SIGKDD , A CL, and TIST . Benjamin I. P . Rubinstein is Senior Lecturer in CIS at the University of Melbour ne, Aus- tralia, and holds a PhD from UC Berkeley . He actively researches in statistical machine lear ning, databases, security & pri v acy . R u- binstein has ser ved on PCs and organised workshops at major conf erences i n these areas including ICML, SIGMOD , CCS. Pre vi- ously he has worked in the research di visions of Microsoft, Googl e, Y ahoo!, Intel (all in the US), and at IBM Research A ustralia. Most notably as a Researcher at MSR Sili con V alley Ben helped ship production systems for entit y resolution in Bing and the Xbox360. Jim Gemmell is CT O of star tup T r ¯ ov and holds PhD and M.Math degrees. Dr . Gem- mell is a world leader in the field of li f e- logging, and is author of the popular book Y our Life, Up loaded . He has numerous pub- lications in a wide range o f areas including life-logging, multimedia, networking, video- conf erencing, and databases. Dr . G emmell was previously Senior Researcher at Mi- crosoft Research where he made lea ding contributions to major products including Bing, Xbox 360, and MyLif eBits.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment