Supervised Negative Binomial Classifier for Probabilistic Record Linkage

Motivated by the need of the linking records across various databases, we propose a novel graphical model based classifier that uses a mixture of Poisson distributions with latent variables. The idea is to derive insight into each pair of hypothesis …

Authors: Harish Kashyap K, Kiran Byadarhaly, Saumya Shah

Supervised Ne gati v e Bino mial Classifier for Probabilistic Record Linkage Harish Kashya p K MCG Mysore, India harish.k.k ashyap@gmail. c o m Kiran Byadarhaly MCG Mysore, Ind ia bkiranv@yahoo.com Saumya Shah MCG Mysore, India saumya.shah @v oyagen ius.ai Abstract —Motiva ted by the need of the link ing records across various databases, we propose a novel graphical model based classifier that uses a mixture of Poisson distributions with latent variables. The idea is to derive insight into each p air o f hypothesis records that match by in ferring its und erlying latent rate of error using Bayesian Modeling techniques. The nov el approach of using gamma p riors f or learning the latent variables along wi th supervised labels is uniqu e a nd allows for activ e l earning. The naive assumption is ma de deliberately as to the independence of the fields to propose a generalized theory fo r this class of problems and n ot to undermine the hierarchical dependencies that could be present in different scenarios. This classifier is able to wo rk with sparse and streaming data. The application to r ecord linkage is able to meet several challenges of sparsity , data strea ms and vary ing nature of t he data-sets. I . I N T R O D U C T I O N Data quality is o n e o f the most importa n t problems in d ata managem ent, sinc e dirty data often leads to inaccurate data analytic results an d wrong business decisions. Poor data acr oss businesses and the government co st the U.S. eco nomy $3.1 trillion a year, accor d ing to a r eport by I n sightSquared in 2012 [4]. In health care dom ains, keeping track of patients health inf o rmation is v ital and these data- sets reside in multiple data sou rces. All these records ar e critical to diagnose a disease o r pr escribe med ic in e for the disease a nd inaccu rate or incor rect data m ay threaten patient safety [5]. Massiv e amounts of disparate data sour ces, have to b e inte gra ted and matched to su p port data analyses that can b e hig hly beneficial to businesses, governments, and acad emia. Record Linkage is a process which aims to solve the task of merging record s from different sources that re f er to the same entity , a task that only gets hard er if they do n’t share a un ique iden tifier between them. The area of record linkage poses many challenges such a s on-going lin kage; storing and handling ch anging data; handlin g different linkage s cena r ios; accommo dating ever increasing d ata-sets [ 1]. All of these issues make the reco rd link age pr oblem very challengin g and critical. Efficient algorithms are essential to address this problem [3]. T radition ally , record linkag e c onsists of two main steps: block ing an d match ing. In th e b locking step, records that potentially match are grou ped into th e sam e block. Subsequen tly , in the matchin g step, records that have been blocked together are e xam in ed to iden tif y those th at match. Matching is imp le m ented using either a distance fun ction, which comp ares the respecti ve field v alues of a record pair against specified d istance thresho lds, or a rule-based appr o ach, e.g., if the su rnames and the zip code s match, then classify the record pair as matchin g” [ 2]. The standard ways of solving such pr oblems has been to use probab ilistic model, a lternative statistical mod els, search ing and blocking, comparison and decision models [8]. One such method is the classical an d pop ular Fellegi-Sunter meth od that estimates normalized frequen cy values and uses weigh ted combinatio ns of them [6]. This poses a p r oblem as the underly ing string error rates vary in a non-linear fashion and therefor e these weigh tin g schem es m ay no t e fficiently captur e them. In add ition, th ese linear weigh tin g tech niques are mostly ad-hoc and not based o n a stron g pattern rec o gnition th eory . Hence, automa ted linkage proce sses ar e one way of en suring consistency of results and scalability o f service. W e pro pose a robust solution that models the probab ility of th e m a tc h ing records as a Poisson distribution tha t is learned using a probab ilistic grap hical m odel based classifier . Individual features h a ve a latent erro r rate that are unique to them. For example, the errors in name pair s an d er r ors in address pairs will likely h av e different error rates. W e can learn such error rates by modeling them as gam m a distributions which ar e conjuga te p r iors to the Poisson likelihoods. The negativ e binomial distribution ar ises when the mix in g distribu- tion of th e Poisson ra te is a gamma distribution. The n egative binomial d istribution could be useful f or various are a s beyond record link a ge where distribution of the f eatures are Po isson, one examp le of such an application is RNA sequencing [7]. This m odel allows for err or to v ary over time and produ ces better parameter estima te s with increase in train in g d ata size which is espe cially u seful in cases where y o u ha ve stre aming data or sparse data av ailable fo r cer tain features. I I . P R O B A B I L I S T I C R E C O R D L I N K AG E The Felle gi-Sun ter meth od for probab ilistic recor d linkage calculates linkage weights which are estimated by o bserving all the agreem ents and disagreements o f the data v alues of the variables that match. Th is weight correspon ds to th e probab ility that the two re cords refer to the same entity . Given two reco rds ( R a , R b ), with a set o f n commo n matching variables g i ven by R a → [ F ( a 1 ) , F ( a 2 ) , ..., F ( a n )] (1) R b → [ F ( b 1 ) , F ( b 2 ) , ..., F ( b n )] (2) the co mparisons b etween the two record s can be obtained as a re su lt of applying a distance function like edit distance to each set of the match in g variables and can be accumulated in a compa rison vector α c = { α c 1 , α c 2 , . . . , α c n } (3) The binary compa rison vector is calculated as α c k =  1 , if F ( a k ) = F ( b k ) 0 , otherwise  The basic idea in the Fellegi-Sunter metho d is to model the comparison vecto r as arising from tw o distributions one that correspo nds to true matching pairs an d the other tha t correspo n ds to true no n-matchin g pairs. For any o b served compariso n vector α c in Λ wh ich is a space of all compar isons, the con d itional probability of o b serving α c giv en the pair is a match is given by m ( α c ) = P ( α c | ( a, b ) ǫ M ) and the condition al prob a bility o f observing al pha given the pair is a non-match is given as u ( α c ) = P ( α c | ( a, b ) ǫ U ) . Here M and U ar e the set of m atches an d th e set o f n on-match es respectively . The weight f or each record pair is given as p ab = m ( α c ) u ( α c ) [10] [11]. Once the probabilities are estimated, the d e cision rule for the Fellegi-sunter me thod is: if the weig ht of a record p a ir p ab > T λ , then its a match and if p ab < T τ , th e n its a non-m atch. A. Mo tivation The stand ard math ematical technique o f prob abilistic record linkage in the state o f the ar t, still relies on likelihood ratios and weights that are ad-h oc an d cannot be probab ility measures. In addition the Fellegi-Sunter m e thod ca lc u lates condition al probability b y assum ing a model wh ich is very sensiti ve to or iginal distribution in the overall da ta b ase. In 2014, T oan Ong showed some g ood results in extending e x- isting r ecord lin kage m e th ods to hand le missing field values in an efficient and effecti ve manner [14]. Th e superv ised learning paradigm s have been u sed b y labeled comparison vectors but these suffer f r om the need to regularly updating training data [15]. A Bayesian approach to graph ical record link age was propo sed which overcomes many obstacles enco u ntered by previous approach es. This unsupervised algorith m is great for unlabeled data. Howev er, lev eragin g the vast amoun ts of labels when they are av ailable is necessary . This begs the need for a truly probab ilistic linkage theory and algorithm. Our algorithm is a truly pr o babilistic Poisson-g amma model that is pop ularly u sed in Bayesian statistics. W e ha ve applied this Poisson-gamm a model to learn the la te n t error rates and hence propo sed a Bayesian scheme f or record linkage. In the big data era, th e velocity of d ata u pdates is often high, q uickly making pr evious linkage r esults obsolete. A true learning framework that can incremen tally and efficiently update linkage results wh en data up-dates arrive are essential [13]. The ad vantage of this m e th od is that it allo ws for only updating parameters in a acti ve lea rning p aradigm in stead of training over all th e large data- sets repeatedly as in majority of the state of th e art, stan d ard probabilistic record link age alg o- rithms. Th is wou ld mean tha t a older d ata-sets can be thrown away and only th e n e w data can be trained on and param eters updated . In a big -data setup this would be immensely useful B. E rr o r Distrib ution Giv en a pair of d ata-sets, the edit distances of p airs of strings within that data-sets that m atch, ha ve relatively small edit d istances as com pared to the leng ths of the matching strin g pairs. O n the o th er ha nd, th e edit distances of p airs of records are quite large compared to the lengths o f the record s in case of a n on match. The total string length compar ed to the error rate are large and hence the error rates are mo deled to be distributed as a Poisson. These errors that are distributed as Poisson ca n suffer fr om the uncertain ty ar o und the und erlying error rate θ i for the field F i which can be considere d as a latent variable. The error distribution o f the matches and no n-matches are shown in Fig. 1: Fig. 1. Error distrib ution of the Matches and Non-match es. Figures 2 and 3 sho ws the distribution of errors over the name and the addr ess co lumn which shows that th e erro r r ates could be different for dif fere n t iden tifiers (features). Fig. 2. Error distrib ution of the name variabl e. Fig. 3. Error distrib ution of the address va riable. C. Algo rithm • Using train ing data, inf er param eters of th e latent error rate that is modeled as a gamma d istribution for each feature. • Infer po sterior predictive d istribution ov er the te st sam- ples as a n egative bino mial distribution fo r the resp ecti ve features. • Compute the joint pro b ability of the featu res an d use training data to d etermine the optimum threshold for classification. • V alidate o n test d ata. This algorithm is called the Negativ e Bino m ial Classifier and is further explained in the next section. I I I . S U P E RV I S E D N E G A T I V E B I N O M I A L C L A S S I FI E R The err ors thou gh distributed as Poisson can suffer from the u ncertainty aro und the un derlying error rate θ i for the field F i .Hence we can con sider the rate to be a pr obability distribution.W e cho ose conjugate p rior of Gam ma distribution as a prior distribution for th e latent variable θ i . Conjug ate priors are chosen fo r mathematical con venience [9 ]. The na ive in dependen ce assumption is for ge nerality of th e theory and by no mean s is the only formulatio n th at it is limited by . The conjugation proper ty en ables learning of the underly ing error rate and allows up dating the latent v ariables as new data b ecomes available. The inference o f th e laten t variable learning would co nstitute th e learning and prediction would b e the predictive distribution of the p osterior probab ility considerin g the gamm a pr ior and the Poisson likelihood. A. P oisson D istrib ution The Po isson distribution is a convenient distribution to model the errors X w ith rate th eta. The p r obability of an individual pa ir of r ecords ( a i , b j ) which is a single o bservation is giv en by P ( X | θ ) = θ x e − θ x ! (4) B. La tent rate of err or θ : The latent variable θ is the con jugate prior distribution which is k nown to be the Gam ma distribution for the Poisson likelihood. The err o r r ate for a field or finite record pair is distrib uted as p ( θ | y ) = γ ( α, β ) (5) The ga m ma distribution is a na tu ral fit as a conjuga te p rior to the Poisson d istribution [9]. α & β are parame ters estimated from the grou nd tru th. There are many tec h niques such as MLE, method of moment, etc to deter mine the parameters α , β [1 2]. C. Estimation of α , β : W e use method of mo ments to determin e the parameter o f the γ distribution for a given featu re F a . E ( X i ) = ( α β ) V ar ( X i ) = ( α β + α β 2 ) (6) Substituting E, E ( X i ) = α β and (7) V ar ( X i ) = E ( . )(1 + 1 β ) = α β (1 + 1 β ) (8) D. A pplication to Recor d Linkage The con jugate prior along with the likelihood allows us to derive the posterior predicti ve d istribution for the single pair of featu res x i ( F ( a i ) , F ( b i )) corr esponding to the rec ord pair ( R a , R b ) , given by : P (( F a i , F b i ) = P (( F a i , F b i ) | θ i )) P ( θ i ) P ( θ | ( F a i , F b i ) (9) Writing this in term s of the variable x i we get, P ( x i ) = P oisson ( X i | θ ) Gamma ( θ i | α, β ) Gamma ( θ i | α + X i , 1 + β ) (10) The pa r ameter θ i , wh ich is the rate of the Poisson d istribu- tion is distributed as Gamma: p ( θ i ) ∼ γ ( α, β ) (11) P ( x i ) = γ ( α + x i ) β α γ ( α ) x i !(1 + β ) α + x i (12) This has a known fo rm called the negativ e b inomial distri- bution which is the posterio r p redictive distribution given the previously deter m ined parameters. P ( x i ) = ( α + x i − 1 x i )( β β + 1 ) α ( 1 β + 1 ) x i (13) x i ∼ N eg − B in ( α, β ) (14) E. F uzzy Matching T o com p are the probability of the rec o rds pairs ( R a , R b ), R a , R b could poten tially contain features that are no t iden tical such as zip co de p r esent in one and not present in th e oth e r . W e are howev er o nly look ing at only the n fields tha t are commo n between the reco r ds. W e shall now apply th e gene ralized theory . This inv olves learning different error r ates for each data-set. Th e assumption is that the err or rate can vary for an individual p air of records with an u n derlying overall error rate. Note that the latent parameters of the error distribution of on ly computed from th e matching record s and applied to the non-m a tching re cords. The pro bability of th e pair would be the jo int of each feature. P [( F a 1 , F b 1 ) , ( F a 2 , F b 2 ) , ( F a 3 , F b 3 ) , ..., ( F a n , F b n )] (15) Assuming the feature s are inde pendent, = P ( F a 1 , F b 1 ) × P ( F a 2 , F b 2 ) × ... × P ( F a n , F b n ) = n Y i =1 ( F a i , F b i ) = n Y i =1 ( α i + x i − 1 x i )( β i β i + 1 ) α i ( 1 β i + 1 ) x i (16) α i , β i → par a meters of the Gamma distribution x i → variable x i for which the p redictive d istribution is app lica ble = ( α 1 + x 1 − 1 x 1 )( α 2 + x 2 − 1 x 2 ) ... ( α n + x n − 1 x n ) × ( β 1 β 1 + 1 ) α 1 ( β 2 β 2 + 1 ) α 2 ... ( β n β n + 1 ) α n × ( 1 β 1 + 1 ) x 1 ( 1 β 2 + 1 ) x 2 ... ( 1 β n + 1 ) x n = n Y i =1 N eg .B in ( α i , β i , x i,j ) (17) Assuming the p airs of data b elongs to a class c k , given k classes H = n Y i =1 N eg .B in ( α i , β i , x i,j , c k ) (18) Therefo re the b est class that the dataset fits is gi ven b y H k = n Y i =1 N eg .B in ( α i , β i , x i,j , c k ) > θ (19) k ǫ (1 , k ) x i,j and thresho ld θ The matching reco rds can be chosen from the pro bability that crosses the thr eshold θ . This ab ove formula tio n should be useful in any scen ario whe r e feature s individually fit a Poisson distribution. The advantage of the Bayesian mo deling appr oach is the robustness that the Negative Binomial distribution offers, due to th e inheren t con jugacy of the laten t error rate. No t only is th is ap proach generative in natu re but also can learn fro m the groun d truth. This can therefor e be a combinatio n of advantage of supervised learning an d g enerative modelin g . F . Ac tive Learning The parameter s at each new d ata-set can be u pdated based on the fitness test as g i ven by α i = α i + Y i (20) β i = β i + X i (21) Where Y i was the ev ent for that oc curred giv en X i Hence, θ i | ( Y i , X i ) ∼ Gamma( α i + Y i , β i + X i ) This can b e represented as a gr aphical model sh own in Fig 1 . This c la ssifier provides better estimates based on every past data that has b een learned. Also, it may not be necessary to lear n param e te r s at ev ery step but in chunk s depend in g on the applica tio n. In such an instance, a chun k o f new data cou ld be add ed to the stream and the u nderlying α and β p arameters can be recomputed . This will pr ovide better estimates fo r the arriving test data set in the data stream . I V . R E S U LT S This algorithm was applied to record lin kage problem and was applied o n restaur ant d ata-sets. The restaurant data-sets are tab les of names and add resses provid e d b y the restaurant revie w comp anies Zagat an d Fodo r s. This is a real data con- taining r e a l string data and real errors. This was d ownloaded from the RIDDLE data repository . There are a total of 191 records in the data-set out of whic h 60 % are uniqu e. The groun d truth co nsists o f false pairs o f d ata sets along with matching pairs. This is a sm all data-set with only 113 tru e matches and 71 false matches. The perfo r mance of record linkages well as g eneral classification algor ith ms suffer greatly with insuf ficient data. This data is divided in to training and testing data with 70% u sed for train ing and rest is used to ev aluate the perfor mance of th e algorithm. A. Ch oosing the Thr eshold The parameters of the negative bino m ial distribution f or each pa ir of f eatures is learned f r om the matching reco rds. For aggressiv e error biasing , the distribution of errors for the non matching pairs c o uld be fit as a Gaussian normal. The matches wer e ran domized with no criterion to filter the featu re set on. This helps absorb rando m erro r s that cou ld h appen during the filtering operatio n as er rors ca n happ en at any string position. The log -likelihood of a r ecord p air b eing a match is then calculated u sing the trained negative binomial distribution on all the records in the training data. The co nfusion matrix as well as the ROC an d the Precision-Recall curves are then produ ced using the available labe ls o n the training data. The R OC cu rve and the Precision-Recall curve shown in figures 4 and 5 ar e then used to estimate the optimum thr e shold that can be used to make a d e cision about whethe r a pair of recor ds are a match or not, on new data. On e of the main advantages of this m ethod which cann ot be articulated enoug h is its ability to update param eters as and when new data is available. Let’ s say a new set pair of records nee d s to be matched, the likelihoo d of th is pair is found using the negativ e binomial classifier an d based on th e threshold c hosen, is d eemed as either a match or a mismatch . If th is turns out to be m atch, it can then furth er be used to update the paramete r s of the gamma distribution. Th e new data f urther solidifies the ability o f the system. One of the other core advantages of this model is that the system does no t need to be retrain ed over and over aga in and is a verse to over-fitting due to its Baye sian nature. Fig. 4. The A UC on the training data. Fig. 5. Precision - Recal l curve on the training data. As you can see from the figures 4 an d 5, the mo del fits the training data very well evidenced by the high values of the A UC as well as Precision a n d Recall. But the u tility of a model ca n only be judg e d by its performanc e on an u nseen test data. The perfo rmance on the test data is as fo llows A UC 94.23 % Precision 90.9% Recall 76.92 % Accuracy 86.2% Even with a very small number of matching records that have bee n used to train the mo del, it is perfo rming quite well on the test data. The very high A UC score shows that the negativ e binomial model is performin g extreme ly well as a classifier . Th e p r ecision of abou t 91 % also shows that the model is able to g et a high perc e n tage of results that are either match or non-matc h and also that a smaller n u mber o f actual matches are classifi ed as non - matches (low false negative). This giv es the confid ence that if this model says that a pair of records d o no t m atch, then there is a very possibility that th ey actually don’t. The overall accuracy of the mod e l shows that 86% of the r e cords are classified pro perly as matches or n on- matches. The thre shold h as been so cho sen so as to f ocus more on precision than rec a ll such that we would like to redu ce th e number of times the model identifies a match as a non-m atch. In order to do this we w ou ld risk making some n on-match es as matches. But d u e to the acti ve learn in g fram e work of the mo del, the new arriving data are not only classified as match or no n-match but are also used to improve the mod el. Thus with a larger data-set, we can achieve h igher values of Precision, Recall as well as accura cy overall, ev en though we w ould still h ave to make a trade off between th em. V . C O N C L U S I O N A N D F U T U R E W O R K This paper pre sen ts a classifier that is based on Bayesian principles an d is ro bust and can work with small data-sets. This theory allo ws for activ e learning where a dditional data can be used to update th e latent gamma distribution p a rameters. This theor y provides a framework f or researchers to explore other d istributions as long as the underlying latent variables are con jugate priors. The excellent A UC score and accura cy is encour a g ing giv en th e small size of the data. The Negati ve binomial classifier can be applied to different area of applica- tion as well like RN A sequencin g and other decay processes. The model can also be mad e h ierarchical by assumin g that the parameter s of the gamma distribution ( α , β ) follow some other pro b ability distribution wh ich would make them hy per- priors to the orig in al error Poisson d istribution.Researchers in record link age can explo r e the variants of th e theory to inc lu de hierarchica l arrangem ents of featu res where depend encies such as zip code and street names can b e set to fu rther refine the model. Also, the target variable can b e ch osen as a mixing distribution o f Dirichlet to accoun t for im balanced data-set. R E F E R E N C E S [1] J . H. Boy d, S. M. Randa ll, A. M. Ferrante , J. K. Bauer , A. P . Brown and J. B. Semmens,“T echnical c hallenges of providing record linka ge service s for researc h, ” BMC Medica l Informatics and Decision Making 14(1):23, March 2014. [2] D . Karapiperis, A. Gkoulala s-Div anis and V . S. V erykios, “Summariza- tion Algorithms for Record Linkage, ” EDBT , 2018. [3] A . A. Mamun, R. Aselt ine and S. Rajasekh aran, “Efficient Record Linkage Algorithms Us ing Complete Linkage Clustering, ” PLoS ONE 11(4): e0154446, 2016. [4] I. F . Ilyas and X. Chu,“T rends in Cleaning Relat ional Data: Consistenc y and Dedupli cation, ” Foundations and Tre nds in Database s: V ol. 5: No. 4, pp. 281-393, 2015. [5] K . Ke rr , T . Norris, and R. Stock dalel, “Data quality i nformation and decisio n making: a healthcare case study”, In Proceedings of the 18th Australasi an Conferen ce on Information Systems Doctoral Consortium, pages 57, 2007. [6] I. P . Felle gi and A. B. Sunter , “ A theory for record linkage, ” Journal of the American Statistica l Associa tion, 64(328):1 183–1210, 1969. [7] K . Dong, H. Zhao, T . T ong and X. W an, “Nblda: negati ve binomial linear discrimina nt analysis for rna-seq data, ” BMC bioinformat ics, 17(1):369 , 2016. [8] L . Gu, R. Baxter , D. V ick ers and C. Rainsford, “ Record lin kage: Current practi ce and future directions, ” T echnic al report, CSIR O Math ematical and Information Sciences, 2003. [9] A . Gelman, J. Carli n, H. Stern and D. Rubin, Bayesian D ata Analysis, Second Edition. Chapman & Hal T exts in Statistic al Science, 2003. [10] B. S. McV eigh and J. S. Murray , “Prac tical Bayesia n Inference for Record Linkage, ” T ec hnical report, Carne gie Mellon Univ ersity , 2017. [11] S. Sharp, “Determi nistic and probabilistic record Linkage, ” Alterna tiv e sources branch, National Records of Scotland. [12] T . P . Minka, “Estimating a gamma distri bution, ”Microsoft Research, Cambridge , UK, T ech. Rep, 2002. [13] A. Gruenheid, X. L. Dong and D. Sri v astav a, “Incr emental Record Linkage, ”Proc. VLDB Endow , V ol 7:No. 9, pp 697–708, May 2014. [14] T . C. Ong, M. V . Mannino, L . M. Schi lling and M. G. Kahn, “Impro ving record linkage performance in the presenc e of missing linkage da ta, ” Journal of Biomedica l Informat ics, V ol 52, pp 43-54, December 2014. [15] A. M, Hurwitz, “ Record lin kage sharing using lab eled comparison vec tors and a machine learning domain classific ation trainer , ” US Pat ent, US9576248B2.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment