Predicting missing links via correlation between nodes

Predicting missing links via correl ation between nodes Hao Liao Departmen t of Ph ysics University of Fribourg Fribourg, Switzerland CH-1700 An Zeng Departmen t of Physics University of Fribourg Fribourg, Switzerland CH-1700 Email: an.zeng @unifr .ch Y i-Cheng Zhang Departmen t of Ph ysics University of Frib ourg Fribourg, Switzerland CH-170 0 W eb Scienc es Center, School of Computer Scien ce and Engine ering, University of Electr onic Science and T echnolo gy of Ch ina, Chengdu 61173 1, Peoples Rep ublic of Chin a Abstract —As a fundamental problem in many different ﬁelds, link prediction aims to estima te the likelihood of an existing link between two nodes based on the observ ed informa tion. Since this problem is related to many appl ications ranging from uncove ring missing data to p redicting the e volution of networks, link prediction has been intensively inv estigated recently and many methods ha ve been p roposed so far . Th e essential challenge of link prediction is to estimate the similarity between nodes. Most of t he existing methods are based on the common neighbor index and its v ariants. In th is paper , we p ropose to calculate the similarity between nodes b y the correlation coefﬁcient. This method is found to be very effective when applied to calculate similarity based on h igh order paths. W e ﬁn ally fuse the correla tion-based method with the resour ce allocation method, and ﬁnd that the combin ed method can su bstantially outperform the existing methods, especially i n sparse networks. I . I N T R O D U C T I O N The object o f many scientiﬁc researche s is predictio n. For instance, unde rstanding the mechan ism of epidem ic spre ading can help us to pr edict the future coverage of a cer tain virus [1], the mechan istic model for the citatio n d ynamics o f individual papers can be app lied to p redict the future ev olution of scientiﬁc publicatio ns [2]. While mathematical m odels and prediction techniques are suf ﬁciently mature for some systems, reliable predictio n app roaches are still unavailable in most systems. One may id entify their chaotic natur e to be the m ajor difﬁculty , yet th e lack of und erstanding of the underlyin g prin- ciples ma y indeed b e the main o bstacle. Besides the p rediction of the collective beh avior , the prediction in microscop ic le vel, such as the well-known link prediction challeng e in complex networks, has also attracted a lot of attention . Link predictio n is a very impor tant pro blem th at aims at estimating the likelihoo d of the existence of a link b etween two n odes [3], [ 4]. Solv ing this problem canno t only help us complete the missing data in biological networks su ch as the protein- protein interaction networks an d m etabolic networks [5], [6], but also enable us to pred ict the ev o lution of so- cial networks [7]–[ 9]. In fact, link predictio n is also closely connected to other prob lems such as recomm endation [10] and spuriou s links detection [11]. A sound link prediction method will h elp to design mor e efﬁcient recommen dation algorithm to ﬁlter out irrelev ant inf ormation for onlin e u sers [12]. Mor eover , the lin k p rediction meth od can be also applied to analyze the reliab ility o f existing link s and accord ingly identify some noisy con nections from networks. T he p rogre ss in this ﬁeld will largely p ush forward the researches in other ﬁelds. Accor dingly , the problem of missing link pr ediction has been intensively studied by researchers fro m d ifferent backgr ounds and many methods applied to different ﬁelds ha ve been propo sed [ 13]–[16]. For a revie w , see ref. [17]. The basic a ssumption for link prediction is that two nodes are more likely to h av e a lin k if they ar e similar to e ach other . Therefo re, the essential pro blem fo r link prediction is ho w to calculate the similarity between no des accu rately . One of the mo st straigh t-forward method is c alled co mmon neighbo r which measu res the similarity b etween two in dividuals by directly co unting th e number of commo n neig hborin g no des [18]. Howev er , this me thod h as serio us shor tcomings as it strongly fa vors th e large degree nod es. T o solve this pr oblem, many variants, such as the Jaccard ind ex [19] and Salton index [20], have been prop osed to remove this tenden cy . In ad di- tion, some other metho ds includin g Katz index [21], simrank [22],hierarchica l rando m grap h [6] a nd stochastic block m odel [23], [24]are also v ery ef fective in estimating nod es’ similarity . Howe ver, th ese method s are based o n glob al a lgorithms that can b e pr ohibitive to use for large-scale systems. In this paper, we argue that th e similarity between no des can b e calculated based on another com pletely different type of m ethod, called correlatio n. In broad de ﬁnition, cor relation refers to any class of statistical relationsh ips in volving depen- dence between tw o o r more random variables. In our case, we actually refer it to the Pearson corre lation [25]between nodes’ attribute vectors which ca n co me from the adjacency matrix or hig her order of that. I n link pred iction, one of the biggest challenge is the data sparsity . It means th at a lot of data at hand is to o sparse to extract valuable similarity informa tion from the simp le commo n neigh bor method or its variants. One possible solu tion has been discussed in ref. [2 6] in which long er path s (i.e. path s with len gth larger than 2) are ap plied to measure no des’ similarity . Ho wever , when it comes to such hig h ord er infor mation much noise will be included so that the similarity matr ix is ind eed denser but the similarities are not satisfactorily accurate, which lead s to a poor outcome of predicted links. In our simulation, we ﬁnd that the co rrelation- based method is very effective when app lied to calculate similarity ba sed on high o rder path s. W e ﬁnally fuse the new m ethod with the reso urce allocatio n m ethod [27], a nd ﬁnd that the c ombined m ethod can substantially o utperfo rm the existing method s, especially in sparse networks. I I . R E L AT E D W O R K S T o begin ou r a nalysis, we ﬁrst brieﬂy describe the lin k prediction pro blem and revie w som e representative m eth- ods. Considering an unweighted undirected simple network G ( V , E ) , V is the set of nodes an d E is the set of links. T he multiple links and self-conn ections are not allowed. For each pair of nodes x , y belo nging to V , we calculate a score s xy which measures the likelihoo d f or node x and y to have a link between them. Since G is und irected, the score is suppo sed to be symme try , i. e. s xy = s y x . All the nonexistent links are sorted in decreasing order according to their s scores, and the link s on the top are mo st likely to exist. Th ere are m any different ways to calculate s xy score and the most co mmon and straigh tforward way is to calcu late the similarity between node x and y . Generally sp eaking, two nod es ar e co nsidered to be similar if they have som e common important featu res in topology . An revie w paper on these similarity indices in ref. [1 7]. In this pap er ,we co mpare the pr ediction accuracies of fou r typical similar ity ind ices: Commo n neighb or (CN), Resour ce allocation (RA), Jaccard, Local path. The ir d eﬁnitions an d relev a nt motiv ations are intro duced as follows: (i) Common Neigh bor ( CN ) . T wo nodes x a nd y are more likely to f orm a lin k if they have many com mon n eighbor s. Let Γ( x ) den ote the set o f neighb ors o f node x , the simp lest measure of the neighbo rhood overlap ca n be the directly calculated as: s CN xy = | Γ( x ) ∩ Γ( y ) | (1) which is the a ctual aggregation method used by most websites. Howe ver, th e dr awback of CN is that it is in fav or of th e nod es with large degre e. It is obvious th at s xy = ( A 2 ) xy , wher e A is the ad jacency matrix, in which A xy = 1 if x and y are directly connected and A xy = 0 otherwise. Note that ( A 2 ) xy is also the number of different paths with length 2 connectin g x and y . Newman [1 7] used this quan tity in the stud y o f collab oration networks, showing the corr elation between th e num ber of common neigh bors and the prob ability th at two scientists will collaborate in th e future.Ther efore, we here select CN as the rep resentative of all CN-based mea sures. Altho ugh CN co nsumes little time and per forms relatively go od among many local indice s, du e to th e insufﬁcient infor mation, its accuracy cann ot catch up w ith the m easures b ased on glo bal informa tion. One ty pical examp le is the Katz index [21]. (ii) Jaccar d coefﬁcient ( Jaccard ). This index was propo sed by Jaccard over a hundred y ears ag o, The algor ithm is a traditional s imilarity measurement in the litera ture. It is deﬁned as s Jaccard xy = | Γ( x ) ∩ Γ( y ) | | Γ( x ) ∪ Γ( y ) | . (2) The motivation of this index is that the pu re commo n neighbo r fa vors a lot for th e large degree nod es: it is e asier for large degree n odes to fo rm c ommon neighbors with other nodes. The denomin ator can remove th e tendency for high degree no des to have high similar ity with oth er node s. N ote that, there are many other ways to remove the tendency of CN to large degree nodes, such as cosine index, Sorensen index, Hub promo ted index and so on [17]. (iii) Resour ce a llocation ( RA ). This in dex is motivated b y the resou rce allocation d ynamics on complex system. Con sid- ering a p air of nodes, x an d y , which are not directly connected, it assumes tha t the nod e x needs send som e resourc e to y , with their comm on n eighbor s playing the r ole of tra nsmitters.Each transmitter has a u nit of resou rce and will equally distribute it to all its neighb ors. In th e simplest case, we assume that each transmitter has a un it of resourc e, and will equ ally distribute it to all its neighb ors. The similar ity b etween x and y is d eﬁne as th e amoun t of r esource y receiv ed from x : s RA xy = X z ∈ Γ( x ) ∩ Γ( y ) 1 k z . (3) It is obviou s that th is measure is symme tric, namely s xy = x y x . A similar similarity ind ex is ca lled Ad amic-Adar (AA) Index wh ich simply repla ces k z in the ab ove equatio n by l og k z . Although resu lting f rom different motiv ations, the AA index an d RA ind ex have very similar forms. In deed, th ey both d epress the contribution of th e h igh-d egree co mmon neighbo rs. AA index takes the fo rm ( log k z ) − 1 while RA index takes th e form k − 1 z . The d ifferent is in signiﬁcant when the d egree, k z is sma ll, while it is co nsiderab le wh en k z is large. So RA index pun ished the hig h degree c ommon neighbo rs mor e heavily than AA. Previous study showed th at RA p erform s the be st am ong all the co mmon -neighb or-based methods in th e USAir network, NetScience n etwork, Power Grid n etwork, etc. (iv) Lo cal P ath index ( LP ). Th is index was in troduce d by ref [ 26]. Th is index takes local paths into consid eration, with wider h orizon than CN.It is g iv en by S LP xy = A 2 + ǫ ∗ A 3 (4) where ǫ is chosen as a free p arameter a nd A is the adjacen cy matrix of the network. Clearly , this measure d egenerates to CN when ǫ = 0 . And if x an d y ar e not directly con nected, ( A 3 ) xy is equal to the numb er o f different path s with len gth 3 connectin g x an d y . This index can be extended to account f or higher-order pa ths, as s LP ( m ) = A 2 + ǫ ( A 3 ) + ǫ 2 ( A 4 ) + ... + ǫ m − 2 ( A m ) , wher e m > 2 is m aximal order . When n equ als to the n umber of nod es in the ne twork, LP is equivalent to the well-known K atz index [2 1] which takes all paths into account in the network. The comp utational complexity of this index in an uncorre lated network is O ( N ( k m )) ,which grows fast with the increasing of n and will exceed the complexity 0.5 0.6 0.7 0.8 0.9 0.6 0.65 0.7 0.75 0.8 0.85 0.9 (a) Dolphin ratio of training set AUC 0.5 0.6 0.7 0.8 0.9 0.75 0.8 0.85 0.9 0.95 1 (b) NetScience ratio of training set AUC 0.5 0.6 0.7 0.8 0.9 0.9 0.92 0.94 0.96 0.98 (d) TAP ratio of training set AUC 0.5 0.6 0.7 0.8 0.9 0.8 0.82 0.84 0.86 0.88 0.9 0.92 (c) E−mail ratio of training set AUC Correlation A 3 similarity Correlation A 3 similarity Correlation A 3 similarity Correlation A 3 similarity Fig. 1. (colour online) A UC of the s Corr xy and A 3 methods as a function of 1 − p (the fraction of links in the traini ng set) in four real networ ks. The error bars are obtained based on 10 independen t reali zation s. for calculating the Katz ind ex (arou nd to O ( N 3 ) ) for large m . Expe rimental results show that the op timal m is po siti vely correlated with the average shortest distanc e of the network . I I I . S I M I L A R I T Y B A S E D O N C O R R E L AT I O N B E T W E E N N O D E S The a bove m ethods, th ough effecti ve in link pr ediction, all measures the similarity between nodes b ased o n the comm on neighbo r informatio n. In this paper, we pro pose to calcula te node similar ities based on the cor relation between nodes. In fact, similar idea ha s b een app lied to design rank ing alg orithm for online users’ reputation [28]. Given a vector v x ( v y ) describing the f eature of a n ode x ( y ), we ca lculate the similarity betwee n these two nodes based on the Pear son correlation coefﬁcient of v x and v y . Mathe matically , it read s s Corr xy = 1 N N X l =1 ( v xl − ¯ v x )( v y l − ¯ v y ) σ v x σ v y (5) where ¯ v x and σ v x are respectively the me an and stand ard deviation of vector v x . As discussed ab ove, v x should be a attribute vector for nod e x . On e simple way would be directly set as v xl = A xl . In th is paper, we go beyond the adjacen cy matrix and take A m into co nsideration ( m can be larger than 1 ), so th at we set v x as the co rrespon ding co lumn o f the A m . I V . D AT A A N D M E T R I C S T o e valuate the effectiv eness of the ab ove methods, we con- sider nin e empirical ne tworks in cluding bo th social network s and no nsocial networks: ( i)Dolphin : a dolphin fr iendship network, which is an undirected social network of frequent associations between 62 do lphins in a comm unity living o ff Doubtfu l Sound, New Zealand . [ 30]. (ii)Jazz: a mu sic col- laboration network o btained from The Red Hot Jazz Archive digital datab ase. Her e it inc ludes 198 band s that p erfor med between 1912 and 194 0, with most of the ban ds in th e 1920 to 19 40 [31]. (iii)C.elegans: the neu ral n etwork of the nematode worm C.elegans, in which edge join s two neurons 0 0.02 0.04 0.06 0.08 0.1 0.7 0.75 0.8 0.85 0.9 ε AUC (a) Dolphin HCR LP 0 0.02 0.04 0.06 0.08 0.1 0.98 0.985 0.99 0.995 1 ε AUC HCR LP 0 0.02 0.04 0.06 0.08 0.1 0.95 0.955 0.96 0.965 0.97 0.975 0.98 (d) TAP ε AUC HCR LP 0 0.02 0.04 0.06 0.08 0.1 0.84 0.86 0.88 0.9 0.92 (c) E−mail AUC ε HCR LP (b) NetScience Fig. 2. (colour online) T he dependenc e of the A U C of the HCR method on ǫ in four real networks. The results of the LP m ethod are shown for comparison. In the LP method, the parameter is chosen as ǫ = 0 . 01 whic h is sho wn to be the optimal paramet er for this method accordi ng to ref. [26]. The error bars in this ﬁgure are obtained based on 10 indepe ndent realiza tions. if they are co nnected, by either a syn apse o r a gap junction [32]. (iv)USAIR: the US air tran sportation network [33].wh ich contains 332 airports and 212 6 air lines. (v) Netscience: a coautho rship network between scientists who ar e pub lishing on the to pic of network science [ 34].This network contains 1589 scientists, an d 128 of whom are iso lated. Here we do no t consider th ose isolated nodes. Actu ally , it is consisted of 268 connected compone nts, and the size of the largest connected compon ent is only 379. The connecti vity of NS is not good.(vi) Email: an email co mmun ication ne twork [35]. (vii) T AP: a yeast p rotein binding ne twork gener ated by tandem afﬁnity puriﬁcation experiments [36]. (viii) Power Grid: the electr ical power g rid of western US [32],with nod es rep resenting gener- ators, transform ers and substations, and link s correspo nding to the high-voltage tr ansmission lines between them . Th is network contains 494 1 nod es, an d they are well connected. (ix) HEP: a collabo ration network of hig h energy phy sicists, which co ntains the collabo ration network of scientists po sting preprin ts on the high-e nergy theory archive at www .arxiv . org from 1 995 to 1999 [34]. W e only take into account of the giant com ponen t of these networks. Th is is becau se for a pair of nodes located in two disconnected compone nts, their s xy score will be zer o accordin g to CN and its variant. T ab le 1 shows the basic statistics of all the giant compo nents o f those networks. For ea ch of the real network, the observed link s M are random ly d ivided into two parts: the trainin g set M T , is treated as kn own information, while prob e set, M p , is used for verifying th e prediction accuracy an d n o inform ation in which is p ermitted to b e u sed f or prediction . M T plus M p is th e whole data set. Note that, each tim e before moving a link t o the probe set we ﬁrst check if this rem oval will make the tr aining network d isconnected . Usually , th e train ing set contain s 90% of the links an d th e pr obe set consists of 10% lin ks. In this paper, we e mploy a standard metr ic, are a under the rec eiv er operating ch aracteristic curve (A UC) [37] to measur e the accuracy of the prediction . A UC can be inte rpreted as th e T ABLE I S T R U C T U R A L P R O P E RT I E S O F TH E D I FF E R E N T R E A L N E T W O R K S . S T R U C T U R A L P R O P E RT I E S I N C L U D E N E T W O R K S I Z E ( N ) , E D G E N U M B E R ( E ) , D E G R E E H E T E R O G E N E I T Y ( H = h k 2 i / h k i 2 ) , D E G R E E A S S O RTA T I V I T Y ( r ) [ 3 4 ] , C L U S T E R I N G C O E F FI C I E N T ( h C i ) [ 2 9 ] AN D AVE R AG E S H O R T E S T PATH L E N G T H ( h d i ) . Networ k N E H r h C i h d i Dolphin 62 159 1.327 -0.044 0.259 3.357 Jazz 198 2742 1.395 0.020 0.618 2.235 C.ele gans 297 2148 1.801 -0.163 0.292 2.455 USAIR 332 2126 3.464 -0.208 0.749 2.46 Netscie nce 379 914 1.663 -0.082 0.741 6.042 Email 1133 5451 1.942 0.078 0.220 3.606 T AP 1373 6833 1.644 0.579 0.529 5.224 Po werGrid 4941 6594 1.450 0.004 0.080 18.989 HEP 5835 13815 1.926 0.185 0.506 7.026 probab ility that a rando mly chosen missing link fr om M p is given a higher scor e than a rando mly chosen no nexistent link. I n pr actice, we usually calculate the scor e of each n on- observed link instead o f gi ving th e or dered list since latter task is more time-con suming. Then, A UC r equires n tim es of indepen dent compar isons. at each time we random ly choo se a missing link and nonexistent link to comp are their scores. After th e co mparison ,we record there are n 1 times the m issing link h aving a high er score, and n 2 times th ey have the same score.The ﬁnal A UC is calculated as AU C = ( n 1 +0 . 5 ∗ n 2 ) /n . If a ll th e scores are gi ven by an in depend ent and identical distribution, th en AU C should be arou nd 0 . 5 . A higher A UC is co rrespon ding to a mo re a ccurate pr ediction. V . R E S U LT S W e ﬁrst com pare th e p erform ance of our method in fou r representative data sets: Dolph in, Netscience, Email and T AP . The d etailed values o n o ther data sets will be reported in table 2. I n fact, we observe in our simulation that s Corr xy itself cann ot outperf orm the tr aditional similarity measu re such as CN, Jaccard, RA and LP in lin k prediction . Howe ver , we ﬁnd that s Corr xy work well in extractin g the n ode similarity in formatio n from high ord er p aths (i.e. paths with length larger tha n 2 ). In order to show this, we set m = 2 in s Corr xy and comp are it with s = A 3 which directly uses th e nu mber o f paths with length 3 to m easure the similarity between nodes. Note that A 3 method has alr eady be en applied to com bine with the co mmon neighbo r index in the LP method to solve the data sparsity problem . W e in vestigate th e effect of data sparsity on the s Corr xy and A 3 method. T o this en d, we move fraction p of all links to the probe set and use the remaining 1 − p link s a s the training set. A larger p is corre sponding to a mo re spar ser known info rmation of the real network. The A UC results of both methods under different p are presented in Fig. 1. One can see that in all networks c onsidered , A UC o f the s Corr xy is sig niﬁcantly higher than that of A 3 . Interesting ly , the advantage of s Corr xy to A 3 becomes gen erally larger when the fraction o f lin ks in the tr aining set is smaller . These r esults are actually very important since the high or der path s are usually applied to solve th e d ata spar sity pr oblem. A better use of the high order paths info rmation can solve the d ata sparsity problem more effectively . Inspired b y th e results above, we propose to combin e the s Corr xy method an d one o f the traditional similar ity method to achieve highe r accuracy in lin k pred iction. As the RA meth od is o ne of the mo st efﬁcient ones am ong th e variants of the CN meth od, we adopt it to d esign the new method . As the new m ethod is a Hyb rid o f the Cor relation metho d and th e Resource allocation method, we refer it as HCR meth od in this paper . Th e formu la o f the HCR method rea ds s HCR xy = s RA xy + ǫ ∗ s Corr xy ( m = 2) , (6) where ǫ is a tu nable param eter . In fact, s RA xy already enjoys a h igh predictio n accuracy in d ense data a nd s Corr xy can accurately predict m issing link s with very sp arse in formation . Therefo re, the HCR metho d is a very gener al link predictio n method wh ich is supp osed to work b oth in de nse and sparse networks. T o validate the HCR method , we study the dependen ce of its A UC on ǫ in fo ur real n etworks in Fig. 2. One ca n see that th e p rediction resu lts are substantially impr oved o nce ǫ is larger th an 0 , which indicates that the s Corr xy method can indeed improve the s RA xy method (co rrespon ding to ǫ ). Moreover , as the L P method was prop osed to solve the d ata sparsity prob lem as well, we compar e the HCR meth od with it in Fig. 2 . One can see that, when ǫ > 0 HCR m ethod can outp erform the LP method, indicating the data spa rsity problem is be tter ad dressed in the HCR meth od. Th is result is actually reasonab le, as we already observe above that s Corr xy can o utperfo rm A 3 in sparse n etworks. The results on the oth er networks ar e re ported in table 2. On e can see as well that the HCR meth od gen erally have h igher A UC than other method s in all the networks considered . Amo ng these networks, p ower grid is a very spar se network. Th e similarity indices based on loc al info rmation , such as the CN, RA and Jaccard , are all with low A UC. Different n ormalization to CN in th is case can not make muc h difference. Ho wever , once the semi-local inform ation are tak en into acco unt, the LP me thod can signiﬁcantly impr ove the A UC by ne arly 10% . Intere stingly , the HCR method perf orms ev en b etter than LP and imp rove the A UC b y more than 23% . T ABLE II C O M PA R I S O N O F D I FF E R E N T A L G O R I T H M S ’ AC C U R A C Y Q U A N T I FI E D B Y AU C O R E A C H R E A L N E T W O R K C O N S I D E R E D . T H E TR A I N I N G S E T C O N TA I N S 90% O F T H E K N OW N L I N K S . E A C H N U M B E R I S O B TA I N E D B Y AVE R AG I N G OV E R 10 I M P L E M E N TA T I O N S W I T H I N D E P E N D E N T LY R A N D O M D I V I S I O N S O F T R A I N I N G S E T AN D P R O B E S E T . W E S E T T H E PA R A M E T E R S ǫ = 10 − 2 I N L P A N D ǫ = 10 − 2 I N H C R . T H E H I G H E S T AC C U R A C Y I N E A C H L I N E I S E M P H A S I Z E D B Y BO L D F AC E . Networ k C N RA J accar d LP H C R Dolphin 0.803 0.806 0.802 0.829 0.846 Jazz 0.955 0.971 0.958 0.947 0.973 C. elegan s 0.846 0.867 0.811 0.866 0.881 USAIR 0.954 0.972 0.915 0.952 0.974 Netsci 0.981 0.985 0.981 0.989 0.992 Email 0.858 0.859 0.856 0.919 0.922 T AP 0.954 0.954 0.956 0.967 0.977 Po werGrid 0.624 0.624 0.624 0.689 0.767 HEP 0.941 0.943 0.942 0.961 0.965 Similar phenom enon can be seen in Email network as well. On the other han d, when the network is dense, such as the Jazz and USAir ne twork, the LP method cann ot outp erform the CN method as the in formatio n from hig h ord er paths in this case is too n oisy . In contrast, th e HCR meth od have h igher A UC than CN an d o ther method, in dicating th at HCR can make b etter use of the hig h o rder p aths informa tion than LP . In link prediction , it is genera lly difﬁcult to pr edict the missing link of the nodes with small degree. This is known as th e ”cold-start“ p roblem [17]. In th e literature, it has already b een shown tha t the item c old-start problem can be well ad dressed by chan ging the denomin ator in the CN method [40]. More speciﬁcally , the prediction accuracy for small degree node s ca n b e largely im proved when la rger scor e is given to the node pairs with small degree. Howe ver , in sparse networks th e cold -start pr oblem can no t be effecti vely solved in this way . In other word s, the A UC cannot be substan tially increased b y just chan ging the de nominato r in the CN metho d. In fact, the essential difﬁculty for the cold-start p roblem is th at th e av a ilable info rmation fo r th e small d egree nodes are too limited fo r the algo rithms to accu rately pred ict their missing links. The LP and HCR ca n address the co ld-start problem by incorpo rate mo re inform ation fr om high o rder paths. In or der to show this, we pick the no des with degree smaller than k , and repor t th e predic tion accu racy (A UC) of the p robe set links between them in Fig . 3. W e comp are CN, LP and HCR m ethods in Fig. 3. As expected, one can see that A UC generally increases w ith k , ind icating that the links connectin g small degre e nodes are indee d mor e difﬁcult to b e correctly predicted . Mo reover , it is clearly that th e LP method can in deed resu lt in a high er A UC than CN. Th e HCR method can signiﬁcan tly imp rove th e A UC of small degree nod es. Therefo re, we conclud e that HCR is mor e effecti ve in solving cold-start prob lem than the LP method. In ﬁg. 3, we consider 4 real networks. E ven tho ugh the results are generally the same, we observe that th e ad vantage of HCR metho d is big ger in sparser n etworks. In principle, one can extend the curren t HCR method to deal with even highe r order paths, an d the m odiﬁed HCR meth od 6 7 8 9 10 0.4 0.5 0.6 0.7 0.8 0.9 degree AUC 4 6 8 10 0.94 0.96 0.98 1 degree AUC 6 8 10 12 14 16 18 0.85 0.9 0.95 (d) TAP degree AUC 6 8 10 12 14 16 18 0.5 0.6 0.7 0.8 0.9 (c) E−mail AUC degree HCR LP CN HCR LP CN HCR LP CN HCR LP CN (a) Dolphin (b) NetScience Fig. 3. (colour online) The A UC of the probe set links connect ing nodes with de gree sum sm aller tha n k when dif ferent link prediction algorithms are applie d. In the LP m ethod, the parameter is chosen as ǫ = 0 . 01 . In HCR methdo, the paramete r is chosen as ǫ = 0 . 01 . T he error bars are obtained based on 10 independent realizatio ns. reads s HCR xy = s RA xy + ǫ X m s Corr xy ( m ) . (7) The results show that A UC can be slightly improved with higher order paths. W e also consider an extension of the LP method, s LP xy = A 2 + X m ǫ m A m . (8) Howe ver, the A UC of LP decreases when m > 3 . These results evidently supports again that HCR is more effectiv e than LP in extracting similarity inform ation, especially when the orig inal informa tion contains noise. V I . C O N C L U S I O N A N D D I S C U S S I O N In this paper, we employ the Pearson corr elation coefﬁcient to measur e the similarity between node s, and acco rdingly apply it to pred icting the fu ture links. W e ﬁnd that tho ugh the correlation meth od can not ou tperform the common neig hbor and its variants in link pred iction, the corr elation method actually very efﬁcient in extractin g the similarity in formation from the h igh o rder path inf ormation . This is because the Pearson correlation coefﬁcient is generally more robust to noise than the trad itional in dex based on commo n neigh bor . W e fu rther combine the correlatio n meth od and the resou rce allocation metho d, and ﬁnd that this meth od can o utperf orm the existing link pr ediction m ethods, esp ecially when the av ailab le inform ation f rom the observed n etwork is little. W e comp are the new method with one existing method that intended to solve the d ata sparsity prob lem, an d th e r esults show that our method have h igher accu racy . Many issues rem ain still open. Our work implies that the Pearson cor relation co efﬁcient is m ore resistent to noisy informa tion th an the o ther meth ods. An interesting extension would be to investigate the link predictio n problem in the noisy environment, i.e. the observed network co ntaining some noisy link s. One can com pare the cor relation-b ased method and the other metho d, and systematically study the ro bustness of these m ethods to no ise. The Pearson correlatio n opens a new dire ction for m easuring the similar ity between nodes. In fact, there are som e oth er corre lation coefﬁcients such as Spearman [38] and Kendall’ s tau [39] coefﬁcient. A detailed study of their pe rforma nce in link prediction would b e an other interesting extension. Our results also show that th e h igh or der p aths in network s also contain some valuable informatio n to characterize node similarity . T his informa tion is espe cially importa nt for sparse networks. Similar stu dy has already been con ducted in r ec- ommend er systems where the semi-local diffusion is fou nd to be able to signiﬁcan tly im prove the recommen dation accu- racy [41]. Howe ver , if such inf ormation is not used prop erly , too much n oise will be in volved and may jeopardize the predict accuracy [4 2]. Th erefor e, the link pred iction metho d that is tolerant of noise is very important. In this paper, we present a possible method to solve this pr oblem. The re ar e some other p ossible ways for this pr oblem, such as on ly taking in to account the salient hig h ord er paths. Related metho ds ask for in vestigation in the futu re. A C K N O W L E D G M E N T This work was par tially suppo rted b y the EU FP7 Grant 61127 2 (pro ject GR OW THCOM) and by the Swiss National Science Foun dation (gran t no . 2000 20-1 43272 ). R E F E R E N C E S [1] D. Brockmann and D. Helbing. The hidden geometry of com- ple x, netw ork-dri ven contagion phenomena . Science , 342(1337):1337– 1342,2013. [2] D. W ang, C. Song, and A. L .Barabsi. Quantifying long-term scientiﬁc impact. Science , 342(6154):12 7–132, 2013. [3] D. Liben-Nowe ll a nd J. Kleinberg. The link-predic tion probl em for social netw orks. Jou rnal of the American Society for Information Scien ce and T echnolo gy , 58(7):1019–10 31, 2007. [4] D. Liben-Nowe ll a nd J. Kleinber g. The link prediction probl em for social netw orks. In CIKM , pages 556–559, 2003. [5] S. Redner T easing out the missing links. In Nature , pages 453, 2008. [6] A. Clauset, C. Moore, and M. E. J. Newman. H ierarc hical structure and the predict ion of missing links in networks. In Nature , pages 453, 2008. [7] L. A. Adamic and E. Adar . Friends and neighbors on the W eb. Soc ial Network s , 25(3):211 – 230, 2003. [8] D. W ang, D. Pedreschi , C. Song, F . Giannott i, and A. Barabasi. Human mobility , social ties, and link predict ion. In KDD , pages 1100–1108, 2011. [9] A. De, N. Ganguly , and S. Chakrabarti Discriminati ve link prediction using local links, node features and community s tructure . In ICDM , pages 1009–1014, 2013. [10] Z. Zhang, C. Liu, and T . Zhou. Solving the cold-start problem in recommender systems with social tags. Eur ophysics Le tters , 92(2), 2010. [11] A. Zeng, and G. Cimini. Removing spurious interac tions in complex netw orks. Physical R evi ew E , 85, 036101,2012. [12] J. Z hang, X. Kong, and P . S. Y u. Predicting social links for ne w users across aligne d hetero geneous social networks. In ICDM , pages 1289– 1294, 2013. [13] R. N. Lichtenw alter , J. T . Lussier , and N. V . Chawla . Ne w perspecti ves and m ethods in link predic tion. In SIGKDD Confer ence , pages 243–2 52, W ashington, DC, USA, 2010. A CM. [14] Z. L u, B. Sav as, W . T ang, and I. Dhillon. Supervised link predicti on using multiple sources. In ICDM , pages 923–928, 2010. [15] D. Shin, S. Si, and I. S. Dhillon . Multi-scale link predictio n. In CIKM , pages 215–224, 2012. [16] L. Lu and T . Z hou. Role of W eak Ties in L ink Prediction of Complex Networ ks. CIKM , 2009. [17] L. Lu and T . Zhou. Link predict ion in complex networks: A surve y . Physica A , 390:1150– 1170, 2011. [18] F . Lorrain and H. C. White. Structur al equi val ence of indi viduals in social networks. J. Math. Sociol , 1, 49,1971. [19] P . Jaccard. tude comparati ve de la distribut ion ﬂorale dans une portion des Alpes et des Jura. Bulleti n de la Socit V audoise des Sciences Natur elles , 37,547-579,1901. [20] G. Sal ton,A. W ong, an d S. C.Y ang. A v ector space model for au tomatic inde xing. In Commu nicatio ns of the A CM , 18 (11 ): 613,W ashington, DC, USA, 1975. A CM. [21] L. Katz. A new status inde x deri ved from sociometric analy sis. Psychome trika , 18(1):39 –43, Mar . 1953. [22] G. Jeh and J. Wi dom. SimRank: A m easure of struct ural-co ntext similarit y . In KDD , pages 538– 543, W ashington, DC, USA, 2002. A CM. [23] B. Karrer and M. E.J.Ne wman. Stochastic bloc kmodels and c ommunity structure in netw orks. In Physical Revie w E ,83,016107,2011. [24] J. J.Whang, P . Ra i, and I. S. Dhi llon. Stochastic Blockmodel with Cl uster Overl ap, Rele vance Select ion, and Similarit y-Based Sm oothing. In ICDM , pages 817–826, 2013. [25] L. Lin. A concordanc e cor relati on coef ﬁcient to e va luate reproducibili ty . In Biometrika ,45:2 55-268,1989. [26] L. Lu, C. H.Jin, and T . Zhou. Similarity index based on local paths for link predic tion of comple x networks. I n P hysical Revi ew E , 80,046122,2009. [27] L. Lu, T . Zhou, and Y . C.Zhang. Predic ting missing links via local informati on. In The Eur opean Physical Journal B-Condensed Matter and Complex Systems , 71,4,Pages 623-630,2009. [28] H. Liao, A. Z eng, R. Xia o, D. B.Che n, and Z. M.Ren. Ranking Reputa tion and Quality in Online Rating Systems. In P LoS ONE ,9(5): e97146, 2014. [29] M. E.J. Ne wman. Assortat i ve mixing in netwo rks. In Phys. R ev . Lett. ,89,208701,2002 . [30] D. Lusseau and M. E.J. Ne wman. Identifyi ng the role that indi vidual ani- mals play in their social network. In Behav . Ecol. Sociobiol. ,54,396,2003 . [31] P . M.Gleiser and L. Danon. Community structur e in J AZZ. In Adv . Comple x Syst. ,6,565,2003. [32] D. J.W atts and S. H. Strogatz. Colle cti ve dynamic s of ’ sm all-w orld’ netw orks . In Natur e. ,393,440,1998. [33] http://vl ado.fmf.uni-lj.si/ pub/networks/data/defaul t.htm . [34] M. E.J. Newman. Finding community s tructure in networks using the eigen vec tors of matrices. In Phys. R ev . E. ,74,036104,2006. [35] R. Guimera, L. Danon, A. D iaz-Gui lera, F . Giralt, and A. Arenas. Finding community structure in netwo rks using the eigen vecto rs of matrice s. In Phys. Rev . E. , 68,065103,2003. [36] A. C.C.Gavin and et. al. Functiona l organiza tion of the yeast proteome by systematic analy sis of protei n comple xes. In Natur e ,415,141,2002. [37] J. A. Hanle y and B. J.Mcneil. The meaning and use of the area under a recei ver operating charac teristi c (R OC) curve. In Radiolo gy ,143:29- 36,1982. [38] C. Spearman. The proof and measurement of association between two things. In Amer . J. Psychol. ,15: 72101, 1904. [39] M. Kendal l. A new measure of rank correlat ion. In Biometrik a ,30:81- 93, 1938. [40] Y . X Zhu, L. Y .Lu, Q. M Zhan g, and T . Z hou. Uncove ring m issing links with cold ends In Physica A , 391, 5769, 2012 [41] W . Zeng, A. Zeng, and M. S.Shang. Informatio n Filteri ng in Sparse Online Systems: Recommenda tion via Semi-Local Diffusion. In PLoS ONE , 8(11) e79354, 2013 [42] T . Zhou,R. Q. Su,R. R.Liu,L. L.Jiang,B. H. W ang, and et al. Accurate and di verse recommendations via eliminating redundant correlation s. In New J Phys 11, 123008, 2009

Predicting missing links via correlation between nodes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment