Link prediction for partially observed networks
Link prediction is one of the fundamental problems in network analysis. In many applications, notably in genetics, a partially observed network may not contain any negative examples of absent edges, which creates a difficulty for many existing superv…
Authors: Yunpeng Zhao, Elizaveta Levina, Ji Zhu
Link prediction for pa rtially observ ed net w orks Y unp eng Zhao 1 , Eliza v eta Levina 2 , and Ji Zh u 2 1 Departmen t o f Statistics, Georg e Mason Univ ersit y , F airfa x, V A 2 2030 2 Departmen t of Statistics, Univers it y o f Mic higan, Ann Arb or, MI 48 109 Abstract Link prediction is one o f the fundamen tal pro blems in netw ork analysis. In many applica- tions, notably in genetics, a partially o bserved net work may not contain an y neg a tive examples of absent edges, which creates a difficulty for man y existing sup ervis ed learning approaches. W e develop a new method which treats the observed ne tw ork as a sample of the true netw ork with differen t sampling rates for p o sitive and ne g ative examples. W e obtain a r elative ranking of p otential links by their probabilities, utilizing infor ma tion on no de cov ariates as well as on net work topo logy . Empir ically , the metho d p erforms well under many settings, including when the obse rved netw ork is spa rse. W e a pply the method to a protein-pro tein interaction net work and a schoo l friendship net work. 1 In tro duction A v ariet y of data in man y differen t fields can b e describ ed b y n et w orks. E x amp les include friendsh ip and social net wo rks, fo o d w ebs, pr otein-protein in teraction and gene regulatory net w orks, the W orld Wide W eb, and many others. One of the fun d amen tal problems in netw ork science is lin k prediction, where the goal is to predict the existence of a link b etw een t wo n o des based on observ ed links b et w een other no d es as w ell as additional information ab out the no d es (no de co v ariates) when a v ailable (see [17], [16] and [7] for recen t reviews). Link prediction has wide applications. F or example, recommendation of new friend s or connectio ns for memb ers is an important service in online so cial netw orks suc h as F aceb o ok. In biological netw orks, suc h as protein-protein interac tion and gene regulatory net wo rks, it is usu ally time-consumin g and exp en siv e to test existence of links by compreh en siv e exp erimen ts; link prediction in these b iologica l net w orks can pro vide sp ecific target s f or futur e exp erimen ts. There are t wo differen t settings un der whic h the link pr ediction problem is commonly stu died. In the fir st setting, a sn apshot of the net w ork at time t , or a sequence of snapshots at times 1 , ..., t , is used to predict new links that are lik ely to app ear in the near futu re (at time t + 1). In the second setting, the n et w ork is tr eated as static but n ot fully observed, and the task is to fi ll in the miss in g links in su c h a partially ob s erv ed net wo rk. Th ese tw o tasks are related in practic e, since a net w ork 1 ev olving ov er time can also b e partially obser ved and a m issing link is m ore like ly to emerge in th e future. F rom the analysis p oint of view, ho we v er, th ese settings are q u ite d ifferent; in this pap er, w e fo cus on the partially observed setting and do not consider net works ev olving ov er time. There are sev eral t yp es of m etho ds for the link pr ed iction problem in the literature. The first class of methods consists of unsu p ervised approac hes based on v arious t yp es of n o de similarities. These metho ds assign a similarit y score s ( i, j ) to eac h p air of no des i and j , and higher similarit y scores are assumed to imply higher pr obabilities of a lin k. Similarities can b e based either on no de attributes or solely on the n et work structure, suc h as the n u m b er of common neighbors; the latte r are known as structur al similarities. Typical c h oices of structural similarit y measures includ e lo cal indices based on common n eigh b ors, suc h as the Jacc ard index [16] or the Adamic-Adar ind ex [1], and global indices based on the ensemble of all paths, su c h as the Katz index [14] and the Leic h t- Holme-Newman Ind ex [15]. Compreh ensiv e reviews of su c h similarit y measures can b e found in [16] and [17 ]. Another class of approac h es to link prediction includes su p ervised learning metho d s that use b oth net w ork structur es and no de attributes. These metho ds treat link pr ediction as a binary classi- fication problem, where the r esp onses are { 1 , 0 } in dicating whether there exists a link f or a p air, and the p redictors are cov ariates for eac h pair, whic h are constructed f rom nod e attribu tes. A n umber of p opular sup ervised learning methods hav e b een app lied to the link p rediction problem. F or example, [2] and [4] use the sup p ort v ector mac hin e with p airwise kernels, and [8 ] compares the p erforman ce of sev eral sup ervised learnin g metho d s. Other sup ervised metho ds use probabilistic mo dels for in complete netw orks to do link prediction, for example, th e hierarchica l s tr ucture mo d els [6], laten t space mo dels [11], lat en t v ariable mo d els [10, 18], and sto c hastic relational mo d els [21]. Our approac h falls in the sup ervised learning category , in the sense that w e make use of b oth the no de similarities and observe d links. Ho wev er, one d ifficult y in treating link prediction as a straigh tforward classificati on pr oblem is the lac k of certa int y ab out the negativ e and p ositiv e examples. This is particularly true for negativ e examples (absent edges). In b iological net wo rks in particular, th ere ma y b e no certain n egativ e examples at all [3]. F or instance, in a protein- protein in teraction netw ork, an absent edge may n ot mean that there is no interacti on b et w een the t w o proteins – instea d, it ma y indicate that the exp erimen t to test that interac tion has not b een done, or that it did not h a v e enough sensitivit y to detect the in teraction. Posit iv e examples could sometimes also b e spu r ious – for examp le, high-throughp ut exp eriments can yield a large num b er of false p ositiv e protein-protein interac tions [19]. Here we prop ose a n ew link prediction metho d that allo ws for the presence of b oth false p ositive and false n egativ e exa mples. More formally , w e assume that the net w ork we observ e is the true net w ork w ith in dep en d en t observ ation errors, i.e., with some true edges missing and other edges recorded erroneously . The er r or rates f or b oth kinds of errors are assumed u n kno wn, and in fact cannot b e estimate d und er this framework. Ho w ev er, we can pro vide rankings of p oten tial links in order of their estimated pr obabilities, for no de pairs with observ ed links as w ell as for no de pairs with no observe d links . T h ese relativ e rankings rather than absolute pr ob ab ilities of edges are su fficien t in man y ap p lications. F or example, pairs of proteins without observ ed int eractions that rank highly could b e giv en priority in s u bsequent exp er im ents. T o obtain these rankin gs, w e utilize n o de cov ariates when a v ailable, and/or n et work top ology based on observ ed lin ks. The rest of the pap er is organized as follo ws. In Section 2, w e sp ecify our (rather minimal) mo del 2 assumptions for th e net w ork and the edge errors . W e prop ose link r anking criteria f or b oth directed and undirected n et works in Section 3. The algorithms used to optimize these criteria are discussed in Section 4. In Section 5 w e compare p erformance of p rop osed criteria to other link prediction metho ds on simulate d net works. In Section 6 , we apply our metho ds to link pr ed iction in a protein- protein in teraction net w ork and a school friendship net work. Section 7 concludes with a summary and discussion of f u ture directions. 2 The net w ork mo d el A net work with n no des (v ertices) can b e represen ted b y an n × n adjacency matrix A = [ A ij ], where A ij = 1 if there is an edge from i to j , 0 otherwise. W e will consider the link prediction problem for b oth und ir ected and d irected netw orks. Therefore A can b e either symmetric (for undirected net wo rks) or asymmetric (for directed net works). In our framewo rk, we distinguish b et w een the adjacency matrix of the tr u e u nderlying net work A T r ue , and its observed v ersion A . W e assume that eac h A T r ue ij follo ws a Bernoulli distribution with P ( A T r ue ij = 1) = P ij . Giv en the tr u e net wo rk, we assume that the observed n et w ork is generated b y P ( A ij = 1 | A T r ue ij = 1) = α, P ( A ij = 0 | A T r ue ij = 0) = β , where α and β are the probabilities of correctly recording a true edge and an absent edge, resp ec- tiv ely . Note that w e assume that this probabilit y is constan t and do es not dep end on i , j , or P ij . Then w e hav e ˜ P ij D ef = P ( A ij = 1) = ( α + β − 1) P ij + (1 − β ) . (1) If the v alues of α , β and P ij w ere kno wn , then the pr obabilities of true edges co nditional on the observ ed adjacency matrix co uld h av e b een estimat ed as P ( A T r ue ij = 1 | A ij = 1) = αP ij ˜ P ij , (2) P ( A T r ue ij = 1 | A ij = 0) = (1 − α ) P ij 1 − ˜ P ij . (3) It is easy to c hec k that b oth (2) and (3) are monotone in creasing f unctions of P ij . T aking (1) in to accoun t implies that they are also increasing functions of ˜ P ij as long as α + β > 1. This giv es us a cru cial obser v ation: if the goal is to obtain relativ e rankings of p oten tial links, it is sufficient to estimate ˜ P ij , and it is not necessary to kno w α , β and P ij . An imp ortan t s p ecial case in this setting is β = 1. Th en all the observ ed links are true p ositiv es, and w e only need to pro vid e a ranking for no de pairs without observ ed lin ks. This can b e applied in recommender systems, for example, for recommending p ossible new friends in a so cial net w ork. 3 Another sp ecial case is when α = 1, wh ich corresp onds to all absen t edges b eing tru e negat iv es. This setting can b e used to frame the problem of in vestig ating reliabilit y of observ ed links, for example, in a gene regulatory net work inferred from high-throughput gene expression data. An estimate of [ ˜ P ij ] provides rankings for b oth these sp ecial cases an d the general problem, and thus w e fo cus on estimat ing ˜ P ij for the rest of the pap er. 3 Link prediction criteria In this section, w e pr op ose cr iteria for estimating the pr obabilities of edges in the observed netw ork, ˜ P ij , for b oth directed and undir ected n et works. Th e criteria rely on a symm etric m atrix W = [ W ii ′ ] with 0 ≤ W ii ′ ≤ 1, whic h describ es th e similarit y b et w een no des i and i ′ . The similarit y matrix W can b e ob tained fr om different sources, including no de information, net work top ology , or a com b ination of the t wo. W e will discuss c hoices of W later in this section. 3.1 Link prediction for direct ed net w orks First we consider d irected netw orks. Th e k ey assumption w e make is that if tw o pairs of no des are Figure 1: P air similarity for d irected net works similar to eac h other, the pr obabilit y of links within these tw o pairs are also similar. Sp ecifically , in Figure 1, P ij and P i ′ j ′ are assumed close in v alue if no d e i is similar to no de i ′ and no de j is similar to nod e j ′ . F or d irected net works, we measure similarit y of no de pairs ( i, i ′ ) and ( j, j ′ ) b y the p ro du ct W ii ′ W j j ′ (see Figure 1), wh ich imp lies t w o pairs are similar only if b oth pairs of endp oin ts are similar. Th is assu mption s hould not to b e confus ed with a differen t assumption made b y man y unsu p ervised link p rediction metho ds, w h ic h assume that a link is more like ly to exist b et we en similar nodes, applicable to net w orks with assortativ e mixing. Assortativ e net w orks are common – a t ypical example is a so cial n et w ork, wh ere p eople commonly tend to b e friends with those of s im ilar ag e, income leve l, race, etc. How ev er, there are also n et w orks with disassortativ e mixing, in whic h the assumption that similar p airs are more lik ely to b e connected is no longer v alid – for example, predators do n ot t ypically feed on eac h other in a fo o d w eb. Our assumption, in contrast, is equally p laus ib le for b oth assortativ e and d isassortativ e net w orks, as w ell as more general settings, as it d o es not assu me an ythin g ab out the relatio nship b et w een P ij and W ij . Motiv ated by this assumption of similar pr obabilities of links for similar no de p airs, we prop ose to 4 estimate ˜ P ij = E ( A ij ) b y ˆ f = arg m in f 1 n 2 n X ij ( A ij − f ij ) 2 + λ n 4 n X ii ′ j j ′ W ii ′ W j j ′ ( f ij − f i ′ j ′ ) 2 , (4) where f is a real-v alued n × n mat rix, and λ is a tuning parameter. The fir st term is the usu al squared error loss connecting the parameters w ith the observe d net w ork. Th e m inimizer of its p opulation v ersion, i.e., E ( A ij − f ij ) 2 is ˜ P ij . The second term enforces our k ey assumption, p enalizing the difference b etw een f ij and f i ′ j ′ more if t wo n o de p airs ( i, i ′ ) and ( j, j ′ ) are similar. The c hoice of the sq u ared err or loss is n ot crucial, and other commonly u s ed loss functions could b e considered instead, for example, the hinge loss or the negativ e log-li k eliho o d. The main reaso n for c h o osing the squ ared error loss is compu tational efficiency , since it makes (4) a quad r atic problem; see more on this details in Section 4. In some applicatio ns, w e ma y h a v e additional information ab out true p ositiv e and nega tiv e exam- ples, i.e., some A ij ’s ma y b e known to b e true 1’s and true 0’s, while others ma y b e uncertain. This could happ en, for example, when v alidation exp eriments h a v e b een conducted on a subset of a gene or pr otein net work inferred from expression d ata. If suc h information is a v ailable, it mak es sense to u se it, and we can then mo d ify criterion (4) as follo ws: arg min f 1 P n ij E ij n X ij E ij ( A ij − f ij ) 2 + λ n 4 n X ii ′ j j ′ W ii ′ W j j ′ ( f ij − f i ′ j ′ ) 2 , (5) where E ij = 1 if it is known that A ij = A T r ue ij , and 0 otherwise. This is similar to a semi-sup ervised criterion prop osed in [13]. Ho we v er, [13] did n ot consider the uncertain t y in p ositiv e and negativ e examples, n or did they consider the u ndirected case whic h w e discuss next. Sin ce (5) only in v olv es a partial sum of the loss function terms, w e will refer to (5) as the partial-sum criterion and (4) as the full-sum criterion for the rest of the pap er. 3.2 Link prediction for undirected net works Figure 2: P air similarity for u ndirected net w orks F or u ndirected netw orks, our key assump tion that P ij and P i ′ j ′ are close if tw o pairs ( i, i ′ ) and ( j, j ′ ) are similar needs to tak e in to acc ount that the direction n o longer matters; thus the pairs are similar if either i is similar to i ′ and j is similar to j ′ , or if i is similar to j ′ and j is similar to i ′ (see Figure 2. T hus w e n eed a new pair similarity measure that com b ines W ii ′ W j j ′ and W ij ′ W j i ′ . 5 There are m u ltiple options; for examp le, tw o natural com binations are S 1 = W ii ′ W j j ′ + W ij ′ W j i ′ and S 2 = max( W ii ′ W j j ′ , W ij ′ W j i ′ ) . Empirically , w e found that S 2 p erforms b etter than S 1 for a range of real and simulated netw orks. The reason for th is can b e easily illustrated on the stoc hastic block mo d el. The sto chastic blo ck mo del is a commonly used mo d el for netw orks w ith comm un ities, wh ere the probabilit y of a link only dep ends on the communit y lab els of its t wo endp oin ts. Sp ecifically , giv en comm u nit y lab els c = { c 1 , . . . , c n } , A T r ue ij ’s are in dep en d en t Bern ou lli rand om v ariables with P ij = S c i c j , (6) where S = [ S ab ] is a K × K symmetric m atrix, and K is the n umb er of comm unities in the net w ork. Sup p ose w e hav e the b est similarit y measur e we ca n p ossibly hop e to ha v e based on the truth, W ij = I ( c i = c j ), where I is th e ind icator fu nction. In that case, (6) implies P ij = P i ′ j ′ if max( W ii ′ W j j ′ , W ij ′ W j i ′ ) = 1, whereas the sum of the weig ht s would be misleading. Using S 2 as the m easure of pair similarity , w e prop ose estimating ˜ P ij for undirected net works b y arg min f 1 n 2 n X i
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment