A Statistical Modelling Approach to Detecting Community in Networks

A Statistical Mo delling App roach to Detecting Communit y in Net w o rks A. A. Ick owicz CSIR O Computational Informatics, North Ryde 1670 NSW, A ustr alia Abstract There has b een considerable recen t in terest in algorithms for ﬁnding com- m unities in netw orks - groups of v ertex within which connections are dense (frequen t), but b et w een which connections are sparser (rare). Most of the curren t literature adv o cates an heuristic approac h to the remo v al of the edges (i.e., remo ving the links that are less signiﬁcant using a well- designed function). In this article, we will inv estigate a technique for un- co v e ring laten t communities using a new mo delling approach, based on ho w information spread within a netw ork. It will prov e to b e easy to use, robust and scalable. It mak es supplementary information related to the net w ork/com m unit y structure (diﬀerent communications, consecutive ob- serv ations) easier to integrate. W e will demonstrate the eﬃciency of our approac h b y pro viding some illustrating real-w orld applications, lik e the fa- mous Zachary k arate club, or the Amazon p olitical b o oks buyers netw ork. 1 I. INTR ODUCTION In the con tin uing ﬂurry of researc h activit y within ph ysics and mathematics on the prop erties of netw orks, a particular recent focus has b een the analysis of commu- nities within netw orks. In the simplest case, a netw ork or graph can b e represen ted as a set of p oin ts, or vertex, joined in pairs by lines, or edges. Man y net works, it is found, are inhomogeneous, consisting not of an undiﬀerentiated mass of vertices, but of distinct groups. Within these groups there are man y edges b et ween vertex, but b et w een groups there are few er edges, producing a structure lik e that sk etched in Fig. 1. The abilit y to ﬁnd comm unities within large netw orks in some automated fashion could b e of considerable use. Comm unities in a w eb graph for instance migh t cor- resp ond to sets of web sites dealing with related topics (Adamic and Glance, 2005), while communities within a Karate club are simply kids going to the same schools see (Zac hary, 1977). In man y settings where w e observe net w orks of interactions there are natural group- ings of no des so that pairs of no des that are in the same group tend to interact more than pairs of no des that belong to diﬀeren t groups. If no des are p eople, they may b elong to the same club, b e of the same ethnicit y or profession. In the case of trade unions, for example, individuals with similar jobs are more lik ely to in teract lik e scien tiﬁc researchers ha ving collab orations and publishing together (Newman, 2001). This tendency of individuals to ha ve a tendency to in teract with others who ha ve similar characteristics is called homophily , and is quite p erv asiv e. In many cases, ho wev er, the underlying structure that inﬂuences net work interactions is of interest but is not directly observ able. In suc h cases w e can only infer whic h nodes should b e group ed together by observing their interaction patterns. 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 FIG. 1 Example of netw ork. A. Heuristic Approaches A ﬁrst class of algorithm to p erform such an inference is mainly of heuristic nature. The Betwe enness algorithm prop osed by (Newman and Girv an, 2003) is a hierarchi- cal decomp osition process where edges are remo ved in the decreasing order of their edge b etw eenness scores (i.e. the num b er of shortest paths that pass through a given edge). This is motiv ated b y the fact that edges connecting diﬀerent groups are more lik ely to be contained in m ultiple shortest paths simply b ecause in many cases they are the only option to go from one group to another. Another hierarc hical approach is prop osed by (Clauset et al. , 2004) and is b ottom-up instead of top-do wn when compared to the previously cited algorithm. It tries to optimize a quality function called mo dularity in a greedy manner. Initially , every vertex belongs to a separate comm unity , and communities are merged iterativ ely suc h that eac h merge is lo cally 3 optimal (i.e. yields the largest increase in the curren t v alue of modularity). (New- man, 2006) prop osed a top-do wn hierarchical approac h still based on the mo dularity function but this time in tro ducing the eigenv ectors. In eac h step, the graph is split in to t wo parts in a wa y that the separation itself yields a signiﬁcan t increase in the mo dularit y . The split is determined b y ev aluating the leading eigen vector of the so-called mo dularit y matrix, and there is also a stopping condition whic h prev en ts tigh tly connected groups to be split further. Random walk has been used b y (P ons and Latapy, 2005) and (A. T abrizi et al. , 2013). The general idea is that if y ou p erform random walks on the graph, then the w alks are more lik ely to stay within the same comm unit y b ecause there are only a few edges that lead outside a giv en comm unity . Pons’ walktrap runs short random w alks of 3 to 5 steps (generally) and uses the results of these random w alks to merge separate communities in a bottom-up manner. The latter proposed a top-down algorithm mixing b oth random walk and the mo dularit y function that can reveal inherent clusters of a graph more accurately than other nearly-linear approac hes that are mainly b ottom-up. These metho ds yield generally go o d results but can suﬀer from computational com- plexit y , sensibility to resolution limit, and no mo delling assumption is made. There- fore, no con vergence has b een deriv ed for them, which make their usage uncertain. B. Mo delling Approac hes In addition to all these w orks on heuristic approaches, mo delling-based approac h ha ve also b een considered b y v arious authors (Bic kel et al. , 2013; Karrer and New- man, 2011), through the Latent v ariable mo dels. This main mo del, kno wn as the sto c hastic blo ckmodel, assume that the netw ork connections are explainable b y a la- ten t discrete class v ariable associated with each node. F or this mo del, consistency has b een shown for proﬁle likelihoo d maximization (Bick el and Chen, 2009), a spectral- 4 clustering based metho d (Rohe et al. , 2011), and other metho ds (see for example Chen et al. , 2011), under v arying assumptions on the sparsit y of the netw ork and the n umber of classes. Other w orks, including the Inﬁnite Relational Mo del (IRM) and the Inﬁnite Hidden Relational Mo del (Kemp et al. , 2006; Xu et al. , 2006), allo w a p otentially inﬁnite n um b er of clusters. The Mixed Mem b ership Stochastic Block Mo del (MMSB) (Airoldi et al. , 2006) increases the expressiveness of the latent class mo dels by allo wing mixed mem b ership, associating eac h ob ject with a distribution o ver clusters. (P alla et al. , 2012) con veys the idea of using an inﬁnite latent attribute mo del. These results suggest that the mo del has reasonable statistical prop erties, and em- pirical exp erimen ts suggest that eﬃcient appro ximate metho ds ma y suﬃce to ﬁnd the parameter estimates. Two main dra wbac ks can b e iden tiﬁed in these approaches. First, the optimization pro cedure b ecomes more and more complex as the dimen- sion of the latent space increase. Secondly , despite it claims on b eing a mo delling approac h, our view is that it only mo dels half of the problem, the comm unities of v ertex. No mention is made on how the vertex connect, i.e. how w e end up with a sp eciﬁc adjacency matrix. C. Contribution In our pap er, we in vestigate a tec hnique for unco vering latent comm unities using a new mo delling approac h, addressing the t wo issues p ointed. Unlik e man y authors considering the mo del at the vertex level, w e will also consider a mo delling approach for the information. Indeed, we will consider that the observed netw ork is not only the realization of a random v ariable, but essentially the result of the spread of man y information impulses across the netw ork. Using a 1-dimensional latent mo del for the v ertex and the spreading mo del, w e will infer the v alue of the laten t v ariables 5 using a MCMC. In order to sp eed up the MCMC pro cess, we also prop ose a parallel temp ering-alik e strategy in section V. Our approac h is easy to scale to large net w orks and suﬀer less from the curse of dimensionality . Supplemen tary information related to the netw ork (diﬀeren t communi cations, consecutive observ ations) is also easy to in tegrate. The pap er is organised as follows. First, w e make a brief recall on the sto chastic blo c kmo dels, and w e in tro duce the notations that w e will use for the rest of the pap er. Then w e present our contribution in the modelling section, detailing the diﬀeren t ma jor diﬀerences with the approaches considered in the literature so far. The estimation procedure is presen ted in section IV. W e extend our approac h with an additional algorithm in section V, and ﬁnally provide some illustrating applications with real-w orld data. I I. PRELIMINARIES A. Sto c hastic blo ckmodel W e consider a class of latent v ariable mo dels whic h w e describe as follo ws. Let Z = ( Z 1 , . . . , Z n ) b e laten t random v ariables corresp onding to vertex 1 , . . . , n . In the original sto chastic blockmodel, they are taking v alues in [ K ] = { 1 , ..., K } . π is usually deﬁned as a distribution on [ K ], and H as a symmetric matrix in [0 , 1] K × K . The c omplete gr aph mo del (CGM) for Z , A , where A is the n × n symmetric 0 − 1 adjacency matrix of a graph, is deﬁned b y its distribution, f ( Z, A ) =  n Y i π ( Z i )  n Y i n Y j = i +1 H ( Z i , Z j ) A ij (1 − H ( Z i , Z j )) 1 − A ij  , (1) where w e ma y in terpret H ( Z i , Z j ) as p (edge | Z i , Z j ), and π ( a ) as p ( Z i = a ) for a = 1 , . . . , K . 6 The gr aph mo del (GM) is deﬁned b y a distribution g : { 0 , 1 } n × n → [0 , 1], whic h satisﬁes g ( A ) = p ( A ; H , π ) and is given by g ( A ) = X z ∈ [ K ] n f ( z , A ) (2) B. More general mo dels The sto c hastic blo c kmo del is a sp ecial case of a more general latent v ariable model (see (Hoﬀ et al. , 2002)). In this mo del, the elemen ts of Z tak e v alues in a general space Z rather than [ K ], π is a distribution on Z , and H is replaced by a symmetric map h : Z × Z → [0 , 1]. The CGM deﬁnes a density for ( Z , A ), with resp ect to an appropriate reference measure, and GM satisﬁes the iden tity g ( A ; θ ) g 0 ( A ) = E θ 0 h f ( Z, A ; θ ) f 0 ( Z, A )    A i (3) I I I. MODELLING The con tribution of our article is describ ed in this section. In our w ork, we will fo cus on a one dimensional latent space to describ e the communit y structure, whic h allows more simplicit y and is then less computationally demanding. W e will demonstrate that using an appropriate latent space and link function, w e can p erform as w ell as multiple dimensional mo dels. A. Latent V ariable Our idea is to consider that the latent v ariable Z i of v ertex i ev olves in the con tinuous circular space Θ = [0 , 2 π ]. The latent v ariable θ i will represen t the lo cation of the v ertex i in the laten t space. W e make the mo delling assumption that 7 t wo vertex in the same comm unity will ha ve close θ 1: n ’s on the latent space. Then, if w e consider the previous mo delling, w e deﬁne, H ( Z i , Z j ) = h ( θ i − θ j ) (4) where h () stands for a link function deﬁned b elow. a. Link function h () The main constraint on h is that it should map [0 , 2 π ] on an Euclidean space E . F or simplicit y , we ha ve c hosen E = [0 , 1]. Theoretically , we w ould lik e that tw o v ertex with opposite θ v alues ha v e a very small probabilit y of b eing in the same communit y . Therefore, we considered the following function, h ( y ) = cos( y ) + 1 2 , ∀ y ∈ [0 , 2 π ] (5) B. Mo delling the information spread In the recen t literature, the cells of the adjacency matrix are essen tially considered as the realization of random v ariables (or a function of random v ariables). In this pap er w e aim at in tro ducing a more real-life related in terpretation (and subsequen tly mo delling) of the adjacency matrix A . W e consider that an edge is built betw een t wo vertex if an information is shared betw een tw o v ertex, e.g. during a phone call, or a meeting, etc. W e will call this ”share of information” an impulse . Dep ending on the kind of information, the impulse do es not necessarily inv olv e t w o vertex from the same communit y . How ever, we mak e the assumption that most of the time, it do es 1 . In this article, w e deﬁne t wo kind of impulses. The se quential impulses , where an information is spread from one v ertex to another sequentially (lik e successiv e phone calls). And the instantane ous impulses , where the information is spread to sev eral 1 And it doesn’t lo ok like an unusual assumption on a statistical p oint of view. 8 v ertex at the same time (meeting, diners...). Let I b e a shared information (or impulse), I is a n × n matrix which elemen ts b elong to { 0 , 1 } , I ij =    1 if v ertex i and j exc hange an information, 0 otherwise. (6) W e assume that the adjacency matrix A is observ ed after the exchange of sev eral impulses. b. Sequen tial impulse If w e denote N the num b er of information, and T i their re- sp ectiv e propagation length, A I = N X i T i X t I it (7) In the case of sequen tial impulse, I do esn’t hav e to b e symmetric, and in our mind is designed to mo del directed graphs. Ho wev er, this is not compulsory and symmetric undirected graph ma y also b e created. c. Instan taneous Impulse In the case of instantaneous impulses, I is not restricted to only tw o vertex sharing information, but we mak e the assumption that all the v ertex are given the information at the same time. Therefore, A I = N X i I i (8) These impulses can also b e considered as communit y-based impulses and thus iden- tifying the propagation of the information helps in identifying the comm unities. It is also worth notice that the kind of impulse suitable to the modelling is related to 9 the data that y ou hav e. Finally , dep ending on the output type, A =    A I if A ∈ N n × n min( A I , 1) if A ∈ { 0 , 1 } n × n (9) IV. ESTIMA TION Our laten t mo del allows for some dimension reduction, but for large net works, the n umber of parameters ( θ ) to b e estimated is still very imp ortant, and then unlikely to b e done through any optimization metho d in a reasonable amount of time. F or small net works, the Bay esian paradigm can b e used, and the p osterior distribution of the θ ’s can b e expressed as, π ( θ 1: n | A, h ) ∝ p ( A | θ 1: n , h ) × π 0 ( θ 1: n ) (10) The use of an appropriate Mon te Carlo simulation with an appropriately designed lik eliho o d can then help us estimate the θ ’s. F or larger netw ork though, the total run length needed will b e to o imp ortan t for our tec hnique to b e a real breakthrough. Then, w e decided to sub-sample the netw ork using the mo delling describ ed in Section I I I, and apply the Ba y esian approach to this sample. A. Sub-sampling from the net w ork using A The idea is that the natural spread of an information indicates the communit y links. Of course, the spread can also reach vertex that do not b elong to the commu- nit y , but most likely they will. Then, if we simulate the spread of sev eral informations I , we should estimate their most relev ant laten t v ariables resp ectiv ely to the other’s. The steps are then the follo wing, 10 • Simulate the spread of an information I . Let σ ( i ) b e the list a attained vertex; • F or eac h of the vertex in σ ( i ), calculate θ using Eq. 11. π ( θ σ ( i ) | A σ ( i ) , h ) ∝ p ( A σ ( i ) | θ σ ( i ) , h ) × π 0 ( θ σ ( i ) ) (11) where A σ ( i ) is the restriction of A to the σ ( i )’s vertex. d. Num b er of simulated I : There is no upp er limit to the num b er of I w e can simu- late. How ev er, the only rule is as usual, ”the more the b etter”. e. Size of each simulated I : This v ery m uch dep ends on the data that you hav e. Giv en that each I is supposed to spread along the comm unit y , the size should b e big enough to allow I to spread to the en tire comm unit y , but not to big so that it do esn’t spread outside it. B. Calculate the lik eliho o d In Eq. 11, the lik eliho o d can b e appro ximated according to the follo wing equa- tions, p ( A σ ( i ) | θ σ ( i ) , h ) = Y j,k ∈ σ ( i ) j

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment