Revisiting Role Discovery in Networks: From Node to Edge Roles
Previous work in network analysis has focused on modeling the mixed-memberships of node roles in the graph, but not the roles of edges. We introduce the edge role discovery problem and present a generalizable framework for learning and extracting edg…
Authors: Nesreen K. Ahmed, Ryan A. Rossi, Theodore L. Willke
Re visiting Role Disco very in Netw orks: Fr om Node to Edge Roles Nesreen K. Ahmed Intel Labs nesreen.k.ahmed@intel.com Ryan A. Rossi P alo Alto Research Center rrossi@parc.com Theodore L. Willke Intel Labs ted.willke@intel.com Rong Zhou P alo Alto Research Center rzhou@parc.com ABSTRA CT Previous w ork in netw ork analysis has fo cused on mo del- ing the mixed-mem b erships of node roles in the graph, but not the roles of edges. W e introduce the e dge r ole disc overy pr oblem and present a generalizable framework for learning and extracting edge roles from arbitrary graphs automati- cally . F urt hermore, while existing no de-cen tric role mo dels ha v e mainly focused on simple degree and egonet features, this work also explores graphlet features for role disco v ery . In addition, we also develop an approach for automatically learning and extracting imp ortan t and useful edge features from an arbitrary graph. The exp erimen tal results demon- strate the utility of edge roles for netw ork analysis tasks on a v ariety of graphs from v arious problem domains. 1. INTR ODUCTION In the traditional graph-based sense, roles represen t no de- lev el connectivit y patterns suc h as star-cen ter, star-edge nodes, near-cliques or no des that act as bridges to different regions of the graph. In tuitiv ely , tw o no des b elong to the same role if t hey are “similar” in the sense of graph structure. Our pro- posed researc h will broaden the framework for defining, dis- co v ering and learning net w ork roles, by drastically increasing the degree of usefulness of the information em bedded within ric h graphs. Recen tly , role discov ery has become increasingly imp or- tan t for a v ariet y of application an d problem domains [8, 17, 5, 4, 30, 24, 37] including descriptive net w ork mo deling [31], classification [16], anomaly detection [31], and exploratory analysis (See [30] for other applications). Despite the (wide v ariety of ) practical applications and importance of role dis- co v ery , existing w ork has only fo cused on discov ering no de roles ( e.g. , see[4, 6, 11, 27]). W e p osit that discov ering the roles of edges may b e fundamentally more imp ortan t and able to capture, represent, and summarize the key behav- ioral roles in the netw ork b etter than existing metho ds that ha v e b een limited to learning only the roles of no des in the graph. F or instance, a p erson with malicious inten t ma y appear normal by maintai ning the v ast ma jority of relation- ships and comm unications with in dividuals that pla y normal roles in society . In this situation, techniques that rev eal the role semantics of no des would hav e difficulty detecting suc h malicious b eha vior since most edges are normal. Ho w ever, modeling the roles (functional seman tics, inten t) of individ- ual edges (relationships, communications) in the rich graph w ould impro v e our abilit y to iden tify , detect, and predict this type of malicious activit y since we are mo deling it di- rectly . Nev ertheless, existing work also hav e many other limitations, which significantly reduces the practical utility of such methods in real-world net w orks. One suc h example is that the existing work has b een limited to mainly simple degree and egonet features [16, 31], see [30] for other pos- sibilities. Instead, we leverage higher-order net w ork motifs (induced subgraphs) of size k ∈ { 3 , 4 , . . . } computed from [1, 2] and other graph parameters such as the largest clique in a node (or edge) neighborho od, triangle core num ber, as well as the neigh borho od c hromatic, among other efficient and highly discriminativ e graph features. The main contributions are as follows: • Edge role discov ery: This w ork i nt ro duce s the prob- lem of edge role disco v ery and proposes a computa- tional framework for learning and mo deling edge roles in both static and dynamic netw orks. • Higher-order laten t space mo del: In troduced a higher-order latent role mo del that lev erages higher- order netw ork features for learning and mo deling node and edge roles. W e also in tro duced graphlet-based roles and proposed feature and role learning techni ques. • Efficien t and scalable: All prop osed algorithms are parallelized. Moreo v er, the feature and role learning and inference algorithms are linear in the n um ber of edges. 2. HIGHER-ORDER EDGE ROLE MODEL This section in troduces our higher-order edge role mo del and a generalizable framework for computing edge roles based on higher-order netw ork features. 2.1 Initial Higher-order Network Featur es Existing role disco very methods use simple degree-based features [16]. In this w ork, we use graphlet metho ds [36, 25, 1] for computing higher-order netw ork features based on induced subgraph patterns (instead of simply edge and node patterns) for discov ering b etter and more meaningful roles. F ollo wing the idea of feature-based roles [30], we systemati- cally discov er an edge-based feature representation. [36, 25] As initial features, we used a recen t parallel graphlet de- composition framework proposed in [1] to compute a v ariety of edge-based graphlet features of size k = { 3 , 4 , . . . } . Us- ing these initial features, more represen tativ e, explainable, T able 1: Summary of Bregman Divergences and up- date rules φ ( y ) ∇ 2 φ ( y ) D φ ( x k x 0 ) Up date F ro. y 2 / 2 1 ( x − x 0 ) 2 / 2 v j k = P m i =1 x ( k ) ij u ik P m i =1 u ik u ik KL y log y 1 /y x log x x 0 − x + x 0 v j k = P m i =1 x ( k ) ij u ik /x 0 ij P m i =1 u ik u ik /x 0 ij IS − log y 1 /y 2 x x 0 − log x x 0 v j k = P m i =1 x ( k ) ij u ik /x 0 ij 2 P m i =1 u ik u ik /x 0 ij 2 and nov el features can b e discov ered. See the relational fea- ture learning template given in [30]. As an aside, graphlets offer a w a y to generalize many existing feature learning sys- tems (including those that hav e b een used for learning and extracting roles). W e can generalize the ab o v e by introduc- ing a generic k-vertex graphlet op erator that returns counts and other statistics for any k -v ertex induced subgraph where k > 2. 2.2 Edge Featur e Representation Learning Learning important and practical representa tions auto- matically is useful for man y mac hine learning applications beyond role discov ery such as anomaly detection, classifica- tion, and descriptive mo deling/explora tory analysis. These methods greatly reduce the engineering effort while also re- v ealing imp ortan t latent features that lead to b etter predic- tiv e p erformance and p o w er of generalization. This section in troduces a generalizable, flexible, and extremely efficient e dge fe atur e learning and infer enc e fr amework capable of au- tomatically learning a representativ e set of edge features au- tomatically . The proposed framew ork and algorithms that arise from it naturally supp ort arbitrary graphs including undirected, directed, and/or bipartite net works. More im- portantly , our approac h also handles attributed graphs in a natural w a y , whic h t ypically consist of a graph G and a set of arbitrary edge and/or no de attributes. The attributes t ypically represent in trinsic edge and no de information such as age, lo cation, gender, p olitical views, textual con tent of comm unication b et ween individuals, among other p ossibili- ties. F or edge feature learning and extraction, we introduce the notion of an edge neigh bor. In tuitively , giv en an edge e i = ( v , u ) ∈ E , let e j = ( a, b ) b e an edge neighbor of e i iff a = v , a = u , b = v , or b = u . Informally , e j is a neighbor of e i if e j and e i share a v ertex. This definition can easily be extended for incorp orating further h -distant neighbors. The relational operators used to search the space of possi- ble neigh bor features at the curren t and previous learned feature la y ers include relational operators such as mean, sum, product, min, max, v ariance, L1, L2, and more gener- ally , any (parameterized) similarity function including p osi- tiv e semidefin ite functio ns suc h as the Rad ial Basis F unction (RBF) K h x i , x j i = exp( − k x i − x j k 2 / 2 σ 2 ), p olynomial similar- it y functions of the form K ( x i , x j ) = ( a · h x i , x j i + c ) d , sig- moid neural netw ork kernel K ( x i , x j ) = tanh( a · h x i , z j i + c ), among others. The flexibilit y and generalizability of the pro- posed approach is a key adv an tage and contribution, i.e. , man y comp onen ts are naturally in terc hangeable, and th us our approach is not restricted to only the relational op era- tors mentioned abov e, but can easily leverage other applica- tion or problem domain sp ecific (relational) op erators 1 . F eatures are searc hed from the selected feature subspaces and the candidate features that are actually computed are pruned to ensure the set learned is as small and representa- tiv e as possible capturing no vel and useful prop erties. W e define a feature graph G f where the nodes represen t fea- tures learned thus far among all curren t feature lay ers as well as an y initial features whic h w ere not immediately pruned due to b eing redundant or non-informativ e w.r.t. the ob jec- tiv e function 2 whereas the edges enco de dep endencies be- t ween the features. F urther, the edges are weigh ted by the computed similarity/correlation (or distance/disagreement) measure, thus as W ij → 1 then the t w o features f i and f j are considered to be extremely similar (and thus p ossi- bly redundan t), whereas W ij → 0 implies that f i and f j are significantly different. In this work, we use log-binning disagreemen t, though hav e also used Pearso n correlation, among others. After constructing the weigh ted feature graph, our ap- proac h has tw o main steps: pruning noisy edges b et ween features and the remov al of redundant features ( i.e. , nodes in G f ). T o remov e these spurious relationships in the feature graph, w e use a simple adaptive sparsification technique. The tec hnique uses a th reshold γ which is adapted automat- ically at each iteration in the search procedure. Once the spurious edges hav e b een remov ed entirely from the feature graph, w e then prune entire features that are found to be re- dundan t. In other words, we discard vertices (i.e., features) from the feature graph that offer no discriminatory pow er. This can b e p erformed by partitioning the feature graph in some fashion ( e.g. , connected components). Note that once a feature is added to the set of represen tativ e features, it cannot b e pruned. How ev er, all features are scored at each iteration, including the representativ e features, and redun- dan t features are discarded. If a feature is found to closely resem ble one of the represen tativ e features, w e prune it, and k eep only the represen tativ e feature, as it is more primitiv e (disco v ered in a previous iteration). Our approach searches the space of features until one of the followi ng stopping cri- terion are met: (i) the previous iteration w as unfruitful in disco v ering any nov el features, or if (ii) the maxim um num- ber of iterations is exceeded which ma y be defined b y the user. It is w orth mentioning that w e could ha ve used an arbi- trary no de feature learning approac h suc h as the one de- scribed in [30]. F or instance, given a no de feature matrix Z ∈ R n × f (from one suc h approac h), we can easily deriv e edge features from it (which can then b e used for learning edge roles) b y using one or more op erators o ver the edges as follows: given an edge e k = ( v i , v j ) ∈ E with end p oin ts v i and v j , one can simply com bine the feature v alues z i and z j of v i and v j , respectively , in some wa y , e.g. , x k = z i + z j where x k is the resulting edge feature v alue for e k . 2.3 Learning Latent Higher -order Edge Roles Let X = x ij ∈ R m × f be a matrix with m rows represent - ing edges and f columns represen ting arbitrary features 3 . 1 See [15] for more details. 2 In general, the ob jectiv e function can be either unsup er- vised or supervised. 3 F or instance, the columns of X represent arbitrary features suc h as graph top ology features, non-relational features/at- tributes, and relational neighbor features, among other p os- More formally , giv en X ∈ R m × f , the edge role disco v ery optimization problem is to find U ∈ R m × r and V ∈ R f × r where r min( m, f ) suc h that the pro duct of t w o low er rank matrices U and V T minimizes the divergence b et w een X and X 0 = UV T . Intuitiv ely , U ∈ R m × r represen ts the la- ten t r ole mixe d-memb erships of the edges whereas V ∈ R f × r represen ts the con tributions of the features with resp ect to eac h of the roles. Each ro w u T i ∈ R r of U can b e in ter- preted as a low dimensional rank- r em bedding of the i th e dge in X . Alternativ ely , each row v T j ∈ R r of V represents a r -dimensi onal role embedding of the j th feature in X us- ing the same low rank- r dimensional space. Also, u k ∈ R m is the k th column represen ting a “laten t feature” of U and similarly v k ∈ R f is the k th column of V . F or the higher-order latent netw ork mo del, we solv e: arg min ( U , V ) ∈C n D φ ( X k UV T ) + R ( U, V ) o (1) where D φ ( X k UV T ) is an arbitrary Bregman divergence [9] betw een X and UV T . F u rthermore, the optimization prob- lem in (1) imp oses hard constraints C on U and V suc h as non-negativit y constraints U , V ≥ 0 and R ( U, V ) is a regu- larization p enalt y . In this w ork, w e mainly fo cus on solving D φ ( X k UV T ) under non-negativit y constraints: arg min U ≥ 0 , V ≥ 0 n D φ ( X k UV T ) + R ( U, V ) o (2) Giv en the edge feature matrix X ∈ R m × f , the edge role disco v ery problem is to find U ∈ R m × r and V ∈ R f × r suc h that X ≈ X 0 = UV T (3) T o measure the quality of our edge mixed membership mo del, w e use Bregman divergences: X ij D φ ( x ij k x 0 ij ) = X ij φ ( x ij ) − φ ( x 0 ij ) − ` ( x ij , x 0 ij ) where φ is a univ ariate smo oth con v ex function and ` ( x ij , x 0 ij ) = ∇ φ ( x 0 ij )( x ij − x 0 ij ) , where ∇ p φ ( x ) is the p-order deriv ative op erator of φ at x . F urthermore, let X − UV T = X ( k ) − u k v T k denote the residual term in the appro ximation (3) where X ( k ) is the k-residual matrix defined as: X ( k ) = X − X h 6 = k u h v T h (4) = X − UV T + u k v T k , for k = 1 , . . . , r (5) W e use a fast sc alar blo ck c o or dinate desc ent appr o ach that easily generalizes for heterogeneous netw orks [32]. The ap- proac h considers a single elemen t in U and V as a block in the blo c k co ordinate descent framework. Replacing φ ( y ) with the corresp onding expression from T able 1 gives rise to a fast algorithm for each Bregman div ergence. T able 1 giv es the up dates for F rob enius norm (F ro.), KL-div ergence (KL), and Itakura-Saito divergence (IS). Note that Beta di- v ergence and many others are also easily adapted for our higher-order netw ork modeling framework. sibilities. 2.4 Model Selection In this section, we introduce our approac h for learning the appropriate mo del given an arbitrary graph. The approac h is lev erages the Minim um Description Length (MDL) [14, 29] principle for automatically selecting the “b est” higher-order net w ork mo del. The MDL principle is a practical formal- ization of Kolmogoro v complexity [22]. More formally , the approac h finds the mo del M ? = ( V r , U r ) that leads to the best compression by solving: M ? = arg min M ∈M L ( M ) + L ( X | M ) (6) where M is the mo del space, M ? is the mo del given by the solving the ab o v e minimization problem, and L ( M ) as the n um b er of bits required to encode M using co de Ω, whic h w e refer to as the description length of M with respect to Ω. Recall that MDL requires a lossless encoding. There- fore, to reconstruct X exactly from M = ( U r , V r ) we must explicitly enco de the error E such that X = U r V T r + E Hence, the total compressed size of M = ( U r , V r ) with M ∈ M is simply L ( X , M ) = L ( M ) + L ( E ). Given an arbitrary mo del M = ( U r , V r ) ∈ M , the description length is decomposed into: • Bits required to describ e the mo del • Cost of describing the approximati on errors X − X r = U r V T r where X r is the rank-r approximation of X , U r = u 1 u 2 · · · u r ∈ R m × r , and (7) V r = v 1 v 2 · · · v r ∈ R f × r (8) The mo del M ? is the mo del M ∈ M that minimizes the to- tal description length: the mo del description cost X and the cost of correcting the errors of our mo del. Let | U | and | V | denote the num ber of nonzeros in U and V , resp ecti vely . Th us, the mo del description cost of M is: κr ( | U | + | V | ) where κ is the bits per v alue. Similarly , if U and V are dense, then the mo del description cost is simply κr ( m + f ) where m and f are the num b er of edges and features, respec- tiv ely . Assuming errors are non-uniformly distributed, one possibility is to use KL divergence (see T able 1) for the error description cost 4 . The cost of correcting a single element in the approximation is D φ ( x k x 0 ) = x log x x 0 − x + x 0 (assuming KL-div ergence), and thus, the total reconstruction cost is: D φ ( X k X 0 ) = X ij X ij log X ij X 0 ij − X ij + X 0 ij (9) where X 0 = UV T ∈ R m × f . Other p ossibilities are given in T able 1. The ab o v e assumes a particular representa tion sc heme for encoding the mo dels and data. Recall that the optimal code assigns log 2 p i bits to enco de a message [34]. Llo yd-Max quantization [26, 23] with Huffman codes [18, 35] are used to compress the model and data [28, 7]. Notice that w e require only the length of the description using the ab o v e encoding scheme, and thus we do not need to materialize the co des themselves. This leads to the improv ed mo del description cost: ¯ κr ( | U | + | V | ) where ¯ κ is the mean bits required to enco de each v alue 5 . In general, our higher-order 4 The representation cost of correcting approximation errors 5 Note log 2 ( m ) quan tization bins are used net w ork mo deling framework can easily lev erage other mo del selection tec hniques such as AIC [3] and BIC [33]. 3. D YNAMIC EDGE R OLE MODEL This section in troduces the dynamic e dge r ole mixe d-memb ership mo del (DERM) and prop ose s a computational framew ork for computing edge roles in dynamic netw orks. 3.1 Dynamic Graph Model & Representation Giv en a graph stream G = ( V , E ) where E = { e 1 , . . . , e m } is an ordered set of edges in the graph stream suc h that τ ( e 1 ) ≤ τ ( e 2 ) ≤ · · · ≤ τ ( e m ). Note that τ ( e i ) is the edge time for e i ∈ E (which may be the edge activ ation time, arriv al time, among other p ossibilities). In tuitiv ely , E is an infinite edge streaming netw ork where edges arrive contin u- ously ov er time. F rom this edge stream, we derive a dynamic net w ork G = { G t } T t =1 where G t = ( V , E t ) represents a snap- shot graph at time t . Note that time t is actually a discrete time in terv al [ a, b ) where a and b are the start and end time, respectively . Therefore, E t = { e t ∈ E | a ≤ τ ( e i ) < b } and E = E 1 ∪ E 2 ∪ · · · ∪ E T . 3.2 Dynamic Edge Role Learning W e start b y learning a time series of features automat- ically . Let G 1: k = ( V , E 1: k ) b e the initial dynamic train- ing graph where E 1: k = E 1 ∪ · · · ∪ E k and k represen ts the n umber of snapshot graphs to use for learning the ini- tial set of (representativ e) dynamic features. Given { G t } T t =1 and G 1: k = ( V , E 1: k ), the prop osed approach automatically learns a set of features F = { f 1 , f 2 , . . . , f d } where each f i ∈ F represen ts a learned feature definition from G 1: k . Giv en the learned rol e definitions V ∈ R r × d using a subset of past temp oral graphs, we then estimate the edge role mem- berships { U t } T t =1 for each { G t } T t =1 (and any future graph snapshots G t +1 , . . . , G t + p ) where U t ∈ R m × r is an edge by role membership matrix. The dynamic edge role mo del is selected using the approach prop osed in Section 2.4. Time-scale Learning: This section briefly introduces the problem of learning an appropriate time-scale automatically and proposes a few techniques. The time-scale learning problem can be formulated as an optimization problem where the optimal solution is the one that minimizes the ob jectiv e function. Naturally , the ob jectiv e function encodes the er- ror from models learned using a particular time-scale s ( e.g. , 1 minute, 1 hour). Thus solving the optimization problem leads to iden tifying models from the time-scale s that lead to the least error. Up dating F eatures and Role Definitions: T o prev en t the features and role definitions from b ecoming stale and meaningless o v er time (due to temp oral/concept drift as the net w ork and its attributes/prop erties evolv e), we use the fol- lo wing approach: the loss (or another measure) is computed and track ed ov er time, and when it b ecomes too large (from either the features or roles), we then re-compute the feature definitions F and role definitions. Both the features and roles definitions can b e learned in the background as well, and can ob viously b e computed in parallel. The edge role framew ork is also flexible for other types of approac hes, and th us, not limited to the simple approac h abov e (whic h is a k ey adv antage of this work). 4. EXPERIMENTS 1 4 8 16 24 32 Processing units 0 5 10 15 Speedup fb-forum ia-escorts Figure 1: Higher-order role discov ery sho ws strong scaling as w e increase the num ber of pro cessing units. This section in v estigates the scalabilit y and effectiveness of the higher-order latent space mo deling framework. Scalabilit y: W e in v estigate the scalability of the paral- lel framew ork for mo deling higher-order latent edge roles. T o ev aluate the effectivenes s of the parallel mo deling frame- w ork, w e meas ure the sp eedup defined as simply S p = T 1 /T p where T 1 is the execution time of the sequential algorithm, and T p is the execution time of the parallel algorithm with p pro cessing units. Ov erall, the metho ds show strong scal- ing (See Figure 1). Similar results were observed for other net w orks. As an aside, the exp erimen ts in Figure 1 used a 4-processor Intel Xeon E5-4627 v2 3.3GHz CPU. Higher-order Mo del Selection: MDL is used to auto- matically learn the appropriate edge role mo del. In Fig- ure 2, description length (in bits) is minimized when r = 18. In tuitiv ely , to o many roles increases the model description cost, whereas to o few roles increases the cost of describing errors. In addition, Figure 3 shows the runtime of our ap- proac h. F urthermore, Figure 5 demonstrates the impact on 5 10 15 20 25 30 35 40 Number of latent roles 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 Description length × 10 4 Description length Confidence Bounds Optimal Figure 2: In the example shown, the v alley iden ti- fies the correct num b er of laten t roles. 0.5 1 1.5 2 2.5 3 Time (in seconds) 11 11.2 11.4 11.6 11.8 12 12.2 Log description length Figure 3: The running time of our approach. The x-axis is time in seconds and the y-axis is the log de- scription cost. The curve is the av erage ov er 50 ex- p erimen ts and the dotted lines represen t three stan- dard deviations. The result rep orted ab o v e is from a laptop with a single core. the learning time, num b er of nov el features discov ered, and their sparsit y , as the tolerance ( ε ) and bin size ( α ) v aries. Mo deling Dynamic Netw orks: In this section, we in- v estigate the Enron email communication netw orks using the Dynamic Edge R ole Mixe d-memb ership Mo del (DERM). The Enron email data consists of 151 Enron employ ees whom ha v e sent 50.5k emails to other Enron emplo yees. W e pro- cessed all email comm unications spanning o v er 3 y ears of email communications, and discarded the textual conten t of the email, and only use the edges represen ting a directed email commu nication (from one employ ee to another). The email comm unications are from 05/11/1999 to 06/21/2002. F or learning edge roles (and a set of representa tiv e edge features), w e leverage the first yea r of emails. Note that other work suc h as dMMSB [12] use email communications from 2001 only , whic h corresponds to the time p eriod that the Enron scandal was rev ealed (October 2001). W e instead study a muc h more difficult problem. In particular, giv en only past data, can we actually uncov er and detect the key ev en ts leading up to the downfall of Enron? A dynamic net- w ork { G t } T t =1 is constructed from the remaining email com- m unications (appro ximately 2 y ears) where each snapshot graph G t , t = 1 , . . . , T represents a mon th of communica - tions. Interestingly , we learn a dynamic no de r ole mixe d- memb ership mo del with 5 latent roles, whic h is exactly the n um b er of latent no de r oles learned by dMMSB [12]. How- ev er, w e learn a dynamic e dge role mixed-mem b ership mo del with 18 roles. Evolving edge and no de mixed-memberships from the Enron email comm unication netw ork are shown in Figure 4. The set of edges and no des visualized in Figure 4 are selected using the difference en trop y rank (See Eq.(10) below) and corresp ond to the edges and no des with largest difference ent rop y rank d . The first role in Figure 4 repre- sen ts inactivit y (dark blue). F or iden tifying anomalies, we use the difference en trop y rank defined as: d = max t ∈ T H ( u t ) − min t ∈ T H ( u t ) (10) where H ( u t ) = − u t · log( u t ) and u t is the r -dimensional (a) Ev olving e dge ro le mixed-mem berships (b) Evolving node r ole mixed-membership Figure 4: T emporal changes in the edge and no de mixed-mem bership v ectors (from the Enron email comm unication net work). The horizontal axes of eac h subplot is time, whereas the v ertical axes rep- resen t the comp onen ts of eac h mixed-mem b ership v ector. Roles are represen ted by different colors. mixed-mem bership vector for an edge (or no de) at time t . Using the difference en trop y rank, we are able to reveal im- portant comm unications betw een k ey pla yers in volv ed in the Enron Scandal, suc h as Kenneth La y , Jeffrey Skilling, and Louise Kitch en. Notice that when node roles are used for iden tifying dynamic anomalies in the graph, w e are only pro vided with p oten tially malicious employ ees, whereas us- ing edge roles naturally allow us to not only detect the key malicious individuals in volv ed, but also the imp ortan t re- lationships betw een them, whic h can b e used for further analysis, among other p ossibilities. Exploratory Analysis: Figure 6 visualizes the no de and edge roles learned for ca-netscience. While our higher-order laten t space mo del learns a stochastic r -dimensional v ector for each edge (and/or no de) represent ing the individual role mem b ersh ips, Figure 6 assigns a single role to each link and node for simplicity . In particular, given an edge e i ∈ E (or node) and its mixed-membership ro w v ector u i , we assign e i t/b 0.5 0.6 0.7 0.8 0.9 0.01 1.48 0.95 0.57 0.47 0.41 0.05 1.03 0.55 0.48 0.46 0.45 0.1 0.72 0.57 0.54 0.51 0.48 0.2 0.78 0.58 0.55 0.52 0.49 0.5 0.58 0.56 0.54 0.6 0.56 (a) Learning time t/b 0.5 0.6 0.7 0.8 0.9 0.01 327 149 81 46 26 0.05 168 73 48 31 18 0.1 111 53 42 26 18 0.2 94 49 36 24 18 0.5 39 33 30 21 16 (b) Num ber of features discov ered t/b 0.5 0.6 0.7 0.8 0.9 0.01 0.151 0.158 0.136 0.097 0.077 0.05 0.23 0.209 0.169 0.111 0.084 0.1 0.235 0.23 0.186 0.133 0.084 0.2 0.24 0.223 0.222 0.143 0.084 0.5 0.319 0.276 0.242 0.158 0.094 (c) Sparsit y of features Figure 5: Impact on the learning time, n um b er of features, and their sparsity , as the tolerance ( ε ) and bin size ( α ) v aries. the role with maximum likelihoo d k ? ← arg max k u ik . The higher-order edge and no de roles from Figure 6 are clearly meaningful. F or instance, the red edge role represen ts a t yp e of bridge relationship as shown in Figure 6. Sparse Graph F eature Learning: Recall that the pro- posed feature learning approac h attempts to learn “sparse graph features” to impro v e learning and efficiency , espe- cially in terms of space-efficiency . This section inv estigates the effectiv eness of our sparse graph feature learning ap- proac h. Results are presented in T able 2. In all cases, our approac h learns a highly compressed representation of the graph, requiring only a fraction of the space of current (node) approaches. Moreov er, the density of edge and node feature representations learned by our approach is b et w een [0 . 164 , 0 . 318] and [0 . 162 , 0 . 334] for nodes (See ρ ( X ) and ρ ( Z ) in T able 2) and up to 6 x more space-efficient than other approac hes. While existing feature learning approaches for graphs are unable to learn higher-order graph features (and th us impractical for higher-order netw ork analysis and mod- eling), they also ha v e another fundamen tal disadv antage: they return dense features. Learning space-efficien t features Figure 6: Edge and no de roles for ca-netscience. Link color represen ts the edge role and node color indicates the corresponding node role. is critical esp ecially for large netw orks. F or instance, notice that on extremely large netw orks, storing even a small n um- ber of edge (or no de) features quickly b ecomes impractical. Despite the imp ortance of learning sparse graph features, existing work has ignored this problem as most approaches stem from Statistical Relational Learning (SRL) [13] and ha v e been designed for extremely small graphs. Moreo v er, nearly all existing methods focus on node features [10, 19, 21, 20], whereas we fo cus on b oth and primarily on learning no v el and imp ortan t edge feature represent ations from large massiv e net w orks. T able 2: Higher-order sparse graph feature learning for laten t no de and edge net w ork modeling. Recall that f is the num ber of features, L is the n um b er of la y ers, and ρ ( X ) is the sparsity of the feature matrix. Edge v alues are b old. graph f L ρ ( X ) ρ ( Z ) socfb-MIT 2080 (912) 8 (9) 0.318 (0.334) yahoo-msg 1488 (405) 7 (7) 0.164 (0.181) enron 843 (109) 5 (4) 0.312 (0.320) F aceb ook 1033 (136) 7 (5) 0.187 (0.162) bio-DD21 379 (723) 6 (6) 0.215 (0.260) Computational Complexit y: Recall that m is the n um- ber of edges, n is the nu mber of no des, f is the num b er of features, and r is the num b er of latent roles. The to- tal time complexity of the higher-or der latent sp ac e mo del is: O f ( m + nr ) . Th us, the runtime is linear in the num- ber of edges. The time complexit y is decomp osed into the follo wing main parts: F eature learning takes O ( f ( m + nf )). Model learning takes O ( mr f ) in the worst case (whic h arises when U and V are completely dense). The quantization and Huffman co ding terms are very small and therefore ig- nored. Latent role learning using scalar elemen t-wise co- ordinate descent has w orst case complexit y of O ( mf r ) p er iteration whic h arises when X is completely dense. How- ev er, assuming X is sparse, then it takes O ( | X | r ) p er it- eration where | X | is the n um ber nonzeros in X ∈ R m × f . In addition, we compute the initial set of graphlet-based features using the efficien t parallel algorithm in [2]. Note that this algorithm computes the counts of a few graphlets and directly obtain the others in constant time. This takes O (∆ | S u | + | S v | + | T e | ) for any given edge e i = ( v , u ), where ∆ is the maxim um degree for an y v ertex, S v , S u are the sets of w edge nodes and T e is the set of triangles inciden t to edge e i . 5. CONCLUSION This work introduced the notion of edge roles and prop osed a higher-order latent space netw ork model for edge role dis- co v ery . T o the b est of our knowledge, this work is the first to explore using higher-order graphlet-based features for role disco v ery . Moreo v er, these features are counts of v arious in- duced subgraphs of arbitrary size and were used directly for role discov ery as w ell as given as input into a graph repre- sen tation learning approac h to learn more discriminative fea- tures based on these initial features. F urthermore, feature- based edge roles also hav e man y imp ortan t and k ey prop- erties and can be used for graph similarity , node and edge similarit y queries, visualization, anomaly detection, classifi- cation, link prediction, among many other tasks. Our edge role discov ery framework also naturally supp orts large-scale attributed netw orks. 6. REFERENCES [1] N. K. Ahmed, J. Neville, R. A. Rossi, and N. Duffield. Efficien t graphlet counting for large netw orks. In ICDM , page 10, 2015. [2] N. K. Ahmed, J. Neville, R. A. Rossi, N. Duffield, and T. L. Willk e. Graphlet decomp osition: F ramework, algorithms, and applications. Know le dge and Information Systems (KAIS) , pages 1–32, 2016. [3] H. Ak aik e. A new lo ok at the statistical mo del iden tification. T r ansactions on Automatic Contr ol , 19(6):716–723, 1974. [4] C. Anderson, S. W asserman, and K. F aust. Building stochastic blo c kmodels. So cial Networks , 14(1):137–161, 1992. [5] P . Arabie, S. Bo orman, and P . Levitt. Constructing blockmodels: How and why . Journal of Mathematic al Psycholo gy , 17(1):21–63, 1978. [6] V. Batagelj, A. Mrv ar, A. F erligo j, and P . Doreian. Generalized blo c kmodeling with pa jek. Meto doloski zvezki , 1:455–467, 2004. [7] W. R. Bennett. Sp ectra of quantized signals. Bel l System T e chnic al Journal , 27(3):446–472, 1948. [8] S. Borgatti, M. Everett, and J. Johnson. Analyzing So cial Networks . Sage Publications, 2013. [9] L. M. Bregman. The relaxation metho d of finding the common p oin t of conv ex sets and its application to the solution of problems in conv ex programming. USSR Computational Math. and Mathematic al Physics , 7(3):200–217, 1967. [10] J. Davis, I. M. Ong, J. Struyf, E. S. Burnside, D. P age, and V. S. Costa. Change of representation for statistical relational learning. In IJCAI , pages 2719–2726, 2007. [11] P . Doreian, V. Batagelj, and A. F erligo j. Gener alize d Blo ckmo deling , volume 25. Cambridge Universit y Press, 2005. [12] W. F u, L. Song, and E. P . Xing. Dynamic mixed mem b ersh ip blockmodel for evolving netw orks. In Pr o c e e dings of the 26th Annual International Confer enc e on Machine L e arning , pages 329–336, 2009. [13] L. Geto or and B. T ask ar, editors. Intr o duction to Statistic al R elational L e arning . MIT Press, 2007. [14] P . D. Gr ¨ un w ald. The minimum description length principle . MIT press, 2007. [15] I. Guyon, M. Nikrav esh, S. Gunn, and L. A. Zadeh. F eatur e Extr action: F ounds and Applic ations . Springer, 2008. [16] K. Henderson, B. Gallagher, T. Eliassi-Rad, H. T ong, S. Basu, L. Akoglu, D. Koutra, C. F aloutsos, and L. Li. Rolx: Structural role extraction & mining in large graphs. In SIGKDD , pages 1231–1239, 2012. [17] P . HollandKathryn Blac kmond and S. Leinhardt. Stochastic blo c kmodels: First steps. So cial Networks , 5(2):109–137, 1983. [18] D. A. Huffman et al. A metho d for the construction of minim um-redundancy co des. Pro ce edings of the IRE , 40(9):1098–1101, 1952. [19] S. Kok and P . Domingos. Statistical predicate in v ent ion. In ICML , pages 433–440. ACM, 2007. [20] N. Landwehr, K. Kersting, and L. De Raedt. nFOIL: In tegrating naıve ba y es and FOIL. In AAAI , pages 795–800, 2005. [21] N. Landwehr, A. Passerini, L. De Raedt, and P . F rasconi. kfoil: Learning simple relational kernels. In AAAI , volume 6, pages 389–394, 2006. [22] M. Li and P . Vit´ an yi. A n intr o duction to Kolmo gor ov c omplexity and its applic ations . Springer Science & Business Media, 2009. [23] S. Lloyd. Least squares quantization in p cm. T ransa ctions on Information The ory , 28(2):129–137, 1982. [24] F. Lorrain and H. White. Structural equiv alence of individuals in so cial netw orksˆ a ˘ A˘ a. Journal of Mathematic al So ciolo gy , 1(1):49–80, 1971. [25] D. Marcus and Y. Shavitt. Rage–a rapid graphlet en umerator for large netw orks. Computer Networks , 56(2):810–819, 2012. [26] J. Max. Quantizing for minimum distortion. T ransa ctions on Information The ory , 6(1):7–12, 1960. [27] K. Nowic ki and T. Snijders. Estimation and prediction for sto c hastic blo c kstructures. Journal of the Americ an Statistic al Asso ciation , 96(455):1077–1087, 2001. [28] B. Oliver, J. Pierce, and C. E. Shannon. The philosoph y of p cm. Pr o c e e dings of the IRE , 36(11):1324–1331, 1948. [29] J. Rissanen. Mo deling by shortest data description. Au tomatic a , 14(5):465–471, 1978. [30] R. A. Rossi and N. K. Ahmed. Role disco very in net w orks. TKDE , 27(4):1112–1131, 2015. [31] R. A. Rossi, B. Gallagher, J. Neville, and K. Henderson. Mo deling dynamic b eha vior in large ev olving graphs. In WSDM , pages 667–676, 2013. [32] R. A. Rossi and R. Zhou. Parallel Collective F actorization for Mo deling Large Heterogeneous Net w orks. In So cial Network Analysis and Mining , page 30, 2016. [33] G. Sch w arz et al. Estimating the dimension of a model. The annals of statistics , 6(2):461–464, 1978. [34] C. E. Shannon. A mathematical theory of comm unication. Bel l Syst. T e ch. J. , 27(1):623ˆ a ˘ A ¸ S656, 1948. [35] J. V an Leeuw en. On the construction of huffman trees. In ICALP , pages 382–410, 1976. [36] S. W ernick e and F. Rasche. F anmo d: a to ol for fast net w ork motif detection. Bioinformatics , 22(9):1152–1153, 2006. [37] D. White and K. Reitz. Graph and semigroup homomorphisms on netw orks of relations. So cial Networks , 5(2):193–234, 1983.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment