A Spectral Framework for Anomalous Subgraph Detection

IN SUBMISSION TO THE IEEE, – – – 2014 1 A Spectral Frame work for Anomalous Subgraph Detection Benjamin A. Miller , Member , IEEE, Michelle S. Beard, Patrick J. W olfe, Senior Member , IEEE, and Nadya T . Bliss Senior Member , IEEE Abstract —A wide variety of application domains ar e con- cerned with data consisting of entities and their relationships or connections, formally repr esented as graphs. Within these diverse application ar eas, a common problem of interest is the detection of a subset of entities whose connectivity is anomalous with respect to the rest of the data. While the detection of such anomalous subgraphs has recei ved a substantial amount of attention, no application-agnostic framework exists for analysis of signal detectability in graph-based data. In this paper , we describe a framework that enables such analysis using the principal eigenspace of a graph’ s residuals matrix, commonly called the modularity matrix in community detection. Leveraging this analytical tool, we show that the framework has a natural power metric in the spectral norm of the anomalous subgraph’ s adjacency matrix (signal power) and of the background graph’s residuals matrix (noise power). W e propose several algorithms based on spectral properties of the r esiduals matrix, with more computationally expensive techniques providing gr eater detection power . Detection and identiﬁcation performance are presented f or a number of signal and noise models, including clusters and bi- partite for egrounds embedded into simple random backgrounds as well as graphs with community structure and realistic degree distributions. The trends observed verify intuition gleaned fr om other signal processing areas, such as greater detection power when the signal is embedded within a less active portion of the background. W e demonstrate the utility of the proposed techniques in detecting small, highly anomalous subgraphs in r eal graphs derived from Internet trafﬁc and product co-purchases. Index T erms —Graph theory , signal detection theory , spectral analysis, residuals analysis, principal components analysis I . I N T R O D U C T I O N I N numerous applications, the data of interest consist of entities and the relationships between them. In social net- work analysis, for example, the data are connections between individuals, such as who knows whom personally , who is in the same organization, or who is connected on a social networking website. In computer networks, we are often interested in which computers communicate with one another . In the natural sciences, we may want to know which chemicals interact in a This work is sponsored by the Assistant Secretary of Defense for Research & Engineering under Air Force Contract F A8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government. B. A. Miller is with Lincoln Laboratory , Massachusetts Institute of T ech- nology , Lexington, MA, 02420 USA (e-mail: bamiller@ll.mit.edu). M. S. Beard is with Charles Stark Draper Laboratory , Cambridge, MA, 02139 USA (email: mbeard@draper.com). P . J. W olfe is with the Department of Statistical Science, University College London, London WC1E 6BT UK (e-mail: p.wolfe@ucl.ac.uk). N. T . Bliss is with Arizona State Univ ersity , T empe, AZ, 85287 USA (e- mail: nadya.bliss@asu.edu). reaction. Across these v aried domains, data regarding connec- tions, relationships and interactions between discrete entities enhances situational awareness and di versiﬁes by incorporating additional contextual information. When w orking with relational data, it is common to formally represent the relationships as a graph. A graph G = ( V , E ) is a pair of sets: a set of vertices, V , comprising the entities, and a set of edges, E , denoting relationships between them. Graph theory provides an abstract mathematical object that has been applied in all of the abov e contexts. Indeed, graphs have been used to model protein interactions [1] and to represent communication between computers [2]. Graphs—commonly called networks in practice—are used extensi vely in social network analysis, with many graph algorithms focused on detection of communities [3], [4] and inﬂuential ﬁgures [5]. As a data structure, graphs ha ve long been utilized by signal processing practitioners. Analysis of graphs derived from radio frequency or image data is common, as a graph structure can help classify similar measurements (see, e.g., [6]). Recent research has also deﬁned traditional digital signal processing kernels—such as ﬁltering and Fourier transforms— for signals that propagate along edges in a graph [7], [8]. When the graph comprises the data itself, rather than a means of storage, signiﬁcant complications arise. Graphs are discrete, combinatorial structures, and, thus, they lack the con venient mathematical context of Euclidean vector spaces. The ability to perform linear transformations and the analytical tractability of working with Gaussian noise are not available in general when working with relational data. Deriving an optimal detector for a small signal subgraph buried within a large network, then, becomes potentially intractable, as it may require the solution to an NP-hard problem. Despite these complications, it is desirable to understand notions of detectability of small subgraphs embedded within a large background. The ability to detect small signals in these contexts would be useful in many domains, from the detection of malicious trafﬁc in a computer network to the discov ery of threatening activity in a social network. Recent work in this area has considered subgraph detection from a variety of perspectiv es. W ork has been done on detection of speciﬁc target subgraphs in random backgrounds [9], with special attention paid in the computer science and statistics communities to planted cliques [10], [11] and planted clusters [12], [13]. Other work assumes common substructures ov er the graph, and detects anomalies based on deviations from the “normativ e pattern” via methods such as minimum description length [14] or analysis of the graph Laplacian [15]. T echniques 2 IN SUBMISSION TO THE IEEE, – – – 2014 such as threat propagation [16], [17] and verte x nomination [18] consider a cue vertex as a knowledge prior , giving an initial indication of which vertices are of interest, the objective then being to ﬁnd the remainder of the subgraph. Community detection in graphs is a widely studied related problem [19], where the communities in the graph are sometimes cast as deviations from a null hypothesis in which the graph has no community structure [20]. The objective of the present contribution is to dev elop a broadly applicable detection frame work for graph-based data. T o apply in these varied domains, this framework should be independent of the speciﬁc application. W e focus speciﬁcally on the uncued anomalous subgraph detection problem, where the goal is to detect the presence of a subgraph that is a statistical outlier without a “tip” verte x provided as a cue. As graphs of interest are often extremely large, the framew ork should have fa vorable scaling properties as the number of vertices and edges grow . T o gain insight into properties that inﬂuence subgraph detectability , the frame work will ideally hav e a natural metric for signal and noise po wer , to enable discussion of quantities like signal-to-noise ratio that are intrinsic to signal processing applications. In this paper , we present a spectral framew ork to address the uncued subgraph detection. This frame work is based on a regression-style analysis of residuals, in which an observed random graph is compared to its e xpected value to ﬁnd outliers. W e analyze the graph in the space of the principal eigen vectors of its residuals matrix, which offers two advan- tages: it allows us to use results from spectral graph theory to elucidate the notion of subgraph detectability , and works within in a linear algebraic frame work with which many signal processing researchers are familiar . W ithin this frame work, the spectral norm provides a good metric for signal and noise power , as we demonstrate analytically and empirically . This framew ork also enables the dev elopment of algorithms that work in a low-dimensional space to detect small anomalies, sev eral of which are discussed in this paper . The remainder of this paper is organized as follows. In Section II, we formally deﬁne the subgraph detection prob- lem. Section III provides a brief summary of related work on subgraph detection and graph residuals analysis. Section IV details our proposed subgraph detection framew ork. In Section V, we outline sev eral algorithms for anomaly detection within the framework. Section VI presents detection results for sev eral simulated datasets, and in Section VII we demonstrate these techniques on real datasets. Finally , in Section VIII, we summarize and discuss open problems and ongoing work. I I . P R O B L E M M O D E L A. Deﬁnitions and Notation In the subgraph detection problem, the observation is a graph G = ( V , E ) . W e will denote the sizes of the verte x and edge sets as N = | V | and M = | E | , respectiv ely . A subgraph G S = ( V S , E S ) of G is a graph in which V S ⊂ V and E S ⊂ E ∩ ( V S × V S ) , where the Cartesian product V × V is the set of all possible edges in a graph with vertex set V . In this paper , we consider graphs whose edges are unweighted and undirected. W e will allow the possibility of self-loops, meaning an edge may connect vertex to itself. Since edges hav e no weight, two graphs will be combined via their union. The union of two graphs, G 1 = ( V 1 , E 1 ) and G 2 = ( V 2 , E 2 ) , is deﬁned as G 1 ∪ G 2 = ( V 1 ∪ V 2 , E 1 ∪ E 2 ) . W orking in a spectral framew ork, we will make use of matrix representations for graphs. The adjacency matrix A = { a ij } of G is a binary N × N matrix. Each ro w and column is associated with a vertex in V . This implies an arbitrary ordering of the vertices with integers from 1 to N , and we will denote the i th vertex v i . Then a ij is 1 if there is an edge connecting v i and v j , and is 0 otherwise. Similarly , let A S = { s ij } be the adjacency matrix for the signal subgraph. Since we consider undirected graphs, A and A S are symmetric. Matrix norms will also be used in the discussion of signal and noise power . Unless otherwise noted, the matrix norm will be the spectral norm, i.e., the induced L 2 norm, k A k = max k x k 2 =1 k Ax k 2 , (1) which is equiv alent to the absolute v alue of the largest- magnitude eigen value of the matrix. Our framew ork is focused on detection of signals within a random background. The analysis presented in this paper is based on the assumption of Bernoulli random graphs, where the probability of an edge between v i and v j is a Bernoulli random variable with expected value p ij . Note that the edge probabilities may be different for all pairs of vertices. Since the presence of each edge is a Bernoulli random variable, the expected value of A is gi ven by P = { p ij } . W e refer to P as the probability matrix of the graph. Another important notion when dealing with graphs is degree. A verte x’ s degree is the number of edges adjacent to the vertex. The observed de gree of vertex v i will be denoted k i , and its expected degree is denoted E [ k i ] = d i . Note that k i = P N j =1 a ij and d i = P N j =1 p ij 1 . The vectors of the observed and expected degrees will be denoted k and d , respectiv ely . The volume of the graph, V ol( G ) , is the sum of the degrees over all vertices. B. The Subgr aph Detection Problem In some cases, the observed graph G will consist of only typical background activity . This is the “noise only” scenario. In other cases, most of G exhibits typical behavior , but a small subgraph has an anomalous topology . This is the “signal- plus-noise” scenario. In this case, the noise graph, denoted G N = ( V N , E N ) , and the signal subgraph, G S = ( V S , E S ) are combined via union. The objecti ve, giv en the observation G , is to discriminate between the two scenarios. Formally , we want to resolve the following binary hypothesis test: ( H 0 : G = G N H 1 : G = G N ∪ G S . (2) Thus, we have the classical signal detection problem: under the null hypothesis H 0 , the observation is purely noise, while 1 Using this con vention, a self-loop only increases a vertex’ s de gree by 1. MILLER et al. : A SPECTRAL FRAMEWORK FOR ANOMALOUS SUBGRAPH DETECTION 3 under the alternative hypothesis H 1 , a signal is also present. Here G N and G S are both random graphs, with G N drawn from the noise distribution and G S drawn from the signal distribution. W e will only consider cases in which the verte x set of the signal subgraph is a subset of the vertices in the background, i.e., V S ⊂ V N = V . I I I . R E L AT E D W O R K While there are many ﬂa vors of subgraph detection research, not all of them work under the same assumptions as in this paper . For example, we consider a v ariety of noise models, which may not hav e the “normativ e pattern” required to use techniques based on common subgraphs [14], [15]. Research into anomaly detection in dynamic graphs by Priebe et al. [21] uses the history of a node’ s neighborhood to detect anomalous behavior , but this would not apply in the case of static graphs, which is the focus of this work. As our interest is in uncued techniques, we operate in a different context from the work in [16]–[18]. These methods are complementary to the techniques outlined in this paper , as a set of outlier vertices could be used to seed a cued algorithm and do further exploration. Previous work has considered optimal detection in the same context we consider in this paper , though in a restricted setting. In [9], the authors consider the detection of a speciﬁc foreground embedded (via union) into a large graph in which each possible edge occurs with equal probability (i.e., the random graph model of Erd ˝ os and R ´ enyi). In this setting, the likelihood ratio can be written in closed form, as demonstrated by the following theorem. Theorem 1 (Mifﬂin et al. [9]) . Let G denote the random graph wher e each possible edge occurs with equal pr obability p , and let H denote the targ et graph. The likelihood ratio of an observed graph J is Λ H ( J ) = X H ( J ) E [ X H ( G )] . (3) Here X H ( · ) denotes the number of occurrences of H in the graph. The applicability of this result, therefore, requires a tractable way to count all subgraphs of the observation J that are isomorphic with the target. This is NP-hard in general [22], although there may be feasible methods to accomplish this for certain tar gets within sparse backgrounds. While the previous example requires a complicated pro- cedure, detection of random subgraphs embedded into ran- dom backgrounds may be an ev en harder problem. T ak e, for example, the detection problem where the background and foreground are both Erd ˝ os–R ´ enyi, i.e., when the null and alternativ e hypotheses are gi ven by          H 0 : each pair of vertices shares an edge with probability p H 1 : an N S -verte x subgraph was embedded whose edges were generated with probability p S . (4) In this situation, we can deri ve an optimal detection statistic. Theorem 2. F or an observed graph G = ( V , E ) , let X be a subset of V of size N S , and E X ⊂ E be the set of all edges existing between the vertices in X . The likelihood ratio for r esolving the hypothesis test in (4) is given by  N N S  − 1  1 − ˆ p 1 − p  ( N S 2 ) X X ⊂ V | X | = N S  ˆ p (1 − p ) p (1 − ˆ p )  | E X | , (5) wher e ˆ p = p + p S − pp S . A proof of Theorem 2 is provided in Appendix A. Even in this relatively simple scenario, computing the likelihood ratio in (5) requires, at least, knowing ho w many N S -verte x induced subgraphs contain each possible number of edges. In [12], it is shown that some computable tests asymptotically achie ve the information-theoretic bound for dense backgrounds, b ut there are no known polynomial-time algorithms that achieve the bound in a sparse graph [13]. For more complicated models, calculating the optimal detection statistic is likely to be even more dif ﬁcult. The subgraph detection framew ork presented in this paper is based on graph residuals analysis. The residuals of a random graph are the difference between the observ ed graph and its expected value 2 . For a random graph G , we analyze its residuals matrix B := A − E [ A ] . (6) In the area of community detection, a widely used quantity to ev aluate the quality of separation of a graph into communities is modularity , deﬁned in [20]. The modularity of a partition C = { C 1 , · · · , C n } is deﬁned as Q = n X i =1 ( e ii − a 2 i ) , (7) where C i are disjoint subsets of V covering the entire set, e ii is the proportion of edges entirely within C i , and a i is the proportion of edge connections in C i , i.e., a i = n X j =1 e ij , (8) with e ij denoting half the number of edges between C i and C j for i 6 = j (half to prev ent from counting the edge in both e ij and e j i ). Note that a 2 i is the e xpected proportion of edges within C i if the edges were randomly rewired (i.e., the degree of each verte x is preserved, but edges are cut and reconnected at random). Indeed, if the edge proportions are the only thing maintained in the rewiring, the fraction of edges from any community that connect to a vertex in C i will be a i . Thus, the proportion of the total edges from C i to C j will be a i a j . T ak en as an analysis of deviations from an expected topology , modularity is a residuals-based quantity . In the community detection literature, numerous algorithms exist to maximize Q for a giv en number of communities. In [3], an algorithm is proposed by casting modularity maxi- mization as optimization of a vector with respect to a matrix. 2 This is distinct, it should be noted, from the notion of residual networks when computing network ﬂow [22]. 4 IN SUBMISSION TO THE IEEE, – – – 2014 The modularity matrix B is giv en as the observed minus the expected adjacency matrices, i.e., a matrix of the form in (6). T o divide the graph into two partitions in which modularity is maximized, we can solve ˆ s = arg max s ∈{− 1 , 1 } N s T  A − 1 V ol( G ) k k T  s, (9) and declare the v ertices corresponding to the positi ve entries of ˆ s to be in one community , with the negati ve entries indicating the other . This technique will optimize Q for a partition into 2 communities. Since this is a hard problem, it is suggested that the principal eigenv ector of B = A − 1 V ol( G ) k k T (10) is computed—thereby relaxing the problem into the real numbers—with the same strategy of discriminating based on the sign of eigenv ector components used to divide the graph into communities. This is an example of a community detection algorithm based on spectral properties of a graph, which have inspired a signiﬁcant amount of work in the detection of communities [3], [23]–[25] and global anomalies [2], [26], [27]. In this paper , we leverage these same properties within a nov el framework for detection of small subgraphs whose behavior is distinct from background activity . I V . D E T E C T I O N F R A M E W O R K A. F rame work Overvie w The subgraph detection framework we propose is based on the analysis of graph residuals, as expressed by (6). W e may be giv en E [ A ] , or it may be estimated from the observed data. This is similar to analysis of variance in linear regression: W e compare the observed data to its expectation, and if the deviations from the expected value are not consistent with variations due to noise, then this may indicate the presence of a signal (in this case an anomalous subgraph). T o reduce the dimensionality of the problem, this frame work deals with a graph’ s spectral properties. Using the principal components of the residuals matrix, we can consider a graph in the linear subspace in which its residuals are lar gest. For some established models, there is also theory regarding the eigen val- ues and eigen vectors of these matrices [28]. This technique is used in community detection, and is similar to models in which each verte x has a position in a latent Euclidean space (see, e.g., [29]). The presence of certain anomalous subgraphs will alter the projection of a graph into this Euclidean residuals space. W orking within this space, we can compute test statistics and, from these, resolve the hypothesis test (2). While these will not be optimal detection statistics as in Theorems 1 and 2, this framew ork can be applied to a wide v ariety of random graph models, is computationally tractable, and, as we demonstrate in subsequent sections, is quite useful for resolving the subgraph detection problem in a variety of scenarios. W e use the modularity matrix from (9) as our baseline residuals model. This has several adv antages. First, the “gi ven expected degree” model has been well studied, and we know properties of its eigen v alues and eigenv ectors [30]. Second, the model’ s expected value term is low-rank, which allows easy computation of the eigenv ectors of B without computing a dense N × N matrix (as noted in [3] and described in [31]). This makes the model computationally tractable for large graphs where algorithms more expensi ve than O ( M ) can be prohibitiv e. This model also has a simple ﬁtting procedure. Estimating the expected degree as simply the observed degree is, in fact, the maximum likelihood estimator for the version of this model where each possible edge is a Poisson random variable [32]. For small edge probabilities, this is a good approximation for Bernoulli random variables. Finally , this model has demonstrated utility for inter-community behavior; i.e., the probability of connections between v ertices in dif ferent communities seems to follow such a model (the reason that observed degree was added as a cov ariate in [33]). B. P ower Metrics As mentioned previously , one important aspect of a signal processing frame work is a metric for signal and noise power . This provides a quantity that enables an intuitive assessment of the detectability of a signal in a gi ven background. Again, vector signals with Gaussian noise provide an intuitiv e metric based on v ector norms, while such quantities are less clear in the context of random graphs. There are se veral intuitive quantities that could be used for signal or noise power in the context of random graphs. One natural choice would be number of edges, or perhaps av erage degree. It seems intuitive that a signal graph with a large number of edges would be easier to detect, and that greater variance in the number of edges in the background would make this more difﬁcult. A related linear algebraic quantity would be the Frobenius norm of the residuals matrix, i.e., the sum of the squared residuals over all ordered pairs of vertices. This would consider each edge probability separately , emphasizing the presence of less-likely edges. These metrics, howe ver , have a few shortcomings. In both cases, the signal power measurement will be exactly the same for any subgraph with the same number of edges. Consider two dif ferent trees: a path, in which each edge can be trav ersed while visiting each vertex exactly once; and a star , where one verte x is connected to all others. Both will have N S − 1 edges, and a Frobenius norm of 2( N S − 1) . The star , howe ver , is much more concentrated on one verte x, and this will cause it to stand out more in the eigenspace (it is also much less likely to occur by chance if edges are randomly placed). The power metric we use should provide an indication of a subgraph’ s likelihood to stand apart from the background in the eigenspace, since this is the space in which we consider the data. W orking within a spectral framework, the spectral norm de- ﬁned in (1) pro vides a natural po wer metric. Using k A − E [ A ] k as a metric for noise power , and k A S k as a metric for signal power , we can determine the detectability of a subgraph in principal eigenspace. T o see this, we ﬁrst deﬁne a new matrix, b A = { ˆ a ij } , which is the adjacency matrix of b G = ( V , E S \ E ) , i.e., the edges of the anomaly that do not appear in the background. For deterministic foreground graphs, if s ij is 1, MILLER et al. : A SPECTRAL FRAMEWORK FOR ANOMALOUS SUBGRAPH DETECTION 5 then ˆ a ij is a random v ariable whose value is 1 with probability 1 − p ij and 0 with probability p ij . For a random Bernoulli foreground, if E [ s ij ] = q ij , then ˆ a ij is 1 with probability q ij (1 − p ij ) . Thus, when the subgraph is embedded within vertices where the interaction lev el is lo w , E [ b A ] ≈ E [ A S ] . For con venience, we will also denote a partition of the residuals matrix as B =  B S B S N B T S N B N  , (11) where the rows and columns hav e been permuted so that the subgraph vertices are those with the smallest indices. The submatrix B S is the background residuals within the subgraph vertices, B S N is the residuals occurring between the subgraph and the rest of the graph, and B N includes only the residuals within the complement of the subgraph v ertices. If the spectral norm of the signal subgraph is sufﬁciently large with respect to the background power , the subgraph will dominate the principal eigenv ector of the residuals matrix. This is captured in the following theorem, a proof of which is provided in Appendix B. Theorem 3. Let B be the residuals matrix of a gr aph drawn fr om an arbitrary Bernoulli graph process, and b A be the adjacency matrix of the subgraph that does not include edges in the bac kground graph. If u is the unit eig en vector associated with the largest positive eigen value of B + b A (the residuals matrix after embedding), then assuming k b A k > k B N k + k B S k , the components of u associated with only the signal vertices, denoted u S , is bounded below as k u S k 2 2 ≥ 1 − ε where ε = O     k b A k − k B N k  ( k B S k + k B S N k ) + k B S N k 2  k b A k − k B N k  2 + k B S N k 2    . (12) Consider the implication of Theorem 3 for a ﬁxed back- ground, when embedding on a ﬁxed subset of vertices. The theorem states that as the difference between the signal power and the power of the noise among the non-signal vertices ( k b A k − k B N k ) becomes much larger than the noise power in volving subgraph vertices ( k B S k + k B S N k ), the principal eigen vector will become concentrated on the foreground ver- tices. A few aspects of this theorem conﬁrm intuition from other signal processing areas. First, if there is signiﬁcant noise activity within the subgraph vertices, then k b A k may be signif- icantly smaller than k A S k , and k B S k may be relatively large. This means that a signal placed in strong noise will be dif ﬁcult to detect, which is always the case in detection problems. Also, the bound in the theorem shows that if a relativ ely strong subgraph is embedded where there is typically very little activity , and where there is relati vely little interaction with the remainder of the graph (i.e., small k B S k and k B S N k ), the subgraph will be much easier to detect. Put in traditional signal processing language, the signal will be much easier to detect when it is less correlated with the noise. W orking within this frame work, we see the same properties of the interaction between signal and noise that affect detectability in domains like radar and communications. 10 −10 10 −8 10 −6 10 −4 10 −2 10 0 10 −10 10 −8 10 −6 10 −4 10 −2 10 0 1− δ min 1−||u s || 2 2 High Edge Prob. Low Edge Prob. B S =0 Fig. 1. Empirical comparison to bound in Theorem 3. The bound holds for each case in this scenario with a 4096-vertex random background and a 15-verte x dense signal subgraph, though it is only tight for cases where B S = 0 . An empirical example is provided in Fig. 1. In this case, a 4096-verte x Erd ˝ os–R ´ enyi graph (see Section VI-A1) is gen- erated, with a 15-vertex subgraph with 90% edge probability embedded. The horizontal axis is 1 − δ min , where δ min is the expression in (35) in Appendix B. The bound holds for all cases considered, and the empirical results often are an order of magnitude below the maximum for both the higher and lower edge probabilities ( p = 4 × 10 − 4 and p = 1 × 10 − 6 , respectiv ely). Only when a case is considered where there is no background connecti vity within the subgraph vertices is the bound approached more closely . V . D E T E C T I O N A L G O R I T H M S For relativ ely large subgraph anomalies, a simple “energy detector” based on the spectral norm of the residuals matrix will provide good detection performance. It is desirable, how- ev er, to be able to detect much smaller subgraphs, which may not stand out in the principal eigen vector . A few techniques hav e been dev eloped within this framework to detect subtler anomalies [34]–[37], which we outline in this section. A. Chi-Squar ed Statistic in Principal Components The ﬁrst algorithm is based on the symmetry of the projection of B into its 2 principal components. This will enable the detection of subgraphs that do not stand out in the ﬁrst eigen vector . W e hav e empirically observed for sev eral random graph models that, when projecting the residuals into their principal two components, the result is rather radially symmetric. For sparse graphs, the entries in the principal eigen vectors resemble a Laplace distribution, as shown on the left in Fig. 2, which is consistent with beha vior observ ed in sparse Erd ˝ os–R ´ enyi graphs. The righthand plot in Fig. 2 demonstrates the symmetry of the residuals in the top two eigen vectors. When an anomaly is embedded within the graph, as previ- ously discussed, the subgraph vertices will stand apart from the background. Therefore, we compute a statistic that is based on symmetry in this space to detect the presence of an anomaly . The detection statistic is a chi-squared statistic based on a 2 × 2 contingency table, where the table contains 6 IN SUBMISSION TO THE IEEE, – – – 2014 −0.1 −0.05 0 0.05 0.1 0 0.125 .25 0.375 0.5 Component Value Sample Proportion Empirical Distribution Laplace Distribution −0.4 −0.2 0 0.2 0.4 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 First Eigenvector Second Eigenvector Fig. 2. Distributions of vertex components in principal eigenv ectors: a histogram of components in the ﬁrst eigen vector (left), with a comparison to a Laplace distribution, and a scatterplot (right) in the principal 2-dimensional subspace, demonstrating its radial symmetry . the number of vertices projected into each quadrant of the two- dimensional space. (That is, the number of rows of [ u 1 , u 2 ] , where u 1 and u 2 are (column) eigen vectors of B , fall into each quadrant.) This yields a 2 × 2 matrix O = { o ij } of the observed numbers of points in each section. From the observation, we compute the expected number of points under the assumption of independence, ¯ O = { ¯ o ij } , where ¯ o ij = ( o i 1 + o i 2 )( o 1 j + o 2 j ) N . (13) The chi-squared statistic is then calculated as χ 2 ([ u 1 u 2 ]) = X i X j ( o ij − ¯ o ij ) 2 ¯ o ij , (14) and, to fav or radial symmetry , we maximize the statistic over rotation in the plane, computing χ 2 max = max θ χ 2  cos θ − sin θ sin θ cos θ  T [ u 1 u 2 ] ! . (15) The statistic χ 2 max is used to detect an anomalous subgraph. When the spectral norm is a reliable detection statistic, thresholding along the principal eigenv alue is often an ef- fectiv e method to identify the vertices that are exhibiting anomalous behavior . W orking in multiple dimensions, while it enables the detection of smaller subgraphs, makes the process of identiﬁcation more complicated. In this setting, we use a method based on k -means clustering to identify the subgraph vertices. W ithin the 2-dimensional space, we compute a small number of clusters, and declare the smallest cluster with at least a minimum number of vertices to be the signal subgraph. B. Eigen vector L 1 Norms It is also desirable to detect signal subgraphs that do not stand out in the principal two components of the residuals matrix, and extending the algorithm of Section V -A to an arbitrary number of dimensions may not be feasible. One method to detect such anomalies relies on the subgraphs being separable in the space of a single eigen vector . As mentioned previously , the entries in the eigen vectors of the background alone resemble numbers drawn from a Laplace distribution. Thus, if a subgraph were to stand out in a single eigen vector , that eigen vector will have a substantially smaller L 1 norm 0 20 40 60 80 100 20 25 30 35 40 45 Eigenvector Index Eigenvector L 1 Norm With Embedding µ ± 3 σ 2 4 6 0 0.01 0.02 0.03 0.04 0.05 Test Statistic Value Sample Proportion Sample Distribution Gumbel Distribution Fig. 3. An example of using eigenvector L 1 norms for subgraph detection. When a small, dense subgraph is embedded into a background with a skewed degree distribution, the L 1 norm of one of the eigen vectors of the residuals matrix becomes much smaller than usual, as shown on the left. Under the null hypothesis, the largest negati ve deviation from the mean will resemble a Gumbel distribution, plotted on the right. than for the background alone. The L 1 norm of a vector x , k x k 1 = P i | x i | , is much smaller when it is concentrated on a small subset of entries, provided that it is unit-normalized in an L 2 sense. For this reason, the L 1 norm serves as a proxy for sparsity in applications such as compressed sensing [38]. The following algorithm enables detection when an eigen- vector is concentrated on the vertices of the subgraph. This will occur when, for example, a dense subgraph is embedded on relativ ely low degree vertices, as discussed in Appendix C. W e compute the eigenv ectors corresponding to the m largest eigen values. By measuring cases with no embedding present, we obtain the mean µ i and standard deviation σ i for the L 1 norm of the i th eigenv ector . For each of the eigen vectors u i , 1 ≤ i ≤ m , we subtract the mean and normalize by the standard de viation. The smallest (i.e., most negati ve) v alue is then used as a test statistic, since we are interested in cases where the norm is small. The test statistic is giv en by − min 1 ≤ i ≤ m k u i k 1 − µ i σ i . (16) An e xample demonstrating this method is provided in Fig. 3. The example uses a 4096-vertex graph with a ske wed degree distribution (using the CL model described in Section VI-A2), with a 15-verte x subgraph with average degree 10.5 randomly embedded into the background. The analysis is run on the 100 eigen vectors associated with the largest positiv e eigen v alues. While the L 1 norms of most eigen vectors in the resulting matrix fall within 3 standard de viations of the mean for their index, the L 1 norm of the 6th eigenv ector is o ver 10 standard deviations below the mean, which is extremely unlikely to occur under the null hypothesis. Under the null hypothesis, the test statistic (16) will resemble a Gumbel distribution (commonly used to model extreme values), as shown in the plot on the right. When an embedding occurs that creates a deviation as large as that in the lefthand plot, it will take on a value much larger than the maximum under normal circumstances. The occurrence of tightly connected subgraphs highly aligned with eigen vectors was documented independently in [39], and a similar anomaly detection method using eigen vec- tor kurtosis in [40]. Here, we use this phenomenon to ﬁnd subgraphs whose internal connectivity is much larger than the expectation, giv en the background model. When an anomaly MILLER et al. : A SPECTRAL FRAMEWORK FOR ANOMALOUS SUBGRAPH DETECTION 7 is detected according to (16), the corresponding eigen vector is thresholded to determine the subgraph vertices. C. Sparse Principal Component Analysis While analysis of eigen vector L 1 norms enables the detec- tion of some subgraphs that do not separate in the principal components of the residuals space, this technique has some shortcomings. In particular , as consecutiv e eigenv alues get closer together , the stability of the direction of the eigen vectors becomes unstable. Therefore, we cannot rely on the test statistic being sufﬁciently changed because an eigenv ector points in the direction of the subgraph. There is, howe ver , a similar technique that enables the detection of small subgraphs with large residuals. Rather than ﬁrst computing the eigen vectors of the residuals matrix and then ﬁnding an eigen vector with a small L 1 norm, we can ﬁnd a vector that is nearly an eigen vector whose L 1 norm is constrained. This is a technique known as sparse principal component analysis (sparse PCA) [41]. This method has been used in the statistics literature to ﬁnd high variance in the space of a limited number of variables. W e utilize it here for a similar goal: to ﬁnd large residuals in the space of a small number of vertices. The problem is formulated as follows. The goal is to ﬁnd a vector that is projected substantially onto itself by the residuals matrix, but with few nonzero components. Put formally , the objectiv e is to solve ˆ x = arg max k x k 2 =1 x T B x (17) subject to k x k 0 ≤ N S , where k · k 0 denotes the L 0 quasi-norm (the number of nonzero components in a vector). This, howe ver , is an integer program- ming problem, and is NP-hard. W e therefore use a relaxation with an L 1 constraint, recast as a penalized optimization: ˆ x = arg max k x k 2 =1 x T B x − λ k x k 1 . (18) This problem is still not in an easily solvable form, due to the quadratic equality constraint. W e use an additional relaxation, following the method of [41], to achiev e a semideﬁnite pro- gram that can be solved using well-documented techniques: b X = arg max X ∈ S n tr( B X ) − λ 1 T | X | 1 (19) subject to tr( X ) = 1 , where tr( · ) denotes the matrix trace and S n is the set of positiv e semideﬁnite matrices in R n × n . The principal eigen- vector of b X , denoted ˆ x , is then returned (and should be sparse, gi ven the constraints). The subgraph detection statistic is k ˆ x k 1 . If no small subgraph has sufﬁciently large residuals, the vector should be relativ ely diffuse, and have a relatively large L 1 norm. For vertex identiﬁcation, the sparse principal component is thresholded, and the vertices corresponding to the components of the vector greater than the threshold are declared to be part of the anomalous subgraph. One drawback of this technique is its computational com- plexity . As mentioned in the introduction, one goal of this work is to dev elop techniques that scale to very large graphs. The algorithms described in Sections V -A and V -B rely on a partial eigendecomposition. Using the Lanczos method for computing m eigen vectors and eigenv alues of a matrix, and lev eraging sparseness of the graphs, this requires a running time of O (( M m + N m 2 + m 3 ) h ) , where h is the number of restarts in the algorithm [42]. Thus, if the number of eigenv ec- tors to compute is ﬁxed, this algorithm scales linearly in the number of edges in its per-restart running time. Sparse PCA, as described in [41], has a running time that is O ( N 4 √ log N / ) , where  controls accuracy of the solution. This implies that sparse PCA will not scale to extremely large datasets without additional optimization, which is a problem for future work. W e present results using this technique to demonstrate the feasibility of detecting exceptionally small anomalies using the frame work outlined in this paper . V I . S I M U L AT I O N R E S U L T S A. Noise Models There are many models for random graphs, with varying degrees of complexity . In this section we outline 3 random models that will be used for background noise in our experi- ments. 1) Er d ˝ os–R ´ enyi (ER) Random Graphs: The simplest ran- dom graph model was proposed by Erd ˝ os and R ´ enyi in [43]. In this model, given a verte x set V and a number p ∈ (0 , 1) , an edge occurs between any two vertices in V with probability p . In matrix form, p ij = p for all i and j . This model is subsumed by the model for a random graph with a giv en expected degree sequence assumed by equation (9), where, in this case, all vertices have the same expected degree. 2) Chung–Lu (CL) Random Graphs: The “giv en expected degree” model has been studied extensi vely by Chung and Lu [30]. Similarly to the dynamic preferential attachment model of [44], in this model, the probability of two nodes sharing a connection increases with their popularity . Formally , each v ertex v i is gi ven an expected degree d i , and the probability of vertices v i and v j sharing an edge is gi ven by p ij = ( d i d j ) / P | V | ` =1 d ` , yielding a rank-1 probability matrix P = 1 P | V | i =1 d i dd T . (20) Using the observed degree as the expected degree—sho wn to be an approximately asymptotically unbiased estimator in [45]—the standard formulation of the modularity matrix (10) perfectly ﬁts this model for background beha vior . 3) R-MA T Stochastic Kr oneck er Gr aphs: T o include a slightly more complicated model, we also consider the Re- cursiv e Matrix (R-MA T) stochastic Kronecker graph [46]. In this model, a base probability matrix P b =  a b c d  (21) is giv en, where a , b , c and d are nonnegati ve v alues that sum to 1, and edge probabilities are deﬁned by the n -fold Kronecker product of P b , denoted b P = { ˆ p ij } = N n i =1 P b . This results in matrices with 2 n vertices. The graph is generated by an 8 IN SUBMISSION TO THE IEEE, – – – 2014 Fig. 4. Sparsity patterns for background graphs: an R-MA T graph (left), a Chung–Lu graph (center) and an Erd ˝ os–R ´ enyi graph (right). iterativ e method where 1 edge is added at each iteration with probabilities deﬁned by b P . If the total number of iterations is t , the edge probabilities are gi ven by p ij = 1 − (1 − ˆ p ij ) t . (22) If the base probability matrix has rank 1, this generator will produce graphs with a similar structure to the CL model. As shown in [46], howe ver , this model creates graphs with mild community structure, thereby presenting a more challenging noisy background for our subgraph detection frame work. These three models represent v arying degrees of complexity for the detection framew ork. The ER model is ov erspeciﬁed by the giv en expected degree model used in the modularity matrix, the CL model matches the formula exactly , and the R-MA T model is mismatched due to its mild community structure. In the simulations in Section VI-C, the R-MA T graphs are generated using a base probability matrix with a = 0 . 5 , b = c = 0 . 125 , and d = 0 . 25 , and the algorithm is run for 12 N iterations, resulting in an average degree of approximately 12. The graph is unweighted, and undirected via the “clip-and-ﬂip” procedure as in [46], i.e., the edges below the main diagonal in the adjacency matrix are remov ed, and those above the main diagonal are made undirected. For CL backgrounds, the expected degree sequence is deﬁned by the edge probabilities of the R-MA T background, i.e., d i = P | V | j =1 p ij where p ij is deﬁned as in (22). The ER backgrounds use an edge probability that yields an av erage degree the same as the more complicated models. Example sparsity patterns of the adjacency matrices, each with 1024 vertices, are shown in Fig. 4. Note the moderate community structure in the R-MA T graph. While the CL graph has vertices of varying degree, it does not have the same structure of the R-MA T . One particularly visible difference is the lack of connections between low-degree vertices and high-degree vertices in the R-MA T graph, seen in the upper- right and lower-left corners of the matrix. Both of these graphs contain more variation that the ER graph, where the uniform randomness can be seen in its sparsity pattern. B. Signal Subgr aph T w o random graph models are used for the anomalous signal subgraph. In one case, an ER graph with probability parameter p S is generated and combined with randomly selected vertices from the background. Here, the e xpected adjacency matrix is an N S × N S matrix where every entry is p S , and thus has spectral norm p S N S . The second subgraph we consider is a random bipartite graph, where the vertex set is split into two subsets and no edge can occur between vertices in the same subset. Letting N 1 and N 2 be the numbers of vertices in each subset, there are N 1 N 2 possible edges between the two verte x subsets, and, as in the ER subgraph case, each of these possible edges is generated with equal probability p S . For the bipartite subgraph, the expected adjacency matrix has the form E [ A S ] =  0 N 1 × N 1 p S 1 N 1 × N 2 p S 1 N 2 × N 1 0 N 2 × N 2  , (23) which has spectral norm p S √ N 1 N 2 . This subgraph provides us with a signal where the av erage degree does not equal the spectral norm (unless N 1 = N 2 ), demonstrating that the spectral norm is a more appropriate po wer metric. C. Monte Carlo Simulations The results in this section detail the outcomes of sev eral 10,000-trial Monte Carlo simulations. In each simulation, a background graph is generated, and may or may not have a signal subgraph embedded on a subset of its vertices. The subgraph may be a 15-verte x cluster or a bipartite graph with N 1 = 12 and N 2 = 25 . T est statistics outlined in Section V are computed on the resulting graph, creating se veral empirical distributions that can be used to discriminate between H 0 and H 1 . Residuals matrices are formed using either the exact expected value 3 , or a rank-1 approximation based on the observed degrees, as in (9). The expected degree sequence from the R-MA T model is used for CL backgrounds, and ER backgrounds use the same average degree. F or R-MA T and CL backgrounds, we consider cases where the foreground vertices are selected uniformly at random from all background vertices, and cases where they are randomly selected from the set of vertices with expected de gree less than 5. For 4096-vertex graphs, ER graphs always achieved near- perfect detection performance. Identiﬁcation and detection performance for CL and R-MA T backgrounds are summarized in Fig. 5. A fe w phenomena in the results conﬁrm our intuition. First, note that CL backgrounds ha ve e xtremely similar performance, whether the expected v alue term is giv en or estimated. This is because the observed degree is a good estimate for expected degree, and the small embedding has a minimal effect on the expected value term, as shown in Appendix D. (The small but noticeable difference when using a bipartite foreground emphasizes the impact of the number of subgraph vertices.) The R-MA T backgrounds have much more substantial performance differences, due to the model mismatch. In fact, when the true expected value is giv en, performance is better than with the CL background. This is likely due to the lower variance in the noise, caused by smaller connection probabilities among low-de gree vertices. Detection performance improves going from the spectral norm statistic to the chi-squared statistic, and improv es further when analyzing the eigen vector L 1 norms. Also, when the subgraph is embedded only on vertices with expected degree at least 5, performance signiﬁcantly increases for L 1 norm analysis, 3 Due to time and memory constraints, a rank-100 approximation for the R-MA T expected v alue was used instead of the true probability matrix. MILLER et al. : A SPECTRAL FRAMEWORK FOR ANOMALOUS SUBGRAPH DETECTION 9 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5 Equal Error Rate Foreground Average Degree 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5 Equal Error Rate Foreground Average Degree 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5 Equal Error Rate Foreground Average Degree 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5 Equal Error Rate Foreground Average Degree 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5 Equal Error Rate Foreground Average Degree 6 8 10 12 14 0 0.1 0.2 0.3 0.4 0.5 Equal Error Rate Foreground Average Degree 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Precision Foreground Average Degree 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Precision Foreground Average Degree 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Precision Foreground Average Degree 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Precision Foreground Average Degree 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Precision Foreground Average Degree 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Precision Foreground Average Degree Fig. 5. A summary of detection and identiﬁcation performance. The equal error rate (EER) for each background and foreground is shown as the average degree increases from 6 to 15. Results are sho wn for cluster subgraphs (solid line) and bipartite subgraphs (dashed line), for R-MA T graphs with the true expected value (  ), R-MA T graphs with an estimated rank-1 expected value ( + ), CL graphs with given expected degrees ( ◦ ) and CL graphs using observed degrees ( × ). Performance improves as the test statistic goes from the spectral norm (left column), to the chi-squared statistic (center column) to the largest de viation in L 1 norm (right column). Detection performance with the L 1 norm-based statistic improves when the subgraph is embedded on low-degree vertices (second row), rather than choosing the vertices uniformly at random (ﬁrst row). The same performance trends typically hold for the vertex identiﬁcation algorithms (uniform random embedding in third row , degree-biased embedding in fourth row), shown here in terms of precision at a 35% recall rate. The non-monotone behavior using L 1 norms is caused by a cluster of larger eigen values in the R-MA T background, which, as discussed in Appendix C, makes detection more difﬁcult with this method. 10 IN SUBMISSION TO THE IEEE, – – – 2014 10 12 14 16 18 0 0.5 1 1.5 2 2.5 Eigenvalue Proportion Fig. 6. Histogram of eigenv alues from an R-MA T matrix using an estimated rank-1 expected value. The two clusters of larger eigen v alues are responsible for the non-monotonic behavior in the L 1 -norm statistic in shown in Fig. 5. while it degrades for the other statistics (since it is likely to be more orthogonal to the principal eigenv ectors). Note also that, for the spectral norm and chi-squared statistics, the bipartite embedding is more detectable than the cluster with the same av erage degree, since the bipartite foreground has a higher spectral norm. This does not hold for the L 1 norm statistic, since the cluster embedding, while less powerful, is concentrated on a smaller subset of vertices, making it more detectable using this statistic. One interesting aspect of the L 1 norm technique is its non- monotonic behavior when using the estimated rank-1 expected value. In both detection and identiﬁcation, performance im- prov es as the subgraphs increase in size up to a certain point, after which performance degrades and then improv es again. This is due to clustering of eigenv alues caused by the model mismatch, as shown in Fig. 6. The ﬁgure presents a histogram of eigen v alues for the R-MA T graph minus the estimated rank- 1 expected value matrix, k k k − 1 1 k k T . (The v ertical axis is the av erage number of eigen v alues that fell into a given bin over the 10,000 Monte Carlo trials.) Most of the eigen v alues are below 12, while there is always 1 that is over 16 and 11 in the cluster that spans approximately 12 to 15. Since, as discussed in Appendix C, having eigenv alues that are close together hinders performance with this method, performance improv es when the subgraph can be localized in an eigen vector as its eigen v alue approaches 12, but it will be more difﬁcult around 13. Using the true expected v alue instead of the rank- 1 approximation does not yield this behavior , since there is no model mismatch. The mismatch between R-MA T and the rank-1 expected value also causes the slight degradation in performance using the chi-squared statistic before it rapidly improv es. This may be because the embedded subgraph ac- tually improves the symmetry of the projection by balancing out the mismatch, before ﬁnally ov erpo wering it. The identiﬁcation results on the bottom half of the ﬁgure follow similar trends, with one notable exception. Performance is shown in terms of precision at a 35% recall rate (precision is emphasized since the foreground vertex set is much smaller than the background). While the k -means-based identiﬁcation method (center column, using 3 clusters and a subgraph threshold of 5 vertices) typically improves performance over thresholding of the principal eigenv ector (ﬁrst column) for 0 0.5 1 0 0.2 0.4 0.6 0.8 1 Probability of False Alarm Probability of Detection Spectral Norm Chi−Squared L1 Norm Sparse PCA 0 0.5 1 −0.2 0 0.2 0.4 0.6 0.8 1 Recall Precision Top Eigenvector K−Means Lower Eiegenvector Sparse PC 0 0.5 1 0 0.2 0.4 0.6 0.8 1 Probability of False Alarm Probability of Detection Spectral Norm Chi−Squared L1 Norm Sparse PCA 0 0.5 1 0 0.2 0.4 0.6 0.8 1 1.2 Recall Precision Top Eigenvector K−Means Lower Eiegenvector Sparse PC Fig. 7. Detection and identiﬁcation results using sparse PCA. In both an Erd ˝ os–R ´ enyi background (top row) and an R-MA T background (bottom row), sparse PCA signiﬁcantly outperforms the other algorithms. Similar performance gaps are seen in detection performance (left column) and identiﬁcation (right column). cases where precision is relativ ely low , it actually hinders performance in cases where precision is high. This shows that a subgraph that separates well along the ﬁrst eigen vector will not necessarily be equally detectable via k -means, possibly due to spreading in the second dimension. Since sparse PCA has a much greater computational b urden, we carried out a more limited set of experiments on smaller graphs. In each trial, a 512-verte x background graph is gen- erated according to either an R-MA T or ER model. The R- MA T graphs use the same probability matrix as in the previous experiment, and the ER graphs ha ve equal expected v olume. In each case, we use an estimated rank-1 expected value, and use the DSPCA software package [47] to solve (19). Detection and identiﬁcation performance are shown in Fig. 7. These results demonstrate the detection of a 7-vertex, 80% dense subgraph in the R-MA T background or a 5-vertex, 85% dense subgraph in an ER background. Sparse PCA yields markedly superior performance to the three methods used in Fig. 5. By using this more costly technique, much smaller , subtler anomalies can be detected, using the same principles as the less expensiv e algorithms. V I I . R E S U LT S O N A P P L I C ATI O N D AT A T w o network datasets were downloaded from the Stanford Network Analysis Project (SN AP) large graph dataset collec- tion (av ailable at http://snap.stanford.edu/data). One dataset consists of product co-purchase records on amazon.com, where each of the 548,552 vertices represents a product, and a directed edge from vertex i to verte x j denotes that when product i is purchased, product j is frequently also purchased [48]. The other dataset has 1,696,415 vertices, representing nodes on the Internet, taken from autonomous MILLER et al. : A SPECTRAL FRAMEWORK FOR ANOMALOUS SUBGRAPH DETECTION 11 0 50 100 150 0 20 40 60 80 100 Eigenvector Index Eigenvector L 1 Norm Product Co−Purchase Network 0 50 100 150 0 100 200 300 400 Eigenvector Index Eigenvector L 1 Norm Autonomous System Network Fig. 8. Eigenv ector L 1 norms in application datasets: an amazon.com product co-purchase network (left) and an autonomous system network (right). system traceroutes in 2005 [49]. The edges in this graph are undirected and represent communication links between nodes. In both cases, the 150 eigen vectors corresponding to the largest positiv e eigen values of the residuals matrices were computed, and subgraphs were analyzed that align with eigenv ectors with small L 1 norms. In the amazon.com co-purchase network, edges are di- rected, and each vertex has at most 5 outward edges. W e use the symmetrized modularity matrix introduced in [50] as a residuals matrix. As shown on the left in Fig. 8, many of the eigen vectors hav e small L 1 norms, due to frequent co-purchase of small, relatively isolated sets of products. W e consider the 2 smallest L 1 norms, corresponding to the 23rd and 135th largest eigenv ectors. These eigen vectors are concentrated, respectiv ely , on a 53-vertex subgraph with all possible internal edges (265) and a 44-vertex subgraph with 215 of its 220 possible internal edges. Neither subgraph has any outgoing edges, and both have fewer than 20 incoming edges. T o compare this to the graph as a whole, we took 1 million samples of comparable size by performing random walks on the graph. Of all 53-vertex samples, only 609 have av erage internal de gree greater than 4.5, and of those, none has fewer than 20 external edges. Similarly , among the random samples with 44 vertices, 108 hav e average internal degree greater than 4.4 and fewer than 40 external edges. Each of these 108 samples, howe ver , is primarily outside of the 150- dimensional space spanned by the computed eigenv ectors—an indicator vector for the sample v ertices in each case is nearly in the null space of the matrix of eigen vectors. Thus, both of these subgraphs are anomalous with respect to random samples of similar size, when considering portions of the graph that are well-represented in the computed subspace. The eigenv ector L 1 norms in the autonomous system graph generally follow a trend, getting larger as the eigenv alues get smaller (indices increasing). The two vectors highlighted in the ﬁgure–the 10th and 94th–were considered for further in vestigation, since they hav e the largest local deviations. The 10th eigenv ector is aligned with a 70-vertex subgraph with ov er 99% of its possible edges, and the 94th eigen vector is aligned with a 28-vertex subgraph with over 81% of its possible edges. These subgraphs consisted of primarily high- degree vertices, with a verage external degrees of about 957 and 577 for the 70- and 28-vertex subgraphs, respectiv ely . W e took 1 million random samples from among the subgraph of vertices with de gree greater than 500, with sizes commensurate with the number of vertices in the subgraph within the high- degree vertex set (68 of 70 and 17 of 28). Among the three 68-verte x samples with density greater than 80%, all share at least 55 vertices with the detected subgraph. Of the 17- verte x samples, 713 are at least 75% dense and hav e fewer than 16,000 external edges (the 17-verte x subset is 93% dense and has about 12,500 external edges). Of these 713 samples, all are signiﬁcantly aligned with eigenv ectors 10 and 18, both of which also hav e extremely small L 1 norms as shown in the ﬁgure. Thus, the only subgraphs among the samples with similar densities and external degrees are detectable through analysis of eigenv ector L 1 norms. V I I I . C O N C L U S I O N In this paper , we present a spectral framework for the uncued detection of small anomalous signals within large, noisy background graphs. This framework is based on analysis of graph residuals in their principal eigenspace. W e propose the spectral norm as a po wer metric, and sev eral algorithms are outlined, with varying degrees of complexity . In simulation, we demonstrate the utility of the algorithms for detection and identiﬁcation of two foregrounds within three background models, with the more computationally complex methods providing better detection performance. In two real networks, subgraphs detected via one of the algorithms are shown to be anomalous with respect to random samples of the background. The frame work presented in this paper demonstrates the utility of considering the anomalous subgraph detection prob- lem in a signal processing conte xt. There are myriad av enues of in vestig ation from this point. Recent work has focused on extending this framework to time-varying graphs [51], [52] and attributed graphs [53]. Non-spectral statistics have also been of interest, in particular for detecting anomalously sparse (rather than anomalously dense) subgraphs [54], though this complicates the analysis since embedding the signal inv olves subtracting edges rather than adding them. Another interesting area is detection using supervised learning based on subgraph features, as in [55]. Performance bounds in spectral detection of cliques and communities hav e recently been studied [11], [56], as hav e computational limits of detection [57], [58]. Also, while the presented framew ork relies on analysis of residuals, considering normalized residuals may improve detection for subgraphs where the edges are extremely unlikely [30], [59]. This analysis, howe ver , may be intractable for more compli- cated graph models, since it requires normalizing each ob- served vertex pair and may not allow the computational tricks mentioned in Section IV -A. As the detection of anomalous behavior in relational datasets continues to be a problem of interest, the ﬁeld of signal processing for graphs will continue to pose a rich set of challenges for the research community . A P P E N D I X A P RO O F O F T H E O R E M 2 Under H 0 , the hypothesis that the observ ed graph was generated by an Erd ˝ os–R ´ enyi process, the likelihood of the observed graph is giv en by L ( G ; H 0 , p ) = p | E | (1 − p ) ( N 2 ) −| E | . (24) 12 IN SUBMISSION TO THE IEEE, – – – 2014 Under the alternativ e hypothesis, an N S -verte x subset was se- lected uniformly at random to serve as the subgraph. Suppose that V S ⊂ V , | V S | = N S , was chosen as the subset. Each pair of vertices within V S still has probability p of sharing an edge due to background activity . If there is no edge in the background, ho wever , an edge will be added with probability p S . Thus, the probability of an edge occurring between a giv en pair of vertices both in V S is ˆ p = p + (1 − p ) p S = p + p s − p · p S . (25) All other verte x pairs still hav e probability p of sharing an edge. Therefore, we have L ( G ; H 1 , p, V S , p S ) = ˆ p | E S | (1 − ˆ p ) ( N S 2 ) −| E S | (26) · p | E |−| E S | (1 − p ) ( N 2 ) −| E |− ( N S 2 ) + | E S | . Note that  N 2  −| E | −  N S 2  + | E S | is the number of “non-edges” that are not within the subgraph vertices. Since only one v ertex subset is chosen for the signal embedding, the likelihood of G under the alternative hypothesis is X V S ⊂ V , | V S | = N S L ( G ; H 1 , p, V S , p S ) Pr [ V S is chosen ] . (27) Each of the  N N S  possible subsets is equally likely , so the likelihood ratio is  N N S  − 1 P V S ⊂ V , | V S | = N S L ( G ; H 1 , p, V S , p S ) L ( G ; H 0 , p ) (28) or , equiv alently  N N S  − 1 X V S ⊂ V , | V S | = N S L ( G ; H 1 , p, V S , p S ) L ( G ; H 0 , p ) . (29) The ratio in (29) can be further simpliﬁed as L ( G ; H 1 , p, V S , p S ) L ( G ; H 0 , p ) =  1 − ˆ p 1 − p  ( N S 2 )  ˆ p (1 − p ) p (1 − ˆ p )  | E S | . (30) Replacing the ratio in (29) with the expression in (30), and moving the non-subgraph-dependent portion outside of the summation, yields the expression in (5). This completes the proof. A P P E N D I X B P RO O F O F T H E O R E M 3 Let u 1 be the (unit-normalized) principal eigen vector of b A . Since u is the eigen vector corresponding to the largest eigen value of B + b A , we have u T 1 ( B + b A ) u 1 = k b A k + u T 1 B u 1 ≤ u T ( B + b A ) u. (31) Since u 1 only has nonzero entries in rows corresponding to subgraph vertices, we can bound this quantity belo w by k b A k − k B S k . The vector u can be decomposed as u = u S + u B , where the only nonzero components of u S correspond to the signal subgraph vertices and u B may only be nonzero in the ro ws corresponding to V \ V S . Let δ = k u S k 2 2 . Since u has unit L 2 norm, and u S and u B are orthogonal, we have 0 ≤ δ ≤ 1 and k u B k 2 2 = 1 − δ . The largest eigen value of the residuals matrix is then given by u T ( b A + B ) u = u T S b Au S + 2 u T S b Au B + u T B b Au B (32) + u T S B u S + 2 u T S B u B + u T B B u B . Both terms that include b Au B are zero, since b A is only nonzero within the subgraph vertices. T o get an upper bound for this quantity , we bound each term in (32), yielding u T ( b A + B ) u ≤ δ k b A k + δ k B S k (33) + 2 p δ (1 − δ ) k B S N k + (1 − δ ) k B N k . For con venience, let α = k b A k + k B S k − k B N k , β = 2 k B S N k , and γ = k B N k + k B S k − k b A k . Combining the upper bound in (33) with the lo wer bound yields αδ + β p δ (1 − δ ) + γ ≥ 0 (34) W e can verify that, for β ≥ 0 and − α ≤ γ < 0 , the expression of (34) will achiev e equality at the lesser of the two roots of the parabola obtained by squaring the expression. Therefore, (34) holds whenever δ ≥ β 2 − 2 αγ − p β 4 − 4 β 2 γ ( α + γ ) 2( α 2 + β 2 ) . (35) Using the triangle inequality to remove the radical in (35) and substituting the matrix norms back into the equation yields the bound in (12). This completes the proof. A P P E N D I X C C O N C E N T R A T I O N O F E I G E N V E C T O R S O N S U B G R A P H V E RT I C E S Here we provide an example of an embedding on which a single eigenv ector will be concentrated. Consider a subgraph b A that is regular , i.e., each vertex has the same degree d S . Such a subgraph will have a spectral norm k b A k = d S , and the principal eigen vector will be a vector in which all components are the same. Let x be a unit-normalized indicator vector for the subgraph, i.e., a vector where the i th component is 1 / √ N S if i corresponds to a subgraph vertex and is 0 otherwise. Further consider x T ( B + b A ) x and x T ( B + b A ) 2 x . W e hav e x T ( B + b A ) x = d S + x T B x = d S + X , (36) where X = 1 N S X i,j ∈ V S ( a ij − p ij ) (37) is a random variable whose mean is 0 and variance is less than 2 N 2 S E [ | E ∩ ( V S × V S ) | ] , that is, the expected fraction of possible edges between the subgraph vertices that exist in the background. If the embedding occurs on vertices where the expected connectivity is low , then X will likely be very small. W e also hav e x T ( B + b A ) 2 x = d 2 S + 2 d S X + x T B 2 x (38) = d 2 S + 2 d S X + Y . MILLER et al. : A SPECTRAL FRAMEWORK FOR ANOMALOUS SUBGRAPH DETECTION 13 Note that Y = k B x k 2 2 = x T B 2 x , which can be rewritten as x T B 2 x = 1 N S N X i =1   X j ∈ V S ( a ij − p ij )   2 (39) = 1 N S N X i =1 X j,k ∈ V S ( a ij a ik − a ij p ik − a ik p ij + p ij p ik ) . For j 6 = k , the expectation of the summand is 0. Considering only j = k , we have E  x T B 2 x  = 1 N S N X i =1 X j ∈ V S p ij (1 − p ij ) (40) < 1 N S N X i =1 X j ∈ V S p ij , (41) where the upper bound is the average expected degree of the subgraph vertices before the embedding occurs. Again, if the subgraph is embedded on vertices with low expected degree, this quantity is likely to be small. Let U Λ U T = B + b A be the eigendecomposition of the residuals matrix, with λ i denoting the i th eigenv ector ( λ i ≥ λ j for i < j ), and let z = U T x . W e hav e  x T ( B + b A ) x  2 = N X i =1 λ i z 2 i ! 2 =( d S + X ) 2 (42) = d 2 S + 2 d S X + X 2 and x T ( B + b A ) 2 x = N X i =1 λ 2 i z 2 i = d 2 S + 2 d S X + Y . (43) If the quantities in (42) and (43) were the same, then x would be an eigen vector of B + b A . Since their difference is v ery small (i.e., assuming Y and X 2 are small, as the y are in expectation), then x may be highly correlated with a single eigen vector . That is, for some i , z 2 i may be quite large, so that x concentrates most of its magnitude on the i th eigenv ector . Let λ m be the eigen value closest to d S + X , and δ = d S + X − λ m . Then we ha ve d S + X = λ m + δ (44) = m − 1 X i =1 λ i z 2 i + λ m z 2 m + N X i = m +1 λ i z 2 i . For i 6 = m , let ∆ i = λ i − λ m . For conv enience, deﬁne the following substitutions: a = m − 1 X i =1 z 2 i (45) b = z 2 m (46) c = N X i = m +1 z 2 i (47) ε + 1 = P m − 1 i =1 ∆ i z 2 i a (48) ε − 1 = P N i = m +1 ∆ i z 2 i c . (49) Thus, λ m + ε + 1 and λ m + ε − 1 are conv ex combinations of the eigen values greater than λ m and less than λ m , respectively . W e can then express (44) as d S + X = a ( λ m + ε + 1 ) + bλ m + c ( λ m + ε − 1 ) . (50) Similarly , letting ε + 2 = P m − 1 i =1 ∆ 2 i z 2 i a (51) and ε − 2 = P N i = m +1 ∆ 2 i z 2 i c , (52) (43) can be rewritten as d 2 S + 2 X d S + Y = a ( λ 2 m + 2 ε + 1 λ m + ε + 2 ) + bλ 2 m (53) + c ( λ 2 m + 2 ε − 1 λ m + ε − 2 ) . Combining equations (50) and (53) and performing some algebraic manipulation yields the system of equations δ = aε + 1 + cε − 1 (54) δ 2 + Y − X 2 = aε + 2 + cε − 2 (55) 1 = a + b + c, (56) which, solving for b , giv es us b =1 − ( δ 2 + Y − X 2 )( ε + 1 − ε − 1 ) − δ ( ε + 2 − ε − 2 ) ε + 1 ε − 2 − ε − 1 ε + 2 > 1 − δ 2 + Y − X 2 min( ε + 2 , ε − 2 ) − | δ | min( ε + 1 , − ε − 1 ) . (57) If the eigenv alues around λ m are spread far apart, then ε + 1 , − ε − 1 , ε + 2 and ε − 2 will be relativ ely large, the fractions in (57) will be small, and x will be heavily concentrated on a single eigenv ector . This is supported by the empirical results in Section VI, where embedding clusters onto vertices with low expected degree yields separation in a single eigen vector . A P P E N D I X D C H A N G E I N M O D U L A R I T Y D U E T O S U B G R A P H E M B E D D I N G When using observed degree to estimate expected degree, the difference in the expected value terms caused by the signal is as follows. If no embedding occurs, the estimated expected value is k k k − 1 1 k k T , where k is the observed degree 14 IN SUBMISSION TO THE IEEE, – – – 2014 vector resulting from the background noise. If an anomalous subgraph is embedded into the background, the degree vector is changed by ˆ k = b A 1 . Since b A consists of only edges within the subgraph that do not appear due to noise, the degree vector after embedding is k + ˆ k , and the volume is k k + ˆ k k 1 = k k k 1 + k ˆ k k 1 . Thus, the difference between the modularity matrix with estimated expected degrees under H 0 and H 1 is ∆ K = k k T k k k 1 −  k + ˆ k   k + ˆ k  T k k k 1 + k ˆ k k 1 (58) = k ˆ k k 1 k k T − k k k 1  k ˆ k T + ˆ k k T + ˆ k ˆ k T  k k k 1  k k k 1 + k ˆ k k 1  . T o bound the strength of ∆ K , we will bound the spectral norm of each summand in the numerator of (58) and ignore the k ˆ k k 1 in the denominator, yielding k ∆ K k ≤ k ˆ k k 1 k k k 2 2 + 2 k k k 1 k k k 2 k ˆ k k 2 + k k k 1 k ˆ k k 2 2 k k k 2 1 . (59) T o show that the strength of this quantity will grow more slowly than the signal strength, gi ven certain conditions, we will sho w that k ∆ K k / k b A k is o (1) , i.e., that  k ˆ k k 1 k k k 2 2 + 2 k k k 1 k k k 2 k ˆ k k 2 + k k k 1 k ˆ k k 2 2  k k k 2 1 k b A k → 0 . (60) Since N S  N , we will ignore the k k k 1 k ˆ k k 2 2 term, as the other terms will dominate it. Thus, we must bound k ˆ k k 1 k k 2 2 + 2 k k k 1 k k k 2 k ˆ k k 2 k k k 2 1 k b A k = 2 k k k 2 k k k 1 · k ˆ k k 2 k b A k (61) +  k k k 2 k k k 1  2 k ˆ k k 1 k b A k . In many applications, the graphs of interest hav e degree sequences that follow a power law; i.e., the number of vertices with degree i is approximately αi − β for constants α, β > 0 . Using this model, we can analyze the ratio of L 1 and L 2 norms in graphs with a realistic growth pattern. Let k max be the largest degree in the graph. Then the squares of the L 1 and L 2 norms of k can be approximated as k k k 2 1 ≈ k max X i =1 i · αi − β ! 2 = k max X i =1 αi 1 − β ! 2 (62) and k k k 2 2 ≈ k max X i =1 i 2 · αi − β = k max X i =1 αi 2 − β , (63) respectiv ely . Their ratio is then approximated, assuming β does not exactly equal 1 or 2, as  k k k 2 k k k 1  2 ≈ P k max i =1 αi 2 − β  P k max i =1 αi 1 − β  2 < 1 α R k max +1 1 x 2 − β dx  R k max 2 x 1 − β dx  2 (64) = 1 α 1 3 − β [( k max + 1) 3 − β − 1]  1 2 − β [ k 2 − β max − 2 2 − β ]  2 = (2 − β ) 2 α (3 − β ) ( k max + 1) 3 − β − 1 k 4 − 2 β max − 2(2 k max ) 2 − β + 2 4 − 2 β . In practice, β is typically greater than 1 and less than 3 (see, e.g., [60]), so the constant (2 − β ) 2 / (3 − β ) will be positi ve. As k max increases, the ratio on the right will tend to k β − 1 max . If we let the maximum degree increase, howe ver , α should be allowed to increase as well, since this controls the number of vertices with a given degree. Assume k max is a degree that will probably not occur in the graph. Speciﬁcally , for a small, constant threshold t , let k max = inf  i | αi − β < t  . Since this means that α ( k max − 1) − β ≥ t, (65) we ha ve 1 α k β − 1 max ≤ 1 α ( β p α/t + 1) β − 1 = O  α − 1 /β  . (66) Using the approximation in (64), the ratio of the L 2 and L 1 norms of k is approximately O (1 / 2 β √ α ) . T o bound the term dependent on the subgraph, we ha ve k ˆ k k 2 k b A k = p 1 T b A 2 1 k b A k ≤ q N S k b A k 2 k b A k = p N S . (67) This upper bound can be achie ved if the subgraph is a clique or a star . Noting that k ˆ k k 1 ≤ √ N S k ˆ k k 2 , we substitute (66) and (67) into (61) to obtain k ∆ K k k b A k ≈ O  q N S / β √ α + N S / β √ α  (68) = O  N S / β √ α  , meaning that k ∆ K k is o ( k b A k ) if N S is o ( β √ α ) . Using (65) as a lower bound for α , this implies that k ∆ K k / k b A k will vanish as the graph grows if N S grows more slo wly than k max . A C K N O W L E D G M E N T The authors would like to thank Dr . B. Johnson and the Lincoln Laboratory T echnology Ofﬁce for supporting this work, and R. Bond, Dr . J. W ard, and D. Martinez for their managerial support. W e would also like to thank N. Singh, for his early work on the method in Section V -C. Finally , we would like to thank Dr . R. S. Caceres, Dr . R. J. Crouser , Prof. A. O. Hero III, Dr . S. Kelley , Dr . A. Reuther, Dr . M. C. Schmidt, Dr . M. M. W olf, and the anonymous referees for many helpful comments on this paper . MILLER et al. : A SPECTRAL FRAMEWORK FOR ANOMALOUS SUBGRAPH DETECTION 15 R E F E R E N C E S [1] D. Bu, Y . Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, N. Zhang, G. Li, and R. Chen, “T opological structure analysis of the protein–protein interaction network in budding yeast, ” Nucleic Acids Resear ch , vol. 31, no. 9, pp. 2443–2450, 2003. [2] T . Id ´ e and H. Kashima, “Eigenspace-based anomaly detection in com- puter systems, ” in Proc. ACM Int. Conf. Knowledge Discovery and Data Mining , 2004, pp. 440–449. [3] M. E. J. Newman, “Finding community structure in networks using the eigen vectors of matrices, ” Phys. Rev . E , vol. 74, no. 3, 2006. [4] K. S. Xu and A. O. Hero III, “Dynamic stochastic blockmodels for time- ev olving social networks, ” IEEE J. Sel. T opics Signal Pr ocess. , 2014, to appear , preprint available: [5] J. M. Kleinberg, “ Authoritati ve sources in a hyperlinked en vironment, ” J. ACM , v ol. 46, no. 5, pp. 604–632, September 1999. [6] K. Chen, C. Huo, Z. Zhou, and H. Lu, “Unsupervised change detection in SAR image using graph cuts, ” in IEEE Int. Geoscience and Remote Sensing Symp. , v ol. 3, July 2008, pp. 1162–1165. [7] A. Sandryhaila and J. M. F . Moura, “Discrete signal processing on graphs, ” IEEE T rans. Signal Process. , v ol. 61, pp. 1644–1656, April 2013. [8] D. I. Shuman, S. K. Narang, P . Frossard, A. Ortega, and P . V an- derghe ynst, “The emerging ﬁeld of signal processing on graphs: Ex- tending high-dimensional data analysis to networks and other irregular domains, ” IEEE Signal Processing Mag. , vol. 30, pp. 83–98, May 2013. [9] T . L. Mifﬂin, C. Boner, G. A. Godfrey , and J. Skokan, “ A random graph model for terrorist transactions, ” in Pr oc. IEEE Aer ospace Conf. , 2004, pp. 3258–3264. [10] N. Alon, M. Kriv elevich, and B. Sudak ov , “Finding a large hidden clique in a random graph, ” in Pr oc. A CM-SIAM Symp. Discrete Algorithms , 1998, pp. 594–598. [11] R. R. Nadakuditi, “On hard limits of eigen-analysis based planted clique detection, ” in Pr oc. IEEE Statistical Signal Pr ocess. W orkshop , 2012, pp. 129–132. [12] E. Arias-Castro and N. V erzelen, “Community detection in random networks, ” 2013, preprint: arXiv .org:1302.7099. [13] N. V erzelen and E. Arias-Castro, “Community detection in sparse random networks, ” 2013, preprint: [14] W . Eberle and L. Holder, “ Anomaly detection in data represented as graphs, ” Intelligent Data Analysis , vol. 11, no. 6, pp. 663–689, December 2007. [15] D. B. Skillicorn, “Detecting anomalies in graphs, ” in Pr oc. IEEE Intelligence and Security Informatics , 2007, pp. 209–216. [16] S. T . Smith, S. Philips, and E. K. Kao, “Harmonic space-time threat propagation for graph detection, ” in Proc. IEEE Int. Conf. Acoust., Speech and Signal Pr ocess. , 2012, pp. 3933–3936. [17] S. T . Smith, K. D. Senne, S. Philips, E. K. Kao, G. Bernstein, and S. Philips, “Bayesian discovery of threat networks, ” IEEE T rans. Signal Pr ocess. , 2014, to be published. [18] G. A. Coppersmith and C. E. Priebe, “V ertex nomination via content and context, ” 2012, preprint: arXiv .org:1201.4118v1. [19] S. Fortunato, “Community detection in graphs, ” Physics Reports , vol. 486, pp. 75–174, February 2010. [20] M. E. J. Newman and M. Girvan, “Finding and ev aluating community structure in networks, ” Phys. Rev . E , v ol. 69, no. 2, 2004. [21] C. E. Priebe, J. M. Conroy , D. J. Marchette, and Y . Park, “Scan statistics on Enron graphs, ” Computational & Mathematical Organization Theory , vol. 11, no. 3, pp. 229–247, 2005. [22] T . H. Cormen, C. E. Leiserson, and R. L. Rivest, Intr oduction to Algorithms . MIT Press, 1990. [23] J. Ruan and W . Zhang, “ An efﬁcient spectral algorithm for network community discov ery and its applications to biological and social networks, ” in Proc. IEEE Int. Conf. Data Mining , 2007, pp. 643–648. [24] S. White and P . Smyth, “ A spectral clustering approach to ﬁnding communities in graphs, ” in Proc. SIAM Int. Conf. Data Mining , 2005. [25] D. Fasino and F . T udisco, “ An algebraic analysis of the graph modular- ity , ” 2013, preprint: arXi v:1310.3031. [26] Q. Ding and E. D. Kolaczyk, “ A compressed PCA subspace method for anomaly detection in high-dimensional data, ” IEEE T rans. Inf. Theory , vol. 59, November 2013. [27] S. Hirose, K. Y amanishi, T . Nakata, and R. Fujimaki, “Network anomaly detection based on eigen equation compression, ” in Pr oc. ACM Int. Conf . Knowledge Discovery and Data Mining , 2009, pp. 1185–1193. [28] F . R. K. Chung, Spectral Graph Theory . American Mathematical Society , 1997. [29] S. J. Y oung and E. R. Scheinerman, “Random dot product graph models for social networks, ” in Algorithms and Models for the W eb-Graph , ser . LNCS, A. Bonato and F . R. K. Chung, Eds. Springer, 2007, vol. 4863, pp. 138–149. [30] F . Chung, L. Lu, and V . V u, “The spectra of random graphs with given expected degrees, ” Pr oc. Nat. Acad. Sci. USA , vol. 100, no. 11, pp. 6313–6318, 2003. [31] B. A. Miller, N. Arcolano, M. S. Beard, J. Kepner , M. C. Schmidt, N. T . Bliss, and P . J. W olfe, “ A scalable signal processing architecture for massive graph analysis, ” in Pr oc. IEEE Int. Conf. Acoust., Speech and Signal Pr ocess. , 2012, pp. 5329–5332. [32] P . O. Perry and P . J. W olfe, “Null models for network data, ” 2012, preprint: arXi v:1201.5871v1. [33] D. S. Choi, P . J. W olfe, and E. M. Airoldi, “Stochastic blockmodels with growing number of classes, ” Biometrika , vol. 99, no. 2, pp. 273–284, 2012. [34] B. A. Miller , N. T . Bliss, and P . J. W olfe, “T oward signal processing theory for graphs and non-Euclidean data, ” in Pr oc. IEEE Int. Conf. Acoust., Speech and Signal Pr ocess. , 2010, pp. 5414–5417. [35] ——, “Subgraph detection using eigen vector L1 norms, ” in Advances in Neural Inform. Pr ocess. Syst. 23 , J. Laf ferty , C. K. I. W illiams, J. Shawe- T aylor , R. Zemel, and A. Culotta, Eds., 2010, pp. 1633–1641. [36] N. Singh, B. A. Miller, N. T . Bliss, and P . J. W olfe, “ Anomalous subgraph detection via sparse principal component analysis, ” in Pr oc. IEEE Statistical Signal Process. W orkshop , 2011, pp. 485–488. [37] B. A. Miller, N. T . Bliss, P . J. W olfe, and M. S. Beard, “Detection theory for graphs, ” Lincoln Laboratory J. , vol. 20, no. 1, 2013. [38] D. Donoho, “Compressed sensing, ” IEEE T rans. Inf. Theory , vol. 52, no. 4, pp. 1289–1306, 2006. [39] B. A. Prakash, A. Sridharan, M. Seshadri, S. Machiraju, and C. Falout- sos, “EigenSpokes: Surprising patterns and scalable community chipping in large graphs, ” in Advances in Knowledge Discovery and Data Mining , ser . LNCS, M. J. Zaki, J. X. Y u, B. Ravindran, and V . Pudi, Eds. Springer , 2010, vol. 6119, ch. 14, pp. 435–448. [40] L. W u, X. Wu, A. Lu, and Z.-H. Zhou, “ A spectral approach to detecting subtle anomalies in graphs, ” J. Intelligent Inform. Syst. , vol. 41, no. 2, pp. 313–337, 2013. [41] A. d’Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. G. Lanckriet, “ A direct formulation for sparse PCA using semideﬁnite programming, ” SIAM Revie w , v ol. 49, no. 3, pp. 434–448, 2007. [42] R. Lehoucq and D. Sorensen, “Implicitly restarted Lanczos method, ” in T emplates for the Solution of Algebraic Eigen value Problems: A Prac- tical Guide , Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der V orst, Eds. Philadelphia: SIAM, 2000, ch. 4.5. [43] P . Erd ˝ os and A. R ´ enyi, “On random graphs, ” Publicationes Mathemat- icae Debr ecen , vol. 6, pp. 290–297, 1959. [44] A. Barab ´ asi and R. Albert, “Emergence of scaling in random networks, ” Science , vol. 286, no. 5439, pp. 509–512, 1999. [45] N. Arcolano, K. Ni, B. A. Miller, N. T . Bliss, and P . J. W olfe, “Moments of parameter estimates for Chung–Lu random graph models, ” in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. , 2012, pp. 3961– 3964. [46] D. Chakrabarti, Y . Zhan, and C. Faloutsos, “R-MA T: A recursive model for graph mining, ” in Pr oc. SIAM Int. Conf. Data Mining , 2004, pp. 442–446. [47] R. Luss, A. d’Aspremont, and L. E. Ghaoui, “DSPCA: Sparse PCA using semideﬁnite programming, ” December 2008, version 0.6. [Online]. A v ailable: http://www .di.ens.fr/$ \ sim$aspremon/DSPCA.html [48] J. Lesko vec, L. A. Adamic, and B. A. Huberman, “The dynamics of viral marketing, ” ACM T rans. W eb , vol. 1, pp. 1–39, May 2007. [49] J. Leskov ec, J. Kleinberg, and C. Faloutsos, “Graphs ov er time: Den- siﬁcation laws, shinking diameters and possible explanations, ” in Pr oc. Int. Conf. Knowledge Discovery and Data Mining , 2005, pp. 177–187. [50] E. A. Leicht and M. E. J. Newman, “Community structure in directed networks, ” Phys. Rev . Lett. , vol. 100, pp. 118 703–(1–4), Mar 2008. [51] B. A. Miller , M. S. Beard, and N. T . Bliss, “Matched ﬁltering for subgraph detection in dynamic networks, ” in Pr oc. IEEE Statistical Signal Pr ocess. W orkshop , 2011, pp. 509–512. [52] B. A. Miller and N. T . Bliss, “T o ward matched ﬁlter optimization for subgraph detection in dynamic networks, ” in Pr oc. IEEE Statistical Signal Pr ocess. W orkshop , 2012, pp. 113–116. [53] B. A. Miller , N. Arcolano, and N. T . Bliss, “Efﬁcient anomaly detection in dynamic, attributed graphs, ” in Pr oc. IEEE Intelligence and Security Informatics , 2013, pp. 179–184. [54] B. A. Miller, L. H. Stephens, and N. T . Bliss, “Goodness-of-ﬁt statistics for anomaly detection in Chung–Lu random graphs, ” in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. , 2012, pp. 3265–3268. 16 IN SUBMISSION TO THE IEEE, – – – 2014 [55] S. Pan and X. Zhu, “Graph classiﬁcation with imbalanced class distri- butions and noise, ” in Pr oc. Int. Joint Conf. Artiﬁcial Intell. , 2013, pp. 1586–1592. [56] R. R. Nadakuditi and M. E. J. Newman, “Graph spectra and the detectability of community structure in networks, ” Phys. Re v . Lett. , v ol. 108, no. 18, pp. 188 701–1–5, 2012. [57] Q. Berthet and P . Rigollet, “Complexity theoretic lower bounds for sparse principal component detection, ” in Conf. Learning Theory , ser . JMLR W&CP , S. Shalev-Shw artz and I. Steinwart, Eds., 2013, vol. 30, pp. 1046–1066. [58] Y . Chen and J. Xu, “Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices, ” 2014, preprint: [59] R. R. Nadakuditi and M. E. J. Newman, “Spectra of random graphs with arbitrary expected degrees, ” Phys. Rev . E , vol. 87, no. 1, pp. 012 803– 1–12, 2013. [60] M. Faloutsos, P . Faloutsos, and C. Faloutsos, “On power -law relation- ships of the Internet topology , ” in Proc. SIGCOMM , 1999.

A Spectral Framework for Anomalous Subgraph Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment