Spectral goodness of fit for network models
We introduce a new statistic, 'spectral goodness of fit' (SGOF) to measure how well a network model explains the structure of an observed network. SGOF provides an absolute measure of fit, analogous to the standard R-squared in linear regression. Add…
Authors: Jesse Shore, Benjamin Lubin
Spectral goodness of fit for netw ork models Jesse Shore a, ∗ , Benjamin Lubin a a Boston University School of Management Abstract W e introduce a new statistic, ’spectral goodness of fit’ (SGOF) to measure ho w w ell a network model explains the structure of an obser v ed netw ork. SGOF pro vides an absolute measure of fit, analogous to the standard R 2 in linear regression. Additionally , as it takes adv antage of the properties of the spectrum of the graph Laplacian, it is suitable for comparing netw ork mod- els of div erse functional for ms, including both fitted statistical models and algorithmic generativ e models of networks. After introducing, defining, and pro viding guidance for interpreting SGOF , w e illustrate the pr operties of the statistic with a number of examples and comparisons to existing techniques. W e sho w that such a spectral approach to assessing model fit fills gaps left b y earlier methods and can be widely applied. 1. Introduction Models of netw ork structure play sev eral important roles in contemporary science. Parametric statistical models of netw ork structure and dynamics al- lo w inferences to be made about dependencies among netw ork ties, netw ork position, and nodal and dyadic co v ariates (Frank and Strauss, 1986; Anderson et al., 1992; Snijders, 2001; Schw einberger and Snijders, 2003; Handcock, 2003; Doreian et al., 2005; Hunter and Handcock, 2006; Steglich et al., 2010). Algo- rithmic generativ e models illustrate ho w complex macroscopic structure can arise from simple and often local rules (W atts and Strogatz, 1998; V ´ azquez, 2003; Saram ¨ aki and Kaski, 2004). Despite the importance and div ersity of re- search within both the model based inference and generativ e algorithms cat- egories, one aspect of network model-based research that has been relativ ely slo w to dev elop is that of assessing goodness of fit, or ho w w ell a given model describes the empirical data being modeled. Moreo v er , the methods that are commonly used to assess fit within one type of model ma y be uncommon or una vailable in another , making it dif ficult to integrate research techniques and results across scholarly communities. ∗ jccs@bu.edu October 30, 2018 The purpose of this paper is therefor e to define a new measure of goodness of fit that substantially fills the gaps left by current methods. In particular , lev eraging the features of the spectrum of the graph Laplacian, w e define a new goodness of fit statistic that measur es the percent impro vement a netw ork model makes ov er a null model in explaining the structure in the obser v ed data. As such, w e pro vide a goodness of fit measure that can be applied across modeling techniques and which pro vides an absolute measure of goodness of fit for the model to the observed network data. 1.1. Existing methods Commonly used existing methods for assessing goodness of fit can be roughly classified into two groups: one based on comparing structural statis- tics from netw orks simulated from a fitted model to structural statistics from the observed netw ork (Hunter et al., 2008a; S chw einberger, 2012), and the other based on a model’s likelihood function, exemplified by the Akaike In- formation Criterion (Hunter et al., 2008a). 1.1.1. Structural-statistics comparisons The most commonly used method of assessing goodness of fit (GOF) is the structural statistics approach, which is implemented in software for es- timating Exponential Random Graph Models (ERGMs) as w ell as dynamic actor-oriented models (also kno wn as ’Siena’ models). Although not done in a hypothesis testing framew ork, important algorithmic models (e.g. W atts and Strogatz, 1998) ha v e also been described in ter ms of ho w w ell the algorithm reproduces the subgraph statistics in observ ed networks. In this approach, after fitting a model, it is necessary to generate a large number of simulated netw orks based on that model. At that point compar- isons can be made betw een the observ ed and the simulated netw orks. The modeler might ask if the observ ed number of closed triads (or distribution of closed triads ov er the nodes) could hav e been dra wn from the distribu- tion defined b y the simulated netw orks, or if the observed degree distribution could hav e been dra wn from the distribution of degree distributions in the simulated networks, or any number of other questions of fit between statistics describing the observ ed and simulated netw orks. If the structures in the ob- served netw ork are very unlikely to hav e been generated b y the fitted model, the modeler can reject the hypothesis that the model fits w ell. The subgraph-statistical approach has many advantages. By specifying different structural statistics to compare, the approach can be readily adapted to different specific questions of model fit. For example, one researcher ma y ha ve a theoretical reason to emphasize the length of geodesics, while another ma y focus on triadic closure. The results of such an analysis are also easy to interpret and lend themselv es to graphical representation and inspection (as in Hunter et al. (2008a)). 2 On the other hand, this method also has limitations. Ev en if the theoret- ical focus of a giv en resear cher is on a single structural issue, sa y , modeling geodesics, the overall fit of the model to the whole netw ork is still important. A model that accurately reproduces the distribution of geodesics but does not reproduce the ov erall structure of the netw ork is probably inferior to one that captures the geodesic distribution and the o v erall structure simultaneously . The difficulty in the subgraph-statistical approach is that it is not clear ho w to measure the o v erall structure of the network, except in ter ms of a list of its statistics. This approach necessarily decomposes the goodness of fit of a whole model into multiple goodness of fit tests on specific features of the model. Theoretically , this is problematic; practically , the v alidity of the good- ness of fit assessment depends hea vily on which statistics are specified b y the researcher for examination. In a sense, in order to construct a valid good- ness of fit test, the researcher is required to know a priori what the important statistics are for a giv en observ ed network; this is sometimes a nonsensical requirement, as goodness of fit tests are often undertaken exactly because the research does not kno w whether a giv en set of statistics (those described by the model parameters) are a good descritption of a netw ork. The pragmatic solution is to use a commonly accepted set of statistics (Hunter et al. (2008a) pro vides a good ar gument for one such set), but the possibility remains that important aspects of structure are not considered in such a goodness of fit test. Additionally , assessing model fit in terms of subgraph statistics does not pro vide a means of selecting between tw o models that are both rejected or both not rejected: it pro vides neither a relativ e nor an absolute measure of fit b y which such a comparison could be made. Finally , it is difficult to compare published results from different studies when they do not report the same subgraph statistical tests or analysis. 1.1.2. Akaike Information Criterion Likelihood-based approaches, exemplified b y the Akaike Information Cri- terion (AIC) (a vailable for example, to users of the ergm package in R (Hand- cock et al., 2014; Hunter et al., 2008b)), fills some of the gaps left by hypothesis tests on structural statistics. The AIC is a well-kno wn tool for model choice based that pro vides a relativ e measure of goodness of fit. There are sev eral limitations of the AIC as w ell. First, many models do not ha ve a w ell-defined AIC, including ERGMs that are conditioned on ha ving the exact number of edges present in the obser v ed netw ork, as w ell as models of netw orks that w ere not estimated from a statistical model at all (cases that w e consider in more detail belo w). Second, the AIC measures goodness of fit of all model parameters to all data, which ma y not alw a ys be what is desired. There are sometimes cases when a researcher wants to kno w if some model could ha v e generated the observed patter n of ties alone, rather than whether the model could ha v e 3 jointly generated the ties and nodal and dyadic co variates. T o briefly cite an example w e discuss belo w , in specifying a model with a homophily parameter , the researcher ma y w ant to kno w ho w w ell the model explains the patter n of ties, rather than how w ell the model describes the homophily . AIC provides information on the latter , but not the former . Third, like the structural-statistics approach to which it is related, one can- not kno w if there are omitted v ariables that w ould hav e impro v ed the fit of the model. While the AIC can compare the relativ e quality of tw o models in cer- tain senses, it cannot sa y whether either model is any good in in an absolute sense. 1.2. Spectral Goodness of Fit Giv en the tools already a v ailable to netw ork modelers, a desirable measure of goodness of fit w ould hav e the follo wing properties: • it w ould pro vide an absolute (not relativ e) measure of goodness of fit • it w ould not require the modeler to kno w the true model or which struc- tural statistics are important in the observ ed netw ork • It would allo w comparison of a wide range of models, including those without likelihood functions or ev en without statistical parameteriza- tions In other w ords, it w ould ha v e properties analogous to the R 2 used in stan- dard linear regression. Here, w e propose such a statistic: spectral goodness of fit (SGOF). Throughout the rest of this article w e make sev eral assumptions. W e con- sider only undirected netw orks explicitly , although w e discuss extensions to directed netw orks in the final section, belo w . Additionally , in proposing to assess goodness of fit, w e assume that a researcher has data on an observed netw ork and has fit (or other wise chosen) a model of netw ork structure to that data. W e do not make any assumptions about the functional form of that model or ev en whether the model is parametric at all, but w e do assume that the researcher can generate simulated netw orks based on the fitted model. 1.3. Computer Code W e hav e made computer code for calculating SGOF and visualizing the results of the analysis a vailable as an R package, spectralGOF 1 . 1 A vailable at http://people.bu.edu/jccs 4 2. The spectr um of the graph Laplacian 2.1. Definitions and notation Netw orks are frequently represented as square adjacency matrices (which w e will denote b y A ), such that if there is a link from node u to node v , then A uv > 0. For the pur poses of this article, w e are considering only undirected netw orks, so A uv = A vu , ∀ u ∀ v . The Laplacian matrix is a transformation of the adjacency matrix giv en b y L = D − A , where D is the ’degree matrix,’ containing the row sums of A on its diagonal and zeros else where. The spectrum of L is the or dered multiset of eigenvalues, λ , such that 0 = λ 1 ≤ λ 2 . . . ≤ λ n . There is one Laplacian eigenv alue (hereafter , for brevity , ’eigenv alues’ and ’spectrum’ alw a ys refer to the eigenv alues of the Laplacian) equal to zero for ev ery connected component in the netw ork (Brouw er and Haemers, 2011). Therefore, λ 1 is alw a ys 0. The sum of all eigenv alues is equal to the total w eight of all edges in the netw ork: n ∑ i = 1 λ i = n ∑ u = 1, v = 1 A uv (1) 2.2. The spectrum of the Lapacian as a repr esentation of network structur e The spectrum is a “graph inv ariant,” meaning that if tw o networks are iso- morphic 2 , then the y ha v e the same spectrum. The spectrum is also a compact representation of a great deal of structural infor mation, and spectral tech- niques (sometimes including analysis of both the spectrum and its associated eigenv ectors) ha v e thus been used extensively to characterize the structure of complex networks (Pothen et al., 1990; Newman, 2006) and to compare and recognize complex objects in other applications such as facial recogni- tion in computer vision (T urk and Pentland, 1991; Belkin and Niy ogi, 2003). The properties of the Laplacian spectrum hav e been studied extensiv ely (see Mohar and Ala vi, 1991; Brouw er and Haemers, 2011; Chung, 1997, for rela- tiv ely accessible mathematical o v er views) and a full treatment is w ell bey ond the scope of this article. How ev er , to pro vide context for our definition of the spectral goodness of fit statistic, w e do pro vide some basic intuition for the connection bew een the spectrum and netw ork structure in the following paragraphs. As w e ha v e already noted, the number of components is reflected in the spectrum by the number of zeros. The magnitude of the smallest non-zero eigenv alue is related to the minimum number of ties (ho w much total w eight) that would ha v e to be cut (that is, remov ed from the netw ork) to divide the 2 Isomorphic networks ha v e the same structure. The y could be represented by the same adja- cency matrix after per muting the rows and columns and disregar ding any “labels” or names of the nodes. 5 netw ork into tw o disconnected components and is kno wn as the “algebraic connectivity” of a netw ork (Fiedler, 1973). The magnitudes of the next small- est eigenv alues repr esent the relativ e modularity of the next-most macroscopic community structure of a network. Donetti et al. (2006) illustrate this logic as follo ws. Imagine a netw ork comprising four totally disconnected components. Its spectrum would contain four eigenv alues equal to zero. If w e perturb this netw ork b y connecting the components with a small number of ties (Cv etkovi ´ c et al., 1997), such that they are no longer disconnected, then rather than ha v- ing one eigenvalue equal to zero for each component, w e w ould ha v e one small eigenv alue for each modular cluster (Donetti et al., 2006), one of which w ould be zero (as there w ould be one component, and thus one eigenvalue equal to zero). The more w eight that w as added betw een the components, the larger the eigenv alues w ould become. The sizes of successiv ely lar ger eigenv alues pro vide information on succes- siv ely finer divisions of the netw ork into smaller sub-communities. In general, a common inter pretation of the magnitudes of eigenvalues of the Laplacian is one of correspondence to the relativ e weight remo ved b y a series of minimum cuts of the netw ork (for a more detail, see, e.g. Bollob ´ as and Nikiforo v, 2004). The largest eigenv alue therefore contains infor mation about the number of ties incident to the single most highly connected node (Schur, 1923; Brouw er and Haemers, 2011). 2.3. Normalizing the spectrum The shape of the spectrum describes ho w the total tie strength in a giv en netw ork is structured relativ e to other netw orks with the same total amount of tie strength (density). Giv en this, in the definition of the spectral goodness of fit (SGOF) statistic belo w , w e nor malize all spectra to sum to unity . As equation 1 indicates, the sizes of the eigenvalues are sensitiv e to the density of the netw ork. More specifically , giv en an adjacency matrix, A , let us denote b y ˆ A a nor malized v ersion of A . ˆ A = A ∑ A (2) Likewise, as λ is the v ector of eigenv alues of A , let ˆ λ denote the vector of eigenv alues of ˆ A , which can also be calculated by nor malizing λ . ˆ λ = λ ∑ λ (3) An increase in the density of A that does not result in changes to ˆ A (i.e., multiplying all entries in A b y a non-zero scalar constant) also does not change ˆ λ . In other w ords, such a change only alters the size and not the shape of the spectrum. On the other hand, an increase in the density of A that does result in changes to ˆ A (i.e., adding new ties or increasing the strength of certain ties 6 and not others) both increases the sizes of λ and changes its shape: it results in a changed ˆ λ as w ell. 3. Spectral Goodness of Fit 3.1. Spectral distance Giv en the structural information contained in the spectrum, the Euclidean distance betw een tw o spectra is frequently used as a measure of the structural similarity of tw o matrices (Cvetko vi ´ c, 2012). The Euclidean spectral distance ( ES D ) can be written as | | ˆ λ A − ˆ λ B || , where the nor malized full spectra of graphs A and B are giv en b y ˆ λ A and ˆ λ B , and the double bars denote the the v ector nor m. W e wish to apply this notion of distance to our netw ork models, but such models do not themselv es ha v e spectra. Ho w ev er , if netw orks can be simu- lated from or otherwise generated b y the model, spectra for these netw orks can be calculated. It is the distance betw een these spectra and the observed spectrum that w e will consider . If w e hav e, sa y , N si m = 1000 simulated net- w orks, w e can calculate the mean spectral distance betw een the simulated netw orks and the observ ed netw ork, as w ell as other distributional statistics, such as the 5 th and 95 th percentiles of the spectral distance betw een simula- tions and the observed network. Formally , after normalizing the spectra as abov e, let us call the absolute v alue of the differ ence betw een the i th observed eigenv alue and the i th eigen- v alue from the k th simulated netw ork an ’error .’ e i = ˆ λ o bs i − ˆ λ si m k i (4) In this context then, ES D is the square root of the sum of squared errors. ES D o bs , si m k = ˆ λ o bs − ˆ λ si m k = r ∑ i ( e i ) 2 (5) The mean Euclidean spectral distance, ES D , is then defined as arithmetic mean of the ESDs from each of the individual simulated netw orks. ES D o bs , si m = 1 N si m N si m ∑ k = 1 ES D o bs , si m k (6) 3.2. Definition of null model For netw ork models we pr opose that goodness of fit be measured as an impro vement in fit relativ e to a naiv e null model. It is therefore necessar y to calculate the errors under the naiv e model and the fitted model for some number of simulated netw orks. 7 The natural null model for dichotomous netw orks is the density-only model, also kno wn as the Bernoulli model or Er d ˝ os-R ´ enyi model, simulatations from which are random netw orks with the same expected density as the observ ed netw ork. For the remainder of this article, w e adopt the density-only model as a null model, but w e note that any other model could be substituted in its place. One situation where the Er d ˝ os-R ´ enyi model w ould not be appropriate as a null model is the case where the measurement of the obser v ed network w as b y means of a surve y instrument that specified the number of alters each respondant w as to nominate (’name fiv e people y ou discuss important matters with’). In this case a degree-regular random graph (one in which each node has the same degree) would be the appropriate null model. Likewise, if the observed data is w eighted, the null model should also be w eighted. In general, the null model should be the maximum entrop y model generating netw orks in the same class as the observed data. 3.3. Definition of SGOF T o calculate the Spectral Goodness of Fit ( S G O F ), w e simply divide the mean Euclidean spectral distance under the fitted model by the mean Eu- clidean spectral distance under the null model, and subtract the result from one. S GO F = 1 − ES D o bs , f i tt e d ES D o bs , n ul l (7) Additionally , giv en that models of netw orks imply a probability distribu- tion of networks generated from the model, it is advisable to report SGOF calculated using the 5 th and 95 th percentile results for ES D under the fitted model. Below , w e report these in parentheses after the SGOF calculated using the mean as in equation 7. This confidence interval pro vides an indication of the dispersion of goodness of fit inherent in a fitted model. Although the mean SGOF of the null model is defined to be zero, it is advisable to report the 5 th and 95 th percentile results for the null model as w ell. The reason for this is that the width of this 90% confidence interval pro vides useful infor mation for interpreting the SGOF of fitted models. If an observed netw ork is not highly structured, the 90% confidence interval for the null model’s SGOF will be very wide, extending, sa y , from − 0.5 to 0.5, reflecting the fact that the obser v ed network is not far from random. For observed netw orks with a great deal of structure, the 90% confidence interval for the null model’s SGOF will be narrow , extending for example only from − 0.001 to 0.001. 3.4. Interpretion of SGOF The SGOF measures the amount of observ ed structure explained b y a fit- ted model, expressed as a percent improv ement ov er a null model, where 8 structure means deviation from randomness. The obser v ed spectrum will be distant from the spectrum of the null model in as much as the obser v ed net- w ork has structure that is non-random. The SGOF is thus a summary measure of the percent of the observ ed structure that is explained by the fitted model. 3.4.1. Bounds for SGOF Like R 2 , SGOF is bounded abov e b y one, when the fitted model exactly de- scribes the structural data. Likewise, an SGOF of zero means no improv ement o v er the null model. Finally , as with R 2 , SGOF can be unboundedly negativ e 3 if the spectrum of the fitted model is more distant from the observ ed spectrum than is the spectrum of the null model. If the SGOF is negativ e, it is there- fore evidence that the null model (an Erd ˝ os-R ´ enyi random graph) is a better approximation of the observed netw ork than the fitted model under consider - ation. This is likely to occur in cases where the observ ed netw ork is not highly structured (and thus v ery similar to the null model), and the fitted model is (incorrectly) highly structured. If the obser v ed network is not structured, then while E S D o bs , f i tt e d > 0, ES D nu l l → 0 and by equation 7, SG O F → − ∞ . For ordinary cases inv olving an obser v ed netw ork that contains structure to be explained and sensible model specifications, ho w ev er , SGOF will fall betw een zero and one. 4. Applications and compar isons to existing methods In this section, w e illustrate the spectral goodness of fit method with sev- eral examples chosen to highlight its strengths and w eaknesses with respect to existing methods. 4.1. Comparison with structural statistics: e.coli It is frequently the case that a resear cher does not ev er disco ver the ’true’ model underlying the formation of an observ ed netw ork, but rather is only able to approximate the truth with sev eral theoretically plausible candidate models. In such cases it is useful to ha v e quantitative e vidence about model goodness of fit to help adjudicate the decision. Structural statistical tests can sometimes pla y this role, but as mentioned abov e, it ma y also be the case that all models under consideration are rejected (or supported) by the test, and more information is therefore needed. This example considers such a situation b y comparing tw o specifications of a model of the degree distribution of the e. coli genetic regulatory netw ork (Shen-Orr et al., 2002), both in the ERGM framew ork. 3 In normal practice, how ever , the fitted model for R 2 is an ordinary least squares linear regres- sion with a free intercept parameter; in this typical case, R 2 is bounded below by zero. 9 T able 1: Comparison of Spectral Goodness of Fit to structural hypothesis testing for the e. coli genetic regulatory network Observed Network SGOF Struc. h-test Simulated Netw ork Null model 0 (-0.02, 0.025) reject Geom. w eighted degree (curved exponential family) 0.242 (0.167, 0.33) reject Geom. weighted degree -0.014 (-0.033, 0.007) reject 10 Using the ergm package in R, after fitting the models, w e assessed their goodness of fit in the manner described by Hunter et al. (2008a), using the gof function with its default settings. This goodness of fit routine assesses the probability that the distributions of degree, transitiv e closure and mean geodesic lengths ov er the nodes in the observed netw ork could hav e been generated b y the fitted model. Results from the gof analysis show that both of the pr oposed model specifications pr oduce distributions of structural statistics that div erge from the obser v ed v alues. Accordingly , the p − values for the goodness of fit diagnostics (not sho wn) indicate rejection of the models. T able 1 indicates this and giv es v alues for the SGOF for these models, along with small network visualizations for reference. Although all the models are rejected b y structural hypothesis tests, there are marked differences in how w ell these models fit. Specifically , the ”curv ed exponential family” v ersion of the model (for more detail, see Hunter and Handcock, 2006) provides a much better fit to the data than the other model without the cur v ed exponential family specification. In fact, at -0.014, the SGOF of this model indicates that it is no better than the null model as an ov erall description of the structure of the observed data. The simple lesson here is that goodness of fit based on structural statis- tics cannot quantitativ ely distinguish betw een similar models when all of the models are either accepted or rejected. V isual inspection of the graphical out- put can often help in this regard, but is not hard to come up with examples where it cannot. In these cases it w ould be good to ha v e an absolute or relativ e measure of fit to pro vide a means of model choice. The AIC is thus a more comparable measure of goodness of fit to the SGOF w e propose here, and the follo wing examples make the comparison explicit. 4.2. Comparison with AIC: Star graph The next example considers a 100-node star graph constructed b y hand to serve as an imaginary observ ed netw ork. In addition to the netw ork ties, there is an observ ed attribute, indicated b y the color of the nodes in the visualiza- tion. The attribute v alues ha v e been measured b y our imaginary resear cher , but they w ere not part of the process that generated the netw ork ties. For this example, w e compare the SGOF to AIC from fitted models in the ergm package (T able 2). After the null model, the next model is one fitted with a term for ho- mophily among red nodes in addition to the density term. The visualization sho ws that such a model produces a pattern of ties that is v ery similar to the null model, but a greater proportion of ties among red nodes, similar to the observed netw ork. It is here that one major difference betw een SGOF and AIC can be seen. The SGOF indicates negligible improv ement o v er the null model because the patter n of ties is only a negligible impro v ement o v er the null model. Meanwhile the AIC sho ws a substantial improv ement, from 972.59 to 11 T able 2: Comparison of Spectral Goodness of Fit to AIC for a star graph Observed Network SGOF Struc. h-test Simulated Netw ork Null model 0 (-0.01, 0.014) 972.59 Red node Homophily 0.007 (-0.005, 0.025) 939.83 99-star tendency 1 (1, 1) 2322.63 2-star tendency 1 (1, 1) 708.97 12 939.83, because the parameters of the fitted model, including a (spurious b y construction) homophily effect, ha v e a higher likelihood than the parameters of the null model, ev en after accounting for the number of parameters with Akaike’s formula. The AIC is senstive to how w ell the model’s parameters fit the data as a whole, including non-structural data. The thir d and fourth models are both ERGMs fit to the data with a k -star parameter (tendency tow ard nodes with degree k ) in addition to the density parameter , but they differ in how the k -star parameter is specified. The first of the tw o parameterizes the netw ork with a tendency tow ard 99-stars, while the second of the two parameterizes the netw ork with a tendency tow ard tw o- stars. Note that the k -stars are induced subgraphs, so although there are no nodes with degree tw o, there are ( 99 2 ) = 4851 tw o-stars, each centered on the same node, while there is only one 99-star in the observ ed network. Both of these models produce simulated networks that are star graphs just like the observed netw ork. Accordingly , the SGOF for both of these models is 1: a perfect fit. According to the AIC, how ev er , the tw o models are dramatically different: the 99-star model is much worse than the null model, with an AIC of 2322.63, while the 2-star model is clearly the best fit of all, with an AIC of 708.97. Unlike the SGOF , the AIC cannot indicate whether any giv en fit is good in an absolute sense. In practice the AIC and the SGOF are complementar y in that they provide answ ers to different modeling questions. A researcher ma y wish to kno w ho w w ell a model fits in terms of both structural ef fects and nodal or dyadic cov ari- ates, or on the other hand, assess the parsimony of the model. In these cases, the AIC is required. On the other hand, the researcher ma y wish to know ho w w ell a model that includes both structural ef fects and nodal and dyadic co v ariates explains the obser v ed structure, or assess the absolute goodness of fit of a model of structure. In these cases the SGOF is required. 4.3. Second comparison to AIC: Faux Mesa High The pre vious example of a star graph w as artificially constructed to illus- trate the differ ences betw een AIC and SGOF . In this subsection, w e giv e an example of a more typical social netw ork using the ”Faux Mesa High” data set of Hunter et al. (2008a), adapted from the Add Health sur v eys (Harris and Udry, 2008). Similar to the star-graph example, abov e, after the null model w e fit an ERGM model using only homophily effects on the obser v ed co vari- ates, which describe Race, Sex and Grade of the respondents. W e go on to fit a model using only the ”Geometrically W eighted Degree” (GWD) of Hunter and Handcock (2006) (which is a flexible approach to modeling degree distri- butions), follow ed b y a model with both the GWD and homophily effects. The final model differs in type: we consider the preferential attachment model of Barab ´ asi and Albert (1999). V isualizations of the netw orks created by these models, as w ell as their AIC and SGOF statistics are shown in T able 3. 13 T able 3: Comparison of Spectral Goodness of Fit to AIC for Faux Mesa High Observed Network SGOF Struc. h-test Simulated Netw ork Null Model 0 (-0.196, 0.21) 2287.742 Homophily on Race, Sex, Grade Only 0.221 (-0.002, 0.474) 1890.922 GWD Only 0.268 (-0.045, 0.545) 2245.181 GWD and homophily 0.501 (0.259, 0.682) 1853.656 Preferential attachment 0.467 (0.16, 0.666) undefined 14 In this example, the homophily on the three cov ariates makes significant impro vements in both SGOF and AIC, because unlike the star graph, there is almost certainly a real homophily effect in the original data. Likewise, both SGOF and AIC indicate that the model with both GWD and homophily is superior to the models with just one of those two types of effects. The lessons from Faux Mesa High are, ho w ev er , otherwise consistent with those from the star graph. AIC indicates that the homophily-only model is superior to the GWD-only model. How ev er , fr om the point of view of generating a patter n of ties alone, the SGOF indicates that the GWD-only model is superior to the homophily-only model. Again, the AIC measures the relativ e quality of fit of the model as a whole to the data as a whole, while the SGOF measures the absolute quality of the fit of the model to the structure manifest in the observed network ties. Finally , w e consider a model outside of not only the exponential random graph family , but indeed a model that is algorithmic in nature rather than statistical: the Barab ´ asi-Albert preferential attachment model (Barab ´ asi and Albert, 1999), as implemented in the igraph package (Csardi and Nepusz, 2006). As we use it here, there is no likelihood function and thus no AIC associated with this last model. The preferential attachment model is based on a generativ e algorithm with fixed parameters and does not ha v e a likelihood function that could be meaningfully compared to those from fitted ERGMs. The SGOF is defined, how ev er , as it is for any model that generates net- w orks with the same number of nodes as the obser v ed netw ork, regar dless of conditions put on the sample space or ho w (or whether) the model was esti- mated. As such, the SGOF makes it possible to compare models that cannot be compared on the basis of the AIC or other likelihood-based methods. 4.3.1. V isualization of SGOF As with other statistical methods, a fuller qualitativ e understanding of the SGOF can be gained through visualization. Figure 1 plots spectral fits for the “GWD and Homophily” and the “Preferential attachment” models from T able 3, using the plotSGOFerrors function in the spectralGOF package. Each panel of the figure is a visualization of spectral error based on three spectra: the obser v ed spectrum, the null model spectrum that is closest to the mean Euclidean distance fr om the observ ed spectrum, and the fitted model spectrum that is closest to the mean Euclidean distance from the obser v ed spectrum. The first and the second are the same in both panels and ar e plotted as points. The fitted model spectrum is not plotted in points, but rather indicated b y colored bars as follo ws. When the fitted model’s spectrum lies betw een the null and the obser v ed spectra, the fitted model has impro v ed the fit. The distance betw een the null and the fitted spectrum is error that has been ”ex- plained” and is indicated in light green. The error that still remains (error that is present under the null and the fitted models) is indicated in blue. 15 There are also parts of the plots where the fitted and null spectra are on opposite sides of the observ ed spectrum. In these cases, the fitted model has ”explained” the error betw een the null and the observ ed, but introduced new error on the other side of the obser v ed spectrum. This new error is indicated in red. T urning to the specific models in Figure 1, we see that the tw o fits differ considerably . In general, the spectrum of the fitted ERGM (left) lies betw een the obser v ed spectrum and the null spectrum, indicating that the obser v ed netw ork is more structured (farther from random) than are netw orks sim- ulated from the fitted ERGM. In contrast, portions of the spectrum of the preferential attachment model (right) are more distant from the null spectrum than is the obser v ed spectrum. The preferential attachment model has ex- plained more error than the ERGM (represented b y more green area in its visualization), but it has also introduced structure not present in the obser v ed netw ork, producing more new error (more red area in the visualization), and resulting in a lo w er net SGOF . Index Normalized Laplacian Eigenvalue λ n λ 0 0.000 0.010 0.020 0.030 Le gend Remaining Errors Explained Errors Ne w errors Observ ed Spectrum Null Spectrum GWD+homophily Index Normalized Laplacian Eigenvalue λ n λ 0 0.000 0.010 0.020 0.030 Le gend Remaining Errors Explained Errors Ne w errors Observ ed Spectrum Null Spectrum Preferential attachment Figure 1: Illustration of spectral qualities of the two best fitted models in T able 3. The gr een and red indicate impr ovements and worsening of model fit, r espectively, fr om a change from a null to the fitted model. Blue indicates error left unexplained from the null model. 4.4. SGOF as an objective function: Collaborations among jazz musicians There are sometimes cases when one wishes to implement algorithmic models that do not hav e an intrinsic means of fitting to obser v ed data. In this case, SGOF can be useful as an objectiv e function in an exploration of the 16 algorithm’s parameter space. T o illustrate this type of application, w e consider the netw ork of jazz collaborations described by Gleiser and Danon (2003). One theoretically plausible algorithmic model of ho w collaboration net- w orks are formed is that of Saram ¨ aki and Kaski (2004). In this model, one assumes some netw ork exists at t 0 to initialize the model. In subsequent time points, new individuals arriv e and form ties to those already present b y means of short random walks from a randomly selected node serving as the point of entry into the network. For musicians, the idea would be that after collaborating with some initial partner , one is likely to get to know one’s partner ’s partners, and so on. In addition to being theoretically plausible, this algorithm generates networks with skew ed degree distributions and local clustering, as w e obser v e in the jazz collaborations data set. T o assess the fit of this model, one must first find the best values for the model’s parameters, which w e will do by appeal to SGOF . In implementing the algorithm, w e left tw o key parameters to be fitted. The first is the mean number of edges to add with each new node added to the netw ork. The second is how many steps in a random walk a new node would take before forming new relationships to existing members of the netw ork. W e then gen- erated 100 simulated netw orks using each combination of parameters, and calculated the SGOF for each pair of parameter v alues. The result of this process are shown in Figure 2, and indicate that the best fit occurs when the a v erage number of edges added per node is 9, and the random walk distance is a single step. Thus w e can not only use SGOF as a diagnostic tool, but also as a means for identifying the parametric model settings that will be optimal under this criterion. 5. Future Extensions 5.1. Hypothesis testing W e hav e presented SGOF as a goodness of fit statistic, analogous to R 2 . Using spectral distances, it is also possible to construct one and tw o-sample hypothesis tests for the purposes of for mal rejection of certain models in fa vor of others. Space does not per mit a full discussion of how such tests would be constructed; ho w ev er , the authors will present this material in a separate manuscript. 5.2. Directed graphs While the properties of the Laplacian spectrum of undirected graphs hav e been widely studied and applied, the spectral properties of directed graphs are less w ell-established. The present paper has therefore focused on undi- rected, possibly w eighted, netw orks to establish the SGOF , but further w ork 17 mean edges added per node number of random walk steps 1 3 5 7 9 5 6 7 8 9 10 11 12 13 14 15 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 Figure 2: SGOF for different combinations of parameter values for an algorithm based on Saram ¨ aki and Kaski (2004) fitted to the network of Jazz collaborations described in Gleiser and Danon (2003) should consider the different properties of directed graphs. For now , w e limit ourselv es to the following remarks. The Laplacian matrix for directed netw orks has been defined differently from that of undirected netw orks. In particular , Chung (2005) defines the Laplacian of directed netw orks as follows. First, given adjacency matrix, A , calculate a matrix, P , such that P ( i , j ) = A i j ∑ k A ik . (8) Then, treating P as the transition matrix of a Markov chain, calculate the Perron vector , φ , which is the all-positiv e left eigenv ector of P corresponding to the stationary distribution of the Marko v chain (for strongly connected graphs). Define Φ as the matrix with φ on the diagonal and zeros else where, and I in the standard w ay as the identity matrix. Finally , the Laplacian for directed graphs is defined as L = I − Φ 1/2 P Φ − 1/ 2 Φ − 1/ 2 P T Φ 1/2 2 . (9) One feature of this definition is that L is undirected and therefore has real-v alued eigenv alues. Future w ork should consider the pr operties of this matrix from the point of view of goodness of fit, but also consider alternativ e transformations of the adjacency matrix for spectral analysis. 18 5.3. Statistical properties of Laplacian eigenvalues Under certain density conditions, the distribution of eigenv alues of the null model follows the ’semi-circle law’ (W igner, 1955; Chung et al., 2003), but these conditions are restrictiv e enough that we hav e chosen to calculate the null errors in the SGOF by simulation rather than by reference to the semi- circle la w . The statistical properties (e.g. consistency and efficiency) of the eigenv al- ues of ensembles of networks other than the null model depend on the details of the model from which they are generated, and it is not clear a priori what can be said about the statistical properties of the SGOF for fitted models in general. As with the null model, the distribution of eigenv alues from cer- tain narro wly defined models ha ve been studied (Farkas et al., 2001; Bolla, 2004; Zhang et al., 2014). It is not y et clear from the present body of resear ch, ho w ev er , what can be said about the statistical properties of the SGOF in the general case. Since w e cannot deriv e the statistical pr operties of the SGOF analytically , in order to provide one practical point of reference, w e ha v e conducted a simulation-based exploration of the properties of 100-node density-only mod- els, under a range of densities. These simulations support the following ten- tativ e conclusions. The means of individual eigenvalues are stable across sample sizes (where sample size refers to the number of simulated networks from which the mean spectrum is calculated). The standard deviations of individual eigenv alues from Erd ˝ os-R ´ enyi random graphs are asymptotically consistent, but biased do wnw ards for small numbers of simulated netw orks. Likewise, the 5 th and 95 th quantiles of individual eigenv alues are asymptoti- cally consistent, but biased tow ard the median for small samples of simulated netw orks. Giv en the abo v e, we recommend using 100 simulations of the null model to calculate standard err ors or quantiles of the distribution of SGOF for ex- ploratory modeling and at least 1000 simulations for published results. Fur- thermore, w e strongly recommend examining the distribution of spectra sim- ulated fr om fitted models to establish that sufficient sample sizes hav e been obtained when calculating the SGOF . Future w ork should seek to deriv e more general conclusions about the statistical properties of spectral distances for netw ork models. 6. Conclusion W e hav e proposed a new measure of goodness of fit for netw ork models based on the spectrum of the graph Laplacian: ”spectral goodness of fit” (SGOF), and pro vided code with which SGOF can be easily implemented. The properties of SGOF fill gaps left b y the current set of goodness of fit indicators, making it complementary to existing methods. 19 T able 4 summarizes the properties of each approach to goodness of fit. Analogous to the standard R 2 , the SGOF statistic measures the percent im- pro vement in netw ork structure explained o ver a null model. By measuring fit r elativ e to fixed refer ence points, SGOF can be said to pro vide an ”absolute” measure of goodness of fit. T able 4: Summary of properties of goodness of fit measures Struct. stats AIC SGOF Absolute Measure of GOF Y es Relativ e Measure of GOF Y es Y es Sensitive to structure only Y es Y es Hypothesis test of model fit Y es Sensitive to model specification Y es Requires Likelihood Function Y es Prior methods had pro vided relativ e measures of fit (AIC), and hypothesis testing of fit for specific subgraph statistics, but until now there w as no abso- lute measure of fit for netw ork structure as a whole. Ultimately , how ev er , we see SGOF as pla ying a complementary role to existing techniques. For exam- ple, when a research question concer ns a specific structural tendency (say , to transitiv e closure), one should use both structural statistics as w ell as SGOF (and ev en AIC if applicable, to assess model parsimony). In addition to pr o viding an absolute measure of fit, the SGOF allo ws the comparison of models fit by div erse means and of div erse functional forms. W e hope that the ability to compare fit among dissimilar models will fa- cilitate building on and refining prior w ork, as well as greater engagement with research models and results from outside of any giv en researcher ’s own methodological tradition. References Cited Carolyn J Anderson, Stanley W asser man, and Katherine Faust. Building stochastic blockmodels. Social Networks , 14(1):137–161, 1992. Albert-L ´ aszl ´ o Barab ´ asi and R ´ eka Albert. Emergence of scaling in random netw orks. Science , 286(5439):509–512, 1999. Mikhail Belkin and Partha Niy ogi. Laplacian eigenmaps for dimensional- ity reduction and data representation. Neural Computation , 15(6):1373–1396, 2003. Marianna Bolla. Distribution of the eigenv alues of random block-matrices. Linear Algebra and its Applications , 377:219–240, 2004. 20 B ´ ela Bollob ´ as and Vladimir Nikiforov . Graphs and hermitian matrices: eigen- v alue interlacing. Discrete Mathematics , 289(1):119–127, 2004. Andries E Brouw er and W illem H Haemers. Spectra of graphs . Springer , 2011. Fan Chung. Laplacians and the cheeger inequality for directed graphs. Annals of Combinatorics , 9(1):1–19, 2005. Fan Chung, Linyuan Lu, and V an V u. Spectra of random graphs with giv en expected degrees. Proceedings of the National Academy of Sciences , 100(11): 6313–6318, 2003. Fan RK Chung. Spectral Graph Theory , v olume 92. American Mathematical Soc., 1997. Gabor Csardi and T amas Nepusz. The igraph softw are package for complex netw ork research. InterJournal , Complex Systems:1695, 2006. URL http: //igraph.sf.net . Drago ˇ s Cv etko vi ´ c. Spectral recognition of graphs. The Y ugoslav Journal of Operations Resear ch , 22(2), 2012. Drago ˇ s M Cv etkovi ´ c, Peter Rowlinson, and Slobodan Simi ´ c. Eigenspaces of graphs . Number 66. Cambridge University Press, 1997. Luca Donetti, Franco Neri, and Miguel A Mu ˜ noz. Optimal netw ork topolo- gies: Expanders, cages, Ramanujan graphs, entangled netw orks and all that. Journal of Statistical Mechanics: Theory and Experiment , 2006(08):P08007, 2006. Patrick Doreian, Vladimir Batagelj, and Anuska Ferligoj. Generalized block- modeling . Structural analysis in the social sciences, number 25. Cambridge Univ ersity Press, 2005. Ill ´ es J. Farkas, Imre Der ´ enyi, Albert-L ´ aszl ´ o Barab ´ asi, and T am ´ as V icsek. Spec- tra of “real-w orld” graphs: Bey ond the semicircle la w . Physical Review E , 64 (2):026704, 2001. Mirosla v Fiedler . Algebraic connectivity of graphs. Czechoslovak Mathematical Journal , 23(2):298–305, 1973. Ov e Frank and Da vid Strauss. Markov graphs. Journal of the American Statistical Association , 81(395):832–842, 1986. Pablo M Gleiser and Leon Danon. Community structure in jazz. Advances in Complex Systems , 6(04):565–573, 2003. Mark S Handcock. Assessing degeneracy in statistical models of social net- w orks. In Journal of the American Statistical Association . Citeseer , 2003. 21 Mark S. Handcock, David R. Hunter , Carter T . Butts, Stev en M. Goodreau, Pa vel N. Krivitsky , and Martina Morris. er gm: Fit, Simulate and Diagnose Exponential-Family Models for Networks . The Statnet Project ( http://www. statnet.org ), 2014. URL CRAN.R- project.org/package=ergm . R package v ersion 3.1.2. Kathleen Mullan Harris and J Richard Udr y . National Longitudinal Study of Adolescent Health (Add Health), 1994-2008 . Chapel Hill, NC: Carolina Pop- ulation Center , Univ ersity of North Carolina-Chapel Hill/Ann Arbor , MI: Inter-univ ersity Consortium for Political and Social Research, 2008. Da vid R Hunter and Mark S Handcock. Inference in cur v ed exponential fam- ily models for netw orks. Journal of Computational and Graphical Statistics , 15 (3), 2006. Da vid R Hunter , Stev en M Goodreau, and Mark S Handcock. Goodness of fit of social netw ork models. Journal of the American Statistical Association , 103 (481), 2008a. Da vid R. Hunter , Mark S. Handcock, Carter T . Butts, Stev en M. Goodreau, Morris, and Martina. er gm: A package to fit, simulate and diagnose exponential-family models for netw orks. Journal of Statistical Software , 24 (3):1–29, 2008b. Bojan Mohar and Y Ala vi. The laplacian spectrum of graphs. Graph Theory , Combinatorics, and Applications , 2:871–898, 1991. Mark EJ Newman. Modularity and community structure in netw orks. Pro- ceedings of the National Academy of Sciences , 103(23):8577–8582, 2006. Alex Pothen, Horst D Simon, and Kang-Pu Liou. Partitioning sparse matrices with eigenv ectors of graphs. SIAM Journal on Matrix Analysis and Applica- tions , 11(3):430–452, 1990. Jari Saram ¨ aki and Kimmo Kaski. S cale-free netw orks generated by random w alkers. Physica A: Statistical Mechanics and its Applications , 341:80–86, 2004. Issai S chur . Uber eine klasse v on mittelbildungen mit anw endungen auf die determinantentheorie. Sitzungsberichte der Berliner Mathematischen Gesellschaft , 22:9–20, 1923. Michael Schweinber ger . Statistical modelling of netw ork panel data: Good- ness of fit. British Journal of Mathematical and Statistical Psychology , 65(2): 263–281, 2012. Michael Schw einberger and T om AB Snijders. S ettings in social networks: A measurement model. Sociological Methodology , 33(1):307–341, 2003. 22 Shai S Shen-Orr , Ron Milo, Shmoolik Mangan, and Uri Alon. Netw ork motifs in the transcriptional regulation netw ork of escherichia coli. Nature Genetics , 31(1):64–68, 2002. T om AB Snijders. The statistical ev aluation of social netw ork dynamics. Soci- ological Methodology , 31(1):361–395, 2001. Christian Steglich, T om AB Snijders, and Michael Pearson. Dynamic netw orks and beha vior: Separating selection from influence. Sociological Methodology , 40(1):329–393, 2010. Matthew T urk and Alex Pentland. Eigenfaces for recognition. Journal of Cog- nitive Neur oscience , 3(1):71–86, 1991. Alexei V ´ azquez. Gro wing network with local rules: Preferential attach- ment, clustering hierarchy , and degree correlations. Physical Review E , 67 (5):056104, 2003. Duncan J W atts and Stev en H Strogatz. Collectiv e dynamics of ’small-world’ netw orks. Nature , 393(6684):440–442, 1998. Eugene P W igner . Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics , pages 548–564, 1955. Xiao Zhang, Raj Rao Nadakuditi, and Mark EJ Newman. Spectra of random graphs with community structure and arbitrary degrees. Physical Review E , 89(4):042816, 2014. 23
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment