Testing goodness-of-fit of random graph models
Random graphs are matrices with independent 0, 1 elements with probabilities determined by a small number of parameters. One of the oldest model is the Rasch model where the odds are ratios of positive numbers scaling the rows and columns. Later Pers…
Authors: Vill"o Csiszar, Peter Hussami, Janos Komlos
T esting go o dness-of-fit of random graph mo de l s Vill˝ o Csisz´ ar 1 P ´ et er Hussami 2 J´ anos Koml´ o s 3 T am´ as F. M´ ori 1 , 5 L ´ ıdia Rejt˝ o 2 , 4 G´ ab or T usn´ ady 2 submitt ed: Ma y 4, 2012 , revised: Nov em b er 8. 2012 Abstract Random g raphs are matrices with indep enden t 0 − 1 elemen ts w i th probabilities de- termined b y a small n um b er of parameters. One of the oldest mo del is the Rasc h mo del where the o dds are r atios of p ositiv e num b ers scali ng the r o ws and columns. Lat er Persi Diaconis with his co w ork ers rediscov ered the mo del for symmetric matrices and ca lled the mo del b eta. Here w e give go o d nes-of-fit tests for the mo del an d extend the mo del to a v ersion of the blo c k mo del in tro du ced b y Holland, Laske y , and Leinhard. 1 In tr o d uction Let n b e a p ositiv e integer, 1 ≤ i, j ≤ n , and ε ( i, j ) indep enden t random v ariables suc h t ha t ε ( i, j ) = ε ( j, i ) and ε ( i, i ) = 0, furthermore P ( ε ( i, j ) = 1 ) = p i,j = p + p i + p j , 1 ≤ i < j ≤ n, (1) where the sum of the p i -s is zero. The least square estimate ˆ p of p is the a v erage of the epsilons, and the least square estimate of p i is the av erage of the differences ε ( i, j ) − ˆ p . The mo dification of the mo del for non-symmetric mat rices is straightforw ard, and in that case the statistical inference is practically a tw o-wa y analysis of v ariance. P erhaps this is the simplest ra ndo m graph mo del but it shares the inconv enien t prop ert y of man y other random graph mo dels that it is hard to ensure that edge probabilities remain in the inte rv al (0 , 1). If w e use the o dds r i,j = p i,j 1 − p i,j , (2) 1 E¨ otv¨ os Lor´ and Universit y , Budapes t, Hungary 2 Alfr´ ed R´ enyi Mathematical Institute o f the Hungarian Academy of Sciences, Budap es t,Hunga ry 3 Rutgers University , Department of Mathematics, New Brunswick, New Jer sey , USA 4 Univ ersity of Delaw are, Statistics Pr ogra m, FREC, CANR, Newark, Dela ware, USA 5 T am´ as F. M´ ori’s res earch was supp orted by OTK A grant 12574 1 instead of the probabilities, t hen it is enough to ensure the p ositivity of r i,j -s. This is the case in the mo del intro duced b y George Rasc h [31]. Historically the o dds w ere defined as the rat io s of scaling factors for ro ws and columns but w e prefer the m ultiplicativ e f orm r i,j = β i γ j (3) for non-symmetric and r i,j = β i β j (4) for symmetric case. Statistical inv estigation of the mo del started with Andersen [1] (see also [21, 30, 33 ]) and la ter P ersi D iaconis with his co work ers redisco v ered the mo del and introduced the n ame b eta-mo del fo r its parameter. The mo del has man y attractiv e prop erties (see in [2, 4, 5, 6, 8, 28]): – degree sequences are sufficien t statistics – the mo del cov ers practically all p ossible expected degree sequence – the conditional distribution o f the gr a phs o n condition of a prescribed degree sequence is uniform on the set of all graphs with the give n degree sequenc es. Statistically inference emerged f rom Gaussian distribution and later w as extended to ra ndom v ariables in Euclidean spaces but the statistical inference on discrete structures is rather sparse ([7, 15, 1 6, 19, 26]). Mathematical inv estigation of g raphs has its own history . Now ada ys ins tead of graphs w e are speaking of net w orks ([27]) whe re the most in ves tigated mo del is the sto c ha stic blo c k mo del introduced by Holland, Lask ey , and Leinhard ([18]). Here the v ertices are lab eled b y small n umbers or colors and edge probabilities dep end only on t he lab els ([3, 17]). With an ey e o n preferen t ia l atta c hmen t where degree sequences f o llo w scale-free p o w er-la w the blo ck mo del was criticized b ecause it has mo derated flexibilit y on degree sequences. Ch ung, Lu, and V u [14 ] in tro duced a mo del with indep endent v ertices, Chaugh uri, Ch ung, and Tsiatas ([10]) in tro duced the pla nte d p artition mo del (see also [25]). Karrer and Newman [20 ] pro p osed and other extension of the blo c k mo del. A natural extension of these mo dels is the unification of the b eta and blo c k mo dels: r i,j = b ( i, c ( j )) b ( j, c ( i )) , (5) where b ( ., . ) is a p o sitiv e matrix with n rows and k columns, and c ( i ) is the lab el of the i -t h v ertex i.e. it is an in teger b et w een 1 and k . W e call the mo del k-b eta m o del. The estimation of the la b els in blo c k mo dels is p ossible by the sp ectral metho d ([32]). It is generally b eliev ed that eigen v ectors and eigen v alues of the matrix ε ( i, j ) tells eve rything of the structure of the graph ([1 0, 12 , 13, 2 2, 23, 24]), while there are many att empts to pro vide more flexible mo dels ([9, 29]). 2 2 Go o dness -of-fit W e can not test edge-indep endence on a single graph. While i.i.d. sample is c ommon in statistical inference, in case of graphs the sample generally means a cop y of a graph. P erhaps the n um b er o ne question in statistical inference is the f ollo wing. Let p 1 , . . . , p n (6) b e an arbitra ry giv en sequence of probabilities, a nd ε 1 , . . . , ε n (7) b e indep enden t 0 − 1 v a riables suc h that P ( ε i = 1) = p i . Can w e t est the mo del? A randomized answ er is the f ollo wing. Let u 1 , . . . , u n (8) indep enden t and uniformly distributed in (0 , 1). Then x i = p i u i ε i + (1 − ε i )( p i + (1 − p i ) u i ) , i = 1 , . . . , n (9) are indep enden t and unifor mly distributed in (0 , 1), what w e can test. An other, more practical solution is ordering the the pairs ( p i , ε i ) according to t he p i -s in increasing or der and compare their partial sums. Or w e can clump them in to blo c ks of small n um b er and compare aga in the sums. All these p ossibilities hold for graphs with estimated edge pr o babilities. Let us partition the edges o f the complete graph according to the blo ck s formed with resp ect to the edge pro babilities. In each p ortio n the edge proba bilities are close to eac h other whence t he ε i,j -s corr esp o nding t o tha t p o r t ion b eha v e like a pure random graph. what we again can test e.g. b y t heir sums on subsets of v ertices. Blitzstein and D iaconis ([6, 11]) prop ose for testing the b eta mo del the follo wing general pro cedure. Let us c ho ose an y gr a ph statistic and determine it o n our graph. Let us generate as man y graph w e can with the same degree sequence as the in v estigated gra ph has according to the uniform distribution, and let us calculate t he c ho sen statistics. If the v alue o f the sample graph is inside the g enerated n um b ers, w e accept the b eta mo del, otherwise reject it. One can ask, are there any effect of the choose o n the p ow er of the test? W e ha v e found b y computer sim ulations that graphs generated b y b eta mo del ha v e only one eigen v alue propor t io nal with n , all the o t hers are of order √ n . W e think that it is a c ha r a cteristic prop ert y of b eta graphs. One w o nders that – if b eta mo del cov ers all p ossible degree sequences – the conditiona l distribution is unifo r m o v er graphs sharing the same degree sequence, then how is po ssible that graph b eha v es differen t ly from ty pical graphs generated b y b eta mo del? Of course there are g r a phs hav ing man y large eigen v alues. But where are they coming 3 from once b eta mo del can generate all the graphs? A p ossible solution o f the catc h is the follo wing. Let us generate a meta graph from graphs sharing the s ame degree s equence. Let us say that neigh b orho o d in this meta graph is giv en by on single sw a p. If w e ha v e f our v ertices A, B, C, D in a graph suc h that AC, BD is and edge but AD, BC is not, then c hanging existence in to non existence among these edges we for m a new graph w ith the same degree sequence. The degree o f a graph in this meta g r a ph go es parallel with the second largest eigenv alue: t ypical b eta mo del graphs hav e minimal degree a nd an y increase in their degree results in a more complicated eigen v alue structure. Perhaps the degree in the meta graph is the most c haracteristic statistic for b eta mo del. 3 The k-b e ta mo d el The maxim um like liho o d equations for the parameters b ( ., . ) in (5) say that the exp ected v alues of degrees ins ide all the subgraph with a giv en pair of lab els should b e the same us in the giv en graph. This is the case when the lab els are know n. With unkno wn lab els w e can form a t w o- lev el optimization: for eac h la b el set first to determine the parameters b ( ., . ) next c hanging a small n um b er of lab els and rep eat t he calculation of the par a meters. But the pro cedure is slo w ev en for graphs of mo derate sizes. Sp ectral metho ds av ailable for blo c k mo dels fail for coloring k-b eta mo dels b ecause the mo del lose the w ell pronounced c hec kerboard c hara cter of blo c k mo dels. It is the ANOV A what offers an applicable algo r it hm. F or an y set C of la b els c ( . ) let us calculate the statistic Q ( C ) = n X i =2 i − 1 X j =1 ( ε ( i, j ) − u ( c ( i ) , c ( j )) − v ( i, c ( j )) − v ( j, c ( i ))) 2 , (10) where u ( s, t ) = P c ( i )= s P c ( j )= t ε ( i, j ) P c ( i )= s P c ( j )= t 1 , (11) and v ( i, t ) = P c ( j )= t ( ε ( i, j ) − u (( c ( i ) , t )) P c ( j )= t 1 . (12) Q ( C ) is the sum of tw o w ay ANOV A sum of squares calculated indep enden tly for subgraphs defined for pairs of lab els. Sta r ting from a uniform random set C of lab els on the v ertices and p erturbing small n um b er of lab els in the individual steps a simple greedy optimization results in a go o d set of la b els, whic h is close to the orig ina l (true) lab els. F o r ev aluating the character o f a random graph w e use the num b er exp( − P n i =2 P i − 1 j =1 ( p ( i, j ) log p ( i, j ) + (1 − p ( i, j )) log(1 − p ( i, j )) n ( n − 1) / 2 ) (13) 4 W e call it delo garithme d aver age entr opy or DA E. This is a n umber b etw een 1 a nd 2. If it is close to one the gr a ph is almost deterministic: the probabilities are close to 0 or 1. In c hec k erb oard blo c k mo dels it means that empt y and full subgraphs are amalga mated together. If DAE is close to 2 then the graph ha s no structure at al. DAE dep ends on edge densit y , to o . The ab ov e tendency is v alid for edge densit y 1 2 , for other edge densities the cut p oin t is closer to 1. According to our exp erience if D AE is smaller then 1 . 9 while edge densit y is half, then we are able to reconstruct the o riginal lab els. F or these gra phs the num b er of non-trivial eigen v alues is 2 k − 1, thus the sp ectrum determines the n um b er of differen t lab els. The k-b eta mo del has a sister mo del r i,j = k X s =1 b ( i, s ) b ( j, s ) (14) what we call smal l o dds r ank mo del. Strictly sp eaking w e o ug h t to redefine the dia g onal of o dds matrix, but p erhaps the name is p ermissible without do ing so. The maxim um likelihoo d estimation of para meters in small o dds rank mo dels is straigh tforw ard and the blo ck structure is detectable in the estimated parameters. Actually the blo ck mo del is in the in tersection of k-b eta and small o dds ra nk mo dels, th us if there is a n y blo ck structure in the graph it is detectable ev en in fitting k- b eta mo del to the graph. But if there is no blo c k structure and w e are trying to use ANOV A colo ring fo r a small o dds rank graph then the algorithm is no longer stable, it results in differen t lo cal minima in eac h runs. References [1] E. B. Andersen, Sufficien t statistics and laten t tra it mo dels, Psychom r etrika 42 1 (197 7), 69–81. [2] P . J. Bic k el and A. Chen, A nonparametric view of net w ork mo dels and Newman-Girv an and other mo dularities, Pr o c e e dings of the National A c ademy of Scienc es 106 , 5 0 (200 9), 21068–21 073. [3] P . Bic kel, D . Choi, X. Chang, and H. Zhang , Asymptotic normalit y of maximum lik eli- ho o d and its v ariational appro ximation for sto chastic blo ck mo dels, Preprin t av ailable at arXiv:1207.0865v1 [math.ST] 4 Jul 2012 , Submitted to the Annals of S tatistics [4] A. Barvinok and J. A. Hartigan, An asy mptotic formula for the n um b er of non- negativ e in teger m atrices with prescrib ed ro w and column sums, Preprint, a v ailable at h ttp://arxiv.org/abs/0910.2 477, (2009). [5] A. Barvinok and J.A. Ha r t ig an, The n um b er of g r a phs and a random graph with a given degree sequence , Preprin t, av ailable at http://arxiv.org/abs/1003.0356, (2010 ). 5 [6] J. Blitzstein a nd P . Diaconis, A sequen tial imp ortance sampling algorithm for generating random graphs with prescrib ed degrees, Journal o f I nternet Mathematics , 6 (2010 ), 489– 522. [7] M. Bolla and G. T usn´ ady , Spectra and optima l partitions of we igh ted graphs, D iscr ete Mathematics 128 , (1994), 1– 2 0. [8] S. Chatterjee, P . Diaconis and A. Sly , Random graphs with a giv en degree sequence, Preprin t, a v ailable at arXiv: 100 5.1136v3 [math.PR], (2010), submitted to the Annals o f Statistics. [9] S. Chatterjee a nd P . Diaconis, Estimating a nd understanding exp o nen tia l random graph mo dels, Preprin t av ailable at arXiv:1102.26 5 0v3[math.PR] 6 Apr 2011 . [10] K. Chaudhu ri, F . Ch ung, and A. Ts iatas, Sp ectral clus tering of graphs w ith general degrees in the extended planted partition mo del, Jo urnal of Machin e L e arning R e se ar ch 1 (2012) 1–23. [11] Y. Chen, P . Diaconis, S. P . Holmes, and J. S. Liu, Sequen tial Mon te Carlo methods for statistical a na lysis of tables, Journal of the Americ an Statistic al Asso ciation 100:469 (2005) 109–120 . [12] F. Ch ung, Sp ectral grag h theory , Americ an Mathem atic al So ci ety, Pr ovidenc e, RI 1997. [13] F. Ch ung and L. Lu, Complex graphs and netw orks, Americ an Mat hematic al So ciety, Boston, Massachusetts, 2006. [14] F. Chun g, L. Lu, a nd V. V u, Sp ectra of random graphs with g iv en expected degrees, Pr o c e e d ings of the National A c ademy o f U.S.A 27 (200 3), 6313–6 3 18. [15] V. Csisz´ a r , L. Rejt˝ o and G. T usn´ ady , Sta t istical inference on r a ndom structures, Horizon of Combinatorics , (eds. G y˝ o r i,E. et al.), (20 08), 37–67. [16] V. Csisz´ ar, P . Hussami, J. Koml´ os, T. M´ ori, L. Rejt˝ o, and G. T us n´ ady , When the degree sequence is a sufficien t statistic, A cta Mathematic a Hungaric a 134 ( 2 011) 45–5 3 . [17] C. J. Flynn and P . O. P erry , Consisten t biclustering, Preprin t av a ilable a t arXiv:1206.6927v1 [stat.ME] 29 Jun 2012. [18] P . Holland, K . B. Lask ey , and S. Leinhardt, Sto chastic blo c kmo dels: s ome first steps, Journal of the Americ an Statistic al Asso ciation, 76(373) (198 1), 33–50. [19] P . Hussami, Sta tistical inference on random graphs, PhD thesis, (Central Europ ean Uni- v ersity , Budap est, Hungary) (2010). 6 [20] B. Karrer and M. E. J. Newman, Sto c hastic blo ck mo dels and communit y structure in net works, Phys. R ev. E 83 (2011), 016107 [21] J. M. Linacre, Predicting resp onses f rom Rasc h measures, Journal of Applie d Me asur ement 11 (2010). [22] L. Lov ´ asz, V ery large graphs, Curr ent D evelopments of Mathematic a 2008, 67 -128. [23] L. Lu and X. P eng, Sp ectra of edge-independent random graphs, Preprin t av ailable at arXiv:1204.6207v1 [math.CO] 27 Apr 2012. [24] R. R. Nadakuditi and M. E. J. Newman, Sp ectra of random graphs with arbitra ry exp ected degrees, Preprin t av ailable at arXiv:1208.127 5v1 [cs.SI] 6 Aug 201 2 . [25] E. Mossel, J. Neeman, and A. Sly , R econstruction and estimation in the plan t ed partit io n mo del, Preprin t, av ailable at arXiv: 1202.149 9v4 [math.PR] 22 Aug 20 12. [26] T. Nepusz, L. N ´ egyes sy , G. T usn´ ady , a nd F. Bazs´ o, D ynamic System and its Applic ations 18 (2009) 335 – 362. [27] M. Newman, A.-L. Barab´ asi and D. W att s, T he structur e and dynamics of networks, (Princ eton studies in c omp lexity) , Princeton Univ ersit y Press, 2007. [28] M. Oga wa, H. Hara, a nd A. T ak emu ra, Gra v er ba sis for an undirected graph and its application to testing the b eta mo del of random gr aphs, Ann. Ist. Stat. Mat. 2012. [29] G. Palla, L. Lov´ asz and T. Vicsek, Multifractal netw o rk generator, Pr o c e e dings of the National A c ademy of Scienc es 107 , 17 (2010), 764 1–7645. [30] I. P onicn y , No nparametric go o dness-of-fit tests for the Rasc h mo del, Psychometrika 66 (2001), 437–460 . [31] G. R a sc h, Proba bilistic mo dels for some in telligence and attainment t ests, Copp enhage n: Danmarks Pae do giske Institut, 196 0 . [32] K. Rohe, S. Chatterjee, and B. Y u, Sp ectral clustrering a nd high-dimensional sto c hastic blo c k mo del, A nnals of Statistics 39 (2011) 1878–191 5. [33] N. V erhelst, T esting the unidimensionalit y assumption of the Rasc h mo de, Me asur eme nt and R ese ar ch De p artment R ep orts 2002. 7
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment