Moment-based parameter estimation in binomial random intersection graph models

Binomial random intersection graphs can be used as parsimonious statistical models of large and sparse networks, with one parameter for the average degree and another for transitivity, the tendency of neighbours of a node to be connected. This paper …

Authors: Joona Karjalainen, Lasse Leskel"a

Moment-based parameter estimation in binomial random intersection graph   models
Momen t-based parameter estimation in binomial random in tersection graph mo dels Jo ona Karjalainen and Lasse Leskelä Aalto Univ ersity , Espo o, Finland math.aalto.fi/en/people/joona.karjalainen math.aalto.fi/~lleskela/ Abstract. Binomial random intersection graphs can b e used as parsi- monious statistical mo dels of large and sparse netw orks, with one pa- rameter for the a verage degree and another for transitivit y , the tendency of neigh b ours of a node to be connected. This paper discusses the esti- mation of these parameters from a single observed instance of the graph, using momen t estimators based on observed degrees and frequencies of 2-stars and triangles. The observ ed data set is assumed to b e a subgraph induced b y a set of n 0 no des sampled from the full set of n no des. W e pro ve the consistency of the prop osed estimators b y showing that the rel- ativ e estimation error is small with high probabilit y for n 0  n 2 / 3  1 . As a b ypro duct, our analysis confirms that the empirical transitivity co- efficien t of the graph is with high probability close to the theoretical clustering co efficien t of the mo del. Keyw ords: statistical net work mo del, net work motif, mo del fitting, moment estimator, sparse graph, tw o-mo de netw ork, ov erlapping communities 1 In tro duction Random intersection graphs are statistical netw ork mo dels with o verlapping comm unities. In general, an intersection graph on a set of n no des is defined b y assigning eac h no de i a set of attributes V i , and then connecting those no de pairs { i, j } for which the in tersection V i ∩ V j is nonempty . When the assignmen t of attributes is random w e obtain a random undirected graph. By construction, this graph has a natural tendency to con tain strongly connected comm unities b ecause any set of no des W k = { i : V i 3 k } affiliated with attribute k forms a clique. The simplest nontrivial mo del is the binomial random in tersection graph G = G ( n, m, p ) introduced in [ 13 ], ha ving n no des and m attributes, where an y particular attribute k is assigned to a no de i with probability p , indep enden tly of other node–attribute pairs. A statistical mo del of a large and sparse netw ork with non trivial clustering properties is obtained when n is large, m ∼ β n and p ∼ γ n − 1 for some constants β and γ . In this case the limiting mo del can be parameterised by its mean degree λ = β γ 2 and attribute in tensit y µ = β γ . By extending the mo del by introducing random node weigh ts, w e obtain a statistical net work model which is ric h enough to admit hea vy tails and non trivial clustering prop erties [ 4 , 5 , 8 , 10 ]. Such models can also b e generalised to the directed case [ 6 ]. An imp ortan t feature of this class of models is the analytical tractability related to comp onen t sizes [ 3 , 15 ] and p ercolation dynamics [ 1 , 7 ]. In this pap er we discuss the estimation of the mo del parameters based on a single observed instance of a subgraph induced by a set of n 0 no des. W e in tro- duce moment estimators for λ and µ based on observed frequencies of 2-stars and triangles, and describ e how these can b e computed in time proportional to the pro duct of the maximum degree and the num ber of observed no des. W e also pro ve that the statistical netw ork model under study has a nontrivial empiri- cal transitivity co efficien t which can b e appro ximated b y a simple parametric form ula in terms of µ . The ma jorit y of classical literature on the statistical estimation of net w ork mo dels concerns exp onen tial random graph mo dels [ 19 ], whereas most of the recen t works are fo cused on sto c hastic blo ck mo dels [ 2 ] and sto c hastic Kroneck er graphs [ 9 ]. F or binomial random in tersection graphs with m  n , it has b een sho wn [ 17 ] that the underlying attribute assignment can in principle b e learned using maximum likelihoo d estimation. T o the b est of our knowledge, the current pap er app ears to b e the first of its kind to discuss parameter estimation in random intersection graphs where m is of the same order as n . The rest of the pap er is organised as follows. In Section 2 we describe the mo del and its k ey assumptions. Section 3 summarises the main results. Sec- tion 4 describ es numerical sim ulation exp erimen ts for the p erformance of the estimators. The pro ofs of the main results are given in Section 5 , and Section 6 concludes the pap er. 2 Mo del description 2.1 Binomial random intersection graph The ob ject of study is an undirected random graph G = G ( n, m, p ) on no de set { 1 , 2 , . . . , n } with adjacency matrix having diagonal en tries A ( i, i ) = 0 and off-diagonal entries A ( i, j ) = min m X k =1 B ( i, k ) B ( j, k ) , 1 ! , where B ( i, k ) are independent { 0 , 1 } -v alued random integers with mean p , in- dexed by i = 1 , . . . , n and k = 1 , . . . , m . The matrix B represen ts a random assignmen t of m attributes to n no des, b oth lab eled using p ositiv e integers, so that B ( i, k ) = 1 when attribute k is assigned to no de i . The set of attributes assigned to no de i is denoted by V i = { k : B ( i, k ) = 1 } . Then a no de pair { i, j } is connected in G if and only if the intersection V i ∩ V j is nonempty . 2.2 Sparse and balanced parameter regimes W e obtain a large and sparse random graph mo del by considering a sequence of graphs G ( n, m, p ) with parameters ( n, m, p ) = ( n ν , m ν , p ν ) indexed by a scale parameter ν ∈ { 1 , 2 , . . . } suc h 1 that n  1 and p  m − 1 / 2 as ν → ∞ . In this case a pair of no des { i, j } is connected with probabilit y P ( ij ∈ E ( G )) = 1 − (1 − p 2 ) m ∼ mp 2 , and the exp ected degree of a no de i is given by E deg G ( i ) = ( n − 1) P ( ij ∈ E ( G )) ∼ nmp 2 . (2.1) Esp ecially , we obtain a large random graph with a finite limiting mean degree λ ∈ (0 , ∞ ) when we assume that n  1 , mp 2 ∼ λn − 1 . (2.2) This will b e called the sp arse p ar ameter r e gime with mean degree λ . The most interesting mo del with non trivial clustering properties is obtained when we also assume that p ∼ µm − 1 for some constant µ ∈ (0 , ∞ ) . In this case the full set of conditions is equiv alen t to n  1 , m ∼ ( µ 2 /λ ) n, p ∼ ( λ/µ ) n − 1 , (2.3) and will be called as b alanc e d sp arse p ar ameter r e gime with mean degree λ and attribute intensit y µ . 2.3 Induced subgraph sampling Assume that we hav e observed the subgraph G ( n 0 ) of G induced b y a set V ( n 0 ) of n 0 no des sampled from the full set of n nodes, so that E ( G ( n 0 ) ) consists of no de pairs { i, j } ∈ E ( G ) such that i ∈ V ( n 0 ) and j ∈ V ( n 0 ) . The sampling mec hanism used to generate V ( n 0 ) is assumed to b e sto c hastically indep enden t of G . Esp ecially , any nonrandom selection of V ( n 0 ) fits this framework. On the other hand, several other natural sampling mechanisms [ 14 ] are ruled out by this assumption, although w e b eliev e that several of the results in this pap er can b e generalised to a wider con text. In what follows, we shall assume that the size of observed subgraph satisfies n α  n 0 ≤ n for some α ∈ (0 , 1) . An important sp ecial case with n 0 = n amoun ts to observing the full graph G . 1 F or n umber sequences f = f ν and g = g ν indexed by integers ν ≥ 1 , w e denote f ∼ g if f ν /g ν → 1 and f  g if f ν /g ν → 0 as ν → ∞ . The scale parameter is usually omitted. 3 Main results 3.1 Estimation of mean degree Consider a random intersection graph G = G ( n, m, p ) in a sparse parameter regime ( 2.2 ) with mean degree λ ∈ (0 , ∞ ) , and assume that w e ha ve observed a subgraph G ( n 0 ) of G induced b y a set of no des V ( n 0 ) of size n 0 , as describ ed in Section 2.3 . Then a natural estimator of λ is the normalised av erage degree ˆ λ ( G ( n 0 ) ) = n n 2 0 X i ∈ V ( n 0 ) deg G ( n 0 ) ( i ) . (3.1) This estimator is asymptotically un biased b ecause b y ( 2.1 ), E ˆ λ ( G ( n 0 ) ) = n n 0 ( n 0 − 1) P ( ij ∈ E ( G )) ∼ λ. The following result provides a sufficient condition for the consistency of the estimator of the mean degree λ , i.e., ˆ λ → λ in probability as n → ∞ . Theorem 3.1. F or a r andom interse ction gr aph G = G ( n, m, p ) in a sp arse p ar ameter r e gime ( 2.2 ) , the estimator of λ define d by ( 3.1 ) is c onsistent when n 0  n 1 / 2 . Mor e over, ˆ λ ( G ( n 0 ) ) = λ + O p ( n 1 / 2 /n 0 ) for m  n 2 0 /n  1 . 3.2 T ransitivit y coefficient F or a random or nonrandom graph G with maximum degree at least tw o, the transitivit y co efficien t (a.k.a. global clustering coefficient [ 12 , 16 ]) is defined b y t ( G ) = 3 N K 3 ( G ) N S 2 ( G ) (3.2) and the mo del tr ansitivity c o efficient by τ ( G ) = 3 E N K 3 ( G ) E N S 2 ( G ) , where N K 3 ( G ) is the num ber of triangles 2 and N S 2 ( G ) is the num ber of 2-stars 3 in G . The ab o ve definitions are motiv ated by noting that t ( G ) = P G ( I 2 I 3 ∈ E ( G ) | I 1 I 2 ∈ E ( G ) , I 1 I 3 ∈ E ( G ) ) , τ ( G ) = P ( I 2 I 3 ∈ E ( G ) | I 1 I 2 ∈ E ( G ) , I 1 I 3 ∈ E ( G ) ) , for an ordered 3-tuple of distinct no des ( I 1 , I 2 , I 3 ) selected uniformly at random and indep enden tly of G , where P G refers to conditional probability given an ob- serv ed realisation of G . The model transitivity co efficien t τ ( G ) is a nonrandom 2 subgraphs isomorphic to the graph K 3 with V ( K 3 ) = { 1 , 2 , 3 } and E ( K 3 ) = { 12 , 13 , 23 } . 3 subgraphs isomorphic to the graph S 2 with V ( S 2 ) = { 1 , 2 , 3 } and E ( S 2 ) = { 12 , 13 } . quan tity which dep ends on the random graph model G only via its probabil- it y distribution, and is often easier to analyse than its empirical coun terpart. Although τ ( G ) 6 = E t ( G ) in general, it is widely b eliev ed that τ ( G ) is a go od appro ximation of t ( G ) in large and sparse graphs [ 4 , 8 ]. The follo wing result confirms this in the con text of binomial random intersection graphs. Theorem 3.2. Consider a r andom interse ction gr aph G = G ( n, m, p ) in a b al- anc e d sp arse p ar ameter r e gime ( 2.3 ) . If n 0  n 2 / 3 , then t ( G ( n 0 ) ) = 1 1 + µ + o p (1) . (3.3) It has been observed (with a sligh tly different parameterisation) in [ 8 ] that the mo del transitivity co efficient of the random intersection graph G = G ( n, m, p ) satisfies τ ( G ) =      1 + o (1) , p  m − 1 , 1 1+ µ + o (1) , p ∼ µm − 1 , o (1) , m − 1  p  m − 1 / 2 , and only depends on n via the scale parameter. Hence, as a consequence of Theorem 3.2 , it follows that t ( G ) = τ ( G ) + o p (1) for large random in tersection graphs G = G ( n, m, p ) in the balanced sparse parameter regime ( 2.3 ). 3.3 Estimation of attribute in tensity Consider a random intersection graph G = G ( n, m, p ) in a balanced sparse parameter regime ( 2.3 ) with mean degree λ ∈ (0 , ∞ ) and attribute in tensit y µ ∈ (0 , ∞ ) , and assume that w e ha v e observ ed a subgraph G ( n 0 ) of G induced b y a set of no des V ( n 0 ) of size n 0 , as describ ed in Section 2.3 . W e will now in tro duce tw o estimators for the attribute intensit y µ . The first estimator of µ is motiv ated by the connection b et w een the empirical and mo del transitivity co efficients established in Theorem 3.2 . By ignoring the error term in ( 3.3 ), plugging the observ ed subgraph G ( n 0 ) in to the definition of the transitivity co efficien t ( 3.2 ), and solving for µ , we obtain an estimator ˆ µ 1 ( G ( n 0 ) ) = N S 2 ( G ( n 0 ) ) 3 N K 3 ( G ( n 0 ) ) − 1 . (3.4) An alternative estimator of µ is given b y ˆ µ 2 ( G ( n 0 ) ) =  n 0 N S 2 ( G ( n 0 ) ) 2 N K 2 ( G ( n 0 ) ) 2 − 1  − 1 , (3.5) where N K 2 ( G ( n 0 ) ) = | E ( G ( n 0 ) ) | . A heuristic deriv ation of the ab ov e formula is as follo ws. F or a random intersection graph G in the balanced sparse parameter regime ( 2.3 ), the exp ected num ber of 2-stars in G ( n 0 ) is asymptotically (see Section 5 ) E N S 2 ( G ( n 0 ) ) ∼ 3  n 0 3  ( mp 3 + m 2 p 4 ) ∼ 1 2 n 3 0 µ 3 (1 + µ ) m − 2 and the exp ectation of N K 2 ( G ( n 0 ) ) = | E ( G ( n 0 ) ) | is asymptotically E N K 2 ( G ( n 0 ) ) ∼  n 0 2  mp 2 ∼ 1 2 n 2 0 µ 2 m − 1 . Hence E N S 2 ( G ( n 0 ) ) ( E N K 2 ( G ( n 0 ) )) 2 ∼ 2 n 0 (1 + µ − 1 ) , so by omitting the exp ectations ab o ve and solving for µ we obtain ( 3.5 ). The following result confirms that b oth of the ab o ve heuristic deriv ations yield consisten t estimators for the attribute intensit y when the observ ed sub- graph is large enough. Theorem 3.3. F or a r andom interse ction gr aph G = G ( n, m, p ) in a b alanc e d sp arse p ar ameter r e gime ( 2.3 ) , the estimators of µ define d by ( 3.4 ) and ( 3.5 ) ar e c onsistent when n 0  n 2 / 3 . 3.4 Computational complexit y of the estimators The ev aluation of the estimator ˆ λ given by ( 3.1 ) requires computing the degrees of the nodes in the observ ed subgraph G ( n 0 ) . This can b e done in O ( n 0 d max ) time, where d max denotes the maxim um degree of G ( n 0 ) . Ev aluating the estimator ˆ µ 1 giv en b y ( 3.4 ) requires coun ting the num b er of triangles in G ( n 0 ) whic h is a nontrivial task for very large graphs. A naive algorithm requires an ov erwhelming O ( n 3 0 ) time for this, a listing metho d can accomplish this in O ( n 0 d 2 max ) time, and there also exist v arious more adv anced algorithms [ 18 ]. The estimator ˆ µ 2 giv en b y ( 3.5 ) can b e computed without the need to com- pute the n umber of triangles. Actually , the computation of ˆ µ 2 only requires to ev aluate the degrees of the no des in G ( n 0 ) . Namely , with help of the formulas N K 2 ( G ( n 0 ) ) = 1 2 X i ∈ V ( n 0 ) deg G ( n 0 ) ( i ) and N S 2 ( G ( n 0 ) ) = X i ∈ V ( n 0 )  deg G ( n 0 ) ( i ) 2  , one can v erify that ˆ µ 2 ( G ( n 0 ) ) =  a 2 − a 1 a 2 1 − 1  − 1 , where a k = n − 1 0 P i ∈ V ( n 0 ) deg G ( n 0 ) ( i ) k denotes the k -th momen t of the empirical degree distribution of G ( n 0 ) . W e conclude that the parameters ( λ, µ ) of the random intersection graph G = G ( n, m, p ) in the balanced sparse parameter regime ( 2.3 ) can be consisten tly estimated in O ( n 0 d max ) time using the estimators ˆ λ and ˆ µ 2 . 4 Numerical exp erimen ts In this section we study the non-asymptotic b eha viour of the parameter estima- tors ˆ λ ( 3.1 ), ˆ µ 1 ( 3.4 ), and ˆ µ 2 ( 3.5 ) using simulated data. In the first exp eriment, a random intersection graph was generated for eac h n = 50 , 70 , . . . , 1000 , using parameter v alues ( λ = 9 , µ = 3) and ( λ = 2 , µ = 0 . 5) . All of the data was used for estimation, i.e., n 0 = n . 0 100 200 300 400 500 600 700 800 900 1000 n 0 2 4 6 8 10 12 ^ 6 ^ 7 1 ^ 7 2 (a) λ = 9 , µ = 3 0 100 200 300 400 500 600 700 800 900 1000 n 0 0.5 1 1.5 2 2.5 3 3.5 ^ 6 ^ 7 1 ^ 7 2 (b) λ = 2 , µ = 0 . 5 Fig. 1: Simulated v alues of the estimators ˆ λ , ˆ µ 1 , and ˆ µ 2 with n 0 = n . The solid curv es show the theoretical v alues of the estimators when the feature counts N ∗ ( G ( n ) ) are replaced by their expected v alues. 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 ^ 7 1 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 ^ 7 2 Fig. 2: 1000 simulated v alues of ˆ µ 1 and ˆ µ 2 with λ = 9 , µ = 3 , and n 0 = n = 750 . Figure 1 shows the computed estimates ˆ λ , ˆ µ 1 , and ˆ µ 2 for each n . F or compar- ison, the theoretical v alues of these estimators are also shown when the counts of links, 2-stars, and triangles are replaced by their exp ected v alues in ( 3.1 ), ( 3.4 ), and ( 3.5 ). With ( λ = 9 , µ = 3) , the parameter µ is generally underestimated by ˆ µ 1 and ov erestimated by ˆ µ 2 . The errors in ˆ µ 1 app ear to b e dominated b y the bias, whereas the errors in ˆ µ 2 are mostly due to v ariance. With ( λ = 2 , µ = 0 . 5) , the sim ulated graphs are more sparse. The differences b etw een the tw o estimators of µ are small, and the relative error of ˆ λ app ears to hav e increased. The discon ti- n uities of the theoretical v alues of ˆ λ are due to the rounding of the n umbers of attributes m . In the second experiment, 1000 random in tersection graphs were simulated with n 0 = n = 750 and ( λ = 9 , µ = 3) . Histograms of the estimates of µ are sho wn in Figure 2 . The bias is visible in both ˆ µ 1 and ˆ µ 2 , and the v ariance of ˆ µ 2 is larger than that of ˆ µ 1 . How ever, the difference in accuracy is coun terbalanced b y the fact that ˆ µ 1 requires counting the triangles. 5 Pro ofs 5.1 Co vering densities of subgraphs Denote b y P ow( Ω ) the collection of all subsets of Ω . F or A , B ⊂ Po w( Ω ) we denote A b B and say that B is a c overing family of A , if for every A ∈ A there exists B ∈ B such that A ⊂ B . A cov ering family B of A is called minimal if for an y B ∈ B , (i) the family obtained b y removing B from B is not a cov ering family of A , and (ii) the family obtained by replacing B by a strict subset of B is not a cov ering family of A . F or a graph R = ( V ( R ) , E ( R )) , we denote b y MCF( R ) the set of minimal cov- ering families of E ( R ) . Note that all memb ers of a minimal cov ering family hav e size at least tw o. F or a family of subsets C = { C 1 , . . . , C t } consisting of t distinct sets, we denote |C | = t and ||C || = P t s =1 | C s | . The notation R ⊂ G means that R is a subgraph of G . The following result is similar in spirit to [ 13 , Theorem 3], but fo cused on subgraph frequencies instead of app earance thresholds. Theorem 5.1. If mp 2  1 , then for any finite gr aph R not dep ending on the sc ale p ar ameter, P ( G ⊃ R ) ∼ X C ∈ MCF( R ) m |C | p ||C || The pro of of Theorem 5.1 is based on tw o auxiliary results which are pre- sen ted first. Lemma 5.2 F or any interse ction gr aph G on { 1 , . . . , n } gener ate d by attribute sets V = { V 1 , . . . , V n } and any gr aph R with V ( R ) ⊂ V ( G ) , the fol lowing ar e e quivalent: (i) R ⊂ G . (ii) E ( R ) b V . (iii) Ther e exists a family C ∈ MCF( R ) such that E ( R ) b C b V . Pr o of. (i) ⇐ ⇒ (ii). Observ e that a no de pair e ∈  V 2  satisfies e ∈ E ( G ) if and only if e ⊂ V j for some V j ∈ V . Hence E ( R ) ⊂ E ( G ) if and only if for every e ∈ E ( R ) there exists V j ∈ V suc h that e ⊂ V j , or equiv alen tly , E ( R ) b V . (ii) = ⇒ (iii). If E ( R ) b V , define C j = V j ∩ V ( R ) . Then C = { C 1 , . . . , C m } is a cov ering family of E ( R ) . Then test whether C still remains a co vering family of E ( R ) if one its members is remov ed. If yes, remov e the member of C with the highest lab el. Rep eat this pro cedure until we obtain a cov ering family C 0 of E ( R ) for which no mem b er can b e remov ed. Then test whether some C ∈ C 0 can b e replaced by a strict subset of C . If y es, do this replacemen t, and repeat this pro cedure until w e obtain a co vering family C 00 of E ( R ) for which no member can b e shrunk in this wa y . This mechanism implies that C 00 is a minimal cov ering family of E ( R ) , for which E ( R ) b C 00 b V . (iii) = ⇒ (ii). F ollows immediately from the transitivity of b . Lemma 5.3 If mp 2  1 , then for any sc ale-indep endent finite c ol le ction C = { C 1 , . . . , C t } of finite subsets of { 1 , 2 , . . . } of size at le ast 2, the pr ob ability that the family of attribute sets V = { V 1 , . . . , V n } of G = G ( n, m, p ) is a c overing family of C satisfies P ( V c C ) ∼ m |C | p ||C || . Pr o of. F or s = 1 , . . . , t , denote by N s = P m j =1 1( V j ⊃ C s ) the num b er of at- tribute sets cov ering C s . Note that N s follo ws a binomial distribution with pa- rameters m and p | C s | . Because | C s | ≥ 2 , it follo ws that the mean of N s sat- isfies mp | C s | ≤ mp 2  1 . Using elemen tary computations related to the bino- mial distribution (see e.g. [ 13 , Lemmas 1,2]) it follows that the random integers N 1 , . . . , N t are asymptotically indep enden t with P ( N s ≥ 1) ∼ mp | C s | , so that P ( V c C ) = P ( N 1 ≥ 1 , . . . , N t ≥ 1) ∼ t Y s =1 P ( N s ≥ 1) ∼ m t p P t s =1 | C s | . Pr o of (Pr o of of The or em 5.1 ). By Lemma 5.2 , w e see that P ( G ⊃ R ) = P   [ C ∈ MCF( R ) {V c C }   . Bonferroni’s inequalities hence imply U 1 − U 2 ≤ P ( G ⊃ R ) ≤ U 1 , where U 1 = X C ∈ MCF( R ) P ( V c C ) and U 2 = X C , D P ( V c C , V c D ) , and the latter sum is taken o ver all unordered pairs of distinct minimal co vering families C , D ∈ MCF( R ) . Note that by Lemma 5.3 , U 1 ∼ X C ∈ MCF( R ) m |C | p ||C || , so to complete the pro of it suffices to verify that U 2  U 1 . Fix some minimal co vering families C = { C 1 , . . . , C s } and D = { D 1 , . . . , D t } of E ( R ) suc h that C 6 = D . Then either C has a member such that C i 6∈ D , or D has a mem b er suc h that D j 6∈ C . In the former case C ∪ D ⊃ { C i , D 1 , . . . , D t } , so that b y Lemma 5.3 , P ( V c C ∪ D ) ≤ P ( V c { C i , D 1 , . . . , D t } ) ∼ m t +1 p | C i | + P t j =1 | D j | ∼ mp | C i | P ( V c D ) . Because C is a minimal co v ering family , | C i | ≥ 2 , and mp | C i | ≤ mp 2  1 , and hence P ( V c C ∪ D )  P ( V c D ) . In the latter case where D has a mem b er such that D j 6∈ C , a similar reasoning shows that P ( V c C ∪ D )  P ( V c C ) . W e may hence conclude that P ( V c C , V c D ) = P ( V c C ∪ D )  P ( V c C ) + P ( V c D ) for all distinct C , D ∈ MCF( R ) . Therefore, the proof is completed b y U 2  X C , D  P ( V c C ) + P ( V c D )  ≤ 2 | MCF( R ) | U 1 . 5.2 Co vering densities of certain subgraphs In order to b ound the v ariances of subgraph counts we will use the cov ering densities of (partially) ov erlapping pairs of 2-stars and triangles. Figure 3 displays the graphs obtained as a union of tw o partially o v erlapping triangles. Figure 4 displa ys the graphs pro duced by ov erlapping 2-stars. Fig. 3: Graphs obtained as unions of o verlapping triangles. A ccording to Theorem 5.1 , the cov ering densities of subgraphs may b e com- puted from their minimal cov ering families. F or a triangle R with V ( R ) = { 1 , 2 , 3 } and E ( R ) = { 12 , 13 , 23 } , the minimal co vering families are 4 { 123 } and { 12 , 13 , 23 } . The minimal co vering families of the a 3-path R with V ( R ) = { 1 , 2 , 3 , 4 } and E ( R ) = { 12 , 23 , 34 } are given by { 1234 } , { 12 , 234 } , { 123 , 34 } , and { 12 , 23 , 34 } . 4 F or clarity , we write 12 and 123 as shorthands of the sets { 1 , 2 } and { 1 , 2 , 3 } . Fig. 4: Graphs obtained as unions of o verlapping 2-stars. The co vering densities of stars are found as follows. Fix r ≥ 1 , and let R b e the r -star such that V ( R ) = { 1 , 2 , . . . , r + 1 } and E ( R ) = {{ 1 , r + 1 } , { 2 , r + 1 } , . . . , { r, r + 1 }} . The minimal cov ering families of R are of the form C = { S ∪ { r + 1 } : S ∈ S } , where S is a partition of the leaf set { 1 , . . . , r } into nonempt y subsets. F or any such C we hav e |C | = |S | and ||C || = r + |S | . Hence P ( G ⊃ r -star ) ∼ r X k =1  r k  m k p k + r , where  r k  equals the num b er of partitions of { 1 , . . . , r } into k nonempt y sets. These co efficien ts are kno wn as Stirling num b ers of the second kind [ 11 ] and can b e computed via  r k  = 1 k ! P k j =0 ( − 1) k − j  k j  j r . Hence, P ( G ⊃ r -star ) ∼      mp 3 + m 2 p 4 , r = 2 , mp 4 + 3 m 2 p 5 + m 3 p 6 , r = 3 , mp 5 + 7 m 2 p 6 + 6 m 3 p 7 + m 4 p 8 , r = 4 . T able 1 summarises appro ximate co vering densities of ov erlapping pairs of 2-stars and triangles. The table is computed by first listing all minimal co vering families of the asso ciated subgraphs, as shown in T able 2 . W e also use the fol- lo wing observ ations (for p  m − 1 / 2  1 ) to cancel some of the redundant terms in the expressions. 4-path: m 2 p 7  m 2 p 6 and m 3 p 8  m 3 p 7 4-cycle: m 2 p 6  mp 4 3-pan: m 3 p 7  m 2 p 5 Diamond: m 2 p 6  mp 4 and m 4 p 9  m 3 p 7 Butterfly: m 5 p 12  m 5 p 11 , m 4 p 11  m 3 p 8  m 2 p 6 , m 3 p 10 ≤ m 3 p 8  m 2 p 6 . R | V ( R ) | | E ( R ) | Appr. density ( p  m − 1 / 2  1 ) Appr. density ( p ∼ µm − 1 ) 1-star 2 1 mp 2 µ 2 m − 1 2-star 3 2 mp 3 + m 2 p 4 (1 + µ ) µ 3 m − 2 3-cycle 3 3 mp 3 + m 3 p 6 µ 3 m − 2 3-star 4 3 mp 4 + 3 m 2 p 5 + m 3 p 6 (1 + 3 µ + µ 2 ) µ 4 m − 3 3-path 4 3 mp 4 + 2 m 2 p 5 + m 3 p 6 (1 + 2 µ + µ 2 ) µ 4 m − 3 4-cycle 4 4 mp 4 + 4 m 3 p 7 + m 4 p 8 µ 4 m − 3 3-pan 4 4 mp 4 + m 2 p 5 + m 4 p 8 (1 + µ ) µ 4 m − 3 Diamond 4 5 mp 4 + 2 m 3 p 7 + m 5 p 10 µ 4 m − 3 4-star 5 4 mp 5 + 7 m 2 p 6 + 6 m 3 p 7 + m 4 p 8 (1 + 7 µ + 6 µ 2 + µ 3 ) µ 5 m − 4 4-path 5 4 mp 5 + 3 m 2 p 6 + 3 m 3 p 7 + m 4 p 8 (1 + 3 µ + 3 µ 2 + µ 3 ) µ 5 m − 4 Chair 5 4 mp 5 + 4 m 2 p 6 + 4 m 3 p 7 + m 4 p 8 (1 + 4 µ + 4 µ 2 + µ 3 ) µ 5 m − 4 Butterfly 5 6 mp 5 + m 2 p 6 + 2 m 4 p 9 + 4 m 5 p 11 + m 6 p 12 (1 + µ ) µ 5 m − 4 T able 1: Approximate densities of some subgraphs. 5.3 Pro ofs of Theorems 3.1 , 3.2 , and 3.3 Pr o of (of The or em 3.1 ). Denote ˆ λ = ˆ λ ( G ( n 0 ) ) and ˆ N = N K 2 ( G ( n 0 ) ) . Then the v ariance of ˆ λ is giv en b y V ar( ˆ λ ) = 4 n 2 n 4 0 V ar( ˆ N ) . (5.1) By writing ˆ N = X e ∈ ( [ n 0 ] 2 ) 1( G ⊃ e ) and ˆ N 2 = X e ∈ ( [ n 0 ] 2 ) X e 0 ∈ ( [ n 0 ] 2 ) 1( G ⊃ e )1( G ⊃ e 0 ) , w e find that E ˆ N =  n 0 2  P ( G ⊃ K 2 ) and E ˆ N 2 =  n 0 2  P ( G ⊃ K 2 )+2( n 0 − 2)  n 0 2  P ( G ⊃ S 2 )+  n 0 2  n 0 − 2 2  P ( G ⊃ K 2 ) 2 . Because the last term ab o v e is b ounded by  n 0 2  n 0 − 2 2  P ( G ⊃ K 2 ) 2 ≤  n 0 2  2 P ( G ⊃ K 2 ) 2 = ( E ˆ N ) 2 , it follows that V ar( ˆ N ) ≤  n 0 2  P ( G ⊃ K 2 ) + 2( n 0 − 2)  n 0 2  P ( G ⊃ S 2 ) = (1 + o (1)) 1 2 n 2 0 mp 2 + (1 + o (1)) n 3 0 ( mp 3 + m 2 p 4 ) . 3-path |C | ||C || 1234 1 4 123, 34 2 5 234, 12 2 5 12, 23, 34 3 6 4-cycle |C | ||C || 1234 1 4 123, 134 2 6 124, 234 2 6 123, 14, 34 3 7 124, 23, 34 3 7 134, 12, 23 3 7 234, 12, 14 3 7 12, 14, 23, 34 4 8 Diamond |C | ||C || 1234 1 4 123, 234 2 6 123, 24, 34 3 7 234, 12, 13 3 7 124, 134, 23 3 8 124, 13, 23, 34 4 9 134, 12, 23, 24 4 9 12, 13, 23, 24, 34 5 10 3-cycle |C | ||C || 123 1 3 12, 13, 23 3 6 Chair |C | ||C || 12345 1 5 1234, 45 2 6 1345, 23 2 6 2345, 13 2 6 123, 345 2 6 123, 34, 45 3 7 134, 23, 45 3 7 234, 13, 45 3 7 345, 13, 23 3 7 13, 23, 34, 45 4 8 4-path |C | ||C || 12345 1 5 1234, 45 2 6 2345, 12 2 6 123, 345 2 6 1245, 234 2 7 123, 34, 45 3 7 234, 12, 45 3 7 345, 12, 23 3 7 1245, 23, 34 3 8 12, 23, 34, 45 4 8 3-pan |C | ||C || 1234 1 4 123, 34 2 5 134, 12, 23 3 7 234, 12, 13 3 7 12, 13, 23, 34 4 8 Butterfly |C | ||C || 12345 1 5 123, 345 2 6 1234, 35, 45 3 8 1235, 34, 45 3 8 1345, 12, 23 3 8 2345, 12, 13 3 8 1245, 134, 235 3 10 1245, 135, 234 3 10 123, 34, 35, 45 4 9 345, 12, 13, 23 4 9 1245, 134, 23, 25 4 11 1245, 235 ,13, 34 4 11 1245, 135, 23, 34 4 11 1245, 234, 13, 35 4 11 134, 12, 23, 35, 45 5 11 135, 12, 23, 34, 45 5 11 234, 12, 13, 35, 45 5 11 235, 12, 13, 34, 45 5 11 1245, 13, 23, 34, 35 5 12 12, 13, 23, 34, 35, 45 6 12 T able 2: Minimal cov ering families of the subgraphs in Fig. 3 and Fig. 4 (stars excluded). Hence by ( 5.1 ), V ar( ˆ λ ) = O ( n − 2 0 n 2 mp 2 ) + O ( n − 1 0 n 2 mp 3 ) + O ( n − 1 0 n 2 m 2 p 4 ) , and by noting that n 2 mp 2 ∼ λn , n 2 mp 3 = m − 1 / 2 n 1 / 2 ( nmp 2 ) 3 / 2 ∼ λ 3 / 2 m − 1 / 2 n 1 / 2 and n 2 mp 4 = ( nmp 2 ) 2 ∼ λ 2 , we find that V ar( ˆ λ ) = O  n − 2 0 n + m − 1 / 2 n − 1 0 n 1 / 2 + n − 1 0  = O  n − 2 0 n + m − 1 / 2 n − 1 0 n 1 / 2  , where the last equality is true because n − 2 0 n ≥ n − 1 0 . The claim now follo ws b y Cheb yshev’s inequalit y . Pr o of (of The or em 3.2 and The or em 3.3 ). The v ariances of N S 2 and N K 3 can b e b ounded from ab o ve in the same wa y that the v ariance of N K 2 w as bounded in the pro of of Theorem 3.1 . The ov erlapping subgraphs con tributing to the v ari- ance of N K 3 are those shown in Fig. 3 . According to T able 1 , the contribution of these subgraphs is O ( n | V ( R ) | 0 m −| V ( R ) | +1 ) for | V ( R ) | = 3 , 4 , 5 , and the nonov er- lapping triangles contribute O ( n 6 0 m − 5 ) . Since E N K 3 is of the order n 3 0 m − 2 , it follo ws that V ar( N K 3 / E N K 3 ) = o (1) for n 0  n 2 / 3 . The same line of proof w orks for N S 2 , i.e., we note that the subgraphs ap- p earing in V ar( N S 2 ) are those shown in Fig. 4 and their contributions to the v ariance are listed in T able 1 . Again, it follo ws that V ar( N S 2 / E N S 2 ) = o (1) for n 0  n 2 / 3 . Hence w e ma y conclude using Cheb yshev’s inequalit y that N K 3 ( G ( n 0 ) ) = (1 + o p (1)) E N K 3 ( G ( n 0 ) ) = (1 + o p (1))  n 0 3  µ 3 m − 2 N S 2 ( G ( n 0 ) ) = (1 + o p (1)) E N S 2 ( G ( n 0 ) ) = (1 + o p (1))3  n 0 3  (1 + µ ) µ 3 m − 2 , and the claim of Theorem 3.2 follo ws. F urther, in the proof of Theorem 3.1 we found that N K 2 ( G ( n 0 ) ) = (1 + o p (1)) E N K 2 ( G ( n 0 ) ) = (1 + o p (1))  n 0 2  µ 2 m − 1 . Hence the claims of Theorem 3.3 follo w from the ab o v e expressions combined with the con tinuous mapping theorem. 6 Conclusions In this pap er w e discussed the estimation of parameters for a large random in- tersection graph mo del in a balanced sparse parameter regime characterised by mean degree λ and attribute in tensity µ , based on a single observ ed instance of a subgraph induced by a set of n 0 no des. W e introduced moment estimators for λ and µ based on observed frequencies of 2-stars and triangles, and describ ed ho w the estimators can be computed in time prop ortional to the pro duct of the maxim um degree and the num b er of observed nodes. W e also prov ed that in this parameter regime the statistical netw ork mo del under study has a nontrivial em- pirical transitivity co efficien t which can b e approximated by a simple parametric form ula in terms of µ . F or simplicity , our analysis was restricted to binomial undirected random in tersection graph mo dels, and the statistical sampling scheme w as restricted induced subgraph sampling, indep endent of the graph structure. Extension of the obtained results to general directed random in tersection graph mo dels with general sampling sc hemes is left for further study and forms a part of our ongoing w ork. A ckno wledgmen ts. P art of this work has b een financially supp orted by the Emil Aaltonen F oundation, Finland. W e thank Mindaugas Bloznelis for helpful discussions, and the tw o anonymous reviewers for helpful comments. References 1. Ball, F.G., Sirl, D.J., T rapman, P .: Epidemics on random in tersection graphs. Ann. Appl. Probab. 24(3), 1081–1128 (2014), http://dx.doi.org/10.1214/13- AAP942 2. Bic kel, P .J., Chen, A., Levina, E.: The metho d of moments and degree distributions for netw ork mo dels. Ann. Statist. 39(5), 2280–2301 (2011), http://dx.doi.org/ 10.1214/11- AOS904 3. Bloznelis, M.: The largest comp onen t in an inhomogeneous random in tersection graph with clustering. Electron. J. Combin. 17(1) (2010) 4. Bloznelis, M.: Degree and clustering co efficien t in sparse random intersection graphs. Ann. Appl. Probab. 23(3), 1254–1289 (2013), http://dx.doi.org/10. 1214/12- AAP874 5. Bloznelis, M., Kurausk as, V.: Clustering function: another view on clustering co- efficien t. Journal of Complex Netw orks (2015) 6. Bloznelis, M., Lesk elä, L.: Diclique clustering in a directed random graph. In: Bonato, A., Graham, F.C., Prałat, P . (eds.) Algorithms and Mo dels for the W eb Graph. pp. 22–33. Springer In ternational Publishing, Cham (2016) 7. Britton, T., Deijfen, M., Lagerås, A.N., Lindholm, M.: Epidemics on random graphs with tunable clustering. J. Appl. Probab. 45(3), 743–756 (2008), http://dx.doi. org/10.1239/jap/1222441827 8. Deijfen, M., Kets, W.: Random intersection graphs with tunable degree distribution and clustering. Probab. Eng. Inform. Sc. 23(4), 661–674 (2009), http://dx.doi. org/10.1017/S0269964809990064 9. Gleic h, D.F., Owen, A.B.: Moment-based estimation of sto c hastic Kroneck er graph parameters. Internet Math. 8(3), 232–256 (2012), http://projecteuclid.org/ euclid.im/1345581012 10. Go dehardt, E., Jaw orski, J.: T wo mo dels of random intersection graphs and their applications. Electronic Notes in Discrete Mathematics 10, 129–132 (2001) 11. Graham, R.L., Knuth, D.E., Patashnik, O.: Concrete Mathematics. Addison- W esley (1994) 12. v an der Hofstad, R.: Random Graphs and Complex Netw orks - Vol. I. Cam bridge Univ ersity Press (2017) 13. Karoński, M., Sc heinerman, E.R., Singer-Cohen, K.B.: On random in tersection graphs: The subgraph problem. Combin. Probab. Comput. 8(1-2), 131–159 (1999), http://dx.doi.org/10.1017/S0963548398003459 14. K olaczyk, E.D.: Statistical Analysis of Netw ork Data. Springer (2009) 15. Lagerås, A.N., Lindholm, M.: A note on the comp onen t structure in random in- tersection graphs with tunable clustering. Electron. J. Combin. 15(1), Note 10, 8 (2008), http://www.combinatorics.org/Volume_15/Abstracts/v15i1n10.html 16. Newman, M.E.J.: The structure and function of complex netw orks. SIAM Review 45(2), 167–256 (2003), http://dx.doi.org/10.1137/S003614450342480 17. Nik oletseas, S., Raptop oulos, C., Spirakis, P .G.: Maximum cliques in graphs with small in tersection num b er and random intersection graphs. In: International Sym- p osium on Mathematical F oundations of Computer Science. pp. 728–739. Springer (2012) 18. T sourak akis, C.E.: F ast coun ting of triangles in large real netw orks without count- ing: Algorithms and laws. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 608–617 (Dec 2008) 19. W asserman, S., F aust, K.: So cial Net work Analysis: Metho ds and Applications. Cam bridge Universit y Press (1994)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment