Testing for Characteristics of Attribute Linked Infinite Networks based on Small Samples

T esting for Characteristics of A ttribute link ed Inﬁnite Net w orks ba sed on Small Samples K oushiki Sark ar and Digan ta Mukherjee Indian Statistical Institute, K olk ata Octob er 25, 2021 Abstract The ob jective of this pap er is to study the characteri stics (geometric and otherwise) of very large attribute based und irected netw orks. Real-w orld netw orks are often ve ry larg e and fast ev olving. Their analysis and understanding present a great c hallenge. An Attribute based net wo rk is a graph in which the edges dep end on certain prop erties of the vertices on whic h they are incident. In context of a social netw ork, the existence of links b etw een t w o individuals may depend on certain attributes of the tw o of th em. W e use the Lov asz typ e sampling strategy of observing a certain random process on a graph ”lo cally”, i.e., in the neigh b orhoo d of a no de, and deriving information ab out ”global” prop erties of the graph. The corresponding adjacency matrix is our primary ob ject of intere st. W e study th e eﬃciency of recentl y prop osed sampling strategies, mod iﬁed to our set up, to estimate th e degree distribution, centralit y measures, planarit y etc. The limiting d istributions are derived using recently develo p ed probabilistic tec hniques for random matrices and h ence we devise relev ant test statistics and conﬁdence inte rv als for diﬀerent p arameters / hyp otheses of i nterest. W e hop e that our work will b e useful for so cial and computer scien tists for designing sampling strategies and computational algorithms appropriate to their resp ectiv e domains of inquiry . Extensive sim ulations studies are done t o empirically verify the probabilistic statements made in the pap er. 1 In tro duction 1.1 Need for Sampling and Common Strategies Real-w orld net w orks are often ve ry large and fast ev olving. Their analysis and understanding presen t a great c hallenge. In the pas t few y ears, a n um b er of diﬀeren t tec hniq ues hav e b een prop osed for sampling large net w orks to allow for their faster and more eﬃcien t analysis. Sev eral studies on netw or k sampling analyze the matc h b et w een the o riginal net w orks and their sampled v arian ts [1, 2, 3], as w ell as comparing the p erformance of diﬀeren t sampling tec hniques [4, 5, 6]. Sampling tec hniques can b e r o ughly divided into t w o categories: r a ndom selection a nd net w ork exploration tec hniques . In the ﬁrst category , no des or links are included in the sample uniformly at random or prop ortional to some particular c haracteristic lik e degree. In the second category , the sample is constructed by retrieving a neigh b orho o d of a randomly selected seed no de using diﬀeren t strategies like breadth-ﬁrst searc h, random w alk and fo r est-ﬁre. On these basis, the f o llo wing algorithms ha v e b een prop osed in the literature. • random no de selection [4] (RNS), where the sample consists of no des selected uniformly at ra ndom and all their m utual links • random no de selection b y degree [4] (R ND), the no des are selected randomly with probabilit y pro p o r- tional to their degrees and all their m utual links are included in the sample • random link selection [4] (RLS), where the sample consists of links selected uniformly at random • random link selection with subgraph induction [7] (RLI), the sample consists o f links selected uniformly at random and a ny additional links b etw een their endp oin ts • random w alk sampling [4] (R WS), where the random walk is sim ulated on the net w ork, starting at a randomly selected seed no de • forest-ﬁre sampling [4] (FFS). Here, a broad neigh b orho o d of a randomly selected seed no de is retriev ed from partial breadth-ﬁrst search 1 • random w alk sampling with subgraph induction (R WI) and • forest-ﬁre sampling with subgraph induction (FFI) Recall the L ovasz sampling strategy of observing a certain random pro cess on a graph ” lo cally”, i.e., in the neigh b orho o d of a no de, and deriving information ab out ”global” prop erties o f the gra ph. F or example, what do y ou kno w ab out a graph based on o bserving the returns of a ra ndom w alk to a giv en no de? Almost all sampling strategies uses t his philosoph y or some close v ariant. W e also aim to use this as an ingredien t in our sampling strategy and deriv e a test for suc h returns. 1.2 Wh y Attribute based Netw ork? An A ttribute based net w ork is a g raph in whic h the edges depend on certain prop erties of the v ertices on whic h they are inciden t. In con text of a so cial net w ork, the existenc e of links b etw een tw o individuals ma y dep end on certain attributes of the tw o of t hem, for example their geographic lo cation or so cio economic status. W e w ork with the underlying assumption that similar p eople connect to eac h other with higher probability . In the con text of a so cial or a neural net w ork, the connection b etw een individual v ertices dep end on certain in trinsic qualities of the v ertices themselv es. It make s sense to consider the connection probabilities as a function of the verte x attributes. In earlier w ork (Sark ar, Ray a nd Mukherjee , 2015) we ha v e shown that in the con text of predictiv e mo deling, attribute based netw orks a re indeed w orth while t o study . 1.3 Plan of the Pa p er The ob j ectiv e of this pap er is to study the c haracteristics (geometric and otherwise) of very large attribute based undirected net works . The corresp onding adjacen cy matrix is o ur primary ob ject of intere st. W e study the eﬃciency of recen tly prop osed sampling strategies, mo diﬁed to our set up, to estimate the degree distribution, cen tralit y measures, planarit y etc. The limiting distributions are derive d using recen tly dev elop ed probabilistic tec hnique s for random matrices and hence w e devise relev an t test statistics and conﬁdence in terv als for diﬀeren t parameters / hy p otheses of in terest. W e hop e that our work will b e useful fo r so cial and computer scien tists for designing sampling strategies and computational algo r ithms appropriate to their respectiv e domains of inquiry . Exte nsiv e simulations studies a re done to empirically verify the probabilistic statemen ts made in the pap er. 2 Prelimina ries Let t he netw ork b e represen ted b y a simple undirected gra ph G = (V; E), where V denotes the set of no des (n = |V|) a nd E is the set of links (m = |E|). The goa l of netw or k sampling is to create a sampled net w ork G’ = (V’; E’), where V’ ⊂ V , E’ ⊂ E and n’ = |V| << n, m’ = |E’| << m. The sample G’ is obtained in t w o steps. In the ﬁrst step, no des or links a re sampled using a particular strategy lik e r a ndom selection and net w ork exploration sampling. In the second step, the sampled no des and links are retriev ed f r o m the original net w ork. 2.1 Notions of Centr ality A netw ork can b e characterize d b y v arious notions of centralit y , whose relev ance and utilit y are con text- sp eciﬁc. A complex net w ork with a heterogeneous top olog y might not hav e the same optimalit y prop erties for a single measure of cen tralit y throughout the graph. Ho w ev er, for a sampling based approac h without any prior idea of the graph t ype, it ma y b e diﬃcult to kno w whic h centralit y measure is b est suited f or the study of the graph. If w e op erate under the simpliﬁed assumptions ab out the attributes-based net w ork then our graph structure is simplied. P articularly , if the attribute random v a riable { X i } tak e v a lues in a ﬁnite set, then the set of p ossible connecting probabilities is also ﬁnite. Then w e can break a graph in to diﬀeren t classes whic h are exp ected to ha v e similar b eha vior. Here w e in tend to dev elop a sampling analo g ue for ﬁnding the Eigen v ector cen tralit y , which is the solution of the E ig env ector E q uation to Ax = λx . Acc ording to the P erron F rob enius theorem due to strict p ositivit y of the eigen v alues w e only require the la rg est eigen v alue. 2 A p ossible generalization of Eigen v ector cen tralit y a s well as Degree Cen tralit y is the Ka tz cen tralit y . It measures the n um b er of a ll no des that can b e connected v ia a path to the v ertex in ques tion, while the con tributions to distant no des are dev a lued. It is mathematically written as x i = P k P j α k ( A ) k j i Katz cen tralit y can b e view ed as a v aria nt of eigen v ector cen tr a lit y . Another form of Katz cen tralit y is x i = α P N j =1 a ij ( x j + 1) . C ompared to the expression o f eigen v ector cen tralit y , x j is replaced by x j + 1 . It is show n that the principal eigen ve ctor (asso ciated with t he largest eigen v alue of A, the adjacency matrix) is the limit of Katz cen tralit y as α approac hes 1 / λ from b elo w. 2.2 Assumptions on the Netw ork Basic Assumptions : W e denote the v ariable X i as an attribute of the class i , where i is assumed to tak e only ﬁnitely many v alues. In a p opulation, there can b e inﬁnitely man y p eople with same a tt r ibute. Let us consider all of them mem b ers o f the class i . Call it c i . • Lo oking at degree distribution is not very meaningful as ev en if w e kno w that the degrees are distributed b y p o w er la w or Normally , w e still don’t kno w what the degree should b e for an individual no de. • The degree needs to b e a speciﬁc prop erty of a no de for us to meaningfully selec t a no de. In context of so cial netw orks, it makes sense to consider ve rtices a ppended with attributes. An edge or connection can b e considered to b e dep enden t o n the attributes of the inv olve d no des. • W e consider the accessory v ariable X i app ended to each v ertex i. • Call the indicator function δ ij = n 1 if connected 0 ow • W e need to lo ok at p ij = P ( δ ij = 1 | X i , X j ) . If Erdos Reny i Graph, then this is the unconditional probabilit y same f or all i,j. • W e make some assumptions on p ij . Ev en if i and j b elong to the same class and hence share the same attributes, p ij 6 = 1 . p ij is assumed to b e b ounded aw ay from 1 and 0. • Fix an i. Consider ar g max i P j P ( δ ij = 1 | X i , X j ) . This can b e approx imated by ar g max i E ( d | X i ) where d is the degree • The { X n } sequenc e, if sto c hastic, is assumed to form a Mark ov Ra ndom Field. we ha v e a sort of dep en- dence structure within a neigh b orho o d; distan t no des can b e assumed to b e more or less indep enden t. In particular, if { X n } is ﬁnite, then p ij also take s ﬁnitely many v alues. 3 Probabilisti c F orm ulation Instead of the Ad jacency Matrix, w e can consider the matrix P = (( p ij )) , the matrix of the probabilities. If w e consider a random motion o n the graph, w e consider p ij to b e a transition probability on the graph. A p ossible notion of cen tralit y in this con text is whether the verte x is recurren t or transien t. A recurren t v ertex indicates that there are m ultiple paths leading back to the gr a ph. This is somewhat analo gous to Bet w eenness cen tralit y . W e need to ho w ev er note that in a larg e g ra ph mo deled on a so cial net w ork whic h w ould b e mostly sparse with interm ediate densely connected cliques, the actual b et w eennes s for all but few v ertices w ould b e v anishingly small. O n a global scale these v ertices ma y not b e imp ortan t; but their lo cal inﬂuenc e cannot b e dismisse d. W e lo ok at an irreducible ap erio dic subset of the graph. If the motion is considered to b e Mark o v, then noting that tra nsienc e and recurrence a re solidarit y prop erties, we attempt to v erify tha t using our mo del. If d is a metric deﬁned on the σ − F ield generated b y the random v ariables { X n } , then consider p ij = f ( d ( X i , X j )) , where f is a decreasing function of d a nd b ounded in [0 , 1] . An easy example is d ′ ( x, y ) = 1 − min (1 , d ( x, y )) Again, if { X n } is ﬁnite (or coun table), d only tak es ﬁnitely (coun tably) many v alues and consequen tly the set of v alues of p is also ﬁnite (coun table.) 3 If the { X n } is kno wn, then P is also completely kno wn, as is P k . In principal, w e can also calculate if P f n ii < 1 , where f ( n ) ii = Pr( T i = n ) . If this holds, then the vertex is transien t, else recurren t. 3.1 Degree Distribution W e are estimating p ij = P ( δ ij = 1 | X i , X j ) . Let P = (( p ij )) and E = (( δ ij )) , symmetric with δ ii = 0 , ∀ i . So δ ij ∼ B er noul i ( p ij ) . Sym b olically E ∼ B er noul i ( P ) . Let degree, d i = P j δ ij . What is the distribution of δ ij | d i , should b e "h yp ergeometric" ty p e? So can w e use iter ate d ex p e ctation in the followin g w a y? First condition o n ro w total d i to use the "hypergeometric t yp e" calculation for ( δ ij | X i , X j ) . Then take expectation on d i ∼ F ( . | X i ) following from the ERG mo del. If w e can sho w tha t the distribution of d i is of "Binomial type", then for ﬁxed | V | = n we can do the calculation and then tak e limit n → ∞ ? Assuming { X i } to b e non-sto chastic , if degree of the i th v ertex is d i , w e hav e: d i = P n j =1 ,i 6 = j δ ij , where δ ij is the indicator v aria ble whic h is 1 if there is a connection b et w een the i th and the j th v ertices. If p ij is the connection probability of the i and the j th v ertex, then p ij = f ( x i , x j ) is completely kno wn. Consider the degree pro po rtion, d i n − 1 = 1 n − 1 P n j =1 ,i 6 = j δ ij . W e a lso assume that the connections dep end en tirely on the tw o in v olv ed ve rtices and not on other fa ctor s. So, δ ′ ij s are indep enden t. The distribution of d i can b e explicitly o btained b y the results from W ang’s P ap er [2] Prop osition: By an easy application of the Ly apunoﬀ condition regarding to the Cen tral limit t heorem for indep enden t but not iden tically distributed random v ariables, w e hav e the large sample distribution of d i √ n as √ n { 1 n P n i =1 ( δ ij − p ij ) √ p ij (1 − p ij ) /n Law − − → N (0 , 1) . as long as p ij is b ounded aw ay from 0 and 1. √ n { 1 n P n i =1 ( δ ij − p ij ) √ p ij (1 − p ij ) /n = d ′ i √ n , whic h is a scaled degree-densit y . Then Also note, under this structure C ov ( d i , d k ) = C o v ( P n j =1 ,i 6 = j δ ij , P n j =1 ,k 6 = j δ k j , ) = P j 6 = i P l 6 = k C ov ( δ ij , δ lk ) Under the condition of indep endence , we ha v e = C ov ( δ ik , δ ik ) = V ar ( δ ik = p ik (1 − p ik ) Consider the ve ctor d ‘ |{z} = ( d 1 , d 2 , ....., d n ) Then the C ov ( d i √ n , d j √ n ) = p ik (1 − p ik ) n → 0 asymptotically . V ar ( d i ) = V ar ( P n j =1 ,i 6 = j δ ij ) = P j 6 = i V ar ( δ ij ) under indep endence. = P p ij (1 − p ij ) Then Cen tralit y can b e tackle d as an eigenv alue E = Q Λ Q ′ . Limiting argumen ts are not easily av ailable but see the references b elo w. The precision for estimation of d i is 1 P p ij (1 − p ij ) {I cannot use this to fo r mulate the sampling argumen t} 3.2 Planarit y etc. One imp ortan t question: 1 − P ( pl anar ity ) ≤ P ( K 5 ) + P ( K 3 , 3 ) , need to compute these. So if P ( K 5 ) ≤ α 5 and P ( K 3 , 3 ) ≤ α 3 , 3 s.t. α 5 + α 3 , 3 ≤ α then we ha v e one deﬁnition of "planar with (1 − α ) conﬁdence." 3.2.1 L imiting Distribution of Adj acency Matrix The concept of a limiting distribution of the a djacency matrix, when n → ∞ will b e ve ry helpful for s. In this con text w e recall the follo wing: 4 Bose and Sen (2008) deal with real symmetric matrices. If λ is an eigen v alue of m ultiplicit y m of an n × n matrix A n , then the Empirical Sp ectral Measure puts mass m/n at λ . Note that if the en tries of A n are r a ndom, then this is a random probabilit y measure. If λ 1 , λ 2 , . . . , λ n are all the eigen v a lues, then the empirical sp ectral distribution function (ESD) F A n of A n is giv en b y F A n ( x ) = 1 n n X i = i I ( λ i ≤ x ) Let { A n } be a sequence of s quare matrices with the corres p onding ESD {F A n }. The Limiting Sp ectral Distribution (or measure) (LSD) of t he sequence is deﬁne d as the we ak limit of the sequence {F A n }, if it exists. If { A n } are random, the limit is in the “a lmost sure” or “ in probabilit y” sense. The relev ant example for us is the W igner matrix . In its simplest form, a Wigner matrix W n of order n is an n × n symmetric matrix whose en tries on and ab o v e the diagonal are i.i.d. random v ariables with zero mean and v a ria nce one. Denoting those i.i.d. random v ariables b y {x ij : 1 ≤ i ≤ j }, we can visualize the Wigner matrix as W n =      x 11 x 12 x 13 · · · x 1( n − 1) x 1 n x 12 x 22 x 23 · · · x 2( n − 1) x 2 n . . . x 1 n x 2 n x 3 n · · · x n ( n − 1) x nn      The semi-circular law W arises as the LSD of n − 1 / 2 W n . It has the densit y function p W ( s ) = ( 1 2 π √ 4 − s 2 0 if | s | ≤ 2 other w ise All its o dd moments are zero. T he ev en momen ts are giv en by β 2 k ( W ) = (2 k )! / k !( k + 1)! Theorem : Let { w ij : 1 ≤ i ≤ j, j ≥ 1} b e a double sequenc e of indep enden t random v ariables with E(w ij ) = 0 f o r all i ≤ j and E(w 2 ij ) = 1 whic h are either (i) uniformly b ounded or (ii) iden tically distributed. Let W n b e an n×n Wigner matrix with the en tries w ij . Then with probabilit y one, F n − 1 / 2 W n con v erges we akly to the semicircular la w. W e will use this result... Using Bose and Sen, 200 8: the eigen v alues of K 5 are (4, -1, -1, -1, -1) and that o f K 3 , 3 are (3, 0, 0, 0, 0, -3). If w e are a ble to work out the ESD fo r (( δ ij − p ij √ p ij (1 − p ij ) )) then if w e can pro ba bilistically b ound it (i) ab o v e b y 4, then K 5 is ruled out; (ii) b elow by -3 to rule o ut K 3 , 3 . Then we ha v e a test for plan a rity . 4 Sampling Str ate gy If w e use attribute infor matio n for no des, then one a pproac h could b e a s follow s: 1. Use Dirichlet pro cess on attributes (as in Seth uraman , 1994 ). Prior sampling fo r computing p osterior distribution. 2. No w w e can use either Markov R andom Fields mo del ( a s in Jordan and W airigh t,2008 ) or Exp onential R and om Gr aph mo del (as in C hristakis et. al 2011 ) fo r edges: ( δ ij | X i , X j ) . No w δ i ∧ − P j δ ij | X i , X j So w e use sample estimate for d i , ˆ d i ( use Lo v asz type strategy ). 5 No w use EM to estimate δ ij | ˆ d i , X i , X j ∀ j (to set this up as in SSSD , 2014 ) 1 3. No w w e extrap olate for | V | → ∞ – this is where the limit theorems will ne ed to be formulated. T a rg et is to establish we ak laws and Centr a l Limit The or ems . Assumptions on similarit y as n → ∞ . Appro ximate ﬁnite basis with dimension k ( n ) s.t. k ( n ) n → 0 as n → ∞ . In fa ct target k ( N ) N → 0 with sample size N . So need to b ound t w o a ppro ximations. Lo v asz ty p e sampling strategy • Start with an y v ertex, call it 1 , observ ation av ailable is { X 1 } . • Cra wl all connections (neighbours) o f 1, observ atio n a v a ila ble { d 1 , X i for all neigh b ours } • Randomly select some of the neigh b ours of 1. Cra wl all connections of them ... • After these t w o lay ers, we will ha v e data on { d i , X i } for i b elonging to sampled v ertices of these t w o la y ers (say ﬁrst N) and { X j } for connections j of them (sa y N+1 to N+M). This can b e visualised in terms of the data structure giv en b elo w. Here the ﬁrst N row s & columns will b e completely kno wn. F or the next M row s & columns, X j will b e know n and some of the d ij ’s will b e kno wn (lo ops bac k). Now from this da ta the analysis will b egin. 4.1 Algorithm Assuming X n to b e a discrete v alued random v ariable, sa y taking v alues from the set S = { s 1 , s 2 , ..., s k } . There are m ultiple iid copies o f v ertices with the attribute v alue s i . Since b y our assumption the b eha vior of the v ertex is completely determined b y its attribute v alue, then the cen tralit y of all v ertices with same attribute v alue should b e same. Th us p ij also tak es ﬁnitely man y v alues, and it is enough to lo ok at the mat r ix consisting of p ij = P ( δ ij = 1 | X i = s k , X j = s l ) for all k , l = 1 , .., n W e attempt to apply a strategy similar to Lo v asz. W e randomly sample a v ertex and consider its depth-2 neigh b orho o d. By the prop ortion of his connections to diﬀeren t v ertices with diﬀeren t at t ribute v alues, w e get an estimate of p ij . If ∃ m, n suc h that p mn is not estimated from the sample,but p mr and p nr are for some r, w e note that d b eing a metric we ha v e the tria ngle inequalit y d ( s m , s n ) ≤ d ( s m , s r ) + d ( s r , s n ) So, we hav e p mn ≥ p mr + p nr whic h pro vides a lo w er b ound for p mn . W e can hav e p mn ≥ su p r { p mr + p nr } . So an iterativ e up dation may b e done of the low er b ound. If in o ur ﬁnal sample it is still not estimated we can tak e p mn ∼ U ( su p r { p mr + p nr } , 1) A. Sampling Algorithm: Step 1. Se lect at random n vertic es from all the v ertices in the gra ph. Step 2. Consider the prop ortion of v ertices that ar e from the i th class, i = 1 , ..., k . Call it s 0 i . Step 3. F or the connection proba bility of elemen ts of Class i a nd Class j , use the ﬁrst measure: ˆ p 0 ij = # connection pr esent among el ements of cl ass i andd j total possibl e connections f r om i to j = n ij s 0 i s 0 j n 2 Step 4 : F or each of the v ertices c hosen at Step 1, say the selected v ertex is from class k and the neighbours are from classes α 1 k, , ...., α n k k . Pic k one of these neigh b ours, sa y the j th one from class i at random with probabilit y p k α ik . 1 See Saad, Basa r et. al. for a diﬀeren t but in terestin g application of such tec hnique in mo delling sharing of information for more eﬃcien t estimation. 6 Step 5 :Using the information from the arbitra rily c hosen neigh b ours, rep eat Step 2 to get s 1 i , and calculate c p 1 ij similarly . Step 6. Calculate q ij = β ˆ p 0 ij + (1 − β ) c p 1 ij where β ∈ (0 , 1). Report it as the probability . Step 7: If ∃ m, n suc h that p mn is not estimated f r o m the sample,but p mr and p nr are for some r, w e note that d b eing a metric we ha v e the triangle inequalit y d ( s m , s n ) ≤ d ( s m , s r ) + d ( s r , s n ) So, we hav e p mn ≥ p mr + p nr whic h pro vides a lo w er b ound for p mn . W e can hav e p mn ≥ su p r { p mr + p nr } . W e tak e q mn ∼ U ( sup r { p mr + p nr } , 1) B. Generation of the A djacency Matrix from our Sampling Sc heme: Assume that the graph size is unkno wn First assign the v ertex i to a class C i b y generating the random v ariable X fro m the discrete distribution of the standardized s i . Then, if ve rtex i is fro m C k and if ve rtex j is from C r , then (( a ij )) ∼ B er nou ll i ( q k r ) Note: P r ob ( a ij = 1) = P k ,r P r ob ( a ij = 1 , iǫC k , j ǫC r ) = P k ,r P r ob ( a ij = 1 | iǫC k , j ǫC r ) P r ob ( iǫC k , j ǫC r ) = P k ,r P r ob ( a ij = 1 | iǫC k , j ǫC r ) P r ob ( iǫC k ) P r ob ( j ǫC r ) = P k ,r q k r s ‘ k s ‘ r C. The resulting data structure The ab ov e sampling sc heme will giv e rise to data whic h is a ﬁnite s ubgraph o f the original graph in the follo wing structure: X 1 X 2 X 3 . . . X k . . . − → d 1 2 3 . . . k . . . 1 0 δ 12 δ 13 . . . δ 1 k . . . d 1 2 δ 12 0 δ 23 . . . δ 2 k . . . d 2 3 δ 13 δ 23 0 . . . δ 3 k . . . d 3 . . . . . . . . . k . . . 0 . . . d k . . . . . . 5 Results • Sim ulation Results • Comparison with existing metho ds (degree cen tralit y etc.) • T est results for Planarit y etc. 6 Discussion and Conclusio ns 6.1: On I nﬁnite Gr aph S p e ctrums: In Bose and Sen [1] , the questions ab out the sp ectral decomposition of inﬁnite dimensional matrices are tac kled, with r esults deriv ed for the Winger Matrix. How ev er, the realised Adjacency Matrix f or our mo del do es not hav e an iid structure of rows, as the v alue on the ( i, j ) th ro w dep ends tota lly on the class of the i th elemen t and the j th elemen t. The subm atrices of the form give n b elo w are Winger (ie, the rows are generated from an iid pro cess) and for the individual blo c ks w e can obtain the limiting sp ectral distribution (LSD), whic h is the limit of F A n ( x ) = n − 1 P 1 I ( λ i ≤ x ) . 7 A n = c j 1 c j 2 c j 3 ... c i 1 c i 2 c i 3 ... This, while can giv e an idea ab o ut the large sample cen tralit y of the classes, seems to fail to generalize to giv e ov erall graph sp ectra. Our o v erall graph has ﬁnitely man y blo cks of the ab o v e form, with each blo c k of inﬁnite size. If we can pro v e the result for an a djacency matrix with t w o classes , t hen w e can extend the result to ﬁnitely many blo c ks. 6.2: W eighte d Gr aph If instead of considering the actual adjacency matrix w e consider the w eigh ted graph adja cency matrix, where link w eigh ts are the probabilities of connection b etw een the t w o v ertices, the underlying weigh ted graph is connected. Ho w ev er, with suc h a notion of w eigh ted g raph the degree of a v ertex of an inﬁnite g r a ph d i = P j p ij is alw a ys going to b e inﬁnite, as it will connect to all other v ertices with some nonzero probabilit y , and is basically a sum of inﬁnitely man y v alues of p ij 6 = 0 for atleast one j (the graph being inﬁnite) and hence con v erges to inﬁnit y . Another issue with suc h a setup w ould b e that all v ertices of the same class should the oretically ha v e the same cen tralit y , as connections are dep enden t only on class prop erties a nd not the individual vertice s themselv es. Th us, w e ma y end up c haracterizing an “inﬂuen tial” group of p eople rather than iden tifying any one individual- whic h mak es sense from a mark eting/ SNA p ersp ectiv e. If our attributes are ﬁne enough, then the n um b er of classes will b e high with low class size f or most, leading to zeroing in on one inﬂuen tial p erson compared to the rest. The notion of p ij b ounded aw ay from zero a rises fro m the notion that w e ma y ha v e incomplete information ab out the attributes, so w e cannot with certainity sa y who will connect to/av oid whom. 6.3: Sp e c trum of the W eighte d Gr aph In suc h a case, w e ma y again note that an y ﬁnite column-truncated (resp ectiv ely row-truncated) subgraph has repeated rows (resp ectiv ely column truncated), and ﬁnding the eigen v alues of any ﬁnite dimensional subgraph with rep eated ro ws is equiv alen t to ﬁnding the eigen v alue of a transformed lo w er dimensional matrix, as fo llo ws. In gen eral, if I is a se t of rows whic h are iden tical, then let v I b e th e v ec tor whic h is 1 / p | I | on the co ordinates in I and 0 elsewhere. The v I are orthonormal, complete them to a n orthonormal basis b y adding v ectors w j . Then A will annihilate the w j and will tak e S pan ( v I ) to itself. The matrix of endomorphism of S pan ( v I ) will hav e en tries that lo ok lik e √ I J aij, with i ∈ I and j ∈ J . So it suﬃces to compute the eigen v alues of this ﬁnite matrix, and the rest are 0 . F or ev ery ﬁnite subgraph of the matrix sa y of dimension n × n , calculating the subgraph sp ectra is th us equiv alen t to calculating the graph sp ectra of a smaller transformed matrix, i.e, if w e ha v e k classes, we can simply compute the eigen v alues of the ﬁnite matrix of dimension  k 2  ×  k 2  . 8 References 1. Another lo ok at the momen t metho d for large dimensional random matrices , Arup Bose (Indian Statistical Institute) and Arnab Sen (Unive rsit y of California, Berk ele y), Electronic Journal of Probabilit y 2008, h ttp://ejp.ejp ecp.org/article/do wnload/501/706 2. On the Num b er of Success es in Indep enden t T rials, Y.H. W ang (Concordia Univ ersit y) 3. C. Hubler, H. P . Kriegel, K. Borg w ardt, Z . Ghahramani, Metrop olis algorithms for represen tativ e sub- graph sampling, in: Pro ceedings of the 8th International Conference on Data Mining, IEEE, 200 8, pp. 283 - 29 2. 4. H. Seth u, X. Ch u, A new algorithm for extracting a small represe n tativ e subgraph from a v ery la r g e graph, e-prin t arXiv:1207.4 8 25. 5. N. Bla g us, L. Sub elj, G . W eiss, M. Ba jec, Sampling promotes comm unit y structure in so cial and infor- mation net w orks, Phys ica A 432 (2015) 206 - 215. 6. J. Lesk o v ec, C. F aloutsos, Sampling from large graphs , in: Pro ceedings of the 12th A CM SIGKDD In ternational Conference on Know ledge Disco v ery and Data Mining, A CM, 2 006, pp. 631 - 636. 7. S. H. Lee, P . J. Kim, H. Jeong, Statistical prop erties of sampled net w orks, Phy s. Rev. E 73 (1) (2006) 016102. 8. N. Blag us, L. Sub elj, M. Ba jec, Assessing the eﬀectiv enes s o f real-w orld net w ork simpliﬁcation, Ph ysica A 413 (2014 ) 134 - 146. 9. N. Ahmed, J. Neville, R. R. Kompella, Netw o rk sampling via edge-based no de selection with graph induction, T ec h. rep., Purdue Univ ersit y (2011). 10. I. Benjamini, L. Lov ász: Global Information from Lo cal Observ atio n, [Pro c. 4 3rd Ann. Symp. on F ound. of Comp. Sci. (2002) 701 -710.] 11. Sourabh Bhattach ary a, Di ganta Mukherjee, Sutanoy Dasgupta and Soumendu Sundar Mukherjee, “ A Mo del for So cial Netw orks”, mimeo, ISI, Decem b er 2014. 12. Christakis ,N., J. F o wler ,G. W. Imbens and K. Kaly anaraman (2011), An Empirical Mo del for Strategic Net w ork F ormation, W orking P ap er, National Bureau of Economic Researc h, Cam bridge, MA 13. W Saad, Z Han, M Debbah, A Hjorungnes, T Basar, Coalitional game theory for comm unication net- w orks, Signal Pro cessing Mag a zine, IEEE 26 (5), 77 -97. 14. K oushiki Sark ar, Abhishe k Ra y and Digan ta Mukhe rjee,“Impact of So cial Net w ork on Financial Deci- sions", forthcoming in Studies in Micro economics, 201 5. 15. J. Seth uraman A constructiv e deﬁnition of Diric hlet priors Statistica Sinica, 4 (1994), pp. 639 – 650 16. M. J. W ain wrigh t and M. I. Jordan (2008). Graphical mo dels, exp onen tial families, and v ariational inference. F oundations and T rends in Mac hine Learning, V ol. 1, Numbers 1–2, pp. 1–305, Decem b er 2008. 9

Testing for Characteristics of Attribute Linked Infinite Networks based on Small Samples

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment