Testing for Characteristics of Attribute Linked Infinite Networks based on Small Samples
The objective of this paper is to study the characteristics (geometric and otherwise) of very large attribute based undirected networks. Real-world networks are often very large and fast evolving. Their analysis and understanding present a great chal…
Authors: Koushiki Sarkar, Diganta Mukherjee
T esting for Characteristics of A ttribute link ed Infinite Net w orks ba sed on Small Samples K oushiki Sark ar and Digan ta Mukherjee Indian Statistical Institute, K olk ata Octob er 25, 2021 Abstract The ob jective of this pap er is to study the characteri stics (geometric and otherwise) of very large attribute based und irected netw orks. Real-w orld netw orks are often ve ry larg e and fast ev olving. Their analysis and understanding present a great c hallenge. An Attribute based net wo rk is a graph in which the edges dep end on certain prop erties of the vertices on whic h they are incident. In context of a social netw ork, the existence of links b etw een t w o individuals may depend on certain attributes of the tw o of th em. W e use the Lov asz typ e sampling strategy of observing a certain random process on a graph ”lo cally”, i.e., in the neigh b orhoo d of a no de, and deriving information ab out ”global” prop erties of the graph. The corresponding adjacency matrix is our primary ob ject of intere st. W e study th e efficiency of recentl y prop osed sampling strategies, mod ified to our set up, to estimate th e degree distribution, centralit y measures, planarit y etc. The limiting d istributions are derived using recently develo p ed probabilistic tec hniques for random matrices and h ence we devise relev ant test statistics and confidence inte rv als for different p arameters / hyp otheses of i nterest. W e hop e that our work will b e useful for so cial and computer scien tists for designing sampling strategies and computational algorithms appropriate to their resp ectiv e domains of inquiry . Extensive sim ulations studies are done t o empirically verify the probabilistic statements made in the pap er. 1 In tro duction 1.1 Need for Sampling and Common Strategies Real-w orld net w orks are often ve ry large and fast ev olving. Their analysis and understanding presen t a great c hallenge. In the pas t few y ears, a n um b er of differen t tec hniq ues hav e b een prop osed for sampling large net w orks to allow for their faster and more efficien t analysis. Sev eral studies on netw or k sampling analyze the matc h b et w een the o riginal net w orks and their sampled v arian ts [1, 2, 3], as w ell as comparing the p erformance of differen t sampling tec hniques [4, 5, 6]. Sampling tec hniques can b e r o ughly divided into t w o categories: r a ndom selection a nd net w ork exploration tec hniques . In the first category , no des or links are included in the sample uniformly at random or prop ortional to some particular c haracteristic lik e degree. In the second category , the sample is constructed by retrieving a neigh b orho o d of a randomly selected seed no de using differen t strategies like breadth-first searc h, random w alk and fo r est-fire. On these basis, the f o llo wing algorithms ha v e b een prop osed in the literature. • random no de selection [4] (RNS), where the sample consists of no des selected uniformly at ra ndom and all their m utual links • random no de selection b y degree [4] (R ND), the no des are selected randomly with probabilit y pro p o r- tional to their degrees and all their m utual links are included in the sample • random link selection [4] (RLS), where the sample consists of links selected uniformly at random • random link selection with subgraph induction [7] (RLI), the sample consists o f links selected uniformly at random and a ny additional links b etw een their endp oin ts • random w alk sampling [4] (R WS), where the random walk is sim ulated on the net w ork, starting at a randomly selected seed no de • forest-fire sampling [4] (FFS). Here, a broad neigh b orho o d of a randomly selected seed no de is retriev ed from partial breadth-first search 1 • random w alk sampling with subgraph induction (R WI) and • forest-fire sampling with subgraph induction (FFI) Recall the L ovasz sampling strategy of observing a certain random pro cess on a graph ” lo cally”, i.e., in the neigh b orho o d of a no de, and deriving information ab out ”global” prop erties o f the gra ph. F or example, what do y ou kno w ab out a graph based on o bserving the returns of a ra ndom w alk to a giv en no de? Almost all sampling strategies uses t his philosoph y or some close v ariant. W e also aim to use this as an ingredien t in our sampling strategy and deriv e a test for suc h returns. 1.2 Wh y Attribute based Netw ork? An A ttribute based net w ork is a g raph in whic h the edges depend on certain prop erties of the v ertices on whic h they are inciden t. In con text of a so cial net w ork, the existenc e of links b etw een tw o individuals ma y dep end on certain attributes of the tw o of t hem, for example their geographic lo cation or so cio economic status. W e w ork with the underlying assumption that similar p eople connect to eac h other with higher probability . In the con text of a so cial or a neural net w ork, the connection b etw een individual v ertices dep end on certain in trinsic qualities of the v ertices themselv es. It make s sense to consider the connection probabilities as a function of the verte x attributes. In earlier w ork (Sark ar, Ray a nd Mukherjee , 2015) we ha v e shown that in the con text of predictiv e mo deling, attribute based netw orks a re indeed w orth while t o study . 1.3 Plan of the Pa p er The ob j ectiv e of this pap er is to study the c haracteristics (geometric and otherwise) of very large attribute based undirected net works . The corresp onding adjacen cy matrix is o ur primary ob ject of intere st. W e study the efficiency of recen tly prop osed sampling strategies, mo dified to our set up, to estimate the degree distribution, cen tralit y measures, planarit y etc. The limiting distributions are derive d using recen tly dev elop ed probabilistic tec hnique s for random matrices and hence w e devise relev an t test statistics and confidence in terv als for differen t parameters / hy p otheses of in terest. W e hop e that our work will b e useful fo r so cial and computer scien tists for designing sampling strategies and computational algo r ithms appropriate to their respectiv e domains of inquiry . Exte nsiv e simulations studies a re done to empirically verify the probabilistic statemen ts made in the pap er. 2 Prelimina ries Let t he netw ork b e represen ted b y a simple undirected gra ph G = (V; E), where V denotes the set of no des (n = |V|) a nd E is the set of links (m = |E|). The goa l of netw or k sampling is to create a sampled net w ork G’ = (V’; E’), where V’ ⊂ V , E’ ⊂ E and n’ = |V| << n, m’ = |E’| << m. The sample G’ is obtained in t w o steps. In the first step, no des or links a re sampled using a particular strategy lik e r a ndom selection and net w ork exploration sampling. In the second step, the sampled no des and links are retriev ed f r o m the original net w ork. 2.1 Notions of Centr ality A netw ork can b e characterize d b y v arious notions of centralit y , whose relev ance and utilit y are con text- sp ecific. A complex net w ork with a heterogeneous top olog y might not hav e the same optimalit y prop erties for a single measure of cen tralit y throughout the graph. Ho w ev er, for a sampling based approac h without any prior idea of the graph t ype, it ma y b e difficult to kno w whic h centralit y measure is b est suited f or the study of the graph. If w e op erate under the simplified assumptions ab out the attributes-based net w ork then our graph structure is simplied. P articularly , if the attribute random v a riable { X i } tak e v a lues in a finite set, then the set of p ossible connecting probabilities is also finite. Then w e can break a graph in to differen t classes whic h are exp ected to ha v e similar b eha vior. Here w e in tend to dev elop a sampling analo g ue for finding the Eigen v ector cen tralit y , which is the solution of the E ig env ector E q uation to Ax = λx . Acc ording to the P erron F rob enius theorem due to strict p ositivit y of the eigen v alues w e only require the la rg est eigen v alue. 2 A p ossible generalization of Eigen v ector cen tralit y a s well as Degree Cen tralit y is the Ka tz cen tralit y . It measures the n um b er of a ll no des that can b e connected v ia a path to the v ertex in ques tion, while the con tributions to distant no des are dev a lued. It is mathematically written as x i = P k P j α k ( A ) k j i Katz cen tralit y can b e view ed as a v aria nt of eigen v ector cen tr a lit y . Another form of Katz cen tralit y is x i = α P N j =1 a ij ( x j + 1) . C ompared to the expression o f eigen v ector cen tralit y , x j is replaced by x j + 1 . It is show n that the principal eigen ve ctor (asso ciated with t he largest eigen v alue of A, the adjacency matrix) is the limit of Katz cen tralit y as α approac hes 1 / λ from b elo w. 2.2 Assumptions on the Netw ork Basic Assumptions : W e denote the v ariable X i as an attribute of the class i , where i is assumed to tak e only finitely many v alues. In a p opulation, there can b e infinitely man y p eople with same a tt r ibute. Let us consider all of them mem b ers o f the class i . Call it c i . • Lo oking at degree distribution is not very meaningful as ev en if w e kno w that the degrees are distributed b y p o w er la w or Normally , w e still don’t kno w what the degree should b e for an individual no de. • The degree needs to b e a specific prop erty of a no de for us to meaningfully selec t a no de. In context of so cial netw orks, it makes sense to consider ve rtices a ppended with attributes. An edge or connection can b e considered to b e dep enden t o n the attributes of the inv olve d no des. • W e consider the accessory v ariable X i app ended to each v ertex i. • Call the indicator function δ ij = n 1 if connected 0 ow • W e need to lo ok at p ij = P ( δ ij = 1 | X i , X j ) . If Erdos Reny i Graph, then this is the unconditional probabilit y same f or all i,j. • W e make some assumptions on p ij . Ev en if i and j b elong to the same class and hence share the same attributes, p ij 6 = 1 . p ij is assumed to b e b ounded aw ay from 1 and 0. • Fix an i. Consider ar g max i P j P ( δ ij = 1 | X i , X j ) . This can b e approx imated by ar g max i E ( d | X i ) where d is the degree • The { X n } sequenc e, if sto c hastic, is assumed to form a Mark ov Ra ndom Field. we ha v e a sort of dep en- dence structure within a neigh b orho o d; distan t no des can b e assumed to b e more or less indep enden t. In particular, if { X n } is finite, then p ij also take s finitely many v alues. 3 Probabilisti c F orm ulation Instead of the Ad jacency Matrix, w e can consider the matrix P = (( p ij )) , the matrix of the probabilities. If w e consider a random motion o n the graph, w e consider p ij to b e a transition probability on the graph. A p ossible notion of cen tralit y in this con text is whether the verte x is recurren t or transien t. A recurren t v ertex indicates that there are m ultiple paths leading back to the gr a ph. This is somewhat analo gous to Bet w eenness cen tralit y . W e need to ho w ev er note that in a larg e g ra ph mo deled on a so cial net w ork whic h w ould b e mostly sparse with interm ediate densely connected cliques, the actual b et w eennes s for all but few v ertices w ould b e v anishingly small. O n a global scale these v ertices ma y not b e imp ortan t; but their lo cal influenc e cannot b e dismisse d. W e lo ok at an irreducible ap erio dic subset of the graph. If the motion is considered to b e Mark o v, then noting that tra nsienc e and recurrence a re solidarit y prop erties, we attempt to v erify tha t using our mo del. If d is a metric defined on the σ − F ield generated b y the random v ariables { X n } , then consider p ij = f ( d ( X i , X j )) , where f is a decreasing function of d a nd b ounded in [0 , 1] . An easy example is d ′ ( x, y ) = 1 − min (1 , d ( x, y )) Again, if { X n } is finite (or coun table), d only tak es finitely (coun tably) many v alues and consequen tly the set of v alues of p is also finite (coun table.) 3 If the { X n } is kno wn, then P is also completely kno wn, as is P k . In principal, w e can also calculate if P f n ii < 1 , where f ( n ) ii = Pr( T i = n ) . If this holds, then the vertex is transien t, else recurren t. 3.1 Degree Distribution W e are estimating p ij = P ( δ ij = 1 | X i , X j ) . Let P = (( p ij )) and E = (( δ ij )) , symmetric with δ ii = 0 , ∀ i . So δ ij ∼ B er noul i ( p ij ) . Sym b olically E ∼ B er noul i ( P ) . Let degree, d i = P j δ ij . What is the distribution of δ ij | d i , should b e "h yp ergeometric" ty p e? So can w e use iter ate d ex p e ctation in the followin g w a y? First condition o n ro w total d i to use the "hypergeometric t yp e" calculation for ( δ ij | X i , X j ) . Then take expectation on d i ∼ F ( . | X i ) following from the ERG mo del. If w e can sho w tha t the distribution of d i is of "Binomial type", then for fixed | V | = n we can do the calculation and then tak e limit n → ∞ ? Assuming { X i } to b e non-sto chastic , if degree of the i th v ertex is d i , w e hav e: d i = P n j =1 ,i 6 = j δ ij , where δ ij is the indicator v aria ble whic h is 1 if there is a connection b et w een the i th and the j th v ertices. If p ij is the connection probability of the i and the j th v ertex, then p ij = f ( x i , x j ) is completely kno wn. Consider the degree pro po rtion, d i n − 1 = 1 n − 1 P n j =1 ,i 6 = j δ ij . W e a lso assume that the connections dep end en tirely on the tw o in v olv ed ve rtices and not on other fa ctor s. So, δ ′ ij s are indep enden t. The distribution of d i can b e explicitly o btained b y the results from W ang’s P ap er [2] Prop osition: By an easy application of the Ly apunoff condition regarding to the Cen tral limit t heorem for indep enden t but not iden tically distributed random v ariables, w e hav e the large sample distribution of d i √ n as √ n { 1 n P n i =1 ( δ ij − p ij ) √ p ij (1 − p ij ) /n Law − − → N (0 , 1) . as long as p ij is b ounded aw ay from 0 and 1. √ n { 1 n P n i =1 ( δ ij − p ij ) √ p ij (1 − p ij ) /n = d ′ i √ n , whic h is a scaled degree-densit y . Then Also note, under this structure C ov ( d i , d k ) = C o v ( P n j =1 ,i 6 = j δ ij , P n j =1 ,k 6 = j δ k j , ) = P j 6 = i P l 6 = k C ov ( δ ij , δ lk ) Under the condition of indep endence , we ha v e = C ov ( δ ik , δ ik ) = V ar ( δ ik = p ik (1 − p ik ) Consider the ve ctor d ‘ |{z} = ( d 1 , d 2 , ....., d n ) Then the C ov ( d i √ n , d j √ n ) = p ik (1 − p ik ) n → 0 asymptotically . V ar ( d i ) = V ar ( P n j =1 ,i 6 = j δ ij ) = P j 6 = i V ar ( δ ij ) under indep endence. = P p ij (1 − p ij ) Then Cen tralit y can b e tackle d as an eigenv alue E = Q Λ Q ′ . Limiting argumen ts are not easily av ailable but see the references b elo w. The precision for estimation of d i is 1 P p ij (1 − p ij ) {I cannot use this to fo r mulate the sampling argumen t} 3.2 Planarit y etc. One imp ortan t question: 1 − P ( pl anar ity ) ≤ P ( K 5 ) + P ( K 3 , 3 ) , need to compute these. So if P ( K 5 ) ≤ α 5 and P ( K 3 , 3 ) ≤ α 3 , 3 s.t. α 5 + α 3 , 3 ≤ α then we ha v e one definition of "planar with (1 − α ) confidence." 3.2.1 L imiting Distribution of Adj acency Matrix The concept of a limiting distribution of the a djacency matrix, when n → ∞ will b e ve ry helpful for s. In this con text w e recall the follo wing: 4 Bose and Sen (2008) deal with real symmetric matrices. If λ is an eigen v alue of m ultiplicit y m of an n × n matrix A n , then the Empirical Sp ectral Measure puts mass m/n at λ . Note that if the en tries of A n are r a ndom, then this is a random probabilit y measure. If λ 1 , λ 2 , . . . , λ n are all the eigen v a lues, then the empirical sp ectral distribution function (ESD) F A n of A n is giv en b y F A n ( x ) = 1 n n X i = i I ( λ i ≤ x ) Let { A n } be a sequence of s quare matrices with the corres p onding ESD {F A n }. The Limiting Sp ectral Distribution (or measure) (LSD) of t he sequence is define d as the we ak limit of the sequence {F A n }, if it exists. If { A n } are random, the limit is in the “a lmost sure” or “ in probabilit y” sense. The relev ant example for us is the W igner matrix . In its simplest form, a Wigner matrix W n of order n is an n × n symmetric matrix whose en tries on and ab o v e the diagonal are i.i.d. random v ariables with zero mean and v a ria nce one. Denoting those i.i.d. random v ariables b y {x ij : 1 ≤ i ≤ j }, we can visualize the Wigner matrix as W n = x 11 x 12 x 13 · · · x 1( n − 1) x 1 n x 12 x 22 x 23 · · · x 2( n − 1) x 2 n . . . x 1 n x 2 n x 3 n · · · x n ( n − 1) x nn The semi-circular law W arises as the LSD of n − 1 / 2 W n . It has the densit y function p W ( s ) = ( 1 2 π √ 4 − s 2 0 if | s | ≤ 2 other w ise All its o dd moments are zero. T he ev en momen ts are giv en by β 2 k ( W ) = (2 k )! / k !( k + 1)! Theorem : Let { w ij : 1 ≤ i ≤ j, j ≥ 1} b e a double sequenc e of indep enden t random v ariables with E(w ij ) = 0 f o r all i ≤ j and E(w 2 ij ) = 1 whic h are either (i) uniformly b ounded or (ii) iden tically distributed. Let W n b e an n×n Wigner matrix with the en tries w ij . Then with probabilit y one, F n − 1 / 2 W n con v erges we akly to the semicircular la w. W e will use this result... Using Bose and Sen, 200 8: the eigen v alues of K 5 are (4, -1, -1, -1, -1) and that o f K 3 , 3 are (3, 0, 0, 0, 0, -3). If w e are a ble to work out the ESD fo r (( δ ij − p ij √ p ij (1 − p ij ) )) then if w e can pro ba bilistically b ound it (i) ab o v e b y 4, then K 5 is ruled out; (ii) b elow by -3 to rule o ut K 3 , 3 . Then we ha v e a test for plan a rity . 4 Sampling Str ate gy If w e use attribute infor matio n for no des, then one a pproac h could b e a s follow s: 1. Use Dirichlet pro cess on attributes (as in Seth uraman , 1994 ). Prior sampling fo r computing p osterior distribution. 2. No w w e can use either Markov R andom Fields mo del ( a s in Jordan and W airigh t,2008 ) or Exp onential R and om Gr aph mo del (as in C hristakis et. al 2011 ) fo r edges: ( δ ij | X i , X j ) . No w δ i ∧ − P j δ ij | X i , X j So w e use sample estimate for d i , ˆ d i ( use Lo v asz type strategy ). 5 No w use EM to estimate δ ij | ˆ d i , X i , X j ∀ j (to set this up as in SSSD , 2014 ) 1 3. No w w e extrap olate for | V | → ∞ – this is where the limit theorems will ne ed to be formulated. T a rg et is to establish we ak laws and Centr a l Limit The or ems . Assumptions on similarit y as n → ∞ . Appro ximate finite basis with dimension k ( n ) s.t. k ( n ) n → 0 as n → ∞ . In fa ct target k ( N ) N → 0 with sample size N . So need to b ound t w o a ppro ximations. Lo v asz ty p e sampling strategy • Start with an y v ertex, call it 1 , observ ation av ailable is { X 1 } . • Cra wl all connections (neighbours) o f 1, observ atio n a v a ila ble { d 1 , X i for all neigh b ours } • Randomly select some of the neigh b ours of 1. Cra wl all connections of them ... • After these t w o lay ers, we will ha v e data on { d i , X i } for i b elonging to sampled v ertices of these t w o la y ers (say first N) and { X j } for connections j of them (sa y N+1 to N+M). This can b e visualised in terms of the data structure giv en b elo w. Here the first N row s & columns will b e completely kno wn. F or the next M row s & columns, X j will b e know n and some of the d ij ’s will b e kno wn (lo ops bac k). Now from this da ta the analysis will b egin. 4.1 Algorithm Assuming X n to b e a discrete v alued random v ariable, sa y taking v alues from the set S = { s 1 , s 2 , ..., s k } . There are m ultiple iid copies o f v ertices with the attribute v alue s i . Since b y our assumption the b eha vior of the v ertex is completely determined b y its attribute v alue, then the cen tralit y of all v ertices with same attribute v alue should b e same. Th us p ij also tak es finitely man y v alues, and it is enough to lo ok at the mat r ix consisting of p ij = P ( δ ij = 1 | X i = s k , X j = s l ) for all k , l = 1 , .., n W e attempt to apply a strategy similar to Lo v asz. W e randomly sample a v ertex and consider its depth-2 neigh b orho o d. By the prop ortion of his connections to differen t v ertices with differen t at t ribute v alues, w e get an estimate of p ij . If ∃ m, n suc h that p mn is not estimated from the sample,but p mr and p nr are for some r, w e note that d b eing a metric we ha v e the tria ngle inequalit y d ( s m , s n ) ≤ d ( s m , s r ) + d ( s r , s n ) So, we hav e p mn ≥ p mr + p nr whic h pro vides a lo w er b ound for p mn . W e can hav e p mn ≥ su p r { p mr + p nr } . So an iterativ e up dation may b e done of the low er b ound. If in o ur final sample it is still not estimated we can tak e p mn ∼ U ( su p r { p mr + p nr } , 1) A. Sampling Algorithm: Step 1. Se lect at random n vertic es from all the v ertices in the gra ph. Step 2. Consider the prop ortion of v ertices that ar e from the i th class, i = 1 , ..., k . Call it s 0 i . Step 3. F or the connection proba bility of elemen ts of Class i a nd Class j , use the first measure: ˆ p 0 ij = # connection pr esent among el ements of cl ass i andd j total possibl e connections f r om i to j = n ij s 0 i s 0 j n 2 Step 4 : F or each of the v ertices c hosen at Step 1, say the selected v ertex is from class k and the neighbours are from classes α 1 k, , ...., α n k k . Pic k one of these neigh b ours, sa y the j th one from class i at random with probabilit y p k α ik . 1 See Saad, Basa r et. al. for a differen t but in terestin g application of such tec hnique in mo delling sharing of information for more efficien t estimation. 6 Step 5 :Using the information from the arbitra rily c hosen neigh b ours, rep eat Step 2 to get s 1 i , and calculate c p 1 ij similarly . Step 6. Calculate q ij = β ˆ p 0 ij + (1 − β ) c p 1 ij where β ∈ (0 , 1). Report it as the probability . Step 7: If ∃ m, n suc h that p mn is not estimated f r o m the sample,but p mr and p nr are for some r, w e note that d b eing a metric we ha v e the triangle inequalit y d ( s m , s n ) ≤ d ( s m , s r ) + d ( s r , s n ) So, we hav e p mn ≥ p mr + p nr whic h pro vides a lo w er b ound for p mn . W e can hav e p mn ≥ su p r { p mr + p nr } . W e tak e q mn ∼ U ( sup r { p mr + p nr } , 1) B. Generation of the A djacency Matrix from our Sampling Sc heme: Assume that the graph size is unkno wn First assign the v ertex i to a class C i b y generating the random v ariable X fro m the discrete distribution of the standardized s i . Then, if ve rtex i is fro m C k and if ve rtex j is from C r , then (( a ij )) ∼ B er nou ll i ( q k r ) Note: P r ob ( a ij = 1) = P k ,r P r ob ( a ij = 1 , iǫC k , j ǫC r ) = P k ,r P r ob ( a ij = 1 | iǫC k , j ǫC r ) P r ob ( iǫC k , j ǫC r ) = P k ,r P r ob ( a ij = 1 | iǫC k , j ǫC r ) P r ob ( iǫC k ) P r ob ( j ǫC r ) = P k ,r q k r s ‘ k s ‘ r C. The resulting data structure The ab ov e sampling sc heme will giv e rise to data whic h is a finite s ubgraph o f the original graph in the follo wing structure: X 1 X 2 X 3 . . . X k . . . − → d 1 2 3 . . . k . . . 1 0 δ 12 δ 13 . . . δ 1 k . . . d 1 2 δ 12 0 δ 23 . . . δ 2 k . . . d 2 3 δ 13 δ 23 0 . . . δ 3 k . . . d 3 . . . . . . . . . k . . . 0 . . . d k . . . . . . 5 Results • Sim ulation Results • Comparison with existing metho ds (degree cen tralit y etc.) • T est results for Planarit y etc. 6 Discussion and Conclusio ns 6.1: On I nfinite Gr aph S p e ctrums: In Bose and Sen [1] , the questions ab out the sp ectral decomposition of infinite dimensional matrices are tac kled, with r esults deriv ed for the Winger Matrix. How ev er, the realised Adjacency Matrix f or our mo del do es not hav e an iid structure of rows, as the v alue on the ( i, j ) th ro w dep ends tota lly on the class of the i th elemen t and the j th elemen t. The subm atrices of the form give n b elo w are Winger (ie, the rows are generated from an iid pro cess) and for the individual blo c ks w e can obtain the limiting sp ectral distribution (LSD), whic h is the limit of F A n ( x ) = n − 1 P 1 I ( λ i ≤ x ) . 7 A n = c j 1 c j 2 c j 3 ... c i 1 c i 2 c i 3 ... This, while can giv e an idea ab o ut the large sample cen tralit y of the classes, seems to fail to generalize to giv e ov erall graph sp ectra. Our o v erall graph has finitely man y blo cks of the ab o v e form, with each blo c k of infinite size. If we can pro v e the result for an a djacency matrix with t w o classes , t hen w e can extend the result to finitely many blo c ks. 6.2: W eighte d Gr aph If instead of considering the actual adjacency matrix w e consider the w eigh ted graph adja cency matrix, where link w eigh ts are the probabilities of connection b etw een the t w o v ertices, the underlying weigh ted graph is connected. Ho w ev er, with suc h a notion of w eigh ted g raph the degree of a v ertex of an infinite g r a ph d i = P j p ij is alw a ys going to b e infinite, as it will connect to all other v ertices with some nonzero probabilit y , and is basically a sum of infinitely man y v alues of p ij 6 = 0 for atleast one j (the graph being infinite) and hence con v erges to infinit y . Another issue with suc h a setup w ould b e that all v ertices of the same class should the oretically ha v e the same cen tralit y , as connections are dep enden t only on class prop erties a nd not the individual vertice s themselv es. Th us, w e ma y end up c haracterizing an “influen tial” group of p eople rather than iden tifying any one individual- whic h mak es sense from a mark eting/ SNA p ersp ectiv e. If our attributes are fine enough, then the n um b er of classes will b e high with low class size f or most, leading to zeroing in on one influen tial p erson compared to the rest. The notion of p ij b ounded aw ay from zero a rises fro m the notion that w e ma y ha v e incomplete information ab out the attributes, so w e cannot with certainity sa y who will connect to/av oid whom. 6.3: Sp e c trum of the W eighte d Gr aph In suc h a case, w e ma y again note that an y finite column-truncated (resp ectiv ely row-truncated) subgraph has repeated rows (resp ectiv ely column truncated), and finding the eigen v alues of any finite dimensional subgraph with rep eated ro ws is equiv alen t to finding the eigen v alue of a transformed lo w er dimensional matrix, as fo llo ws. In gen eral, if I is a se t of rows whic h are iden tical, then let v I b e th e v ec tor whic h is 1 / p | I | on the co ordinates in I and 0 elsewhere. The v I are orthonormal, complete them to a n orthonormal basis b y adding v ectors w j . Then A will annihilate the w j and will tak e S pan ( v I ) to itself. The matrix of endomorphism of S pan ( v I ) will hav e en tries that lo ok lik e √ I J aij, with i ∈ I and j ∈ J . So it suffices to compute the eigen v alues of this finite matrix, and the rest are 0 . F or ev ery finite subgraph of the matrix sa y of dimension n × n , calculating the subgraph sp ectra is th us equiv alen t to calculating the graph sp ectra of a smaller transformed matrix, i.e, if w e ha v e k classes, we can simply compute the eigen v alues of the finite matrix of dimension k 2 × k 2 . 8 References 1. Another lo ok at the momen t metho d for large dimensional random matrices , Arup Bose (Indian Statistical Institute) and Arnab Sen (Unive rsit y of California, Berk ele y), Electronic Journal of Probabilit y 2008, h ttp://ejp.ejp ecp.org/article/do wnload/501/706 2. On the Num b er of Success es in Indep enden t T rials, Y.H. W ang (Concordia Univ ersit y) 3. C. Hubler, H. P . Kriegel, K. Borg w ardt, Z . Ghahramani, Metrop olis algorithms for represen tativ e sub- graph sampling, in: Pro ceedings of the 8th International Conference on Data Mining, IEEE, 200 8, pp. 283 - 29 2. 4. H. Seth u, X. Ch u, A new algorithm for extracting a small represe n tativ e subgraph from a v ery la r g e graph, e-prin t arXiv:1207.4 8 25. 5. N. Bla g us, L. Sub elj, G . W eiss, M. Ba jec, Sampling promotes comm unit y structure in so cial and infor- mation net w orks, Phys ica A 432 (2015) 206 - 215. 6. J. Lesk o v ec, C. F aloutsos, Sampling from large graphs , in: Pro ceedings of the 12th A CM SIGKDD In ternational Conference on Know ledge Disco v ery and Data Mining, A CM, 2 006, pp. 631 - 636. 7. S. H. Lee, P . J. Kim, H. Jeong, Statistical prop erties of sampled net w orks, Phy s. Rev. E 73 (1) (2006) 016102. 8. N. Blag us, L. Sub elj, M. Ba jec, Assessing the effectiv enes s o f real-w orld net w ork simplification, Ph ysica A 413 (2014 ) 134 - 146. 9. N. Ahmed, J. Neville, R. R. Kompella, Netw o rk sampling via edge-based no de selection with graph induction, T ec h. rep., Purdue Univ ersit y (2011). 10. I. Benjamini, L. Lov ász: Global Information from Lo cal Observ atio n, [Pro c. 4 3rd Ann. Symp. on F ound. of Comp. Sci. (2002) 701 -710.] 11. Sourabh Bhattach ary a, Di ganta Mukherjee, Sutanoy Dasgupta and Soumendu Sundar Mukherjee, “ A Mo del for So cial Netw orks”, mimeo, ISI, Decem b er 2014. 12. Christakis ,N., J. F o wler ,G. W. Imbens and K. Kaly anaraman (2011), An Empirical Mo del for Strategic Net w ork F ormation, W orking P ap er, National Bureau of Economic Researc h, Cam bridge, MA 13. W Saad, Z Han, M Debbah, A Hjorungnes, T Basar, Coalitional game theory for comm unication net- w orks, Signal Pro cessing Mag a zine, IEEE 26 (5), 77 -97. 14. K oushiki Sark ar, Abhishe k Ra y and Digan ta Mukhe rjee,“Impact of So cial Net w ork on Financial Deci- sions", forthcoming in Studies in Micro economics, 201 5. 15. J. Seth uraman A constructiv e definition of Diric hlet priors Statistica Sinica, 4 (1994), pp. 639 – 650 16. M. J. W ain wrigh t and M. I. Jordan (2008). Graphical mo dels, exp onen tial families, and v ariational inference. F oundations and T rends in Mac hine Learning, V ol. 1, Numbers 1–2, pp. 1–305, Decem b er 2008. 9
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment