Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks

Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawl…

Authors: M. Kurant, M. Gjoka, C. T. Butts

Walking on a Graph with a Magnifying Glass: Stratified Sampling via   Weighted Random Walks
W alking on a Graph with a Magn ifying Glass Stratified Sampling via W eighted Random W alks Maciej K ur ant, Minas Gjoka, Car ter T . Butts, A thina Markopo ulou Unive rsity of Calif ornia, Ir vine {mkuran t, mgjoka, buttsc, a thina}@uci.ed u ABSTRA CT Our ob jectiv e is to sample the no de set of a large unknown graph via cra wling, to accurately estimate a giv en metric of interes t. W e design a random wal k on an appropriately de- fined w eigh ted graph that ac h i eves high efficiency by prefer- entia lly crawling those no d e s and edges that convey greater information regarding th e target metric. Our approac h b e- gins b y employing the theory of stratification to find opti- mal no de weigh ts, for a given estimation problem, un der an indep endence sampler. While optimal u nder indep end en ce sampling, these w eig hts ma y b e impractical un d er graph cra wling d ue to constraints arising from the stru cture of the graph. Therefore, the edge wei ghts for our random w alk should b e chosen so as to lead to an equilibrium distribution that strikes a balance b etw een approximating th e opt imal w eigh ts un der an indep endence sampler and ac hieving fast conv ergence. W e prop ose a heuristic ap p roac h (stratified w eigh ted random w alk, or S -WR W) th at achiev es this goal, while using only limited information ab out the graph struc- ture and t h e n od e prop erties. W e ev aluate our tec hnique in simulation, and exp erimentally , by collecting a sample of F acebo ok college users. W e sh o w that S - WR W requ ires 13-15 times few er samples than the simple re- w eigh ted ran- dom walk (R W) to achiev e the same estimation accuracy for a range of metrics. 1. INTR ODUCTION Man y types of online netw orks, such as online so cial n et- w orks (OSNs), Peer-to-P eer (P2P) netw orks, or th e W orld Wide W eb (WWW), are measured and studied today v ia sampling techniques. This is due to severa l reasons. First, such graphs are typically to o large to measure in their en- tiret y , an d it is desirable to b e able t o study t hem based on a small but representativ e sample. Second, the information p ertaining to these netw orks is often hard to obtain. F or ex- ample, OSN service providers have access to all information in their user base, but rarely make this information pu b licly a v ail able. There are many wa ys a graph can b e sampled, e.g., by sampling no des, edges, paths, or other substruct ures [23, 27]. Dep ending on our measurement goal, the elements with different prop erties ma y hav e differ ent im p ortanc e and should b e sampled with a d ifferent probability . F or ex am- ple, Fig. 1(a) d ep icts the w orld’s p opulation, with residents of China (1.3B p eople) represen ted b y blu e no des, of the V atican (800 p eople) by blac k no des, and all other nation- * This is an extend ed version of a pap er with the same title presented at S IGMETRICS’11 . This w ork was supp orted b y SNF grant PBELP2-130871, Switzerland, and by th e N SF CDI Award 1028394, USA. alities represented by white no des. Assume th at we w an t t o compare th e median income in China and V atica n. T aking a uniform sample of size 100 from th e entire w orld’s p opu- lation is ineffective, b ecause most of th e samples will come from coun tries other than China and V atican. Even restrict- ing our sample to the un ion of China and V atican will not help muc h, as our sample is unlikely to includ e any V atican resident. In contrast, uniformly sampling 50 Chinese and 50 V aticanese residents w ould b e much more accurate with the same sampling budget. This type of problem has b een widely studied in the sta- tistical and survey sampling literature. A commonly used approac h is str atifie d sampling [12,28,34], where no des ( e.g., p eople) are p artitioned into a set of non-ov erlapping c at e- gories (or strata). The ob jective is then to d ecide how many indep endent dra ws to take from each category , so as to min- imize the uncertain t y of the resulting measuremen t. This effect can be achiev ed in expectation by a w eighted indep en- dence sampler (WIS ) with appropriately chosen sampling probabilities π WIS . In our example, WIS samples V atica n residents with much higher probabilities t han Chinese ones, and av oids completely th e rest of the w orld, as illustrated in Fig. 1(b). How ever, WIS, as ev ery ind ep en dence sampler, requires a sampling frame, i.e. , a list of all elemen ts we can sample from ( e. g., a list of all F aceb o ok users). This information is typicall y not a v ail able in tod a y’s online netw orks. A f easible alternativ e is cr aw li ng (also k nown as exp loration or link- trace sampling). It is a graph sampling technique in which w e can see the neighb ors of already sampled users and make a decision on whic h u sers to visit nex t. In this p aper, w e study how to p erform stratified sam- pling through graph crawling. W e illustrate the key idea and some of the c hallenges in Fig. 1. Fig. 1(c) depicts a so- cial netw ork that connects the w orld’s p opulation. A simple random w alk (R W) visits every node with frequency prop or- tional to its d egree, which is reflected by the no de size. In this p articular example, for a simplicit y of illustration, all nod es hav e the same d egree equ al to 3. As a result, R W is equiv alen t to the uniform sample of the world’s p opulation, and faces exactly t he same problems of wasting resources, by sampling all n od es with the same p robabilit y . W e add ress these problems by appropriately setting the edge w eigh ts and then performing a rand om w alk on the w eigh ted graph, which w e refer to as weighte d r andom walk (WR W). On e goal in setting the weig hts is to mimic the WIS-optimal sampling probabilities π WIS sho wn in Fig . 1(b). How ever, such a WR W might p erform p o orly due to p oten- (a) Po pulation (b) WIS we igh ts π WIS (c) Social graph G (d) π WIS applied to WR W (e) S-WR W weigh ts Figure 1: Illustrative e xample. Our goal is to compare the blue and black s ubp opulations ( e.g., with res pe ct to their median income) in p opulation (a). Optimal indep ende nce sampler, WIS (b), ov er-samples the black node s, under-samples the blue nodes , and completel y skips the whi te no des. A nai ve cra wling approac h, R W (c), sam ples many irrelev an t white no des. WR W that enforces WIS-optimal probabilitie s may result i n po or or no con vergence (d). S-WR W (e) strikes a balance betw een the optim ality of WIS and fast con vergence. tially slow mixing. In our example, it will not even conv erge b ecause the u nderlying we igh ted graph is disconnected, as sho wn i n Fig . 1(d). Therefore, the edge w eigh ts u nder WR W (whic h determine the equilibrium distribution π WR W ) should b e chosen in a wa y that strikes a balance b etw een the opti- malit y of π WIS and fast conv ergence. W e prop ose Str atifie d Weighte d R andom W al k ( S-WR W) , a p ractical heuristic th at effectively strikes such a balance. W e refer t o our app roach as “wal king on the graph with a magnifying glass” , b ecause S-WR W ove r-samples more rele- v an t parts of the graph and u nder-samples less relev ant ones. In our example, S-WR W results in the graph presented in Fig. 1(e). The only information required by S-WR W are the categories of neighbors of every visited no de, which is typ- ically a v ail able in crawlable online netw orks, such as F ace- b ook. S - WR W uses tw o natural and easy-to-interpret pa- rameters, n amely: (i) ˜ f ⊖ , which control s the fraction of samples from irrelev an t categories and (ii) γ , whic h is the maximal resolution of our magnifying glass, with resp ect to the largest relev ant category . The main contributions of this pap er are th e follow ing. • W e prop ose to impro v e the e fficiency of cra wling-based graph sampling metho ds, by p erforming a stratified w eigh ted random wa lk that takes into accoun t not on ly the graph structure but also th e no de prop erties that are relev ant to th e measurement goal. • W e design and ev aluate S-WR W, a practical h euristic that sets the ed ge w eigh ts and op erates with limited information. • A s a case study , we apply S-WR W t o sample F aceb ook and estimate the sizes of colleges. W e show that S- WR W requ ires 13-15 times few er samples th an a simple random wa lk for the same estimation accuracy . The ou t line of the rest of the p ap er is as follo ws. Section 2 summarizes the most p opular graph sampling techniques, including sampling by exploration. Section 3 p resen ts clas- sical stratified sampling. Section 4 combines stratified sam- pling with graph exploration, presenting a unified WR W ap- proac h that t akes into account b oth netw ork structure and nod e prop erties; v arious t rade-offs and practical issues are discussed and an efficient h euristic (S-WR W) is prop osed based on th e insigh ts. Section 5 presents simulation results. Section 6 p resents an implementation of S-WR W for th e problem of estimating the college friendship graph on F ace- b ook. Section 7 presents related w ork. Section 8 conclud es the pap er. 2. SAMPLING TECHNIQUES 2.1 Notation W e consider an undirected, static, 1 graph G = ( V , E ), with N = | V | no des and | E | edges. F or a no de v ∈ V , denote by d eg( v ) its degree, and by N ( v ) ⊂ V th e list of neighbors of v . A graph G can b e w eigh ted. W e denote by w( u, v ) the w eigh t of ed ge { u, v } ∈ E , an d by w( u ) = X v ∈ N ( u ) w( u, v ) (1) the weigh t of nod e u ∈ V . F or any set of nod es A ⊆ V , we define its volume vol( A ) and w eigh t w( A ), respectively , as vol ( A ) = X v ∈ A deg( v ) and w( A ) = X v ∈ A w( v ) . (2) W e will often u se f A = | A | | V | and f vol A = vol ( A ) vol ( V ) (3) to denote t h e relative size of A in terms of the num ber of nod es and the vol umes, respectively . Sampling. W e collect a sample S ⊆ V of n = | S | no des. S may contain multiple copies of the same node, i.e., the sampling is with replacemen t. In this section, w e briefly review the techniques for sampling no des from graph G . W e also present th e weigh ted random w alk (WR W) whic h is th e basic building b lock for ou r approac h. 2.2 Indepe ndence Sampling Uniform Indepe ndence Sampling (UIS) samples the nod es d irectly from the set V , with replacements, uniformly and indep endently at rand om, i.e., with probability π UIS ( v ) = 1 N for every v ∈ V . (4) W eigh ted Indepe ndence Sam pling (WIS) is a weigh ted versi on of UIS. WIS samples the n od es directly from the 1 Sampling d ynamic graphs is currently an activ e researc h area [35,40,42], but ou t of the scop e of this pap er. set V , with replacemen ts, indep endently at random, b ut with probabilities p rop ortional to n od e w eigh ts w( v ): π WIS ( v ) = w( v ) P u ∈ V w( u ) . (5) In general, UI S and WIS are not p ossible in online n etw orks b ecause of the lac k of sampling frame. F or example, the list of all user I Ds may n ot b e publicly av ailable, or the user ID space may b e to o sparsely allo cated. Nevertheless, we present them as baseline for comparison with th e random w alks. 2.3 Sampling vi a Crawling In contrast to indep en dence sampling, the crawling tech- niques are p ossible in many online netw orks, and are there- fore the main focus of this p aper. Simple Random W alk (R W) [29] selects the next-hop nod e v u n iformly at random among th e neighbors of th e current no de u . In a connected and aperio dic graph, the probabilit y of b eing at the particular no de v conv erges t o the stationary distribution π R W ( v ) = deg( v ) 2 · | E | . (6) Metropoli s-Hastings Random W al k (MHR W) is an application of the Metropolis-Hastings algorithm [30] that mod ifi es the transition probabilities to conv erge to a desired stationary distribution. F or examp le, we can ac hieve the uniform stationary distribution π MHR W ( v ) = 1 N (7) by randomly selecting a n eigh b or v of th e current no de u and moving there with probability min(1 , deg( u ) deg( v ) ). H ow ev er, it was shown in [17,35] th at R W (after re-weig hting, as in Section 2.4) outp erforms MHR W for most applications. W e therefore restrict our attentio n t o comparing against R W. W eigh ted Random W alk (WR W) is R W on a wei ghted graph [4]. At n od e u , WR W chooses the ed ge { u, v } to follo w with probability P u,v prop ortional to the we igh t w( u, v ) ≥ 0 of this edge, i .e., P u,v = w( u, v ) P v ′ ∈N ( u ) w( u, v ′ ) . (8) The stationary distribution of WR W is: π WR W ( v ) = w( v ) P u ∈ V w( u ) . (9) WR W is the basic building blo ck of our d esign. I n the next sections, we show how to choose wei ghts for a sp ecific esti- mation problem. Graph T ra versals (BFS, DFS, RDS, ... ) is a family of crawling techniques where no n od e is sampled more t h an once. Because trav ersals introduce a generally un known bias (see Sec. 7), w e do n ot consider them in th is pap er. 2.4 Corr ecting the bias R W, WR W, and WIS all produ ce biased (nonuniform) nod e samples. But their bias is k now n and therefore can b e corrected b y an appropriate re-w eigh ting of the measured v alues. This can b e done using the Hansen-Hurwitz estima- tor [19] as first shown in [39,41] for random wa lks and also used in [35]. Let every no de v ∈ V carry a v alue x ( v ). W e can estimate the p opulation total x tot = P v x ( v ) by ˆ x tot = 1 n X v ∈ S x ( v ) π ( v ) , (10) where π ( v ) is the sampling probability of no de v in the sta- tionary distribution. In practice, we usually kn o w π ( v ), an d thus ˆ x tot , only up to a constant, i.e., w e know the (non- normalized) weigh ts w( v ). T his problem disapp ears when w e estimate th e p opulation mean x av = P v x ( v ) / N as ˆ x av = P v ∈ S x ( v ) π ( v ) P v ∈ S 1 π ( v ) = P v ∈ S x ( v ) w( v ) P v ∈ S 1 w( v ) . (11) F or ex ample, for x ( v ) = 1 if deg ( v ) = k (and x ( v ) = 0 other- wise), ˆ x av ( k ) estimates the n od e degree distribut ion in G . All the results in this p aper are p resen ted after this r e- weighting step, when ever n ecessary . 3. STRA TIFIED SAMPLING In Sec. 1, w e argued that in order to compare the me- dian income of residents of China and V atican w e sh ould take 50 random samples from each of these tw o countries, rather than taking 100 UIS samples from China an d V ati- can together (or, even w orse, from the worl d’s p opulation). This problem naturally arises in the field of survey sam- pling. The most common solution is str atifie d sampli ng [12, 28,34], where no d es V are partitioned into a set C of non- o verl apping no de categories (or “strata” ), with S C ∈C C = V . Next, we select uniformly at random n i nod es from cate- gory C i . W e are free to choose the allo cation ( n 1 , n 2 , . . . , n |C | ), as long as we resp ect the total budget of samples n = P i n i . Under pr op ortional al lo c ation [28] (or “prop’) we use n i ∝ | C i | , i.e., n prop i = | C i | · n/ N . (12) Another p ossibilit y is to do an optimal allocation (or “opt” ) that minimizes the vari ance V of our estimator for the sp e- cific problem of in terest. F or example, assume that every nod e v ∈ V carries a v alue x ( v ), and we may wan t to esti- mate t h e mean of x in va rious scenarios, as discussed b elo w. 3.1 Examples of Stratified Sampling Problems 3.1.1 Estimating the mea n ac r oss the entir e V A classic application of stratification is to b etter estimate the p opulation mean µ , given several groups (strata) of d if- feren t prop erties ( e.g., v ariances). Given n i samples from category C i , we can estimate the mean µ i = 1 | C i | P v ∈ C i x ( v ) o ver category C i by ˆ µ i = 1 n i X v ∈ S ∩ C i x ( v ) with V ( ˆ µ i ) = σ 2 i n i , (13) where V ( ˆ µ i ) is the v ariance of this estimator and σ 2 i is the v aria nce of p opulation C i . W e can estimate p opu lation mean µ by a w eigh ted av erag e o ver all ˆ µ i s [28], i.e., ˆ µ = X i | C i | N · ˆ µ i with V ( ˆ µ ) = X i ( | C i | ) 2 · σ 2 i N 2 · n i . Under prop ortional allocation (Eq.(12)), this b oils dow n to V ( ˆ µ prop ) = 1 N · n P i | C i | · σ 2 i . How ev er, we can apply La- grange multipliers to find that V ( ˆ µ ) is minimized when n opt i = | C i | · σ i P j | C j | · σ j · n. (14) This solution is sometimes called ‘Neyman allo cation’ [34]. This gives us the v aria nce under optimal allo cation V ( ˆ µ opt ) = 1 N 2 · n  P i | C i | · σ i  2 . The v ariances V ( ˆ µ prop ) an d V ( ˆ µ opt ) are measures of the p erformance of prop ortional and optimal allo cation, respec- tively . In order to make their practical interpretation eas- ier, we also show how these v ariances translate into sample lengths. W e defi ne as gai n α of ‘opt’ ov er ‘prop’ th e num ber of times ‘prop’ m ust b e longer than ‘opt’ in order to achiev e the same v ariance gain α = n prop n opt , sub ject to V prop = V opt . In that case, the gain is α = N · P i | C i | · σ 2 i  P i | C i | · σ i  2 ( ≥ 1) . (15) Notice t h at this gain do es n ot dep end on th e sample budget n . The gain is one of the main metrics we will use in the ev al uation sections to assess the efficiency of ou r t ec hnique compared to the random walk. 3.1.2 Highest p r ecision for all c ate gories If w e are equally interested in each category , we migh t w an t th e same (highest p ossible) precision of estimating µ i for all categories C i . In this case, t h e metric to minimize is V max = max i { V ( ˆ µ i ) } = max i n σ 2 i n i o . Und er prop ortional allocation, t his translates to V prop max = N n max i σ 2 i | C i | . But the optimal n i , whic h makes V ( ˆ µ i ) eq ual for all i , is n opt i = σ 2 i P j σ 2 j · n. (16) Consequently , V opt max = P i σ 2 i n , whic h leads to gain α = max i n N | C i | σ 2 i o P i σ 2 i ( ≥ 1) . (17) 3.1.3 Smallest s u m of var iances across cate gories Even if we are interested in all categories, an alternative ob jective is to maximize the aver age precision of category pair comparisons (see Sec. 5A.13 in [12]), which is equiv alen t to minimizing the sum V Σ = P i V ( ˆ µ i ) = P i σ 2 i n i . In this case, prop ortional allocation ac hiev es V prop Σ = N n P i σ 2 i | C i | . while, using Lagrange m ultipliers we get n opt i = σ i P j σ j · n and V opt Σ =  P i σ i  2 n , (18) whic h leads to gain α = P i N | C i | σ 2 i  P i σ i  2 ( ≥ 1) . (19) 3.1.4 Relative s izes of node categories Stratified sampling assumes that we know the sizes | C i | of nod e categories. In some applications, h o w ever, these sizes are u nknown and among the v alues we need to estimate as w ell ( e.g., by using UIS or WIS). W e show in App en dix C (for |C | = 2) that the optimal sample allocation and the correspondin g gain α of WIS ov er UI S are resp ectively n WIS i = 1 |C | · n and α = N 2 4 | C 1 | · | C 2 | . (20) 3.1.5 Irr elevant category C ⊖ (aggr e gated) In many practical cases, we may wa nt to measure some (but not all) no de categories. E.g. , in Fig. 1, we are in- terested in blue and black nodes, but n ot in white ones. Similarly , in our F aceb o ok study in Section 6 we are only interes ted in self-declared college students, which accounts for only 3.5% of all users. W e group all categories not co vered by our measurement ob jective as a single i rr ele- vant c ate gory C ⊖ ∈ C , and we set n opt ⊖ = 0. In contrast, n prop ⊖ = | C ⊖ | · n/ N . As a result, und er ‘opt’ w e have N/ ( N − | C ⊖ | ) times more useful samples than under ‘prop’. Now , if we allo cate optimally all these u seful samples b e- tw een the relev an t categories C \ { C ⊖ } , t he gain α b ecomes α = N N − | C ⊖ | · α ( C \ { C ⊖ } ) , (21) where α ( C \{ C ⊖ } ) is the gain (15), (17), (19) or (20), dep en d - ing on the metric, calculated only within categories C \ { C ⊖ } . In other w ords, gai n α is now comp osed of tw o factors: (i) gain in a vo iding irrelev an t categories , and (ii) gain in optimal allocation of samples among t h e relev an t categories. 3.1.6 Practical Guideline Let us lo ok at th e opt imal weigh ts in th e ab ov e scenarios, when all σ i = σ are the same. This is a reasonable w orking assumption in many practical settings, since we typically do not hav e prior estimates of σ i . With this simplification, Eq.(14) b ecomes n opt i = | C i | N · n = n prop i . In contras t, Eq.(16), Eq.(18) and Eq .(20) get simplified to n opt i = 1 |C | · n. In conclusion, if we are interested in comparing the n ode cat- egories with resp ect to some prop erties ( e. g., av erage no de degree, category size), rather than estimating a prop erty across the entire p opulation, we should take an e qual num- b er of samples fr om every r elevant c ate gory . 4. EDGE W EIGHT SETTING U N D ER W R W In the previous section, we stud ied the optimal sample allocation und er (indep end ence) stratified sampling. How- ever, in d ep endence n o de sampling is typically imp ossible in large online graphs, while craw ling the graph is a natural, a v ail able exploration primitive. In this section, we show how to p erform a weig hted random walk (WR W) which approx- imates the stratified sampling of the previous section. W e can form ulate the general problem as follo ws: Given a me asur ement obje ct ive, err or metric and sampli ng budget | S | = n , set the e dge weights in gr aph G such that the WR W me asur ement err or is minim ize d. Although we are able to solve t his problem analytically for some specific and fully known top ologies, it is not obvi- ous how to address it in general, esp ecially under a limited knowl edge of G . I n stead, in this pap er, we prop ose S-WR W, a heuristic to set th e edge w eigh ts. S- WR W starts from a solution optimal under WIS, and takes into account p racti- cal issues th at arise in graph exploration. Once the w eigh ts are set, we simply p erform WR W as describ ed in Section 2.3 and collect samples. 4.1 Pr eliminaries 4.1.1 Cate gory-level gran ularity One can think of the problem in tw o levels of granular- it y: the original graph G = ( V , E ) and the c ate gory gr aph G C = ( C , E C ). I n G C , no des represent categories, and ev - ery undirected ed ge { C 1 , C 2 } ∈ E C represents the corre- sp on d ing non-empty set of edges E C 1 ,C 2 ⊂ E in the original graph G , i.e., E C 1 ,C 2 = {{ u, v } ∈ E : u ∈ C 1 and v ∈ C 2 } 6 = ∅ . In our approach, w e mov e from the finer gran ularit y of G to the coarser granulari ty of G C . This means that we are interes ted in collecting, say , n i samples from category C i , but w e do not control how these n i nod es are collected ( i.e., with what individual sampling probabilities). The rationale fo r that simplification is tw ofold. F rom a theoretical p oint of view, categories are exactly th e prop- erties of interest in the estimation problems we consider. F rom a practical p oin t of view, it is relativel y easy t o ob- tain or infer information ab out categories, as we show e.g., in Sec. 4.2.1 . 4.1.2 Stratification in expectation Ideally , w e w ould like to enforce strictly stratified sam- pling. Ho w eve r, when we u se craw ling instead of indep en- dence sampling, sampling ex actly n i nod es from category C i (and no other nod es) is p ossible only by d iscarding observ a- tions. It is t hus more natural to frame the p roblem in terms of the p robab ilit y mass placed on each category in eq u ilib- rium. This can b e ac hieved by making the w eigh t w( C i ) of each category prop ortional to the d esired number n i of samples, i.e., w( C i ) ∝ n i . (22) As a result, we draw n i samples from C i in exp e ctation . 4.1.3 Main guideline As the main guideline, S- WR W tries to realize the cate- gory weigh ts w WIS ( C i ) that are optimal under WIS . There are many edge w eigh t settings in G that achiev e w WIS ( C i ). In our implementation, we observe that vol( C i ) counts th e num ber of edges incident on no des of C i . Consequently , if for every category C i w e set in G the weigh ts of all edges incident on no des in C i to w e ( C i ) = w WIS ( C i ) vol ( C i ) . (23) Main guide line (to b e mo difie d) Set the edge w eights i n cate gory C i to w WIS ( C i ) / v ol( C i ) . Step 1: Es ti mation of Category V olume s Estimate vol( C i ) with a pil ot R W estim ator ˆ vol( C i ) as in Eq.(35). Step 2: Category W eights Optimal Under WIS F or given me asurement ob je ctive, calculate w WIS ( C i ) as in Se c. 3. Step 3: Include Irrelev an t Categories Mo dify w WIS ( C i ). ˜ f ⊖ - d esired fraction of irre lev an t n o des. Step 4: T iny and Unknown Categorie s Mo dify ˆ vol( C i ). γ - maxim al resolution. Step 5: Edge C onflict Resolution Set the weigh ts of inter-category edge s to Eq.(28). WR W s ample Use transiti on probabili ties p rop ortional to edge weights (Se c. 2.3) . Correct for the bias Apply formulas f rom Sec. 2.4. Final result Figure 2: Overview of our approac h. then wei ght w WIS ( C i ) are ac hieved. 2 This simple observ a- tion is central to t he S -WR W heuristic. In order to apply Eq.(23), we fi rst have t o calculate or estimate its terms vol( C i ) and w WIS ( C i ). 3 Belo w, we show how to do it in Step 1 and 2, resp ectively . Next, in Steps 3-5, w e show h o w to mo dify these terms to account for practical problems arising mainly from the und erlying graph struc- ture. 4.2 Our practical solution: S-WRW 4.2.1 Step 1 : E s timation o f Ca te gory V olumes In general, w e hav e no p rior in formation abou t G or G C . F ortunately , it is easy and inexp ensive estimate the relative category volumes f vol i whic h is the first piece of information w e need in Eq.(23) (see footn ote 3). Indeed, it is enough to run a relativel y short pilot R W, and p lu g the collected sample S in Eq.(35) derived in App endix B, as follo ws b f vol i = 1 n X u ∈ S   1 deg( u ) X v ∈ N ( u ) 1 { v ∈ C i }   . 4.2.2 Step 2 : Ca te gory W eigh ts Optimal Under WIS In order to find the optimal WIS category w eigh ts w WIS ( C i ) in Eq.(23 ), w e first calculate n opt i as shown, under va ri- ous scenarios, in Sec. 3. Nex t, we plug th e resulting n opt i in Eq.(22), e.g., by setting w WIS ( C i ) = n opt i . 4.2.3 Step 3 : Irrelevant Ca te gories 2 There exist many other edge weigh t assignments th at lead to w WIS ( C i ). Eq.(23) has the adv an tage of distributing th e w eigh ts evenly across all vol ( C i ) edges. 3 In fact, w e need to k now w e ( C i ) in Eq.(23) only up to a c onstant factor , b ecause these factors cancel out in the calculation of t ransition p robab ilities of WR W in Eq.(8). Consequently , the same app lies t o vol( C i ) and w WIS ( C i ). w 1 w 1 w 1 w 1 w 1 w 2 w 2 w 2 w 2 w 2 (a) (b) WIS: w 1 > 0 , w 2 = 0 WR W: w 1 = 0 , w 2 > 0 WIS: w 1 = 190 w 2 WR W: w 1 ∼ = 60 w 2 for n=50 w 1 ∼ = 100 w 2 for n=500 w 1 = 190 w 2 for n → ∞ Figure 3: Optimal edge weights: WIS vs WR W. The ob jective is to compare the s izes of red (dark) and green (light) categories. Problem: Poten tially p o or or no conv ergence. Con- sider the toy example in Fig. 3(a). W e are interes ted in finding the relative sizes of red (dark) and green (light) cat- egories. The white n od e in the middle is irrelev an t for our measuremen t ob jective. D ue to symmetry , we distinguish b etw een tw o typ es of edges with we igh ts w 1 and w 2 . Un- der WIS , Eq.(20) gives us the optimal weigh ts w 1 > 0 and w 2 = 0, i.e., WIS samples every non-white n od e with the same probability and never samples the white one. How ev er, under WR W with these weigh ts, relev an t no des get discon- nected into tw o comp onents and WR W do es not conv erge. W e observed a similar problem in Fig. 1. Guideline: Occasionally visit i rrelev an t no de s. W e sho w in Ap p endix D that the optimal WR W w eigh ts in Fig. 3(a) are w 1 = 0 and w 2 > 0. In that case, half of the samples are du e to v isits in th e white (irrelev a nt) no de. In oth er words, WR W may b enefit from allocating small w eigh t w( C ⊖ ) > 0 to category C ⊖ that groups all (if any) categories irrelev an t to our estimation. The intuition is that irrelev ant no des may not contri bute to estimation but may b e needed for connectivity or fast mixing. Implementation in S-WR W. In S- WR W, we achiev e this goal by replacing w WIS ( C i ) with ˜ w WIS ( C i ) =  w WIS ( C i ) if C i 6 = C ⊖ ˜ f ⊖ · P C 6 = C ⊖ w WIS ( C ) if C i = C ⊖ . (24) The parameter 0 ≤ ˜ f ⊖ ≪ 1 controls the desired fraction of visits in C ⊖ . 4.2.4 Step 4 : T iny and Unkno wn Categories Problem: “blac k hole s” . Every optical system h as a fundamental magnification limit due to d iffraction and our “graph magnifying glass” is no exception. Consider the toy graph in Fig. 3(b ): it consists of a big clique C big of 20 red nod es with edge weigh ts w 2 , and a green category C tiny with tw o no des only and edge weig hts w 1 . In S ec. 3.1.4, we saw that WIS opt imally estimates the relative sizes of red and green categories for w( C big ) = w( C tiny ), i . e., for w 1 = 190 w 2 . How ever, for such large v alues of w 1 , t h e tw o green no des b e- hav e as a sink (or a “bla c k h ole” ) for a WR W of finite length, thus increasing th e v ariance of the category size estimation. Guideline: lim it edge w eights. In other words, al- though W I S suggests t o ov er-sample small categories, WR W should “under-ove r-sample” very small categories to av oid blac k holes. F or examp le, in Fig. 3(b) w 1 ≃ 60 w 2 ( ≪ 190 w 2 ) is optimal for WR W of length n = 50 (simulation results). Implementation in S-WR W. In S- WR W, we achiev e this goal by replacing vol( C i ) in Eq.(23) with ˜ vol ( C ) = max n ˆ vol ( C ) , vol min o , where (25) vol min = 1 γ · max C 6 = C ⊖ { ˆ vol ( C ) } . (26) Moreo v er, t his formulation takes care of every category C that was not discov ered by the pilot R W in Sec. 4.2.1 , by setting ˜ vol ( C ) = vol min . 4.2.5 Step 5 : E d ge Confl ict Reso lution Problem: Conflicting desi red edge weigh ts. With the ab ov e mod ifications, our target edge weigh ts defined in Eq.(23) can b e rewritten as ˜ w e ( C i ) = ˜ w WIS ( C i ) ˜ vol ( C i ) . (27) W e can directly set the weig ht w( u, v ) = ˜ w e ( C ( u )) = ˜ w e ( C ( v )) for every intra-categ ory edge { u, v } . H o w ever , for ever y inter-categ ory edge, we usually ha ve “conflicting” weigh ts ˜ w e ( C ( u )) 6 = ˜ w e ( C ( v )) desired at th e tw o en ds of the edge. Guideline: prefer inter-cat egory e dges. There are sev- eral p ossible ed ge w eigh t assignments that ac hieve the de- sired category nod e weigh ts. High w eigh ts on intra-catego ry edges and small weigh ts on inter-catego ry edges result in WR W sta ying in small categories C tiny for a long time. I n order to improv e the mixing time, we should do exactly the opp osite, i.e., assign relatively high weigh ts to inter- category edges (connecting relev ant categories). As a result, WR W will en ter C tiny more often, but will stay there for a short time. This intuition is motiv ated by Monte Carlo v aria nce reduction techniques such as the use of antithetic variates [15], which seek to induce negative correlation b e- tw een consecutive draws so as to reduce the v ariance of th e resulting estimator. Implementation in S-WR W. W e choose to assign an edge w eigh t ˜ w e that is in b etw een these tw o v alues ˜ w e ( C ( u )) and ˜ w e ( C ( v )). W e considered severa l candidate suc h as- signmen ts. W e may take t he arithmetic or ge ometric mean of the conflicting weigh ts, which we denote by w ar ( u, v ) an d w ge ( u, v ), resp ectively . W e may also use the m aximum of the tw o v alues, w max ( u, v ), which should improv e mixing ac- cording t o the discussion above. How ev er, w max ( u, v ) alone w ould also add high weig ht to irrelev an t no des C ⊖ (p ossibly far b eyond ˜ f ⊖ ). T o av oid this undesired effect, w e distin- guish b etw een the tw o cases by defi n ing a hybrid solution: w hy ( u, v ) =  w ge ( u, v ) if C ⊖ ∈ { C ( u ) , C ( v ) } w max ( u, v ) otherwise. (28) This hybrid edge assignmen t w as the one we found to work b est in practice - see Section 6. 4.3 Discussion 4.3.1 Information n e eded a bout the neighb ors In t h e pilot R W (Sec. 4.2.1) as well as in th e main WR W, w e assume that by sampling a no de v we also learn th e cat- egory (b u t not degree) of each of its neighbors u ∈ N ( v ). F ortunately , su c h information is often av ailable in most on- line graphs at no additional cost, esp ecially when scraping html pages (as we do). F or example, when sampling colleges in F acebo ok (Sec. 6), we use t h e college memb ership infor- mation of all v ’s neighbors, which, in F aceb o ok, is av ailable at v together with the friends list. 4.3.2 Cost of pilot RW The pilot R W volume estimator describ ed in Sec. 4.2. 1 considers the categories not only of the sampled n odes, bu t also of t h eir neigh bors. As a result, it ac hiev es high effi- ciency , as we sh o w in simula tions (Sec. 5.3.1) an d F aceb o ok measuremen ts (Sec. 6.1). Given that, and high robustness of S-WR W to estimation errors (see S ec. 5.3.5), pilot R W should b e only a small fraction of t h e later WR W ( e.g., 6.5% in our F acebo ok measurements in Sec. 6). 4.3.3 Setting the pa rameters S-WR W sets the edge weigh ts trying to achiev e roughly w WIS ( C i ) as the main goal. W e slightly shap e w WIS ( C i ) to a voi d blac k holes and improv e mixing, whic h is controlled by tw o natural and easy-to-interpret parameters, ˜ f ⊖ and γ . Irrelev ant no des visits ˜ f ⊖ . The parameter 0 ≤ ˜ f ⊖ ≪ 1 contro ls the desired fraction of visits in C ⊖ . When set- ting ˜ f ⊖ , we should exp loit the information provided by the pilot cra wl. If the relev an t categories app ear p o orly in- terconnected and often separated by irrelev ant no d es, w e should set ˜ f ⊖ relativ ely h igh. W e hav e seen an extreme case in Fig. 3(a), with d isconnected relev ant categories and optimal ˜ f ⊖ = 0 . 5. In contrast, when the relev an t categories are strongly interconnected, we should use much smaller ˜ f ⊖ . How ever, b ecause we can never b e sure th at th e graph in- duced on relev an t no des is connected, we recommend alw a ys using ˜ f ⊖ > 0. F or example, when measuring F aceb o ok in Sec. 6, w e set ˜ f ⊖ = 1%. Maximal res olution γ . The parameter γ ≥ 1 can b e in- terpreted as the maximal resolution of our “graph magnify- ing glass” , with resp ect to th e largest relev an t category C big . S-WR W will typically sample w ell all categories that are less than γ times smaller t h an C big ; all categori es small er than that are relatively un d ersampled (see Sec. 6.2.4 ). In the extreme case, for γ → ∞ , S-WR W tries to cov er every category , no matter how small, which may cause th e “black hole” problem discussed in Sec. 4.2.4 . In the oth er extreme, for γ = 1 (and identical w WIS ( C i ) for all categories, includ- ing C ⊖ ), S-WR W reduces to R W. W e recommend alwa y s setting 1 < γ < ∞ . I deally , we k now | C smallest | - the small- est category size that is still relev an t to us. In that case we should set γ = | C big | / | C smallest | . 4 F or example, in Sec. 6 the categories are U S colleges; w e set γ = 1000, b ecause colleges with size smaller than 1/1000th of the largest one ( i. e., with a few tens of stud ents) seem irrelev an t to our measurement. As anoth er rule of th umb, we should try to set smal ler γ for relativ ely small sample sizes and in graphs with tigh t comm unity structure (see Sec. 5.3.5). 4.3.4 Conser v ative a ppr oach Note that a reasonable setting of these parameters ( i.e., ˜ f ⊖ > 0 and 1 < γ < ∞ , and any conflict resolution discussed in the pap er), increases the wei ghts of large categories (in- cluding C ⊖ ) and decreases the w eigh t of small categories, 4 Strictly speakin g, γ is related to v olumes vol( C i ) rather than sizes | C i | . They are equiv alen t when category volume is prop ortional to its size, which is often t he case, and is the central assumption in the “scale-up metho d” [9]. compared to w WIS ( C i ). This makes S- WR W allocate cat- egory w eigh ts b etw een the tw o extremes: R W and WI S. Consequently , S -WR W can b e considered c onserva tive ( with respect to WIS). 4.3.5 S-WRW is unbiased It is also imp ortant to note t h at b ecause the collected WR W sample is eventually corrected with the actual sam- pling weigh ts as describ ed in Sec. 2.4, S -WR W estimation process is unbiase d , regardless of the choice of weig hts (so long as conv ergence is attained). In con trast, sub optimal w eigh ts ( e.g., due to estimation error of b f vol C ) can increase WR W mixing time, and/or the varianc e of th e resulting esti- mator. How ev er, our simulations and empirical exp eriments on F acebo ok (see S ec. 5 and 6) show that S-WR W is very robust t o sub optimal choi ce of weigh ts. 5. SIMULA TION RESUL TS The gain of our approach compared to R W comes from tw o main factors. First, S-WR W av oids, to a large extent or completely , the no des in C ⊖ that are irrelev an t t o our measuremen t. This fact alone can bring an arbitrarily large impro vemen t ( N N −| C ⊖ | under WIS), esp ecially when C ⊖ is large compared to N . W e demonstrate this in the F aceb o ok measuremen ts in Section 6 . Second, w e can b etter allocate samples among the relev ant categories. This factor is ob- serv a ble in our F aceb o ok measurements as w ell, but it is more difficult to ev aluate due to the lack of ground-truth therein. In th is section, w e ev aluate the optimal allo cation gain in a controlled simulation and we demonstrate some key insights. 5.1 Setup W e consider a graph G with 101K no des and 505.5K edges organized in tw o den sely (and randomly) connected commu- nities 5 as sho wn in Fig. 4(h ). The no des in G are partitioned into tw o no de categories: C tiny with 1K no des (d ark red), and C big with 100K no des (ligh t yello w). W e consider tw o extreme scenarios of such a partition. The ‘random’ scenario is purely random, as shown in Fig. 4(a). In contrast, under ‘clustered’, categories C tiny and C big coincide with the existing comm unities in G , as sho wn in Fig. 4(h). It is arguably th e wors t case scenario for graph sampling by exploration. W e fix th e edge w eigh ts of all internal edges in C big to 1. All the remaining edges, i .e., all edges inciden t on nodes in category C tiny , h a ve weig ht w eac h, where w ≥ 1 is a parameter. Note that this is equiva len t to setting ˜ w e ( C big ) = 1, ˜ w e ( C tiny ) = w , and ‘max’ or ‘h ybrid’ conflict resolution. 5.2 Measur ement objective and err or metric W e are mainly interested in measuring th e relative sizes f tiny and f big of categories C tiny and C big , resp ectively . W e use Normalized Ro ot Mean S q uare Error ( NRMSE ) to assess the estimation error, d efined as [37]: NRMSE ( b x ) = q E  ( b x − x ) 2  x , (29) where x is t h e real v alue and b x is th e estimated one. 5 The term “communit y” refers to cluster and is defined purely b ased on top ology . The term “catego ry” is a p rop erty of a n od e and is ind ep endent of t op ology . (a) (b) (c) (d) (e) (f ) (g) (h) (i) (j) (k) (l) (m) (n) γ = 5 γ = 5 optimal optimal NRMSE ( b f tiny ) NRMSE ( b f tiny ) NRMSE ( b f tiny ) NRMSE ( b f tiny ) NRMSE ( b f tiny ) NRMSE ( b f tiny ) NRMSE ( b f vol tiny ) NRMSE ( b f vol tiny ) P [ C tiny visited] P [ C tiny visited] gain α gain α w eigh t w (equiv alen t to γ ) sample length n sample length n Figure 4: R W and S-WR W under t wo sce narios: Random (a-g) and Cl us te red (h-n). In (b,i ), we show error of tw o volume estimators: naive Eq.(32) (dotted) and neighbor-based Eq. (35) (plain). Nex t, we show error of size es timator as a function of n (c,j) and w (d,g,k,n); in the latter, UIS and R W correspond to WIS and S-WR W for w = 1 . In (e, l), we show the empi rical probability that S-WR W visi ts C tiny at l east once. Finally , (f,m) is gain α of S-WR W ov er R W under the optimal cho ice of w (plain), and for fix ed γ = w = 5 (dashed). 5.3 Results 5.3.1 Estimating volumes is us ually cheap The first step in S-WR W is obtaining category volume es- timates b f vol i . W e achiev e it by running a short pilot R W and applying the estimator Eq .(35). W e show NRMSE ( b f vol tiny ) as plain curves in Fig. 4(b ). This estimator t akes adv an tage of the knowl edge of the categories of the neighboring no des, whic h makes it muc h more efficient than the naive estima- tor Eq.(32) shown by dashed cu rves. Moreo v er, the adv an- tage of Eq.(35) ov er Eq .(32) grows with the graph density and th e skew ness of its degree distribution (n ot show n here). Note that under ‘random’, R W and WIS (with th e sam- pling probabilities of R W) are almost equally efficient. How- ever, on the oth er ext reme, i.e., un d er the ‘clustered’ sce- nario, th e p erformance of R W b ecomes muc h w orse and the adv an tage of Eq.(35) ov er Eq.(32) diminishes. This is b e- cause essentia lly all friends of a no de from category C i are in C i too, which reduces form ula Eq.(35) to Eq.(32). Nev- ertheless, we sho w later in Sec. 5.3.5 that even severa lfold vol ume estimation errors are likel y not to affect significantly the results. 5.3.2 V isiting the tiny cate gory Fig. 4(e,l) presents th e emp irical probabilit y P [ C tiny visited] that our wal k visits at least one no d e from C tiny . Of course, this probability grows with the sample length. Ho w ev er, the choi ce of we igh t w also helps in it. In deed, WR W with w > 1 is more like ly to visit C tiny than R W ( w = 1, b ottom line). This demonstrates the first advan tage of introd ucing ed ge w eigh ts and W R W. 5.3.3 Optimal w and γ Let u s no w fo cus on the estimation error as a function of w , shown in Fig. 4(d,k). Interestingly , th is error do es not drop monotonically with w but follo ws a ’U’ shap ed fun ction with a clear opt imal val ue w opt . Under WI S, we hav e w opt ≃ 100, whic h confi rms our findings in Sec. 3.1.4. I ndeed, according t o Eq.(20), w e need the same num ber of samples from the tw o categories, and thus w WIS ( C tiny ) = w WIS ( C big ) (by Eq.(22 ) ). By plug- ging this and vol( C big ) = 100 · vol( C tiny ) to Eq.(23), we finally obtain t h e WIS- op t imal edge w eigh ts in C tiny , i.e., w opt = w e ( C tiny ) = 100 · w e ( C big ) = 100. 6 In contrast, WR W is optimized for w < 100. F or the sam- ple length n = 500 as in Fig. 4(d,k) , th e error is minimized already for w opt ≃ 20 an d increases for higher weigh ts. This demonstrates the “black hole” effect d iscussed in Sec. 4.2.4. It is much more p ronounced in the ‘clustered’ scenario, con- firming our intuition that black-holes b ecome a problem only in th e presence of relatively isolated, tight communities. Of course, th e black hole effect diminishes with the sample length n (and completely va nishes for n → ∞ ), which can b e observed in Fig. 4(g,n), esp ecially in (n ). In other words, the opt imal assignment of ed ge we igh ts (in relev an t categories) und er WR W lies somewhere b etw een 6 F or simplicity , w e ignored in this calculation t h e con fl icts on the 500 ed ges b etw een C big and C tiny . R W (all wei ghts equal) and W I S. In S-WR W, we control it by parameter γ . In t his example, we ha v e γ ≡ w for γ ≤ 100. Indeed, by combining Eq.(23), Eq .(25), Eq .(26), w WIS ( C tiny ) = w WIS ( C big ), w e obtain w = w 1 = w e ( C tiny ) w e ( C big ) = w WIS ( C tiny ) / ˜ vol ( C tiny ) w WIS ( C big ) / ˜ vol ( C big ) = ˜ vol( C big ) ˜ vol ( C tiny ) = vol ( C big ) 1 γ vol ( C big ) = γ . Consequently , the op t imal setting of γ is the same as w opt discussed ab o ve. 5.3.4 Gain α The gain α of WI S ov er UIS is given by Eq.(20). In t his case, w e hav e α = (101 K ) 2 · (4 · 1 K · 100 K ) − 1 ≃ 25. In- deed, WIS with n = 500 samples show n in Fig. 4(d ) achiev es NRMSE ≃ 0 . 1, whic h is th e same as UIS of ab out α = 25 times more samples (see Fig. 4 (c)). This gain due t o stratification is smaller for sampling by exploration: a 500-hop-long WR W with w ≃ 20 yields the same error NRMSE ≃ 0 . 3 as a 2000-hop-long R W. This means that WR W reduces the sampling cost by a factor of α ≃ 4. Fig. 4( f ) shows that this gain do es not v ary much with the sampling length. Under ‘clustered’, b oth R W and WR W p erform m uch w orse. Nevertheless, Fig. 4(m) shows that also in this scenario WR W may significantly reduce the sam- pling cost, especially for longer samples. It is worth noting th at WR W can sometimes significantly outp erform UIS. This is the case in Fig. 4(d ), where UI S is equiv alen t t o WIS with w = 1. Be cause no walk can mix faster th an UIS (that is ind ep endent and thus has p erfect mixing), imp roving the mixing time alone [5,10,37,38] can- not achiev e the p otential gains of stratification, in general. So far we focused on the smaller set C tiny only . When estimating th e size of C big , all errors are muc h smaller, but w e observe similar gain α . 5.3.5 Robustness to γ an d vo lume estimation The gain α show n ab ove is calculated for the optimal choi ce of w , or, equiv alen tly , γ . Of course, in p ractice it migh t b e imp ossible to obtain this v alue. F ortunately , S - WR W is relatively robust t o the choice of parameters. The dashed lines in Fig. 4(f,m) are calculated for γ fixed to γ = 5, rather th an optimized. Note that th is v alue is often dras- tically smaller than th e optimal one ( e.g., w opt ≃ 50 for n = 5000). Nev ertheless, although the p erformance some- what drops, S-WR W still red uces t he sampling cost ab out three-fold. This observ atio n also addresses p otentia l concerns one migh t hav e regarding the category volume estimation er- ror ( see Sec. 4.2.1). Indeed, setting γ = 5 means that every category C i with volume estimated at ˆ vol ( C i ) ≤ 1 5 vol ( C big ) is treated the same. In Fig. 4(f ), th e vol ume of C tiny w ould hav e to be o v erestimated by more than 20 times in order to affect the edge w eigh t setting and thus th e results. W e hav e seen in S ec. 5.3.1 that this is very un like ly , even under smallest sample length s and most adversa rial scenarios. 5.4 Summary WR W brings tw o types of b en efits (i) avo iding irrelev an t nod es C ⊖ and (ii) carefully allocating samples b etw een rele- v an t categories of different sizes. Even when C ⊖ = ∅ , WR W can still reduce the sampling cost by 75%. This second b en- efit is more difficult to achiev e when the categories form strong and tigh t communities, which leads to the “black hole” ’ effect. W e should then choose smaller, more conserv a- tive va lues of γ in S-WR W, whic h translate into smaller w in our example. In contras t, und er a lo oser communit y struc- ture this problem d isappears and WR W is closer to WIS. 6. IMPLEMENT A TION IN F A CEBOOK As a concrete application, we apply S - WR W to measure the F aceb ook so cial graph, which is our motiv ating and canonical ex ample. W e also note t hat it is an undirected and can also b e considered a static graph, for all practical purp oses in th is study . 7 In F aceb o ok, every user may de- clare h erself a member of a college 8 he/she attend s. This membership information is publicly av ailable by default and allo ws us t o answer some interesting questions. F or examp le, how do the college netw orks (or “colleges” for short) compare with resp ect to their sizes? What is the college-to-college friendship graph? In ord er to answ er these questions, w e hav e to collect many college u ser samples, preferably evenly distributed b etw een colleges. This is the main goal of th is section. 6.1 Measur ement Setup By default, every F aceb o ok user can see the basic informa- tion on any other user, including the name, photo, and a list of friends together with their college membership s (if any). W e developed a high p erformance multi-threaded cra wler to explore F aceb o ok’s so cial graph by scraping t his w eb inter- face. T o make informed decision for the p arameters of S-WR W, w e first ran a short pilot R W (see Sec. 4.2.1) with a to- tal of 65 K samples (whic h is only 6.5% of th e length of the main S-WR W sample). Although our p ilot walk v isited only 2000 colleges, it estimated the relative vol umes f vo l i for ab out 9500 colleges disco v ered among friends of sampled users, as discussed in Sec. 4.3.2. In Fig. 6(a), w e show t hat the neigh b or-based estimator Eq.(35) greatly outperforms the naive estimator Eq .(32). These vo lumes cov er severa l decades. Because colleges with only a few tens of u sers are not of ou r interes t, w e set the maximal resolution to γ = 1000 (see t h e discussion in Sec. 4.3.3) . Finally , b ecause t he college students looked very well interconnected in ou r pilot R W , we set the desired fraction of irrelev an t no des to a small num- b er ˜ f ⊖ = 1%. In t he main measurement phase, we collected three S- WR W cra wls, each with different edge weig ht conflict reso- lution (hybrid, geometric, and arithmetic), and one simple R W cra wl as a baseline comparison (T able 1). F or eac h craw l type w e collected 1 million unique users. Some of th em are sampled multiple times (at no additional ban d width cost), whic h results in higher total number of samples in th e sec- ond ro w of T able 1. Our cra wls w ere performed on Oct. 16-19 2010, and are a v ai lable at [1]. 7 The F acebo ok characteristics do change but in time scales muc h longer than the 3-day du ration of our crawls. W eb sites such as F aceb ook statistics, Alexa etc show that th e num ber of F aceb o ok users is growi ng with rate 0.1-0.2% p er day . 8 There also exist categories other than colleges, namely “w ork” and “high school” . F aceb o ok requires a v alid category-specific email for verification. n =4K n =40K n =20K R W R W S-WR W, geometric S-WR W, geometric S-WR W, arithmetic S-WR W, arithmetic S-WR W, hybrid S-WR W, hybrid R W R W R W geometric geometric geometric arithmetic arithmetic arithmetic hybrid hybrid hybrid Relative size b f i Number of samples n i Average NRMSE ( b f i ) Ranked colleges Number of samples n × 10 − 6 × 10 − 6 × 10 − 6 (a) (b) (c) (d) (e) (f ) Figure 5: 5331 college s discov ered and ranked by R W. (a) Estimated rel ati ve college size s b f i . (b) Absolute n umber of user samples p e r college. (c-e) 25 estimate s of size b f i for three different colleges and sample lengths n . (f ) A verage NRMSE of colle ge size estimation. Results i n (a,b,f ) a re bi nned. R W S-WR W Hybrid Geometric Ar ithmetic Unique samples 1,000K 1,000K 1,000K 1,000K T otal samples 1,016K 1,263K 1,228K 1,237K College samples 9% 86% 79% 58% Unique Colleges 5,331 9,014 8,994 10,439 T able 1: Overview of col lected F aceb o ok datasets. 6.2 Results: R W vs. S-WRW 6.2.1 A voiding irrelevant cate gories Only 9% of the R W’s samples come from colleges, which means that the v ast ma jorit y of sampling effort is wasted. In contrast, th e S-WR W crawls ac hiev ed 6-10 bet t er effi- ciency , collecting 86% (hybrid), 79% (geometric) and 58% (arithmetic) samples from colleges. Note t h at these v alues are significantly low er than the target 99% suggested by our choi ce of ˜ f ⊖ = 1%, and th at S-WR W hybrid reaches the highest num ber. This is in agreemen t with our discussion in Sec. 4.2.5 . Finally , we also note that S-WR W craw ls disco v- ered 1 . 6 − 1 . 9 times more un ique colleges than R W. It migh t seem surprising th at R W samples colleges in 9% of cases while only 3.5% of F aceb ook users b elong t o colleges. This can b e explained by lo oking at the last row s of T able 1. Indeed, the college users have on ave rage three times more F acebo ok friends than av erage users, and th erefore they at- tract R W approximately th ree times more often. 6.2.2 Stratification The adv an tage of S -WR W ov er R W does not lie exclusively in av oiding the no des in th e irrelev an t category C ⊖ . S- WR W can also o ver-sa mple small categories (here colleges) at t he cost of u nder-sampling large ones (whic h are very wel l sam- pled an yw a y). This feature b ecomes imp ortant esp ecially when the category sizes differ significantly , which is th e case in F aceb o ok. Indeed, Fig. 5(a) shows that college sizes ex- hibit great h et erogeneity . F or a fai r comparison, we only include the 5,331 colleges d iscov ered by R W. (In fact, th is filtering actually giv es p reference to R W. S- WR W crawl s disco vere d many more colleges that we do n ot show in this figure.) They span more than tw o orders of magnitude and follo w a heavily skew ed distribution (not shown here). Fig. 5(b) confi rms that S-WR W successfully overs amples the small colleges. I ndeed, the num ber of S -WR W samples p er college is almost constant (roughly around 100). In con- trast, the number of R W samples follo ws closely the college size, whic h results in dramatic 100-fold differences b etw een R W and S-W R W for smaller colleges. 6.2.3 Colle g e size estimation With more samples p er college, w e natu rally exp ect a b et- ter estimation accuracy und er S- WR W. W e demonstrate it for three colleges of d ifferent sizes (in t erms of the number of F aceb ook users): MIT (large), Caltec h (medium), and Eindhov en Universit y of T echnology (small). Each b o xplot in Fig. 5(c-e) is generated b ased on 25 indep endent college size estimates b f i that come from w al ks of length n = 4K (left), 20K (middle), and 40K (right) samples each. F or the three studied colleges, R W fails to pro duce reliable estimates in all cases ex cept for MIT (largest college) under the tw o longest cra wls. Similar results hold for the o verw helming ma jority of midd le-sized and small colleges. The un derly- ing reason is the very small num ber of samples collected by R W in t hese colleges, av eraging at b elow 1 sample p er walk. In contrast, the three S- W R W crawls contain typically 5- 50 times more samples than R W ( in agreement with Fig. 5(b)), and prod uce much more reliable estimates. Finally , we aggregate th e results ov er all colleges and com- pute t he gain α of S -WR W ov er R W. W e calculate the error NRMSE ( b f i ) by taking as our “ ground truth” f i the grand av- erage of b f i v alues ov er all samples collected v ia all full-length w alks and cra wl types. Fig. 5(f ) p resen ts NRMSE ( b f i ) av er- aged ov er all 5,331 colleges discov ered by R W, as a function of wa lk length n . As exp ected, for all cra wl types the error decreases with n . How ev er, there is a consistent large gap b etw een R W and all three versions of S-WR W. R W needs 13-15 times more samples than S-WR W in order to achiev e the same error. 6.2.4 The ef fect of the choice of γ Recall that in all th e S-WR W results describ ed ab ov e, w e used the resolution γ = 1000. In order to chec k how sensitive the results are to the choice of t h is parameter, we also tried a ( sh orter) S-WR W ru n with γ = 100, i. e. , ten times smaller. In Fig. 6(b), we see that the num ber of samples collected in the smal lest coll eges is smaller under γ = 100 than under γ = 1000. In fact, the tw o curves diverge for colleges ab out 1 0 -8 1 0 -7 1 0 -6 1 0 -5 1 0 -4 1 0 -3 1 0 -1 1 0 0 1 0 -8 1 0 -7 1 0 -6 1 0 -5 1 0 -4 1 0 0 1 0 1 1 0 2 pilot R W S-WR W, γ = 100 S-WR W, γ = 1000 neighbor naive Relative size b f i Relative volume b f vol i Number of samples n i NRMSE ( b f vol i ) Figure 6: F aceb o ok: Pilot R W and other walks of the same length n = 65 K . (a) The p erformance of the nei gh bor-based volume estimator Eq .(35) (plain line) and the naive one Eq.(32) (dashed line). As ‘ground-trut h’ we used f v ol i calculated for all 4 × 1M collected samples. (b) The e ffect of the choice of γ . 100 times smaller th an the biggest college, i. e. , exactly at the maximal resolution γ = 100. In any case, b oth settings of γ p erform orders of magni- tude b etter than R W of the same length. 6.3 Summary Only ab out 3.5% of 500M F aceb ook users are college mem- b ers. There are more than 10K colle ges and th ey greatly v ary in size, ranging from 50 (or fewer) t o 50K mem bers (w e aggregate stu dents, alumni and staff ). In this setting, state-of-the-art sampling methods such as R W are bound to p erform p oorly . Ind eed, UIS, i .e., an idealize d vers ion of R W, with as many as 1M samples will collect only one sample from size-500 college, on a verag e. Eve n if we could magically sample directly only from colleges, we wo uld typ- ically collect few er than 30 samples p er size-500 college. S-WR W solves these problems. W e show ed that S-WR W of th e same length collects typicall y ab out 100 samples p er size-500 college. As a result, S-WR W outp erforms R W by α = 13 − 15 t imes or α = 12 − 14 times if we also consider the 6.5% ov erhead from the initial p ilot R W. This huge gain can b e d ecomp osed into tw o factors, say α = α 1 · α 2 , as we prop osed in Eq.(21). F actor α 1 ≃ 8 can b e attributed to a abou t 8 times higher fraction of college samples in S-W R W compared t o R W. F actor α 2 ≃ 1 . 5 is d u e to ov er-sampling smaller netw ork s, i.e., by apply ing stratified sampling. Another imp ortant observ ation is that S-WR W is robust to the wa y we resolv e target edge weigh t conflicts in Sec. 4.2.5. The d ifferences b etw een th e three S -WR W implementations are minor - it is the app lication of Eq.(27 ) that b rings most of the b enefit. 7. RELA TED WORK Graph Sampling b y Exploration. Early crawl ing of P2P , OSN and WWW typically used graph trav ersal s, mainly BFS [3,31–33,43] and its v aria nts. How ever, incomplete BFS introduces bias tow ards high-d egree n od es t h at is unk nown and thus imp ossible to correct in general graphs [2,8 ,17 ,25, 26]. Later stud ies follow ed a more principled approach based on rand om walks (R W) [4,29]. The Metrop olis-Hasting R W (MHR W) [16,30] remov es the bias during the walk; it has b een used to sample P2P netw orks [35,40] and OSN s [17]. Alternatively , w e can use R W, whose bias is know n and can b e corrected for [20,39], th us leading to a re-weigh ted R W [17,35]. R W wa s also u sed to sample W eb [21], P2P net- w orks [18,35,40], OSN s [17,24,33,36 ], and other large graphs [27]. It wa s empirically shown in [17,35] that R W outp er- forms MHR W in measurement accuracy . Therefore, R W can b e considered as th e state-of-the- art. Random w alks hav e also b een used to sample dynamic gr aphs [35,40,42], which are outside the scope of this pap er. F ast Mixing Marko v Chains. The mixing time of a random wa lk determines th e efficiency of the sampling. On the practical side, th e mixing time of R W in many OSN s w as found larger t h an commonly b elieved [33]. Multiple d ep en- dent random walks [37] hav e b een used to sample discon- nected and lo osely connected graphs. Rand om walks with jumps hav e b een used to sample large graphs in [5,38] an d in [27]. All th e ab ove metho ds treat all no des with equal imp ortance, which is orthogonal to our technique. On t h e theoretical side, in [10], the authors prop ose a metho d to set edge w eigh ts that achiev e the fastest mix- ing WR W for a given target stationary distribut ion. This technique, although related, is not applicable in ou r context. First, [10] requires the knowledge of th e graph, whic h makes it inapplicable to G , yet p ossibly feasible in G C (after esti- mating some limited information ab out G C as in S ec. 4.2.1 ). In the latter case, how ever, even given a p erfect knowledge of G C , [10] often assigns wei ght 0 to some self-loops, which lik ely makes the un derlying graph G disconnected. Finally , and most imp ortantl y , [10] takes a target stationary distri- bution as inp ut. By taking w WIS , we will face exactly the same problems of p otentially p oor conv ergence (Sec. 4.2.3) and “blac k holes” ( S ec. 4.2.4) as we addressed by S-WR W. Stratified Sampl ing. Our approac h builds on str atifie d sampling [34], a widely used t echnique in statistics; see [12, 28] for a goo d introdu ct ion. A related w ork in a different n etw orking prob lem is [14], where threshold sampling is used to v ary sampling proba- bilities of netw ork traffic flows and estimate their volume. W eigh ted Random W alks for Sampling. Random w alks on graphs with weigh ted ed ges, or equiv alen tly re- versi ble Marko v chains [4,29], are well studied and heavily used in Monte Carlo Marko v Chain sim ulations [16] t o sam- ple a state space with a specified probability distribution. How ever, to the b est of ou r knowledge, WR Ws h a ve n ot b een designed explicitly for measurements of real online sys- tems. In the context of sampling OSNs, th e closest works are [5,38]. T echnical ly sp eaking, they use WR W. But they set as th eir only ob jectiv e th e minimization of the mixing time, which makes them orthogonal and complementary to our approach, as we d iscussed ab o ve. V ery recent app lications of wei ghted random w alks in on- line social n etw orks include [6,7]. [7] u ses WR W in the con- text of link prediction. The authors employ sup ervised learn- ing tec hniques to set the edge weig hts, with the goal of in- creasing the p robabilit y of visiting no des that are more likel y to receive new link s. [6] introduces WR W-based method s t o generate samples of no des that are internally well-connected but also approximately un iform o ver the p opulation. In b oth these pap ers, WR W is used t o predict/extract something from a known graph. In contrast, w e use WR W to estimate features of an u nknown graph. In the context of W orld Wide W eb crawli ng, fo cuse d cr aw l- ing techniques [11,13] hav e b een introduced to follo w web pages of sp ecified interest and to av oid t he irrelev an t pages. This is ac hieved by p erforming a BFS type of sample, except that instead of fifo queue they use a priorit y q ueue wei ghted by the page relev ancy . In our context, such an approac h suffers from t he same problems as regular BFS: (i) collected samples strongly dep end on the starting p oint, and (ii) w e are not able to u nbias the sample. 8. CONCLUSION W e introduced Stratified W eighted R andom W al k (S-WR W) - an efficient wa y to sample large, static, undirected graphs via crawling and u sing minimal information. S-WR W p er- forms a w eig hted random wa lk on th e graph with weigh ts determined by the estimation problem. W e apply our ap- proac h to measure the F aceb ook social graph, and w e show that S -WR W greatly outp erforms the state-of-art sampling technique, namely the simple re- w eigh ted random w alk. There are several directions for future wo rk. First, S - WR W is currently an intuitiv e and efficient heu ristic; in fu- ture w ork, w e plan to inves tigate the optimal solutio n to problems identified in th is pap er and compare against or im- prov e S-WR W. Second, it may b e p ossible to combine t hese ideas with existing orthogonal techniques, some of which hav e b een review ed in Related W ork, to further improv e p erformance. Finally , we are interested in extend ing our techniques to dynamic graphs and non-stratified data. 9. REFERENCES [1] W eighted Rand om W alks of the F aceb o ok so cial graph: http://ody sseas.cal it2.uci.edu/researc h/, 2011. [2] D. Achlioptas, A. Clauset, D. Kemp e, and C. On the bias of traceroute sampling: or, p o w er-la w degree distributions in regular graphs. Journal of the A CM , 2009. [3] Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological c haracteristics of huge online social n etw orking services. I n W WW , pages 835–844, 2007. [4] D. Aldous and J. A. Fill. R eve rsible Markov Chains and Ra ndom Walks on Gr aphs . In preparation. [5] K. Avrachenko v , B. Rib eiro, and D. T o wsley . Improving Random W alk Estimation Accu racy with Uniform Restarts. In I7th Workshop on Algor ithms and M o dels f or the W eb Gr aph , 2010. [6] L. Backstrom and J. Kleinberg. Netw ork Buck et T esting. In WW W , 2011. [7] L. Backstrom and J. Lesko v ec. Su p ervised Rand om W alks: Predicting and Recommending Links in So cial Netw or ks. In ACM International Conf er enc e on Web Se ar ch and Data Mi ni g (WSDM) , 2011. [8] L. Becchetti, C. Castillo, D. Donato, and A. F azzone. A comparison of sampling t echniques for web graph chara cterization. In LinkKDD , 2006. [9] H. R. Bernard, T. Hallett, A. Iovita, E. C. Johnsen, R. Lyerla, C. McCart y , M. Mah y , M. J. Salganik, T. S aliuk, O. S cutelniciuc, G. a. Shelley , P . Sirinirund, S. W eir, and D. F. S t roup. Counting hard- to-count p opulations: the netw ork scale-up metho d for pu blic health. Sexual ly T r ansmitte d Infe ctions , 86(Supp l 2):ii11–i i15, N o v. 2010. [10] S . Boyd, P . D iaconis, and L. Xiao. F astest mixing Mark o v chain on a graph. SIAM r eview , 46(4):667– 689, 2004. [11] S . Chakrabarti. F o cused crawli ng: a new approach to topic-sp ecific W eb resource discov ery. Computer Networks , 31(11-16):1623–164 0, May 1999. [12] W. G. Co chran. Sampli ng T e chniques , volume 20 of McGr aw-Hil Series in Pr ob abil i ty and Statistics . Wiley , 1977. [13] M. Diligenti, F. Co etzee, S. Lawrence, C. Giles, and M. Gori. F ocused crawli ng using context graphs. In Pr o c e e dings of the 26th International Confer enc e on V ery L ar ge Data Bases , pages 527–534, 2000. [14] N . Du ffield, C. Lund , and M. Thorup. Learn more, sample less: con trol of volume and v aria nce in netw ork measuremen t. IEEE T r ansact ions on Information The ory , 51(5):1756 –1775, May 2005. [15] J. Gentle. R andom num b er gener at ion and Monte Carlo metho ds . S p ringer V erlag, 2003. [16] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain M onte Carlo i n Pr actic e . Chapman and Hall/CR C, 1996. [17] M. Gjok a, M. Kurant, C. T. Butt s, and A. Markopoulou. W alking in F aceb o ok: A Case Stud y of U nbiase d Sampling of O SNs. In INFOCOM , 2010. [18] C. Gk antsi dis, M. Mihail, and A. Sab eri. Random w alks in p eer-to-p eer netw orks. I n I NFOCOM , 2004. [19] M. H an sen and W. Hu rwitz. On the Theory of Sampling from Finite Populations. A nnals of Mathematic al Statistics , 14(3), 1943. [20] D . D. H ec k athorn. R esp ondent-Driv en Sampling: A New Ap proac h to the S tudy of H idden Populations. So cial Pr oblems , 44:174–199, 1997. [21] M. R . Henzin ger, A. Heyd on, M. Mitzenmacher, and M. N a jork. On near-u niform URL sampling. I n WWW , 2000. [22] M. H . Kalos and P . A. Whitlo ck. Monte c ar lo metho ds. V olume I: Basics . Wiley , 1986. [23] E. D . Kolaczyk. Statistic al Analysis of Network Data , vol ume 69 of Springer Series in Statistics . S pringer New Y o rk, 2009. [24] B. Krishn am urthy , P . Gill, and M. A rlitt. A few chirps abou t Twitter. In W O SN , 2008. [25] M. Ku rant, A. Markopoulou, and P . Thiran. On the bias of BFS (Breadth First S earch). In IT C , also in arXiv:1004.1729 , 2010. [26] S . H. Lee, P .-J. Kim, and H. Jeong. Statistical prop erties of S amp led Netw orks. Phys. R ev. E , 73:1610 2, 2006. [27] J. Lesko ve c and C. F a loutsos. Samp ling from large graphs. In KDD , pages 631–636, 2006. [28] S . Lohr. Sampli ng: des ign and analysis . Bro oks/Cole, second edition, 2009. [29] L. Lov´ asz. R andom walks on graphs: A su rvey. Combinatorics, Paul Er dos is Eighty , 2(1):1–46, 1993. [30] N . Metrop olis, A. W. R osenbluth, M. N. Rosenbluth, A. H . T eller, and E. T eller. Equation of state calculation by fast compu ting machines. Journal of Chemic al Physics , 21:1087–1092 , 1953. [31] A . Mislov e, H . S. K oppula, K . P . Gummadi, P . Druschel, and B. Bhattacharjee. Growth of th e Flic kr so cial netw ork. In WOSN , 2008. [32] A . Mislov e, M. Marcon, K. P . Gummadi, P . Druschel, and B. Bhattacharjee. Measuremen t and analysis of online social n etw orks. In I MC , pages 29–42, 2007. [33] A . Mohaisen, A . Y u n, and Y . Kim. Measuring the mixing time of social graphs. I MC , 2010. [34] J. Ney man. On th e Two D ifferent Asp ects of the Representativ e Metho d: The Method of Stratified Sampling and the Method of Purp osive Selection. Journal of the R oyal Statistic al So ciety , 97(4):558, 1934. [35] A . R asti, M. T orkjazi, R. Rejaie, N. Duffield, W. Willinger, and D. St u tzbach. R esp ond ent-driv en sampling for chara cterizing u nstructured o verl a ys. In Info c om Mini-c onfer enc e , pages 2701–27 05, 2009. [36] A . H. Rasti, M. T orkjazi, R . Rejaie, and D. S tutzbach. Ev al uating Sampling T ec hniques for Large D ynamic Graphs. In T e ch nic al R ep ort , volume 1, 2008. [37] B. R ibeiro and D. T ow sley . Estimating and sampling graphs with multidimensional rand om w alks. I n I MC , vol ume 011, 2010. [38] B. R ibeiro, P . W ang, and D. T o wsley . On Estimating Degree Distributions of Directed Graph s t hrough Sampling. UMass T e chnic al R ep ort , 2010. [39] M. S alganik and D. D. Heck ath orn. Sampling and estimation in hidden p op u lations u sing respond ent-driv en sampling. So ci olo gic al M etho dolo gy , 34(1):193– 240, 2004. [40] D . Stu tzbach, R. R ejaie, N. D uffield, S . Sen, an d W. Willinger. On unbiased sampling for u nstructured p eer-to-p eer netw orks. I n IMC , 2006. [41] E. V ol z and D. D . H eck athorn. Probability b ased estimation th eory for respon d ent d riven sampling. Journal of Official Statistics , 24(1):79–97, 2008. [42] W. Willinger, R . R ejaie, M. T orkjazi, M. V alafar, and M. Maggioni. OSN R esearc h: Time to F ace the Real Challenges. I n HotMetrics , 2009. [43] C. Wilson, B. Bo e, A . Sala, K. P . N. Puttasw am y , and B. Y . Zh ao. User interactions in so cial n etw orks and their implications. In Eur oSys , 2009. Ap pendix A: Achieving Arbitrary Node W eights Achieving arbitrary no de w eigh ts by setting the edge weigh ts in a graph G = ( V , E ) is sometimes imp ossible. F or example, for a graph th at is a path consisting of tw o no des ( v 1 − v 2 ), it is imp ossible to ac hieve w( v 1 ) 6 = w( v 2 ). H ow ev er, it is alw a ys p ossible to do so, if t here are self lo ops in eac h nod e. Obser v a tion 1. F or any undir e cte d gr aph G = ( V , E ) with a self-lo op { v , v } at every no de v ∈ V , we c an achieve an arbitr ary distribution of no de weights w ( v ) > 0 , v ∈ V , by appr opriate choic e of e dge weights w ( u, v ) > 0 , { u, v } ∈ E . Pr oof. Denote by w min the smallest of all target no de w eigh ts w( v ). Set w( u, v ) = w min / N for all non self-loop edges (i.e., where u 6 = v ). Now, for every self-loop { v , v } ∈ E set w( v , v ) = 1 2  w( v ) − w min N · (deg( v ) − 2)  . It is easy to chec k that, b ecause th ere are exactly deg( v ) − 2 non self-loop edges incident on v , every no de v ∈ V will ac hieve the target weigh t w ( v ). Moreo v er, t he defi nition of w min guaran tees that w( v , v ) > 0 for every v ∈ V . Ap pendix B: Estimating C a tegory V olumes In this section, we derive efficient estimators of the volume ratio b f vol C = vo l( C ) vo l( V ) . R ecall that S ⊂ V den otes an indep en- dent sample of nod es in G , with rep lacemen t. Node s ampling If S is a uniform sample UIS , then we can write b f vol C = P v ∈ S deg( v ) · 1 { v ∈ C } P v ∈ S deg( v ) , (30) whic h is a straightfo rw ard application of t h e classi c ratio estimator [28]. In the more general case, when S is selected using WIS, then we have to correct for t he linear bias tow ards n od es of higher w eigh ts w(), as follo w s: b f vol C = P v ∈ S deg( v ) · 1 { v ∈ C } / w( v ) P v ∈ S deg( v ) / w( v ) . (31) In particular, if w ( v ) ∼ deg ( v ), then b f vol C = 1 n · X v ∈ S 1 { v ∈ C } . (32) Star sampli ng Another approac h is to focu s on th e set of all n eigh b ors N ( S ) of sampled nod es (with rep etitions) rather than on S itself, i .e., to use ‘star sampling’ [23]. The probability that a nod e v is a neighbor of a no de sampled from V by UIS is X u ∈ V 1 N · 1 { v ∈ N ( u ) } = deg( v ) N . Consequently , the no des in N ( S ) are asymp totically equiv a- lent to no des d ra wn with probabilities linearly prop ortional to no de d egrees. By applying Eq .(32) to N ( S ), we obt ain 9 b f vol C = 1 vol ( S ) X u ∈ S X v ∈ N ( u ) 1 { v ∈ C } , (33) where w e used |N ( S ) | = P u ∈ S deg( u ) = vol( S ). In the more general case, when S is selected using WIS, then we correct for the linear b ias tow ards no des of higher w eigh ts w(), as follo ws: b f vol C = 1 X u ∈ S deg( u ) w( u ) X u ∈ S   1 w( u ) X v ∈ N ( u ) 1 { v ∈ C }   . (34) In particular, if w ( v ) ∼ deg ( v ), then b f vol C = 1 n X u ∈ S   1 deg( u ) X v ∈ N ( u ) 1 { v ∈ C }   . (35) Note that for every sampled node v ∈ S , the form ulas Eq.(33-35) exp loit all t he deg ( v ) neigh b ors of v , whereas Eq.(30-32) rely on one no d e p er sample only . Not surpris- ingly , Eq.(33-35) p erformed muc h b etter in all our simula- tions and implemen tations. 9 As a side note, observe th at form ula Eq.(33) generalizes the “scale-up metho d” [9] used in so cial sciences to estimate th e size (here | C | ) of hidden p opulations ( e.g., of drug addicts). Indeed, if we assume that t h e av erage no de degree in V is the same as in C , then vol( C ) / vol( V ) = | C | / N , which reduces Eq.(32) to the core formula of the scale-up metho d. Ap pendix C: Relative sizes o f node categories Consider a scenario with only tw o no de categories, i.e., C = { C 1 , C 2 } . Denote f 1 = | C 1 | / N and f 2 = | C 2 | / N . The goal is to estimate f 1 and f 2 based on the collected sample S . UIS - Uniform inde penden ce sa mp ling. Under UIS, the num ber X 1 of times we select a node from C 1 among n attempts follo ws the Binomial d istribu- tion X 1 = B inom ( f 1 , n ). Therefore, we can estimate f 1 as ˆ f UIS 1 = X 1 n with V ( ˆ f UIS 1 ) = f 1 f 2 n . (36) WIS - W eighte d independ ence s ampling. In contrast, und er W I S, at every iteration the probability π ( v ) of selecting a no de v is: π ( v ) =  π 1 = 1 N · w 1 w 1 f 1 + w 2 f 2 if v ∈ C 1 , and π 2 = 1 N · w 2 w 1 f 1 + w 2 f 2 if v ∈ C 2 , where w 1 and w 2 are th e wei ghts w( v ) of no des in C 1 and C 2 , respectively . By apply ing the Hansen-H urwitz estimator (separately for nominator and denominator), w e obtain ˆ f WIS 1 = | ˆ C 1 | ˆ N = P v ∈ S 1 v ∈ C 1 / π ( v ) P v ∈ S 1 / π ( v ) = X 1 / π 1 X 1 / π 1 + ( n − X 1 ) / π 2 = X 1 · π 2 X 1 ( π 2 − π 1 ) + n · π 1 = X 1 · w 2 X 1 (w 2 − w 1 ) + n · w 1 , (37) where X 1 is the num ber of samples taken from C 1 . Note, that to calculate ˆ f WIS 1 w e only need va lues w 1 and w 2 , which are set by us and thus known. Computing the vari ance of ˆ f WIS 1 is a bit more c halleng- ing. W e use t he second-order T aylo r expansions ( the ’Delta metho d ’) to ap p ro ximate it as follo ws: ∂ ˆ f WIS 1 ∂ X 1 = nw 1 w 2 (( w 2 − w 1 ) X 1 + nw 1 ) 2 , and V ( ˆ f WIS 1 ) ∼ = ∂ ˆ f WIS 1 ∂ X 1  E ( X 1 )  ! 2 V ( X 1 ) =  . . .  = f 1 f 2 nw 1 w 2 · ( f 1 w 1 + f 2 w 2 ) 2 . (38) In the ab ov e deriv ation, we u sed the fact that E ( X 1 ) = nN f 1 π 1 and V ( X 1 ) = nN 2 f 1 π 1 f 2 π 2 . This comes from the fact that X 1 actually follow s t h e binomial d istribution X 1 = B inom ( N f 1 π 1 , n ) . F or w 1 = w 2 , we are back in th e UIS case. But this is not necessarily t h e opt imal choice of weigh ts. Indeed , a quick application of Lagrange multipliers reveals that V ( ˆ f WIS 1 ) is minimized when w 1 f 1 = f 2 w 2 . (39) Moreo v er, analogous analysis show s that Eq .(39) minimizes V ( ˆ f WIS 2 ) as well. In oth er words, the estimators of b oth f 1 and f 2 hav e t h e low es t v aria nce if the total weigh ted mass of C 1 is equal to that of C 2 . This implies, in exp ectation, equal allocation of samples b etw een C 1 and C 2 , i.e., n WIS i = n |C | . Finally , we can use Eq.(36), Eq.(38) and Eq.(39) to cal- culate the gain α of WI S over UIS α = 1 4 f 1 f 2 ( ≥ 1) . (40) Note th at w e alw a ys hav e α ≥ 1, and α gro ws quickly with gro wing difference b etw een f 1 and f 2 . Ap pendix D: Optimal WR W weights i n Fi g. 3(a) Every time WR W v isits the white no de/category in Fig. 3(a), the nex t no de is chosen uniformly from red and green cat- egories. W e stay in th is selected category for k rounds, where k is a geometric random va riable with parameter p = w 2 / ( w 1 + w 2 ) ∈ [0 , 1]. Next, we come back to the white category , and reiterate th e p rocess. So th e num b er n red of times the red category is sampled is n red = Binom (0 . 5 ,n wh ) X 1 Geom ( p ) , where n wh is th e num ber of visits to the white category . Because the random va riables generated by B inom (0 . 5 , n wh ) and Geom ( p ) are indep endent, we can write E [ n red ] = E [ B i n om (0 . 5 , n wh )] · E [ Geom ( p )] = 0 . 5 n wh /p V [ n red ] = E [ B i n om ()] V [ Geom ()] + E 2 [ Geom ()] V [ B inom ()] = n wh 4 p 2 (3 − 2 p ) . A p ossible u nbias ed estimator of t he relative size f red of red category (among relev ant categories) is b f red = n red n wh /p , for which w e get E [ b f red ] = E [ n red ] n wh /p = 1 2 (unbiased) V [ b f red ] = V [ n red ] ( n wh /p ) 2 = 3 − 2 p 4 n wh . This v ariance is expressed as a function of n wh , an d not of th e total sample length n . How ever, note th at n wh drops with decreasing p . Consequently , t h e v ariance V [ b f red ] (expressed as a function of n wh or of n ) is minimized for p = 1, i .e., for w 1 = 0 and w 2 > 0 (and n wh = n/ 2).

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment