On the bias of BFS

On the bias of B FS Maciej Kurant School of Comp uter & Comm. Sciences EPFL, Lausan ne, Switzerland maciej.kurant@gmail.co m Athina Markopoulou EECS Dept University of California, Irvine athina@uci. edu Patrick Thiran School of Comp uter & Comm. Scien ces EPFL, Lausanne, Switzerland patrick.thiran@epﬂ.ch Abstract —Breadth First Search (BFS) is widely used for mea- suring large unknown g raphs, such as Online Social Networks. It has been empirically observ ed that an incomplete BFS is biased toward high d egree nodes. In cont rast to mor e stu d ied sampling techniques, such as random walks, the pre cise bias of BFS has not been characterize d to date. In this p aper , we quantify the d egree bias of BFS sampling. In particular , we calculate the n ode d egree distribution expected to be observ ed by BFS as a function of the fraction of cov ered nodes, in a random graph RG ( p k ) wi t h a giv en degr ee di stribution p k . Furthermore, we also show that, for R G ( p k ) , all commonly used graph tra ve rsal techniques (BFS, DFS, F ore st Fire, and S nowball Sampling) lead to the same b ias, and we show h ow to correct fo r this bias. T o giv e a broader perspective, we compar e this class of exploration techniques to random walks that are well- studied and easier to analyze. Next, we stu dy by simulation the effect of graph properties not captu red d irectly by our model. W e ﬁnd that the bias gets ampliﬁed in graphs with strong positive assortativity . Finally , we demonstrate t h e abov e results by sampling th e Facebook social network, and we prov ide some practical guidelines f or graph sampling in practice. Index T erms —BFS, Breadth First Sear ch, graph sampling, degree b ias, Online Social Networks ( OSN). I . I N T R O D U C T I O N A large b ody of work in the n etworking co mmunity fo cuses on topolo gy measurements at various le vels, including the Internet, the W eb (WWW), peer-to-peer (P2P) and o nline social networks (OSN). The size o f these network s and other practical restriction s make measuring the entire grap h impos- sible. Instead, researchers typ ically collect an d st udy a small but “re presentative” sample. In this paper , we are particularly interested in sampling ne tworks that naturally allow to explore the neighbors of a gi ven node (which is the case in W WW , P2P and OSN). A n umber of graph exploration techniqu es use this basic operation for sampling. They can be rough ly classiﬁed in two categories: (a) with replacem ent (random w alks), and (b) withou t replacement (graph traversal techniques). In the ﬁrst cate gory , random walks, nodes can be re visited. This ca tegory inclu des the classic Ran dom W alk (R W) as we ll as the Metr opolis-Hasting s Rando m W alk (MHR W). Th ey are used f or samp ling of nodes on the W eb [1], P2P networks [2]– [4], OSNs [5,6] and large graph s in general [7]. Random walks are well stud ied [8] and result in samp les that have either no bias (MH R W) or a known bias (R W) that can be c orrected for . Rando m walks are not the focu s of this pap er , b ut are discussed as baseline for comparison. In the secon d category , grap h tra versal techniques, each node is visited exactly once (if we let the process run until h k ∗ i - expected observed average node degree h k i h k 2 i h k i f - fraction of sampled nodes 1 Random W alk (R W) Graph trav ersal techn iques: - BFS - DFS - Forest Fire - Snowball Metropolis-Hastings Random W alk (MHR W) Fig. 1. Over view of r esults. W e calcul ate t he a vera ge node degre e h k ∗ i (and the full degree distrib ution, not s ho wn) expe cted to be observe d by BFS in a random graph RG ( p k ) with a gi ven degree distri buti on p k , as a function of the fracti on of sampled nodes f . W e sho w R W and MHR W as a refere nce. h k i is the real ave rage node degree, and h k 2 i is the real a verage squared node degre e. Observatio ns: (1) For a small sample size, BFS has the same bias as R W; with in creasing f , the bia s dec reases; a complete BFS ( f = 1 ) is unbiased, as is MHR W (or u niform sampling). (2) Al l common graph trav ersal techniq ues (that do not re visit the same node) lead to the s ame bias. (3) The shape of the BFS curve depen ds on the re al node degree distrib ution p k , but it is al ways m onotonic ally decreasing. completion ). These methods v ary in the order in which they visit the nod es; examples inclu de BFS, Depth-Fir st Search (DFS), Forest Fir e (FF) and Snowball Samp ling (SBS). Grap h trav ersals, especially BFS, ar e very p opular and widely used for samp ling large networks, e.g. WWW [9] or OSNs [10]– [12]. One reason is that BFS is well-known ( a textboo k technique ) a nd easy to under stand. Another reason is th at (incomp lete) BFS collects a full view (all n odes and ed ges) of some particular r egion in the gr aph, which is som etimes believed to be re presentative of the entire graph. E. g., a BFS sample of a lattice is a (smaller) lattice. Unfortu nately , th is intuitio n often fails. It was observed empirically that BFS in troduc es a bias tow ards high -degree nodes [9,13,14]. W e also conﬁrm ed this fact in a r ecent measuremen t of Facebook [5], where our BFS crawler found the a vera ge node degree h k BF S i ≃ 32 4 , while the real v alue is only h k i ≃ 94 , i.e., about 3.5 times smaller! Giv en th e popularity of BFS o n one h and, and its bias on the o ther han d, it is surp rising that we still know relatively little about the statistical properties of node sequences returned by BFS. Indeed, samp ling withou t replacem ent in trodu ces complex dependen cies, no rigorou s analytical explanation of the observed biases of BFS w a s av ailab le to date. Our w ork is a ﬁrst step tow ard understanding the statistical characteristics of inco mplete BFS sampling . In particular, we calculate prec isely the n ode degree distribution expected to 2 be ob served by BFS as a fun ction of the fraction of sampled nodes in a random g raph R G ( p k ) with a g iv en (and arbitrar y) degree distribution p k . W e accompany this central result with additional re lated co ntributions. First, we show th at in RG ( p k ) , BFS is equivalent to other graph tra versal techniques, such as Depth First Search (DFS), Snowball Sampling, and Forest Fire (FF). Second, we comp are the bias of BFS (and other traversal tech niques) to that of rand om walks. As sho wn in Fig. 1 and as also formally demonstrated in t his paper, in the beginning of th e explo ration p rocess, BFS exhibits exactly the same bias as the Rand om W alk (R W). W ith incr easing fraction of sampled nodes f , this bias mon otonically d ecreases. When the BFS is com plete ( f = 1 ) , there is no bias, as it is can also be achieved by the Metro polis-Hastings Random W alk (MHR W). Moreover , g iv en a biased sample, we derive an unbiased estimator of the original node degree distribution. In ad dition, we u se simu lation to co nﬁrm o ur analysis a nd in vestigate the e ffect of g raph prope rties, such a s assortativity , not cap tured directly by R G ( p k ) . W e complemen t it with real- world measuremen ts of the Facebook social network. Scope. Our theoretical results hold for the ran dom graph model R G ( p k ) d escribed in Section IV. W e study some extensions o f this model in simulations in Section VII. W e also restrict our atten tion to BFS sam pling of static g raphs. The outline of th e paper is as follows. Section II discusses related work. Section III presents the g raph sampling algo- rithms under stud y . Section IV presents the random gr aph model used in this paper . Section V analyze s the expected degree distribution o f various graph samp ling techniq ues; in particular the main results related to BFS are derived in Section V.B. Section VI shows how to correct for th e bias. Section VII presen ts simulation results. Section VIII, dem on- strates the above ideas by sampling a real world network, Facebook, and provides hints for gr aph sampling in practice. Section IX co ncludes and outlin es future work. I I . R E L AT E D W O R K BFS u sed in practice. BFS is widely u sed today for explor- ing large n etworks, su ch as OSNs. The following list p rovides some examples but is b y n o means exhaustiv e. In [10], Ahn et al. used BFS to sample Orkut and My Space. In [1 1] and [1 5], Mislove et al. u sed BFS to crawl the social graph in four pop- ular OSNs: Flickr , Li veJournal, Orkut, and Y ouT u be. I n [12], W ilson et al. measured the social graph and the user interaction graph of F aceb ook usin g s ev eral B FSs, each BFS con strained in one of th e largest 22 regional Facebook networks. In o ur recent work [5], we have also crawled Faceboo k using various sampling techniques, including BFS, R W and MHR W . It has been empirically observed that incomplete BFS and its v arian ts introdu ce b ias towards hig h-degree nodes [9,13,14]. W e a lso conﬁrmed this in Facebook [5], a n o bservation that in fact inspired this p aper . Analyzing BFS. T o th e best of ou r knowledge, the sampling bias of BFS has not been analyzed so far . [16] and [ 17] are the closest related pap ers to our method ology . The orig inal paper by Kim [16] analyzes th e size of the largest conn ected compon ent in classic Erd ¨ os-R ´ enyi random graph by essentially applying th e conﬁguration mod el with node d egrees chosen from a Poisson distribution. T o match the stub s (o r ‘clones’ in [ 16]) u niform ly at random in a tractable way , Kim pro poses a “cu t-off line” algorithm: he ﬁrst assigns each stub a ran dom index from [0 , np ] , and n ext progressiv ely scans this interval. Achlioptas et al. used this p owerful idea in [17] to study th e bias of tracer oute sampling in ran dom graphs with a giv en degree distribution. The basic oper ation in [17] is traceroute ( i.e., “discover a p ath”) and is p erfor med from a single node to all other nodes in the graph. The union of the observed paths f orms a “BFS-tree”, which inclu des all no des but misses some edges ( e.g., tho se between nodes at the same d epth in the tree). In contrast, the ba sic operation in the traversal methods presented in our pa per is to discover all neighbors of a node, and it is applied to all n odes in inc reasing distance from the origin. Another importan t d ifference is that [17] studies a completed BFS-tree, whereas we study the sampling process when it has visited only a fraction f < 1 of nodes; a completed BFS ( f = 1 ) is trivial in our case (it has n o bias). There is also a large body of literature on u nequa l pr oba- bility sampling witho ut replacement [1 8]. Altho ugh, at ﬁrst, it seems to be a promising p ath to follo w , to th e best of ou r knowledge, none of the existing results is directly applicable to our p roblem . This is becau se, speaking in the term s used later in this paper, the av ailab le results either (i) requ ire the knowledge of q k ( f ) as a n inpu t, or ( ii) prop ose how to calculate q k ( f ) fo r the ﬁrst two nod es only . Another recent paper related to BFS bias is [19]. The paper is about Snowball Sampling [ 20], which is similar to BFS, and p ropo ses a heuristic appr oach to corr ect the degree biases in i th gene ration o f Snowball based o n the values measured in ge neration i − 1 . The authors show by simu lation that this tech nique perfo rms m oderately well, esp ecially when a signiﬁcant fr action of nodes have been covered. Rando m W alks. Simp le and metrop olized r andom walks are also used for crawling OSNs [5 ,6], P2P netw orks [2]–[4], the web [1] and large gr aphs in general [7]. Rando m walks ar e well-studied [8], their bias is known and can be corrected. Random walks are not the focus o f the pa per but are used as baseline for com parison. I I I . G R A P H E X P L O R A T I O N T E C H N I Q U E S Let G = ( V , E ) be a connected graph with t he set of vertices V , and a set of un directed ed ges E . Initially , G is unkn own, except for one (or some limited numb er of) seed node(s). When sampling thr ough graph e xploration , we begin at th e seed nod e, and we recur siv ely visit (one, some or all) of its neighbo rs. W e disting uish two main cate gories of exploration technique s: with and without replacemen t. A. Explo ration with r eplacement (rando m w alks) Explora tion with r ep lacement , or simply a walk , allows revisiting the same node many times. Consider the follo wing classic examples: 3 1) Ra ndom W alk ( RW): In this classic samp ling te ch- nique [8], we start at some seed node. At every iteration, the next-hop no de v is cho sen unifo rmly at rand om amo ng the neighbo rs of the curren t no de u . I t is easy to see that R W introdu ces a linear bias towards nodes of high degree [8]. 2) Metr o polis Hasting s Rand o m W a lk (MHRW): In this technique , as in R W , the next-hop n ode w is c hosen u niform ly at ra ndom am ong the neighbor s of the c urrent no de u . Ho w- ev er , with a probability that depends on the degrees of w a nd u , MHR W performs a self- loop instead of moving to w . Mo re speciﬁcally , the probability P u,w of movin g from u to w is as follows [21]: P u,w =    1 k u · min(1 , k u k w ) if w is a neig hbor of u , 1 − P y 6 = u P u,y if w = u , 0 otherwise , (1) where k v is the degree of nod e v . Essentially , MHR W red uces the transition s to h igh degree n odes and thus elimin ates the degree b ias of R W . Th is prop erty of MHR W was recently exploited in various network sampling contexts [2,3,5,6]. 3) Respo ndent- Driven Sa mpling (RDS): RDS was prop osed and studied in the ﬁeld of social sciences to penetrate hidde n populatio ns, suc h as th at of dr ug addicts [2 2,23]. In th e network sampling terminology , at each iteration RDS selects random ly exactly n neighbors (typically n ≃ 3 ) of the current node u an d sch edules them to v isit later . RDS visits the nodes in the ord er they were scheduled. Thus, RDS is a modiﬁcation of Sn owball Sampling (describe d below) that allows node re visiting . 1 RDS introduces a degree bias that is known and can be c orrected fo r . It was demo nstrated in [23] on the examp le with n = 1 , wh ich red uces RDS precisely to Random W alk (R W). Th is appro ach was recently tested in [3] on various graph models and unstru ctured P2P networks. B. Explo ration witho ut replacement (graph tr a v ersals) In contrast, explor ation without r eplacemen t , or graph travers al , nev er r evisits the same node and. At the en d of the process, and assuming tha t the g raph is connected, all nodes are visited. 1) Breadth F irst Sea r ch ( B FS): BF S is a classic graph trav ersal algo rithm th at starts from the seed and prog ressiv ely explores all neighb ors. At each new iteration the earliest explored b ut not- yet-visited node is selected n ext. Thus, BFS discovers a ll node s within some distance f rom the seed. 2) Dep th F irst Sea rc h (DFS ): This tech nique is similar to BFS, except th at at each itera tion we select the latest e xplored but no t-yet-visited no de. As a result, DFS explo res ﬁrst the nodes that are f araway (in the n umber of ho ps) from the seed . 1 In practic al RDS s urve ys i n human populati ons, node s (people) are not re visite d. Ho we ver , the re visiting assumption is necessary to formally correct for the degre e bias [23]. The authors of [23] argue that this approxi mation is v alid i f th e sampl e s ize i s relati vel y small compar ed to the popu lation siz e. In this paper we formally conﬁrm this claim. G = ( V , E ) graph G with nodes V and edges E k v degre e of node v p k = 1 | V | P v ∈ V 1 k v = k degre e distrib ution in G q k expe cted observ ed degre e distribut ion b q k observe d degre e distrib ution b p k estimate d original degre e distrib ution in G h k i = P k k p k av erage node degre e in G h k ∗ i = P k k q k expe cted observ ed av erage node degree f fracti on of nodes co vere d by the sample T ABLE I N O TA T I O N S U M M A RY . ‘ O B S E RV E D ’ M E A N S C A L C U L AT E D DI R E C T LY F RO M T H E S A M P L E . 3) F orest F ir e ( F F): FF is a random ized version of BFS, where fo r every neigh bor v of the current n ode, we ﬂip a coin, with pro bability of success p , to decid e if we explor e v . FF reduces to BFS for p = 1 . It is possible that th is process dies out bef ore it covers all no des. In this case, in order to ma ke FF compara ble with other tech niques, we reviv e the process from a random node already in the sample. Forest Fire is inspired by the graph g rowing mo del of the s ame name pr oposed in [2 4] and is used as a graph sampling techniqu e in [7]. 4) Snowba ll S ampling ( S BS): Snowball Samp ling is a pre- cursor of RDS and a term lo osely used f or BFS-like traversal technique s. According to a classic deﬁnition by Goodman [20], an n -name Snowball Sampling is similar to BFS, but at ev ery no de v , no t all k v , but exactly n n eighbo rs are chosen random ly out of all k v neighbo rs of v . These n neighb ors are scheduled to v isit, b u t only if they ha ve no t been visited before. I V . G R A P H M O D E L R G ( p k ) A basic im portant gr aph prop erty is the node d egree dis- tribution p k , i.e. , the fr action o f n odes with degree equ al to k , for all k ≥ 0 . 2 Dependin g on th e n etwork, the degree distribution can vary , ra nging fro m constant-degree (in regular graphs), a distribution con centrated ar ound th e average value ( e.g., in Erd ¨ os-R ´ enyi rand om graphs or in w ell-balanced P2P networks), to heavily righ t-ske wed distributions with k covering sev eral decade s (in WWW , un structured P2P , I nternet at the Au tonom ous System level, OSNs). W e handle all these cases by assuming that we are g iv en an y ﬁxed node degree distribution p k . Other than that, the graph G is completely random . That is, G is drawn un iformly at random from the set of all mu ltigraphs 3 with d egree distribution p k . W e denote this mode l by RG ( p k ) . W e use a classic techniqu e to g enerate R G ( p k ) , called conﬁg u ration mod e l [25,26]: each n ode v is gi ven k v “stubs” (or “edges-to-be”) . Next, all these P v ∈ V k v = 2 | E | stubs are random ly matched in pairs, until all stubs are e xhausted (and | E | edg es are created). In Fig. 2 (ignore the rectangular interv al [0,1] fo r now), we presen t four n odes with their stubs (left) and an example of their random matching (right). 2 As we deﬁne p k as a ‘fraction ’, not the ‘probabilit y’, p k determin es th e degre e sequence in the graph, and vice versa. 3 A multigraph is a graph that acce pts multiple edges and self-loops. 4 V . A N A LY Z I N G T H E N O D E D E G R E E B I A S In this section, we study the n ode degree bias observed when the grap h explor ation techniques of Section III are run on the random grap h RG ( p k ) of Section IV. In par ticular, we derive the node degree distribution q k and th e average node degree h k ∗ i expected to b e observed, as a function of th e original degree distribution p k and, in the case of BFS, o f th e fraction of sampled n odes f . A. Explo ration with r eplac ement (walks) W e begin by summar izing the relev an t results known for walks, in par ticular for R W and MHR W . They will serve as a r eference point fo r ou r main an alysis of graph trav ersals in the next section. 1) Ra ndom W alk (RW): Random walk hav e been wid ely studied; see [ 8] f or an excellent survey . In a ny given con- nected and ap eriodic grap h, the probability of bein g at a particular node v con verges at equ ilibrium to the stationary distribution π v = k v 2 | E | . Theref ore, the expected observed de gree distribution q k is q k = X v π v · 1 { k v = k } = k 2 | E | · X v 1 { k v = k } = = k 2 | E | p k | V | = k p k h k i , (2) where h k i is the average node degree in G . Eq. (2) is essen- tially similar to calculation for RDS in [23,27]. As this holds for any ﬁxed (and connected and aperiodic) graph, it is also true for a ll con nected gr aphs ge nerated by the conﬁgu ration model. Consequ ently , the expected o bserved av erage node degree is h k ∗ i = P k k 2 p k h k i = h k 2 i h k i , (3) where h k 2 i is the a verage squa red no de d egree in G . W e show this value h k 2 i h k i in Fig. 1. 2) Metr o polis Hasting s Random W alk (MHRW): It is easy to show tha t the transition matrix P u,w shown in Eq.( 1) leads to a u niform statio nary distribution π v = 1 | V | [21], an d consequen tly: q k = p k (4) h k ∗ i = X k k · p k = h k i . (5) In Fig. 1, we sho w that MHR W estimates th e true me an. B. Explo ration witho ut replacement (Main Result) In b oth R W an d MHR W the nod es can be revisited. So the state of the sy stem at iteration i + 1 d epends only o n iteration i , which makes it possible to analyz e as Markov Chains. In contrast, g raph traversals do not allow for no de revisits, which introd uces crucial dependencies between all the itera tions an d sign iﬁcantly complicates the analysis. T o handle these d epend encies, we adopt an elegant tech nique recently introduce d in [16] (to study the size of th e largest connected com ponen t) an d extended in [17] (to study the bias of tracerou te sampling ). Howe ver, our work differs in m any aspects f rom both [16] and [17], which we com ment in detail in the related work Sec tion II. 1) Exploration withou t r eplacement at the stub level: W e begin by deﬁn ing Algo rithm 1 (b elow) - a g eneral g raph trav ersal techniqu e that collects a sequence of nodes S , without replacemen ts. T o be compatible with the conﬁgur ation m odel (see Sec tion IV) , we are interested in the process at the stub level , where we consider one stub at a time, rather than one node a t a time. An in tegral par t of th e alg orithm is a qu eue Q , that k eeps th e discovered, but still no t-yet-fo llowed stub s. W e start the algorithm by adding to Q all the stubs of some initial node v 1 , and by setting S = [ v 1 ] . Next, at ev ery iteratio n, we pop one stub a from Q , and follow it to discover its partn er- stub b , and b ’ s owner v ( b ) . If n ode v ( b ) is not yet disco vered, i.e., if v ( b ) / ∈ S , th en we append v ( b ) to S and we add to Q all other stub s of v ( b ) . More formally: Algorithm 1 Stub-Lev el Graph T r av ersal 1: S ← [ v 1 ] and Q ← [all stu bs of v 1 ] 2: while Q is nonempty do 3: Pop a from Q 4: Discover a ’ s partner b 5: if v ( b ) / ∈ S then 6: Append v ( b ) to S 7: Add to Q all stubs of v ( b ) excep t b 8: else 9: Remove b from Q 10: end if 11: end while Dependin g on the schedulin g d iscipline f or the elements in Q (line 3), Algorithm 1 implem ents BFS (for a ﬁrst-in ﬁrst out scheduling ), DFS ( last-in ﬁrst-out) or Forest Fire (ﬁrst- in ﬁrst-o ut with randomized stub losses). Line 9 g uarantees that the algo rithm n ever tracebacks the ed ges, i.e ., that stub a poppe d from Q in line 3 never belongs to an ed ge that has already been traversed in the opp osite direction. 2) Discovery on the ﬂy: In line 4 of Alg orithm 1, we fo llow stub a to d iscover its p artner b . In a ﬁxed graph G , this step is determin istic. In the co nﬁgura tion mo del RG ( p k ) , a ﬁxed graph G is o btained by matching all the stubs unifor mly random . Next we can sample this ﬁxed gra ph and average it over the spac e o f all the random gr aphs R G ( p k ) that have just been co nstructed. Unf ortunately , this qu ickly lead s to complex combinato rial problem s. W e adopt the refore a n a lternative and more tractable con struction of a ﬁxed gr aph with is an iter ativ e sampling from th e set of r andom gr aphs, by selecting b ‘on the ﬂy’ (i.e, every time line 4 is executed), uniform ly at random from all th e un matched stubs. By the prin ciple of defer red decisions [28], these tw o a pproac hes are equiv alen t. 3) Br eaking th e dep endencies: There is still one pr oblem with the ‘o n the ﬂy’ method. It selects stub b un iformly at random from all th e un matched stubs. T his intr oduces depend encies betwee n the stubs and across all the iterations. 5 3 4 3 1 1 1 1 2 2 2 1 0 1 0 1 0 3 4 3 1 time t (index) time t (index) current time t v 1 v 1 v 1 v 2 v 2 v 2 v 3 v 3 v 3 v 4 v 4 v 4 Fig. 2. An ill ustration of the stub-le vel, on-the -ﬂy graph explorat ion without repla cements. In this particul ar example, we show an exec ution of BFS starting at node v 1 . Left: Initi ally , e ach node v ha s k v stubs, where k v is a gi ven target degree of v . Each of these stubs is assigned a real-v alued number drawn uniformly at random from the interv al [0 , 1] shown belo w the graph. Next, we follo w Algorit hm 1 with a starting node v 1 . The numbers next to the stubs of e very node v indicate the order in which these stubs are added to the queue Q . Center: The state of the system at time t . All stubs in [0 , t ] have already been match ed (the indices of matche d stubs are set in pla in line). All unmatch ed stubs are distrib uted uniformly at random on ( t, 1] . T his inte rv al can contain also some (here two) alread y matched stubs. Right: T he ﬁnal result is a reali zation of a random graph G with a gi ven node degree sequence (i.e., of the conﬁgurat ion model). G may contain self-lo ops and multie dges. W e remedy th is by implementing the ‘ on the ﬂy ’ approach as follows . First, we assign each stu b a real-valued index t drawn uniformly at random from the in terval [0 , 1] . Then, ev ery time we p rocess line 4, we pick b as the unmatched stub with the smallest index. W e can interp ret this a s a continu ous- time pro cess, where we determin e p rogressively the partners of stubs po pped fr om q ueue Q , by scanning the interval fr om ‘time’ t = 0 to t = 1 in a sear ch of unm atched s tubs. Because the indices ch osen by the stubs are ind ependen t fr om each other, the above trick breaks the dependence between the stubs, which is a cr ucial for mak ing this appro ach tractable. In Fig. 2 , we p resent an examp le execution of Algorithm 1 , where line 4 is implemented as described above. 4) Exp ected sampled degr ee distribution q k : Now we are ready to derive the expected observed degree distribution q k . Recall that all the stub indices are chosen ind ependen tly and unifor mly from [0 , 1] . A vertex v with degree k is not samp led yet at time t if the indices of all its k stubs are larger than t , which happen s with p robab ility (1 − t ) k . So the prob ability that v is samp led be fore tim e t is 1 − (1 − t ) k . T herefor e, the expected fraction of vertices of degree k sampled before t is f k ( t ) = p k (1 − (1 − t ) k ) . (6) By no rmalizing (6), we obtain th e expec ted ob served ( sam- pled) degree distribution at time t : q k ( t ) = f k ( t ) P l f l ( t ) = p k (1 − (1 − t ) k ) P l p l (1 − (1 − t ) l ) . (7) Unfortu nately , it is difﬁcult to interpre t q k ( t ) direc tly , b e- cause t is prop ortional neither to the numb er of match ed edges nor to the num ber of discovered nodes. Recall that ou r primary goal is to express q k as a f unction of f raction f o f covered nodes. W e achieve this by calculating f ( t ) - the expec ted fraction of nod es, of any degree, visited before time t f ( t ) = X k f k ( t ) = 1 − X k p k (1 − t ) k (8) Because p k ≥ 0 , an d p k > 0 fo r at least one k > 0 , the ter m P k p k (1 − t ) k is continuou s and strictly decreasing fr om 1 to 0 with t growing from 0 to 1. Th us, for f ∈ [0 , 1] there exists a well de ﬁned fu nction t ( f ) that satisﬁes Eq .(8), i.e., the in verse of f ( t ) . Althoug h we cannot compute t ( f ) an alytically (except in so me special cases such as for k ≤ 4 ), it is straightforward to ﬁnd it n umerically . Now , we can rewrite Eq.(7) as q k ( f ) = p k (1 − (1 − t ( f )) k ) P l p l (1 − (1 − t ( f )) l ) , (9) which is the expected o bserved degree distribution after co v- ering fraction f of nodes of graph G . 5) Equiva lence of traversal technique s und er RW ( p k ) : An inte resting observation is that, un der the ran dom g raph model R W ( p k ) , all common traversal techniqu es (BFS, DFS, FF , SBS, . . . ) are subjec t to exactly th e same bias. Th is is because the sampled n ode sequence S is fully determ ined by the choice of stub indices on [0 , 1 ] , indepen dently of th e w ay we manag e the elements in Q . This observation applies to the sequence S only - the subgrap hs of G that we actually sample by BFS and DFS, for example, might signiﬁcantly differ . 6) Equiva lence to weighted sampling witho ut r ep lacement: Consider a n ode v with a d egree k v . The pr obability tha t v is discovered befor e time t , g iv en th at it has n ot b een discovered before t 0 ≤ t , is P ( v befor e time t | v not bef ore t 0 ) = 1 −  1 − t 1 − t 0  k v (10) W e now take a de riv ative d d t of the above equ ation, which results in the con ditional prob ability d ensity fun ction k v ( 1 − t 1 − t 0 ) k v − 1 . Setting t → t 0 (but keeping t > t 0 ), r educes it to k v , which is th e density o f probab ility that v is sampled at t 0 , gi ven that it has not b een sampled before. This mean s that at every point in time, out of all nodes that h av e not ye t been selected, the p robab ility o f selecting v is pr opor tional to its degree k v . Ther efore, this scheme is equiv alent to node sampling weighted by d egree, without replacements. 7) Equiva lence to RW for f → 0 : Finally , for f → 0 (and thus t → 0 ), we ha ve 1 − (1 − t ) k ≃ k , and Eq. (7) simpliﬁes to Eq. (2). This means that in the b eginning of the samp ling p rocess, ev ery traversal tech nique is equiv alent to R W , as shown in Fig. 1 fo r f → 0 . 6 8) h k ∗ i is decr ea sing in f : Let us denote by X i ∈ V the i th selected nod e. As we have shown above that our proced ure is eq uiv alen t to node degree weighted samplin g without replac ements, we can write: P ( X 1 = u ) = k u z P ( X 2 = w ) = X u 6 = w k w z − k u · k u z = k w z · α w , where z = 2 | E | and α w = P u 6 = w k u z − k u . Becau se f or any two nodes a and b , we have α b − α a = z ( k a − k b ) / (( z − k a )( z − k b )) , α w strictly decreases with growing k w . As a result, P ( X 2 ) is more con centrated around nodes with smaller degrees than is P ( X 1 ) , imply ing that E [ k X 2 ] < E [ k X 1 ] . W e can use an analogo us argu ment at every itera tion i ≤ | V | , which allo ws us to say that E [ k X i ] < E [ k X i − 1 ] . I n oth er words, h k ∗ i ( f ) is a decrea sing function of f . A practical consequ ence is that many short tra versals ( e.g., BFS-es) are m ore biased than a long one, with the same total number of samp les. C. Commen ts on the starting no de and graph connectivity In all exploration tech niques, the choice of the starting node v 1 can h ave a stro ng effect on the ﬁrst iterations. For example, if v 1 is a low-degree node then the degree distribution b q k sampled in the ﬁrst iteration s is naturally biased tow ard lower degrees. In ﬁxed graph s, this problem is u sually addressed by selectin g v 1 as the last node of an approp riately long “burn-in” ru n of R W (o r MHR W when this techniq ue is used), started at an arbitrary nod e. In the case of a random graph RG ( p k ) , the pro blem is e ven simpler , b ecause already the s econd no de o f R W follo ws π v = k v 2 | E | , wh ich red uces the burn-in period to one ho p only . Another issue is that th e conﬁguratio n model RG ( p k ) might result in a graph G that is n ot con nected. In this case, ev ery explor ation techniq ue covers only the compon ent C in which it was in itiated; conseque ntly , the process described in Section V -B3 stops once C is cov ered. D. A co n venient interpr etation It might b e sometimes co n venient to split the exploration technique s, in RG ( p k ) , in to three simp le classes, with r espect to th e node degree bias they e xperience. These classes can be deﬁned as ways to samp le nodes from a p ool of all no des V , inde penden tly of the actu al topology of G . MHR W is equiv alent to uniform node samplin g with replacem ent. R W is equiv alent to degree -weighted node sam pling with replace- ment. Finally , all traversal techniques eq uiv alen t to d egree- weighted no de sampling without r eplacemen t. Th e above holds strictly for RG ( p k ) only , but it can be an in sightful interpretatio n, in general. V I . C O R R E C T I N G F O R N O D E D E G R E E B I A S In the previous section we d erived the expected o bserved degree distribution q k as a fun ction of the or iginal d egree distribution p k , fo r three general gr aph explor ation techniques. The distribution q k is usually biased tow ards high-d egree nodes. In this section, we derive unb iased estimators b p k and h b k i o f the original degree distribution p k and its mean h k i , respectively . Let S ⊂ V be a sequence o f vertices that we samp led. Based on S , w e can estimate q k as b q k = number of no des in S w ith degree k | S | (11) A. Ran dom W a lk (RW) In order to e stimate p k based on b q k , consider again E q.(2), which says that q k is p ropo rtional to k p k . Th erefore , p k is propo rtional to q k /k , and b p k is p ropor tional to b q k /k which allows us to write (similarly to [3,2 3]): b p k = b q k k · X l b q l l ! − 1 (12) where P l b q l l is a normalizing constant. Fro m Eq. (12), we can estimate the average node degree as h b k i = X k k b p k = X l b q l l ! − 1 = | S | P v ∈ S 1 k v (13) B. Metr op olis Hastings Ra ndom W a lk (MHRW) In this case, equations (4) and (5) trivially yield b p k = b q k , and (14) h b k i = X k k b p k = X k k b q k . (15) C. Graph traversal From Eq . (9) we know that p k ( f ) is propor tional to q k / (1 − (1 − t ( f )) k ) . Consequ ently , b p k = b q k 1 − (1 − t ( f )) k · X l b q l 1 − (1 − t ( f )) l ! − 1 (16) Howe ver, in orde r to ev aluate this expression, we need to ev aluate t ( f ) , that, in turn, requires p k . W e can so lve this chicken-and -egg pro blem iteratively , if we know the real fraction f r eal of covered no des, or equiv alen tly the gr aph size | V | . First, we ev alua te Eq.( 16) for some values of t and feed the resulting b p k ’ s in to Eq. (8) to obtain the corresp onding f ’ s. By rep eating this proc ess, we can driv e the v alues of f arbitrarily close to f r eal , and thu s ﬁnd the desire d b p k . In summary , for g raph tra versal tech niques, Eq.(16 ) s hows how to estimate the orig inal degree distribution p k giv en that the real g raph coverage f r eal , which is often the case in prac tice. Of course, based on o ur estimator b p k , we can calculate the average node degree as h b k i = P k k b p k . V I I . S I M U L AT I O N R E S U LT S In this section, we implement a nd simulate th e c onsidered sampling techn iques, namely BFS, DFS, FF (with p = 0 . 5 ), R W and M HR W . T he simulations conﬁrm ou r an alytical results. Mor e importantly , in simulation s w e can study the effect of topologic al properties, such as of assortativity , tha t are not d irectly cap tured b y th e random g raph model R G ( p k ) . 7 f - fraction o f covered nodes h k ∗ i - observed average node degree k - node degree P ( k ) Degree distribution A verage node degree h k i h k 2 i h k i real, p k expected, q k R W , sampled, b q k R W , estimate, b p k BFS, f = 0 . 1 , sampled, b q k ( f ) BFS, f = 0 . 1 , estimate, b p k ( f ) BFS, f = 0 . 3 , sampled, b q k ( f ) BFS, f = 0 . 3 , estimate, b p k ( f ) Fig. 3. Comparison of sampling techniques in theory and in simulation. Left: Observ ed (sampled) averag e node degree h k ∗ i as a function of the fraction f of sampled nodes, for v arious sampling techniqu es. The results are averag ed ov er 1000 graphs with 10000 nodes each, generated by the conﬁgurat ion model with a ﬁxed he avy-t ailed degree d istrib ution p k (sho wn on the right). Right: Real, e xpecte d, and estimated (c orrecte d) de gree distribu tions for sel ected techni ques and value s of f (other technique s beha ve a nalogousl y). W e obtained analogous results for other degree distribu tions and graph sizes | V | . The term h k i is the real a verag e node deg ree, and h k 2 i is the real ave rage squared node degree . f - fraction o f covered nodes f - fraction o f covered nodes h k ∗ i - average sampled node degree h k ∗ i - average sampled node degree A verage node degree, a ssortativity r > 0 A verage node degree, a ssortativity r < 0 h k i h k i h k 2 i h k i h k 2 i h k i Fig. 4. The effect of assor tativi ty r on the results . First, we use the conﬁguration model with the same degree distribu tion p k as in Fig. 3 (and the same number of nodes | V | = 10000 ) to generate a graph G . Nex t, we apply the pairwise edge re wiring technique [29] to chang e the assortati vity r of G without changin g node degre es. This technique i terati vely tak es two ra ndom e dges { v 1 , w 1 } and { v 2 , w 2 } , and re wires th em as { v 1 , w 2 } and { v 2 , w 1 } only if it brings us closer to the desired val ue of assortati vity r . As a result, we obtain graphs with a positi ve (left) and nega ti ve (right) assortat i vity r . Note that for a bette r readabil ity , we present only the v alues of f ∈ [0 , 0 . 1] , i.e., ten times smaller than in Fig. 3. A. Estimating Degr ee Distributions and A verage De gr e e Fig. 3 veriﬁes a ll the form ulae derived in th is paper, fo r a rando m grap h with a given powerlaw d istribution. Th e analytical expec tations are plotted in thick plain lin es in the backgr ound and the av eraged simulation results are plotted in thin ner lines ly ing on to p of them. W e observe almo st a perfect match b etween theor y an d simulation in estimating the sampled degree distribution q k (Fig. 3, righ t) and its mean h k ∗ i (Fig. 3, left). Indeed, all traversal techniqu es follow the same curve (as p redicted in V -B5), that initially coincid es with th at o f R W (see V - B7) and is monotonically decreasing in f (see V - B8). W e also show that degree weighted node sampling without rep lacements e xhibits exactly the same bias (see V -B6). Finally , ap plying the estimator s b p k derived in Section VI co rrects for the bias of q k . B. The ef fec t of degr ee-degr ee correlations (assortativity r ) Dependin g on the type of network, n odes ma y ten d to connect to similar or different no des. For example, in most social networks high degree nod es tend to connect to other high degree nodes [3 0]. Such network s a re called assorta- tive . I n con trast, biologic al and technological networks ar e typically disassortative , i.e., they exhibit signiﬁcantly more high-d egree-to-low-degree connection s. This ob servation can be q uantiﬁed b y c alculating th e assortativity coefﬁcient r [30], which is the correlation coefﬁcient comp uted over a ll edges ( i.e., d egree-degree pairs) in the gr aph. V alues r < 0 , r > 0 a nd r = 0 indicate disassortati ve, assortative and purely rando m graphs, respectively . For the sam e initial p arameters as in Fig. 3 ( p k , | V | ), we simulated different levels of assortati vity . Fig. 4 shows the re sults. Grap h assortativity r strong ly affects the ﬁrst iterations of traversal techniques. I ndeed, for assortati vity r > 0 (Fig. 4 , left), the degree bias is even strong er than for r = 0 ( Fig. 3, left). Th is is bec ause the h igh-d egree nodes are now interconn ected mo re densely than in a pur ely random graph , an d are thus easier to discover by sampling technique s that ar e inh erently b iased towards high degree nodes. Intere stingly , Forest Fire is by far the most affected. 8 UNI R W BFS 28 BFS 1 MHR W | S | 982K 2.26M 28 × 81K = 2.26M 1.19M 2. 26M f 0.44% 1.03% 28 × 0.04% 0 .54% 1.03% T ABLE II Fac ebook measurem ents - data set ov ervi ew . | S | A N D f A R E T H E A B S O L U T E AN D R E L AT I V E L E N G T H S O F T H E C O L L E C T E D S A M P L E S . F O R M O R E D E TA I L S R E F E R T O [ 5 ] . A p ossible exp lanation is that und er Forest Fire, low-degree nodes are likely to b e co mpletely skipped b y the ﬁrst sampling wa ve. Not surp risingly , a n egativ e a ssortativity r < 0 h as the oppo site ef fect: ev ery high-degree node tends to conne ct to low-degree n odes, which signiﬁcantly slo ws d own the discovery o f the fo rmer . In contrast, ran dom walks R W and MHR W are not affected by the changes in assortati vity . This is expected, becau se their statio nary d istributions hold for a n y ﬁxed (co nnected and aperiodic) grap h regardless of its topological prop erties. C. Othe r graph p r operties W e also attempted to simu late the effect o f other basic graph pr operties, such as clustering o r mo dularity . Howe ver, all these pro perties are interdepen dent, which makes it difﬁcult to interp ret the results. For example, [31] descr ibed recently an extension of the conﬁgu ration mod el to gener ate rand om graphs with a given le vel o f clustering c . Howe ver, the assorta- ti vity r tur ns out to strongly de pend on c . Rather th an sh owing preliminar y results, we decided to defer them to future work, where we are planning to incor porate som e o f th ese ad ditional topolog ical properties in our analytical mo del. V I I I . R E A L L I F E E X A M P L E : S A M P L I N G O F F AC E B O O K In this section we apply and test th e p revious ideas in a real-life large-scale system - the Facebook so cial graph. W ith 25 0+ millions o f activ e users, F ace book is currently the largest on line social network. Crawling the entire top ology o f Facebook w ould require downloading ab out 50 T B of HTML data [5], which makes samp ling a very practical altern ativ e. A. Data co llection W e have implemented a set of crawlers to co llect the samples of Facebook (FB) ac cording to the UNI , BFS, R W , MHR W tech niques. The details of our impleme ntation a re described in [5]. The collected data sets are summ arized in T able II. UNI re fers to a unif orm sample of FB u sers. I t was obtained by un iformly sampling the entire FB userID space and discarding no n allocated userIDs. This is a trivial version of rejection samplin g and g uarantees a unifo rm sampling of the existing users, regardless of th eir actual distribution in the userID space. UNI gives a high quality estimation of p k and h k i , mainly thank s to a large nu mber of samp les | S | . Therefo re, we use UNI as groun d tr uth fo r compar ison o f various techniques. W e ran two types o f BFS crawling. BFS 28 consists of 28 small BFS-es initiated at 28 randomly chosen nodes from UNI, which allowed us to easily parallelize the process. Moreover , at the time of data co llection, we ( naively) th ought that this UNI R W BFS 28 BFS 1 MHR W h k ∗ i sampled 94.1 338.0 323.9 285.9 95.2 h k ∗ i expect ed - 32 9.8 (3) 329.1 (9) 328.7 (9) 94.1 (5 ) h b k i estimated - 93.9 (13) 85.4 (16) 72.7 (16) 95.2 (15) T ABLE III Fac ebook measurem ents - ave rage node degree. A V G D E G R E E : S A M P LE D ( R O W 1 ) , E X P EC T E D ( R O W 2 ) A N D C O R R E C T E D ( R OW 3 ) F O R V A R I O U S T E C H N I Q U E S . F O R E AC H E X P E C T E D A N D C O R R E C T E D V A L U E , W E G I V E I N PA R E N T H E S I S T H E F O R M U L A U S E D T O C O M P U T E I T . would redu ce the BFS bias. After gaining m ore in sight into the process (which, nota ben e, m otiv ated this paper ), w e collected a single large BFS 1 , initiated at a r andomly cho sen node from UNI. The implementation of RW an d MHRW is straightfor ward. B. Results W e pr esent the F ace book sampling r esults in T ab le III and in Fig. 5. The ﬁrst row of T able I II shows the a verage nod e degree h k ∗ i observed (samp led) by several techniques. The value sampled by UNI is h k ∗ i = 94 . 1 , which we interpre t as the real value h k i . MHR W , as expected, recovers a similar value. In contrast R W and BFS are both biased towards high degrees by a factor larger than three! The degree bias of R W is the largest. It dro ps very slig htly under the (relati vely very short) BFS 28 crawl, whic h conﬁrms our ﬁ nding s fro m V -B7. BFS 1 , a sample 15 times longer than BFS 28 , is signiﬁcantly less biased, which is in agreement with V -B8. The second row sho ws the expected sam pled a verage node degrees ( i.e., our pred ictions of the values in the ﬁrst row), assuming that the und erlying Faceboo k topology is a ran- dom g raph RG ( p k ) with degree distribution p k equal to that sampled by UNI. As expected, this works very well for R W . Howev er , th e values predicted for BFS signiﬁcantly overshoot the reality . This is because Facebook is no t a random graph RG ( p k ) . For example, Facebook, a s m ost socia l net- works [26], is characterized by a high clustering coefﬁcient c . W e believe that it is po ssible to incorpora te this fact in our analytical model, e.g., by appr opriately stretchin g the function f ( t ) in Eq. (8). This is a main g oal in ou r future work. Finally , in the last row of T a ble III we apply the estimators developed in Section VI to cor rect the degre e biases of R W and BFS. In the c ase of R W , the corr ection works very well. Unfor tunately , for the BFS estimator the resu lts are signiﬁcantly worse, clearly for the reason s d iscussed in the previous paragraph. All the above observations ho ld n ot o nly fo r th e a verage node de gree, but also for the entire degree distribution, which is shown in Fig. 5. C. Practical r eco mmendatio ns BFS is strongly b iased tow ard high d egree nodes. It is possible to correct for this bias pre cisely when the underly ing graph is a RG ( p k ) (which is not the case in pra ctice). Also, in m ore realistic graphs, this bias can be correc ted reasonab ly well fo r a very small sample size (as is the case for BFS 28 ), where BFS is similar to R W (see Fig. 1). On the other extreme, for very large sampling coverage, the bias of BFS 9 k - node degree P ( k ) Degree distributions sampled in Facebook Fig. 5. Facebook measurem ents - degree distribution. Cra wlers used: UNI, R W and BFS. Al l plot s are in log-log scale with logarithmic binning of data (we take the av erage of all points that fall in the same bin). W e also correct these distrib utions, as describe d in S ectio n VI. becomes r elativ ely small and cou ld be sometimes n eglected (even without additional correctio n). However , in all o ther cases, the results be come difﬁcult to interpret. In contr ast, both R W (equip ped with a co rrection proce dure) and MHR W are unbiased, regardless o f the actual graph topology . Th erefore , we recommend using R W and MHR W (with a slight adv antage of R W [3]) as general methods to sample the n ode pro perties. In contrast, R W and MHR W are not really useful when sam- pling non-local graph pr op erties , such as the graph diameter or the average shortest path len gth. In this case, BFS s eems v ery attractive, because it produce s a full vie w of a particular region in the gr aph, which is u sually a d ensely connected grap h itself, and for which the non-local properties can be easily calculated. Howe ver, all such results sh ould b e interp reted very caref ully , as they m ay be also strongly affected by the bias of BFS. For example, the graph d iameter (usually) drop s sig niﬁcantly with growing a verag e node degree of a network. I X . C O N C L U S I O N A N D F U T U R E D I R E C T IO N S In this paper , we analyzed the bias in estimating node degree when BFS (and other grap h tra versal techniques that sample nodes without replacemen t) a re used to crawl a lar ge, static, undirected network th at is modeled by a random graph with a giv en, arbitrary d egree distribution. W e also compare d BFS and g raph traversal techniques to the well-stu died ra ndom walks, and we wer e able to explain many of th e similarities and differences that were only em pirically ob served so far . T o the best of ou r knowledge, this is a ﬁrst step towards analyzing the b ias of BFS sam pling, wh ich is wid ely u sed in p ractice. I n future w ork, we plan to extend our th eoretical f ramework an d study the effect of to pologica l prope rties oth er th an the degree distribution (such a s assortativity , clu stering, or com munity structure) on the bias of BFS and other techn iques. R E F E R E N C E S [1] M. R. Henzinger , A . Heydon, M. Mitzenmacher , and M. Najork, “ On near -uniform url sampling , ” in Pr oc. of WWW , 2000. [2] D. Stutz bach, R. Rejaie, N. Dufﬁeld, S. Sen, and W . Will inger , “On unbiased sampling for unstructured peer-to-pe er net works, ” in P r oc. of IMC , 2006. [3] A. Rasti, M. T orkjazi, R. Rejaie, N. Dufﬁel d, W . Wi llinge r , and D. Stutzbach, “Respondent-d ri ven sampling for characte rizing unstruc- tured overlay s, ” in INFOCOM Mini-Confer ence , Apri l 2009. [4] C. Gkantsidis, M. Mihail, and A. Saberi, “Random walks in peer-to-pe er netw orks, ” i n Pr oc. of Infocom , 2004. [5] M. Gjoka, M. Kurant, C. T . Butts, and A. Markopou lou, “ A walk in facebook: Uniform sampling of users in online social netwo rks, ” http:// arxiv .org/a bs/0906.0060 , 2009. [6] B. Krishnamurthy , P . Gill, a nd M. Arlitt, “ A fe w chirps a bout twitter , ” in P r oc. of W OSN , 2008. [7] J. Lesk ovec a nd C. Falo utsos, “Samplin g from large graphs, ” in Proc. of ACM SIGKDD , 2006. [8] L. Lov asz, “R andom wal ks on gra phs. a surv ey , ” in Combinat orics , 1993. [9] M. Najork and J. L. W iener , “Breadth-ﬁrst search cra wling yields high- qualit y pages, ” in Pr oc. of WWW , 2001. [10] Y . Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong, “ Analysis of T opo- logica l Characteri stics of Huge Online Social Networking Service s, ” in Pr oc. of WWW , 2007. [11] A. Mislove, M. Marcon, K. P . Gummadi, P . Druschel, and S. Bhattac har- jee, “Measurement a nd Ana lysis of Onlin e Soc ial Ne tworks, ” in Proc . of IMC , 2007. [12] C. Wilson , B . Boe, A. Sal a, K. P . Put taswa my , and B. Y . Zhao, “User intera ctions in social networks and their implicat ions, ” in Proc . of Eur oSys , 2009. [13] S. H. Lee , P . -J. Kim, and H. Jeong, “Sta tistic al properties of sampled netw orks, ” Phys. Rev . E , vol. 73, p. 016102, 2006. [14] L.Becchetti, C.Castill o, D. Donato, an d A.Fazz one, “ A comparison of sampling techni ques for web graph characteri zatio n, ” in LinkKDD , 2006. [15] A. Mislov e, H. S. Koppula, K. P . Gum madi, P . Druschel, and B. Bhat- tacha rjee, “Growth of the ﬂickr socia l network, ” in Pr oc. of W OSN , 2008. [16] J. H. Kim, “ Poisson clonin g model for random gr aphs, ” International Congr ess of Mathemati cians (ICM) , 2006 (prepri nt in 2004). [17] D. Achlioptas, A. Clauset, D. Ke mpe, and C. Moore, “On the bias of tracero ute sampling: or , power -law degr ee distributi ons in regular graphs, ” in STOC , 2005. [18] M. Q. Shahbaz, “Sampling with unequal probabilitie s and without replac ement, ” Ph.D. dissertation. [19] J. Illenbe rger , G. Fl ¨ otter ¨ od, , and K. Nage , “ An approach to correct bias induced by snowball sampling, ” Sunbe lt Social Netw orks Con fer ence , 2009. [20] L. Good man, “Snowbal l sampling , ” Annals of Mathematical St atistic s , vol. 32, p. 148170, 1961. [21] W . Gilks, S. Richardson, and D. Spi egel halte r , Mark ov Chain M onte Carlo in Practic e . Chapman and Hall/CRC, 1996. [22] D. Heckath orn, “Respondent-dri ven sampli ng: A ne w approac h to the study of hidden populations, ” Social Pr oblems , vol. 44, p. 174199, 1997. [23] M. Salganik and D. Heckathorn, “Sampling and estimation in hidden populat ions using respondent-dri ven sampling, ” Socio logi cal Methodol- ogy , vol. 34, p. 193239, 2004. [24] J. Leskov ec, J. Kleinber g, and C. F aloutsos, “Graph s over time: densi- ﬁcation laws, shrinking diamet ers a nd possible expl anatio ns, ” in KDD , 2005. [25] M. Moll oy a nd B. Reed, “ A cr itica l point for random gr aphs with a gi ven degree sequence, ” pp . 161–179, 1995. [26] M. E. J. Newman, “The structur e and function of complex networks, ” SIAM REV IEW , vol. 45, pp. 167–256, 2003. [27] ——, “ Ego-cente red networks and the ripple ef fect, ” Social Ne tworks , vol. 25, pp. 83–95, 2003. [28] R. Motwa ni and P . Ra ghav an, R andomized Algorit hms . Cambridge Uni ver sity Press, 1990. [29] S. Maslov an d K. Sne ppen, “Speciﬁc ity and sta bility in topol ogy of protein networks, ” Science , vo l. 296, no. 5569, pp. 910–913, May 2002. [30] M. Newman, “ Assortati ve mixing in network s, ” in Phys. Rev . Lett. 89 , 2002. [31] M. E. J., “Random graphs with clustering, ” Phys. Rev . Lett. (in press) , 2009.

On the bias of BFS

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment