Random Walks on Directed Networks: Inference and Respondent-driven Sampling
Respondent driven sampling (RDS) is a method often used to estimate population properties (e.g. sexual risk behavior) in hard-to-reach populations. It combines an effective modified snowball sampling methodology with an estimation procedure that yiel…
Authors: Jens Malmros, Naoki Masuda, Tom Britton
Random W alks on Directed Net w orks: Inference and Resp onden t-driv en Sampling Jens Malmros Departmen t of Mathematics, Sto c kholm Universit y , Sto c kholm, Sw eden Naoki Masuda Departmen t of Mathematical Informatics, Universit y of T oky o, T oky o, Japan T om Britton Departmen t of Mathematics, Sto c kholm Universit y , Sto c kholm, Sw eden August 19, 2013 ∗ Jens Malmros is Ph.D. studen t, Departmen t of Mathematics, Division of Mathematical Statis- tics, Sto c kholm Univ ersity , SE-106 91 Sto ckholm (E-mail: jensm@math.su.se). Naoki Masuda is as- so ciate professor, Department of Mathematical Informatics, The Universit y of T oky o, 7-3-1 Hongo, Bunky o, T oky o 113-8656, Japan (E-mail: masuda@mist.i.u-toky o.ac.jp). T om Britton is professor, Departmen t of Mathematics, Division of Mathematical Statistics, Sto ckholm Universit y , SE-106 91 Sto c kholm (E-mail: tomb@math.su.se). J.M. was supported by gran t no. 2009-5759 from the Swedish Researc h Council. N.M. w as supp orted b y Gran ts-in-Aid for Scientific Research (No. 23681033) from MEXT, Japan, the Nak a jima F oundation, and the Aihara Pro ject, the FIRST program from JSPS, initiated b y CSTP , Japan. The authors would like to thank prof. F redrik Liljeros and dr. Xin Lu, Departmen t of So ciology , Stockholm Universit y for use of the Qruiser dataset. 1 Abstract Resp onden t driv en sampling (RDS) is a method often used to estimate pop- ulation prop erties (e.g. sexual risk b ehavior) in hard-to-reac h p opulations. It com bines an effective modified snowball sampling methodology with an estima- tion pro cedure that yields un biased p opulation estimates under the assumption that the sampling pro cess b ehav es like a random w alk on the so cial netw ork of the p opulation. Curren t RDS estimation metho dology assumes that the so cial net work is undirected, i.e. that all edges are recipro cal. How ever, empirical so- cial netw orks in general also ha ve non-recipro cated edges. T o account for this fact, we dev elop a new estimation metho d for RDS in the presence of directed edges on the basis of random walks on directed netw orks. W e distinguish di- rected and undirected edges and consider the possibility that the random w alk returns to its curren t p osition in tw o steps through an undirected edge. W e deriv e estimators of the selection probabilities of individuals as a function of the num b er of outgoing edges of sampled individuals. W e ev aluate the p erfor- mance of the prop osed estimators on artificial and empirical netw orks to show that they generally p erform better than existing metho ds. This is in particular the case when the fraction of directed edges in the net work is large. Key words: Hidden p opulation; Social netw ork; Renewal pro cess; Estimated degree; Netw ork mo del. 2 1 INTRODUCTION Random w alks on net w orks are crucial to the understanding of many netw ork pro cesses, and in many applications, random walks serve as either rigorous or approximate to ols dep ending on the amount of information av ailable ab out net works. A netw ork sampling metho dology taking adv antage of a random w alk approximation is resp onden t-driv en sampling (RDS). The metho d, first suggested in Hec k athorn (1997), is esp ecially suitable for in vestigating hidden or hard-to-reach p opulations, such as injecting drug users (IDUs), sex w ork ers, and men who hav e sex with men (MSM). F or suc h p opulations, sampling frames are typically unav ailable b ecause individuals often suffer from so cial stigmati- zation and/or legal difficulties, and conv en tional sampling metho ds therefore fail. High demand for v alid inference on hidden p opulations, e.g. on the risk b eha vior of individuals and the disease prev alence in the p opulation, as well as a lac k of comp eting methods, has made RDS a leading metho d. Examples of RDS studies from 2013 include MSM in Nanjing, China (T ang et al., 2013), undo cumen ted Central American immigran ts in Houston, T exas (Montealegre et al., 2013), and IDUs in the District of Columbia (Magn us et al., 2013). A t the core of RDS is the notion of a social netw ork that binds the p opu- lation together. During the sampling pro cess, already sampled individuals use their social relations (edges of the so cial net work) to recruit new individuals in the population in to the sample, creating a sno wball-like mechanism. Addition- ally , information on the structure of the net work collected during the sampling pro cess facilitates un biased p opulation estimates given that the actual RDS recruitmen t pro cess b eha ves like a random walk on the netw ork (Salganik and Hec k athorn, 2004; V olz and Heck athorn, 2008). In recen t y ears, muc h RDS researc h has focused on the sensitivit y of curren t RDS estimators to violations of the assumptions underlying the estimating pro cess. In fact, it has b een shown that RDS estimators may b e sub ject to 3 substan tial biases and large v ariances when some assumptions are not v alid (Gile and Handco ck, 2010; Lu et al., 2012; W ejnert, 2009; T omas and Gile, 2011; Go el and Salganik, 2010). New RDS estimators hav e b een developed to mitigate this problem (Gile and Handco ck, 2011; Gile, 2011; Lu et al., 2013). Curren t RDS estimation assumes that the so cial net work of the p opulation is undirected. Ho wev er, real so cial netw orks are at least partially directed in general. The directedness of a netw ork can b e quantified by the the ratio of the num b er of non-recipro cal (i.e., directed) edges to the total num b er of edges in the netw ork (W asserman and F aust, 1994). This v alue lies b et w een 0 and 1, and a large v alue indicates that the netw ork is close to a purely directed net work. Examples of real so cial netw orks and so cial net w orks, including e- mail so cial netw orks, from online comm unities having a considerable fraction of non-recipro cal edges are shown in T able 1. F or these and other directed so cial netw orks, RDS methods assuming an undirected net work ma y b e biased. T able 1: Prop ortion of directed edges in social net works. Real so cial netw orks Online so cial netw orks High-tec h managers 0.71 Go ogle+ (Oct 2011) 0.62 (W asserman and F aust, 1994) (Gong et al., 2013) Dining partners 0.76 Flic kr (May 2007) 0.55 (Moreno et al., 1960) (Gong et al., 2013) Radio amateurs 0.59 Liv eJournal (Dec 2006) 0.26 (Killw orth and Bernard, 1976) (Mislo ve et al., 2007) Twitter (June 2009) 0.78 (Kw ak et al., 2010) Univ ersity e-mail 0.77 (Newman et al., 2002) Enron e-mail 0.85 (Boldi and Vigna, 2004) (Boldi et al., 2011) Motiv ated by these data, w e aim to expand RDS estimation to the case of directed netw orks. Because the RDS metho d uses the random walk, a random w alk framework for directed netw orks is a key comp onent to this expansion. 4 This is not a trivial task b ecause the random walk b ehav es v ery differently in undirected and directed netw orks. In particular, the stationary distribution of the random w alk is simply prop ortional to the degree of the v ertex in undirected net works (Do yle and Snell, 1984; Lov´ asz, 1993), whereas it is affected b y the en tire net w ork structure in directed netw orks (Donato et al., 2004; Langville and Meyer, 2006; Masuda and Ohtsuki, 2009). In this pap er, we first presen t the commonly av ailable RDS estimation pro- cedures and the basics of random w alks on net works in Sections 2 and 3, re- sp ectiv ely . Then, we present metho ds for estimating the stationary distribution from random walks on directed net w orks and its application to RDS estima- tion in Section 4. These metho ds are then ev aluated and compared to existing metho ds by numerical simulations, which we describ e in Section 5. The results from sim ulations are presen ted in Section 6. Finally , our findings are discussed in Section 7. 2 RESPONDENT-DRIVEN SAMPLING In practice, an RDS study b egins with the selection of a seed group of individu- als from the p opulation. Each seed is given a fixed num b er of coup ons, typically three to five, whic h are effectively the tick ets for participation in the study , to b e distributed to other p eers in the p opulation. Those who ha ve received a coup on and joined the study (i.e., resp onden ts) are also given coup ons to b e distributed to other p eers that ha v e not obtained a coup on. This pro cedure is rep eated until the desired sample size has b een reached. Each resp ondent is rew arded b oth for participating in the study and for the participation of those to whom he/she passed coup ons, resulting in double incentiv es for participa- tion. The sampling pro cedure ensures that the iden tities of mem b ers of the p opulation are not rev ealed in the recruitmen t pro cess. F or each respondent, 5 the prop erties of in terest (e.g., HIV status), n umber of neighbors (degree), and the neighbors that the resp onden t has successfully recruited are recorded. W e approximate the RDS recruitmen t pro cess by a random walk on the so cial net w ork. T o this end, we assume that (i) resp ondents recruit peers from their so cial con tacts with uniform probabilit y , (ii) each recruitment consists of only one p eer, (iii) sampling is done with replacement, such that a resp on- den t ma y app ear in the sample multiple times, (iv) the degree of resp ondents is accurately rep orted, and (v) the p opulation forms a connected netw ork. Then, if the random w alk is in equilibrium with a kno wn stationary distri- bution { π i ; i = 1 , . . . , N } , where N is the p opulation size, w e ma y estimate p A , the fraction of individuals with a prop ert y of in terest A , as (Thompson, 2012) ˆ p A = P i ∈ S ∩ A 1 /π i P i ∈ S 1 /π i , (1) where S is our sample. F or undirected netw orks, the stationary distribution is prop ortional to the degree (Doyle and Snell, 1984; Lov´ asz, 1993), and Eq. (1) yields the most widely used RDS estimator (V olz and Heck athorn, 2008) given b y ˆ p VH A = P i ∈ S ∩ A 1 /d i P i ∈ S 1 /d i , (2) where d i is the degree of node i . How ev er, the estimator giv en by Eq. (2) ma y b e biased for directed netw orks (Lu et al., 2012, 2013). Therefore, to estimate p A without bias from an RDS sample on a directed netw ork, we need to accurately calculate Eq. (1). Because the stationary distribution { π i } used in Eq. (1) is analytically intractable for most directed netw orks, we will pro ceed by deriving estimators of it. 6 3 RANDOM W ALKS ON DIRECTED NETW ORKS W e consider a directed, unw eighted, ap erio dic, and strongly connected netw ork G with N vertices. Let e ij = 1 if there is a directed edge from i to j and 0 otherwise. An undirected edge exists b etw een i and j if and only if e ij = e j i = 1. W e denote the num b er of undirected, in-directed, and out-directed edges at v ertex i b y d (un) i , d (in) i , and d (out) i , respectively . W e use D (un) , D (in) , and D (out) to refer to the corresp onding random v ariables if a no de is dra wn uniformly at random. If we sp ecifically mention that the netw ork is undirected, we obtain d (in) i = d (out) i = 0, and the degree of vertex i refers to d (un) i = d i . Otherwise, the degree of v ertex i refers to the triplet ( d (un) i , d (in) i , d (out) i ). W e refer to d (un) i + d (in) i and d (un) i + d (out) i as the in-degree and out-degree of vertex i , resp ectively . It should be noted that w e ma y observ e for example the out-degree d (un) i + d (out) i , but not separately the d (un) i and d (out) i v alues. Consider the simple random w alk X = { X ( t ); t = 0 , 1 , . . . } with state space S = { 1 , . . . , N } on G suc h that the walk er staying at vertex i mo v es to any of the d (un) i + d (out) i neigh b ors reac hed b y an undirected or out-directed edge with equal probability . W e denote the stationary distribution of X by { π i ; i = 1 , . . . , N } , where π i = lim t →∞ P ( X ( t ) = i ). If we sample from the random w alk in equilibrium, vertices will b e selected with probabilities giv en by the stationary distribution, and w e then refer to { π i } as the sele ction pr ob abilities of the v ertices in G . F or an arbitrary netw ork, we obtain π i = N X j =1 e j i P N ` =1 e j ` π j = N X j =1 e j i d (un) j + d (out) j π j , (3) where the stationary distribution is fully defined b y P N i =1 π i = 1. In undirected net works, we obtain π i = d i / P N j =1 d j . In contrast, there is no analytical closed form solution for { π i } in directed netw orks. If a directed netw ork has little 7 assortativit y (i.e., degree correlation b etw een adjacent v ertices), { π i } is often accurately estimated by the normalized in-degree (Lu et al., 2013; F ortunato et al., 2008; Ghoshal and Barab´ asi, 2011) b ecause π i ≈ N X j =1 e j i d (un) j + d (out) j ¯ π ∝ N X j =1 e j i = d (in) i + d (un) i , (4) where ¯ π is the a v erage selection probabilit y . How ever, the estimate given b y (4) is often inaccurate in general directed netw orks (Donato et al., 2004; Masuda and Oh tsuki, 2009). Moreov er, since it is muc h easier for individuals to assess ho w man y p eople they know (i.e., out-degree) than by how many p eople they are kno wn (i.e., in-degree), it is common to observ e only the out-degree. In this case, Eq. (4) can not b e used with an RDS sample. 4 ESTIMA TION OF SELECTION PROBABILITIES F OR DIRECTED NETW ORKS W e now derive estimators of the selection probabilities for the random w alk on directed net works. W e first derive an estimation sc heme when the full degree ( d (un) i , d (in) i , d (out) i ) is observ ed for all the vertices i visited by the random walk. Then, w e extend this estimation to the situation in whic h only the out-degree d un i + d out i of the visited vertices is observed. 4.1 Estimating Selection Probabilities F rom Full Degrees In order to estimate { π i } , w e assume that X ( t 0 ) = i and that t 0 is sufficiently large for the stationary distribution to b e reached. W e ev aluate the frequency with which X visits i in the subsequent times. If X leav es i through an undirected edge e (un) i · , where e (un) i · is one of the d (un) i undirected edges owned b y i , X may return to i after tw o steps using the same edge and rep eat the same 8 t yp e of returns m times in total, perhaps using differen t undirected edges e (un) i · . Then, X ( t 0 ) = X ( t 0 + 2) = · · · = X ( t 0 + 2 m ) = i and X ( t 0 + 2 m + 2) = k for some k 6 = i . If X ( t 0 + 2) = i , the w alk first mov es from i through an undirected edge to vertex j at t = t 0 + 1 and returns to i through the same edge at t = t 0 + 2. The probability of this ev ent is given b y d (un) i / ( d (un) i + d (out) i ) · 1 / ( d (un) j + d (out) j ). Because the out-degree of v ertex j , i.e., d (un) j + d (out) j , is unkno wn, w e appro ximate 1 / ( d (un) j + d (out) j ) by E 1 / ( ˜ D (un) + D (out) ) . Here ˜ D (un) denotes the undirected degree distribution under the condition that the vertex is reac hed b y follo wing an und irected edge, i.e. a size-biase d distribution for the undirected degree, P ( ˜ D (un) = d ) ∝ dP ( D (un) = d ) (Newman, 2010). It is also p ossible to estimate 1 / ( d (un) j + d (out) j ) b y 1 /E ( ˜ D (un) + D (out) ), which how ever show ed to ha ve hardly any effect in our simulations, and if an y , slightly worse. Thus, we estimate the probabilit y of returning to vertex i after tw o steps b y p (ret) i = d (un) i d (un) i + d (out) i E 1 ˜ D (un) + D (out) . (5) When t ≥ t 0 + 2 m + 3, we use Eq. (4) to estimate the probabilit y to visit v ertex i at any time as b eing prop ortional to d (un) i + d (in) i , i.e., p (vis) i = d (un) i + d (in) i N ( E ( D (un) ) + E ( D (in) )) . (6) Under these estimates, the num b er of returns after tw o steps to vertex i , coun ting the starting p oint X ( t 0 ) = i as a return to i , is geometrically dis- tributed with exp ected v alue 1 / (1 − p (ret) i ), and the num b er of steps starting from t = t 0 + 2 m + 2, coun ting this step, and ending at the time immedi- ately b efore visiting i with probabilit y p (vis) i is geometrically distributed with exp ected v alue 1 /p (vis) i . W e then ha ve a renewal pro ce ss { R n i ; n ≥ 1 , R 0 i = 0 } with the n th renew al o ccurring at random time R n i = P n k =1 (2 Z k i + Y k i ), where 9 (a) Z n i consecutiv e returns to i . i j i j (b) Leav es i for Y n i steps. i j Figure 1: Schematic of a renewal p erio d. a) The walk er makes Z n i consecutiv e direct returns to i . b) The w alker leav es i without an immediate return, b ecause the walk er lea ves i by a directed edge or lea ves j b y another edge. Then, the walk er returns to i after Y n i steps. Z n i ∼ Ge (1 − p (ret) i ) and Y n i ∼ Ge ( p (vis) i ). In Figure 1, the b eha vior of the pro cess during a renew al perio d is sc hematically sho wn. The a verage time step b et w een consecutive renew al even ts is equal to 2 E ( Z n i ) + E ( Y n i ). The av erage n umber of visits to i b et ween the tw o renewal even ts, with the visit to i at t = t 0 included, is equal to E ( Z n i ). Therefore, from renew al theory (see e.g., Resnic k, 1992), we obtain an estimate of π i as π i ≈ E ( Z n i ) 2 E ( Z n i ) + E ( Y n i ) = 1 1 − p (ret) i 2 1 1 − p (ret) i + 1 p (vis) i = p (vis) i 2 p (vis) i + 1 − p (ret) i . (7) Because p (ret) i = O (1) and p (vis) i = O (1 / N ), removing higher order terms in Eq. (7) yields ˆ π i ≈ p (vis) i 1 − p (ret) i ∝ d (un) i + d (in) i 1 − d (un) i d (un) i + d (out) i E 1 ˜ D (un) + D (out) . (8) The prop ortionalit y constant is given by imp osing that P N i =1 ˆ π i = 1. If the net work is undirected, we obtain ˆ π i ∝ d (un) i , such that ˆ π i coincides with the exact solution used in Eq. (2). If the netw ork is fully directed, i.e., there are no 10 recipro cal edges and α = 1, the estimator is prop ortional to in-directed degree d (in) i . 4.2 Estimating Selection Probabilities F rom Out-degrees A common situation in RDS is that only the out-degrees (i.e., d (un) i + d (out) i ) of resp ondents are recorded. Then, the estimator of the selection probabilities giv en by Eq. (8) can not b e directly used. T o cop e with this situation, we estimate the num b er of undirected, in-directed, and out-directed edges from the observed out-degrees and substitute the estimators ( ˆ d (un) i , ˆ d (in) i , ˆ d (out) i ) in Eq. (8). Assume that w e ha v e observ ed the out-degree d (un) i + d (out) i of v ertex i . W e estimate d (un) i and d (out) i b y their exp ected prop ortions of the out-degree, and the in-directed degree by its exp ectation, as follows: ˆ d (un) i = E ( D (un) ) E ( D (un) )+ E ( D (out) ) d (un) i + d (out) i , ˆ d (out) i = E ( D (out) ) E ( D (un) )+ E ( D (out) ) d (un) i + d (out) i , ˆ d (in) i = E ( D (in) ) . (9) The exp ectations used in Eq. (9) rely on the assumption that we ha v e a random sample from the netw ork, which is not true in this case. A plausible assumption on the sampled degree distributions is that they are size-biased. Ho wev er, our n umerical results suggest that a size-biased distribution for un- directed and/or the in-directed degree makes little difference, and if any , in- creases the bias of selection probability estimators. Therefore, w e stay with the estimators given b y Eq. (9). When ( ˆ d (un) i , ˆ d (in) i , ˆ d (out) i ) is substituted in Eq. (8) in place of ( d (un) i , d (in) i , d (out) i ), ˆ d (un) i / ( ˆ d (un) i + ˆ d (out) i ) in the denominator is a constant. Therefore, the estimator is prop ortional to ˆ d (un) i + ˆ d (in) i , i.e., equiv alen t to Eq. (4) calculated with the estimated degrees. 11 4.3 Estimating Net wo rk Pa rameters The estimators of directed degrees in Eq. (9) rely on knowing E ( D (un) ), E ( D (in) ), and E ( D (un) ) separately , which are not estimable from a typical RDS sample, where only the out-degrees d (un) i + d (out) i of resp onden ts are recorded. Therefore, w e need to extend the estimation pro cedure to handle these unkno wn momen ts. W e do so b y assuming a mo del for the netw ork from which we can estimate the required moments. Sp ecifically , w e assume that the observed net work is a realization of a di- rected equiv alen t of the simple G ( N , p = λ/ ( N − 1)) random graph (Erd˝ os and Renyi, 1960). Giv en parameters α ∈ [0 , 1] and λ ∈ [0 , N − 1], eac h pair of v ertices indep enden tly forms an edge with probability λ/ ( N − 1), which is undirected with probabilit y (1 − α ) and directed with probability α . When the edge is directed, the direction is selected with equal probability . It follows that λ is the expected total degree of a vertex and that α is the fraction of directed edges as N → ∞ . If N is large, D (un) , D (in) , and D (out) appro ximately follow indep endent P oisson distributions with parameters (1 − α ) λ , αλ/ 2, and αλ/ 2, resp ectiv ely . Therefore, the out-degree D (un) + D (out) and the in-degree D (un) + D (in) are b oth P oisson distributed with parameter (2 − α ) λ/ 2. Consequently , if w e estimate α and λ , w e can estimate the unkno wn momen ts by substituting the estimated ˆ α and ˆ λ in the moments of the (P oissonian) degree distributions. T o find estimators of α and λ , w e again consider the random walk X = { X ( t ) } on the net w ork. Assume that e ij = 1, X ( t 0 ) = i , and X ( t 0 + 1) = j , for a large t 0 . If X ( t 0 + 2) = i , an undirected edge b etw een i and j exists, i.e. e ij = e j i = 1, and the random walk leav es v ertex j via e j i . Because the edge b et w een i and j is either in-directed to j or undirected, the probability that the edge is undirected is equal to the probabilit y that a randomly selected edge among all undirected and in-directed edges is undirected, i.e., (1 − α ) / (1 − α/ 2). 12 If there is an undirected edge b et ween i and j (i.e., e j i = 1), the random w alk lea ves j via e j i with probability 1 / ( d (un) j + d (out) j ). Thus, the random walk revisits vertex i at t 0 + 2 under the directed E-R random graph mo del with probabilit y 1 − α 1 − α/ 2 · 1 d (un) j + d (out) j . (10) Let M b e the num b er of immediate revisits, whic h is describ ed ab ov e, during l consecutiv e steps. Then, we ha ve M = P l k =2 M k , where M k = 1 if a revisit o ccurs in step k and M k = 0 otherwise. M k is Bernoulli distributed, M k ∼ Be (1 − α ) / (1 − α/ 2) · 1 / ( d (un) j k − 1 + d (out) j k − 1 ) , where j k − 1 is the vertex visited in step k − 1. W e obtain the exp ected n umber of immediate revisits as E ( M ) = 1 − α 1 − α/ 2 l − 1 X k =1 1 d (un) j k + d (out) j k . (11) If m is the observ ed n umber of revisits, we set m = E ( M ) in Eq. (11) to obtain the moment estimator ˆ α = m − P l − 1 k =1 d (un) j k + d (out) j k − 1 m/ 2 − P l − 1 k =1 d (un) j k + d (out) j k − 1 . (12) If the estimated ˆ α < 0, w e force ˆ α = 0. Giv en ˆ α , we estimate λ as follo ws. If α = 0, the net work con tains only undirected edges, and the observ ed out-degree equals the observ ed undirected degree, which has a size-biased distribution, with E ( ˜ D (un) ) = λ +1. If α = 1, the net work has only directed edges, and the exp ected observed out-degree equals the exp ected num b er of out-directed edges, λ/ 2. By linearly interpolating the exp ected observ ed out-degree b etw een α = 0 and α = 1, and substituting it with the mean sample out-degree ¯ u , we obtain ¯ u = λ/ 2 + (1 − α )(1 + λ/ 2), 13 whic h yields an estimator of λ as ˆ λ = ¯ u + ˆ α − 1 1 − ˆ α/ 2 . (13) Using ˆ α and ˆ λ , we can estimate the moments of the degree distributions under the random graph mo del. F or example, E ( D (un) ) is estimated by (1 − ˆ α ) ˆ λ . By substituting the estimated moments in Eqs. (8) and (9), w e obtain an estimator of the selection probabilit y of vertex i as ˆ π i ∝ ˆ d (un) i + ˆ d (in) i = 1 − ˆ α 1 − ˆ α/ 2 ( d (un) i + d (out) i ) + ˆ α ˆ λ 2 . (14) When α = 0 is assumed known and used in place of ˆ α , the estimator in Eq. (14) is equiv alent to that used in Eq. (2). When ˆ α = α = 1, it is prop or- tional to 1, and th us equiv alen t to the sample mean. 5 SIMULA TION SETUP W e numerically examine the accuracy of our estimation sc hemes on directed Erd˝ os-Renyi graphs, a mo del of directed p ow er-law netw orks (i.e., netw orks with a p o wer-la w degree distribution), and a real online MSM so cial netw ork. W e ev aluate b oth the estimated selection probabilities and corresp onding es- timates of p A . As describ ed in Section 1, real directed so cial netw orks show a v arying fraction of directed edges, corresp onding to a div ersity of α v alues. Therefore, α is v aried in the model net w orks. W e also v ary λ and other net work parameters. W e study the p erformance of the estimators describ ed in Section 4 when the full degree is observed and when only the out-degree is observed, and compare the p erformance of our estimators to existing estimators. W e do not consider RDS estimators that are not based on the random w alk framework b ecause they fall outside the scop e of this study . 14 5.1 Net wo rk Mo dels and Empirical Net wo rk The first mo del netw ork that we use is a v ariant of the simple Erd˝ os-R ´ en yi graph with a mixture of undirected and directed edges, as describ ed in Section 4.3. W e generate the netw orks with α ∈ { 0 . 25 , 0 . 5 , 0 . 75 } and λ ∈ { 5 , 10 , 15 } . W e then extract the largest strongly connected comp onen t of the generated netw ork, whic h has O ( N ) vertices for all com binations of α and λ . The directed Erd˝ os-R´ en yi netw orks hav e Poisson degree distributions with quic kly decaying tails. T o mimic hea vy-tailed degree distributions present in man y empirical netw orks (Newman, 2010), we also use a v ariant of the p ow er- la w netw ork mo del prop osed in (Goh et al., 2001; Chung and Lu, 2002; Chung et al., 2003). The original algorithm for generating undirected p o wer-la w net- w orks presented in Goh et al. (2001) is as follows. W e fix the num b er of v ertices N and exp ected degree E ( D ). Then, w e set the weigh t of v ertex i (1 ≤ i ≤ N ) to b e w i = i − τ , where 0 ≤ τ ≤ 1 is a parameter that con trols the p o wer-la w exp onen t of the degree distribution. Then, we select a pair of vertices i and j (1 ≤ i 6 = j ≤ N ) with probability prop ortional to w i w j . If the tw o vertices are not yet connected, w e connect them by an undirected edge. W e rep eat the pro cedure until the net work has E ( D ) N / 2 edges. The exp ected degree of v ertex i is prop ortional to w i , and the degree distribution is given b y p ( d ) ∝ d − γ , where γ = 1 + 1 τ (Goh et al., 2001). T o generate a p o wer-la w net work in which undirected and directed edges are mixed with a desired fraction, we extend the algorithm as follo ws. First, we sp ecify the exp ected undirected degree E ( D (un) ) and generate an undirected net work. Second, we define w in i = ( σ in ( i )) − τ in (1 ≤ i ≤ N ), where σ in is a random p ermutation on 1, . . . , N , and τ in is a parameter that sp ecifies the p o w er-law exp onent of the in-directed degree distribution. Similarly , we set w out i = ( σ out ( i )) − τ out (1 ≤ i ≤ N ). Third, we select a pair of vertices with probabilit y prop ortional to w in i w out j . If i 6 = j and there is not yet a directed edge 15 from j to i , w e place a directed edge from j to i . W e rep eat the pro cedure un til a total of E ( D (in) ) N / 2 edges are placed. It should b e noted that E ( D (in) ) = E ( D (out) ). The in-directed degree distribution is giv en by p ( d in ) ∝ ( d in ) − γ in , where γ in = 1 + 1 τ in , and similar for the out-directed degree distribution. Finally , w e sup erp ose the obtained undirected net work and directed netw ork to make a single graph. If the combined graph is not strongly connected, we discard it and start o ver. This netw ork is devoid of degree correlation by construction. In b oth net work models, we v ary the probability of a vertex b eing assigned prop ert y A as prop ortional to six differen t combinations of its degree: in- degree, out-degree, undirected degree, in-directed degree, out-directed degree, and directed (in- and out-directed) degree. F ormally , if P (vertex i has A ) ∝ g ( d (un) i , d (in) i , d (out) i ), we let g b e equal to ( d (un) i + d (in) i ), ( d (un) i + d (out) i ), d (un) i , d (in) i , d (out) i , and ( d (in) i + d (out) i ), resp ectiv ely . W e refer to these as differen t wa ys to allo cate prop ert y A . W e also examined the case in which w e assigned the prop ert y uniformly ov er all vertices. How ever, b ecause the p erformance of the differen t estimators is almost the same in this case, w e do not show the results in the follo wing. F or all allo cations of A , the prop erty is assigned in such a w ay that the expected prop ortion of vertices b eing assigned A is equal to some fixed v alue p . Because A is sto c hastically assigned, the actual prop ortion p A of v ertices with A will v ary b etw een realized allo cations. W e also ev aluate our estimators on an online MSM so cial netw ork, www.qruiser.com, whic h is the Nordic region’s largest communit y for lesbian, gay , bisexual, trans- gender and queer p ersons (Dec 2005-Jan 2006; Rybski et al., 2009; Lu et al., 2013, 2012). Our dataset consists of 16,082 male homosexual members and forms a strongly connected comp onent. Because mem b ers are allo wed to add an y member to their list of con tacts without approv al of that member, the re- sulting netw ork is directed; the fraction of directed edges equals α = 0 . 7572. The in-degree and out-degree distributions are skew ed (Lu et al., 2012), and 16 the mean num b er of edges λ is equal to 27.7434. The data set also includes user’s profiles, from which w e obtain four dic hotomous prop erties on which w e ev aluate estimators of p opulation prop ortions: age (b orn b efore 1980 or not), coun ty (liv e in Sto c kholm or not), civil status (married or unmarried), and profession (employ ed or unemploy ed). 5.2 Evaluation of Estimato rs W e compared the p erformance of our estimators of the selection probabilities with three other estimators. W e refer to our estimator { ˆ π i } obtained from Eq. (8) as { ˆ π (ren) i } (ren stands for renew al). The other estimators are the uni- form stationary distribution { ˆ π (uni) i } , where ˆ π (uni) i = 1 / N for all i , the selection probabilities prop ortional to the out-degree { ˆ π (outdeg) i } , on which Eq. (2) is based, where ˆ π (outdeg) i ∝ d (un) i + d (out) i , and the stationary distribution obtained from Eq. (4) { ˆ π (indeg) i } , i.e., prop ortional to the in-degree. In the following, w e suppress the {} notation. T o assess the p erformance of an estimator we first calculated the estimated selection probabilities ˆ π i for one of the four estimators and the true stationary distribution π i at all the vertices in the given netw ork. Then, we calculated their total variation distanc e defined by D T V = 1 2 N X i =1 | ˆ π i − π i | (15) (Levin et al., 2009). The stationary distribution π i w as obtained using the p o w er metho d (Langville and Mey er, 2006) with an accuracy of 10 − 10 in terms of the total v ariation distance for the tw o distributions given in the successive t wo steps of the p ow er iteration. F or ˆ π (ren) , we considered three v ariants dep ending on the information a v ail- able from observed degree and kno wledge of the moments of the degree distri- 17 butions. When the full degree ( d (un) i , d (in) i , d (out) i ) is observ ed, we used Eq. (8) to calculate ˆ π (ren) , where E 1 / ( ˜ D (un) + D (out) ) is estimated by the mean of the inv erse sample out-degrees. W e denote the corresp onding estimator with ˆ π (ren) f . d . , where f.d. stands for “full degree”. When only the out-degree is observ ed and the momen ts of the degree distributions are kno wn, w e used Eq. (9). This case is only ev aluated for the directed Erd˝ os-R ´ en yi graphs, and the corresp ond- ing estimator is denoted by ˆ π (ren) α,λ . If only the out-degree is observ ed and the momen ts of the degree distributions are unkno wn, w e used Eqs. (12), (13), and (14), and the estimator is denoted ˆ π (ren) . W e sampled from eac h generated netw ork by means of a random w alk start- ing from a randomly selected vertex. In the random walk, we collect the degree of the visited no des and also c heck whether they hav e property A or not. W e estimated the p opulation prop ortion p A from the sample by replacing π in Eq. (1) b y either ˆ π (uni) , ˆ π (outdeg) , ˆ π (indeg) , or an y of the v arian ts of ˆ π (ren) , yield- ing estimates ˆ p (uni) A , ˆ p (outdeg) A , ˆ p (indeg) A , or ˆ p (ren) A , resp ectively . The sample size is denoted by s . 6 NUMERICAL RESUL TS 6.1 Directed Erd˝ os-Renyi Graphs In T able 2, w e show the mean of the total v ariation distance D T V b et w een the true stationary distribution and ˆ π (uni) , ˆ π (outdeg) , ˆ π (indeg) , and ˆ π (ren) f . d . , calculated on the basis of 1000 realizations of the largest strongly connected comp onent of the directed random graph having N = 1000 v ertices. Because the standard deviation of D T V is similar betw een the estimators, we show an a v erage ov er the four estimators. The sample size s used in ˆ π (ren) f . d . is 500. W e also tried s = 200, whic h gav e similar results. The D T V v alue of ˆ π (indeg) and ˆ π (ren) f . d . is m uch smaller than that of ˆ π (uni) and ˆ π (outdeg) for all v alues of α and λ . F urthermore ˆ π (ren) f . d 18 T able 2: Mean and av erage s.d. of D T V for the directed random graph when ( d (un) i , d (in) i , d (out) i ) is observ ed and moments of the degree distributions are kno wn. The low est D T V v alue mark ed in b oldface. W e set N = 1000. (a) α = 0 . 1 λ ˆ π (uni) ˆ π (outdeg) ˆ π (indeg) ˆ π (ren) f . d s.d. 5 0.185 0.074 0.042 0.041 0.004 10 0.131 0.045 0.017 0.016 0.002 15 0.106 0.036 0.010 0.010 0.001 (b) α = 0 . 25 ˆ π (uni) ˆ π (outdeg) ˆ π (indeg) ˆ π (ren) f . d s.d. 0.203 0.134 0.077 0.075 0.005 0.140 0.081 0.031 0.030 0.002 0.112 0.063 0.019 0.019 0.002 (c) α = 0 . 5 λ ˆ π (uni) ˆ π (outdeg) ˆ π (indeg) ˆ π (ren) f . d s.d. 5 0.247 0.225 0.138 0.133 0.009 10 0.160 0.136 0.056 0.055 0.004 15 0.126 0.105 0.034 0.033 0.002 (d) α = 0 . 75 ˆ π (uni) ˆ π (outdeg) ˆ π (indeg) ˆ π (ren) f . d s.d. 0.303 0.319 0.207 0.201 0.014 0.188 0.201 0.090 0.088 0.005 0.144 0.156 0.055 0.055 0.003 alw ays gives smaller D T V than π (indeg) although the tw o v alues are similar for man y combinations of the parameters. In T able 3, we sho w the mean and a verage s.d. of D T V when the out-degree, i.e. d (un) i + d (out) i , is observ ed but the individual d (un) i and d (out) i v alues are not. The assumptions underlying the netw ork generation are the same as those for T able 2, and the sample size s is equal to 500. Here w e consider tw o c ases. In the first case, the moments of the degree distribution are known, and we use the estimator ˆ π (ren) α,λ . In the second case, they are not known, and we use ˆ π (ren) . Results for ˆ π (indeg) are not shown in T able 3 because in-degree is not observed. T able 3 indicates that D T V for ˆ π (ren) is smaller than that for ˆ π (uni) and ˆ π (outdeg) when α is 0.5 and 0.75. When α = 0 . 75, ˆ π (outdeg) yields the largest D T V . F or α = 0 . 1 and 0.25, ˆ π (ren) and ˆ π (outdeg) yield similar results. F or all parameter v alues ˆ π (ren) α,λ sligh tly outp erforms ˆ π (ren) . W e tried s = 200 (not shown) whic h ga ve similar s.d. for ˆ π (ren) α,λ , and similarly for ˆ π (ren) , except for α = 0 . 1, where, for example, λ = 15 yielded the s.d. v alues of 0.0039 and 0.0073 for s = 500 and s = 200, resp ectively . T o compare estimated p A , w e generated 1000 net w orks for eac h com bination 19 T able 3: Mean and a v erage s.d. of D T V for the directed random graph when d (un) i + d (out) i is observed. W e set N = 1000. (a) α = 0 . 1 λ ˆ π (uni) ˆ π (outdeg) ˆ π (ren) α,λ ˆ π (ren) s.d. 5 0.185 0.074 0.074 0.075 0.004 10 0.131 0.045 0.045 0.047 0.003 15 0.106 0.036 0.035 0.037 0.002 (b) α = 0 . 25 ˆ π (uni) ˆ π (outdeg) ˆ π (ren) α,λ ˆ π (ren) s.d. 0.203 0.135 0.132 0.133 0.006 0.140 0.081 0.079 0.080 0.003 0.112 0.063 0.061 0.063 0.002 (c) α = 0 . 5 λ ˆ π (uni) ˆ π (outdeg) ˆ π (ren) α,λ ˆ π (ren) s.d. 5 0.246 0.225 0.214 0.215 0.010 10 0.160 0.136 0.127 0.128 0.004 15 0.125 0.105 0.098 0.099 0.003 (d) α = 0 . 75 ˆ π (uni) ˆ π (outdeg) ˆ π (ren) α,λ ˆ π (ren) s.d. 0.303 0.318 0.294 0.295 0.014 0.188 0.201 0.177 0.178 0.006 0.144 0.156 0.135 0.135 0.004 of the parameters α ∈ { 0 . 25 , 0 . 5 , 0 . 75 } and λ = 10. On each of these netw orks w e in turn allo cate the prop erty A in each of the six wa ys describ ed in Sec- tion 5.1. The probability of a v ertex having A is denoted by p ∈ { 0 . 2 , 0 . 5 } . F or eac h netw ork and allo cation, we simulate a random w alk with length s ∈ { 200 , 500 } and calculate the differences b etw een estimated proportions of the p opulation with prop ert y A and the actual prop ortion of vertices with A . In Figure 2, results for α = 0 . 75, p = 0 . 5, and s = 500 are sho wn. The six groups of four boxplots corresp ond to the six differen t wa ys of allo cating A (see Section 5.1). The six b o xplots in eac h group corresp ond to ˆ p (ren) A f . d . , ˆ p (indeg) A , ˆ p (ren) A , ˆ p (ren) A α,λ , ˆ p (outdeg) A , and ˆ p (uni) A . W e see that the bias of ˆ p (ren) A f . d and ˆ p (indeg) A is small for all allo cations, as to b e exp ected. F or the estimators utilizing the out-degree, ˆ p (ren) A , ˆ p (ren) A α,λ , and ˆ p (outdeg) A , Figure 2 indicates that the choice of ho w to allo cate A has a significan t impact on the p erformance of estimators. When A is allocated prop ortional to the out-degree (Out-deg. in Fig. 2), ˆ p (ren) A and ˆ p (ren) A α,λ yields the most accurate result, and when A is allo cated prop ortional to the num b er of directed edges (Dir. in Fig. 2), ˆ p (outdeg) A is most accurate; this is true for almost all parameter 20 Figure 2: Deviations of estimated ˆ p A from true v alue in the directed Erd˝ os-R ´ enyi graphs with N = 1000, α = 0 . 75, λ = 10, p = 0 . 5, and s = 500. Each group of b o xplots corresponds to ˆ p (ren) A f . d . , ˆ p (indeg) A , ˆ p (ren) A , ˆ p (ren) A α,λ , ˆ p (outdeg) A , and ˆ p (uni) A for one allo ca- tion of the individual prop erty A . The abbreviations for the allo cations corresp onds to the function g , i.e., In-deg. equals ( d (un) i + d (in) i ), Out-deg. ( d (un) i + d (out) i ), Undir. d (un) i , In-dir. d (in) i , Out-dir. d (out) i , and Dir. ( d (in) i + d (out) i ). −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 In−deg. Out−deg. Undir. In−dir. Out−dir. Dir. ˆ p (ren) A f . d . ˆ p (indeg ) A ˆ p (ren) A ˆ p (ren) A α,λ ˆ p (outdeg ) A ˆ p (uni) A com binations. In general, the bias and v ariance increase with b oth α and p for all estimators, and a small s results in an increased v ariance, as to b e exp ected. In the Supplemen tary material, these findings are further illustrated b y numerical results with ( α, p, s ) equal to (0 . 5 , 0 . 2 , 500), (0 . 25 , 0 . 5 , 500), and (0 . 75 , 0 . 5 , 200). 6.2 Net wo rks With Po w er-la w Degree Distributions T o generate p o w er-law netw orks, w e set the expected total num b er of edges for eac h no de to 16, while we set the exp ected num b er of undirected and directed edges equal to ( E ( D (un) ) , E ( D (in) + D (out) )) = (12 , 4) , (8 , 8), and (4 , 12). The three cases yield α = 0 . 25, 0.5, and 0.75, respectively . F or eac h com bination of the parameters, we generate 1000 netw orks of size N = 1000 and calculate the mean of the D T V . W e also calculate the s.d., whic h is of magnitude 10 − 3 and therefore not sho wn. The sample size s is set to 200 and 500. 21 Figure 3: Average D T V b et w een the true stationary distribution and ˆ π (ren) f . d . , ˆ π (indeg) , ˆ π (ren) , ˆ π (ren) α,λ , ˆ π (outdeg) , and ˆ π (uni) in the p ow er-la w netw orks with N = 1000, α equal to a) 0.25, b) 0.5, and c) 0.75, and s = 500. (a) α = 0 . 25 3 3.5 4 4.5 5 0 0.05 0.1 0.15 0.2 0.25 γ D T V ˆ π (ren) f . d . ˆ π (indeg) ˆ π (ren) ˆ π (ren) α , λ ˆ π (outdeg) ˆ π (uni) (b) α = 0 . 5 3 3.5 4 4.5 5 0 0.05 0.1 0.15 0.2 0.25 γ D T V (c) α = 0 . 75 3 3.5 4 4.5 5 0 0.05 0.1 0.15 0.2 0.25 γ D T V The a v erage D T V v alues for ˆ π (ren) f . d . , ˆ π (indeg) , ˆ π (ren) , ˆ π (ren) α,λ , ˆ π (outdeg) , and ˆ π (uni) are sho wn in Figure 3 for v arious α and γ v alues. Figure 3 suggests that ˆ π (ren) f . d and ˆ π (indeg) are the most accurate among the four estimators, with ˆ π (ren) f . d b eing sligh tly b etter. When α = 0 . 25 and 0.5, ˆ π (ren) α,λ has a low er mean D T V than ˆ π (ren) , but this difference is not seen when α = 0 . 75. ˆ π (outdeg) p erforms b etter than ˆ π (ren) for all v alues of γ when α = 0 . 25, and the opp osite result holds true when α = 0 . 75. In Figure 4, the results for ˆ p (ren) A f . d . , ˆ p (indeg) A , ˆ p (ren) A , ˆ p (ren) A α,λ , ˆ p (outdeg) A , and ˆ p (uni) A when γ = 3, E ( D (un) ) = 4, E ( D (in) + D (out) ) = 12, p = 0 . 2, and s = 500 are sho wn. The figure indicates that ˆ p (ren) A f . d . and ˆ p (indeg) A ha ve small bias across differen t allo cations of A . In con trast, the magnitude of the bias of ˆ p (ren) A , ˆ p (ren) A α,λ , and ˆ p (outdeg) A dep ends on the allo cation type; ˆ p (ren) A has the smallest bias when A is allo cated prop ortional to the undirected degree, and ˆ p (ren) A α,λ and ˆ p (outdeg) A when A is allo cated prop ortional to the out-degree. Their relative p erformance is hard to assess for other allocations. In general, a large fraction of directed edges, small γ , and large p increase bias and v ariance, and v ariance of course decreases with s . The Supplementary material contains n umerical results for ( γ , E ( D (un) ) , E ( D (in) + D (out) ) , p, s ) = (4 . 5 , 4 , 12 , 0 . 2 , 500), (4 . 5 , 4 , 12 , 0 . 5 , 500), 22 Figure 4: Deviations of estimated p A from the true p opulation prop ortion in the p o w er-law netw orks for γ = 3, E ( D (un) ) = 4, E ( D (in) + D (out) ) = 12, p = 0 . 2, and s = 500. Each group of b o xplots corresp onds to ˆ p (ren) A f . d . , ˆ p (indeg) A , ˆ p (ren) A , ˆ p (ren) A α,λ , ˆ p (outdeg) A , and ˆ p (uni) A , for one allo cation of A . −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 In−deg. Out−deg. Undir. In−dir. Out−dir. Dir. ˆ p (ren) A f . d . ˆ p (indeg ) A ˆ p (ren) A ˆ p (ren) A α,λ ˆ p (outdeg ) A ˆ p (uni) A T able 4: D T V b et w een the true stationary distribution and ˆ π (uni) , ˆ π (outdeg) , ˆ π (indeg) , ˆ π (ren) f . d . and ˆ π (ren) . S.d. is shown in the second row, but only applies to ˆ π (ren) f . d . and ˆ π (ren) . ˆ π (ren) f . d . ˆ π (indeg) ˆ π (ren) ˆ π (outdeg) ˆ π (uni) 0.2198 0.2248 0.4057 0.4290 0.4484 0.0004 - 0.0048 - (4 . 5 , 12 , 4 , 0 . 5 , 500), and (3 , 4 , 12 , 0 . 2 , 200) to further supp ort these results. 6.3 Online MSM Net wo rk F or the Qruiser online MSM netw ork, w e first ev aluate ˆ π (uni) , ˆ π (outdeg) , ˆ π (indeg) , ˆ π (ren) f . d . , and ˆ π (ren) . The results are shown in T able 4. Note that ˆ π (ren) α,λ is not ev aluated b ecause α and λ are not kno wn b eforehand. F or ˆ π (uni) , ˆ π (outdeg) , and ˆ π (indeg) , D T V to the true selection probabilities is exactly calculated. F or ˆ π (ren) f . d . and ˆ π (ren) , we show the mean and s.d. of D T V on the basis of 1000 samples of size 500. W e see that ˆ π (ren) f . d . has smaller D T V than ˆ π (indeg) , and that the mean D T V of ˆ π (ren) is smaller than that of ˆ π (uni) and ˆ π (outdeg) . 23 Figure 5: Estimates of p opulation prop ortions in the Qruiser netw ork for a) age, b) civil status, c) count y , and d) profession. Eac h figure shows ˆ p (ren) A f . d . , ˆ p (indeg) A , ˆ p (ren) A , ˆ p (outdeg) A , and ˆ p (uni) A . The true p opulation prop ortions are shown by the dashed lines and are equal to 0.77, 0.40, 0.39, and 0.38 for age, civil status, coun t y , and profession, resp ectiv ely . (a) 0.5 0.6 0.7 0.8 0.9 ˆ p (ren ) A f . d . ˆ p (indeg) A ˆ p (ren ) A ˆ p (outdeg) A ˆ p (uni) A (b) 0.2 0.3 0.4 0.5 0.6 ˆ p (ren ) A f . d . ˆ p (indeg) A ˆ p (ren ) A ˆ p (outdeg) A ˆ p (uni) A (c) 0.3 0.4 0.5 0.6 ˆ p (ren ) A f . d . ˆ p (indeg) A ˆ p (ren ) A ˆ p (outdeg) A ˆ p (uni) A (d) 0.2 0.3 0.4 0.5 0.6 ˆ p (ren ) A f . d . ˆ p (indeg) A ˆ p (ren ) A ˆ p (outdeg) A ˆ p (uni) A In Figure 5, we show estimates of the p opulation prop ortions of the age, coun ty , civil status, and profession prop erties. The true p opulation prop ortions are sho wn by the dashed lines. The sample size is 500. Figure 5 indicates that ˆ p (ren) A f . d . p erforms b est of all estimators. Among the estimators utilizing d (un) i + d (out) i , ˆ p (ren) A has the smallest ov erall bias. Moreo ver, the v ariance of ˆ p (ren) A is smaller than for ˆ p (outdeg) A for all prop erties, in particular the civil status. 7 DISCUSSION AND CONCLUSIONS W e dev elop ed statistical pro cedures for sampling v ertices in so cial net works to account for the empirical fact that so cial net w orks generally include non- recipro cal edges. The prop osed estimation pro cedures t ypically outp erformed existing methods that neglect directed edges. Among the scenarios in v es tigated in the present study , the b est accuracy of estimation w as obtained when undi- rected, in-directed, and out-directed degree are separately observ ed for sampled individuals. In the more realistic scenario in which one only kno ws the sum of undirected and out-directed edges of sampled individuals, all estimation pro- cedures are less precise. Our sim ulations also sho w ed that estimators of p opu- 24 lation prop ortions were highly sensitiv e to how the prop ert y A is allocated in the so cial net w ork. If the full directed degree ( d (un) i , d (in) i , d (out) i ) is observ ed and the momen ts of the degree distributions are kno wn, our estimator ˆ π (ren) f . d . is compared to ˆ π (indeg) . It can b e seen in T ables 2 and 4, and Figure 3 that ˆ π (ren) f . d . p erforms sligh tly b etter than ˆ π (indeg) in all the studied situations. The corresp onding estimated prop ortions giv en by ˆ p (ren) A f . d . and ˆ p (outdeg) A in Figures 2, 4, and 5 are very similar. If only the out-degree d (un) i + d (out) i is observed, we compare ˆ π (ren) and ˆ π (outdeg) (T ables 3 and 4, and Figure 3). W e also include ˆ π (ren) α,λ in the comparison on the generated netw orks, and it can b e seen that the p erformance of ˆ π (ren) α,λ is only sligh tly b etter than that of ˆ π (ren) . Our estimator ˆ π (ren) outp erforms ˆ π (outdeg) except when the fraction of directed edges α is small (0.1 in T able 3 and 0.25 in Figure 3). This corresp onds to that ˆ π (ren) will deviate further from ˆ π (outdeg) as α increases (Eq. (14)). Figures 2 and 4 indicate that the results of the estimators ˆ p (ren) A , ˆ p (ren) A α,λ , and ˆ p (outdeg) A dep end muc h on the allo cation of the prop ert y A . W e b elieve that it is of interest to further study how prop erties are distributed in empirical so cial netw orks. If α is known, w e can estimate λ using only the mean sample out-degree in Eq. (13). Although generally difficult, it is p ossible to assess the fraction of directed edges in the so cial netw ork of a hidden p opulation through direct metho ds. In many RDS studies, participants are ask ed questions that exp er- imen ters use to quantify the nature of the relationship b etw een a participant and its recruiter, e.g., friends, acquantiances or strangers (e.g., Ramirez-V alles et al., 2005; W ang et al., 2007; Ma et al., 2007). With these questions, the authors aim to con trol for non-reciprocated relationships, which could lead to the participant b eing excluded from the sample. This type of questions is also useful for assessing the directedness of the so cial net w ork, b ecause the fraction of coup ons giv en by strangers could b e a measure of (non-)recipro cit y . In Gile 25 et al. (2012), another type of question more directly assessing reciprocation is suggested, e.g. “Do you think that the p erson to whom you gav e a coup on w ould ha ve giv en you a coup on if you had not participated in the study first?”. Another p ossible metho d to estimate α would be to obtain information on the n umber of revisits m used in Eq. (12). This could b e done by asking for exam- ple “W ould you giv e a coup on to the p erson who gav e you a coup on if he or she had not yet participated in the study?”. The main fo cus of the present pap er w as on accoun ting for directed edges in a so cial netw ork. There are also other assumptions in existing estimation pro cedures (including the curren t one) worth y of relaxing. F or example, the metho ds t ypically assume that participants choose coup on recip ents uniformly at random among their neigh b ors in the so cial net work. In realit y , they proba- bly sample closely connected neighbors more likely , which ma y bias estimators of selection probabilities. Extending the RDS metho ds by allowing weigh ted edges warran ts for future work. It should b e noted that our metho ds allo w the t wo weigh ts on the same undirected edge in the opp osite directions to b e differen t, b ecause our framework targets directed netw orks. Random w alks on directed netw orks hav e n umerous other applications, in- cluding iden tification of imp ortan t v ertices (Brin and P age, 1998; Langville and Mey er, 2006; Noh and Rieger, 2004; Newman, 2005) and comm unit y detection (Rosv all and Bergstrom, 2008). Therefore, w e also hop e that this work may con tribute to an increased understanding in other areas of net work research that use random walks on directed net works. References Boldi, P ., Rosa, M., Santini, M., and Vigna, S. (2011). Lay ered lab el propaga- tion: A multiresolution co ordinate-free ordering for compressing so cial net- 26 w orks. In Pr o c e e dings of the 20th international c onfer enc e on World Wide Web , pages 587–596. ACM. Boldi, P . and Vigna, S. (2004). The w ebgraph framework i: compression tec h- niques. In Pr o c e e dings of the 13th international c onfer enc e on World Wide Web , pages 595–602. ACM. Brin, S. and Page, L. (1998). Anatomy of a large-scale h yp ertextual web searc h engine. Pr o c e e dings of the Seventh International World Wide Web Confer- enc e , pages 107–117. Ch ung, F. and Lu, L. Y. (2002). The av erage distances in random graphs with giv en exp ected degrees. Pr o c. Natl. A c ad. Sci. USA , 99:15879–15882. Ch ung, F., Lu, L. Y., and V u, V. (2003). Sp ectra of random graphs with giv en exp ected degrees. Pr o c. Natl. A c ad. Sci. USA , 100:6313–6318. Donato, D., Laura, L., Leonardi, S., and Millozzi, S. (2004). Large scale prop- erties of the W ebgraph. Eur. Phys. J. B , 38:239–243. Do yle, P . G. and Snell, J. L. (1984). R andom Walks and Ele ctric Networks . Math. Asso. Amer. Erd˝ os, P . and Renyi, A. (1960). On the ev olution of random graphs. Publ. Math. Inst. Hungar. A c ad. Sci , 5:17–61. F ortunato, S., Bogu˜ n´ a, M., Flammini, A., and Menczer, F. (2008). Approximat- ing pagerank from in-degree. In A lgorithms and Mo dels for the Web-Gr aph , pages 59–71. Springer. Ghoshal, G. and Barab´ asi, A. L. (2011). Ranking stability and sup er-stable no des in complex netw orks. Nat. Comm. , 2:394. 27 Gile, K. J. (2011). Impro ved inference for resp ondent-driv en samplin g data with application to hiv prev alence estimation. Journal of the A meric an Statistic al Asso ciation , 106(493). Gile, K. J. and Handco ck, M. S. (2010). Resp onden t-driv en sampling: An assessmen t of current methodology . So ciolo gic al Metho dolo gy , 40(1):285–327. Gile, K. J. and Handco c k, M. S. (2011). Netw ork mo del-assisted inference from resp onden t-driv en sampling data. arXiv pr eprint arXiv:1108.0298 . Gile, K. J., Johnston, L. G., and Salganik, M. J. (2012). Diagnostics for resp onden t-driv en sampling. arXiv pr eprint arXiv:1209.6254 . Go el, S. and Salganik, M. J. (2010). Assessing resp onden t- driv en sampling. Pr o c e e dings of the National A c ademy of Scienc es , 107(15):6743–6747. Goh, K. I., Kahng, B., and Kim, D. (2001). Universal b ehavior of load distri- bution in scale-free netw orks. Phys. R ev. L ett. , 87:278701. Gong, N. Z., Xu, W., and Song, D. (2013). Recipro city in social net works: Mea- suremen ts, predictions, and implications. arXiv pr eprint arXiv:1302.6309 . Hec k athorn, D. D. (1997). Respondent-driv en sampling: a new approac h to the study of hidden p opulations. So cial pr oblems , pages 174–199. Killw orth, P . D. and Bernard, H. R. (1976). Informant accuracy in so cial net work data. Human Or ganization , 35(3):269–286. Kw ak, H., Lee, C., P ark, H., and Mo on, S. (2010). What is twitter, a so cial net work or a news media? In Pr o c e e dings of the 19th international c onfer enc e on World w ide web , pages 591–600. A CM. Langville, A. N. and Meyer, C. D. (2006). Go o gle’s PageR ank and b eyond . Princeton Universit y Press, Princeton. 28 Levin, D. A., P eres, Y., and Wilmer, E. L. (2009). Markov chains and mixing times . Amer Mathematical So ciety . Lo v´ asz, L. (1993). Random walks on graphs: A surv ey . Boyal So ciety Math. Studies , 2:1–46. Lu, X., Bengtsson, L., Britton, T., Camitz, M., Kim, B. J., Thorson, A., and Liljeros, F. (2012). The sensitivit y of respondent-driv en sampling. Journal of the R oyal Statistic al So ciety: Series A (Statistics in So ciety) , 175(1):191–216. Lu, X., Malmros, J., Liljeros, F., and Britton, T. (2013). Resp onden t-driv en sampling on directed netw orks. Ele ctr onic Journal of Statistics , 7:292–322. Ma, X., Zhang, Q., He, X., Sun, W., Y ue, H., Chen, S., Raymond, H. F., Li, Y., Xu, M., Du, H., et al. (2007). T rends in prev alence of hiv, syphilis, hepatitis c, hepatitis b, and sexual risk b eha vior among men who hav e sex with men: results of 3 consecutive resp onden t-driven sampling surveys in b eijing, 2004 through 2006. JAIDS Journal of A c quir e d Immune Deficiency Syndr omes , 45(5):581–587. Magn us, M., Kuo, I., Phillips I I, G., Rawls, A., P eterson, J., Montanez, L., W est-Ojo, T., Jia, Y., Op oku, J., Kaman u-Elias, N., et al. (2013). Differing hiv risks and prev ention needs among men and women injection drug users (idu) in the district of columbia. Journal of Urb an He alth , pages 1–10. Masuda, N. and Ohtsuki, H. (2009). Ev olutionary dynamics and fixation prob- abilities in directed netw orks. New J. Phys. , 11:033012. Mislo ve, A., Marcon, M., Gummadi, K. P ., Drusc hel, P ., and Bhattac harjee, B. (2007). Measurement and analysis of online so cial net w orks. In Pr o c e e dings of the 7th ACM SIGCOMM c onfer enc e on Internet me asur ement , pages 29–42. A CM. 29 Mon tealegre, J. R., Risser, J. M., Selwyn, B. J., McCurdy , S. A., and Sabin, K. (2013). Effectiveness of resp onden t driv en sampling to recruit undo cumen ted cen tral american immigran t w omen in houston, texas for an hiv b eha vioral surv ey . AIDS and Behavior , 17(2):719–727. Moreno, J. L. et al. (1960). The So ciometry R e ader . F ree Press New Y ork. Newman, M. (2010). Networks: an intr o duction . OUP Oxford. Newman, M. E., F orrest, S., and Balthrop, J. (2002). Email net works and the spread of computer viruses. Physic al R eview E , 66(3):035101. Newman, M. E. J. (2005). A measure of b et weenness centralit y based on random w alks. So c. Netw. , 27:39–54. Noh, J. D. and Rieger, H. (2004). Random w alks on complex net works. Phys. R ev. L ett. , 92:118701. Ramirez-V alles, J., Heck athorn, D. D., V´ azquez, R., Diaz, R. M., and Camp- b ell, R. T. (2005). F rom netw orks to p opulations: the developmen t and application of respondent-driv en sampling among idus and latino gay men. AIDS and Behavior , 9(4):387–402. Resnic k, S. I. (1992). A dventur es in Sto chastic Pr o c esses . Birkhauser. Rosv all, M. and Be rgstrom, C. T. (2008). Maps of random w alks on complex net works rev eal comm unity structure. Pr o c. Natl. A c ad. Sci. USA , 105:1118– 1123. Rybski, D., Buldyrev, S. V., Havlin, S., Liljeros, F., and Makse, H. A. (2009). Scaling la ws of human in teraction activit y . Pr o c e e dings of the National A c ademy of Scienc es , 106(31):12640–12645. 30 Salganik, M. J. and Heck athorn, D. D. (2004). Sampling and estimation in hid- den populations using resp ondent-driv en sampling. So ciolo gic al metho dolo gy , 34(1):193–240. T ang, W., Huan, X., Mahapatra, T., T ang, S., Li, J., Y an, H., F u, G., Y ang, H., Zhao, J., and Detels, R. (2013). F actors associated with unprotected anal in tercourse among men who hav e sex with men: Results from a respondent driv en sampling survey in nanjing, china, 2008. AIDS and b ehavior , pages 1–8. Thompson, S. K. (2012). Sampling . Wiley . T omas, A. and Gile, K. J. (2011). The effect of differential recruitment, non- resp onse and non-recruitment on estimators for respondent-driv en sampling. Ele ctr onic Journal of Statistics , 5:899–934. V olz, E. and Hec k athorn, D. D. (2008). Probability based estimation theory for resp ondent driv en sampling. Journal of Official Statistics , 24(1):79. W ang, J., F alck, R. S., Li, L., Rahman, A., and Carlson, R. G. (2007). Resp onden t-driv en sampling in the recruitment of illicit stimulan t drug users in a rural setting: Findings and technical issues. A ddictive b ehaviors , 32(5):924–937. W asserman, S. and F aust, K. (1994). So cial Network Analysis . Cam bridge Univ ersity Press, New Y ork. W ejnert, C. (2009). An empirical test of respondent-driv en sampling: Poin t es- timates, v ariance, degree measures, and out-of-equilibrium data. So ciolo gic al metho dolo gy , 39(1):73–116. 31
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment