An identification problem in an urn and ball model with heavy tailed distributions

AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL WITH HEA VY T AILED DISTRIBUTIONS CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T Abstract. W e consider i n this paper an urn and ball problem with replace- men t, where balls are with diﬀerent colors and are drawn unifor mly fr om a unique urn. The num bers of balls wi th a given color are i.i. d. random v ari- ables with a hea vy tailed probability distri bution, for instance a Pa reto or a W eibull distribution. W e dr a w a small fraction p ≪ 1 of the tot al n um ber of balls. The basic problem addressed in this pap er is to kno w to whic h exten t w e can infer the total n umber of colors and the distribution of the num ber of balls with a giv en color. By means of Le Cam’s inequality and Chen-Stein method, bounds for the total v ar iation norm betw een the distribution of the num ber of balls dra wn with a gi ven color and the P oisson distribution with the same mean are obtained. W e then show that the distribution of the num ber of ball s drawn with a giv en color has the same tail as that of the original num ber of balls. W e ﬁnall y establish explicit b ounds b etw ee n the tw o distributions when eac h ball is dra wn with ﬁxed probability p . 1. Introduction W e consider in this pap er the following urn and ball scheme with r epla c e ment : An urn contains a ra ndom num ber of ba lls with diﬀerent colors. W e draw a s mall fraction p ≪ 1 of the total nu mber o f balls. A ball which has b een dr awn is replaced int o the urn. The problem considered in this pap er consists o f estimating the num ber of colors tog ether with the distribution o f the num ber of ba lls with a given color b y using informatio n from s a mpled balls. This problem is motiv a ted by the analy s is of pack et sampling in the Int ernet (see Chabchoub et al. [5] for details). T o address the abov e problem, we analyze the non-no r malized distr ibution of the num ber of balls dra wn with a given color . More sp eciﬁcally , let W j (resp ec- tively , W + j ) denote the num ber of colo rs with a num ber of sampled balls equal to (resp ectively , eq ual to o r grea ter than) j . Denoting b y ˜ K the num ber of color s seen when drawing balls, the quantities W j / ˜ K a nd W + j / ˜ K a re equal to the pro- po rtions of colors , which at the end of the trial co mprise exactly or a t least j balls, resp ectively . The num bers of balls with v a rious co lors are assumed to b e i.i.d. random v ar i- ables and the num ber K of color s is lar ge. In addition, the distribution of the nu mber of balls with a given color has a heavy tailed proba bilit y distribution of Pareto or W eibull type. Finally , balls ar e drawn uniformly . This mea ns that for each i = 1 , . . . , K , if there are v i balls with colo r i , the probability of drawing a ball with this color is v i /V , wher e V = v 1 + · · · + V K is the total num ber of balls in the urn. Key wor ds and phr ases. Chen-Stein method, P areto distri bution, W eibull distribution. 1 2 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T The ab ov e mo del is deﬁned a s the “unifor m mo del”. It will compar ed ag ainst the case when balls are drawn indep e nden tly one of each o ther with pr o bability p . This mo del will b e referr ed to as proba bilistic mo del. W e show that the results obtained in b oth mo dels are clos e o ne to each other when p is very small. But there are so me subtle diﬀerences b et ween the t wo mo dels, notably with regard to the achiev able accuracy in the inference of origina l statistics. It turns out that the probabilistic mo del is s impler to ana lyze than the unifor m mo del but yields less accurate r esults. This is due to the fact that we cannot exploit the fact that the nu mber of color s is very lar ge. One of the main results of the pap er concerns the a na lysis of the v alidit y of the following simple sca ling rule: The distribution of the o r iginal n umber v i of balls with color i co uld b e estima ted by that o f the random v ariable ˜ v i /p , where ˜ v i is the num ber of sa mpled balls with colo r i . When each ba ll is drawn with a ﬁxed probability , it is known that this rule is v alid for tails of the dis tributions as so on as they are heavy tailed. See Asm ussen et al [3 ] and F oss and Ko rshuno v [7 ] where this asymptotic equiv alence is prov ed in a quite general framew ork. Our main goal here is to g et, for j ≥ 2, an explicit b ound on the qua n tit y     P ( ˜ v ≥ j ) P ( v ≥ j /p ) − 1     . In the con text of pack et sampling in the Internet, explicit expr essions are e s pec ia lly impo rtant for the es timation of the sizes of ﬂows in Internet traﬃc. In this setting the v a riable j is taken to be large but cannot b e to o large so that the e ven t { ˜ v = j } o ccurs suﬃciently often to obta in reliable statistics. Henceforth, the dep endence on j should b e ma de explicit. See Chab choub et al. [5] for a discussion. The o rganiza tio n o f this pap er is as follows: The notation and the basic r e sults used in this pap er (Le Cam’s inequality and Chen-Stein metho d) ar e prese n ted in Section 2. The mean v alues of the r a ndom v a riables W j and W + j are computed in Section 3. The approximation of the dis tribution of W + j by a P oisson distribution and the v alidit y o f the sca ling r ule are inv estigated in Section 4. W e co mpare in Section 5 the or ig inal distribution of the n um ber of balls with a given color against the rescaled distribution of the num ber of drawn balls with the same colo r. Some concluding remar ks with regar d to sampling a r e presented in Sec tion 6. 2. Not a t ion and basic resul ts 2.1. De ﬁn i tions and assumptio ns . W e consider an urn containing v i balls with color i for i = 1 , . . . , K . The quantities v i are indep endent random v ariables with a common heavy tailed distr ibutio n. In the fo llowing we shall consider t w o families of heavy tailed distributions for the num ber v of ba lls with a given color: P areto distributions: The distribution o f v is g iven by (1) P ( v > x ) = ( b/x ) a , x ≥ b, with the sha pe para meter a > 1 and the lo cation parameter b > 0. The mean of v is ab/ ( a − 1). W eibul l distributi ons: The distribution of v i is given b y (2) P ( v > x ) = exp( − ( x/η ) β ) , x ≥ 0 , with the skew parameter β ∈ (0 , 1) and the sca le par ameter η > 0. The mean of v is η β Γ(1 /β ), where Γ is the classical Euler’s Ga mma function. AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 3 The total nu mber of balls in the urn is V = P K i =1 v i . W e dr aw only a fraction p of this total num ber of balls. Each ba ll is dr awn at random: A ball with color i is dr awn with pro bability v i /V . After drawing the pV balls, we have ˜ v i balls with color i . O f cour s e, only tho se colors with ˜ v i > 0 can be s een. The q uantit y ˜ K = P K i =1 1 { ˜ v i > 0 } is the num ber of color s seen at the end of a trial. In the following, we shall b e int erested in the as ymptotic regime whe n the n um ber of colors K → ∞ while the fraction p → 0. Note that b y the law o f large n um bers , V → ∞ a.s. (the total num ber of balls in the urn is very large). The random v a riables we co nsider in this pa per to infer the or iginal s tatistics of the n um b e r of balls and colors are the v ariables W j and W + j , j ≥ 0, deﬁned as follows. Deﬁnition 1 (Deﬁnition of W j ) . The r a ndom variable W j is the numb er of c olo rs with j b al ls at the end of a trial and is given by j ≥ 0 , W j = 1 { ˜ v 1 = j } + 1 { ˜ v 2 = j } + · · · + 1 { ˜ v K = j } , wher e ˜ v i ≥ 0 is the numb er of b alls dr awn with c o lor i (which c an b e e qual to 0). Deﬁnition 2 (Deﬁnition o f W + j ) . The r andom varia ble W + j is the numb er of c olo rs with at le ast j b al ls at t he end of a t rial. The r andom variables W + j ar e formal ly deﬁne d by j ≥ 0 , W + j = 1 { ˜ v 1 ≥ j } + 1 { ˜ v 2 ≥ j } + · · · + 1 { ˜ v K ≥ j } . Note that we hav e ∀ j ≥ 0 , W + j = X ℓ ≥ j W ℓ . The av erages of the random v ar iables W j are in fact the key quantities we shall use in the following to infer the original num bers of balls p er color . 2.2. Le Cam’s inequalit y and Chen-Ste i n me tho d. Le Cam’s inequality gives the distance in total v ariation betw een the distribution of a sum of independent a nd ident ically dis tributed (i.i.d.) Berno ulli rando m v ariables and the Poisson distri- bution with the same mean (see Barb our et al. [4]). Note tha t if V and W are t wo random v ariables ta king integer v alues, the distance in to tal v ariation b etw een their distributions is de ﬁned by k P ( W ∈ · ) − P ( V ∈ · ) k tv def. = sup A ⊂ N | P ( W ∈ A ) − P ( V ∈ A ) | = 1 2 X n ≥ 0 | P ( W = n ) − P ( V = n ) | . Theorem 1 (Le Ca m’s Inequality) . If t he r a ndom variable W = P i I i , wher e the r and om variables I i ar e i.i.d. Bernoul li r a ndom variables, then (3) k P ( W ∈ · ) − P ( Q E ( W ) ∈ · ) k tv ≤ X i P ( I i = 1 ) 2 , wher e for λ > 0 , Q λ is a Poisson r andom variabl e with me a n λ , that is, for al l n ≥ 0 , P ( Q λ = n ) = λ n n ! e − λ . 4 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T When the r andom v aria bles I i app earing in the ab ov e theorem are not indep en- dent but sa tisfy a sp eciﬁc co ndition, r eferred to as monoto nic coupling, it is still po ssible to obtain a bound on the distance betw een the distribution of the sum W = P i I i and the Poisson distribution with mea n E ( W ). Deﬁnition 3 (Monotonic Coupling) . The variables I i ar e said to b e nega tively related , when ther e exist some r andom variables U i and V i such that (1) U i dist. = W and 1 + V i dist. = ( W | I i = 1 ) ; (2) V i ≤ U i . The main r esult of the Chen-Stein metho d is given b y the following theor em (se e Barb our et al. [4]). Theorem 2. If the monotonic co upling condition is satisﬁe d , then the fol lowing ine q uality holds (4) k P ( W ∈ · ) − P ( Q E ( W ) ∈ · ) k tv ≤ 1 − V ar( W ) E ( W ) . When the monotonic coupling c o ndition is satisﬁed, in o rder to prov e the Poisson approximation, it is suﬃcient to show that the ratio of the v ariance to the mean v a lue of W is clo s e to 1; this is a very weak condition to pr ov e in pra c tice . It should b e noted (see [8]) tha t Relation (4) can b e us e d not only when E ( W ) takes bo unded v alues so that W is approximately a Poisso n random v ar iable, but also when E ( W ) is la r ge. In this case Chen-Stein Metho d yields a c en tral limit theorem: If N is a standard normal distr ibution,      P W − E ( W ) p V ar( W ) ∈ · ! − P ( N ∈ · )      tv ≤      P W − E ( W ) p V ar( W ) ∈ · ! − P Q E ( W ) − E ( W ) p V ar ( W ) ∈ · !      tv +      P Q E ( W ) − E ( W ) p V ar( W ) ∈ · ! − P ( N ∈ · )      tv where V a r( W ) is the v ar iance of the ra ndom v aria ble W . By using Rela tion (4), we ha v e      P W − E ( W ) p V ar( W ) ∈ · ! − P ( N ∈ · )      tv ≤ 1 − V ar ( W ) E ( W ) +      P Q E ( W ) − E ( W ) p V ar ( W ) ∈ · ! − P ( N ∈ · )      tv . If the ratio E ( W ) / V ar( W ) is close to 1, then the ﬁrst term in the right hand side o f the ab ov e relation is negligible. In addition, the c la ssical central limit theor em for Poisson distributions implies that when E ( W ) is large, the sec o nd ter m is negligible to o. Therefore, we ha ve W ∼ E ( W ) + p V ar ( W ) N with a b ound on the e r ror. AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 5 3. Comput a tion of mean v alues 3.1. B o unds for mean v alues. By using Le Cam’s inequality , we can establish the following r esult for the mean v a lue of the ra ndom v a riables W j . Prop ositio n 1 (Mean V a lue of W j ) . If ther e ar e V b a l ls and K c o lors in the urn, for j ≥ 0 , the me a n numb er E ( W j ) of c olors with j b a l ls at the end of a trial satisﬁes the r ela tion     E ( W j ) K − Q j     ≤ E  min( pv , 1) v V  , (5) wher e Q is t he pr ob ability distribution deﬁne d for j ≥ 0 by Q j = E ( pv ) j j ! e − pv ! , p is the sampling r ate, and v is distribute d as the nu mb e r of b al ls with a given c olor. Pr o of. W e hav e ˜ v i = B i 1 + B i 2 + · · · + B i pV , where B i ℓ is equa l to one if the ℓ th ball dra wn from the urn has color i , whic h even t o ccurs with pro bability v i /V , the quant it y V being the total n um be r of balls in the urn. Conditionally on the v a lues of the s e t F = { v 1 , . . . , v K } , the v ariables ( B i ℓ , ℓ ≥ 1) are indep e ndent Berno ulli v ariables. F or 1 ≤ i ≤ K , Le C a m’s Inequality (3) therefore gives the relation k P ( ˜ v i ∈ · | F ) − P ( Q pv i ∈ · ) k tv ≤ p v 2 i V , and Relation (4) which can also b e used in this ca se yields k P ( ˜ v i ∈ · | F ) − P ( Q pv i ∈ · ) k tv ≤ v i V , By integrating with res pect to the v ariables v 1 , . . . , v K , these tw o inequalities give the relatio n (6) k P ( ˜ v i ∈ · ) − Q k tv ≤ E  min ( pv , 1) v V  . Since E ( W j ) = P K i =1 P ( ˜ v i = j ), by summing on i = 1 , . . . , K , we o btain | E ( W j ) − K Q j | ≤ K E  min ( pv , 1) v V  . and the result follows.  By using the fact that E ( W + j ) = P K i =1 P ( ˜ v i ≥ j ), w e can deduce fro m Equa- tion (6) the following result. Prop ositio n 2 (Mea n V alue of W + j ) . If ther e ar e V b a l ls and K c o lors in the urn, the me an n u mb er E ( W + j ) of c olors with at le ast j ≥ 0 b al ls at t he end of an arbitr ary t rial satisﬁes the r elatio n       E ( W + j ) K − X ℓ ≥ j Q ℓ       ≤ E  min ( pv , 1) v V  , (7) wher e t he pr ob abili ty distribution Q is deﬁne d in Pr op osition 1. 6 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T W e immediately deduce fro m P rop ositions 1 and 2 the following coro llary by using the fact tha t V ≥ K . Corollary 1 (Asymptotic Mea n V a lue s ) . The re lations lim K →∞ 1 K E ( W j ) = Q j and lim K →∞ 1 K E ( W + j ) = X ℓ ≥ j Q ℓ . hold. Note that if balls are drawn with proba bilit y p indep endently one of e ach o ther (probabilistic mo del), we have ˜ v i = P v i ℓ =1 ˜ B i ℓ , where the random v ariables ˜ B i ℓ are Bernoulli with mea n p . By adapting the a bove pro o fs, we ﬁnd (8)     E ( W j ) K − Q j     ≤ p. 3.2. As ymptotic results for sp eciﬁc probability distributio ns. 3.2.1. Par e to distribut ions. Let us ﬁrst assume tha t the num ber of balls of a g iven color follows a Pareto distr ibution given by Equatio n (1). Then, we hav e the fol- lowing res ult when the num ber of colors go es to inﬁnity . Prop ositio n 3 . If v has a Par eto distribution as in Equation (1) , then for al l j > a , t he r el ations lim K → + ∞ E ( W j +1 ) E ( W j ) = 1 − a + 1 j + 1 + O (( pb ) j − a ) , (9) lim K → + ∞ E ( W j ) K = a ( pb ) a Γ( j − a ) j ! + O (( pb ) j ) , (10) lim K → + ∞ E ( W + j ) K = ( pb ) a Γ( j − a ) ( j − 1)! + O  ( pb ) j 1 − pb  (11) hold. Pr o of. F or j > a , (12) Q j = E ( pv ) j j ! e − pv ! = ab a p a j ! Z + ∞ pb u j − a − 1 e − u du = a ( pb ) a Γ( j − a ) j ! − a ( pb ) j j ! Z 1 0 u j − a − 1 e − pbu du. Therefore, by us ing the re la tion Γ( x + 1) = x Γ( x ), we get the equiv a lence Q j +1 Q j = j − a j + 1 + O (( pb ) j − a ) , AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 7 which gives Equations (9) and (10) by using Coro llary 1. F or the mean v alue of W + j , Equation (12) gives the rela tion lim K → + ∞ E ( W + j ) K = a ( pb ) a X n ≥ j Γ( n − a ) n ! + O  ( pb ) j 1 − pb  = a ( pb ) a X n ≥ 0 Γ( n + j − a )Γ( n + 1) Γ( j + n + 1) 1 n n ! + O  ( pb ) j 1 − pb  = a ( pb ) a Γ( j − a ) j ! F ( j − a, 1; j + 1; 1) + O  ( pb ) j 1 − pb  , where F ( a, b ; c ; z ) is the hypergeo metric function satisfying F ( a, b ; c ; 1) = Γ( c )Γ( c − a − b ) Γ( c − a )Γ( c − b ) (see Abramowitz and Stegun [1]), and Eq uation (11) follows.  The shap e par a meter a can b e estimated via Relation (11) by (13) a = lim K →∞ j 1 − E ( W + j +1 ) E ( W + j ) ! + O  ( pb ) j 1 − pb  for all j > a . This g ives a mea ns o f estimating the shap e parameter a . When ob- serving dr awn balls , we have in fact o nly acce s s to the quantit y E ( ˜ K ) of the num ber of sa mpled color s. While this has no impact for the estimation o f a , this correcting term is impo rtant when estimating b fro m E quation (11). It is s traightforw ard that ˜ K = K X i =1 1 { ˜ v i > 0 } = K − W 0 and then when K → ∞ E ( ˜ K ) ∼ K (1 − Q 0 ) = K  1 − E ( e − pv )  . Since (14) 1 − E ( e − pv ) = p Z ∞ 0 e − px P ( v > x ) dx = bp + ( bp ) a Γ(1 − a, bp ) , where Γ( a, x ) is the incomplete Ga mma function deﬁned by Γ( a, x ) = R ∞ x t a − 1 e − t dt , we can use the ab ove equations together with Equation (11) in o rder to estimate b and then K . It is a ls o worth no ting that 1 − E ( e − pv ) ∼ bp when a > 1 and bp → 0. 3.2.2. Weibul l distributions. W e assume in this section that the num ber o f ba lls with a given color follows a W eibull distribution. In this case, we hav e the following result, whic h follo ws from a simple v ariable change and the expansion of exp( − x β ) in power series of x β or exp( − px ) in p ow er series of x ; the pro of is omitted. Prop ositio n 4 . If v has a Weibul l distribution with skew p ar ameter β and sc ale p ar ameter η , then for 0 < β < 1 (15) lim K → + ∞ E ( W j +1 ) = β j ! ∞ X n =0 ( − 1) n ( pη ) ( n +1) β Γ(( n + 1 ) β + j ) n ! 8 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T and for β > 1 , (16) lim K → + ∞ E ( W j +1 ) = ( pη ) j j ! ∞ X n =0 ( − pη ) n n ! Γ  ( n + j ) β + 1  . Note that E ( W j ) can b e wr itten in the for m E ( W j ) = 1 j ! β ( pη ) β Z ∞ 0 u j + β − 1 e − u + tu β du with t = − 1 / ( pη ) β . The ab ov e in tegral is known in the literature as to b e of the F ax en’s type and can be expres sed by means of Meijer G -function, when β is a rational num ber, see Abramowitz and Stegun [1]. Contrary to the case of Pareto distr ibution fo r the initial distribution of balls of a given c o lor, ther e is no simple r elations giving the pa r ameters β and η fro m the mean v alues E ( W j ), j ≥ 1. In fact, we shall pr ov e in the fo llowing that P ( ˜ v ≥ j ) has also a W eibull tail. This eventually giv es a means of identifying the parameters. 4. Poisson appro xima tions In the prev ious section, we hav e esta blished b ounds for the mean v alues of the random v a riables W j and W + j . T o obtain more informatio n on their distributions, we intend to use Chen-Stein method. F or a ﬁxed environmen t (namely ﬁxed v alues of the quantities v i for i = 1 , . . . , K ), these random v ariables app ear as sums of non indep e nden t Bernoulli r andom v ariables. A preliminary analy sis of the Bernoulli ra ndom v ariables app ear ing in the expression of W j reveals that it seems not po ssible to inv oke a mo notonic coupling ar gument. It is w ell known (se e [4] for details) that the situatio n is mor e fav orable with the r a ndom v ar iables W + j and we can s peciﬁca lly pr ov e that if F is the se t F = { v i , 1 ≤ i ≤ K } , then the total nu mber W + j of colors with at least j balls at the end of the trial satisﬁes the relation (17)    P ( W + j ∈ · | F ) − P ( Q E ( W + j | F ) ∈· )    tv ≤ E 1 − V ar ( W + j | F ) E ( W + j | F ) ! . Indeed, given the ra ndo m v ariables v i , the mo del is equiv alen t to a standar d urn and ball pr oblem consisting of putting pV i balls in to K urns, a ball falling into urn i with proba bilit y p i = v i /V i . The num ber of balls in ur n i is the n umber of balls with color i in the origina l urn and ball pr oblem. Even in the case when the quantities p i are diﬀerent, the v ariables I + i,j def = 1 { ˜ v i ≥ j } are nega tively re la ted so that Theo rem 2 can b e used. See Page 24 a nd Co rolary 2.C.2 Page 26 of [4] fo r a deﬁnition and the main inequality in this domain. C ha pter 6 of this refere nce is ent irely devoted to related o ccupancy pro ble ms . The res t of this section is devoted to the estimation of the bo und in Equa- tion (17). W e ﬁrst establis h the following lemma. Lemma 1. F or a ﬁxe d envir onment F = { v i , 1 ≤ i ≤ K } , the distanc e in total variation b etwe en t he distribution of W + j and the Poisson distribution Q E ( W + k | F ) satisﬁes t he ine qu ality lim K → + ∞ k P ( W + j ∈ · | F ) − P ( Q E ( W + k | F ) ∈ · ) k tv ≤ m 2 ,j ( p ) m j ( p ) + p E ( v ) m ′ j ( p ) 2 m j ( p ) , (18) AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 9 wher e m j ( p ) and m 2 ,j ( p ) ar e the ﬁrst t wo moments of the ra ndom variable deﬁne d by (19) X j ( p ) = X ℓ ≥ j ( pv ) ℓ ℓ ! e − pv , and the prime s ign denotes the derivative with r esp e ct t o p . Pr o of. F or F ﬁxed, the num ber W j of co lo rs with j ≤ pV balls at the end o f the trial is such that E ( W j | F ) = K X i =1  pV j   v i V  j  1 − v i V  pV − j . By using the fact that 1 V = 1 K E ( v ) + o  1 K  a.s. for large K , s traightforw ard c alculations show that (20) E ( W j | F ) = K X i =1 ( pv i ) j j ! e − pv i  1 − j ( j − 1) 2 pK E ( v ) + 2 j v i − pv 2 i 2 E ( v ) K  + o  1 K  = K X i =1  ( pv i ) j j ! e − pv i − p 2 E ( v ) K d 2 dp 2  e − pv i ( pv i ) j j !  + o  1 K  . By summing up the terms ab ov e and by chec king tha t the o  1 K  term remains v alid, since the sum ca n b e written as P K i =1 f ( v i ) e − pv i /K 2 , where f is a p olynomia l, w e hav e for j ≥ 1 and 0 < p < 1 E ( W + j | F ) = X ℓ ≥ j E ( W ℓ | F ) = K X i =1 X i,j ( p ) − p 2 E ( v ) K K X i =1 X ′′ i,j ( p ) + o  1 K  , where X i,j ( x ) = X ℓ ≥ j ( xv i ) ℓ ℓ ! e − xv i . F or the v ariance, if I i,j is 1 if colo r i has exa ctly j balls at the end of the trial and 0 otherwise, then W j = P K i =1 I i,j and, for j 6 = ℓ , E ( W j W ℓ | F ) = X 1 ≤ i 6 = m ≤ K E ( I i,j I m,ℓ | F ) and E ( W 2 j | F ) = E ( W j | F ) + X 1 ≤ i 6 = m ≤ K E ( I i,j I m,j | F ) . F or j, ℓ such that j + ℓ ≤ p V , E ( I i,j I m,ℓ | F ) = ( pV )! j ! ℓ !( pV − j − ℓ )!  v i V  j  v m V  ℓ  1 − v i + v m V  pV − j − ℓ . The quantit y in the right hand side of the ab ov e equa tio n can be expanded as e − p ( v i + v m ) p j + ℓ v j i v ℓ m j ! ℓ ! − p 2 V e − p ( v i + v m ) v j i v ℓ m j ! ℓ ! c i,m ( j, ℓ ) + o  1 K  , 10 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T where c i,m ( j, ℓ ) = p j + ℓ − 2 ( j + ℓ )( j + ℓ − 1) − 2( j + ℓ )( v i + v m ) p j + ℓ − 1 + ( v i + v m ) 2 p j + ℓ is such that e − p ( v i + v m ) v j i v ℓ m j ! ℓ ! c i,m ( j, ℓ ) = d 2 dp 2 e − p ( v i + v m ) v j i v ℓ m j ! ℓ ! . Since ( W + j ) 2 =   X ℓ ≥ j W ℓ   2 = X ℓ 6 = k ≥ j W k W ℓ + X ℓ ≥ j W 2 ℓ , E (( W + j ) 2 | F ) − E ( W + j | F ) = X 1 ≤ i 6 = m ≤ K X ℓ,k ≥ j E ( I i,k I m,ℓ | F ) = X 1 ≤ i 6 = m ≤ K  X i,j ( p ) X m,j ( p ) − p 2 E ( v ) K ( X i,j X m,j ) ′′ ( p )  + o  1 K  , and 1 − V ar( W + j | F ) E ( W + j | F ) = E ( W + j | F ) − E (( W + j ) 2 | F ) + E ( W + j | F ) 2 E ( W + j | F ) . The right-hand side of this equatio n can be expanded as 1 P K i =1 X i,j + O (1)   − X 1 ≤ i 6 = m ≤ K X i,j ( p ) X m,j ( p ) + p 2 E ( v ) K X 1 ≤ i 6 = m ≤ K ( X i,j X m,j ) ′′ ( p ) + K X i =1 X i,j ( p ) − p 2 E ( v ) K K X i =1 X ′′ i,j ( p ) ! 2   + o  1 K  which can b e rewritten as 1 P K i =1 X i,j + O (1)   X 1 ≤ i ≤ K X 2 i,j ( p ) + p 2 E ( v ) K   X 1 ≤ i 6 = m ≤ K ( X i,j X m,j ) ′′ ( p ) − 2 K X i =1 X i,j ( p ) K X i =1 X ′′ i,j ( p )     + O (1) using that X i 6 = m X i,j X m,j = X i X i,j ! 2 − X i X 2 i,j . By the law of large num bers, we hav e that, a lmost surely , lim K → + ∞ 1 K K X i =1 X 2 i,j ( p ) = E ( X 2 j ( p )) = m 2 ,j ( p ) , lim K → + ∞ 1 K 2 K X i 6 = m ( X i,j X m,j ) ′′ ( p ) = ( m 2 j ) ′′ ( p ) , AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 11 together with lim K → + ∞ 1 K X i =1 X i,j ( p ) = m j ( p ) and lim K → + ∞ 1 K K X i =1 X ′′ i,j ( p ) = m ′′ j ( p ) . Hence, lim K →∞ 1 − V ar( W + j | F ) E ( W + j | F ) = m 2 ,j ( p ) + p [( m 2 j ) ′′ ( p ) / 2 − m j ( p ) m ′′ j ( p )] / E ( v ) m j ( p ) a.s. = m 2 ,j ( p ) + p m ′ j ( p ) 2 / E ( v ) m j ( p ) a.s. and the result follows.  T o illustra te the fact that the bo und in Equatio n (18) is tight when p → 0 and v ha s ﬁnite moments o f any order , let us note that, provided the corr e spo nding moments ar e ﬁnite, (21) lim p → 0 m j ( p ) p j = v j j ! Moreov er, lim p → 0 m 2 ,j ( p ) p 2 j = E ( v 2 j ) j ! 2 and lim p → 0 m ′ j ( p ) p j − 1 = E ( v j ) ( j − 1)! . Thu s, the limit when K tends to + ∞ of the b ound given by E q uation (18) is equiv a lent to j p j − 1 ( j − 1)! E ( v j ) E ( v ) when p tends to 0 . If j ≥ 2, this term tends to 0 when p → 0. By using the ab ov e lemma, w e are now able to s ta te a limit result for the distri- bution of the r a ndom v a riables W + j . Prop ositio n 5. The ine quali ty (22) lim K → + ∞ sup y ∈ R       P   W + j − E ( W + j ) q E ( W + j ) ≤ y   − Z y −∞ e − u 2 / 2 √ 2 π du       ≤ m 2 ,j ( p ) m j ( p ) + p E ( v ) ( m ′ j ( p )) 2 m j ( p ) holds. Thu s, for j ≥ 2 a nd for small p , this gives the following approximation W + j ∼ E ( W + j ) + q E ( W + j ) , where G is a standard nor mal ra ndom v a riable. It should be noted nevertheless that Equatio n (22) is a lmost a cent ral limit result but b ecause of the sca ling in 1 / q E ( W + j ) instea d of 1 / q V ar ( W + j ), the bo und in the right hand side is not 0 as K gets la rge but, according to the pr o of of Lemma 18, only an upper bound on the distance b etw een E ( W + j ) and V ar( W + j ). 12 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T Pr o of. F rom Lemma 1, we hav e       P   W + j − E ( W + j ) q E ( W + j ) ∈ · | F   − P   Q E ( W + j |F ) − E ( W + j |F ) q E ( W + j |F ) ∈ ·         tv ≤ m 2 ,j ( p ) m j ( p ) + p E ( v ) m ′ j ( p ) 2 m j ( p ) . F ro m Equation (20), we ha ve that lim K →∞ 1 K E ( W + j | F ) = E ( X j ( p )) = K X ℓ ≥ j Q ℓ = K m j ( p ) , where the q uantit ies Q ℓ are deﬁned in P rop osition 1. In addition, from Coro llary 1, E ( W + j ) ∼ K m j ( p ) when K → + ∞ . The result then follows by applying the central limit theorem for Poisson distributions and by deco nditio ning with resp ect to F .  T o conclude this section, let us no tice that when balls ar e drawn with probability p independently of each o ther, we do not have to condition on the en vironment and we have    P ( W + j ∈· ) − P ( Q E ( W + j ) ∈ · )    tv ≤ E  P v k = j  v k  p k (1 − p ) v − k 1 { v ≥ j }  2 E   v j  p j (1 − p ) v − j 1 { v ≥ j }  , It is worth noting that the results are independent of the num ber of colors and that we do not need ta ke K → ∞ to obtain a b ound for the distance in total v a riation. In addition, when E ( W j ) b ecome large, then it is po ssible to obtain a central limit-type approximation s imilar to Pro po sition 5. 5. Comp arison with original distributions 5.1. Uni form mo del . In this section, we compare the distribution of the num ber ˜ v o f balls drawn with a given colo r with that of the original n um ber v of balls with a given color. W e are in par ticula r interested in giving a sense to the heuristic stating that v and ˜ v / p hav e distributions close to each other. Prop ositio n 6. Under the c o ndition that the r andom variable v has a Weibul l or Par eto distribution, we have lim j →∞ lim K →∞ E ( W + j ) K P ( v ≥ j /p ) = 1 . Pr o of. F rom Coro llary 1, w e know that E ( W j ) /K → Q j when K → ∞ . Since Q j = E  ( pv ) j j ! e − pv  = ∞ X ℓ =1 ( pℓ ) j j ! e − pℓ P ( v = l ) , we can s how that if v has a W eibull or Pareto distribution, then Q j ∼ P ( v = j /p ) /p when j → ∞ . Indeed, the a b ove sum ca n be rewritten a s 1 j ! ∞ X ℓ =1 e f j ( ℓ ) P ( v = ℓ ) , AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 13 where f j ( ℓ ) = − pℓ + j log( pℓ ), which attains its maxim um a t p oint j /p with f ′′ j ( j /p ) = − p 2 /j . If the random v a riable v is W eibull or Pareto and j /p is suf- ﬁcient ly la rge, then P ( v = ℓ ) / P ( v = j /p ) − 1 ∼ 0 uniformly o n j for ℓ in the neighborho o d of j /p . It follows that Q j ∼ 1 j ! P ( v = j /p ) e f j ( j /p ) ∞ X ℓ = −∞ e − ℓ 2 p 2 2 j . F or a > 0 converging to 0 , ∞ X ℓ = −∞ e − aℓ 2 = ∞ X ℓ = −∞ Z + ∞ 0 1 { u>aℓ 2 } e − u du ∼ 2 Z + ∞ 0 r u a e − u du = 2 Z + ∞ 0 u 2 √ a e − u 2 / 2 du = r π a and b y Stirling formula j ! ∼ √ 2 π j j + 1 2 e − j for large j , so that Q j ∼ P ( v = j /p ) /p . It is then eas y to deduce that P ℓ ≥ j Q j ∼ P ( v ≥ j /p ) for large j .  The ab ove Pr op osition implies that P ( ˜ v ≥ j ) is suc h that P ( ˜ v ≥ j ) ∼ P ( v ≥ j /p ) when the num ber of color s is larg e . This means that the tail of the distribution of the random v ariable v ca n be obtained by rescaling that of the num ber ˜ v of sampled ba lls with a given colo r. When v has a Pareto distr ibution, Equation (13) can still b e used for larg e j to estimate the shap e pa r ameter a . The estimation of the probability 1 − E ( e − pv ) of sampling a color and the s cale parameter b can also be estimated from the ta il by using the expre s sion of that probability as a function of b a nd a a s in Equation (1 4). The same metho d a pplies for W eibull distributions. 5.2. Probabil istic mo del. F rom now on, we cons ide r the pr obabilistic model a nd we establish s tronger results on the distance betw een P ( ˜ v ≥ j ) and P ( v ≥ j /p ), where ˜ v is the num ber o f balls with a given color a t the end of a trial. F or this sampling mo de, it was not p ossible to prov e a r esult similar to C o rollar y 1, but Berry-E ssen’s theorem [6] ca n be used to establish a stronger r esult for the compa r - ison b etw een ˜ v and v . In [5], it is s pec iﬁc a lly prov ed that if we deﬁne the function h j ( x ) = x 2 / 4 p 2  p 1 + 4 j p/ x 2 − 1  2 for x ∈ R and j > 0, then    P ( ˜ v ≥ j ) − P  v ≥ h j  p p (1 − p ) G  ∨ k     ≤ c E  1 √ v 1 { v ≥ j }  , where G is a standard Gaussia n random v a riable, for real num bers a ∨ b = max( a, b ), and c = 3( p 2 + (1 − p ) 2 ) / p p (1 − p ). F or small p , the constant c ∼ 3 / √ p . The above bo und is very loo se for small p and beco mes accur ate only for very larg e v a lue s o f j . This is why we g o further in this pa per by es tablishing a tight er bo und for the ratio P ( ˜ v ≥ j ) / P ( v ≥ j /p ). Let ( B n ) b e some sequence of i.i.d. Bernoulli random v ar iables with para meter p and v so me indep endent r.v. on N . T ake some α ∈ ]1 / 2 , 1 [. Let ˜ v = P v l =1 B l . Theorem 3 . F or α ∈ (1 / 2 , 1 ) , we have for al l j ≥ 1 P ( ˜ v ≥ j ) P ( v ≥ j /p ) = A ( j ) + B ( j ) , wher e A 1 ( j ) ≤ A ( j ) ≤ A 2 ( j ) 14 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T with A 1 ( j ) =     1 − exp     − p 2  1 +  j p  α − 1   j p  2 α − 1         P ( v ≥ j /p + ⌊ ( j / p ) α ⌋ + 1) P ( v ≥ j /p ) , A 2 ( j ) = P ( v ≥ j /p − ⌊ ( j / p ) α ⌋ ) P ( v ≥ j /p ) , and wher e B ( j ) is a p osi tive quantity su ch that B ( j ) ≤ e − p 2(1 − p ) ( j p ) 2 α − 1 P ( v ≥ j ) P ( v ≥ j /p ) . Pr o of. W e hav e P ( ˜ v ≥ j ) = P v X ℓ =1 B ℓ ≥ j ! = T 1 + T 2 , where T 1 = P v X ℓ =1 B ℓ ≥ j, j ≤ v ≤ j /p − ⌊ ( j /p ) α ⌋ − 1 ! , T 2 = P v X ℓ =1 B ℓ ≥ j, j / p − ⌊ ( j /p ) α ⌋ ≤ v ! . Let us ﬁrst recall the fo llowing inequality for the s um of indep endent Bernoulli random v a riables B ℓ , ℓ ≥ 1 [9]: for x ∈ [0 , 1 − p ] (23) P n X ℓ =1 B ℓ − np ≥ nx ! ≤ e − nx 2 A ( x ) , where (24) A ( x ) = 2 p (1 − p ) + 2 3 x (1 − 2 p ) − 2 9 x 2 . It follows tha t for j ≤ v ≤ j / p P v X ℓ =1 B ℓ ≥ j ! ≤ e − ( j − pv ) 2 vA ( j v − p ) . It is easily chec k ed that the function v → v A  j v − p  is incr easing in the interv al [ j, j /p ] and that for all v ∈ [ j, j /p ] v A  j v − p  ≤ 2 j (1 − p ) . Hence, for v ∈ [ j, j /p ] P v X ℓ =1 B ℓ ≥ j ! ≤ e − ( j − pv ) 2 2 j (1 − p ) and for v ∈ [ j, j /p − ⌊ ( j /p ) α ⌋ − 1] P v X ℓ =1 B ℓ ≥ j ! ≤ e − p 2(1 − p ) ( j p ) 2 α − 1 . AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 15 This implies that T 1 ≤ P v X ℓ =1 B ℓ ≥ j, j ≤ v ≤ j /p − ⌊ ( j /p ) α ⌋ − 1 ! ≤ P   j /p −⌊ ( j /p ) α ⌋− 1 X ℓ =1 B ℓ ≥ j   P ( v ≥ j ) = e − p 2(1 − p ) ( j p ) 2 α − 1 P ( v ≥ j ) . F or the term T 2 , w e ﬁrst note that T 2 ≤ P ( v ≥ j /p − ⌊ ( j /p ) α ⌋ ) . Then, we clearly ha ve T 2 ≥ P v X ℓ =1 B ℓ ≥ j, j /p + ⌊ ( j / p ) α ⌋ + 1 ≤ v ! and then T 2 P ( v ≥ j /p ) ≥ P   j /p + ⌊ ( j /p ) α ⌋ +1 X ℓ =1 B ℓ > j   P ( v ≥ j /p + ⌊ ( j /p ) α ⌋ + 1) P ( v ≥ j /p ) . Chernoﬀ b ound implies for v = j /p + ⌊ ( j /p ) α ⌋ + 1 P v X ℓ =1 B ℓ ≤ j ! ≤ exp  − ( pv − j ) 2 2 pv  ≤ exp     − p 2  1 +  j p  α − 1   j p  2 α − 1     . It follows tha t T 2 P ( v ≥ j /p ) ≥     1 − exp     − p 2  1 +  j p  α − 1   j p  2 α − 1         P ( v ≥ j /p + ⌊ ( j /p ) α ⌋ + 1) P ( v ≥ j /p ) . and the pro of follows.  The ab ov e result ca n b e a pplied to speciﬁc dis tributions for v , na mely P areto and W eibull distributions, in order to show that the tails of the pr o bability distribution functions of ˜ v and pv are the same. This is the analog of Pro po sition 6 for the probabilistic mo del. Corollary 2. If v has either (1) a Par et o tail distribution with p ar ameter a > 1 such that for x ≥ 0 , P ( v ≥ x ) = L ( x ) x − a wher e L is a slow ly varying function, i.e., for e ach t > 0 , lim x → + ∞ L ( tx ) L ( x ) = 1; 16 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T or (2) a Weibul l tail distribution with β ∈ ]0 , 1 / 2 [ s uch that for x ≥ 0 , P ( v ≥ x ) = L ( x ) e − δx β for s ome δ > 0 and L a slow ly varying fu n ction then lim j → + ∞     P ( ˜ v ≥ j ) P ( v ≥ j /p ) − 1     = 0 . Pr o of. F or (1), P ( v ≥ j ) P ( v ≥ j /p ) = L ( j ) L ( j /p ) j − a ( j /p ) − a = L ( j ) L ( j /p ) p a − − − − → j → + ∞ p − a and P ( v ≥ j /p + ǫ ( j / p ) α ) P ( v ≥ j /p ) = L (( j /p )(1 + ǫ ( j /p ) α − 1 )) L ( j /p ) (1 + ǫ ( j /p ) α − 1 ) − a which tends to 1 when j tends to + ∞ . This implies tha t the quantities A 1 ( j ) and A 2 ( j ) app e aring in Theo rem 3 tends to 1 a nd B ( j ) tends to 0 when j → ∞ . F or (2), P ( v ≥ j ) P ( v ≥ j /p ) = L ( j ) L ( j /p ) e − δj β (1 − p − β ) − − − − → j → + ∞ 0 and it is s traightforw ard that P ( v ≥ j /p + ǫ ( j /p ) α ) P ( v ≥ j /p ) = L ( j /p (1 + ǫ ( j /p ) α − 1) ) L ( j /p ) e − δ ( j /p + ǫ ( j/p ) α ) β + δ ( j /p ) β = L ( j /p (1 + ǫ ( j /p ) α − 1) ) L ( j /p ) e − δβ ǫ ( j /p ) α + β − 1 (1+ o (1)) which tends to 1 if α + β < 1. Let β ∈ ]0 , 1 [. It is s uﬃcien t to ﬁnd α ∈ ]1 / 2 , 1[ such that α + β < 1. Nec e s sarily 1 − β > α > 1 / 2 th us β < 1 / 2 and for s uch a β , such an α exists.  6. Conclud ing remarks on sa mpling and p arameter inference W e hav e established in this pap er convergence re s ults for the distr ibution of the num ber of balls with a given color under the assumption that there is a lar ge nu mber o f co lors in the urn, that the n um ber o f balls with a given co lo r ha s a heavy tailed distr ibutio n indep endent of the color, a nd that only a small fraction p of the total num ber o f ball is sampled. W e hav e considered tw o ball sampling rules. The ﬁrst one states tha t the pro bability of drawing a ball with a g iven color dep ends upo n the relative con tribution of the color to the to tal num ber of balls and that a drawn ba ll is immediately replaced in to the urn. With the second rule, each ball is selected with pr obability p independently o f the others. The tw o rules do not give the same res ults, even if they coincide when p → 0 (see [5] for details). F ro m a practical point of view, we hav e sho wn that it is pos sible to ident ify the original distribution of the num ber of ba lls with a given color by using the tail of the distribution of the num ber of ba lls with a a given colo r dr awn from the urn. A stronger result holds for Pareto when the n umber of colo rs is very larg e (see Prop ositio n 3). This re s ult is robust in pr actice b ecause it do es not re ly o n the asymptotics of the tail distribution (in Prop osition 3 assertions hold for all j > a ). The determination o f the original num ber of balls pe r color is v alid when the nu mber of balls follows a unique distribution of Pareto or W eibull t yp e . This could AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 17 be us e d in the co nt ext of pack et sampling in the Internet. In pr actice, how ev er, the num ber of pack ets in ﬂo ws is in g e neral no t des c rib ed by a unique “nice” distribution, but ca n only be lo cally appr oximated by a series of Pareto distributions (see [2] for a discussion). More sophisticated techniques are then neces sary to get the orig inal statistics of ﬂows. References [1] M. Abramowitz and I. Stegu n, Handb o ok of mathematic al functions , National Bureau of Stan- dards, Applied Mathematics Series 55, 1972. [2] N. Antu nes, Y. Chab cho ub, C. F ri c k er, F. Guillemin, and P . Rob ert, O n the estimation of ﬂow statistics via p acket sampling in the Internet , Submitted for publication. [3] S. Asmussen, C. Kl ¨ uppelber g, and K. Sigman, Sampling at sub exp onential t imes, with queueing applic ation , Stochastic Process. Appl. 79 (1999), 265–286. [4] A. D. Barb our, Lars H olst, and Sv ante Janson, Poisson appr oximation , The Clarendon Press Oxford Universit y Press, New Y ork, 1992 , Oxford Science Publications. [5] Y ousra Chab ch oub, Christine F ri ck er, F abrice Guil lemin, and Philipp e Rob ert, Det erministic versus pr ob abilistic p acket sampling in the Internet , Pro ceedings of ITC’20, June 2007. [6] W. F eller, An intr o duction to pr ob ability the o ry, The ory and applic ation , v ol. 2, Wil ey , 1966. [7] S. F oss and D. Korshuno v, Sampl ing at a r andom time with a he avy-t aile d distribution , M arko v Pro cess. Related Fields 6 (2000) , no. 4, 543–568. [8] Philipp e Rob ert, R´ ese a ux et ﬁles d’attente: m´ etho des pr ob abilist e s , Math ´ ematiques et Appli- cations, vol. 35, Springer-V erlag, Berlin, Octobre 2000. [9] A. Siegel, T owar d a usable the ory of chernoﬀ b ound s for heter o gene ous and p artial ly dep endent r a ndom variables , Paper av ailable at ht tp://cs.n yu.edu/facult y/siege l/HHf.p df . (C. F r ick e r) INRIA-Rocq uencour t, RAP projec t, Domaine de Voluceau, 7 8153 Le Ches- na y, France E-mail addr ess : Christine.F ricker@in ria.fr URL : http://ww w-c.inria .fr/twiki/bin/view/RAP/ChristineFricker (F. Guillemin) Orange Labs, F-22300 Lannion E-mail addr ess : Fabrice.Gui llemin@or ange-ftgroup.com (Ph. Robert) INRIA-Rocquencour t, RAP project, Domaine de Volucea u, 78 153 Le Chesna y, Fran ce E-mail addr ess : Philippe.Ro bert@inri a.fr URL : http://ww w-rocq.in ria.fr/~robert

An identification problem in an urn and ball model with heavy tailed distributions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment