An identification problem in an urn and ball model with heavy tailed distributions

We consider in this paper an urn and ball problem with replacement, where balls are with different colors and are drawn uniformly from a unique urn. The numbers of balls with a given color are i.i.d. random variables with a heavy tailed probability d…

Authors: Christine Fricker (INRIA Rocquencourt), Fabrice Guillemin, Philippe Robert (INRIA Rocquencourt)

AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL WITH HEA VY T AILED DISTRIBUTIONS CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T Abstract. W e consider i n this paper an urn and ball problem with replace- men t, where balls are with different colors and are drawn unifor mly fr om a unique urn. The num bers of balls wi th a given color are i.i. d. random v ari- ables with a hea vy tailed probability distri bution, for instance a Pa reto or a W eibull distribution. W e dr a w a small fraction p ≪ 1 of the tot al n um ber of balls. The basic problem addressed in this pap er is to kno w to whic h exten t w e can infer the total n umber of colors and the distribution of the num ber of balls with a giv en color. By means of Le Cam’s inequality and Chen-Stein method, bounds for the total v ar iation norm betw een the distribution of the num ber of balls dra wn with a gi ven color and the P oisson distribution with the same mean are obtained. W e then show that the distribution of the num ber of ball s drawn with a giv en color has the same tail as that of the original num ber of balls. W e finall y establish explicit b ounds b etw ee n the tw o distributions when eac h ball is dra wn with fixed probability p . 1. Introduction W e consider in this pap er the following urn and ball scheme with r epla c e ment : An urn contains a ra ndom num ber of ba lls with different colors. W e draw a s mall fraction p ≪ 1 of the total nu mber o f balls. A ball which has b een dr awn is replaced int o the urn. The problem considered in this pap er consists o f estimating the num ber of colors tog ether with the distribution o f the num ber of ba lls with a given color b y using informatio n from s a mpled balls. This problem is motiv a ted by the analy s is of pack et sampling in the Int ernet (see Chabchoub et al. [5] for details). T o address the abov e problem, we analyze the non-no r malized distr ibution of the num ber of balls dra wn with a given color . More sp ecifically , let W j (resp ec- tively , W + j ) denote the num ber of colo rs with a num ber of sampled balls equal to (resp ectively , eq ual to o r grea ter than) j . Denoting b y ˜ K the num ber of color s seen when drawing balls, the quantities W j / ˜ K a nd W + j / ˜ K a re equal to the pro- po rtions of colors , which at the end of the trial co mprise exactly or a t least j balls, resp ectively . The num bers of balls with v a rious co lors are assumed to b e i.i.d. random v ar i- ables and the num ber K of color s is lar ge. In addition, the distribution of the nu mber of balls with a given color has a heavy tailed proba bilit y distribution of Pareto or W eibull type. Finally , balls ar e drawn uniformly . This mea ns that for each i = 1 , . . . , K , if there are v i balls with colo r i , the probability of drawing a ball with this color is v i /V , wher e V = v 1 + · · · + V K is the total num ber of balls in the urn. Key wor ds and phr ases. Chen-Stein method, P areto distri bution, W eibull distribution. 1 2 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T The ab ov e mo del is defined a s the “unifor m mo del”. It will compar ed ag ainst the case when balls are drawn indep e nden tly one of each o ther with pr o bability p . This mo del will b e referr ed to as proba bilistic mo del. W e show that the results obtained in b oth mo dels are clos e o ne to each other when p is very small. But there are so me subtle differences b et ween the t wo mo dels, notably with regard to the achiev able accuracy in the inference of origina l statistics. It turns out that the probabilistic mo del is s impler to ana lyze than the unifor m mo del but yields less accurate r esults. This is due to the fact that we cannot exploit the fact that the nu mber of color s is very lar ge. One of the main results of the pap er concerns the a na lysis of the v alidit y of the following simple sca ling rule: The distribution of the o r iginal n umber v i of balls with color i co uld b e estima ted by that o f the random v ariable ˜ v i /p , where ˜ v i is the num ber of sa mpled balls with colo r i . When each ba ll is drawn with a fixed probability , it is known that this rule is v alid for tails of the dis tributions as so on as they are heavy tailed. See Asm ussen et al [3 ] and F oss and Ko rshuno v [7 ] where this asymptotic equiv alence is prov ed in a quite general framew ork. Our main goal here is to g et, for j ≥ 2, an explicit b ound on the qua n tit y     P ( ˜ v ≥ j ) P ( v ≥ j /p ) − 1     . In the con text of pack et sampling in the Internet, explicit expr essions are e s pec ia lly impo rtant for the es timation of the sizes of flows in Internet traffic. In this setting the v a riable j is taken to be large but cannot b e to o large so that the e ven t { ˜ v = j } o ccurs sufficiently often to obta in reliable statistics. Henceforth, the dep endence on j should b e ma de explicit. See Chab choub et al. [5] for a discussion. The o rganiza tio n o f this pap er is as follows: The notation and the basic r e sults used in this pap er (Le Cam’s inequality and Chen-Stein metho d) ar e prese n ted in Section 2. The mean v alues of the r a ndom v a riables W j and W + j are computed in Section 3. The approximation of the dis tribution of W + j by a P oisson distribution and the v alidit y o f the sca ling r ule are inv estigated in Section 4. W e co mpare in Section 5 the or ig inal distribution of the n um ber of balls with a given color against the rescaled distribution of the num ber of drawn balls with the same colo r. Some concluding remar ks with regar d to sampling a r e presented in Sec tion 6. 2. Not a t ion and basic resul ts 2.1. De fin i tions and assumptio ns . W e consider an urn containing v i balls with color i for i = 1 , . . . , K . The quantities v i are indep endent random v ariables with a common heavy tailed distr ibutio n. In the fo llowing we shall consider t w o families of heavy tailed distributions for the num ber v of ba lls with a given color: P areto distributions: The distribution o f v is g iven by (1) P ( v > x ) = ( b/x ) a , x ≥ b, with the sha pe para meter a > 1 and the lo cation parameter b > 0. The mean of v is ab/ ( a − 1). W eibul l distributi ons: The distribution of v i is given b y (2) P ( v > x ) = exp( − ( x/η ) β ) , x ≥ 0 , with the skew parameter β ∈ (0 , 1) and the sca le par ameter η > 0. The mean of v is η β Γ(1 /β ), where Γ is the classical Euler’s Ga mma function. AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 3 The total nu mber of balls in the urn is V = P K i =1 v i . W e dr aw only a fraction p of this total num ber of balls. Each ba ll is dr awn at random: A ball with color i is dr awn with pro bability v i /V . After drawing the pV balls, we have ˜ v i balls with color i . O f cour s e, only tho se colors with ˜ v i > 0 can be s een. The q uantit y ˜ K = P K i =1 1 { ˜ v i > 0 } is the num ber of color s seen at the end of a trial. In the following, we shall b e int erested in the as ymptotic regime whe n the n um ber of colors K → ∞ while the fraction p → 0. Note that b y the law o f large n um bers , V → ∞ a.s. (the total num ber of balls in the urn is very large). The random v a riables we co nsider in this pa per to infer the or iginal s tatistics of the n um b e r of balls and colors are the v ariables W j and W + j , j ≥ 0, defined as follows. Definition 1 (Definition of W j ) . The r a ndom variable W j is the numb er of c olo rs with j b al ls at the end of a trial and is given by j ≥ 0 , W j = 1 { ˜ v 1 = j } + 1 { ˜ v 2 = j } + · · · + 1 { ˜ v K = j } , wher e ˜ v i ≥ 0 is the numb er of b alls dr awn with c o lor i (which c an b e e qual to 0). Definition 2 (Definition o f W + j ) . The r andom varia ble W + j is the numb er of c olo rs with at le ast j b al ls at t he end of a t rial. The r andom variables W + j ar e formal ly define d by j ≥ 0 , W + j = 1 { ˜ v 1 ≥ j } + 1 { ˜ v 2 ≥ j } + · · · + 1 { ˜ v K ≥ j } . Note that we hav e ∀ j ≥ 0 , W + j = X ℓ ≥ j W ℓ . The av erages of the random v ar iables W j are in fact the key quantities we shall use in the following to infer the original num bers of balls p er color . 2.2. Le Cam’s inequalit y and Chen-Ste i n me tho d. Le Cam’s inequality gives the distance in total v ariation betw een the distribution of a sum of independent a nd ident ically dis tributed (i.i.d.) Berno ulli rando m v ariables and the Poisson distri- bution with the same mean (see Barb our et al. [4]). Note tha t if V and W are t wo random v ariables ta king integer v alues, the distance in to tal v ariation b etw een their distributions is de fined by k P ( W ∈ · ) − P ( V ∈ · ) k tv def. = sup A ⊂ N | P ( W ∈ A ) − P ( V ∈ A ) | = 1 2 X n ≥ 0 | P ( W = n ) − P ( V = n ) | . Theorem 1 (Le Ca m’s Inequality) . If t he r a ndom variable W = P i I i , wher e the r and om variables I i ar e i.i.d. Bernoul li r a ndom variables, then (3) k P ( W ∈ · ) − P ( Q E ( W ) ∈ · ) k tv ≤ X i P ( I i = 1 ) 2 , wher e for λ > 0 , Q λ is a Poisson r andom variabl e with me a n λ , that is, for al l n ≥ 0 , P ( Q λ = n ) = λ n n ! e − λ . 4 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T When the r andom v aria bles I i app earing in the ab ov e theorem are not indep en- dent but sa tisfy a sp ecific co ndition, r eferred to as monoto nic coupling, it is still po ssible to obtain a bound on the distance betw een the distribution of the sum W = P i I i and the Poisson distribution with mea n E ( W ). Definition 3 (Monotonic Coupling) . The variables I i ar e said to b e nega tively related , when ther e exist some r andom variables U i and V i such that (1) U i dist. = W and 1 + V i dist. = ( W | I i = 1 ) ; (2) V i ≤ U i . The main r esult of the Chen-Stein metho d is given b y the following theor em (se e Barb our et al. [4]). Theorem 2. If the monotonic co upling condition is satisfie d , then the fol lowing ine q uality holds (4) k P ( W ∈ · ) − P ( Q E ( W ) ∈ · ) k tv ≤ 1 − V ar( W ) E ( W ) . When the monotonic coupling c o ndition is satisfied, in o rder to prov e the Poisson approximation, it is sufficient to show that the ratio of the v ariance to the mean v a lue of W is clo s e to 1; this is a very weak condition to pr ov e in pra c tice . It should b e noted (see [8]) tha t Relation (4) can b e us e d not only when E ( W ) takes bo unded v alues so that W is approximately a Poisso n random v ar iable, but also when E ( W ) is la r ge. In this case Chen-Stein Metho d yields a c en tral limit theorem: If N is a standard normal distr ibution,      P W − E ( W ) p V ar( W ) ∈ · ! − P ( N ∈ · )      tv ≤      P W − E ( W ) p V ar( W ) ∈ · ! − P Q E ( W ) − E ( W ) p V ar ( W ) ∈ · !      tv +      P Q E ( W ) − E ( W ) p V ar( W ) ∈ · ! − P ( N ∈ · )      tv where V a r( W ) is the v ar iance of the ra ndom v aria ble W . By using Rela tion (4), we ha v e      P W − E ( W ) p V ar( W ) ∈ · ! − P ( N ∈ · )      tv ≤ 1 − V ar ( W ) E ( W ) +      P Q E ( W ) − E ( W ) p V ar ( W ) ∈ · ! − P ( N ∈ · )      tv . If the ratio E ( W ) / V ar( W ) is close to 1, then the first term in the right hand side o f the ab ov e relation is negligible. In addition, the c la ssical central limit theor em for Poisson distributions implies that when E ( W ) is large, the sec o nd ter m is negligible to o. Therefore, we ha ve W ∼ E ( W ) + p V ar ( W ) N with a b ound on the e r ror. AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 5 3. Comput a tion of mean v alues 3.1. B o unds for mean v alues. By using Le Cam’s inequality , we can establish the following r esult for the mean v a lue of the ra ndom v a riables W j . Prop ositio n 1 (Mean V a lue of W j ) . If ther e ar e V b a l ls and K c o lors in the urn, for j ≥ 0 , the me a n numb er E ( W j ) of c olors with j b a l ls at the end of a trial satisfies the r ela tion     E ( W j ) K − Q j     ≤ E  min( pv , 1) v V  , (5) wher e Q is t he pr ob ability distribution define d for j ≥ 0 by Q j = E ( pv ) j j ! e − pv ! , p is the sampling r ate, and v is distribute d as the nu mb e r of b al ls with a given c olor. Pr o of. W e hav e ˜ v i = B i 1 + B i 2 + · · · + B i pV , where B i ℓ is equa l to one if the ℓ th ball dra wn from the urn has color i , whic h even t o ccurs with pro bability v i /V , the quant it y V being the total n um be r of balls in the urn. Conditionally on the v a lues of the s e t F = { v 1 , . . . , v K } , the v ariables ( B i ℓ , ℓ ≥ 1) are indep e ndent Berno ulli v ariables. F or 1 ≤ i ≤ K , Le C a m’s Inequality (3) therefore gives the relation k P ( ˜ v i ∈ · | F ) − P ( Q pv i ∈ · ) k tv ≤ p v 2 i V , and Relation (4) which can also b e used in this ca se yields k P ( ˜ v i ∈ · | F ) − P ( Q pv i ∈ · ) k tv ≤ v i V , By integrating with res pect to the v ariables v 1 , . . . , v K , these tw o inequalities give the relatio n (6) k P ( ˜ v i ∈ · ) − Q k tv ≤ E  min ( pv , 1) v V  . Since E ( W j ) = P K i =1 P ( ˜ v i = j ), by summing on i = 1 , . . . , K , we o btain | E ( W j ) − K Q j | ≤ K E  min ( pv , 1) v V  . and the result follows.  By using the fact that E ( W + j ) = P K i =1 P ( ˜ v i ≥ j ), w e can deduce fro m Equa- tion (6) the following result. Prop ositio n 2 (Mea n V alue of W + j ) . If ther e ar e V b a l ls and K c o lors in the urn, the me an n u mb er E ( W + j ) of c olors with at le ast j ≥ 0 b al ls at t he end of an arbitr ary t rial satisfies the r elatio n       E ( W + j ) K − X ℓ ≥ j Q ℓ       ≤ E  min ( pv , 1) v V  , (7) wher e t he pr ob abili ty distribution Q is define d in Pr op osition 1. 6 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T W e immediately deduce fro m P rop ositions 1 and 2 the following coro llary by using the fact tha t V ≥ K . Corollary 1 (Asymptotic Mea n V a lue s ) . The re lations lim K →∞ 1 K E ( W j ) = Q j and lim K →∞ 1 K E ( W + j ) = X ℓ ≥ j Q ℓ . hold. Note that if balls are drawn with proba bilit y p indep endently one of e ach o ther (probabilistic mo del), we have ˜ v i = P v i ℓ =1 ˜ B i ℓ , where the random v ariables ˜ B i ℓ are Bernoulli with mea n p . By adapting the a bove pro o fs, we find (8)     E ( W j ) K − Q j     ≤ p. 3.2. As ymptotic results for sp ecific probability distributio ns. 3.2.1. Par e to distribut ions. Let us first assume tha t the num ber of balls of a g iven color follows a Pareto distr ibution given by Equatio n (1). Then, we hav e the fol- lowing res ult when the num ber of colors go es to infinity . Prop ositio n 3 . If v has a Par eto distribution as in Equation (1) , then for al l j > a , t he r el ations lim K → + ∞ E ( W j +1 ) E ( W j ) = 1 − a + 1 j + 1 + O (( pb ) j − a ) , (9) lim K → + ∞ E ( W j ) K = a ( pb ) a Γ( j − a ) j ! + O (( pb ) j ) , (10) lim K → + ∞ E ( W + j ) K = ( pb ) a Γ( j − a ) ( j − 1)! + O  ( pb ) j 1 − pb  (11) hold. Pr o of. F or j > a , (12) Q j = E ( pv ) j j ! e − pv ! = ab a p a j ! Z + ∞ pb u j − a − 1 e − u du = a ( pb ) a Γ( j − a ) j ! − a ( pb ) j j ! Z 1 0 u j − a − 1 e − pbu du. Therefore, by us ing the re la tion Γ( x + 1) = x Γ( x ), we get the equiv a lence Q j +1 Q j = j − a j + 1 + O (( pb ) j − a ) , AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 7 which gives Equations (9) and (10) by using Coro llary 1. F or the mean v alue of W + j , Equation (12) gives the rela tion lim K → + ∞ E ( W + j ) K = a ( pb ) a X n ≥ j Γ( n − a ) n ! + O  ( pb ) j 1 − pb  = a ( pb ) a X n ≥ 0 Γ( n + j − a )Γ( n + 1) Γ( j + n + 1) 1 n n ! + O  ( pb ) j 1 − pb  = a ( pb ) a Γ( j − a ) j ! F ( j − a, 1; j + 1; 1) + O  ( pb ) j 1 − pb  , where F ( a, b ; c ; z ) is the hypergeo metric function satisfying F ( a, b ; c ; 1) = Γ( c )Γ( c − a − b ) Γ( c − a )Γ( c − b ) (see Abramowitz and Stegun [1]), and Eq uation (11) follows.  The shap e par a meter a can b e estimated via Relation (11) by (13) a = lim K →∞ j 1 − E ( W + j +1 ) E ( W + j ) ! + O  ( pb ) j 1 − pb  for all j > a . This g ives a mea ns o f estimating the shap e parameter a . When ob- serving dr awn balls , we have in fact o nly acce s s to the quantit y E ( ˜ K ) of the num ber of sa mpled color s. While this has no impact for the estimation o f a , this correcting term is impo rtant when estimating b fro m E quation (11). It is s traightforw ard that ˜ K = K X i =1 1 { ˜ v i > 0 } = K − W 0 and then when K → ∞ E ( ˜ K ) ∼ K (1 − Q 0 ) = K  1 − E ( e − pv )  . Since (14) 1 − E ( e − pv ) = p Z ∞ 0 e − px P ( v > x ) dx = bp + ( bp ) a Γ(1 − a, bp ) , where Γ( a, x ) is the incomplete Ga mma function defined by Γ( a, x ) = R ∞ x t a − 1 e − t dt , we can use the ab ove equations together with Equation (11) in o rder to estimate b and then K . It is a ls o worth no ting that 1 − E ( e − pv ) ∼ bp when a > 1 and bp → 0. 3.2.2. Weibul l distributions. W e assume in this section that the num ber o f ba lls with a given color follows a W eibull distribution. In this case, we hav e the following result, whic h follo ws from a simple v ariable change and the expansion of exp( − x β ) in power series of x β or exp( − px ) in p ow er series of x ; the pro of is omitted. Prop ositio n 4 . If v has a Weibul l distribution with skew p ar ameter β and sc ale p ar ameter η , then for 0 < β < 1 (15) lim K → + ∞ E ( W j +1 ) = β j ! ∞ X n =0 ( − 1) n ( pη ) ( n +1) β Γ(( n + 1 ) β + j ) n ! 8 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T and for β > 1 , (16) lim K → + ∞ E ( W j +1 ) = ( pη ) j j ! ∞ X n =0 ( − pη ) n n ! Γ  ( n + j ) β + 1  . Note that E ( W j ) can b e wr itten in the for m E ( W j ) = 1 j ! β ( pη ) β Z ∞ 0 u j + β − 1 e − u + tu β du with t = − 1 / ( pη ) β . The ab ov e in tegral is known in the literature as to b e of the F ax en’s type and can be expres sed by means of Meijer G -function, when β is a rational num ber, see Abramowitz and Stegun [1]. Contrary to the case of Pareto distr ibution fo r the initial distribution of balls of a given c o lor, ther e is no simple r elations giving the pa r ameters β and η fro m the mean v alues E ( W j ), j ≥ 1. In fact, we shall pr ov e in the fo llowing that P ( ˜ v ≥ j ) has also a W eibull tail. This eventually giv es a means of identifying the parameters. 4. Poisson appro xima tions In the prev ious section, we hav e esta blished b ounds for the mean v alues of the random v a riables W j and W + j . T o obtain more informatio n on their distributions, we intend to use Chen-Stein method. F or a fixed environmen t (namely fixed v alues of the quantities v i for i = 1 , . . . , K ), these random v ariables app ear as sums of non indep e nden t Bernoulli r andom v ariables. A preliminary analy sis of the Bernoulli ra ndom v ariables app ear ing in the expression of W j reveals that it seems not po ssible to inv oke a mo notonic coupling ar gument. It is w ell known (se e [4] for details) that the situatio n is mor e fav orable with the r a ndom v ar iables W + j and we can s pecifica lly pr ov e that if F is the se t F = { v i , 1 ≤ i ≤ K } , then the total nu mber W + j of colors with at least j balls at the end of the trial satisfies the relation (17)    P ( W + j ∈ · | F ) − P ( Q E ( W + j | F ) ∈· )    tv ≤ E 1 − V ar ( W + j | F ) E ( W + j | F ) ! . Indeed, given the ra ndo m v ariables v i , the mo del is equiv alen t to a standar d urn and ball pr oblem consisting of putting pV i balls in to K urns, a ball falling into urn i with proba bilit y p i = v i /V i . The num ber of balls in ur n i is the n umber of balls with color i in the origina l urn and ball pr oblem. Even in the case when the quantities p i are different, the v ariables I + i,j def = 1 { ˜ v i ≥ j } are nega tively re la ted so that Theo rem 2 can b e used. See Page 24 a nd Co rolary 2.C.2 Page 26 of [4] fo r a definition and the main inequality in this domain. C ha pter 6 of this refere nce is ent irely devoted to related o ccupancy pro ble ms . The res t of this section is devoted to the estimation of the bo und in Equa- tion (17). W e first establis h the following lemma. Lemma 1. F or a fixe d envir onment F = { v i , 1 ≤ i ≤ K } , the distanc e in total variation b etwe en t he distribution of W + j and the Poisson distribution Q E ( W + k | F ) satisfies t he ine qu ality lim K → + ∞ k P ( W + j ∈ · | F ) − P ( Q E ( W + k | F ) ∈ · ) k tv ≤ m 2 ,j ( p ) m j ( p ) + p E ( v ) m ′ j ( p ) 2 m j ( p ) , (18) AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 9 wher e m j ( p ) and m 2 ,j ( p ) ar e the first t wo moments of the ra ndom variable define d by (19) X j ( p ) = X ℓ ≥ j ( pv ) ℓ ℓ ! e − pv , and the prime s ign denotes the derivative with r esp e ct t o p . Pr o of. F or F fixed, the num ber W j of co lo rs with j ≤ pV balls at the end o f the trial is such that E ( W j | F ) = K X i =1  pV j   v i V  j  1 − v i V  pV − j . By using the fact that 1 V = 1 K E ( v ) + o  1 K  a.s. for large K , s traightforw ard c alculations show that (20) E ( W j | F ) = K X i =1 ( pv i ) j j ! e − pv i  1 − j ( j − 1) 2 pK E ( v ) + 2 j v i − pv 2 i 2 E ( v ) K  + o  1 K  = K X i =1  ( pv i ) j j ! e − pv i − p 2 E ( v ) K d 2 dp 2  e − pv i ( pv i ) j j !  + o  1 K  . By summing up the terms ab ov e and by chec king tha t the o  1 K  term remains v alid, since the sum ca n b e written as P K i =1 f ( v i ) e − pv i /K 2 , where f is a p olynomia l, w e hav e for j ≥ 1 and 0 < p < 1 E ( W + j | F ) = X ℓ ≥ j E ( W ℓ | F ) = K X i =1 X i,j ( p ) − p 2 E ( v ) K K X i =1 X ′′ i,j ( p ) + o  1 K  , where X i,j ( x ) = X ℓ ≥ j ( xv i ) ℓ ℓ ! e − xv i . F or the v ariance, if I i,j is 1 if colo r i has exa ctly j balls at the end of the trial and 0 otherwise, then W j = P K i =1 I i,j and, for j 6 = ℓ , E ( W j W ℓ | F ) = X 1 ≤ i 6 = m ≤ K E ( I i,j I m,ℓ | F ) and E ( W 2 j | F ) = E ( W j | F ) + X 1 ≤ i 6 = m ≤ K E ( I i,j I m,j | F ) . F or j, ℓ such that j + ℓ ≤ p V , E ( I i,j I m,ℓ | F ) = ( pV )! j ! ℓ !( pV − j − ℓ )!  v i V  j  v m V  ℓ  1 − v i + v m V  pV − j − ℓ . The quantit y in the right hand side of the ab ov e equa tio n can be expanded as e − p ( v i + v m ) p j + ℓ v j i v ℓ m j ! ℓ ! − p 2 V e − p ( v i + v m ) v j i v ℓ m j ! ℓ ! c i,m ( j, ℓ ) + o  1 K  , 10 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T where c i,m ( j, ℓ ) = p j + ℓ − 2 ( j + ℓ )( j + ℓ − 1) − 2( j + ℓ )( v i + v m ) p j + ℓ − 1 + ( v i + v m ) 2 p j + ℓ is such that e − p ( v i + v m ) v j i v ℓ m j ! ℓ ! c i,m ( j, ℓ ) = d 2 dp 2 e − p ( v i + v m ) v j i v ℓ m j ! ℓ ! . Since ( W + j ) 2 =   X ℓ ≥ j W ℓ   2 = X ℓ 6 = k ≥ j W k W ℓ + X ℓ ≥ j W 2 ℓ , E (( W + j ) 2 | F ) − E ( W + j | F ) = X 1 ≤ i 6 = m ≤ K X ℓ,k ≥ j E ( I i,k I m,ℓ | F ) = X 1 ≤ i 6 = m ≤ K  X i,j ( p ) X m,j ( p ) − p 2 E ( v ) K ( X i,j X m,j ) ′′ ( p )  + o  1 K  , and 1 − V ar( W + j | F ) E ( W + j | F ) = E ( W + j | F ) − E (( W + j ) 2 | F ) + E ( W + j | F ) 2 E ( W + j | F ) . The right-hand side of this equatio n can be expanded as 1 P K i =1 X i,j + O (1)   − X 1 ≤ i 6 = m ≤ K X i,j ( p ) X m,j ( p ) + p 2 E ( v ) K X 1 ≤ i 6 = m ≤ K ( X i,j X m,j ) ′′ ( p ) + K X i =1 X i,j ( p ) − p 2 E ( v ) K K X i =1 X ′′ i,j ( p ) ! 2   + o  1 K  which can b e rewritten as 1 P K i =1 X i,j + O (1)   X 1 ≤ i ≤ K X 2 i,j ( p ) + p 2 E ( v ) K   X 1 ≤ i 6 = m ≤ K ( X i,j X m,j ) ′′ ( p ) − 2 K X i =1 X i,j ( p ) K X i =1 X ′′ i,j ( p )     + O (1) using that X i 6 = m X i,j X m,j = X i X i,j ! 2 − X i X 2 i,j . By the law of large num bers, we hav e that, a lmost surely , lim K → + ∞ 1 K K X i =1 X 2 i,j ( p ) = E ( X 2 j ( p )) = m 2 ,j ( p ) , lim K → + ∞ 1 K 2 K X i 6 = m ( X i,j X m,j ) ′′ ( p ) = ( m 2 j ) ′′ ( p ) , AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 11 together with lim K → + ∞ 1 K X i =1 X i,j ( p ) = m j ( p ) and lim K → + ∞ 1 K K X i =1 X ′′ i,j ( p ) = m ′′ j ( p ) . Hence, lim K →∞ 1 − V ar( W + j | F ) E ( W + j | F ) = m 2 ,j ( p ) + p [( m 2 j ) ′′ ( p ) / 2 − m j ( p ) m ′′ j ( p )] / E ( v ) m j ( p ) a.s. = m 2 ,j ( p ) + p m ′ j ( p ) 2 / E ( v ) m j ( p ) a.s. and the result follows.  T o illustra te the fact that the bo und in Equatio n (18) is tight when p → 0 and v ha s finite moments o f any order , let us note that, provided the corr e spo nding moments ar e finite, (21) lim p → 0 m j ( p ) p j = v j j ! Moreov er, lim p → 0 m 2 ,j ( p ) p 2 j = E ( v 2 j ) j ! 2 and lim p → 0 m ′ j ( p ) p j − 1 = E ( v j ) ( j − 1)! . Thu s, the limit when K tends to + ∞ of the b ound given by E q uation (18) is equiv a lent to j p j − 1 ( j − 1)! E ( v j ) E ( v ) when p tends to 0 . If j ≥ 2, this term tends to 0 when p → 0. By using the ab ov e lemma, w e are now able to s ta te a limit result for the distri- bution of the r a ndom v a riables W + j . Prop ositio n 5. The ine quali ty (22) lim K → + ∞ sup y ∈ R       P   W + j − E ( W + j ) q E ( W + j ) ≤ y   − Z y −∞ e − u 2 / 2 √ 2 π du       ≤ m 2 ,j ( p ) m j ( p ) + p E ( v ) ( m ′ j ( p )) 2 m j ( p ) holds. Thu s, for j ≥ 2 a nd for small p , this gives the following approximation W + j ∼ E ( W + j ) + q E ( W + j ) , where G is a standard nor mal ra ndom v a riable. It should be noted nevertheless that Equatio n (22) is a lmost a cent ral limit result but b ecause of the sca ling in 1 / q E ( W + j ) instea d of 1 / q V ar ( W + j ), the bo und in the right hand side is not 0 as K gets la rge but, according to the pr o of of Lemma 18, only an upper bound on the distance b etw een E ( W + j ) and V ar( W + j ). 12 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T Pr o of. F rom Lemma 1, we hav e       P   W + j − E ( W + j ) q E ( W + j ) ∈ · | F   − P   Q E ( W + j |F ) − E ( W + j |F ) q E ( W + j |F ) ∈ ·         tv ≤ m 2 ,j ( p ) m j ( p ) + p E ( v ) m ′ j ( p ) 2 m j ( p ) . F ro m Equation (20), we ha ve that lim K →∞ 1 K E ( W + j | F ) = E ( X j ( p )) = K X ℓ ≥ j Q ℓ = K m j ( p ) , where the q uantit ies Q ℓ are defined in P rop osition 1. In addition, from Coro llary 1, E ( W + j ) ∼ K m j ( p ) when K → + ∞ . The result then follows by applying the central limit theorem for Poisson distributions and by deco nditio ning with resp ect to F .  T o conclude this section, let us no tice that when balls ar e drawn with probability p independently of each o ther, we do not have to condition on the en vironment and we have    P ( W + j ∈· ) − P ( Q E ( W + j ) ∈ · )    tv ≤ E  P v k = j  v k  p k (1 − p ) v − k 1 { v ≥ j }  2 E   v j  p j (1 − p ) v − j 1 { v ≥ j }  , It is worth noting that the results are independent of the num ber of colors and that we do not need ta ke K → ∞ to obtain a b ound for the distance in total v a riation. In addition, when E ( W j ) b ecome large, then it is po ssible to obtain a central limit-type approximation s imilar to Pro po sition 5. 5. Comp arison with original distributions 5.1. Uni form mo del . In this section, we compare the distribution of the num ber ˜ v o f balls drawn with a given colo r with that of the original n um ber v of balls with a given color. W e are in par ticula r interested in giving a sense to the heuristic stating that v and ˜ v / p hav e distributions close to each other. Prop ositio n 6. Under the c o ndition that the r andom variable v has a Weibul l or Par eto distribution, we have lim j →∞ lim K →∞ E ( W + j ) K P ( v ≥ j /p ) = 1 . Pr o of. F rom Coro llary 1, w e know that E ( W j ) /K → Q j when K → ∞ . Since Q j = E  ( pv ) j j ! e − pv  = ∞ X ℓ =1 ( pℓ ) j j ! e − pℓ P ( v = l ) , we can s how that if v has a W eibull or Pareto distribution, then Q j ∼ P ( v = j /p ) /p when j → ∞ . Indeed, the a b ove sum ca n be rewritten a s 1 j ! ∞ X ℓ =1 e f j ( ℓ ) P ( v = ℓ ) , AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 13 where f j ( ℓ ) = − pℓ + j log( pℓ ), which attains its maxim um a t p oint j /p with f ′′ j ( j /p ) = − p 2 /j . If the random v a riable v is W eibull or Pareto and j /p is suf- ficient ly la rge, then P ( v = ℓ ) / P ( v = j /p ) − 1 ∼ 0 uniformly o n j for ℓ in the neighborho o d of j /p . It follows that Q j ∼ 1 j ! P ( v = j /p ) e f j ( j /p ) ∞ X ℓ = −∞ e − ℓ 2 p 2 2 j . F or a > 0 converging to 0 , ∞ X ℓ = −∞ e − aℓ 2 = ∞ X ℓ = −∞ Z + ∞ 0 1 { u>aℓ 2 } e − u du ∼ 2 Z + ∞ 0 r u a e − u du = 2 Z + ∞ 0 u 2 √ a e − u 2 / 2 du = r π a and b y Stirling formula j ! ∼ √ 2 π j j + 1 2 e − j for large j , so that Q j ∼ P ( v = j /p ) /p . It is then eas y to deduce that P ℓ ≥ j Q j ∼ P ( v ≥ j /p ) for large j .  The ab ove Pr op osition implies that P ( ˜ v ≥ j ) is suc h that P ( ˜ v ≥ j ) ∼ P ( v ≥ j /p ) when the num ber of color s is larg e . This means that the tail of the distribution of the random v ariable v ca n be obtained by rescaling that of the num ber ˜ v of sampled ba lls with a given colo r. When v has a Pareto distr ibution, Equation (13) can still b e used for larg e j to estimate the shap e pa r ameter a . The estimation of the probability 1 − E ( e − pv ) of sampling a color and the s cale parameter b can also be estimated from the ta il by using the expre s sion of that probability as a function of b a nd a a s in Equation (1 4). The same metho d a pplies for W eibull distributions. 5.2. Probabil istic mo del. F rom now on, we cons ide r the pr obabilistic model a nd we establish s tronger results on the distance betw een P ( ˜ v ≥ j ) and P ( v ≥ j /p ), where ˜ v is the num ber o f balls with a given color a t the end of a trial. F or this sampling mo de, it was not p ossible to prov e a r esult similar to C o rollar y 1, but Berry-E ssen’s theorem [6] ca n be used to establish a stronger r esult for the compa r - ison b etw een ˜ v and v . In [5], it is s pec ific a lly prov ed that if we define the function h j ( x ) = x 2 / 4 p 2  p 1 + 4 j p/ x 2 − 1  2 for x ∈ R and j > 0, then    P ( ˜ v ≥ j ) − P  v ≥ h j  p p (1 − p ) G  ∨ k     ≤ c E  1 √ v 1 { v ≥ j }  , where G is a standard Gaussia n random v a riable, for real num bers a ∨ b = max( a, b ), and c = 3( p 2 + (1 − p ) 2 ) / p p (1 − p ). F or small p , the constant c ∼ 3 / √ p . The above bo und is very loo se for small p and beco mes accur ate only for very larg e v a lue s o f j . This is why we g o further in this pa per by es tablishing a tight er bo und for the ratio P ( ˜ v ≥ j ) / P ( v ≥ j /p ). Let ( B n ) b e some sequence of i.i.d. Bernoulli random v ar iables with para meter p and v so me indep endent r.v. on N . T ake some α ∈ ]1 / 2 , 1 [. Let ˜ v = P v l =1 B l . Theorem 3 . F or α ∈ (1 / 2 , 1 ) , we have for al l j ≥ 1 P ( ˜ v ≥ j ) P ( v ≥ j /p ) = A ( j ) + B ( j ) , wher e A 1 ( j ) ≤ A ( j ) ≤ A 2 ( j ) 14 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T with A 1 ( j ) =     1 − exp     − p 2  1 +  j p  α − 1   j p  2 α − 1         P ( v ≥ j /p + ⌊ ( j / p ) α ⌋ + 1) P ( v ≥ j /p ) , A 2 ( j ) = P ( v ≥ j /p − ⌊ ( j / p ) α ⌋ ) P ( v ≥ j /p ) , and wher e B ( j ) is a p osi tive quantity su ch that B ( j ) ≤ e − p 2(1 − p ) ( j p ) 2 α − 1 P ( v ≥ j ) P ( v ≥ j /p ) . Pr o of. W e hav e P ( ˜ v ≥ j ) = P v X ℓ =1 B ℓ ≥ j ! = T 1 + T 2 , where T 1 = P v X ℓ =1 B ℓ ≥ j, j ≤ v ≤ j /p − ⌊ ( j /p ) α ⌋ − 1 ! , T 2 = P v X ℓ =1 B ℓ ≥ j, j / p − ⌊ ( j /p ) α ⌋ ≤ v ! . Let us first recall the fo llowing inequality for the s um of indep endent Bernoulli random v a riables B ℓ , ℓ ≥ 1 [9]: for x ∈ [0 , 1 − p ] (23) P n X ℓ =1 B ℓ − np ≥ nx ! ≤ e − nx 2 A ( x ) , where (24) A ( x ) = 2 p (1 − p ) + 2 3 x (1 − 2 p ) − 2 9 x 2 . It follows tha t for j ≤ v ≤ j / p P v X ℓ =1 B ℓ ≥ j ! ≤ e − ( j − pv ) 2 vA ( j v − p ) . It is easily chec k ed that the function v → v A  j v − p  is incr easing in the interv al [ j, j /p ] and that for all v ∈ [ j, j /p ] v A  j v − p  ≤ 2 j (1 − p ) . Hence, for v ∈ [ j, j /p ] P v X ℓ =1 B ℓ ≥ j ! ≤ e − ( j − pv ) 2 2 j (1 − p ) and for v ∈ [ j, j /p − ⌊ ( j /p ) α ⌋ − 1] P v X ℓ =1 B ℓ ≥ j ! ≤ e − p 2(1 − p ) ( j p ) 2 α − 1 . AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 15 This implies that T 1 ≤ P v X ℓ =1 B ℓ ≥ j, j ≤ v ≤ j /p − ⌊ ( j /p ) α ⌋ − 1 ! ≤ P   j /p −⌊ ( j /p ) α ⌋− 1 X ℓ =1 B ℓ ≥ j   P ( v ≥ j ) = e − p 2(1 − p ) ( j p ) 2 α − 1 P ( v ≥ j ) . F or the term T 2 , w e first note that T 2 ≤ P ( v ≥ j /p − ⌊ ( j /p ) α ⌋ ) . Then, we clearly ha ve T 2 ≥ P v X ℓ =1 B ℓ ≥ j, j /p + ⌊ ( j / p ) α ⌋ + 1 ≤ v ! and then T 2 P ( v ≥ j /p ) ≥ P   j /p + ⌊ ( j /p ) α ⌋ +1 X ℓ =1 B ℓ > j   P ( v ≥ j /p + ⌊ ( j /p ) α ⌋ + 1) P ( v ≥ j /p ) . Chernoff b ound implies for v = j /p + ⌊ ( j /p ) α ⌋ + 1 P v X ℓ =1 B ℓ ≤ j ! ≤ exp  − ( pv − j ) 2 2 pv  ≤ exp     − p 2  1 +  j p  α − 1   j p  2 α − 1     . It follows tha t T 2 P ( v ≥ j /p ) ≥     1 − exp     − p 2  1 +  j p  α − 1   j p  2 α − 1         P ( v ≥ j /p + ⌊ ( j /p ) α ⌋ + 1) P ( v ≥ j /p ) . and the pro of follows.  The ab ov e result ca n b e a pplied to specific dis tributions for v , na mely P areto and W eibull distributions, in order to show that the tails of the pr o bability distribution functions of ˜ v and pv are the same. This is the analog of Pro po sition 6 for the probabilistic mo del. Corollary 2. If v has either (1) a Par et o tail distribution with p ar ameter a > 1 such that for x ≥ 0 , P ( v ≥ x ) = L ( x ) x − a wher e L is a slow ly varying function, i.e., for e ach t > 0 , lim x → + ∞ L ( tx ) L ( x ) = 1; 16 CHRISTINE FRICKER, F ABRICE GUILLEMIN, AND PHILIPPE ROBER T or (2) a Weibul l tail distribution with β ∈ ]0 , 1 / 2 [ s uch that for x ≥ 0 , P ( v ≥ x ) = L ( x ) e − δx β for s ome δ > 0 and L a slow ly varying fu n ction then lim j → + ∞     P ( ˜ v ≥ j ) P ( v ≥ j /p ) − 1     = 0 . Pr o of. F or (1), P ( v ≥ j ) P ( v ≥ j /p ) = L ( j ) L ( j /p ) j − a ( j /p ) − a = L ( j ) L ( j /p ) p a − − − − → j → + ∞ p − a and P ( v ≥ j /p + ǫ ( j / p ) α ) P ( v ≥ j /p ) = L (( j /p )(1 + ǫ ( j /p ) α − 1 )) L ( j /p ) (1 + ǫ ( j /p ) α − 1 ) − a which tends to 1 when j tends to + ∞ . This implies tha t the quantities A 1 ( j ) and A 2 ( j ) app e aring in Theo rem 3 tends to 1 a nd B ( j ) tends to 0 when j → ∞ . F or (2), P ( v ≥ j ) P ( v ≥ j /p ) = L ( j ) L ( j /p ) e − δj β (1 − p − β ) − − − − → j → + ∞ 0 and it is s traightforw ard that P ( v ≥ j /p + ǫ ( j /p ) α ) P ( v ≥ j /p ) = L ( j /p (1 + ǫ ( j /p ) α − 1) ) L ( j /p ) e − δ ( j /p + ǫ ( j/p ) α ) β + δ ( j /p ) β = L ( j /p (1 + ǫ ( j /p ) α − 1) ) L ( j /p ) e − δβ ǫ ( j /p ) α + β − 1 (1+ o (1)) which tends to 1 if α + β < 1. Let β ∈ ]0 , 1 [. It is s ufficien t to find α ∈ ]1 / 2 , 1[ such that α + β < 1. Nec e s sarily 1 − β > α > 1 / 2 th us β < 1 / 2 and for s uch a β , such an α exists.  6. Conclud ing remarks on sa mpling and p arameter inference W e hav e established in this pap er convergence re s ults for the distr ibution of the num ber of balls with a given color under the assumption that there is a lar ge nu mber o f co lors in the urn, that the n um ber o f balls with a given co lo r ha s a heavy tailed distr ibutio n indep endent of the color, a nd that only a small fraction p of the total num ber o f ball is sampled. W e hav e considered tw o ball sampling rules. The first one states tha t the pro bability of drawing a ball with a g iven color dep ends upo n the relative con tribution of the color to the to tal num ber of balls and that a drawn ba ll is immediately replaced in to the urn. With the second rule, each ball is selected with pr obability p independently o f the others. The tw o rules do not give the same res ults, even if they coincide when p → 0 (see [5] for details). F ro m a practical point of view, we hav e sho wn that it is pos sible to ident ify the original distribution of the num ber of ba lls with a given color by using the tail of the distribution of the num ber of ba lls with a a given colo r dr awn from the urn. A stronger result holds for Pareto when the n umber of colo rs is very larg e (see Prop ositio n 3). This re s ult is robust in pr actice b ecause it do es not re ly o n the asymptotics of the tail distribution (in Prop osition 3 assertions hold for all j > a ). The determination o f the original num ber of balls pe r color is v alid when the nu mber of balls follows a unique distribution of Pareto or W eibull t yp e . This could AN IDENTIFICA TION PROBLEM IN AN URN AND BALL MODEL 17 be us e d in the co nt ext of pack et sampling in the Internet. In pr actice, how ev er, the num ber of pack ets in flo ws is in g e neral no t des c rib ed by a unique “nice” distribution, but ca n only be lo cally appr oximated by a series of Pareto distributions (see [2] for a discussion). More sophisticated techniques are then neces sary to get the orig inal statistics of flows. References [1] M. Abramowitz and I. Stegu n, Handb o ok of mathematic al functions , National Bureau of Stan- dards, Applied Mathematics Series 55, 1972. [2] N. Antu nes, Y. Chab cho ub, C. F ri c k er, F. Guillemin, and P . Rob ert, O n the estimation of flow statistics via p acket sampling in the Internet , Submitted for publication. [3] S. Asmussen, C. Kl ¨ uppelber g, and K. Sigman, Sampling at sub exp onential t imes, with queueing applic ation , Stochastic Process. Appl. 79 (1999), 265–286. [4] A. D. Barb our, Lars H olst, and Sv ante Janson, Poisson appr oximation , The Clarendon Press Oxford Universit y Press, New Y ork, 1992 , Oxford Science Publications. [5] Y ousra Chab ch oub, Christine F ri ck er, F abrice Guil lemin, and Philipp e Rob ert, Det erministic versus pr ob abilistic p acket sampling in the Internet , Pro ceedings of ITC’20, June 2007. [6] W. F eller, An intr o duction to pr ob ability the o ry, The ory and applic ation , v ol. 2, Wil ey , 1966. [7] S. F oss and D. Korshuno v, Sampl ing at a r andom time with a he avy-t aile d distribution , M arko v Pro cess. Related Fields 6 (2000) , no. 4, 543–568. [8] Philipp e Rob ert, R´ ese a ux et files d’attente: m´ etho des pr ob abilist e s , Math ´ ematiques et Appli- cations, vol. 35, Springer-V erlag, Berlin, Octobre 2000. [9] A. Siegel, T owar d a usable the ory of chernoff b ound s for heter o gene ous and p artial ly dep endent r a ndom variables , Paper av ailable at ht tp://cs.n yu.edu/facult y/siege l/HHf.p df . (C. F r ick e r) INRIA-Rocq uencour t, RAP projec t, Domaine de Voluceau, 7 8153 Le Ches- na y, France E-mail addr ess : Christine.F ricker@in ria.fr URL : http://ww w-c.inria .fr/twiki/bin/view/RAP/ChristineFricker (F. Guillemin) Orange Labs, F-22300 Lannion E-mail addr ess : Fabrice.Gui llemin@or ange-ftgroup.com (Ph. Robert) INRIA-Rocquencour t, RAP project, Domaine de Volucea u, 78 153 Le Chesna y, Fran ce E-mail addr ess : Philippe.Ro bert@inri a.fr URL : http://ww w-rocq.in ria.fr/~robert

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment