A New Estimator for the Number of Species in a Population

We consider the classic problem of estimating T, the total number of species in a population, from repeated counts in a simple random sample. We look first at the Chao-Lee estimator: we initially show that such estimator can be obtained by reconcilin…

Authors: L. Cecconi, A. G, olfi

Submitted to the Annals of Statistics A NEW ESTIMA TOR F OR THE NUMBER OF SPECIES IN A POPULA TION By L. Cecconi, ∗ A. Gandolf i ∗ and C.C.A. Sastr i University o f Fir enze and Missouri Univ ersity of Scienc e and T e chnolo gy W e consider the class ic p roblem of e stimating T , the total num b er of sp ecies in a p opulation, from rep eated counts in a simple random sample and lo ok first at the Chao-Lee estimator: w e initially show that suc h estimator can b e obtained by reconciling tw o estimators of the unobserved p rob ab ility , and then develo p a sequ ence of improv e- ments culminating in a Diric hlet prior Ba yesian rein t erpretation of the estimation p roblem. By means of this, w e obtain simultaneous estimates of T , the n ormalized intersp ecies v ariance γ 2 and the pa- rameter λ of the prior. Several simula tions show that our estimation metho d is more flex ible t h an severa l kn o wn metho ds we used as com- parison; t he on ly limitation, apparently shared by all other method s, seems to b e that it cannot deal with the rare cases in which γ 2 > 1. 1. Introduction. W e consider the classic problem of estimati ng the n umb er T of sp ecies in a p opulation, and, subsequentely , their distribu tion, from a simple r andom sample dra wn with replacemen t. W e are in terested in the ”small sample” regime in w hic h it is likely that not all sp ecies ha v e b een observed. Pr oblems of this kind arise in a v ariet y of settings: for exam- ple, when sampling fish from a lake or insects in a forest (see, for instance, Shen, Ch ao and Lin ( 2003) [ 47] on h o w to use estimates of T to predict further sampling, or [7]); or when estimating the size of a p articular p opula- tion (see [6]); or when tryin g to guess h o w man y letters an alphab et or h o w man y sp ecific groups of words a language con tains (see [14]) or h o w man y w ords a writer kno ws (see [19]); or, ev en, when determining ho w many dif- feren t coins w er e minted b y an ancient p opulation (Esty [21]). Because of its great interest this h as b ecome a classic in p robabilit y , and there has b een a great n umber of stu dies suggesting metho ds for the estimatio n of T . See, for instance, [8] for a review th rough 1993 , [23] for some furth er d etails and Colw ell’s Estimates f or softw are implementing a large n umb er of estimators. ∗ Supp orted by I talian Prin pro ject 20060102 52-001. AMS 2000 subje ct classific ations: Primary 62G05; secondary 62F15 Keywor ds and phr ases: simple random sample, un observed species, un observed proba- bilit y , p oint estimation, confidence interv al, Dirichlet prior, Bay esian p osterior, L A T E X 2 ε 1 imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 2 CECCONI, GAN DOLFI, S ASTRI In particular, [8] calls for some dev elopment of the Ba y esian metho d for the estimation of T , w h ic h is the d irection that w e even tually hav e tak en. In this pap er w e start, in fact, b y an alyzing one wel l kno wn estimator of T , namely the one by Chao and Lee ([13 ]). O ne of our results sho ws that th e estimator can b e obtained by reconciling t w o estimators of the unobserved probabilit y U : one b eing an extended v ersion of Laplace’s ”add λ ” ([34]) and the other th e estimator b y T uring and Go o d ([24]), provided that the normalized in tersp ecies v ariance γ 2 is interpreted as th e inv erse of the λ . Then we pro ceed by dev eloping sim ultaneous metho ds for estimating T and λ (or γ 2 , which is th e same). By suc h metho ds we impro ve on the original Chao-Lee estimatio n, bu t the e stimators w e obtain are sho wn b y sim ulations to hav e some serious defects. It is for this reason th at w e p erform a more fun damen tal analysis of the p roblem by means of a Ba y esian approac h. This is based on a Diric hlet prior w ith parameter λ on the prob ab ilities of T sp ecies (see [33], [32], [25], and [49] for an historical description); the parameter turn s out to b e the same as the one in Laplace’s metho d . The simultaneous estimation that w e dev elop n o w tak es in to account a p osterior seco nd momen t of the random sp ecies probabilities compared to the classical Go o d T oulmin estimator for the s ame quanti t y (see [27]). Let us men tion that the empirical Bay esian app roac h us ed here is different from that of existing results in th e literature. The metho d in [41] is, in fact, limited to uniform sp ecies distributions. On the other hand, the general Ba yesia n app r oac h in Bo ender and Rinno y Kan (1987 ) [4] starts from a prior distribution of T and, conditionally to T , a uniform or Dirichlet ( λ ) prior on the sp ecies probabilit y , b ut then intro d uces a (level I I I) prior on λ itself (as suggested in [26]) whic h in turn r equires the introd uction of a further parameter (Bo ender and Rinno y Kan (1987) [4 ], form ulae (10) and (11)), with then no an alytical expression for the p osteriors. In the end, this direction seems to include several undetermined choic es (the pr ior on T and the extra parameter at leve l I I I) and no simp le analytical expression of the estimators. A t the end of th e p ap er w e presen t some numerical tests. Due to the inherent difficult y in find ing fully pu blished data for this estimation we resort to sim ulations and real tests on disco vering the size of an alphab et. The tests seem to in dicate that the new estimator of T is more flexibile th an existing ones and thus pr eferable, in th e sense that the p erform ance of all estimators seem to greatly dep end on the n ormalized v ariance γ 2 , and the new estimator is the only one able to p erform rather w ell for all v alues of γ 2 ∈ [0 , 1]. In our metho d, the only constrain t is that λ ≥ 1, w hic h is γ 2 ≤ 1, which is imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 3 imp osed in order to ensure con ve rgence of the p rior; this, in tur n, imp oses a mild limitation on the p opulations to whic h the metho d can b e app lied, since γ 2 can, for some p eculiar p opulation, exceed 1; on the other h and, suc h p opulations are likely to b e quite un usu al and, in add ition, all other existing estimators seem also to fail on samples tak en from th em. In section 2 we review in detail some known estimation m etho ds of in terest in deriving our results; in section 3 w e d eriv e s ome relations b et w een kno wn estimators and our first improv emen ts; in section 4 w e dev elop the Ba yesian metho d and d efine our final estimator; in section 5 we giv e estimates of the sp ecies probabilities from, for b oth the obser ved and the un observ ed ones; from these, w e indicate ho w to generate confidence interv als for T by means of r esampling; fi nally , in section 6 w e presen t some sim ulations rev ealing a rather goo d p er f ormance of our new estimator and also very adequate r esults of the confidence in terv als. All detailed mathematical pro of are deferred to the Ap p endix. 2. Some kno wn estimators of T and relate d quantities. W e start with some notation. Assume that th e p opulation from whic h t he sample is dra wn h as a total of T sp ecies (whic h w e sometimes will call states) ha ving prop ortions p 1 , p 2 , · · · , p T . ; and that in a sample x 1 , x 2 , · · · , x n of size n there are N obs erv ed sp ecies. F or i = 1 , · · · , T , let m i b e the n umber of observ ations of the sp ecies i in the sa mple, so that P N i =1 m i = n. W e assume that the m i ’s are giv en one of the p ossib le orders in whic h m 1 ≥ m 2 . . . , m N ≥ 1 and m i = 0 for i = N +1 , . . . , T . Also, for j = 1 , · · · , n , let n j b e the p rev alence of j , whic h is to sa y the num b er of sp ecies observ ed exactly j times, so that P n j =1 n j = N . Next, le t L n ( i ) = m i /n b e the empirical frequency of sp ecies i , so th at C = P i : L n ( i ) > 0 p i is the co v erage, i.e, the total probabilit y of the observe d sp ecies, and U = 1 − C = P i : L n ( i )=0 p i is the unobserved p robabilit y . W e are int erested in the estimation of T from the prev alences. The estimation of U has also b een studied in tensiv ely (see, for instance, [40] and [38 ]). In fact, it is p ossible to tur n the estimation of U into a simplified version of our original p roblem by assuming that there are N + 1 sp ecies, the N observe d ones and the ”new” sp ecies w ith p robabilit y U ; the main issue b ecomes then the estimation of the pr obabilities of the v arious sp ecies and esp eciall y f or the new one. F or this and other reasons th at we shall s ee, the estimations of T and U are closely in tert w ined (ev en the title of [20] p oints to this relation). The first attempt to estimate U ca n b e extracted fr om Laplace (see [34] and [45]) who s uggested an ”a dd-one” estimator: this consists in add ing imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 4 CECCONI, GAN DOLFI, S ASTRI one to the num b er of observ ations of eac h sp ecies plus an additional one for the ”unobserve d” sp ecies. In an extended ve rsion, wh ic h can b e n amed ”add λ ”, one can add some p ositiv e v alue λ to eac h sp ecies’ num b er of observ ations (including the u nobserve d one): an estimate of the p r obabilit y of eac h observe d sp ecies i is then b p i = m i + λ λ + P i ≥ 0 ( m i + λ ) = m i + λ n +( N +1) λ and the estimate of the unobserve d pr obabilit y b ecomes b U L,λ = λ n +( N +1) λ . With a seemingly completely d ifferen t metho d, T u r ing and Go o d (see [24 ]) prop osed another estimator of U . Recall that n 1 is the n u m b er of sp ecies observ ed exactly once and n the s ize of the sample; th en the T urin g-Goo d estimator f or U i s some minor mo dification of: b U T G = n 1 n . A plausible r ationale f or t his estimator is that while for sp ecies observe d at least twice the emp irical frequency is already b ecoming stable and ve ry lik ely close to the corresp onding probabilit y , sp ecies observ ed only on ce are lik ely to b e randomly selected representati ve s of the collection of th e yet unobserved sp ecies. A more soun d mathematical deriv ation is in Goo d ([24]), in wh ic h also a ”‘smo othing”’ of the n i ’s is prop osed. Other metho ds to estimate U ha ve b een dev elop ed, and in p articular w e refer to [38] f or a Ba yesian metho d b ased on the general class of Gibbs-type priors (see also [46] and the other references in [38] f or the defin ition and prop erties of suc h pr iors). This class con tains sev eral kno wn families of priors as particular cases and eac h suc h family is based on one or more parame- ters, w hic h n eed to b e fur ther estimated. In [38], for ins tance, a maxim um lik eliho o d estimator is used. Another recent adv ance app ears in Orlitsky et al ([45]), in whic h a qu an tit y is introd uced, called atten uation, that mea- sures the effectiv eness of the estimation of U as the sample gets larger; the p erformance of an estimator is compared to the maxim um p robabilit y of the observ ed prev alences and asymp toticall y v ery go o d estimato rs are de- termined. W e are going to b ase our work here on a preliminary estimation of U . It is conceiv able that within the wide class of prop osed estimators of U some w ould impr o v e the results that w e get; ho wev er, we fo cu s on the u n smo othed T uring-Go o d estimator since it is more direct and simple, while still allo wing us to ac h iev e v ery satisfactory r esults. Getting bac k to the estimation of T , there are sev eral p arametric metho ds based on assuming some structure of the s p ecies distrib ution; for instance, imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 5 an estimator d evised for the uniform case, in which the p robabilities of all sp ecies are assumed to b e the same is the Horvitz-Thomp s on b T H T = N 1 − U , (see [39] and Bishop, Fien b erg and Holland (1975) [3]) and then U can b e further estimated, for instance b y the unsm o othed T uring-Go o d metho d, to get (1) b T H T T G = N 1 − b U T G = nN n − n 1 see [16] and [5]. Esty [20] improv es this estimate b y assuming a negativ e binomial p rior with parameter k to get (2) b T H T T G = N 1 − b U T G + n b U T G (1 − b U T G ) 1 k , then pr o viding some ad ho c guess f or k (in some cases, k = 2). As to nonparametric metho ds, Harris [28], Chao [12] and Chao & Lee [13] ha v e pr op osed some suc h estimators, of whic h the most r eliable ones seem to b e those p rop osed in [13]. In our n otation these amoun t to b T C L ( b γ ) = N 1 − b U T G + n b U T G (1 − b U T G ) b γ 2 , (3) with b γ 2 an estimate - f or whic h Chao & Lee make tw o prop osals - of the normalized v ariation co efficien t of the p i ’s. In fact, assume that p is a random v ariable u niformly distribu ted on the T p opu lation probabilities p 1 , . . . , p T ;, then its a v erage is ¯ p = 1 T T X k =1 p k = 1 T , and its normalized v ariation co efficien t is (4) γ 2 = V ar ( p ) [ E ( p )] 2 = T T X k =1 ( p k − ¯ p ) 2 = T T X k =1 p 2 k − 1 . Next, Chao and Lee pro ceed b y usin g an estimate of Goo d and T oulm in (5) T X k =1 p 2 k ≈ b V GT = X j ≥ 1 j ( j − 1) n j n ( n − 1) imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 6 CECCONI, GAN DOLFI, S ASTRI and u sing one preliminary estimate for T , (1) f or instance, to obtain b γ 2 = max  nN n − n 1 X j ( j − 1) n j n ( n − 1) − 1 , 0  . Note that th e work by Chao and Lee can b e considered as a fur ther im- pro v ement o v er the results b y Est y . Ho w ev er, Chao and Lee make a r ather direct use of a preliminary guess for T and we think their metho d is to o sensitiv e t o err ors in such preliminary ev aluation. In th e n ext section w e start d iscussing some p ossible impro v emen ts. 3. Preliminary results on new estimators. (I) W e firs t consider (3) and (4) as equations in the un kno wns T and γ 2 and searc h for simulta neous solutions T ≥ N and γ 2 ≥ 0. Since in some simp le examples the un iqu e solution giv es γ 2 < 0, we consider the solutions T 1 ( b γ 2 1 ) and b γ 1 of the problem T = T ( γ 2 ) = N 1 − b U T G + n b U T G (1 − b U T G ) γ 2 (6) b γ 2 = arg inf γ 2 ≥ 0    γ 2 − ( T b V GT − 1)    , (7) with b V GT as in (5). On letting u = b U T G and v = b V GT for br evity , the fu nction to minimize b ecomes (1 − u + nuv ) γ 2 + 1 − u − N v ; note that (1 − u + nuv ) ≥ 0 s in ce u ≤ 1, so that the solutions of (6) are b γ 2 1 =    0 if 0 < u ≤ 1 − N v N v − 1+ u 1 − u + nuv = N b V GT − 1+ b U T G 1 − b U T G + n b U T G b V GT if 1 − N v < u and b T 1 = T 1 ( b γ 2 1 ). Some tests describ ed in section 6 show that b T 1 p erforms b etter for non uniform p opu lations than the original Chao-Lee estimate, but has to o large a v ariance. (I I) Next we compare t wo estimators of U , the un smo othed T ur ing-Go o d and the follo w ing mod ified v ersion of the ”add λ ”: assu me the n umber T of sp ecies is kno wn and add λ to eac h of th e fr equencies of all the T sp ecie s, imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 7 not ju st to that of those arb itrarily lab elled through N + 1. This would giv e b p k ( λ ) = m k + λ T λ + n p er k = 1 . . . N b p k ( λ ) = λ T λ + n p er k = N + 1 . . . T b U λ = ( T − N ) λ T λ + n since there are T − N unobs erv ed sp ecies. No w, we can hop e to reconcile the extended ”add λ ” and the un smo othed T uring-Go o d estimators by requirin g that they assign the same v alue to b U . This amoun ts to solving (8) ( T − N ) λ T λ + n = b U T G = n 1 n . Solving f or T w e get (9) b T λ = N + n b U T G /λ 1 − b U T G = n N + n 1 /λ n − n 1 . Quite surprisingly , we ha v e obtained Lemma 3.1 . The only value of T for which the extende d ”‘add λ ”’ and the T uring-Go o d estimators of U c oincide, is the Chao-L e e estimator T C L ( γ ) with γ 2 = 1 /λ . F r om now on we wil l assume this e quality and mostly r e fer to the p ar ameter λ . (I I I) The relation foun d in (I I) suggests that (6) can b e seen as a first momen t estimate : (10) T X k = N + 1 b p k ( λ ) = b U T G , so that one can hop e to derive γ 2 from a second moment relation. The f orm is s uggested by (I), considerin g the meaning of b V GT : (11) b λ 2 = arg in f λ ≥ 0 | T X k =1 b p k ( λ ) 2 − b V GT | . The solutions b T 2 ( b λ 2 ) and b λ 2 of (10) and ( 11), together w ith b γ 2 = b λ − 1 2 , giv e new estimators; although this seems to imp ro v e the estimatio n in some cases, it do es app ear to hav e significant fla ws, as sh o w n in the sim ulations rep orted in tables 1-3. imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 8 CECCONI, GAN DOLFI, S ASTRI 4. The Ba y esian in terpretat ion. T o fu rther impro v e th e ab o ve es- timate, w e need to u nderstand more ab out the ”add λ ” estimator. It turns out, as was pr obably kn o wn already to Laplace, that the probabilit y esti- mation acco rd in g to the ”add λ ” metho d is nothing but the a verag e sp ecies probabilit y un der the Ba y esian p osterior on p r obabilit y d istributions on T sp ecies Σ T = { p = ( p 1 , p 2 , · · · , p T ) , p i ≥ 0 , T X i =1 p i = 1 } , giv en the sample, with a sin gle parameter Diric h let prior ρ 0 ,T ,λ , i.e. a p rior with den sit y c Q T i =1 p λ − 1 i for some constan t c and λ ≥ 1. With lik eliho o d µ ( x ) = c n Y j =1 p λ − 1 x j = c T Y i =1 p m i + λ − 1 i the p osterior b ecomes ρ n,T ,λ ( dµ ) = µ ( x ) ρ 0 ,T ,λ ( dµ ) R Σ T µ ( x ) ρ 0 ,T ,λ ( dµ ) (12) = ρ n,T ,λ ( dµ ) = 1 Z Λ 1 Σ T T Y i =1 p m i + λ − 1 i dp 1 . . . dp T . where Z = R Σ T p m 1 + λ − 1 1 · · · p m N + λ − 1 N p λ − 1 N +1 · · · p λ − 1 T dp 1 · · · dp T (note that the constan t terms hav e b een cancelled). By standard in tegration using the gamma f unction (see App endix 1), w e find that the a verag e sp ecies probabilit y under the p osterior is: E ρ n,T ,λ ( y i ) = ( m i + λ T λ + n if i = 1 , . . . , N λ T λ + n if i = N + 1 , . . . , T as claimed. T h is remark, together with our reconcilation Lemma in (I) ab o v e, indicates that we are taking a new step in the d ev elopmen t which b rough t us from (1) to (2) and then to (3) b y assigning no w tw o other meanings for λ − 1 = γ 2 , n amely that of the add constan t in a generalized Laplace m etho d and that of the constan t in a Diric h let prior. The Ba yesia n in terpretation of b p k also suggests a mo d ifi cation of the sec- ond moment minimization (11). Recalling that n o w λ ≥ 1 w e h a ve: b λ = arg inf λ ≥ 1 | f ( λ ) | imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 9 with f ( λ ) = b V − T X k =1 ( E ρ n,T ,λ ( p 2 k )) = P j > 0 j 2 n j − n n ( n − 1) − 2 nλ + λ + nλ ( λ + 1) N λ + n 1 n − n 1 [ n N λ + n n − n 1 + 1][ n N λ + n n − n 1 ] where b T λ has b een tak en as in (9) and the ca lculation is carried out in App end ix 1. In App endix 2 we sho w the function f ( λ ) has t w o sin gularities β 2 < β 1 = − n N < 0 and tw o zero’s, the interesti ng one b eing λ 2 = 1 − u − v + uv − uv n N v + u − 1 . (13) The min imization dep ends on the sign of f ( λ ) for large λ w hic h in turn dep end s on the sign of ( λ 2 − β 1 ). S ince f ( λ ) is increasing for λ ≥ 1, if the limit for large λ is n egativ e, then th e only reasonable v alue we can assign is ∞ , else there is a r eal solution for the minimization problem ab o ve: note that if λ 2 ≤ 1 then we are forced to tak e b λ = 1. I t is thus s ho wn in App endix 2 that the minimization ab ov e yields the estimator b λ =      1 if β 1 < λ 2 and 1 ≥ λ 2 , i.e. 2 − v ( N +1) 2 − v + v n ≤ u ≤ 1 − v λ 2 if β 1 < λ 2 and λ 2 ≥ 1, i.e. 1 − N v < u ≤ 2 − v ( N +1) 2 − v + v n ∞ if λ 2 ≤ β 1 , i.e. 0 ≤ u ≤ 1 − N v . F rom (9) we get the follo wing estimator of T : b T b λ = N + n b U T G / b λ 1 − b U T G = ( n N + n 1 / ( λ 2 ∨ 1) n − n 1 if β 1 < λ 2 nN n − n 1 if λ 2 < β 1 . or, alternativ ely , b T b λ =      N + n u 1 − u if 2 − v ( N +1) 2 − v + v n ≤ u ≤ 1 − v N − N v − nu 1 − u − v + u v − uv n if 1 − N v ≤ u ≤ 2 − v ( N +1) 2 − v + v n N 1 − u if 0 ≤ u ≤ 1 − N v . Clearly , b T b γ 2 is not n ecessarely an integ er w hile T is suc h, and w e round it to the n earest integ er. Notice that when b λ = ∞ we get b T b λ = b T H T T G . imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 10 CECCONI, GAN DOLFI, S ASTRI 5. Estimate of sp ecies distribution and confidence in terv als for T . Since w e no w ha ve an estimate for b oth the parameters T and λ , w e can use the p osterior av erage probabilit y of eac h sp ecies as an estimate of the sp ecies probabilities. F or the observ ed sp ecies, i.e. for i = 1 , . . . , N , this amoun ts to (14) b p i = E ρ n, b T b λ , b λ ( y i ) = m i + b λ b T b λ b λ + n = ( m i + b λ )(1 − b U ) n + N b λ . This expr ession is correct also for b λ = ∞ in whic h case all sp ecies are es- timated to ha ve probabilit y ( b T ) − 1 . Also note that these v alues are close to the u n biased estimator m i /n of the p robabilit y of the i -th sp ecies and can b e seen as a mixtu re of the Laplace add- λ and T uring-Go o d estimators since they are obtained b y adding λ to the frequ ency m i of the N observed sp ecies (recall that n = P N i =1 m i ), but only after h a ving assigned the probabilit y b U to the ev en t that we will observ e a n ew s p ecies; the estimate of eac h of the N sp ecies is then reduced b y the factor 1 − b U to comp ensate for this and, in fact, ( b T b λ − N ) b λ (1 − b U ) n + N b λ + b U = b U . Th is is lik ely to b e a sensible wa y to mak e the attenuat ion of the Laplace estimator (see [45]) finite. An alterna- tiv e d escription of our estimato r is then completed b y usin g the pr eviously estimated v alue of λ . A simple approac h for th e un observ ed sp ecies wo uld b e to uniformly split the p robabilit y b U among the b T b λ − N un observ ed sp ecies and by the reconcila- tion metho d in (8) and (9 ) this w ould giv e b U b T b λ − N = b λ b T b λ b λ + n = b λ (1 − b U ) n + N b λ . On the other hand, n otice that, since one can r ead (10) as 1 − P N k =1 b p k ( λ ) = 1 − b U T G , the reconciliation metho d nev er used the moment s of the p i ’s for i > N ; therefore, w e ha ve some freedom in assigning the estimated v alues of the p ′ i s for i > N . Th ese v alues can th en b e estimated by taking in to accoun t the m eaning of λ − 1 = γ 2 as n ormalized sp ecies v ariance, or of some related quan tities; w e could then assign p robabilities to the u nobserve d sp ecies to ac h iev e the estimate d normalized v ariance b γ 2 or to ac hieve s ome relat ed equalit y . F or simplicit y we will actually fo cus on P N k =1 p 2 k and its estimator b V . T his is a v alid approac h except when u < 1 − N v , in wh ic h case f ( λ ) < 0 and b V turn s out to b e to o small to b e a reasonable estimate of P N k =1 p 2 k ; in that case we replace b V with P b T b λ k =1 E ρ n, b T b λ , b λ ( p 2 k ). Clearly N X k =1 ( E ρ n,T ,λ ( p k )) 2 ≤ b V ∨ N X k =1 E ρ n,T ,λ ( p 2 k ) imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 11 b y Jensen’s inequalit y , and th us w e require that the estimates b p k of the probabilities of the u nobserve d sp ecies satisfy: b T b λ X k = N + 1 ( b p k ) 2 =    b V ∨ b T b λ X k =1 E ρ n, b T b λ , b λ ( p 2 k )    − N X k =1 ( E ρ n, b T b λ , b λ ( p k )) 2 =: ˜ V W e can use any tw o parameter distribution, su c h as f or instance p i = cα i − N for i = N + 1 , . . . , b T b λ , and insist th at (15) b T b λ X i = N + 1 p i = b U T G and (16) b T b λ X i = N + 1 p 2 i = ˜ V . Solving f or c and α giv es the estimated unobserved probabilities b p i = p i ( c, α ), whic h are u sed in the s im ulations of s ection 6 b elo w to generate confi dence in terv als by r esampling. It is easily seen that if T >> N then α (1 − α ) ≈ u/v and c ≈ u (1 − α ) α . 6. Simulations. In this section w e pr esen t n umerical simulations and tests of the p erformance of sev eral estimators compared to those w e ha v e dev elop ed here. T ables 1-4 present the analysis of sev eral p opulations in- creasing v alues of γ 2 . T ables 5-6 presen t s ome real tests b ased on disco v ering the n umber of letters in an alph ab et fr om a long text. I n table 7 we compute confidence in terv als u sing a r esampling based on the reconstructed sp ecies’ probabilities as describ ed in section 5 ab o ve. The estimators compared in tables 1-6 are b T 1 , b T 2 and b T b λ defined here, then b T T H T T G from (1), b T C L from (3), the Jac kknife estimator with optimal parameter b T J K from [9] (// ind icates numerical err ors due to small denom- inators), and b T +1 whic h is our (or the Chao-Lee) estimator with γ 2 = 1. imsart-aos ver. 2007/12 /10 fil e: CompletEstimat14 .tex date: November 9, 2018 12 CECCONI, GAN DOLFI, S ASTRI In tables 1-4 eac h p opulation is generated from T i.i.d. rand om v ariables, normalized to sum to 1; the r esulting γ 2 is determined as n ormalized in ter- sp ecies v ariance; 1000 simple r an d om samples of size n are then generated; finally , m ean, SD and mean squ are error are computed for eac h estimator. T ables 5 and 6 test the letter con ten t of some p assages in English and Italian in order to detect the num b er of letters in eac h alphab et. Eac h table sho ws the resu lts of taking 1000 samp les of ab out 9000 letters eac h f r om the indicated texts. The conclusion that can b e d r a wn from these tests is that estimato r p er- formances are seen to dep end on γ 2 , with the b T b λ present ing a consisten t lo w v alue of the MSE as long as γ 2 ∈ [0 , 1]. T herefore, b T b λ has the flexibility to adapt to the different v alues of the in tersp ecies v ariance. In table 1, in fact, γ 2 ≈ 0 and the b est estimators turn out to b e b T T H T T G and b T C L (in whic h clearly γ 2 gets appropriately estimated), bu t all the estimato rs defin ed in the presen t paper perfom equally well. In the less uniform p opu lation in table 2, Jac kkn ife and b T b λ sho w the b est p erformances; and in table 3 where γ 2 ≈ 1, th e b est estimator turns out to b e b T +1 , while b T b λ has only a s lightly w orse p erformance. Note that b T 1 and b T 2 sho w a ve ry p o or p er f ormance in table 2 and 3. Finally , table 4 shows an extremely skew ed p opulation, with γ 2 v ery large, for whic h no estimator works prop erly . The reason for b T b λ is th at con v ergence of th e prior imp oses γ − 2 = λ ≥ 1. Ev en in th e alphab et test the p erformance of b T b λ turns out to b e o verall b est. T able 7 shows some simulatio ns ab ou t confidence in terv als for T based on samples of size n = 400 computed from b T b λ b y estimating the sp ecies probabilities p k as describ ed in section 5 and th en r esamp ling 1000 times from the estimated p opulation distribution. Th is pro cess is rep eated 100 times and table 7 ind icates, for the p opulations of tables 1-3 resp ectiv ely , the p ercent age of times the confidence int erv als hits the tru e v alue of T = 1000 and the a v erage size of the confi dence inte rv al. The hitting p er centag e comes out remark ably well , due to the go o d ap- pro ximation of the true p opulation distribu tion by th e estimated one. imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 13 T = 1000 n = 500 n = 1000 n = 2000 mean std MSE mean std MSE mean std MSE b T T G 994 79 79 999 36 36 997 16 16 b T C L 1010 86 87 1009 42 43 1000 18 18 b T J K 1068 96 117 1223 84 239 1117 165 203 b T +1 1759 157 775 1580 73 585 13 09 32 311 b T 1 1003 82 82 1005 39 40 1000 18 18 b T 2 1017 83 86 1010 38 40 1027 54 60 b T λ 1087 193 212 1026 60 66 1002 20 20 T able 1 Uniform p opulation: p i ’s ∼ N (0 , 1) , γ 2 ≈ 0 . 009 . T = 1000 n = 500 n = 1000 n = 2000 mean std MSE mean std MSE mean std MSE b T T G 781 59 226 816 30 186 858 15 1 42 b T C L 808 72 205 847 40 158 893 20 1 09 b T J K 962 88 96 1034 88 94 1054 763 764 b T +1 1342 116 361 1245 59 252 11 18 30 122 b T 1 796 64 213 835 35 168 884 19 1 17 b T 2 787 59 220 816 30 186 858 15 1 42 b T λ 915 189 207 891 65 127 912 26 92 T able 2 L ess uniform p opulation: p i ’s ∼ U [0 , 1] , γ 2 ≈ 0 . 3317 . 7. APPENDIX 1: T he Ba y esian approac h. By definition of the gamma and b eta functions Γ( x ) = R + ∞ 0 e − t t x − 1 dt x > 0 and β ( x, y ) = Z 1 0 t a − 1 (1 − t ) b − 1 dt = Γ( x )Γ( y ) Γ( x + y ) , taking z = y / (1 − x ) we get Z 1 − x 0 y a (1 − x − y ) b dy = Z 1 0 (1 − x ) a + b +1 z a (1 − z ) b dz = (1 − x ) a + b +1 Γ( a + 1)Γ( b + 1) Γ( a + b + 2) . Next, let ρ n,T ,λ b e th e Ba y esian p osterior, giv en a sample with sp ecies records m 1 , . . . , m N , from a Diric hlet prior with parameter λ on Q T = { p = ( p 1 . . . p T − 1 ) : p k > 0 , T − 1 X k =1 p k ≤ 1 } . imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 14 CECCONI, GAN DOLFI, S ASTRI T = 1000 n = 500 n = 1000 n = 2000 mean std MSE mean std MSE mean std MSE b T T G 620 43 38 2 659 24 341 759 16 24 1 b T C L 690 65 31 6 784 46 220 888 30 11 5 b T J K 870 119 176 955 432 435 1027 661 662 b T +1 1036 85 92 990 47 48 1013 30 30 b T 1 658 52 34 5 733 33 268 845 23 15 6 b T 2 620 43 38 2 659 24 341 759 16 24 1 b T λ 910 164 187 973 66 71 1001 42 42 T able 3 Non-uniform p opulation: p i ’s ∼ E xp (1) , γ 2 ≈ 0 . 9992 . T = 1000 n = 500 n = 1000 n = 2000 mean std MSE mean std MSE mean std MSE b T T G 192 10 808 228 8 772 261 7 738 b T C L 262 26 737 346 28 654 416 29 583 b T J K 326 486 830 // // // // // // b T +1 271 19 729 304 15 696 334 14 666 b T 1 231 16 768 291 14 708 344 14 656 b T 2 192 10 808 228 8 772 261 7 738 b T λ 271 19 729 304 15 696 334 14 666 T able 4 Extr emely skewe d p opulation: p i ’s ∼ Γ(1 , 1) , γ 2 ≈ 9 . 1289 . Note that ρ n,T ,λ is inv arian t under p ermutati on of th e p k ’s, so it is v alid to express an y r esult via a p erm utation of indices fr om a pro ve n statement . Therefore, in th e follo wing Th eorems it is sufficient to pro ve the results for some ind ex i . Theorem 7.1 . F or eve y λ ≥ 1 and f or every i = 1 . . . T , (17) E ρ n,T ,λ ( p k ) = m k + λ T λ + n . Pr oof . F or i ∈ { 1 . . . T − 1 } w e ha v e: E ρ n,T ,λ ( p i ) = Z Q T p i ρ n,T ,λ ( dµ ) = R Q T p m 1 + λ − 1 1 . . . p m i + λ i . . . (1 − p 1 − . . . − p T − 1 ) m T + λ − 1 dp 1 . . . dp T − 1 R Q T p m 1 + λ − 1 1 . . . p m i + λ − 1 i . . . (1 − p 1 − . . . − p T − 1 ) m T + λ − 1 dp 1 . . . dp T − 1 . imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 15 T = 26 n = 15 n = 25 n = 50 media std MSE media std MSE media std MSE b T T G 19.8 10.3 70 18.6 3.6 56 19.5 2.5 43 b T C L 22.7 13.3 71 20.7 5.6 56 21.6 4.0 37 b T J K 19.4 6.9 60 23.0 9.8 58 // // // b T +1 33.3 20.3 109 27.0 7.0 56 25.6 4.8 29 b T λ 26.9 15.8 87 23.2 7.5 60 23.4 5.3 36 T able 5 Estimates f or the 26 letters Engli sh alphab et fr om samples dr awn fr om [10]; γ 2 ≈ 0 . 7029 (se e [36]) T = 21 n = 15 n = 25 n = 50 media std MSE media std MSE media std MSE b T T G 16.0 6.9 62 15.7 3.4 43 16.7 2.0 31 b T C L 18.4 9.8 71 17.7 5.4 45 18.5 3.3 27 b T J K 16.9 6.3 51 19.6 8.2 65 // // // b T +1 26.3 14.3 113 23.1 6.8 48 21.5 4.0 27 b T λ 21.5 12.2 90 19.9 7.3 49 19.8 4.3 29 T able 6 Estimates f or the 21 letters I talian alphab et fr om samples dr awn fr om [ 2]; γ 2 ≈ 0 . 5932 (se e [48]) F or k = 1 . . . T , let s k = m k + λ − 1 b s k = s k + δ ( k , i ) where δ is the Kroneck er delta and for k = 1 . . . T − 1 let I ( k ) = Z Q k p s 1 1 . . . p s k k (1 − p 1 − . . . − p k ) s T + ... + s k +1 + T − k − 1 dp 1 . . . dp k and G ( k ) = Γ( s k + 1)Γ( s T + . . . + s k +1 + T − k ) Γ( s T + . . . + s k +1 + s k + T − k + 1) and let b I ( k ) and b G ( k ) b e as the qu an tities w ithout hat but w ith b s k replacing s k , so that E ρ n,T ,λ ( p i ) = b I ( T − 1) I ( T − 1) . imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 16 CECCONI, GAN DOLFI, S ASTRI Confidence level P opulation − > 1 2 3 90% fraction of hits 93% 92% 80% a verage interv al size 1115 821 707 95% fraction of hits 95% 98% 89% a verage interv al size 1225 889 827 99% fraction of hits 97% 100% 98% a verage interv al size 1520 1064 977 T able 7 Summary of c onfidenc e interval p erformanc es at the given c onfidenc e level fr om b T b λ by r esampling. No w we ha v e I ( T − 1) = Γ( s T − 1 + 1)Γ( s T + 1) Γ( s T + s T − 1 + 2) I ( T − 2) = G ( T − 1) Γ( s T − 2 + 1)Γ( s T + s T − 1 + 2) Γ( s T + s T − 1 + s T − 2 + 3) I ( T − 3) = T − 1 Y k =1 G ( k ) = Γ( s T + 1) . . . Γ( s 1 + 1) Γ( s T + . . . + s 1 + T ) and b I ( T − 1) = Γ( b s T + 1) . . . Γ( b s 1 + 1) Γ( b s T + . . . + b s 1 + T ) = Γ( s T + 1) . . . Γ( s i + 2) . . . Γ( s 1 + 1) Γ( s T + . . . + s 1 + T + 1) Therefore, E ρ n,T ,λ ( p i ) = Γ( s i + 2)Γ( s T + . . . + s 1 + T ) Γ( s i + 1)Γ( s T + . . . + s 1 + T + 1) = s i + 1 s T + . . . s 1 + T = m i + λ m 1 + . . . + m T + T λ = m i + λ T λ + n . It is easily v erifi ed that P T k =1 m k + λ T λ + n = 1. Moreo ver, addin g these v alues ov er the T − N u nobserve d sp ecies w e get an estimate of U : b U + λ = E ρ n,T ,λ ( U ) = E ρ n,T ,λ  X m i =0 p i  = T X i = N + 1 E ρ n,T ,λ ( p i ) = ( T − N ) λ T λ + n imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 17 Lemma 7.1 . F or every λ ≥ 1 and i, j = 1 . . . T such that i 6 = j , (18) E ρ n,T ,λ ( p i p j ) = ( m i + λ )( m j + λ ) ( T λ + n + 1)( T λ + n ) Pr oof . F ollo wing the pr o of of T heorem 7.1 let, for k = 1 . . . T , s k = m k + λ − 1 b s k = s k + δ ( i, k ) + δ ( j, k ) , i 6 = j, 1 ≤ i, j ≤ T − 1 Th us E ρ n,T ,λ ( p i p j ) = Z Q T p 1 p j ρ n,T ,λ ( dµ ) = b I ( T − 1) I ( T − 1) where I ( T − 1) = Γ( s T + 1) . . . Γ( s 1 + 1) Γ( s T + . . . + s 1 + T ) b I ( T − 1) = Γ( s T + 1) . . . Γ( s i + 2) . . . Γ( s j + 2) . . . Γ( s 1 + 1) Γ( s T + . . . + s 1 + T + 2) Therefore E ρ n,T ,λ ( p i p j ) = Γ( s i + 2)Γ( s j + 2)Γ( s T + . . . + s 1 + T ) Γ( s i + 1)Γ( s j + 1)Γ( s T + . . . + s 1 + T + 2) = ( s i + 1)( s j + 1) ( P s k + T )( P s k + T + 1) = ( m i + λ )( m j + λ ) ( T λ + n )( T λ + n + 1) Lemma 7.2 . F or every λ ≥ 1 and for every k = 1 . . . T , (19) E ρ n,T ,λ ( p 2 k ) = ( m k + λ + 1)( m k + λ ) ( T λ + n + 1)( T λ + n ) Pr oof . As in Theorem 7.1, for k = 1 . . . T and i ∈ { 1 . . . T − 1 } let s k = m k + λ − 1 b s k = s k + 2 δ ( k , i ) imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 18 CECCONI, GAN DOLFI, S ASTRI So, E ρ n,T ,λ ( p 2 i ) = R Q T p 2 i ρ n,T ,λ ( dµ ) = b I ( T − 1) I ( T − 1) where I ( T − 1) = Γ( s T + 1) . . . Γ( s 1 + 1) Γ( s T + . . . + s 1 + T ) b I ( T − 1) = Γ( s T + 1) . . . Γ( s i + 3) . . . Γ( s 1 + 1) Γ( s T + . . . + s 1 + T + 2) Therefore, for i = 1 , . . . , T − 1, E ρ n,T ,λ ( p 2 i ) = Γ( s i + 3)Γ( s T + . . . + s 1 + T ) Γ( s i + 1)Γ( s T + . . . + s 1 + T + 2) = ( s i + 1)( s i + 2) ( P T k =1 s k + T )( P T k =1 s k + T + 1) = ( m i + λ )( m i + λ + 1) ( T λ + n )( T λ + n + 1) Lemma 7.3 . If q = P j ≥ 0 j 2 n j = P T k =1 m 2 k we have (20) T X k =1 E ρ n,T ,λ ( p 2 k ) = q + n (2 λ + 1) + T ( λ 2 + λ ) ( T λ + n + 1)( T λ + n ) Pr oof . W e h a ve T X k =1 E ρ n,T ,λ ( p 2 k ) = T X k =1 ( m k + λ )( m k + λ + 1) ( T λ + n )( T λ + n + 1) = P m 2 k + n (2 λ + 1) + T ( λ 2 + λ ) ( T λ + n + 1)( T λ + n ) = q + n (2 λ + 1) + T ( λ 2 + λ ) ( T λ + n + 1)( T λ + n ) 8. APPENDIX 2: Some prop e rt ies of the function defining λ . Let u = b U and v = b V . W e consider no w u and v as free v ariables satisfying some r equ iremen ts satisfied by the v alues that, in fact, b U and b V tak e on in our estimation, namely b U = n 1 n and b V = b V GT = P j ≥ 1 j ( j − 1) n j n ( n − 1) . Also let q = v n ( n − 1) + n = P j > 0 j 2 n j . Then Lemma 8.1 . F or every sample, b U + b V ≤ 1 imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 19 Pr oof . Since q = P j > 0 j 2 n j and n = P j > 0 j n j w e ha ve that b U + b V = n 1 n + q − n n ( n − 1) ≤ 1 is im p lied by − nn 1 + n 1 − q + n 2 = n 1 + ( N X j =1 j n j ) 2 − N X j =1 n j ( j n 1 + j 2 ) = N X j =1 j 2 ( n 2 j − n j ) + n 1 + X j,r =1 ,...,N ,j 6 = r j n j r n r − n 1 N X j =1 j n j ≥ ( n 2 1 − n 1 ) + n 1 + n 1 N X j =2 j n j − n 2 1 ≥ 0 (21) Lemma 8.2 . F or every sample, N + b U − 1 − n b U ≥ 0 Pr oof . By definition of b U w e ha v e N + b U − 1 − n b U = N − n 1 + n 1 n − 1 , then either n = n 1 = N and the righ t han d side b ecomes 0, or N − n 1 ≥ 1 and the relation h olds. Lemma 8.3 . F or every sample, ( b V nN − b V N + N − n ) n = q N − n 2 ≥ 0 Pr oof . Expressin g q , N and n as fu nction of the n j ’s we get q = X j > 0 j 2 n j N = X j > 0 n j n = X j > 0 j n j . imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 20 CECCONI, GAN DOLFI, S ASTRI Then q N − n 2 =  X j > 0 j 2 n j  X k > 0 n k  −  X j > 0 j n j  2 =  X j > 0 j 2 n 2 j + X j > 0 X k 6 = j j 2 n j n k  −  X j > 0 j 2 n 2 j + X j > 0 X k 6 = j j n j k n k  = X j > 0 X k 6 = j ( j 2 − j k ) n j n k = X j > 0 X j 0 X j β 1 . Pr oof . Let f ′ ( λ ) = (1 − u ) g ( λ ) ( n + N λ )(1 − u + n + N λ ) . Then (33) lim λ → β + 1 g ( λ ) = n (1 − u ) 2 ( v nN − v N + N − n ) > 0 b y (26). Note that g satisfies g ′ ( λ ) = 2 N 2 ( N + u − 1 − nu ) λ +2 nN ( − 1 − n + 2 N + u − N u − N v + nN v + N uv − nN uv ) with the leading co efficien t nonnegativ e by (25). Therefore, if λ > β 1 = − n N g ′ ( λ ) > 2 nN (1 − u )( v nN − v N + N − n ) ≥ 0 again b y (26). Thus g ′ > 0 for all λ > β 1 and, b y (33), g > 0 for all λ > β 1 and since the other factors in f ′ are also p ositiv e, w e h a v e that f ′ > 0 for all λ > β 1 as r equired. imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 22 CECCONI, GAN DOLFI, S ASTRI No w there are three p ossibilities. 1. If u ≤ 1 − N v then fr om (32 ) and the ab o ve Lemma, it f ollo ws that f < 0 for all λ > β 1 and increasing, th us b λ = arg in f λ ≥ 1 | f ( λ ) | = arg max λ ≥ 1 f = + ∞ . 2. If 1 − N v < u < then from (29) λ 2 ≥ 1 is equiv alen t to u ≤ 2 − v ( N +1) 2 − v + v n , in wh ic h case b λ = λ 2 . 3. If 2 − v ( N +1) 2 − v + v n < u then λ 2 < 1 and by the Lemma ab o ve b λ = arg inf λ ≥ 1 | f ( λ ) | = arg min λ ≥ 1 f = 1 The conditions on u and v are translated into those for λ 2 and β 1 b y direct calculation. Ac knowledgemen ts. T his work wa s d on e dur ing visits by one of us (CCAS) to the Universit` a di Roma, T or V ergata; Unive rsit` a di Milano- Bicocca; and Un iversit` a di Firenze. He tak es pleasur e in thanking those unive rsities for their w arm hospitalit y and GNAMP A for its su pp ort. References. [1] Almudev ar, A., Bhattachary a, R .N. and S astri, C.C.A. (2000): Estimating the Pr ob- ability Mass of Unobserve d Supp ort i n R andom Sampling , J. Stat. Planning and Inference, 91 , 91-105. [2] M. Adamo, L a matematic a nel l’antic a Cina , Osiris, V ol. 15. ( 1968), pp. 175-195. [3] Bishop, Y. M. M., Fienb erg, S. E., Holland P . W. (1975): Discr ete multivariate analysis: the ory and pr actic e , Cambridge, MIT Press. [4] Bo ender, C. G. E. , Rinnoy Kan, A. H. G (1987): A multinomial Bayesan Appr o ach to the Estimation of Population and V o c abulary Size , Biometrik a 74 No. 4, 849-856. [5] B¨ ohning, D., Sch¨ on, D. (2005): Nonp ar ametric m aximum l i keliho o d estimation of p opulation size b ase d on the c ounting distribution , Journal of the Roya l Stat. So c. (C) Appl. S t atist. 54 , Part 4, 721-737. [6] B¨ ohning, D ., Suppaw attanab e, B., Kusolvisitkul, W., Viv atw ongk asem, C ( 2004): Estimating the numb er of drug users in Bangkok 2001: A c aptur e-r e c aptur e ap- pr o ach using r ep e ate d entries i n the list , Eu rop . J. of Epidemiology 19 , 1075-1083. [7] Brose, U., Martinez, M.D., Williams, R. J. (2003): Estimating sp e cies richness: sensitivity to sample c over age and insensitivity to sp atial p atterns , Ecology 84 No. 9, 2364-2377. imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 23 [8] Bunge, J., Fitzpatrick, M. (1993): Estimating the numb er of sp e cies: a R eview , J. Amer. Stats. A ssn. 88 No. 421, 364-373. [9] Burnahm, K.P ., Overton, W. S. ( 1979): R obust estimation of p opulation size when c aptur e pr ob abili ties vary among ani mals , Ecology 60 No. 5, 927-936. [10] D . Burt on , The H i story of Mathematics: A n Intr o duction , McGra w-Hill, 2003. [11] Carothers (1993): Estimating the numb er of sp e cies: a R eview , J. Amer. Stats. Assn. 88 N o. 421, 364-373. [12] Chao, A. (1984): Nonp ar am etric estimation of the numb er of classes i n a p opula- tion , S c. J. of Stat. 11 , 265-270. [13] Chao, A., Lee, S -M. (1992): Estimating the num b er of classes via sample c over age , J. Amer.St at.A ssn., 87 No.417, 210-217. [14] Chuch, K. W. , Gale, W. A. (2006): Enhanc e d Go o d-T uring and Cat-Cal: two new metho ds for estimating pr ob abilities of english bigr ams , Preprint [15] Colwel l: Estimates. Softw are F reewa re. See http://vicero y .eeb.uconn.edu/estimates [16] D arroch, J.N., Ratcliff (1980): A Note on Captur e-R e c aptur e Estimation , Biomet- rics, 36 , 149-153. [17] Edwards, W.R, Eb erhardt, L.L. (1967): Estimating c ottontail abundanc e fr om live tr apping data , J. of Wildlife Manag. 33 , 28-39. [18] Efron, B. (1981): Nonp ar ametric standar d err ors and c onfidenc e intervals , Cana- dian J. St atist. 9 , 139-172. [19] Efron, B.,Thisted, R. (1976): Estimating the numb er of unse en sp e ci es: how many wor ds did Shakesp e ar e know? , Biometrik a 63 , 435-467. [20] Esty, W.W. (1985): Estimation of the Number of Classes in a Population and the Co verage of a Sample, Mathematic al Scientist , 10 , 41-50. [21] Esty, W.W. (1986): The size of a co verage, Num ismatic Chr onicle , 146 , 185-215. [22] Fisher, R.A., Steven Corb et, A ., Williams, C.B. (1943): The r elation b etwe en the numb er of sp e cies and the numb er of individuals in a r andom sample of an ani mal p opulation , J. An. Ecol., 12 No. 1, 42-58. [23] Gand olfi, A., Sastri, C.C.A. (2004): Nonp ar ametric Estimations ab out Sp e cies not observe d in a R andom Sampl e , Milan J. Math 72 , 81-105. [24] Go o d, I. J. (1953): The p opulation fr e quencies of sp e cies and the estimation of p opulation p ar ameters , Biometrik a 40 , 237-266. [25] Go o d, I. J. ( 1965): The estimation of pr ob abili ties: an essay on mo dern b ayesian metho d , Research Monograph No. 30 MIT Press. [26] Go o d, I. J. ( 1967): A Bayesian signific anc e test for multinomial di stributions , J. Roy . Statist. So c. Ser. B 29 , 399-431. [27] Go o d, I. J. and T oulmin, G. (1956): The numb er of new sp e cies and the incr e ase in p opulation c over age when a sample is incr e ase d , Biometrik a 43 , 45-63. imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 24 CECCONI, GAN DOLFI, S ASTRI [28] H arris, B. (1968): Statistic al infer enc e in the classic al o c cup ancy pr oblem: unbiase d estimation of the numb er of classes , J. Amer.Stat.Assn. 63 , 837-847. [29] H uang, S-P and W eir, B.S. Estimating the T otal Numb er of Al l el es Using a Sample Cover age Metho d Genetics 2001 159: 1365-137 3 [30] H uand, J. (2006): Maximum likeliho o d estimation of D i richlet distribution p ar am - eters , Man uscript. [31] Jedy n ak, B., K hudanpur, S., Y azgan, A. (2005) Estimating Pr ob abi lities fr om Smal l Samples , 2005 Pro ceedings of the A merican Statistical A ssociation, St atistical computing section [CD-ROM], Alexandria, V A : American Statistical A ssociation. [32] Jeffreys, H . ( 1961): The ory of pr ob ability , Clarendom Press, Oxford, Third Edition. [33] Johnson, W. E. (1932): Pr ob ability: the de ductive and inductive pr oblems , Mind 49 ,409-423 . [34] Laplace (1995): Philosophic al essays i n Pr ob abilities , Springer V erlag, N ew Y ork. [35] Leh m an n , E. L. (1983): The ory of p oint estimation , W iley ed., New Y ork. [36] R .E. Lewa nd, R elative F r e quencies of L etters i n Gener al Engli sh Plain text , Cryp- tographical Mathematics. [37] Lewon tin, P ., Prout, T. (1956): Estimation of the di ffer ent classes in a p opulation , Biometrics 12 , 211-223. [38] Lijoi, A, Mena, H. R., Pr¨ unster, I. (2007) Bayesian nonp ar ametric estimation of the pr ob ability of disc overing new sp e cies . Preprint. [39] Lind sa y , B. G., Ro eder, K. (1987): A unifie d tr e atment of inte ger p ar ameter mo dels , J. Am. St atist. Ass. 82 , 758-764. [40] Mao, C.X. ( 2004): Pr e dicting the c onditional pr ob abil ity of disc overing a new class , Journal of th e American Statistical Asso ciation, 99 , 1108-1118 . [41] Marchand, J.P . and Schroeck, F.E. (1982): On the Estimation of the Numb er of Equal ly Likely Cl asses in a Population , Communications in S tatistics, Part A – Theory and Metho ds, 11 , 1139-1146. [42] McAllester, D. and Schapire, R.E. (2000): On the C onver genc e R ate of Go o d-T uring Estimators , Conference On Comput ing Learning Theory (COL T), 1-6. [43] McNeil, D. (1973): Estimating an author’s vo c abulary , J. Am. Stat. Ass., 68 No. 341, 92-96. [44] N orris I I I , J. L., Polloc k, K.H., Non-p ar ametric MLE f or Poi sson sp e ci es abundanc e mo dels al lowing f or heter o geneity b etwe en sp e cies , Environ. Ecol. Statist., 5 98), 391-402. [45] O rlitsky , A., Santhanam, N. P , Zhang, J. (2003): Always Go o d T uring: Asimptoti- c al ly Optimal Pr ob ability Estimation , Science, 302 N o. 5644, 427-431. [46] Pitman, J. (2005): Combinatorial sto chastic pr o c esses , Lecture Notes for th e St. Flour S ummer School . [47] S hen, T-J., Chao, A., Lin, C-F. (2003): Pr e dicting the numb er of new sp e cies i n further taxonomic sampling , Ecology , 84 No. 3 , 798-804. imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018 ESTIMA TOR OF THE NUMBER OF SPECIES 25 [48] S imon S ingh, Co dici e Se gr eti , 1999. [49] Zab ell, S. L.(1982): W. E Johnson ’s ”Suffici entness” Postulate , The Ann als of Statistics, 10 No. 4 , 1090-1099 . Address of L.Cecconi and A.Ga ndolfi Dip ar timento di Ma tema tica U. Dini, Universit ` a di Firenze, Viale Morgagni 67/A, 50134 Firenze, It al y E-mail: cecconi@math.unifi.it gandolfi@math.unifi.it Address of C.C.A. S astri Dep ar tment of Ma thema tics and St a tistics Missouri Un iversity of Science an d Technology (F ormerl y University of Missouri - R olla) R olla, MO 654 09 USA E-mail: sastric@ms t.edu URL: imsart-aos ver. 2007/12 /10 file: CompletEst imat14.tex date: November 9, 2018

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment