Kendalls tau in high-dimensional genomic parsimony
High-dimensional data models, often with low sample size, abound in many interdisciplinary studies, genomics and large biological systems being most noteworthy. The conventional assumption of multinormality or linearity of regression may not be plaus…
Authors: Pranab K. Sen
IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 251–266 c Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000001 83 Kendall’s tau in high-dimension al genomic parsimon y Pranab K. Sen 1 University of North Car olina, Chap el Hil l Abstract: High-dimensional data models, often with lo w sample size, ab ound in many interdisciplinary studies, genomics and l ar ge biological systems being most notew orth y . The con ven tional assumption of m ultinormality or li near- ity of r egression may not b e plausible f or s uc h m o dels which are l ike ly to b e statistically complex due to a large n um ber of parameters as well as v arious un- derlying r estrain ts. As suc h, parametric approac hes may not be very effective. Any thing beyond parametrics, alb ei t, having increased scope and robustness pers p ectives, may generally b e baffled by the lo w sample si ze and hence un- able to give reasonable m argins of errors. Kendall’s tau statistic is exploited in this con te xt with emphasis on dimensional rath er than sample si ze asymp- totics. The Chen–Stein theorem has b een thoroughly appraised in this study . Applications of these findings in some mi cr oarray data m o dels are illustrated. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 2 An illustrative da ta mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 3 Some HDLSS formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 4 Dimensional asymptotics and the union intersection tes t . . . . . . . . . . 256 5 Dimensional asymptotics and Chen–Stein theorem . . . . . . . . . . . . . 2 61 Ac knowledgmen ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 1. Introduction The past three decades hav e witnessed a phenomenal growth of r esearch liter a- ture on statistical methods for large dimensional data mo dels. Such mo dels ab ound in v arious interdisciplinary fields , esp ecially in the evolving field of genomics and bioinformatics. Know le dge disc overy and data mining (K DDM) or statistica l lea r n- ing to ols a re usually adv oca ted for suc h high dimensional da ta mo dels, often on primarily computationa l or heuristic justifications. The curse of dimensionality is so overwhelming that class ical likelihoo d (pr inciple) based statis tica l inference to ols, baffled with an excessiv e num ber o f parameter s, may not b e robus t or e fficie n t. Con- ven tional assumptions of multinormality of e r rors a nd linearity of regression mo dels ∗ Supported in part by the C. C. Boshamer Research F oundation at the Universit y of North Carolina, Chapel Hill. 1 Departmen ts of Bi ostatistics and Statistics and Op erations Research, Universit y of North Carolina, Chapel Hill, NC 27 599-7420, USA, e-mail: pksen@bi os.unc.e du AMS 2000 subje c t classific ations: Pr imary 62 G10, 62G99; seco ndary 62P99. Keywor ds and phr ases: bioinformatics, Chen–Stein theorem, dim ensional asymptotics, FDR, multiple hypotheses testing, nonparametrics, permutational i nv ariance, U -statistics. 251 252 P. K. Sen may no t b e genera lly tena ble in such contexts. Mo reov er, having a large num ber of co ordina te v ariables, the assumption of their sto chastic indep endence may no t be realistic in a ma jority o f cas e s. On top of that, a t least a part of the r e s po nse v aria bles ma y b e discrete or ev en purely qualitative in nature; o ften, the ca teg orical resp onses ma y not reveal any (par tial) ordering. In that sense, discr ete m ultiv ariate analysis ma y app ea r to be mor e appro priate than conv en tional m ultinormal mo de l based analysis. Even for m ultinormal models, the high-dimensionality may demand a far lar ger sa mple size in o rder to implement a full likelihoo d based a symptotic analysis. That is, w e need the conv en tional n ≫ K environmen t for dr awing a ppro- priate statistical conclusions with reaso nable precision. Typically , in such high-dimensio nal mo dels , one encounters a K ≫ n environ- men t, where K is the dimensio n of the data and n is the s ample s iz e. In such high- diensional low sample size , HDLSS, mo dels, effective dimension reduction may b e a challenging statistical task, usually beyond the sco pe of KDDM. F or exa mple, in neuronal s pike train mo dels, there a re literally tens of thous a nds o f neuro ns (nerve cells), and in the presence o f e x ternal stimuli, the spike tra ins for any o bs erv a ble subset of neuro ns e x hibit a high-deg r ee of nonstationar ity . F urther, recording of such s pike trains in a lar ge n um ber of nerve ce lls may b e inv asive to the br ain functioning due to the destructive nature of recording ([ 16 ], Ch. 3 ). E s sentially , we hav e a very high dimensional co unt ing pr o cess. Doubly stochastic P oisson pro cesses hav e been consider ed in the literature , alb eit without muc h claim of optimal reso - lutions. In magnetic res onance ima g ing, MRI, there could b e tens of thousands of microscopic units pro ducing an enormous ly high dimensional s patial data mo del. More co mplexities may arise in case of (functional) fMRI models. F or such HDLSS mo dels, pa r ametric as ymptotics may no t hav e adequate scop e or go o d statistical int erpretation. The tra nsition from con ven tio nal normal theo r y to no nparametric linear mo dels has b een w ell fortified along with the dev elopment of nonparametric or robust statistical metho ds based on R - statistics (ranks), M -statistics (maximization) and linear combinations of order statistics o r L -statistics; see, for example, [ 8 ] where other per tinen t references ha ve bee n extensively cited. In a more general setup, nonparametric regr ession functionals hav e been formulated wherein the linearity of regres s ion o r a specific nonlinear for m are not assumed to hold. In the context of testing monotonicity of no nparametric r egressio n, without as- suming a linear or any sp ecific nonlinear form, Ghosal et al. [ 5 ] consider ed suitable U -pro cesses based on a lo cally s mo othed Kendall’s tau statistic. They provided gen- eral a symptotics for such lo cally smo othed Kendall’s ta u pro c esses when b o th the independent and dependent v ariates ar e sto chastic, and illustrated their effectiv e use in the p o stulated hypothesis testing problem. Such lo c al versions of Kendall’s tau sta tistics hav e s imple sta tis tica l interpretation, alb eit, in view of p os s ibly slow er rate of conv ergence, the impact of lar ge s ample siz e is a pparent in their analysis. In the co ntemp lated bioinformatics area, as we sha ll see, the HDLSS scenar io calls for alternative approaches, and some o f these will b e explor ed in this study . In a simple reg ression setup, the Theil-Sen (point as well as in ter v al) estimates of the reg ression slo pe based on the Kenda ll tau statistic [ 15 ], have s imple forms, a nd are computationa lly tracta ble and statistica lly r obust. Another a dv antage of the Kendall tau statistic is its adaptability for count da ta as well as latent-effect mo dels. F urther, a test fo r the nu ll hypothesis of no r egressio n bas ed on the K endall tau statistic (be ing distribution-free under the null hypo thesis o f in v aria nce) remains v alid and efficient fo r such co mplex mo dels . Our contemplated mo dels , unlike [ 5 ], ent ail a high dimensional data with relatively (and often ina dequately) smaller Kendal l’s tau in high-dimension 253 sample size, i.e., the HD LSS ( K ≫ n ) en vironmen t. As we shall see in the nex t section, there may not b e a genuine temp ora l patter n. In addition, there may b e other complica tions aris ing fr om la ck of spa tial-compactness, spatial ho mogeneity and other spa tia l dependence patterns. F or b etter motiv ation, in Section 2, a n illustra tio n is made with a microa rray d ata mo del where HDLSS mo dels typically aris e. Section 3 dea ls with the appro priateness of statistical modeling and analys is based on a pseudo- marginal approach incorpo - rating co ordinatewis e construction of the Kendall tau statistic, in such K ≫ n environmen ts. Section 4 is devoted to the dimensiona l as ymptotics for the K endall tau pro ces s in such HDLSS mo dels wher e there ar e tw o basic problems : (i) gro up divergence, and (ii) class ification o f genes into diseas e a nd nondisease types. F or the fir st problem, a ps eudo-marg inal approach ba sed on the Hamming distance has b een explore d in [ 18 ] while in the latter context, multiple hypo theses testing (MHT) problems in HDLSS setups ar ise in a different pers p ective and call for s ome alternative nov el to ols for v alid and efficient statistical appraisals. Motiv ated by these pe rsp ectives in suc h HDLSS models, so me applications of the Chen–Stein [ 3 ] theo rem in such K ≫ n environments ar e presented in the las t section. These generaliza tions cov er both the MHT and the g e ne - environmen t interaction testing problems. 2. An i llustrative data mo del W e cons ider a geno mic mo del ar is ing in microarray data analysis as an illustration. The micro array technology allows simultaneous studies of thousa nds of genes, K , po ssibly differentially express ed under diverse biologica l/exp erimental setups, with only a few, n , ar rays. W e ma y refer to Lob enhofer et al. [ 11 ] where for a set of 19 00 genes, arr anged in r ows, the gene expr essions were recor ded at 6 time p oints, with 8 obser v ations at ea ch time po int . Thus 1900 = K ≫ n = 48. The gene - expressio n levels ar e measured b y their color intensit y (or luminosity) a s a quantitativ e (non- negative) v ariable, either o n the (0 , 1) or 0–100 per cen t scale, or (based on the log-sca le) on the real line ℜ . A gene asso ciated (causally or statistically) with a tar- get disease is known as a dise ase gene , DG, while the others as nondise ase genes , NDG. Gene expres sion levels under differ ent environmen ts cast light on plausible gene-envir onment int er actions (or ass o ciations) so that if the arr ays a re prop erly designed, mapping dise ase genes may b e facilitated with such micro array studies. One of the ma in issues is identifying differentially expressed genes among tho us ands of genes, tested sim ultaneously , acro ss exp erimental co nditions. Typically , for a tar- get disea se, there a re only a few DG while the NDG comprise the v ast ma jority . A NDG is e x pec ted to hav e a low gene expr e ssion level while a DG is exp ected to hav e genera lly higher expression levels. Thus, a natural sto chastic or dering o f gene expression lev els of the DG with v ar ying disease sev erit y is plausible while the NDG expression levels a r e exp ected to b e s to chastically unaffected by such disease level differentials. Microar r ay data go thorough a lot of standardization and norma lization so that conv en tional simple mo dels, such as the clas sical MANOV A mo dels, may rare ly be totally adaptable. If the arrays a re indexed by an explanatory o r desig n v ariate ( t ) that po ssesses an ordering (no t neces s arily linear), then the stochastic ordering could be ex ploited through suitable nonparametric tec hniques. The main difficult y in modeling and sta tistically ana lyzing micr oarr ay da ta stems from the high di- mensionality of the genes compared to the n um ber of arrays. While the different 254 P. K. Sen arrays may so metimes b e taken to b e at least statistica lly indep endent, the g enes may not. Moreov er, not m uc h is known ab out the spatial topo logy of the genes or their g enetic distances. Ther e is ano ther facto r that merits o ur attention. The gene expressio n levels for the different g e ne s in an a r ray a re neither exp ected to be sto chastically indep endent nor (marginally) identically distributed. Sans such an i.i.d. clause, standard pa rametrics typically adaptable fo r fMRI mo dels (alb eit mostly done in a B ay esian coating) may encounter r oadblo cks for fruitful adapta- tion in micro a rray data mo dels . Thus, structurally , such data mo dels are differ e n t from those usually encountered in nonpara metric functional regr ession mo dels. F o r this reason, a pseudo-marginal approach is highligh ted here . This approach explo its the marginal nonpa r ametrics fully and renders some useful mo deling and analys is conv enience. 3. So me H DLSS formulations Motiv ated by microar ray data mo dels intro duced in Section 2, we co nsider her e a set o f n arr ays (sa mple observ ations) wher e there is a des ign v a riate t i asso ciated with the i th array , f or i = 1 , . . . , n . Without loss of genera lit y , we assume the t i are ordered, i.e., (3.1) t 1 ≤ t 2 ≤ · · · ≤ t n , with at lea st one strict inequality . W e do not, how ev er, imp os e any linear o r sp e- cific pa rametric ordering of these design v aria tes. The multisample (ordered alter - native) model is a particular case where n can be partitioned in to I subsets of sizes n 1 , . . . , n I such that within each subgroup, the t i are the sa me while they are order ed ov er the I different s ubsets. F o r the i th array , co rresp onding to the K genes (po sitions), we have a g ene expr ession level denoted by X ik , k = 1 , . . . , K , so that we hav e K -v ectors X i = ( X i 1 , . . . , X iK ) ′ , for i = 1 , . . . , n . The joint distri- bution function of X i is deno ted b y F i ( x ) , x ∈ ℜ K . F urther, for the k th gene in the i th arr ay , i.e., X ik , the ma rginal dis tribution is denoted by F ik ( x ) , x ∈ ℜ , fo r k = 1 , . . . , K ; i = 1 , . . . , n . F o r a given i , the F ik , k = 1 , . . . , K may no t b e gener- ally the same, and moreov er, the X ik , k = 1 , . . . , K ma y not b e a ll sto chastically independent. If a gene k is NDG and the t i reflect the v aria bilit y of the disease level, then the F ik , i = 1 , . . . , n s hould b e the same. On the other hand, for a DG k , for i < i ′ , X ik should b e sto chastically smaller than X i ′ k in the s e nse that the F ik , i = 1 , . . . , n should hav e the ordering (3.2) F 1 k ( x ) ≥ F 2 k ( x ) ≥ · · · ≥ F nk ( x ) , ∀ x ∈ ℜ . Therefore, we could force a characteriation of DG and NDG based on the following sto chastic ordering: F or a NDG k , the F ik , i = 1 , . . . , n are all the same, this b eing denoted by the null hypo thes is H 0 k , while for a DG k , the sto chastic ordering in (3.2) holds which w e deno te by H 1 k , for k = 1 , . . . , K . In this marg inal formulation, we hav e a set of K hypothes e s co rresp onding to the K genes , and whatever a ppropriate test statistic (sa y T nk ) w e use for testing H 0 k vs. H 1 k , these statistics may not be, generally , stochastically independent. The basic problem is therefo re to test simult aneously for (3.3) H 0 = K \ k =1 H 0 k vs H 1 = K [ k =1 H 1 k , Kendal l’s tau in high-dimension 255 without ignoring p ossible dep endence of the test sta tistics for the co mpo nent hy- po theses testing H 0 k vs H 1 k , for k = 1 , . . . , K . This makes it app ealing to follow the general guideline s o f the Roy [ 13 ] u n ion-interse ct ion principle (UIP), alb eit in a marginaliza tion (i.e., adapting a finite union and finite intersection scheme), and th us p ermitting a mor e g eneral framework so a s to allow simultaneous testing and classification into DG / NDG groups . In a very pa rametric setup, some or der r e- stricted inference pr oblems hav e b een cons idered by [ 12 ]. How ev er, in our setup, such no rmality based para metr ic mo dels may not b e very appropr iate. Our approa ch is based on the classica l Ke ndall tau statistics for each o f the K genes and the incor po ration of these (po ssibly dep endent) marginal statistics in a comp osite scheme for cla ssification. F o r the k th gene, ba s ed on the n o bserv a tions X ik , i = 1 , . . . , n , a nd the tagg ing v ariables t 1 , . . . , t n , we define the Kendall ta u statistic as (3.4) T nk = n 2 − 1 X 1 ≤ i t i ′′ 6 = t i ′ } . F o r small v alues of n and given (3.1), one can en umerate S and obtain the exact distribution of T o nk under H 0 k . If n is large, the standardized form o f the sta tistic, i.e., T o nk /ν n has closely a standard normal distr ibutio n. In o ur setup, p er haps the exact p ermutation distribution plays a greater role and this w ill b e illustrated later o n. The b ehavior o f T o nk under a lternatives would na tur ally depend on the stochastic ordering in (3.2) and these statistics will not be exact distribution-free nor p ossibly hav e iden tical marginal laws. N evertheless, un der (3.2), for ev ery i < i ′ , X i ′ k − X ik has a distributio n tilted to the right, so that (3.10) E { T o nk | H 1 k } ≥ 0 , ∀ k = 1 , . . . , K. This motiv a tes us to us e tests based o n the marginal sta tis tics T o nk using the r ight hand side critical regio n, or e quiv alently the r ight-hand sided p -v alues. Reca ll that the distribution of each T o nk , at least for n not to o large, is discrete, but tha t is no t going to b e of any pa rticular conce r n. A greater concer n is to incorp or ate po ssible sto chastic dep endence a mong the K statistics T o nk , k = 1 , . . . , K (ev en under the nu ll hypothesis ) and their p oss ible heterogeneity when some of the H 1 k are true. A basic problem is to formulate suitable mult iple hypothesis testing pro cedures to assess which hypotheses are to b e rejected sub ject to a s uitably defined T ype I error rate. This is elab o rated in the next section. 4. Di mensional asymptotics and the uni on i n tersection test Although indepe ndence across microar rays may be ass umed, their i.d. structure may be vitiated if the arr ays rela te to differ e nt biologica l or exp erimental setups. Moreov er, for different genes, the gene express ion (marginal) dis tributions are likely to be different when there is gene- environmen t interaction. T aking into account such plausible inter-gene sto chastic dep endence and heterogeneity , we ne e d to pre s crib e statistical mo deling and analy s is to ols. This will b e accomplished throug h dimen- sional asymptotics wher e K is made to incr ease indefinitely while n , b eing sma ll compared to K , may or may not b e adeq ua tely large. In view of (3.3), it is tempting to app e al to the union-intersection principle [ 13 ], or UIP , to construct suitable test statistics whic h will cov er the genome- wise picture in a rea s onable way . T owards this, we may note that as under H 0 (i.e., H 0 k , ∀ k ), marginally each T o nk has the s ame distribution (whic h do es not dep e nd on the underlying F ik ). Thus, co rresp onding to any c : − 1 ≤ c ≤ 1, the tail probability P 0 { T o nk > c } is the s ame for a ll k and this ca n be ev aluated by using the exact per mut ation distribution gener ated by the n ! p e rmutations o f the X ik , 1 ≤ i ≤ n . The UIP then le ads to the following union-int ersection test, UIT, statistic: (4.1) T ∗ o n = max { T o nk : 1 ≤ k ≤ K } , where the test function is given by φ ( T ∗ 0 n ) = 1 , γ , or 0 , accor dingly as T ∗ o n is > , = or < c and γ : (0 ≤ γ ≤ 1) is so chosen that E 0 { φ ( T ∗ o n ) } = α , the preassigned level of signific anc e . Note that for n not a dequately lar ge, the null distribution of T o nk is essentially discrete and hence this usual rando mization test function is aimed to take car e of this problem. The cr ux of the problem is ther efore to deter mine such a c r itical level c α . The joint distribution of the T o nk , 1 ≤ k ≤ K , even under the null hypothes is H 0 , depends Kendal l’s tau in high-dimension 257 on the underlying K -dimensional distribution F i , and hence, in g eneral will not b e distribution-free. Thus, the usua l technique of finding out the critical level of T ∗ o n from this joint distribution may b e intractable. One p ossibility is to incorp or ate the fact that under H 0 , the K -v ectors X i , i = 1 , . . . , n , are i.i.d. and hence their joint distribution remains in v aria nt under any p er- m utation of these v ectors among themselves. Thereby w e can ev a luate such critical v alues by an to a ppea l to the p ermutation distribution generated by the n ! equally likely p ermutations of the K -vectors { X i } among themselves. This p ermutation law generates the (unconditional nu ll) margina l laws of the T o nk , and provides some conditional v ersions o f their joint distributions of v arious orders. Since this p er - m utation law is a conditional law (given the collec tio n of all these K -vectors), the critical v alues obtained in this manner are themselves stochastic, thus in tro ducing another la y er of v a riation. Nev ertheless, it provides a conditionally distribution- free test. One discourag ing feature o f this p ermutation appr oach is that the p ermuta- tion inv ar iance do es not ho ld under the alter na tive hypothesis, and hence critical levels computed from the pe rmutation law inv olving an o bserved set o f { X i } may be sensitive to the data confor mit y to the null situatio n. If we a ssume tha t all the T o nk are sto chastically indep endent, then we hav e for any c, − 1 ≤ c ≤ 1 , under H 0 , (4.2) P 0 { T ∗ o n ≤ c } = [ P 0 { T o n 1 ≤ c } ] K , so tha t the distribution-free nature o f the T nk under the null hyp othesis pr ovides the access to the computatio n of the test function and the c ritical level. If n is at least mo derately la rge, in view of the asymptotic normality of T o nk /ν n , the r andomization test function ma y be replaced b y a conven tio na l no rmal theory test function, wher e for the individual tests , a significance level α ∗ is so chosen that (4.3) α = 1 − (1 − α ∗ ) K . Generally , if w e let α ∗ = ( α/K ), then the size of the UIT is ≤ α no matter whether the T o nk are sto chastically indep endent or not. There is, therefore, a certain amoun t of conser v ativeness in this s pe cification. In passing, w e ma y r emark that b y the classical asymptotics on Hoeffding’s U -statistics, any pair ( T o nk , T o nq ), with k 6 = q , is a biv a riate U -sta tistic, for α ∗ suf- ficient ly sma ll, so using the biv a riate extre me statistics results (viz ., [ 19 ]), we c a n claim that the ev en ts { T o nk > c α ∗ } and { T o nq > c α ∗ } will be asymptotically (as K → ∞ ) independent so that P 0 { T o nk > c α ∗ , T o nq > c α ∗ } can b e well approximated by [ P 0 { T o nk > c α ∗ } ] 2 . In a similar manner , the third or der pro ba bilit y terms c a n be ha ndled, and the Bo nferroni b ound retaining the s e cond and third order pr oba- bilities pr ovide a go od a pproximation : α = K α ∗ − K 2 α ∗ 2 + K 3 α ∗ 3 + o ( α ∗ 3 ). As a r esult, α ∗ = ( α/K ) provides a go o d approximation to the level of significance. Therefore, for the UIT, when K is lar ge, ev en when the genes are not sto chastically independent, letting α ∗ = ( α/K ) we may consider the following mult iple hypo thesis testing scheme: F or a chosen α ∗ = K − 1 α , obtain the mar ginal distributional cri tic al level c α ∗ , and r eje ct those H 0 k ; k ∈ { 1 , . . . , K } for whi ch the c orr esp onding T o nk exc e e ds c α ∗ . A randomization test f unction can be pres crib ed when n is no t adequately large. Thu s, the UIT provides a b ound o n the family wise err or r ate , FWER. If w e ta ke α ∗ ∼ α/K a nd K is lar ge, we need to make sure that n is so large that ν − 1 n c α ∗ < 1; this will imply that if we are to us e the p ermutation null distr ibution o f a ny 258 P. K. Sen T o nk , b eing attracted by the p ermutational central limit theorem, it has a no nzero mass p oint b eyond c α ∗ /ν n . If ν 2 n = O ( n − 1 ), as is t y pically the cas e , then c α ∗ = O ( n − 1 / 2 √ − 2 log α ∗ ) so that log K = O ( n ) a nd this do e s no t appea r to be a serious concern in real life applications. F o r example, if we have three groups of arr ays, say within each group ther e are 5 arr ays, the total num ber of par titioning 15 units int o 3 subsets of 5 each is equal to (15)! / (5!) 3 and this is so large (756 ,756) that even if K is as lar ge as 30,000 , it would not b e a pr oblem. How ev er, for la rge K , the UIT, like the classical likeliho o d ra tio test, will have little pow er, and hence alternative test pro cedures nee d to be explor ed. This illustrates the imp ortant role of the design of the study and the n um ber of arr ays required in trying to include a very la r ge K . Roy’s UIT ca n b e adapted b y explor ing the information con tained in the or dered p -v a lue s . If the T o nk are all sto chastically indep endent (and as they a re identically distributed under the null hypothesis H 0 ) then o ne can adapt Simes’ [ 20 ] theorem (whic h is a r estatement of the classic a l Ballo t theorem (viz., [ 9 ]) introduced so me t wen ty years earlier ). If P 1 , . . . , P K are the p - v alues for the K marginal tests and P K : 1 ≤ · · · ≤ P K : K are the cor resp onding order statistics, then a ssuming that under H 0 the P k hav e a uniform (0, 1) distribution (i.e., tacitly assuming that the T o nk /ν n hav e a contin uous distribution under H 0 ), Simes’ theorem asserts tha t for every α : 0 < α < 1, (4.4) P { P K : k > k α/K, ∀ k = 1 , . . . , K | H 0 } = 1 − α. Suppo se now we define the anti-r anks S 1 , . . . , S K by letting (4.5) P K : k = P S k , k = 1 , . . . , K, where again ties among the ranks are neglected under the assumption of contin uit y of the distribution o f the P k . Whereas Simes’ theo rem provides a tes t of the overall hypothesis, Ho ch berg [ 6 ] derived a step-up pro cedur e fo r multiple h ypo theses testing based on the following : F or every α ∈ (0 , 1), (4.6) P { P K : k ≥ α/ ( K − k + 1) , ∀ k = 1 , . . . , K | H 0 } = 1 − α. Benjamini a nd Ho chberg [ 2 ] considered a step-up pro c edure bas e d on the Simes theorem. Their multiple hypothesis testing pr o cedure is the following: R eje ct those nul l hyp otheses { H 0 S k } for which P S k ≤ k α/K , k = 1 , . . . , K , and ac c ept those nul l hyp otheses in t he c omplementary set . F or some related developmen ts in a para metric setup, we refer to [ 2 ], [ 4 ], [ 10 ], [ 14 ] and [ 21 ], amo ng o thers. These developmen ts pav ed the way for other measures o f err or r ates which a re more ada ptable in the K ≫ n environment. Some of these will be discussed later on. There are tw o basic concer ns that ca n be voiced in this respect. T he whole setup is ba sed on the as s umed uniform distribution of the P k under the null h ypo thesis. How ev er, if we lo ok into the s ta tistics T o nk in our s etup, we may note tha t though they hav e a sp ecified distribution, the latter is a discrete one defined ov er the in- terv al ( − 1 , 1). Noting that there ar e a set of discr ete mass p oints, ties among the T o nk /ν n (and hence P k ) can not b e neglected with probability one, and mo r eov er, the P k will hav e a set of pr obability mass p o ints on [0 , 1] with non-ze ro masses. Thu s, technically the ab ov e pr obability results are not strictly usable (unles s n is indefinitely large, con tradicting the K ≫ n en vironmen t). Secondly , as was s tressed earlier, the T o nk across the set of genes are ge nerally not stochastically independent. Kendal l’s tau in high-dimension 259 Controlling the FWER when K is very larg e may genera lly entail undue conse rv a- tiveness of multiple hypotheses testing schemes. On the other hand, using a level of significa nce for ea ch marginal hypothes is testing pro blem may lea d to a large FWER. In the co nt ext of microar rays supp ose that there ar e K 1 disease genes (DG) a nd K 0 = K − K 1 NDG; th us, w e hav e a se t of K 0 nu ll h ypo theses whic h are true and a complementary set of K 1 hypotheses whic h are not true. Supp ose that based on our m ultiple h ypothes e s testing pro cedure , we acce pt m 0 out of K 0 true n ull hypothes is so that the r emaining K 0 − m 0 = m 1 true null hypothese s a re rejected. Similarly , among the K 1 not true null h ypothes es, l 0 are accepted as true and l 1 accepted in fav or o f the alterna tive. Thus, a totality of R = m 1 + l 1 hypotheses are rejected while K − R are accepted. Mind that thoug h w e observe R , through our chosen m ultiple hypo thes es testing pro cedure, indiv idua lly m 1 , l 1 are not o bserv able; all these ( R, l 1 , m 1 ) are s to chastic in nature. A natural modifica tion of the FWE R, to suit such K ≫ n environments, is the p er-c omp ariso n err or r ate (PCER) defined as (4.7) PCER = E ( m 1 ) /K, which is the exp e cted prop ortion of Type I er rors among the K hypo theses. A related measure is the p er-family err or r ate (PFER), defined as (4.8) P FER = E ( m 1 ) , which is the exp e cted total num ber of T ype I errors among the K hypotheses. Obviously , P FER = K . PCER, and is ge nerally la r ge when K is larg e (unless the PCER is very small). Moreover, (4.9) PFER = E ( m 1 ) = X r ≥ 1 rP { m 1 = r } ≥ P { m 1 > 0 } , so that PFER ≥ FWER. If our observed R = 0 then no tr ue null hypo thesis is rejected and hence ther e is no false discovery . F or R ≥ 1, the prop o r tion of false discov ery is given by Q = m 1 /R ; conv en tionally , it is taken Q = 0 whe n R = 0, so that Q is prop erly defined for every nonnegative R and m 1 . Ho wev er, Q is not observ able. Hence, the false disc overy r ate (FDR) is defined as (4.10) FDR = E { Q } = X r ≥ 1 P { R = r } E { m 1 /R | R = r } . Since, conv en tionally , we hav e forced Q = 0 for R = 0, this definition of FDR may pro duce a neg a tive bias. An alternative definition, known as the p F DR , is defined as (4.11) p FDR = E { Q | R > 0 } = FDR /P { R ≥ 1 } . Naturally , p FDR ≥ FDR. In the formulation of FDR and p FDR it is not necessary to assume that all of the test sta tistics hav e con tin uous distributions under the null hypothesis. If these distributions a re a ll contin uo us then of cours e the p -v alues hav e a unifor m (0, 1) distribution under the null hypothesis, and hence, the multiple hypo thes es testing schemes discussed ear lier can b e co nv enien tly ada pted. In o ur s e tup, e a ch 260 P. K. Sen Fig 1 . Comp ariso n of t he nul l distribution with the alternative distribution. test statistic has mar ginally the same null distr ibutio n, alb eit that is discrete. So, it migh t be necessa ry , especia lly whe n n is not large, to make use of this otherwise completely sp ecified, discrete distribution without assuming a unifor m distribution for the asso ciated p -v alues under the null hypothesis. W e may sim ulate the p ermutation distr ibution of any marginal test statistics and thereby tak e into accoun t p os sible dependence a mong the gene expressions without assuming any sp ecific pa ttern. Of course, margina lly , eac h test statistic has the same null distribution. So, if we consider the set { T o nk : k = 1 , . . . , K } and define the empirical distribution (4.12) G K ( t ) = K − 1 K X k =1 I ( T o nk ≤ t ) , t ∈ ( − 1 , 1) , then E 0 { G K ( t ) } = G ( t ) , ∀ t ∈ ( − 1 , 1 ) where G ( t ) is the common marginal distr ibu- tion of the T o nk under the null hypo thesis. The summands in G K ( t ) are all bo unded v aria bles, nondecreasing in t ∈ ( − 1 , 1) and G ( t ) is a lso nondec r easing and assumes v alues on (0 , 1). Thus, whenever G K ( t ) sto chastically converges p oint wise to G ( t ), it do es so uniformly in t ∈ ( − 1 , 1). F urther G K ( t ) − G ( t ) is a bo unded r.v ., and hence, if it conv erges in probability , it conv erges in the r th mean for ev ery r > 0. Therefore it might s uffice to assume that the depe ndenc e pattern satisfies the c ondition: (4.13) V ar( G K ( t )) → 0 , as K → ∞ . Then we conclude that k G K ( . ) − G ( . ) k = sup {| G K ( t ) − G ( t ) | : t ∈ ( − 1 , 1) } sto chasti- cally con v erges to 0. F urther, (4.13) ho lds under quite general dep endence patterns. It is naturally tempting to explore weak conv ergence (inv ar ia nce principles) re- sults for √ K ( G K ( . ) − G ( . )) wher ein K is ta ken indefinitely lar g e but not n . Since G ( t ) , t ∈ ( − 1 , 1 ) is a discr ete distribution function with mas s p oints ov er ( − 1 , 1 ), the jump-discontin uities of G ( . ) may v itiate the usual compactness (or tightness) prop erties p osses sed in the contin uous ca se, alb eit by stre ngthening (4.13) to (4.14) lim sup K K V a r( G K ( t )) < ∞ , ∀ t ∈ ( − 1 , 1 ) , Kendal l’s tau in high-dimension 261 po int wise , the as ymptotic nor mality (as K → ∞ ) fo llows under quite ge ne r al de- pendenc y conditions. If we hav e some linear functional of G K ( . ) as a test statistic, this weak conv ergence would have be en quite useful in deriving the asymptotic (in K ) normality of the test s ta tistic under the null hypothesis ; (4.14) w o uld ha v e b een sufficient in tha t cont ext. How ever, in our case, we hav e so me functional of G K ( . ), of extr e mal order statis tic t ype, namely , the extr eme quantiles o f a se t of dep en- dent r.v .s, and hence we may need somewha t different r egularity conditions. This per sp ective is appraised more elab ora tely in the nex t section. 5. Di mensional asymptotics and Chen–Ste in theo rem In the previo us section we have briefly disc us sed the plausibility of some K o NDG and K 1 DG with K o + K 1 = K , the total num ber of genes. Neither K 1 nor the DG p ositions are known and hence we hav e a dual problem of estimating K 1 as well as identif ying the p ositio ns of these K 1 DG’s. It is conceiv able that the NDG having sto chastically s maller ex pression levels (than the DG) and the sto chastic depe ndence among the DG may not b e insignificant. W e int end to incorp orate this sto chastic dep endence structur e among the g e ne expressions in a s uitable mo del. Unfortunately , sans any p ositiona l o r dering o f the K genes, it might be difficult to assume suitable mixing conditions under which ce n tral limit theorems may apply . As for consider ing alternative limit theorems for dep endent sequences, we intend to incor po rate the Chen–Stein theorem [ 3 ] and its ramifications wher ein Poisson approximations for mor e g eneral dependent sequences ar e a dvo cated. F or our co n- venience, let us state the Chen–Stein Theo rem in a slightly updated version [ 1 ]. Theorem 1. (Chen–Stein): L et I b e an index set with elements i ∈ I and let K b e the c ar dinality of the set I . F or e ach i ∈ I let Y i b e an indic ator ra ndom variable and let (5.1) P { Y i = 1 } = 1 − P { Y i = 0 } = p K i , i ∈ I . L et W = P i ∈I Y i the total numb er of o c curr enc e of the events { Y i = 1 } , i ∈ I , and let λ K = P i ∈I p K i = E ( W ) . F or e ach i ∈ I , we define a set J i ∈ I and its c omplement J c i as the set of dep endenc e of i and its c omplement, set of indep endenc e of i . Thus, it is tacitly assume d that Y i is indep en dent of { Y j , j ∈ J c i } , for every i ∈ I . F ur t her, let b 1 = X i ∈I X j ∈J i E ( Y i ) E ( Y j ); = X i ∈I X j ∈J i p K i p K j , (5.2) (5.3) b 2 = X i ∈I X j ( 6 = i ) ∈J i E ( Y i Y j ) , and (5.4) b 3 = X i ∈ I E |{ E ( Y i − E ( Y i ) |{ Y j , ∀ j ∈ J c i } ) | . 262 P. K. Sen Final ly, let Z b e a r andom variable having Poisson distribution with p ar ameter E ( Z ) = λ K . Then kL ( W ) − L ( Z ) k ≤ 2( b 1 + b 2 + b 3 ) 1 − e − λ K λ K ≤ 2( b 1 + b 2 + b 3 ) min { 1 , λ − 1 K } . (5.5) A direct coro llary to Theorem 1 is the following: (5.6) | P { W = 0 } − e − λ K | ≤ 2( b 1 + b 2 + b 3 ) min { 1 , λ − 1 K } . An interesting fea ture o f this Theorem is the dual control o f λ K , the exp ectation and b 1 , b 2 , and b 3 , the dep endence functions. In line with our intended application we consider a natural ex tens io n of this result. With the sa me notation as in Theor em 1, we r eplace the Y i , i ∈ I , by a sequence o f pro cesse s Y i ( t ) , i ∈ I , t ∈ T , wher e T = (0 , a ), for so me a > 0, and assume that for e a ch i , Y i ( t ) is no ndecreasing in t and yet a zer o-one v a lued r andom v ar iable. F ur ther assume that the sets J i do not depe nd on t ∈ T . F or every i ∈ I , t ∈ T , w e denote by p K i ( t ) = E ( Y i ( t )), a nd the corres p o nding parameters by λ K ( t ) , b 1 ( t ) , b 2 ( t ) and b 3 ( t ). Let W K = { W K ( t ) , t ∈ T } b e the sum process and corres po nding to Z , we introduce a Poisson pro cess Z K = { Z K ( t ) , t ∈ T } whos e expe ctation pro cess is { λ K = { λ K ( t ) , t ∈ T } . Then kL ( W K ) − L ( Z K ) k ≤ 2 sup { ( b 1 ( t ) + b 2 ( t ) + b 3 ( t )) 1 − e − λ K ( t ) λ K ( t ) : t ∈ T } . The pro of of this extension is a long the lines of Theo rem 1 and hence we omit the details. In our study , unless n is large, we may not have a co ntin uous time parameter ( t ∈ T ). Thus, we consider an intermediate result that remains applica ble for s mall n as well. Theorem 2. Consider a set of M discr ete time p oi nts − 1 ≤ τ 1 < · · · < τ M ≤ 1 with r esp e ctive pr ob abil ity masses η n 1 , . . . , η nM wher e M may dep end on n . Also , let ν nj = P i ≤ j η ni , j = 1 , . . . , M . F urther, let Y i ( τ j ) , i = 1 , . . . K, j = 1 , . . . , M b e an arr ay of zer o-one value d ra ndom variables wher e Y i ( τ j ) is nonde cr e asing in τ j and E ( Y i ( τ j )) = ν nj , j = 1 , . . . , M . Define W K = { W K ( τ j ) , j = 1 , . . . , M } wher e W K ( τ j ) = P K i =1 Y i ( τ j ) for j = 1 , . . . , M . Similarly, let Z K = { Z K ( τ j ) , j = 1 , . . . , M } b e a discr ete time p ar ameter Poisson pr o c ess with the drift f unction ν K = { ν nj , j = 1 , . . . , M } . Define the p ar ameters b K 1 ( τ j ) , b K 2 ( τ j ) , b K 3 ( τ j ) , j = 1 , . . . , M as in (5.2), (5.3), and (5.4); assume t hat as K → ∞ , (5.7) max { ( b K 1 ( τ j ) + b K 2 ( τ j ) + b K 3 ( τ j )) 1 − e − ν nj ν nj : j ≤ M } → 0 . Then, as K incr e ases indefinitely, (5.8) kL ( W K ) − L ( Z K ) k → 0 . Again, b eing a finite-dimensio nal version of Theorem 1, this do es not need an elab orate pro of. In the pr esent co nt ext, under the null hypothes is, all the T o nk hav e a co mmon distribution ov er ( − 1 , 1); this is discrete but s y mmetric a bo ut 0, and is completely Kendal l’s tau in high-dimension 263 known (though could be computationally intensiv e if n is not to o s mall). Let us denote the distinct ma ss p oints for T o nk by − 1 = a 1 < a 2 < · · · , a L = 1 a nd let (5.9) τ j = P 0 { T o nk ≥ a L − j +1 } , j = 1 , . . . , L. Then 0 ≤ τ 1 < τ 2 < · · · < τ L ≤ 1. Also , let us write (5.10) Y k ( τ j ) = I ( T o nk ≥ a L − j +1 ) , j = 1 , . . . , L, k = 1 , . . . , K . F urther, let (5.11) W K ( τ j ) = K X k =1 Y k ( τ j ) , j = 1 , . . . , L . Also, let J = max { j : 1 ≤ j ≤ L ; τ j ≤ η } for some pre-assigned η > 0. Basica lly , w e would like to pursue the distributional features of the partial seq uence { W K ( τ j ) , j ≤ J } , a nd incor po rate Theorem 2. No te that in this wa y , we av oid the conv en tional assumption of a co nt inu ous null distribution of the co ordinate- wise test statis tics . Of course, if n is a dequately large, the assumption of a unifor m distribution of the p -v a lue s (under the null hypothesis ) would be reas onable. F or example, if we hav e a three s ample situation with n 1 = n 2 = n 3 = 4 then L = (12)! / (4!) 3 = 34 , 650 s o that we could choose J = 1 and use the Poisson approximation. It is also p ossible to c hoos e J = 2 with an appropriate c ut-o ff point and still stic k to a FWER around 0.05. In any ca se, under alterna tives ( of sto chastic ordering ) the distribution o f the T o nk will be tilted tow ards the right, still confined to the interv al ( − 1 , 1), and hence, their centering w ould b e shifted to the right of the or igin with a negatively sk ewed distribution. Corresp o nding to the known p o ints τ 1 < · · · < τ J , let us co nsider the partial pro - cess W K ( τ j ) , j = 1 , . . . , J , as defined abov e. Also, let us choose a set o f nonnegative int egers r 1 ≤ · · · ≤ r J in such a way that (5.12) P 0 { W K ( τ j ) > r j , for so me j ≤ J } = α, where α may not be exactly e q ual to a sp ecified level (such as 0.05) but can b e approximated very well through the a b ove Poisson pr o cess result. If we let (5.13) A j = [ W K ( τ j ) > r j ] , j = 1 , . . . , J, then (5.12) can b e wr itten as P { S j ≤ J A j } , so that by the Bonferro ni inequa lit y , P { [ j ≤ J A j } = X j ≤ J P { A j } − X 1 ≤ j r j ν r nj /r ! } , j ≥ 1 . F urther, note that W K ( τ j ) is a no ndecreasing (step) function in j so that using the Marko v pro p e rty a nd Theor em 2 w e may ev aluate P { A j A j ′ } . Actually , w e write 264 P. K. Sen P { A j A k } = P { A j } · P { A k | A j } , for k > j , and us e Theorem 2 to approximate the conditiona l probability by P { Z k > r k | Z j > r j } wher e r k ≥ r j , ∀ j < k . Also , t ypically terms inv olving mor e than 2 event s ( A j ) will b e small and ca n usually be neglected. Nevertheless, even if they are not s mall, the Mar ko v pr o p erty em b edded in Theor em 2 can b e used to provide a g o o d approximation. Alternatively , we may write P { S j ≤ J A j } = 1 − P { T j ≤ J A c j } a nd using Theo rem 2, write P { T j ≤ J A c j } a s a J -tuple sum over Poisso n distributional probabilities. F o r small r j , j ≤ J , as is t ypically the case , this computation do es no t app ea r to be a formidable task. Led by these findings, let us now co nsider the following testing pro cedure: Compute the W K ( τ j ) , j ≤ J as ab ove. If W K ( τ j ) ≤ m j , ∀ j ≤ J , ac c ept the nul l hyp othesis t hat ther e is n o D G. On the other hand, if W K ( τ j ) is gr e ater than m j for at le ast one j ≤ J , t hen r eje ct t he n ul l hyp othesis that al l the genes ar e N D G, and pr o c e e d to dete ct those genes k ∈ K as DG wher e (5.16) K = { k ∈ { 1 , . . . , K } : Y k ( τ j ) = 1 , for so me j ≤ J } . Note that if for some k , Y k ( τ j ) = 1 for some j ≤ J , then Y k ( τ j ′ ) = 1 , ∀ j ′ ≥ j . F urther, note that K is a sto chastic subset of { 1 , . . . , K } , a nd R = cardinality of K is a (no nnegative) integer v alued random v a riable. The ov erall s ig nificance level o f this testing pr o cedure is well approximated by the prea ssigned level α . Let us denote the fo llowing exclusive e vents by (5.17) B 1 = A 1 ; B j = A c 1 · · · A c j − 1 A j , j ≤ J. Then, by definit ion, A j = T j ≤ J B j . With the same notation as in (4.7)—(4.1 1), w e study the other measures (viz., PCER, PFER, FDR a nd p FDR). T ow ards this, we consider the no nnull situation where K 0 are NDG and K 1 = K − K 0 are DG. T o handle the distributio n of R , the total num b er o f rejections, we let (5.18) τ ∗ j = ( K 0 τ j + K 1 β j ) /K = τ j + ( K 1 /K )( β j − τ j ) , j ≥ 1 , where (5.19) β j = K − 1 1 X k ∈ D G P { T o nk ≥ a L − j +1 | k ∈ { 1 , . . . , K } − K 0 } , for j = 1 , . . . , J . Note b y arguments simila r to those in Sections 3 a nd 4, β j ≫ τ j , ∀ j ≤ J . W e may wr ite (5.20) E ( m 1 ) = X j ≤ J E ( m 1 I ( B j )) . Next note that the even ts B j , j ≤ J, dep end on the par tial pro ce ss W K ( τ j ) , j ≤ J and are thereb y g ov erned by The o rem 2 with ν nj = K t ∗ j , j ≤ J . O n the o ther hand, the distribution o f m 1 is gov erned by the pro cess W o K ( τ j ) , j ≤ J , where the drift function for W o K ( τ j )) is ν o nj = K 0 τ j , j ≤ J . Using Theore m 2 and the repr o ductiv e prop erty of the Poisson distr ibutio n, we may well approximate the (conditional) distribution of m 1 , given R , by a binomial law with par ameters ( R, K 0 τ j / ( K 0 τ j + K 1 β j )) whenever B j holds. Thus, we a re able to provide a go o d approximation to the PFER by writing (5.21) E ( m 1 ) = X j ≤ J E { E ( m 1 /R | R, B j ) RI ( B j ) } . Kendal l’s tau in high-dimension 265 If J = 1, the conditional binomia l law directly a pplies and we hav e the approxima- tion (5.22) K 0 t 1 K 0 t 1 + K 1 t ∗ 1 .E ( RI ( R > r 1 )) = K 0 t 1 P { R ≥ r 1 } , where the last step follows from the fact that for a Poisson v ar iable X with parame- ter np , E ( X I ( X > r )) = npP { X ≥ r } . F or J ≥ 2, w e hav e to apply the conditional binomial law under the se ts B j , follow ed b y the distribution of R ov er the sets B j , and this c a n b e done by rep ea ted quadr ature pro c edures. Numerica l studies have thereby go o d sco pe . By construction, r ejection of the null hypothesis H 0 ent ails that R > r 1 and ma y even be grea ter than r 1 if B j per tains for some j ≥ 1. As such, we do not hav e an y problem in applying the original definition of FDR (in (4.10)). W e write (5.23) FDR = E ( Q ) = X j ≤ J E ( QI ( B j )) = X j ≤ J E { E ( Q | R ∈ B j ) I ( B j ) } and use the conditional bino mia l law for each ter m in the right hand side. Detailed nu merical study is planned for a future communication. W e conclude this section with so me p ertinent remar ks and observ ations. Fir st, the use of the Chen–Stein theorem in a multi-state co ntext can b e done und er f airly mild regularity conditions re g arding the dependence of the genes. Secondly , b y our choice of the r j , j ≤ J and allowing p os s ibly J ≥ 1, we ar e not only in a p osition to a llow more flexibility in the choice of statistical inference pro cedures but also to enforce the r ejection o f n ull hypothesis under a more struc tur ed setup. This allows us to study the FDR, etc., under mor e diverse setups. F urther, using K endall’s tau statistic fo r e a ch gene separ ately , we are in a p o sition to allow heterog eneity of the gene expres sions acros s the K genes in a completely a rbitrary manner, while un- der the null hypothesis , the distribution of the T o nk , k = 1 , . . . , K being completely known provides an easy acces s to the incorpo ration of the Che n– Stein theorem. Finally , instead of using Kendall’s tau statistic (co or dinate-wise), it might b e a t- tractive to use more general ra nk statistics [ 17 ]. Though the distribution-free asp ect holds under the null hypo thes is , such dis tributions are mor e complex to ev a luate and the asso c iated Poisson pro ces ses hav e mor e co mplex drift functions. F urther, such linea r rank s tatistics inv olv e some design v ariables which as sume more str uc - ture on the F ik , k = 1 , . . . , K , not nece s sary with the use of K endall’s tau. Ac kno wle dgment s. The author is grateful to the reviewers for their critical reading of the man uscript a nd most helpful comments. Thanks are a lso due to Dr . Mo onsu Kang and Sunil Suchandran for providing the Figure in the text. References [1] Arra tia, R., Golds tein, L. and Gord o n, L . (1990). Poisson approx- imation a nd the Chen–Stein metho d: Rejoinder. Statist. S ci. 5 432– 434. MR10929 83 [2] Benjamini, Y. and Ho chber g, Y. (199 5 ). Controlling the false discovery rate: a practical and pow erful a pproach to m ultiple testing. J. Ro y. S tatist. So c. Ser. B 57 289– 300. MR13253 92 [3] Chen, L. H. Y. (19 75). Poisson appr oximation for dep endent trials. Ann. Pr ob ab. 3 534– 545. MR04283 87 266 P. K. Sen [4] Dudoit, S., Shaffer, J. an d Boldrick, J. (200 3). Multiple hypothesis testing in microa r ray exp eriments. Statist . Sci. 18 71–10 3. MR19970 66 [5] Ghosal, S., S en, A. and v an der V aar t, A. W. (2000). T esting mono - tonicity of r egressio n. Ann. Statist. 2 8 1054– 1081. MR1810 919 [6] Hochberg, Y . (1988). A shar per Bonferr oni pro cedur e for multiple tests of significance. Biometrika 75 80 0–802 . MR099512 6 [7] Hoeffding, W. (1948 ). A class of statistics with asymptotica lly norma l dis- tribution. Ann. Math. St atist. 19 293– 325. MR00262 94 [8] Jure ˇ cko v ´ a, J. and Sen, P. K. (199 6 ). R obust Statistic al Pr o c e dur es: Asymp- totics and Interr elations . Wiley , New Y ork. MR13 87346 [9] Karlin, S. (196 9). A First Course in St o chastic Pr o c esses. Academic P ress, New Y or k. MR02086 5 7 [10] Lehmann, E. L. and Romano, J. P. (20 05). Generaliza tions of the family- wise error rate. Ann. S t atist. 33 113 8–115 4. MR2195631 [11] Lobenhofer, E. K., Bennett, L., Cable, P. L., Li, L., Bushel, P. R. and Afshari, C. A. (2002 ). Regulation o f DNA replication for k genes by 17 β -estradiol. Mole cu lar Endo crinolo gy 16 1219 –122 9. [12] Peddad a, S. , Harris, S ., Za jd, J. and Ha r vey, E . (20 05). ORIGEN: Order re stricted inference o rdered gene ex pression data. Bioinformatics 21 3933– 3934 . [13] Ro y, S . N. (1953). A heuristic method o f test construction and its use in m ultiv aria te analysis. Ann. Math. St atist . 2 4 220–2 38. MR00575 19 [14] Sarkar, S. K. (20 06). F alse discov ery and false nondisc ov ery rates in single- step multiple testing pr o cedures. Ann. St atist . 3 4 394 –415 . MR22752 47 [15] Sen, P. K. (19 68). Estimates of regress io n co efficients ba sed o n Kendall’s tau. J. Amer. Statist. Asso c. 6 3 1379– 1389 . MR02 5820 1 [16] Sen, P. K. (200 4). Ex cu rsions in Biosto chastics: Biometry to Biostatistics to Bioinfo rmatics . Institute of Statistical Studies, Academia Sinica, T aip ei. [17] Sen, P. K. (2 0 06). Robust statistical inference for high- dimensional data mo d- els with applications to genomics. Austrian J. Statist. 35 197– 2 14. [18] Sen, P. K., Tsai, M. -T. and Jo u, Y .-S. (20 07). High- dimension low sam- ple size p er sp ectives in constrained statistical inference: The SARSCoV RNA genome in illustra tion. J. Amer. S tatist. Asso c. 1 02 686–6 94. MR23708 60 [19] Sibuy a, M. (19 59). Biv a riate extreme statistics. Ann. Inst. St atist. Math. 11 195–2 10. MR01152 41 [20] Simes, R. J. (1986). An improv ed Bonferroni pro cedure for multiple tests of significance. Biometrika 73 75 1–754 . MR089787 2 [21] Storey, J. (20 07). The o ptimal discov ery pro cedure: a new appro ach to simul- taneous sig nificance testing. J. Roy . Statist. So c. Ser. B 69 1–22 . MR232375 7
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment