Kendalls tau in high-dimensional genomic parsimony

IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 251–266 c  Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000001 83 Kendall’s tau in high-dimension al genomic parsimon y Pranab K. Sen 1 University of North Car olina, Chap el Hil l Abstract: High-dimensional data models, often with lo w sample size, ab ound in many interdisciplinary studies, genomics and l ar ge biological systems being most notew orth y . The con ven tional assumption of m ultinormality or li near- ity of r egression may not b e plausible f or s uc h m o dels which are l ike ly to b e statistically complex due to a large n um ber of parameters as well as v arious un- derlying r estrain ts. As suc h, parametric approac hes may not be very eﬀective. Any thing beyond parametrics, alb ei t, having increased scope and robustness pers p ectives, may generally b e baﬄed by the lo w sample si ze and hence un- able to give reasonable m argins of errors. Kendall’s tau statistic is exploited in this con te xt with emphasis on dimensional rath er than sample si ze asymp- totics. The Chen–Stein theorem has b een thoroughly appraised in this study . Applications of these ﬁndings in some mi cr oarray data m o dels are illustrated. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 2 An illustrative da ta mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 3 Some HDLSS formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 4 Dimensional asymptotics and the union intersection tes t . . . . . . . . . . 256 5 Dimensional asymptotics and Chen–Stein theorem . . . . . . . . . . . . . 2 61 Ac knowledgmen ts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 1. Introduction The past three decades hav e witnessed a phenomenal growth of r esearch liter a- ture on statistical methods for large dimensional data mo dels. Such mo dels ab ound in v arious interdisciplinary ﬁelds , esp ecially in the evolving ﬁeld of genomics and bioinformatics. Know le dge disc overy and data mining (K DDM) or statistica l lea r n- ing to ols a re usually adv oca ted for suc h high dimensional da ta mo dels, often on primarily computationa l or heuristic justiﬁcations. The curse of dimensionality is so overwhelming that class ical likelihoo d (pr inciple) based statis tica l inference to ols, baﬄed with an excessiv e num ber o f parameter s, may not b e robus t or e ﬃcie n t. Con- ven tional assumptions of multinormality of e r rors a nd linearity of regression mo dels ∗ Supported in part by the C. C. Boshamer Research F oundation at the Universit y of North Carolina, Chapel Hill. 1 Departmen ts of Bi ostatistics and Statistics and Op erations Research, Universit y of North Carolina, Chapel Hill, NC 27 599-7420, USA, e-mail: pksen@bi os.unc.e du AMS 2000 subje c t classiﬁc ations: Pr imary 62 G10, 62G99; seco ndary 62P99. Keywor ds and phr ases: bioinformatics, Chen–Stein theorem, dim ensional asymptotics, FDR, multiple hypotheses testing, nonparametrics, permutational i nv ariance, U -statistics. 251 252 P. K. Sen may no t b e genera lly tena ble in such contexts. Mo reov er, having a large num ber of co ordina te v ariables, the assumption of their sto chastic indep endence may no t be realistic in a ma jority o f cas e s. On top of that, a t least a part of the r e s po nse v aria bles ma y b e discrete or ev en purely qualitative in nature; o ften, the ca teg orical resp onses ma y not reveal any (par tial) ordering. In that sense, discr ete m ultiv ariate analysis ma y app ea r to be mor e appro priate than conv en tional m ultinormal mo de l based analysis. Even for m ultinormal models, the high-dimensionality may demand a far lar ger sa mple size in o rder to implement a full likelihoo d based a symptotic analysis. That is, w e need the conv en tional n ≫ K environmen t for dr awing a ppro- priate statistical conclusions with reaso nable precision. Typically , in such high-dimensio nal mo dels , one encounters a K ≫ n environ- men t, where K is the dimensio n of the data and n is the s ample s iz e. In such high- diensional low sample size , HDLSS, mo dels, eﬀective dimension reduction may b e a challenging statistical task, usually beyond the sco pe of KDDM. F or exa mple, in neuronal s pike train mo dels, there a re literally tens of thous a nds o f neuro ns (nerve cells), and in the presence o f e x ternal stimuli, the spike tra ins for any o bs erv a ble subset of neuro ns e x hibit a high-deg r ee of nonstationar ity . F urther, recording of such s pike trains in a lar ge n um ber of nerve ce lls may b e inv asive to the br ain functioning due to the destructive nature of recording ([ 16 ], Ch. 3 ). E s sentially , we hav e a very high dimensional co unt ing pr o cess. Doubly stochastic P oisson pro cesses hav e been consider ed in the literature , alb eit without muc h claim of optimal reso - lutions. In magnetic res onance ima g ing, MRI, there could b e tens of thousands of microscopic units pro ducing an enormous ly high dimensional s patial data mo del. More co mplexities may arise in case of (functional) fMRI models. F or such HDLSS mo dels, pa r ametric as ymptotics may no t hav e adequate scop e or go o d statistical int erpretation. The tra nsition from con ven tio nal normal theo r y to no nparametric linear mo dels has b een w ell fortiﬁed along with the dev elopment of nonparametric or robust statistical metho ds based on R - statistics (ranks), M -statistics (maximization) and linear combinations of order statistics o r L -statistics; see, for example, [ 8 ] where other per tinen t references ha ve bee n extensively cited. In a more general setup, nonparametric regr ession functionals hav e been formulated wherein the linearity of regres s ion o r a speciﬁc nonlinear for m are not assumed to hold. In the context of testing monotonicity of no nparametric r egressio n, without as- suming a linear or any sp eciﬁc nonlinear form, Ghosal et al. [ 5 ] consider ed suitable U -pro cesses based on a lo cally s mo othed Kendall’s tau statistic. They provided gen- eral a symptotics for such lo cally smo othed Kendall’s ta u pro c esses when b o th the independent and dependent v ariates ar e sto chastic, and illustrated their eﬀectiv e use in the p o stulated hypothesis testing problem. Such lo c al versions of Kendall’s tau sta tistics hav e s imple sta tis tica l interpretation, alb eit, in view of p os s ibly slow er rate of conv ergence, the impact of lar ge s ample siz e is a pparent in their analysis. In the co ntemp lated bioinformatics area, as we sha ll see, the HDLSS scenar io calls for alternative approaches, and some o f these will b e explor ed in this study . In a simple reg ression setup, the Theil-Sen (point as well as in ter v al) estimates of the reg ression slo pe based on the Kenda ll tau statistic [ 15 ], have s imple forms, a nd are computationa lly tracta ble and statistica lly r obust. Another a dv antage of the Kendall tau statistic is its adaptability for count da ta as well as latent-eﬀect mo dels. F urther, a test fo r the nu ll hypothesis of no r egressio n bas ed on the K endall tau statistic (be ing distribution-free under the null hypo thesis o f in v aria nce) remains v alid and eﬃcient fo r such co mplex mo dels . Our contemplated mo dels , unlike [ 5 ], ent ail a high dimensional data with relatively (and often ina dequately) smaller Kendal l’s tau in high-dimension 253 sample size, i.e., the HD LSS ( K ≫ n ) en vironmen t. As we shall see in the nex t section, there may not b e a genuine temp ora l patter n. In addition, there may b e other complica tions aris ing fr om la ck of spa tial-compactness, spatial ho mogeneity and other spa tia l dependence patterns. F or b etter motiv ation, in Section 2, a n illustra tio n is made with a microa rray d ata mo del where HDLSS mo dels typically aris e. Section 3 dea ls with the appro priateness of statistical modeling and analys is based on a pseudo- marginal approach incorpo - rating co ordinatewis e construction of the Kendall tau statistic, in such K ≫ n environmen ts. Section 4 is devoted to the dimensiona l as ymptotics for the K endall tau pro ces s in such HDLSS mo dels wher e there ar e tw o basic problems : (i) gro up divergence, and (ii) class iﬁcation o f genes into diseas e a nd nondisease types. F or the ﬁr st problem, a ps eudo-marg inal approach ba sed on the Hamming distance has b een explore d in [ 18 ] while in the latter context, multiple hypo theses testing (MHT) problems in HDLSS setups ar ise in a diﬀerent pers p ective and call for s ome alternative nov el to ols for v alid and eﬃcient statistical appraisals. Motiv ated by these pe rsp ectives in suc h HDLSS models, so me applications of the Chen–Stein [ 3 ] theo rem in such K ≫ n environments ar e presented in the las t section. These generaliza tions cov er both the MHT and the g e ne - environmen t interaction testing problems. 2. An i llustrative data mo del W e cons ider a geno mic mo del ar is ing in microarray data analysis as an illustration. The micro array technology allows simultaneous studies of thousa nds of genes, K , po ssibly diﬀerentially express ed under diverse biologica l/exp erimental setups, with only a few, n , ar rays. W e ma y refer to Lob enhofer et al. [ 11 ] where for a set of 19 00 genes, arr anged in r ows, the gene expr essions were recor ded at 6 time p oints, with 8 obser v ations at ea ch time po int . Thus 1900 = K ≫ n = 48. The gene - expressio n levels ar e measured b y their color intensit y (or luminosity) a s a quantitativ e (non- negative) v ariable, either o n the (0 , 1) or 0–100 per cen t scale, or (based on the log-sca le) on the real line ℜ . A gene asso ciated (causally or statistically) with a tar- get disease is known as a dise ase gene , DG, while the others as nondise ase genes , NDG. Gene expres sion levels under diﬀer ent environmen ts cast light on plausible gene-envir onment int er actions (or ass o ciations) so that if the arr ays a re prop erly designed, mapping dise ase genes may b e facilitated with such micro array studies. One of the ma in issues is identifying diﬀerentially expressed genes among tho us ands of genes, tested sim ultaneously , acro ss exp erimental co nditions. Typically , for a tar- get disea se, there a re only a few DG while the NDG comprise the v ast ma jority . A NDG is e x pec ted to hav e a low gene expr e ssion level while a DG is exp ected to hav e genera lly higher expression levels. Thus, a natural sto chastic or dering o f gene expression lev els of the DG with v ar ying disease sev erit y is plausible while the NDG expression levels a r e exp ected to b e s to chastically unaﬀected by such disease level diﬀerentials. Microar r ay data go thorough a lot of standardization and norma lization so that conv en tional simple mo dels, such as the clas sical MANOV A mo dels, may rare ly be totally adaptable. If the arrays a re indexed by an explanatory o r desig n v ariate ( t ) that po ssesses an ordering (no t neces s arily linear), then the stochastic ordering could be ex ploited through suitable nonparametric tec hniques. The main diﬃcult y in modeling and sta tistically ana lyzing micr oarr ay da ta stems from the high di- mensionality of the genes compared to the n um ber of arrays. While the diﬀerent 254 P. K. Sen arrays may so metimes b e taken to b e at least statistica lly indep endent, the g enes may not. Moreov er, not m uc h is known ab out the spatial topo logy of the genes or their g enetic distances. Ther e is ano ther facto r that merits o ur attention. The gene expressio n levels for the diﬀerent g e ne s in an a r ray a re neither exp ected to be sto chastically indep endent nor (marginally) identically distributed. Sans such an i.i.d. clause, standard pa rametrics typically adaptable fo r fMRI mo dels (alb eit mostly done in a B ay esian coating) may encounter r oadblo cks for fruitful adapta- tion in micro a rray data mo dels . Thus, structurally , such data mo dels are diﬀer e n t from those usually encountered in nonpara metric functional regr ession mo dels. F o r this reason, a pseudo-marginal approach is highligh ted here . This approach explo its the marginal nonpa r ametrics fully and renders some useful mo deling and analys is conv enience. 3. So me H DLSS formulations Motiv ated by microar ray data mo dels intro duced in Section 2, we co nsider her e a set o f n arr ays (sa mple observ ations) wher e there is a des ign v a riate t i asso ciated with the i th array , f or i = 1 , . . . , n . Without loss of genera lit y , we assume the t i are ordered, i.e., (3.1) t 1 ≤ t 2 ≤ · · · ≤ t n , with at lea st one strict inequality . W e do not, how ev er, imp os e any linear o r sp e- ciﬁc pa rametric ordering of these design v aria tes. The multisample (ordered alter - native) model is a particular case where n can be partitioned in to I subsets of sizes n 1 , . . . , n I such that within each subgroup, the t i are the sa me while they are order ed ov er the I diﬀerent s ubsets. F o r the i th array , co rresp onding to the K genes (po sitions), we have a g ene expr ession level denoted by X ik , k = 1 , . . . , K , so that we hav e K -v ectors X i = ( X i 1 , . . . , X iK ) ′ , for i = 1 , . . . , n . The joint distri- bution function of X i is deno ted b y F i ( x ) , x ∈ ℜ K . F urther, for the k th gene in the i th arr ay , i.e., X ik , the ma rginal dis tribution is denoted by F ik ( x ) , x ∈ ℜ , fo r k = 1 , . . . , K ; i = 1 , . . . , n . F o r a given i , the F ik , k = 1 , . . . , K may no t b e gener- ally the same, and moreov er, the X ik , k = 1 , . . . , K ma y not b e a ll sto chastically independent. If a gene k is NDG and the t i reﬂect the v aria bilit y of the disease level, then the F ik , i = 1 , . . . , n s hould b e the same. On the other hand, for a DG k , for i < i ′ , X ik should b e sto chastically smaller than X i ′ k in the s e nse that the F ik , i = 1 , . . . , n should hav e the ordering (3.2) F 1 k ( x ) ≥ F 2 k ( x ) ≥ · · · ≥ F nk ( x ) , ∀ x ∈ ℜ . Therefore, we could force a characteriation of DG and NDG based on the following sto chastic ordering: F or a NDG k , the F ik , i = 1 , . . . , n are all the same, this b eing denoted by the null hypo thes is H 0 k , while for a DG k , the sto chastic ordering in (3.2) holds which w e deno te by H 1 k , for k = 1 , . . . , K . In this marg inal formulation, we hav e a set of K hypothes e s co rresp onding to the K genes , and whatever a ppropriate test statistic (sa y T nk ) w e use for testing H 0 k vs. H 1 k , these statistics may not be, generally , stochastically independent. The basic problem is therefo re to test simult aneously for (3.3) H 0 = K \ k =1 H 0 k vs H 1 = K [ k =1 H 1 k , Kendal l’s tau in high-dimension 255 without ignoring p ossible dep endence of the test sta tistics for the co mpo nent hy- po theses testing H 0 k vs H 1 k , for k = 1 , . . . , K . This makes it app ealing to follow the general guideline s o f the Roy [ 13 ] u n ion-interse ct ion principle (UIP), alb eit in a marginaliza tion (i.e., adapting a ﬁnite union and ﬁnite intersection scheme), and th us p ermitting a mor e g eneral framework so a s to allow simultaneous testing and classiﬁcation into DG / NDG groups . In a very pa rametric setup, some or der r e- stricted inference pr oblems hav e b een cons idered by [ 12 ]. How ev er, in our setup, such no rmality based para metr ic mo dels may not b e very appropr iate. Our approa ch is based on the classica l Ke ndall tau statistics for each o f the K genes and the incor po ration of these (po ssibly dep endent) marginal statistics in a comp osite scheme for cla ssiﬁcation. F o r the k th gene, ba s ed on the n o bserv a tions X ik , i = 1 , . . . , n , a nd the tagg ing v ariables t 1 , . . . , t n , we deﬁne the Kendall ta u statistic as (3.4) T nk =  n 2  − 1 X 1 ≤ i t i ′′ 6 = t i ′ } . F o r small v alues of n and given (3.1), one can en umerate S and obtain the exact distribution of T o nk under H 0 k . If n is large, the standardized form o f the sta tistic, i.e., T o nk /ν n has closely a standard normal distr ibutio n. In o ur setup, p er haps the exact p ermutation distribution plays a greater role and this w ill b e illustrated later o n. The b ehavior o f T o nk under a lternatives would na tur ally depend on the stochastic ordering in (3.2) and these statistics will not be exact distribution-free nor p ossibly hav e iden tical marginal laws. N evertheless, un der (3.2), for ev ery i < i ′ , X i ′ k − X ik has a distributio n tilted to the right, so that (3.10) E { T o nk | H 1 k } ≥ 0 , ∀ k = 1 , . . . , K. This motiv a tes us to us e tests based o n the marginal sta tis tics T o nk using the r ight hand side critical regio n, or e quiv alently the r ight-hand sided p -v alues. Reca ll that the distribution of each T o nk , at least for n not to o large, is discrete, but tha t is no t going to b e of any pa rticular conce r n. A greater concer n is to incorp or ate po ssible sto chastic dep endence a mong the K statistics T o nk , k = 1 , . . . , K (ev en under the nu ll hypothesis ) and their p oss ible heterogeneity when some of the H 1 k are true. A basic problem is to formulate suitable mult iple hypothesis testing pro cedures to assess which hypotheses are to b e rejected sub ject to a s uitably deﬁned T ype I error rate. This is elab o rated in the next section. 4. Di mensional asymptotics and the uni on i n tersection test Although indepe ndence across microar rays may be ass umed, their i.d. structure may be vitiated if the arr ays rela te to diﬀer e nt biologica l or exp erimental setups. Moreov er, for diﬀerent genes, the gene express ion (marginal) dis tributions are likely to be diﬀerent when there is gene- environmen t interaction. T aking into account such plausible inter-gene sto chastic dep endence and heterogeneity , we ne e d to pre s crib e statistical mo deling and analy s is to ols. This will b e accomplished throug h dimen- sional asymptotics wher e K is made to incr ease indeﬁnitely while n , b eing sma ll compared to K , may or may not b e adeq ua tely large. In view of (3.3), it is tempting to app e al to the union-intersection principle [ 13 ], or UIP , to construct suitable test statistics whic h will cov er the genome- wise picture in a rea s onable way . T owards this, we may note that as under H 0 (i.e., H 0 k , ∀ k ), marginally each T o nk has the s ame distribution (whic h do es not dep e nd on the underlying F ik ). Thus, co rresp onding to any c : − 1 ≤ c ≤ 1, the tail probability P 0 { T o nk > c } is the s ame for a ll k and this ca n be ev aluated by using the exact per mut ation distribution gener ated by the n ! p e rmutations o f the X ik , 1 ≤ i ≤ n . The UIP then le ads to the following union-int ersection test, UIT, statistic: (4.1) T ∗ o n = max { T o nk : 1 ≤ k ≤ K } , where the test function is given by φ ( T ∗ 0 n ) = 1 , γ , or 0 , accor dingly as T ∗ o n is > , = or < c and γ : (0 ≤ γ ≤ 1) is so chosen that E 0 { φ ( T ∗ o n ) } = α , the preassigned level of signiﬁc anc e . Note that for n not a dequately lar ge, the null distribution of T o nk is essentially discrete and hence this usual rando mization test function is aimed to take car e of this problem. The cr ux of the problem is ther efore to deter mine such a c r itical level c α . The joint distribution of the T o nk , 1 ≤ k ≤ K , even under the null hypothes is H 0 , depends Kendal l’s tau in high-dimension 257 on the underlying K -dimensional distribution F i , and hence, in g eneral will not b e distribution-free. Thus, the usua l technique of ﬁnding out the critical level of T ∗ o n from this joint distribution may b e intractable. One p ossibility is to incorp or ate the fact that under H 0 , the K -v ectors X i , i = 1 , . . . , n , are i.i.d. and hence their joint distribution remains in v aria nt under any p er- m utation of these v ectors among themselves. Thereby w e can ev a luate such critical v alues by an to a ppea l to the p ermutation distribution generated by the n ! equally likely p ermutations of the K -vectors { X i } among themselves. This p ermutation law generates the (unconditional nu ll) margina l laws of the T o nk , and provides some conditional v ersions o f their joint distributions of v arious orders. Since this p er - m utation law is a conditional law (given the collec tio n of all these K -vectors), the critical v alues obtained in this manner are themselves stochastic, thus in tro ducing another la y er of v a riation. Nev ertheless, it provides a conditionally distribution- free test. One discourag ing feature o f this p ermutation appr oach is that the p ermuta- tion inv ar iance do es not ho ld under the alter na tive hypothesis, and hence critical levels computed from the pe rmutation law inv olving an o bserved set o f { X i } may be sensitive to the data confor mit y to the null situatio n. If we a ssume tha t all the T o nk are sto chastically indep endent, then we hav e for any c, − 1 ≤ c ≤ 1 , under H 0 , (4.2) P 0 { T ∗ o n ≤ c } = [ P 0 { T o n 1 ≤ c } ] K , so tha t the distribution-free nature o f the T nk under the null hyp othesis pr ovides the access to the computatio n of the test function and the c ritical level. If n is at least mo derately la rge, in view of the asymptotic normality of T o nk /ν n , the r andomization test function ma y be replaced b y a conven tio na l no rmal theory test function, wher e for the individual tests , a signiﬁcance level α ∗ is so chosen that (4.3) α = 1 − (1 − α ∗ ) K . Generally , if w e let α ∗ = ( α/K ), then the size of the UIT is ≤ α no matter whether the T o nk are sto chastically indep endent or not. There is, therefore, a certain amoun t of conser v ativeness in this s pe ciﬁcation. In passing, w e ma y r emark that b y the classical asymptotics on Hoeﬀding’s U -statistics, any pair ( T o nk , T o nq ), with k 6 = q , is a biv a riate U -sta tistic, for α ∗ suf- ﬁcient ly sma ll, so using the biv a riate extre me statistics results (viz ., [ 19 ]), we c a n claim that the ev en ts { T o nk > c α ∗ } and { T o nq > c α ∗ } will be asymptotically (as K → ∞ ) independent so that P 0 { T o nk > c α ∗ , T o nq > c α ∗ } can b e well approximated by [ P 0 { T o nk > c α ∗ } ] 2 . In a similar manner , the third or der pro ba bilit y terms c a n be ha ndled, and the Bo nferroni b ound retaining the s e cond and third order pr oba- bilities pr ovide a go od a pproximation : α = K α ∗ −  K 2  α ∗ 2 +  K 3  α ∗ 3 + o ( α ∗ 3 ). As a r esult, α ∗ = ( α/K ) provides a go o d approximation to the level of signiﬁcance. Therefore, for the UIT, when K is lar ge, ev en when the genes are not sto chastically independent, letting α ∗ = ( α/K ) we may consider the following mult iple hypo thesis testing scheme: F or a chosen α ∗ = K − 1 α , obtain the mar ginal distributional cri tic al level c α ∗ , and r eje ct those H 0 k ; k ∈ { 1 , . . . , K } for whi ch the c orr esp onding T o nk exc e e ds c α ∗ . A randomization test f unction can be pres crib ed when n is no t adequately large. Thu s, the UIT provides a b ound o n the family wise err or r ate , FWER. If w e ta ke α ∗ ∼ α/K a nd K is lar ge, we need to make sure that n is so large that ν − 1 n c α ∗ < 1; this will imply that if we are to us e the p ermutation null distr ibution o f a ny 258 P. K. Sen T o nk , b eing attracted by the p ermutational central limit theorem, it has a no nzero mass p oint b eyond c α ∗ /ν n . If ν 2 n = O ( n − 1 ), as is t y pically the cas e , then c α ∗ = O ( n − 1 / 2 √ − 2 log α ∗ ) so that log K = O ( n ) a nd this do e s no t appea r to be a serious concern in real life applications. F o r example, if we have three groups of arr ays, say within each group ther e are 5 arr ays, the total num ber of par titioning 15 units int o 3 subsets of 5 each is equal to (15)! / (5!) 3 and this is so large (756 ,756) that even if K is as lar ge as 30,000 , it would not b e a pr oblem. How ev er, for la rge K , the UIT, like the classical likeliho o d ra tio test, will have little pow er, and hence alternative test pro cedures nee d to be explor ed. This illustrates the imp ortant role of the design of the study and the n um ber of arr ays required in trying to include a very la r ge K . Roy’s UIT ca n b e adapted b y explor ing the information con tained in the or dered p -v a lue s . If the T o nk are all sto chastically indep endent (and as they a re identically distributed under the null hypothesis H 0 ) then o ne can adapt Simes’ [ 20 ] theorem (whic h is a r estatement of the classic a l Ballo t theorem (viz., [ 9 ]) introduced so me t wen ty years earlier ). If P 1 , . . . , P K are the p - v alues for the K marginal tests and P K : 1 ≤ · · · ≤ P K : K are the cor resp onding order statistics, then a ssuming that under H 0 the P k hav e a uniform (0, 1) distribution (i.e., tacitly assuming that the T o nk /ν n hav e a contin uous distribution under H 0 ), Simes’ theorem asserts tha t for every α : 0 < α < 1, (4.4) P { P K : k > k α/K, ∀ k = 1 , . . . , K | H 0 } = 1 − α. Suppo se now we deﬁne the anti-r anks S 1 , . . . , S K by letting (4.5) P K : k = P S k , k = 1 , . . . , K, where again ties among the ranks are neglected under the assumption of contin uit y of the distribution o f the P k . Whereas Simes’ theo rem provides a tes t of the overall hypothesis, Ho ch berg [ 6 ] derived a step-up pro cedur e fo r multiple h ypo theses testing based on the following : F or every α ∈ (0 , 1), (4.6) P { P K : k ≥ α/ ( K − k + 1) , ∀ k = 1 , . . . , K | H 0 } = 1 − α. Benjamini a nd Ho chberg [ 2 ] considered a step-up pro c edure bas e d on the Simes theorem. Their multiple hypothesis testing pr o cedure is the following: R eje ct those nul l hyp otheses { H 0 S k } for which P S k ≤ k α/K , k = 1 , . . . , K , and ac c ept those nul l hyp otheses in t he c omplementary set . F or some related developmen ts in a para metric setup, we refer to [ 2 ], [ 4 ], [ 10 ], [ 14 ] and [ 21 ], amo ng o thers. These developmen ts pav ed the way for other measures o f err or r ates which a re more ada ptable in the K ≫ n environment. Some of these will be discussed later on. There are tw o basic concer ns that ca n be voiced in this respect. T he whole setup is ba sed on the as s umed uniform distribution of the P k under the null h ypo thesis. How ev er, if we lo ok into the s ta tistics T o nk in our s etup, we may note tha t though they hav e a sp eciﬁed distribution, the latter is a discrete one deﬁned ov er the in- terv al ( − 1 , 1). Noting that there ar e a set of discr ete mass p oints, ties among the T o nk /ν n (and hence P k ) can not b e neglected with probability one, and mo r eov er, the P k will hav e a set of pr obability mass p o ints on [0 , 1] with non-ze ro masses. Thu s, technically the ab ov e pr obability results are not strictly usable (unles s n is indeﬁnitely large, con tradicting the K ≫ n en vironmen t). Secondly , as was s tressed earlier, the T o nk across the set of genes are ge nerally not stochastically independent. Kendal l’s tau in high-dimension 259 Controlling the FWER when K is very larg e may genera lly entail undue conse rv a- tiveness of multiple hypotheses testing schemes. On the other hand, using a level of signiﬁca nce for ea ch marginal hypothes is testing pro blem may lea d to a large FWER. In the co nt ext of microar rays supp ose that there ar e K 1 disease genes (DG) a nd K 0 = K − K 1 NDG; th us, w e hav e a se t of K 0 nu ll h ypo theses whic h are true and a complementary set of K 1 hypotheses whic h are not true. Supp ose that based on our m ultiple h ypothes e s testing pro cedure , we acce pt m 0 out of K 0 true n ull hypothes is so that the r emaining K 0 − m 0 = m 1 true null hypothese s a re rejected. Similarly , among the K 1 not true null h ypothes es, l 0 are accepted as true and l 1 accepted in fav or o f the alterna tive. Thus, a totality of R = m 1 + l 1 hypotheses are rejected while K − R are accepted. Mind that thoug h w e observe R , through our chosen m ultiple hypo thes es testing pro cedure, indiv idua lly m 1 , l 1 are not o bserv able; all these ( R, l 1 , m 1 ) are s to chastic in nature. A natural modiﬁca tion of the FWE R, to suit such K ≫ n environments, is the p er-c omp ariso n err or r ate (PCER) deﬁned as (4.7) PCER = E ( m 1 ) /K, which is the exp e cted prop ortion of Type I er rors among the K hypo theses. A related measure is the p er-family err or r ate (PFER), deﬁned as (4.8) P FER = E ( m 1 ) , which is the exp e cted total num ber of T ype I errors among the K hypotheses. Obviously , P FER = K . PCER, and is ge nerally la r ge when K is larg e (unless the PCER is very small). Moreover, (4.9) PFER = E ( m 1 ) = X r ≥ 1 rP { m 1 = r } ≥ P { m 1 > 0 } , so that PFER ≥ FWER. If our observed R = 0 then no tr ue null hypo thesis is rejected and hence ther e is no false discovery . F or R ≥ 1, the prop o r tion of false discov ery is given by Q = m 1 /R ; conv en tionally , it is taken Q = 0 whe n R = 0, so that Q is prop erly deﬁned for every nonnegative R and m 1 . Ho wev er, Q is not observ able. Hence, the false disc overy r ate (FDR) is deﬁned as (4.10) FDR = E { Q } = X r ≥ 1 P { R = r } E { m 1 /R | R = r } . Since, conv en tionally , we hav e forced Q = 0 for R = 0, this deﬁnition of FDR may pro duce a neg a tive bias. An alternative deﬁnition, known as the p F DR , is deﬁned as (4.11) p FDR = E { Q | R > 0 } = FDR /P { R ≥ 1 } . Naturally , p FDR ≥ FDR. In the formulation of FDR and p FDR it is not necessary to assume that all of the test sta tistics hav e con tin uous distributions under the null hypothesis. If these distributions a re a ll contin uo us then of cours e the p -v alues hav e a unifor m (0, 1) distribution under the null hypothesis, and hence, the multiple hypo thes es testing schemes discussed ear lier can b e co nv enien tly ada pted. In o ur s e tup, e a ch 260 P. K. Sen Fig 1 . Comp ariso n of t he nul l distribution with the alternative distribution. test statistic has mar ginally the same null distr ibutio n, alb eit that is discrete. So, it migh t be necessa ry , especia lly whe n n is not large, to make use of this otherwise completely sp eciﬁed, discrete distribution without assuming a unifor m distribution for the asso ciated p -v alues under the null hypothesis. W e may sim ulate the p ermutation distr ibution of any marginal test statistics and thereby tak e into accoun t p os sible dependence a mong the gene expressions without assuming any sp eciﬁc pa ttern. Of course, margina lly , eac h test statistic has the same null distribution. So, if we consider the set { T o nk : k = 1 , . . . , K } and deﬁne the empirical distribution (4.12) G K ( t ) = K − 1 K X k =1 I ( T o nk ≤ t ) , t ∈ ( − 1 , 1) , then E 0 { G K ( t ) } = G ( t ) , ∀ t ∈ ( − 1 , 1 ) where G ( t ) is the common marginal distr ibu- tion of the T o nk under the null hypo thesis. The summands in G K ( t ) are all bo unded v aria bles, nondecreasing in t ∈ ( − 1 , 1) and G ( t ) is a lso nondec r easing and assumes v alues on (0 , 1). Thus, whenever G K ( t ) sto chastically converges p oint wise to G ( t ), it do es so uniformly in t ∈ ( − 1 , 1). F urther G K ( t ) − G ( t ) is a bo unded r.v ., and hence, if it conv erges in probability , it conv erges in the r th mean for ev ery r > 0. Therefore it might s uﬃce to assume that the depe ndenc e pattern satisﬁes the c ondition: (4.13) V ar( G K ( t )) → 0 , as K → ∞ . Then we conclude that k G K ( . ) − G ( . ) k = sup {| G K ( t ) − G ( t ) | : t ∈ ( − 1 , 1) } sto chasti- cally con v erges to 0. F urther, (4.13) ho lds under quite general dep endence patterns. It is naturally tempting to explore weak conv ergence (inv ar ia nce principles) re- sults for √ K ( G K ( . ) − G ( . )) wher ein K is ta ken indeﬁnitely lar g e but not n . Since G ( t ) , t ∈ ( − 1 , 1 ) is a discr ete distribution function with mas s p oints ov er ( − 1 , 1 ), the jump-discontin uities of G ( . ) may v itiate the usual compactness (or tightness) prop erties p osses sed in the contin uous ca se, alb eit by stre ngthening (4.13) to (4.14) lim sup K K V a r( G K ( t )) < ∞ , ∀ t ∈ ( − 1 , 1 ) , Kendal l’s tau in high-dimension 261 po int wise , the as ymptotic nor mality (as K → ∞ ) fo llows under quite ge ne r al de- pendenc y conditions. If we hav e some linear functional of G K ( . ) as a test statistic, this weak conv ergence would have be en quite useful in deriving the asymptotic (in K ) normality of the test s ta tistic under the null hypothesis ; (4.14) w o uld ha v e b een suﬃcient in tha t cont ext. How ever, in our case, we hav e so me functional of G K ( . ), of extr e mal order statis tic t ype, namely , the extr eme quantiles o f a se t of dep en- dent r.v .s, and hence we may need somewha t diﬀerent r egularity conditions. This per sp ective is appraised more elab ora tely in the nex t section. 5. Di mensional asymptotics and Chen–Ste in theo rem In the previo us section we have brieﬂy disc us sed the plausibility of some K o NDG and K 1 DG with K o + K 1 = K , the total num ber of genes. Neither K 1 nor the DG p ositions are known and hence we hav e a dual problem of estimating K 1 as well as identif ying the p ositio ns of these K 1 DG’s. It is conceiv able that the NDG having sto chastically s maller ex pression levels (than the DG) and the sto chastic depe ndence among the DG may not b e insigniﬁcant. W e int end to incorp orate this sto chastic dep endence structur e among the g e ne expressions in a s uitable mo del. Unfortunately , sans any p ositiona l o r dering o f the K genes, it might be diﬃcult to assume suitable mixing conditions under which ce n tral limit theorems may apply . As for consider ing alternative limit theorems for dep endent sequences, we intend to incor po rate the Chen–Stein theorem [ 3 ] and its ramiﬁcations wher ein Poisson approximations for mor e g eneral dependent sequences ar e a dvo cated. F or our co n- venience, let us state the Chen–Stein Theo rem in a slightly updated version [ 1 ]. Theorem 1. (Chen–Stein): L et I b e an index set with elements i ∈ I and let K b e the c ar dinality of the set I . F or e ach i ∈ I let Y i b e an indic ator ra ndom variable and let (5.1) P { Y i = 1 } = 1 − P { Y i = 0 } = p K i , i ∈ I . L et W = P i ∈I Y i the total numb er of o c curr enc e of the events { Y i = 1 } , i ∈ I , and let λ K = P i ∈I p K i = E ( W ) . F or e ach i ∈ I , we deﬁne a set J i ∈ I and its c omplement J c i as the set of dep endenc e of i and its c omplement, set of indep endenc e of i . Thus, it is tacitly assume d that Y i is indep en dent of { Y j , j ∈ J c i } , for every i ∈ I . F ur t her, let b 1 = X i ∈I X j ∈J i E ( Y i ) E ( Y j ); = X i ∈I X j ∈J i p K i p K j , (5.2) (5.3) b 2 = X i ∈I X j ( 6 = i ) ∈J i E ( Y i Y j ) , and (5.4) b 3 = X i ∈ I E |{ E ( Y i − E ( Y i ) |{ Y j , ∀ j ∈ J c i } ) | . 262 P. K. Sen Final ly, let Z b e a r andom variable having Poisson distribution with p ar ameter E ( Z ) = λ K . Then kL ( W ) − L ( Z ) k ≤ 2( b 1 + b 2 + b 3 ) 1 − e − λ K λ K ≤ 2( b 1 + b 2 + b 3 ) min { 1 , λ − 1 K } . (5.5) A direct coro llary to Theorem 1 is the following: (5.6) | P { W = 0 } − e − λ K | ≤ 2( b 1 + b 2 + b 3 ) min { 1 , λ − 1 K } . An interesting fea ture o f this Theorem is the dual control o f λ K , the exp ectation and b 1 , b 2 , and b 3 , the dep endence functions. In line with our intended application we consider a natural ex tens io n of this result. With the sa me notation as in Theor em 1, we r eplace the Y i , i ∈ I , by a sequence o f pro cesse s Y i ( t ) , i ∈ I , t ∈ T , wher e T = (0 , a ), for so me a > 0, and assume that for e a ch i , Y i ( t ) is no ndecreasing in t and yet a zer o-one v a lued r andom v ar iable. F ur ther assume that the sets J i do not depe nd on t ∈ T . F or every i ∈ I , t ∈ T , w e denote by p K i ( t ) = E ( Y i ( t )), a nd the corres p o nding parameters by λ K ( t ) , b 1 ( t ) , b 2 ( t ) and b 3 ( t ). Let W K = { W K ( t ) , t ∈ T } b e the sum process and corres po nding to Z , we introduce a Poisson pro cess Z K = { Z K ( t ) , t ∈ T } whos e expe ctation pro cess is { λ K = { λ K ( t ) , t ∈ T } . Then kL ( W K ) − L ( Z K ) k ≤ 2 sup { ( b 1 ( t ) + b 2 ( t ) + b 3 ( t )) 1 − e − λ K ( t ) λ K ( t ) : t ∈ T } . The pro of of this extension is a long the lines of Theo rem 1 and hence we omit the details. In our study , unless n is large, we may not have a co ntin uous time parameter ( t ∈ T ). Thus, we consider an intermediate result that remains applica ble for s mall n as well. Theorem 2. Consider a set of M discr ete time p oi nts − 1 ≤ τ 1 < · · · < τ M ≤ 1 with r esp e ctive pr ob abil ity masses η n 1 , . . . , η nM wher e M may dep end on n . Also , let ν nj = P i ≤ j η ni , j = 1 , . . . , M . F urther, let Y i ( τ j ) , i = 1 , . . . K, j = 1 , . . . , M b e an arr ay of zer o-one value d ra ndom variables wher e Y i ( τ j ) is nonde cr e asing in τ j and E ( Y i ( τ j )) = ν nj , j = 1 , . . . , M . Deﬁne W K = { W K ( τ j ) , j = 1 , . . . , M } wher e W K ( τ j ) = P K i =1 Y i ( τ j ) for j = 1 , . . . , M . Similarly, let Z K = { Z K ( τ j ) , j = 1 , . . . , M } b e a discr ete time p ar ameter Poisson pr o c ess with the drift f unction ν K = { ν nj , j = 1 , . . . , M } . Deﬁne the p ar ameters b K 1 ( τ j ) , b K 2 ( τ j ) , b K 3 ( τ j ) , j = 1 , . . . , M as in (5.2), (5.3), and (5.4); assume t hat as K → ∞ , (5.7) max { ( b K 1 ( τ j ) + b K 2 ( τ j ) + b K 3 ( τ j )) 1 − e − ν nj ν nj : j ≤ M } → 0 . Then, as K incr e ases indeﬁnitely, (5.8) kL ( W K ) − L ( Z K ) k → 0 . Again, b eing a ﬁnite-dimensio nal version of Theorem 1, this do es not need an elab orate pro of. In the pr esent co nt ext, under the null hypothes is, all the T o nk hav e a co mmon distribution ov er ( − 1 , 1); this is discrete but s y mmetric a bo ut 0, and is completely Kendal l’s tau in high-dimension 263 known (though could be computationally intensiv e if n is not to o s mall). Let us denote the distinct ma ss p oints for T o nk by − 1 = a 1 < a 2 < · · · , a L = 1 a nd let (5.9) τ j = P 0 { T o nk ≥ a L − j +1 } , j = 1 , . . . , L. Then 0 ≤ τ 1 < τ 2 < · · · < τ L ≤ 1. Also , let us write (5.10) Y k ( τ j ) = I ( T o nk ≥ a L − j +1 ) , j = 1 , . . . , L, k = 1 , . . . , K . F urther, let (5.11) W K ( τ j ) = K X k =1 Y k ( τ j ) , j = 1 , . . . , L . Also, let J = max { j : 1 ≤ j ≤ L ; τ j ≤ η } for some pre-assigned η > 0. Basica lly , w e would like to pursue the distributional features of the partial seq uence { W K ( τ j ) , j ≤ J } , a nd incor po rate Theorem 2. No te that in this wa y , we av oid the conv en tional assumption of a co nt inu ous null distribution of the co ordinate- wise test statis tics . Of course, if n is a dequately large, the assumption of a unifor m distribution of the p -v a lue s (under the null hypothesis ) would be reas onable. F or example, if we hav e a three s ample situation with n 1 = n 2 = n 3 = 4 then L = (12)! / (4!) 3 = 34 , 650 s o that we could choose J = 1 and use the Poisson approximation. It is also p ossible to c hoos e J = 2 with an appropriate c ut-o ﬀ point and still stic k to a FWER around 0.05. In any ca se, under alterna tives ( of sto chastic ordering ) the distribution o f the T o nk will be tilted tow ards the right, still conﬁned to the interv al ( − 1 , 1), and hence, their centering w ould b e shifted to the right of the or igin with a negatively sk ewed distribution. Corresp o nding to the known p o ints τ 1 < · · · < τ J , let us co nsider the partial pro - cess W K ( τ j ) , j = 1 , . . . , J , as deﬁned abov e. Also, let us choose a set o f nonnegative int egers r 1 ≤ · · · ≤ r J in such a way that (5.12) P 0 { W K ( τ j ) > r j , for so me j ≤ J } = α, where α may not be exactly e q ual to a sp eciﬁed level (such as 0.05) but can b e approximated very well through the a b ove Poisson pr o cess result. If we let (5.13) A j = [ W K ( τ j ) > r j ] , j = 1 , . . . , J, then (5.12) can b e wr itten as P { S j ≤ J A j } , so that by the Bonferro ni inequa lit y , P { [ j ≤ J A j } = X j ≤ J P { A j } − X 1 ≤ j r j ν r nj /r ! } , j ≥ 1 . F urther, note that W K ( τ j ) is a no ndecreasing (step) function in j so that using the Marko v pro p e rty a nd Theor em 2 w e may ev aluate P { A j A j ′ } . Actually , w e write 264 P. K. Sen P { A j A k } = P { A j } · P { A k | A j } , for k > j , and us e Theorem 2 to approximate the conditiona l probability by P { Z k > r k | Z j > r j } wher e r k ≥ r j , ∀ j < k . Also , t ypically terms inv olving mor e than 2 event s ( A j ) will b e small and ca n usually be neglected. Nevertheless, even if they are not s mall, the Mar ko v pr o p erty em b edded in Theor em 2 can b e used to provide a g o o d approximation. Alternatively , we may write P { S j ≤ J A j } = 1 − P { T j ≤ J A c j } a nd using Theo rem 2, write P { T j ≤ J A c j } a s a J -tuple sum over Poisso n distributional probabilities. F o r small r j , j ≤ J , as is t ypically the case , this computation do es no t app ea r to be a formidable task. Led by these ﬁndings, let us now co nsider the following testing pro cedure: Compute the W K ( τ j ) , j ≤ J as ab ove. If W K ( τ j ) ≤ m j , ∀ j ≤ J , ac c ept the nul l hyp othesis t hat ther e is n o D G. On the other hand, if W K ( τ j ) is gr e ater than m j for at le ast one j ≤ J , t hen r eje ct t he n ul l hyp othesis that al l the genes ar e N D G, and pr o c e e d to dete ct those genes k ∈ K as DG wher e (5.16) K = { k ∈ { 1 , . . . , K } : Y k ( τ j ) = 1 , for so me j ≤ J } . Note that if for some k , Y k ( τ j ) = 1 for some j ≤ J , then Y k ( τ j ′ ) = 1 , ∀ j ′ ≥ j . F urther, note that K is a sto chastic subset of { 1 , . . . , K } , a nd R = cardinality of K is a (no nnegative) integer v alued random v a riable. The ov erall s ig niﬁcance level o f this testing pr o cedure is well approximated by the prea ssigned level α . Let us denote the fo llowing exclusive e vents by (5.17) B 1 = A 1 ; B j = A c 1 · · · A c j − 1 A j , j ≤ J. Then, by deﬁnit ion, A j = T j ≤ J B j . With the same notation as in (4.7)—(4.1 1), w e study the other measures (viz., PCER, PFER, FDR a nd p FDR). T ow ards this, we consider the no nnull situation where K 0 are NDG and K 1 = K − K 0 are DG. T o handle the distributio n of R , the total num b er o f rejections, we let (5.18) τ ∗ j = ( K 0 τ j + K 1 β j ) /K = τ j + ( K 1 /K )( β j − τ j ) , j ≥ 1 , where (5.19) β j = K − 1 1 X k ∈ D G P { T o nk ≥ a L − j +1 | k ∈ { 1 , . . . , K } − K 0 } , for j = 1 , . . . , J . Note b y arguments simila r to those in Sections 3 a nd 4, β j ≫ τ j , ∀ j ≤ J . W e may wr ite (5.20) E ( m 1 ) = X j ≤ J E ( m 1 I ( B j )) . Next note that the even ts B j , j ≤ J, dep end on the par tial pro ce ss W K ( τ j ) , j ≤ J and are thereb y g ov erned by The o rem 2 with ν nj = K t ∗ j , j ≤ J . O n the o ther hand, the distribution o f m 1 is gov erned by the pro cess W o K ( τ j ) , j ≤ J , where the drift function for W o K ( τ j )) is ν o nj = K 0 τ j , j ≤ J . Using Theore m 2 and the repr o ductiv e prop erty of the Poisson distr ibutio n, we may well approximate the (conditional) distribution of m 1 , given R , by a binomial law with par ameters ( R, K 0 τ j / ( K 0 τ j + K 1 β j )) whenever B j holds. Thus, we a re able to provide a go o d approximation to the PFER by writing (5.21) E ( m 1 ) = X j ≤ J E { E ( m 1 /R | R, B j ) RI ( B j ) } . Kendal l’s tau in high-dimension 265 If J = 1, the conditional binomia l law directly a pplies and we hav e the approxima- tion (5.22) K 0 t 1 K 0 t 1 + K 1 t ∗ 1 .E ( RI ( R > r 1 )) = K 0 t 1 P { R ≥ r 1 } , where the last step follows from the fact that for a Poisson v ar iable X with parame- ter np , E ( X I ( X > r )) = npP { X ≥ r } . F or J ≥ 2, w e hav e to apply the conditional binomial law under the se ts B j , follow ed b y the distribution of R ov er the sets B j , and this c a n b e done by rep ea ted quadr ature pro c edures. Numerica l studies have thereby go o d sco pe . By construction, r ejection of the null hypothesis H 0 ent ails that R > r 1 and ma y even be grea ter than r 1 if B j per tains for some j ≥ 1. As such, we do not hav e an y problem in applying the original deﬁnition of FDR (in (4.10)). W e write (5.23) FDR = E ( Q ) = X j ≤ J E ( QI ( B j )) = X j ≤ J E { E ( Q | R ∈ B j ) I ( B j ) } and use the conditional bino mia l law for each ter m in the right hand side. Detailed nu merical study is planned for a future communication. W e conclude this section with so me p ertinent remar ks and observ ations. Fir st, the use of the Chen–Stein theorem in a multi-state co ntext can b e done und er f airly mild regularity conditions re g arding the dependence of the genes. Secondly , b y our choice of the r j , j ≤ J and allowing p os s ibly J ≥ 1, we ar e not only in a p osition to a llow more ﬂexibility in the choice of statistical inference pro cedures but also to enforce the r ejection o f n ull hypothesis under a more struc tur ed setup. This allows us to study the FDR, etc., under mor e diverse setups. F urther, using K endall’s tau statistic fo r e a ch gene separ ately , we are in a p o sition to allow heterog eneity of the gene expres sions acros s the K genes in a completely a rbitrary manner, while un- der the null hypothesis , the distribution of the T o nk , k = 1 , . . . , K being completely known provides an easy acces s to the incorpo ration of the Che n– Stein theorem. Finally , instead of using Kendall’s tau statistic (co or dinate-wise), it might b e a t- tractive to use more general ra nk statistics [ 17 ]. Though the distribution-free asp ect holds under the null hypo thes is , such dis tributions are mor e complex to ev a luate and the asso c iated Poisson pro ces ses hav e mor e co mplex drift functions. F urther, such linea r rank s tatistics inv olv e some design v ariables which as sume more str uc - ture on the F ik , k = 1 , . . . , K , not nece s sary with the use of K endall’s tau. Ac kno wle dgment s. The author is grateful to the reviewers for their critical reading of the man uscript a nd most helpful comments. Thanks are a lso due to Dr . Mo onsu Kang and Sunil Suchandran for providing the Figure in the text. References [1] Arra tia, R., Golds tein, L. and Gord o n, L . (1990). Poisson approx- imation a nd the Chen–Stein metho d: Rejoinder. Statist. S ci. 5 432– 434. MR10929 83 [2] Benjamini, Y. and Ho chber g, Y. (199 5 ). Controlling the false discovery rate: a practical and pow erful a pproach to m ultiple testing. J. Ro y. S tatist. So c. Ser. B 57 289– 300. MR13253 92 [3] Chen, L. H. Y. (19 75). Poisson appr oximation for dep endent trials. Ann. Pr ob ab. 3 534– 545. MR04283 87 266 P. K. Sen [4] Dudoit, S., Shaffer, J. an d Boldrick, J. (200 3). Multiple hypothesis testing in microa r ray exp eriments. Statist . Sci. 18 71–10 3. MR19970 66 [5] Ghosal, S., S en, A. and v an der V aar t, A. W. (2000). T esting mono - tonicity of r egressio n. Ann. Statist. 2 8 1054– 1081. MR1810 919 [6] Hochberg, Y . (1988). A shar per Bonferr oni pro cedur e for multiple tests of signiﬁcance. Biometrika 75 80 0–802 . MR099512 6 [7] Hoeffding, W. (1948 ). A class of statistics with asymptotica lly norma l dis- tribution. Ann. Math. St atist. 19 293– 325. MR00262 94 [8] Jure ˇ cko v ´ a, J. and Sen, P. K. (199 6 ). R obust Statistic al Pr o c e dur es: Asymp- totics and Interr elations . Wiley , New Y ork. MR13 87346 [9] Karlin, S. (196 9). A First Course in St o chastic Pr o c esses. Academic P ress, New Y or k. MR02086 5 7 [10] Lehmann, E. L. and Romano, J. P. (20 05). Generaliza tions of the family- wise error rate. Ann. S t atist. 33 113 8–115 4. MR2195631 [11] Lobenhofer, E. K., Bennett, L., Cable, P. L., Li, L., Bushel, P. R. and Afshari, C. A. (2002 ). Regulation o f DNA replication for k genes by 17 β -estradiol. Mole cu lar Endo crinolo gy 16 1219 –122 9. [12] Peddad a, S. , Harris, S ., Za jd, J. and Ha r vey, E . (20 05). ORIGEN: Order re stricted inference o rdered gene ex pression data. Bioinformatics 21 3933– 3934 . [13] Ro y, S . N. (1953). A heuristic method o f test construction and its use in m ultiv aria te analysis. Ann. Math. St atist . 2 4 220–2 38. MR00575 19 [14] Sarkar, S. K. (20 06). F alse discov ery and false nondisc ov ery rates in single- step multiple testing pr o cedures. Ann. St atist . 3 4 394 –415 . MR22752 47 [15] Sen, P. K. (19 68). Estimates of regress io n co eﬃcients ba sed o n Kendall’s tau. J. Amer. Statist. Asso c. 6 3 1379– 1389 . MR02 5820 1 [16] Sen, P. K. (200 4). Ex cu rsions in Biosto chastics: Biometry to Biostatistics to Bioinfo rmatics . Institute of Statistical Studies, Academia Sinica, T aip ei. [17] Sen, P. K. (2 0 06). Robust statistical inference for high- dimensional data mo d- els with applications to genomics. Austrian J. Statist. 35 197– 2 14. [18] Sen, P. K., Tsai, M. -T. and Jo u, Y .-S. (20 07). High- dimension low sam- ple size p er sp ectives in constrained statistical inference: The SARSCoV RNA genome in illustra tion. J. Amer. S tatist. Asso c. 1 02 686–6 94. MR23708 60 [19] Sibuy a, M. (19 59). Biv a riate extreme statistics. Ann. Inst. St atist. Math. 11 195–2 10. MR01152 41 [20] Simes, R. J. (1986). An improv ed Bonferroni pro cedure for multiple tests of signiﬁcance. Biometrika 73 75 1–754 . MR089787 2 [21] Storey, J. (20 07). The o ptimal discov ery pro cedure: a new appro ach to simul- taneous sig niﬁcance testing. J. Roy . Statist. So c. Ser. B 69 1–22 . MR232375 7

Kendalls tau in high-dimensional genomic parsimony

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment