On the Permutation Distribution of Independence Tests
One of the most popular class of tests for independence between two random variables is the general class of rank statistics which are invariant under permutations. This class contains Spearman's coefficient of rank correlation statistic, Fisher-Yate…
Authors: ** *논문에 저자 정보가 명시되어 있지 않습니다.* (제공된 텍스트에 저자 이름이 포함되지 않았음) --- **
On the P erm utation Distribution of Indep endence T ests Ehab F. Ab d-Elfatt ah Ain Shams Universit y , Cair o , Egypt. ehab@ASUnet. shams.edu.eg Abstract One of the most p opular class of tests for indep endence b et w een tw o r an - dom v ariables is the general class of rank statistics w h ic h are in v arian t under p erm utations. This class con tains Sp earman’s co efficien t of r an k corr elation statistic, Fisher-Y ates statistic, w eigh ted Mann statistic and ot h ers. Under the n u ll h yp othesis of indep endence these test statistics ha ve a p ermutatio n dis- tribution that usually the normal asymptotic theory used to appro ximate the p-v alues for these tests. In this note we suggest u sing a saddlep oin t approac h that almost exact and n ee d no extensive simula tion calculations to calculate the p-v alue of suc h class of tests. Some key words: Indep endence tests; Linear r a n k test; P ermutat ion d ist r i- bution; Saddlep oin t appro ximation. 1 In tro duc tion When the factors b eing studied are not treatmen ts that the inv estigator can assign to his sub j e cts but conditions or attributes whic h are inseparably attac hed to these sub jects, a n assumption that need to b e tested is that a n asso ciation exists b et w een t w o factors in a p opulation of sub jects. Let us observ e N indep enden t pairs of random v ariables ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ..., ( X N , Y N ) and we wish to test the n ull h yp othesis H 0 that the tw o v ariables X i and Y i are indep enden t for eac h i . 1 No w rearrange all N pairs of observ ations according to the magnitude of their first co ordinate in to the sequence ( X d 1 , Y d 1 ) , ( X d 2 , Y d 2 ) , ..., ( X d N , Y d N ) in suc h a w a y that X d 1 < X d 2 < · · · < X d N . Then put R i equal to the ra nk of Y d i among the observ a t ions Y d 1 , Y d 2 , ..., Y d N . Under the assumption of indep endence and assuming no ties, all N ! orderings ( R 1 , ..., R N ) are equally like ly with probabilit y 1 / N !. If w e willing to assume that the t wo factors ha v e a p ositiv e associations, the { R i } should rev eal an upw ard trend, with large v alues tending to o ccur on the r ig h t of the sequence and low v alues o n the left. An appropriate test statistic that r e flects this idea is D = N X i =1 ( R i − i ) 2 (1) with small v alues of D indicating significance. The statistic D is related to the well kno wn Sp earman’s co efficien t of rank cor- relation statistic, S p , with the relation S p = 1 − 6 D / N ( N 2 − 1), see G ibbons and Chakrab orti (20 03). It is also related to the we ig h ted Mann stat istic, D ′ , b y D ′ = 1 6 N ( N 2 − 1) − 1 2 D . Expanding (1), D can b e written as D = 1 3 N ( N + 1)(2 N + 1) − 2 N X i =1 iR i whic h gives an equiv alen t simple statistic V ′ = N X i =1 iR i (2) Ha jek, Sidak and Sen (1999 ). The statistic V ′ is equiv alen t to a general class of rank statistics whose n ull dis- tributions are inv ariant under p erm utations, this class can b e written as S = N X i =1 f N ( i ) f N ( R i ) (3) 2 whic h contains the Fisher-Y ates normal score test with f N ( i ) = E U ( i ) N , where U (1) N < U (2) N < · · · < U ( N ) N b eing an ordered sample of N observ ations from the standardized normal distribution, the v an der W aerden test statistic, with f N ( i ) = Φ − 1 ( i N +1 ), where Φ is the standard no r ma l distribution function and the quadran t test statistic with f N ( i ) = sig n ( i − N +1 2 ). Saddlep oin t appro ximation to ra ndo mizatio n distributions we re in tro duced by Daniels (1958 ) and further dev elop ed by Robinson (1 9 82) and D a vison and Hinkly (1988). Bo oth and Buter (1990 ) sho w ed that v arious randomization and resampling distributions are the same as certain conditional distributions a nd that the double saddlep oin t approxim a tion atta ins accuracy compar a ble to the single saddlepo int approac h. Recen tly , Ab d-Elfattah and Butler (2007) used the double saddlep oin t appro ximation to calculate the p-v alues and confidence interv al for the class of linear rank tw o sample statistics for censored data. In this note we presen t a simple, fast and accurate saddlep oin t approach that do es not need an y extensiv e p erm utation simulations, to calculate the exact p- v alue for the previous class of tests using double saddlep oin t appro ximation. T o use the double saddlepoint a ppro ximation, the following lemma reform ulate the class (3) to more a ppro priate simple f orm. Lemma 1 The class of statistics (3) c an b e written in an e quivalent form as V = L T N X i =1 f N ( i ) Z i (4) wher e L T = ( f N (1) , f N (2) , ..., f N ( N )) , and Z 1 , Z 2 , ..., Z N ar e N × 1 ve ctors of the form Z R i = η i , i = 1 , ..., N , wher e the N × N identity matrix I N = ( η 1 , η 2 , ..., η N ) . Pro of. Simple algebra. F or example, if R 1 = 2 is arithmetical rank so that Z 2 = η 1 and P N i =1 iZ i has a 2 in its first comp onen t for R 1 . 3 Section 2 presen ts the saddlep oin t approximation approac h. A real data example has illustrated in section 3 along with a simulation study to sho w the p erformance of the saddlepoint metho d. The application of the saddlep oin t metho d to Cuzic k (1982 ) test statistic in case of interv al censoring is discussed in section 4. 2 Saddlep oi n t Approxima t ion for T ests of Inde - p enden ce Under the n ull hy p othesis H 0 of indep endenc e, the p erm utation distribution of V places a uniform distribution on the set of N × 1 indicator v ectors { Z i } . This distribution ma y constructed fro m a corresp onding set of i.i.d. N × 1 v ectors of M ul tinomial (1 , θ 1 , θ 2 , ..., θ N ) indicators ζ 1 , ζ 2 , ..., ζ N . The p erm utation distribution o ve r all one w ay design for whic h P N i =1 Z i = (1 , ..., 1) T is constructed from the i.i.d. Multinomial v ariables as the conditiona l distribution Z 1 , ..., Z N D = ζ 1 , ..., ζ N | N X i =1 ζ i = (1 , ..., 1) T N × 1 the dep endenc e in the statistic can b e remo v ed b y using ( N − 1) × 1 v ectors Z − i and ζ − i , the first N − 1 comp onen ts in Z i and ζ i , then Z − 1 , ..., Z − n D = ζ − 1 , ..., ζ − n | n X i =1 ζ − i = (1 , ..., 1) T ( N − 1) × 1 and t hen V can b e represen ted in terms of { Z − i } as V = L T − N X i =1 f N ( i ) Z − i + Q where L T − = ( f N (1) − f N ( N ) , ..., f N ( N − 1) − f N ( N )) and Q = f N ( N ) P N i =1 f N ( i ) . If v 0 is the observ ed statistic v alue of V , then the n ull distribution of V is Pr { V ≥ v 0 } = Pr { T ( ζ − ) = L T − N X i =1 f N ( i ) ζ − i + Q ≥ v 0 | N X i =1 ζ − i = (1 , ..., 1) T } 4 Assuming any probability v ector { θ 1 , θ 2 , ..., θ N } for the Multinomial distribution, the conditional distribution of T ( ζ − 1 , ζ − 2 , ..., ζ − N ) is the r equired p erm utatio n distribu- tion whic h can b e approximated b y using the double saddlepo int appro ximation of Sk ov ga ard (19 87). The p-v alue is appro ximated from t he double saddlepoint pro cedure in whic h uses the jo in t cum ulant g e nerat ing function for ( T ( ζ − 1 , ζ − 2 , ..., ζ − N ) , P N i =1 ζ − i ) give n b y K ( s, t ) = log M ( s, t ) where M ( s, t ) = N Y i =1 ( N − 1 X j =1 θ j exp( s j + r ij t ) + θ N ) with s = ( s 1 , ..., s N − 1 ) a nd r ij = f N ( i )( f N ( j ) − f N ( N )) , and then Pr( V ≥ v 0 ) ≃ 1 − Φ( ˆ w ) − φ ( ˆ w ) 1 ˆ w − 1 ˆ u where ˆ w = sgn( ˆ t ) q 2 −{ K ˆ s, ˆ t − ˆ s T 1 − − v 0 ˆ t } ˆ u = ˆ t q | K ′′ ( ˆ s , ˆ t ) | / | K ′′ ss (0 , 0) | . and 1 − is ( N − 1) × 1 v ector of ones. In these expressions, K ′′ is the N × N Hessian matrix a nd K ′′ ss is the ∂ 2 /∂ s∂ s T p ortion at (0 , 0) . The saddlep oin t ˆ s, ˆ t solv es K ′ sj ( ˆ s , ˆ t ) = N X i =1 exp( ˆ s j + r ij ˆ t ) n P N − 1 l =1 exp ˆ s l + r il ˆ t + 1 o = 1 , j = 1 , ..., N − 1 K ′ t ( ˆ s , ˆ t ) = N X i =1 P N − 1 j =1 r ij exp ˆ s j + r ij ˆ t n P N − 1 l =1 exp ˆ s l + r il ˆ t + 1 o + Q = v 0 using θ i = 1 / N the denominator saddlep oin t equations ha ve an explicit solution as ˆ s 0 = 0 and this simplifies t he calculations. 5 3 Example and Sim ulation Stud y Na yak (1988) giv es the failure times of transmission ( X ) and of transmission pumps ( Y ) on 15 caterpillar tracto r s as sho wn in table 1. X 1641 5556 5421 3168 1534 6367 9460 6679 6142 5995 3953 6922 4210 5161 4732 Y 850 1607 2225 3223 3379 3832 3871 4142 4300 4789 6310 6311 6378 6449 6949 T able 1. F ailure t imes of tr ans missions b y Na yak (1988). T o test the indep endence of failure t imes of X and Y , the test statistic (2) are used with L = (1 , ..., N ), and Q = L N P N i =1 R i = N 2 ( N + 1 ) / 2 . The true (simu- lated ) p-v alue w as calculated by using 10 6 p erm utations of the computed test statis- tic. The simulated p-v alue is then the prop ortion of suc h generations exceeding the observ ed statistic plus the prop ortion o f those equal. The p-v alue of the saddle- p oin t approach is compared to the normal p-v alue calculated using the test statistic ( v ′ − E ( v ′ )) / p V ar ( v ′ ) . The true p-v alue and the saddlepo int approx imated p-v alue w ere 0 . 2768 and 0 . 2763 , resp ectiv ely , while the normal p -v alue w as 0 . 2 693 . A small sim ulat ion study has carried out to assist the p erformance of the saddle- p oin t metho d. Consider the general mo del of dep endenc e X i = X ′ i + λe i , Y i = Y ′ i + λe i , i = 1 , ..., N where all the v ariables X ′ i , Y ′ i and e i are mutually indep enden t and their distributions do not depend on i , and λ is a r eal non-negativ e parameter. In this mo de l the n ull h yp othesis H 0 of independence is equiv alen t to λ = 0 , whereas for λ > 0 the v ariables X i and Y i are dep enden t. Data sets are generated from this mo del using Logistic, Extreme v alue and Uniform distributions for X ′ i , Y ′ i and e i resp e ctively . F or each 6 v alue of λ = 0 . 0 , 0 . 5 and sample sizes (10 , 20 , 30), a 1000 data sets are generated and the true, saddlep oin t and no rmal p-v alues are calculated using the test statistic (2). T able 2 sho ws the prop ortion of the 100 0 data sets that saddlep oin t p-v alue w as closer to the true p-v alue tha n the normal p-v alue “Sad. Prop.”, “Abs. Err. Sad.” is the a verage absolute erro r of the saddlepo in t p-v alue f rom the true p-v alue, and “ Rel. Abs. Err. Sad.” is the av erage relative absolute error of the saddlep oin t p-v a lue fro m the true p-v alue. Sad. Abs. Err. Rel. Abs. n λ Prop. Sad. Err. Sad 10 0 . 0 0 . 944 0 . 0010 0 . 0048 0 . 5 0 . 945 0 . 0010 0 . 0050 20 0 . 0 0 . 957 0 . 0003 0 . 0016 0 . 5 0 . 956 0 . 0003 0 . 0018 30 0 . 0 0 . 943 0 . 0003 0 . 0012 0 . 5 0 . 932 0 . 0003 0 . 0013 T able 2. P erformance under sim ulatio n for the dep endence test. The saddlep oin t p-v a lue w a s mo r e accurate in 94 . 6% of the ov erall cases as com- pared to the norma l approx imatio n. The a verage absolute saddlep oin t error w as less than 10 − 3 with a verage relative error typ ically less than 0 . 1% . An imp ortant consid- eration in these saddlep o in t computations is the difficult y in solving N saddlep oin t equations. This b ec omes increasingly difficult with large N . 4 Discuss ion The problem of testing the indep endence b et w een tw o v ariables under random cen- soring has ta ken atten tion of many authors, see O’Brien (1978), W ei (19 80), Oak es 7 (1982), and Gieser and Randles (1997). When one of the tw o v ariables is under in- terv al censoring, sa y the first, and the second random v ariable is observ ed, Cuzic k (1982) presen ts a linear log-rank test statistic to test the indep endence of tw o v ec- tors in the f o rm P N i =1 ξ i R 2 i , where { ξ i } are giv en scores and { R 2 i } are the ranks of the observ ed v alues of the second random v ar ia ble . I n the linear fo r m (4), taking L = ( ξ 1 , ξ 2 , ..., ξ N ) and f N ( R i ) = R 2 i , the saddlep oin t metho d is simply applicable. F or example, Cuzic k giv es a surviv al times for 20 patien ts for the analysis of the re- lation b et w een hemoglobin at presen tation and surviv al in some medical clinice. The normal p-v alues using his asymptotic approa c h w as 0 . 0505 while the true p-v alue and the saddlep oin t p- v alue are 0 . 0516 and 0 . 0512, resp ectiv ely . References [1] E. F. Ab d-Elfattah and R. Butler, The W eigh ted Log - Rank Class of Perm u- tation T ests: P-v alues and Confidence In terv als Using Saddlep oin t Metho ds, Biometrik a (2007 ), V94, No. 3, 5 4 3-551. [2] J. G. Bo oth, and R. W. Butler, Rando mizatio n distributions and saddlepo in t appro ximations in generalized linear mo dels. Biometrik a (199 0 ) , 77, 787-796. [3] J. Cuzic k, Rank tests for asso ciation with right censored data. Biometrik a (19 82), 69, No. 2, 351-364. [4] H. E. Daniels, Discussion of pap er by D. R. Co x, Journal of Roy al Statistical So ciet y B (1958) , 20, 2 36-238. [5] A. C. Dav ison, and D. H. Hink ely , Saddlep oin t a ppro ximations in resampling metho d. Biometrik a (19 88), 75, 3, 417-431 . [6] J. D. Gibb ons and S. Chakrab orti, Nonpara me tr ic statistical inference. 4 th edi- tion, Marcel Dekk er, New Y ork ( 2003) . [7] P . W. Gieser and R. H. Randles, A nonparametric test of indep endence b et wee n t w o v ectors. Jour na l of the American statistical a s so ciation (1 997), 9 2, No. 438, 561-567 . [8] J. Ha jek , Z. Sidak and P . K.Sen, Theory of rank tests. 2nd Ed. Academic Press (1999). 8 [9] T. K. Nay ak, T esting equalit y of conditionally indep enden t exp onen tial distribu- tions. Communic a tions in statistics: theory and metho ds (1988) , 17, 807-82 0 . [10] D. Oak es, A concordance test for independence in the presen t of censoring. Bio- metrics (1 982), 38, 45 1 -455. [11] P . O ’Br ien, A nonparametric test for asso ciation with censored data. Biometrics (1978) , 34, 243-250. [12] J. Robinson, Saddlep oin t appro ximations for p erm utation tests and confidence in terv als. Journal of Roy al Sta tis t ic a l So ciet y B (1982) , 44 , 1, 91-101 . [13] I. M. Sk ov ga ard, Saddlep oin t expansions for conditional distributions. Journal of Applied Probability(1987) , 24 , 8 7 5-87. [14] L. J. W ei, A generalized Gehan and G ilbert test f o r paired observ ations that are sub ject to a r bitrary righ t censorship. Journal of the American statistical asso ciation (1980) , 75 , No. 371, 6 3 4-637. 9
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment