On the Permutation Distribution of Independence Tests

On the P erm utation Distribution of Indep endence T ests Ehab F. Ab d-Elfatt ah Ain Shams Universit y , Cair o , Egypt. ehab@ASUnet. shams.edu.eg Abstract One of the most p opular class of tests for indep endence b et w een tw o r an - dom v ariables is the general class of rank statistics w h ic h are in v arian t under p erm utations. This class con tains Sp earman’s co eﬃcien t of r an k corr elation statistic, Fisher-Y ates statistic, w eigh ted Mann statistic and ot h ers. Under the n u ll h yp othesis of indep endence these test statistics ha ve a p ermutatio n dis- tribution that usually the normal asymptotic theory used to appro ximate the p-v alues for these tests. In this note we suggest u sing a saddlep oin t approac h that almost exact and n ee d no extensive simula tion calculations to calculate the p-v alue of suc h class of tests. Some key words: Indep endence tests; Linear r a n k test; P ermutat ion d ist r i- bution; Saddlep oin t appro ximation. 1 In tro duc tion When the factors b eing studied are not treatmen ts that the inv estigator can assign to his sub j e cts but conditions or attributes whic h are inseparably attac hed to these sub jects, a n assumption that need to b e tested is that a n asso ciation exists b et w een t w o factors in a p opulation of sub jects. Let us observ e N indep enden t pairs of random v ariables ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , ..., ( X N , Y N ) and we wish to test the n ull h yp othesis H 0 that the tw o v ariables X i and Y i are indep enden t for eac h i . 1 No w rearrange all N pairs of observ ations according to the magnitude of their ﬁrst co ordinate in to the sequence ( X d 1 , Y d 1 ) , ( X d 2 , Y d 2 ) , ..., ( X d N , Y d N ) in suc h a w a y that X d 1 < X d 2 < · · · < X d N . Then put R i equal to the ra nk of Y d i among the observ a t ions Y d 1 , Y d 2 , ..., Y d N . Under the assumption of indep endence and assuming no ties, all N ! orderings ( R 1 , ..., R N ) are equally like ly with probabilit y 1 / N !. If w e willing to assume that the t wo factors ha v e a p ositiv e associations, the { R i } should rev eal an upw ard trend, with large v alues tending to o ccur on the r ig h t of the sequence and low v alues o n the left. An appropriate test statistic that r e ﬂects this idea is D = N X i =1 ( R i − i ) 2 (1) with small v alues of D indicating signiﬁcance. The statistic D is related to the well kno wn Sp earman’s co eﬃcien t of rank cor- relation statistic, S p , with the relation S p = 1 − 6 D / N ( N 2 − 1), see G ibbons and Chakrab orti (20 03). It is also related to the we ig h ted Mann stat istic, D ′ , b y D ′ = 1 6 N ( N 2 − 1) − 1 2 D . Expanding (1), D can b e written as D = 1 3 N ( N + 1)(2 N + 1) − 2 N X i =1 iR i whic h gives an equiv alen t simple statistic V ′ = N X i =1 iR i (2) Ha jek, Sidak and Sen (1999 ). The statistic V ′ is equiv alen t to a general class of rank statistics whose n ull dis- tributions are inv ariant under p erm utations, this class can b e written as S = N X i =1 f N ( i ) f N ( R i ) (3) 2 whic h contains the Fisher-Y ates normal score test with f N ( i ) = E U ( i ) N , where U (1) N < U (2) N < · · · < U ( N ) N b eing an ordered sample of N observ ations from the standardized normal distribution, the v an der W aerden test statistic, with f N ( i ) = Φ − 1 ( i N +1 ), where Φ is the standard no r ma l distribution function and the quadran t test statistic with f N ( i ) = sig n ( i − N +1 2 ). Saddlep oin t appro ximation to ra ndo mizatio n distributions we re in tro duced by Daniels (1958 ) and further dev elop ed by Robinson (1 9 82) and D a vison and Hinkly (1988). Bo oth and Buter (1990 ) sho w ed that v arious randomization and resampling distributions are the same as certain conditional distributions a nd that the double saddlep oin t approxim a tion atta ins accuracy compar a ble to the single saddlepo int approac h. Recen tly , Ab d-Elfattah and Butler (2007) used the double saddlep oin t appro ximation to calculate the p-v alues and conﬁdence interv al for the class of linear rank tw o sample statistics for censored data. In this note we presen t a simple, fast and accurate saddlep oin t approach that do es not need an y extensiv e p erm utation simulations, to calculate the exact p- v alue for the previous class of tests using double saddlep oin t appro ximation. T o use the double saddlepoint a ppro ximation, the following lemma reform ulate the class (3) to more a ppro priate simple f orm. Lemma 1 The class of statistics (3) c an b e written in an e quivalent form as V = L T N X i =1 f N ( i ) Z i (4) wher e L T = ( f N (1) , f N (2) , ..., f N ( N )) , and Z 1 , Z 2 , ..., Z N ar e N × 1 ve ctors of the form Z R i = η i , i = 1 , ..., N , wher e the N × N identity matrix I N = ( η 1 , η 2 , ..., η N ) . Pro of. Simple algebra. F or example, if R 1 = 2 is arithmetical rank so that Z 2 = η 1 and P N i =1 iZ i has a 2 in its ﬁrst comp onen t for R 1 . 3 Section 2 presen ts the saddlep oin t approximation approac h. A real data example has illustrated in section 3 along with a simulation study to sho w the p erformance of the saddlepoint metho d. The application of the saddlep oin t metho d to Cuzic k (1982 ) test statistic in case of interv al censoring is discussed in section 4. 2 Saddlep oi n t Approxima t ion for T ests of Inde - p enden ce Under the n ull hy p othesis H 0 of indep endenc e, the p erm utation distribution of V places a uniform distribution on the set of N × 1 indicator v ectors { Z i } . This distribution ma y constructed fro m a corresp onding set of i.i.d. N × 1 v ectors of M ul tinomial (1 , θ 1 , θ 2 , ..., θ N ) indicators ζ 1 , ζ 2 , ..., ζ N . The p erm utation distribution o ve r all one w ay design for whic h P N i =1 Z i = (1 , ..., 1) T is constructed from the i.i.d. Multinomial v ariables as the conditiona l distribution Z 1 , ..., Z N D = ζ 1 , ..., ζ N | N X i =1 ζ i = (1 , ..., 1) T N × 1 the dep endenc e in the statistic can b e remo v ed b y using ( N − 1) × 1 v ectors Z − i and ζ − i , the ﬁrst N − 1 comp onen ts in Z i and ζ i , then Z − 1 , ..., Z − n D = ζ − 1 , ..., ζ − n | n X i =1 ζ − i = (1 , ..., 1) T ( N − 1) × 1 and t hen V can b e represen ted in terms of { Z − i } as V = L T − N X i =1 f N ( i ) Z − i + Q where L T − = ( f N (1) − f N ( N ) , ..., f N ( N − 1) − f N ( N )) and Q = f N ( N ) P N i =1 f N ( i ) . If v 0 is the observ ed statistic v alue of V , then the n ull distribution of V is Pr { V ≥ v 0 } = Pr { T ( ζ − ) = L T − N X i =1 f N ( i ) ζ − i + Q ≥ v 0 | N X i =1 ζ − i = (1 , ..., 1) T } 4 Assuming any probability v ector { θ 1 , θ 2 , ..., θ N } for the Multinomial distribution, the conditional distribution of T ( ζ − 1 , ζ − 2 , ..., ζ − N ) is the r equired p erm utatio n distribu- tion whic h can b e approximated b y using the double saddlepo int appro ximation of Sk ov ga ard (19 87). The p-v alue is appro ximated from t he double saddlepoint pro cedure in whic h uses the jo in t cum ulant g e nerat ing function for ( T ( ζ − 1 , ζ − 2 , ..., ζ − N ) , P N i =1 ζ − i ) give n b y K ( s, t ) = log M ( s, t ) where M ( s, t ) = N Y i =1 ( N − 1 X j =1 θ j exp( s j + r ij t ) + θ N ) with s = ( s 1 , ..., s N − 1 ) a nd r ij = f N ( i )( f N ( j ) − f N ( N )) , and then Pr( V ≥ v 0 ) ≃ 1 − Φ( ˆ w ) − φ ( ˆ w )  1 ˆ w − 1 ˆ u  where ˆ w = sgn( ˆ t ) q 2  −{ K  ˆ s, ˆ t  − ˆ s T 1 − − v 0 ˆ t }  ˆ u = ˆ t q | K ′′ ( ˆ s , ˆ t ) | / | K ′′ ss (0 , 0) | . and 1 − is ( N − 1) × 1 v ector of ones. In these expressions, K ′′ is the N × N Hessian matrix a nd K ′′ ss is the ∂ 2 /∂ s∂ s T p ortion at (0 , 0) . The saddlep oin t  ˆ s, ˆ t  solv es K ′ sj ( ˆ s , ˆ t ) = N X i =1 exp( ˆ s j + r ij ˆ t ) n P N − 1 l =1 exp  ˆ s l + r il ˆ t  + 1 o = 1 , j = 1 , ..., N − 1 K ′ t ( ˆ s , ˆ t ) = N X i =1 P N − 1 j =1 r ij exp  ˆ s j + r ij ˆ t  n P N − 1 l =1 exp  ˆ s l + r il ˆ t  + 1 o + Q = v 0 using θ i = 1 / N the denominator saddlep oin t equations ha ve an explicit solution as ˆ s 0 = 0 and this simpliﬁes t he calculations. 5 3 Example and Sim ulation Stud y Na yak (1988) giv es the failure times of transmission ( X ) and of transmission pumps ( Y ) on 15 caterpillar tracto r s as sho wn in table 1. X 1641 5556 5421 3168 1534 6367 9460 6679 6142 5995 3953 6922 4210 5161 4732 Y 850 1607 2225 3223 3379 3832 3871 4142 4300 4789 6310 6311 6378 6449 6949 T able 1. F ailure t imes of tr ans missions b y Na yak (1988). T o test the indep endence of failure t imes of X and Y , the test statistic (2) are used with L = (1 , ..., N ), and Q = L N P N i =1 R i = N 2 ( N + 1 ) / 2 . The true (simu- lated ) p-v alue w as calculated by using 10 6 p erm utations of the computed test statis- tic. The simulated p-v alue is then the prop ortion of suc h generations exceeding the observ ed statistic plus the prop ortion o f those equal. The p-v alue of the saddle- p oin t approach is compared to the normal p-v alue calculated using the test statistic ( v ′ − E ( v ′ )) / p V ar ( v ′ ) . The true p-v alue and the saddlepo int approx imated p-v alue w ere 0 . 2768 and 0 . 2763 , resp ectiv ely , while the normal p -v alue w as 0 . 2 693 . A small sim ulat ion study has carried out to assist the p erformance of the saddle- p oin t metho d. Consider the general mo del of dep endenc e X i = X ′ i + λe i , Y i = Y ′ i + λe i , i = 1 , ..., N where all the v ariables X ′ i , Y ′ i and e i are mutually indep enden t and their distributions do not depend on i , and λ is a r eal non-negativ e parameter. In this mo de l the n ull h yp othesis H 0 of independence is equiv alen t to λ = 0 , whereas for λ > 0 the v ariables X i and Y i are dep enden t. Data sets are generated from this mo del using Logistic, Extreme v alue and Uniform distributions for X ′ i , Y ′ i and e i resp e ctively . F or each 6 v alue of λ = 0 . 0 , 0 . 5 and sample sizes (10 , 20 , 30), a 1000 data sets are generated and the true, saddlep oin t and no rmal p-v alues are calculated using the test statistic (2). T able 2 sho ws the prop ortion of the 100 0 data sets that saddlep oin t p-v alue w as closer to the true p-v alue tha n the normal p-v alue “Sad. Prop.”, “Abs. Err. Sad.” is the a verage absolute erro r of the saddlepo in t p-v alue f rom the true p-v alue, and “ Rel. Abs. Err. Sad.” is the av erage relative absolute error of the saddlep oin t p-v a lue fro m the true p-v alue. Sad. Abs. Err. Rel. Abs. n λ Prop. Sad. Err. Sad 10 0 . 0 0 . 944 0 . 0010 0 . 0048 0 . 5 0 . 945 0 . 0010 0 . 0050 20 0 . 0 0 . 957 0 . 0003 0 . 0016 0 . 5 0 . 956 0 . 0003 0 . 0018 30 0 . 0 0 . 943 0 . 0003 0 . 0012 0 . 5 0 . 932 0 . 0003 0 . 0013 T able 2. P erformance under sim ulatio n for the dep endence test. The saddlep oin t p-v a lue w a s mo r e accurate in 94 . 6% of the ov erall cases as com- pared to the norma l approx imatio n. The a verage absolute saddlep oin t error w as less than 10 − 3 with a verage relative error typ ically less than 0 . 1% . An imp ortant consid- eration in these saddlep o in t computations is the diﬃcult y in solving N saddlep oin t equations. This b ec omes increasingly diﬃcult with large N . 4 Discuss ion The problem of testing the indep endence b et w een tw o v ariables under random cen- soring has ta ken atten tion of many authors, see O’Brien (1978), W ei (19 80), Oak es 7 (1982), and Gieser and Randles (1997). When one of the tw o v ariables is under in- terv al censoring, sa y the ﬁrst, and the second random v ariable is observ ed, Cuzic k (1982) presen ts a linear log-rank test statistic to test the indep endence of tw o v ec- tors in the f o rm P N i =1 ξ i R 2 i , where { ξ i } are giv en scores and { R 2 i } are the ranks of the observ ed v alues of the second random v ar ia ble . I n the linear fo r m (4), taking L = ( ξ 1 , ξ 2 , ..., ξ N ) and f N ( R i ) = R 2 i , the saddlep oin t metho d is simply applicable. F or example, Cuzic k giv es a surviv al times for 20 patien ts for the analysis of the re- lation b et w een hemoglobin at presen tation and surviv al in some medical clinice. The normal p-v alues using his asymptotic approa c h w as 0 . 0505 while the true p-v alue and the saddlep oin t p- v alue are 0 . 0516 and 0 . 0512, resp ectiv ely . References [1] E. F. Ab d-Elfattah and R. Butler, The W eigh ted Log - Rank Class of Perm u- tation T ests: P-v alues and Conﬁdence In terv als Using Saddlep oin t Metho ds, Biometrik a (2007 ), V94, No. 3, 5 4 3-551. [2] J. G. Bo oth, and R. W. Butler, Rando mizatio n distributions and saddlepo in t appro ximations in generalized linear mo dels. Biometrik a (199 0 ) , 77, 787-796. [3] J. Cuzic k, Rank tests for asso ciation with right censored data. Biometrik a (19 82), 69, No. 2, 351-364. [4] H. E. Daniels, Discussion of pap er by D. R. Co x, Journal of Roy al Statistical So ciet y B (1958) , 20, 2 36-238. [5] A. C. Dav ison, and D. H. Hink ely , Saddlep oin t a ppro ximations in resampling metho d. Biometrik a (19 88), 75, 3, 417-431 . [6] J. D. Gibb ons and S. Chakrab orti, Nonpara me tr ic statistical inference. 4 th edi- tion, Marcel Dekk er, New Y ork ( 2003) . [7] P . W. Gieser and R. H. Randles, A nonparametric test of indep endence b et wee n t w o v ectors. Jour na l of the American statistical a s so ciation (1 997), 9 2, No. 438, 561-567 . [8] J. Ha jek , Z. Sidak and P . K.Sen, Theory of rank tests. 2nd Ed. Academic Press (1999). 8 [9] T. K. Nay ak, T esting equalit y of conditionally indep enden t exp onen tial distribu- tions. Communic a tions in statistics: theory and metho ds (1988) , 17, 807-82 0 . [10] D. Oak es, A concordance test for independence in the presen t of censoring. Bio- metrics (1 982), 38, 45 1 -455. [11] P . O ’Br ien, A nonparametric test for asso ciation with censored data. Biometrics (1978) , 34, 243-250. [12] J. Robinson, Saddlep oin t appro ximations for p erm utation tests and conﬁdence in terv als. Journal of Roy al Sta tis t ic a l So ciet y B (1982) , 44 , 1, 91-101 . [13] I. M. Sk ov ga ard, Saddlep oin t expansions for conditional distributions. Journal of Applied Probability(1987) , 24 , 8 7 5-87. [14] L. J. W ei, A generalized Gehan and G ilbert test f o r paired observ ations that are sub ject to a r bitrary righ t censorship. Journal of the American statistical asso ciation (1980) , 75 , No. 371, 6 3 4-637. 9

On the Permutation Distribution of Independence Tests

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment