One of the most popular class of tests for independence between two random variables is the general class of rank statistics which are invariant under permutations. This class contains Spearman's coefficient of rank correlation statistic, Fisher-Yates statistic, weighted Mann statistic and others. Under the null hypothesis of independence these test statistics have a permutation distribution that usually the normal asymptotic theory used to approximate the p-values for these tests. In this note we suggest using a saddlepoint approach that almost exact and need no extensive simulation calculations to calculate the p-value of such class of tests.
Deep Dive into On the Permutation Distribution of Independence Tests.
One of the most popular class of tests for independence between two random variables is the general class of rank statistics which are invariant under permutations. This class contains Spearman’s coefficient of rank correlation statistic, Fisher-Yates statistic, weighted Mann statistic and others. Under the null hypothesis of independence these test statistics have a permutation distribution that usually the normal asymptotic theory used to approximate the p-values for these tests. In this note we suggest using a saddlepoint approach that almost exact and need no extensive simulation calculations to calculate the p-value of such class of tests.
When the factors being studied are not treatments that the investigator can assign to his subjects but conditions or attributes which are inseparably attached to these subjects, an assumption that need to be tested is that an association exists between two factors in a population of subjects. Let us observe N independent pairs of random variables (X 1 , Y 1 ), (X 2 , Y 2 ), ..., (X N , Y N ) and we wish to test the null hypothesis H 0 that the two variables X i and Y i are independent for each i.
with small values of D indicating significance.
The statistic D is related to the well known Spearman’s coefficient of rank correlation statistic, S p , with the relation S p = 1 -6D/N(N 2 -1), see Gibbons and Chakraborti (2003). It is also related to the weighted Mann statistic, D ′ , by
Expanding (1), D can be written as
which gives an equivalent simple statistic
Hajek, Sidak and Sen (1999).
The statistic V ′ is equivalent to a general class of rank statistics whose null distributions are invariant under permutations, this class can be written as
which contains the Fisher-Yates normal score test with f N (i) = EU (i)
N , where U
(1)
being an ordered sample of N observations from the standardized normal distribution, the van der Waerden test statistic, with f N (i) = Φ -1 ( i N +1 ), where Φ is the standard normal distribution function and the quadrant test statistic with f N (i) = sign(i -N +1
2 ). Saddlepoint approximation to randomization distributions were introduced by Daniels (1958) and further developed by Robinson (1982) and Davison and Hinkly (1988). Booth and Buter (1990) showed that various randomization and resampling distributions are the same as certain conditional distributions and that the double saddlepoint approximation attains accuracy comparable to the single saddlepoint approach. Recently, Abd-Elfattah and Butler (2007) used the double saddlepoint approximation to calculate the p-values and confidence interval for the class of linear rank two sample statistics for censored data.
In this note we present a simple, fast and accurate saddlepoint approach that does not need any extensive permutation simulations, to calculate the exact p-value for the previous class of tests using double saddlepoint approximation. To use the double saddlepoint approximation, the following lemma reformulate the class (3) to more appropriate simple form.
The class of statistics (3) can be written in an equivalent form as
where
Proof. Simple algebra.
For example, if R 1 = 2 is arithmetical rank so that Z 2 = η 1 and N i=1 iZ i has a 2 in its first component for R 1 .
Section 2 presents the saddlepoint approximation approach. A real data example has illustrated in section 3 along with a simulation study to show the performance of the saddlepoint method. The application of the saddlepoint method to Cuzick (1982) test statistic in case of interval censoring is discussed in section 4. the dependence in the statistic can be removed by using (N -1) × 1 vectors Z - i and ζ - i , the first N -1 components in Z i and ζ i , then
and then V can be represented in terms of {Z - i } as
Assuming any probability vector {θ 1 , θ 2 , …, θ N } for the Multinomial distribution, the conditional distribution of
is the required permutation distribution which can be approximated by using the double saddlepoint approximation of Skovgaard (1987).
The p-value is approximated from the double saddlepoint procedure in which uses the joint cumulant generating function for (T
and 1 -is (N -1) × 1 vector of ones. In these expressions, K ′′ is the N × N Hessian matrix and K ′′ ss is the ∂ 2 /∂s∂s T portion at (0, 0). The saddlepoint ŝ, t solves
using θ i = 1/N the denominator saddlepoint equations have an explicit solution as ŝ0 = 0 and this simplifies the calculations. 1. Failure times of transmissions by Nayak (1988).
To test the independence of failure times of X and Y , the test statistic (2) are used with L = (1, …, N), and
The true (simulated ) p-value was calculated by using 10 6 permutations of the computed test statistic. The simulated p-value is then the proportion of such generations exceeding the observed statistic plus the proportion of those equal. The p-value of the saddlepoint approach is compared to the normal p-value calculated using the test statistic (v ′ -E(v ′ ))/ V ar(v ′ ). The true p-value and the saddlepoint approximated p-value were 0.2768 and 0.2763, respectively, while the normal p-value was 0.2693.
A small simulation study has carried out to assist the performance of the saddlepoint method. Consider the general model of dependence
where all the variables X ′ i , Y ′ i and e i are mutually independent and their distributions do not depend on i, and λ is a real non-negative parameter. In this model the null hypothesis H 0 of independence is equivalent to λ = 0, whereas for λ > 0 the variables X i and Y i are dependent. Data sets are generated from this model using Logistic, Extreme value and U
…(Full text truncated)…
This content is AI-processed based on ArXiv data.