On the Permutation Distribution of Independence Tests

Reading time: 6 minute
...

📝 Original Info

  • Title: On the Permutation Distribution of Independence Tests
  • ArXiv ID: 0902.0442
  • Date: 2009-02-04
  • Authors: Researchers from original ArXiv paper

📝 Abstract

One of the most popular class of tests for independence between two random variables is the general class of rank statistics which are invariant under permutations. This class contains Spearman's coefficient of rank correlation statistic, Fisher-Yates statistic, weighted Mann statistic and others. Under the null hypothesis of independence these test statistics have a permutation distribution that usually the normal asymptotic theory used to approximate the p-values for these tests. In this note we suggest using a saddlepoint approach that almost exact and need no extensive simulation calculations to calculate the p-value of such class of tests.

💡 Deep Analysis

Deep Dive into On the Permutation Distribution of Independence Tests.

One of the most popular class of tests for independence between two random variables is the general class of rank statistics which are invariant under permutations. This class contains Spearman’s coefficient of rank correlation statistic, Fisher-Yates statistic, weighted Mann statistic and others. Under the null hypothesis of independence these test statistics have a permutation distribution that usually the normal asymptotic theory used to approximate the p-values for these tests. In this note we suggest using a saddlepoint approach that almost exact and need no extensive simulation calculations to calculate the p-value of such class of tests.

📄 Full Content

When the factors being studied are not treatments that the investigator can assign to his subjects but conditions or attributes which are inseparably attached to these subjects, an assumption that need to be tested is that an association exists between two factors in a population of subjects. Let us observe N independent pairs of random variables (X 1 , Y 1 ), (X 2 , Y 2 ), ..., (X N , Y N ) and we wish to test the null hypothesis H 0 that the two variables X i and Y i are independent for each i.

with small values of D indicating significance.

The statistic D is related to the well known Spearman’s coefficient of rank correlation statistic, S p , with the relation S p = 1 -6D/N(N 2 -1), see Gibbons and Chakraborti (2003). It is also related to the weighted Mann statistic, D ′ , by

Expanding (1), D can be written as

which gives an equivalent simple statistic

Hajek, Sidak and Sen (1999).

The statistic V ′ is equivalent to a general class of rank statistics whose null distributions are invariant under permutations, this class can be written as

which contains the Fisher-Yates normal score test with f N (i) = EU (i)

N , where U

(1)

being an ordered sample of N observations from the standardized normal distribution, the van der Waerden test statistic, with f N (i) = Φ -1 ( i N +1 ), where Φ is the standard normal distribution function and the quadrant test statistic with f N (i) = sign(i -N +1

2 ). Saddlepoint approximation to randomization distributions were introduced by Daniels (1958) and further developed by Robinson (1982) and Davison and Hinkly (1988). Booth and Buter (1990) showed that various randomization and resampling distributions are the same as certain conditional distributions and that the double saddlepoint approximation attains accuracy comparable to the single saddlepoint approach. Recently, Abd-Elfattah and Butler (2007) used the double saddlepoint approximation to calculate the p-values and confidence interval for the class of linear rank two sample statistics for censored data.

In this note we present a simple, fast and accurate saddlepoint approach that does not need any extensive permutation simulations, to calculate the exact p-value for the previous class of tests using double saddlepoint approximation. To use the double saddlepoint approximation, the following lemma reformulate the class (3) to more appropriate simple form.

The class of statistics (3) can be written in an equivalent form as

where

Proof. Simple algebra.

For example, if R 1 = 2 is arithmetical rank so that Z 2 = η 1 and N i=1 iZ i has a 2 in its first component for R 1 .

Section 2 presents the saddlepoint approximation approach. A real data example has illustrated in section 3 along with a simulation study to show the performance of the saddlepoint method. The application of the saddlepoint method to Cuzick (1982) test statistic in case of interval censoring is discussed in section 4. the dependence in the statistic can be removed by using (N -1) × 1 vectors Z - i and ζ - i , the first N -1 components in Z i and ζ i , then

and then V can be represented in terms of {Z - i } as

Assuming any probability vector {θ 1 , θ 2 , …, θ N } for the Multinomial distribution, the conditional distribution of

is the required permutation distribution which can be approximated by using the double saddlepoint approximation of Skovgaard (1987).

The p-value is approximated from the double saddlepoint procedure in which uses the joint cumulant generating function for (T

and 1 -is (N -1) × 1 vector of ones. In these expressions, K ′′ is the N × N Hessian matrix and K ′′ ss is the ∂ 2 /∂s∂s T portion at (0, 0). The saddlepoint ŝ, t solves

using θ i = 1/N the denominator saddlepoint equations have an explicit solution as ŝ0 = 0 and this simplifies the calculations. 1. Failure times of transmissions by Nayak (1988).

To test the independence of failure times of X and Y , the test statistic (2) are used with L = (1, …, N), and

The true (simulated ) p-value was calculated by using 10 6 permutations of the computed test statistic. The simulated p-value is then the proportion of such generations exceeding the observed statistic plus the proportion of those equal. The p-value of the saddlepoint approach is compared to the normal p-value calculated using the test statistic (v ′ -E(v ′ ))/ V ar(v ′ ). The true p-value and the saddlepoint approximated p-value were 0.2768 and 0.2763, respectively, while the normal p-value was 0.2693.

A small simulation study has carried out to assist the performance of the saddlepoint method. Consider the general model of dependence

where all the variables X ′ i , Y ′ i and e i are mutually independent and their distributions do not depend on i, and λ is a real non-negative parameter. In this model the null hypothesis H 0 of independence is equivalent to λ = 0, whereas for λ > 0 the variables X i and Y i are dependent. Data sets are generated from this model using Logistic, Extreme value and U

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut