Conservative Hypothesis Tests and Confidence Intervals using Importance Sampling

Conserv ativ e Hyp othesis T ests and Conﬁdence In terv als using Imp ortance Sampling Matthew T. Harrison ∗ Division of Applie d Mathematics Br own University Pr ovidenc e, RI 02912 email: Firstname Lastname@Brown.edu Abstract: Importance sampling is a common tec hnique for Mon te Carlo approximation, including Mon te Carlo appro ximation of p-v alues. Here it is shown that a simple correction of the usual imp ortance sampli ng p-v alues creates valid p-v alues, meaning that a h yp othesis test created by rejecting the null when the p-v alue is ≤ α will also hav e a type I error rate ≤ α . This correction uses the importance weigh t of the original observ ation, which gives v aluable diagnostic information under the null hypothesis. Using the corrected p-values can b e crucial for multiple testing and also in problems where ev aluating the accuracy of importance sampling approximations is diﬃcult. Inv erting the corrected p-v alues provides a useful w ay to create Monte Carlo conﬁdence interv als that maintain the nominal signiﬁcance level and use only a single Monte Carlo sample. Several applications are described, including accelerated m ultiple testing for a large neuroph ysiolog- ical dataset and exact conditional inference for a logistic regression model with nuisance parameters. Keyw ords and phrases: exact inference, Monte Carlo, multiple testing, p-v alue, Rasch mo del, v alid. Con tents 1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1 Accelerating m ultiple permutation tests . . . . . . . . . . . . . . 8 4.2 Exact inference for cov ariate eﬀects in Rasch mo dels . . . . . . . 10 4.3 Accelerated testing for neurophysiological data . . . . . . . . . . 12 5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 ∗ This work was supp orted in part by the National Science F oundation (NSF) gran t DMS- 1007593 and also, while the author was in the Department of Statistics at Carnegie Mel- lon Universit y , by the NSF gran t DMS-0240019 and the National Institutes of Health grant NIMH-2RO1MH064537. The author thanks Stuart Geman, Jeﬀrey Miller, and Lee Newberg for helpful comments. Shigeyoshi F ujisa wa and Gy¨ orgy Buzs´ aki graciously provided the data for the example in Section 4.3 . 1 imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdenc e Intervals 2 A.1 Pro of of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Pro of of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B Prop osal distributions and sim ulation details . . . . . . . . . . . . . . 18 C Additional examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.1 Mismatched Gaussians . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Conditional testing in binary tables . . . . . . . . . . . . . . . . . 20 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1. In tro duction Imp ortance sampling is a common technique for Monte Carlo approximation, including Monte Carlo approximation of p-v alues. Besides its use in situations where eﬃcien t direct-sampling algorithms are unav ailable, importance sampling can b e used to accelerate the approximation of tiny p-v alues as needed for mul- tiple hypothesis testing. Imp ortance sampling can also b e used for Monte Carlo appro ximation of conﬁdence in terv als using a s ingle Monte Carlo sample b y in- v erting a family of hypothesis tests. The practicality of these t wo uses of imp or- tance sampling — accelerated multiple testing and Monte Carlo approximation of conﬁdence interv als — would seem to b e limited by the fact that eac h places strong requirements on the imp ortance sampling pro cedure. Multiple testing con trols can b e sensitive to tiny absolute errors (but large relative errors) in small p-v alues, whic h w ould seem to demand either excessively large Mon te Carlo sample sizes or unrealistically accurate importance sampling prop osal distributions in order to reduce absolute Monte Carlo approximation errors to tolerable levels. The conﬁdence in terv al procedure uses a single proposal distri- bution to approximate probabilities under a large family of target distributions, whic h again would seem to demand large sample sizes in order to ov ercome the high imp ortance sampling v ariability that is frequently encoun tered when the prop osal distribution is not tailored to a sp eciﬁc target distribution. Nev erthe- less, as sho wn here, simple corrections of the usual imp ortance sampling p-v alue appro ximations can be used to ov ercome these diﬃculties, making imp ortance sampling a practical choice for accelerated multiple testing and for constructing Mon te Carlo conﬁdence in terv als. The p-v alue corrections that w e introduce require negligible additional com- putation and still conv erge to the target p-v alue, but ha v e the additional prop- ert y that they are v alid p-v alues, meaning the probability of rejecting a n ull h yp othesis is ≤ α for an y sp eciﬁed lev el α . Hyp othesis tests and conﬁdence in- terv als constructed from the corrected p-v alue appro ximations are guaranteed to b e conserv ative, regardless of the Monte Carlo sample size, while still behaving lik e the target tests and in terv als for suﬃciently large Mon te Carlo sample sizes. The combination of b eing b oth conserv ativ e and consistent turns out to b e cru- cial in many applications where the importance sampling v ariability cannot b e adequately con trolled with practical amoun ts of computing, including m ultiple testing, conﬁdence interv als, and an y h yp othesis testing situation where the true imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 3 v ariability of the importance sampling algorithm is either large or unknown. W e demonstrate the practical utilit y of the correction with several examples b elow. Let X denote the original observ ation (data set). Assume that the n ull hy- p othesis speciﬁes a kno wn distribution P for X . F or a speciﬁed test statistic t , the goal is to compute the p-v alue p ( X ) deﬁned b y p ( x ) , Pr ( t ( X ) ≥ t ( x )) where the probability is computed under the n ull h yp othesis. Imp ortance sam- pling can b e used to Monte Carlo appro ximate p ( X ) if the p-v alue cannot be determined analytically . Let Y , ( Y 1 , . . . , Y n ) b e an indep endent and identi- cally distributed (i.i.d.) sample from a distribution Q (the pr op osal distribution ) whose supp ort includes the supp ort of P (the tar get distribution ). Then p ( X ) can be appro ximated with either b p ( X, Y ) , P n i =1 w ( Y i ) 1  t ( Y i ) ≥ t ( X )  n e p ( X, Y ) , P n i =1 w ( Y i ) 1  t ( Y i ) ≥ t ( X )  P n j =1 w ( Y j ) where 1 is the indicator function and where the imp ortanc e weights are deﬁned to be w ( x ) , P ( x ) Q ( x ) in the discrete case, and the ratio of densities in the con tin uous case. Eac h of these are consisten t approximations of p ( X ) as the Mon te Carlo sample size increases. The former is unbiased and is esp ecially useful for approximating extremely small p-v alues. The latter can b e ev aluated ev en if the imp ortance w eigh ts are only known up to a constant of prop ortionality . Note that n is the Mon te Carlo sample size; the sample size or dimensionality of X is irrelev ant for the developmen ts here. The reader is referred to Liu ( 2001 ) for details and references concerning imp ortance sampling and to Lehmann & Romano ( 2005 ) for h yp othesis testing. Here we prop ose the following simple corrections of b p and e p that mak e use of the importance w eight of the original observ ation, namely , b p ∗ ( X, Y ) , w ( X ) + P n i =1 w ( Y i ) 1  t ( Y i ) ≥ t ( X )  1 + n e p ∗ ( X, Y ) , w ( X ) + P n i =1 w ( Y i ) 1  t ( Y i ) ≥ t ( X )  w ( X ) + P n j =1 w ( Y j ) W e will show that the corrected p-v alue approximations, while clearly still con- sisten t approximations of the target p-v alue p ( X ), are also themselves valid p-v alues, meaning Pr  b p ∗ ( X, Y ) ≤ α  ≤ α and Pr  e p ∗ ( X, Y ) ≤ α  ≤ α imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 4 for all α ∈ [0 , 1] and n ≥ 0 under the null h yp othesis, where the probability is with resp ect to the joint distribution of data and Monte Carlo sample. These simple corrections ha ve far-reaching consequences and enable imp ortance sam- pling to be successfully used in a v ariety of situations where it w ould otherwise fail. Of sp ec ial notes are the abilit y to prop erly con trol for m ultiple hypothesis tests and the ability to create v alid conﬁdence interv als using a single Mon te Carlo sample. 2. Main results The m ain results are that b p ∗ and e p ∗ are v alid p-v alues (Theorems 1 and 2 ). W e generalize the introductory discussion in tw o directions. First, we allo w arbitrary distributions, so the importance w eights b ecome the Radon-Nikodym deriv ative dP /dQ . In the discrete case this simpliﬁes to the ratio of probability mass functions, as in the introduction, and in the contin uous case this simpliﬁes to the ratio of probability density functions. Second, we allow the c hoice of test statistic to dep end on ( X, Y 1 , . . . , Y n ) as long as the c hoice is in v ariant to p erm utations of ( X , Y 1 , . . . , Y n ). W e express this mathematically b y writing the test statistic, t , as a function of t wo arguments: the ﬁrst is the same as b efore, but the second argument takes the entire sequence ( X , Y 1 , . . . , Y n ), although w e require that t is in v ariant to p ermutations in the second argument. F or example, w e may w ant to transform the sequence ( X , Y 1 , . . . , Y n ) in some w ay , either b efore or after applying a test-statistic to the individual en tries. As long as the transformation pro cedure is p ermutation in v ariant (such as centering and scaling, or conv erting to ranks), everything is ﬁne. T ransformations are often desirable in multiple testing contexts for improving balance ( W estfall & Y oung , 1993 ). W e b egin with the precise notation and assumptions for the theorems. P and Q are probabilit y distributions deﬁned on the same measurable space ( S, S ) with P  Q (meaning that sets with positive P probability also ha ve p ositive Q prob- abilit y), and w is a ﬁxed, nonnegative version of the Radon-Nikodym deriv ativ e dP /dQ . M denotes the set of all ( n + 1)! p ermutations π , ( π 0 , . . . , π n ) of (0 , . . . , n ). F or π ∈ M and z , ( z 0 , . . . , z n ), w e deﬁne z ( π ) , ( z π 0 , . . . , z π n ). Assume that t : S × S n +1 7→ [ −∞ , ∞ ] has the prop ert y that t ( a, z ) = t ( a, z ( π ) ) for all a ∈ S , z ∈ S n +1 , and π ∈ M . F or z ∈ S n +1 deﬁne b p ∗ ( z ) , P n i =0 w ( z i ) 1 { t ( z i , z ) ≥ t ( z 0 , z ) } n + 1 e p ∗ ( z ) , P n i =0 w ( z i ) 1 { t ( z i , z ) ≥ t ( z 0 , z ) } P n j =0 w ( z j ) where we take 0 / 0 , 0. Let X hav e distribution P and let Y 0 , Y 1 , . . . , Y n b e an i.i.d. sample from Q , indep enden t of X . F or notational conv enience deﬁne imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 5 Z , ( Z 0 , Z 1 , . . . , Z n ) b y Z 0 , X , Z 1 , Y 1 , . . . , Z n , Y n so that the corrected p-v alues are b p ∗ ( Z ) and e p ∗ ( Z ). Then, Theorem 1. Pr  b p ∗ ( Z ) ≤ α  ≤ α for al l α ∈ [0 , 1] . Theorem 2. Pr  e p ∗ ( Z ) ≤ α  ≤ α for al l α ∈ [0 , 1] . Pro ofs are in the App endix. The theorems do not require an y sp ecial rela- tionships among P , Q , t , or n . F or example, in parametric settings Q does not need to b e in the same model class as the null and/or alternative. V alidity of the corrected p-v alues is ensured even for unusual cases such as n = 0 or impor- tance weigh ts with inﬁnite v ariance. W e discuss some practical considerations for c ho osing Q in the next section (presumably , P , t and n are dictated b y the application and the computational budget). V alidity of the corrected p-v alues is well kno wn for the sp ecial case of direct sampling, i.e., Q ≡ P and w ≡ 1. F or Marko v c hain Monte Carlo (MCMC) approac hes, Besag & Cliﬀord ( 1989 ) demonstrate how to generate v alid p-v alue approximations using techniques that are unrelated to the ones here. The theorems contin ue to hold if each Y k is chosen from a diﬀeren t Q , say Q k , as long as the sequence Q 0 , Q 1 , . . . , Q n is itself i.i.d. from some distribution o v er prop osal distributions. The k th imp ortance weigh t is now a ﬁxed version of the Radon-Nikodym deriv ative dP /dQ k . This generalization can b e useful in practice when eac h Y k is generated hierarc hically b y ﬁrst choosing an auxiliary v ariable V k and then, given V k , choosing Y k from some distribution Q V k that dep ends on V k . If the marginal distribution (i.e., Q ) of Y k is not a v ailable, one can use imp ortance weigh ts based on the conditional distribution (i.e., Q V k ) of Y k giv en V k . This is equiv alent to using random prop osal distributions as describ ed ab o v e. The drawbac k of combining this approach with the p-v alue corrections, is that the imp ortance weigh t of the original observ ation is ev alu- ated using dP /dQ 0 , where Q 0 is a randomly c hosen proposal distribution. This further increases the Monte Carlo randomness already inherent in the p-v alue appro ximations. Note that generalizing the theorems to allow for random pro- p osals requires no additional w ork. If ν is the join t distribution of eac h ( Q k , Y k ) and ν Q is the marginal distribution of eac h Q k , simply apply the theorems with ν Q × P in place of P , ν in place of Q , ( Q 0 , X ) in place of X , and ( Q k , Y k ) in place of Y k . If one uses a regular conditional distribution to deﬁne the im- p ortance weigh t [ d ( ν Q × P ) /dν ]( Q k , x ), then it will simplify to a version of [ dQ k /dP ]( x ). Finally , w e recall the well known fact that any v alid family of p-v alues can b e inv erted in the usual w ay to give v alid conﬁdence in terv als. In particular, consider a collection of distributions { P θ : θ ∈ Θ } and, for each θ ∈ Θ, let p ∗ ( θ , Z ) b e any v alid p-v alue for testing the null hypothesis of P θ . Fix α ∈ [0 , 1] and deﬁne the random set C α ( Z ) ,  θ ∈ Θ : p ∗ ( θ , Z ) > α } imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 6 Then C α ( Z ) is a 1 − α conﬁdence set for θ because Pr θ  C α ( Z ) 3 θ  = 1 − Pr θ  p ∗ ( θ , Z ) ≤ α  ≥ 1 − α where Pr θ is calculated under the null h yp othesis of P θ . An app ealing feature ab out in verting either b p ∗ or e p ∗ is that the same importance samples can (in principle) b e used for testing each θ . Only the imp ortance weigh ts (and p erhaps the test statistic) v ary with θ . Section 4.2 b elow illustrates this application. The idea of using imp ortance sampling to construct conﬁdence interv als from a single Mon te Carlo sample was p ointed out in Green ( 1992 ). See Bolviken & Sk o vlund ( 1996 ) and Garthw aite & Buckland ( 1992 ) for examples of other w a ys to create Monte Carlo conﬁdence interv als. 3. Practical considerations It is clear that each of the corrected p-v alues has the same asymptotic behavior as its uncorrected counterpart, so for suﬃciently large n they are essentially equiv alent. But in practice, n will often b e to o small. The concern for the inv es- tigator is that the corrected p-v alues ma y b ehav e muc h worse than the uncor- rected ones for practical c hoices of n . In particular, the in v estigator is concerned ab out the follo wing situation: the null hypothesis is false; the target p-v alue, p , is small; the uncorrected approximations, b p or e p , are also small; but the corrected appro ximations, b p ∗ or e p ∗ , are large. Careful choice of Q can low er the chances of this situation. Eac h of the corrected p-v alues can b e expressed as an interpolation betw een their resp ective uncorrected versions and another (non-Monte Carlo) v alid p- v alue: b p ∗ ( z ) = 1 n + 1 w ( x ) +  1 − 1 n + 1  b p ( z ) (1) e p ∗ ( z ) = w ( x ) w ( x ) + P n j =1 w ( y j ) + 1 − w ( x ) w ( x ) + P n j =1 w ( y j ) ! e p ( z ) (2) where w e are using the shorthand z , ( x, y 1 , . . . , y n ). In b oth cases, p ow er suﬀers when X comes from a distribution in the alternative h yp othesis, but w ( X ) is typically large relativ e to w ( Y 1 ) , . . . , w ( Y n ). Since w ( Y k ) alwa ys has mean 1, problems might arise for alternatives that tend to make w ( X ) m uch larger than 1. This problem can b e av oided by choosing Q so that it gives more w eigh t than does P to regions of the sample space that are more t ypical under alternativ es than they are under P . The prob lem can also b e av oided by choosing Q to b e similar to P so that the weigh ts are close to 1 throughout the sample space. Most prop osal distributions are designed with one of these tw o goals in mind, so the corrected p-v alues should b ehav e well for well-designed prop osal distributions. In practice, how ever, prop osal distributions can b e quite bad, and it can b e helpful to lo ok more closely at how the proposal aﬀects the p ow er of the corrected p-v alues. imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 7 F rom ( 1 ) we see that b p ∗ is an interpolation b etw een b p and w ( x ), the latter of whic h is a v alid p-v alue. V alidit y of w ( X ) can be seen either by Theorem 1 for n = 0 or by the simple calculation Pr( w ( X ) ≤ α ) = X x 1  P ( x ) ≤ αQ ( x )  P ( x ) ≤ X x αQ ( x ) = α (3) whic h easily generalizes to more abstract settings. T esting w ( X ) = P ( X ) /Q ( X ) ≤ α is simply a likelihoo d ratio test (LR T) of P v ersus Q , although using the crit- ical v alue of α will giv e a test of size smaller than α . The test can b e strongly conserv ative, b ecause the b ound in ( 3 ) can b e far from tight. So b p ∗ is an in- terp olation betw een a conserv ative LR T of P v ersus Q and an (often lib eral) imp ortance sampling Monte Carlo appro ximation of the target p-v alue. If the eﬀect of w ( X ) on b p ∗ has not disapp eared (such as for small n ), then b p ∗ is a p-v alue for the original null, P , versus a new alternative that is some amalga- mation of the original alternativ e and the proposal distribution Q . If Q is not in the alternative (as it often will not b e), this mo diﬁed alternative is imp ortant to k eep in mind when interpreting any rejections. The eﬀect of Q on interpreting rejections is esp ecially imp ortant in m ultiple- testing situations. In these settings, importance sampling is often used to ac- celerate the approximation of tiny p-v alues via b p . If b p ≈ 0, how ever, then b p ∗ ≈ w ( x ) / ( n + 1) and Q pla ys a critical role in determining whic h null hypothe- ses are rejected after correcting for m ultiple comparisons. The inv estigator can tak e adv antage of this by ensuring that a rejection of P in fa v or of Q is sensible for the problems at hand, and by ensuring that ev en ts with small p-v alues will b e hea vily w eighted by Q . T urning to ( 2 ), w e see that e p ∗ is an interpolation b etw een 1 and e p . Since e p ≤ 1, we alwa ys ha ve e p ∗ ≥ e p so that e p ∗ alw a ys leads to a more conserv ative test than e p . The degree of con- serv ativeness is controlled by the ratio w ( x ) / P j w ( y j ). If the ratio is small, then the correction ensures v alidity with little loss of pow er. If it is large, there ma y b e a substan tial loss of p o w er when using the correction. The ratio will b e appro ximately 1 /n if P and Q are similar, whic h is often the design criteria when using e p . An imp ortan t ca v eat for the main results is that Q is not allow ed to dep end on the observed v alue of X . This cav eat precludes one of the classical uses of imp ortance sampling, namely , using the observed v alue of t ( X ) to design a Q that heavily weigh ts the ev ent { x : t ( x ) ≥ t ( X ) } . In many cases, ho wev er, since the functional form of t is kno wn, one can a priori design a Q that will hea vily w eigh t the even t { x : t ( x ) ≥ t ( X ) } whenever the p-v alue w ould be small. F or example, given a family of prop osal distributions { Q ` } ` , each of which migh t b e useful for a limited range of observed v alues of t ( X ), one can use ﬁnite mixture distributions of the form Q = P L ` =1 λ ` Q ` to obtain more robust p erformance. F or similar reasons, mixture distributions are also useful for creating Monte Carlo conﬁdence in terv als in whic h the same prop osal distribution is used for a imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 8 family of target distributions. The examples in Sections 4.2 and 4.3 hav e more details, and Hesterb erg ( 1995 ) contains a more general discussion of the utilit y of mixture distributions for imp ortance sampling. Finally , w e note that this ca v eat ab out Q do es not preclude conditional inference. In conditional inference P is the null conditional distribution of X given some appropriate A ( X ), and all of the analysis takes place after conditioning on A ( X ). The choice of Q (and also t , for that matter) can thus dep end on A ( X ), but not on additional details ab out X . 4. Applications 4.1. A c c eler ating multiple p ermutation tests Consider a collection of N datasets. The i th dataset X i , ( V i , L i ) contains a sample of v alues V i , ( V i 1 , . . . , V i m i ) and corresp onding lab els L i , ( L i 1 , . . . , L i m i ). The distributions o ver v alues and lab els are unknown and p erhaps unrelated across datasets. W e are interested in iden tifying which datasets sho w a depen- dence b et ween the v alues and the lab els. F rom a multiple hypothesis testing p ersp ective, the i th null hypothesis is that V i and L i are independent, or more generally , that the v alues of L i are exc hangeable conditioned on the v alues of V i . Giv en a test statistic t ( x ) , t ( v , ` ), a permutation test p-v alue for the i th n ull hypothesis is giv en b y p ( X i ) , 1 m i ! X π 1  t ( V i , ( L i ) ( π ) ) ≥ t ( V i , L i )  where the sum is ov er all p ermutations π , ( π 1 , . . . , π m i ) of (1 , . . . , m i ), and where the notation ` ( π ) , ( ` π 1 , . . . , ` π m i ) denotes a p ermuted version of the elemen ts of ` using the p ermutation π . The b eauty of the p ermutation test is that it con verts a large comp osite null (indep endence of v alues and lab els) into a simple null (all p ermutations of the labels are equally lik ely) b y conditioning on the v alues and the labels but not their pairings. The i th n ull distribution, P i , b ecomes the uniform distribution o ver permutations (of the labels). The Bonfer- roni correction can be used to control the family-wise error rate (FWER), i.e., the probabilit y of even a single false rejection among all N datasets ( Lehmann & Romano , 2005 ). In particular, rejecting n ull h yp othesis i whenever p ( X i ) ≤ α/ N ensures that the probability of even a single false rejection is no more than α . Although computing p exactly is often prohibitive, Monte Carlo samples from P i are readily a v ailable for approximating p , and imp ortance sampling can b e used to accelerate the appro ximation of tiny p-v alues. Bonferroni is still sensible as long as the Monte Carlo appro ximate p-v alues are v alid p-v alues. Here is a sp eciﬁc sim ulation example. Consider N = 10 4 indep enden t data- sets, each with m i = m = 100 real-v alued v alues and corresp onding binary lab els. In each case there are 40 labels of 1 and 60 labels of 0. In 9990 datasets, the lab els and v alues are indep endent, in which case the v alues are i.i.d. standard Cauc h y (i.e., with scale parameter 1). In 10 of the datasets, the labels and v alues imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 9 are dep endent, in which case the v alues associated to lab el 0 are i.i.d. standard Cauc h y and the v alues asso ciated to label 1 are i.i.d. standard Cauch y plus 2 (i.e., a standard Cauch y shifted to the right b y 2). (This particular example was c hosen because standard tests like the the tw o-sample t -test and the Wilcoxon rank-sum test tend to p erform p o orly .) F or the alternative h yp othesis that lab el 1 tends to ha v e larger v alues than label 0, a sensible test statistic is the diﬀerence in medians b etw een v alues asso ciated to lab el 1 and v alues asso ciated to lab el 0, namely , t ( x ) , t ( v , ` ) , median( v j : ` j = 1) − median( v j : ` j = 0) T able 1 shows the Bonferroni-corrected p erformance of several diﬀerent Mon te Carlo approaches as the Mon te Carlo sample size ( n ) increases. The approxima- tions ¯ p and ¯ p ∗ refer to b p and b p ∗ , resp ectively , for the sp ecial case of Q ≡ P and w ≡ 1, i.e., direct sampling (all p erm utations are equally likely). Using either ¯ p or ¯ p ∗ w ould b e the standard metho d of approximating p . The approximations b p and b p ∗ are based on a prop osal distribution that prefers p ermutations that pair large v alues of t with lab el 1 (see the App endix for details). In each case w e reject when the appro ximate p-v alue is ≤ α/ N . The ﬁnal appro ximation b q refers to a W ald 1 − α/ (2 N ) upp er conﬁdence limit for b p using the same importance samples to estimate a standard error. F or b q we reject whenever b q ≤ α/ (2 N ), whic h, if the conﬁdence limits were exact, w ould correctly control the FWER at lev el α . T able 1 Bonferr oni-c orre cte d testing performanc e for diﬀer ent p-value approximations versus Monte Carlo sample size ( n ); H 0 is false for 10 out of N = 10 4 tests # correct rejections # incorrect rejections n 10 1 10 2 10 3 10 4 10 1 10 2 10 3 10 4 ¯ p 10 10 10 10 908 87 12 1 ¯ p ∗ 0 0 0 0 0 0 0 0 b p 9 8 7 5 7579 1903 2 0 b p ∗ 7 6 6 5 0 0 0 0 b q 6 5 4 3 4545 585 0 0 Only b p ∗ w orks well. The approximations that are not guaranteed to b e v alid ( ¯ p , b p , b q ) require excessively large n b efore the false detection rate drops to acceptable levels. The cases n = 10 3 and n = 10 4 (ev en the largest of whic h was still to o s mall for ¯ p to reach the Bonferonni target of zero false rejections) are computationally burdensome, and the situation only worsens as N increases. F urthermore, in a real problem the inv estigator has no wa y of determining that n is large enough. The conﬁdence limit pro cedure similarly failed b ecause the estimated standard errors are to o v ariable. The v alid p-v alues ( ¯ p ∗ , b p ∗ ) ensure that the Bonferonni correction w orks re- gardless of n , but ¯ p ∗ ≥ 1 / ( n + 1), so it is not useful after a Bonferonni correction unless n is extremely large. The go o d p erformance of b p ∗ requires the com bina- tion of imp ortance sampling, which allows tin y p-v alues to b e approximated with small n , and the v alidity correction in tro duced in this paper, which allows m ultiple-testing adjustmen ts to work prop erly . imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 10 4.2. Exact infer enc e for c ovariate eﬀe cts in R asch mo dels Let X = ( X ij : i = 1 , . . . , M ; j = 1 , . . . , N ) b e a binary M × N matrix. Consider the following logistic regression model for X . The X ij are indep endent Bernoulli( p ij ) where log p ij 1 − p ij , κ + α i + β j + θ v ij (4) for unkno wn coeﬃcients κ , α = ( α 1 , . . . , α M ), and β = ( β 1 , . . . , β N ), and known co v ariates v = ( v ij : i = 1 , . . . , M ; j = 1 , . . . , N ). The special case θ = 0 is called the Rasch mo del and is commonly used to mo del the resp onse of M sub jects to N binary questions ( Rasch , 1961 ). In this example, we discuss inference about θ when κ, α , β are treated as n uisance parameters. Consider ﬁrst the problem of testing the n ull h yp othesis of θ = θ 0 v ersus the alternativ e of θ 6 = θ 0 . Conditioning on the row and column sums remov es the n uisance parameters, and the original comp osite n ull h yp othesis reduces to the simple (conditional) null hypothesis of X ∼ P θ 0 , where P θ 0 is the conditional distribution of X given the row and column sums for the mo del in ( 4 ) with κ = α 1 = · · · = α M = β 1 = · · · = β N = 0 and θ = θ 0 . A sensible test statistic is the minimal suﬃcient statistic for θ , namely , t ( x ) , X ij x ij v ij Since t ( X ) has p ow er for detecting θ > θ 0 and − t ( X ) has p ow er for detecting θ < θ 0 , it is common to combine upper- and low er-tailed p-v alues into a single p-v alue, p ± ( θ 0 , X ), deﬁned b y p + ( θ , x ) , Pr θ  t ( X ) ≥ t ( x )  p − ( θ , x ) , Pr θ  − t ( X ) ≥ − t ( x )  p ± ( θ , x ) , min  1 , 2 min { p + ( θ , x ) , p − ( θ , x ) }  (5) where Pr θ uses X ∼ P θ . There are no practical algorithms for computing p ± ( θ 0 , X ) nor for direct sampling from P θ 0 . The imp ortance sampling algorithm in Chen et al. ( 2005 ) (designed for the case θ = 0) can b e mo diﬁed, how ever, to create a sensible and practical prop osal distribution, sa y Q θ , for any P θ . The details of these prop osal distributions will be reported elsewhere. Q θ ( x ) can be ev aluated, but P θ ( x ) is known only up to a normalization constan t, so w e must use e p and e p ∗ . Here we compute t wo Monte Carlo p-v alue approximations, one for eac h of t and − t , and com bine them as in ( 5 ). V alidity of the underlying p-v alues ensures v alidity the com bined p-v alue. Conﬁdence interv als for θ can b e constructed in the usual wa y b y inv erting this family of tests. In particular, a 1 − α conﬁdence set for θ is the set of θ 0 for which we do not reject the n ull hypothesis of θ = θ 0 at level α . An anno ying feature of in verting Monte Carlo h yp othesis tests is that in most cases a new sample is needed for each θ 0 , which increases the computational burden imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 11 and creates pathologically shap ed conﬁdence sets, b ecause the Monte Carlo randomness v aries across θ 0 ( Bolvik en & Sko vlund , 1996 ). Imp ortance sampling suﬀers from the same problems, unless the same imp ortance samples, and hence the same prop osal distribution, are used for testing eac h θ 0 . T o create a common prop osal distribution that might work w ell in practice, w e use a mixture of proposals designed for sp eciﬁc θ . In particular, deﬁne Q , 1 L L X ` =1 Q θ ` where ( θ 1 , . . . , θ L ) are ﬁxed and each Q θ ` is a prop osal distribution appro- priate for P θ ` as mentioned earlier. Here we use L = 601 and ( θ 1 , . . . , θ L ) = ( − 6 . 00 , − 5 . 98 , − 5 . 96 , . . . , 6 . 00). F or an y θ , the imp ortance w eights are w ( θ , x ) , P θ ( x ) Q ( x ) = c θ exp  θ t ( x )  P ` Q θ ` ( x ) for an y binary matrix x with the correct row and column sums, where c θ is an un- kno wn constant that is not needed for computing e p and e p ∗ . W e will use e p + ( θ , z ) and e p − ( θ , z ) to denote e p ( z ) when computed using the importance weigh ts w ( θ , · ) and the test statistics + t and − t , resp ectively , where z , ( x, y 1 , . . . , y n ). Sim- ilarly , deﬁne e p + ∗ ( θ , z ) and e p − ∗ ( θ , z ). The upp er and low er p-v alues can b e com- bined via e p ± = min  1 , 2 min { e p + , e p − }  and e p ± ∗ = min  1 , 2 min { e p + ∗ , e p − ∗ }  In light of Theorem 2 , e p ± ∗ ( θ 0 , Z ) is a v alid p-v alue for testing the null h yp othesis that θ = θ 0 . In v erting these p-v alues giv es the conﬁdence sets e C α ( z ) ,  θ ∈ R : e p ± ( θ , z ) > α  e C α ∗ ( z ) ,  θ ∈ R : e p ± ∗ ( θ , z ) > α  F or ﬁxed z , the approximate p-v alues are well-behav ed functions of θ , so it is straigh tforw ard to numerically approximate the conﬁdence sets (which will t ypically b e in terv als). T able 2 describ es the results of a simulation exp eriment that in v estigated the cov erage prop erties of e C α and e C α ∗ . In that exp eriment, w e to ok M = 200, N = 10, and ﬁxed the parameters κ , θ , α , β and the cov ariates v (see the Ap- p endix for details). W e set θ = 2 and repeatedly (1000 times) generated datasets and importance samples in order to estimate the true co verage probabilities and the median interv al lengths of v arious conﬁdence interv als. Conﬁdence in terv als based on the corrected p-v alues alwa ys maintain a cov erage probabilit y of at least 1 − α without b ecoming excessively long, while those based on uncorrected p-v alues can hav e muc h low er cov erage probabilities. As the n um b er of imp or- tance samples increases, the tw o conﬁdence interv als begin to agree. imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 12 T able 2 Performanc e of 95% c onﬁdence intervals versus Monte Carlo sample size ( n ). cov erage prob (%) median length n 10 50 100 10 50 100 e p 27.1 77.9 87.9 0.36 1.00 1.10 e p ∗ 99.3 98.0 96.9 2.10 1.52 1.38 Inciden tally , importance sampling seems to b e the only currently av ailable metho d for quickly creating (ev en approximate) frequentist conﬁdence in terv als using conditional inference in this mo del for a dataset of this size. Exact condi- tional logistic regression metho ds (e.g., StataCorp ( 2009 ) or Cytel ( 2010 )) are unable to enumerate the state space or generate direct samples with a few gi- gab ytes of memory . And the corresp onding MCMC metho ds (e.g., Zamar et al. ( 2007 )) often fail to adequately sample the state space, ev en after many hours of computing. Classical, unconditional, asymptotic conﬁdence in terv als also b e- ha v e p o orly . F or the simulation described in T able 2 , 95% W ald interv als only had 82.2% cov erage probabilit y with a median length of 1.24. 4.3. A c c eler ate d testing for neur ophysiolo gic al data The example in this section is similar to the example in Section 4.1 . It is mo- tiv ated by a practical multiple testing problem that arises in neuroph ysiology . Neuroph ysiologists can routinely record the simultaneous electrical activity of o v e r 100 individual neurons in a b ehaving animal’s brain. Each neuron’s spiking electrical activity can b e mo deled as a temporal p oint pro cess. Neurophysiol- ogists are interested in detecting v arious types of spatio-temp oral correlations b et w een these p oint pro cesses, an endeav or whic h often leads to challenging m ultiple testing problems. F or example, simply testing for non-zero correlation b et w een all pairs of 100 neurons gives  100 2  diﬀeren t tests, and neurophysiolo- gists are often interested in muc h more complicated situations than this. This section demonstrates how to use imp ortance sampling to accelerate a Monte Carlo hypothesis test for lagged correlation used in F ujisaw a et al. ( 2008 ), here- after F AHB, and then to correctly control for multiple tests via the p-v alue correction. Let X , ( T , N ) be a marked p oint pro cess where T , ( T 1 , . . . , T M ) are the nondecreasing even t times taking v alues in { 0 , 1 , . . . , B − 1 } and N , ( N 1 , . . . , N M ) are marks taking v alues in { 1 , . . . , C } . Let T i , ( T i 1 , . . . , T i M i ) , ( T k : N k = i ) b e the sequence of even t times with mark i , which in this con text are the ﬁring times (rounded to the nearest millisecond) of neuron i . A neuron can ﬁre at most once p er ms, so the times in each T i are strictly increasing. The dataset used here is described in F AHB and has C = 117 neurons and w as recorded ov er a p erio d of B = 2 . 812 × 10 6 ms (ab out 47 minutes) from a rat p erforming a working memory task in a simple maze. There are M = 763501 total ev ents. Fix ∆ = 10 ms. F or each ordered pair of neurons ( i, j ), i 6 = j , F AHB were in terested in testing the null h yp othesis that the joint conditional distribution of T i and T j w as uniform giv en b T i / ∆ c and b T j / ∆ c , where b a c is imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 13 the greatest integer ≤ a . Note that b T i / ∆ c , ( b T i 1 / ∆ c , . . . , b T i M i / ∆ c ) gives a temp orally coarsened v ersion of neuron i ’s ﬁring times — it only preserv es the total num b er of ﬁrings in eac h length-∆ windo w of a ﬁxed partition of time. F AHB used tw o diﬀeren t test statistics t + ( T i , T j ) and t − ( T i , T j ) deﬁned by t + ( T i , T j ) , max d =1 ,..., 4 M i X k =1 M j X ` =1 1 { T j ` − T i k = d } t − ( T i , T j ) , − min d =1 ,..., 4 M i X k =1 M j X ` =1 1 { T j ` − T i k = d } eac h with pow er to detect diﬀeren t alternatives that they hoped to distinguish. (There are some minor technical diﬀerences b etw een the tests describ ed here and those rep orted in F AHB.) Lo osely speaking, the tests w ere designed to detect a particular type of transient, fast-temp oral, lagged-correlation b etw een neurons (via the test statistic) while ignoring all types of slow-temporal correlations (via the conditioning). Additional motiv ation, references and details can b e found in F AHB. T esting a sp eciﬁc null and test statistic com bination is easy: it is trivial to sample from the n ull (i.e., conditionally uniform) distribution and to Monte Carlo appro ximate p-v alues for any test statistic. Unfortunately , in this case there are 2 C ( C − 1) = 27144 diﬀerent h yp othesis tests, which leads to a chal- lenging multiple testing problem. Much like the example in Section 4.1 , im- practically large Mon te Carlo sample sizes are needed to successfully use direct sampling (i.e., Q ≡ P ) in conjunction with a m ultiple testing correction such as Bonferroni. F AHB did not directly address this particular multiple testing problem, but they did use ov erly conserv ative individual p-v alues in hop es of reducing the severit y of the problem. Here w e observe that imp ortance sampling in conjunction with b p ∗ p ermits prop er and practical use of multiple testing corrections. The imp ortance sam- pling prop osal distributions that we use are mixture distributions muc h like the one in Section 4.2 (see the Appe ndix for details). W e use a diﬀerent prop osal distribution for eac h h yp othesis test dep ending on the c hoice of test statistic and on the neuron pair. Using n = 100 and rejecting whenev er b p ∗ ≤ 0 . 05 / (2 C ( C − 1)) con trols the family-wise error rate (FWER) at lev el 0 . 05. F AHB rep orted 78 rejections. W e can conﬁrm ab out one third of them using this muc h more stringent FWER pro cedure. The qualitative results of F AHB remain the same using this smaller set of rejections, reassuring us that the scien tiﬁc conclusions of F AHB are not the result of statistical artifacts caused b y improper con trol of multiple hypothesis tests. 5. Discussion The practical b eneﬁts of using either b p ∗ or e p ∗ are clear. V alid p-v alues are alw a ys crucial for multiple testing adjustments. But ev en for individual tests, imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 14 the p-v alue corrections protect against false rejections resulting from high (and p erhaps undiagnosed) v ariability in the imp ortance sampling appro ximations. There is almost no computational p enalty for using the corrections. And there is little or no loss of pow er for well-behav ed imp ortance sampling algorithms. These adv antages extend to conﬁdence interv als constructed by inv erting the corrected p-v alues. Mon te Carlo appro ximations should alwa ys b e accompanied b y further ap- pro ximations of the Mon te Carlo standard error. Approximate standard errors giv e useful diagnostic information, but they can b e very misleading. The p o or p erformance of b q in Section 4.1 shows that the uncorrected p-v alue approxi- mations cannot be meaningfully corrected by relying on approximate standard errors. An interesting issue is whether or not the p-v alue corrections should also b e accompanied by approximate standard errors. F urther research is w arranted on this issue, but at this time, it seems sensible to rep ort b oth the uncorrected p- v alue approximations with their approximate standard errors and the corrected p-v alues with no standard errors. This provides useful information ab out esti- mating the target p-v alue and ab out Monte Carlo v ariability , but also p ermits in terpretable h yp othesis testing and multiple testing adjustments, without con- fusing the tw o issues. One danger that m ust b e av oided is the temptation to use close agreement b etw een the corrected and uncorrected p-v alues as an indicator of con v ergence. The App endix con tains additional examples that demonstrate ho w all four p-v alue appro ximations (and their standard errors) can be in close agreemen t but still far from the truth (to ward which they are con verging). Although the p-v alue corrections are extremely simple, they are not alw a ys a v ailable using oﬀ-the-shelf imp ortance sampling softw are, b ecause the softw are ma y not permit ev aluation of the importance w eight of the original data. Adding this capability is almost alw a ys trivial for the soft ware designer — instead of actually calling random num b er generators, simply see what v alues would lead to the original data and ev aluate their probabilit y — but in vestigators ma y be un willing or unable to mo dify existing soft ware. Hop efully , softw are designers will begin to include this capabilit y in all imp ortance sampling algorithms. W e ha v e found that the generic abilit y to ev aluate a prop osal distribution at an y p oin t (including the observed data) is quite useful for diagnostic purp oses, even if one do es not exp ect to make use of the p-v alue corrections. The techniques describ ed here are examples of a more general principle of com bining observ ations from b oth target and prop osal distributions in order to impro v e Mon te Carlo imp ortance sampling approximations. Supp ose, for ex- ample, that our computational budget would allo w a small sample from the target and a large sample from the proposal. What is the b est w ay to combine these samples for v arious types of inference? As we hav e shown, even a single observ ation from the target can b e used to provide m uc h more than diagnostic information. It can b e used, surprisingly , to ensure that (often extremely po or) h yp othesis test and conﬁdence interv al appro ximations maintain the nominal signiﬁcance levels. Perhaps a single observ ation from the target could also b e used to improv e appro ximations of p oint estimators in some wa y? As imp ortance sampling contin ues to gain prominence as a to ol for Monte imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 15 Carlo approximation, so do the innov ations in imp ortance sampling techniques. Most of these will not ﬁt into the classical imp ortance sampling framework needed for the theorems here. It remains future w ork to in v estigate the degree to which the original data can inform and correct more sophisticated sequential Mon te Carlo metho ds. But it seems clear that this v aluable source of diagnostic information should not be completely ignored. App endix A: Pro ofs The pro ofs of b oth Theorem 1 and Theorem 2 rely on a simple algebraic fact, encapsulated in the next lemma. Lemma 3. F or al l t 0 , . . . , t n ∈ [ −∞ , ∞ ] and al l α, w 0 , . . . , w n ∈ [0 , ∞ ] , we have n X k =0 w k 1 ( n X i =0 w i 1 { t i ≥ t k } ≤ α ) ≤ α Pr o of. Let H denote the left side of the desired inequality . W e can assume that H > 0, since the statemen t is trivial otherwise. The pairs ( t 0 , w 0 ) , . . . , ( t n , w n ) can be reordered without aﬀecting the v alue of H , so w e can assume that t 0 ≥ · · · ≥ t n . This implies that P n i =0 w i 1 { t i ≥ t k } is increasing in k , and that there exists a k ∗ deﬁned as the largest k for which n X i =0 w i 1 { t i ≥ t k } ≤ α So H = k ∗ X k =0 w k = k ∗ X i =0 w i 1 { t i ≥ t k ∗ } ≤ α A.1. Pr o of of The or em 1 F or any sequence y , ( y 0 , . . . , y n ) and any k ∈ { 0 , . . . , n } , let y k , ( y k , y 1 , . . . , y k − 1 , y 0 , y k +1 , . . . , y n ) whic h is the sequence obtained b y sw apping the 0th element and the k th elemen t in the original sequence y . W e begin with a lemma. Lemma 4. F or any nonne gative, me asur able function f , E  f ( Z )  = E 1 n + 1 n X k =0 w ( Y k ) f ( Y k ) ! imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 16 Pr o of. Recall that Z i = Y i ∼ Q for i 6 = 0 and that Z 0 = X ∼ P . A change of v ariables from X to Y 0 giv es E  f ( Z )  = E  w ( Y 0 ) f ( Y )  Since the distribution of Y is inv ariant to permutations, we hav e E  w ( Y 0 ) f ( Y )  = E  w ( Y k ) f ( Y k )  for eac h k = 0 , . . . , n , whic h means that E  f ( Z )  = E  w ( Y 0 ) f ( Y )  = 1 n + 1 n X k =0 E  w ( Y k ) f ( Y k )  Mo ving the sum inside the expectation completes the pro of. Applying Lemma 4 to the function f ( z ) = 1  b p ∗ ( z ) ≤ α  giv es Pr  b p ∗ ( Z ) ≤ α  = E  1  b p ∗ ( Z ) ≤ α  = E 1 n + 1 n X k =0 w ( Y k ) 1  b p ∗ ( Y k ) ≤ α  ! = E n X k =0 w ( Y k ) n + 1 1 ( n X i =0 w ( Y i ) n + 1 1  t ( Y i , Y ) ≥ t ( Y k , Y )  ≤ α )! The quantit y inside the ﬁnal exp ectation is alwa ys ≤ α , which follo ws from Lemma 3 by taking t ` , t ( Y ` , Y ) and w ` , w ( Y ` ) / ( n + 1) for each ` = 0 , . . . , n . This completes the proof of Theorem 1 . A.2. Pr o of of The or em 2 Let Π denote a random p ermutation chosen uniformly from M and chosen indep enden tly of Z . Let U , Z ( π ) . If π ∈ M is a ﬁxed p erm utation, then Π and Π ( π ) ha v e the same distribution, whic h means U and U ( π ) ha v e the same distribution. In particular, U and U k ha v e the same distribution, where the notation U k is deﬁned in the pro of of Theorem 1 . W e b egin with t wo lemmas, and then the proof of Theorem 2 is nearly identical to that of Theorem 1 . W e use the conv ention 0 / 0 , 0. Lemma 5. L et P Z and P U denote the distributions of Z and U , r esp e ctively, over S n +1 . Then P Z  P U and dP Z dP U ( u ) = ( n + 1) w ( u 0 ) P n j =0 w ( u j ) almost sur ely. imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 17 Pr o of. Let g b e any nonnegative, measurable function on S n +1 . W e need only sho w that E  g ( Z )  = E ( n + 1) w ( U 0 ) P n j =0 w ( U j ) g ( U ) ! (6) F or an y π ∈ M there is a unique inv erse p ermutation π − 1 ∈ M with π − 1 π i = π π − 1 i = i , for each i = 0 , . . . , n . Comparing with the proof of Lemma 4 , for an y nonnegativ e, measurable function f we hav e E  f ( Z ( π ) )  = E  f ( Y ( π ) ) w ( Y 0 )  = E  f ( Y ) w ( Y π − 1 0 )  (7) where the ﬁrst equality is a change of v ariables from Z to Y and the second equalit y follo ws from the fact that the distribution of Y is p ermutation inv ariant, so, in particular, Y and Y ( π − 1 ) ha v e the same distribution. Using ( 7 ) gives E  f ( U )  = 1 ( n + 1)! X π ∈M E  f ( U )   Π = π  = 1 ( n + 1)! X π ∈M E  f ( Z ( π ) )  = 1 ( n + 1)! X π ∈M E  f ( Y ) w ( Y π − 1 0 )  = 1 ( n + 1)! n X j =0 X π ∈M : π − 1 0 = j E  f ( Y ) w ( Y j )  = 1 ( n + 1)! n X j =0 n ! E  f ( Y ) w ( Y j )  = E P n j =0 w ( Y j ) n + 1 f ( Y ) ! (8) Applying ( 8 ) to the function f ( u ) = ( n + 1) g ( u ) w ( u 0 ) /  P n j =0 w ( u j )  giv es E ( n + 1) w ( U 0 ) P n j =0 w ( U j ) g ( U ) ! = E P n j =0 w ( Y j ) n + 1 ( n + 1) w ( Y 0 ) P n j =0 w ( Y j ) g ( Y ) ! = E  w ( Y 0 ) g ( Y )  = E  g ( Z )  where the last equalit y is a change of v ariables as in ( 7 ). This giv es ( 6 ) and completes the pro of. Lemma 6. F or any nonne gative, me asur able function f , E  f ( Z )  = E n X k =0 w ( U k ) P n j =0 w ( U j ) f ( U k ) ! Pr o of. Changing v ariables from Z to U and using Lemma 5 giv es E  f ( Z )  = E ( n + 1) w ( U 0 ) P n j =0 w ( U j ) f ( U ) ! Since the distribution of U is in v ariant to permutations, we hav e E ( n + 1) w ( U 0 ) P n j =0 w ( U j ) f ( U ) ! = E ( n + 1) w ( U k ) P n j =0 w ( U j ) f ( U k ) ! imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 18 for eac h k = 0 , . . . , n , whic h means that E  f ( Z )  = E ( n + 1) w ( U 0 ) P n j =0 w ( U j ) f ( U ) ! = 1 n + 1 n X k =0 E ( n + 1) w ( U k ) P n j =0 w ( U j ) f ( U k ) ! Mo ving the sum inside the exp ectation and cancelling the ( n + 1)’s completes the proof. Applying Lemma 6 to the function f ( z ) = 1  e p ∗ ( z ) ≤ α  giv es Pr  e p ∗ ( Z ) ≤ α  = E  1  e p ∗ ( Z ) ≤ α  = E n X k =0 w ( U k ) P n j =0 w ( U j ) 1  e p ∗ ( U k ) ≤ α  ! = E n X k =0 w ( U k ) P n j =0 w ( U j ) 1 ( n X i =0 w ( U i ) P n j =0 w ( U j ) 1  t ( U i , U ) ≥ t ( U k , U )  ≤ α )! The quantit y inside the ﬁnal exp ectation is alwa ys ≤ α , which follo ws from Lemma 3 by taking t ` , t ( U ` , U ) and w ` , w ( U ` ) /  P n j =0 w ( U j )  for each ` = 0 , . . . , n . This completes the pro of of Theorem 2 . App endix B: Prop osal distributions and simulation details The sim ulation example in Section 4.1 uses conditional inference, so the target and prop osal distributions for eac h dataset i (notation suppressed) can dep end on the observed v alues V , ( V 1 , . . . , V m ) and the fact that there are r one-lab els and m − r zero-labels (but cannot dep end on the observed pairing of lab els and v alues). Let I , ( I 1 , . . . , I m ) b e a p ermutation that makes V non-increasing, i.e., V I 1 ≥ · · · ≥ V I m . Cho ose a random p ermutation Π , (Π 1 , . . . , Π m ) according to the distribution Pr( Π = π ) , exp  θ P r i =1 1 { π i ≤ r }  r !( m − r )! P m k =0  r k  m − r r − k  exp( θ k ) ( θ ∈ R , r = 0 , . . . , m ) where the binomial co eﬃcients  a b  , a ! / ( b !( a − b )!) are deﬁned to b e zero if a < 0, b < 0, or a < b . Lea ving V in the original observ ed order and p erm uting L so that L I Π 1 = · · · = L I Π r = 1 and L I Π r +1 = · · · = L I Π m = 0 gives a random pairing of v alues with lab els. The case θ = 0 is the uniform distribution ov er p erm utations, i.e., the target null conditional distribution. W e used θ = 3 for the proposal distribution which assigns higher probability to those p ermutations that tend to matc h the lab el one with larger v alues. The simulation example in Section 4.3 also uses conditional inference, so the target and prop osal distributions for each pair of neurons ( i, j ) can dep end on the temp orally coarsened even t times b T i / ∆ c and b T j / ∆ c . Deﬁne the set of all ev en t times with the same temporal coarsening as neuron i to be Ω i ,  u ∈ { 0 , . . . , B − 1 } M i : u 1 < · · · < u M i , b u k / ∆ c = b T i k / ∆ c , ∀ k  imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 19 F or any s ∈ Z m and a ∈ Z deﬁne R a ( s ) ,   { k : b s k / ∆ c = a }   so that, for example,   Ω i   = Y a  ∆ R a ( T i )  The target distribution (i.e., the null h yp othesis for pair ( i, j )) is the uniform distribution o ver Ω i × Ω j . F or an y s ∈ Ω j , u ∈ Ω i , d ∈ Z , and θ ∈ R deﬁne ρ ( s , u , d, θ ) , exp  θ P M j k =1 P M i ` =1 1 { s k = u ` + d }  Q a P ∆ r =0  R a ( u + d ) r  ∆ − R a ( u + d ) R a ( s ) − r  exp( θ r ) 1 { s ∈ Ω j } whic h is a probability distribution o ver Ω j for ﬁxed u , d , and θ . The case θ = 0 is the uniform distribution ov er Ω j ; the case θ > 0 prefers those s with even t times that match those in u + d ; and the case θ < 0 prefers those s that do not. W e used prop osal distributions of the form Pr  T i = u , T j = s  , 1 5 4 X d =0 1 { u ∈ Ω i }   Ω i   ρ ( s , u , d, θ d ) whic h choose ev ent times uniformly for neuron i and then choose times in neuron j that prefer (or av oid) times of a sp eciﬁc lag from those of neuron i . F or the test statistic t + w e used ( θ 0 , . . . , θ 4 ) , (0 , . 5 , . 5 , . 5 , . 5) and for t − w e used (0 , − . 5 , − . 5 , − . 5 , − . 5). All simulations and analysis w ere performed with custom softw are written in Matlab 2010b and executed on a 2.66 GHz iMac with 8GB of RAM using Matlab’s default pseudo random n umber generator (Mersenne Twister). The pa- rameters used for the logistic regression example in Section 4.2 are κ = − 1 . 628, β = (0.210, -0.066, 0.576, -0.197, 0.231, 0.184, -0.034, -0.279, -0.396, 0.000) and α = (-0.183, -0.735, -0.144, -0.756, -0.749, -0.226, -0.538, -0.213, -0.118, -0.284, -0.127, -0.632, -0.132, 0.104, 0.000, -0.781, -0.500, -0.498, -0.182, -0.269, -0.077, -0.499, -0.661, -0.780, -0.095, -0.661, -0.478, -0.315, -0.638, -0.225, -0.382, -0.715, -0.085, -0.766, -0.573, -0.629, -0.336, -0.775, -0.461, -0.762, -0.754, -0.082, -0.575, -0.263, 0.098, -0.434, -0.172, -0.109, -0.434, -0.211, -0.757, 0.067, -0.679, -0.601, -0.069, -0.379, -0.098, -0.471, -0.594, -0.830, -0.193, -0.437, -0.415, -0.257, -0.807, -0.551, -0.094, -0.170, -0.741, -0.737, -0.774, -0.859, -0.444, -0.211, -0.144, -0.336, -0.758, -0.235, -0.740, -0.732, -0.768, -0.725, -0.698, -0.671, -0.549, -0.550, -0.649, -0.616, 0.026, -0.164, -0.311, -0.682, -0.655, -0.789, 0.047, -0.160, -0.309, -0.553, -0.701, -0.244, 0.121, -0.696, -0.609, -0.470, -0.793, -0.183, -0.464, 0.116, -0.465, -0.246, -0.712, -0.485, -0.706, -0.109, 0.004, -0.516, -0.181, -0.573, -0.336, -0.034, -0.269, -0.531, -0.568, -0.414, -0.444, -0.507, -0.308, -0.124, -0.442, -0.437, -0.742, -0.842, -0.577, -0.549, -0.213, 0.090, 0.069, -0.409, -0.626, -0.103, -0.107, -0.126, -0.123, -0.761, -0.185, -0.403, -0.655, -0.768, -0.043, -0.692, -0.703, -0.201, 0.028, -0.350, -0.164, -0.713, 0.087, -0.326, -0.187, -0.830, -0.058, -0.118, -0.747, -0.342, imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 20 -0.541, -0.320, -0.468, -0.452, -0.686, -0.611, -0.846, 0.057, -0.213, 0.066, -0.703, 0.054, -0.072, -0.289, -0.427, -0.609, -0.115, -0.638, -0.803, -0.099, -0.196, -0.152, -0.225, -0.448, -0.476, -0.051, -0.549, -0.052, -0.078, -0.014, -0.361, -0.231, 0.084, -0.423, -0.807, 0.000). The ﬁnal entries of β and α were constrained to be zero to av oid identiﬁabilit y problems. The co v ariates ν are recorded as a 200 × 10 matrix in the ASCII ﬁle nu.csv . App endix C: Additional examples C.1. Mismatche d Gaussians F or the ﬁrst example, we take P to b e Normal(0 , 1), Q to b e Normal( µ, σ ), and use the test statistic t ( x ) , x . W e exp eriment with a few choices of µ , σ and n , and in each case compute the cumulativ e distribution functions (cdfs) and mean-squared errors (MSEs) of the approximations b p , e p , b p ∗ , and e p ∗ of the true p-v alue p ( X ) = 1 − Φ( X ), where Φ is the standard normal cdf. Actually , w e do not compute these quantities directly , but approximate them using 10 6 Mon te Carlo samples (each of which uses a new X and new imp ortance samples Y 1 , . . . , Y n ). Figure 1 and T able 3 sho w the results. In each case, b p ∗ and e p ∗ seem to be v alid p-v alues, conﬁrming the theory . There are man y cases where b p and e p are not v alid. The MSEs of all the estimators seem comparable, with none uniformly better than the rest, although for a ma jority of these examples e p has the smallest MSE. (W e use min { b p, 1 } and min { b p ∗ , 1 } instead of b p and b p ∗ , resp ectively , for computing MSE.) An imp ortant sp ecial case among these examples is µ = 0, σ = 0 . 2, for whic h b p and e p hav e MSEs up to 10 times smaller than the MSEs of b p ∗ and e p ∗ , respectively . F or this sp ecial case, none of the estimators do es a go o d job approximating p , ev en with n = 10 5 , but b p ∗ and e p ∗ are conserv ative, whereas b p and e p are strongly lib eral (esp ecially for small p-v alues). In many h yp othesis testing con texts, the conserv ative choice is desirable, despite worse MSE. C.2. Conditional testing in binary tables The next example was inspired by Bezak ov a et al. ( 2006 ) which describ es a situation where an imp ortance sampling algorithm taken from the literature con v erges extremely slowly . Let X b e a binary 52 × 102 matrix with row sums (51 , 1 , 1 , . . . , 1) and column sums (1 , 1 , 1 , . . . , 1). The null h yp othesis is that X came from the uniform distribution o v er the class of all binary matrices with these row and column sums. The alternative h yp othesis is that X came from a non-uniform distribution that prefers conﬁgurations where the 51 ones in the ﬁrst ro w tend to clump near the later columns. F or any binary matrix x with these row and column sums, let ` 1 ( x ) < · · · < ` 51 ( x ) denote the indices of the columns for which x has ones in the ﬁrst row. F or example, if the ﬁrst row is imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 21 μ = 0, σ = 0.2 μ = 0, σ = 1 μ = 0, σ = 5 μ = 3, σ = 0.2 μ = 3, σ = 1 μ = −3, σ = 1 μ = −3, σ = 0.2 μ = 3, σ = 5 μ = −3, σ = 5 0 0 0 0 0 1 1 1 1 1 1 x x x F ( x ) F ( x ) F ( x ) μ = 0, σ = 0.2 μ = 0, σ = 1 μ = 0, σ = 5 μ = 3, σ = 0.2 μ = 3, σ = 1 μ = −3, σ = 1 μ = −3, σ = 0.2 μ = 3, σ = 5 μ = −3, σ = 5 0 0 0 0 0 1 1 1 1 1 1 x x x F ( x ) F ( x ) F ( x ) A B cumulative distribution functions for standard p-values cumulative distribution functions for modified p-values Fig 1 . Estimate d cumulative distribution functions (c dfs) for the approximate p-values in Example C.1 under the nul l hyp othesis. Each plot c orr esponds to a spe ciﬁc choic e of µ and σ for the pr oposal distribution. The plots in p anel A show the c dfs for b p (solid lines) and e p (dashe d lines) for each of n = 10 , 1000 , 100000 , for a total of six c dfs on e ach gr aph. The plots in panel B show b p ∗ (solid lines) and e p ∗ (dashe d lines) for each of n = 10 , 1000 , 100000 . (Each c df was estimate d using 10 6 Monte Carlo exp eriments, e ach of which involve d a new X and new Y 1 , . . . , Y n . Sinc e c dfs take values in [0 , 1] , the maximum p ossible varianc e of any of these estimators is (1 / 4)10 − 6 , which implies that an interval of ± 10 − 3 ar ound any estimate is at le ast a 95% c onﬁdenc e interval.) The c df of the target p-value, p , is the c df of a Uniform (0 , 1) random variable, namely, F ( x ) = x , and is shown with a dotte d line along the diagonal of e ach plot. The diﬀer ent choic es of n ar e not lab ele d on the plots, but wher e the curves ar e distinguishable, lar ger n wil l always b e closer to the diagonal, since the appr oximate p-values always c onver ge to the tar get p-value. A v alid p-value has a c df that lies on or below the diagonal. If the c df dr ops strictly below the diagonal, this indicates a loss of p ower. If the c df exc ee ds the diagonal, this indic ates that the r esulting hyp othesis test does not c orr e ctly c ontrol the typ e I err or. Note that the corr e cted p-values ar e always valid, but that the original p-values ar e often invalid. (0 , 1 , 0 , 1 , 0 , . . . , 1), then ( ` 1 , . . . , ` 51 ) = (2 , 4 , 6 , . . . , 102). A suitable test statistic for distinguishing the n ull from the alternativ e is t ( x ) , 51 X j =1 ` j ( x ) Supp ose that we observe a matrix X with  ` 1 ( X ) , . . . , ` 51 ( X )  = (1, 7, 8, 10, 11, 15, 16, 17, 20, 21, 28, 29, 30, 36, 37, 40, 41, 42, 48, 49, 51, 54, 55, 56, 57, 58, 60, 61, 62, 63, 65, 67, 68, 69, 70, 73, 75, 77, 80, 81, 82, 85, 86, 87, 91, 92, 94, 95, 96, 97, 100), so that t ( X ) = 2813. What is the p-v alue p ( X )? Chen et al. ( 2005 ) suggest a prop osal distribution for appro ximate uniform generation of binary matrices with speciﬁed ro w and column sums. (They ac- tually suggest a v ariet y of prop osal distributions. W e use the same algorithm imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 22 T able 3 Estimate d MSEs of the appr oximate p-values (smal lest for e ach row is b old) Q MSE µ σ n b p b p ∗ e p e p ∗ 0 0 . 2 10 1 . 16 e − 01 2 . 54 e − 01 5 . 12 e − 02 2 . 37 e − 01 0 0 . 2 1000 5 . 98 e − 02 1 . 98 e − 01 2 . 43 e − 02 1 . 75 e − 01 0 0 . 2 100000 3 . 39 e − 02 1 . 52 e − 01 1 . 41 e − 02 1 . 33 e − 01 0 1 10 1 . 67 e − 02 1 . 66 e − 02 1 . 67 e − 02 1 . 66 e − 02 0 1 1000 1 . 67 e − 04 1 . 67 e − 04 1 . 67 e − 04 1 . 67 e − 04 0 1 100000 1 . 66 e − 06 1 . 66 e − 06 1 . 66 e − 06 1 . 66 e − 06 0 5 10 8 . 58 e − 02 9 . 52 e − 02 6 . 80 e − 02 6 . 37 e − 02 0 5 1000 1 . 38 e − 03 1 . 39 e − 03 4 . 96 e − 04 4 . 96 e − 04 0 5 100000 1 . 45 e − 05 1 . 45 e − 05 4 . 94 e − 06 4 . 94 e − 06 3 0 . 2 10 3 . 23 e − 01 3 . 23 e − 01 3 . 31 e − 01 3 . 31 e − 01 3 0 . 2 1000 3 . 17 e − 01 3 . 15 e − 01 3 . 26 e − 01 3 . 26 e − 01 3 0 . 2 100000 3 . 08 e − 01 3 . 07 e − 01 3 . 20 e − 01 3 . 20 e − 01 3 1 10 1 . 94 e − 01 1 . 83 e − 01 2 . 53 e − 01 2 . 56 e − 01 3 1 1000 2 . 48 e − 02 2 . 04 e − 02 5 . 18 e − 02 5 . 06 e − 02 3 1 100000 7 . 69 e − 04 6 . 06 e − 04 4 . 76 e − 03 4 . 65 e − 03 3 5 10 9 . 62 e − 02 1 . 05 e − 01 8 . 63 e − 02 7 . 92 e − 02 3 5 1000 1 . 62 e − 03 1 . 63 e − 03 5 . 97 e − 04 5 . 98 e − 04 3 5 100000 1 . 72 e − 05 1 . 72 e − 05 5 . 93 e − 06 5 . 93 e − 06 − 3 0 . 2 10 3 . 34 e − 01 3 . 42 e − 01 3 . 30 e − 01 3 . 34 e − 01 − 3 0 . 2 1000 3 . 33 e − 01 3 . 49 e − 01 3 . 26 e − 01 3 . 36 e − 01 − 3 0 . 2 100000 3 . 32 e − 01 3 . 57 e − 01 3 . 20 e − 01 3 . 39 e − 01 − 3 1 10 2 . 80 e − 01 3 . 99 e − 01 2 . 53 e − 01 3 . 20 e − 01 − 3 1 1000 1 . 05 e − 01 2 . 68 e − 01 5 . 17 e − 02 1 . 44 e − 01 − 3 1 100000 1 . 62 e − 02 3 . 22 e − 02 4 . 79 e − 03 1 . 21 e − 02 − 3 5 10 1 . 06 e − 01 1 . 23 e − 01 8 . 60 e − 02 7 . 82 e − 02 − 3 5 1000 1 . 79 e − 03 1 . 81 e − 03 5 . 98 e − 04 5 . 98 e − 04 − 3 5 100000 1 . 88 e − 05 1 . 88 e − 05 5 . 93 e − 06 5 . 93 e − 06 that they used in their examples: sampling columns successively and using con- ditional P oisson sampling w eights of r i / ( n − r i ). Note that the basic and more delicate algorithms that they describe are identical for the example here. Note also that w e could hav e used their algorithm to sample rows successively , in whic h case the sampling would b e exactly uniform b ecause of the symmetry for this particular example.) Applying their algorithm to this problem with n = 2 × 10 7 samples from the proposal distribution gives approximate p-v alues and standard errors of e p = 0 . 068 ± 0 . 013 and the correction e p ∗ = 0 . 068. (W e also ha ve b p = 0 . 066 ± 0 . 014 and b p ∗ = 0 . 066, although normally they would not b e a v ailable b ecause the imp ortance weigh ts are known only up to a constan t of prop ortionality . The exact w eigh ts are av ailable in this particular example b ecause of the sp ecial symmetry .) The estimated squared co eﬃcient of v aria- tion (i.e., the sample v ariance divided by the square of the sample mean) for the imp ortance weigh ts is b cv 2 = 20249, whic h is extremely large, but which suggests an eﬀectiv e sample size of around n/ (1 + cv 2 ) ≈ 988 according to the heuristic in Kong et al. ( 1994 ). While muc h smaller than 2 × 10 7 , a sample size of 988 w ould usually ensure reasonable conv ergence using direct sampling and the estimated standard errors seem to agree with this heuristic. F urthermore, the imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 23 solid blac k line in Figure 2 shows how e p changes with n . It seems to hav e stabi- lized by the end. Thus, we might b e tempted to tak e e p = 0 . 068 as a reasonable appro ximation of p ( X ) and rep ort a “p-v alue” of 0 . 068. 10 7 2×10 7 0 1 0.5 Monte Carlo sample size ( n ) Approximate p-values Convergence of Approximate p-values Fig 2 . Evolution of the appr oximate d p-values with incr e asing Monte Carlo sample size ( n ) in Example C.2 . The dashe d black line shows (an excellent appr oximation of ) the tar get p- value. A l l of the other lines in the plot wil l c onver ge to this value as n → ∞ , b ec ause they ar e al l consistent estimators. The solid black line shows the unc orr e cted imp ortanc e sampling p-value appr oximation, e p . F or a ﬁxe d value of the observed test statistic, t ( X ) , this line do es not dep end on the choic e of X . The thin gr ay lines show the c orr e cte d imp ortanc e sampling p-value appr oximation, e p ∗ , for diﬀer ent choices of X , but each with the same value of the observe d test statistic, t ( X ) . Unlike e p , e p ∗ do es depend on the choic e of X . The ﬁgur e shows 50 thin gray lines. The thick gray line is the aver age of 1000 such thin gr ay lines (the ﬁrst 50 of which ar e shown). The X ’s for these 1000 wer e chosen uniformly fr om the set of X ’s with the same test statistic. The same sequenc e of 2 × 10 7 imp ortanc e samples wer e use d in al l c ases. It turns out that 0 . 068 is not a go o d approximation of p ( X ). The sp ecial symmetry in this particular example p ermits exact and eﬃcient uniform sam- pling — each of the  102 51  c hoices for the ﬁrst row is equally likely , and eac h of the remaining 51! c hoices for the remainder of the matrix is equally likely . The true p-v alue is v ery nearly p ( X ) = 0 . 107, as estimated by ¯ p using n = 10 7 i.i.d. Monte Carlo samples from P . This is sho wn as the black dashed line in Figure 2 . Despite the fact that e p and e p ∗ are esse n tially identical for this example, and despite the fact that they do not giv e a go o d estimate of the target p-v alue, p ( X ), the suggestion here is that rep orting e p ∗ as a p-v alue correctly preserves the interpretation of a p-v alue, whereas, rep orting e p as a p-v alue is misleading. If we w ere to rep eat this test man y times, generating new X ’s sampled from the n ull (uniform) distribution and new imp ortance samples, we would ﬁnd that e p is m uch to o lib eral — there will be too many small p-v alues. The correction in e p ∗ do es not hav e this problem. imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 24 Generating 2 × 10 7 samples from Q is computationally demanding, so we do not ha v e a sim ulation exp erimen t to illustrate the assertion in the previous paragraph. W e do, how ever, illustrate the idea with tw o less demanding com- putations. The ﬁrst appro ximates cdfs of e p and e p ∗ for n = 1000, muc h like the exp erimen ts in Example C.1 . Results are shown in Figure 3 . The uncorrected p-v alues are extremely lib eral — rejecting when e p ≤ 0 . 05 results in a t yp e I error-rate of 40%. The corrected p-v alues are conserv ative, but v alid, correctly preserving the interpretation of a level α test. 0 1 1 cumulative distribution functions x F ( x ) Fig 3 . Cumulative distribution functions under the nul l hyp othesis for e p (dashe d line) and e p ∗ (solid line) using n = 1000 in Example C.2 . The c dfs ar e approximate d with 10 4 Monte Carlo r ep etitions. The dotte d line is the Uniform (0 , 1) c df. It is visual ly indistinguishable fr om the c df of the target p-value, p ( X ) , although not identic al, b e cause of the discr eteness of the test statistic. Our second illustration uses the same 2 × 10 7 imp ortance samples as b efore, but simply c hanges the original observ ation X to a new one with the same v alue of t . F or example, consider another observ ation of X with  ` 1 ( X ) , . . . , ` 51 ( X )  = (1, 2, 6, 8, 10, 14, 16, 18, 19, 20, 21, 23, 25, 29, 32, 33, 36, 38, 42, 44, 46, 49, 50, 53, 54, 57, 60, 62, 63, 67, 68, 69, 70, 72, 73, 76, 78, 81, 82, 88, 89, 92, 93, 94, 95, 96, 97, 99, 100, 101, 102), which has the same v alue of the test statistic t ( X ) = 2813, and consequently the same target p-v alue. It also has the same approximate uncorrected p-v alue, e p = 0 . 068 (using the same imp ortance samples from before). But it do es not hav e the same corrected p-v alue. In this case e p ∗ = 0 . 3907. The change results from the fact that this new observ ation of X is muc h rarer than the previous one under the imp ortance sampling prop osal distribution — so rare, in fact, that even n = 2 × 10 7 is insuﬃcient to mitigate the eﬀect that its imp ortance w eight has on the corrected p-v alue. The 50 diﬀerent thin gra y lines in Figure 2 sho w how e p ∗ v aries with n using 50 diﬀerent choices of X , all with t ( X ) = 2813. The thic k gray line sho ws the av erage of 1000 diﬀeren t c hoices of X , all with t ( X ) = 2813. The new X ’s were chosen uniformly sub ject to the constrain t that t ( X ) = 2813, which is actually quite eﬃcien t in this sp ecial case by using direct sampling from P , follo w ed by rejection sampling. imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 25 By paying attention to X , and not just t ( X ), e p ∗ can b e a v alid p-v alue, ev en though it is not a goo d estimator of p ( X ) using n ≈ 10 7 . This class of imp ortance sampling algorithms is used, for example, in testing go o dness of ﬁt in Rasch mo dels and in the statistical analysis of ecological data. Chen et al. ( 2005 ) use an o ccurrence matrix for “Darwin’s ﬁnc h data” to illustrate their imp ortance sampling approac h for testing the uniformit y of zero-one tables. The observed table, X , is a 13 × 17 binary matrix indicating whic h of 13 sp ecies of ﬁnch inhabit which of 17 islands in the Gal´ apagos. The (ordered) ro w sums of X are (17, 14, 14, 13, 12, 11, 10, 10, 10, 6, 2, 2, 1) and the (ordered) column sums are (11, 10, 10, 10, 10, 9, 9, 9, 8, 8, 7, 4, 4, 4, 3, 3, 3). A scien tiﬁcally relev ant null h yp othesis is that X was selected uniformly among all p ossible occurrence matrices with the same sequence of row and column sums. A scien tiﬁcally relev ant test statistic is t ( x ) , 1 13(12) X i 6 = j  ( xx T ) ij  2 where ( A ) ij denotes en try ( i, j ) in matrix A , and where A T denotes the transp ose of A . The test statistic should b e large if there is comp etition among sp ecies. The observ ed data has t ( X ) = 53 . 1. See Chen et al. ( 2005 ) for details and references. Chen et al. ( 2005 ) use the imp ortance sampling algorithm describ ed ab ov e to generate an approximate p-v alue for this hypothesis test. They rep ort p-v alues of e p = (4 . 0 ± 2 . 8) × 10 − 4 using n = 10 4 and e p = (3 . 96 ± 0 . 36) × 10 − 4 using n = 10 6 . The error terms are approximate standard errors, estimated from the imp ortance samples. Rep eating their analyses, but with new imp ortance samples, gives e p = (7 . 77 ± 7 . 78) × 10 − 4 and e p ∗ = 13 . 92 × 10 − 4 using n = 10 4 , and e p = (4 . 32 ± 0 . 51) × 10 − 4 and e p ∗ = 4 . 38 × 10 − 4 using n = 10 6 . An inv estigator rep orting any of the v alues of e p , even with estimates of standard errors, has few guaran tees. But rep orting either of the v alues of e p ∗ is guaranteed to preserve the usual interpretation of a p-v alue. References Besag, J., & Cliﬀord, P . (1989). Generalized Monte Carlo signiﬁcance tests. Biometrika , 76 (4), 633. Bezak o v a, I., Sinclair, A., Stefanko vic, D., & Vigo da, E. (2006). Negativ e examples for sequen tial imp ortance sampling of binary contingency tables. A rxiv pr eprint math/0606650 . Bolvik en, E., & Sk ovlund, E. (1996). Conﬁdence Interv als from Monte Carlo T ests. Journal of the Americ an Statistic al Asso ciation , 91 (435). Chen, Y., Diaconis, P ., Holmes, S., & Liu, J. (2005). Sequen tial Monte Carlo metho ds for statistical analysis of tables. Journal of the A meric an Statistic al Asso ciation , 100 (469), 109–120. Cytel. (2010). L o gXact 9 . Cam bridge, MA: Cytel Inc. imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024 M.T. Harrison/Conservative Hyp othesis T ests and Conﬁdence Intervals 26 F ujisa wa, S., Amarasingham, A., Harrison, M., & Buzs´ aki, G. (2008). Behavior- dep enden t short-term assembly dynamics in the medial prefron tal cortex. Na- tur e neur oscienc e , 11 (7), 823–833. Garth w aite, P ., & Buckland, S. (1992). Generating Mon te Carlo conﬁdence in terv als by the Robbins-Monro pro cess. Journal of the R oyal Statistic al So- ciety. Series C (Applie d Statistics) , 41 (1), 159–171. Green, P . (1992). Discussion of the paper b y Geyer and Thompson. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , 54 (3), 683–684. Hesterb erg, T. (1995). W eighted a verage imp ortance sampling and defensiv e mixture distributions. T e chnometrics , 37 (2), 185–194. Kong, A., Liu, J., & W ong, W. (1994). Sequential imputations and Ba y esian missing data problems. Journal of the A meric an Statistic al Asso ciation , 278– 288. Lehmann, E., & Romano, J. P . (2005). T esting statistic al hyp otheses (3rd ed.). New Y ork: Springer. Liu, J. (2001). Monte Carlo str ate gies in scientiﬁc c omputing . Springer. Rasc h, G. (1961). On general laws and the meaning of measuremen t in psy- c hology . Pr o c e e dings of the F ourth Berkeley Symp osium on Mathematic al Statistics and Pr ob ability: Pr ob ability the ory , 321–334. StataCorp. (2009). Stata statistic al softwar e: R ele ase 11 . College Station, TX: StataCorp LP. W estfall, P ., & Y oung, S. (1993). R esampling-b ase d multiple testing: Examples and metho ds for p-value adjustment . Wiley-Interscience. Zamar, D., McNeney , B., & Graham, J. (2007). elrm: Softw are implemen ting exact-lik e inference for logistic regression models. J. Statist. Softwar e , 21 . imsart-generic ver. 2008/08/29 file: ISpvalue.tex date: November 26, 2024

Conservative Hypothesis Tests and Confidence Intervals using Importance Sampling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment