On $p$-values

On p -v alues Laurie Davies F aculty of Mathema tics Univ ersity of Duisbur g-Esse n, 45117 E ssen, Ge r many email:laurie.davies@uni-due.de 1. The use and abuse of p -v alues All branches of knowledge which r equire the analysis of data make use of p - v alues. Unfortunately in many cases ‘make use of ’ could b e replace d b y ‘abuse ’, the many rep orts of widespre ad abuse are convincing. In r esp onse The American Statistician published a statement on p -v alues b y the American Statistica l Asso- ciation together with supplement ary materia l consisting of statements by s everal statisticians and philo s ophers ([W asserstein and Lazar, 2 016]). The most detailed of the supplement ary material is [Greenland et al., 20 16]. The authors p oint out that there are many w ays in which any usefulness o f a p -v alue can be inv a lidated. One example is to p erform several exp eriments and repo rt only the o ne with the smallest p -v alue. Problems o f this nature will not b e discussed here. It will be a ssumed tha t the ex per iment is so to sp eak clean and the da ta are so to sp eak high quality . 2. Probability mod els and appro xima tion 2.1. Seman tics. There a re tw o mea ning of the w o rd ‘mo del’ is statistics. The ﬁrst meaning refers to a para metric family of distributio ns . Thus the normal mo del is the family of all normal distributions. This meaning o f the word ‘mo del’ is common in muc h of statistics, in pa r ticular in Bay es ian statistics wher e such models are the ob jects of study . The second mea ning is that of a sing le pr o bability meas ure. In this sense o f the word the N (0 , 1) probability measur e is one mo del, the N (0 , 2) probability meas ure is ano ther mo del. This is the sense in which the word will b e used in this pap er. Mo dels in this sense ar e the a toms so to s pe a k of probability theo ry and hence the basic ob jects of sto chastic mo delling. The meaning of the word ‘mo del’ in the ﬁrst sense is a par ametric family o f mo dels in the second s ense. 1 2 2.2. Appro ximate mo dels. T he authors of [Greenla nd et al., 2016] state (A) .. the distance b etw een the data a nd the mo del prediction is measur ed using a test statistic .. and (B) In log ical terms, the p -v alue tests al l the assumptions a bo ut how the data w ere gener a ted (the entire mo del) ... Although it is never precisely stated it seems tha t the word ‘mo del’ in the ab ove quotations is meant in the ﬁr st s ense, a parametr ic family o f distributions. Whatev er the meaning of the word ‘mo del’ the meaning of the quota tions taken together is clear. The distance betw een the data a nd the model is ba sed on a test s tatistic and the corr esp onding p -v alue measures this distance in a particula r manner. The phrase ‘In logical ter ms’ in the quotatio n (B) sug gests tha t in practice this is not so. Indeed in pra ctice the parametric mo del is accepted and the p -v alue is based on a particula r hypothesis H 0 : µ = µ 0 using a statistic espe cially desig ne d for testing this null hypo thesis, for example a t -test. Suc h a single s ta tistic c a nnot p oss ibly test ‘ al l the ass umptions ab out how the data were generated (the entire mo del)’. A similar attitude is ta ken in [Bir nb aum, 1 9 62]: co nsideration is res tr icted to (C) mo dels whose a dequacy is p ostulated a nd not in question ... the ad- equacy of an y s uch model is t ypically supp orted, more or less ad- equately , by a complex infor ma l synthesis o f previo us exper imental evidence of v arious k inds and theoretical considerations c o ncerning bo th sub j ect-matter and exp er iment al techniques. In cont rast to the word ‘adequa c y’ b e ing applied to a family o f proba bilit y measures it will here b e applied to individual probability meas ur es. Thus the N (0 , 1 ) distribution may or may not b e adequate for a g iven data set. The only sense I ca n make of applying the w ord ‘adequate’ to a parametric family of probability measures is that there are v alues o f the pa rameter for which the individual dis tributions sp eciﬁed b y these pa rameter v a lues are consistent with the data. In g eneral this will be a str ict subset of the para meter space: it is diﬃcult to ima gine a data set for which the N (0 , 1 ) mo del and the N (100 , 10 − 6 ) mo del are b o th adequate or consistent with the da ta. The tw o diﬀerent meanings of the word ‘mo del’ are not just a q ue s tion of no- tation or deﬁnition. They r e ﬂect tw o diﬀerent a pproaches to statistics. This may be seen in [Birnbaum, 1962] where a pa rametric family o f pr o bability measures has 3 to b e a dequate without specifying the a dequacy of an y individual measure. This is necessar y as the L ikeliho o d Principle req uir es the prop ortio nality o f t wo diﬀer- ent densities for all v a lues of the parameter and no t just for the adequate o ne s . A similar problem o ccurs when testing hypo theses. The parametric mo del is declar ed adequate w itho ut sp ecifying the adequa te v alues of the parameter. A hypo thesis H 0 : µ = µ 0 is then tested to see whether µ = µ 0 is consistent with the data . It only makes sense to do this if the adequate par ameter v alues ha ve not been speciﬁed when declaring the whole family to be a deq uate as otherwise the test would b e sup e rﬂuous. More generally a co mmon approa ch is tw o pe rform a sta tistical a nalysis in tw o stages. In the ﬁrst stage one or several parametric models will b e inv estig ated for adequacy , for example b y using a go o dness-of-ﬁt test. Once an adequate model has bee n found it is made the ba sis of the second stage where it is treated as if it were true. T re a ting it as true mea ns among other things ig noring the ﬁrst stag e. If indeed the model is now treated as if true then how we a rrived at this truth is irrele v ant. The following quotation is ta ken fr o m the Chapter 5 of Hub er [Hub er, 2 011] e ntitled ‘Approximate Mo dels’: (D) In the opp osite ca se, if a go o dness-of-ﬁt test do es no t reject a mo del, statisticians hav e b ecome accusto med to acting as if it were true. Of course this is logically inadmissible, even more s o if with McCulla gh and Nelder o ne b elieves that a ll mo dels ar e wrong a priori . Moreov er, trea ting a mo del that has not be en re jected as corr ect can be misleading and dang erous. Perhaps this is the main le s son we have learned fro m robustness. In [Davies, 201 4] models are cons is tent ly treated as a pproximations. The bas ic idea is that a mo del P is an ade q uate appr oximation to a data set x n = ( x 1 , . . . , x n ) if t ypical data X n ( P ) genera ted under P lo ok like x n . Data are gener a ted under single probability distributions P and not under a family of such, that is, a mo del in the ﬁrst sense of the w o rd. This is the reason why sing le pr obability distributions are the basic o b jects of study and not families of dis tributions. The deﬁnition of ‘lo ok like’ will dep end o n the na tur e of the data b eing a nalysed and the mo del. As an ex ample supp ose that the mo del is that of i.i.d. N (0 , 1) random v a riables. Then ‘lo ok like’ can be based on the mean, the v aria nce, the extreme v alues a nd the distance of the empirical distribution function from tha t of the standard N (0 , 1) distribution function. This will b e done explicitly in Sectio n 2.4 4 below. It is worth noting that the concept of adequacy is deﬁned in terms of several statistics and not just one. This is in c o ntrast to the quotation at the b eg inning of Section 3 .2 whe r e it is bas ed on a single statistic. The approach describ ed in [Da vies, 2 014] c an b e r ead as an attempt to replace a t wo sta ge metho dology , EDA follow e d by formal inference, by a sing le stage metho dology whereb y a ll tes ts b ecome missp eciﬁcation tests fr om without, or from a distance. It is an instance of ‘distanced r ationality’ due to D. W. M ¨ uller (see [M¨ uller , 19 74]). Here my trans lation (E) ... distanced rationality . By this we mean a n a ttitude to the given, which is not gov erne d by a ny p o s sible or imputed immanen t laws but which confronts it with dr aft co ns tructs o f the mind in the form of mo dels, hypotheses, w orking hypotheses, deﬁnitions, conclusions, al- ternatives, analogies, so to s pea k fro m a distance, in the ma nner of partial, pr ovisional, approximate knowledge. 2.3. ‘Adequate’ parametric fami l ies. Althoug h the quotation C does not make it explicit it is clear from [Birnbaum, 1962] that Birnbaum is referring to families of parametric mo dels. Thus the Poisson fa mily may b e decla red adequate without sp ecifying any par ticular v a lue of λ which is consistent with the data. As an exam- ple supp ose the pa rametric family is the Poisso n family and that the chi-squared go o dness-o f-ﬁt is used to test adequacy . T he test is typically bas ed o n s o me v ariant of the test statistic (1) k X j =0 ( ˆ p j − p j ( ¯ x n )) 2 p j ( ¯ x n ) where the ˆ p j are the empir ical frequencies, ¯ x n is the mean o f the data and p j ( λ ) = λ j exp( − λ ) /j !. If the v a lue of the test statistic (1) lies b elow a certain level then the Poisson mo del is declared adequa te. Note that (1) do es not s p e c ify any individua l parameter v a lues. Given a dequacy in this sens e the whole par ametric family is then tr a nsp orted to the second stage of formal inference in spite of the fact that the overwhelming ma jority of individual mo dels will not b e consis ten t with the da ta. F or Birnbaum’s argument to work this is essential: ‘t wo lik eliho o d functions, f ( x, θ ) and g ( y , θ ) ar e called the same if they are propo rtional, that is if there e xists a po sitive constant c such that f ( x, θ ) = g ( y , θ ) for all θ ’. 5 In the s econd sense o f the word ‘mo del’, an individual probability distribution, the go o dness-o f-ﬁt pro cedure takes on a diﬀerent form. In the concr ete ca s e o f the Poisson family a g iven Poisson distribution s ay P λ with λ = 2 can b e tested for adequacy us ing (2) k X j =0 ( ˆ p j − p j (2)) 2 p j (2) . The set of λ v a lues for whic h the test statistic lies b elow a critical le vel sp eciﬁes those λ v alues, if any , which are cons istent with the data . This will no t b e the s et of all p os sible v alues. If one in terprets the conce pt of adequa cy for mo dels in the ﬁrst sense using the second sense it can o nly mean that ther e are some para meter v a lues θ for which the single mo del P θ is cons istent with the data. The likelihoo d pr inciple is based on no t spec ifying which v a lues o f θ these are. 2.4. Appro ximation regi ons: an example. A mo del P is an ade q uate a pprox- imation to data x n if typical data sets genera ted under P lo ok like x n . T o make this susceptible to mathematical ana lysis the term ‘look like’ must b e expressible in nu merical quantities. This may not a lwa ys b e p oss ible or ea sy . An animal may b e easily reco gnizable a s a dog but it it not eas y to give this a ma thema tical expressio n. If a model is required whic h giv es data s ets lo ok ing lik e the daily returns of the Standard and Poor’s 500 index it is not clea r how ‘loo k like’ can be deﬁned. In the following it will b e assumed tha t ‘lo ok like’ has a precise ma thema tical expres sion. The following is taken from [Davies, 201 4]. Given a proba bilit y measure P a sam- ple of size n genera ted under P will b e denoted by X n ( P ) = ( X 1 ( P ) , . . . , X n ( P )). Given fur ther a family P of probability measures and a num b er α, 0 < α ≤ 1 , an α approximation regio n for the data x n is deﬁned by (3) A ( x n , α, P ) = { P ∈ P : x n ∈ E n ( P ) } where for ea ch P ∈ P E n ( P ) denotes a subse t of R n such that (4) P ( X n ( P ) ∈ E n ( P )) = α . The choice of the E n ( P ) dep ends on the situation and has in ge ne r al to b e augmented by some form of r egulariza tion, for example: minimum Fisher mo dels, nu m ber of lo cal extre mes , conv exity constraints. These a nd further examples ar e to be found in [Davies, 20 14]. 6 The deﬁnition (3) makes no assumption that the data x n were genera ted under some mo del P 0 ∈ P . The interpretation is that A ( x n , α, P ) sp eciﬁes those mo dels P for which x n ‘lo oks like’ a ‘t ypical s a mple’ X n ( P ) generated under P : typical samples X n ( P ) lie in E n ( P ) so that p oints x n ∈ E n ( P ) lo ok lik e typical sa mples X n ( P ). As an exa mple supp ose P is the family of normal distributio ns N = { ( µ, σ ) : N ( µ, σ 2 ) } . An approximation region can be based o n the mean, the v a riance, out- liers and the distance o f the empirical meas ur e to the mo del N ( µ, σ 2 ) as measured by the K uipe r metric. Mor e pr ecisely put y n = ( x n − µ ) /σ and (5)    T 1 ( y n ) = √ n | mean( y n ) | , T 2 ( y n ) = P n i =1 y 2 i , T 3 ( y n ) = ma x i | y i | , T 4 ( y n ) = d ku ( P ( y n ) , N ( 0 , 1)) , where P ( y n ) is the empirical measur e based on y n . Giv en ˜ α o ne can determine quantiles q 1 ( ˜ α ) , q 21 ( ˜ α ) , q 22 ( ˜ α ) , q 3 ( ˜ α ) , q 4 ( ˜ α ) such that (6)    P ( T 1 ( Y n ) ≤ q 1 ( ˜ α )) = ˜ α, P ( q 21 ( ˜ α ) ≤ T 2 ( Y n ) ≤ q 22 ( ˜ α )) = ˜ α, P ( T 3 ( Y n ) ≤ q 3 ( ˜ α )) = ˜ α, P ( T 4 ( Y n ) ≤ q 4 ( ˜ α )) = ˜ α. where Y n are i.i.d. N (0 , 1 ). The approximation region is then deﬁned by A ( x n , α, R × R + ) = { ( µ, σ ) : T 1 ( y n ) ≤ q 1 ( ˜ α ) , q 21 ( ˜ α ) ≤ T 2 ( y n ) ≤ q 22 ( ˜ α ) , (7) T 3 ( y n ) ≤ q 3 ( ˜ α ) , T 4 ( y n ) ≤ q 4 ( ˜ α ) , y n = ( x n − µ ) /σ } where ˜ α is adjusted so that the region is indeed an α - approximation region. A reasona ble star ting v a lue for ˜ α is (3 + α ) / 4. This will lead to an eﬀective v alue α ∗ > α of α which can b e determined by simulations. A b etter approximation can now b e o bta ined by putting ˜ α = (3 + 2 α − α ∗ ) / 4. F or a nor mal sa mple o f s iz e n = 50 and α = 0 . 9 this leads to ˜ α ≈ 0 . 97 compar ed with the star ting v alue of 0.975. The following data give the quantit y o f copp er in milligrams p er litr e in a s ample of drink ing water ([Da vies, 2 014]): 2 . 16 , 2 . 2 1 , 2 . 15 , 2 . 0 5 , 2 . 06 , 2 . 04 , 1 . 90 , 2 . 03 , 2 . 06 , 2 . 02 , 2 . 06 , 1 . 92 , 2 . 08 , (8) 2 . 05 , 1 . 8 8 , 1 . 99 , 2 . 0 1 , 1 . 86 , 1 . 70 , 1 . 88 , 1 . 99 , 1 . 93 , 2 . 20 , 2 . 02 , 1 . 92 , 2 . 13 , 2 . 13 . The 0.9 a ppr oximation region A ( x n , 0 . 9 , R × R + ) this data set is shown in Figure 1 . An appr oximation region for µ alone can b e obtained by pro jecting A ( x n , α, N ) onto the µ -axis: (9) A ( x n , α, R ) = { µ : ther e ex is ts some σ s .t. ( µ, σ ) ∈ A ( x n , α, R × R + ) } 7 This is equiv alent to pro jecting the approximation re g ion of Figure 1 o nt o the x - axis. The result is the interv al [1 . 945 , 2 . 0 87]. The standa rd 0 . 9 c onﬁdence in terv al for µ based on the t -statistic is the smaller interv al [1 . 9 78 , 2 . 05 4]. If the data really are normally distributed then the standar d co nﬁdence interv al for µ will be smaller than the cor resp onding approximation in terv al. If the data are not normally distr ibuted then the appro ximation interv al can b e smaller, indeed m uch smaller than the conﬁdence in terv al. T his w ill b e discuss ed in Section 2.6 b elow. In (7) the same ˜ α is used for all four functionals . There is no need fo r this. If for example the Kuip er distance is not r e garded as impo rtant as the other features it can b e given less weigh t in terms of a hig her v alue of ˜ α . 2.5. Multipl e p -v alues. The appr oximation region (7) is based on the four statis- tics T i , i = 1 , . . . , 4. F or each parameter pair ( µ, σ ) each o f the statistics T i , i = 1 , 3 , 4 comes with a p -v a lue (10) p i ( µ, σ ) = 1 − P ( T i ( Y n ) ≤ T i ( y n )) , the statistic T 2 comes with the p - v alue (11) p 2 ( µ, σ ) = 2 min( P ( T 2 ( Y n ) ≤ T 2 ( y n )) , 1 − P ( T 2 ( Y n ) ≤ T 2 ( y n ))) where the Y n are i.i.d. N (0 , 1) and y n = ( x n − µ ) /σ ). Thus each par ameter pair ( µ, σ ) co mes with four p -v alues attached p i ( µ, σ ) , i = 1 , . . . 4. It b elongs to the approximation regio n if and only if (12) p ( µ, σ ) = min( p i ( µ, σ ) , i = 1 , . . . , 4) ≥ 1 − α ∗ . As an example the pair (2 . 00 8 , 0 . 110 ) in the a pproximation reg ion o f Figure 1 ha s the p i -v alues (0 . 720 , 0 . 6 83 , 0 . 12 3 , 0 . 967). 1.94 1.96 1.98 2.00 2.02 2.04 2.06 2.08 0.10 0.12 0.14 0.16 Figure 1. The approximation r egion A ( x n , 0 . 9 , R × R + ) for the data (8). 8 The multiple p -v alues as so ciated with eac h pa rameter v alue stand in contrast to the usua l deﬁnition of a p -v alue whic h uses only one s tatistic (see the quo ta tion A). 2.6. Appro ximation and conﬁdence regions . At ﬁr st sight the approximation region (3) can b e in terpreted as a co nﬁdence reg ion. If the data x n were indeed generated under some mo del P 0 ∈ P then b ecause of (4) w e hav e (13) P ( P 0 ∈ A ( X n ( P 0 ) , α, P )) = α for all P 0 ∈ P . Such a n in terpretation howev er causes diﬃculties. Consider the family P = { N ( µ, σ 2 ) : ( µ, σ ) ∈ R × R + } . A standa rd conﬁdence re g ion for the ‘true’ v a lue µ 0 of µ is based on the as sumption tha t there is indeed a ‘true’ v alue µ 0 of µ . That is the data w e re generated under N ( µ 0 , σ 2 ) for some σ . This a ssumption is not chec ked in the formal inference phas e and co nsequently a conﬁdence r e gion for µ 0 is never empty . The int erpretation is tha t it is a measur e precisio n with which µ 0 can b e determined. The a pproximation regio n (9 ) on the other hand is not based o n the as s umption that the data w ere indeed g enerated as i.i.d. N ( µ, σ 2 ) for some ( µ, σ ). It sp eciﬁes those µ - v alues if any for whic h N ( µ, σ 2 ) is an adequate a pproximation to the data for so me σ . Thus if the adequacy reg ion (9) is sma ll this s imply means that there are few v alue s of µ for whic h N ( µ, σ 2 ) is an adequate appr oximation to the data for some σ . It is not a mea sure of pre cision. If o ne imagines the data gradually beco ming less and les s norma l then the reg ion (7) will b eco me sma ller and even tually will b e the empty set. One way of doing this is to gradually increa se one v alue of the s ample un til this v a lue b ecomes inco mpa tible with the fea tur e T 3 of (5). As an ex ample Figure 2 gives the 0.9 approximation r egion for the copp er da ta of (8) but with the smallest obse r v ation of 1.7 being replaced by 1.5 If the 1.7 is r eplaced by 1.267 the 1.95 2.00 2.05 2.10 0.14 0.15 0.16 0.17 0.18 0.19 0.20 Figure 2. The approximation r egion A ( x n , 0 . 9 , R × R + ) for the data (8) but with 1.7 replace d b y 1.5. approximation regio n as calculate has ex actly one p oint (1 . 9812 , 0 . 2 161). 9 Int erpreting (9) as a conﬁdence reg ion leads to complica tions. As the data b e- come less and less like Gaussian data the reg io n bec omes sma lle r and smaller which is int erpreted as a n increase in pr ecision. Th us on this interpretation replacing 1.7 by 1.267 in (8) leads to exact v alues for ( µ, σ ) namely (1 . 9 812 , 0 . 2 161). When the region b ecomes empty this is as if one go es from inﬁnite pre c is ion to no infor mation at a ll. A dis c ussion can b e found in (*) ht tp://andrewge lman.com/20 11/08 /25/ . F rom the p oint of view of appr oximation there is no problem of interpretation. The set of adeq uate pa rameter v alues bec o mes s maller and sma ller and even tually bec omes the empty set, that is, there are no adequate parameter v a lues a t all. 2.7. An empt y approximation regi on. Co nsider the appr oximation region (7) with ˜ α = 0 . 9 75 cor resp onding to α ≈ 0 . 92. Simulations show that for no rmal samples of size 50 the approximation is empt y in ab out 0.7% of the cases. This v alue is based on 5000 simulations. It is m uch s maller than the 8% of the cases where the approximation regio n do es not contain the ( µ, σ ) pair used to gener ate the data. If the a pproximation regio n is empty , that is, the family P co nt ains no mo del which is a n adequa te approximation, there may well b e an int erest in qua nt ifying just how po o r the approximation is. One way of doing this is to determine the smallest v a lue of α , say α ∗ such that the appr oximation regio n is no n-empty . The corres p o nding p -v alue is deﬁned a s p ∗ = 1 − α ∗ which is a measure of the go o dness of the appr oximation: the smalle r the p -v a lue the worse the approximation. F or the approximation regio n (7) it is alwa ys p oss ible to calculate p as the quantiles q can b e calculated. If the quantiles w e r e obtained by sim ulation and the approximation is po or then it ma y not b e p ossible to calcula te the p -v a lues. An alternative is suggested in [Lindsay and Liu, 20 09]. It is based on the idea that it is easier for ther e to b e a n adequate approximation if the s a mple size is small. Samples x ∗ m of size m are drawn for the orig inal sample x n and the approximation reg ion calculated. The measur e of the degree o f approximation is the lar gest v alue of m for which the approximation regio n is no t empty in 5 0% of the ca ses. An example is given in Chapter 3.8 of [Davies, 2014] for a sample of size n = 189. The family of mo dels consider ed was the family of discretized gamma mo dels and the concept of adequacy was based o n the to tal v a riation metric The ﬁt was s o p o or that even for α = 0 . 9999 99 there was no a dequate a pproximation. The size of the smallest 10 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 2 3 4 5 6 7 8 0.2 0.4 0.6 0.8 2 3 4 5 6 7 8 100 200 300 400 Figure 3. Upper panel: the p -v alues (12) plo tted a gainst the o f the larg e st observ atio n for a nor mal sample of s iz e n = 27. Low er panel: the num b er of po ints in the a pproximation region also plot- ted against the size of the larges t observ ation. random subsamples for which there was an adequate approximation in 50% of the cases was a pproximately 40. 2.8. A non- e mpt y approx imation regio n. If the appr oximation region is de- ﬁned b y sta tis tics as in (7) then for each mo del P the p -v alues for each of inequa lities is calculated and the minim um v a lue taken. The ma ximum of these v alues taken ov er the approximation r egion is then a measure of the degree of approximation. Again, the smaller this v a lue the po or er the appr oximation. The statis ticia n can base the dec is ion o n whether to us e the family of mo dels P at leas t in part on the maximum p -v alue. A cut-oﬀ p oint 0.2 implies that there are a dequate mo dels wher e the p -v alues of the functionals inv olved all exceed 0.2. Figure 3 s hows an exa mple o f this. A N (0 , 1) sample of size n = 50 w as g enerated and the larg est observ a tion 2.130 g radually incr eased in steps of 0 .0687 5 . The upp er panel shows the p - v alues as a function of the size of this observ ation, the low er panel the the num b er of points in the approximation region e v aluated over a grid of par ameter v alues. It can b e interpreted as a proxy for the a rea of the reg ion. 11 The qua n tiles o r the p - v alues of the p - v alues ca n b e o btained fr o m simulations. F or a nor mal s ample of size n = 28 the 0.0 0 1, 0.01, 0.05 and 0 .1 quantiles a r e 0 .0713 , 0.210, 0.324 and 0.3 93 re s pe c tively . These v alues ar e ba sed o n 10 000 simulations. The p -v alue based on the normal family o f mo dels is 0.406 with a p -v alue ( p - v alue of the p -v alue) of abo ut 0.1. If the smallest observ ation 1.70 is r emov ed the p -v alue b eco mes 0.83 5 which corr e sp onds to a p - v alue of 0 .96. If the smallest v alue is set to 1 .4 the p - v alue of the p - v alues is a b o ut 0.001. This rais es the ques tions to how bad an approximation can be whilst still basing an a nalysis on the family of distributions. The following quo tation from [Hub er, 201 1] is relev ant in this context. It imme- diately pr ecedes the quotation D (F) If a g o o dness-o f-ﬁt tests rejects a mo del, w e are left with many al- ternative actions. Perhaps we do nothing (and contin ue to live with a mo del cer tiﬁed to b e inaccurate). Perhaps we tinker with a mo del by adding or adjusting a few mor e features (and thereby destroy its conceptual s implicit y). Or we s witch to an entirely diﬀerent mo del, maybe one based on a diﬀeren t theory , or ma y be in the absence of such a theo ry , to a purely phenomenolog ical one. In the opp os ite case, if a go o dness-of-ﬁt test do es not reject a ... 2.9. p -v alues and h yp otheses. Consider the Gaussian family of mo dels and the nu ll h ypo thesis H 0 : µ ≥ µ 0 . The p -v alue deﬁned by the t -s ta tistic (14) pt  √ n mean( x n ) − µ 0 sd( x n ) , n − 1  is often used as a measure as to the exten t that µ 0 is compatible with the data. This deﬁnition is not acceptable from the p o int of view of approximation a s it do es not sp ecify a ny v a lue for σ . As a n example cons ider the copper data (8) Supp os e the legal limit is 2.1 mil- ligrams p er litre and we wish to test the hypo thes is that this is ex ceeded. The Gaussian family of mo dels w ill b e used so tha t with the usual identiﬁcation of the amount of copp er in the water with µ the nu ll h ypo thesis beco mes (15) H 0 : µ ≥ 2 . 1 . The p -v alue a s deﬁned by (14) is 0.0004 36. An equiv alent deﬁnition of a standard p -v a lue is the following. Given the p - v alue p ∗ put α ∗ = 1 − p ∗ . Then α ∗ is the smallest v alue of α such that the conﬁdence 12 region for µ co nt ains µ 0 . This can b e us e d to deﬁne a p -v a lue using the idea o f an approximation r egion. This p -v a lue is deﬁned a s p ∗ = 1 − α ∗ where α ∗ is the smallest v alue of α s uch that the approximation regio n contains a point ( µ 0 , σ ) for s o me σ . This is similar to the deﬁnition of a p -v alue for an empt y approximation region given in Section 2 .7 . If this is done for the w ater data (8) using the approximation region (7 ) the r esulting p -v alue is 0.0 45. Replace no w the smallest v alue 1.7 by 0 .7. The p -v a lue of (14 ) is now 0.01 5 . A t ﬁrst sight this may seem surpr is ing a s the v alue 0.7 is less consis tent with (15) than is 1.7. The reason is that the standard deviation is now 0.2 74 as ag ainst 0.1 16. The p -v alue based on the a pproximation region is 0.00 018. The v alue o f σ is 0.31 0 . The r eason is that the v alue 0.7 is essentially an outlier. This is pick ed up by the statistics T 3 and T 4 but not b y the t -statistic. See the second Huber quotation (F). The outlyingness of 0.7 sho uld ha ve be e n detected in the EDA phase b efor e mo ving on to the formal inference phase. This r a ises the q ues tion of how to react to the outlier. 3. p -v alues and functionals The purp ose of the co pper measurements (8) it to give a p oint estimate of the amount of copp er in the sample o f drinking w ater combined with an interv al of reasona ble v alues. The mea n a nd a co nﬁdence interv al using a nor mal mo del g ive a r easonable solution for this par ticular data sets, but there are problems. One immediate questio n is why the Gaussian family a nd not the Laplace (double exp onential) family? This q uestion dr aws attent ion to the fact that the lo catio n- scale problem is ill-p osed when density ba s ed metho ds, maximum likeliho o d o r Bay es , are used. Some form o f regularizatio n is required. What a re required are ‘bland’ o r ‘hornless’ mo dels (see Section 2, B is for Blandness , o f [T ukey , 1 993] and Chapter 1 .3 .6 of [Davies, 2 014]). In the lo cation-sca le situation one p ossible form of regular iz ation is to use minim um Fishe r informa tion mo dels such as the Gaus sian. Another problem is to relate the parameters of the model to the real world. As the purp ose o f the c o pp er data is to es timate the amount o f c o pp er in the water, simply estimating (in another sense of the w ord estimate) the parameters of a par ametric mo del do es not solve the problem. The parameters must b e connected to the real world. F or the Gaus s ian family this is not a pr oblem as the canonical connection is to identify the lo cation parameter µ with the actual amo unt of co ppe r in the w ater. How ever this fails for the log-normal distribution, another minimum 13 Fisher distr ibution. One ca n still asso cia te the a ctual a mo unt o f co pper with the mean but also with the median. This gives t wo diﬀerent ident iﬁcations for the sa me mo del. The ﬁnal problem is that of outliers . They ar e common in interlab ora to ry tests and any metho d of analysis must be able to dea l with them. Neither the Ga ussian, Laplace o r lo g-norma l achiev e this. The path taken in Chapter 5 of [Davies, 2 014] is to use M -functionals (see Chapters 4 and 5 of [Huber a nd Ronchetti, 2 009]). Given ψ - a nd χ -functions ψ and χ resp ectively and a probability mea sure P over R the M -functional T M is deﬁned by T M ( P ) = ( T L ( P ) , T S ( P ) where T L ( P ) and T S ( P ) solve (16)    R ψ  x − T L ( P )) T S ( P )  dP ( x ) = 0 R χ  x − T L ( P )) T S ( P )  dP ( x ) = 0 The functions ψ and χ can b e s o c hosen so that (i) (16) ha s a unique so lution for all P with a largest atom of less than 0.5 and (ii) the functional T M ( P ) is lo cally uniformly F r´ echet diﬀeren tiable in a K olmogor ov neighbour ho o d o f P see([Davies, 1998] and pag e 54 of [Hamp el et al., 1986]). This g ives stability of anal- ysis with r esp ect to P . The functions used here are (17) ψ ( u, c ) = exp( u/c ) − 1 exp( u/c ) + 1 , χ ( u ) = u 4 − 1 u 4 + 1 where c is a tuning constant set here to 5 . The connection with r eality is achiev ed by ide ntifying the amo unt of copp er with T L ( P ) for any adequate mo del P . the only for m of ade q uacy required is that the mo del P is in a small Ko lmogor ov neightbourho o d o f the empirical distr ibution P n of the data. This s till leav es op en the choice of T M . This will b e discusse d at the end o f the section. Let X n ( P ) denote a sample of size n of i.i.d. rando m v ariable with distr ibution P , and b y q ψ ( · , n, P ) the quant iles o f 1 √ n      n X i =1 ψ  X i ( P ) − T L ( P ) T S ( P )       14 with the corresp onding deﬁnition of q χ ( · , n, P ). The an α -appr oximation reg ion for the functional T M is deﬁned by A ( x n , α, T M ) = n ( T L ( P ) , T S ( P )) : d ko ( P n , P ) < qdk( ˜ α, n ) , (18) √ n     Z ψ  x − T L ( P ) T S ( P )  d P n ( x )     ≤ q ψ ( ˜ α, n, P ) , √ n     Z χ  x − T L ( P ) T S ( P )  d P n ( x )     ≤ q χ ( ˜ α, n, P ) o where ˜ α = (2 + α ) / 3, P n denotes the empirical distribution of the data x n , d ko the Kolmogo rov metric and qdk( · , n ) its quantile function. The choice of ˜ α corr e- sp onds to sp ending (1 − α ) / 3 on ea ch of the three features in the deﬁnition of the approximation regio n. As E  ψ  X i ( P ) − T L ( P ) T S ( P )  = 0 and E ψ  X i ( P ) − T L ( P ) T S ( P )  2 ! = Z ψ  x − T L ( P ) T S ( P )  2 dP ( x ) it follows from the central limit theor em that 1 √ n n X i =1 ψ  X i ( P ) − T L ( P ) T S ( P )  ≍ N 0 , Z ψ  x − T L ( P ) T S ( P )  2 dP ( x ) ! . with the s a me r esult for χ . Thus asymptotica lly (19) q ψ ( ˜ α, n, P ) ≈ qnorm( ˜ α ) s Z ψ  x − T L ( P ) T S ( P )  2 dP ( x ) with the s a me result for χ . As the r andom v ar iables are b ounded the nor ma l a p- proximation is g o o d fo r s mall v alues of n . The requir ement d ko ( P n , P ) < qdk( ˜ α, n ) for ces P into a O (1 / √ n ) Kolmo g orov neighbourho o d of P n . This together with the lo cally uniform F r´ echet diﬀerentiabilit y implies q ψ ( ˜ α, n, P ) ≈ q ψ ( ˜ α, n, P n ) (see pages 10 7 -108 o f[Davies, 2 014]) and together with (19) it leads to the approximate approximation region ˜ A ( x n , α, T M ) = n ( T L ( P ) , T S ( P )) : d ko ( P n , P ) < qdk( ˜ α, n ) (20) √ n     Z ψ  x − T L ( P ) T S ( P )  d P n ( x )     ≤ qnorm( ˜ α ) q V ψ ( P n ) , √ n     Z χ  x − T L ( P ) T S ( P )  d P n ( x )     ≤ qnorm( ˜ α ) q V χ ( P n ) 15 1.94 1.96 1.98 2.00 2.02 2.04 2.06 2.08 0.10 0.12 0.14 0.16 Figure 4. T he 0 . 9 approximation reg io n of the lo catio n a nd sc a le functionals ( T L , T S ) for the co pp e r da ta using the psi- and chi- functions of (17 ) with c = 5 . where V ψ ( P n ) = 1 n n X i =1 ψ  x i − T L ( P ) T S ( P )  2 . This approximation to (18) can be calculated over a gr id of v alues. It is s hown in Figure 4 fo r the copper data with α = 0 . 9 . It may b e compared w ith the a p- proximation region ba sed on the Gaussian distr ibution a s shown in Figure 1 . The approximation reg ion for T L ( P ) is obtained by pro jecting the appr oximation region onto the x -a xis. F or the copp er data with α = 0 . 9 it is [1 . 964 , 2 . 067] compar e d with the standard 0.9-conﬁdence in terv al [1 . 978 , 2 . 0 54] ba sed on the t -s tatistic. The approximation r egion (4) r emains unchanged if the smallest o bserv ation 1.7 is replaced by zero. This is in sharp contrast to the approximation regio n ba sed on the Ga us sian family of mo dels which is empt y in this case. This one example o f stability of analysis deriving fro m the use o f T M : small ch anges in the da ta , her e a single data p oint, lead to only small changes in the result. It was p o inted o ut ab ov e that the lo cation-s cale problem req uires regula rization. The use of the M -functional T M is a reg ula rization o f the pro ce dur e no t the mo dels. Hypo thesis testing as in Section 2.9 can b e done as fo llows. F or the copp er data the n ull hyp o thesis (15) is r e placed by H 0 : T L ( P ) ≥ 2 . 1 . The p -v alue is p ∗ = 1 − α ∗ where α ∗ is the sma llest v alue of α such that (20) contains (2 . 1 , σ ) for s ome σ . Its v alue is p ∗ = 0 . 0 1. The M -functional used here is not the only one. There are many p os sible choices. Whic h one to use is an empirica l question. A member of the co mmittee which pro duced the German DIN s tandard ([DIN, 2 003]) for analys ing water, waste water 16 and sludge r ep orted that in his exp erience the median was better than the mean but worse than the mean after the elimination of o utliers. The ﬁnal decision w as to use Hampe l’s redescending ψ -function (Example 1 on pa g e 1 50 of [Hamp el et al., 19 86]) which can be se e n as a smo oth version of the mean after eliminating outliers. 4. Appro xima tion and prediction 4.1. Prediction. The co ncept of adequate appr oximation c a n b e lo oked at in terms of predictio n. Given a num ber α and bas ed on a mo del P a prediction has to b e made ab out a sample x n . That the pr ediction is bas e d on P mea ns that if the sample were genera ted under P , that is x n = X n ( P ), then the prediction would be corr ect with probability α . In making the prediction is has to b e decided which asp ects of the data a re reg arded a s imp or tant. In the deﬁnitio n of the approximation region (7) the impor tant asp ects are given by the sta tistics T i , i = 1 , . . . , 4. With P = N ( µ, σ 2 ) the co r resp onding predictio n is that a ll the ineq ua lities of (6) will ho ld with y n = ( x n − µ ) /σ replacing Y n . If the prediction is co rrect then the model N ( µ, σ 2 ) is acce pted as an adequa te approximation to the data. 4.2. Jeﬀreys o n p -v alues. The follo wing is often cited as an argument ag ainst the use of p -v alues: (G) .... gives the probability of depa rtures, mea sured in a pa rticular wa y , equal to or grea ter than the o bserved set, a nd the contribution fro m the actual v a lue is nearly alwa ys negligible. What the use of P implies, therefore, is that a h ypo thesis that may be true may b e re jected b e- cause it ha s not predicted observ able results that hav e not o ccurred. This seems to b e a remark a ble pro cedure. On the face o f it, the evi- dence might more r easonably b e taken a s evidence for the h ypothes is , not a g ainst it. (page 385 of [Jeﬀr e ys, 196 1]). Suppo se the hypothes is is that the data follow the N (0 , 1) distribution. What observ able r esults do es this hypothesis predict? It seem po int less to predict a single v alue as such a predictio n w ould be wr ong with probability 1 . The pr ediction mu st be a se t S of v alues with the prediction b eing regarded as correct if the obser v able result x lies in S . Putting S = R re sults in the prediction b eing correc t w ith probability one but this is somewhat v acuous. A non-v acuo us prediction can be obtained by sp ecifying a proba bilit y α and a se t S ( α ) such that the prediction is 17 correct with probability α , P ( X ∈ S ( α )) = α . It is worth y o f note tha t the larger α the more v acuo us the prediction so to sp eak. As a s imple example put α = 0 . 95 and S ( α ) = ( − 1 . 9 6 , 1 . 96) and supp o se that x = 3 . 121 is obser ved. The p - v alue is P ( | X | > 3 . 12 1) = 0 . 0 018 and for this to b e a succ essful prediction would require α = 0 . 99 8 2 rather than the chosen α = 0 . 95. W e now interpret ‘not predicted to o ccur’ in the sense ‘predicted no t to o ccur’ r ather tha n in the sense ‘for getting to predict’. If it were agreed b eforeha nd that a fa ls e prediction would lead to the n ull hypothesis to b e rejected, then this is done b ecause a v alue predicted not to o ccur, namely 3.1 21, did in fact o ccur. This seems an unremark able pr o cedure. How bad the prediction erro r is ca n be mea sured by the α = 0 . 998 2 require d to make the prediction corr ect a nd w hich corr esp onds to a very weak prediction in that it would be co rrect in 99 .8% of the times. 5. p -v alues and choice of cov aria tes in stepwise regression The following is based on [Davies, 2016 a]. Given a data set of size n co nsisting of a dependent v ariable y ( n ) and p ( n ) cov a riates x ( n ) the problem is to decide which if a ny of the cov aria tes to include. The discussion b elow will b e r estricted to the case where p ( n ) is chosen b y step wise regre s sion but the idea can b e extended to considering a ll subsets of the cov ariates as long as p ( n ) is not to o large, say p ( n ) ≤ 2 0 (see [Davies, 20 16b]). It would seem that all pro cedures for choo s ing the cov a riates ar e based o n the standard linear mo del (21) Y ( n ) = X ( n ) β ( n ) + ε ( n ) . The pro cedure to b e describ ed b elow is not based on this mo del. The basic idea is to compare the cov a riates x ( n ) with cov ariates which are simply standar d Ga ussian white no is e. A cov a riate x j is included only if it is signiﬁcantly b etter than white noise. Suppo se that p 0 ≤ n − 2 with indices S 0 hav e already been b een included in the regres s ion a nd that the sum of squar ed residuals is ss 0 . Denote by ss j the sum of squared residuals if the cov ariate x j with j / ∈ S 0 is included. The next ca ndida te for inclusion is that cov ar iate for which s s j is smallest. Including this cov aria te leads to a s um of squa r ed residua ls ss 01 = min j / ∈S 0 ss j . 18 Replace a ll the cov aria tes not in S 0 in their entiret y by standar d Gaussian white noise. Let S S j denote the sum o f sq uared r esiduals if the ra ndom cov ariate cor re- sp onding to x j is included. The inclusion o f the b es t of the random cov ariates leads to a s um of squa r ed residua ls S S 01 = min j / ∈ S 0 S S j . The proba bilit y tha t the b es t random cov ar iate is better than the best o f the actua l cov aria tes is P ( S S 01 < ss 01 ) = 1 − P ( S S 01 ≥ ss 01 ) = 1 − P ( min j / ∈ S 0 S S j ≥ ss 01 ) = 1 − Y j / ∈S 0 P ( S S j ≥ ss 01 ) It has b een shown by Lutz D ¨ um bgen that (22) S S j D = ss 0 (1 − B 1 / 2 , ( n − p 0 − 1) / 2 ) where B a,b denotes a b eta r a ndom v ar iable with pa rameters a and b and distribution function pbeta( · , a, b ). F ro m this it follows that P ( S S j ≥ ss 01 ) = pb eta(1 − ss 01 /ss 0 , 1 / 2 , ( n − p 0 − 1) / 2 ) so that ﬁna lly (23) P ( S S 01 ≤ ss 01 ) = 1 − pb eta(1 − ss 01 /ss 0 , 1 / 2 , ( n − p 0 − 1) / 2 ) p ( n ) − p 0 . This is the p -v a lue for the inclusion o f the next cov ar iate. The simplest pro cedur e is to sp ecify α < 1 and to contin ue the s tep wise selection until the ﬁrst p -v a lue exceeds α . Those cov ariates up to but excluding this last one are the selected ones. The stopping rule is (24) ss 01 > ss 0  1 − q be ta ((1 − α ) 1 / ( p ( n ) − p 0 ) , 1 / 2 , ( n − p 0 − 1) / 2 )  where qb eta( · , a, b ) is the q uantile function o f the beta distribution with pa rameters a and b . The pro cedur e is conceptually and algor ithmically simple. It requires no regular - ization parameter o r cross-v a lidation or an estimate of the error term in (21). It is inv ariant with resp ect to aﬃne changes of unit of the cov ar iates and equiv ariant with resp ect to a p ermutation of the cov ariates. It can b e ex tended to non-linear par a- metric r egressio n, robust regr ession and the Kullback-Leibler discrepa nc y where appropria te. 19 As an exa mple we take the le ukemia data ([Golub et al., 1999] ht tp://www-genome.w i.mit.edu/ c ancer/ which was analy s ed in [Dettling and B¨ uhlmann, 20 03]. These c o nsist of data on n = 72 samples of tissue with with p ( n ) = 3 571 cov ar iates. The dep endent v ar iable y ( n ) is either 0 o r 1 de p ending on whether the patient suﬀers fro m acute lymphobla stic leukemia or acute m y eloid leukemia. The ﬁrst ﬁve genes in order o f inclusion with their a sso ciated p -v alues as deﬁned by (23) ar e a s follows: (25) gene n um ber 1182 1219 2888 1946 2102 p -v alue 0.0000 8.57e-4 3.56e-3 2.54e-1 1.48e-1 According to this r elev ant ge ne s are 11 82, 1219 and 2888 a nd g iven these the re- maining 3568 are no better than random noise. This applies to the gene 1946 but if a simple linear r egressio n is perfor med using this gene a lone its p -v a lue in the linear regres s ion is 7.75e-9 . This is muc h smaller than the 0.25 4 in (25). The p -v alue (23) takes into a ccount the step wise nature of the pr o cedure, in particular that g ene 1946 is the b e s t of the remaining ge ne s once the g enes 118 2, 1219 and 28 8 8 hav e bee n included. A simple linea r r egress io n do e s not take this int o account. The da ta were gather ed in the hop e o f using the gene express ion data to cla ssify the patients. If the classiﬁca tion is based o n g enes 1 1 82, 121 9 and 2888. A simple linear re gressio n results in one miscla ssiﬁcation. In [Dettling and B ¨ uhlmann, 2003] the authors consider ed 4 2 diﬀeren t classiﬁca tion s chemes. O nly tw o of them resulted in a single miscla ssiﬁcation. They used a 1 -nearest- ne ig hbour metho d base d o n 25 and 3 571 genes. F or this pa rticular da ta set the pro cedure descr ib ed ab ove attains the same r esult a nd moreov er speciﬁe s the relev a nt genes. References [DIN, 2003] (2003). DIN 3840 2-45:2003-09 German s tandard m etho ds for the examination of wa- ter, waste water and sludge - General i nformation (group A) - Part 45: In terlab oratory compar- isons for proﬁciency testing of labor atories (A 45). Deutsch es Institut f ¨ ur Normi er ung. [Birnbaum, 1962] Birnbau m, A. (1962). On the f oundations of statistical inference. Journal of the Am eric an Statist ic al Asso ciation , 57:269–326. [Davies, 2014] Dav ies, L. (2014). Data A nalysis and Appr oximate Mo dels . M onographs on Statis- tics and Applied Probability 133. CRC Press. [Davies, 2016a] Da vies, L. (2016a) . Step wi se choice of cov ari ates in high dimensional regressi on. arXiv:1610.05131 [ math.ST]. [Davies, 1995] Dav ies, P . L. (1995). Data features. Statistic a Ne erlandic a , 49:185–245. [Davies, 1998] Dav ies, P . L. (1998) . On lo cally uniformly linearizable high breakdo wn l o cation and scale functionals. Annals of Statistics , 26:1103–1125. 20 [Davies, 2008] Dav ies, P . L. (2008). Approximating data (with discussion). Journal of the Kor ea n Statistic al So ciety , 37:191–240. [Davies, 2016b] Da vies, P . L. (2016b). F unctional choice and non-signiﬁcance regions in regression. arXiv:1605.01936 [ math.ST]. [Dettling and B¨ uhlmann, 2003] Dettling, M. and B¨ uhlmann, P . (2003). Boosting f or tumor clas- siﬁcation w i th gene expressi on data. Bioinformatics , 19(9) :1061–1069 . [Golub et al., 1999] Golub, T. , Slonim, D., P . , T. , Huard, C., Gaasenbeek, M. , Mesirov, J., Coller, H., Loh, M., Downing, J., Cali giuri, M ., Bloomﬁeld, C., and Lander, E. (1999). Molecular clas- siﬁcation of cancer: class dis cov ery and class prediction b y gene expression monitoring. Sci e nc e , 286(15) :531–537. [Greenland et al., 2016] Greenland, S., Senn, S. J., Rothman, K. J. , Carlin, J. B . , Poole, C., Goo dman, S. N., and Al tman, D. G. (2016). Statistical tests, p -v alues, conﬁdence int erv als, and p ow er: A guide to mi sinte rpretations. The Amer ican Statistician, V olume 70. Supplemental material to ‘The ASA’s Statement on p-V alues: Con text, Pro cess and Purpose’. [Hamp el et al . , 1986] Hampel, F. R ., Ronc hetti, E. M., Rousseeu w, P . J., and Stahel, W. A. (1986). R obust Statistic s: The Appr o ach Base d on Inﬂuenc e F unct i ons . Wiley , New Y ork. [Hub er, 2011] Huber, P . J. (2011). D ata Ana lysis . Wiley , New Jersey . [Hub er and R onc hetti, 2009] Huber, P . J. and Ronc hetti, E. M. (20 09). R obust Statistics . Wiley , New Jersey , second edition. [Jeﬀreys, 1961] Jeﬀreys, H. (1961). The ory of Pr ob ability . Oxfor d Classic T exts in the Physical Sciences. O xf ord Universit y press, third edition. [Lindsay and Liu, 2009] Lindsa y , B. and Liu, J. (2009). M odel assessmen t tools for a m o del false wo rld. Stati stic al Science , 24(3):303–31 8. [M ¨ uller, 1974] M¨ uller , D. W. (1974). Thesen zur Di daktik der Mathematik. Math. phys. Semester- b erichte, N.F. , 21:164–1 69. [T ukey , 1993] T ukey , J. W. (1993) . Issues r elev ant to an honest accoun t of data-based inference, partially in the ligh t of Laur i e Davies’s pap er. Princeton U nive rsity , Pr i nceton. [W asserstein and Lazar, 2016 ] W asserstein, R. L. and Lazar, N. A. (2016). The asa’s statemen t on p-v alues: Context , process, and purp ose. The Americ an Statisti c ian,V olume 70 , 70(2).

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment