Rediscovering a little known fact about the t-test and the F-test: Algebraic, Geometric, Distributional and Graphical Considerations
We discuss the role that the null hypothesis should play in the construction of a test statistic used to make a decision about that hypothesis. To construct the test statistic for a point null hypothesis about a binomial proportion, a common recommen…
Authors: Jennifer A. Sinnott, Steven N. MacEachern, Mario Peruggia
Redisco v ering a little kno wn fact ab out the t -test and the F -test: Algebraic, Geometric, Distributional and Graphical Considerations Jennifer A. Sinnott, Stev en N. MacEac hern, and Mario P eruggia Departmen t of Statistics, The Ohio State Universit y , Columbus, Ohio, USA July 14, 2022 Abstract W e discuss the role that the n ull h yp othesis should pla y in the construction of a test statistic used to mak e a decision ab out that h yp othesis. T o construct the test statistic for a p oint n ull h yp othesis ab out a binomial prop ortion, a common recom- mendation is to act as if the null hypothesis is true. W e argue that, on the surface, the one-sample t -test of a p oint null h yp othesis ab out a Gaussian p opulation mean do es not app ear to follow the recommendation. W e show how simple algebraic ma- nipulations of the usual t-statistic lead to an equiv alent test pro cedure consisten t with the recommendation. W e pro vide geometric in tuition regarding this equiv alence and we consider extensions to testing nested hypotheses in Gaussian linear mo dels. W e discuss an application to graphical residual diagnostics where the form of the test statistic makes a practical difference. By examining the formulation of the test statistic from multiple p ersp ectiv es in this familiar example, we pro vide simple, con- crete illustrations of some important issues that can guide the form ulation of effective solutions to more complex statistical problems. Keywor ds: Binomial prop ortion; F -test; Nested mo dels; Null hypothesis; Orthogonal sum of squares decomp osition; T est statistic 1 1 In tro duction Among the first pro cedures taught in an in tro ductory statistics class are hypothesis testing and confidence interv al estimation for a prop ortion (see, e.g., Mo ore et al. ( 2012 )). F or example, students may b e giv en data on the sexes of a sample of n babies b orn during a certain time p erio d. They may b e asked either to estimate the true prop ortion p of babies b orn male and provide a confidence interv al, or to test whether the prop ortion is equal to, for example, 0.5. 1 T ypically , for large n, the distribution of the sample proportion is appro ximated b y b p · ∼ N ( p, p (1 − p ) /n ) , and t w o slightly differen t pro cedures are in tro- duced. F or estimation and confidence in terv al construction, b p is commonly plugged into the v ariance formula, and a 100(1 − α )% confidence interv al is calculated as b p ± z α 2 p b p (1 − b p ) /n. (1) F or testing H 0 : p = p 0 for a pre-sp ecified p 0 , students are advised to act as though the null w ere true, and use the null to c onstruct the test statistic. As a result, p 0 is plugged in to the v ariance formula, pro ducing the test statistic b p − p 0 p p 0 (1 − p 0 ) /n . (2) Although man y different approac hes to b oth testing and interv al estimation hav e b een pro- p osed — and many commonly used statistical soft ware pac k ages allo w the user to apply con tin uity corrections to these formulas to improv e the asymptotic appro ximation (e.g., b y setting the argument correct = TRUE in the R function prop.test ) — in the authors’ exp erience, the ab o ve metho ds are still frequently taught for hand calculation in introduc- tory statistics classes of v arious levels. F or instance, Example 10.3.5 in Casella and Berger ( 2002 ) discusses precisely tw o test pro cedures based on test statistics that use b p or p 0 to estimate the v ariance, commenting on their relativ e merits in terms of a comparison of their p o wer functions. F or further discussions of pro cedures used in the one-sample prop ortion setting, see, e.g., Agresti and Coull ( 1998 ) and Y ang and Blac k ( 2019 ). Also among the first pro cedures taugh t are estimation and hypothesis testing for the mean µ of a normal N ( µ, σ 2 ) p opulation with unkno wn v ariance σ 2 . F or example, studen ts 1 There is evidence that this prop ortion is larger than 0.5 in most of the world; (see, e.g., Chao et al. ( 2019 )). 2 ma y b e giv en data on the heights of a random sample of U.S. women and b e asked to estimate the true mean heigh t, or test whether it is equal to some sp ecified v alue. If our data consist of a random sample Y 1 , . . . , Y n from the N ( µ, σ 2 ) p opulation, ¯ Y ∼ N ( µ, σ 2 /n ) , and a confidence in terv al is constructed analogously to ( 1 ), as ¯ Y ± t n − 1 , α 2 S/ √ n where S 2 = 1 n − 1 n X i =1 ( Y i − ¯ Y ) 2 (3) is the sample v ariance. (This follo ws from observing that T := ( ¯ Y − µ ) / ( S/ √ n ) has a t distribution with n − 1 degrees of freedom, accounting for the replacemen t of σ with S .) T o test H 0 : µ = µ 0 for a pre-sp ecified µ 0 , w e can, analogously to ( 2 ), in vok e the null. When H 0 holds, w e know µ = µ 0 but still need to estimate σ 2 . Since µ is known, the most efficien t estimator of σ 2 is: S 2 0 := 1 n n X i =1 ( Y i − µ 0 ) 2 . Our test statistic w ould th us b e: T 0 := ¯ Y − µ 0 S 0 / √ n . But, of course, p eople do not use this test statistic! Instead, they construct a statistic that ignores the information that µ = µ 0 pro vided b y H 0 , and p erform the standard one- sample t -test using the test statistic T = ¯ Y − µ 0 S/ √ n . A t first glance, one migh t susp ect that using this test statistic would b e less efficien t than using T 0 , since its denominator has n − 1 degrees of freedom rather than n. W e are th us led to w onder wh y information pro vided b y the null is discarded in con- structing the one-sample t -test. In the remainder of the pap er we clarify this question and presen t a more general p ersp ectiv e that we think will b e of interest to colleagues who teach this material as well as those in terested in the developmen t and implications of some of our most fundamental statistical to ols. 3 2 Establishing the connection The connection b etw een the tw o metho ds prop osed at the end of the previous section can b e established from an algebraic and from a geometric p oin t of view. W e lo ok at these tw o approac hes separately . T o begin, w e note that any in tuition that a test based on T 0 rather than T could b e more efficient is wrong: a tail-area test based on T 0 and one based on T pro duce identic al answ ers. This is b ecause T is a one-to-one, increasing function of T 0 , T = √ n − 1 T 0 p n − T 2 0 , (4) o v er the in terv al ( − √ n, √ n ), which is the set of p ossible v alues for T 0 . Sp ecifically , for an y fixed α , with 0 ≤ α ≤ 1, let c α ≥ 0 b e the critical v alue of the size α test based on T 0 . The rejection region of this test is R T 0 = { y = ( y 1 , . . . , y n ) T : | T 0 ( y ) | ≥ c α } . Because the transformation in Equation ( 4 ) is monotonic increasing on [0 , √ n ), the set R T = { y = ( y 1 , . . . , y n ) T : | T ( y ) | ≥ ( √ n − 1 c α ) / ( p n − c 2 α ) } satisfies R T = R T 0 . It follows that the test that rejects if and only if | T ( y ) | ≥ ( √ n − 1 c α ) / ( p n − c 2 α ) has the exact same rejection region (in sample space) as the test that rejects when | T 0 ( y ) | ≥ c α . The t wo tests m ust then ha ve the same size and p o wer function and are therefore equiv- alen t. As noted b y a colleague, a simple wa y to establish Equation ( 4 ) is to recognize that the one sample t -test can be deriv ed as a lik eliho o d ratio test that rejects H 0 : µ = µ 0 when the ratio λ ( Y ) = sup σ 2 L ( µ 0 , σ 2 | Y ) sup µ,σ 2 L ( µ, σ 2 | Y ) is small or, equiv alently , when the ratio of sums of squares under the null and full mo del, R = P j =1 ( Y j − µ 0 ) 2 P n j =1 ( Y j − ¯ Y ) 2 , (5) is large. This ratio can b e expressed as R = P j =1 ( Y j − ¯ Y ) 2 + n ( ¯ Y − µ 0 ) 2 P n j =1 ( Y j − ¯ Y ) 2 = 1 + T 2 n − 1 4 or as R = P j =1 ( Y j − µ 0 ) 2 P n j =1 ( Y j − µ 0 ) 2 − n ( ¯ Y − µ 0 ) 2 = 1 1 − T 2 0 /n . The former expression leads to the standard t -test based on T , while the latter leads to the test based on T 0 . Equating these tw o expressions yields the identit y of Equation ( 4 ). This relationship b et w een T and T 0 is, of course, not new: for example, it arises substan- tiv ely in Lehmann’s approach for demonstrating that the one sample t -test is a uniformly most p ow erful (UMP) unbiased test of H 0 : µ = µ 0 vs. H A : µ 6 = µ 0 Lehmann ( 1986 ). The full details of the argument are b est left to Lehmann, but, v ery briefly , for parameters in exp onen tial family distributions, Lehmann’s Theorem 1 in Chapter 5 gives a set of condi- tions ab out the form of a test statistic in relation to the family’s sufficient statistics. When these conditions are satisfied, a test based on the test statistic is UMP un biased. The set of conditions Lehmann pro vides is satisfied by T 0 rather than T , and the UMP un biasedness of the t -test is then established by exhibiting that T is a one-to-one function of T 0 . In terestingly , this equiv alence do es not seem to b e widely kno wn (at least based on our informal surveying of several colleagues). This is somewhat surprising. In fact, in addition to app earing in Lehmann’s b o ok, the algebraic equiv alence of the test statistics is p erio dically men tioned in the literature (see, e.g., Lefan te Jr and Shah ( 1986 ); Go o d ( 1986 ); Shah and Lefan te Jr ( 1987 ); Shah and Krishnamo orth y ( 1993 ); LaMotte ( 1994 )). How ever, w e feel that the equiv alence is w orth revisiting, b oth in the context of the t -test and in the more general setting of nested linear mo dels, where an analogous equiv alence holds. The geometric interpretation of the equiv alence, not describ ed in these earlier references, pro vides an in teresting addition to the geometric in terpretation of linear models. Moreov er, despite the test statistics leading to identical conclusions in the linear mo dels setting, one c hoice naturally leads a practitioner to consider so-called studen tized residuals while the other leads to so-called standardized residuals—and these sets of residuals do ha ve differen t prop erties and, when plotted, may lead to different visual interpretations. W e expand on these remarks in subsequen t sections. 5 3 The geometric p oin t of view In terestingly , the equiv alence of T 0 and T can be understoo d geometrically b ecause they can b oth b e viewed as trigonometric functions of the same angle, and it is p ossible to express an y trigonometric function in terms of any other trigonometric function, up to sign. T o see the geometric relationship, define the vectors v = ( Y 1 − µ 0 , Y 2 − µ 0 , . . . , Y n − µ 0 ) T and 1 = (1 , 1 , . . . , 1) T . Then, the orthogonal pro jection of v on to 1 is u = ( ¯ Y − µ 0 ) 1 , and the Pythagorean Theorem implies: k v k 2 = k u k 2 + k v − u k 2 , i.e., n X i =1 ( Y i − µ 0 ) 2 = n ( ¯ Y − µ 0 ) 2 + n X i =1 ( Y i − ¯ Y ) 2 , i.e., SSTO = SST + SSE , where we introduce analysis of v ariance terminology , with SSTO, SST, and SSE indicating the Sums of Squares for T otal, T reatment, and Error, resp ectively . Th us, if w e define θ to b e the angle b etw een 1 and v , then: T 2 0 = n SST SSTO = n cos 2 θ and T 2 = ( n − 1) SST SSE = ( n − 1) cot 2 θ . A st ylized, tw o-dimensional represen tation of the essence of these geometric relationships is presen ted in Figure 1 . Using basic trigonometric expressions it is easy to derive the stated algebraic relationship b et ween T and T 0 . In fact, T 2 = ( n − 1) cot 2 θ = ( n − 1) cos 2 θ sin 2 θ = ( n − 1) cos 2 θ 1 − cos 2 θ . Substituting cos 2 θ = T 2 0 /n in to this expression and taking square ro ots on b oth sides (making sure the signs matc h, as they should) yields Equation ( 4 ). 4 Extension to linear mo dels The results presented in the previous sections are not sp ecific to the t -test setting. In fact, constructing a test statistic by inv oking the n ull hypothesis and constructing it in the “traditional” w ay pro duces equiv alen t test pro cedures across a range of linear mo dels. This connection can b e established by rewriting the tw o statistics as functions of differen t terms in the orthogonal decomp osition of the sum of squares. 6 v u v -u � a b c a = k v k = √ SSTO, b = a cos θ = k u k = √ SST, c = a sin θ = k v − u k = √ SSE, T 2 0 = n ( b 2 /a 2 ) = n cos 2 θ , T 2 = ( n − 1) ( b 2 /c 2 ) = ( n − 1) cot 2 θ . Figure 1: Geometric represen tation of the test statistics T 0 and T . 4.1 Nested mo dels F or instance, consider the standard linear mo del Y = X β + , where Y = ( Y 1 , . . . , Y n ) T is a v ector of observ ations, X n × p is a design matrix of rank p < n , β = ( β 1 , . . . , β p ) T is a vector of regression parameters, and = ( 1 , . . . , n ) T is an error v ector with elements i iid ∼ N (0 , σ 2 ). Supp ose we wish to determine if a specific collection of p 2 co v ariates in X do es not significantly contribute to the prediction of Y in the linear mo del. W e can form ulate this question as a testing problem in which the null hypothesis states that the p 2 regression co efficients for these cov ariates are all zero. Without loss of generalit y w e can assume that the parameters of in terest are the last p 2 < p and rewrite the mo del as Y = X 1 β 1 + X 2 β 2 + , where X = [ X 1 | X 2 ] and β = ( β T 1 , β T 2 ) T , with β i of dimension p i for i = 1 , 2 , and p 1 + p 2 = p. The testing problem concerning the nested mo del can then b e stated as H 0 : β 2 = 0 vs. H A : β 2 6 = 0 . Both the “traditional” and the “n ull hypothesis” testing pro cedures try to quantify the imp ortance of the reduction in error sums of squares that ensues from entertaining the full mo del rather than the reduced mo del, but they differ in the comparison yardstic k they use. The “traditional” pro cedure uses a yardstic k based on the full mo del. The “null h yp othesis” pro cedure uses a yardstic k based on the reduced mo del with β 2 = 0 . 7 Geometrically , the statistics arise from a sequence of pro jections. Sp ecifically , define: P 1 = X 1 ( X T 1 X 1 ) − 1 X T 1 , Q 1 = I − P 1 , and P 12 = X ( X T X ) − 1 X T , Q 12 = I − P 12 . The matrix P 1 op erates an orthogonal pro jection on to the space spanned by the columns of the reduced design matrix X 1 and the matrix P 12 op erates an orthogonal pro jection on to the space spanned by the columns of the full design matrix X . Under the reduced mo del, the v ector of predicted v alues is b Y 1 = P 1 Y , the vector of residuals is r 1 = Y − b Y 1 = Q 1 Y , and the residual sum of squares is SSE 1 = Y T Q T 1 Q 1 Y = Y T Q 1 Y . Similarly , under the full mo del, the v ector of predicted v alues is b Y 12 = P 12 Y , the vector of residuals is r = Y − b Y 12 = Q 12 Y , and the residual sum of squares is SSE 12 = Y T Q 12 Y . The reduction in sums of squares ensuing from fitting the larger mo del is given b y SS 2 | 1 = SSE 1 − SSE 12 = Y T ( Q 1 − Q 12 ) Y = Y T ( P 12 − P 1 ) Y . The “traditional” pro cedure compares SS 2 | 1 to SSE 12 , the error sum of squares for the full mo del, while the “null hypothesis” pro cedure compares SS 2 | 1 to SSE 1 = SS 2 | 1 + SSE 12 , 8 r r - r r � a b c 1 1 a = k r 1 k = √ SSE 1 , b = a cos θ = k r 1 − r k = p SS 2 | 1 , c = a sin θ = k r k = √ SSE 12 , F null = [( n − p 1 ) /p 2 ] ( b 2 /a 2 ) = [( n − p 1 ) /p 2 ] cos 2 θ , F trad = [( n − p ) /p 2 ] ( b 2 /c 2 ) = [( n − p ) /p 2 ] cot 2 θ . Figure 2: Geometric represen tation of the decomp osition of the sums of squares for testing a nested hypothesis in the general linear mo del. the error sum of squares for the reduced mo del envisioned to hold under the null. After adjusting for the degrees of freedom of the v arious sums of squares, the resulting test statistics are F trad = SS 2 | 1 /p 2 SSE 12 / ( n − p ) and F null = SS 2 | 1 /p 2 SSE 1 / ( n − p 1 ) = SS 2 | 1 /p 2 (SS 2 | 1 + SSE 12 ) / ( n − p 1 ) , resp ectiv ely . 4.2 Algebra, geometry , and distributional results The orthogonal decomp osition at pla y in this setting is analogous to the one presented in Section 2 and is described in a st ylized, tw o-dimensional display in Figure 2 , along with the relationships b etw een its v arious elements. Algebraic and trigonometric manipulations similar to those outlined in Section 2 sho w that F trad is a one-to-one, increasing function of F null o v er (0 , ( n − p 1 ) /p 2 ), the set of p ossible v alues for F null : F trad = ( n − p ) F null n − p 1 − p 2 F null . (6) Th us, as in the case of the t -test, tail-area tests using F trad and F null are iden tical. Note that, when p = 1 , p 1 = 0 , and p 2 = 1 , the relationship b et ween F trad and F null giv en in Equation ( 6 ) reduces to the relationship b et ween T 2 and T 2 0 implied by Equation ( 4 ). The implemen tation of either test pro cedure requires kno wledge of the distribution of the corresp onding test statistic under the null hypothesis. Using the notation introduced 9 in Figure 2 , standard distributional results imply that, under the null h yp othesis, b 2 /σ 2 = SS 2 | 1 /σ 2 ∼ χ 2 p 2 , c 2 /σ 2 = SSE 12 /σ 2 ∼ χ 2 n − p , with b 2 indep enden t of c 2 . Then, F trad = b 2 /p 2 c 2 / ( n − p ) ∼ F p 2 ,n − p , as it is the ratio of tw o indep enden t c hi-square random v ariables divided by their degrees of freedom. Also, p 2 n − p 1 F null = b 2 b 2 + c 2 ∼ Beta 1 2 p 2 , 1 2 ( n − p ) , as it is the ratio b et w een a chi-square random v ariable and the sum of that c hi-square random v ariable and an indep endent c hi-square random v ariable. 4.3 Do es the difference ev er matter? While the test pro cedures based on F trad and F null pro duce iden tical inferences, the realized v alues of the test statistics are different. In this section we consider a situation in whic h, arguably , it is preferable to work with one of the t wo statistics rather than the other. Residual plots are effective graphical devices for assessing the quality of the fit of a linear regression mo del and for detecting p otential outliers. As noted in Section 9.4.1 of W eisb erg ( 2014 ), a simple test for determining if observ ation i is an outlier in a regression mo del that includes p 1 predictors is to include an additional predictor which is an indicator of the observ ation in question (i.e., a 0-1 vector whose only element equal to 1 is the i -th one) and to test if the regression co efficien t of the indicator is equal to zero. Assuming normal errors for the regression mo del and letting p 2 = 1, it is natural to cast this problem in to the framework of Section 4.1 and compare the full mo del with p = p 1 + p 2 predictors (the original predictors and the indicator of observ ation i ) and the nested mo del that omits the indicator v ariable. Observ ation i is declared an outlier if the null h yp othesis that the co efficien t of its indicator v ariable is zero is rejected. 10 The traditional statistic for this problem is F trad , whic h has an F 1 ,n − p distribution under the n ull. The square ro ot of F trad (with sign matc hing the sign of the regression residual for observ ation i ) is the usual t statistic for outlier detection describ ed by W eisb erg ( 2014 ). It is also a quan tit y kno wn as the studentize d r esidual for observ ation i , a normalized v ersion of the ra w residual, ˆ e i , computed using an estimate of the error v ariance, ˆ σ 2 ( i ) , that omits observ ation i from the calculation. Conceptually , this p oint of view is app ealing b ecause, if the null h yp othesis were violated and observ ation i were indeed an outlier, its inclusion in the calculation w ould inflate the estimate of the error v ariance. As stated in W eisb erg ( 2014 ), the studentized residual can b e expressed as t i = ˆ e i ˆ σ ( i ) √ 1 − h ii , where h ii denotes the leverage of observ ation i given by the i -th diagonal element of the pro jection (or hat) matrix P 12 for the full mo del. On the other hand, as seen in Section 4.1 , the same test could also b e p erformed using the statistic F null . The signed square ro ot of F null turns out to b e what is called the standar dize d r esidual for observ ation i , a normalized v ersion of the raw residual, ˆ e i , computed using an estimate of the error v ariance, ˆ σ 2 , that uses all observ ations, including observ ation i . This w ould b e the natural calculation to p erform if one w ere to assume that the null hypothesis w ere true. As stated in W eisb erg ( 2014 ), the standardized residual can b e expressed as r i = ˆ e i ˆ σ √ 1 − h ii , and the deterministic relationship b et ween studentized and standardized residuals is giv en b y t i = r i r n − p n − p + 1 − r 2 i This deterministic relationship mirrors, on the square ro ot scale, the deterministic relation- ship b et ween F trad and F null . Ultimately , b ecause of the deterministic relationships relating F trad , F null , and the tw o residual test statistics, an outlier test based on any of these four statistics leads to the same decision. Residual plots are often used to conduct an exploratory assessment of the fit of the regression mo del. In this t yp e of analysis, the plots a re scanned visually for the existence of 11 iden tifiable patterns and idiosyncratic features that might rev eal violations of the mo deling assumptions. With regard to outlier detection specifically , plots of residuals vs. fitted v alues are insp ected to reveal the presence of un usually large residuals. W e argue that, owing to the nonlinearity of the transformation that relates standardized residuals to studentized residuals, a studen tized residual plot is b etter suited than a standardized residual plot to ac hiev e this goal. W e illustrate this p oin t with an example based on a subset of the data on brain an d b o dy w eigh ts for 100 sp ecies of placental mammals rep orted in Sacher and Staffeldt ( 1974 ). Here, for the measuremen ts on the 21 species of primates included in the data set, w e consider the simple linear regression of the natural logarithm of brain weigh t on the natural logarithm of bo dy w eigh t. Standardized and studentized residual plots are presen ted in the top ro w of Figure 3 . Two sp ecies stand out: Homo Sapiens (with large p ositiv e residuals) and Goril la Goril la (with large negative residuals). Both are flagged as outliers at the 0.05 level with resp ectiv e p-v alues of 0.0034 and 0.0301 (unadjusted for multiplicit y of comparisons). The extent to which these tw o sp ecies outlie compared to the other 19 sp ecies is clearly differen t. As evidenced visually in b oth plots, the residual for Homo Sapiens is further remo v ed from the bulk of the residuals than the residual for Goril la Goril la and this impression is more notably accentuated in the studentized residual plot. This is due to the nonlinear relationship b et w een standardized and studentized residuals whic h causes the difference in absolute size b etw een the t wo to increase monotonically as the absolute size of the standardized residual go es from 1 to infinity . In particular, as shown in Figure 4 , the size of such difference b ecomes very noticeable when the absolute v alue of the standardized residual exceeds a v alue of ab out 2.5. In our example, the absolute difference b et ween studen tized and standardized residuals is 0.6563 (v ery noticeable) for Homo Sapiens , 0.2394 (noticeable) for Goril la Goril la , and b et ween 0.0011 and 0.0273 (hardly noticeable) for all other sp ecies. The displays in the b ottom line of Figure 3 , being based on F null and F trad whic h are the squared v ersions of the standardized and studen tized residuals, emphasize even more the features just describ ed. In summary , the displa ys based on the studen tized residuals and on F trad can fo cus the analyst’s attention on the most extreme cases more effectiv ely than those based on the 12 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 3 4 5 6 −2 0 1 2 3 standardized residuals GORILLA GORILLA HOMO SAPIENS ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 3 4 5 6 −2 0 1 2 3 studentized residuals GORILLA GORILLA HOMO SAPIENS ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 3 4 5 6 0 2 4 6 8 12 fitted v alues F_null GORILLA GORILLA HOMO SAPIENS ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 3 4 5 6 0 2 4 6 8 12 fitted v alues F_trad GORILLA GORILLA HOMO SAPIENS Figure 3: Standardized and studen tized residuals vs. fitted v alues for the primates data (top row) and their squared counterparts (b ottom row). 13 −3 −2 −1 0 1 2 3 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 standardized residuals studentized − standardiz ed residuals GORILLA GORILLA HOMO SAPIENS Figure 4: Differences b et w een studentized and standardized residuals vs. standardized residuals for the primates data. The solid line traces the deterministic relationship linking the plotted quantities. 14 standardized residuals and on F null . 5 The Role of the Null Hyp othesis in the Construc- tion of a T est Statistic The fundamental question raised by the examples we presented in this article concerns the role that the n ull h yp othesis should pla y in the testing paradigm. By assumption, the n ull h yp othesis is assumed true in order to assess statistical significance, but to what exten t should one rely on it to c onstruct the test statistic? When confron ted with a new statistical model and a new parameter of in terest, it can b e something of an art to determine a goo d c hoice of test statistic. Three common “automatic” approac hes for constructing test statistics from likelihoo ds privilege the null differen tly: score tests are typically built under the null; W ald tests are typically built under the alternative; and likelihoo d ratio tests compare the null and the alternative somewhat equally . W e consider first the case of an i.i.d. sample of size n from f ( x | θ ), a distribution indexed b y a single parameter, θ , and rely on the results and examples presen ted in Casella and Berger ( 2002 ). W e denote by L ( θ | X ) = f ( X | θ ) the likelihoo d function. The score is defined as S ( X | θ ) = d/d θ log f ( X | θ ). It can b e sho wn that, for all θ , E S ( X | θ ) = 0 and V ar S ( X | θ ) = I n ( θ ), the exp ected Fisher information in the sample. The p oin t null h yp othesis H 0 : θ = θ 0 is tested using the score test statistic S ( X | θ 0 ) / p I n ( θ 0 ), whic h has mean 0 and v ariance 1 for all n , and, under appropriate regularity conditions, con v erges in distribution under the null to a standard normal as n go es to infinit y , enabling the deriv ation of approximate cut-off v alues. Equiv alently , the test can b e based on the square of the score test statistic which has an asymptotic χ 2 1 distribution. F or n indep enden t Bernoulli( p ) observ ations yielding y successes, b p = y /n and the resulting score test statistic for testing H 0 : p = p 0 is the one giv en in formula ( 2 ). Its squared version is therefore U S ( y , n ; p 0 ) = ( b p − p 0 ) 2 p 0 (1 − p 0 ) /n . Supp ose that, for all θ , W n ( X ) is a consistent sequence of estimators of θ , ha ving standard error S n ( X ). The W ald statistic for testing H 0 : θ = θ 0 is constructed as ( W n ( X ) − 15 θ 0 ) /S n ( X ) and, if asymptotic normality holds, approximate cut-off v alues can again b e deriv ed under the n ull based on the quantiles of a standard normal. If the square of the W ald statistic is used for testing, approximate cut-offs should b e based on the quantiles of a χ 2 1 distribution. Often W n ( X ) is tak en to b e the maxim um likelihoo d estimator of θ , with S n ( X ) = 1 / p I n ( W n ( X )). Up on observing y successes out of n indep enden t Bernoulli( p ) trials, this recip e yields the statistic of form ula ( 2 ), but with p 0 replaced by ˆ p = y /n in the denominator of that expression. The squared version of the statistic is therefore U W ( y , n ; p 0 ) = ( b p − p 0 ) 2 b p (1 − b p ) /n . The likelihoo d ratio test statistic for testing H 0 : θ = θ 0 is defined as λ ( X ) = L ( θ 0 | X ) sup θ L ( θ | X ) . Assuming appropriate regularit y conditions, − 2 log λ ( X ) has an asymptotic χ 2 1 distribution under the n ull that can b e used to obtain appro ximate cut-offs for the test. F or the case of n indep endent Bernoulli( p ) observ ations, denoting b y y the total n umber of successes, the resulting likelihoo d ratio test will reject for large v alues of U L ( y , n ; p 0 ) = − 2 log p y 0 (1 − p 0 ) n − y ˆ p y (1 − ˆ p ) n − y . Engle ( 1984 ) defines these three t yp es of tests for the more general situation in which the parameter vector is multidimensional, including the case in which only a subset of the parameters are of inferential in terest while the remaining ones are regarded as nuisance parameters. A detailed recoun t of the insigh tful results presen ted there is b ey ond the scop e of this article, but an important message is that, quite generally , the three t yp es of tests will b eha v e asymptotically similarly under the null and under lo cal alternativ es, although the asymptotic b eha vior for alternative v alues a wa y from θ 0 will typically differ. F or finite samples the three statistics ma y yield differen t tests. The reason for this is illustrated in Figure 5 which presents scatter plots of the squared score, U S , and squared W ald, U W , statistics against the log-lik eliho o d statistic, U L , and of the squared score statis- tics, U S , against the squared W ald statistic, U W , for n = 30 and p 0 = 1 / 3. While these 16 statistics are, separately , related monotonically for ˆ p ≤ 1 / 3 and ˆ p > 1 / 3, the ov erall rela- tionships are not monotonic. An examination of the rejection regions for these tests shows that the order in whic h the total n umber of successes en ters the rejection region (as the size of the tests increase) differs among them. This is a situation in which the choice of whic h statistic to use matters. As an example of a m ultidimensional situation including parameters of inferential in ter- est and nuisance parameters, consider again the problem of testing a nested reduced mo del against the full mo del in the Gaussian linear mo del setting. There, the likelihoo d ratio test rejects the null h yp othesis that the reduced mo del holds when the ratio λ ( Y , X ) = sup β 1 ,σ 2 L ( β 1 , σ 2 | Y , X 1 ) sup β ,σ 2 L ( β , σ 2 | Y , X ) is small, or, equiv alently , when the ratio SSE 1 / SSE 12 of the error sum of squares under the reduced (n ull) mo del and the full mo del is large, ultimately leading to the equiv alent tests based on F null (a multiple of the score statistic as defined in Engle ( 1984 )) and F trad (a m ultiple of the W ald statistic as defined in Engle ( 1984 )). This structure of the likelihoo d ratio test for nested mo dels had already b een noticed for the sp ecial case presented in Section 2 , when discussing the deriv ation of the t -test in its t w o equiv alen t forms based on the ratio of Equation ( 5 ). Using the multiparameter definitions of the three types of test statistics, their deterministic functional relationships, and considering their asymptotic and finite sample distributions, Engle ( 1984 ) shows that the resulting tests are, in this case, equiv alen t b oth asymptotically and in finite samples. 6 Discussion The idea of constructing a test statistic by pretending that the n ull hypothesis is true is routinely presented as a general guideline when using binomial data for testing the h yp othesis that a p opulation prop ortion is equal to a given v alue. Y et, this guideline is not follo w ed, at least on the surface, when normal data are used to build the t -test for testing the hypothesis that the p opulation mean is equal to a giv en v alue. As we noted in the pap er, the t -test is actually equiv alent to a pro cedure based on a test statistic deriv ed b y follo wing the guideline, but making the connection requires a little algebra, and is, to our 17 0 10 20 30 40 50 60 0 10 20 30 40 50 60 U S ( y , 30 ; 1 3 ) 0 100 200 300 U W ( y , 30 ; 1 3 ) 0 10 20 30 40 50 0 10 20 30 40 50 60 0 100 200 300 0 10 20 30 40 50 0 10 20 30 40 50 U L ( y , 30 ; 1 3 ) Figure 5: Relationships betw een the squared score, U S , squared W ald, U W , and log- lik eliho o d, U L , test statistics for the case of indep endent Bernoulli data with n = 30 and p 0 = 1 / 3. The op en plotting sym b ols corresp ond to v alues of y such that b p ≤ 1 / 3. The solid plotting sym b ols corresp ond to v alues of y suc h that b p > 1 / 3. The statistics are not plotted for y = 0 and y = 30 to a void cases where the W ald statistic is undefined and the log-likelihoo d statistic is close to min us infinit y . 18 kno wledge, not typically made in introductory statistics classes, ev en at the graduate level. W e ha ve also noted that the the same considerations presented for the t -test extend to the use of the F -test for testing hypotheses concerning nested linear mo dels with Gaussian errors. So, we are left to sp eculate why , in the case of the t -test and of the F -test, the “tradi- tional” pro cedure is preferred to the “n ull hypothesis” pro cedure. If a formal comparison is required, there is no clear distributional adv antage of one approac h ov er the other. F or the comparison of nested linear mo dels, under the null, the “traditional” pro cedure requires calculation of the tail area of an F distribution and the “null h yp othesis” pro cedure re- quires calculation of the tail area of a Beta distribution. If a p o w er calculation has to b e p erformed under some alternativ e, it can b e based on the non-central F -distribution for the traditional pro cedure and on the T yp e I non-cen tral Beta distribution for the “null hy- p othesis” pro cedure, again with no clear adv antage of one approac h ov er the other. Similar considerations apply to the case of the t -test. An app ealing asp ect of the “traditional” pro cedures is that the t -statistic T and the F -statistic F trad are b oth constructed as ratios of indep endent quan tities. Because, in b oth cases, the decision rule is based on an assessmen t of the relative size of the n umerator and denominator, it is conceiv able that indep endence ma y ha ve b een a k ey factor in establishing the tradition, as an informal comparison of independent quantities is easier. Under the n ull, the denominators of the “null h yp othesis” test statistics are more efficien t estimators of v ariabilit y (hav e more degrees of freedom) than their “traditional” counterparts. How ever, this gain in efficiency is offset by the dep endence b et ween n umerator and denominator (see LaMotte ( 1994 ) for a related discussion). In addition to the basic guiding principles, other considerations may b e at pla y when a certain tradition is established of preferring one form of a test pro cedure ov er another for a given problem. F or the nested mo del comparison, w e already noted one desirable feature exhibited by F trad , namely that its numerator and denominator are indep enden t. Another feature w orth noting is that the denominator of F trad do es not depend on the particular reduced mo del under consideration while the denominator of F null do es. Although this is not m uch of a computational burden, it is intuitiv ely app ealing to b e able to use the same 19 y ardstic k in the denominator when testing different nested mo dels against the same full mo del. F urther, the graphical example of Section 4.3 illustrates that when the v alue of the statistic itself is of interest, rather than the formal testing decision, there may b e practical reasons for preferring the use of one statistic o v er the other. In Section 5 w e review ed three p opular metho ds for building test statistics (the score, W ald, and likelihoo d ratio metho ds), discussing the different emphasis that they place on the n ull and alternative hypotheses. F or all cases examined in this pap er, the three metho ds yield asymptotically equiv alent procedures while emphasizing different features of the testing problem. As noted in Engle ( 1984 ) this is related to the differen t metrics used to ev aluate discrepancy b et w een the n ull and the alternative. The W ald test accounts directly for differences in the parameter v alues, the lik eliho o d ratio test measures differences in the log-lik eliho o ds, and the score test assesses how steep the slop e of the log-likelihoo d is at the null v alue. While under very general conditions the three metho ds yield pro cedures that are asymptotically equiv alent, we ha v e noticed that the resulting finite sample tests ma y differ for indep enden t Bernoulli data. Engle ( 1984 ) presents additional examples where finite-sample conclusions migh t differ, commen ts on the differen t insigh t that the v arious form ulations migh t bring to b ear for sp ecific mo dels, and suggests that p oten tial computational considerations might induce the analyst to opt for one of the tests o v er the other tw o. In sum, while we do not ha ve a conclusiv e explanation as to why certain traditions ha ve established themselves as the standard of practice for sp ecific problems, w e b eliev e that these issues, often o verlooked, are w orth ruminating on, as they help us b etter see what considerations lead to the preference of one statistical pro cedure ov er ano ther. Cho osing the righ t test statistic for a particular problem can b e somewhat of an art, and understanding the similarities, differences, adv antages, and disadv antages of the choice in the simple settings we considered may b e helpful when turning to more complicated settings. Ac kno wledgemen ts This material is based up on work supp orted by the National Science F oundation under Gran ts No. SES-1424481, No. DMS-1613110, and No. SES-1921523. 20 References A. Agresti , B. A. Coull (1998). Appr oximate is b etter than “exact” for interval esti- mation of binomial pr op ortions . The American Statistician, 52, no. 2, pp. 119–126. G. Casella , R. Ber ger (2002). Statistic al Infer enc e . Duxbury-Thomson Learning, Second ed. F. Chao , P. Gerland , A. R. Cook , L. Alkema (2019). Systematic assessment of the sex r atio at birth for al l c ountries and estimation of national imb alanc es and r e gional r efer enc e levels . Pro ceedings of the National Academ y of Sciences, 116, no. 19, pp. 9303–9311. R. F. Engle (1984). Chapter 13 Wald, likeliho o d r atio, and Lagr ange multiplier tests in e c onometrics . Elsevier, vol. 2 of Handb o ok of Ec onometrics , pp. 775–826. I. Good (1986). Comments, c onje ctur es, and c onclusions: C258 e ditorial note on c257 r e gar ding the t-test . Journal of Statistical Computation and Simulation, 25, no. 3-4, pp. 296–297. L. R. LaMotte (1994). A note on the r ole of indep endenc e in t statistics c onstructe d fr om line ar statistics in r e gr ession mo dels . The American Statistician, 48, no. 3, pp. 238–240. J. J. Lef ante Jr , A. K. Shah (1986). C257. a note on the one-sample t-test . Journal of Statistical Computation and Sim ulation, 25, no. 3-4, pp. 295–296. E. L. Lehmann (1986). T esting Statistic al Hyp otheses . John Wiley & Sons. D. S. Moore , G. P. McCabe , B. A. Craig (2012). Intr o duction to the Pr actic e of Statistics . WH F reeman New Y ork. G. A. Sacher , E. F. St affeldt (1974). R elation of gestation time to br ain weight for plac ental mammals: implic ations for the the ory of vertebr ate gr owth . The American Naturalist, 108, no. 963, pp. 593–615. A. K. Shah , K. Krishnamoor thy (1993). T esting me ans using hyp othesis-dep endent varianc e estimates . The American Statistician, 47, no. 2, pp. 115–117. 21 A. K. Shah , J. J. Lef ante Jr (1987). C293. a note on using a hyp othesis-dep endent varianc e estimate . Journal of Statistical Computation and Sim ulation, 28, no. 4, pp. 347–349. S. Weisber g (2014). Applie d Line ar R e gr ession . Wiley Series in Probability and Statistics. Wiley . URL https://books.google.com/books?id=FHt- AwAAQBAJ . S. Y ang , K. Bla ck (2019). Using the standar d wald c onfidenc e interval for a p opulation pr op ortion hyp othesis test is a c ommon mistake . T eaching Statistics, 41, no. 2, pp. 65–68. 22
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment