The Third Way Of Probability & Statistics: Beyond Testing and Estimation To Importance, Relevance, and Skill

There is a third way of implementing probability models and practicing. This is to answer questions put in terms of observables. This eliminates frequentist hypothesis testing and Bayes factors and it also eliminates parameter estimation. The Third W…

Authors: William M. Briggs

THE THIRD W A Y OF PR OBABILITY & ST A TISTICS: BEYOND TESTING AND ESTIMA TION TO IMPOR T ANCE, RELEV ANCE, AND SKILL WILLIAM M. BRIGGS Abstract. There is a third w ay of implemen ting probabilit y mo d- els and practicing. This is to answ er questions put in terms of observ ables. This eliminates frequen tist h yp othesis testing and Ba yes factors and it also eliminates parameter estimation. The Third W ay is the logical probability approach, which is to mak e statemen ts Pr( Y ∈ y | X , D , M ) ab out observ ables of interest Y taking v alues y , given probativ e data X , past observ ations (when presen t) D and some mo del (possibly deduced) M . Significance and the false idea that probability models sho w causality are no more, and in their place are importance and relev ance. Mo dels are built keeping on information that is relev an t and imp ortant to a decision mak er (and not a statistician). All models are stated in publicly verifiable fashion, as predictions. All mo dels must undergo a v erification pro cess b efore any trust is put into them. 1. Introduction T esting and estimation, whether in the frequentist or Ba y esian im- plemen tations, form the bac kb one of classical analyses. This is unfor- tunate because the questions answered by testing and estimation are not of interest to any one except statisticians. A t ypical question w ould b e, “How is age related to the uncertain t y in con tracting cancer of the alb ondigas?” The tactic is to say whether the p-v alue in a parameter- ized probability mo del relating age and cancer was w ee, or to sa y that a Bay es factor for the same parameter has a certain v alue, or b y stat- ing an estimate of that unobserv able parameter. These are ob viously not answers to the question. Error is introduced, and on a breathtak- ing scale, by supp osing the classical statements ar e the answ er. This happ ens in almost all published instances of statistical analysis. Disputes in statistics cen ter around Bay esian versus frequentist meth- o ds. While there is a wide c hasm b et w een the in terpretation of prob- abilit y in these tw o philosophies, the practical differences are minor. Both metho ds devote themselv es to parameters. Y et parameters are of no direct in terest in answ ering questions about the uncertain t y of 1 2 WILLIAM M. BRIGGS observ ables (suc h as cancer of the alb ondigas). Ho w ever, a doleful tradition has developed whic h says they are. If a p-v alue is w ee, or a Ba y es factor large, or a parameter estimate a certain v alue, uncer- tain t y in the observ able is thought to be adequately quan tified. But this is not so: no matter how m uch certain t y is presen t in parameters, the uncertain ty in the observ able m ust alwa ys be greater. Therefore relying on parameter-certaint y as a pro xy for observ able-certaint y is to b e o v er-confident. Some ha v e noticed this discrepancy , such as Seymour Geisser, see [Geisser, 1993, Geisser, 1984], but kno wledge of his work is limited, hamp ered, no doubt, by lacking a well-articulated replacemen t for the classical metho ds. The Third W ay prop osed b elow is (it is hop ed) suc h a replacemen t. T o main tain consistency the in terpretation of philosoph y is as logic is adopted. This is exp ounded in, among oth- ers, [Jaynes, 2003, Co x, 1961, Keynes, 2004, de Laplace, 1996] as w ell as philosophically in [Campb ell and F ranklin, 2004, F ranklin, 2001a, F ranklin, 2001b]. Those familiar with ob jectiv e Ba y es will recognize man y mathematical parallels, e.g. [Williamson, 2010] (although this author do es not cross o ver in to full logical probability). The only piece of philosoph y necessary to grasp the basics of logical probabilit y is to recognize that al l probability is conditional. That is, there is no such thing as unconditional probabilit y of an y prop osition. 1 The gist of the Third W a y is simple: answ er the questions put to us ab out the uncertaint y in observ ables. Eliminate all talk of parameters, all notion of testing, and all manner of estimation. If someb o dy asks, “Ho w is age related to the uncertain t y in contracting cancer of the alb ondigas?”, w e tell them. 2. The Third W a y Let past observ ables b e lab eled D = ( Y , X ) old , where Y is the ob- serv able in whic h we w an t to quan tify or explain our uncertaint y , and X are the premises or observ ables assumed probative of Y (the di- mensions of each will b e ob vious in con text). Let the premises which lead to a probabilit y mo del (if one is present) b e lab eled M . And let X = X new b e the premises or assumed v alues of new observ ables. The goal of all probabilit y mo deling is this: Pr( Y ∈ y | X , D , M ) , (1) 1 The sk eptical reader is in vited to try to discov er the probabilit y of a proposition that relies on no evidence. THE THIRD W A Y OF PROBABILITY 3 where y are v alues of the observ able Y which are of interest to some decision mak er. Mo dels should b e rare, b ecause most probabilit y is not quan tifiable—and w e m ust resist the temptation to force quan tification b y making up scientific-sounding n um b ers. But ev en if we do, (1) can b e calculated, as long as supply the premises whic h led to our creations. 2 Although it is ob vious, the equation reads, “The probability Y takes the v alues y giv en the premises or assumptions X , the past data D , and the mo del M .” If the mo del is parameterized and Bay esian philosoph y is adopted, (1) is the p osterior predictive distribution, and M incorp o- rates those premises or assumptions from which the priors are deduced (see e.g. [Bernardo and Smith, 2000]). If a frequentist philosoph y is adopted, there are many difficulties and inconsistencies in in terpreta- tion, but I will not discuss them here; the meaning of equation is plain enough. The key is that no parameters are explicit in (1); the uncer- tain t y in them has b een “integrated out.” Only observ ables and plain assumptions remain. Logical probability w ould supply premises from whic h the mo del M is deduced (there w ould b e no parameters th us no priors). Equation (1) eliminates, or rather combines, the efforts of testing and estimation into one form. The fo cus is en tirely on observ ables and the assumptions made and their effect on the uncertain ty of not-y et-seen or unknown v alues of Y . Not-y et-seen v alues of Y are those unkno wn or assumed unknown; usually they are as y et unmeasured, e.g. in the future. A simple example is a die roll in whic h M = “This is a six-sided ob ject with lab els one through six and which when tossed must show only one side.” The mo del is deduced based on these premises. There is no X probative b eyond M and D can b e absen t or can b e a record of previous flips (i.e. X and D are null or are assumed not probative). An application of (1) is Pr( Y = 6 | X, D , M ) = 1 / 6. Because the mo del w as deduced, no parameters w ere ev er present. It is unfortunately rare that models are deduced; most are posited ad ho c . That is, M is usually “I’m using regression”, i.e. an act of will. Mo del deduction can b e accomplished if the measurement of observ able are prop erly accounted for. See Briggs [Briggs, 2016] for details. Since that problem tak es us to o far afield, I’ll stick with arbitrary or capri- cious mo dels. Supp ose we are in terested in Y = “First-y ear college grade point av erage” of students. Observ ations X though t probativ e 2 Readers are ask ed to express their agreement with this prop osition on a scale of -1.74 to e π . 4 WILLIAM M. BRIGGS are the high sc ho ol grade p oint a verage and SA T score. The mo del M will b e ordinary regression. The goal is to produce statements lik e this: Pr( Y > 3 . 8 | X h = 3 . 5 , X s = 1160 , D, M ) , (2) where the subscript “h” is for high sc ho ol GP A, and “s” is for SA T. The D are past observ ations. Since regression uses con tinuous normal probabilit y , we unfortunately cannot ask ab out observ ables like “ Y = 4 . 0” and m ust restrict our atten tion to in terv als. A t any rate, the main questions of in terest are only tw o: (1) how do c hanging v alues of X h and X s c hange the uncertaint y in Y in the presence of D , M , and y , and (2) of what v alue or descriptive p ow er is M ? M includes X h and X s as comp onen ts. There is no notion of “significance” in either of these questions, and no notion of correctness in an y instance where the mo del was not de- duced. All probability is conditional, and that means the probabilities giv en in equations like 2 are correct. If, at some later p oint, it is de- cided that X s is of no interest and the mo del is up dated to reflect this, then the probabilities deriv ed from the new model are also correct. One mo del ma y b e sup erior to another, ho w ev er, but only with resp ect to decisions made conditional on the mo dels. There is more w ork to b e done b y mo del builders in the Third W ay b ecause v alues of y m ust also b e chosen, and so do v alues of X . This implies a mo del may b e useful in some decision con texts and of no use or ev en harmful in others. There is and should b e no default or automatic lev els of usefulness. The last thing the field of probabilit y and statistics needs is another magic n um b er a la “significance” with p-v alues. The imp ortance of probative observ ables and assumptions X is thus also a matter of decision and cannot be made automatically . This is plain from equation (1), which encapsulates al l we kno w about Y given our assumptions. It is we who made these assumptions, and we who can c hange them. Again, the only ultimately true mo del of Y is that whic h is deduced from true premises, as in the die example. Every other mo del is therefore only useful or not conditional on the premises w e assume. Since most mo dels are ad ho c , as regression alwa ys is, we can only speak of usefulness. Deduced mo dels are true b y definition and th us nothing more need b e done with them except mak e predictions. Deduced mo dels do not ev en need to b e v erified. The idea in the Third W ay is, conditional on D and M , to v ary X in the range of exp ected, decisionable, or imp ortan t v alues to some decision mak er and see how these c hange the probabilit y of Y ∈ y . If a particular X as it ranges along the v alues w e choose do not c hange THE THIRD W A Y OF PROBABILITY 5 the probabilities of Y ∈ y in an y imp ortan t wa y , then these X are themselv es not important. The opp osite is also true. Imp ortance is a matter of decision, whic h v aries by decision mak er. Imp ortance is not a probability or statistical concept and therefore cannot b e ascertained within probabilit y mo dels. If the probability of Y ∈ y c hanges in any as X does, then that X is relev an t to understanding Y , else it is not. Relev ance is a probabilistic concept, but as the reader will see, it is almost alw ays presen t giv en the assumption that the X is causally related to the Y . If X is kno wn not b e causally related to Y , then X is irrelev ant by definition. There is no h yp othesis testing in the frequentist or Ba y esian sense (as implied by Bay es factors, for instance). And there is no estimation of parameters. There are only plain, understandable, and verifiable probabilit y statemen ts. These probabilit y statements can and should and m ust b e v erified. The Third W ay allows communication of mo del go o dness and usefulness in an in telligible, actionable manner. It re- duces o v er-certain ty but cannot eliminate it unless mo dels are deduced. 3. Example There are 100 observ ations of first-year college students’ grade p oin t a v erages. W e w an t to quan tify the uncertaint y in the GP As of new studen ts given these observ ations, and also giv en information though t probativ e, in this case high sc ho ol GP As and SA T scores. 3 W e assume an ad ho c ordinary regression mo del. If w e adopt the Ba y esian philos- oph y , we need priors, and here an assumption of “flat” priors will do. As is well known, in ordinary regression this assumption matches the answ ers giv en by frequentist philosoph y . But it do esn’t matter. An y premises that giv e different priors will do. Our purp ose here is not (directly) inv estigating priors, but the uncertain t y inherent in Y giv en the assumptions w e mak e. It turns out Pr( Y > 3 . 8 | X h = 3 . 5 , X s = 1160 , D, M h,s ) = 0 . 038 . (3) Notice that the dep endence of the mo del on the assumptions has b een annotated, as it always should b e . If, giv en D , we insist on M and on the presence of X h and X s , then this is the final and true answ er. Nothing more need b e done. The v alues pic k ed for y , X h , and X s 3 The data is a v ailable at http: wm briggs.com public sat.txt. 6 WILLIAM M. BRIGGS are those I, and p erhaps nob o dy else, thought imp ortant. A different decision mak er migh t pick different v alues. But supp ose I am interested in the relev ance of X h . Its presence is an assumption, a premise, one that I felt imp ortant to mak e. There are sev eral things that can b e done. The first is to remo v e it. That leads to Pr( Y > 3 . 8 | X s = 1160 , D, M s ) = 0 . 0075 . (4) Notice first that b oth (3) and (4) are c orr e ct . The probabilit y in (3) is 5 times larger than in (4). This is a measure of relev ance and im- p ortance, giv en y = 3 . 8 and X s = 1160. Imp ortance and relev ance, lik e probability itself, are alwa ys conditional on our assumptions. A second measure of imp ortance is the c hange in probabilities when X h is v aried. That can b e seen in the follo wing figure. There is a c hange from ab out 0 to 8% ov er the range of high school GP As. If high school GP A w as not probative of Y > 3 . 8 giv en these premises then the graph w ould b e flat, indicating no c hange. In other w ords, it would resem ble the dashed line, which is (4), the mo del with- out high sc ho ol GP A. Is this “departure” from flatness imp ortant? There is no single answ er to this question. That en tirely dep ends on the uses to whic h this mo del is put. If a decision would b e made dif- feren tly giv en these v arying v alues of the probability , then high school GP A is imp ortant, otherwise it is not. The answ er is not a matter of probabilit y or statistics. That the line is not flat is pr o of , ho w ever, that, giv en M , y , and X s , kno wledge of high sc ho ol GP A is r elevant to kno wledge of Y . It cannot b e emphasized to o strongly that imp ortance and relev ance are conditional, just as probabilit y is. A linear function of high sc ho ol GP A added to a regression mo del already supplied with high sc ho ol GP A would b e irrelev ant. It might b e that high school GP A is relev an t or imp ortan t at some levels of SA T, and irrelev an t and unimp ortant at others, and the same is true for the go o dness measures of SA T. Information X is not isolated and is related to all the assumptions w e made, and these include the other X in the mo del. No w add the information X w = “hours studied a w eek” to the model and create a relev ance plot for it. The (conditional, as alw a ys) probabilit y of Y > 3 . 8 v aries little, from 0.038 to ab out 0.044, a change of only 0.006. Time studying is relev an t b ecause the probabilit y differs from the straigh t line, whic h is the prob- abilit y using the mo del sans time studying. But it is imp ortan t? Given the v alues of high school GP A, SA T, and y , the c hange from 0.038 to 0.044 is small and therefore many decision makers migh t conclude that THE THIRD W A Y OF PROBABILITY 7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.00 0.02 0.04 0.06 0.08 High school GP A Pr(Y > 3.8 | E) Figure 1. Pr( Y > 3 . 8 | X h = x, X s = 1160 , D , M h,s ) allo wing X h to v ary from 1 to 4 in incremen ts of 0.1. The notation on the figure conditions on E, which is short- hand for all the evidence w e ha v e. The dashed line is Pr( Y > 3 . 8 | X s = 1160 , D, M s ). adding time studying pro vides no b enefit to our understanding. Of course, such a small c hange migh t b e useful to someb o dy . It is not the statistician’s job to decide, but the decision mak er’s. I cheated, b ecause “time studying” isn’t that at all. In fact, it is made-up num b ers using R’s rnorm() and abs() functions. It is there- fore not surprising that the p ertinent probability should v ary little. That it do es at all is only b ecause of coincidence. By “coincidence” I do not mean “randomness” or “chance”, because I know the deter- minativ e cause of these n um b ers, but happenstance, a kno wn lac k of causal connection b etw een the made-up num bers and the observ able of in terest. W e should add information presumed probative into a mo del 8 WILLIAM M. BRIGGS ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 0.030 0.035 0.040 0.045 Hours spent studying Pr(Y > 3.8 | E) Figure 2. Pr( Y > 3 . 8 | X h = 3 . 5 , X s = 1160 , X w = x, D , M h,s,w ) allo wing X w to v ary from 0 to 26. The notation on the figure conditions on E, which is short- hand for all the evidence w e ha v e. The dashed line is Pr( Y > 3 . 8 | X h = 3 . 5 , X s = 1160 , D , M h,s ). only if we ha ve a plausible b elief that the information is related (some- ho w) to the cause of the observ able of interest. If this connection is lac king, the information should not b e added. Th us “time studying” should certainly b e remo v ed. Understanding causality in probabilit y mo dels is difficult sub ject beyond the scope of this pap er, except to emphasize that probability mo dels cannot deduce cause (though be- cause of hypothesis testing just ab out ev eryb o dy thinks they can; see Briggs [Briggs, 2015] for more details). The Third W ay strategy is to create scenarios that are of direct in terest to a decision mak er, the p erson or p ersons who will use the mo del. Plots like those ab ov e can b e made at the v alues of the probativ e observ ables in which the decision mak er is in terested. There are no one THE THIRD W A Y OF PROBABILITY 9 T able 1. The predictions for Pr( Y < 2 | X h = h, X s = s, D , M ) and Pr( Y > 3 | X h = h, X s = s, D , M ) for com- mon v alues of h and s , and t w o points of y though t to b e of interest. Every X in the mo del app ears in the T able. Pr( Y < 2 | h, s, D , M ) Pr( Y > 3 | h, s, D , M ) h / s 400 800 1200 1600 400 800 1200 1600 0.5 0.99 0.94 0.760 0.4600 3.9e-05 0.00065 0.0093 0.071 1.0 0.98 0.89 0.640 0.3400 1.4e-04 0.00180 0.0200 0.120 1.5 0.95 0.81 0.510 0.2200 5.0e-04 0.00500 0.0420 0.190 2.0 0.90 0.70 0.370 0.1300 1.7e-03 0.01300 0.0820 0.290 2.5 0.83 0.57 0.250 0.0730 5.0e-03 0.03100 0.1500 0.420 3.0 0.73 0.43 0.160 0.0360 1.4e-02 0.06600 0.2400 0.550 3.5 0.61 0.30 0.089 0.0170 3.3e-02 0.13000 0.3700 0.680 4.0 0.48 0.20 0.048 0.0074 6.9e-02 0.22000 0.5000 0.790 set of right or prop er v alues, except in the trivial sense of excluding v alues that are, given exterior information, kno wn to b e imp ossible. F or instance, given our knowledge of grade p oints, the v alue X h = − 17 is imp ossible. Assessing relev ance and imp ortance for large mo dels will not b e easy . But who insisted it should be? That classical statistical pro cedures no w make analysis so simple is part of the problem we’re trying to correct. Giv en the mo del and old observ ations, ev ery set of X h and X s , at some y , pro duce a prediction. F or instance, for future students with X h = 3 . 5 and X s = 1200 the (conditional) probabilit y that Y > 3 . 8 is 0.045. In regression, incidentally , we do not hav e to restrict ourselv es to a fixed y , b ecause the mo del will pro duce a prediction of ev ery p ossible v alue of Y . These predictions can and must b e verified. An example of suc h a rep ort is giv en in T able 1. Considerable art and thinking will hav e to go into presen ting predictions from a mo del loaded with probative X . This may seem like a dra wback, but in fact it is a b o on. F ar to o many mo dels are crammed with extraneous “con trolling v ariables” the usefulness is scarcely ev er considered. This approac h forces suc h consideration and encourages leanness; whic h is to sa y , mo dels without fat. Lean mo dels are not necessarily small. The verification strategy is this. A mo del is built using imp or- tance and relev ance, as ab ov e. It is then released in to the wild, as it were, to w ait for new data to arise. Every new data p oin t will ha v e 10 WILLIAM M. BRIGGS a v alue of ( X h , X s , Y ). These are fed in to the mo del and the pre- diction for the probabilit y of Y is giv en (with, sa y , y = Y ). These prediction-observ able pairs are then ev aluated in relation to a prop er score and p ossibly also with resp ect to a simpler version of the mo del, in a mo v e to assess (what is called) mo del skill. This is called the v erification pro cess. It is what happens naturally in, say , engineering and physics, where models are forced to meet reality . See, inter alia , [Gneiting and Raftery , 2007, Murphy , 1991, Murph y and Winkler, 1987, Briggs and Zaretzki, 2008] for details on v erification, an incredibly broad and rich sub ject. All that can b e done here is to give a hin t ab out the b est tools. A t bottom, the b est v erification is to assess how well the mo del p erformed in the decisions made with it. Probabilit y leak age is also discov erable [Briggs, 2013]. This is when the mo del giv es non-zero probability to v alues of Y which are known to b e imp ossible given external evidence, evidence which is usually ignored in the model-building pro cess (regression, for instance, is hor- ribly o ver-used). F or instance, we learn that at this college a GP A of 4 is the maxim um. Y et in our model with ( X h = 4 and X s = 1400 the probability that Y > 4 is 0.105. This is substan tial leak age and a guaran tee of mo del w eakness. W e can also compare our touted mo del with a simpler mo del, whic h is perhaps a standing comp etitor or is otherwise natural to consider. In the example, time sp en t studying w as revealed to b e faked. But supp ose I w ere to give the data to a statistician and not tell him of the fraud. Time studying sounds plausibly causally related to grade p oin t. The chec k of relev ance and imp ortance do not excite. Th us it is reasonable to say t w o mo dels are in conten tion, M h,s,w and M h,s . If, considering whatever prop er score w e are using, M h,s,w cannot “b eat” M h,s , M h,s,w is said not to hav e skill in relation to M h,s . This tec h- nique is widely used in meteorology (see the references ab o ve). In re- gression, the so-called null mo del (only with an “intercept”) is alwa ys a v ailable as a comparator. A p opular (but by no means p erfect) prop er score is the contin uous rank probability score (CRPS), which is the squared distance b et ween the cumulativ e distribution function of eac h prediction and the stair- case step-function empirical cum ulative distribution of each observ able [Gneiting and Raftery , 2007]. Smaller is b etter. The mean or summed CRPS is presen ted ov er multiple observ ations, but it can also b e exam- ined for each observ ation. Skill is a relative measure of improv ement in score, usually (score full − score partial ) / score partial , comparing a full THE THIRD W A Y OF PROBABILITY 11 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.05 0.15 0.25 0.05 0.15 0.25 0.35 CRPS Partial model CRPS Full model 0.05 0.15 0.25 0.35 0.00 0.10 0.20 0.30 CRPS scores Frequency Skill Count −0.4 −0.2 0.0 0.2 0.4 0 5 10 15 20 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 −0.4 −0.2 0.0 0.2 0.4 cgpa Skill Figure 3. T op left: the CRPS calculated for each observ ation for the full and partial mo dels; a one-to-one line is o ver-plotted. T op righ t: the histograms of CRPS scores; the partial mo del is the dashed line. Bottom left: the skill score for eac h observ ation. Bottom righ t: The skill plotted for eac h observ ation of college grade p oint. Skill is had for v alues greater than 0. or larger mo del with a partial, less complex, or otherwise natural com- parator. A p erfect skill is 1; v alues less than 0 are not skillful. Models without skill should not be used. Skill, as is ob vious, dep ends on the score. The mean CRPS for the “full” mo del with “time studying” is 0.0734, while for the “partial” mo del without it is 0.0749. The skill score is 0.02. This sho ws the full mo del is sup erior, but only just barely . Here is a p er-observ ation analysis of CRPS and skill. The first graph in Fig. 3 sho ws the CRPS calculated for eac h obser- v ation for the full and partial mo dels; a one-to-one line is ov er-plotted. The addition of “time studying” does not lead to uniform impro vemen t. 12 WILLIAM M. BRIGGS ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● hgpa Skill 1 2 3 4 −0.4 −0.2 0.0 0.2 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● sat Skill 400 600 800 1000 1400 −0.4 −0.2 0.0 0.2 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● studying Skill 0 5 10 15 20 25 30 −0.4 −0.2 0.0 0.2 0.4 Figure 4. T op left: the skill plots for eac h observ ation of high school GP A. T op right: skill for eac h observ ation of SA T. Bottom left: skill for each observ ation of “time studying”. The next graph are histograms of CRPS scores; the partial mo del is the dashed line. This is useful for a decision maker deciding ho w v aluable either mo del is. Again, the b est score is that whic h is related directly to the decisions made with the mo dels. Whether CRPS is this score is situation dependent. The last graph sho ws the skill score for eac h observ ation of college GP A. A mixed bag, as far as “time studying” go es. If w e didn’t already know, w e’d suspect that “time studying” is not adding muc h, and is ev en subtracting from, our kno wledge of college GP A. In Fig. 4 sho w the skill plots ov er eac h X in the mo del for eac h ob- serv ation. Skill is had for v alues greater than 0. The view that “time studying” is nearly useless is confirmed. The last graph is particularly rev ealing. As “time studying” mov es to either extreme, the skill bi- furcates, showing a process that “can’t make up its mind.” If “time THE THIRD W A Y OF PROBABILITY 13 studying” w ere truly v aluable, the signal would b e coherent. A t this p oin t, conferring with the decision maker, the statistician might drop “time studying” and compare his new “full” model consisting only of high school GP A and SA T with (p erhaps) a “partial” mo del consisting only of an “in tercept”. The v erification pro cess can be done on the already-observ ed data as an initial c hec k on mo del go o dness. The analogy is standard residual analysis. The same w eaknesses apply , ho w ev er, b ecause the temptation to t weak the mo del to pro duce b etter verification measures will not b e able to b e resisted. Ov er-fitting and o ver-confidence will result, as al- w a ys, but with an imp ortant t wist. Since the mo del will b e kno wn and published in its predictive form, outsiders do not ha v e to trust the in- sample v erification. They can wait for new data and apply verification on them themselv es. 4. Discussion Once imp ortance or relev ance are known, it is a mistak e to say that X is linked to, or is asso ciated with, or predicts Y , or, worse, some v ariant of “When X equals x , Y equals y ”. These are v ersions of a colossal misunderstanding, which is to sa y X c auses Y , It is true that X determines the uncertaint y we ha ve in Y , but determines is analogical; it has an ontological and epistemological sense. Probability is only concerned with the latter usage. The only function probability has is to sa y ho w our assumptions X determine epistemologically the uncertain t y w e hav e in Y . If we knew X w as a cause of Y , we hav e no need of probabilit y . Imp ortance and relev ance are replacemen ts for testing and estima- tion, but not painless ones. The recipient of an analysis is ask ed to do m uc h more w ork than is usual in statistics. How ev er, this is the more honest approac h. The b enefit is that equation (1) answers questions our customers ask us in the form they exp ect. The probabilities are in plain English and painless to in terpret. Ev erything is stated in terms of observ ables. Everything is v erifiable. The conditions on whic h the mo del relies are made explicit, made bare for all to see and to agree or disagree with. Gone is the idea that there is one “b est” mo del which researc hers hav e somehow disco v ered and whic h gives unam biguous re- sults. Gone also is the b elief that the statistical analysis has pro v ed a causal relationship. The model is made plain so that all can use it for themselves to v erify predictions made with it. Ev eryb o dy will b e able to see for themselv es just ho w useful the mo del really is. 14 WILLIAM M. BRIGGS References [Bernardo and Smith, 2000] Bernardo, J. M. and Smith, A. F. M. (2000). Bayesian The ory . Wiley , New Y ork. [Briggs, 2013] Briggs, W. M. (2013). On probabilit y leak age. arxiv.or g/abs/1201.3611 . [Briggs, 2015] Briggs, W. M. (2015). The crisis of evidence: Why probability & statistics cannot disco ver cause. arxiv.or g/abs/1507.07244 . [Briggs, 2016] Briggs, W. M. (2016). L o gic al Pr ob ability and Statistics . Awaiting a friendly publisher, New Y ork. [Briggs and Zaretzki, 2008] Briggs, W. M. and Zaretzki, R. A. (2008). The skill plot: a graphical technique for ev aluating contin uous diagnostic tests. Biometrics , 64:250–263. (with discussion). [Campb ell and F ranklin, 2004] Campbell, S. and F ranklin, J. (2004). Randomness and induction. Synthese , 138:79–99. [Co x, 1961] Co x, R. T. (1961). Algebr a of Pr ob able Infer enc e . Johns Hopkins Uni- v ersity Press, Baltimore. [de Laplace, 1996] de Laplace, M. (1996). A Philosophic al Essay on Pr ob abilities . Do ver, Mineola, NY. [F ranklin, 2001a] F ranklin, J. (2001a). Resurrecting logical probabilit y . Erkenntnis , 55:277–305. [F ranklin, 2001b] F ranklin, J. (2001b). The ory of Pr ob ability:evidenc e and pr ob abil- ity b efor e Pasc al . Johns Hopkins, Baltimore. [Geisser, 1984] Geisser, S. (1984). On prior distributions for binary trials. The A meric an Statistician , 38(4):244–247. [Geisser, 1993] Geisser, S. (1993). Pr e dictive Infer enc e: A n Intr o duction . Chapman & Hall, New Y ork. [Gneiting and Raftery , 2007] Gneiting, T. and Raftery , A. E. (2007). Strictly prop er scoring rules, prediction, and estimation. JASA , 102:359–378. [Ja ynes, 2003] Ja ynes, E. T. (2003). Pr ob ability The ory: The L o gic of Scienc e . Cam- bridge Univ ersity Press, Cambridge. [Keynes, 2004] Keynes, J. M. (2004). A T r e atise on Pr ob ability . Do ver Phoenix Editions, Mineola, NY. [Murph y , 1991] Murph y , A. H. (1991). F orecast v erification: its complexity and dimensionalit y . Monthly We ather R eview , 119:1590–1601. [Murph y and Winkler, 1987] Murph y , A. H. and Winkler, R. L. (1987). A general framew ork for forecast verification. Monthly We ather R eview , 115:1330–1338. [Williamson, 2010] Williamson, J. (2010). In Defense of Obje ctive Bayesianism . Oxford Univ ersity Press, Oxford. E-mail addr ess : matt@wmbriggs.com URL : wmbriggs.com

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment