Confidence as Forecast: A Decision-Theoretic Interpretation of Confidence Intervals

What, if anything, should a frequentist say about a single realized confidence interval (CI) and its chance of having covered the parameter? Jerzy Neyman's original answer was to refuse any nondegenerate probability for coverage ex post and, instead,…

Authors: ** *작성자: (논문에 명시된 저자 이름이 없으므로 “미상”으로 표기)* **

Confidence as F or ecast: A Decision-Theor etic Interpr etation of Confidence Interv als Scott Lee 1 1 National Center for Emer ging and Zoonotic Infectious Diseases, Centers for Disease Control and Pre vention Abstract What, if anything, should a frequentist say about a single realized confidence interval (CI) and its chance of ha ving covered the parameter? Jerzy Neyman’ s original answer was to refuse an y nondegenerate probability for co verage e x post and, instead, to “state that the interval cov ers. ” In this paper I argue that the usual frequentist machinery already supports a different reading. I treat the cov erage e vent as a Bernoulli random v ariable, with the nominal le vel 1 − α as its design-based success probability , and view “confidence” as a probability forecast for that Bernoulli outcome. Using strictly proper scoring rules, I sho w that 1 − α is the unique optimal constant forecast for cov erage, both before and after observing the data, and that it remains optimal post-trial in com- mon unbounded, translation-in v ariant models with piv ot-based CIs. When the design yields a θ -free statistic—such as the relativ e width of the interv al in a finite-windo w uniform model—the conditional cov erage gi ven that statistic pro vides a nonconstant, design-based refinement of 1 − α that strictly improves predicti ve performance. T wo thought experiments, a Monty Hall–style shell game and the “lost submarine” example of Morey et al. (2016), illustrate how this perspectiv e resolves f amiliar interpretational puzzles about CIs without appealing to priors or single-case sub- jecti ve de grees of belief. I conclude with simple “what to do when you see an interv al” guidance for applied work and some implications for teaching confidence interv als as tools for forecasting long-run cov erage. K eywords: Confidence interv als, coverage probability , proper scoring rules, probabilistic fore- casting, frequentist inference Disclaimer : The findings and conclusions in this report are those of the author and do not neces- sarily represent the of ficial position of the Centers for Disease Control and Prev ention. 1 Intr oduction 1.1 Backgr ound If you were being scored on predicting whether a single confidence interval (CI) has cov ered its target parameter , what number would you report? Jerzy Neyman, the in ventor of the CI procedure, suggested proceeding in two steps [15]: first, refusing to assign an y probability to coverage itself, 1 since, on the assumption that θ is a fixed constant and not a random variable, cov erage becomes fully determined once an interval has been constructed; and second, stating that the constructed interv al cov ers the parameter . Although the latter suggestion is not typically thought of as a forecast per se for indi vidual co verage e vents, it is essentially just that—by stating that the interv als alw ays cov er , we are issuing a constant forecast for P ( C ov er ) at 1 , ev en if we choose not to interpret the forecasting subjecti vely , e.g., as our personal degree of belief in whether coverage occurred. The forecast has a natural frequentist interpretation in that, under repeated sampling, it will be wrong no more than α % of the time, which Neyman sensibly appeals to this as a strength of CI theory as a means for controlling the practicing statistician’ s error rate in making such statements. T wo other critical facts are, of course, also true. First, confidence procedures (CPs) can be con- structed from data that carry no information at all about θ , and ex ante coverage probability can appear to change substantially e x post after an interval has been constructed [1, 19], leading to the standard argument that the only way to think about CIs coherently is in terms of their long-run cov erage properties. Second, and perhaps more importantly , for an y gi ven interval I , its cov erage probability is degenerate in { 0 , 1 } , conditioned on the realized values of its endpoints, as Neyman’ s point abo ve mak es clear . This is the line we most often hear about ho w to interpret a gi ven CI ex post (after construction)–that the interv al either does or does not cov er the parameter , and that we can make no probabilistic statement about its co verage because of it. The ”it either cov ers, or it doesn’t” statement has been the source of confusion, consternation, and frustration for be ginning statistics students for a long time, lik ely since it was first made, and has led to a number of not-quite-satisfying claims about ho w to interpret CIs in the applied literature. Introductory papers and applied practitioners occasionally claim that constructed intervals retain their nominal cov erage probability [9, 8], critiques in response claim that they retain none [13, 4, 7, 17], and still others seem so befuddled by it all that they suggest what interv als are trying to estimate are the interval endpoints themselves (rather than θ ) [14]. More often than not, the ensuing debates–which can be rather spirited [12, 18, 10]–end with trenches being drawn along philosophical lines, with frequentists defending claims that are sometimes, on their face, rather awkw ard (e.g., retreating to a von Mises-style infinite hypothetical frequentism without acknowl- edging that his program, taken seriously , undermines the foundational pillars of K olmogorov-style mathematical statistics [5]), and Bayesians wondering why the frequentists will not just switch to Bayesianism, put priors ov er θ , and construct credible intervals instead. In the sections below , I hope to of fer a principled frequentist reading of the concept of ”confidence” and its associated interv als that resolves some of this confusion. More precisely , the notion I b uild to ward recasts the notion of coverage probability as having three layers, rather than a single one: the first being the ev ent-lev el degenerate conditional in { 0 , 1 } determining whether an interv al cov- ers; the second being the design-le vel coverage guarantee of 1 − α that av erages those conditionals ov er the randomness in X ; and the third, or the notion of ”confidence”, being a predictiv e probabil- ity , or model-based forecast, of empirical coverage with whate ver information the statistician may hav e at hand. Under this view , we ha ve clear bounds on what we can say about a particular interval ex post (e.g., we might predict that it cov ers θ with probability 1 − α ), and, in some cases, we also hav e mathematical justification for updating our forecast in light of new evidence, for example, when we see the ”tri vial” interv al [ −∞ , ∞ ] , where we would v ery sensibly switch our prediction 2 to 1 , since cov erage is certain. Separating our probabilistic forecasts from the design-lev el cov- erage guarantee generally washes a way criticisms of CIs as being uninterpretable (with respect to the cov erage ev ent and not the actual value of θ ), and, as I show below , it generally respects the frequentist machinery Neyman used to b uild his theory (e ven if he might ha ve preferred not to see it that way). 1.2 Paper o verview The remainder of the paper is organized as follo ws. Section 2 presents a though experiment sho w- ing that design-le vel coverage probability can in fact be used to guide decision-making based on constructed intervals ex post. Section 3 introduces the relev ant standard formalisms for frequen- tist statistical inference, and it formalizes the notion of ”confidence” as a predicti ve probability , or model-based estimate, of interval coverage that applies both ex ante and ex post. Section 4 sho ws how this predictiv e notion might apply to CI-based inference by revisiting a commonly- cited thought experiment designed to disprove the notion that confidence has an y meaning e x post, and Section 5 concludes with a recap of the main results from earlier sections; a discussion of ped- agogical strate gies for teaching both CI theory and its applications; and some suggested potential directions for future research. 2 A thought experiment: ”Monty’ s Hell” 2. 1 The setup An uncommonly truthful street performer is running an interesting v ariation of a shell game he calls “Monty’ s Hell”. Y ou’ re pretty good with probability , and you also like to gamble, so you ask him ho w to play . Here are his instructions: “T ake a look at these 3 plastic cups. Under each one, I have placed a handwritten note indicating a r ange of dollar amounts—for example, $10 to $20. On a separ ate sheet of paper that you can’ t see, I’ve written down a single dollar amount that falls under one, and only one, of the ranges hidden by the cups. I will shuffle the cups for a while and then let you pick one to overturn. I’ll then r emove one of the cups you didn’ t pick fr om the table and give you the opportunity to switc h your original choice. After you make your final choice, if the range on the paper beneath your cup contains the hidden amount on my other sheet, I’ll pay you the hidden amount; if not, you’ll pay me half of the hidden amount. The game costs $5 to play , and the hidden amount is at least $10—ar e you in?” Fi ve dollars seems lik e a small amount to pay for the chance to win some extra cash, and you like that you’ll only pay half of the hidden amount if you lose, so you agree to play a single game. The performer shuffles the cups, and you choose one that you think is good . W ithout turning the cup over , what is your expected payout? Since you know nothing about either the ranges under the cups or the hidden value that one of them contains, you effecti vely will be choosing at random. W ith an initial buy-in of $5 , your expected 3 payout M will be E [ M ] = P ( S ) v − (1 − P ( S )) · 1 2 v − 5 = 1 3 v − 2 3 · 1 2 v − 5 = 1 3 v − 1 3 v − 5 = − 5 (1) where v is the dollar amount of the hidden prize. The e xpected payout would be the same for an y v alue of the hidden amount, but you appreciate that because it’ s at least $10 , you’ll hav e a 1 / 3 rd chance of doing no worse than doubling your original b uy-in of $5. T rue to his pr omise, the performer r emoves one of the two cups you didn’ t choose, saying that it did not cover the piece of paper with the winning range . Assuming he did not lie, what is your expected payout now? None of the cups has been overturned, and so you still hav e no information about the range in which the winning v alue might lie. Y ou remember the Monty Hall paradox, though, and note that ev en though a losing cup has been remov ed from the table, and now either your cup or the remaining cup must contain the winning range, there’ s still only a 1 / 3 probability that your cup will win; the remaining 2 / 3 probability of winning now belongs to the single remaining cup. If you stay with your original choice, the expected payout is the same as it was at the beginning of the game: a $5 . loss If you switch to the remaining cup, howe v er , the expected payout M switch becomes E [ M switch ] = (1 − P ( W )) v − P ( W ) · 1 2 v − 5 =  2 3  v −  1 3  · 1 2 v − 5 =  2 3 − 1 6  v − 5 = 0 . 5 v − 5 . (2) Y ou know the minimum value of v is $10 , and so your expected payout if you switch is at least $0 —in other words, in the long run, if you switch, you will at least break ev en. Because of this, you switch. F inally , the str eet performer lets you turn over your new cup to see its hidden rang e, and he turns over your original cup so you can see its range, as well. Y ou see that the range under your cup is $30 to $50, and the range under your original cup was $10 to $29. Has your expected payout chang ed? Some what counterintuiti vely , you realize the expected payout is actually still the same, e ven though you’ ve now seen both of the remaining hidden ranges. By the rules of the game, you now know that the hidden v alue must lie somewhere between $10 and $50 –the performer remov ed one of the losing cups, ef fectiv ely pinning the prize amount to the union of the intervals in the ones remaining–but you still do not know whether your cup is the winning one, and so the expected 4 payout, which depends only on your probability of choosing correctly and the associated rew ards and penalties, has not changed—gi ven that you switched, it is still no less than $0 . The str eet performer r e veals the hidden amount to be $50 , which was included in the rang e under your chosen cup. Congratulations—you’ve multiplied your original in vestment of $5 by 9 and now have $45. A good r eturn by any account! 2.1 The parallels with CIs This thought experiment was designed as a direct analog to CIs, and there are tw o main parallels to note. First is that, as with CIs capturing θ , each cup winning is decided solely by whether its interv al captures the hidden dollar amount. Second, we hav e that the winning value, like θ , is a fixed-b ut-unkno wn constant, with no probability distribution being placed ov er its value. These two facts fix whether a cup wins or loses, but, as with CIs, we have no way of knowing which. The uncomfortable dissonance here is that from both Bayesian and frequentist simulation-based analyses of the original Monty Hall problem under its standard setup, we kno w that switching is, indeed, the optimal strategy . Does Neyman’ s interpretation also lead to success? The answer in this case is an emphatic ”no”: if we follow either of the options he gi ves us for interpreting CIs after choosing our first cup, we are guaranteed to lose mone y . The de generate conditional in { 0 , 1 } would freeze us in our tracks, since we no w hav e no basis for weighing whether to switch and thus decide to stay; and stating ”this interv al cov ers” would also do the same, since we now have a clear reason not to switch, as we are treating our first cup as the one that wins. In both cases, the long run would keep the chances of winning at 1 / 3 , and we would lose an average of $5 per game, rather than taking adv antage of the 2 / 3 chances offered by the switch. In other words, the two Ne yman-style mov es—refusing to assign a nondegenerate probability to cov erage, or simply declaring that our original cup “cov ers”—are not only philosophically awkward; rather, they are always outperformed by the strate gy that treats the design-le vel success probability as a forecast. 3 Neyman thr ough the looking-glass Neyman’ s interpretation looks at realized CIs from one point of view , but, as we saw informally abov e, the underlying model of fers us another one. At the le vel of the model, these correspond to two ways of vie wing the coverage variable: as a { 0 , 1 } –valued random variable attached to a specific interval, and as its nondegenerate mean 1 − α under the sampling distribution. Both are features of the same probability model. The y differ only in the σ –fields with respect to which they are e valuated, not in whether they are “pre–data” or “post–data” in any temporal sense (this particular distinction is more about the choice of reference class than about what the model does or does not allo w [6]). Belo w , I begin by re visiting ho w these tw o layers of probability are connected by the model gov erning co verage, and I then formalize the third layer I suggested at the start of the paper: “confidence” as a predicti ve probability for empirical cov erage, conditional on whatev er information we happen to have on hand, with proper scoring rules providing guidance on what numbers to report, and when. 5 3.1 What the model giv es us W e start with a standard frequentist model (Ω , F , { P θ : θ ∈ Θ } ) , where θ ∈ Θ is a fix ed but unkno wn parameter . The data X is a random element X : (Ω , F , P θ ) → ( X , A ) with distribution P X θ . A (tw o–sided) 1 − α confidence interv al (CI) procedure for θ is a measurable map I : X → I , x 7→ I ( x ) = [ L ( x ) , U ( x )] , such that, for e very fixed θ ∈ Θ , P θ  θ ∈ I ( X )  = P θ  L ( X ) ≤ θ ≤ U ( X )  = 1 − α . (3) Next, we define the corresponding co verag e indicator Z ( X ) := 1 { θ ∈ I ( X ) } = 1  L ( X ) ≤ θ ≤ U ( X )  , which is a { 0 , 1 } –valued random v ariable on (Ω , F , P θ ) . In terms of Z , condition (3) is equi valent to E θ [ Z ( X )] = 1 − α, that is, Z ( X ) ∼ Bernoulli(1 − α ) . (4) Under this model, there are two natural ways of viewing the coverage v ariable Z . At the design level , we w ork with its unconditional law under P θ , so that for a single use of the procedure P θ  Z ( X ) = 1  = 1 − α is a property of the pair ( P X θ , I ) and does not depend on any particular realized dataset. At the same time, Z can be viewed conditional on the data X . For a fixed θ and any realization x ∈ X such that X ( ω ) = x , the interval I ( X ( ω )) = I ( x ) is determined, and so is the indicator Z ( X ( ω )) = 1 { θ ∈ I ( x ) } ∈ { 0 , 1 } . In other words, conditioning on the σ –field σ ( X ) yields the degenerate conditional e xpectation E θ  Z ( X )   X  = Z ( X ) ∈ { 0 , 1 } a.s. (5) The two layers are linked by the usual to wer property . Combining (4) and (5), E θ [ Z ( X )] = E θ h E θ  Z ( X )   X  i = E θ  Z ( X )  = 1 − α . (6) Thus the nonde generate cov erage probability 1 − α liv es at the design level, as the unconditional mean of the cov erage indicator , while the degenerate { 0 , 1 } values arise upon conditioning on the data X . Both are internal to the same model: they are simply Z viewed with respect to different σ –fields. In particular , the y apply to all intervals generated by the procedure, regardless of whether we hav e already observed the corresponding data or not. 6 3.2 F orecasting with confidence 3.3.1 Minimizing risk pre-trial Let I ( X ) = [ L ( X ) , U ( X )] be a (1 − α ) confidence interval for a fixed parameter value θ , and recall the cov erage indicator Z θ := 1 { θ ∈ I ( X ) } . By construction, P θ ( θ ∈ I ( X )) = E θ [ Z θ ] = 1 − α , so Z θ is a Bernoulli random v ariable with success probability 1 − α under P θ . Before observing an y data, imagine we would like to issue a probability forecast q ∈ [0 , 1] for the e vent “the interv al will co ver θ , ” i.e. for Z θ = 1 . Let S ( q , z ) be any strictly proper scoring rule for Bernoulli forecasts (for example, the Brier or log score), interpreted as a loss function in the usual sense that a lo wer score is better [3]. From our setup, the pre-trial expected score is R θ ( q ) := E θ  S ( q , Z θ )  . Strict propriety of S implies that R θ ( q ) is uniquely minimized at q ∗ = P θ ( Z θ = 1) = 1 − α. (7) At the design le vel, then, the confidence lev el 1 − α is e xactly the probability forecast for co verage that minimizes e xpected loss under an y strictly proper scoring rule. In this sense, 1 − α is a model- implied pr edictive pr obability of cov erage for an y single use of the procedure. By extension, an y alternati ve constant forecast, such as “the interv al covers” (i.e. q ≡ 1 ), has strictly larger expected loss whenev er 0 < 1 − α < 1 (Neyman might not ha ve intended his interpretation to be used this way , but for all intents and purposes, ”the interval covers” is a constant forecast about cov erage, whether we as the statisticians are meant to truly ”belie ve” the statement to be true or not). 3.3.2 Minimizing risk post-trial After observing X = x , we obtain the realized interval I ( x ) and realized indicator Z θ ( x ) := 1 { θ ∈ I ( x ) } . Suppose now that we choose to condition on some information G deri ved from the design and the data (for example, the fact that the interval was produced by this procedure, or possibly some coarser features of I ( X ) such as its length), but not on σ ( X ) itself. A data-dependent forecast is then a random v ariable q ( X ) that is G -measurable. Its conditional expected score is E θ  S  q ( X ) , Z θ    G  . By strict propriety of S , this conditional risk is minimized when q ∗ ( X ) = E θ [ Z θ | G ] = P θ  θ ∈ I ( X )   G  . (8) In the baseline case where the procedure is such that, for the chosen G , E θ [ Z θ | G ] = 1 − α almost surely , (9) 7 the risk-minimizing forecast remains q ∗ ( X ) = 1 − α e ven after observing I ( X ) : the specific realized interval carries no further information about coverage beyond the design. In that sense, the ex-ante and (design-le vel) e x-post predicti ve probabilities coincide. By contrast, if certain features collected in G are kno wn under the design to be associated with different conditional cov erage, then P θ  θ ∈ I ( X ) | G   = 1 − α , and the optimal predictiv e probability q ∗ ( X ) should be adjusted accordingly . At the other e xtreme, if one takes G = σ ( X ) , the conditional expectation collapses to E θ [ Z θ | X ] = Z θ ∈ { 0 , 1 } , recov ering the oracle’ s degenerate view discussed in Section 3.1. This “forecast” is perfect for an or acle that knows whether the interval cov ers, but a non-omniscient statistician who insists on the slogan “either it covers or it does not” is in a different position: they kno w only that the true cov erage indicator takes v alues in { 0 , 1 } , b ut they still must name a single real number q ∈ [0 , 1] when scored. Any operational rule Q that implements this stance and only e ver reports 0 or 1 , without access to Z θ itself, has strictly larger expected loss than the design-based forecast q ≡ p = 1 − α under any strictly proper Bernoulli scoring rule. Formally , if Z θ ∼ Bernoulli( p ) with p = 1 − α and S is strictly proper, then for e v ery { 0 , 1 } –valued forecast rule Q , E θ  S ( Q, Z θ )  > E θ  S ( p, Z θ )  , 0 < p < 1 , because E θ [ S ( q , Z θ )] is uniquely minimized at q = p and p / ∈ { 0 , 1 } . In this sense, the vague “forecast” { 0 , 1 } is strictly dominated, in the usual scoring-rule sense, by the sharp design-lev el forecast 1 − α . 3.3.3 Design-based refinement via θ -free conditional cov erage No w imagine that we ha ve a constructed interv al in hand and are wondering whether to update our pre-trial forecast from 1 − α to something else, and, if so, to what. Since θ is fixed and unkno wn, all we hav e to work with is what is in the data. Let T ( X ) be a statistic deri ved from the data (e.g., interv al length, whether the interval hits a kno wn boundary , or other design-specific features), and let G := σ  T ( X )  be the σ -algebra it generates. A data-dependent probability forecast is a random variable q ( X ) that is G -measurable. As before, we take S ( q , z ) to be a strictly proper scoring rule for Bernoulli forecasts, interpreted as a loss. Theorem 3.1 (Design-based optimal forecast with a θ -free statistic) . Assume ther e exists a mea- surable function g : range( T ) → [0 , 1] such that, for all θ ∈ Θ , P θ ( Z θ = 1 | G ) = g  T ( X )  almost sur ely . (10) Then the for ecast rule q ∗ ( X ) := g  T ( X )  uniquely minimizes the conditional expected scor e E θ  S  q ( X ) , Z θ    G  8 for every θ ∈ Θ and almost every r ealization of T ( X ) . In particular , for all θ ∈ Θ and all G -measurable q ( X ) , E θ  S  q ∗ ( X ) , Z θ  ≤ E θ  S  q ( X ) , Z θ  , (11) with equality for a given θ only if q ( X ) = q ∗ ( X ) almost sur ely under P θ . Here, the condition (10) is a design-lev el statement: the conditional coverage giv en T ( X ) is the same function of T for all θ . This is a very strong assumption, and in most real-world scenarios, it will not apply . Nonetheless, when it does, q ∗ ( X ) = g ( T ( X )) will be a uniformly risk-minimizing predicti ve probability for coverage across the entire parameter space, gi ving us a data-dependent way to update our forecast based, again, purely on the design–no prior ov er θ required, because the forecast minimizes risk for them all–and in a way justified entirely by risk-minimization under repeated sampling. 3.3.4 Defaulting to the confidence lev el From before, we saw that for a fix ed θ ∈ Θ , the conditionally optimal forecast gi ven G is always q ∗ θ ( X ) = E θ  Z θ | G  = P θ ( Z θ = 1 | G ) . Naturally , though, if there exists no measurable g such that P θ ( Z θ = 1 | G ) = g  T ( X )  for all θ ∈ Θ , then there is no single G -measurable rule q ( X ) that is conditionally optimal for e very θ . Any non- constant q ( X ) based on T ( X ) will reduce the expected score for some θ and increase it for others, and choosing among such rules then necessarily requires some additional structure (e.g., a prior on θ ) that Neyman’ s machinery does not provide. By contrast, going back to our view of the confidence le vel as a predictiv e probability , among constant forecasts q ( X ) ≡ q , strict propriety implies that, for each fixed θ , E θ  S ( q , Z θ )  is uniquely minimized at q = P θ ( Z θ = 1) = 1 − α. So, again, the nominal le vel 1 − α is the unique constant forecast that minimizes e xpected loss for e very θ under any strictly proper scoring rule. Moreo ver , because P θ ( Z θ = 1) = 1 − α for all θ , the same conclusion holds after av eraging ov er any prior π on Θ : Z P θ ( Z θ = 1) π ( dθ ) = 1 − α for all priors π . This integral holds because 1 − α does not depend on θ , and so integrating a constant gi ves a constant. In this sense, when no θ -free conditional coverage refinement exists, the design-le vel confidence 1 − α is the natural default predicti ve probability of cov erage for a realized interv al: it respects the defining calibration property for all θ and is Bayes-optimal among constant forecasts under ev ery prior on Θ (including, again, the constant forecast implicit in alw ays stating the interval cov ers). 9 3.4 Framework summary T o recap, if we are interested in predicting whether a particular interv al co vers the parameter , 1 − α is the best we can do ex ante. Ex post, if we hav e a θ -free statistic that lets us break down the constant 1 − α forecast into finer -grained conditional probabilities that still a verage to 1 − α , then using those finer pieces strictly improv es our prediction under any strictly proper score, unless the finer-grained conditionals are actually all the same, in which case we are ef fecti vely back at 1 − α . Whether we hav e such a θ -free statistic will depend on the underlying model, as I outline briefly belo w . 3.4.1 Unbounded location–scale models In standard unbounded, translation-in variant models with pi vot-based CIs, interv al coverage de- pends on a pi vot, while interval width is ancillary (see the Supplemental Methods to see ho w this works in the case of estimating a normal mean with a t interv al). In such designs, neither the abso- lute location of the interv al nor an y θ -free function of its geometry carries additional design-based information about cov erage, and the optimal pre- and post-trial forecast for 1 { θ ∈ I ( X ) } under any strictly proper scoring rule is simply the nominal le vel 1 − α . Man y , if not most, real-world estimation problems will fall under this cate gory . 3.4.2 Finite-windo w designs In designs where the data are supported on a finite windo w around θ (as in the uniform “submarine” model I e xplore in Section 4), the relativ e width of the interval, T ( X ) = length ( I ( X )) length of the windo w , is a θ -free statistic whose conditional coverage P θ ( θ ∈ I ( X ) | T ) typically varies with T . In such cases, the design-based optimal forecast is q ∗ ( X ) = P ( θ ∈ I ( X ) | T ( X )) , which strictly improv es on the constant forecast 1 − α under any strictly proper score whenev er the conditional coverage function is non-constant. These kinds of designs are occasionally used as instructiv e e xamples to show that adhering to the design-level co verage probability ex post can lead to apparently incoherent statements (e.g., when a 50% CI ends up cov ering only , say , 10% of the data support). 3.4.3 Bounded parameter spaces When θ is known to lie in a compact interval [ a, b ] , the nor- malized endpoints ˜ L = L ( X ) − a b − a , ˜ U = U ( X ) − a b − a describe the interval’ s width and position relativ e to the parameter space itself. In simple symmetric designs, we can again consider the design-based conditional coverage P θ ( θ ∈ I ( X ) | ˜ L, ˜ U ) as a post-trial forecast. A full general theory is beyond the scope of this paper , but these normalized endpoints provide a natural starting point for simulation-based assessment of θ -free refinements. 10 3.4.4 A rough guide f or f orecasting W ith the foregoing in mind, we can formulate a rough step-by-step process for deciding how to make ex-post forecasts. Giv en a confidence procedure and a statistic T ( X ) deriv ed from the interv al, the following rule of thumb applies: 1. Check whether T ( X ) is θ -free for cov erage, in the sense that P θ ( θ ∈ I ( X ) | T ) has the same functional form for all θ ; 2. If so, and if this conditional co verage varies with T , use the design-based forecast q ∗ ( X ) = P ( θ ∈ I ( X ) | T ( X )) ; and 3. If no such θ -free refinement exists, default to the nominal le vel q ( X ) ≡ 1 − α . In the first case mentioned above (standard unbounded, translation-in v ariant designs), our θ -free statistics for cov erage carry no more information about coverage than the design itself, and so whether we choose to condition on them or not, our ex-post forecast holds steady at 1 − α . In the other two cases, they may carry extra information and, as we will see in the example belo w , may help us improv e the quality of our predictions. In these situations, i.e., when we hav e reason to belie ve T ( X ) is likely θ -free under our design, we could then obtain usable plug-in conditional cov erage probabilities by simulating datasets under a con venient θ or range of θ s, tab ulating co ver - age as a function of T , and then using the empirical co verage estimates as a kind of look-up table for ex post forecasts. 4 Returning to the lost submarine In their paper rebutting what the y call the ”Fundamental Confidence F allacy” (FCF), More y et al. [13] present a simple thought experiment to sho w the perils of attempting to interpret confidence interv als ex post. In short, the setup in volves a surface search ship looking for a lost research sub on the ocean floor . The surface ship needs to drop a line to the sub’ s hatch in order to rescue its cre w , so the y use the pattern of b ubbles coming of f of the sub’ s hull to estimate its position relativ e to its o wn. The hatch is exactly halfway down the sub’ s 10 -meter length, and the bubbles come off of the sub uniformly at random and in pairs. The authors present a number of confidence procedures for estimating the position of the hatch, but for clarity , I will limit the analysis below to only three of them. 4.1 Setting things up Let θ ∈ R denote the (unknown) horizontal location of the sub relati ve to the rescue vessel. Con- ditional on θ , bubble locations are modeled as X 1 , X 2 | θ i.i.d. ∼ Uniform[ θ − 5 , θ + 5] . Write X (1) := min { X 1 , X 2 } , X (2) := max { X 1 , X 2 } , let ¯ X := X 1 + X 2 2 , d := X (2) − X (1) = | X 1 − X 2 | . 11 W e consider two 50% confidence procedures for θ based on ( X 1 , X 2 ) (see Morey et al. for deri v a- tions). The first is a nonparametric (NP) procedure that simply returns the interval spanned by the two observ ations: I NP ( X 1 , X 2 ) := [ X (1) , X (2) ] =  ¯ X − d 2 , ¯ X + d 2  . (12) The second procedure (UMP) selects, for each realization, the shorter of two interv als that have the same 50% cov erage under the uniform model: I UMP ( X 1 , X 2 ) := ( [ X (1) , X (2) ] , d < 5 , [ X (2) − 5 , X (1) + 5] , d ≥ 5 . (13) Both I NP and I UMP hav e coverage P θ ( θ ∈ I ( X 1 , X 2 )) = 1 / 2 for all θ , b ut I UMP has strictly smaller e xpected length. The second interv al is also the likelihood of the pair of b ubbles under the model. 4.2 Simulating cov erage T o take an empirical look at cov erage, I simulated N = 1 e 5 runs of the experiment, at each point sampling a pair of bubbles from the distribution described abov e. Under this setup, we get confirmation that both procedures co ver θ with the same probability: 49 , 998 successes out of the 1 e 5 runs yields 50% cov erage, rounded up. W e also get confirmation that tw o procedures cover θ at exactly the same times, since the UMP intervals are always either equal to the NP intervals or strictly contained by them, in the latter case effecti vely dropping the parts of the larger interval that are inconsistent with both b ubbles being within ± 5 m of the hatch. As with any CI procedure, each particular interval either does or does not contain θ , and so conditioned on their realized values, each will have a degenerate probability in { 0 , 1 } (in the simulation, of course, we know θ , so our cov erage pr ediction is always one of those tw o values and is perfectly accurate). 4.2.1 Design-lev el coverage as a constant f orecast Since in the experiment the location of the hatch is unknown, we do not know ho w many of any particular number of constructed interv als will hav e included it. If we were interested solely in predicting the underlying co verage e vents for those intervals, what forecast would get us the best result? W e can begin with the two constant forecasts mentioned earlier: the one we get from always stating that the interval cov ers, q ( X ) = 1 ; and the one we get from the design-le vel coverage guarantee, q ( X ) = 1 − α = 0 . 5 . Using Brier score loss as our proper scoring rule, the loss for a single forecast q and outcome z ∈ { 0 , 1 } is S ( q , z ) = ( z − q ) 2 . From Section 3 above, we no w let Z θ := 1 { θ ∈ I ( X ) } be the coverage indicator for a single interv al from either procedure. In the submarine example, Z θ ∼ Bernoulli( p ) with p = 1 / 2 , so the expected Brier loss for a constant forecast q is R θ ( q ) := E θ  ( Z θ − q ) 2  = p (1 − q ) 2 + (1 − p ) q 2 . For p = 1 / 2 , the two constant forecasts above yield R θ (1) = 1 2 (1 − 1) 2 + 1 2 (1) 2 = 1 2 , 12 and R θ  1 2  = 1 2  1 − 1 2  2 + 1 2  1 2  2 = 1 4 . This is an obvious result, but in this setting, using the design-le vel forecast q = 1 − α = 0 . 5 halves the expected Brier loss relativ e to always asserting that the interval cov ers ( q = 1 ). Since the design-le vel forecast is also the optimal constant forecast, this is to be expected. 4.2.2 Beating the constant f orecasts with conditional cov erage The reason Morey et al. present this thought experiment, howe ver , is naturally not to extol the virtues of the confidence lev el as a constant forecast: indeed, it is primarily to sho w that, gi ven the underlying model, a constant design-lev el forecast leads one to make rather awkward statements about cov erage ex post. For example, if we use the NP procedure to construct an interval for a sample and notice that, after construction, the interv al only spans 25% of the 10 -meter length of the hull, should we still say it cov ered the hatch with 50% probability? In a sense, yes, that is exactly true–as sho wn in Section 3, the model gi ves us tw o layers of probability that apply to each and ev ery interval, one of them being 1 − α –although it does sound very strange to say , and it requires us to ignore the actual v alues (and relativ e width) of the realized interval, which throws aw ay some information. Thankfully , because of the setup of experiment, with a uniform distribution of known width dis- tributed symmetrically around θ , the width of the interv als relati ve to the support of X is in fact a θ -free statistic, and so we can test it out to see whether it improves our forecast. T o see these two things in action–the statistic’ s θ -freeness and its value for prediction–I reran the simulation abo ve across a range of v alues for both θ and the support (i.e., the hull width). In total, there were 100 unique configurations, with θ (hatch location) ranging from 0 to 10 in increments of 1 , and scale (hull width) ranging from 10 to 110 in increments of 10 . As before, each experiment consisted of sampling 1 e 5 pairs of bubbles from a Uniform distribution with the gi ven midpoint and scale. T able 1 shows the Brier scores for using the conditional cov erage characteristics of the two estima- tors above to predict whether they captured the hatch, along with using the two constant forecasts of 1 and 1 − α for the same (the scores for the constant forecasts were the same for both proce- dures, so they are only listed once in the table). Unsurprisingly , the conditional forecasts beat the constant design-le vel forecast in both cases, with the Brier score dropping from 0 . 25 to about 0 . 11 for the NP interval and to about 0 . 18 for the UMP interval. The scores are instructiv e in another way as well. In this setup, the two procedures share the same co verage indicator Z θ : by definition, whene ver the NP interval cov ers θ , the UMP interv al also cov ers, and vice versa. The dif ference lies in the θ -free statistic used for forecasting co verage. For the NP procedure, we condition on its relati ve width D ; for the UMP procedure, we condition on its (shorter) relati ve width W = min { D , 1 − D } , which is a deterministic coarsening of D that folds very wide NP interv als back into shorter UMP interv als. Thus, W is a less informativ e function of the data than D (it induces a coarser partition of the sample space), and conditional forecasts based on D can only improve—and nev er worsen— expected loss relati ve to forecasts based on W under any strictly proper scoring rule. In that sense, the NP interv al is actually a better pr edictor of cov erage when we condition on relativ e width, e ven though the UMP interval is the better estimator of θ : the latter discards impossible values under 13 Forecast Brier score ( µ ) Brier score σ 2 Constant 1 0 . 500 0 . 000 Constant 1 − α 0 . 250 0 . 000 NP width 0 . 112 0 . 000 UMP width 0 . 165 0 . 000 T able 1: Brier score means and v ariances for four different forecasting strategies when using a joint confidence procedure to estimate the location of the hatch across 100 dif ferent simulation configurations (hatch location ranging from 0 to 10 in increments of 1, and hull width ranging from 10 to 110 in increments of 10). ”NP W idth” denotes a forecast from coverage probability conditioned on the nonparametric interval’ s width relativ e to the support of X , and ”UMP W idth” denotes the same, b ut for the univ ersally most po werful interv al. V ariances are sho wn as empirical e vidence that the strategies performed consistently across the dif ferent configurations. the design and has shorter expected length for the same marginal coverage, b ut in doing so it giv es up some of the θ -free, cov erage-relev ant information carried by the wider NP interv al. Using the same conditional coverage estimates, we can also answer the question posed abov e about the 2 . 5 m -wide interval precisely: under simulation, intervals of that width cover θ approximately 33% of the time and not, as Morey et al. gesture tow ard as a kind of paradox, 50% of the time. Looking at this kind of interval, could we still forecast coverage at 1 − α ? Certainly , b ut it would be an ov erestimate, and we would do better by issuing the finer -grained conditional forecast instead. 4.2.3 The bit about nesting Morey et al. also point to nested intervals as problematic for ex-post inference about cov erage, observing that if one 50% CI completely contains another , then it is logically impossible for them both to cov er θ each with 50% probability , since there would be no probability mass left to fall in the space between them. From a frequentist standpoint, though, a natural response would be to note that once we start to reason about cov erage from two CIs simultaneously , we are no w talking about a composite confidence procedure with design-lev el cov erage defined by their joint distribution and not their indi vidual mar ginals. T o see such reasoning in action, let us pick a new pair of procedures, one being the UMP procedure from abov e, and the other being what Morey et al. describe as the ”sampling distribution” (SD) procedure, defined as ¯ x ± (5 − 5 / √ 2) . Under simulation, these hav e similar marginal coverage probabilities, at 49 , 998 / 1 e 5 ≈ 0 . 500 for the former , and 50 , 094 / 1 e 5 ≈ 0 . 501 for the latter . The SD procedure does not overlap with the UMP procedure in the same way as the NP proce- dure, though, and the joint cov erage probability under simulation turns out to be a tad higher at 58 , 605 / 1 e 5 ≈ 0 . 586 . From the start, then, we kno w there should be no reason to consider the composite procedure as ha ving the same chance of capturing θ as each procedure indi vidually–the design-le vel cov erage probability nearly 9% higher , presumably because we are casting a wider net on av erage than with either procedure alone. What, then, about nesting? Can we use it to our advantage for forecasting? As it turns out, cov erage 14 F orecast Brier score µ Brier score σ 2 Constant 1 0 . 414 0 . 000 Constant p joint 0 . 243 0 . 000 Nest. Cond. 0 . 213 0 . 000 Max W idth 0 . 208 0 . 000 T able 2: Brier score means and variances for four different forecasting strategies when using a joint confidence procedure to estimate the location of the hatch across 100 different simulation configu- rations (hatch location ranging from 0 to 10 in increments of 1, and hull width ranging from 10 to 110 in increments of 10) ”Nest. Cond” denotes a forecast from cov erage probability conditioned on which interv al contains the other , and ”Max W idth” denotes a forecast from co verage probabil- ity conditioned on the relati ve width of whiche ver interv al contains the other . V ariances are sho wn as empirical evidence that the strategies performed consistently across the different configurations. probability does change substantially when the intervals are nested: when the SD intervals are nested inside the UMP intervals, the probability that either of them cov ers goes up to 0 . 793 , and when the roles are rev ersed, it drops down belo w the design-lev el joint coverage to 0 . 441 (note that, because the intervals share the same midpoint by design, one is alw ays nested inside of the other – all that changes is which one is where). These probabilities are also stable across the simulation configurations, and so they appear to be θ -free. T o update our ex-ante forecast from the design- le vel joint value p joint ≈ 0 . 586 to something different ex post, we no w ha ve two main choices: use the conditional cov erage probabilities P ( C ov er | SD ⊂ UMP) and P ( C ov er | UMP ⊂ SD) gi ven abo ve, or use the finer -grained versions of these where use cov erage probability conditioned on the relati ve width of whichev er is the wider of the two intervals (i.e., the outer one). T able 2 shows the Brier scores for these forecasts, relativ e to the outcome that either of the intervals cov er . Again, a forecast of always 1 leads to the worst loss at 0 . 414 ; mo ving to the constant design-le vel joint forecast q ≡ p joint ≈ 0 . 586 lowers that by slightly less than half to 0 . 243 . Using conditional coverage probabilities under the two nesting conditions lo wers a touch more to 0 . 213 , and combining that information with each procedure’ s marginal conditional cov erage probability (relati ve to itself, not to either interval covering) leads to the lowest score at 0 . 208 . These results of fer some mild empirical e vidence in fav or of the general principle raised in Section 3 that conditioning on more information leads to a better forecast. Before leaving the submarine example behind, we can return to the apparent logical paradox raised at the beginning of this subsection about the potentially-missing cov erage probability between the inner and outer intervals in a nested pair . Across the simulations, the inner SD interv al misses when the outer UMP interval co vers with an average probability of 0 . 085 , and vice-versa (results were ag ain stable across all combinations of θ and scale). This probability fills the ”gaps” between the inner and outer nested interv al in both cases and are not, thankfully , 0 , as Morey et al. had wondered whether they might be. It is also approximately the difference between either of the procedures’ marginal co verage probability of 0 . 500 and their joint cov erage probability of 0 . 586 , a sensible and welcome result. 15 5 Discussion 5.1. The utility of confidence as a for ecast T reating ”confidence” as a probabilistic forecast gets us a few , hopefully helpful, things for free when using Neyman’ s machinery for inference. As an interpretation for some number of realized interv als, it gi ves us a principled way to estimate the number of them that covered, for example, by using 1 − α as p to calculate a binomial mean. This benefit most notably applies to the case of a single realized interv al, where we can no w happily say both that the interv al either co vers, or it does not (conditioned on its endpoints), and that it has some intermediate probability of cov ering the parameter (conditioned on the group coverage of other intervals like it). It also allows for a straightforward interpretation of groups of intervals that v ery plainly w ould not co ver the parame- ter with probability 1 − α , since we ha ve the mathematical support for updating our forecast based on any coverage-rele v ant information they may contain. As touched on abo ve, the latter serves as a response to the numerous counter-e xamples in the literature intended to sho w that ”confidence” has no coherent meaning ex post, and the former serves as formal support for Neyman’ s original intention for CIs as having estimable cov erage properties in conducting repeated (empirical) ex- periments. Although these two seem the most valuable, treating ”confidence” as a forecast also helps keep clear what we mean when we talk about ”co verage probability”, which can no w be seen as both a design-le vel unconditional probability under the model and a information-relati ve condi- tional probability based on whate ver we happen to observe when using a confidence procedure. 5.2 Isn’t this epistemic? Part of why Neyman may hav e resisted a forecasting-based interpretation of ex-post coverage is that it threatens to blur the boundary between his frequentist, procedural, beha vioristic view of inference and the then-emer ging subjectivist Bayesian program of Ramse y [16] and de Finetti [2]. In particular , the gap between Neyman’ s error -controlling constructions and de Finetti’ s slogan that “probability does not exist” (ibid.) was likely a bridge too f ar . Still, as the preceding sections suggest, when we treat “confidence” as a predicti ve probability for an underlying cov erage ev ent determined by the design, the resulting notion remains both objecti ve and purely frequentist. Both ex ante and ex post, coverage probabilities are defined as limiting relativ e frequencies of success in clearly specified reference classes, e.g., intervals produced by this procedure, or intervals of a gi ven relati ve width W under a gi ven design, and under any strictly proper scoring rule, the risk- minimizing forecast is exactly the relev ant conditional cov erage probability , whether unconditional ( 1 − α ) or conditioned on a θ -free statistic such as relativ e width. So long as those conditional prob- abilities are all equally well-defined under the sampling model, there seems to be no mathematical reason to treat some as more legitimate than others, or to insist that our confidence in cov erage must not change, e ven if the design itself tells us ho w it should. T aking a slightly more philosophical tack, we might also note that Neyman’ s behavioristic pro- gram is explicitly designed to control our long-run error rates in making statistical inferences, for example in stating that an interval covers the parameter , or in rejecting a null hypothesis. The only reason we ev er make such errors, howe ver , is that we, as agents, do not know whether the claims themselves are true. If we did—as in the familiar stacked-interv al plot where θ is taken 16 to be kno wn, and we can see exactly which interv als cover —our error rate would be zero, and there would be no substanti ve need for confidence procedures in the first place. It is precisely our lack of kno wledge that makes long-run error control, and hence both Neyman’ s framework and the error -statistical machinery b uilt more recently around it [11], both useful and rele v ant. T o be sure, though, none of this is to suggest that CI theory itself is subjectiv e in any way , or that the conditional co verage probabilities e xplored in Sections 3 and 4 are someho w tied to de gree of belief, credence, propensity , or an y other philosophical interpretation of probability aside from fre- quentism (again, they are defined purely with respect to limit relati ve frequencies under repeated sampling). Mainly , it is to suggest that the idea of ”confidence”, as Neyman conceiv ed, was de- fined relativ e to an agent’ s (typically a statistician’ s) information state when making statements about unobserv ed e vents, and could thus be fairly deemed to be at least information-relati ve, if not actually epistemic in the subjecti ve sense. 5.3 A note on pedagogy In light of this paper’ s analysis, it seems natural to teach students of statistics that a confidence pro- cedure giv es rise to three related ways of talking about the single-case coverage e vent { θ ∈ I ( X ) } . First, there is the de generate { 0 , 1 } judgment obtained by conditioning on the realized interval (or equi valently on the full data): either this particular interv al cov ers or it does not. Second, there is the design-le vel probability 1 − α , obtained by averaging the cov erage indicator o ver the sampling distribution, and conditioned only on the fact that the procedure was used at all. Third, there is a predicti ve probability for coverage conditioned on whatev er coarser information we choose to retain about the case at hand (for example, that the interval arose from a particular procedure, or that it has a certain relati ve width under a gi ven design). Since man y of the interv als students will encounter early on f all into what I called the first class of designs in Section 3.4—unbounded, approximately translation-inv ariant models with pi vot-based CIs—it also seems fair to point out that the realized endpoints typically do not carry any design- based information about cov erage beyond the nominal le vel (e ven though they may be useful for other purposes, such as summarizing uncertainty about θ or inv erting hypothesis tests). In such settings, the best forecast for cov erage, both before and after seeing the interval, is simply 1 − α . W ith this perspecti ve in mind, it may be pedagogically helpful to introduce the g eneral notion of a confidence procedure first, and only then show ho w it specializes to the canonical introductory interv als, such as the t interv al for a normal mean, the W ald interv al for a binomial proportion, or nonparametric bootstrap interv als. As Basu and others hav e emphasized [1], it is e ven possible to construct confidence procedures based entirely on randomization v ariables whose distrib utions are themselves θ -free, without specifying any sampling model for X giv en θ . His example, together with the uniform-design procedures of W elch [19] and its reimagining in the submarine example of More y et al. [13], form a useful constellation of procedures for helping students see that “con- fidence” is fundamentally about the coverage indicator as a Bernoulli-distributed ev ent under a specified design, rather than primarily about pinning down an approximate value of θ (ev en though the latter is often the moti vating goal in applications). 17 6 Conclusion In this paper , I framed the frequentist notion of ”confidence” as a probabilistic forecast that can be estimated, updated, and scored relati ve to the corresponding intervals’ underlying cov erage e vents. A simple thought experiment sho wed how we might wish to use the design-lev el uncondi- tional, rather than the degenerate conditional, cov erage probability for betting, and the following section de velop that notion formally by treating coverage as Bernoulli and ”confidence” as a pre- diction whose quality can be measured with a strictly proper scoring rule. I then sho wed ho w this frame work helps resolve one well-cited example of the apparent paradoxes inherent to CI-based inference, and ended by discussing the frame work’ s frequentist bona fides, as well as its potential to reframe ho w we teach CI theory to beginning statistics students. Refer ences [1] Anirban DasGupta. “Ancillary Statistics, Pi votal Quantities and Confidence Statements”. In: Selected W orks of Debabrata Basu . Springer, 2010, pp. 327–342. [2] Bruno De Finetti. “La pr ´ evision: ses lois logiques, ses sources subjecti ves”. In: Annales de l’institut Henri P oincar ´ e . V ol. 7. 1. 1937, pp. 1–68. [3] T ilmann Gneiting and Adrian E Raftery. “Strictly proper scoring rules, prediction, and esti- mation”. In: J ournal of the American statistical Association 102.477 (2007), pp. 359–378. [4] Sander Greenland et al. “Statistical tests, P v alues, confidence intervals, and po wer: a guide to misinterpretations”. In: Eur opean journal of epidemiology 31.4 (2016), pp. 337–350. [5] Alan H ´ ajek. “Interpretations of probability”. In: (2002). [6] Alan H ´ ajek. “The reference class problem is your problem too”. In: Synthese 156.3 (2007), pp. 563–585. [7] Alexander T Hawkins and Lauren R Samuels. “Use of confidence intervals in interpreting nonstatistically significant results”. In: J ama 326.20 (2021), pp. 2068–2069. [8] Rink Hoekstra et al. “Robust misinterpretation of confidence interv als”. In: Psychonomic bulletin & r eview 21.5 (2014), pp. 1157–1164. [9] Michael EJ Masson and Geoffre y R Loftus. “Using confidence intervals for graphically based data interpretation.” In: Canadian Journal of Experimental Psycholo gy/Revue cana- dienne de psycholo gie exp ´ erimentale 57.3 (2003), p. 203. [10] Deborah G Mayo. “In defense of the Neyman-Pearson theory of confidence intervals”. In: Philosophy of Science 48.2 (1981), pp. 269–280. [11] Deborah G Mayo and Aris Spanos. “Error statistics”. In: Philosophy of statistics . Else vier, 2011, pp. 153–198. [12] Richard D Morey et al. “Continued misinterpretation of confidence interv als: Response to Miller and Ulrich”. In: Psychonomic Bulletin & Re view 23.1 (2016), pp. 131–140. [13] Richard D Morey et al. “The fallacy of placing confidence in confidence intervals”. In: Psychonomic b ulletin & r evie w 23.1 (2016), pp. 103–123. 18 [14] Ashley I Naimi and Brian W Whitcomb. “Can confidence interv als be interpreted?” In: American J ournal of Epidemiology 189.7 (2020), pp. 631–633. [15] Jerzy Neyman. “Outline of a theory of statistical estimation based on the classical theory of probability”. In: Philosophical T ransactions of the Royal Society of London. Series A, Mathematical and Physical Sciences 236.767 (1937), pp. 333–380. [16] Frank P Ramsey. “Truth and probability”. In: Readings in formal epistemology: Sour cebook . Springer, 1926, pp. 21–45. [17] Philip Sedgwick. “Understanding confidence interv als”. In: BMJ 349 (2014). [18] T eddy Seidenfeld. “On after-trial properties of best Ne yman-Pearson confidence intervals”. In: Philosophy of Science 48.2 (1981), pp. 281–291. [19] BL W elch. “On confidence limits and suf ficiency , with particular reference to parameters of location”. In: The Annals of Mathematical Statistics 10.1 (1939), pp. 58–69. 19 Supplemental Methods 1 Pr oof sketch f or theta-free conditional cov erage Fix θ ∈ Θ . By strict propriety of S in the conditional sense, for any G -measurable q ( X ) , E θ  S  q ( X ) , Z θ    G  is minimized (almost surely) when q ( X ) = E θ  Z θ | G  = P θ ( Z θ = 1 | G ) . Under assumption (21) in the main te xt, this conditional e xpectation equals g  T ( X )  for ev ery θ , so q ∗ ( X ) = g ( T ( X )) solves the conditional optimization problem simultaneously for all θ . T aking expectations o ver G yields the unconditional risk inequality , with uniqueness from strict propriety . □ 20

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment