Statistical Compatibility, Refutational Information, and Acceptability

This paper develops an interpretive framework for divergence P-values and S-values within a descriptive frequentist perspective. Statistical analysis is framed as operating within idealized worlds defined by a set of assumptions and a target hypothes…

Authors: Aless, ro Rovetta

1 Statistical Compatibility , Refutational Information, and Acceptability Alessandro Rovetta International Committee Against the Misuse of Statistical Significance (ICAMSS), Bovezzo, Italy Corresponding author Alessandro Rovetta International Committee Against the Misuse of Statistical Significance Bovezzo (BS), 25073, Italy alessandrorovetta@icamss.com +39 39271 12808 https://orcid.org/0000-0002-4634-279X ABSTRACT This paper develops an interpretive framework for divergence P-values and S-values within a descriptive frequentist perspective. Statistical analysis is framed as operating within idealized worlds defined by a set of assumptions and a target hypothesis, where probabilities describe the behavior of data under the model but do not assign truth values to hypotheses. W ithin this view , P-values are interpreted as graded indices of compatibility between the observed result and the predictions generated by the assumed model; accordingly , small P-values should not be read as indicating logical impossibility or strict inconsistency of the model itself. Building on this distinction, the paper argues that practical inference requires moving beyond the internal logic of the model toward judgments of overall acceptability , which depend not only on data-model compatibility but also on multiple contextual considerations such as subject-matter knowledge, plausibility of assumptions, data quality , usefulness, and loss - all interpreted through the competence, intentions, perceptions, and moral values of the specific analyst. S-values are therefore interpreted not as evidence against the epistemic status of the model, but as a specific form of refutational information that contributes to the broader body of information used by the analyst to judge whether a model remains acceptable for an intended practical purpose. The paper also examines the linguistic and conceptual risks associated with the language of incompatibility , distinguishes probability from rarity , and clarifies dif ferent notions of surprise - including a possible definition of Shannon-type surprise, to be distinguished from Bayesian belief revision. Overall, the article proposes a more cautious and explicit interpretation of frequentist measures, centered on model-based description, analyst responsibility , and decision acceptability . These considerations mainly pertain to settings with accumulated, multi-source evidence; in contrast, individual studies lacking such evidentiary breadth should primarily aim to provide clear and transparent descriptions of the data-model relationship rather than attempt to resolve questions of practical acceptability . 2 PREMISES In this work, I refer exclusively to diver gence P-values as defined in the original descriptive frequentist framework and carefully distinguished from decision P-values by Greenland (2023). In line with the notation adopted by Greenland, I denote by A the set of assumptions defining the model. However , I introduce a slight modification for convenience: I use H i to indicate a specific target hypothesis under evaluation, rather than a set of hypotheses. The formulations presented here are not intended to express definitive judgments or deliver truth evaluations - which generally lie beyond the scope of research inquiry (e.g., Msaouel, 2022); rather , they aim to of fer interpretative nuances and perspectives that may prove useful (see Box, 1976) within appropriately specified contexts. The goal is to revisit and expand previous proposals - including my own - to provide clearer and more precise interpretations, particularly in areas where prior formulations may lead to consequential ambiguity or misinterpretation. 1. FREQUENTIST FRAMEWORKS: BASIC CONCEPTS In frequentist frameworks, statistical analysis is grounded in a set of assumptions A that jointly specify the expected behavior of the phenomenon under study (Greenland, 2023). These assumptions ideally encode all the relevant properties of the data-generating process (DGP), and they are combined with a target hypothesis H i to define the model A + H i . In particular , A includes background conditions such as distributional forms of data and errors, degree of dependence, measurement properties, and all structural and causal features that define how data are mechanically produced; the target hypothesis H i specifies a particular “claim” about the parameter of interest (e.g., the average efficacy of a medical treatment in the defined population is zero). Different tar get hypotheses correspond to different model specifications; in some cases these define families of admissible models (e.g., composite hypotheses such as ‘the average effect is greater than zero’), whereas in others they define fully specified models. For instance, one may consider the null model A + H 0 - where H 0 encodes the null hypothesis of zero effect - or the alternative non-null models A + H i≠0 (e.g., A + H 1 or A + H 2 ) - where H i≠0 encodes the 3 target hypothesis of a specific (i) non-zero ef fect; each of these defines a distinct , internally coher ent probabilistic system. Crucially , these systems should not be interpreted as competing claims endowed with probabilities of being true, as the “traditional” frequentist frameworks do not assign probabilities to hypotheses or models. Instead, each model represents a conditional scenario: if the assumptions A and the specific hypothesis H i were true, then certain probabilistic statements about the data would follow . For this reason, it is useful to conceptualize these models as “small fictional worlds” defined by a pr edefined combination of assumptions and a specific target hypothesis; within that world , one can derive the distribution of possible data and the behavior of statistical procedures (Greenland, 2025). The analyst can thus “enter” each of these worlds and examine what would be expected to occur under those conditions - in the language of Cox and Hinkley (1974), one treats the chosen model A + H i as pr ovisionally true. 2. ON P-V ALUES AND COMP A TIBILITY The P-value is defined as the probability of obtaining, under the specific model A + H i , a statistical outcome (e.g., a test statistic) at least as extreme as the one observed (W asserstein and Lazar , 2016). The reference is therefore always to a fully idealized world, not to the hypothesis in isolation. This definition naturally supports an interpretation of the P-value as an index of compatibility - ranging from 0 to 1 - between the observed result and the predictions generated within that model; here, compatibility should be understood as a graded measure of agr eement , coher ence , consonance, or consistency between what is expected under A + H i (in terms of the sampling distribution of the statistic) and what is actually observed (Greenland et al., 2016; Greenland, 2023; McShane et al., 2024; Hennig, 2024; Rovetta, 2025). P-values “close” to 1 correspond to outcomes that are “typical” under the assumed model, and thus indicate “high” compatibility; as the P-value decreases, the observed result becomes increasingly “atypical” relative to the model’ s predictions, so that the degree of compatibility correspondingly decreases. However , and this point is crucial in light of the previous discussion, I ar gue it is not correct to interpret “small” P-values as showing logical incompatibility , impossibility , or inconsistency 4 with H 0 . The model A + H 0 defines an internally coherent probabilistic system within which all realizations in the support of the distribution are, by construction, perfectly possible. The observed data cannot contradict the model in a strict logical sense, because the model is provisionally assumed as the generating mechanism for the purpose of the calculation (Cox and Hinkley , 1974). Statements about “incompatibility” are conventions with their own merits, but involve non-negligible interpretative risks in light of the inversion fallacies that pervade frequentist statistics (e.g., Greenland et al., 2016; Rovetta and Mansournia, 2025). Etymologically , ‘incompatibility’ derives from the Latin ‘incompatibilis’, denoting the inability of two (or more) things to co-exist or be brought together - a meaning that naturally conveys total conflict rather than a matter of degree (Harper , n.d.). Such an original meaning has been preserved in ordinary language (The Origin of Incompatibility: From Past to Present, n.d.). Even in probability theory and in standard statistical usage, it is more common to reserve ‘incompatibility’ for genuine structural impossibility: situations in which a model forbids an outcome or pattern outright, so that the corresponding event lies outside the model’ s support (Glossary of Statistical T erms, n.d.). In the special case of discrete models this often coincides with the model assigning probability zero, and for mutually exclusive (or disjoint) events it means they cannot occur together . The common underlying idea is thus binary: either an event is possible under the model or it is not. 3. ISSUES IN ‘INCOMP A TIBILITY’ LANGUAGE The temptation to speak of “degrees of incompatibility” often arises because we want a graded scale to replace the older dichotomies of “significant” versus “not significant”. But importing graded language into a term that is frequently used dichotomically risks conceptual muddle. Consider the parallel with “insignificance”. Historically , some authors including Neyman (1977) have used graded labels for statistical significance (for example, “highly significant”, “approximately significant”, or “marginally significant”). Y et, at least at the level of ordinary phrasing, it is less natural to apply an analogous graded vocabulary to the “not significant” side - that is, to “insignificance” and related variants - unless one is prepared to shift the term’ s 5 meaning away from simple negation and toward a graded quantity . This is a linguistic expectation that can be checked empirically: in a PubMed title and abstract corpus analysis since 1926, reported in the Supplement, the conditional frequency of the examined graded modifiers was about 12.3-fold higher on the “significant/significance” side than on the “insignificance” or “not/non-significant” side. Such a result strongly agrees with the hypothesis that “insignificant” is naturally read as the negation or absence of “significant”, not as a quantity with its own continuum. This matters even more when we move across inferential cultures. In Bayesian contexts, one may speak about conflict between prior and likelihood, between model components, or about prior-data conflict and model checking (Evans and Moshonov , 2006). Speaking of incompatibility of hypotheses with data risks conflating distinct ideas: tail-area rarity under a frequentist reference model and posterior predictive misfit in Bayesian checking. These are not the same object, and a single word should not be made to carry them all unless we carefully define it. For these reasons, I argue it is more “appropriate” (less risky) to frame discussions in terms of graded compatibility rather than graded incompatibility . W ithin a given model, outcomes are compatible insofar as they are possible under the model, but they can be more or less in accord with what the model makes “typical”. The P-value, in this view , can be described as one particular measure of extremeness (relative to a test statistic and a reference distribution) and thus as an indirect indicator of how “unusual” the computed result looks under the assumed model. Moreover , we increasingly see new dichotomies emer ging under the banner of “compatibility language”. For example, some writings describe parameter values outside a so-called ‘compatibility interval’ as incompatible, and those inside as compatible (Rovetta and Mansournia, 2025). This reproduces almost verbatim the “inside-outside” dichotomy of confidence intervals, with ‘compatible/incompatible’ simply replacing ‘accepted/rejected’ or ‘significant/not significant’ (e.g., Amrhein et al., 2017). If we then also speak of degrees of incompatibility , we risk creating a conceptual hybrid that is simultaneously graded and dichotomous, depending on rhetorical needs. That is precisely the kind of language instability that invites misinterpretation and overconfident conclusions. 6 4. ABOUT PRACTICAL UTILITY OF THE COMP A TIBILITY APPROACH The practical use of frequentist results requires a step that lies outside those “fictional small worlds”. The analyst should exit the provisional abstraction defined by A and H i , so that the key question is no longer how the data behave within that hypothetical world, but to what extent that hypothetical world should be considered acceptable in light of the broader evidential and decision context (McShane et al., 2024; Rovetta and Mansournia, 2024). A “small” P-value indicates that the observed result would be “rare” under A and H i ; however , this statement alone does not determine whether the model should be refuted in practice. The transition from model-based evaluation to real-world action requires integrating this information with external considerations, including subject-matter knowledge, causal plausibility , data quality , and the credibility of the underlying assumptions (Good, 1952; Neyman 1977; Box, 1980; Hennig, 2010; Greenland, 2022; McShane et al., 2024; Greenland, 2025; Rovetta, 2025; Rovetta et al., 2025a). Here one can see that probability is a model-defined numerical quantity (i.e., the long-run relative frequency under repeated realizations of the same experiment under A + H i ), whereas rarity is an interpretive judgment about how unusual that quantity is considered in context. This leads to the notion of ‘acceptability’ as a decision-oriented concept. Acceptability reflects a judgment that combines data-model compatibility with external evidence and contextual considerations including usefulness and consequentiality (Good, 1952; Box, 1980; Greenland, 2022; McShane et al., 2024; Rovetta and Mansournia, 2024). Thus, the evaluation of acceptability implicitly involves a loss structure (Greenland, 2021): what are the consequences of rejecting a model that is at least “adequate” given the established goal, and what are the consequences of retaining a model that is “inadequate” given the established goal? In this regard, especially in medical settings where the stakes are extremely high, it is essential to emphasize that there is often mor e than one model that is at least reasonably compatible with the data - and thus potentially acceptable (Fisher , 1955; Greenland, 2025; Rovetta et al., 2025a). W ithin the same family of models defined by A, multiple H i may exhibit high degrees of compatibility with the observed data (e.g., Rovetta et al., 2025a), and even alternative families of 7 assumptions distinct from A may also be consistent with the same data (Greenland, 2025). It is important to note that the aforementioned “reasonableness” is not a property of models, but a human conclusion influenced by factors including competence, creativity , and moral values (Good, 1952; Fisher , 1955; Greenland, 2012, 2025; Gelman and Hennig, 2017; Rovetta, 2025). For these reasons, from now on, instead of passive expressions such as ‘given the assumptions’, expressions that emphasize human agency will be used, such as ‘according to the assumptions specified by the analyst’. 5. THE ROLE OF REFUT A TIONAL INFORMA TION Suppose we observe a sequence of tosses and record the number of heads obtained. W e define a model by combining a set of assumptions A - including independence and identical distribution of the tosses - with a target hypothesis H 0 stating that the coin is not biased in favor of heads (i.e., the probability of heads is less than or equal to 1/2). 1 W ithin this model, the number of heads in ‘s’ tosses follows a binomial distribution; in particular , the probability of observing ‘s’ heads in ‘s’ tosses is at most 1/2 s (where 1/2 s is the boundary case in which the probability of heads matches exactly that of tails; see also Cole et al., 2021). The P-value for such one-tailed scenario is thus p = sup{Pr(S ≥ s)} = sup{Pr(S = s)} = 1/2 s ; this corresponds exactly to the probability of obtaining ‘s’ heads in ‘s’ tosses under A and the boundary case of H 0 in which the coin is not biased in either direction. In other words, the P-value is compared to the probability of obtaining ‘s’ heads in ‘s’ tosses under A and the specific case within the composite hypothesis H 0 in which the coin is not biased in either direction. From the inverse transformation, one obtains a base-2 logarithmic relation that defines the S-value, s = –log 2 p, namely the amount of information in ‘bits’ against the acceptability of the composed hypothesis H 0 (the coin is not biased toward heads), conditionally on the working model A defined by the analyst (Rafi and Greenland, 2020; McShane et al., 2024; Rovetta and Mansournia, 2024; Rovetta et al., 2025b). This coin-toss analogy can be naturally extended to evaluate the acceptability of any hypothesis (e.g., the hypothesis of a zero average effect of a medical treatment) under a given set of assumptions (Rafi and Greenland, 2020; Rovetta et al., 1 Note that this is a composite hypothesis. 8 2025b). The S-value allows one to express how much r efutational information the data provide against the acceptability of a target hypothesis H i within a specified working model A, compared to the refutation information that ‘s’ heads in ‘s’ tosses of a coin provides against the acceptability of the hypothesis ‘the coin is not biased toward heads’ under the assumption ‘the coin toss is fair ’. 6. ADV ANT AGES OF S-V ALUES OVER P-V ALUES A key limitation of the P-value lies in the fact that it does not preserve information on the additive scale (Rafi and Greenland, 2020; Rovetta, 2025). For example, the change from 0.04 to 0.01 represents a substantially larger shift in evidential meaning than the change from 0.94 to 0.91, even though both differences are numerically equal. This asymmetry can be initially appreciated in qualitative terms by considering the multiplicative scale: in the first case, 0.04 is four times larger than 0.01, indicating a “substantial” change in the rarity of the result under the model; in contrast, 0.94 is only about 1.03 times larger than 0.91. However , since the multiplicative scale exaggerates differences, a more precise assessment is obtained by translating P-values into S-values - which address P-values dependence on location within the unit interval. Under such transformation, p = 0.01 corresponds to approximately s = –log 2 0.01 = 6.64 bits, while p = 0.04 corresponds to about 4.64 bits, yielding a difference of 2 bits; 2 by contrast, p = 0.91 corresponds to about 0.14 bits and p = 0.95 to about 0.09 bits, with a difference of only about 0.05 bits. Because it is defined on a logarithmic base-2 scale, equal increments in the S-value correspond to equal increases in refutational information (provided by the data against the acceptability of the targeted hypothesis, given the assumptions fixed by the analyst). For instance, the difference between s = 9 bits and s = 10 bits represents the same additional amount of information (1 bit) as the difference between s = 19 bits and S = 20 bits. An additional advantage of the S-value is that it allows probabilities to be mapped onto a concrete and familiar physical experiment, namely repeated coin tossing (Rafi and Greenland, 2020; McShane et al., 2024; Rovetta, 2025). This provides an intuitive benchmark grounded in everyday experience and helps contextualize the 2 A faster way to compute differences is to ex ploit the properties of logarithms, so that Δs = log 2 (0.04/0.01) = 2 bits. 9 degree of rarity of the observed data in relation to the specific research question and decision context, supporting more informed judgments about the acceptability of the underlying model. 7. THE EPISTEMIC RELA TIONSHIP BETWEEN REFUT A TIONAL INFORMA TION AND ACCEPT ABILITY In some previous work - including my own (e.g., Rovetta et al., 2025b) - this refutational information was often described as being directed against the model itself, rather than against its acceptability . Upon closer inspection, such formulation is imprecise and risks conveying a misleading message, namely that “atypical” results under A + H i (i.e., small P-values and thus large S-values) can provide evidence against a model that is, by construction, assumed to be (provisionally) true. Neither the P-value nor the S-value can be interpreted as providing evidence about the truth or falsity of the model in an epistemic sense: the evaluation of acceptability lies outside the “fictional small world” induced by the model, i.e., it lies in the real-world context in which decisions are made. Accordingly , the S-value should not be interpreted as evidence against the model per se: rather , it contributes to informing judgments about how much the model remains acceptable, without being sufficient, on its own, to determine that human decision. A more precise interpretation is that S-values characterize the degree to which the observed data depart from what would be expected under the model assumed by the analyst. The evaluation of whether this departure renders the model “unacceptable” depends on contextual considerations based on the specific scope of the analyst - shaped by their background knowledge and their intention (Rovetta et al., 2025b). In other words, the S-value does not quantify evidence against the epistemic status of the model assumed for the calculation ; rather , it quantifies a specific form of refutational information that becomes part of the broader body of information the analyst uses to judge whether the model remains acceptable for the intended practical purpose. For example, observing 2 heads in 2 tosses is typically regarded as weak refutational information against the acceptability of the hypothesis that the coin is not biased toward heads. This is because, in ordinary real contexts, such an outcome is not perceived as “rare” even when the plausibility of the model defined by A and H 0 is high - in light of our experience and cognitive 10 perceptions. In other words, when the experiment is conducted under routine conditions - where assumptions such as toss independence are considered reasonable by the analyst - 2 heads in 2 tosses are generally insufficient to motivate refuting the model as an adequate description of the real process. By contrast, observing 10 heads in 10 tosses may be considered “sufficiently” rare under A together with H 0 to prompt a different judgment. In that case, alternative explanations - such as the possibility that the coin is in fact biased - may appear more consistent with the observed data. As a result, the hypothesis H 0 may be deemed unacceptable, in the sense that it is no longer considered a convenient representation - or even idealization - of the r eal phenomenon under investigation given the analyst’ s scope. Nonetheless, it is important to stress that overall support for a hypothesis requires a restriction of plausible alternatives (Greenland, 2023; Rovetta et al., 2025); for example, in this case we must exclude the possibility that the person tossing the coin is a magician capable of influencing the outcome at will, even if the coin itself is not biased at all. 8. LOSS ACCEPT ABILITY There is no universal or context-free scale for evaluating the degree of refutational information. Any such evaluation is necessarily conditional on the loss function , that is, on how the consequences of decisions are weighted by the analyst across possible states of the world (Good, 1952; Neyman, 1977; Greenland, 2021, 2022; Rovetta and Mansournia, 2024). The loss function encodes the costs associated with different types of errors as evaluated by the specific analyst , conditionally on their knowledge, moral values, perceptions, and goals. These costs are not necessarily symmetric and strongly depend on the specific practical context. As a result, the same amount of refutational information may lead to different decisions depending on how severe the consequences of being wrong are judged to be. Consider the following stylized example. Suppose the following bet: if one wagers that a coin is biased toward heads and this hypothesis turns out to be correct, the gain is 100 dollars; if the hypothesis turns out to be incorrect (i.e., if the coin is not biased toward heads), the loss is 0 dollars. The question is: what is the number ‘s’ of heads in ‘s’ tosses that you should observe in 1 1 order to take the bet? If the ultimate goal is to earn money , the optimal decision is then trivial: one should accept the bet regardless of the observed data, since there is no downside cost. Now suppose that losing the bet entails a cost of 5 dollars. In this setting, the personal decision becomes sensitive to the degree of refutational information provided by the data. One might judge that observing 3 heads in 3 tosses - corresponding to a “modest” level of rarity under the assumption of no bias toward heads - is sufficient to justify the risk. However , if the potential loss increases to 100 dollars, matching the possible gain, a much stronger degree of refutational information would typically be required. For instance, one might require observing at least 5 heads in 5 tosses before considering the bet acceptable. This dependence becomes even more pronounced in sensitive settings such as medicine (Greenland, 2021, 2022; Rovetta and Mansournia, 2024). For example, consider a setting in which the causal structure underlying a treatment is clear (e.g., the biological mechanisms are well understood) and the study design is optimal. When assessing a hypothesis concerning a relatively mild outcome, such as a non-negligible increase in the frequency of headaches, a moderate level of refutational information (e.g., 4 bits) might be considered sufficient. By contrast, when evaluating severe outcomes, such as anaphylaxis, a substantially higher degree of refutational information would typically be required before deciding to refute that costly hypothesis. In this regard, it is essential to recognize that the decision problem becomes even more complex, as it does not reduce to a binary accept-or-reject choice, but instead requires selecting among multiple therapeutic alternatives - whose benefits and side effects may dif fer across several practical aspects, including dosage, timing of administration, and treatment intervals. 9. SURPRISAL AND SURPRISE S-values are often referred to as surprisals , a term that is frequently used alongside the more general notion of surprise - especially when illustrated through analogies such as observing ‘s’ heads in ‘s’ tosses under the assumption that the coin is not biased toward heads (e.g., Cole et al., 2021; McShane et al., 2024; Rovetta, 2025). While this analogy is useful, it can also generate ambiguity unless the notion of “surprise” is carefully specified. The term “surprise” can suggest 12 a deviation from what is expected, but its meaning varies depending on the framework in which it is used. In everyday language, surprise is typically associated with a psychological r eaction to an event perceived as unusual (Cambridge University Press, n.d.). In statistical contexts, however , at least two distinct notions should be clearly distinguished: a Shannon-type (information-theoretic) notion and a Bayesian notion. In the Shannon sense, surprisal is defined purely in terms of probability within a specified model (Cole et al., 2021). An event is “surprising” to the extent that it has “low” probability under the assumed system. 3 The S-value quantifies how unexpected a certain outcome (e.g., obtaining a test statistic as or more extreme than that observed in the experiment) is within the “small fictional world” induced by the chosen model - which is pr ovisionally assumed true . This does not imply any claim about the truth of that world or require any revision of beliefs. By contrast, in a Bayesian framework, surprise is tied to belief updating (Baldi, 2002). An observation is surprising to the extent that it leads to a revision of prior beliefs, as formalized through the posterior distribution. In this sense, surprise is inherently epistemic: it reflects a change in what is considered plausible after observing the data (even if, in practical terms, usefulness is both a more achievable and interesting goal; see Box, 1976). A simple example helps clarify this distinction. Suppose a football player widely regarded as poorly skilled performs an extraordinary bicycle kick and scores a remarkable goal. This event may be perceived as “highly surprising” in the sense that it is “rare” given one’ s expectation, that is, it diver ges “a lot” from one’ s expectation. However , this does not necessarily imply a revision of one’ s overall judgment about the player ’ s ability: the event may instead be interpreted as an isolated occurrence or a matter of chance, without altering the prior belief that the player is poorly skilled. Analogously , within the frequentist framework, observing a result with a high S-value indicates that a certain outcome is rare under the assumed model. In this regard, it should be noted that Frequentist and Bayesian approaches are therefore not in opposition, but can be better understood as complementary components of a broader inferential 3 Adjectival terms such as “surprising” and “low” are placed in quotation marks precisely because these evaluations depend on loss acceptability; they are not universal. 13 framework. Indeed, decisions about loss acceptability inevitably incorporate prior considerations and beliefs - including potential revisions (Greenland, 2022; Rovetta and Mansournia, 2024); and, as emphasized by Box (1980), model assumptions themselves can be viewed as fixed priors. 10. ANOTHER DEFINITION OF SURPRISE A distinct interpretation of surprise can be derived from its ordinary usage: an event is surprising if it pr oduces (a sense of) surprise (Cambridge University Press, n.d.). One may speak of a “surprising” hypothesis not because the hypothesis itself is “rare”, but because, when evaluated against the observed data, it generates a certain degree of surprise. Operationally , one can fix the observed data and the set of assumptions A, and then vary the hypothesis under consideration (H 1 , H 2 , …) to assess how much refutational information each hypothesis produces given A and the data (Msaouel, 2024; Rovetta 2025b,c). Under this interpretation, the surprisal ‘s’ becomes a function s=s(H i ;A,D) of the hypothesis H i given the observed data D and the assumptions in A. Crucially , it does not refer to the rarity of the hypothesis itself (which is assumed 100% true within the model), but to the amount of refutational information (in bits) that the hypothesis yields against its r eal-world acceptability when confronted with the observed data, holding the assumptions A constant. W ithin this perspective, the surprise does not arise inside any single model, but emer ges from a comparison process. It is experienced at the level of the analyst, who evaluates how each hypothesis fares when confronted with the data. The numerical outputs produced within each model (e.g., the S-value) serve as mere indicators, quantifying the degree of refutational information associated with each hypothesis under the assumptions specified by the analyst; but the practical interpretation of that information remains anchored in the real world, through the overall acceptability evaluation. T o illustrate, suppose we observe a marked clinical improvement in a patient following administration of a treatment (the observed data D). The set of assumptions A encodes the idealized data-generating process, including correct outcome measurement, patient adherence, absence of major confounding factors, and a stable clinical condition unrelated to external 14 interventions. Now consider two alternative hypotheses: under H 0 , the treatment has zero effect; under H 1 , the treatment has a “large” ef fect. Holding D and A fixed, H 0 becomes “very” surprising when evaluated against the data, whereas H 1 does not (as it is “very” compatible with D). Importantly , the sense of surprise associated with H 0 does not arise because H 0 is intrinsically “rare” or “unlikely”; rather , it emer ges from the mismatch between the expectations implied by A together with H 0 and the observed data. In other words, given the assumptions in A and compared to H 1 , H 0 generates a higher amount of refutational information against its own real-world acceptability when confronted with the data, despite being treated as provisionally true within the model. 1 1. SOME CRITICALITIES OF THE SHANNONIAN SURPRISE It is worth noting that previous works - including my own (e.g., Rovetta, 2025) - have made use of the notion of Shannon-type surprise. However , upon closer consideration, this interpretation could have criticalities. The main concern arises from the well-known inversion fallacy , whereby the probability of an outcome under a hypothesis is mistakenly interpreted as the probability of the hypothesis given the outcome (Greenland et al., 2016; Rovetta and Mansournia, 2025a). The former is a frequentist quantity , defined within the assumed model; the latter is a fully Bayesian construct, which requires the explicit specification of additional prior distributions beyond the standard modeling assumptions (Greenland et al., 2016). In this light, the concept of Shannonian surprise is perfectly defensible, but its use carries a non-negligible risk of encouraging such misinterpretations - especially in applied fields such as clinical research, where statistical concepts are often communicated to a broader audience. For this reason, although it is important to clarify the proper interpretation of this notion wherever it appears, I do not recommend its routine use in contexts that are particularly prone to highly consequential misunderstanding. Its application may be more appropriate in technically oriented statistical work, where the audience is equipped to distinguish clearly between model-based quantities, decision-making, and epistemic probabilities. 15 12. FUR THER W ARNINGS ON COMP A TIBILITY EV ALUA TIONS Under the present framework, it is legitimate to refer to the compatibility of hypotheses with the data - especially when comparing multiple target hypotheses while holding the data and the other assumptions fixed in order to inform a real-world acceptability evaluation. In such cases, the analyst may examine how the same observed data align with different hypotheses (e.g., H 0 , H 1 , H 2 , …), thereby assessing their relative compatibility within a common modeling structure - according to the rules imposed by that structure (e.g., the test statistic under the chosen working model). However , this formulation also carries a potential risk of misinterpretation: referring to the ‘compatibility of a hypothesis with the data’ may inadvertently suggest an evaluation of the hypothesis itself, thereby encouraging the inversion fallacy . For this reason, in settings where such confusion is very likely and consequential, it may be preferable to adopt an even more cautious formulation and speak instead of the compatibility of the data with a given hypothesis under A. This ‘data-model’ compatibility phrasing may better reflect the frequentist logic of the approach: the analyst considers the observed data and evaluates their consistency under each hypothetical model defined by A together with H i , without attributing probabilistic statements to the hypotheses themselves. This perspective preserves conceptual clarity while reducing the risk of inadvertently importing interpretations that belong to a different inferential framework. 13. ON THE “LEAST CONDITIONAL” APPROACH As stressed by previous research (e.g., Amrhein and Greenland; 2019; Rafi and Greenland, 2020; Rovetta and Mansournia, 2024) only a minority of empirical investigations are (or should be) conceived with the explicit goal of formal decision-making based on fully specified loss structures. In most research settings - particularly in clinical and epidemiological studies - the primary objective is (or should be) to describe and summarize the relationship between data and hypotheses in a transparent and interpretable manner . This caution is essential to avoid overinterpretations that lead to waste of resources, reduced public trust, and, in sensitive fields 16 such as medicine, harm to the most vulnerable stakeholders (Bann et al., 2024; T ing and Greenland, 2024). In such cases, a less conditional approach is more appropriate (Rafi and Greenland, 2020; Rovetta and Mansournia, 2024). The analysis proceeds by evaluating the observed data under specified assumptions A and hypotheses H, and by reporting measures of compatibility or refutational information that characterize this relationship. The aim is not to deliver definitive judgments about acceptability or to prescribe decisions, but to provide coherent and calibrated summaries that can be meaningfully interpreted and compared across studies. Such summaries are especially valuable in the broader process of evidence synthesis. Systematic reviews and meta-analyses rely on the availability of interpretable and comparable statistical outputs across individual studies; accordingly , emphasizing transparent descriptions of data-model relationships is more aligned with the cumulative nature of scientific inference than attempting to embed complex and often uncertain loss structures within single studies (Greenland, 2025; Higgs and Amrhein, 2025; Rovetta et al., 2025b). REFERENCES 1. Amrhein, V ., Korner-Niever gelt, F ., & Roth, T . (2017). The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ, 5, e3544. https://doi.org/10.7717/peerj.3544 2. Amrhein, V ., T rafimow , D., & Greenland, S. (2019). Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if W e Don’t Expect Replication. The American Statistician, 73, 262–270. https://www .jstor .org/stable/4878371 1 3. Baldi, P . (2002). A Computational Theory of Surprise. In: Blaum, M., Farrell, P .G., van T ilborg, H.C.A. (eds) Information, Coding and Mathematics. The Springer International Series in Engineering and Computer Science, vol 687. Springer , Boston, MA. https://doi.org/10.1007/978-1-4757-3585-7_1 4. Bann, D., Courtin, E., Davies, N. M., & W right, L. (2024). Dialling back 'impact' claims: researchers should not be compelled to make policy claims based on single studies. 17 International journal of epidemiology , 53(1), dyad181. https://doi.org/10.1093/ije/dyad181 5. Box, G.E.P . (1976). Science and Statistics. Journal of the American Statistical Association, 71(356), 791–799. https://doi.org/10.2307/2286841 6. Box, G.E.P . (1980). Sampling and Bayes’ Inference in Scientific Modelling and Robustness. Journal of the Royal Statistical Society . Series A (General), 143(4), 383–430. https://doi.org/10.2307/2982063 7. Cambridge University Press. (n.d.). Surprise. In Cambridge Dictionary . Retrieved March 28, 2026, from https://dictionary .cambridge.org/it/dizionario/inglese/surprise 8. Cole, S. R., Edwards, J. K., & Greenland, S. (2021). Surprise!. American journal of epidemiology , 190(2), 191–193. https://doi.org/10.1093/aje/kwaa136 9. Cox, D.R., & Hinkley , D.V . (1974). Theoretical Statistics (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b14832 10. Evans, M., & Moshonov , H. (2006). Checking for prior -data conflict. Bayesian Analysis, 1(4), 893–914. https://doi.org/10.1214/06-BA129 1 1. Fisher , R. (1955). Statistical Methods and Scientific Induction. Journal of the Royal Statistical Society . Series B (Methodological), 17(1), 69–78. http://www .jstor .org/stable/2983785 12. Glossary of Statistical T erms. (n.d.). SticiGui — UC Berkeley Department of Statistics. Retrieved March 27, 2026, from https://www .stat.berkeley .edu/~stark/SticiGui/T ext/gloss.htm 13. Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180(4), 967–1033. https://doi.org/10.1 1 1 1/rssa.12276 14. Good, I. J. (1952). Rational Decisions. Journal of the Royal Statistical Society . Series B (Methodological), 14(1), 107–1 14. http://www .jstor .org/stable/2984087 15. Greenland S. (2012). T ransparency and disclosure, neutrality and balance: shared values or just shared words?. Journal of epidemiology and community health, 66(1 1), 967–970. https://doi.org/10.1 136/jech-201 1-200459 18 16. Greenland S. (2021). Analysis goals, error-cost sensitivity , and analysis hacking: Essential considerations in hypothesis testing and multiple comparisons. Paediatric and perinatal epidemiology , 35(1), 8–23. https://doi.org/10.1 1 1 1/ppe.1271 1 17. Greenland, S. (2022). The causal foundations of applied probability and statistics. In J. Pearl (Ed.), Probabilistic and causal inference: The works of Judea Pearl (pp. 605-624). Association for Computing Machinery . https://doi.org/10.1 145/3501714.3501747 18. Greenland, S. (2023). Divergence versus decision P-values: A distinction worth making in theory and keeping in practice: Or , how diver gence P-values measure evidence even when decision P-values do not. Scandinavian Journal of Statistics, 50(1), 54-88. https://doi.org/10.1 1 1 1/sjos.12625 19. Greenland, S. (2025). Statistical Methods: Basic Concepts, Interpretations, and Cautions. In: Ahrens, W ., Pigeot, I. (eds) Handbook of Epidemiology . Springer , New Y ork, NY . https://doi.org/10.1007/978-1-4614-6625-3_54-1 20. Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European journal of epidemiology , 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3 21. Harper , D. (n.d.). Etymology of incompatible. Online Etymology Dictionary . Retrieved March 27, 2026, from https://www .etymonline.com/word/incompatible 22. Hennig, C. (2010). Mathematical models and reality: A constructivist perspective. Foundations of Science 15 (1):29-48. https://doi.org/10.1007/s10699-009-9167-x 23. Hennig, C. (2024). Probability Models in Statistical Data Analysis: Uses, Interpretations, Frequentism-as-Model. In: Sriraman, B. (eds) Handbook of the History and Philosophy of Mathematical Practice. Springer , Cham. https://doi.org/10.1007/978-3-031-40846-5_105 24. Higgs, M. D., & Amrhein, V . (2025). Giving less power to statistical power . Laboratory animals, 59(6), 714–721. https://doi.org/10.1 177/00236772251331680 25. McShane, B. B., Bradlow , E. T ., L ynch, J. G., & Meyer , R. J. (2024). “Statistical Significance” and Statistical Reporting: Moving Beyond Binary . Journal of Marketing, 88(3), 1-19. https://doi.org/10.1 177/00222429231216910 19 26. Msaouel, P . (2022). The Big Data Paradox in Clinical Practice. Cancer investigation, 40(7), 567–576. https://doi.org/10.1080/07357907.2022.2084621 27. Msaouel, P . (2024). Decoding Randomized Controlled Trials: An Information Science Perspective. Harvard Data Science Review , 6(2). https://doi.org/10.1 162/99608f92.49e3c961 28. Neyman, J. (1977). Frequentist Probability and Frequentist Statistics. Synthese, 36(1), 97–131. http://www .jstor .org/stable/201 15217 29. Rafi, Z., & Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC medical research methodology , 20(1), 244. https://doi.org/10.1 186/s12874-020-01 105-9 30. Rovetta, A. (2025). S-values and Surprisal intervals to Replace P-values and Confidence Intervals: Accepted - January 2024. REVST A T -Statistical Journal, 23(4), 491-502. https://doi.org/10.57805/revstat.v23i4.669 31. Rovetta, A., & Mansournia, M. A. (2024). P>0.05 Is Good: The NORD-h Protocol for Several Hypothesis Analysis Based on Known Risks, Costs, and Benefits. Journal of preventive medicine and public health = Y ebang Uihakhoe chi, 57(6), 51 1–520. https://doi.org/10.3961/jpmph.24.250 32. Rovetta, A., & Mansournia, M. A. (2025). Common wrong beliefs about statistical testing: Recent trends in biomedical sciences. Current Opinion in Epidemiology and Public Health, 4(3), 23–29. https://doi.org/10.1097/PXH.0000000000000054 33. Rovetta, A., Mansournia, M. A., Stovitz, S. D., Adams, W . M., & Greenland, S. (2025a). Interpreting p values and interval estimates based on practical relevance: guidance for the sports medicine clinician. British journal of sports medicine, 59(24), bjsports-2024-109357. https://doi.org/10.1 136/bjsports-2024-109357 34. Rovetta, A., Stovitz, S. D., & Mansournia, M. A. (2025b). Compatibility and S values versus traditional reporting in epidemiological studies. Current Opinion in Epidemiology and Public Health, 4(3), 40-46. https://doi.org/10.1097/PXH.0000000000000057 35. The origin of incompatibility: From past to present. (n.d.). W ordpandit. Retrieved March 27, 2026, from https://wordpandit.com/wpt_vocabulary/the-origin-of-incompatibility-from-past-to-prese nt/ 20 36. T ing, C., & Greenland, S. (2024). Forcing a deterministic frame on probabilistic phenomena: A communication blind spot in media coverage of the “replication crisis”. Science Communication, 46(5), 672–684. https://doi.org/10.1 177/10755470241239947 37. W asserstein, R. L., & Lazar , N. A. (2016). The ASA Statement on p-V alues: Context, Process, and Purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1 154108

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment