A General Model Validation and Testing Tool

A General Mo del V alidation and T esting T o ol Kevin V anslette, T on y T ohme, and Kamal Y oucef-T oumi Departmen t of Mec hanical Engineering Massac h usetts Institute of T ec hnology , Cam bridge MA 02139, USA Octob er 28, 2019 Abstract W e construct and prop ose the “Ba y esian V alidation Metric” (BVM) as a genera l model v alidation and testing tool. W e ﬁnd the BVM to be capable of represen ting all of the standard v alidation metrics (square error, reliabilit y , probability of agreemen t, frequen tist, area, probability density comparison, statistical h yp othesis testing, and Bay esian model testing) as sp ecial cases and ﬁnd that it can be used to improv e, generalize, or further quan tify their uncertainties. Th us, the BVM allows us to assess the similarities and diﬀerences b et w een existing v alidation metrics in a new ligh t. The BVM has the capacit y to allow users to inv ent and select mo dels according to nov el v alidation requiremen ts. W e form ulate and test a few no v el comp ound v alidation metrics that impro v e up on other v alidation metrics in the literature. F urther, w e construct the BVM Ratio for the purp ose of quantifying mo del selection under user deﬁned deﬁnitions of agreement in the presence or absence of uncertaint y . This construction generalizes the Bay esian mo del testing framework. 1 In tro duction Piv otal to the scientiﬁc and engineering pro cess is the testing of h yp othesis and the v alidation of mathematical and computational mo dels. Without a v alidation procedure, there is no reason to b eliev e that a mo del, whic h has been designed to serve as a con v enient represen tation of physical and/or human-made pro cesses, has fulﬁlled its functional purp ose. Thus, v alidation is the result of the p ositiv e justiﬁcation of a mo del’s represen tation of relev an t features in the real w orld [1, 2, 3]. W e are in terested in studying the v alidation of multiv ariate computational mo dels that represent un- certain situations and/or data. It is understo od that complete certaint y is a sp ecial case of uncertain ty . The uncertaint y in a mo del or data set may originate from sto c hasticit y , mo del parameter and input data uncertain ty , measurement uncertaint y , or other p ossible aleatoric or epistemic sources of uncertaint y . Each of the following data mo deling schemes ma y include quantiﬁable amounts of uncertain t y (or certain t y) that we would like to v alidate on the basis of a set of v alidation data: neural netw orks and AI mo dels, mac hine learning models, Gaussian Pro cess Regression models [4], p olynomial chaos and other surrogate mo dels [5, 6, 7, 8], spatial and time series sto c hastic mo dels, physics based mo dels (usually solutions to diﬀeren tial equations), engineering based mo dels (which are suﬃciently abstracted physics based mo dels), Mon te Carlo sim ulation mo dels [9, 10], and more. Mo del output uncertainties ma y b e quantiﬁed through uncertain ty propagation tec hniques (that ma y or ma y not include veriﬁcation, calibration, and v alidation) [4, 6, 7, 8, 11, 12, 13, 14, 16, 21, 22]. There exist several v alidation metrics. Each metric is designed to compare features of a model-data pair to quantify v alidation: square error compares the diﬀerence in the data and mo del v alues in a p oin t to point or interv al fashion [23], the reliability metric [24] and the probabilit y of agreement [25] compare contin uous mo del outputs and data expectation v alues (the mo del reliability metric was extended past exp ectation v alues in [26]), the frequen tist v alidation metric [27, 28] and statistical hypothesis testing compare data and mo del test statistics, the area metric compares the cumulativ e distribution of the mo del to the estimated cum ulativ e 1 distribution of the data [15, 16, 17, 18, 19, 20], probability density function (p df ) comparison metrics suc h as the KL Divergence that represent “closeness” b etw een p dfs, and Bay esian mo del testing compares the p osterior probabilit y that eac h mo del w ould correctly output the observed data [12, 13, 29, 30, 31, 32, 33]. A detailed review of the ma jority of these metrics ma y b e found in [34, 35, 36, 37] and the references therein. In particular, [35] is an up to date review that considers many v alidation metrics in the cases of data and mo del certain ty , data uncertaint y and mo del certaint y , and data and mo del uncertaint y . F urther, they comment on the imp ortance of mo del v alidation while admitting its contextual diﬃculties b ecause often v alidation is “strongly application, quan tity of interest, and mo deler, dep endent”. Our article addresses some of these con textual diﬃculties by taking a b ottom up approach. T o assist the comparison of the p ositive and negative as pects of (most) the ab o v e v alidation metrics, reference [34] outlines six “desirable v alidation criteria” that a v alidation metric migh t hav e (they extend [27, 15]). One conclusion from [34] is that none of the a v ailable metrics sim ultaneously satisfy all six desirable v alidation criteria. W e summarize the most important features of the desirable v alidation criteria with the follo wing v alidation criterion: 1. A v alidation metric should b e a statistically quantiﬁed quan titativ e measure (as opp osed to a qualitative measure) of the agr e ement b et w een (general) mo del predictions and data pairs, in the presence or absence of uncertaint y . The desire for ob jectivit y , “that a metric will pro duce the same assessment for every analyst independent of their individual preferences” [34], is diﬃcult to satisfy b ecause there are no rules in place to guide a mo deler to ward selecting one v alidation metric ov er another. F or this reason the individual might simply choose a metric based on their preferences, or worse, b e tempted to base their decision on which v alidation metric giv es them the most fa v orable ev aluation. Given individuals ma y choose diﬀerent v alidation metrics for the same model-data pair, it is p ossible for individuals to imp ose accuracy requirements that are incompatible with one another and arriv e at diﬀerent conclusions regarding the v alidity of the same mo del-data pair. As the ﬁnal goal is ob jectivit y , when p ossible, a map betw een the accuracy requiremen ts should b e constructed suc h that the v alidation metrics yield consistent ev aluations of model-data v alidity when applicable. F urther, Liu et al. [34] suggest that there is no agreed up on uniﬁed mo del-data comparison function. Ev en including the results of this article, we exp ect this statemen t to hold as it is extremely diﬃcult to guess the prior information ab out the utility of a mo del an analyst may b e required to include in the v alidation of a mo del-data pair. F or instance given arbitrary data “What features of the data are relev ant to capture with a mo del?”, “Of these features, are some more relev an t than others?”, and “What accuracy is required for the mo del to b e v alid?”. A gr e ement and validation are ultimately human-made concepts designed for the purp ose of expressing that “in general, not every feature or statistic b et ween a mo del-data pair need to b e equal to conserve the utility of the mo del”. F or some mo del-data pairs, all that may b e required is that the model and data av erages closely match within uncertaint y , while for others, one may require that the mo del can accurately repro duce the probability distribution of a data set as a whole (as one would do to ph ysically mo del a noisy measuremen t device). Given the wide v ariet y of data and the large num b er of diﬀeren t inferences (and thus models and h yp otheses) that one ma y b e in terested in drawing from a given data source, i.e . the context of the mo del-data pair, we do not exp ect an y single set of comparison functions, statistics, or v alues to b e equally relev ant and maximally useful for all p ossible model-data contexts. This, ho wev er, do es not stop us from quantifying the v alidit y of a mo del-data pair given any arbitrary comparison function and with any arbitrary deﬁnition of agreement. In this article, we construct and prop ose the “Bay esian V alidation Metric” (BVM) as a general mo del v alidation and testing to ol. W e design the BVM to adhere to the desired v alidation criterion (1.) b y using “four BVM inputs”: the mo del and data comparison v alues, the model output and data pdfs, the comparison v alue function, and the agreemen t function. The comparison v alue function is a function of mo del output and data comparison v alues that pro vides the desired quantitativ e comparison measure, e.g. square diﬀerence. Using the mo del output p df and the data p df, the v alue of the comparison v alue function is statistically quan tiﬁed. In turn, the agreement function provides an accept/reject rule and eﬀectiv ely wraps the previous three BVM inputs together to giv e the BVM. F rom this, the BVM outputs “the probability the mo del and data agree”, where agr e ement is a user deﬁned Bo olean function that meets, or do es not meet, accuracy 2 requiremen ts b et w een mo del and data comparison v alues. Th us, the BVM meets the desired v alidation criterion (1.) for arbitrary comparison v alue functions, arbitrary deﬁnitions of agreement, and in principle for arbitrary data types suc h as integers, vectors, tensors, strings, pictures, or others. The BVM can b e used to represent all of the aforementioned v alidation metrics as sp ecial cases. This allo ws us to compare and contrast the v alidation metrics from the literature in a new light. W e ﬁnd the conditions under which sev eral of the curren t v alidation metrics are eﬀectively equal to one another, whic h impro ves the ob jectivit y of the current v alidation pro cedure. In brief we ﬁnd that the frequentist metric (using natural deﬁnitions of agreement) is equal to the reliability metric and the probabilities from Bay esian mo del testing are equal to the probabilities of the impro v ed model reliability metric [26] when one demands exact equality of the (uncertain) model-data comparison v alues. Because probability can represen t b oth certain and uncertain situations, so can the BVM. Thus, these “sp ecial case” metrics can b e generalized to quan tify certain or uncertain cases, and even be combined into m ore complex v alidation requiremen ts using the BVM framew ork. Th us, the BVM provides a standardized framework to impro v e, generalize, or further quan tify these v alidation metrics. Figure 1: This is the depiction of a common model v alidation scenario. The mo del line is trained on noisy data (not depicted in the ﬁgure) and is to be compared to a set of v alidation data. As both the mo del line and the data are uncertain in general, any quantitativ e measure (i.e. the comparison function) b et ween these comparison v alues inherits this uncertain t y . Th us, an y accept/reject rule on the basis of these uncertain comparison function v alues is uncertain as well. A visual insp ection of this graph seems to indicate, up to statistical ﬂuctuation, that the comparison v alues of the mo del and data more or less (or probably) agree, but this intuitiv e measure has yet to be quan tiﬁed. Graphic adapted from [38]. By constructing the “BVM ratio”, we generalize the Bay esian mo del testing framew ork [29]. In the Ba yesian model testing framew ork, one constructs the Ba yes ratio to rank mo dels according to the ratio of their p osterior probabilities giv en the data. W e sho w that these p osterior probabilities are equal to a sp ecial case of the BVM under the deﬁnition of agreemen t that requires these uncertain mo del outputs and data 3 to match exactly . Thus, nothing preven ts us from extending the logic used in the Bay esian mo del testing framew ork to our framew ork and w e construct the BVM ratio for the purp ose of mo del selection under arbitrary deﬁnitions of agreement, i.e. for arbitrary v alidation scenarios. The remainder of the article con tin ues as follo ws. In Section 2 we construct the BVM by following our v alidation criterion. Through some edge cases, we show that the BVM satisﬁes b oth the six desirable v alidation criteria from [34] as w ell as our v alidation criterion (1.) in Section 3. In Section 4, w e summarize the results derived in App endix A. There, we incorp orate all of the ab ov e standard v alidation metrics as sp ecial cases of the BVM, dra w relationships b etw een sev eral of the v alidation metrics, pro vide impro v emen ts and generalizations to these metrics as is suggested by the functional form of the BVM, and construct the BVM ratio. In Section 5 we represent three nov el v alidation metrics using the BVM and compare them to similar metrics from the literature. Due to the pro v en capacit y of our framework, we consider the BVM to b e a general mo del v alidation and testing tool. 2 The Ba y esian V alidation Metric F or the remainder of the article we will use the following notation and language. W e will let ˆ y denote the output of a mo del, ˆ Y a set of mo del outputs, y a data p oin t, and Y a set of data p oin ts. The prop osition M essentially stands for “the mo del” or “coming from the mo del”, and D stands for “the exp erimen t” or “coming from the experiment”. W e let ˆ z and z represen t the comparison quan tities of in terest, which pertain to the mo del and the data resp ectiv ely . F urther, we let the comparison quan tities tak e general forms, such as multidimensional v ectors, functions, or functionals (e.g. output v alues, exp ectation v alues, p dfs, ...), so w e can represent any such pair of quantities we ma y wish to compare b et w een the mo del and the data. When w e refer to “the four BVM inputs” we mean: the comparison v alues ( ˆ z , z ), the mo del output and data p df ρ ( ˆ z , z | M , D ), the comparison v alue function f ( ˆ z , z ), and the agreement function B = B ( f ( ˆ z , z )). The (denoted) in tegrals ma y be integrals or sums dep ending on the nature of the v ariable b eing summed or in tegrated ov er, which is to b e understo od from the discrete or con tinuous context of the inference at hand. The dot “ · ” represen ts standard multiplication, whic h is mainly used to impro v e aesthetics. P erforming uncertaint y propagation through a mo del results in a mo del output probability (density) distribution ρ ( ˆ y | M , D ) that ultimately w e w ould like to v alidate by comparing it to an uncertain v alidation data source ρ ( y | D ), to see if they agree (as depicted in Figure 1). The immediate question is, how ev er, “What v alues do we wan t to c omp ar e and what do we mean by agr e e ?”. Giv en the wide v ariety of data and the large n um b er of diﬀerent inferences (and thus mo dels and h yp otheses) that one may b e in terested in drawing from a giv en data source, i.e. the context of the mo del-data pair, w e do not exp ect any single set of comparison functions to b e equally relev ant and maximally useful for all p ossible mo del-data contexts. In light of this, w e instead quantify the v alidity of a mo del-data pair given any arbitrary comparison v alue function and with any arbitrary deﬁnition of agreement. 2.1 Deriv ation Here w e will b egin constructing the Bay esian V alidation Metric (BVM). T o capture the concept of what we migh t mean by agr e e , w e deﬁne ˆ z and z to agree, A , when the Bo olean expression, B , is true. Both A and B are deﬁned by the mo deler and their prior knowledge of the context of the mo del-data pair. Naturally then, the agreemen t function B = B ( f ( ˆ z , z )) = B ( ˆ z , z ) is some function or functional of a comparison v alue function f ( ˆ z , z ). Giv en the v alues of ˆ z and z are kno wn, i.e. certain, we quantify agr e ement using a probability distribution that assigns certaint y , p ( A | ˆ z , M , z , D ) = Θ  B ( ˆ z , z )  . (1) The indicator function Θ  B  is deﬁned to equal unity if B ev aluates to “true” (i.e. “agreeing”) and equal to zero otherwise. Thus in the completely certain case, we are certain as to whether the mo del and data 4 comparison v alues ‘agree, or do not agree, as deﬁne d by B and the deterministic evaluation of f ( ˆ z , z ). 1 W e will call p ( A | ˆ z , M , z , D ) the “agreemen t kernel”. Giv en that in general the comparison v alues are uncertain, and quan tiﬁed b y ρ ( ˆ z , z | M , D ), the probabilit y the comparison v alues agree, as deﬁne d by B and f ( ˆ z , z ), is equal to, p ( A | M , D ) = Z ˆ z ,z p ( A | ˆ z , M , z , D ) · ρ ( ˆ z , z | M , D ) d ˆ z dz , (2) ind. − → Z ˆ z ,z ρ ( ˆ z | M , D ) · Θ  B ( ˆ z , z )  · ρ ( z | D ) d ˆ z dz , (3) whic h is a marginalization o ver the spaces of ( ˆ z , z ). 2 Equation (2) is the general form of the Bay esian V alidation Metric (BVM). Because A is discrete, the BVM is a probabilit y rather than a probabilit y density and it therefore falls in the range 0 ≤ p ( A | M , D ) ≤ 1. Equation (3) explicitly assumes that the uncertaint y in the data is independent of the mo del, i.e. ρ ( z | M , ˆ z , D ) = ρ ( z | D ), that the data D do es not take ˆ z or the mo del M (that it is currently b eing compared to) as inputs. 3 This is a relatively common scenario so it is stated explicitly . The BVM ma y b e given a geometric interpretation as the pro jection of tw o probability v ectors (p oten tially of unequal length) in a space whose o v erlap is deﬁned by the agreemen t kernel – this is an inner product if B ( ˆ z , z ) is symmetric in its argumen ts. The BVM ma y b e computed using an y of the w ell kno wn computational integration methods. 2.2 An iden tical represen tation In some cases, it is useful to work directly with the probabilit y densit y ρ ( f | M , D ), which quantiﬁes the probabilit y the comparison v alue function f ( ˆ z , z ) tak es the v alue f due to uncertaint y in its inputs. This p df is indep endent of any user deﬁned accuracy requirement. W e will call this p df the comparison v alue probabilit y density , which is equal to, ρ ( f | M , D ) = Z ˆ z ,z δ ( f − f ( ˆ z , z )) · ρ ( ˆ z , z | M , D ) d ˆ z dz . (4) This is the net uncertaint y propagated through the comparison v alue function f ( ˆ z , z ) from the uncertain mo del and data comparison v alues. All of the expectation v alues that are associated with f ma y be generated from this p df. If one imp oses an accuracy requirement with a Bo olean expression B = B ( f ) (i.e. deﬁning agreement according to the v alue of f ), the resulting accum ulated probabilit y is the BVM. That is, the BVM, i.e. equation (2), may equally b e expressed as, p ( A | M , D ) = Z f ρ ( f | M , D ) · Θ( B ( f )) d f , (5) whic h is prov en through substitution and marginalization o v er f , p ( A | M , D ) = Z f  Z ˆ z ,z δ ( f − f ( ˆ z , z )) · ρ ( ˆ z , z | M , D ) d ˆ z dz  · Θ( B ( f )) d f = Z ˆ z ,z Θ( B ( ˆ z , z )) · ρ ( ˆ z , z | M , D ) d ˆ z dz . (6) 1 This binary y et probabilistic deﬁnition of agreement turns out to b e completely satisfactory for our curren t purp oses. As is brieﬂy discussed later, the sharp boundaries of the indicator function can b e smo othed out without employing fuzzy logic by allowing parameters in the Bo olean function to themselves b e uncertain and marginalized ov er. 2 Recall that the prop ositions in the probability distributions ρ ( z | D ) and ρ ( ˆ z | M , D ) are completely arbitrary (in some cases requiring propagation from ρ ( y | D ) and ρ ( ˆ y | M , D )), they could b e b oth contin uous, discrete (with order), categorical v ariables (no w ell deﬁned order, e.g. strings, pictures,...), or a mix. 3 In a controls system this may not b e the case, as the mo del ma y interact with the system of interest. In such a case this constrain t ma y be lifted and one should use (2) instead. The join t probabilit y ρ ( ˆ z , z | M , D ) can be used to account for the correlations betw een the model (the con troller or reference) and the data (the measured response of the system b eing con trolled) in a con trols setting in principle. 5 2.3 Imp ortance The BVM allows the user to, in principle, quantify the probabilit y the mo del and the data agree with one another under arbitrary comparison v alue functions and with arbitrary deﬁnitions o f agreement. The BVM can therefore be used to fully quantify the probability of agreement b et w een arbitrary mo del and data t yp es using nov el or existing comparison v alue functions and deﬁnitions of agreement. Thus, the problem of mo del-data v alidation ma y be reduced to the problem of ﬁnding the four BVM inputs in any mo del v alidation scenario. 2.4 Statistical Resp onsibilit y When using the BVM framew ork, one should practice statistical resp onsibilit y b y explicitly stating the deﬁnition of agreemen t that is implemen ted in the v alidation pro cedure. Although the ﬂexibility of the BVM framework is a feature, diﬀerent v alidation metrics often ha v e diﬀerent amounts of tolerance as what constitutes “agreement”. Agreemen t according to one metric do es not in general imply agreement according to another. Ov erly toleran t deﬁnitions of agreement hav e little resolution p ow er and can only b e used resp onsibly if a large degree of non-exactness b et ween the mo del and data is p ermissible. In principle, the deﬁnition of agreemen t should b e just as strict or stricter than it needs to b e. By explicitly stating the deﬁnition of agreemen t along side the BVM v alue, p ( A | M , D ) ≡ p ( A | M , D , B ), one av oids statistical misrepresen tation by not hiding one’s deﬁnition of agreement. 3 Meeting the desirable v alidation criterion First we will describ e how the BVM, equations (2) and (5), precisely matc h our v alidation criterion (1.). As can b e seen by equation (4), incorp orated in to the BVM is a statistically quan tiﬁed quantitativ e measure that compares data and mo del outputs, ρ ( f | M , D ). Ho w ever, this p df is in some sense lacking a context p ertaining to the mo del-data pair. Not until an accept/reject rule is imparted on ρ ( f | M , D ) do es one deﬁne what is meant by agr e ement in the mo del-data context. Thus, the BVM only b ecomes the probabilit y of agr e ement b et ween the data and the model when the agreement function is also incorp orated. The four BVM inputs are therefore adequate to satisfy (1.) as the BVM is a “statistically quan tiﬁed quantitativ e measure ( f ) of agreement p ( A | M , D ) b et w een mo del predictions and data pairs ( ˆ z , z ), in the presence or absence of uncertaint y ρ ( ˆ z , z | M , D )”. There are a few more BVM concepts worth discussing b efore moving forward. W e will show that the BVM is capable of handling general multidimensional mo del-data comparisons and that there are no conceptual issues when agreement is exact, i.e. B is true iﬀ ˆ z = z , in the certain and uncertain cases. W e will then mak e comments on the sense in which the BVM adheres to the full set of six desirable v alidation criteria giv en in [34] by discussing the criteria that are underrepresen ted in (1.). 3.1 Comp ound Bo oleans Because Bo olean op erations b et w een Boolean functions results in a Bo olean function itself, the BVM is capable of handling multidimensional mo del-data comparisons. W e will call a Bo olean function with this prop ert y a “compound Bo olean”. A comp ound Boolean function results from and , ∧ , conjunctions and or , ∨ , disjunctions b et w een a set of Boolean functions, e.g., B ( { B i } ) =  B 1 ( ˆ z , z ) ∨ B 2 ( ˆ z , z )  ∧  B 3 ( ˆ z , z ) ∨ B 4 ( ˆ z , z )  ..., (7) where each B i ( ˆ z , z ) = B i ( f i ( ˆ z , z )) may use a diﬀerent comparison function f i ( ˆ z , z ). Compound Bo oleans using conjunctions quan tifying the v alidity of en tire mo del functions (random ﬁelds and/or m ultidimensional v ectors) by assessing agreemen t b et ween eac h of the mo del-data comparison ﬁeld p oin ts simultaneously , i.e o ver the comparison points 1 and points 2 and so on. The comp ound Bo oleans ma y be factored into their constituting Bo olean functions using the standard pro duct and sum rules of probabilit y theory after being 6 mapp ed to probabilities with the agreement kernel. One should b e weary when deﬁning an and Bo olean that, if one of the Booleans is false, then the entire Bo olean is false. If this strict “all or nothing” v alidation requiremen t is not needed then other more ﬂexible deﬁnitions of agreement may b e instantiated instead (see “BVM Examples”). 3.2 The BVM under the conditions of exact agreemen t W e can calculate the BVM under the conditions of exact agreement in the completely certain and uncertain cases. Because the BVM is a probabilit y rather than a probability densit y , the agreemen t k ernel falls in the range [0 , 1]. Under the conditions of exact agreemen t, that B is only true when ˆ z = z , the agreemen t k ernel is Θ( B ) = Θ( ˆ z = z ) = δ ˆ z ,z , which is the Kronec ker delta, i.e. it is 0 or 1, but has contin uous lab els. As it is uncommon to deal with Kroneck er delta’s ha ving con tin uous lab els under integration, w e will show that the BVM gives reasonable results under the condition of exact agreement in the complete certain ty as well as in the general uncertain case. Complete certaint y and exact agreemen t Complete certaint y is represen ted using Dirac delta p df functions ov er the model and data comparison v alues. This giv es the BVM, p ( A | M , D ) = Z ˆ z ,z ρ ( ˆ z | M , D )Θ  ˆ z = z  ρ ( z | D ) d ˆ z dz = Z ˆ z ,z δ ( ˆ z − ˆ z 0 ) · δ ˆ z ,z · δ ( z − z 0 ) d ˆ z dz , (8) where w e are considering the mo del-data pair to agree iﬀ the comparison v alues are exactly equal. Using the sifting prop ert y of the Dirac delta function, we ﬁnd the reasonable result that, p ( A | M , D ) = δ ˆ z 0 ,z 0 , (9) whic h is equal to unit y iﬀ ˆ z 0 and z 0 , the deﬁnite v alues of ˆ z and z , are equal. Uncertain t y and exact agreemen t In the uncertain case under the condition of exact agreement, the BVM is p ( A | M , D ) = Z ˆ z ,z ρ ( ˆ z | M , D ) · δ ˆ z ,z · ρ ( z | D ) d ˆ z dz . (10) W e will do the following trick to correctly interpret this in tegral. W e will ﬁrst let B (  ) b e true if z −  ≤ ˆ z ≤ z +  and then take the limit as  → 0 + suc h that lim  → 0 + B (  ) → B when appropriate. With this Boolean expression, the BVM is, p ( A | M , D ,  ) = Z ˆ z ,z ρ ( ˆ z | M , D )Θ  z −  ≤ ˆ z ≤ z +   ρ ( z | D ) d ˆ z dz = Z z ρ ( z | D )  Z z +  z −  ρ ( ˆ z | M , D ) d ˆ z  dz . (11) In the limit  → 0 + , the term in the paren thesis R z +  z −  ρ ( ˆ z | M , D ) d ˆ z → p ( ˆ z = z | M , D ) = ρ ( ˆ z = z | M , D ) d ˆ z b y the deﬁnition of probabilities. This gives, p ( A | M , D ) = Z z ρ ( z | D )  p ( ˆ z = z | M , D )  dz =  Z z ρ ( ˆ z = z | M , D ) ρ ( z | D ) dz  d ˆ z ≡ ρ ( ˆ z ≡ z | M , D ) d ˆ z = p ( ˆ z ≡ z | M , D ) , (12) whic h is understo o d to be the sum of the mo del and the data probabilities that jointly output exactly the same v alues. W e see that the BVM in this case is prop ortional to d ˆ z , p ( A | M , D ) → ρ ( ˆ z ≡ z | M , D ) d ˆ z , in the general 7 case of exact agreement, and therefore the BVM goes to zero unless the p df ρ ( ˆ z ≡ z | M , D ) ∝ δ ( ... ). Thus, w e recov er the standard logical result for probabilit y densities p ( x ) = ρ ( x ) dx → 0 unless it is oﬀset by ρ ( x ) ∝ δ ( x ). This result is easily generalized to the dep enden t case using ρ ( ˆ z , z | M , D ) = ρ ( z | M , D ) ρ ( ˆ z | z , M , D ). The result, equation (12), is no more surprising than (9) in principle. Due to the v ast num ber of p ossibilities for contin uous v alued v ariables, ha ving a pathological deﬁnition of exact agreement b et w een contin uous v ariables do es not o ccur in practice. In a computational setting, dx → 4 x b ecomes a ﬁnite diﬀerence and these inﬁnitely improbable agreement conceptual issues are a v oided. The Bay esian mo del testing framew ork av oids these issues by ev aluating posterior o dds ratios, in whic h case the measures, d ˆ z , drop out. 3.3 Meeting underrepresen ted v alidation criteria In this subsection we will discuss how the BVM also meets the v alidation criteria found in [34]. This is done b y using the deriv ed general and special cases of the BVM for eac h of the criteria whic h are underrepresen ted in (1.). P erhaps the primary underrepresen ted criterion from [34] is their second. It states that “..the criteria used for determining whether a model is acceptable or not should not b e a part of the metric whic h is exp ected to pro vide a quantitativ e measurement only .” W e argue that the functional form of the BVM presen ted in equation (5) clearly demonstrates this feature as it factors into ρ ( f | M , D ) and Θ( B ( f )). The comparison function f ( ˆ z , z ) represents the “ob jective quantitativ e measure” from their ﬁrst criterion that is separate from the accept/reject rule, which is our agreement function B ( f ) – b oth of which require deﬁnition to ultimately ev aluate the v alidity of a mo del. W e see it as adv an tageous to quan tify the probabilit y the model is accepted or rejected through B ( f ) due to the uncertaint y in the v alue of f , which is the general case, and which gives the BVM as the result. As all of the v alidation metrics presented in [34] (and more) will b e sho wn to be represen table with the BVM, and thus placed on the same fo oting, we ﬁnd our language of “comparison function” and “agreement function” to ultimately b e more useful than a language that only considers comparison functions (without accept/reject rules) to b e the v alidation metrics . The third criteria in [34] is that ideally the metric should “degenerate to the v alue from a deterministic comparison b et w een scalar v alues when uncertaint y is absent”. This is indeed the case as can b e seen in equation (1) or in equations (2) and (5) b y utilizing Dirac delta pdfs similar to their application in equation (8). The ﬁfth desirable v alidation criteria in [34] states that artiﬁcially widening probabilit y distributions should not lead to higher rates of v alidation. They ﬁnd all but the frequentist metric to hav e this undesired feature; how ev er, we later see that the frequen tist metric ma y b e considered a sp ecial case of the reliability metric (when reasonable accuracy requirements are imp osed), meaning artiﬁcial widening can lead to higher rates of v alidation for more general instances of the frequen tist metric. F urther, w e argue that artiﬁcially introducing uncertain t y for the express purp ose of passing a v alidation test is indistinguishable from scientiﬁc misconduct. If there is ob jective reason to include more uncertain t y in to the analysis or if the circumstance for what constitutes v alidation has changed due to a c hange of con text – and it happ ens to impro v e the rate of v alidation – so b e it. This is a diﬀerent context, mo del, or state of uncertain ty than w as originally prop osed so diﬀerent rates of acceptance should be expected. Reducing the uncertain ty of either the data or the model (the inputs) through additional measurements or changing the mo del may later prov e the mo del v alid or inv alid when it may hav e been initially accepted. Th us, to meet this v alidation criteria, w e simply assume the user is not engaging in scientiﬁc misconduct. Finally , due to the results of the “Comp ound Bo oleans” section, their sixth criterion is met. Because the BVM (2) can be used to assess single or multidimensional controllable settings (see footnote num b er 3.) w e can perform global function v alidity (in or out of a controls setting). As they note, “This last feature is critical from the viewp oin t of engineering design”. Th us, the BVM satisﬁes b oth our v alidation criterion and the six desirable v alidation criteria outlined in [34]. This was accomplished by represen ting mo del-data v alidation as an inference problem using the four BVM inputs. 8 4 Represen ting and generalizing the kno wn v alidation metrics with the BVM This section is a review of the material found in App endix A. The follo wing v alidation metrics will b e represen ted with the BVM, which are then are impro ved, generalized, and/or commented on: reliabil- it y/probability of agreement, impro ved reliability metric, frequentist, area metric, p df comparison metrics, statistical hypothesis testing, and Ba y esian mo del testing. 4.1 Represen tating the kno wn v alidation metrics with the BVM T able 1 outlines the v alues of the four BVM inputs that result in the BVM represen ting each of the well kno wn v alidation metrics as sp ecial cases. The follo wing notation is used for the comparison v alues ( ˆ z , z ). The brack ets h ... i denote exp ectation v alues, µ ’s denote a v eraged v alues, ˆ y , y denote single v alues, ˆ Y , Y denote multidimensional (or many v alued) v alues, F y , F y denote cumulativ e distribution functions ( F y = R ˆ y −∞ ρ ( ˆ y | M , D ) dy ), S ˆ y , S y denote test statistics, and [ − c α , c α ] denotes the 1 − α conﬁdence interv al of the data. In the agreement function column, an element listed as B ( f ) means the creators of the metric in tentionally left the deﬁnition of agreement unsp eciﬁed; ho w ev er, it is natural to assume it is a function of the comparison function f . T able 1 sho ws the sp eciﬁcation of the four BVM inputs that give the other v alidation metrics as sp ecial cases. It also summarizes some of the similarities and diﬀerence b et ween the known v alidation metrics. In particular, by looking at the v alidation metrics with the same t yp e of comparison v alues, i.e. the reliability and frequen tist or the improv ed reliability and Bay esian mo del testing, we can compare them directly . W e see that if one lets the frequentist metric allow for more general input probability distributions and the use of a reasonable agreemen t function (i.e., B ( f ) is true if | f | <  ), then the frequentist metric is the reliability metric. F urther, in Ba y esian mo del testing, if the agreemen t function B ( f ) is lo osened to accept f <  , than the pdf ’s that app ear in the Bay esian model testing framework are equal to the impro ved reliability metric. This information impro ves the ob jectivity of the curren t v alidation pro cedure because we no w ha v e a map b et w een v alidation metrics that w ere originally thought to b e diﬀeren t. T able 1: Sp eciﬁcation of the four BVM inputs that give the other v alidation metrics as special cases. Comp. V alues Probs. Comp. F unc. Agree. F unc. BVM ˆ z z ρ ( ˆ z | M , D ) ρ ( z | D ) f ( ˆ z , z ) B ( f ) Reliabilit y h ˆ y i µ y ρ ( h ˆ y i| M , D ) ρ ( µ y | D ) |h ˆ y i − µ y | f <  Imp. Reli. ˆ Y Y ρ ( ˆ Y | M , D ) ρ ( Y | D ) | ˆ Y − Y | f <  F requentist h ˆ y i µ y δ ( h ˆ y i − h ˆ y i 0 ) Stud. t h ˆ y i − µ y B ( f ) Area F y F y δ ( F y − F 0 ˆ y ) δ ( F y − F 0 y ) R y | F y − F y = ˆ y | dy B ( f ) Pdf Comp. ρ M ρ D δ ( ρ M − ρ 0 M ) δ ( ρ D − ρ 0 D ) G ( ρ D || ρ M ) B ( f ) Stat. Hyp. S ˆ y S y ρ ( S ˆ y | M = D ) ρ ( S y | D ) S ˆ y S ˆ y ∈ [ − c α , c α ] Ba yes Mo del ˆ Y Y ρ ( ˆ Y | M , D ) ρ ( Y | D ) | ˆ Y − Y | f = 0 The column headings are the four BVM input v alues: Comparison V alues ( ˆ ˆ z , z ), Probabilities ( ρ ( ˆ z | M , D ) , ρ ( z | D )), Comparison F unction f = f ( ˆ z , z ), and the Bo olean Agreement F unction B ( f ). The ro w headings read: Reliability , Impro v ed Reliabilit y , F requen tist, Area, Pdf Comparison Metrics, Statistical Hyp othesis T est, and Bay esian Model T est. The denoted data probability for the a v erage in the frequentist metric, Stud. t, is the Student t distribution. 9 T able 2: BVM representation of the sp ecial cases using the f ’s sp eciﬁed in T able 1. BVM BVM R f ρ ( f | M , D ) · Θ( B ( f )) d f Reliabilit y R f ρ ( f | M , D ) · Θ( f <  ) d f = r Imp. Reli. R f ρ ( f | M , D ) · Θ( f <  ) d f = r i F requentist R µ y ρ ( µ y | D ) · Θ( B ( f ( h ˆ y i 0 , µ y ))) dµ y Area Θ( B ( f ( F 0 y , F 0 y )) Pdf Comp. Θ( B ( f ( ρ 0 M , ρ 0 D )) Stat. Hyp. R S ˆ y ρ ( S ˆ y | M = D ) · Θ( S ˆ y ∈ [ − c α , c α ]) dS ˆ y = 1 − α Ba yes Mo del R f ρ ( f | M , D ) · Θ( f = 0) d f = p ( ˆ Y ≡ Y | M , D ) T able 2 shows the resulting BVM using the sp eciﬁcations listed in T able 1. The v alue r is the standard notation for the reliabilit y metric [24] and w e use r i for the improv ed reliability metric [26]. The BVM represen ts eac h of the known v alidation metrics as a probabilit y of agreement b et w een the mo del and the data from equation (2). As no agreemen t function is sp eciﬁed directly for the frequen tist and area metric, the problem is under constrained so the agreement functions are left as general functions o v er the comparison function B ( f ). Thus, for any c hosen agreement function, the BVM quantiﬁes their probability of agreemen t. The remaining metrics all do sp ecify (or indicate) an agreement function, and thus, hav e sp eciﬁed all of the information required to compute the BVM. The statistical hypothesis test is p erhaps a bit out of place among the v alidation metrics. First, note that the comparison function for statistical h yp othesis testing is not a function of b oth the data and the mo del. F urther note, the mo del p df used for statistical hypothesis testing assumes the null hypothesis is true, whic h in our language is the assumption that ρ ( S y | M , D ) = ρ ( S y | M = D ), i.e. that the p df of the mo del is equal to the p df of the data. This sho ws how statistical hypothesis is a bit out of place here among the v alidation metrics because here w e are attempting to v alidate a mo del, usually with its own quantiﬁed p df, rather than, p erhaps irresp onsibly , assuming it is equal the data p df b efore v alidating that to b e the case. This causes standard statistical hypothesis pitfalls, suc h as type I (rejecting the null h yp othesis when it is true) and type I I errors (failing to reject the null hypothesis when it is false), to b e carried ov er into BVM, which is un w anted. Several comments are made in App endix A.5 on this issue. A p erhaps surprising result is the prop osed functional form of the BVM that represents Bay esian model testing p ( A | M , D ) = p ( ˆ Y ≡ Y | M , D ), which is the Bay esian evidence. This is the probability the uncertain mo del and data happ en to output exactly the same v alues. Usually what is discussed when reviewing Ba yesian mo del testing is the Ba y es p osterior o dds ratio, i.e. the “Ba yes Ratio”, R = p ( M | Y ) p ( M 0 | Y ) ∝ p ( Y | M ) p ( Y | M 0 ) , whic h tests one mo del M , i.e. for v alidation, against another mo del M 0 . Ho w ever, in v alidation metric problems, w e are ﬁrst in terested in considering the v alidation of a single mo del – the ratio is an extra bit of inference. In App endix A.6 we show that the BVM result of p ( ˆ Y ≡ Y | M , D ) is exactly what we mean b y p ( Y | M ) in the numerator of the Bay es factor, 4 whic h eﬀectiv ely quan tiﬁes the v alidation of a single mo del against data Y , all quantiﬁed under uncertaint y . 4.2 Generalizations and Impro v emen ts to the kno wn v alidation metrics with the BVM The BVM oﬀers several a v en ues to either generalize or impro ve man y of the metrics. The t yp es of general- izations the BVM oﬀer p ertain to generalizing the comparison v alues, comparison functions, deﬁnitions of 4 It should b e noted that our notation for D diﬀers from the notation typically used in Bay esian mo del testing. Their D is equal to our data Y , while our D refers to context “as having come from the data or experiment rather than the mo del”. 10 agreemen t, and/or generalizing deterministic comparison v alues and metrics to the uncertain case. These generalizations are only useful if quantitativ e statements can be made on their b ehalf – in such a case, these generalizations are impro vemen ts. W e will give a brief review of the improv emen ts w e found b elo w, but the full discussion is lo cated in Appendix A. By making generalizations or improv emen ts to each of the known v alidation metrics as implied by the BVM, each metric can b e made to satisfy our v alidation criterion as w ell as the six desirable v alidation criteria in [34], due to the results of Section 3. App endix A.1 uses the BVM to sho w that the reliability metric and the impro ved reliabilit y metric can b e generalized to compare v alues without a unique order, such as strings, in principle. This inv olv es creating an agreement function ov er sets of v alues (such as synonymous sets of strings), rather than in a con tin uous in terv al, that ma y b e considered to “agree”. App endix A.2 deriv es the frequentist v alidation metric and generalizes it to the case where b oth the mo del and data exp ectation v alues are uncertain. The frequenti st metric assumes the mo del outputs are kno wn with certaint y , whic h ma y or may not b e true. If a model is sto c hastic, the mo del p dfs may b e estimated with Monte Carlo or other uncertaint y propagation metho ds that quan tify the p df directly . App endix A.3 shows that the area metric may b e cast as a sp ecial case of the BVM. The area metric in volv es quantifying the diﬀerence b etw een mo del and data cumulativ e distributions on a point to p oin t basis; thus, the comparison v alues ( ˆ z , z ) are cumulativ e distributions themselves. The comparison v alues are assumed to b e known with complete certaint y , which in the case of cum ulative distributions of data is often diﬃcult to argue. An y quantiﬁable uncertaint y in the cumulativ e distributions may in tegrated ov er, whic h generalizes the area metric to situations when the model and/or the data cumulativ e distributions are uncertain. A dra wback is that the BVM in these cases may b e very computationally intensiv e and would lik ely need to b e appro ximated using a random sampling or discretization sc heme. A binned pdf metric is put forward to p oten tially reduce the computational complexity tow ard quantifying this generalized area v alidation metric. This applies similarly to the p df comparison metrics in App endix A.4. In Appendix A.5 w e in v ent an improv ed statistical hypothesis test using the BVM that takes in to accoun t b oth mo del and data p dfs called the “statistical p o wer BVM”. Because in principle w e hav e a model output p df in mo del v alidation problems ρ ( ˆ y | M , D ), we can use it (in place of assuming the n ull h ypothesis is true) to remov e the possibility of b oth t yp e I and type I I errors. In the statistical p o w er BVM, the mo del and the data are deﬁned to agree if their test statistics both lie within one another’s conﬁdence interv als (or “conﬁdence sets” as explained in App endix A.5). The statistical p o w er BVM b ecomes the pro duct of the statistical p o w ers of the mo del and data, denoted p ( A | M , D ) = (1 − β D ( ˆ α )) · (1 − β M ( α )) in equation (46). F urther comments are made ab out how systematic error (deﬁned as when a test statistic lies outside of its own conﬁdence interv al) may be remo v ed. It is concluded that the statistical p o w er BVM has a relativ ely low resolving p o w er compared to other BVMs. This is b ecause large conﬁdence interv als imply large tolerance in terv als for acceptance. F or this reason, statistical hypothesis testing should only b e used for v alidation in situations when a high degree of nonexactness b et ween mo del and data test statistics is p ermissible and the p df ’s hav e v ery thin tails. This BVM does, how ev er, hav e a greater resolution than the classical hypothesis test as was prov ed in the app endix and will b e demonstrated next section. App endix A.6 ﬁnds that Bay esian mo del testing has the highest p ossible resolving p ow er b ecause the mo del and the data are deﬁned to agree only if their v alues are exactly equal. This is the reverse of what w as concluded ab out statistical hypothesis testing. F urther in App endix A.6, we argue that, analogous to the Ba yesian mo del testing framew ork, nothing prev ents us from constructing what we call the BVM factor. The BVM factor is, K ( B ) = p ( A | M , D , B ) p ( A | M 0 , D , B ) , (13) whic h is a ratio of the BVMs of tw o mo dels under arbitrary deﬁnitions of agreement B . Using Bay es Theorem, p ( M | A, D , B ) = p ( A | M ,D,B ) P ( M | D ,B ) P ( A | D,B ) , we may further construct the BVM ratio, R ( B ) = p ( M | A, D , B ) p ( M 0 | A, D , B ) = p ( A | M , D , B ) P ( M | D, B ) p ( A | M 0 , D , B ) P ( M 0 | D , B ) = K ( B ) P ( M | D , B ) P ( M 0 | D , B ) , (14) 11 for the purp ose of comparative mo del selection under a general deﬁnition of agreement B . Th e ratio P ( M | D ,B ) P ( M 0 | D,B ) is the ratio prior probabilities of M and M 0 , which analagous to Bay esian mo del testing, if there is no reason to susp ect that one mo del is a priori more probable than another, one may let P ( M | D ,B ) P ( M 0 | D,B ) = 1, and then R ( B ) → K ( B ) in v alue. Th us, using the BVM ratio, w e can p erform gener al mo del validation testing under arbitr ary deﬁnitions of agr e ement and with any r e asonable set of c omp arison functions . The BVM ratio therefore generalizes the Ba yesian mo del testing framework. This will be utilized in next section. Finally we wan ted to add a note ab out how one may mitigate the sharpness of the indicator function without using fuzzy logic while also allowing close mo dels to b e somewhat accepted. As we ha ve seen, it is natural to use a threshold Bo olean parameter  to help deﬁne the boundary of agreement through B ( f ≤  ). Suc h a BVM takes the form, p ( A | M , D ,  ) = Z f ρ ( f | M , D ) · Θ( B ( f ≤  )) d f , (15) where Θ  ...  instan taneously drops to zero for f >  . One may soften the b oundary b y allowing  itself b e an uncertain quantit y , whic h means one allo ws their deﬁnition of agreement to b e somewhat uncertain (whic h often times can b e reasonably claimed). As an example, let this uncertaint y b e ρ (  ) = λ exp( − λ (  −  0 )) for  0 >  and zero otherwise, where λ is p ositiv e. Marginalizing ov er this BVM then giv es, p ( A | M , D ) = Z  p ( A | M , D ,  ) ρ (  ) d = Z f ρ ( f | M , D ) · h Θ  B ( f ≤  0 )  + Θ  B ( f >  0 )  e − λ ( f −  0 ) i d f , (16) whic h allo ws some f ’s to b e accepted outside the agreemen t region deﬁned by f ≤  0 , but with a exp onentially deca ying probability . Other p oten tially useful  pdfs include, but are not limited to: negativ e slop e linear, Gaussian, or decaying sigmoidal. None of these  t yp e distributions were needed to obtain the results of the previous sections explicitly; how ever, these types of assumptions ma y ha v e been part of the decision process made implicitly by a practitioner while p erforming mo del v alidation. 5 BVM Examples In this section we inv ent and quantify three nov el v alidation metrics using the BVM to highligh t the con- ceptual clarity , ﬂexibilit y , and capacit y of our framework. 5.1 The Statistical P o w er BVM Here we consider the statistical p o w er BVM prop osed in A.5 and reviewed in the previous section. This metric deﬁnes agreement as o ccurring when both the mo del and data comparison v alues are within one another’s conﬁdence interv als, sim ultaneously . The BVM for this metric is the pro duct of the statistical p o w ers of the mo del and the data p ( A | M , D ) = (1 − β D ( ˆ α )) · (1 − β M ( α )), which is calculated in equation (46). W e contrast this with the standard statistical hypothesis test that, after assuming the mo del is correct, M = D , ﬁnds the probability the mo del lies within the data’s conﬁdence interv al equal to 1 − α . In the statistical hypothesis testing, one then pro ceeds to chec k the actual mo del output and sp eculates ab out type I and I I errors. As discussed in the App endix, we do not assume M = D b efore v alidation and therefore t yp e I and I I errors are a voided. Rather, w e let the statistical p o w er BVM decide whether or not the model is v alid. This provides a more informative v alidation pro cedure. Figure 2 depicts a t ypical statistical hypothesis test scenario that is designed to chec k the v alidit y of an uncertain mo del av erage prediction ˆ µ (in blue) against an uncertain data av erage prediction µ (in red). The data’s µ is t -distributed the same in eac h subﬁgure according to T ( y , n − 1 , s ) = T (0 , 10 , 1 . 75) where ( y , n, s ) 12 are the sample mean, num b er of collected data p oints, and the sample standard deviation, resp ectiv ely . Each ro w depicts a normal distribution mo del cen tered at 0, but with increasing mo del v ariance p er ro w. Figure 2: The shaded regions in Column A depict the 95% conﬁdence interv al of each distribution, respec- tiv ely . Because the data distribution is the same in each ﬁgure and b ecause the statistical hypothesis test is independent of the prop osed model due to assuming the hypothesis M = D , each mo del is equally v alid b y that test when it is clear that the mo del in row B is preferable. The shaded regions in Column B depict the statistical p ow er of the distributions – the 95% conﬁdence interv als from each distribution is shaded (in tegrated) in the other’s p df. The statistical p o w er BVM (denoted P ( A ) in column B) is calculated for eac h mo del and indeed the mo del in row 2 is found to b e preferable as it has the highest probability of agreemen t. 5.2 The ( h  i , β D ) BVM W e inv en t a nov el comp ound Bo olean that deﬁnes agreemen t as when the mo del passes an av erage square error threshold of h  i and a c hec k for probabilistic mo del represen tation. The later is imposed b y requiring that 95% ± 4% of the uncertain data lies inside the mo del’s 1 − ˆ α = 95% conﬁdence interv al, i.e. 1 − β D ( ˆ α ) ∼ 95%. The ± 4% tolerance was c hosen suc h that o verly uncertain mo dels would b e marked as “not agreeing” as they would be able to guaran tee that 100% of the data lie within their ov erly wide conﬁdence interv als. W e call this compound Boolean the ( h  i , β D ) Bo olean. The BVM in this case is p ( A | M , D , h  i , β D ) = Z ˆ Y ,Y ρ ( ˆ Y | M , D ) · Θ  B ( ˆ Y , Y , h  i , β D )  · ρ ( Y | D ) d ˆ Y d Y , (17) 13 where the comp ound Bo olean B ( ˆ Y , Y , h  i , β D ) is equal to, B  1 N X i | y i − ˆ y i | ≤ h  i  ∧ B  0 . 91 ≤ 1 N X i Θ( y i ∈ [ − c ˆ α , c ˆ α ] i ) ≤ 0 . 99  , (18) N is the num b er of data p oin ts in { y i } = Y , and [ − c ˆ α , c ˆ α ] i is the mo del’s 95% conﬁdence interv al at comparison lo cation x i . W e treat the mo del conﬁdence interv als as certain quantities, which can b e ac hiev ed eﬀectiv ely through enough Mon te Carlo (MC) sim ulation of the model output p dfs (although this stipulation can b e remo v ed if needed). Although the mathematical notation for the compound Boolean is a bit cumbersome, it is relativ ely easy to implemen t programmatically using if statemen ts. This ease of programming allo ws the BVM to hav e a large capacity for represen ting complex and abstract v alidation scenarios in practice. Expressing the BVM as an expectation v alue ov er ρ ( Y | D ) ρ ( ˆ Y | M , D ), p ( A | M , D , h  i , β D ) = E " Θ  B ( ˆ Y , Y , h  i , β D ) i ∼ 1 K K X k =1 Θ  B ( ˆ Y k , Y k , h  i , β D )  , (19) allo ws one to compute the in tegral using standard statistical metho ds like MC. W e use MC and K = 3000 samples in this toy example. In Figure 3 we implemen t the ( h  i , β D ) Bo olean and show that it is able to quan tify b oth a verage error and a mo del’s probabilistic represen tation of uncertain data, sim ultaneously . W e consider data that is generated from, y ( x ) = 1 + x exp( − cos(10 x )) + sin(10 x ) +  a ( x ) , (20) where  a ( x ) ∼ N (0 , 0 . 4 2 ) represen ts aleatoric stochastic uncertaint y due to the inherent randomness of the system. An instance of the aleatoric data Y without measurement uncertain t y is depicted in red in Figure 3A) and B). W e consider the data to hav e an additional epistemic measurement uncertaint y  e ( x ) ∼ N (0 , 0 . 2 2 ) that con tributes to the probability of whether or not the mo del agrees with the plotted instance of the aleatoric data Y . 5 The mo dels plotted in Figure 3A) and Figure 3B) are generated from, ˆ y ( x ; a, b, c, d, f , g ) = a + bx exp( − c cos( dx )) + f sin( g x ) , (21) where ( a, b, c, d, f , g ) are the mo del parameters. In Figure 3A), a deterministic mo del is considered and plotted in blue b y treating the mo del parameters as completely certain n um b ers ( a, b, c, d, f , g ) = (1 , 1 , 1 , 10 , 1 , 10). In Figure 3B), w e consider an uncertain mo del by treating the mo del parameters as uncertain v alues dra wn from a multiv ariate Gaussian distribution having av erages ( µ a , µ b , µ c , µ d , µ f , µ g ) = (1 , 1 , 1 , 10 , 1 , 10) and standard deviations ( σ a , σ b , σ c , σ d , σ f , σ g ) = (0 . 35 , 0 . 3 , 0 . 3 , 0 . 3 , 0 . 3 , 0 . 3). W e ev aluate the probability these data and mo del pairs pass the h  i Bo olean verses the ( h  i , β D ) Bo olean b y calculating their resp ective BVM’s. 5 Our framework is not limited to in vestigating other v alidations scenarios, e.g., if one has many aleatoric data path instances { Y } (or enough statistics ab out the data p df ) that one would like to v alidate or use for modeling testing in the congregate. 14 Figure 3: The deterministic mo del plotted in Figure 3A) satisﬁes the av erage error v alidation requirement with P ( A |h  i ) = . 99 when a threshold of h  i = 0 . 46. This result is logical b ecause the congregate standard deviation of the data is itself the com bination of the aleatoric and epistemic uncertainties is √ 0 . 4 2 + 0 . 2 2 ≈ 0 . 45. This deterministic mo del fails to predict the uncertain ﬂuctuations of the data b ecause determinisic mo dels hav e conﬁdence in terv als with zero width. Th us the deterministic model fails to agree according to the ( h  i , β D ) Bo olean and one ﬁnds P ( A |h  i , β D ) = 0 for any v alue of h  i . The uncertain mo del depicted in Figure 3B) is able to pass b oth agreement deﬁnitions; ho w ever, our c hoice to ev aluate each probable mo del path against the epistemic uncertain data rather than just the a v erage mo del increases the treshold to ab out h  i = 0 . 9 b efore an agreement probability of ab out P ( A |h  i ) = . 96 is ac hieved. When instead ev aluating h  i against the av erage mo del, w e obtained similar results to the deterministic mo del for this Bo olean h  i ∼ 0 . 46 and P ( A |h  i ) = . 99. The ( h  i , β D ) BVM for the uncertain mo del is P ( A |h  i , β D ) = 0 . 93, b ecause the uncertaint y in the data more or less agrees with the conﬁdence interv al pro vided by the mo del. This mo del can b e tested for agreement against other mo dels or mo del parameter distributions for their resp ectiv e deﬁnitions of agreement. 5.3 Exploring the BVM Ratio with the ( γ ,  ) BVM In this subsection we inv en t an agreement function to represen t the visual insp ection an engineer might p erform graphically and use the BVM ratio for mo del selection under this deﬁnition of agreement. By quan tifying this, a practitioner could visually v alidate a mo del without actually lo oking at the mo del-data pair, which can b e helpful for high dimensional spaces that are beyond h uman comprehension/visualizability . W e will pro ceed by in tro ducing this agreement function and some simple mo dels to test it on. W e will then quan tify this measure using the BVM in the completely certain and uncertain cases. The main purp ose of this example is to explore the BVM Ratio while show casing the conceptual ﬂexibility of the framework. T o quantify something resembling the visual insp ection an engineer migh t make graphically , we use tw o main criteria. W e deﬁne the mo del to b e accepted if most of the mo del and data p oin t pairs lie relatively close to one another and if none of the p oin t pairs deviate to o far from one another. W e therefore consider a comp ound Boolean B ( ˆ Y , Y , γ ,  ) that is true if a p ercen tage larger than γ % ( ∼ 90%) of the mo del output p oin ts ˆ Y lie within  of the data Y and 100% of the mo del output p oints lie within some multiple m of the data, which rules out obvious model form error. W e will call this comp ound Bo olean the ( γ ,  ) Bo olean. The v alues γ %,  , and m can b e adjusted to the needs of the mo deler. It should b e noted that the  in this metric mak es point by p oin t ev aluations as opp osed to the a v erage h  i Bo olean used in the previous example. W e will p erform the analysis for a v ariet y of γ and  v alues to explore the limits of the metric. W e will calculate the BVM for tw o diﬀerent order p olynomial models that approximate N data p oin ts tak en from the cosine function y i = cos( x i ), as an illustration. The p oin ts are evenly spaced in the range x i ∈ [0 , π ]. The ﬁrst model ˆ y (1) i = M (1) ( x i ; a, b, c ) = a + bx 2 i + cx 4 i has uncertain parameters ( a, b, c ) and the second mo del ˆ y (2) i = M (2) ( x i ; a, b, c, d ) = a + bx 2 i + cx 4 i + dx 6 i has uncertain parameters ( a, b, c, d ). T o form ulate the BVM, we still need to form ulate the model and data probabilit y distributions. Because 15 the Bo olean expression B ( ˆ Y , Y ) is ov er the en tire mo del and data functions, the mo del probability distri- bution is p ( ˆ Y | M , D ) and data probability distribution is p ( Y | D ). These are joint probabilities ov er all of the p oin ts ( ˆ Y = { y i } , Y = { y i } ) that constitute a particular path ( ˆ Y , Y ) of the mo del or data, resp ectively . Because b oth models are linear in the uncertain coeﬃcients ( a, b, c, d ), there is a one to one corresp ondence from the set of mo del parameters ( a, b, c, d ) to the set of the possible paths ˆ Y a,b,c,d (giv en N is greater than the n umber of indep enden t coeﬃcients). This makes the uncertain t y propagation from the uncertain mo del parameters to the full join t probabilit y of the p oin ts on a path simple and results in the joint probability of paths b eing equal to the joint probability of the uncertain input mo del parameters. F or simplicity we will let p ( a, b, c, d ) = N ( µ a , σ a ) N ( µ b , σ b ) N ( µ c , σ c ) N ( µ d , σ d ), where N ( µ, σ ) is a normal distribution of a v erage µ and standard deviation σ , such that is, p ( ˆ Y a,b,c,d | M j ) = 1 Z exp  − ( a − µ a ) 2 2 σ 2 a − ( b − µ b ) 2 2 σ 2 b − ( c − µ c ) 2 2 σ 2 c − ( d − µ d ) 2 2 σ 2 d  . Because the problem is well understo o d, we discretize the integrals rather than estimating them with MC. After discretization, the ( γ ,  ) BVM for eac h mo del is, p ( A | M j , D , γ ,  ) = X ˆ Y ,Y p ( ˆ Y | M j ) · Θ( B ( ˆ Y , Y , γ ,  )) · p ( Y | D ) . (22) In principle  = (  1 , ..,  N ) is an N dimensional v ector where each  i ma y b e adjusted to imp ose more or less stringen t agreemen t conditions on a p oin t to p oin t basis, which may b e used to enforce reliability in regions of interest. In our example we let all of the comp onents of  i b e equal. If the standard deviation of each data p oin t ∼ ˆ σ i (aleatoric and/or measuremen t uncertain ty) in the joint data p df p ( Y | D ) is muc h less than  i and N is large, one may appro ximate the BVM as, p ( A | M j , D , γ ,  ) ≈ X ˆ Y p ( ˆ Y | M j ) · Θ( B ( ˆ Y , Y 0 , γ ,  )) , (23) whic h can greatly reduce the num b er of com binations one m ust calculate by eﬀectiv ely treating the data as kno wn, deterministic, and equal to Y 0 . W e will use this appro ximation as it do es not tak e aw a y from the main p oin t of this example. W e will use the following numerics. In the completely certain case, w e will let the parameters b e the T aylor series co eﬃcien ts ( a, b, c, d ) = (1 , − 1 2! , 1 4! , − 1 6! ) ( d = 0 for mo del 1) and in the uncertain case we let eac h co eﬃcient ha v e Gaussian uncertain t y centered at their T a ylor series coeﬃcients with standard deviations ( σ a , σ b , σ c , σ d ) = (0 . 1 , 0 . 05 , 0 . 005 , 0 . 0005) (and where σ d = 0 for mo del 1). W e let each model output path ha ve N = 50 p oin ts and we allow for 20 p ossible v alues per parameter ( a, b, c, d ), whic h results in 20 3 = 8000 p ossible paths for mo del 1 and 20 4 = 160 , 000 for mo del 2. W e let γ v ary b et w een 75% and 100% using an incremen t of 1% and let  v ary b et w een 0 and 1 using an incremen t of 0 . 01. The v alue of m was chosen to b e equal to 5, whic h imp oses that no mo del path can ha ve p oin ts that are greater than 5  a w ay while still b e considered to agree with the data. The BVM probability of agreement v alues as a function of the Bo olean function parameters ( γ ,  ) are plotted in Figure 4 for model 1 and mo del 2 in the completely certain case: 16 Figure 4: Completely Certain Case: The BVM probabilit y of agreement b etw een mo del 1 and 2 with the data is plotted in the space of ( γ ,  ). The results for mo del 1 (4 th order p olynomial) is plotted on the left and mo del 2 (6 th order p olynomial) is plotted on the right. Because here the mo dels are deterministic, the BVM probability of agreemen ts for each ( γ ,  ) pair is either zero or one. As exp ected, mo del 2 b etter ﬁts the data in the space of ( γ ,  ) as it has more BVM v alues equal to one than mo del 1 as it is o v erall closer to the cosine function b eing that it is the next nonzero order in the T aylor series expansion. Neither mo del ﬁts the data exactly as the BVM for b oth mo dels at ( γ = 100% ,  = 0) is zero. F or a single ( γ ,  ) pair, the mo del’s BVM ratio (a prior the mo dels are assumed to be equally lik ely) is, R ( B ( γ ,  )) = p ( A | M 1 , D , γ ,  ) p ( A | M 2 , D , γ ,  ) , (24) whic h, b ecause the numerator and denominator is either 0 or 1 in the deterministic case, gives R ( B ( γ ,  )) equal to 1, 0, ∞ , or 0 0 meaning that the mo dels b oth agree, mo del 1 do es not agree but mo del 2 agrees, mo del 1 agrees but mo del 2 does not agree, or both mo dels disagree, resp ectiv ely . Th us, the BVM ratio for a single ( γ ,  ) pair b etw een tw o deterministic mo dels with completely certain data is not particularly insightful as they either agree or do not agree as deﬁned b y B . As it may not alwa ys b e clear precisely what v alues of ( γ ,  ) one should choose to deﬁne agreemen t, one can meaningfully a v erage (marginalize, analagous to (16)) o ver a viable volume in the space of ( γ ,  ), with p ( γ ,  ) = 1 V , and arriv e at an av eraged Bo olean BVM ratio, R ( B ) = P γ , p ( A | M 1 , D , γ ,  ) P γ , p ( A | M 2 , D , γ ,  ) = N 1 A N 2 A , (25) whic h is simply a ratio of the n umber of agreements found for mo del 1, N 1 A , in the ( γ ,  ) volume to the n umber of agreements found for model 2, N 2 A , in the selected ( γ ,  ) volume. In our deterministic example, R ( B ) = 1108 2364 = 0 . 4687 as mo del 2 b etter ﬁts the data, as deﬁned by B , for the c hosen meaningful ( γ ,  ) v olume (whic h is tak en to b e the whole tested volume in this toy example). The BVM ratio or the av eraged Bo olean BVM ratio ma y b e used as a guide for selecting mo dels in the deterministic case given reasonable regions are chosen. The BVM probabilit y of agreemen t v alues as a function of ( γ ,  ) are plotted in Figure 5 for mo del 1 and mo del 2 in the uncertain mo del case: 17 Figure 5: Uncertain Case: The BVM probabilit y of agreemen t b et w een mo del 1 and 2 with the data is plotted in the space of ( γ ,  ). The results for mo del 1 (4 th order p olynomial) is plotted on the left and mo del 2 (6 th order p olynomial) is plotted on the right. Because the mo del paths are uncertain, the BVM probabilit y of agreements for each ( γ ,  ) pair ma y tak e an y v alue from zero to one. As exp ected, mo del 2 b etter ﬁts the data in the space of ( γ ,  ) as its BVM is generally larger than that of mo del 1; how ever, the BVM v alues are ab out equal in cases of large v alues of γ and  v alues (the deﬁnition of agreemen t is less stringen t and they b oth “agree”) and in the case of demanding absolute equalit y (  = 0) as neither mo del ﬁts the data exactly . The BVM ratios R ( B ( γ ,  )) = p ( A | M 1 ,D,γ , ) p ( A | M 2 ,D,γ , ) of the uncertain mo dels are plotted as a function of ( γ ,  ) in Figure 6: Figure 6: This is a plot of the BVM ratios in the uncertain case. Mo del 2 is generally fa v ored o v er model 1 as there exist no v alues greater than one on the plot. The amount the BVM ratio fav ors mo del 2 o v er mo del 1 decreases as the metric b ecomes less and less stringen t (i.e. as γ decreases and  increases). The  = 0 line w as remov ed because neither mo del agrees with the data exactly . The av eraged Boolean BVM ratio for the uncertain mo dels is, R ( B ) = P γ , p ( A | M 1 , D , γ ,  ) P γ , p ( A | M 2 , D , γ ,  ) = 0 . 7471 , (26) 18 whic h conforms to the notion that model 2 is, generally sp eaking, the preferable mo del, and which may b e comm unicated with this single n umber. W e given examples of the BVM ratio correctly selecting mo dels according to abstract and new forms of v alidation. 6 Conclusion W e demonstrated the versatilit y of the BVM tow ard expressing and solving mo del v alidation problems. The BVM quan tiﬁes the probability the mo del is v alid for arbitrary quantiﬁable deﬁnitions of mo del-data agreemen t using arbitrary quantiﬁed comparison functions of the model-data comparison v alues. The BVM w as sho wn: to obey all of the desired v alidation metric criteria [34] (whic h is a ﬁrst), to be able to represent all of the standard v alidation metrics as sp e cial c ases , to supply improv emen ts and generalizations to those sp ecial cases, and to b e a to ol for quantifying the v alidity of a mo del in nov el mo del-data con texts. The later was demonstrated b y the v alidation metrics w e inv en ted and quantiﬁed in our examples. Finally , it w as shown that one can p erform model selection using the BVM ratio. The BVM mo del testing framew ork w as sho wn to generalize the Bay esian mo del testing framew ork to arbitrary mo del-data con texts and with reference to arbitrary comparisons and agreement deﬁnitions. That is, the BVM ratio may b e used to rank mo dels directly in terms of the relev ant model-data v alidation context. The problem of mode l-data v alidation may b e reduced to the problem of ﬁnding/ deﬁning the four BVM inputs: ( ˆ z , z ), ρ ( ˆ z , z | M , D ), f ( ˆ z , z ), and B ( f ), and computing their BVM v alue. W e ﬁnd that the BVM is a useful to ol for quantifying, expressing, and p erforming mo del v alidation and testing. Our future work inv olv es probabilistically regress- ing (learning) model parameter distributions that satisfy arbitrary deﬁnitions of agreement within the BVM framew ork. Ac kno wledgmen ts This work w as supp orted b y the Center for Complex Engineering Systems (CCES) at King Ab dulaziz City for Science and T echnology (KACST) and the Massach usetts Institute of T echnology (MIT). W e w ould like to thank all of the researchers with the Cen ter for Complex Engineering (CCES), esp ecially Zeyad Al-Awwad, Arw a Alanqari, and Mohammad Alrished. Finally , we would also like to thank Nic holas Carrara and Ariel Catic ha. References References [1] W. L. Ob erk ampf, T. G. T rucano, and C. Hirsc h, 2004, V eriﬁcation, V alidation, and Predictive Capa- bilit y in Computational Engineering and Ph ysics, Appl. Mech. Rev., 57(3), pp. 345384. [2] D. Sornette, A. B. Davis, K. Ide, K. R. Vixie, V. Pisarenko, and J. R. Kamm, 2007, Algorithm for Mo del V alidation: Theory and Applications, Pro c. Natl. Acad. Sci. U.S.A., 104(16), pp. 65626567. [3] T.G. T rucano, L.P . Swiler, T. Igusa, W. L. Ob erk ampf, and M. Pilch, 2006, “Calibration, v alidation, and sensitivity analysis: Whats what”, Reliab. Eng. Syst. Saf. 91, pp. 1331-1357. [4] M. C. Kennedy and A. O’Hagan, Bay esian calibration of computer mo dels, J. R. Stat. So c.: Ser. B (Stat Metho dol.), v ol. 63, no. 3, pp. 425-464, 2001. [5] O. P . Ma ˆ ıtre, O. M. Knio, Sp ectral Metho ds for Uncertain ty Quantiﬁcation. New Y ork: Springer Sci- ence+Business Media, 2010. [6] Adams, B.M., Bauman, L.E., Bohnhoﬀ, W.J., Dalb ey , K.R., Eb eida, M.S., Eddy , J.P ., Eldred, M.S., Hough, P .D., Hu, K.T., Jak eman, J.D., Stephens, J.A., Swiler, L.P ., Vigil, D.M., and Wildey , T.M., ”Dak ota, A Multilevel P arallel Ob ject-Orien ted F ramework for Design Optimization, P arameter Estima- tion, Uncertain ty Quantiﬁcation, and Sensitivit y Analysis: V ersion 6.0 Users Manual,” Sandia T echnical Rep ort SAND2014-4633, July 2014. Up dated No vem b er 2015 (V ersion 6.3). 19 [7] R. Ghanem, D. Higdon, and H.Owhadi, “The Uncertain ty Quantiﬁcation T o olkit (UQTk)” Handb ook of Uncertaint y Quan tiﬁcation, Springer, 2016. [8] B. Debusschere, N. Habib, N. Na jm, P . P ´ ebay , O. M. Knio, R. Ghanem, and O. P . Ma ˆ ıtre, Numerical Challenges in the Use of P olynomial Chaos Representations for Sto chastic Pro cesses, SIAM J. Sci. Comput. vol. 26, no. 2, pp. 698-719, 2005. [9] W. Gilks, S. Richardson, and D. J. Spiegelhalter, Marko v chain Monte Carlo in practice. London, UK: Chapman and Hall, 1996. [10] N. Metrop olis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. T eller, and E. T eller, Equation of state calculations by fast computing mac hines, J. Chem. Phys., vol. 21, no. 6, pp. 1087, 1953. [11] S. Sank araraman and S. Mahadev an, Integration of mo del veriﬁcation, v alidation, and calibration for uncertain ty quantiﬁcation in engineering systems, Reliab. Eng. Syst. Saf., vol. 138, pp. 194-209, 2015. [12] S. Sank araraman and S. Mahadev an, Mo del v alidation under epistemic uncertaint y , Reliab. Eng. Syst. Saf. vol. 96, no. 9, pp. 1232-1241, 2011. [13] S. Mahadev an and R. Rebba, V alidation of reliability computational mo dels using Bay es netw orks, Reliab. Eng. Syst. Saf. v ol. 87, no. 2, pp. 223-232, 2005. [14] C. Li and S. Mahadev an, Role of calibration, v alidation, and relev ance in m ulti-level uncertaint y inte- gration, Reliab. Eng. Syst. Saf., v ol. 148, pp. 32-43, 2016. [15] S. F erson, W. L. Ob erk ampf, and L. Ginzburg, Model v alidation and predictive capabilit y for the thermal c hallenge problem, Comput. Metho ds Appl. Mech. Eng., vol. 197, no. 29, pp. 2408-2430, 2008. [16] C. Roy and W. Ob erk ampf, A comprehensive framew ork for veriﬁcation, v alidation, and uncertaint y quan tiﬁcation in scien tiﬁc computing, Comput. Metho ds Appl. Mech. Engrg., vol. 200, pp. 2131-2144, 2011. [17] Y. Ling and S. Mahadev an, Quan titativ e mo del v alidation techniques: New insigh ts, Reliab. Eng. Syst. Saf., vol. 111, pp. 217, 2013. [18] W. Li, W. Chen, Z. Jiang, Z. Lu, and Y u Liu, New v alidation metrics for mo dels with multiple correlated resp onses, Reliab. Eng. Syst. Saf., vol. 127, pp. 1-11, 2014. [19] D. W u, Z. Lun, Y. W ang, and L. Cheng, Mo del v alidation and calibration based on comp onen t functions of mo del output, Reliab. Eng. Syst. Saf., v ol. 140, pp. 59-70, 2015. [20] L. Zhao, Z. Lu, W. Y un, and W. W ang, V alidation metric based on Mahalanobis distance for mo dels with multiple correlated responses, Reliab. Eng. Syst. Saf., vol. 159, pp. 80-89, 2017. [21] M. Stefano and B. Sudret, UQLab user man ual - P olynomial chaos expansions. Report UQLab-V0.9-104, Chair of Risk, Safety and Uncertaint y Quantiﬁcation, ETH Zurich, 2015. [22] M. Parno and A. Davis, “MUQ: MIT Uncertaint y Quan tiﬁcation Library”, http://m uq.mit.edu/home, 2018. [23] C. W ang and H. G. Matthies, No v el mo del calibration metho d via non-probabilistic in terv al character- ization and Bay esian theory , Reliab. Eng. Syst. Saf., vol 183, pp 84-92, 2019. [24] R. Rebba and S. Mahadev an, Computational metho ds for mo del reliability assessment, Reliab. Eng. Syst. Saf., vol 93, no. 8, pp 1197-1207, 2008. [25] N. T. Stev ens, Assessment and comparison of Contin uous Measuremen t Systems, Ph. D. Thesis, Dept. Statistics, Univ. of W aterlo o, W aterlo o, On tario, Canada, 2014. 20 [26] S. Sank araraman and S. Mahadev an, Assessing the reliabilit y of computational mo dels under uncer- tain ty . In: The 54th AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics, and materials conference; 2013. [27] W. L. Ob erk ampf, M. F. Barone, Measures of agreement b et w een computation and exp erimen t: v ali- dation metrics, J. Comput. Ph ys. v ol. 217, no. 1, pp. 5-36, 2006. [28] W.L. Ob erk ampf, M.F. Barone, Measures of agreement betw een computation and experiment: v alidation metrics, AIAA Paper 2004 2626. [29] D. Sivia and J. Skilling, “Data Analysis A Ba yesian T utorial second edition”, Oxford Univ ersit y Press, Oxford, UK, 2006. [30] B. Placek, Bay esian Detection and Characterization of Extra-Solar Planets Via Photometric V ariations, Ph. D. Thesis, Dept. Ph ysics, Univ. at Alb. (S.U.N.Y.), Alban y , NY, USA, 2014. [31] A. E. Gelfand and D. K. Dey , Bay esian mo del choice: asymptotics and exact calculations, J. R. Stat. So c. Ser. B (Metho dol.), pp. 501-514, 1994 [32] J. Gewek e, Bay esian model comparison and v alidation, Am. Econ. Rev. 2007;97(2):604. [33] R. Zhang, S. Mahadev an S, Ba yesian metho dology for reliabilit y mo del acceptance, Reliab. Eng. Syst. Saf., vol. 80, no. 1, pp. 95-103, 2003. [34] Y. Liu, W. Chen, P . Arendt, and H. Z. Huang, T ow ard a b etter understanding of mo del v alidation metrics, T rans ASME J. Mech. Des. v ol. 133, no. 7, pp. 071005, 2011. [35] K. A. Maupin, L. P . Swiler, and N. W. Porter, V alidation metrics for deterministic and probabilistic data. Journal of V eriﬁcation, V alidation and Uncertaint y Quantiﬁcation, vol. 3, no. 3, pp. 031002, 2018. [36] G. Lee, W. Kim, H. Oh, B. D. Y oun, and N. H. Kim, Review of statistical mo del calibration and v alidation from the p erspective of uncertain ty structures, Structural and Multidisciplinary Optimization 91, pp. 1-26, 2019. [37] J. Mullins, Y. Ling, S. Mahadev an, L. Sun, A. Strachan, Separation of aleatory and epistemic uncertaint y in probabilistic mo del v alidation, Reliab. Eng. Syst. Saf., vol. 147, pp. 49-59, 2016. [38] F. P edregosa, G. V aro quaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P . Prettenhofer, R. W eiss, V. Dub ourg, J. V anderplas, A. Passos, D. Cournap eau, M. Brucher, M. Perrot, E. Duc hesnay , Scikit-learn: Mac hine Learning in Python, Journal of Machine Learning Research, v ol. 12, pp. 2825- 2830, 2011. [39] E. T. Jaynes. In “F oundations of Probabilit y Theory , Statistical Inference, and Statistical Theories of Science V ol. II”, D. Reidel Publishing Company , Dordrec ht-Holland, pp. 175-257, 1976. [40] E. T. Jaynes, Probability Theory: The Logic of Science, UK Cambridge: Cambridge Universit y Press, 2003. [41] F. F eroz and M. P . Hobson, Multimo dal nested sampling: an eﬃcient and robust alternative to Mark o v c hain Monte Carlo metho ds for astronomical data analyses, Mon thly Notices of the Roy al Astronomical So ciet y , vol. 384, no. 2, pp. 449-463, 2008. [42] A. Catic ha, En tropic Inference and the F oundations of Ph ysics (monograph commis- sioned by the 11th Brazilian Meeting on Bay esian Statistics - EBEB-2012). 2012. URL h ttp://www.albany .edu/ph ysics/A Caticha-EIFP-bo ok.pdf. [43] K. Kn uth, Optimal Data-Based Binning for Histograms, arxiv:physics/0605197v2, h ttps://arxiv.org/abs/physics/0605197v2, 2013. 21 A Deriving the other v alidation metrics from the BVM In the following subsections we will sho w some of the special cases of the Bay esian v alidation metric. Sub- sequen t improv emen ts or immediate generalizations of the metrics using (2) are presented when applicable. A detailed review of the ma jorit y of these metrics may b e found in [34] and the references therein. T able 1 and 2 in Section 4 outline the results. A.1 Reliabilit y metric and probabilit y of agreemen t There are a few v alidation metrics related to the reliabilit y metric present in the literature. The reliabilit y metric r = p ( |h ˆ y i − µ y | <  ) [24] is equal to the probability that the data and the mo del exp ectation v alues are within a tolerance of size  . Their “probabilit y of agreement” introduced in [25] is closely related to r , but instead expresses the quan tity as “the probabilit y the data and the mo del exp ectation v alues agr e e within a tolerance (or sliding tolerance) of  ”. The reliability metric was expanded in [26] to account for mo del outputs and data rather than simply comparing the mean of the mo del prediction against the mean of the data. The improv ed reliabilit y metric is equal to, r i = Z ∞ −∞ ρ ( Y | D ) Z Y +  ( Y ) Y −  ( Y ) ρ ( ˆ Y | M ) d ˆ Y d Y , (27) where ρ ( ˆ Y | M ) and ρ ( Y | D ) are the full joint probability distributions of the mo del outputs and the data, resp ectiv ely . This metric quantiﬁes the probability that the error is less than a v alue  ( y ) on a p oin t to p oint basis. The BVM is the reliability metric when the comparison v alues are ˆ z = h ˆ y i and z = µ y , and B takes the form of an inequality , b eing true if −  ≤ h ˆ y i − µ y ≤  . If we w ould like to use a sliding interv al of “tolerance” or “error acceptance”, denote it b y [ c − ( µ y ) , c + ( µ y )]”, where c − < c + , and the BVM is, p ( A | M , D ) = Z h ˆ y i ,µ y ρ ( h ˆ y i| M )Θ  c − ( µ y ) ≤ h ˆ y i ≤ c + ( µ y )  ρ ( µ y | D ) d h ˆ y i dµ y = Z µ y Z c + ( µ y ) h ˆ y i = c − ( µ y ) ρ ( h ˆ y i| M ) ρ ( µ y | D ) d h ˆ y i dµ y = r . (28) This is the reliability metric if c ± = µ y ±  is a constant and symmetric interv al about µ y . The BVM is the improv ed reliabilit y metric when ˆ z = ˆ Y , z = Y , and when the Bo olean B ( ˆ z , z ) is true iﬀ | ˆ y − y | ≤  ( y ) for all ˆ y , y pairs. This is, p ( A | M , D ) = Z ˆ Y ,Y ρ ( Y | D ) ρ ( ˆ Y | M )  Y ˆ y ,y pairs Θ( | ˆ y − y | ≤  ( y ))  d ˆ Y d Y = r i . (29) The BVM quantiﬁes the probabilit y of square error (or diﬀerence) is less than some  b y considering, p ( A | M , D ) = Z ˆ Y ,Y ρ ( Y | D )Θ( | ˆ Y − Y | ≤  ) ρ ( ˆ Y | M ) d ˆ Y d Y . (30) Nothing in the BVM requires the v ariables to b e con tinuous or ordered, so the natural generalization is to let the Boolean expression b e true if “The v alue ˆ z is in the subset S ( z ), which is the set of ˆ z ’s agreeing with z ”. F or example, if ˆ z ’s are strings, S ( z ) migh t b e the set of w ords or phrases in ˆ z that are reasonably synon ymous with z . This gives the straightforw ard generalization to accommo date arbitrary data types, p ( A | M , D ) = X ˆ z ,z p ( ˆ z | M )Θ  ˆ z ∈ S ( z )  p ( z | D ) = X z X ˆ z ∈ S ( z ) p ( ˆ z | M ) p ( z | D ) , (31) b y using sets rather than in terv als. 22 A.2 F requentist v alidation metric T o include the frequentist v alidation metric in the BVM, w e will hav e to express the comparison v ariables ˆ z and z and their resp ectiv e probabilities. The result can b e replicated by letting: z = µ y b e the Student-t distribution and ˆ z = h ˆ y i hav e a Dirac delta distribution ρ ( ˆ z | M , D ) = ρ ( h ˆ y i| M , D ) = δ ( h ˆ y i − h ˆ y i 0 ) where h ˆ y i 0 is the kno wn v alue of the computational model’s exp ected output. Because the frequentist v alidation metric do es not force the mo deler to deﬁne w hat is meant by agr e ement , w e represent this freedom by keeping B ( ˆ z , z ) general. This gives, p ( A | M , D ) = Z ˆ z ,z ρ ( ˆ z | M , D )Θ  B  ρ ( z | D ) d ˆ z dz = Z h ˆ y i ,µ y δ ( h ˆ y i − h ˆ y i 0 ) · Θ  B ( h ˆ y i , µ y )  ·  Γ( ν +1 2 ) √ ν π Γ( ν 2 )  1 + ( µ y − y ) 2 ν s 2 / N  − ν +1 2  d h ˆ y i dµ y , (32) where y is the p opulation a v erage, s is the population standard deviation, and ν is the degrees of freedom are the parameters of the data’s Student-t distribution. Making the co ordinate transformations h ˆ y i → E = h ˆ y i− y and µ y → E = h ˆ y i − µ y giv es, p ( A | M , D ) = Z E ,E δ ( E − E 0 ) · Θ  B ( E , E )  ·  Γ( ν +1 2 ) √ ν π Γ( ν 2 )  1 + ( E − E ) 2 ν s 2 / N  − ν +1 2  dE dE = Γ( ν +1 2 ) √ ν π Γ( ν 2 ) Z E 0 Θ  B ( E 0 , E 0 )  ·  1 + ( E 0 − E 0 ) 2 ν s 2 / N  − ν +1 2 dE 0 , (33) with E 0 = h ˆ y i 0 − µ y and E 0 = h ˆ y i 0 − y . Giv en that judgmen ts of agreement in the frequentist v alidation metric are expected to b e made based on the conﬁdence level 1 − α that E 0 is within the conﬁdence in terv al, this may be factored into B ( E 0 , E 0 ) → B ( E 0 , E 0 , α ), as well as other user deﬁned terms tow ard expressing agreemen t. Equation (33) is though t to b e the full BVM’s represen tation of the frequen tist v alidation metric. The immediate generalization to the frequentist v alidation metric oﬀered by the BVM is to let the mo del output exp ectation v alue ˆ z = h ˆ y i hav e some amount of uncertain ty . The uncertaint y is p erhaps Gaussian or Studen t t distributed in µ h ˆ y i (the true mo del exp ectation v alue) due to only having a ﬁnite num b er of Mon te Carlo samples and/or uncertain t y induced by discretization error. Because ˆ z = µ h ˆ y i and z = µ y are b oth uncertain in general, one generalizes to, p ( A | M , D , B ) = Z µ h ˆ y i ,µ y ρ ( µ h ˆ y i | M ) · Θ  B ( µ h ˆ y i , µ y )  · ρ ( µ y | D ) dµ h ˆ y i dµ y . (34) It is in teresting to note the consequence of deﬁning a reasonable Bo olean expression of agreemen t on the BVM represen tation of the frequentist metric. A natural agreement function B for the metric is one that is true if E µ = | µ h ˆ y i − µ y | ≤  . The BVM then giv es, p ( A | M , D ) = Z µ h ˆ y i ,µ y p ( µ h ˆ y i | M , D ) · Θ  E µ ≤   · p ( µ y | D ) dµ h ˆ y i dµ y = Z µ h ˆ y i Z µ y = µ h ˆ y i ±  p ( µ h ˆ y i | M , D ) p ( µ y | D ) dµ h ˆ y i dµ y = r . (35) Th us, the frequen tist’s v alidation metric is the reliability metric [24] and “probabilit y of agreement” [25] when reasonable accuracy requirements are imp osed on the acceptable diﬀerence b et w een the exp ectation v alues. 23 A.3 Area and Binned Probabilit y Diﬀerence Metric The BVM is able to represen t and generalize the area metric by letting the comparison v alues ( z , ˆ z ) b e the the cdfs in question to be compared ( F ( y | D ) , F ( ˆ y | M )). The Area metric is, d [ F ( ˆ y | M ) , F ( y | D )] = Z ∞ −∞    F ( ˆ y | M ) − F ( y = ˆ y | D )    d ˆ y . (36) The Area metric may be represented b y the BVM in a simple wa y . First allow the comparison v alue function to b e the Area metric functional, f ( z , ˆ z ) ≡ d [ F ( ˆ y | M ) , F ( y | D )] , (37) and deﬁne agreement with a Bo olean, p ( A | M , D ) = Θ  B  d [ F ( ˆ y | M ) , F ( y | D )]  . (38) Here, the cdfs are treated as completely certain ( δ ( ˆ z − F ( ˆ y | M )) and δ ( ˆ z − F ( ˆ y | M )) are Dirac delta functionals). The functional form of the Bo olean may b e decided on given the user’s sp eciﬁc v alidation requiremen ts; ho wev er, satisfying some kind of  threshold d [ F ( ˆ y | M ) , F ( y | D )] ≤  seems logical. Generalizations to the Area metric may be represen ted analogously with the BVM. When the v alues of the cdfs are uncertain the BVM b ecomes, p ( A | M , D ) ← Z Θ  B ( d [ F ( ˆ y | M ) , F ( y | D ))]  ρ ( F ( ˆ y | M ) , F ( y | D )) dF ( ˆ y | M ) dF ( y | D ) , (39) whic h ma y p ose computational challenges and require random sampling; how ev er, in some cases it may b e p ermissable to treat the model cdf as kno wn. The extension to uncertain cdf ’s is usually not considered in the literature; how ever, the theoretical generalization to the uncertain case is apparent in the BVM framework. As an approximation, one ma y consider discretizing the Area metric by breaking it into K comparison points, d [ F ( ˆ y | M ) , F ( y = ˆ y | D )] ≈ K X i =1    F ( y i | M ) − F ( y = y i | D )    . (40) The uncertain ty in the cumulativ e distribution of the data ρ ( z | D ) is no w a ﬁnite join t product p df ρ ( F ( y 1 | D ) , ..., F ( y K | D ) | D ) o ver p ossible cdf v alues in the K bins, constrained by F ( y i | D ) ≤ F ( y i +1 | D ). This metric may also b e rep- resen ted with the BVM. Alternativ ely , one ma y consider a binned probabilit y diﬀerence comparison functional, d m [ p ( ˆ y | M ) , p ( y = ˆ y | D )] = K X i =1    p ( y i | M ) − p ( y = y i | D )    . (41) Let p ( z | D ) = P ( p y 1 , ..., p y K | D , X k p y k = 1) b e the uncertain t y for the probability estimate in eac h data bin due to the random pro cess. Given that the p df of the mo del is set to a (presumably known) single p df function p ( ˆ y | M ), the BVM is, p ( A | M , D ) = Z ˆ z ,z δ ( ˆ z − { p ( ˆ y | M ) } )Θ  B  d m [ ˆ z , z ]  p ( z | D ) d ˆ z dz = Z p y 1 ,...,p y K Θ  B  d m [ { p ( ˆ y i | M ) } , { p ( y i | D ) } ]  P ( p y 1 , ..., p y K | D ) Y i ( dp y i ) . (42) 24 Giv en the form of the distribution P ( p y 1 , ..., p y K | D ) is well known (i.e. a Dirichlet distribution after N indep enden t observ ations of y ), it can be treated as an exp ectation v alue, p ( A | M , D ) = E h Θ  B  d m [ p ( ˆ y | M ) , { p ( y i | D ) } ) i { p ( y i | D ) } , (43) and estimated with Monte Carlo. When K is large we may face the curse of dimensionality if the probabilities (bin heights) are themselv es uncertain. The num b er of bins K pla ys a role similar to  in that its c hoice aﬀects the deﬁnition/context of agreemen t represen ted by the BVM. The binned pdf metric is most informative when K is large b ecause one is chec king eac h region of the p df for agreement. On the other hand, p erhaps when data is limited, there may be instances when ha ving K = 2 bins is useful, if for instance one was interested in testing a mo del for represen ting binary t yp e probabilities (lik e pass/fail or p ositiv e/negativ e). Using a fav orable agreement in this simple binary con text to imply that the mo del w orks for otherwise unteste d K > 2 comparisons/contexts/v alidations, is statistical misrepresentation and should b e av oided (see Section 2.4). The BVM requires one to explicitly state the comparison v alues, the comparison v alue function, and the deﬁnition of agreement to compute P ( A | M , D ) suc h that confusion may be a v oided and mo del v alidation comparisons ma y b e justiﬁed. Th us, the choice of K ultimately determines the gran ularity of the deﬁnition of agreement being tested. If one uses metho ds similar to [43] to select K , then one is letting the complexity of the data and the num b er of data p oin ts determine the stringency of the deﬁnition of agreement. A.4 Probabilit y Densit y F unction Comparison Metrics Another w ay to gauge the agreemen t betw een uncertain data and models is through a p df comparison metric, G ( ρ D || ρ M ) [35]. Examples of G ( ρ D || ρ M ) are the negative relativ e entrop y or KL divergence, D ( ρ D || ρ M ) = Z y ρ ( y | D ) log  ρ ( y | D ) ρ ( ˆ y = y | M )  dy , (44) the Symmetrized KL divergence S ( ρ D || ρ M ) = D ( ρ D || ρ M ) + D ( ρ M || ρ D ), the Jenson-Shannon divergence D J S ( ρ D || ρ M ), the Hellinger Metric H ( ρ D || ρ M ), the Fisher information distance ` ( ρ D || ρ M ), and the W asser- stein distance W ( ρ D || ρ M ). This metrics give a notion of “closeness” b et w een the pdfs that can be used for v alidation. The BVM ma y represent these v alidation metrics b y letting the comparison v alue function f ( z , ˆ z ) b e the p df comparison metric, f ( z , ˆ z ) ≡ G ( ρ D || ρ M ) , where z ≡ ρ M and ˆ z ≡ ρ D and are the mo del and data p dfs, resp ectiv ely . In the absence of mo del and data uncertaint y (i.e. the p dfs are treated as kno wn functions, e.g. ρ D is a gaussian with mean=0 and v ariance=1), the BVM is simply , p ( A | M , D ) = Θ  B [ f ( z , ˆ z )]  = Θ  B [ G ( ρ D || ρ M )]  , whic h either meets the speciﬁcations for them to agree deﬁned by B , or does not (e.g. passing a tolerance threshold). F ollowing the structure of the BVM, if there are uncertainties in the functional forms of the p df ’s, they ma y b e included in to the BVM, p ( A | M , D ) ← Z ρ D ,ρ M Θ  B [ G ( ρ D || ρ M )]  ρ ( ρ D , ρ M ) dρ D dρ M . Uncertain ties in the data pdfs may come from a lack of data and uncertainties in the model may come from parametric uncertaint y (e.g. a Gaussian p df mo del ρ M | µ with an uncertain mean ρ ( µ )) 6 Similar to the Area metric, these metrics may b e discretized from p df to probability comparison metrics and uncertain t y in the p dfs themselv es are typically not discussed/quan tiﬁed in the literature. 6 One should note that here there is the potential for tw o diﬀerent models. One in which ρ M ≡ R ρ M | µ ρ ( µ ) dµ is marginalized ov er (in which case the model p df is “certain” if integrated analytically) and another in whic h the mo del pdf is gaussian ρ M | µ but there is mo del parametric uncertain ty of the form ρ ( µ ). 25 A.5 Statistical h yp othesis testing Normally when statistical hypothesis testing is p erformed, one constructs the p df of the relev ant test statistic of the data ρ ( z | D ) and then assumes the null hypothesis is true ( M = D ) counterfactually , which is enforced b y setting the “to b e tested p opulation” (the mo del outputs here) p df to b e equal to the p df of the test statistic of the data ρ ( ˆ z | M ) → ρ ( ˆ z | M = D ). How ever, in the presen t case, we are interested in the general mo deling case in which one is able to extract a pdf of the outputs of the model, which we w ould lik e to test against data b efore assuming that they are equal. W e will ﬁrst represen t the classical statistical hypothesis test using the BVM and then later supply a version that is more relev an t to mo del v alidation problems. Classical statistical hypothesis testing In classical hypothesis testing, one constructs the p df of a relev ant test statistic S y ≡ z of the data ρ ( z | D ). The null hypothesis is that the mo del is equal to the data, which is enforced b y setting ρ ( ˆ z | M ) = ρ ( ˆ z | M = D ). F urther, the n ull hypothesis is not rejected if the test statistic from the mo del ˆ z falls within the critical region [ − c α , c α ] that corresp onds to the probability R c α − c α ρ ( z | D ) dz = 1 − α of the data. The case of not rejecting the null hypothesis is represented by the Bo olean expression B ( ˆ z ), which is true if − c α ≤ ˆ z ≤ c α is true – deﬁning “agreement” in this case. This results in the following BVM, p ( A | M = D , D ) = Z ˆ ˆ z ,z ρ ( ˆ z | M = D )Θ  B ( ˆ z )  ρ ( z | D ) d ˆ z dz = Z ˆ z ρ ( ˆ z | M = D )Θ  − c α ≤ ˆ z ≤ c α  d ˆ z = 1 − α, (45) b ecause the model w as assumed to be equal to the data counterfactually in the n ull h yp othesis and B ( ˆ z , z ) = B ( ˆ z ) only . The probability of type I error, i.e. rejecting the coun terfactually assumed true n ull h ypothesis when it is actually true, is equal to α . It should b e noted that this is not equal to the probability of ﬁnding the mo del v alue outside of the data’s conﬁdence interv al b ecause it is unknown if the null hypothesis (in what results in ρ ( ˆ ˆ z | M ) → ρ ( ˆ ˆ z | M = D )) is actually true, or not, b ecause the n ull hypothesis w as merely assumed to b e true counterfactually . It should further b e noted that with probability α the data’s test statistic is outside of its own conﬁdence interv al. Th us, the probability α b oth indicates type I error and a systematic t yp e error that the wrong sort of comparison is being made, i.e. the wrong Bo olean expression w as chosen, b ecause, why w ould we care if the model is within a certain conﬁdence interv al if the data is not even within that interv al? 7 The probability of type I I error, i.e. that the n ull hypothesis w as not rejected when it is actually false, is equal to β M ( α ) = 1 − R c α − c α ρ ( ˆ z | M ) d ˆ ˆ z , whic h in the classical case is diﬃcult to calculate directly b ecause one do es not ha ve access to the actual mo del p df p ( ˆ z | M ) in frequentist probabilit y . Impro v ed statistical h ypothesis testing for v alidation F or the v alidation cases we are interested in, both ρ ( ˆ z | M ) and ρ ( z | D ) are quantiﬁed, and therefore assuming that ρ ( ˆ z | M ) → ρ ( ˆ z | M = D ) w ould irresp onsibly throw aw a y an y information sen t through the model. W e therefore oﬀer the improv ed statistical h yp othesis test for v alidation using the BVM, which uses b oth the mo del and data pdfs. W e call this BVM the statistical p o w er BVM. F or the mo diﬁed statistical h yp othesis test, let the deﬁnition of agr e ement b e a comp ound Bo olean expression that is true iﬀ b oth − c α ≤ ˆ z ≤ c α , that the mo del test statistic lies in the data’s conﬁdence in terv al, and − c ˆ α ≤ z ≤ c ˆ α that the data statistic lies in the mo del’s conﬁdence in terv al, which corresp onds to the probabilit y R c ˆ α − c ˆ α ρ ( ˆ z | M ) d ˆ z = 1 − ˆ α of the mo del. By not assuming M = D , w e remov e the possibility that either t ype I or type I I errors can occur; ho wev er, systematic errors, that the Bo olean expression mean t to deﬁne agreement is nonsensical, still exist. Thus, while more or less adhering to the type of tests one 7 Independent of whether or not the null hypothesis is true* 26 migh t p erform for mo del v alidation using statistical h yp othesis testing, w e get, p ( A | M , D ) = Z ˆ ˆ z ,z ρ ( ˆ z | M , D )Θ  − c α ≤ ˆ z ≤ c α  Θ  − c ˆ α ≤ z ≤ c ˆ α  ρ ( z | D ) d ˆ z dz = (1 − β M ( α )) · (1 − β D ( ˆ α )) , (46) whic h is the probability that both the mo del and the data lie within one another’s conﬁdence interv als. The v alue 1 − β ( α ) is called the statistical p ow er of the test, but here we ha v e access to b oth the statistical p o w er of the data and the model. The probability the mo del and data do not agree as deﬁned by B is given b y , p ( A | M , D ) = 1 − p ( A | M , D ) = β D ( ˆ α ) + β M ( α ) − β D ( ˆ α ) β M ( α ) , (47) whic h o ccurs if either ˆ z , z , or b oth are outside of one another’s conﬁdence interv als. The probability for systematic error (that ˆ z , z , or b oth are outside of their o wn conﬁdence interv als) is equal to α + ˆ α − α ˆ α ; ho wev er, there is no conceptual issue with setting α , ˆ α , or b oth equal to 0 as long as both distributions do not span the entire range of possible v alues (as in such a case the mo del and data w ould alwa ys agree, whic h means the test has zero resolving p o w er). Setting α = ˆ α = 0 remov es the chance of systematic error from the analysis. Ov erall, the use of conﬁdence interv als is sub optimal unless b oth the mo del and data p dfs are strictly unimo dal. Therefore, rather than using a conﬁdence in terv al, one may use a “conﬁdence set”, which w e deﬁne as the smallest set of ˆ z (as w ell as for z ) v alues that’s probability adds to 1 − ˆ α (and similarly 1 − α ). Conﬁdence sets are generated by adding the largest probabilities of ˆ z ( z ) until a conﬁdence level of 1 − ˆ α (1 − α ) is met. A conﬁdence in terv al is only equal to the conﬁdence set if the distribution is unimo dal. As the conﬁdence set is the smallest set of v alues adding up the the conﬁdence level, this set of v alues is more informativ e than a conﬁdence in terv al, which may include man y 0 or low probability even ts. Using a conﬁdence set improv es the resolving p o w er of the metric further. The statistical p o w er BVM (46) is more informative than the classical statistical hypothesis test (45) b ecause it utilizes the model p df while also removing t ype I, II, and (optionally) systematic errors from the test; ho wev er its o v erall resolving p o wer is w eak when compared to other metrics. By rewriting the pair of o verlapping conﬁdence in terv als (or sets) as the set of v alues in a single “o verlap in terv al” I = [ c α , c α ] ∩ [ c ˆ α , c ˆ α ], one may see that (46) is a particular case of (31) with S ( y ) = I for y ∈ I and being the null set otherwise. Th us, the statistical p ow er BVM may b e seen as a sp ecial case of the gene ralized reliabilit y metric suggested b y the BVM in (31) that eﬀectively has large and un v arying tolerance in terv als (sets). The use of conﬁdence in terv als as w ell as conﬁdence sets as a means to deﬁne agreement eﬀectively coarse grain the probabilities, whic h remov es their informative features. The most stringent deﬁnition of agreement b et w een the mo del and the data leads to Bay esian mo del testing, which we pro v e in the next section, as it is indeed equal to the generalized mo del reliabilit y metric with a zero tolerance for diﬀerences  = 0. A.6 Ba y esian mo del testing In Bay esian model testing, rather than assuming a particular mo del is true, one lets the av ailable data determine whic h mo del is most likely giv en that data. Represented probabilistically , out of a set of p ossible mo dels, Bay esian mo del testing selects the mo del with the maximum p osterior probability p ( M | Y ), i.e. the probabilit y of a (previously calibrated [11]) model M giv en the data set Y = { y i } from data source D . As the selection rule for mo dels requires comparing v alues of p ( M | Y ), one often constructs the p osterior o dds ratio, R = p ( M | Y ) p ( M 0 | Y ) , (48) and selects the mo del M with the highest v alue of R relative to a chosen base mo del M 0 . Using Ba yes Theorem, the p osterior o dds ratio ma y b e recast as, R = ρ ( Y | M ) p ( M ) ρ ( Y | M 0 ) p ( M 0 ) = K p ( M ) p ( M 0 ) , (49) 27 where K ≡ ρ ( Y | M ) ρ ( Y | M 0 ) is known as the Bay es factor, whic h is the ratio of the likelihoo ds of the mo dels. The ratio of the prior model probabilities p ( M ) p ( M 0 ) is the ratio of the b elief that mo del M is true prior to examining the data, relativ e to M 0 . If there is reason to b eliev e that one mo del is more probable than another due to prior (p erhaps statistical) kno wledge of the system under inv estigation, then the Bay esian framew ork allows one to take this in to account through the prior model probabilities. As it is common for the R v alues to b e p oten tially many orders of magnitude greater than one, the prior mo del probabilities may b e ov erwhelmed b y the Bay es factor. In an y case, if there is no reason to prefer one mode ov er another, one can let the prior mo del probabilities be equal a-priori. Giv en we are testing some mo del function ˆ y = M ( ~ x, ~ α ), we hav e access to the forward propagation of data and parameters through a giv en model, the problem of calculating R ma y b e rerouted to the compu- tation of K . Using the p oten tially multidimensional inputs to our mo dels ( ~ x, ~ α ), which represent the input data and the model parameters resp ectiv ely , the Bay es factor is calculated through forw ard propagation (marginalization) of these inputs through both models, K = p ( Y | M ) p ( Y | M 0 ) = R ~ x,~ α ρ ( Y | ~ x, ~ α , M ) ρ ( ~ x, ~ α | M ) d~ x d ~ α R ~ x 0 ,~ α 0 ρ ( Y | ~ x 0 , ~ α 0 , M 0 ) ρ ( ~ x 0 , ~ α 0 | M 0 ) d~ x 0 d~ α 0 . (50) The prior probability to the inputs of the mo del ρ ( x, α | M ) are the input probability distributions used to propagate uncertain t y through the mo del. The probability ρ ( Y | x, α, M ) is the probability of the data given b y the knowledge of the mo del function ˆ y = M ( x, α ); how ev er, it should b e noted that in general the data Y 6 = ˆ Y = { y i } , is not the set of mo del outputs ˆ Y , as the data Y was collected from the exp erimen t rather than from mo del outputs. Th us, one must imp ose the assumptions under which a mo del is built, i.e. it is built for the purp ose of approximating y i ≈ y i = M ( x i , α ), as we will see later. Th us, a more verbose presen tation of K is, K = p ( ˆ Y ≡ Y | M ) p ( ˆ Y ≡ Y | M 0 ) = ρ ( ˆ Y ≡ Y | M ) ρ ( ˆ Y ≡ Y | M 0 ) = R ~ x,~ α ρ ( ˆ Y = Y | ~ x, ~ α, M ) ρ ( ~ x, ~ α | M ) d~ x d ~ α R ~ x 0 ,~ α 0 ρ ( ˆ Y = Y | ~ x 0 , ~ α 0 , M 0 ) ρ ( ~ x 0 , ~ α 0 | M 0 ) d~ x 0 d~ α 0 , (51) where p ( ˆ Y ≡ Y | M ) is understo o d to b e the sum of the model and the data probabilities that jointly output the same v alues, i.e. the probability that the any of the possible mo del and data v alues turn out to b e the same. The form of ρ ( ˆ Y = Y | ~ x, ~ α, M ) is usually assumed to b e something that w orks (and ev en better if it is also simple) [30], lik e the pro duct of Gaussian distributions, ρ ( ˆ Y = Y | ~ x, ~ α, M ) = 1 Z exp( − 1 2 σ 2 X i ( M ( x i ; ~ α ) − y i ) 2 ) , (52) where σ 2 is interpreted as the measurement uncertaint y of the data. 8 One usually computes the integrals in K using v arious sampling algorithms such as nested sampling [29, 41] or another Marko v Chain Mon te Carlo technique. As an added bonus, Bay esian mo del testing has an in built Occam’s razor mec hanism whic h p enalizes needlessly complex mo dels, i.e. one that w ould o v er ﬁt the data b y using a large n um b er of uncertain mo del parameters ~ α . A clear explanation ma y b e found in [42]. W e can represen t Ba yesian mo del testing using the Ba y esian v alidation metric. Let the model comparison v alue ˆ z be ˆ Y = ( y 1 , ..., y N ) that in principle corresp ond to z = Y = ( y 1 , ..., y N ), a v alidation set of data. The Boolean expression B is considered to factor into a set of “and” statemen ts o v er the individual mo del output and data p oints B ( ˆ Y , Y ) = B ( y 1 , y 1 ) ∧ ... ∧ B ( y N , y N ), where each B is true iﬀ y i = y i exactly . The Ba yesian v alidation metric in this case is then, p ( A | M , D ) = Z ˆ Y ,Y ρ ( ˆ Y | M , D ) θ ( B ( ˆ Y , Y )) ρ ( Y | D ) d ˆ Y d Y = Z ˆ Y ,Y ρ ( ˆ ˆ Y | M , D ) δ ˆ ˆ Y ,Y ρ ( Y | D ) d ˆ ˆ Y d Y . (53) 8 The dependence on the data set/ experiment D is therefore implied. 28 In general, computational mo dels may be describ ed by a mo del function ˆ Y = ~ ˆ y = M ( ~ x, ~ α ), and giv en the mo del p df w as constructed through forward propagation of the uncertainties ρ ( ~ x, ~ α ), the mo del output pdf is ρ ( ˆ Y | M , D ) = Z ~ x,~ α ρ ( ˆ Y | ~ x, ~ α , M , D ) ρ ( ~ x, ~ α ) d~ x d ~ α (54) Substituting (54) into (53), using the trick (11) and (12) – that Z Y +  Y −  ρ ( ˆ Y | M , D ) d ˆ z  → 0 + − → p ( ˆ Y = Y | M , D ) = ρ ( ˆ Y = Y | M , D ) d ˆ Y , and integrating ov er Y , one ﬁnds, p ( A | M , D ) = Z ~ x,~ α,Y p ( ˆ Y = Y | ~ x, ~ α, M , D ) ρ ( ~ x, ~ α ) ρ ( Y | D ) d~ x d ~ α d Y = Z ~ x,~ α p ( ˆ Y ≡ Y | ~ x, ~ α, M , D ) ρ ( ~ x, ~ α ) d~ x d ~ α =  Z ~ x,~ α ρ ( ˆ Y ≡ Y | ~ x, ~ α, M , D ) ρ ( ~ x, ~ α ) d~ x d ~ α  d ˆ Y = ρ ( ˆ Y ≡ Y | M , D ) d ˆ Y , (55) whic h is what is mean t by , and is equal to, the probabilit y in the numerator of the Bay es factor (51), p ( A | M , D ) = ρ ( ˆ Y ≡ Y | M , D ) d ˆ Y ≡ p ( ˆ Y ≡ Y | M ) . (56) That is, w e hav e shown that Bay esian mo del testing is a sp ecial case the generalized model reliabilit y metric in the case of exact agreemen t (31), i.e.  = 0, as can b e seen by in v estigating (53). Th us, a generalization of Bay esian mo del testing is to let the deﬁnition of agreement hav e a tolerance  > 0 such that the Bay es factor b ecomes, K = p ( A | M , D ) p ( A | M 0 , D ) − → K (  ) = p ( A | M , D ,  ) p ( A | M 0 , D ,  ) , (57) where p ( A | M , D ,  ) is Eq. (28). This deriv ation suggests that the BVM can be used analogously to Bay esian mo del testing, except with arbitrary deﬁnitions of agreement  → B , that is, we may construct the BVM factor, K ( B ) = p ( A | M , D , B ) p ( A | M 0 , D , B ) . (58) Using Ba y es Theorem, p ( M | A, D , B ) = p ( A | M ,D,B ) p ( M | D,B ) p ( A | D,B ) , and so w e may further construct the BVM rati o, R ( B ) = p ( M | A, D , B ) p ( M 0 | A, D , B ) = p ( A | M , D , B ) p ( M | D , B ) p ( A | M 0 , D , B ) p ( M 0 | D , B ) = K ( B ) p ( M | D, B ) p ( M 0 | D , B ) , (59) for the purp ose of mo del testing under a general deﬁnition of agreemen t B , i.e., we can do mo del selection under an y deﬁnition of agreement with the BVM ratio. The ratio p ( M | D,B ) p ( M 0 | D,B ) is the ratio prior probabilities of M and M 0 , which again, if there is no reason to susp ect that one mo del is a priori more probable than another, one may let p ( M | D,B ) p ( M 0 | D,B ) = 1, and then R ( B ) → K ( B ). With the BVM ratio one could in principle compare data and mo dels with diﬀerent data t yp es to p erform model testing or selection. 29

A General Model Validation and Testing Tool

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment