Optimal Data Split Methodology for Model Validation

The decision to incorporate cross-validation into validation processes of mathematical models raises an immediate question - how should one partition the data into calibration and validation sets? We answer this question systematically: we present an…

Authors: Rebecca Morrison, Corey Bryant, Gabriel Terejanu

Optimal Data Split Methodology for Model V alidation Rebecca Morrison, Corey Bryant, Gabriel T erejanu, K enji Miki, Serge Prudhomme Abstract —The decision to incorporate cr oss-validation into validation processes of mathematical models raises an immedi- ate question – how should one partition the data into calibration and validation sets? W e answer this question systematically: we present an algorithm to find the optimal partition of the data subject to certain constraints. While doing this, we address two critical issues: 1) that the model be evaluated with respect to predictions of a giv en quantity of interest and its ability to repr oduce the data, and 2) that the model be highly challenged by the validation set, assuming it is properly informed by the calibration set. This framework also r elies on the interaction between the experimentalist and/or modeler , who understand the physical system and the limitations of the model; the decision-maker , who understands and can quantify the cost of model failure; and the computational scientists, who strive to determine if the model satisfies both the modeler’s and decision- maker’ s r equirements. W e also note that our framework is quite general, and may be applied to a wide range of problems. Here, we illustrate it through a specific example involving a data reduction model for an ICCD camera from a shock-tube experiment located at the NASA Ames Research Center (ARC). Index T erms —Model validation, quantity of interest, Bayesian inference I . I N T RO D U C T I O N Model validation to assess the credibility of a gi v en model is becoming a necessary activity when making critical decisions based on the results of computer modeling and sim- ulations. As stated in [1], validation requires that the model accurately capture the critical behavior of the system(s), and that uncertainties due to random effects be quantified and correctly propagated through the model. There are various approaches to model validation; here we explore a procedure based on cross-validation. First, one partitions the data into two sets: the calibration (or training) set and the validation set. Next, the calibration set is used to calibrate the model. Then the calibrated model produces a set of predicted values to be compared with the validation set. A small discrepancy between predicted values and the validation set improves the credibility of the model while a large discrepancy may inv alidate the model. W e will present a general framework based on these principles that incorporates our particular goals along with a detailed cross- validation algorithm. More specifically , we examine situations in which we w ant to predict values for which e xperimental data is not av ailable, referred to here as the prediction scenario . Experiments for this scenario may be impractical or e ven impossible. Still, Manuscript received June 30, 2011; revised August 13, 2011. This material is based upon work supported by the Department of Energy [National Nuclear Security Administration] under A ward Number [DE- FC52-08N A28615]. All authors are located at the Institute for Computational Engineering and Sciences, The University of T exas at Austin; Austin, TX, 78712 e-mail: { rebeccam, cbryant, terejanu, kenji, serge } @ices.utexas.edu. as a computational scientist, one wants to predict a certain quantity of interest (QoI) at this scenario and assess the quality of this prediction. Often, the only experimental data av ailable comes from legacy experiments and may be incom- plete. Furthermore, this QoI is seldom directly observable from the system, but requires some additional modeling. Numerous examples of this situation exist. Computational models aimed at predicting the reentry of space vehicles are one such example. Some characteristics of the system can be recreated in sophisticated wind tunnels, but experiments are expensi ve and may be unreliable. Another example is the maintenance of nuclear stockpiles. Since experiments to assess en vironmental impact in case of failure are banned, predictiv e models must be used. Babu ˇ ska et al. present a systematic approach to assess predictions of this type using Bayesian inference and what they call a validation pyramid [2]. Scenarios of varying complexity are av ailable that suggest an obvious hierarchy on which to validate the model. In their calibration phase, Bayesian updating is used to condition the model on the observations av ailable at lo wer le vels of the pyramid. The model’ s predictiv e ability is then assessed by further condi- tioning using the v alidation data at the higher le vels. One advantage of their approach is that the prediction metric is directly related to the QoI. This feature is maintained in our work. In the work described above, the authors employ a single split of the data into calibration and validation scenarios. While they argue that this partition of the data is made clear by the experimental set-up and validation pyramid, it is often the case that all experiments provide an equal amount of information regarding the QoI. T o avoid a subjective choice of the calibration set, we determine the set by a more rigorous and quantitati ve process. T o do this, we propose a cross-validation inspired method- ology to partition the data into calibration and validation sets. In contrast to previous works, we do not immediately choose a single partition as above, nor do we use averages or estimators ov er multiple splits (see [3]–[5] and references therein). Instead, our approach first considers all possible ways to split the data into disjoint calibration and validation sets which satisfy a chosen set size. Then, by analyzing sev eral splits, we methodically choose what we term the “optimal” split. Once the optimal split is found, we are then able to judge the validity of the model, and whether or not it should be used for predictive purposes. First, we argue that the model’ s ability to replicate the observations must be assessed quantitativ ely . A model in- capable of reproducing observations should not be used to predict unobservable quantities of interest. Specifically in the case of Bayesian updating, the prior information and observed data must be sufficient to produce a satisfactory posterior distribution for the model parameters. Second, we address a fundamental issue of v alidation. W e can nev er fully validate a model; instead, we can only try to inv alidate it. Because of this, when we choose a validation set with which to test the model, it should be the most challenging possible. In other words, we demand that the model perform well on ev en the most challenging of validation sets; otherwise, we cannot be confident in its prediction of the QoI. W ith these concepts in mind, we propose that the optimal split satisfy the following desiderata: (I) The model is suf ficiently informed by the calibration set (and is thus able to reproduce the data). (II) The v alidation set challenges the model as much as possible with respect to the quantity of interest. Using the optimal partition, we are then able to answer whether the model should be used for prediction. In the current work, we apply our framew ork to a data reduction model which con verts photon counts receiv ed by an ICCD camera to radiati ve intensities [6]. W e chose this as an appropriate application because the resulting intensities are later used to make higher-le vel decisions. Thus, the validity of the model should be thoroughly tested before the model is deemed reliable. Moreover , the uncertainty in such a model should be explicitly e v aluated, as possible errors may be propagated through to other predicted quantities. The paper is or ganized as follows. In section II, the general framew ork is detailed step by step. A concise algorithm is provided. In section III, the approach is applied to the data reduction model. Finally , in section IV , short-comings and future work are discussed. I I . V A L I DAT I O N F R A M E W O R K (θ) (θ) v (D ) π π (D ) c (θ) (θ) c v σ c σ π v π Estimate Feedback Control p v c S S S No Data Data Model with Re−Calibrated Parameter(s) Model with Calibrated Parameter(s) Validation v v Scenario Validation Scenario Calibration Sensitivity/ Uncertainty Scenario Prediction using QoI Q Quantification Yes Decision Calibration c γ D , ( Q Q ) < c Q Invalid Confidence Increased is Model not Invalid Model Fig. 1. The calibration and validation cycle Figure 1 demonstrates the previously described framework [2], which applies when the quantity of interest is only av ailable as a prediction through the computational model, not through direct observation. There, Bayesian updating is performed on a calibration set, and then a prediction of the QoI is made using the updated model. A subsequent update is performed using a validation set followed by an additional prediction with the newly updated model. Finally , the two predictions are compared to assess the model’ s predictiv e capabilities. A. Pr ediction metric The QoI dri ven model assessment developed in [2] re- quires a metric comparing predictions of the QoI obtained from the calibration and validation sets. In many instances the QoI is determined by a decision-maker , who may not be the computational scientist performing the analysis. Con- sequently , they must work together to dev elop a suitable metric for the QoI, as well as a tolerance, which quantifies how consistent the predictions must be (note that we cannot measure accuracy of the predictions because we hav e no true value). Ideally this metric would measure in the units of the QoI, or provide a relativ e error , allowing for easy interpretation by decision-makers. Examples of such metrics include absolute error measures and percent error based on a nominal v alue. In the following sections of the paper we will denote the metric used to measure the predictiv e performance of the model by M Q , highlighting the fact that it is determined by the QoI. Likewise, we will denote the threshold, or tolerance, by M ∗ Q . W e stress that generality is maintained since the rest of the procedure discussed below is flexible to the particular choice of metric and threshold. What we do require is that the choice of metric be appropriate for the application at hand. For instance, when using Bayesian inference we require that the metric be compatible with probabilistic inputs, since predictions are provided as probability or cumulativ e density functions of the QoI. B. Data repr oduction As discussed briefly in the introduction, one must also ensure that the model is capable of reproducing the observed data. This ev aluation has been overlook ed, or at least not emphasized, in previous works [2], [6]. Such an ev aluation provides confidence that the model and data provided are mutually suitable. V erifying the model’ s reproducibility of the observables demonstrates that the parameters of the model are adequately informed through the in verse problem. As in the case of the prediction metric, establishing a performance criterion may require further kno wledge of the physical system. The analyst is likely not an expert in the system being modeled and should solicit a modeler’ s, or experimentalist’ s, assistance in dev eloping a performance metric and tolerance. Doing so will certify that the correct aspects of the system are being captured suf ficiently by the model. W e use a similar notation as for the prediction metric. Let M D and M ∗ D denote the data reproduction metric and threshold respectiv ely . W e note again that these choices are defined based on the model being considered and are not specific to the framework we propose. C. Choice of calibration size As with many validation schemes, the selection of the calibration, or training, set size is important. From k -fold to leav e-one-out cross validation, this set size can vary greatly [4], [5]. Here, we do not require a particular choice as the proposed framework will apply whatev er this size may be. Indeed, v arious choices were considered in preparation of this work; howe ver , we do not discuss them further . W ith that being said, we do recognize that the particular choice could impact the final conclusion and must be made with care. Of particular concern is providing enough data so that all parameters of the model are sufficiently informed by the in verse problem. If the model were to fail with respect to the data metric, the issue may not be the model itself but too small a calibration set size. In this case one should perform further analysis to determine the source of this discrepancy and increase the calibration set size if necessary . Giv en N observations, we denote the size of the calibra- tion set by N C and the size of the validation set by N V . That is, N C + N V = N . (1) Note that it could be the case that each of the N observations in fact represents a set of observations, if, for instance, repeated experimental measurements are taken at the same conditions. W e do not perform our analysis on a single partitioning of the data but instead consider all possible partitions of the data respecting (1). The reason for this approach is two-fold. First, it reduces the sensitivity of the final outcome to any particular set of data. Since each data point is equiv alent under partitioning, we do not bias the groupings in any subjectiv e way (once we ha ve chosen the calibration set size). Second, we en vision an application where it is unclear which experiments relate more closely to the QoI scenario. Considering all admissible partitions can provide insight as to which observations are most influential with respect to the QoI. As an example, consider a case of resonance where the QoI is associated with the resonant behavior of the system. W ithout knowing a priori the resonance frequency of the system, one cannot say which frequencies will be important for capturing the resonant behavior . Howe ver , the drawback of this approach is evident: we must consider the model performance for all partitions of the data. This yields a combinatorially large number of partitions, whose exact number is giv en by the binomial formula: P =  N N C  = N ! N C ! N V ! . (2) W e will denote these partitions, or splits, by { s k } , where k = 1 , 2 , . . . , P . The computational impact of this becomes ev en more significant while performing the next step of the procedure. D. In version for model parameter s For each admissible partition of the data, we solve an in verse problem using the calibration set, of size N C , as input data. It is not hard to see why solving P in verse prob- lems may be difficult, or e ven impossible, for complicated models. This is an area for improv ement; approximations and alternativ e approaches to reduce the number of in verse problems will be the subject of future work. As discussed previously , we treat the inv erse problem in a probabilistic setting, using Bayesian updating to incorporate the calibration data. As a result we obtain distributions of model parameters, and these in turn yield distributions for the predicted quantities. At this point it becomes clear that the definition of the metrics will depend on how the in verse problems are solved. Note, of course, that a deterministic approach could also be used. E. Computation of the metrics W ith the solutions obtained from the in verse problems we are now able to ev aluate the model’ s performance. For each calibration set, we compute the metrics as detailed abov e. One can then visualize the data on a Cartesian grid where the x and y axes correspond to the metrics M Q and M D , respectiv ely , and each point corresponds to a single partition of the data into a calibration and validation set (figure 2). M D M Q s k Fig. 2. Metrics computed for each data split s k F . Optimal partition W e attempt to inv alidate the model using the optimal split determined by (I) and (II). For (I), performance in replicating the observ ables is measured using the data metric M D . Thus we only consider splits, s k , of the data that satisfy M D ( s k ) < M ∗ D . If no points lie below this threshold, the model’ s ability of reproducing the observed data is unsatisfactory , and one must change or improv e the model, or the data, or perhaps both. Next, to satisfy (II), we select the partition s ∗ which maximizes the prediction metric, s ∗ = argmax s k , M D ( s k ) M ∗ Q , then we must conclude that the model is incapable of predicting the QoI, and has thus been in validated. C. Analysis of results Looking at figure 7, we see that the points, representing splits, are grouped into four regions. First, we note that s ∗ belongs to group (iv), and as group number increases, the size of the group decreases. No w let us examine why s ∗ is the most challenging of the splits and why these groups hav e formed. First, recall that the data we received was N ∆ t , where ∆ t ranged from 0.5 to 10 micro-seconds; specifically ∆ t = 0 . 5 , 0 . 6 , 0 . 7 , 0 . 8 , 0 . 9 , 1 , 2 , 3 , 4 , 5 , and 10 µ s. As one might guess, the most challenging (optimal) data split, s ∗ , contained the four lowest gate widths in the validation set. In other words, the data points “closest” to the QoI scenario were missing from the calibration set. After calibration, the model must first predict the QoI without any of the information contained in the lowest gate widths. Then, using the valida- tion points to recalibrate the model, there is a large distance between the resulting predictions (CDFs) for the QoI. The calibration set for s ∗ is missing 0 . 5 , 0 . 6 , 0 . 7 , 0 . 8 . The other points in group (iv) are those whose calibration set is missing 0 . 5 , 0 . 6 , 0 . 7 , but at least contains 0 . 8 , making them slightly less challenging than s ∗ . Next, group (iii) contains those splits whose calibration set is missing 0 . 5 and 0 . 6 . Similarly , in group (ii), we see all the splits whose calibration set is missing 0 . 5 . Finally , in group (i), the calibration sets contain 0 . 5 , and some collection of the remaining points. The grouping of the data suggests a possible way to decrease the number of inv erse problems performed. By finding representatives of the groups, we could search for the optimal split without analyzing ev ery one. Moreover , in the case that the model is not in v alidated, understanding these groups could be helpful when planning further experiments, perhaps through experimental design. I V . C O N C L U S I O N Computationally , our approach is prohibitively expensiv e and requires significant improv ement. Methods to reduce the number of in verse problems required are being inv estigated. Among them is the use of mutual information to group observations into sets containing similar data. These larger sets then require fewer in verse problems. W ith experimental design one may be able to choose where to perform subsequent experiments that will challenge the model further, finding a new optimal split. This newly determined split could render the model in v alid, or provide increased confidence in the model’ s predictive capability . This paper has proposed and demonstrated a systematic framew ork for assessing a model’ s predictive performance with respect to a particular quantity of interest. Extending the work of Bab u ˇ ska et al., the ability of the model to reproduce experimental observations was also ev aluated. A data reduction model was examined using our frame- work and ultimately deemed in v alid. While the model was capable of reproducing the observations at higher gate widths, it failed based on its performance in predicting the quantity of interest. The analysis was carried out on all partitions of the data respecting a chosen size constraint. This allowed for the determination of an optimal set satisfying calibration (I) and validation (II) requirements. R E F E R E N C E S [1] R. Hills, M. Pilch, K. Dowding, J. Red-horse, T . P aez, I. Babu ˇ ska, and R. T empone, “V alidation challenge workshop, ” Computer Methods in Applied Mechanics and Engineering , vol. 197, no. 29 - 32, pp. 2375 – 2380, 2008. [2] I. Babu ˇ ska, F . Nobile, and R. T empone, “ A systematic approach to model validation based on bayesian updates and prediction related rejection criteria, ” Computer Methods in Applied Mechanics and Engineering , vol. 197, no. 29-32, pp. 2517 – 2539, 2008, V alidation Challenge W orkshop. [3] A. V ehtari and J. Lampien, “Bayesian model assessment and compar- ison using cross-validation predictive densities, ” Neural Computation , vol. 14, pp. 2439 – 2468, 2002. [4] S. Arlot and A. Celisse, “ A survey of cross-validation procedures for model selection, ” Statistics Surveys , vol. 4, pp. 40 – 79, 2010. [5] F . Alqallaf and P . Gustafson, “On cross-validation of bayesian models, ” The Canadian Journal of Statistics , vol. 29, no. 2, pp. 333–340, 2001. [6] M. Panesi, K. Miki, and S. Prudhomme, “On the validation of a data reduction model with application to shock tube experiments, ” Computer Methods in Applied Mechanics and Engineering , 2011, under review . [7] G. T erejanu, R. R. Upadhyay , and K. Miki, “Bayesian Experimental Design for the Active Nitridation of Graphite by Atomic Nitrogen, ” Experimental Thermal and Fluid Science , 2011, under revie w . [8] E. E. Prudencio and K. W . Schulz, “The Parallel C++ Statistical Library ‘QUESO’: Quantification of Uncertainty for Estimation, Sim- ulation and Optimization, ” Submitted to IEEE IPDPS , 2011. [9] S. H. Cheung and J. L. Beck, “New bayesian updating methodology for model validation and robust predictions of a target system based on hierarchical subsystem tests. ” Computer Methods in Applied Me- chanics and Engineering , 2009, accepted for publication. [10] J. Larsen and C. Goutte, “On optimal data split for generalization estimation and model selection, ” in Neural Networks for Signal Pr ocessing IX, 1999. Proceedings of the 1999 IEEE Signal Pr ocessing Society W orkshop , aug 1999, pp. 225 –234.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment