Robustness Quantification for Discriminative Models: a New Robustness Metric and its Application to Dynamic Classifier Selection
Among the different possible strategies for evaluating the reliability of individual predictions of classifiers, robustness quantification stands out as a method that evaluates how much uncertainty a classifier could cope with before changing its pre…
Authors: Rodrigo F. L. Lassance, Jasper De Bock
Rob ustness Quantification f or Discriminativ e Models: a New Rob ustness Metric and its A pplication to Dynamic Classifier Selection Rodrigo F . L. Lassance 1,2,3 Jasper De Bock 1 1 Foundations Lab for imprecise probabilities (FLip), Ghent Uni v ersity , Ghent, Belgium 2 Statistics Dept., Federal Univ ersity of São Carlos, São Carlos, São Paulo, Brazil 3 Institute of Mathematic and Computer Sciences, Univ ersity of São Paulo, São Carlos, São Paulo, Brazil Abstract Among the different possible strate gies for ev al- uating the reliability of individual predictions of classifiers, robustness quantification stands out as a method that e v aluates ho w much uncertainty a clas- sifier could cope with before changing its predic- tion. Ho wever , its applicability is more limited than some of its alternativ es, since it requires the use of generativ e models and restricts the analyses either to specific model architectures or discrete features. In this work, we propose a new robustness met- ric applicable to any probabilistic discriminative classifier and an y type of features. W e demonstrate that this new metric is capable of distinguishing between reliable and unreliable predictions, and use this observ ation to de velop new strategies for dynamic classifier selection. 1 INTR ODUCTION Machine learning models possess great predicti ve capacity , but this capacity comes with a lev el of unreliability that is hard to assess. From a modeling perspective alone, the majority of the methods used to make high-stakes decisions lack transparency in what their true decision process is, mak- ing their users potentially subject to harmful consequences in the process [O’Neil, 2016]. Although there have been attempts to make such black box models more interpretable [Molnar, 2025] or to switch to models that are inherently interpretable instead [Rudin, 2019], these contrib utions re- main insufficient due to reasons that go be yond the models themselves. After all, a model is only as good as the data that has been provided to it, with the inherent variability of the phenomena and the lack of all meaningful features leading to an unreliability source referred to as Aleatoric Uncertainty (A U). Moreover , when (i) the data is not suf- ficient for the model to differentiate pattern from noise to an acceptable degree or (ii) there is a discrepancy between the data being used to train the model and the context in which the model will be applied to, this also leads to another unreliability source: Epistemic Uncertainty (EU). While completely removing these sources of uncertainty is impossible by definition, progress has been made in ev al- uating how reliable an individual prediction is in the field of uncertainty quantification. Many different approaches hav e already been explored [Hüllermeier and W aegeman, 2021], with the attempt of estimating and separating A U and EU among the most popular strate gies. If successfully done, this allows decision makers to refrain from trusting the outputs of a model in specific circumstances, av oiding the more undesirable repercussions that could come. Ho wever , properly quantifying EU is rarely feasible from the data used for training alone, since it most often di verges from the true application that a practitioner has in mind. Aside from specific situations, some aspects of EU are fundamentally inaccessible in practice. A promising alternativ e to directly estimating EU is us- ing robustness quantification. Similarly to uncertainty quan- tification, earlier work [Detav ernier , Adrián and De Bock, Jasper, 2025a,b] has demonstrated that rob ustness quantifi- cation can be used to assess the reliability of indi vidual predictions. Unlike uncertainty metrics, which aim to quan- tify the amount of uncertainty that influences the decision, robustness metrics aim to quantify how much uncertainty the model could cope with without changing the decision, thereby sidestepping the problem of estimating the amount of uncertainty there actually is. Both types of metrics tend to correlate with the accurac y of the predictions, and their relativ e performance varies with the context. Robustness metrics tend to perform best e xactly in the contexts where uncertainty is dif ficult to estimate, such as in the presence of distribution shift, model misspecification, or small data regimes. Compared to other applications of the term “robustness” in machine learning, one of the differentiating aspects of ro- 1 bustness quantification is that it is an instance-based metric, i.e., it is an assessment of how reliable a specific indi vidual prediction is based on the underlying model and the values of the features used as input. It is also different from the no- tion of adversarial rob ustness [Muhammad and Bae, 2022], with focuses on perturbing the features and is mainly used in applications with image data. Rob ustness quantification, on the other hand, comes from a notion of perturbing the joint distributions of the trained model, which is more in line with ideas in imprecise probabilities [Augustin et al., 2014]. Important downsides of robustness quantification, at this point, are that it can only be applied to generativ e models, is mainly based on epsilon-contamination, and is restricted to either specific model architectures [De Bock et al., 2014, Correia and de Campos, 2019, Correia et al., 2020] or to fully discrete sets of features [Detav ernier , Adrián and De Bock, Jasper, 2025a]. In this work, we impro ve on these cur - rent limitations by proposing a ne w robustness metric that is applicable to any probabilistic discriminati ve classifier (sec- tion 2). Instead of using epsilon-contamination, our metric is based on the Constant Odds Ratio (COR) perturbation, yielding a metric that is not restricted to discrete features. In a first batch of experiments, we use Accurac y Rejection Curves to demonstrate that this metric correlates nicely with accuracy , that it does this better than an alternativ e competi- tor , and that it can do this for several model architectures (section 3.1). Our main application uses rob ustness to per - form dynamic selection of classifiers, of fering two strategies for choosing which model to use as a predictor gi ven a set of features (section 3.2). Lastly , we pro vide a discussion that highlights possible venues for future research (section 4). 2 SETUP Let Y ∈ Y be a discrete class v ariable and X ∈ X its vec- tor of features. The features can be purely discrete, purely continuous or a mixture of both. Uncertainty about ( Y , X ) is expressed by a probability measure P on a suitable sigma algebra A of subsets of Y × X (e.g. the power set when X is purely discrete or a product of a power set and Borel sigma algebra if X purely continuous or mixed). W e furthermore assume that P is absolutely continuous w .r .t. an adequate base measure µ (e.g., the counting measure when X is purely discrete or the product of a counting and Lebesgue measure when X is purely continuous or mixed), which guarantees that P has a density p = dP /dµ w .r .t. this base measure µ . W e denote the set of all such probability mea- sures by P ( Y , X ) . W e consider a classification problem where the goal is to predict the v alue of Y giv en x based on P or , equiv alently , based on p . The classifier that minimizes the 0-1 loss then predicts the most likely class gi ven the features x , which is giv en by g p ( x ) : = arg max y ∈Y p ( y | x ) = arg max y ∈Y p ( y , x ) p ( x ) = arg max y ∈Y p ( y , x ) . (1) W e are interested in quantifying the rob ustness of this pre- diction, by determining how stable it is with respect to div ergences from the original measure P . T o this end, fol- lowing Detav ernier, Adrián and De Bock, Jasper [2025a], we consider parametrized perturbations P δ around P , with δ ∈ R ≥ 0 . W e call the prediction g p ( x ) robust w .r .t. P δ if, for all P ′ ∈ P δ , the prediction g p ′ ( x ) is the same as g p ( x ) . The robustness r ( x ) of g p ( x ) is then quantified as the lar gest δ such that g p ( x ) is robust w .r .t. P δ . W ork on robustness quantification has so far mainly fo- cused on perturbations P δ based on epsilon-contamination. One approach consists in applying epsilon-contamination directly to the global model P , resulting in perturbations of the form P ε = { (1 − ε ) P + εQ : Q ∈ P ( Y , X ) } with ε ∈ [0 , 1] . This leads to simple closed form expressions for the rob ustness metric r ( x ) , but it is only meaningful if X is discrete since it otherwise typically leads to robust- ness values of zero. Another approach consists in applying epsilon-contamination to the local parameters of specific parametric models such as Nai ve Bayes classifiers [Detav- ernier , Adrián and De Bock, Jasper, 2025a], Sum-Product Networks [Correia and de Campos, 2019] or Generati ve Forests [Correia et al., 2020]. This local approach remains meaningful for continuous or mixed features as well (at least for Sum-Product Networks and Generati ve Forests), b ut is only applicable to specific parametric models and can be computationally expensi ve. In this work, we study rob ustness quantification based on a dissimilarity function d between probability measures. That is, we consider perturbations of the type P δ = { Q ∈ P ( Y , X ) : d ( P, Q ) ≤ δ } (2) with δ ∈ R ≥ 0 . In other words, the prediction is robust w .r .t. P δ if the prediction g q ( x ) is the same as g p ( x ) for all Q ∈ P ( Y , X ) such that d ( P , Q ) ≤ δ , and the robustness r ( x ) of the prediction g p ( x ) is the lar gest δ for which this is the case. T o indicate the influence of the choice of d , we will write r d ( x ) instead of r ( x ) when referring to the robustness metric based on a specific dissimilarity function d . W e focus in particular on the distance function d COR , de- fined for all P , Q ∈ P ( Y , X ) by d COR ( P , Q ) := sup A,B ∈A Q ( A ) > 0 ,P ( B ) > 0 1 − P ( A ) Q ( B ) Q ( A ) P ( B ) . (3) Similarly to epsilon-contamination, the perturbation induced by (3) is related to imprecise probabilities. More specifically , 2 Montes et al. [2020] shows that the perturbations that cor - respond to this distance function are Constant Odds Ratio (COR) models [Augustin et al., 2014, Section 4.7.2], which can also be given a behavioral interpretation in terms of gambling [W alley, 1991, Section 2.9.4]. A closely related dissimilarity function is d ∗ COR , defined for all P , Q ∈ P ( Y , X ) by d ∗ COR ( P , Q ) : = sup A,B ∈A Q ( A ) > 0 ,P ( B ) > 0 P ( A ) Q ( B ) Q ( A ) P ( B ) = 1 − 1 d COR ( P , Q ) . (4) In terms of the densities p and q , this simplifies to d ∗ COR ( P , Q ) : = ess sup p q ess sup q p = ess sup q p ess inf q p = ess sup L ess inf L , (5) with L = dQ dP = q p the likelihood ratio of Q w .r .t. P [Düm- bgen et al., 2021]. Since the distance function d COR and dissimilarity function d ∗ COR are monotone transformations of each other , the same is true for the resulting robustness metrics: for all x ∈ X , r d COR ( x ) = 1 − 1 r d ∗ COR ( x ) (6) and r d ∗ COR ( x ) = 1 1 − r d COR ( x ) . (7) For this reason, we can equiv alently work with either of these dissimilarity functions and their corresponding robust- ness metrics. W e will work with d ∗ COR and r d ∗ COR in our proofs, out of con venience, but will mostly focus on d COR and r d COR in the remainder of the paper because we find r d COR more intuitiv e to interpret. The following result, the proof of which is a vailable in sec- tion 5, provides closed-form e xpressions for the robustness metrics r d ∗ COR and r d COR . Theorem 1. Consider a measure P in P ( Y , X ) , its corre- sponding density p and, for any given features x ∈ X , the most and second most likely class according to p given x : ˆ y 1 ∈ arg max y ∈Y p ( y | x ) and ˆ y 2 ∈ arg max y ∈Y \{ ˆ y 1 } p ( y | x ) . Then the r obustness of the prediction ˆ y 1 w .r .t. d ∗ COR and d COR is given by , r espectively , r d ∗ COR ( x ) = p ( ˆ y 1 , x ) p ( ˆ y 2 , x ) (8) and r d COR ( x ) = p ( ˆ y 1 , x ) − p ( ˆ y 2 , x ) p ( ˆ y 1 , x ) . (9) This result has a number of notew orthy consequences. First, since both metrics have simple closed-form expres- sions, this mak es them particularly easy to ev aluate, whereas existing local rob ustness metrics typically require perform- ing optimization procedures [Correia and de Campos, 2019, Correia et al., 2020]. Second, unlike global uncertainty metrics based on epsilon- contamination, our metrics are meaningful (not automati- cally zero) regardless of whether the features are discrete, continuous or mixed. Third, a simple application of Bayes rule allows us to rewrite the obtained expressions in terms of conditional probabili- ties: r d ∗ COR ( x ) = p ( ˆ y 1 | x ) p ( ˆ y 2 | x ) (10) and r d COR ( x ) = p ( ˆ y 1 | x ) − p ( ˆ y 2 | x ) p ( ˆ y 1 | x ) . (11) This implies that these metrics can be applied to any ma- chine learning architecture that leads to a discriminativ e model, that is, one for which only the conditional distrib u- tion p ( ·| x ) is av ailable. Fourth, since both e xpressions are based solely on the (con- ditional or joint) probability of the first and second most probable class, they are e xtremely simple to interpret, e ven without detailed knowledge about the reasoning that lead to them. In particular , we personally very much like the inter - pretation of r d COR as the probability difference between the first and second most probable class, relativ e to the absolute probability of the most probable one. 3 EXPERIMENTS The experiments undertaken in this work have two objec- tiv es: (i) demonstrate that our reliability metric(s) are ca- pable of assessing the reliability of the predictions of clas- sifiers (section 3.1) and (ii) offer one possible application in which we use the notion of robustness to develop new Dynamic Classifier Selection methods (section 3.2). De- tailed information about all datasets in our experiments is gi ven in T able 1; the sources were the OpenML [V anschoren et al., 2014], UCI [Realinho et al., 2021] and PMLB [Ro- mano et al., 2021] benchmark repositories. Code reproduc- ing the results can be found in https://github.com/ rflassance/rob4discriminative . 3.1 CORRELA TION WITH A CCURA CY T o assess the quality of our rob ustness metric(s), we follow Detav ernier , Adrián and De Bock, Jasper [2025a,b] in using 3 T able 1: Details about the datasets in our e xperiments. Label Name n #features (%cont.) Source D 1 authent 1372 4 (100%) OpenML D 2 bank-additional-full 41188 20 (25%) OpenML D 3 diabetes 768 8 (25%) OpenML D 4 electricity 45312 8 (88%) OpenML D 5 gesture 9873 32 (100%) OpenML D 6 magic 19020 10 (100%) OpenML D 7 robot 5456 24 (100%) OpenML D 8 segment 2310 19 (84%) OpenML D 9 students 4424 36 (19%) UCI D 10 texture 5500 40 (100%) OpenML D 11 twonorm 7400 20 (100%) OpenML D 12 vo wel 990 12 (83%) OpenML D 13 wav eform_21 5000 21 (100%) PMLB D 14 wav eform_40 5000 40 (100%) PMLB D 15 wdbc 569 30 (100%) OpenML Accuracy Rejection Curv es [ARC, Nadeem et al., 2009]. T o generate such an ARC, we first order the data according to a specific strategy , which in this case is in order of increasing values of rob ustness. Next, we ev aluate the accuracy of the model in regards to the ordered data, gradually thro wing aw ay the first samples (with low robustness, in this case) and recalculating the accuracy for the remaining data, leading to an accuracy curv e index ed by the proportion of samples that were removed. Examples of such ARCs, which we’ll analyse in the next section, can be seen in Figure 1. When the ordering criteria is good, the accuracy of the model quickly goes to 1 as we disregard samples. Con versely , when the ordering is not good, the ARC remains closer to the starting accuracy . This way , ARCs provide a visual method for ev aluating how well a robustness metric is at correlating with accuracy . In sections 3.1.1 and 3.1.2, we perform 15 train-test splits for each dataset analyzed. Then, for ev ery split, we obtain the ARC for the test set and present the av erage of the ARCs as our result. 3.1.1 Comparison Between Robustness Metrics Our first e xperiment compares ho w our rob ustness metric, r d C O R , fares when compared to an alternativ e. Considering that part of the focus of this work is to provide a rob ustness metric that remains meaningful when the datasets possess continuous features, we turn our attention to another method with a similar capability: a local robustness metric based on Generative Forests [GeF , Correia et al., 2020] that we refer to as r GeF . This metric is specifically designed for GeFs— which are a generati ve version of Random F orests [Breiman, 2001]—and epsilon-contaminates the parameters of a Generativ e Forest to assess robustness. Figure 1 shows the ARC curves of both rob ustness metrics for two dif ferent datasets, as an illustration of two extreme situations that occur throughout the datasets. In Figure 1a, we note that both rob ustness metrics provide a good order - ing criteria for the data, with r d C O R slightly outperforming r GeF . In the other extreme, Figure 1b highlights a situation in which r GeF is doing a poor job at correlating with accu- racy , whereas r d C O R remains more consistent. Considering that in both cases r d C O R has a better performance, and that we observed similar behaviour in the other datasets, this leads us to disre gard the use of r GeF in further analyses and focus solely on r d C O R . (a) wa veform_21 dataset ( D 13 ) (b) students dataset ( D 9 ) Figure 1: Comparison between the ARCs of r GeF and r d C O R for two datasets. 3.1.2 Comparison Between Models Unlike r GeF , our metric r d C O R is not restricted to Genera- tiv e Forests, being applicable to any generati ve or discrim- inativ e model of interest. Hence, we can also make use of ARCs to verify if our robustness metric correlates with ac- curacy for dif ferent types of models. In our next e xperiment, we do this for Gradient Boosting (GB), Random Forests (RF), XGBoost [XGB, Chen and Guestrin, 2016] and GeFs. Figure 2 presents the ARC of these four models in specific datasets. The leftmost v alue of the ARCs is the base accu- racy of the models, indicating that Generati ve F orests was the least competitive model in both settings. For every model and both datasets, we observe that our robustness metric r d C O R still correlates with accuracy , but that the gro wth of the ARC varies based on dataset and model. In Figure 2a, the ARCs of the GB and of the XGB are similar at the start, being both superseded by the RF around a rejection rate of 20% (ev en though the RF started out with a lo wer base accuracy) and then exhibiting about the same behavior as 4 (a) robot dataset ( D 7 ) (b) students dataset ( D 9 ) Figure 2: Comparison between the ARCs of dif ferent mod- els for two datasets when ordered by r d C O R . RF around 40%, while the GeF only catches up around 70%. As for Figure 2b, the top three models initially perform equally well, with the GB and XGB eventually outperform- ing the other models around a rejection rate of 70%, while the GeF is again not competitiv e. These results not only imply that robustness correlates with accuracy , b ut also that different models might be the top performers at specific rejection rates. Moreov er , the GeF underperformed in both cases, leading us to disregard it from further analyses. One thing for which results like these can be used, is to g ain an understanding of how well robustness is capable of as- sessing the reliability of the predictions of different models. Another , howe ver , is to use them to increase performance in terms of accuracy . Most obviously , in a context where we can allo w ourselves to reject a percentage of the data—for example because we can label those manually—we can use robustness-based ARCs to determine wich percentage to reject, and to determine which model performs best on the remaining data. Howe ver , it is also possible to use robust- ness to increase performance in contexts where rejection is not an option. This is what we come to next. 3.2 D YNAMIC SELECTION OF CLASSIFIERS T o demonstrate the potential use of robustness metrics, we propose two robustness-based strategies (RS-D and RS-I) for Dynamic Selection (DS) of classifiers and compare them to other strategies in benchmark datasets. DS is an umbrella term for techniques that combine or choose between mul- tiple base models, contingent on the set of features being used as input [Cruz et al., 2018]. Hence, depending on the v alues of the ne w sample, a DS classifier uses dif ferent base models as reference to provide a prediction. Let M 1 and M 2 be the models with the highest and sec- ond highest accuracy in a hold-out validation set and let r 1 ( · ) and r 2 ( · ) be the their robustness metrics (since we focus solely on r d COR , we drop the subscript d COR for no- tational con venience). For both DS classifier strategies, we start by ordering the v alidation set based on r 2 ( · ) /r 1 ( · ) . For instances that appear tow ards the beginning of the ordering, the prediction M 1 is more robust than that of M 2 , and vice versa for instances to wards the end of the ordering. The idea is now to only fav or M 2 ov er M 1 when its robustness is considerably abov e that of M 1 , so for instances towards the end of the ordering. Concretely , we choose a threshold t and, for any ne w instance x new , we predict the correspond- ing class y new using M 2 instead of M 1 only in cases when r 2 ( x new ) /r 1 ( x new ) > t , and otherwise use M 1 . The only difference between RS-D and RS-I is the w ay in which the threshold t is determined. (a) RS-D (b) RS-I Figure 3: Robustness-based strategies for DS. M 1 is the model with the best accuracy for the validation set, while M 2 is the second best. Figure 3 pro vides an illustration on ho w to deri ve the thresh- old for each strategy . On the horizontal axis of both graphs, we depict the percentage of the holdout v alidation data for which r 2 ( x ) /r 1 ( x ) does not exceed the chosen t . The ver - tical axis depicts the accurac y on the same v alidation data, either for the DS classifier for this choice of t when applied to all validation data (Figure 3a) or for the indi vidual models when applied to the data that e xceeds t (Figure 3b). RS-D (Figure 3a) takes a Direct approach (hence the D). The idea is simply to choose t such that the resulting strate gy maxi- mizes the accuracy in the v alidation set. This is equi valent to finding the proportion p of the validation set with the lo west rob ustness dif ference that yields the highest accuracy when using M 1 to predict the class for this part of the data and M 2 to predict the rest. RS-I (Figure 3b), on the other hand, takes a more indirect approach. The idea here is to 5 T able 2: Mean accurac y of each method for dif ferent datasets. Best accuracy in bold, second best underlined. Label GB RF XGB SB MCB KNORA-U KNORA-E MET A-DES RS-D RS-I D 1 0.99742 0.99356 0.99388 0.9971 0.99678 0.99614 0.99678 0.99581 0.99549 0.99678 D 2 0.91598 0.91489 0.91594 0.91569 0.91568 0.9158 0.91573 0.91657 0.91594 0.91567 D 3 0.77126 0.76954 0.75517 0.76264 0.76494 0.76667 0.76667 0.76667 0.76724 0.76149 D 4 0.86772 0.91021 0.88572 0.91021 0.89522 0.89445 0.89647 0.90386 0.90975 0.91033 D 5 0.60909 0.66284 0.65367 0.66284 0.6525 0.6516 0.65744 0.65268 0.6664 0.66451 D 6 0.88019 0.88215 0.88036 0.88089 0.88024 0.88199 0.8819 0.88288 0.88227 0.88106 D 7 0.9974 0.99333 0.9965 0.99683 0.9956 0.99691 0.99683 0.99707 0.9978 0.99699 D 8 0.97406 0.97387 0.97483 0.97426 0.97426 0.97483 0.97483 0.97522 0.97618 0.97483 D 9 0.76581 0.76752 0.77023 0.77013 0.76842 0.77033 0.77033 0.77193 0.77073 0.77073 D 10 0.98079 0.9753 0.98103 0.98047 0.97942 0.982 0.98184 0.98208 0.98257 0.98184 D 11 0.97036 0.9721 0.97006 0.9709 0.97126 0.9733 0.97306 0.9733 0.97234 0.97174 D 12 0.86846 0.95615 0.90022 0.95615 0.91767 0.92662 0.92796 0.92617 0.94183 0.95168 D 13 0.85326 0.85175 0.85885 0.85442 0.85477 0.85823 0.85761 0.85761 0.85539 0.8561 D 14 0.85513 0.8553 0.86143 0.85841 0.85628 0.86019 0.86045 0.85957 0.8577 0.85965 D 15 0.94419 0.94651 0.94729 0.94496 0.94574 0.94496 0.94186 0.94186 0.94109 0.94651 Beats SB 5/15 4/15 7/15 - 4/15 10/15 9/15 10/15 10/15 11/15 compare the ARCs of M 1 and M 2 for the ordering based on r 2 ( · ) /r 1 ( · ) (so not based on r 1 ( · ) and r 2 ( · ) , respectively), and to verify at what point (if at all) the ARC of M 2 has the largest accurac y adv antage ov er M 1 . So we choose t as the point for which, on the data for which r 2 ( x ) /r 1 ( x ) exceeds t , M 2 provides the greatest accurac y gain compared to M 1 . T able 2 presents the mean accuracy of each DS technique for the datasets in T able 1. The base models used in the experiments were GB, RF and XGB. For each dataset, 15 dif ferent random train-validation-test splits were performed, with respecti ve proportions of 0.7, 0.15 and 0.15. For each random split, the base models were fitted to the training set, with their hyperparameters being selected through grid search with a 5-fold cross-v alidation using the scikit-learn package [Pedregosa et al., 2011]. Then, the accuracy for the validation data w as used for choosing the single best base model (SB) and for tuning the DS strate gies. The competing DS methods (MCB, KNORA-U, KNORA-E and MET A- DES) can be found in Cruz et al. [2018] and are based on using KNN on the feature space to come up with the opti- mal model, using the v alidation set to decide the number of neighbors (maximum of 10). All the competing DS tech- niques were implemented through the DESlib library [Cruz et al., 2020]. Analysing the results in T able 2, we see that RS-D is the best performing strategy (indicated in bold) most often. More- ov er , RS-D featured as one of the two best strategies in the majority of the datasets, along with MET A-DES. Further- more, among the datasets in which RS-D wasn’t listed as one of the two best strategies, RS-I was among the best in more than half of them. Compared to the Single Best (SB) strategy , RS-I was the alternati ve that most often had a better performance against it, e ven in circumstances where RS-D was not among the two best strate gies. Moreov er , in cases where one base model clearly outperforms the others— such as in the electricity ( D 4 ) and vo wel ( D 12 ) datasets—all competing DS strategies based on the feature space under - perform when compared to our robustness-based strategies. Figure 4 illustrates this situation in the two plots on the first row , while presenting more typical settings in the second row . (a) electricity ( D 4 ) (b) vo wel ( D 12 ) (c) gesture ( D 5 ) (d) magic ( D 6 ) Figure 4: Boxplot of accuracies for each strategy and differ - ent datasets; white triangles represent the mean. 3.2.1 Perf ormance Under Label Corruption While T able 2 sho ws that robustness can be used to build a competiti ve DS classifier , it only e valuates settings in which 6 T able 3: Mean accurac y of each method under label corruption ( ρ = 0 . 05 ). Best accuracy in bold, second best underlined. Label GB RF XGB SB MCB KNORA-U KNORA-E MET A-DES RS-D RS-I D 1 0.98712 0.98583 0.99227 0.99259 0.99002 0.99163 0.9913 0.99195 0.99259 0.99291 D 2 0.91474 0.91421 0.91554 0.91501 0.9152 0.91579 0.91607 0.91611 0.91549 0.91514 D 3 0.75747 0.76954 0.75172 0.76092 0.75632 0.76264 0.76379 0.76092 0.76552 0.76149 D 4 0.86471 0.90057 0.88439 0.90057 0.88947 0.89033 0.89175 0.89528 0.90055 0.90062 D 5 0.59879 0.65807 0.64948 0.65601 0.64444 0.64534 0.652 0.64656 0.66023 0.65857 D 6 0.87461 0.88106 0.87655 0.87783 0.87748 0.87835 0.87767 0.87874 0.87993 0.87821 D 7 0.98926 0.96809 0.99373 0.99373 0.98689 0.99365 0.99365 0.99373 0.99292 0.99373 D 8 0.95677 0.91873 0.97003 0.96926 0.95543 0.96619 0.96619 0.96734 0.96849 0.96945 D 9 0.76361 0.76942 0.76882 0.76632 0.76642 0.76862 0.76872 0.76932 0.77043 0.76832 D 10 0.96578 0.87264 0.98136 0.98136 0.9632 0.977 0.97684 0.97571 0.98055 0.98144 D 11 0.96406 0.97138 0.96208 0.96952 0.9664 0.9697 0.96946 0.96982 0.97024 0.96982 D 12 0.85414 0.9387 0.90291 0.93647 0.90559 0.91902 0.91946 0.91678 0.92975 0.93512 D 13 0.8498 0.85362 0.8522 0.85051 0.85024 0.85424 0.85619 0.85317 0.85193 0.85078 D 14 0.8506 0.85353 0.85229 0.84953 0.85211 0.85451 0.85335 0.85442 0.85486 0.8506 D 15 0.94031 0.94496 0.94884 0.94806 0.94186 0.94729 0.94806 0.94729 0.94884 0.94729 Beats SB 1/15 8/15 6/15 - 3/15 7/15 5/15 6/15 9/15 12/15 the test set comes from the same distrib ution as the rest of the data. T o e valuate the performance of all strate gies in a context of distrib ution shift, we perform label corruption on the training and v alidation sets, which works by uniformly changing the labels of the response v ariable for a proportion ρ of the data. This same technique has been previously explored in Li et al. [2020]. In T able 3, we set ρ = 0 . 05 and repeat the analyses. RS-D and RS-I remain as top performers in the new setting, but this time none of the competing DS strategies (not ev en MET A-DES) fares better than choosing the Single Best (SB) base model based on the accuracy in the v alidation set. Moreov er , while RS-D still remains the highest performing strategy , RS-I is the technique that is most often superior to the Single Best. This pattern suggests that RS-I can be seen as the more reliable option, while RS-D is the one with greater chances of offering the best performance. Such behavior remains consistent ev en when higher levels of label corruption are applied, as demonstrated in T ables 4 and 5 in the supplementary material. 4 DISCUSSION In this work, we ha ve sho wn that robustness quantification can be brought to more general settings, being applicable to discriminativ e models and to settings with continuous features. Still, the choice of the dissimilarity function d of course remains somewhat arbitrary , justifying further studies into what other possible robustness metrics could hav e such properties. Section 3.1.1 provides evidence—based on ARCs—that, e ven though our robustness metric does not take the architec- ture of the predicti ve model into account, it still outperforms the competing metric. This situation contrasts the findings of Detav ernier , Adrián and De Bock, Jasper [2025a,b], which hav e sho wn that local robustness metrics can in fact perform better than global metrics. The reason for this discrepancy is still to be determined, with some possibilities being the fact that these works were limited to nai ve Bayes classifiers or that they were restricted to discrete features. Considering the results in section 3.2, the use of robust- ness has presented superior performance in the context of Dynamic Selection of classifiers, especially when dealing with label corruption. One of the possible reasons is that standard DS techniques make assessments on the entirety of the feature space, which can lead to instability , whereas ro- bustness of fers a lower dimensional b ut rich representation of the data, leading to more stable procedures. Con versely , combining evaluations in the robustness space in a simi- lar manner as that of other DS techniques could also of fer improv ements, being a potential topic for further research. Besides robustness, there are other reliability metrics that could also of fer a lo wer dimensional representation of the feature space, such as those used in uncertainty quantifi- cation [Hüllermeier and W aegeman, 2021], which could therefore be applied to the same Dynamic Selection tasks. More than arguing for the use of one over the other , we believ e that each can bring different aspects of the model and the data to the spotlight, and should thus be tried in combination. In fact, combining robustness and uncertainty in a single metric has been sho wn to sometimes better cor- relate with accuracy than using each metric individually [Detav ernier , Adrián and De Bock, Jasper, 2025b]. Lastly , since the strategies RS-D and RS-I are somewhat complementary of each other , devising a ne w technique that combines both of them is also a point of interest. Since this complementarity seems to be related to the presence or absence of a base model with superior predictiv e capac- ities than all others, one strategy to choose between using RS-D and RS-I would be to perform a hypothesis test. By performing multiple train-v alidation splits and obtaining the accuracies of the models for each split, one could test if the mean accuracy between the best performing model and all others is significantly different. If so, this could corroborate 7 using RS-I instead of RS-D. 5 PR OOF OF THEOREM 1 T aking into account Equation (6) , it clearly suf fices to prove that r d ∗ COR ( x ) = r , with r = p ( ˆ y 1 ,x ) p ( ˆ y 2 ,x ) . W e first prov e that r d ∗ COR ( x ) ≤ r , by constructing a measure Q ∈ P with density q such that d ∗ COR ( P , Q ) = r and q ( ˆ y 2 | x ) ≥ q ( ˆ y 1 | x ) . Let π 1 := P ( Y = ˆ y 1 ) , π 2 := P ( Y = ˆ y 2 ) , λ − := π 1 + π 2 π 1 + r π 2 and λ + := r π 1 + r π 2 π 1 + r π 2 and consider the measurable function L : Y × X → R > 0 defined as L ( y , x ′ ) := λ − if y = ˆ y 1 , λ + if y = ˆ y 2 , 1 otherwise . Then E P [ L ] = λ + π 1 + λ − π 2 + (1 − π 1 − π 2 ) = 1 + π 1 ( λ + − 1) + π 2 ( λ − − 1) = 1 + π 1 ( r − 1) π 2 π 1 + r π 2 − π 2 ( r − 1) π 1 π 1 + r π 2 = 1 . This implies that Q defined by dQ/dP = L is a probabil- ity measure that is absolutely continuous w .r .t. P and has density q ( y , x ′ ) = L ( y , x ′ ) p ( y , x ′ ) . Moreov er , since λ − ≤ 1 ≤ λ + , we hav e ess inf L = λ − and ess sup L = λ + , so it follows from Equation (5) that d ∗ COR ( P , Q ) = ess sup L ess inf L = λ + λ − = r . Finally , q ( ˆ y 2 | x ) q ( ˆ y 1 | x ) = q ( ˆ y 2 , x ) q ( ˆ y 1 , x ) = λ + p ( ˆ y 2 , x ) λ − p ( ˆ y 1 , x ) = r r = 1 , so q ( ˆ y 2 | x ) ≥ q ( ˆ y 1 | x ) , which implies that the prediction ˆ y 1 is not robust w .r .t. P r for d = d ∗ COR and thus that r d ∗ COR ( x ) ≤ r . Next, we prove that r d ∗ COR ( x ) ≥ r . Let Q ∈ P be such that d ∗ COR ( P , Q ) < r , let q be its density and L = q /p the likelihood ratio. Then q ( y , x ) = L ( y , x ) p ( y , x ) and, due to Equation (5) , ess sup L/ ess inf L < r . F or all y = ˆ y 1 , since p ( y , x ) ≤ p ( ˆ y 2 , x ) , this implies that q ( ˆ y 1 | x ) q ( y | x ) = q ( ˆ y 1 , x ) q ( y , x ) = p ( ˆ y 1 , x ) p ( y , x ) L ( ˆ y 1 , x ) L ( y , x ) ≥ p ( ˆ y 1 , x ) p ( y , x ) ess inf L ess sup L > p ( ˆ y 1 , x ) p ( y , x ) 1 r ≥ p ( ˆ y 1 , x ) p ( ˆ y 2 , x ) 1 r = 1 . So q ( ˆ y 1 | x ) > q ( y | x ) for all y = ˆ y 1 , which means that Q predicts the same class ˆ y 1 as P . Since this is true for ev ery Q ∈ P such that d ∗ COR ( P , Q ) < r , we find that the prediction ˆ y 1 is robust w .r .t. P δ for all δ < r and therefore that r d ∗ COR ( x ) ≥ r . A uthor Contributions The authors contrib uted equally in the de velopment of the ideas and the writing of the paper . J. De Bock did the proof of Theorem 1 and R. F . L. Lassance created the code and ran the experiments. Acknowledgements This work was carried out with the support of the Coordina- tion for the Impro vement of Higher Education Personnel – Brazil (CAPES) – Financing Code 001. W e thank Adrián Detav ernier and Rafael B. Stern for the fruitful con versa- tions and suggestions on ho w to make the contrib utions of robustness quantification more meaningful. References Thomas Augustin, Frank P A Coolen, Gert de Cooman, and Matthias C M T roffaes, editors. Intr oduction to imprecise pr obabilities . W iley Series in Probability and Statistics. W iley-Blackwell, Hoboken, NJ, May 2014. Leo Breiman. Random forests. Mach. Learn. , 45(1): 5–32, October 2001. ISSN 0885-6125. doi: 10.1023/ A:1010933404324. URL https://doi.org/10. 1023/A:1010933404324 . T ianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Pr oceedings of the 22nd A CM SIGKDD International Confer ence on Knowledge Dis- covery and Data Mining , KDD ’16, pages 785–794, New Y ork, NY , USA, 2016. A CM. ISBN 978-1-4503- 4232-2. doi: 10.1145/2939672.2939785. URL http: //doi.acm.org/10.1145/2939672.2939785 . Alvaro H. C. Correia and Cassio P . de Campos. T o wards scalable and robust sum-product networks. In Nahla Ben Amor , Benjamin Quost, and Martin Theobald, edi- tors, Scalable Uncertainty Management , pages 409–422, Cham, 2019. Springer International Publishing. Alvaro H. C. Correia, Robert Peharz, and Cassio de Cam- pos. T o wards robust classification with deep genera- ti ve forests, 2020. URL 2007.05721 . Rafael M. O. Cruz, Luiz G. Hafemann, Robert Sabourin, and George D. C. Cav alcanti. Deslib: A dynamic ensemble selection library in python. Journal of Machine Learning 8 Resear ch , 21(8):1–5, 2020. URL http://jmlr.org/ papers/v21/18- 144.html . Rafael M.O. Cruz, Robert Sabourin, and Geor ge D.C. Cav alcanti. Dynamic classifier selection: Re- cent advances and perspectiv es. Information Fusion , 41:195–216, 2018. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inf fus.2017.09.010. URL https://www.sciencedirect.com/ science/article/pii/S1566253517304074 . Jasper De Bock, Cassio P . de Campos, and Alessan- dro Antonucci. Global sensiti vity analysis for map inference in graphical models. In Z. Ghahramani, M. W elling, C. Cortes, N. Lawrence, and K.Q. W ein- berger , editors, Advances in Neural Information Pr ocessing Systems , volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips. cc/paper_files/paper/2014/file/ f9d3c99bd6cbf2d694266e7760ee1ed6- Paper. pdf . Detav ernier , Adrián and De Bock, Jasper. Robustness quan- tification : a new method for assessing the reliability of the predictions of a classifier . In INTERNA TIONAL SYMPOSIUM ON IMPRECISE PROB ABILITIES: THE- ORIES AND APPLICATIONS , v olume 290, pages 126– 136, 2025a. URL https://proceedings.mlr. press/v290/detavernier25a.html . Detav ernier , Adrián and De Bock, Jasper. Robustness and uncertainty: two complementary aspects of the reliability of the predictions of a classifier , 2025b . URL https: //arxiv.org/abs/2512.15492 . Lutz Dümbgen, Richard J. Samworth, and Jon A. W ell- ner . Bounding distributional errors via density ratios. Bernoulli , 27(2), May 2021. ISSN 1350-7265. doi: 10.3150/20- bej1256. URL http://dx.doi.org/ 10.3150/20- BEJ1256 . Eyke Hüllermeier and Willem W aegeman. Aleatoric and epistemic uncertainty in machine learning: an intro- duction to concepts and methods. Machine Learning , 110(3):457–506, Mar 2021. ISSN 1573-0565. doi: 10.1007/s10994- 021- 05946- 3. URL https://doi. org/10.1007/s10994- 021- 05946- 3 . Meizhu Li, Shaoguang Huang, Jasper De Bock, Gert de Cooman, and Aleksandra Pižurica. A robust dynamic classifier selection approach for hyperspectral images with imprecise label information. Sensors , 20(18), 2020. ISSN 1424-8220. doi: 10.3390/s20185262. URL https: //www.mdpi.com/1424- 8220/20/18/5262 . Christoph Molnar . Interpr etable Machine Learn- ing . https://christophm.github.io/interpretable-ml- book/, 3 edition, 2025. ISBN 978-3-911578-03-5. URL https://christophm.github.io/ interpretable- ml- book . Ignacio Montes, Enrique Miranda, and Sébastien Destercke. Unifying neighbourhood and distortion models: part i – new results on old models. International Journal of General Systems , 49(6):602–635, 2020. doi: 10.1080/ 03081079.2020.1778682. A wais Muhammad and Sung-Ho Bae. A survey on effi- cient methods for adversarial robustness. IEEE Access , 10:118815–118830, 2022. doi: 10.1109/A CCESS.2022. 3216291. Malik Sajjad Ahmed Nadeem, Jean-Daniel Zucker , and Blaise Hanczar . Accurac y-rejection curves (arcs) for comparing classification methods with a reject op- tion. In Sašo Džeroski, Pierre Guerts, and Juho Rousu, editors, Pr oceedings of the thir d International W orkshop on Machine Learning in Systems Biology , volume 8 of Pr oceedings of Machine Learning Re- sear ch , pages 65–81, Ljubljana, Slovenia, 05–06 Sep 2009. PMLR. URL https://proceedings.mlr. press/v8/nadeem10a.html . Cathy O’Neil. W eapons of Math Destruction: How Big Data Incr eases Inequality and Thr eatens Democracy . Crown Publishing Group, USA, 2016. ISBN 0553418815. F . Pedregosa, G. V aroquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P . Prettenhofer , R. W eiss, V . Dubourg, J. V anderplas, A. Passos, D. Cour - napeau, M. Brucher , M. Perrot, and E. Duchesnay . Scikit- learn: Machine learning in Python. Journal of Machine Learning Resear ch , 12:2825–2830, 2011. V alentim Realinho, Mónica V ieira Martins, Jorge Machado, and Luís Baptista. Predict Students’ Dropout and Aca- demic Success. UCI Machine Learning Repository , 2021. DOI: https://doi.org/10.24432/C5MC89. Joseph D Romano, T rang T Le, W illiam La Cav a, John T Gre gg, Daniel J Goldber g, Praneel Chakraborty , Natasha L Ray , Daniel Himmelstein, W eixuan Fu, and Jason H Moore. Pmlb v1.0: an open source dataset collec- tion for benchmarking machine learning methods. arXiv pr eprint arXiv:2012.00058v2 , 2021. Cynthia Rudin. Stop e xplaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence , 1(5): 206–215, May 2019. ISSN 2522-5839. doi: 10.1038/ s42256- 019- 0048- x. URL https://doi.org/10. 1038/s42256- 019- 0048- x . Joaquin V anschoren, Jan N. van Rijn, Bernd Bischl, and Luis T orgo. Openml: networked science in ma- chine learning. SIGKDD Explor . Newsl. , 15(2):49–60, June 2014. ISSN 1931-0145. doi: 10.1145/2641190. 9 2641198. URL https://doi.org/10.1145/ 2641190.2641198 . Peter W alle y . Statistical Reasoning with Imprecise Pr obabil- ities , v olume 42 of Monogr aphs on Statistics and Applied Pr obability . Chapman and Hall, London, 1991. 10 Rob ustness Quantification f or Discriminativ e Models (Supplementary Material) Rodrigo F . L. Lassance 1,2,3 Jasper De Bock 1 1 Foundations Lab for imprecise probabilities (FLip), Ghent Uni versity , Ghent, Belgium 2 Statistics Dept., Federal Univ ersity of São Carlos, São Carlos, São Paulo, Brazil 3 Institute of Mathematic and Computer Sciences, Univ ersity of São Paulo, São Carlos, São Paulo, Brazil T able 4: Mean accurac y of each method under label corruption ( ρ = 0 . 1 ). Best accuracy in bold, second best underlined. Label GB RF XGB SB MCB KNORA-U KNORA-E MET A-DES RS-D RS-I D 1 0.98068 0.98261 0.98647 0.98841 0.98583 0.98937 0.98969 0.99002 0.98776 0.98969 D 2 0.91351 0.90522 0.91446 0.91385 0.91166 0.91145 0.9113 0.91284 0.91423 0.91429 D 3 0.74713 0.76954 0.7546 0.75402 0.76609 0.76149 0.7569 0.76207 0.76379 0.75575 D 4 0.85802 0.89204 0.87812 0.89204 0.88107 0.88344 0.88445 0.88689 0.89383 0.89219 D 5 0.58817 0.65092 0.63927 0.64975 0.63437 0.6305 0.64076 0.63711 0.65245 0.65074 D 6 0.87064 0.87786 0.87342 0.87648 0.87416 0.87489 0.8747 0.87564 0.87741 0.87687 D 7 0.98803 0.9676 0.98958 0.98942 0.98551 0.99088 0.99105 0.99162 0.99129 0.99064 D 8 0.94409 0.92065 0.96119 0.96061 0.95293 0.95793 0.95812 0.95869 0.95985 0.96081 D 9 0.75669 0.76622 0.7612 0.7614 0.76221 0.76241 0.76371 0.76411 0.76571 0.76361 D 10 0.95956 0.86772 0.97869 0.97869 0.96053 0.97103 0.97256 0.97119 0.97635 0.97797 D 11 0.9613 0.96874 0.95842 0.9679 0.96202 0.96754 0.9673 0.96748 0.96778 0.9676 D 12 0.82685 0.91991 0.87919 0.92081 0.8868 0.89306 0.89575 0.89083 0.90917 0.9132 D 13 0.84412 0.85237 0.84874 0.84891 0.85166 0.85184 0.85344 0.85122 0.849 0.84971 D 14 0.84669 0.85211 0.85069 0.84927 0.84927 0.85149 0.85246 0.85477 0.85584 0.85078 D 15 0.92248 0.94574 0.93721 0.94186 0.94109 0.94186 0.94419 0.93876 0.94496 0.94264 Beats SB 0/15 8/15 6/15 - 3/15 6/15 7/15 6/15 10/15 12/15 T able 5: Mean accurac y of each method under label corruption ( ρ = 0 . 2 ). Best accuracy in bold, second best underlined. Label GB RF XGB SB MCB KNORA-U KNORA-E MET A-DES RS-D RS-I D 1 0.94847 0.97359 0.97262 0.97327 0.96844 0.97874 0.97939 0.97617 0.97649 0.9752 D 2 0.91295 0.90548 0.91228 0.91313 0.91017 0.9105 0.91051 0.91242 0.91338 0.91314 D 3 0.71322 0.75287 0.73908 0.74195 0.73046 0.74483 0.74425 0.7431 0.73851 0.73851 D 4 0.84607 0.85835 0.8595 0.85955 0.859 0.86329 0.86388 0.86241 0.86998 0.86176 D 5 0.56644 0.64197 0.62231 0.63941 0.61831 0.61165 0.62434 0.62092 0.64462 0.64017 D 6 0.86403 0.8665 0.86578 0.86529 0.86508 0.86802 0.86847 0.8687 0.86858 0.86559 D 7 0.98266 0.96443 0.9825 0.98063 0.98128 0.98584 0.98689 0.98746 0.98665 0.98396 D 8 0.92277 0.92027 0.9414 0.93929 0.93372 0.94159 0.94467 0.94428 0.94409 0.94121 D 9 0.75028 0.75228 0.74657 0.74937 0.74817 0.75388 0.75589 0.75409 0.75519 0.75258 D 10 0.94649 0.86489 0.96594 0.96594 0.94496 0.95642 0.96287 0.95932 0.96594 0.96626 D 11 0.95014 0.96538 0.94287 0.96538 0.95476 0.95884 0.9586 0.9589 0.96244 0.96436 D 12 0.74989 0.8783 0.82998 0.8783 0.83669 0.83982 0.84743 0.85101 0.87293 0.8774 D 13 0.83604 0.85308 0.83356 0.8474 0.8419 0.84359 0.84634 0.84634 0.84758 0.84838 D 14 0.83595 0.85362 0.83693 0.85175 0.8435 0.8466 0.8498 0.849 0.85255 0.85166 D 15 0.88915 0.92946 0.90775 0.92016 0.9093 0.92481 0.92558 0.92403 0.92946 0.92403 Beats SB 2/15 8/15 5/15 - 1/15 8/15 9/15 8/15 11/15 11/15 1
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment