An index of effective number of variables for uncertainty and reliability analysis in model selection problems

An index of an effective number of variables (ENV) is introduced for model selection in nested models. This is the case, for instance, when we have to decide the order of a polynomial function or the number of bases in a nonlinear regression, choose …

Authors: Luca Martino, Eduardo Morgado, Roberto San Millán-Castillo

An index of effective number of variables for uncertainty and reliability analysis in model selection problems
An index of effectiv e n um b er of v ariables for uncertain t y and reliabilit y analysis in mo del selection problems Luca Martino † , Eduardo Morgado † , Rob erto San Mill´ an Castillo † Abstract An index of an effectiv e num b er of v ariables (ENV) is introduced for mo del selection in nested mo dels. This is the case, for instance, when we hav e to decide the order of a p olynomial function or the num b er of bases in a nonlinear regression, choose the num b er of clusters in a clustering problem, or the num b er of features in a v ariable selection application (to name few examples). It is inspired b y the idea of the maximum area under the curv e (A UC). The in terpretation of the ENV index is iden tical to the effectiv e sample size (ESS) indices concerning a set of samples. The ENV index improv es drawbac ks of the elb o w detectors describ ed in the literature and introduces different confidence measures of the prop osed solution. These nov el measures can b e also employ ed jointly with the use of differen t information criteria, suc h as the w ell-kno wn AIC and BIC, or any other mo del selection procedures. Comparisons with classical and recen t sc hemes are pro vided in differen t exp erimen ts inv olving real datasets. Related Matlab co de is given. Keyw ords: Model selection, elbow detection, information criterion, Effective Sample Size (ESS), Gini index, uncertaint y analysis, v ariable imp ortance. 1 In tro duction Mo del selection is absolutely a fundamental task of scientific inquiry [1, 2, 3, 4]. It consists of selecting one mo del among man y candidate mo dels, giv en some observ ed data. W e can distinguish three main frameworks in mo del selection. The first scenario is when completely different mo dels are compared. The second setting is when several mo dels defined by the same parametric family are considered (namely , these parameters of the mo del are tuned). The third setting is related to the previous one but, in this scenario, the family contains mo dels of different complexit y since the n umber of parameters can grow (i.e., the dimension of the v ector of parameter grows, building more complex mo dels). This last case, also known as neste d mo dels , is what we address in this w ork. Examples of mo del selection in nested mo dels are the order selection in p olynomial regression or autoregressiv e schemes, v ariable selection, clustering, and dimension reduction, just to name a few [5, 6, 7, 8, 9]. Other imp ortan t applications in signal pro cessing are the n um b er of source signal † Univ ersidad Rey Juan Carlos, Campus de F uenlabrada, Madrid 1 estimation [10] and the structured parameter selection [11]. In the selection of nested mo dels, the decision is driven b y the so-called bias-v ariance trade-off 1 : w e ha ve to c ho ose a compromise betw een (a) the mo del p erformance and (b) the model complexit y . Th us, the concept of selecting the best mo del is, in some sense, related to a prop er mathematical definition of a ‘go o d enough’ model. Hence, the issue is to prop erly describ e in terms of equations the collo quial expression ‘go o d enough’ [12]. In the literature, there are tw o main families of metho ds for mo del selection, which are comp osed of three main sub-families, as depicted in Figure 1. The first class is formed b y r esampling metho ds , where a main sub-family is given by the b o otstr ap and cr oss-validation (CV) tec hniques [13, 14, 15]. F or simplicit y , w e fo cus here on CV. More sp ecifically , CV is based on the splitting of the data in training and test sets. The training set is used to fit a mo del and the test set is to ev aluate it. Ho wev er, the prop ortion of data to use in training (and/or in testing) must b e c hosen by the user and can critically affect the results in terms of p enalization of the mo del complexit y . Moreo ver, the splitting pro cess should b e repeated several times, and the p erformance can b e a veraged ov er the runs (that is computationally costly). Figure 1: Classes of metho ds for mo del selection (standard ones and more recen t approaches). The second fam ily is giv en by the so-called pr ob abilistic statistic al me asur es , whic h emplo y different rules for ev aluating the differen t mo dels considering directly the en tire dataset (unlike in CV). This family is formed by t w o main sub-classes: the information criteria [16, 5, 17, 18] and the mar ginal likeliho o d approach (a.k.a., Ba y esian evidence) used in Ba yesian inference [19, 4, 20]. Some famous information criteria are the Ak aik e information criterion (AIC), which is based on the en tropy maximization principle [21], and the Bay esian information criterion (BIC) which is also a bridge with marginal likelihoo d approach, since BIC is deriv ed as an approximation of the marginal likelihoo d [22]. Other very similar or completely equiv alen t schemes can b e also found [23, 24, 25]. All the information criteria (IC) consider the sum of a p erformance measure and a penalty of the complexit y . More sp ecifically , they employ a linear p enalization of the mo del complexit y , where 1 A bias-v ariance trade-off could appear also in scenarios with non-nested mo dels. 2 a parameter λ represen ts a p ositiv e slop e v alue of this p enalty . They differ for the slop e λ of this linear p enalization term [26] (see, for instance, T able 1). The choice of this slop e is justified b y differen t theoretical deriv ations, each one with sev eral assumptions and approximations. Clearly , the results dep end on this c hoice [26, 27]. Some considerations regarding the consistency of IC can b e found in this survey [28]. The last approac h is based on the marginal lik eliho o d, which is emplo y ed in Ba y esian inference for mo del selection purp oses. The mo del p enalization in the marginal lik eliho o d is induced by the c hoice of the prior densities [29, 4]. Therefore, in the three main approaches describ ed ab ov e, (a) CV, (b) information criteria, and (c) marginal likelihoo d computation, we hav e alwa ys a degree of freedom (prop ortion of split data, slop e λ , and choice of the prior densities) that affects the final results of the model selection. F or this reason, other recent approac hes ha ve been also in vestigated in the literature, based on geometric considerations: some of them prop ose an automatic detection of an “elb ow” or “knee-p oin t” in a non-increasing curv e describing a metric of p erformance of the mo del versus its complexit y [30, 31, 32, 33]. In [30], the authors provide four different and equiv alent geometric deriv ations sho wing: (a) the elb o w detectors in [30, 31, 32, 33] are equiv alen t (providing the same result), and (b) this result can b e obtained as an optimization of an information criterion with a sp ecific choice of λ (given in [30]), i.e., this geometric approac h can b e also expressed as an information criterion. On the other hand, another recent information criterion, called sp ectral information criterion (SIC) [27], has b een also introduced in the literature: this criterion uses all the p ossible v alues of λ , th us also con tains the rest of IC as sp ecial cases, and returns also a confidence measure of the prop osed solution. In this work, w e extend one of the deriv ations prop osed in [30] designing an index of an effectiv e n umber of v ariables (ENV). Note that, throughout all the work, w e use the w ords variables , c omp onents , fe atur es , and/or p ar ameters of a mo del as synon ymous [34, 35]. The resulting index is inspired by the concept of maxim um area-under-the-curve (AUC) in receiv er op erating c haracteristic (R OC) curves [12, 36] and the Gini index, widely used in economics [37, 38, 39, 40]. Moreo ver, the underlying idea and interpretation are exactly like the effectiv e sample size (ESS) indices concerning a set of samples [41, 42]. An imp ortant prop erty of the ENV index is that impro ves the elb o w detectors remo ving the dep endence on the maximum num b er of v ariables K considered in the analysis [30, 31, 32, 33]. Moreo ver, we introduce measures of reliability and uncertain ty of the proposed results, related to the ENV index, similar to the SIC outputs. They pro vide quan tities asso ciated with how ‘safe’ is the solution, in terms of p ossible information lost by constructing a ‘to o’ parsimonious mo del. It is imp ortant to remark that the nov el confidence measures can b e employ ed also for different information criteria, suc h as AIC and BIC (to name a few), or other feature imp ortance and mo del selection schemes [43, 44], not only when the elbow detectors are applied. In order to define these confidence measures, w e also in tro duce some v ariable imp ortance measures (see also the sensibilit y analysis in [45]) that are in some wa y related to other relev an t concepts already prop osed in the literature, suc h as the Shapley v alues and feature imp ortance ideas, which are currently a hot researc h topic [46, 47]. The fact that sev eral information criteria and feature imp ortance sc hemes ha ve b een introduced in the literature (dep ending on the sp ecific context and application) is a 3 clear signal that confidence measures for the results are required. The rest of the w ork is structured as follo ws. In Section 2, w e define the main notation and recall some background material. The ENV index is derived and analyzed in Section 3. Additional prop erties of the ENV index are giv en in Section 4. F urthermore, in the same section, w e in tro duce some confidence measures based on the deriv ation of the ENV index. In Section 5, w e sho w differen t n umerical exp erimen ts. W e provide some final conclusions in Section 6. Additional information is also con tained in t wo appendices. 2 F ramew ork and main notation 2.1 The error curv e V ( k ) as a figure of merit In many applications in signal pro cessing and mac hine learning, we desire to infer a v ector of parameters θ k = [ θ 1 , ..., θ k ] ⊤ of dimension k giv en a data v ector y = [ y 1 , ..., y N ] ⊤ . A lik eliho o d function p ( y | θ k ) is usually av ailable, and often derived from a related physical mo del [20, 19]. In many scenarios, also the dimension k is unknown and m ust b e estimated as w ell. This is the case in numerous applications, for instance when k can represent (a) the n umber of clusters in a clustering problem, (b) the order of a p olynomial in a non-linear regression problem, (c) the n umber of selected v ariables in a feature selection problem, (d) the main num b er of dimensions in a dimension reduction problem, to name a few. In all these real-w orld application problems, a non-increasing err or function (i.e., a metric that c haracterizes the p erformance of the system, suc h as a fitting measure), V ( k ) : N → R , k = 0 , 1 , 2 , ..., K, can b e obtained. Note that k = 0 represen ts the scenario of ‘no dep endence’, or ‘absence of mo del’, or the simplest p ossible mo del. Namely , V (0) represents the v alue of the error function corresp onding to the simplest mo del, for instance, a constant mo del in a regression problem, or a single cluster (for all the data) in a clustering problem. F or instance, in p olynomial regression, the v alue V (0) could corresp ond to the v ariance of the data, i.e., the mean square error (MSE) error in prediction using a constan t v alue θ 0 (equal to the mean of the data) as the mo del. Generally , more complex mo dels - with more parameters, i.e., the v ector θ k = [ θ 0 , θ 1 , ..., θ k ] ⊤ has a higher dimension k - provide smaller errors in prediction/classification. W e consider the most complex mo del having K parameters on top of the simplest mo del the simplest case with only θ 0 , i.e., θ K = [ θ 0 , θ 1 , ..., θ K ] ⊤ . Without loss of generalit y , w e are considering an integer v ariable k with increasing step of one unit. Clearly , more general assumptions could b e made. In a v ariable/feature selection problem, w e assume that the order of the v ariable inside the vector θ k is w ell-c hosen, i.e., V ( k ) is built after ranking the v ariables/features in a decreasing order of relev ance [8]. Examples of p ossible choices of V ( k ) are the following: • In the literature, when a likelihoo d function is giv en, a usual c hoice is V ( k ) = − 2 log( ℓ max ) , where ℓ max = max θ p ( y | θ k ) , 4 e.g., as in [17, 22, 21, 20]. • V ( k ) could b e directly the MSE, and/or the mean absolute error (MAE) or transformations of them (as log MSE, etc.). F or instance, in the linear and additive Gaussian noise case, it is p ossible to show that the c hoice V ( k ) = − 2 log ( ℓ max ) is equiv alen t to V ( k ) = N log MSE where the MSE represen ts also an estimation of the noise pow er in the system (e.g., see [48]). • V ( k ) can b e any other error measure in classification or clustering, for instance. See Section 5.3 for an example. Assumptions. Generally , V ( k ) should b e a non-incr e asing error curve, i.e., for any pair of non- negativ e in tegers n 1 , n 2 suc h that n 2 > n 1 , then we ha ve V ( n 2 ) ≤ V ( n 1 ). 2 Indeed, V ( k ) is a fitting term that decreases as the complexit y of the model (giv en b y the num b er k of parameters) gro ws. Therefore, w e hav e V (0) ≥ V ( k ) , ∀ k . Note that V ( k ) pla ys the same role as the so-called Lorenz curv e in the definition of the Gini index [37, 38]. This general approach is similar to the framew ork used for defining the Shapley v alues [43, 44]. See Figure 2 for a graphical example. In some applications, the score function V ( k ) should b e also con vex (as in Figure 2), i.e., the differences V ( k + 1) − V ( k ) will decrease as k increases. This is the case of a v ariable selection problem if the v ariables hav e b een ranked correctly . Ho wev er, this w ork does not require conditions regarding the conca vity of V ( k ). Finally , just for the sak e of simplicit y and without loss of generality , w e assume that min V ( k ) = V ( K ) = 0. Clearly , This condition can b e alwa ys obtained with a simple subtraction. F or instance, given a generic non-increasing function V ′ ( k ), we can define V ( k ) = V ′ ( k ) − min V ′ ( k ) = V ′ ( k ) − V ′ ( K ) so that min V ( k ) = 0. (a) (b) Figure 2: (a) Example of error function V ( k ) where K = 6, (b) Construction with t wo straight lines and the areas A 1 , A 2 , and A 3 . 2 This condition could b e also relaxed. W e k eep it, for the sak e of simplicit y . 5 2.2 The univ ersal automatic elb ow detector In this section, we briefly recall one of the deriv ations (all based on geometric arguments) of the univ ersal automatic elb o w detector (UAED) given in [30]. Here, we need to recall the deriv ation more connected to the A UC approach [12, 36]. The underlying idea is to extract geometric information from the curv e V ( k ) by lo oking for a geometric “elb ow” k e in order to determine the optimal n umber of comp onen ts (denoted as k e ∈ { 0 , 1 ..., K } ) to include in our mo del. W e consider the construction of t wo straight lines passing through the p oints (0 , V (0)), to ( k , V ( k )) and ( k , V ( k )), to ( K , 0) as sho wn in Figure 2(b) (where k ∈ { 0 , 1 , ..., K } ). Hence, we ha ve piece- wise linear appro ximation of the curve V ( k ) with these tw o straight lines. The goal is to minimize the area under this approximation. More sp ecifically , the total area under the approximation is the sum of the tw o triangular areas ( A 1 and A 3 ) and the rectangular area ( A 2 ) in Figure 2(b). Namely , we hav e A 1 = k ( V (0) − V ( k )) 2 , A 2 = k V ( k ) , A 3 = ( K − k ) V ( k ) 2 , hence the optimal n um b er of comp onents k e (lo cation of the “elb ow”) is defined as k e = arg min k { A 1 + A 2 + A 3 } . (1) After some algebra, w e arriv e at the expression k e = arg min k  V ( k ) + V (0) K k  , for k = 1 , ..., K , (2) where w e are assuming V (0)  = 0 and K  = 0. Remark. It is imp ortant to remark that, since k b elongs to a discrete and finite set, solving the optimization ab ov e is straigh tforward. Remark. If V ( k ) is conv ex, k e is unique. Con vexit y of V ( k ) is a sufficient but not necessary condition for the uniqueness of the solution. Remark. If V ( k ) is not conv ex, w e can ha ve several global minima. F or instance, ha ving M differen t minima, k ∗ 1 , k ∗ 2 , ..., k ∗ M , the user can c ho ose the b est solution (within the M p ossible one) according to some sp ecific requirements dep ending on the sp ecific application. Here, we suggest to pic k the most conserv ative c hoice, i.e., k e = max k ∗ j , for j = 1 , ..., M . Relation with the information criteria. The cost function emplo yed b y the UAED is C ( k ) = V ( k ) + V (0) K k , (3) = V ( k ) + λk , (4) where the slop e of the complexity p enalization term is λ = V (0) K . Therefore, this cost function has exactly the same form as the cost function used in the information criteria lik e BIC and AIC, i.e., 6 with a linear p enalization of the mo del complexity and selecting λ = V (0) K . In AIC and or BIC, we ha ve V ( k ) = − 2 log ℓ max . Therefore, when V ( k ) = − 2 log ℓ max , the UAED can b e interpreted as an information criterion with the particular c hoice of λ = V (0) K . T able 1 summarizes this information. T able 1: Information criteria in the literature, with the corresponding choices of V ( k ) and λ . Note that N denotes the n um b er of data p oin ts and ℓ max is the maximum v alue reached by a likelihoo d function. The ENV index represen ts the scheme presen ted in this work. Information criterion (IC) Choice of λ V ( k ) Ba yesian-Sc hw arz [22] log N − 2 log ℓ max Ak aike [21] 2 − 2 log ℓ max Hannan-Quinn [49] log(log ( N )) − 2 log ℓ max Univ ersal Automatic Elb o w Detector [30] V (0) K an y Sp ectral IC (SIC) [27] all an y Eff. num. of v ariables (ENV) — an y (a) (b) Figure 3: (a) W e can consider that V ( k ) is like a sampled c urv e obtained from sampling - in a signal processing sense - a con tin uous function V ( x ) where x ∈ R (shown in dashed line) is an auxiliary contin uous v ariable. The con tin uous function V ( x ) p ossibly do es not exist and can b e just a theoretical to ol to define the area A V . (b) In an y case, we ha ve access to V ( k ), k ∈ N , whic h allo ws us to obtain the appro ximation b A ≈ A V . 3 An index of the effectiv e n um b er of v ariables (ENV) The elb ow detectors provide very go o d p erformance in several differen t scenarios, as shown in [30]. This pro ves the strength of the geometric approach compared to other information criteria prop osed in the literature. Ho wev er, all the elb ow detectors presented in the literature [30, 31, 32, 33] ha v e a dep endence on the maximum of v ariables analyzed, i.e., K . This is due to the slop e of the linear complexit y p enalt y b eing λ = V (0) K . See also Section 5.1, for a n umerical example. In some applications, the num b er K is the maxim um v alue strictly defined b y a limitation of the system/mo del. In 7 (a) (b) (c) Figure 4: Sp ecial ideal cases. (a) V ( k ) reaches zero already at k = 1. Only the first comp onent is relev ant, hence the optimal choice k ∗ = k e = 1. (b) V ( k ) is a straight line connecting the p oin ts (0 , V (0)) and ( K , 0). All v ariables con tribute in the same wa y to the decay V ( k ), hence the optimal choice k ∗ = k e = K (in figure K = 6). (c) The function V ( k ) is a straight line passing through the points (0 , V (0)) and ( k ∗ , 0), that is V ( k ) = V (0) − V (0) k ∗ k , so that V ( k ∗ ) = 0 at some p oint k ∗ < K . Clearly , the point k ∗ = k e = 3 is an optimal c hoice: the first 3 v ariables ha v e the same con tribution to the decay V ( k ) and completely explain the drop. other framew orks, K could b e increased: if a new possible comp onent is added ev en with a small impact on the decrease of error function V ( k ), since the slop e λ = V (0) K b ecomes smaller, the elb ow detectors could suggest a bigger v alue as the optimal n umber of comp onents. On the other hand, if the model selection analysis is performed by reducing the p ossible total n umber of comp onen ts in adv ance, the elb ow detectors w ould suggest a smaller v alue as the optimal n umber of v ariables. 3 Moreo ver, in several applications, it is also important to obtain a confidence measure for the model selection, join tly with the elb ow detection. In the next sections, w e try to address these issues. 3.1 Deriv ation of the ENV index In this section, w e extend the geometric approac h emplo yed by the UAED designing a new index that (a) reduces the dep endence on K and (b) helps us to pro vide confidence measures with resp ect to the c hosen elb ow. The main idea is formed by t w o parts: (P art-1) to obtain a b etter appro ximation b A (w.r.t. the deriv ation of the UAED) of the area under the curv e V ( k ), denoted as A V and shown in Figure 3(a), (P art-2) then w e hav e to normalize the obtained v alue b A (considering some ideal scenario). W e pro vide more details b elo w. (P art-1) Lo oking Figure 3, we can observ e that a b etter appro ximation of the area A V under the function V ( k ) can b e easily obtained as sum of K trap ezoidal pieces, as sho wn in Figure 3(b) 3 Clearly , there is also a dependence of V (0) and, more generally , the result dep ends on the choice of error curv e V ( k ), as well. 8 with red lines: b A = K − 1 X k =0 V ( k ) + V ( k + 1) 2 , = V (0) 2 + V (1) 2 + V (1) 2 + V (2) 2 + ... + V ( K − 1) 2 + V ( K ) 2 , = V (0) + V ( K ) 2 + K − 1 X k =1 V ( k ) , = V (0) 2 + K − 1 X k =1 V ( k ) , (5) where, in the last equality , w e ha v e used the assumption that V ( K ) = 0 and the step of increase on axis k is 1 (without loss of generalit y). A graphical example is depicted in Figure 3(b). (P art-2) In order to design a normalize d index effective of the n umber of comp onents, we need to describ e an ideal scenario: let us assume that all en tire drop of V ( k ) is already reac hed in the first step with a linear deca y, i.e., w e ha v e already V (1) = 0 and b y assumption the rest of v alues are also zero, V (1) = V (2) = ...V ( K ) = 0, as shown in Figure 4(a). In this case, the first v ariable/comp onent is relev ant whereas the rest of the v ariables do not give any con tribution to the deca y of V ( k ). Therefore, the correct decision as effective as the n um b er of comp onents is k ∗ = k e = 1. 4 The approximated area under the curv e V ( k ) in this case is b A (1) = V (0) 2 . Thus, w e define the index of an effectiv e n umber of v ariables (ENV) as: I ENV =      0 , when V (0) = 0 , b A b A (1) = 2 V (0) b A, when V (0)  = 0 . (6) Note that, when V (0)  = 0, we can then write I ENV = 1 + 2 K − 1 X k =1 V ( k ) V (0) , (for V (0)  = 0) . (7) 3.2 Beha vior of I ENV in ideal cases In this section, w e c heck the behavior of the ENV index in differen t extreme and ideal cases: • In the ideal scenario ab o ve sho wn in Figure 4(a), i.e., when V (0)  = 0 but w e hav e already V (1) = 0 and V (1) = V (2) = ...V ( K ) = 0, the correct decision is k ∗ = 1. Hence, note also that b A = b A (1) . In this case, from (6) or from (7) we get I ENV = 1, as desired. 4 If w e supp ose a con tin uous function that sampled at each step 1 generated the curve V ( k ), the result k ∗ = k e = 1 is optimal for us, ev en if the function decays faster than a linear decrease, but only because we hav e not access to sampled points b etw een 0 and 1 (i.e., we hav e no information ab out the faster deca y betw een 0 and 1); see Figure 4(a). Having more information, i.e., more points b etw een 0 and 1 (hence an increase in k smaller than 1), then the optimal result would be k ∗ < 1 (if w e ha v e a decay as the dashed line in Figure 4(a)). 9 • Another ideal scenario is when V ( k ) is a linear straight line connecting the p oints (0 , V (0)) and ( K , V ( K )), as sho wn in Figure 4(b). In this situation, all the v ariables contribute in the same wa y as the decay of V ( k ) (i.e., eac h v ariable has the same influence on the error decrease), so that the correct decision is k ∗ = K (all the comp onents are relev an t, with the same imp ortance). The area under the curve V ( k ) in this case is b A ( K ) = K V (0) 2 . In this case, the equality b A ( K ) = A V also holds. With Eq. (6), we also obtain b A = K V (0) 2 therefore the ENV index is I ENV = 2 V (0) K V (0) 2 = K , exactly as exp ected. • Another extreme case is when all the comp onen ts are indep endent of the output y . In this situation, theoretically V ( k ) should b e a constan t (i.e., there is not an y drop), V ( k ) = V (0) for all k , namely V (0) = V (1) = ....V ( K ) = 0 due to our assumption V ( K ) = 0, without losing an y generalit y . Hence, the correct decision is k ∗ = 0 and the area under the function V ( k ) is b A (0) = 0. In this case, b A = 0 so that I ENV = 0, again as desired. • More generally , consider that at some in teger k ∗ w e ha ve V ( k ∗ ) = 0, and the V ( k ) is a straigh t line passing through the p oints (0 , V (0)) and ( k ∗ , 0), that is V ( k ) = V (0) − V (0) k ∗ k . Then, I ENV = 1 + 2 k ∗ X k =1 V (0) − V (0) k ∗ k V (0) , = 1 + 2 k ∗ X k =1  1 − k k ∗  , = 1 + 2 k ∗ − k ∗ ( k ∗ + 1) k ∗ , = 1 + 2 k ∗ − ( k ∗ + 1) , = k ∗ . Exactly as exp ected and desired. This can b e also easily obtained by lo oking at Figure 4(c): indeed, in this scenario we can write b A = k ∗ V (0) 2 and the ENV index will b e I ENV = 2 V (0) k ∗ V (0) 2 = k ∗ . Hence, w e hav e an index such that 0 ≤ I ENV ≤ K. (8) Some additional prop erties and observ ations are giv en b elow and in App endix A. 4 In terpreting and using the ENV index In this section, we introduce some prop erties of the ENV index, and their effects are sho wn in Section 5.1. An interesting b ehavior of I ENV in ideal scenarios is describ ed b elow and depicted in Figure 5. More generally , w e explain ho w to in terpret and use I ENV . 10 4.1 First considerations Prop ert y 1. Let us consider a generic non-increasing function V ′ ( k ), with k = 0 , ..., K , and assume to ha ve a horizon tal asymptote, i.e., lim K →∞ V ′ ( k ) = C . W e can alwa ys define V ( k ) = V ′ ( k ) − min V ′ ( k ) so that V ( K ) = 0 for each p ossible K . Note that the ENV index in Eq. (7) dep ends on K (hence we use the notation here I ENV = I ENV ( K )) but, when K → ∞ , w e ha ve that stability pr op erty , i.e., lim K →∞ I ENV ( K ) = ¯ I ENV , (9) i.e., the ENV index conv erges to a stable fixed v alue ¯ I ENV , that represents a geometrical feature of the curve V ( k ). This is because adding infinitesimal portions of the area to A V virtually do es not c hange the v alues of A V itself and b A ; see Figs. 3(a) and 3(b). Generally , this is not the b ehavior of the elb ow detectors [30, 31, 32, 33] in (non-ideal) real scenarios. See the n umerical example in Section 5.1. Prop ert y 2. Moreo ver, another prop ert y is that I ENV ≥ 1 , for V (0)  = 0 , (10) as w e can observ e clearly from Eq. (7). This is due to the fact that k is a discrete v ariable with increasing step of 1, i.e., k = 0 , 1 , ..., K , and if V (0) = 0 we hav e V ( k ) = 0 for all k , by assumption. 5 Prop ert y 3. V (0) = 0 implies that I ENV = 0, as also sho wn in Eq. (6). This is b ecause if V (0) = 0 we ha v e V ( k ) = 0 for all k , by assumption. See also App endix A. Prop ert y 4. I ENV is in v arian t to linear scaling of the curve V ( k ). Namely , a different curve V ′ ( k ) = aV ( k ) with a > 0 has the same v alue I ENV of V ( k ). Indeed, w e hav e: I ′ ENV = 1 + 2 K − 1 X k =1 V ′ ( k ) V ′ (0) = 1 + 2 K − 1 X k =1 aV ( k ) aV (0) = 1 + 2 K − 1 X k =1 V ( k ) V (0) = I ENV . Cho osing the n umber of comp onen ts. The deriv ation of the ENV index is based on extending the UAED deriv ation, pro viding a b etter appro ximation of the area under the curve V ( k ). Th us, as an elb ow detector, the ENV index can b e emplo yed for c ho osing the n umber of components in a mo del selection problem just by defining k ∗ = ⌊ I ENV ⌉ , (11) 5 W e hav e access only to a “sampled” curve V ( k ) (sampled in the sense of sampling a contin uous signal to obtain a discrete signal). W e do not know V ( x ) where x ∈ R that p ossibly defines a theoretical area A V , as sho wn in Figure 3. If, in a specific problem, we ha ve access to V ( x ) then I ENV could tak e any intermediate v alues also betw een 0 and 1. 11 where ⌊ a ⌉ represen ts the rounding op eration of replacing an arbitrary real num b er a b y its nearest in teger. In terpreting I ENV . Figure 5 shows 4 ideal cases of V ( k ) with a blue solid line. In all cases, the curv e V ( k ) is comp osed of t wo straight lines yielding an elb o w at k e = 14. A well-designed elb o w detector as in [30, 31], alwa ys pic ks k e = 14 as a solution, since this is the p osition of the elb o w (hence this is the correct result for an elb o w detector). The ENV index I ENV is sho wn with red triangles: it is able to discriminate the 4 different scenarios. The effective num b er of v ariables is exactly k e = 14 in the low est line and becomes greater and greater as V ( k ) b ecomes closer and closer to the red line (that represents the ideal scenario where all comp onen ts are equally relev ant). 4.2 Confidence measures for the decision The ENV index can be emplo yed to detect the qualit y of a decision p ro vided by an elbow detector, an information criterion, or another mo del selection pro cedure. F or instance, as discussed ab o ve, the ENV index I ENV is able to discriminate the 4 differen t scenarios in Figure 5, and determine the confidence in the c hoice of the model with k e = 14 parameters. In the low est curv e V ( k ), the c hoice k e = 14 is clearly an optimal p oint, since V ( k ) already reaches zero at k = 14. Ho wev er, the confidence in this choice decreases as the blue line b ecomes closer and closer to the red line. The v alue I ENV (depicted in red triangles) b ecomes more and more distant to the elb ow p osition k e . The goal is to design some mathematical tools/measures for ev aluating the robustness and degree of certain ty of the decision (i.e., the model choice). Before in tro ducing some confidence measures that rely on I ENV , we provide a definition of variable imp ortanc e based on the same geometric considerations b ehind the ENV index. V ariable imp ortance. The underlying idea b ehind the UAED and the ENV index is to asso ciate a measure of the imp ortance of each v ariable/comp onen t, defined as a v alue prop ortional to its con tribution to the deca y of V ( k ), i.e., w k = V ( k − 1) − V ( k ) , k = 1 , ..., K. Hence, w k represen ts the imp ortance of the k -th comp onent. W e can normalize these w eights arriving at the follo wing result: ¯ w k = w k P K − 1 i =1 w i = V ( k − 1) − V ( k ) V (0) , where w e hav e used that K − 1 X i =1 w i = V (0) − V ( K ) = V (0) , 12 since V ( K ) = 0 b y assumption. In the case of equally imp ortan t v ariables, giv en when V ( k ) is a unique straigh t line as in Fig. 4(b) or the red line in Fig. 5, w e hav e V ( k ) = V (0) − V (0) K k , w k = V ( k − 1) − V ( k ) = − V (0) K ( k − 1) + V (0) K k = V (0) K , that is the same v alue for all k (as exp ected) and, as a consequence, ¯ w k = 1 K for all k . This definition of v ariable imp ortance has b een already applied with success in [45]. Cum ulative imp ortance (CI). W e can compute the imp ortance accumulated using the first (rank ed) k comp onents, CI( k ) = k X i =1 ¯ w i = P k i =1 w i V (0) = V (0) − V ( k ) V (0) = 1 − V ( k ) V (0) for k = 0 , ..., K . Note that 0 ≤ CI( k ) ≤ 1, CI(0) = 0 and CI( K ) = 1. If the zero v alue is reac hed for a k ∗ ≤ K , i.e., V ( k ∗ ) = 0, we ha ve CI( k ∗ ) = 1. This cum ulative sum of normalized w eights can play the same roles as the cumulativ e sum of normalized w eights in SIC (i.e., as a reliabilit y/safety measure), although the computation of the w eights follo ws a completely different pro cedure [27]. In the case of equally imp ortant v ariables we hav e CI( k ) = k K . The CI can b e considered an accuracy measure: closer to 1 w e ha ve more accuracy , i.e., w e ha v e safer and more reliable decisions. Hence, as a consequence, 1 − CI( k ) is an uncertaint y measure (the amoun t of lost imp ortanc e , lost choosing a mo del with only the first k comp onents), and we can define also the cum ulative uncertain ty as CU( k ) = 1 − CI( k ) = V ( k ) V (0) . (12) Note that CU resembles the definition of the so-called Sob ol’s indices [50, 51]. W e can rewrite the ENV index as the sum of the cumulativ e uncertain ties, I ENV = 1 + 2 K − 1 X k =1 CU( k ) , for V (0)  = 0. (13) In Figure 5, from the b ottom to the upp er curv e, the strengths of the 4 elb ows at k e = 14 are CI( k e ) = 1, 0.87, 0.67, and 0.50, resp ectiv ely . Reliabilit y of the decision. Other indicators of the reliabilit y of the decision can b e designed. Indeed, the v alue I ENV pro vides the effe ctive numb er of v ariables/comp onents in our mo del selection problem. Clearly , the decision to use k < I ENV v ariables is less safe (in terms of loss of information and p erformance) instead of using a mo del with k > I ENV . In this sense, a suitable indicator for the reliabilit y of the decision could b e defined as R D = min  1 , k e I ENV  . (14) 13 Namely , any elb ow p osition k e suc h that k e − I ENV > 0 can b e considered in a safe region (since w e are using more v ariables than the effectiv e num b er of v ariables), i.e., according to the ENV index, more parsimonious mo dels could b e c hosen. An alternativ e to these confidence measures has b een introduced in [27]; see App endix B for some details. 0 1 02 03 04 05 0 k 0 5 10 15 V(k) V(k) Elbow - UAED ENV index Figure 5: Ideal cases of four curves V ( k ) (blue solid lines) where an “elb o w” is well-defined at k e = 14. This is exactly the result that w e obtain with the application of an elbow detector [30, 31, 32, 33] (green circles), whereas the results provided by the ENV index are sho wn by red triangles. Note that, in all cases, I ENV > k e . 5 Numerical exp erimen ts In this section, w e test the ENV index in different frameworks, including exp eriments with artificial data (the first one) and t w o real datasets (the last tw o exp erimen ts). W e will see that I ENV pro vides v ery goo d p erformance and presen ts more robustness with resp ect to other alternatives in the literature and w e sho w the b eha vior of the confidence measures prop osed ab ov e. 6 5.1 Syn thetic exp erimen t where V ( k ) is an analytic function The goal of this example is to show the b ehaviors of the elb ow detectors [30, 31, 32, 33] and the ENV index in a syn thetic framework but that is not an ideal scenario (as in Figure 5 where an elb o w is w ell-defined). W e consider the function V ′ ( k ) = e − 0 . 1 k , k = 0 , 1 , 2 ..., K. W e consider different v alues of K . F or each p ossible v alue of K ∈ { 20 , 50 , 500 , 5000 } , w e define V ( k ) = V ′ ( k ) − min V ′ ( k ) = e − 0 . 1 k − e − 0 . 1 K , so that V ( K ) = 0. W e apply the elb ow detectors [30, 31, 32, 33] and the ENV index obtaining the following results sho wn in T able 2. 6 The Matlab co de and datasets of the exp eriments are av ailable at http://www.lucamartino.altervista. org/PUBLIC_ENV_CODE.zip . 14 T able 2: Results in the syn thetic exp erimen t. Metho ds K = 20 K = 50 K = 500 K = 5000 Elb o w detectors 8 16 39 62 ENV index 13.756 19.338 20.016 20.016 CI 0.58 0.78 0.98 0.99 R D 0.58 0.83 1 1 Hence, the ENV index con verges to the v alue ¯ I ENV = 20 . 016. This stable v alue of the ENV index is an in trinsic geometric prop ert y of the analyzed curv e V ( k ). Moreo ver, w e can observ e that b oth confidence indices are quite smaller than 1 when the elb o w is smaller than ¯ I ENV = 20 . 016, and b oth measures approac h 1 as the elb ow p osition b ecomes greater and greater, as exp ected. W e can also see that the index of reliability of the decision R D is more optimistic than the cumulativ e imp ortance (CI). 5.2 V ariable selection in a regression problem with real data In a regression problem, we observe a dataset of N pairs { x n , y n } N n =1 , where eac h input v ector x n = [ x n, 1 , . . . , x n,K ] is formed b y K v ariables, and the outputs y n ’s are scalar v alues [35]. W e consider the case that K ≤ N and assume a linear observ ation mo del, y n = θ 0 + θ 1 x n, 1 + θ 2 x n, 2 + . . . θ K x n,K + ϵ n , (15) where ϵ n is a Gaussian noise with zero mean and v ariance σ 2 ϵ , i.e., ϵ n ∼ N ( ϵ | 0 , σ 2 ϵ ). More sp ecifically , in this real dataset [8, 52, 53], there are K = 122 features and N = 1214 num b er of data p oin ts x i . The dataset presen ts tw o outputs: ”arousal” and ”v alence”. In this section, we fo cus on the first output in the dataset (“arousal”). In this exp eriment, we can set V ( k ) = − 2 log ( ℓ max ) with ℓ max = max θ p ( y | θ k ) with k ≤ K , after ranking the 122 v ariables (see [8]), where the lik eliho o d function p ( y | θ k ) is induced by the Eq. (15). Th us, here we can compare with other information criteria given in the literature, shown in T able 1 and a standard metho d based on the computation of p-v alues [54]- [55]. The ENV index returns a v alue of 12.74. Thus, the results pro vided by eac h metho d are given in T able 3. The first line represen ts the metho ds, the second line giv es the n um b er of suggested v ariables, and the last line contains the corresp onding references. In [8, Section 4-C], the results of that exhaustiv e analysis suggest that there are 7 v ery relev ant v ariables (lev el 1 of [8, Section 4-C]), other 7 relev ant v ariables (lev el 2 of [8, Section 4-C]) and other 2 v ariables in a level 3 of importance [8, Section 4-C], hence, totally 16 v ariables among very relev ant, relev ant and imp ortant ones. Note that the suggested num b ers of SIC-95, SIC-99, 7 BIC, the UAED and the ENV index are in line 7 The n umbers 95 and 99 are asso ciated with the sum of some normalized weigh ts (0.95 and 0.99) built by SIC whic h deliv ers a degree of confidence in the decision, as a confidence measure [27]: closer to 100, the decision has a greater degree of assurance; see also App endix B. 15 T able 3: Results in the v ariable selection example, in a regression problem with real data. Sc heme p-v alue AIC BIC HQIC UAED SIC-95 SIC-99 ENV k e 71 44 17 41 11 7 17 13 CI 0.98 0.97 0.92 0.96 0.88 0.84 0.92 0.90 R D 1 1 1 1 0.86 0.55 1 1 Ref. [55] [21] [22] [49] [30] [27] [27] here with this exhaustiv e analysis. ENV seems to pro vide a go o d compromise b et w een SIC-95 and, on the opp osite side, SIC-99 and BIC. The result pro vided b y ENV is quite close to the UAED so that k e = 11 with a high degree of reliability of R D = 0 . 86. Note that, in this exp eriment, the confidence measures provided b y SIC show that SIC is more optimistic suggesting of a more parsimonious model: for instance, the choice k e = 7 is fostered by SIC with 95% of reliability , whereas CI = 84% and R D = 55%. Sp ecifically , the R D index warns that with k e = 7 w e are missing 45% of the effectiv e n umber of v ariables (the ENV index is 12.74). 5.3 V ariable selection in a biomedical classification problem with real data In [56], a feature selection analysis has b een p erformed in order to find the most imp ortan t v ariable for predicting patien ts at risk of dev eloping nonalcoholic fatty liv er disease among 35 p ossible features. The authors ha ve collected data from 1525 patien ts who attended the Cardiov ascular Risk Unit of Mostoles Universit y Hospital (Madrid, Spain) from 2005 to 2021, and used a random forest (RF) tec hnique as a classifier to yield a ranking of the components of input v ectors (of dimension 35). They found that 4 features w ere the most relev an t according to the ranking and considering the medical exp erts’ opinions: (a) insulin resistance, (b) ferritin, (c) serum levels of insulin, and (d) triglycerides. Here, w e consider V ( k ) = 1 − accuracy-in-class( k ) , as a figure of merit (where we set V (0) = 0 . 5, which represents a completely random binary classification). The curve is obtained after ranking the 35 features [56]-[27, Section 5.5]. Note that with this c hoice of V ( k ) we cannot apply BIC, AIC, and other standard information criteria, as sho wn in T able 1. How ever, the UAED [30], SIC [27], and ENV can b e applied. The results are giv en in the T able 4. ENV gives a v alue of 3.5851, then suggests k ∗ = 4 as the exp erts in [56]. the UAED and SIC also pro vide results in line with this solution. In this experiment, in terms of confidence, the v alues of the confidence measures pro vided b y SIC and the measures prop osed here are very close: for the decision k e = 3 SIC provides as a confidence measure of 0.95, and CI = 0 . 91, R D = 0 . 84 that are in the same line (high v alues similar to 0.95). The same observ ation can b e done for k e = 2 , 4, and 9 (except the R D v alue - 0 . 56 - for k e = 2, which alerts of a less confiden t decision missing 16 T able 4: Results in the v ariable selection example, in a classification problem with real data. Sc heme UAED SIC-90 SIC-95 SIC-99 ENV k e 3 2 3 9 4 CI 0.91 0.88 0.91 0.98 0.92 R D 0.84 0.56 0.84 1 1 almost 50% of the effectiv e n umber of v ariables). 6 Conclusions An index of an effective n umber of v ariables has been prop osed which has been inspired by the concept of maxim um AUC in ROC curves and the Gini index. The in tro duced ENV index remo ves a dep endence that we can find in the elb ow detectors designed in the literature, i.e., the dep endence on the maxim um n um b er of components K analyzed. W e also introduce differen t measures of uncertain ty and reliabilit y of the proposed solution (related to the ENV index). These no vel confidence measures can b e employ ed also join tly with the use of differen t information criteria such as the well-kno wn AIC and BIC (where they can b e applied, as we ha ve sho wn in Section 5.2). More generally , it is imp ortan t to remark that the results obtained by the ENV index can b e asso ciated not only with an information criterion, but also with any other mo del selection pro cedure. Indeed, for computing CI w e just need the knowledge of the curve V ( k ) and, for R D , we need to use the ENV index I ENV and a decision k e , whic h can b e obtained b y any mo del selection scheme. Several comparisons with classical and recen t schemes are provided in differen t exp eriments inv olving real datasets: w e analyze tw o v ariable selection problems, one of them regarding a regression analysis and the other one in v olving a biomedical classification study . Related Matlab co de is provided. Ac knowledgmen t The work w as partially supp orted by the Y oung Researchers R&D Pro ject, ref. num. F861 (AUTO- BA-GRAPH) funded by the Comm unit y of Madrid and Rey Juan Carlos Univ ersit y , b y Agencia Estatal de Inv estigaci´ on AEI with pro ject SP-GRAPH, ref. num. PID2019-105032GB-I00, with pro jec t POLI-GRAPH - Grant PID2022-136887NB-I00 funded b y MCIN/AEI/10.13039/501100011033, and by Programa de Excelencia-Conv enio Plurianual entre Comunidad de Madrid y la Universidad Rey Juan Carlos (ref. n um. Y158/DF007003/30-06-2020, ref. num F840). 17 References [1] K. Aho, D. Derryb erry , and T. Peterson, “Mo del selection for ecologists: the worldviews of AIC and BIC,” Ec olo gy , vol. 95, no. 3, pp. 631–636, 2014. [2] A. Gupta and S. Das, “On efficien t mo del selection for sparse hard and fuzzy center-based clustering algorithms,” Information Scienc es , v ol. 590, pp. 29–44, 2022. [3] N. L. Hjort and G. Claeskens, “F requen tist mo del av erage estimators,” Journal of the A meric an Statistic al Asso ciation , vol. 98, no. 464, pp. 879–899, 2003. [4] P . Stoica, X. Shang, and Y. Cheng, “The Monte-Carlo sampling approac h to mo del selection: A primer [lecture notes],” IEEE Signal Pr o c essing Magazine , v ol. 39, no. 5, pp. 85–92, 2022. [5] C. Cob os, H. Mu ˜ noz-Collazos, R.Urbano-Mu ˜ noz, M. Mendoza, E. Le´ on, and E. Herrera- Viedma, “Clustering of web searc h results based on the cuck o o searc h algorithm and balanced Ba yesian information criterion,” Information Scienc es , vol. 281, pp. 248–264, 2014. [6] I. Gkioulek as and L. G. Papageorgiou, “Piecewise regression analysis through information criteria using mathematical programming,” Exp ert Systems with Applic ations , v ol. 121, pp. 362–372, 2019. [7] P . Mukherjee, D. P arkinson, and A. R. Liddle, “A nested sampling algorithm for cosmological mo del selection,” The Astr ophysic al Journal L etters , v ol. 638, no. 2, p. L51, 2006. [8] R. San Mill´ an-Castillo, L. Martino, E. Morgado, and F. Llorente, “An exhaustive v ariable selection study for linear mo dels of soundscap e emotions: Rankings and Gibbs analysis,” IEEE/A CM T r ansactions on A udio, Sp e e ch, and L anguage Pr o c essing , v ol. 30, pp. 2460– 2474, 2022. [9] Z. Zh u and S. Ka y , “On Bay esian exponentially embedded family for model order selection,” IEEE T r ansactions on Signal Pr o c essing , vol. 66, no. 4, pp. 933–943, 2017. [10] S. Behesh ti and S. Sedghizadeh, “Num b er of source signal estimation by the mean squared eigen v alue error,” IEEE T r ansactions on Signal Pr o c essing , vol. 66, no. 21, pp. 5694–5704, 2018. [11] M. Jansen, “Information criteria for structured parameter selection in high-dimensional tree and graph mo dels,” Digital Signal Pr o c essing , v ol. 148, p. 104437, 2024. [12] C. M. Bishop, “Pattern recognition,” Machine L e arning , v ol. 128, pp. 1–58, 2006. [13] E. F ong and C. Holmes, “On the marginal likelihoo d and cross-v alidation,” Biometrika , vol. 107, no. 2, pp. 489–496, 2020. [14] A. V eh tari, A. Gelman, and J. Gabry , “Practical Ba yesian mo del ev aluation using leav e-one- out cross-v alidation and WAIC,” Statistics and c omputing , v ol. 27, no. 5, pp. 1413–1432, 2017. 18 [15] P . Stoica and Y. Sel´ en, “Cross-v alidation rules for order estimation,” Digital Signal Pr o c essing , v ol. 14, pp. 355–371, 2004. [16] T. Ando, “Predictiv e Bay esian mo del selection,” A meric an Journal of Mathematic al and Management Scienc es , vol. 31, no. 1-2, pp. 13–38, 2011. [17] S. Konishi and G. Kitagaw a, Information criteria and statistic al mo deling . Springer Science & Business Media, 2008. [18] A. V an der Linde, “DIC in v ariable selection,” Statistic a Ne erlandic a , vol. 59, no. 1, pp. 45–56, 2005. [19] C. P . Rob ert and G. Casella, Monte Carlo Statistic al Metho ds . Springer, 2004. [20] F. Llorente, L. Martino, D. Delgado, and J. Lop ez-Santiago, “Marginal lik eliho o d computation for mo del selection and h yp othesis testing: an extensiv e review,” SIAM R eview (SIREV) , v ol. 65, no. 1, pp. 3–58, 2023. [21] D. Spiegelhalter, N. G. Best, B. P . Carlin, and A. V. der Linde, “Ba y esian measures of mo del complexit y and fit,” J. R. Stat. So c. B , vol. 64, pp. 583–616, 2002. [22] G. Sc h warz et al. , “Estimating the dimension of a mo del,” The annals of statistics , vol. 6, no. 2, pp. 461–464, 1978. [23] D. P . F oster and E. I. George, “The risk inflation criterion for m ultiple regression,” The A nnals of Statistics , vol. 22, no. 4, pp. 1947–1975, 1994. [24] C. L. Mallows, “Some commen ts on Cp,” T e chnometrics , vol. 15, no. 4, pp. 661–675, 1973. [25] J. Rissanen, “Modeling b y shortest data description,” Automatic a , v ol. 14, no. 5, pp. 465–471, 1978. [26] A. Mariani, A. Giorgetti, and M. Chiani, “Model order selection based on information theoretic criteria: Design of the p enalt y ,” IEEE T r ansactions on Signal Pr o c essing , vol . 63, no. 11, pp. 2779–2789, 2015. [27] L. Martino, R. S. Millan-Castillo, and E. Morgado, “Sp ectral information criterion for automatic elb ow detection,” Exp ert Systems with Applic ations , vol. 231, p. 120705, 2023. [28] J. J. Dziak, D. L. Coffman, S. T. Lanza, R. Li, and L. S. Jermiin, “Sensitivit y and sp ecificity of information criteria,” Briefings in Bioinformatics , vol. 21, no. 2, pp. 553–565, 03 2020. [29] F. Lloren te, L. Martino, E. Curb elo, J. Lop ez-Santiago, and D. Delgado, “On the safe use of prior densities for bay esian mo del selection,” WIREs Computational Statistics , p. e1595, 2022. [30] E. Morgado, L. Martino, and R. S. Millan-Castillo, “Univ ersal and automatic elb o w detection for learning the effective num b er of comp onen ts in mo del selection problems,” Digital Signal Pr o c essing , vol. 140, p. 104103, 2023. 19 [31] A. J. Onuman yi, D. N. Molokomme, S. J. Isaac, and A. M. Abu-Mahfouz, “Auto elb ow: An automatic elb o w detection metho d for estimating the n um b er of clusters in a dataset,” Applie d Scienc es , vol. 12, no. 15, 2022. [32] J. Zhang, P . F u, F. Meng, X. Y ang, J. Xu, and Y. Cui, “Estimation algorithm for chloroph yll-a concen trations in water from h yp ersp ectral images based on feature deriv ation and ensem ble learning,” Ec olo gic al Informatics , vol. 71, p. 101783, 2022. [33] D. Kaplan, “Knee point,” 2024, MA TLAB Central File Exchange. [Online]. Av ailable: h ttps://www.mathw orks.com/matlab central/fileexc hange/35094- knee- p oin t [34] R. L. Thorndike, “Who belongs in the family?” Psychometrika , v ol. 3, pp. 267–276, 1953. [35] G. Heinze, C. W allisch, and D. Dunkler, “V ariable selection - a review and recommendations for the practicing statistician,” Biometric al journal , v ol. 60, no. 3, pp. 431–449, 2018. [36] J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver op erating c haracteristic (ROC) curv e,” R adiolo gy , v ol. 143, no. 1, pp. 29–36, 1982. [37] M. O. Lorenz, “Metho ds of measuring the concentration of wealth,” Public ations of the A meric an Statistic al Asso ciation , vol. 9, no. 70, pp. 209–219, 1905. [38] L. Ceriani and P . V erme, “The origins of the Gini index: extracts from v ariabilit´ a e mutabilit´ a (1912) by Corrado Gini,” The Journal of Ec onomic Ine quality , vol. 10, no. 3, pp. 421–443, 2012. [39] S. Yitzhaki and E. Schec htman, Mor e Than a Dozen Alternative Ways of Sp el ling Gini . Springer New Y ork, 2013, pp. 11–31. [40] S. Inoua, “Bew are the Gini index! a new inequalit y measure,” pr eprint arXiv:2110.01741 , pp. 1–26, 2021. [41] L. Martino, V. Elvira, and F. Louzada, “Effectiv e sample size for imp ortance sampling based on discrepancy measures,” Signal Pr o c essing , vol. 131, pp. 386–401, 2017. [42] V. Elvira, L. Martino, and C. P . Rob ert, “Rethinking the Effectiv e Sample Size,” International Statistic al R eview , vol. 90, no. 3, pp. 525–550, 2022. [43] I. V erdinelli and L. W asserman, “F eature imp ortance: A closer lo ok at shapley v alues and lo co,” pr eprint arXiv: 2303.05981 , 2023. [Online]. Av ailable: h [44] M. K. A. Khan, O. Saarela, and R. Kustra, “A generalized v ariable imp ortance metric and estimator for black b ox machine learning models,” pr eprint arXiv: 2212.09931 , 2023. [Online]. Av ailable: h 20 [45] J. Vicent Servera, L. Martino, J. V errelst, J. P . Rivera-Cai cedo, and G. Camps-V alls, “Multioutput feature selection for em ulation and sensitivit y analysis,” IEEE T r ansactions on Ge oscienc e and R emote Sensing , v ol. 62, pp. 1–11, 2024. [46] D. W atson, J. O ' Hara, N. T ax, R. Mudd, and I. Guy , “Explaining predictiv e uncertain ty with information theoretic shapley v alues,” in A dvanc es in Neur al Information Pr o c essing Systems , A. Oh, T. Neumann, A. Globerson, K. Saenk o, M. Hardt, and S. Levine, Eds., v ol. 36, 2023, pp. 7330–7350. [47] K. Aas, M. Jullum, and A. Loland, “Explaining individual predictions when features are dep enden t: More accurate appro ximations to Shapley v alues,” Artificial Intel ligenc e , vol. 298, p. 103502, 2021. [48] Wikip edia, “Ba yesian information criterion,” 2024, see the section ’Gaussian sp ecial case’. [Online]. Av ailable: h ttps://en.wikip edia.org/wiki/Bay esian information criterion [49] E. J. Hannan and B. G. Quinn, “The determination of the order of an autoregression,” Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , vol. 41, no. 2, pp. 190–195, 1979. [50] I. M. Sob ol, “Sensitivity estimates for nonlinear mathematical mo dels,” Mathematic al Mo del ling and Computational Exp eriments , v ol. 4, pp. 407–414, 1993. [51] T. Klein, A. Lagnoux, P . Ro c het, and T. M. N. Nguy en, “Efficient influence functions for Sob ol’ indices under tw o designs of exp erimen ts,” pr eprint arXiv: 2407.15468 , 2024. [Online]. Av ailable: h [52] R. San Mill´ an-Castillo, L. Martino, and E. Morgado, “A v ariable selection analysis for soundscap e emotion mo delling using decision tree regression and mo dern information criteria,” IEEE A c c ess , 2024. [53] J. F an, M. Thorogo o d, and P . Pasquier, “Emo-soundscap es: A dataset for soundscap e emotion recognition,” in 2017 Seventh international c onfer enc e on affe ctive c omputing and intel ligent inter action (ACII) , 2017, pp. 196–201. [54] M. Efroymson, “Multiple regression analysis,” Mathematic al metho ds for digital c omputers , pp. 191–203, 1960. [55] R. R. Ho cking, “The analysis and selection of v ariables in linear regression,” Biometrics , pp. 1–49, 1976. [56] R. G´ arcia-Carretero, R. Holgado-Cuadrado, and O. Barquero-P´ erez, “Assessmen t of classification mo dels and relev ant features on nonalcoholic steatohepatitis using random forest,” Entr opy , vol. 23, no. 6, 2021. 21 A F urther considerations In this app endix, we describ e different ideal cases when V ( k ) reaches the zero v alue at a sp ecific k . Scenario V (0)  = 0 and V (1) = 0 : If we ha ve a zero at k = 1, i.e. V (1) = 0, by assumption we ha ve V (1) = V (2) = ...V ( K ) = 0, as shown in Figure 4(a). In this case, computing Eq. (6) or (7), w e obtain I ENV = 1. Scenario V (0)  = 0 , V (1)  = 0 and V (2) = 0 : Hence, by assumption we hav e V (2) = V (3) = ...V ( K ) = 0. Generally , w e will hav e 1 < I ENV ≤ 2. The v alue of I ENV dep ends on the t yp e of curve V ( k ). The equalit y I ENV = 2 is giv en if the decay is linear, i.e., V ( k ) is a straigh t line in 0 ≤ k ≤ 2 (see the concept of v ariable imp ortance in Section 4.2). Scenario V (0)  = 0 , V (1)  = 0 , V (2)  = 0 ... V ( k − 1)  = 0 , until V ( k ∗ ) = 0 : Hence, b y assumption w e hav e V ( k ∗ ) = V ( k ∗ + 1) = ...V ( K ) = 0 as depicted in Figure 4(c). Generally , w e will ha v e 1 < I ENV ≤ k ∗ . Again, the v alue of I ENV dep ending on the t yp e of curv e V ( k ): the equalit y I ENV = k ∗ is given if the decay is linear, i.e., V ( k ) is a straigh t line in 0 ≤ k ≤ k ∗ (see the concept of v ariable imp ortance in Section 4.2). Recalling also the prop erties in Section 4.1 and due to the discrete nature of k , w e cannot ha ve v alue 0 < I ENV < 1, but we can hav e I ENV = 0, I ENV = 1, and any v alue in the interv als j − 1 ≤ I ENV ≤ j for all j = 2 , ..., K . In summary , the v alues that the ENV index are the following I ENV ∈ { 0 , [1 , K ] } . This makes sense since there is ev en a small decrease b et ween k = 0 and k = 1 in V ( k ), this means at least one comp onent deserv es to b e considered in the mo del. B Confidence measure b y SIC A first attempt to provide a confidence measure has b een given in the recent prop osed sp ectral information criterion (SIC) [27] whic h returns: (a) a suggestion regarding the p osition of the elb ow k e (as the other information criteria); (b) but also provides a degree of certain t y in the decision (as a confidence measure). SIC asso ciates to its decision with a num b er that is a sum of some normalized weigh ts (e.g., often 0.95 and 0.99): closer to 1 (100%), the decision is more safe, meaning that we are confiden t in discarding the K − k e v ariables/comp onents in the construction of the mo del. 22

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment