An Exponential-Polynomial Divergence-based Robust Information Criterion for Linear Panel Data Models and Neural Networks

Model selection is a cornerstone of statistical inference, where information criteria are widely employed to balance model fit and complexity. However, classical likelihood-based criteria are often highly sensitive to contamination, outliers, and mod…

Authors: Udita Goswami, Shuvashree Mondal

An Exponential-Polynomial Divergence-based Robust Information Criterion for Linear Panel Data Models and Neural Networks
An Exp onen tial-P olynomial Div ergence-based Robust Information Criterion for Linear P anel Data Mo dels and Neural Net w orks Udita Gosw ami a , Sh uv ashree Mondal* a a Dep artment of Mathematics and Computing, IIT(Indian Scho ol of Mines) Dhanb ad, , 826004, Jharkhand, India Abstract Mo del selection is a cornerstone of statistical inference, where information cri- teria are widely emplo yed to balance model fit and complexity . Ho wev er, classical lik eliho o d-based criteria are often highly sensitiv e to contamination, outliers, and mo del missp ecification. In this paper, w e dev elop a robust alternativ e based on the Exp onen tial-Polynomial Div ergence, a flexible extension of existing div ergence measures that enhances adaptabilit y to diverse data irregularities. The prop osed Exp onen tial-P olynomial Divergence Information Criterion preserves the ob jectiv e of appro ximating the discrepancy b et wee n the true mo del and candidate mo dels while incorp orating robustness against anomalous observ ations. Its theoretical prop er- ties are established, and robustness is examined through influence function analysis, demonstrating con trolled sensitivity to extreme data points. F or practical implemen- tation, a data-driv en tuning parameter selection strategy based on generalized score matc hing is emplo yed, ensuring impro v ed computational stabilit y and efficiency . The effectiv eness of the proposed method is demonstrated through extensive simulation studies under v arying con tamination lev els, as w ell as real data applications in volving linear mixed-effects panel data mo dels and neural netw ork-based prediction tasks. The results consisten tly show impro ved stabilit y and reliabilit y compared to classical lik eliho o d and densit y p o w er div ergence-based information criteria. The proposed framew ork th us pro vides a practical and unified approach for model selection in complex and con taminated data settings. Keywor ds: Exp onen tial-p olynomial divergence, robust information criterion, mo del selection, influence function, score matc hing, panel data mo dels, neural net works 1. In tro duction Information criteria form a fundamental class of tools for model selection, balanc- ing go o dness-of-fit and mo del complexity within an information-theoretic framew ork. Their construction is ro oted in likelihoo d-based inference, where the ob jectiv e is to maximize the log-likelihoo d while accoun ting for mo del complexit y , leading to an appro ximation of the exp ected Kullback–Leibler divergence b et ween the true mo del and a candidate mo del. The Ak aik e Information Criterion (AIC), prop osed b y Hiro- tugu Ak aik e, provides an appro ximately unbiased estimator of this div ergence under correct mo del sp ecification, whereas the T ak euchi Information Criterion (TIC) ex- tends this idea to settings where the mo del ma y b e missp ecified [ 1 , 2 ]. Recen t work further shows that, for non-normalized mo dels where the lik eliho o d is in tractable, information criteria can still b e form ulated as appro ximately un biased estimators of suitable discrepancy measures, thereb y enabling principled mo del selection b eyond the classical lik eliho o d framew ork [ 3 ]. Despite their widespread applicability and theoretical app eal, classical informa- tion criteria are often unsuitable in the presence of con tamination, outliers, or mo del missp ecification, as they rely on lik eliho od-based ob jectiv e functions that are inher- en tly non-robust. Ev en a small fraction of aberrant observ ations can significantly distort parameter estimates and, consequen tly , the mo del selection outcome. As a result, criteria such as AIC ma y lead to unreliable conclusions when the assumed mo del deviates from realit y . This limitation is particularly relev an t in practice, where data from so cio-economic systems, industrial pro cesses, healthcare, and en vironmen- tal studies are frequently affected by noise, anomalies, and departures from ideal assumptions, highligh ting the need for more robust alternatives. In an effort to address the lack of robustness in lik eliho o d-based model selection, [ 4 ] introduced a divergence-based information criterion (DIC) based on a generalized measure of statistical discrepancy , closely link ed to the Density Po w er Div ergence (DPD) framework of [ 5 ]. The DPD and its subsequent developmen ts [ 6 , 7 , 8 ] pro vide a robust alternative to lik eliho od by do wnw eigh ting the influence of atypical observ a- tions. While this marks a clear departure from classical lik eliho od-based approac hes, the practical utilit y of such div ergence-based criteria remains underexplored. Exist- ing studies on DIC, including improv emen ts and applications such as [ 9 ], are largely confined to con trolled sim ulations, with limited v alidation on real-w orld datasets. This reveals a gap in understanding their empirical p erformance and highlights the need for more practically viable and robust mo del selection framew orks. Motiv ated by the growing emphasis on robust statistical metho dologies, we pro- p ose a nov el information criterion grounded in the Exponential-P olynomial Diver- gence (EPD), a flexible divergence family introduced by [ 10 ]. The EPD extends the 2 Densit y P ow er Div ergence b y incorp orating an exp onen tial-p olynomial structure, al- lo wing greater flexibilit y in handling div erse con tamination patterns and impro ving robustness without significant loss of efficiency , as also demonstrated in recent appli- cations [ 11 ]. Building up on this framework, w e propose the Exp onen tial-Polynomial Div ergence Information Criterion (EPDIC), which lev erages the robustness of EPD to enable reliable mo del selection under contamination and mo del missp ecification, pro viding a practical alternative to existing likelihoo d- and DPD-based criteria. A key con tribution of this w ork is the assessment of the robustness of the pro- p osed EPDIC using the influence function, a standard to ol for measuring sensitivit y to con tamination. By analyzing its influence function, we characterize the lo cal ro- bustness of EPDIC and its resistance to outliers. In particular, the b oundedness of the influence function demonstrates that the prop osed criterion effectiv ely con trols the impact of extreme observ ations, thereb y inheriting strong robustness prop erties from the underlying div ergence structure and ensuring stability under con tamination. The choice of optimal tuning parameters is crucial in div ergence-based metho ds, as it go verns the trade-off b et w een robustness and statistical efficiency . Classical approac hes, suc h as the W arwick and Jones criterion [ 12 ] and its iterative extension [ 13 ], are widely used but often suffer from computational burden and instability , particularly in complex or high-dimensional settings. Recen t studies [ 14 ] indicate that score-based approaches can provide impro v ed and more reliable p erformance. Motiv ated by this, we adopt the generalized score matc hing (GSM) framew ork of [ 15 ], whic h builds on the original metho d of [ 16 ]. By a voiding the need for normal- ization constants and relying on deriv ativ es of log-densities, this approach offers a computationally efficien t and stable mec hanism for obtaining data-adaptiv e tuning parameter estimates. T o substantiate the practical relev ance of the prop osed metho dology , we conduct an extensiv e empirical in v estigation encompassing both con trolled sim ulations and real-w orld data applications, whic h form the principal crux of this study . The sim- ulation framew ork is carefully designed to assess the finite-sample p erformance of the proposed exp onen tial-p olynomial div ergence information criterion across v ary- ing lev els of con tamination, thereb y pro viding a comprehensiv e understanding of its robustness. Bey ond sim ulations, we examine the efficacy of EPDIC in the context of linear mixed-effects panel data mo dels, where the underlying ob jective function is constructed via the density p o w er divergence measure to accommo date unobserved heterogeneit y and p oten tial con tamination. In parallel, we explore its applicabilit y in mo dern machine learning settings, particularly in neural net work-based prediction tasks, where the conv en tional loss functions are replaced or augmen ted by div ergence- based counterparts to enhance robustness against noisy and corrupted inputs. These 3 applications are nov el in the sense that the p erformance of robust information criteria has seldom b een systematically ev aluated across such div erse real-w orld scenarios. In both simulation and empirical analyses, w e provide a thorough comparativ e as- sessmen t of EPDIC against classical lik eliho o d-based criteria and existing div ergence- based alternatives, including those deriv ed from the density p ow er div ergence, across m ultiple con tamination regimes. The results consistently demonstrate the sup erior stabilit y and reliability of the prop osed criterion in challenging data environmen ts. The remainder of this paper is organized as follo ws. In Section 2 , we rigorously in tro duce the Exp onen tial-Polynomial Div ergence estimator. Section 3 presen ts the form ulation of the EPDIC along with its theoretical prop erties. In Section 4 , w e in vestigate the robustness c haracteristics of the prop osed criterion through an in- fluence function analysis. Section 5 is dev oted to the selection of optimal tuning parameters using the GSM approac h. Section 6 pro vides an extensive empirical ev al- uation, including both simulation studi es and real data applications. Finally , Section 7 concludes the pap er with a summary of findings and p oten tial directions for future researc h. 2. Exp onen tial-P olynomial Div ergence Estimator The Exp onen tial-P olynomial Divergence (EPD), proposed by [ 10 ], represents a unified and generalized class of Bregman div ergences, capable of encompassing sev- eral well-kno wn div ergence measures suc h as the Density P o wer Div ergence (DPD), Bregman Exp onen tial Div ergence (BED), and Kullbac k-Leibler (KL) div ergence as sp ecial cases. Under the assumptions originally stated b y [ 17 ], the Bregman divergence b et ween t wo density functions g and f is defined as D B ( g , f ) = Z x [ B ( g ( x )) − B ( f ( x )) − { g ( x ) − f ( x ) } B ′ ( f ( x ))] dx, (1) where B ( · ) is a strictly con vex function and B ′ ( · ) denotes its deriv ative with resp ect to its argumen t. The choice of the con vex generating function B determines the sp ecific form of the div ergence; different selections of B lead to distinct members within the Bregman div ergence family (see, e.g., [ 18 ], [ 19 ]). T o generalize the existing div ergence families, [ 10 ] in tro duced a con v ex generating function of the form B ( x ) = β α 2 ( e αx − 1 − αx ) + 1 − β γ  x γ +1 − x  , α ∈ R , β ∈ [0 , 1] , γ ≥ 0 , (2) 4 where α , β , and γ are tuning parameters that control the con tributions of the expo- nen tial and polynomial comp onents. The EPD connects several w ell-kno wn diver- gences through particular parameter c hoices: • When β = 0 , the EPD reduces to the Density P o wer Div ergence (DPD) of [ 5 ] with parameter γ . • When β = 1 , it coincides with the Bregman Exponential Divergence (BED) of [ 20 ] with parameter α . • F or in termediate v alues 0 < β < 1 , it pro vides a con v ex combination of BED and DPD. • F urther, when β = 0 and γ → 0 , the div ergence con v erges to the Kullback- Leibler div ergence [ 21 ]. Th us, the EPD offers a smo oth con tinuum of divergence measures linking the exp onen tial- t yp e and p o w er-type divergences through its three-parameter structure. 2.1. Estimation under Indep endent Non-homo gene ous observations Consider indep enden t but not identically distributed observ ations Y 1 , Y 2 , . . . , Y n , where eac h Y i follo ws a p otentia lly different densit y g i are mo deled b y a parametric family F i,θ = { f i ( · ; θ ) | θ ∈ Θ ⊆ R p } , sharing the same common parameter θ . The goal is to estimate θ robustly b y minimizing the av erage exp onen tial-p olynomial div ergence b et ween the true and mo del densities, 1 n n X i =1 D EP  g i , f i ( · ; θ )  = 1 n n X i =1 " Z n B ′  f i ( y ; θ )  f i ( y ; θ ) − B  f i ( y ; θ )  o dy (3) − B ′  f i ( Y i ; θ )  # . where D EP ( · , · ) denotes the div ergence corresponding to the generating function B ( · ) in ( 2 ). Since only a single observ ation Y i is av ailable from each densit y g i , w e approximate g i b y the degenerate empirical distribution that places unit mass at Y i . The resulting empirical ob jective function for estimation becomes H ( α,β ,γ ) n ( θ ) = 1 n n X i =1 V α,β ,γ ( Y i ; θ ) , (4) 5 where V α,β ,γ ( Y i ; θ ) represents the samplewise contribution to the divergence and com- bines exp onen tial and p olynomial components as V α,β ,γ ( Y i ; θ ) = β α 2 Z y  { e αf i ( y ; θ ) ( αf i ( y ; θ ) − 1) + 1 } dy + (1 − β ) f 1+ γ i ( y ; θ )  dy −  β α ( e αf i ( Y i ; θ ) − 1) + 1 − β γ (( γ + 1) f i ( Y i ; θ ) − 1)  . (5) Minimizing H ( α,β ,γ ) n ( θ ) with respect to θ yields the Minim um Exp onen tial-P olynomial Div ergence Estimator (MEPDE). No w, when w e differen tiate H ( α,β ,γ ) n ( θ ) with resp ect to θ , w e get the follo wing estimating equation: 1 n n X i =1  u i ( f i ( Y i ; θ )) B ′′ i ( Y i ; θ ) f i ( Y i ; θ ) − Z y u i ( f i ( y ; θ )) B ′′ i ( y ; θ ) f 2 i ( y ; θ ) dy  = 0 , (6) where u i ( y ; θ ) = ∂ ∂ θ log f i ( y ; θ ) is the lik eliho od score function of the i th sample. F urther, w e can express the prop osed formulation as a weigh ted lik eliho o d estimating equation, viz., 1 n n X i =1  u i ( f i ( Y i ; θ )) w ( f i ( Y i ; θ )) − Z y u i ( f i ( y ; θ )) w ( f i ( y ; θ )) f i ( y ; θ ) dy  = 0 , (7) where the corresp onding w eigh t function is of the form w ( t ) = β te αt + (1 − β )(1 + γ ) t γ . The ab o v e estimating equation generalizes the minimum densit y p o wer div ergence estimator (MDPDE) for non-homogeneous observ ations prop osed b y [ 22 ], while pre- serving the bias-v ariance trade-off controlled b y ( α, β , γ ) . When β = 0 , the MEPDE reduces to the MDPDE, and as ( β , γ , α ) → (0 , 0 , 0) , it simplifies to the maximum lik eliho o d estimating equation: 1 n n X i =1 u i ( Y i ; θ ) = 0 . (8) F rom a functional p erspective, the minimum Exp onen tial-Polynomial divergence functional for indep enden t non-homogeneous data is defined as T α,β ,γ ( G 1 , . . . , G n ) = arg min θ ∈ Θ 1 n n X i =1 D EP ( g i , f i ( · ; θ )) . (9) 6 Since D EP ( · , · ) is a v alid div ergence—non-negativ e and equal to zero if and only if g i = f i ( · ; θ ) —the functional T α,β ,γ ( G 1 , . . . , G n ) is Fisher consistent under mo del iden tifiability , satisfying T α,β ,γ ( F 1 ,θ 0 , . . . , F n,θ 0 ) = θ 0 . (10) Hence, the prop osed MEPDE serves as a natural and robust generalization of the MDPDE for indep enden tly but non-i den tically distributed data, com bining the strengths of b oth exp onen tial and p olynomial divergence families within a unified estimation framew ork. 2.2. Asymptotic Pr op erties T o study the asymptotic b eha vior of the Minim um Exp onen tial-P olynomial Di- v ergence Estimator (MEPDE) under indep enden t non-homogeneous observ ations, let θ g = arg min θ ∈ Θ 1 n n X i =1 D EP ( g i , f i ( · ; θ )) denote the b est-fitting parameter that minimizes the av erage exp onen tial-p olynomial div ergence betw een the true and mo del densities. Let us in tro duce a p × p matrix de- noted b y J i , whose ( k , l ) th en try is defined through the second-order partial deriv ativ e with respect to the k th and l th comp onen ts of θ . In addition, we define the quantities K i and ξ i as giv en b elo w. J i = β Z y f 2 i ( y ; θ g ) e αf i ( y ; θ g ) u i ( y ; θ g ) u ⊤ i ( y ; θ g ) dy + (1 − β )(1 + γ ) Z y f γ +1 i ( y ; θ g ) u i ( y ; θ g ) u ⊤ i ( y ; θ g ) dy + (1 − β )(1 + γ ) Z y [ g i ( y ) − f i ( y ; θ g )]  I i ( y , θ g ) − γ u i ( y ; θ g ) u ⊤ i ( y ; θ g )  f γ i ( y ; θ g ) dy + β Z y [ g i ( y ) − f i ( y ; θ g )]  I i ( y , θ g ) − u i ( y ; θ g ) u ⊤ i ( y ; θ g )  f i ( y ; θ g ) e αf i ( y ; θ g ) dy − α β Z y [ g i ( y ) − f i ( y ; θ g )] f 2 i ( y ; θ g ) e αf i ( y ; θ g ) u i ( y ; θ g ) u ⊤ i ( y ; θ g ) dy . (11) K i = Z y u i ( y ; θ g ) u ⊤ i ( y ; θ g ) n β f i ( y ; θ g ) e αf i ( y ; θ g ) + (1 − β )(1 + γ ) f γ i ( y ; θ g ) o g i ( y ) dy − ξ i ξ ⊤ i . (12) 7 where ξ i = Z y u i ( y ; θ g ) n β f i ( y ; θ g ) e αf i ( y ; θ g ) + (1 − β )(1 + γ ) f γ i ( y ; θ g ) o g i ( y ) dy . (13) Regularit y Conditions. T o establish consistency and asymptotic normality , w e assume conditions analogous to those in [ 22 ] and [ 10 ]: (R1) The supp ort X = { y : f i ( y ; θ ) > 0 } is independent of b oth i and θ , for all i = 1 , 2 , . . . , n , and the true densities g i are also supp orted on X . (R2) There exists an op en subset Θ 0 ⊆ Θ containing the b est-fitting parameter θ g suc h that, for almost all y ∈ X and all θ ∈ Θ 0 , the densities f i ( y ; θ ) , i = 1 , 2 , . . . , n , are three times contin uously differen tiable with resp ect to θ , and all third-order partial deriv atives are con tin uous in θ . (R3) F or eac h i = 1 , 2 , . . . , n , and for the generating function B ( · ) defined in ( 2 ), the in tegrals Z y B ′ ( f i ( y ; θ )) f i ( y ; θ ) dy and Z y B ′ ( f i ( y ; θ )) g i ( y ) dy can b e differentiated three times with resp ect to θ , and differentiation can b e in terchanged with integration. (R4) F or eac h i = 1 , 2 , . . . , n , the matrix J i defined in ( 11 ) is p ositiv e definite. W e define the a verage information matrix Ψ n = 1 n n X i =1 J i . Assume that the sequence { Ψ n } con verges to a p ositive definite limit Ψ = lim n →∞ Ψ n , and that λ 0 = λ min ( Ψ ) > 0 . 8 (R5) There exist measurable b ounding functions M ( i ) j kl ( y ) suc h that, for all θ ∈ Θ 0 ,      ∂ ∂ θ j ∂ θ k ∂ θ l V α,β ,γ ( y ; θ )      ≤ M ( i ) j kl ( y ) , where V α,β ,γ ( · ; θ ) is defined in ( 5 ), and 1 n n X i =1 E g i h M ( i ) j kl ( Y i ) i = O (1) for all j, k , l . (R6) F or all j and k , the following uniform in tegrability conditions hold for the first and second deriv atives of V α,β ,γ ( Y i ; θ ) : lim n →∞ sup n 1 n n X i =1 E g i "      ∂ ∂ θ j V α,β ,γ ( Y i ; θ )      I (      ∂ ∂ θ j V α,β ,γ ( Y i ; θ )      > ϵ √ n )# = 0 , lim n →∞ sup n 1 n n X i =1 E g i       −      ∂ ∂ θ j ∂ θ k V α,β ,γ ( Y i ; θ ) − E g i " ∂ ∂ θ j ∂ θ k V α,β ,γ ( Y i ; θ ) #      × I (      ∂ ∂ θ j ∂ θ k V α,β ,γ ( Y i ; θ ) − E g i " ∂ ∂ θ j ∂ θ k V α,β ,γ ( Y i ; θ ) #      > ϵ √ n )       = 0 . (R7) ( Lindeb erg-F eller condition ) F or every ϵ > 0 , lim n →∞ 1 n n X i =1 E g i "     Ω − 1 / 2 n ∂ ∂ θ V α,β ,γ ( Y i ; θ )     2 I      Ω − 1 / 2 n ∂ ∂ θ V α,β ,γ ( Y i ; θ )     > ϵ √ n  # = 0 , where Ω n = 1 n n X i =1 K i , Ω = lim n →∞ Ω n , and the limiting matrix Ω is p ositiv e definite with λ min ( Ω ) > 0 . Theorem 1. Under c onditions (R1)–(R7), the fol lowing r esults hold: 1. Ther e exists a c onsistent se quenc e of r o ots ˆ θ ( α,β ,γ ) n of the estimating e quation ( 6 ) such that ˆ θ ( α,β ,γ ) n p − → θ g . 9 2. The Minimum Exp onential-Polynomial Diver genc e estimator is asymptotic al ly normal with √ n ( ˆ θ ( α,β ,γ ) n − θ g ) d − → N p (0 , Ψ − 1 ΩΨ − 1 ) . (14) Th us, under standard smo othness and identifiabilit y assumptions, the MEPDE for indep enden t non-homogeneous observ ations is consistent and asymptotically normal, pro viding a unified robust inference framework within the exp onen tial-p olynomial di- v ergence family . Remark. The pro of of this theorem follows arguments analogous to those used in Theorem 3.1 of [ 22 ], with appropriate mo difications to incorp orate the exp onen- tial–p olynomial div ergence structure. 3. Exp onen tial-P olynomial Div ergence Information Criterion Mo del selection criteria built up on div ergence measures pro vide a systematic ap- proac h to weigh mo del fit against complexity , esp ecially in situations where data con tamination, hea vy tails, or structural heterogeneity can distort likelihoo d-based criteria. Divergence measures quantify ho w far a prop osed mo del departs from the underlying data-generating distribution, and when the c hosen div ergence is itself resistan t to outlying observ ations, the resulting information criterion naturally be- comes more stable. In this context, the exp onen tial-p olynomial divergence (EPD), in tro duced in Section 2 , serves as a v ersatile foundation for robust mo del selec- tion. Its three tuning parameters ( α, β , γ ) allo w it to encompass several standard div ergences—including the DPD, BED, and KL div ergence—while offering additional flexibilit y to mo derate the impact of outliers or mo del misspecification. Building on this adaptability , w e prop ose a robust information criterion based on the exponential-polynomial divergence, abbreviated as EPDIC. This criterion extends the ideas underlying AIC and later div ergence-based metho ds, including the DIC of [ 4 ], but incorp orates the enhanced robustness prop erties of the EPD family . Let ˆ θ ( α,β ,γ ) = arg min θ ∈ Θ H ( α,β ,γ ) ( θ ) represen t the Minim um Exp onen tial-Polynomial Div ergence Estimator (MEPDE). T o construct an information criterion appropriate for EPD, one requires an asymp- totically un biased estimator of the expected ov erall div ergence b et w een the true densit y and the fitted parametric family , ev aluated at ˆ θ ( α,β ,γ ) . F ollo wing the logic of AIC, this inv olv es expanding the population div ergence around the true parameter and deriving a mo del-complexit y penalty whose stru cture dep ends on the curv ature 10 of the EPD functional. The resulting p enalty dep ends explicitly on the robustness parameters ( α , β , γ ) , and it collapses to the usual AIC p enalt y when these parame- ters tak e the v alues corresp onding to the Kullback-Leibler div ergence. Subsequently , w e will describ e the ev olution of EPDIC. In the presen t setting, the exp ected discrepancy b et ween the true data-generating densit y g and the fitted mo del f ˆ θ is quan tified using the EPD functional D ˆ θ ≡ H ( α,β ,γ ) ( g , f ˆ θ ) = Z y D E P ( g ( y ) , f ˆ θ ( y )) dy . Let θ 0 denote the true parameter suc h that g = f θ 0 under correct sp ecification. The criterion w e wish to ev aluate is the p opulation quantit y , E [ D ( ˆ θ )] , whic h can- not b e computed directly . T o relate it to observ able comp onen ts, w e consider the empirical div ergence functional b D ( θ ) = H ( α,β ,γ ) n ( θ ) , b D ( ˆ θ ) = H ( α,β ,γ ) n ( ˆ θ ) , where ˆ θ is the minimum exp onen tial-p olynomial divergence estimator (MEPDE). W e no w employ a T aylor expansion of D ( θ ) about θ 0 , D ( ˆ θ ) = D ( θ 0 ) + ( ˆ θ − θ 0 ) ⊤ ∂ ∂ θ D ( θ )     θ = θ 0 + 1 2 ( ˆ θ − θ 0 ) ⊤ ∂ 2 ∂ θ ∂ θ ⊤ D ( θ )     θ = θ 0 ( ˆ θ − θ 0 ) + o  ∥ ˆ θ − θ 0 ∥ 2  . (15) Since θ 0 minimizes the p opulation div ergence, ∂ ∂ θ D ( θ )     θ = θ 0 = 0 . W e pro ceed further and get, E [ D ( ˆ θ )] = E [ D ( θ 0 )] + 1 2 E " ( ˆ θ − θ 0 ) ⊤ ∂ 2 ∂ θ ∂ θ ⊤ D ( θ )     θ = θ 0 ( ˆ θ − θ 0 ) # + o ( n − 1 ) . (16) Under the regularit y conditions established for the EPD framework, the estimator ˆ θ satisfies √ n ( ˆ θ − θ 0 ) d − → N p (0 , Ψ − 1 ΩΨ − 1 ) . Also note that, ∂ 2 ∂ θ ∂ θ ⊤ D ( θ )     θ = θ 0 = Ψ . 11 Substituting this asymptotic b eha vior in to the T a ylor expansion gives E [ D ( ˆ θ )] = D ( θ 0 ) + 1 2 n tr  Ψ E h ( ˆ θ − θ 0 ) ⊤ ( ˆ θ − θ 0 ) i + o ( n − 1 ) = D ( θ 0 ) + 1 2 n tr  ΩΨ − 1  + o ( n − 1 ) . (17) This result is directly analogous to the T akeuc hi Information Criterion (TIC) ex- pansion [ 2 ]. Finally , replacing the unobserv able D ( θ 0 ) b y its empirical coun terpart H ( α,β ,γ ) n ( ˆ θ ) and m ultiplying both sides b y n , yields the exp onen tial-p olynomial div er- gence information criterion EPDIC ≈ n H ( α,β ,γ ) n ( ˆ θ ) + tr  ΩΨ − 1  . (18) The first term in ( 18 ) represen ts the empirical div ergence and therefore measures the go odness of fit of the mo del under the c hosen exp onen tial-p olynomial divergence. The second term, tr ( ΩΨ − 1 ) , serves as a mo del-complexit y adjustment, pla ying the role of the asymptotic bias correction in the same spirit as T ak euchi’s information criterion. Hence, EPDIC provides a flexible con tinuum of model-selection to ols that unifies TIC and div ergence measures from the exp onen tial-p olynomial family . 4. Influence F unction of the EPDIC W e now study the robustness prop erties of the prop osed exp onen tial-p olynomial div ergence information criterion (EPDIC) through its influence function, whic h quan- tifies the infinitesimal effect of a small contamination at a p oin t on the v alue of the criterion. Throughout, we w ork under the indep endent non-homogeneous framew ork describ ed in Section 2 . It is w orth emphasizing that the prop osed EPDIC is defined as EPDIC = n H ( α,β ,γ ) n ( ˆ θ ) + tr  ΩΨ − 1  , where ˆ θ denotes the Minim um Exp onen tial-P olynomial Divergence Estimator (MEPDE). Let G = ( G 1 , . . . , G n ) denote the collection of true distributions, and consider a con taminated version of the i th comp onen t giv en by G i,ε = (1 − ε ) G i + ε ∆ y , (19) where ∆ y is the degenerate distribution at y and ε ↓ 0 . The influence function of the EPDIC at the con tamination p oin t y is defined as IF( y ; EPDIC , G ) = d dε EPDIC( G 1 , . . . , G i,ε , . . . , G n )     ε =0 . (20) 12 The empirical div ergence admits the functional representation H ( θ , G ) = 1 n n X i =1 Z V α,β ,γ ( y ; θ ) dG i ( y ) . (21) Hence, IF  y ; nH n ( ˆ θ ) , G  = n " V α,β ,γ ( y ; θ g ) + ∂ H ( θ, G ) ∂ θ     ⊤ θ = θ g IF( y ; ˆ θ , G ) # , (22) where θ g denotes the p opulation minimizer of the div ergence. Since θ g satisfies ∂ H ( θ, G ) ∂ θ     θ = θ g = 0 , (23) w e obtain IF  y ; nH n ( ˆ θ ) , G  = n V α,β ,γ ( y ; θ g ) . (24) The p enalt y comp onen t tr( ΩΨ − 1 ) dep ends on the underlying distribution only through second-order moments. Under standard smo othness and moment conditions, its in- fluence function is of order O (1) and is therefore negligible compared to the O ( n ) div ergence contribution. Th us, IF  y ; tr( ΩΨ − 1 ) , G  = O (1) . (25) Com bining the ab o ve results, the influence function of the classical EPDIC is given b y IF( y ; EPDIC , G ) = n V α,β ,γ ( y ; θ g ) + O (1) . (26) Since V α,β ,γ ( y ; θ ) is b ounded for α > 0 and γ > 0 , the influence function of the clas- sical EPDIC is b ounded, confirming its robustness. In the limiting case ( α, β , γ ) → (0 , 0 , 0) , the divergence reduces to the negative log-likelihoo d, yielding an unbounded influence function and recov ering the non-robust b eha vior of lik eliho o d-based criteria suc h as AIC and TIC. The robustness prop erties of information criteria can b e assessed through their influence functions (IF s), which quan tify the effect of an infinitesimal contamina- tion on the criterion v alue. Conv en tional information criteria lik e AIC are directly constructed from the log-likelihoo d ev aluated at the maximum likelihoo d estimator. Consequen tly , their influence functions inherit the unbounded nature of the score and log-lik eliho o d contributions, rendering AIC highly sensitive to outlying observ ations. 13 In particular, a single gross error can exert an arbitrarily large impact on these cri- teria, leading to p oten tially unstable mo del selection decisions. In contrast, the pro- p osed EPDIC replaces the log-likelihoo d comp onen t with a robust divergence-based ob jective, leading to b ounded score-type contributions. This mo dification yields a b ounded influence function for EPDIC, ensuring that the effect of anomalous obser- v ations is con trolled. Consequen tly , EPDIC exhibits reduced gross-error sensitivit y and improv ed lo cal robustness compared to AIC and DIC. This b oundedness prop- ert y mak es EPDIC particularly well-suited for mo del selection in the presence of mild deviations from the assumed mo del or potential data con tamination. 5. Optimal T uning P arameter The p erformance of the Minim um Exp onen tial-P olynomial Divergence Estimator (MEPDE) critically dep ends on the appropriate c hoice of the robustness parame- ters ( α, β , γ ) . While these parameters regulate the trade-off betw een efficiency and robustness, their optimal selection is non-trivial. Classical approaches suc h as cross- v alidation or asymptotic mean squared error minimization require either rep eated mo del fitting or explicit ev aluation of asymptotic cov ariance matrices, b oth of which ma y b e computationally demanding within the exp onen tial-p olynomial framew ork. T o address this issue, we adopt the generalized score matc hing principle in tro duced b y [ 16 ], whic h provides an alternative estimation and mo del selection mechanism that do es not rely on the normalizing constan t of the mo del density . In particular, we exploit the form ulation of generalized score matc hing in Euclidean space for inde- p enden t observ ations, as detailed in [ 15 ]. This approac h offers a computationally attractiv e and theoretically justified criterion for selecting the optimal tuning pa- rameters of the exp onen tial-p olynomial div ergence estimator. Let Y 1 , . . . , Y n ∈ R d b e indep enden t observ ations with true densities g i , mo deled b y parametric densities f i ( · ; θ ) . Denote b y p i ( y ; θ ) = f i ( y ; θ ) the mo del densit y . The generalized score matc hing criterion is based on the Fisher div ergence b et ween the true density and the model density , defined as D SM ( g , p ) = 1 n n X i =1 E g i h    ∇ i log g i ( Y i ) − ∇ i log p i ( Y i ; θ )    2 i , where ∇ i denotes differen tiation with resp ect to Y i . 14 Under mild regularity conditions, this divergence decomp oses in to a term indep en- den t of θ and a mo del-dep enden t comp onent. Consequen tly , minimizing the Fisher div ergence reduces to minimizing the empirical generalized score matc hing ob jectiv e d SM ( θ ) = 1 n n X i =1 ρ SM ,i ( Y i ; θ ) , (27) where ρ SM ,i ( Y i ; θ ) = 2 d X j =1 ∂ 2 ∂ y 2 ij log f i ( Y i ; θ ) + d X j =1  ∂ ∂ y ij log f i ( Y i ; θ )  2 . (28) An imp ortan t adv an tage of this formulation is that it depends only on deriv atives of the log-densit y with resp ect to the data, thereby eliminating the need to ev aluate an y normalizing constant. This feature is particularly beneficial in divergence-based estimation framew orks. F or each fixed triplet ( α, β , γ ) , let ˆ θ ( α,β ,γ ) = arg min θ H ( α,β ,γ ) n ( θ ) denote the corresp onding MEPDE. Substituting this estimator into the generalized score matc hing ob jectiv e ( 27 ), we define the tuning-selection criterion S n ( α, β , γ ) = d SM  ˆ θ ( α,β ,γ )  . (29) The optimal tuning parameters are then obtained as ( ˆ α, ˆ β , ˆ γ ) = arg min α,β ,γ S n ( α, β , γ ) . (30) This pro cedure selects the div ergence parameters that yield the smallest Fisher di- v ergence b et w een the fitted mo del and the underlying data-generating mec hanism. Unlik e approac hes based on asymptotic MSE minimization, this metho d do es not require explicit computation of the asymptotic cov ariance matrices Ψ and Ω . More- o ver, it a voids rep eated pilot up dates, as required in iterativ e W arwick and Jones-type pro cedures [ 13 ]. Once the optimal tuning parameters ( ˆ α, ˆ β , ˆ γ ) are determined via ( 30 ), the corresp onding MEPDE ˆ θ ∗ = ˆ θ ( ˆ α, ˆ β , ˆ γ ) is used to ev aluate the EPDIC, defined in Section 3 . 15 6. Numerical Experiments This section presents an extensive set of n umerical exp erimen ts designed to exam- ine the finite-sample p erformance and practical relev ance of the prop osed method- ologies. Comprehensiv e sim ulation studies are conducted to assess the b eha vior of the estimators and asso ciated information criteria under a v ariet y of data-generating scenarios. In addition, t wo real-w orld applications are analyzed to illustrate the prac- tical applicabilit y of the prop osed framework. The first application concerns a linear mixed-effects panel data mo del, while the second fo cuses on a neural net w ork set- ting, thereb y demonstrating the flexibility and effectiv eness of the prop osed approach across b oth classical statistical models and mo dern machine learning frameworks. 6.1. Simulation Study An extensiv e Mon te Carlo sim ulation study is conducted to in vestigate the ro- bustness of the prop osed exp onen tial-p olynomial div ergence information criteria. All results reported in this subsection are based on 1000 indep enden t Mon te Carlo repli- cations. W e consider a m ultiple linear regression mo del with p = 5 co v ariates, giv en b y y i = x ⊤ i β + ε i , i = 1 , . . . , n, (31) where x i = ( x i 1 , . . . , x i 5 ) ⊤ denotes the vector of explanatory v ariables, β is the vector of regression co efficien ts, and ε i represen ts the random error term. The resp onse v ariable y i is generated from a Gaussian distribution, ensu ring compatibilit y with the lik eliho o d-based and divergence-based estimation framew orks. While, the cov ariate v ectors x i are generated from a m ultiv ariate normal distribution with mean vector 0 and a cov ariance matrix Σ having an autoregressiv e structure, that is, Σ j k = ρ | j − k | , j, k = 1 , . . . , 5 , with ρ = 0 . 5 , allo wing for mo derate correlation among the co v ariates. The error terms ε i are indep enden tly generated from a normal distribution with mean zero and v ariance σ 2 = 1 . The true regression co efficient vector is fixed as β 0 = (1 . 5 , − 1 . 0 , 0 . 8 , 0 . 5 , − 0 . 7) ⊤ . Subsequen tly , the sim ulation pro cedure is implemented under a single sample size, namely n = 150 . T o examine the robustness prop erties of the estimators, tw o distinct con tamination sc hemes are considered. In the first sc heme, con tamination is in tro- duced in the error (disturbance) terms ε i , where a prop ortion δ ∈ (0 . 052 , 0 . 093 , 0 . 134) of the errors is replaced by v alues generated from a Normal (10 . 6 , 1) distribution. 16 This mechanism induces vertical outliers in the resp onse v ariable while lea ving the design matrix unaffected. In the second scheme, con tamination is introduced in the explanatory v ariables x i . Sp ecifically , a prop ortion δ ∈ (0 . 058 , 0 . 099 , 0 . 140) of the co v ariate observ ations is replaced by v alues drawn from a Shifted-Normal (45 . 6 , 6 . 3) distribution. This setup generates leverage-t ype outliers in the design matrix. T aken together, these contamination mechanisms introduce b oth mo derate and sev ere de- viations from the assumed mo del structure. Next, parameter estimation is p erformed using three comp eting approaches: the classical maximum lik eliho o d estimator (MLE), the density p o w er div ergence (DPD) estimator, and the exponential-polynomial divergence (EPD) estimator. These es- timators are obtained by minimizing their resp ectiv e ob jective functions through a sequen tial conv ex programming (SCP) algorithm (see [ 23 ]). Within this framew ork, eac h nonlinear ob jectiv e function is lo cally approximated b y a conv ex surrogate and solv ed iterativ ely . At every iteration, a con vex subproblem is constructed using first-order deriv ativ e information together with an appropriate step-size con trol to ensure numerical stabilit y and monotonic descen t. The parameter vector is up dated according to the solution of the con v ex approximation, and the pro cess con tin ues un til con v ergence. The algorithm is initialized at the ordinary least squares (OLS) estimates, and con v ergence is declared when the relativ e c hange in b oth the param- eter estimates and the ob jectiv e function v alue falls b elow a pre-sp ecified tolerance threshold. F urthermore, the optimal tuning parameters for b oth the DPD and EPD esti- mators are determined using the generalized score matc hing pro cedure under eac h con tamination sc heme. This data-driv en strategy iden tifies the tuning parameters that ac hieve an appropriate balance b etw een efficiency and robustness by minimizing an empirical criterion derived from the corresp onding score equations. The result- ing optimal tuning parameter v alues under the different contamination scenarios are summarized in T able 1 . T able 1: Optimal tuning parameters for DPD and EPD under different contamination schemes δ γ (DPD) α (EPD) β (EPD) γ (EPD) Pure 0.000 0 . 95 0 . 10 0 . 70 0 . 30 0 . 052 0 . 25 0 . 10 0 . 60 0 . 70 Con t. 1 0 . 093 0 . 35 0 . 10 0 . 70 0 . 30 0 . 134 0 . 40 0 . 40 0 . 70 0 . 60 0 . 058 0 . 15 0 . 10 0 . 70 0 . 90 Con t. 2 0 . 099 0 . 25 0 . 10 0 . 60 0 . 70 0 . 140 0 . 20 0 . 10 0 . 70 0 . 90 17 Figure 1: Information criteria under Con tamination sc heme 1. Emplo ying these optimal tuning parameters, the parameter estimates are ultimately deriv ed using the MLE, DPDE, and EPDE metho ds, noting that MLE do es not require any tuning parameter. The corresp onding n umerical results, ev aluated across v arious contamination scenarios, are reported in T able 2 . Based on the estimated parameter v alues from eac h comp eting metho d, the cor- resp onding information criteria—namely , the maxim um likelihoo d information cri- terion (MLIC), the density p o w er div ergence information criterion (DPDIC), and the prop osed exp onen tial-p olynomial div ergence information criterion (EPDIC)— are then ev aluated under both pure and contaminated data-generating mec hanisms. In this unified framew ork, the criteria are computed b y combining the empirical lik eliho o d-, DPD-, and EPD-based ob jectiv e functions ev aluated at their resp ective estimators with the asso ciated asymptotic p enalt y terms that accoun t for model complexit y . This formulation enables a coherent comparison of lik eliho od-based and div ergence-based mo del selection strategies. Figure 1 illustrates the information criterion v alues for each estimation method un- der Scheme 1 across v arying contamination prop ortions δ , while Figure 2 displa ys the corresp onding results under Sc heme 2. The comparativ e b eha vior of MLIC, DPDIC, and EPDIC as δ increases in eac h sc heme provides a clear assessment of their relative stabilit y and robustness in the presence of model deviations, thereb y highlighting the 18 T able 2: Estimated v alues of the parameters under pure and contamination schemes. Sc heme δ P arameter MLE DPDE EPDE Pure 0 . 000 β 1 1.495108 1.494928 1.494640 β 2 -0.995442 -0.995126 -0.994971 β 3 0.799332 0.799579 0.799965 β 4 0.502114 0.502597 0.503010 β 5 -0.701112 -0.700919 -0.700425 Con t. 1 0 . 052 β 1 1.569221 1.520784 1.514209 β 2 -1.018774 -0.981226 -0.989350 β 3 0.761441 0.772438 0.785769 β 4 0.548332 0.520714 0.513276 β 5 -0.731114 -0.706512 -0.701868 0 . 093 β 1 1.628114 1.526884 1.517182 β 2 -1.061884 -0.978112 -0.980029 β 3 0.726441 0.759331 0.762461 β 4 0.589114 0.530774 0.521705 β 5 -0.768552 -0.710221 -0.703512 0 . 134 β 1 1.702441 1.539882 1.525483 β 2 -1.112774 -0.964221 -0.973818 β 3 0.691118 0.741115 0.757822 β 4 0.629441 0.542997 0.532960 β 5 -0.804118 -0.715438 -0.706917 Con t. 2 0 . 058 β 1 1.528114 1.502398 1.501286 β 2 -1.006218 -0.998402 -0.999017 β 3 0.819552 0.801669 0.800941 β 4 0.515008 0.508681 0.501237 β 5 -0.721933 -0.703284 -0.700739 0 . 099 β 1 1.566227 1.511903 1.507549 β 2 -1.023915 -0.979114 -0.989718 β 3 0.845661 0.810492 0.808795 β 4 0.538104 0.512961 0.509558 β 5 -0.748991 -0.710728 -0.702381 0 . 140 β 1 1.621332 1.517884 1.511344 β 2 -1.056781 -0.968331 -0.975795 β 3 0.892114 0.819612 0.813201 β 4 0.579442 0.526893 0.513679 β 5 -0.789661 -0.718947 -0.710308 effect of differen t contamination structures on mo del selection p erformance. Consisten t with the patterns display ed in Figure 1 and Figure 2 , the rate of in- 19 Figure 2: Information criteria under Con tamination sc heme 2. crease in the information criterion v alues as the contamination proportion δ rises clearly distinguishes the three approaches. Under pure data, all criteria remain sta- ble; ho wev er, once con tamination is introduced, their tra jectories diverge mark edly . The MLIC exhibits the steep est escalation with increasing δ in b oth Sc heme 1 (er- ror con tamination) and Sc heme 2 (lev erage con tamination), indicating pronounced sensitivit y to model deviations. The rapid growth of MLIC reflects the w ell-kno wn vulnerabilit y of likelihoo d-based criteria to b oth v ertical and lev erage-type outliers. In contrast, DPDIC demonstrates a more mo derate and controlled increase as con tamination in tensifies. Although its v alues do rise with δ , the progression is sub- stan tially smo other than that of MLIC, suggesting improv ed resistance to departures from mo del assumptions. Among the three, EPDIC sho ws the smallest incremen- tal c hange across con tamination lev els in both schemes. Its comparativ ely minimal rise as δ increases highligh ts a high degree of stabilit y , thereby underscoring the robustness of the exp onen tial-p olynomial div ergence framework. Ov erall, the rela- tiv e magnitudes of increase—largest for MLIC, mo derate for DPDIC, and smallest for EPDIC—provide coheren t empirical evidence of the sup erior robustness of the div ergence-based information criteria, particularly EPDIC, under differen t con tami- nated settings. 20 6.2. R e al data A nalysis 6.2.1. Applic ation: Line ar Mixe d-Effe cts Panel data Mo dels W e consider a linear mixed-effects panel data mo del of the form y it = x ⊤ it β + z ⊤ it α i + u it , i = 1 , . . . , n, t = 1 , . . . , m, (32) with α i ∼ N p (0 , Σ α ) and u it ∼ N (0 , σ 2 u ) indep enden tly . The observ ations are as- sumed to b e indep enden t but non-homogeneous across individuals, allo wing the co- v ariance structure to v ary with i through the individual-sp ecific design matrices. By stac king the observ ations o ver time for eac h individual, the mo del can b e expressed as y i = X i β + ε i , ε i ∼ N m (0 , Ω i ) , where Ω i = Z i Σ α Z ⊤ i + σ 2 u I m . Hence, the marginal densit y of y i is f i ( y i ; θ ) = (2 π ) − m/ 2 | Ω i | − 1 / 2 exp  − 1 2 ( y i − X i β ) ⊤ Ω − 1 i ( y i − X i β )  , (33) where θ = ( β ⊤ , v ec(Σ α ) ⊤ , σ 2 u ) ⊤ . With this general form ulation in place, w e no w illustrate the application of the linear mixed-effects panel mo del using a real-world dataset that naturally aligns with its structure. In particular, we consider the P anel Data of Individual W ages dataset, a well-kno wn b enc hmark dataset in panel data econometrics. This dataset is dra wn from the P anel Study of Income Dynamics (PSID) and con tains longitudinal w age information for 59 5 individuals observed ov er the perio d 1976 − 1982 in the United States, resulting in 4165 total observ ations arranged as a balanced panel. The primary resp onse v ariable in this study is the logarithm of wage ( lwage ), whic h is commonly used in lab or economics analyses to stabilize v ariance and in- terpret co efficients in p ercentage terms. The set of explanatory v ariables includes b oth con tin uous and categorical cov ariates reflecting individual c haracteristics and emplo yment conditions, suc h as y ears of education ( e d ), years of full-time w ork exp erience ( exp ), weeks w orked ( wks ), union membership ( union ), marital status ( marrie d ), industry indicators ( blue c ol, ind ), regional indicators ( south, smsa ), and demographic attributes including sex and race ( black ). These v ariables are incorp o- rated in the mo del as fixed effects, represen ting systematic influences on w age lev els that are assumed to b e common across individuals. 21 T o accoun t for unobserv ed heterogeneit y that cannot b e explained by the observ ed co v ariates—suc h as innate ability , motiv ation, or p ersisten t individual-sp ecific pro- ductivit y—we introduce individual-level random effects. These random effects cap- ture sub ject-sp ecific deviations from the o verall w age-co v ariate relationship and are assumed to follow a normal distribution with mean zero and unkno wn v ariance. The inclusion of random effects is particularly appropriate here b ecause the dataset con tains rep eated observ ations for eac h individual ov er multiple y ears, enabling the mo del to separate within-individual temp oral v ariation from b et ween-individual v ari- abilit y . The wage panel dataset is well-suited to the prop osed linear mixed-effects frame- w ork, as it provides a balanced longitudinal structure with rich so cio-economic co- v ariates and rep eated observ ations for each individual, enabling reli able estimation of b oth fixed and random effects. T o assess multiv ariate extremeness in the join t distri- bution of the co v ariates, w e emplo y the Minimum Co v ariance Determinan t (MCD) estimator. It is a widely used, robust m ultiv ariate outlier detection metho d that seeks the subset of observ ations with the smallest determinant of the cov ariance ma- trix, thereby pro viding high-breakdown estimates of lo cation and scatter that are not unduly influenced by outliers. The robust Mahalanobis distances derived from the MCD estimate are used to detect multiv ariate outliers, as recommended in the ro- bust statistics literature (e.g., [ 24 ], [ 25 ]). The MCD-based diagnostic confirms that, despite some individually extreme v alues, there are no observ ations that simultane- ously deviate in the m ultiv ariate co v ariate space to an extent requiring exclusion. Therefore, no observ ations hav e b een remo v ed, and robust estimation pro cedures are used to mitigate p oten tial influence from at ypical observ ations. In addition, the dataset is v erified to contain no missing v alues. W e emplo y MLE, DPDE, and EPDE for estimating the mo del parameters. F rom the p ersp ectiv e of exp onen tial-p olynomial div ergence, the parameter estimation prob- lem reduces to minimizing an empirical divergence criterion. Specifically , we consider the ob jective function H ( α,β ,γ ) n ( θ ) = 1 n n X i =1 V α,β ,γ ( y i ; θ ) , (34) where the con tribution from the i th sub ject is captured b y V α,β ,γ ( y i ; θ ) = β α 2 Z R m  e αf i ( y ; θ )  αf i ( y ; θ ) − 1  + 1  d y + 1 − β γ Z R m  f γ +1 i ( y ; θ ) − f i ( y ; θ )  d y 22 −  β α  e αf i ( y i ; θ ) − 1  + 1 − β γ (( γ + 1) f i ( y i ; θ ) − 1)  . (35) T o obtain stable starting v alues for the iterative estimation pro cedures, the fixed- effect parameters are first initialized using the ordinary least squares (OLS) estimator applied to the p ooled regression mo del. These preliminary estimates serve as initial v alues for the subsequen t div ergence-based optimization pro cedures. Based on these initial estimates, the optimal tuning parameters for the div ergence-based estimators are determined using the generalized score matching metho d. F or DPDE, the optimal tuning parameter is obtained as γ = 0 . 50 , while for EPDE, the optimal set of tuning parameters is found to b e ( α, β , γ ) = (0 . 1 , 0 . 3 , 0 . 3) . Using these optimal v alues, the fixed-effects mo del parameters are estimated. Subsequen tly , to identify the most relev an t explanatory v ariables and reduce the computational burden asso ciated with ev aluating all p ossible regression sp ecifica- tions, we incorp orate a LASS O-t yp e p enalt y in to the likelihoo d-, DPD-, and EPD- based ob jectiv e functions. This penalized estimation step p erforms v ariable selection b y shrinking the co efficien ts of less informative cov ariates tow ard zero. The proce- dure is particularly useful in the presen t con text b ecause calculating the information criteria for the full mo del with all av ailable co v ariates would require ev aluating an excessiv ely large num b er of candidate mo dels. The p enalized estimation results re- v eal a consistent subset of co v ariates that remain relev an t across all three ob jective functions. The v ariables selected b y the LASSO pro cedure are ( blue c ol , smsa , marrie d , sex , union , black ) . A ccordingly , the mo del-selection analysis is restricted to these six cov ariates. By examining ev ery non-empty subset of these v ariables, we obtain (2 6 − 1 = 63) can- didate regression mo dels. F or eac h of these 63 candidate mo dels, the corresp onding information criteria—MLIC, DPDIC, and EPDIC—are computed. The models are then ranked in ascending order according to the magnitude of each criterion. F rom the three rank ed lists obtained under MLIC, DPDIC, and EPDIC, the top fifteen candidate mo dels are extracted. These lists are thereupon consolidated to form a com bined set of candidate mo dels, within which sev eral mo dels app ear rep eatedly across the three criteria. T o identify the mo del specifications that receiv e the most consisten t supp ort across the three criteria, w e compute the frequency with which each candidate mo del ap- p ears among the top-ranked mo dels. Based on this frequency , a final set of five 23 leading candidate mo dels is selected. T able 3 presents these mo dels along with their selection frequencies and the corresp onding v alues of EPDIC, DPDIC, and MLIC. T able 3: T op candidate mo dels based on consolidated information-criterion rankings. Mo del F req. Sel. F req. EPDIC DPDIC MLIC bluecol, sex, black 3 1.000 1435.263 229.14054 11666.50 smsa, married, sex 3 1.000 1892.022 370.21159 11101.19 bluecol, smsa, sex, union 2 0.667 1669.711 246.01321 21262.38 bluecol, smsa, sex, black 2 0.667 1806.677 201.14246 21313.76 bluecol, smsa, sex, union, black 2 0.667 1716.873 253.14976 21249.34 The results indicate that several mo dels receive consistent supp ort across m ulti- ple information criteria, suggesting that the iden tified cov ariates play a significan t role in explaining w age v ariation. In particular, mo dels in volving o ccupation t yp e ( blue c ol ), regional indicator ( smsa ), marital status, gender, union mem b ership, and race rep eatedly app ear among the leading sp ecifications. This consolidated ranking approac h pro vides a robust mec hanism for identifying economically meaningful wage determinan ts while av oiding o v erfitting and excessive model complexity . 6.2.2. Applic ation: Neur al Network Mo dels Bey ond classical parametric and panel-data mo dels, the prop osed div ergence- based estimation and mo del-selection framework can b e naturally extended to mo d- ern mac hine-learning architectures. In this subsection, we illustrate its applicabilit y through sup ervised classification using feed-forw ard neural net works (FFNNs), which are widely emplo yed nonlinear predictive models in statistical learning. Let { ( x i , y i ) } N i =1 denote a collection of indep enden t observ ations, where x i ∈ R d represen ts a d -dimensional feature vector and y i ∈ { 0 , 1 } denotes a binary class label indicating the op erational state of a mac hine. In particular, y i = ( 1 , if a mac hine failure o ccurs , 0 , otherwise . T o illustrate the prop osed framew ork, we utilize the AI4I 2020 Predictiv e Main tenance Dataset , a widely used b enc hmark dataset in predictive maintenance and mac hine-learning studies. The dataset con tains information on 10 , 000 industrial mac hine instances along with several op erational and environmen tal v ariables that influence mac hine p erformance and reliabilit y . The primary ob jectiv e is to predict whether a machine is lik ely to fail based on a set of sensor-based cov ariates. The explanatory v ariables considered in the analysis include air temp erature, pro cess temp erature, rotational sp eed, torque, and 24 to ol w ear, together with a categorical machine-t yp e indicator reflecting the pro duct qualit y level. These v ariables represent k ey op erational characteristics that influ- ence mec hanical stress, thermal conditions, and to ol degradation in man ufacturing en vironments. Prior to mo del estimation, all con tin uous co v ariates are standardized to ensure n umerical stability and comparable scaling across features. The categorical mac hine- t yp e v ariable is enco ded using dumm y v ariables. The dataset is verified to con tain no missing v alues. T o examine p oten tial m ultiv ariate extremeness among the cov ariates, we emplo y the Minimum Co v ariance Determinan t (MCD) estimator. This robust method iden- tifies subsets of observ ations with minimal cov ariance determinant and pro vides high- breakdo wn estimates of multiv ariate lo cation and scatter. The corresponding robust Mahalanobis distances are used to detect multiv ariate outliers. The diagnostic anal- ysis indicates that although some observ ations exhibit individually extreme sensor v alues, none deviate substan tially in the joint cov ariate space. Consequen tly , no ob- serv ations are remo v ed, and robust estimation pro cedures are retained to mitigate p oten tial influence from outlying measuremen ts. Instead of selecting subsets of cov ariates, mo del selection in neural net w orks in- v olves choosing among alternative netw ork architectures. A ccordingly , w e fix a finite collection of candidate arc hitectures, A = { A 1 , A 2 , . . . , A m } , eac h differing in structural complexity but using all av ailable cov ariates as inputs. The architectural v ariation is constructed b y altering the n umber of hidden la yers and the n umber of neurons p er la yer. Sp ecifically , we consider • Number of hidden lay ers L ∈ { 1 , 2 } , • Number of hidden neurons p er la yer H ∈ { 2 , 3 } . This specification yields a finite set of candidate arc hitectures corresp onding to all p ossible ( L, H ) combinations. Eac h arc hitecture emplo ys the same standardized input feature space and a softmax output la y er for binary classification. Thus, the candidate mo dels differ only in internal netw ork complexit y while retaining iden tical co v ariate information. F or a giv en architecture A j , the predicted class probability for observ ation i is expressed as ˆ p i = Softmax  f A j ( x i ; θ j )  , 25 where f A j ( · ) denotes the feed-forward transformation determined by arc hitecture A j , and θ j represen ts the collection of w eights and bias parameters asso ciated with the net work. Thus, ˆ p i ∈ (0 , 1) represen ts the predicted probabilit y of mac hine failure ( y i = 1) . The corresp onding Bernoulli mo del implied b y the netw ork is giv en b y f i ( y i ; θ j ) = ˆ p y i i (1 − ˆ p i ) 1 − y i , y i ∈ { 0 , 1 } . Using the EPD framew ork, the loss function for neural net w ork estimation is defined as L ( α,β ,γ ) ( θ j ) = 1 N N X i =1 V α,β ,γ ( y i ; θ j ) , (36) where the samplewise con tribution V α,β ,γ ( · ) is giv en by V α,β ,γ ( y i ; θ j ) = β α 2 X y ∈{ 0 , 1 }  e αf i ( y ; θ j )  αf i ( y ; θ j ) − 1  + 1  + 1 − β γ X y ∈{ 0 , 1 }  f γ +1 i ( y ; θ j ) − f i ( y ; θ j )  −  β α  e αf i ( y i ; θ j ) − 1  + 1 − β γ (( γ + 1) f i ( y i ; θ j ) − 1)  . (37) Since the resp onse is binary , the summations are taken ov er y ∈ { 0 , 1 } , with f i (1; θ j ) = ˆ p i , f i (0; θ j ) = 1 − ˆ p i . The neural netw ork parameters are then estimated b y minimizing the ab o v e loss function: ˆ θ j = arg min θ j L ( α, β , γ )( θ j ) . (38) Similarly , the corresponding estimators are obtained by minimizing the respective loss functions o ver the parameter space. T o obtain stable starting v alues for the optimization pro cedure, the net w ork w eights are initialized using standard random initialization metho ds commonly em- plo yed in neural net work training. Based on these initial v alues, the optimal tuning parameters for the div ergence-based estimators are determined using the generalized score matching method. F or the DPD estimator, the optimal tuning parameter is obtained as γ = 0 . 90 , 26 while for the EPD estimator, the optimal set of tuning parameters is found to b e ( α, β , γ ) = (0 . 1 , 0 . 7 , 0 . 1) . Using these optimal v alues, the neural net work parameters are estimated for eac h candidate architecture under the three estimation framew orks, namely MLE, DPDE, and EPDE. Subsequen tly , for eac h architecture in A , the corresp onding information criteria— MLIC, DPDIC, and EPDIC—are computed based on the fitted mo dels. The can- didate arc hitectures are then ranked in ascending order according to the magnitude of eac h criterion. F rom the three ranked lists obtained under MLIC, DPDIC, and EPDIC, the top candidate arc hitectures are extracted and consolidated in to a com- bined set. T o identify the architectures that receive the most consistent supp ort across the three criteria, we compute the frequency with whic h each candidate arc hitecture app ears among the top-ranked mo dels. Based on this frequency , a final set of leading neural net work architectures is selected. T able 4 presen ts these architectures together with their selection frequencies and the corresp onding v alues of EPDIC, DPDIC, and MLIC. T able 4: T op neural netw ork arc hitectures based on consolidated information-criterion rankings. Arc hitecture F req. Sel. F req. EPDIC DPDIC MLIC A4 (3,3) 3 1.000 98853.29 330.1793 319.0178 A1 (2) 2 0.667 113136.60 330.9141 180.5058 A2 (3) 2 0.667 114616.58 331.4005 206.7497 A3 (2,2) 2 0.667 104669.26 330.2590 811.9595 The results indicate that several neural net work architectures receive consistent supp ort across m ultiple information criteria. In particular, the architecture with t w o hidden la y ers and three neurons in eac h la yer, denoted b y A 4(3 , 3) , app ears most frequen tly among the top-rank ed mo dels and attains the highest selection frequency . This suggests that mo derately deep arc hitectures pro vide improv ed flexibilit y for capturing the nonlinear relationships presen t in the predictive maintenance dataset. A t the same time, simpler arc hitectures such as A 1(2) and A 2(3) also app ear among the leading candidates, indicating that parsimonious net w ork structures ma y still achiev e comp etitiv e p erformance under certain criteria. The differences observed across MLIC, DPDIC, and EPDIC reflect the v arying emphasis placed on mo del complexit y and robustness by the resp ective information criteria. Ov erall, this consolidated ranking approac h pro vides a robust and systematic mec hanism for selecting an appropriate neural netw ork arc hitecture while balanc- ing predictive p erformance and mo del complexity . The procedure ensures that the 27 selected arc hitecture is supp orted consisten tly across m ultiple div ergence-based in- formation criteria, thereby enhancing the reliability of the mo del-selection process in predictiv e maintenance applications. 7. Conclusion In this pap er, we prop osed a robust information criterion based on the Exp onen tial- P olynomial Divergence, aimed at addressing the limitations of classical likelihoo d- based mo del selection in the presence of contamination and mo del missp ecification. The prop osed EPDIC lev erages the flexibilit y and robustness of the underlying diver- gence to provide a reliable alternativ e for mo del selection across a wide range of set- tings. W e established its theoretical prop erties and demonstrated, through influence function analysis, that the criterion exhibits desirable robustness, with con trolled sensitivit y to outliers. T o ensure practical applicabilit y , w e incorporated a data-driven tuning-parameter selection mechanism based on generalized score matching, thereb y impro ving stabil- it y o v er existing metho ds. The effectiv eness of the proposed approac h w as v alidated through extensiv e simulation studies and real data applications, including panel data mo dels and neural netw ork-based prediction tasks. The results consisten tly indicated that EPDIC outp erforms classical and existing divergence-based criteria in terms of stabilit y and reliability under contaminated and misspecified scenarios. Ov erall, the proposed framew ork provides a unified and practically viable ap- proac h to robust mo del selection. F uture work ma y explore its extension to more complex high-dimensional mo dels and other structured learning framew orks. Data a v ailabilit y The panel dataset used in this study is the Panel Data of Individual W ages from the P anel Study of I ncome Dynamics (PSID), a v ailable online at the R datasets rep ository: P anel Data of Individual W ages . The AI4I 2020 Predictiv e Main tenance dataset is obtained from the UCI Mac hine Learning Rep ository and can be accessed at: AI4I 2020 Predictiv e Maintenance . Declaration of comp eting in terest The authors confirm that they hav e no recognized financial conflicts of in terest or p ersonal relationships that could ha v e influenced the work presented in this paper. 28 F unding This research is not supp orted b y an y sp ecific grant from public, commercial, or non-profit funding organizations. References [1] H. Ak aik e, Information theory and an extension of the maxim um lik eliho o d principle, in: Selected pap ers of hirotugu ak aik e, Springer, 1998, pp. 199–213. [2] K. T akeuc hi, Distribution of information num b er statistics and criteria for ade- quacy of mo dels, Mathematical Sciences 153 (1976) 12–18. [3] T. Matsuda, M. Uehara, A. Hyv arinen, Information criteria for non-normalized mo dels, Journal of Mac hine Learning Researc h 22 (158) (2021) 1–33. [4] K. Mattheou, S. Lee, A. Karagrigoriou, A mo del selection criterion based on the bhhj measure of div ergence, Journal of Statistical Planning and Inference 139 (2) (2009) 228–235. [5] A. Basu, I. R. Harris, N. L. Hjort, M. Jones, Robust and efficient estimation b y minimising a densit y p o wer divergence, Biometrik a 85 (3) (1998) 549–559. [6] A. Ghosh, A. Basu, A new family of div ergences originating from mo del ade- quacy tests and application to robust statistical inference, IEEE T ransactions on Information Theory 64 (8) (2018) 5581–5591. [7] A. Ghosh, S. Ma jumdar, Ultrahigh-dimensional robust and efficien t sparse re- gression using non-concav e p enalized densit y p o w er divergence, IEEE T ransac- tions on Information Theory 66 (12) (2020) 7812–7827. [8] S. Ray , S. Pal, S. K. Kar, A. Basu, Characterizing the functional densit y pow er div ergence class, IEEE T ransactions on Information Theory 69 (2) (2022) 1141– 1146. [9] P . Man talos, K. Mattheou, A. Karagrigoriou, An impro ved div ergence informa- tion criterion for the determination of the order of an ar pro cess, Communica- tions in Statistics—Sim ulation and Computation ® 39 (5) (2010) 865–879. [10] P . Singh, A. Mandal, A. Basu, Robust inference using the exp onen tial- p olynomial div ergence, Journal of Statistical Theory and Practice 15 (2) (2021) 29. 29 [11] B. Kim, S. Lee, Robust estimation for general in teger-v alued autoregressive mo dels based on the exponential-polynomial divergence, Journal of Statistical Computation and Sim ulation 94 (6) (2024) 1300–1316. [12] J. W arwick, M. Jones, Cho osing a robustness tuning parameter, Journal of Statistical Computation and Sim ulation 75 (7) (2005) 581–588. [13] S. Basak, A. Basu, M. Jones, On the ‘optimal’densit y p o w er divergence tuning parameter., Journal of Applied Statistics 48 (2021) 536–556. [14] U. Goswami, S. Mondal, Inequalit y restricted minim um density pow er diver- gence estimation in panel coun t data, Applied Mathematical Mo delling (2025) 116371. [15] J. Xu, J. L. Scealy , A. T. W o od, T. Zou, Generalized score matc hing, Journal of Multiv ariate Analysis 210 (2025) 105473. [16] A. Hyvärinen, P . Da y an, Estimation of non-normalized statistical mo dels b y score matc hing., Journal of Machine Learning Research 6 (2005). [17] L. M. Bregman, The relaxation method of finding the common p oin t of con vex sets and its application to the solution of problems in conv ex programming, USSR computational mathematics and mathematical ph ysics 7 (3) (1967) 200– 217. [18] A. Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh, Clustering with bregman di- v ergences, Journal of machine learning research 6 (Oct) (2005) 1705–1749. [19] A. Cichocki, S.-i. Amari, F amilies of alpha-b eta-and gamma-divergences: Flex- ible and robust measures of similarities, En tropy 12 (6) (2010) 1532–1568. [20] F. Nielsen, R. No c k, On the c hi square and higher-order c hi distances for ap- pro ximating f-div ergences, IEEE Signal Pro cessing Letters 21 (1) (2013) 10–13. [21] S. Kullbac k, R. A. Leibler, On information and sufficiency , The annals of math- ematical statistics 22 (1) (1951) 79–86. [22] A. Ghosh, A. Basu, Robust estimation for indep enden t non-homogeneous obser- v ations using density p o w er divergence with applications to linear regression., Electronic Journal of Statistics 7 (2013) 2420–2456. doi:10.1214/13- EJS847 . 30 [23] Q. T. Dinh, M. Diehl, Lo cal conv ergence of sequential conv ex programming for noncon vex optimization, in: Recen t A dv ances in Optimization and its Appli- cations in Engineering: The 14th Belgian-F rench-German Conference on Opti- mization, Springer, 2010, pp. 93–102. [24] P . J. Rousseeuw, K. V. Driessen, A fast algorithm for the minimum co v ariance determinan t estimator, T ec hnometrics 41 (3) (1999) 212–223. [25] M. Hub ert, P . J. Rousseeuw, S. V erb ov en, Robust p ca for high-dimensional data, in: R. Dutter, P . Filzmoser, U. Gather, P . J. Rousseeuw (Eds.), Developmen ts in Robust Statistics, Ph ysica-V erlag HD, Heidelb erg, 2003, pp. 169–179. 31

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment