An Exponential-Polynomial Divergence-based Robust Information Criterion for Linear Panel Data Models and Neural Networks

An Exp onen tial-P olynomial Div ergence-based Robust Information Criterion for Linear P anel Data Mo dels and Neural Net w orks Udita Gosw ami a , Sh uv ashree Mondal* a a Dep artment of Mathematics and Computing, IIT(Indian Scho ol of Mines) Dhanb ad, , 826004, Jharkhand, India Abstract Mo del selection is a cornerstone of statistical inference, where information cri- teria are widely emplo yed to balance model ﬁt and complexity . Ho wev er, classical lik eliho o d-based criteria are often highly sensitiv e to contamination, outliers, and mo del missp eciﬁcation. In this paper, w e dev elop a robust alternativ e based on the Exp onen tial-Polynomial Div ergence, a ﬂexible extension of existing div ergence measures that enhances adaptabilit y to diverse data irregularities. The prop osed Exp onen tial-P olynomial Divergence Information Criterion preserves the ob jectiv e of appro ximating the discrepancy b et wee n the true mo del and candidate mo dels while incorp orating robustness against anomalous observ ations. Its theoretical prop er- ties are established, and robustness is examined through inﬂuence function analysis, demonstrating con trolled sensitivity to extreme data points. F or practical implemen- tation, a data-driv en tuning parameter selection strategy based on generalized score matc hing is emplo yed, ensuring impro v ed computational stabilit y and eﬃciency . The eﬀectiv eness of the proposed method is demonstrated through extensive simulation studies under v arying con tamination lev els, as w ell as real data applications in volving linear mixed-eﬀects panel data mo dels and neural netw ork-based prediction tasks. The results consisten tly show impro ved stabilit y and reliabilit y compared to classical lik eliho o d and densit y p o w er div ergence-based information criteria. The proposed framew ork th us pro vides a practical and uniﬁed approach for model selection in complex and con taminated data settings. Keywor ds: Exp onen tial-p olynomial divergence, robust information criterion, mo del selection, inﬂuence function, score matc hing, panel data mo dels, neural net works 1. In tro duction Information criteria form a fundamental class of tools for model selection, balanc- ing go o dness-of-ﬁt and mo del complexity within an information-theoretic framew ork. Their construction is ro oted in likelihoo d-based inference, where the ob jectiv e is to maximize the log-likelihoo d while accoun ting for mo del complexit y , leading to an appro ximation of the exp ected Kullback–Leibler divergence b et ween the true mo del and a candidate mo del. The Ak aik e Information Criterion (AIC), prop osed b y Hiro- tugu Ak aik e, provides an appro ximately unbiased estimator of this div ergence under correct mo del sp eciﬁcation, whereas the T ak euchi Information Criterion (TIC) ex- tends this idea to settings where the mo del ma y b e missp eciﬁed [ 1 , 2 ]. Recen t work further shows that, for non-normalized mo dels where the lik eliho o d is in tractable, information criteria can still b e form ulated as appro ximately un biased estimators of suitable discrepancy measures, thereb y enabling principled mo del selection b eyond the classical lik eliho o d framew ork [ 3 ]. Despite their widespread applicability and theoretical app eal, classical informa- tion criteria are often unsuitable in the presence of con tamination, outliers, or mo del missp eciﬁcation, as they rely on lik eliho od-based ob jectiv e functions that are inher- en tly non-robust. Ev en a small fraction of aberrant observ ations can signiﬁcantly distort parameter estimates and, consequen tly , the mo del selection outcome. As a result, criteria such as AIC ma y lead to unreliable conclusions when the assumed mo del deviates from realit y . This limitation is particularly relev an t in practice, where data from so cio-economic systems, industrial pro cesses, healthcare, and en vironmen- tal studies are frequently aﬀected by noise, anomalies, and departures from ideal assumptions, highligh ting the need for more robust alternatives. In an eﬀort to address the lack of robustness in lik eliho o d-based model selection, [ 4 ] introduced a divergence-based information criterion (DIC) based on a generalized measure of statistical discrepancy , closely link ed to the Density Po w er Div ergence (DPD) framework of [ 5 ]. The DPD and its subsequent developmen ts [ 6 , 7 , 8 ] pro vide a robust alternative to lik eliho od by do wnw eigh ting the inﬂuence of atypical observ a- tions. While this marks a clear departure from classical lik eliho od-based approac hes, the practical utilit y of such div ergence-based criteria remains underexplored. Exist- ing studies on DIC, including improv emen ts and applications such as [ 9 ], are largely conﬁned to con trolled sim ulations, with limited v alidation on real-w orld datasets. This reveals a gap in understanding their empirical p erformance and highlights the need for more practically viable and robust mo del selection framew orks. Motiv ated by the growing emphasis on robust statistical metho dologies, we pro- p ose a nov el information criterion grounded in the Exponential-P olynomial Diver- gence (EPD), a ﬂexible divergence family introduced by [ 10 ]. The EPD extends the 2 Densit y P ow er Div ergence b y incorp orating an exp onen tial-p olynomial structure, al- lo wing greater ﬂexibilit y in handling div erse con tamination patterns and impro ving robustness without signiﬁcant loss of eﬃciency , as also demonstrated in recent appli- cations [ 11 ]. Building up on this framework, w e propose the Exp onen tial-Polynomial Div ergence Information Criterion (EPDIC), which lev erages the robustness of EPD to enable reliable mo del selection under contamination and mo del missp eciﬁcation, pro viding a practical alternative to existing likelihoo d- and DPD-based criteria. A key con tribution of this w ork is the assessment of the robustness of the pro- p osed EPDIC using the inﬂuence function, a standard to ol for measuring sensitivit y to con tamination. By analyzing its inﬂuence function, we characterize the lo cal ro- bustness of EPDIC and its resistance to outliers. In particular, the b oundedness of the inﬂuence function demonstrates that the prop osed criterion eﬀectiv ely con trols the impact of extreme observ ations, thereb y inheriting strong robustness prop erties from the underlying div ergence structure and ensuring stability under con tamination. The choice of optimal tuning parameters is crucial in div ergence-based metho ds, as it go verns the trade-oﬀ b et w een robustness and statistical eﬃciency . Classical approac hes, suc h as the W arwick and Jones criterion [ 12 ] and its iterative extension [ 13 ], are widely used but often suﬀer from computational burden and instability , particularly in complex or high-dimensional settings. Recen t studies [ 14 ] indicate that score-based approaches can provide impro v ed and more reliable p erformance. Motiv ated by this, we adopt the generalized score matc hing (GSM) framew ork of [ 15 ], whic h builds on the original metho d of [ 16 ]. By a voiding the need for normal- ization constants and relying on deriv ativ es of log-densities, this approach oﬀers a computationally eﬃcien t and stable mec hanism for obtaining data-adaptiv e tuning parameter estimates. T o substantiate the practical relev ance of the prop osed metho dology , we conduct an extensiv e empirical in v estigation encompassing both con trolled sim ulations and real-w orld data applications, whic h form the principal crux of this study . The sim- ulation framew ork is carefully designed to assess the ﬁnite-sample p erformance of the proposed exp onen tial-p olynomial div ergence information criterion across v ary- ing lev els of con tamination, thereb y pro viding a comprehensiv e understanding of its robustness. Bey ond sim ulations, we examine the eﬃcacy of EPDIC in the context of linear mixed-eﬀects panel data mo dels, where the underlying ob jective function is constructed via the density p o w er divergence measure to accommo date unobserved heterogeneit y and p oten tial con tamination. In parallel, we explore its applicabilit y in mo dern machine learning settings, particularly in neural net work-based prediction tasks, where the conv en tional loss functions are replaced or augmen ted by div ergence- based counterparts to enhance robustness against noisy and corrupted inputs. These 3 applications are nov el in the sense that the p erformance of robust information criteria has seldom b een systematically ev aluated across such div erse real-w orld scenarios. In both simulation and empirical analyses, w e provide a thorough comparativ e as- sessmen t of EPDIC against classical lik eliho o d-based criteria and existing div ergence- based alternatives, including those deriv ed from the density p ow er div ergence, across m ultiple con tamination regimes. The results consistently demonstrate the sup erior stabilit y and reliability of the prop osed criterion in challenging data environmen ts. The remainder of this paper is organized as follo ws. In Section 2 , we rigorously in tro duce the Exp onen tial-Polynomial Div ergence estimator. Section 3 presen ts the form ulation of the EPDIC along with its theoretical prop erties. In Section 4 , w e in vestigate the robustness c haracteristics of the prop osed criterion through an in- ﬂuence function analysis. Section 5 is dev oted to the selection of optimal tuning parameters using the GSM approac h. Section 6 pro vides an extensive empirical ev al- uation, including both simulation studi es and real data applications. Finally , Section 7 concludes the pap er with a summary of ﬁndings and p oten tial directions for future researc h. 2. Exp onen tial-P olynomial Div ergence Estimator The Exp onen tial-P olynomial Divergence (EPD), proposed by [ 10 ], represents a uniﬁed and generalized class of Bregman div ergences, capable of encompassing sev- eral well-kno wn div ergence measures suc h as the Density P o wer Div ergence (DPD), Bregman Exp onen tial Div ergence (BED), and Kullbac k-Leibler (KL) div ergence as sp ecial cases. Under the assumptions originally stated b y [ 17 ], the Bregman divergence b et ween t wo density functions g and f is deﬁned as D B ( g , f ) = Z x [ B ( g ( x )) − B ( f ( x )) − { g ( x ) − f ( x ) } B ′ ( f ( x ))] dx, (1) where B ( · ) is a strictly con vex function and B ′ ( · ) denotes its deriv ative with resp ect to its argumen t. The choice of the con vex generating function B determines the sp eciﬁc form of the div ergence; diﬀerent selections of B lead to distinct members within the Bregman div ergence family (see, e.g., [ 18 ], [ 19 ]). T o generalize the existing div ergence families, [ 10 ] in tro duced a con v ex generating function of the form B ( x ) = β α 2 ( e αx − 1 − αx ) + 1 − β γ  x γ +1 − x  , α ∈ R , β ∈ [0 , 1] , γ ≥ 0 , (2) 4 where α , β , and γ are tuning parameters that control the con tributions of the expo- nen tial and polynomial comp onents. The EPD connects several w ell-kno wn diver- gences through particular parameter c hoices: • When β = 0 , the EPD reduces to the Density P o wer Div ergence (DPD) of [ 5 ] with parameter γ . • When β = 1 , it coincides with the Bregman Exponential Divergence (BED) of [ 20 ] with parameter α . • F or in termediate v alues 0 < β < 1 , it pro vides a con v ex combination of BED and DPD. • F urther, when β = 0 and γ → 0 , the div ergence con v erges to the Kullback- Leibler div ergence [ 21 ]. Th us, the EPD oﬀers a smo oth con tinuum of divergence measures linking the exp onen tial- t yp e and p o w er-type divergences through its three-parameter structure. 2.1. Estimation under Indep endent Non-homo gene ous observations Consider indep enden t but not identically distributed observ ations Y 1 , Y 2 , . . . , Y n , where eac h Y i follo ws a p otentia lly diﬀerent densit y g i are mo deled b y a parametric family F i,θ = { f i ( · ; θ ) | θ ∈ Θ ⊆ R p } , sharing the same common parameter θ . The goal is to estimate θ robustly b y minimizing the av erage exp onen tial-p olynomial div ergence b et ween the true and mo del densities, 1 n n X i =1 D EP  g i , f i ( · ; θ )  = 1 n n X i =1 " Z n B ′  f i ( y ; θ )  f i ( y ; θ ) − B  f i ( y ; θ )  o dy (3) − B ′  f i ( Y i ; θ )  # . where D EP ( · , · ) denotes the div ergence corresponding to the generating function B ( · ) in ( 2 ). Since only a single observ ation Y i is av ailable from each densit y g i , w e approximate g i b y the degenerate empirical distribution that places unit mass at Y i . The resulting empirical ob jective function for estimation becomes H ( α,β ,γ ) n ( θ ) = 1 n n X i =1 V α,β ,γ ( Y i ; θ ) , (4) 5 where V α,β ,γ ( Y i ; θ ) represents the samplewise contribution to the divergence and com- bines exp onen tial and p olynomial components as V α,β ,γ ( Y i ; θ ) = β α 2 Z y  { e αf i ( y ; θ ) ( αf i ( y ; θ ) − 1) + 1 } dy + (1 − β ) f 1+ γ i ( y ; θ )  dy −  β α ( e αf i ( Y i ; θ ) − 1) + 1 − β γ (( γ + 1) f i ( Y i ; θ ) − 1)  . (5) Minimizing H ( α,β ,γ ) n ( θ ) with respect to θ yields the Minim um Exp onen tial-P olynomial Div ergence Estimator (MEPDE). No w, when w e diﬀeren tiate H ( α,β ,γ ) n ( θ ) with resp ect to θ , w e get the follo wing estimating equation: 1 n n X i =1  u i ( f i ( Y i ; θ )) B ′′ i ( Y i ; θ ) f i ( Y i ; θ ) − Z y u i ( f i ( y ; θ )) B ′′ i ( y ; θ ) f 2 i ( y ; θ ) dy  = 0 , (6) where u i ( y ; θ ) = ∂ ∂ θ log f i ( y ; θ ) is the lik eliho od score function of the i th sample. F urther, w e can express the prop osed formulation as a weigh ted lik eliho o d estimating equation, viz., 1 n n X i =1  u i ( f i ( Y i ; θ )) w ( f i ( Y i ; θ )) − Z y u i ( f i ( y ; θ )) w ( f i ( y ; θ )) f i ( y ; θ ) dy  = 0 , (7) where the corresp onding w eigh t function is of the form w ( t ) = β te αt + (1 − β )(1 + γ ) t γ . The ab o v e estimating equation generalizes the minimum densit y p o wer div ergence estimator (MDPDE) for non-homogeneous observ ations prop osed b y [ 22 ], while pre- serving the bias-v ariance trade-oﬀ controlled b y ( α, β , γ ) . When β = 0 , the MEPDE reduces to the MDPDE, and as ( β , γ , α ) → (0 , 0 , 0) , it simpliﬁes to the maximum lik eliho o d estimating equation: 1 n n X i =1 u i ( Y i ; θ ) = 0 . (8) F rom a functional p erspective, the minimum Exp onen tial-Polynomial divergence functional for indep enden t non-homogeneous data is deﬁned as T α,β ,γ ( G 1 , . . . , G n ) = arg min θ ∈ Θ 1 n n X i =1 D EP ( g i , f i ( · ; θ )) . (9) 6 Since D EP ( · , · ) is a v alid div ergence—non-negativ e and equal to zero if and only if g i = f i ( · ; θ ) —the functional T α,β ,γ ( G 1 , . . . , G n ) is Fisher consistent under mo del iden tiﬁability , satisfying T α,β ,γ ( F 1 ,θ 0 , . . . , F n,θ 0 ) = θ 0 . (10) Hence, the prop osed MEPDE serves as a natural and robust generalization of the MDPDE for indep enden tly but non-i den tically distributed data, com bining the strengths of b oth exp onen tial and p olynomial divergence families within a uniﬁed estimation framew ork. 2.2. Asymptotic Pr op erties T o study the asymptotic b eha vior of the Minim um Exp onen tial-P olynomial Di- v ergence Estimator (MEPDE) under indep enden t non-homogeneous observ ations, let θ g = arg min θ ∈ Θ 1 n n X i =1 D EP ( g i , f i ( · ; θ )) denote the b est-ﬁtting parameter that minimizes the av erage exp onen tial-p olynomial div ergence betw een the true and mo del densities. Let us in tro duce a p × p matrix de- noted b y J i , whose ( k , l ) th en try is deﬁned through the second-order partial deriv ativ e with respect to the k th and l th comp onen ts of θ . In addition, we deﬁne the quantities K i and ξ i as giv en b elo w. J i = β Z y f 2 i ( y ; θ g ) e αf i ( y ; θ g ) u i ( y ; θ g ) u ⊤ i ( y ; θ g ) dy + (1 − β )(1 + γ ) Z y f γ +1 i ( y ; θ g ) u i ( y ; θ g ) u ⊤ i ( y ; θ g ) dy + (1 − β )(1 + γ ) Z y [ g i ( y ) − f i ( y ; θ g )]  I i ( y , θ g ) − γ u i ( y ; θ g ) u ⊤ i ( y ; θ g )  f γ i ( y ; θ g ) dy + β Z y [ g i ( y ) − f i ( y ; θ g )]  I i ( y , θ g ) − u i ( y ; θ g ) u ⊤ i ( y ; θ g )  f i ( y ; θ g ) e αf i ( y ; θ g ) dy − α β Z y [ g i ( y ) − f i ( y ; θ g )] f 2 i ( y ; θ g ) e αf i ( y ; θ g ) u i ( y ; θ g ) u ⊤ i ( y ; θ g ) dy . (11) K i = Z y u i ( y ; θ g ) u ⊤ i ( y ; θ g ) n β f i ( y ; θ g ) e αf i ( y ; θ g ) + (1 − β )(1 + γ ) f γ i ( y ; θ g ) o g i ( y ) dy − ξ i ξ ⊤ i . (12) 7 where ξ i = Z y u i ( y ; θ g ) n β f i ( y ; θ g ) e αf i ( y ; θ g ) + (1 − β )(1 + γ ) f γ i ( y ; θ g ) o g i ( y ) dy . (13) Regularit y Conditions. T o establish consistency and asymptotic normality , w e assume conditions analogous to those in [ 22 ] and [ 10 ]: (R1) The supp ort X = { y : f i ( y ; θ ) > 0 } is independent of b oth i and θ , for all i = 1 , 2 , . . . , n , and the true densities g i are also supp orted on X . (R2) There exists an op en subset Θ 0 ⊆ Θ containing the b est-ﬁtting parameter θ g suc h that, for almost all y ∈ X and all θ ∈ Θ 0 , the densities f i ( y ; θ ) , i = 1 , 2 , . . . , n , are three times contin uously diﬀeren tiable with resp ect to θ , and all third-order partial deriv atives are con tin uous in θ . (R3) F or eac h i = 1 , 2 , . . . , n , and for the generating function B ( · ) deﬁned in ( 2 ), the in tegrals Z y B ′ ( f i ( y ; θ )) f i ( y ; θ ) dy and Z y B ′ ( f i ( y ; θ )) g i ( y ) dy can b e diﬀerentiated three times with resp ect to θ , and diﬀerentiation can b e in terchanged with integration. (R4) F or eac h i = 1 , 2 , . . . , n , the matrix J i deﬁned in ( 11 ) is p ositiv e deﬁnite. W e deﬁne the a verage information matrix Ψ n = 1 n n X i =1 J i . Assume that the sequence { Ψ n } con verges to a p ositive deﬁnite limit Ψ = lim n →∞ Ψ n , and that λ 0 = λ min ( Ψ ) > 0 . 8 (R5) There exist measurable b ounding functions M ( i ) j kl ( y ) suc h that, for all θ ∈ Θ 0 ,      ∂ ∂ θ j ∂ θ k ∂ θ l V α,β ,γ ( y ; θ )      ≤ M ( i ) j kl ( y ) , where V α,β ,γ ( · ; θ ) is deﬁned in ( 5 ), and 1 n n X i =1 E g i h M ( i ) j kl ( Y i ) i = O (1) for all j, k , l . (R6) F or all j and k , the following uniform in tegrability conditions hold for the ﬁrst and second deriv atives of V α,β ,γ ( Y i ; θ ) : lim n →∞ sup n 1 n n X i =1 E g i "      ∂ ∂ θ j V α,β ,γ ( Y i ; θ )      I (      ∂ ∂ θ j V α,β ,γ ( Y i ; θ )      > ϵ √ n )# = 0 , lim n →∞ sup n 1 n n X i =1 E g i       −      ∂ ∂ θ j ∂ θ k V α,β ,γ ( Y i ; θ ) − E g i " ∂ ∂ θ j ∂ θ k V α,β ,γ ( Y i ; θ ) #      × I (      ∂ ∂ θ j ∂ θ k V α,β ,γ ( Y i ; θ ) − E g i " ∂ ∂ θ j ∂ θ k V α,β ,γ ( Y i ; θ ) #      > ϵ √ n )       = 0 . (R7) ( Lindeb erg-F eller condition ) F or every ϵ > 0 , lim n →∞ 1 n n X i =1 E g i "     Ω − 1 / 2 n ∂ ∂ θ V α,β ,γ ( Y i ; θ )     2 I      Ω − 1 / 2 n ∂ ∂ θ V α,β ,γ ( Y i ; θ )     > ϵ √ n  # = 0 , where Ω n = 1 n n X i =1 K i , Ω = lim n →∞ Ω n , and the limiting matrix Ω is p ositiv e deﬁnite with λ min ( Ω ) > 0 . Theorem 1. Under c onditions (R1)–(R7), the fol lowing r esults hold: 1. Ther e exists a c onsistent se quenc e of r o ots ˆ θ ( α,β ,γ ) n of the estimating e quation ( 6 ) such that ˆ θ ( α,β ,γ ) n p − → θ g . 9 2. The Minimum Exp onential-Polynomial Diver genc e estimator is asymptotic al ly normal with √ n ( ˆ θ ( α,β ,γ ) n − θ g ) d − → N p (0 , Ψ − 1 ΩΨ − 1 ) . (14) Th us, under standard smo othness and identiﬁabilit y assumptions, the MEPDE for indep enden t non-homogeneous observ ations is consistent and asymptotically normal, pro viding a uniﬁed robust inference framework within the exp onen tial-p olynomial di- v ergence family . Remark. The pro of of this theorem follows arguments analogous to those used in Theorem 3.1 of [ 22 ], with appropriate mo diﬁcations to incorp orate the exp onen- tial–p olynomial div ergence structure. 3. Exp onen tial-P olynomial Div ergence Information Criterion Mo del selection criteria built up on div ergence measures pro vide a systematic ap- proac h to weigh mo del ﬁt against complexity , esp ecially in situations where data con tamination, hea vy tails, or structural heterogeneity can distort likelihoo d-based criteria. Divergence measures quantify ho w far a prop osed mo del departs from the underlying data-generating distribution, and when the c hosen div ergence is itself resistan t to outlying observ ations, the resulting information criterion naturally be- comes more stable. In this context, the exp onen tial-p olynomial divergence (EPD), in tro duced in Section 2 , serves as a v ersatile foundation for robust mo del selec- tion. Its three tuning parameters ( α, β , γ ) allo w it to encompass several standard div ergences—including the DPD, BED, and KL div ergence—while oﬀering additional ﬂexibilit y to mo derate the impact of outliers or mo del misspeciﬁcation. Building on this adaptability , w e prop ose a robust information criterion based on the exponential-polynomial divergence, abbreviated as EPDIC. This criterion extends the ideas underlying AIC and later div ergence-based metho ds, including the DIC of [ 4 ], but incorp orates the enhanced robustness prop erties of the EPD family . Let ˆ θ ( α,β ,γ ) = arg min θ ∈ Θ H ( α,β ,γ ) ( θ ) represen t the Minim um Exp onen tial-Polynomial Div ergence Estimator (MEPDE). T o construct an information criterion appropriate for EPD, one requires an asymp- totically un biased estimator of the expected ov erall div ergence b et w een the true densit y and the ﬁtted parametric family , ev aluated at ˆ θ ( α,β ,γ ) . F ollo wing the logic of AIC, this inv olv es expanding the population div ergence around the true parameter and deriving a mo del-complexit y penalty whose stru cture dep ends on the curv ature 10 of the EPD functional. The resulting p enalty dep ends explicitly on the robustness parameters ( α , β , γ ) , and it collapses to the usual AIC p enalt y when these parame- ters tak e the v alues corresp onding to the Kullback-Leibler div ergence. Subsequently , w e will describ e the ev olution of EPDIC. In the presen t setting, the exp ected discrepancy b et ween the true data-generating densit y g and the ﬁtted mo del f ˆ θ is quan tiﬁed using the EPD functional D ˆ θ ≡ H ( α,β ,γ ) ( g , f ˆ θ ) = Z y D E P ( g ( y ) , f ˆ θ ( y )) dy . Let θ 0 denote the true parameter suc h that g = f θ 0 under correct sp eciﬁcation. The criterion w e wish to ev aluate is the p opulation quantit y , E [ D ( ˆ θ )] , whic h can- not b e computed directly . T o relate it to observ able comp onen ts, w e consider the empirical div ergence functional b D ( θ ) = H ( α,β ,γ ) n ( θ ) , b D ( ˆ θ ) = H ( α,β ,γ ) n ( ˆ θ ) , where ˆ θ is the minimum exp onen tial-p olynomial divergence estimator (MEPDE). W e no w employ a T aylor expansion of D ( θ ) about θ 0 , D ( ˆ θ ) = D ( θ 0 ) + ( ˆ θ − θ 0 ) ⊤ ∂ ∂ θ D ( θ )     θ = θ 0 + 1 2 ( ˆ θ − θ 0 ) ⊤ ∂ 2 ∂ θ ∂ θ ⊤ D ( θ )     θ = θ 0 ( ˆ θ − θ 0 ) + o  ∥ ˆ θ − θ 0 ∥ 2  . (15) Since θ 0 minimizes the p opulation div ergence, ∂ ∂ θ D ( θ )     θ = θ 0 = 0 . W e pro ceed further and get, E [ D ( ˆ θ )] = E [ D ( θ 0 )] + 1 2 E " ( ˆ θ − θ 0 ) ⊤ ∂ 2 ∂ θ ∂ θ ⊤ D ( θ )     θ = θ 0 ( ˆ θ − θ 0 ) # + o ( n − 1 ) . (16) Under the regularit y conditions established for the EPD framework, the estimator ˆ θ satisﬁes √ n ( ˆ θ − θ 0 ) d − → N p (0 , Ψ − 1 ΩΨ − 1 ) . Also note that, ∂ 2 ∂ θ ∂ θ ⊤ D ( θ )     θ = θ 0 = Ψ . 11 Substituting this asymptotic b eha vior in to the T a ylor expansion gives E [ D ( ˆ θ )] = D ( θ 0 ) + 1 2 n tr  Ψ E h ( ˆ θ − θ 0 ) ⊤ ( ˆ θ − θ 0 ) i + o ( n − 1 ) = D ( θ 0 ) + 1 2 n tr  ΩΨ − 1  + o ( n − 1 ) . (17) This result is directly analogous to the T akeuc hi Information Criterion (TIC) ex- pansion [ 2 ]. Finally , replacing the unobserv able D ( θ 0 ) b y its empirical coun terpart H ( α,β ,γ ) n ( ˆ θ ) and m ultiplying both sides b y n , yields the exp onen tial-p olynomial div er- gence information criterion EPDIC ≈ n H ( α,β ,γ ) n ( ˆ θ ) + tr  ΩΨ − 1  . (18) The ﬁrst term in ( 18 ) represen ts the empirical div ergence and therefore measures the go odness of ﬁt of the mo del under the c hosen exp onen tial-p olynomial divergence. The second term, tr ( ΩΨ − 1 ) , serves as a mo del-complexit y adjustment, pla ying the role of the asymptotic bias correction in the same spirit as T ak euchi’s information criterion. Hence, EPDIC provides a ﬂexible con tinuum of model-selection to ols that uniﬁes TIC and div ergence measures from the exp onen tial-p olynomial family . 4. Inﬂuence F unction of the EPDIC W e now study the robustness prop erties of the prop osed exp onen tial-p olynomial div ergence information criterion (EPDIC) through its inﬂuence function, whic h quan- tiﬁes the inﬁnitesimal eﬀect of a small contamination at a p oin t on the v alue of the criterion. Throughout, we w ork under the indep endent non-homogeneous framew ork describ ed in Section 2 . It is w orth emphasizing that the prop osed EPDIC is deﬁned as EPDIC = n H ( α,β ,γ ) n ( ˆ θ ) + tr  ΩΨ − 1  , where ˆ θ denotes the Minim um Exp onen tial-P olynomial Divergence Estimator (MEPDE). Let G = ( G 1 , . . . , G n ) denote the collection of true distributions, and consider a con taminated version of the i th comp onen t giv en by G i,ε = (1 − ε ) G i + ε ∆ y , (19) where ∆ y is the degenerate distribution at y and ε ↓ 0 . The inﬂuence function of the EPDIC at the con tamination p oin t y is deﬁned as IF( y ; EPDIC , G ) = d dε EPDIC( G 1 , . . . , G i,ε , . . . , G n )     ε =0 . (20) 12 The empirical div ergence admits the functional representation H ( θ , G ) = 1 n n X i =1 Z V α,β ,γ ( y ; θ ) dG i ( y ) . (21) Hence, IF  y ; nH n ( ˆ θ ) , G  = n " V α,β ,γ ( y ; θ g ) + ∂ H ( θ, G ) ∂ θ     ⊤ θ = θ g IF( y ; ˆ θ , G ) # , (22) where θ g denotes the p opulation minimizer of the div ergence. Since θ g satisﬁes ∂ H ( θ, G ) ∂ θ     θ = θ g = 0 , (23) w e obtain IF  y ; nH n ( ˆ θ ) , G  = n V α,β ,γ ( y ; θ g ) . (24) The p enalt y comp onen t tr( ΩΨ − 1 ) dep ends on the underlying distribution only through second-order moments. Under standard smo othness and moment conditions, its in- ﬂuence function is of order O (1) and is therefore negligible compared to the O ( n ) div ergence contribution. Th us, IF  y ; tr( ΩΨ − 1 ) , G  = O (1) . (25) Com bining the ab o ve results, the inﬂuence function of the classical EPDIC is given b y IF( y ; EPDIC , G ) = n V α,β ,γ ( y ; θ g ) + O (1) . (26) Since V α,β ,γ ( y ; θ ) is b ounded for α > 0 and γ > 0 , the inﬂuence function of the clas- sical EPDIC is b ounded, conﬁrming its robustness. In the limiting case ( α, β , γ ) → (0 , 0 , 0) , the divergence reduces to the negative log-likelihoo d, yielding an unbounded inﬂuence function and recov ering the non-robust b eha vior of lik eliho o d-based criteria suc h as AIC and TIC. The robustness prop erties of information criteria can b e assessed through their inﬂuence functions (IF s), which quan tify the eﬀect of an inﬁnitesimal contamina- tion on the criterion v alue. Conv en tional information criteria lik e AIC are directly constructed from the log-likelihoo d ev aluated at the maximum likelihoo d estimator. Consequen tly , their inﬂuence functions inherit the unbounded nature of the score and log-lik eliho o d contributions, rendering AIC highly sensitive to outlying observ ations. 13 In particular, a single gross error can exert an arbitrarily large impact on these cri- teria, leading to p oten tially unstable mo del selection decisions. In contrast, the pro- p osed EPDIC replaces the log-likelihoo d comp onen t with a robust divergence-based ob jective, leading to b ounded score-type contributions. This mo diﬁcation yields a b ounded inﬂuence function for EPDIC, ensuring that the eﬀect of anomalous obser- v ations is con trolled. Consequen tly , EPDIC exhibits reduced gross-error sensitivit y and improv ed lo cal robustness compared to AIC and DIC. This b oundedness prop- ert y mak es EPDIC particularly well-suited for mo del selection in the presence of mild deviations from the assumed mo del or potential data con tamination. 5. Optimal T uning P arameter The p erformance of the Minim um Exp onen tial-P olynomial Divergence Estimator (MEPDE) critically dep ends on the appropriate c hoice of the robustness parame- ters ( α, β , γ ) . While these parameters regulate the trade-oﬀ betw een eﬃciency and robustness, their optimal selection is non-trivial. Classical approaches suc h as cross- v alidation or asymptotic mean squared error minimization require either rep eated mo del ﬁtting or explicit ev aluation of asymptotic cov ariance matrices, b oth of which ma y b e computationally demanding within the exp onen tial-p olynomial framew ork. T o address this issue, we adopt the generalized score matc hing principle in tro duced b y [ 16 ], whic h provides an alternative estimation and mo del selection mechanism that do es not rely on the normalizing constan t of the mo del density . In particular, we exploit the form ulation of generalized score matc hing in Euclidean space for inde- p enden t observ ations, as detailed in [ 15 ]. This approac h oﬀers a computationally attractiv e and theoretically justiﬁed criterion for selecting the optimal tuning pa- rameters of the exp onen tial-p olynomial div ergence estimator. Let Y 1 , . . . , Y n ∈ R d b e indep enden t observ ations with true densities g i , mo deled b y parametric densities f i ( · ; θ ) . Denote b y p i ( y ; θ ) = f i ( y ; θ ) the mo del densit y . The generalized score matc hing criterion is based on the Fisher div ergence b et ween the true density and the model density , deﬁned as D SM ( g , p ) = 1 n n X i =1 E g i h    ∇ i log g i ( Y i ) − ∇ i log p i ( Y i ; θ )    2 i , where ∇ i denotes diﬀeren tiation with resp ect to Y i . 14 Under mild regularity conditions, this divergence decomp oses in to a term indep en- den t of θ and a mo del-dep enden t comp onent. Consequen tly , minimizing the Fisher div ergence reduces to minimizing the empirical generalized score matc hing ob jectiv e d SM ( θ ) = 1 n n X i =1 ρ SM ,i ( Y i ; θ ) , (27) where ρ SM ,i ( Y i ; θ ) = 2 d X j =1 ∂ 2 ∂ y 2 ij log f i ( Y i ; θ ) + d X j =1  ∂ ∂ y ij log f i ( Y i ; θ )  2 . (28) An imp ortan t adv an tage of this formulation is that it depends only on deriv atives of the log-densit y with resp ect to the data, thereby eliminating the need to ev aluate an y normalizing constant. This feature is particularly beneﬁcial in divergence-based estimation framew orks. F or each ﬁxed triplet ( α, β , γ ) , let ˆ θ ( α,β ,γ ) = arg min θ H ( α,β ,γ ) n ( θ ) denote the corresp onding MEPDE. Substituting this estimator into the generalized score matc hing ob jectiv e ( 27 ), we deﬁne the tuning-selection criterion S n ( α, β , γ ) = d SM  ˆ θ ( α,β ,γ )  . (29) The optimal tuning parameters are then obtained as ( ˆ α, ˆ β , ˆ γ ) = arg min α,β ,γ S n ( α, β , γ ) . (30) This pro cedure selects the div ergence parameters that yield the smallest Fisher di- v ergence b et w een the ﬁtted mo del and the underlying data-generating mec hanism. Unlik e approac hes based on asymptotic MSE minimization, this metho d do es not require explicit computation of the asymptotic cov ariance matrices Ψ and Ω . More- o ver, it a voids rep eated pilot up dates, as required in iterativ e W arwick and Jones-type pro cedures [ 13 ]. Once the optimal tuning parameters ( ˆ α, ˆ β , ˆ γ ) are determined via ( 30 ), the corresp onding MEPDE ˆ θ ∗ = ˆ θ ( ˆ α, ˆ β , ˆ γ ) is used to ev aluate the EPDIC, deﬁned in Section 3 . 15 6. Numerical Experiments This section presents an extensive set of n umerical exp erimen ts designed to exam- ine the ﬁnite-sample p erformance and practical relev ance of the prop osed method- ologies. Comprehensiv e sim ulation studies are conducted to assess the b eha vior of the estimators and asso ciated information criteria under a v ariet y of data-generating scenarios. In addition, t wo real-w orld applications are analyzed to illustrate the prac- tical applicabilit y of the prop osed framework. The ﬁrst application concerns a linear mixed-eﬀects panel data mo del, while the second fo cuses on a neural net w ork set- ting, thereb y demonstrating the ﬂexibility and eﬀectiv eness of the prop osed approach across b oth classical statistical models and mo dern machine learning frameworks. 6.1. Simulation Study An extensiv e Mon te Carlo sim ulation study is conducted to in vestigate the ro- bustness of the prop osed exp onen tial-p olynomial div ergence information criteria. All results reported in this subsection are based on 1000 indep enden t Mon te Carlo repli- cations. W e consider a m ultiple linear regression mo del with p = 5 co v ariates, giv en b y y i = x ⊤ i β + ε i , i = 1 , . . . , n, (31) where x i = ( x i 1 , . . . , x i 5 ) ⊤ denotes the vector of explanatory v ariables, β is the vector of regression co eﬃcien ts, and ε i represen ts the random error term. The resp onse v ariable y i is generated from a Gaussian distribution, ensu ring compatibilit y with the lik eliho o d-based and divergence-based estimation framew orks. While, the cov ariate v ectors x i are generated from a m ultiv ariate normal distribution with mean vector 0 and a cov ariance matrix Σ having an autoregressiv e structure, that is, Σ j k = ρ | j − k | , j, k = 1 , . . . , 5 , with ρ = 0 . 5 , allo wing for mo derate correlation among the co v ariates. The error terms ε i are indep enden tly generated from a normal distribution with mean zero and v ariance σ 2 = 1 . The true regression co eﬃcient vector is ﬁxed as β 0 = (1 . 5 , − 1 . 0 , 0 . 8 , 0 . 5 , − 0 . 7) ⊤ . Subsequen tly , the sim ulation pro cedure is implemented under a single sample size, namely n = 150 . T o examine the robustness prop erties of the estimators, tw o distinct con tamination sc hemes are considered. In the ﬁrst sc heme, con tamination is in tro- duced in the error (disturbance) terms ε i , where a prop ortion δ ∈ (0 . 052 , 0 . 093 , 0 . 134) of the errors is replaced by v alues generated from a Normal (10 . 6 , 1) distribution. 16 This mechanism induces vertical outliers in the resp onse v ariable while lea ving the design matrix unaﬀected. In the second scheme, con tamination is introduced in the explanatory v ariables x i . Sp eciﬁcally , a prop ortion δ ∈ (0 . 058 , 0 . 099 , 0 . 140) of the co v ariate observ ations is replaced by v alues drawn from a Shifted-Normal (45 . 6 , 6 . 3) distribution. This setup generates leverage-t ype outliers in the design matrix. T aken together, these contamination mechanisms introduce b oth mo derate and sev ere de- viations from the assumed mo del structure. Next, parameter estimation is p erformed using three comp eting approaches: the classical maximum lik eliho o d estimator (MLE), the density p o w er div ergence (DPD) estimator, and the exponential-polynomial divergence (EPD) estimator. These es- timators are obtained by minimizing their resp ectiv e ob jective functions through a sequen tial conv ex programming (SCP) algorithm (see [ 23 ]). Within this framew ork, eac h nonlinear ob jectiv e function is lo cally approximated b y a conv ex surrogate and solv ed iterativ ely . At every iteration, a con vex subproblem is constructed using ﬁrst-order deriv ativ e information together with an appropriate step-size con trol to ensure numerical stabilit y and monotonic descen t. The parameter vector is up dated according to the solution of the con v ex approximation, and the pro cess con tin ues un til con v ergence. The algorithm is initialized at the ordinary least squares (OLS) estimates, and con v ergence is declared when the relativ e c hange in b oth the param- eter estimates and the ob jectiv e function v alue falls b elow a pre-sp eciﬁed tolerance threshold. F urthermore, the optimal tuning parameters for b oth the DPD and EPD esti- mators are determined using the generalized score matc hing pro cedure under eac h con tamination sc heme. This data-driv en strategy iden tiﬁes the tuning parameters that ac hieve an appropriate balance b etw een eﬃciency and robustness by minimizing an empirical criterion derived from the corresp onding score equations. The result- ing optimal tuning parameter v alues under the diﬀerent contamination scenarios are summarized in T able 1 . T able 1: Optimal tuning parameters for DPD and EPD under diﬀerent contamination schemes δ γ (DPD) α (EPD) β (EPD) γ (EPD) Pure 0.000 0 . 95 0 . 10 0 . 70 0 . 30 0 . 052 0 . 25 0 . 10 0 . 60 0 . 70 Con t. 1 0 . 093 0 . 35 0 . 10 0 . 70 0 . 30 0 . 134 0 . 40 0 . 40 0 . 70 0 . 60 0 . 058 0 . 15 0 . 10 0 . 70 0 . 90 Con t. 2 0 . 099 0 . 25 0 . 10 0 . 60 0 . 70 0 . 140 0 . 20 0 . 10 0 . 70 0 . 90 17 Figure 1: Information criteria under Con tamination sc heme 1. Emplo ying these optimal tuning parameters, the parameter estimates are ultimately deriv ed using the MLE, DPDE, and EPDE metho ds, noting that MLE do es not require any tuning parameter. The corresp onding n umerical results, ev aluated across v arious contamination scenarios, are reported in T able 2 . Based on the estimated parameter v alues from eac h comp eting metho d, the cor- resp onding information criteria—namely , the maxim um likelihoo d information cri- terion (MLIC), the density p o w er div ergence information criterion (DPDIC), and the prop osed exp onen tial-p olynomial div ergence information criterion (EPDIC)— are then ev aluated under both pure and contaminated data-generating mec hanisms. In this uniﬁed framew ork, the criteria are computed b y combining the empirical lik eliho o d-, DPD-, and EPD-based ob jectiv e functions ev aluated at their resp ective estimators with the asso ciated asymptotic p enalt y terms that accoun t for model complexit y . This formulation enables a coherent comparison of lik eliho od-based and div ergence-based mo del selection strategies. Figure 1 illustrates the information criterion v alues for each estimation method un- der Scheme 1 across v arying contamination prop ortions δ , while Figure 2 displa ys the corresp onding results under Sc heme 2. The comparativ e b eha vior of MLIC, DPDIC, and EPDIC as δ increases in eac h sc heme provides a clear assessment of their relative stabilit y and robustness in the presence of model deviations, thereb y highlighting the 18 T able 2: Estimated v alues of the parameters under pure and contamination schemes. Sc heme δ P arameter MLE DPDE EPDE Pure 0 . 000 β 1 1.495108 1.494928 1.494640 β 2 -0.995442 -0.995126 -0.994971 β 3 0.799332 0.799579 0.799965 β 4 0.502114 0.502597 0.503010 β 5 -0.701112 -0.700919 -0.700425 Con t. 1 0 . 052 β 1 1.569221 1.520784 1.514209 β 2 -1.018774 -0.981226 -0.989350 β 3 0.761441 0.772438 0.785769 β 4 0.548332 0.520714 0.513276 β 5 -0.731114 -0.706512 -0.701868 0 . 093 β 1 1.628114 1.526884 1.517182 β 2 -1.061884 -0.978112 -0.980029 β 3 0.726441 0.759331 0.762461 β 4 0.589114 0.530774 0.521705 β 5 -0.768552 -0.710221 -0.703512 0 . 134 β 1 1.702441 1.539882 1.525483 β 2 -1.112774 -0.964221 -0.973818 β 3 0.691118 0.741115 0.757822 β 4 0.629441 0.542997 0.532960 β 5 -0.804118 -0.715438 -0.706917 Con t. 2 0 . 058 β 1 1.528114 1.502398 1.501286 β 2 -1.006218 -0.998402 -0.999017 β 3 0.819552 0.801669 0.800941 β 4 0.515008 0.508681 0.501237 β 5 -0.721933 -0.703284 -0.700739 0 . 099 β 1 1.566227 1.511903 1.507549 β 2 -1.023915 -0.979114 -0.989718 β 3 0.845661 0.810492 0.808795 β 4 0.538104 0.512961 0.509558 β 5 -0.748991 -0.710728 -0.702381 0 . 140 β 1 1.621332 1.517884 1.511344 β 2 -1.056781 -0.968331 -0.975795 β 3 0.892114 0.819612 0.813201 β 4 0.579442 0.526893 0.513679 β 5 -0.789661 -0.718947 -0.710308 eﬀect of diﬀeren t contamination structures on mo del selection p erformance. Consisten t with the patterns display ed in Figure 1 and Figure 2 , the rate of in- 19 Figure 2: Information criteria under Con tamination sc heme 2. crease in the information criterion v alues as the contamination proportion δ rises clearly distinguishes the three approaches. Under pure data, all criteria remain sta- ble; ho wev er, once con tamination is introduced, their tra jectories diverge mark edly . The MLIC exhibits the steep est escalation with increasing δ in b oth Sc heme 1 (er- ror con tamination) and Sc heme 2 (lev erage con tamination), indicating pronounced sensitivit y to model deviations. The rapid growth of MLIC reﬂects the w ell-kno wn vulnerabilit y of likelihoo d-based criteria to b oth v ertical and lev erage-type outliers. In contrast, DPDIC demonstrates a more mo derate and controlled increase as con tamination in tensiﬁes. Although its v alues do rise with δ , the progression is sub- stan tially smo other than that of MLIC, suggesting improv ed resistance to departures from mo del assumptions. Among the three, EPDIC sho ws the smallest incremen- tal c hange across con tamination lev els in both schemes. Its comparativ ely minimal rise as δ increases highligh ts a high degree of stabilit y , thereby underscoring the robustness of the exp onen tial-p olynomial div ergence framework. Ov erall, the rela- tiv e magnitudes of increase—largest for MLIC, mo derate for DPDIC, and smallest for EPDIC—provide coheren t empirical evidence of the sup erior robustness of the div ergence-based information criteria, particularly EPDIC, under diﬀeren t con tami- nated settings. 20 6.2. R e al data A nalysis 6.2.1. Applic ation: Line ar Mixe d-Eﬀe cts Panel data Mo dels W e consider a linear mixed-eﬀects panel data mo del of the form y it = x ⊤ it β + z ⊤ it α i + u it , i = 1 , . . . , n, t = 1 , . . . , m, (32) with α i ∼ N p (0 , Σ α ) and u it ∼ N (0 , σ 2 u ) indep enden tly . The observ ations are as- sumed to b e indep enden t but non-homogeneous across individuals, allo wing the co- v ariance structure to v ary with i through the individual-sp eciﬁc design matrices. By stac king the observ ations o ver time for eac h individual, the mo del can b e expressed as y i = X i β + ε i , ε i ∼ N m (0 , Ω i ) , where Ω i = Z i Σ α Z ⊤ i + σ 2 u I m . Hence, the marginal densit y of y i is f i ( y i ; θ ) = (2 π ) − m/ 2 | Ω i | − 1 / 2 exp  − 1 2 ( y i − X i β ) ⊤ Ω − 1 i ( y i − X i β )  , (33) where θ = ( β ⊤ , v ec(Σ α ) ⊤ , σ 2 u ) ⊤ . With this general form ulation in place, w e no w illustrate the application of the linear mixed-eﬀects panel mo del using a real-world dataset that naturally aligns with its structure. In particular, we consider the P anel Data of Individual W ages dataset, a well-kno wn b enc hmark dataset in panel data econometrics. This dataset is dra wn from the P anel Study of Income Dynamics (PSID) and con tains longitudinal w age information for 59 5 individuals observed ov er the perio d 1976 − 1982 in the United States, resulting in 4165 total observ ations arranged as a balanced panel. The primary resp onse v ariable in this study is the logarithm of wage ( lwage ), whic h is commonly used in lab or economics analyses to stabilize v ariance and in- terpret co eﬃcients in p ercentage terms. The set of explanatory v ariables includes b oth con tin uous and categorical cov ariates reﬂecting individual c haracteristics and emplo yment conditions, suc h as y ears of education ( e d ), years of full-time w ork exp erience ( exp ), weeks w orked ( wks ), union membership ( union ), marital status ( marrie d ), industry indicators ( blue c ol, ind ), regional indicators ( south, smsa ), and demographic attributes including sex and race ( black ). These v ariables are incorp o- rated in the mo del as ﬁxed eﬀects, represen ting systematic inﬂuences on w age lev els that are assumed to b e common across individuals. 21 T o accoun t for unobserv ed heterogeneit y that cannot b e explained by the observ ed co v ariates—suc h as innate ability , motiv ation, or p ersisten t individual-sp eciﬁc pro- ductivit y—we introduce individual-level random eﬀects. These random eﬀects cap- ture sub ject-sp eciﬁc deviations from the o verall w age-co v ariate relationship and are assumed to follow a normal distribution with mean zero and unkno wn v ariance. The inclusion of random eﬀects is particularly appropriate here b ecause the dataset con tains rep eated observ ations for eac h individual ov er multiple y ears, enabling the mo del to separate within-individual temp oral v ariation from b et ween-individual v ari- abilit y . The wage panel dataset is well-suited to the prop osed linear mixed-eﬀects frame- w ork, as it provides a balanced longitudinal structure with rich so cio-economic co- v ariates and rep eated observ ations for each individual, enabling reli able estimation of b oth ﬁxed and random eﬀects. T o assess multiv ariate extremeness in the join t distri- bution of the co v ariates, w e emplo y the Minimum Co v ariance Determinan t (MCD) estimator. It is a widely used, robust m ultiv ariate outlier detection metho d that seeks the subset of observ ations with the smallest determinant of the cov ariance ma- trix, thereby pro viding high-breakdown estimates of lo cation and scatter that are not unduly inﬂuenced by outliers. The robust Mahalanobis distances derived from the MCD estimate are used to detect multiv ariate outliers, as recommended in the ro- bust statistics literature (e.g., [ 24 ], [ 25 ]). The MCD-based diagnostic conﬁrms that, despite some individually extreme v alues, there are no observ ations that simultane- ously deviate in the m ultiv ariate co v ariate space to an extent requiring exclusion. Therefore, no observ ations hav e b een remo v ed, and robust estimation pro cedures are used to mitigate p oten tial inﬂuence from at ypical observ ations. In addition, the dataset is v eriﬁed to contain no missing v alues. W e emplo y MLE, DPDE, and EPDE for estimating the mo del parameters. F rom the p ersp ectiv e of exp onen tial-p olynomial div ergence, the parameter estimation prob- lem reduces to minimizing an empirical divergence criterion. Speciﬁcally , we consider the ob jective function H ( α,β ,γ ) n ( θ ) = 1 n n X i =1 V α,β ,γ ( y i ; θ ) , (34) where the con tribution from the i th sub ject is captured b y V α,β ,γ ( y i ; θ ) = β α 2 Z R m  e αf i ( y ; θ )  αf i ( y ; θ ) − 1  + 1  d y + 1 − β γ Z R m  f γ +1 i ( y ; θ ) − f i ( y ; θ )  d y 22 −  β α  e αf i ( y i ; θ ) − 1  + 1 − β γ (( γ + 1) f i ( y i ; θ ) − 1)  . (35) T o obtain stable starting v alues for the iterative estimation pro cedures, the ﬁxed- eﬀect parameters are ﬁrst initialized using the ordinary least squares (OLS) estimator applied to the p ooled regression mo del. These preliminary estimates serve as initial v alues for the subsequen t div ergence-based optimization pro cedures. Based on these initial estimates, the optimal tuning parameters for the div ergence-based estimators are determined using the generalized score matching metho d. F or DPDE, the optimal tuning parameter is obtained as γ = 0 . 50 , while for EPDE, the optimal set of tuning parameters is found to b e ( α, β , γ ) = (0 . 1 , 0 . 3 , 0 . 3) . Using these optimal v alues, the ﬁxed-eﬀects mo del parameters are estimated. Subsequen tly , to identify the most relev an t explanatory v ariables and reduce the computational burden asso ciated with ev aluating all p ossible regression sp eciﬁca- tions, we incorp orate a LASS O-t yp e p enalt y in to the likelihoo d-, DPD-, and EPD- based ob jectiv e functions. This penalized estimation step p erforms v ariable selection b y shrinking the co eﬃcien ts of less informative cov ariates tow ard zero. The proce- dure is particularly useful in the presen t con text b ecause calculating the information criteria for the full mo del with all av ailable co v ariates would require ev aluating an excessiv ely large num b er of candidate mo dels. The p enalized estimation results re- v eal a consistent subset of co v ariates that remain relev an t across all three ob jective functions. The v ariables selected b y the LASSO pro cedure are ( blue c ol , smsa , marrie d , sex , union , black ) . A ccordingly , the mo del-selection analysis is restricted to these six cov ariates. By examining ev ery non-empty subset of these v ariables, we obtain (2 6 − 1 = 63) can- didate regression mo dels. F or eac h of these 63 candidate mo dels, the corresp onding information criteria—MLIC, DPDIC, and EPDIC—are computed. The models are then ranked in ascending order according to the magnitude of each criterion. F rom the three rank ed lists obtained under MLIC, DPDIC, and EPDIC, the top ﬁfteen candidate mo dels are extracted. These lists are thereupon consolidated to form a com bined set of candidate mo dels, within which sev eral mo dels app ear rep eatedly across the three criteria. T o identify the mo del speciﬁcations that receiv e the most consisten t supp ort across the three criteria, w e compute the frequency with which each candidate mo del ap- p ears among the top-ranked mo dels. Based on this frequency , a ﬁnal set of ﬁve 23 leading candidate mo dels is selected. T able 3 presents these mo dels along with their selection frequencies and the corresp onding v alues of EPDIC, DPDIC, and MLIC. T able 3: T op candidate mo dels based on consolidated information-criterion rankings. Mo del F req. Sel. F req. EPDIC DPDIC MLIC bluecol, sex, black 3 1.000 1435.263 229.14054 11666.50 smsa, married, sex 3 1.000 1892.022 370.21159 11101.19 bluecol, smsa, sex, union 2 0.667 1669.711 246.01321 21262.38 bluecol, smsa, sex, black 2 0.667 1806.677 201.14246 21313.76 bluecol, smsa, sex, union, black 2 0.667 1716.873 253.14976 21249.34 The results indicate that several mo dels receive consistent supp ort across m ulti- ple information criteria, suggesting that the iden tiﬁed cov ariates play a signiﬁcan t role in explaining w age v ariation. In particular, mo dels in volving o ccupation t yp e ( blue c ol ), regional indicator ( smsa ), marital status, gender, union mem b ership, and race rep eatedly app ear among the leading sp eciﬁcations. This consolidated ranking approac h pro vides a robust mec hanism for identifying economically meaningful wage determinan ts while av oiding o v erﬁtting and excessive model complexity . 6.2.2. Applic ation: Neur al Network Mo dels Bey ond classical parametric and panel-data mo dels, the prop osed div ergence- based estimation and mo del-selection framework can b e naturally extended to mo d- ern mac hine-learning architectures. In this subsection, we illustrate its applicabilit y through sup ervised classiﬁcation using feed-forw ard neural net works (FFNNs), which are widely emplo yed nonlinear predictive models in statistical learning. Let { ( x i , y i ) } N i =1 denote a collection of indep enden t observ ations, where x i ∈ R d represen ts a d -dimensional feature vector and y i ∈ { 0 , 1 } denotes a binary class label indicating the op erational state of a mac hine. In particular, y i = ( 1 , if a mac hine failure o ccurs , 0 , otherwise . T o illustrate the prop osed framew ork, we utilize the AI4I 2020 Predictiv e Main tenance Dataset , a widely used b enc hmark dataset in predictive maintenance and mac hine-learning studies. The dataset con tains information on 10 , 000 industrial mac hine instances along with several op erational and environmen tal v ariables that inﬂuence mac hine p erformance and reliabilit y . The primary ob jectiv e is to predict whether a machine is lik ely to fail based on a set of sensor-based cov ariates. The explanatory v ariables considered in the analysis include air temp erature, pro cess temp erature, rotational sp eed, torque, and 24 to ol w ear, together with a categorical machine-t yp e indicator reﬂecting the pro duct qualit y level. These v ariables represent k ey op erational characteristics that inﬂu- ence mec hanical stress, thermal conditions, and to ol degradation in man ufacturing en vironments. Prior to mo del estimation, all con tin uous co v ariates are standardized to ensure n umerical stability and comparable scaling across features. The categorical mac hine- t yp e v ariable is enco ded using dumm y v ariables. The dataset is veriﬁed to con tain no missing v alues. T o examine p oten tial m ultiv ariate extremeness among the cov ariates, we emplo y the Minimum Co v ariance Determinan t (MCD) estimator. This robust method iden- tiﬁes subsets of observ ations with minimal cov ariance determinant and pro vides high- breakdo wn estimates of multiv ariate lo cation and scatter. The corresponding robust Mahalanobis distances are used to detect multiv ariate outliers. The diagnostic anal- ysis indicates that although some observ ations exhibit individually extreme sensor v alues, none deviate substan tially in the joint cov ariate space. Consequen tly , no ob- serv ations are remo v ed, and robust estimation pro cedures are retained to mitigate p oten tial inﬂuence from outlying measuremen ts. Instead of selecting subsets of cov ariates, mo del selection in neural net w orks in- v olves choosing among alternative netw ork architectures. A ccordingly , w e ﬁx a ﬁnite collection of candidate arc hitectures, A = { A 1 , A 2 , . . . , A m } , eac h diﬀering in structural complexity but using all av ailable cov ariates as inputs. The architectural v ariation is constructed b y altering the n umber of hidden la yers and the n umber of neurons p er la yer. Sp eciﬁcally , we consider • Number of hidden lay ers L ∈ { 1 , 2 } , • Number of hidden neurons p er la yer H ∈ { 2 , 3 } . This speciﬁcation yields a ﬁnite set of candidate arc hitectures corresp onding to all p ossible ( L, H ) combinations. Eac h arc hitecture emplo ys the same standardized input feature space and a softmax output la y er for binary classiﬁcation. Thus, the candidate mo dels diﬀer only in internal netw ork complexit y while retaining iden tical co v ariate information. F or a giv en architecture A j , the predicted class probability for observ ation i is expressed as ˆ p i = Softmax  f A j ( x i ; θ j )  , 25 where f A j ( · ) denotes the feed-forward transformation determined by arc hitecture A j , and θ j represen ts the collection of w eights and bias parameters asso ciated with the net work. Thus, ˆ p i ∈ (0 , 1) represen ts the predicted probabilit y of mac hine failure ( y i = 1) . The corresp onding Bernoulli mo del implied b y the netw ork is giv en b y f i ( y i ; θ j ) = ˆ p y i i (1 − ˆ p i ) 1 − y i , y i ∈ { 0 , 1 } . Using the EPD framew ork, the loss function for neural net w ork estimation is deﬁned as L ( α,β ,γ ) ( θ j ) = 1 N N X i =1 V α,β ,γ ( y i ; θ j ) , (36) where the samplewise con tribution V α,β ,γ ( · ) is giv en by V α,β ,γ ( y i ; θ j ) = β α 2 X y ∈{ 0 , 1 }  e αf i ( y ; θ j )  αf i ( y ; θ j ) − 1  + 1  + 1 − β γ X y ∈{ 0 , 1 }  f γ +1 i ( y ; θ j ) − f i ( y ; θ j )  −  β α  e αf i ( y i ; θ j ) − 1  + 1 − β γ (( γ + 1) f i ( y i ; θ j ) − 1)  . (37) Since the resp onse is binary , the summations are taken ov er y ∈ { 0 , 1 } , with f i (1; θ j ) = ˆ p i , f i (0; θ j ) = 1 − ˆ p i . The neural netw ork parameters are then estimated b y minimizing the ab o v e loss function: ˆ θ j = arg min θ j L ( α, β , γ )( θ j ) . (38) Similarly , the corresponding estimators are obtained by minimizing the respective loss functions o ver the parameter space. T o obtain stable starting v alues for the optimization pro cedure, the net w ork w eights are initialized using standard random initialization metho ds commonly em- plo yed in neural net work training. Based on these initial v alues, the optimal tuning parameters for the div ergence-based estimators are determined using the generalized score matching method. F or the DPD estimator, the optimal tuning parameter is obtained as γ = 0 . 90 , 26 while for the EPD estimator, the optimal set of tuning parameters is found to b e ( α, β , γ ) = (0 . 1 , 0 . 7 , 0 . 1) . Using these optimal v alues, the neural net work parameters are estimated for eac h candidate architecture under the three estimation framew orks, namely MLE, DPDE, and EPDE. Subsequen tly , for eac h architecture in A , the corresp onding information criteria— MLIC, DPDIC, and EPDIC—are computed based on the ﬁtted mo dels. The can- didate arc hitectures are then ranked in ascending order according to the magnitude of eac h criterion. F rom the three ranked lists obtained under MLIC, DPDIC, and EPDIC, the top candidate arc hitectures are extracted and consolidated in to a com- bined set. T o identify the architectures that receive the most consistent supp ort across the three criteria, we compute the frequency with whic h each candidate arc hitecture app ears among the top-ranked mo dels. Based on this frequency , a ﬁnal set of leading neural net work architectures is selected. T able 4 presen ts these architectures together with their selection frequencies and the corresp onding v alues of EPDIC, DPDIC, and MLIC. T able 4: T op neural netw ork arc hitectures based on consolidated information-criterion rankings. Arc hitecture F req. Sel. F req. EPDIC DPDIC MLIC A4 (3,3) 3 1.000 98853.29 330.1793 319.0178 A1 (2) 2 0.667 113136.60 330.9141 180.5058 A2 (3) 2 0.667 114616.58 331.4005 206.7497 A3 (2,2) 2 0.667 104669.26 330.2590 811.9595 The results indicate that several neural net work architectures receive consistent supp ort across m ultiple information criteria. In particular, the architecture with t w o hidden la y ers and three neurons in eac h la yer, denoted b y A 4(3 , 3) , app ears most frequen tly among the top-rank ed mo dels and attains the highest selection frequency . This suggests that mo derately deep arc hitectures pro vide improv ed ﬂexibilit y for capturing the nonlinear relationships presen t in the predictive maintenance dataset. A t the same time, simpler arc hitectures such as A 1(2) and A 2(3) also app ear among the leading candidates, indicating that parsimonious net w ork structures ma y still achiev e comp etitiv e p erformance under certain criteria. The diﬀerences observed across MLIC, DPDIC, and EPDIC reﬂect the v arying emphasis placed on mo del complexit y and robustness by the resp ective information criteria. Ov erall, this consolidated ranking approac h pro vides a robust and systematic mec hanism for selecting an appropriate neural netw ork arc hitecture while balanc- ing predictive p erformance and mo del complexity . The procedure ensures that the 27 selected arc hitecture is supp orted consisten tly across m ultiple div ergence-based in- formation criteria, thereby enhancing the reliability of the mo del-selection process in predictiv e maintenance applications. 7. Conclusion In this pap er, we prop osed a robust information criterion based on the Exp onen tial- P olynomial Divergence, aimed at addressing the limitations of classical likelihoo d- based mo del selection in the presence of contamination and mo del missp eciﬁcation. The prop osed EPDIC lev erages the ﬂexibilit y and robustness of the underlying diver- gence to provide a reliable alternativ e for mo del selection across a wide range of set- tings. W e established its theoretical prop erties and demonstrated, through inﬂuence function analysis, that the criterion exhibits desirable robustness, with con trolled sensitivit y to outliers. T o ensure practical applicabilit y , w e incorporated a data-driven tuning-parameter selection mechanism based on generalized score matching, thereb y impro ving stabil- it y o v er existing metho ds. The eﬀectiv eness of the proposed approac h w as v alidated through extensiv e simulation studies and real data applications, including panel data mo dels and neural netw ork-based prediction tasks. The results consisten tly indicated that EPDIC outp erforms classical and existing divergence-based criteria in terms of stabilit y and reliability under contaminated and misspeciﬁed scenarios. Ov erall, the proposed framew ork provides a uniﬁed and practically viable ap- proac h to robust mo del selection. F uture work ma y explore its extension to more complex high-dimensional mo dels and other structured learning framew orks. Data a v ailabilit y The panel dataset used in this study is the Panel Data of Individual W ages from the P anel Study of I ncome Dynamics (PSID), a v ailable online at the R datasets rep ository: P anel Data of Individual W ages . The AI4I 2020 Predictiv e Main tenance dataset is obtained from the UCI Mac hine Learning Rep ository and can be accessed at: AI4I 2020 Predictiv e Maintenance . Declaration of comp eting in terest The authors conﬁrm that they hav e no recognized ﬁnancial conﬂicts of in terest or p ersonal relationships that could ha v e inﬂuenced the work presented in this paper. 28 F unding This research is not supp orted b y an y sp eciﬁc grant from public, commercial, or non-proﬁt funding organizations. References [1] H. Ak aik e, Information theory and an extension of the maxim um lik eliho o d principle, in: Selected pap ers of hirotugu ak aik e, Springer, 1998, pp. 199–213. [2] K. T akeuc hi, Distribution of information num b er statistics and criteria for ade- quacy of mo dels, Mathematical Sciences 153 (1976) 12–18. [3] T. Matsuda, M. Uehara, A. Hyv arinen, Information criteria for non-normalized mo dels, Journal of Mac hine Learning Researc h 22 (158) (2021) 1–33. [4] K. Mattheou, S. Lee, A. Karagrigoriou, A mo del selection criterion based on the bhhj measure of div ergence, Journal of Statistical Planning and Inference 139 (2) (2009) 228–235. [5] A. Basu, I. R. Harris, N. L. Hjort, M. Jones, Robust and eﬃcient estimation b y minimising a densit y p o wer divergence, Biometrik a 85 (3) (1998) 549–559. [6] A. Ghosh, A. Basu, A new family of div ergences originating from mo del ade- quacy tests and application to robust statistical inference, IEEE T ransactions on Information Theory 64 (8) (2018) 5581–5591. [7] A. Ghosh, S. Ma jumdar, Ultrahigh-dimensional robust and eﬃcien t sparse re- gression using non-concav e p enalized densit y p o w er divergence, IEEE T ransac- tions on Information Theory 66 (12) (2020) 7812–7827. [8] S. Ray , S. Pal, S. K. Kar, A. Basu, Characterizing the functional densit y pow er div ergence class, IEEE T ransactions on Information Theory 69 (2) (2022) 1141– 1146. [9] P . Man talos, K. Mattheou, A. Karagrigoriou, An impro ved div ergence informa- tion criterion for the determination of the order of an ar pro cess, Communica- tions in Statistics—Sim ulation and Computation ® 39 (5) (2010) 865–879. [10] P . Singh, A. Mandal, A. Basu, Robust inference using the exp onen tial- p olynomial div ergence, Journal of Statistical Theory and Practice 15 (2) (2021) 29. 29 [11] B. Kim, S. Lee, Robust estimation for general in teger-v alued autoregressive mo dels based on the exponential-polynomial divergence, Journal of Statistical Computation and Sim ulation 94 (6) (2024) 1300–1316. [12] J. W arwick, M. Jones, Cho osing a robustness tuning parameter, Journal of Statistical Computation and Sim ulation 75 (7) (2005) 581–588. [13] S. Basak, A. Basu, M. Jones, On the ‘optimal’densit y p o w er divergence tuning parameter., Journal of Applied Statistics 48 (2021) 536–556. [14] U. Goswami, S. Mondal, Inequalit y restricted minim um density pow er diver- gence estimation in panel coun t data, Applied Mathematical Mo delling (2025) 116371. [15] J. Xu, J. L. Scealy , A. T. W o od, T. Zou, Generalized score matc hing, Journal of Multiv ariate Analysis 210 (2025) 105473. [16] A. Hyvärinen, P . Da y an, Estimation of non-normalized statistical mo dels b y score matc hing., Journal of Machine Learning Research 6 (2005). [17] L. M. Bregman, The relaxation method of ﬁnding the common p oin t of con vex sets and its application to the solution of problems in conv ex programming, USSR computational mathematics and mathematical ph ysics 7 (3) (1967) 200– 217. [18] A. Banerjee, S. Merugu, I. S. Dhillon, J. Ghosh, Clustering with bregman di- v ergences, Journal of machine learning research 6 (Oct) (2005) 1705–1749. [19] A. Cichocki, S.-i. Amari, F amilies of alpha-b eta-and gamma-divergences: Flex- ible and robust measures of similarities, En tropy 12 (6) (2010) 1532–1568. [20] F. Nielsen, R. No c k, On the c hi square and higher-order c hi distances for ap- pro ximating f-div ergences, IEEE Signal Pro cessing Letters 21 (1) (2013) 10–13. [21] S. Kullbac k, R. A. Leibler, On information and suﬃciency , The annals of math- ematical statistics 22 (1) (1951) 79–86. [22] A. Ghosh, A. Basu, Robust estimation for indep enden t non-homogeneous obser- v ations using density p o w er divergence with applications to linear regression., Electronic Journal of Statistics 7 (2013) 2420–2456. doi:10.1214/13- EJS847 . 30 [23] Q. T. Dinh, M. Diehl, Lo cal conv ergence of sequential conv ex programming for noncon vex optimization, in: Recen t A dv ances in Optimization and its Appli- cations in Engineering: The 14th Belgian-F rench-German Conference on Opti- mization, Springer, 2010, pp. 93–102. [24] P . J. Rousseeuw, K. V. Driessen, A fast algorithm for the minimum co v ariance determinan t estimator, T ec hnometrics 41 (3) (1999) 212–223. [25] M. Hub ert, P . J. Rousseeuw, S. V erb ov en, Robust p ca for high-dimensional data, in: R. Dutter, P . Filzmoser, U. Gather, P . J. Rousseeuw (Eds.), Developmen ts in Robust Statistics, Ph ysica-V erlag HD, Heidelb erg, 2003, pp. 169–179. 31

An Exponential-Polynomial Divergence-based Robust Information Criterion for Linear Panel Data Models and Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment