Reformulating van Rijsbergen's $F_β$ metric for weighted binary cross-entropy

Reform ulating v an Rijsb ergen’s F β metric for w eigh ted binary cross-en trop y Satesh Ramdhani Con tributing authors: satesh.ramdhani@gmail.com ; Abstract The separation of p erformance metrics from gradient based loss func- tions ma y not alwa ys give optimal results and ma y miss vital aggregate information. This pap er inv estigates incorp orating a p erformance met- ric alongside diﬀerentiable loss functions to inform training outcomes. The goal is to guide mo del p erformance and interpretation by assuming statistical distributions on this p erformance metric for dynamic weigh t- ing. The fo cus is on v an Rijsb ergen’s F β metric – a p opular choice for gauging classiﬁcation p erformance. Through distributional assumptions on the F β , an in termediary link can b e established to the standard binary cross-en tropy via dynamic penalty weigh ts. First, the F β met- ric is reform ulated to facilitate assuming statistical distributions with accompan ying proofs for the cum ulative densit y function. These proba- bilities are used within a knee curve algorithm to ﬁnd an optimal β or β opt . This β opt is used as a weigh t or p enalty in the prop osed weigh ted binary cross-entrop y . Exp erimentation on publicly av ailable data along with b enc hmark analysis mostly yields b etter and interpretable results as compared to the baseline for b oth imbalanced and balanced classes. F or example, for the IMDB text data with known lab eling errors, a 14% b oost in F 1 score is sho wn. The results also rev eal commonalities b et ween the p enalt y mo del families derived in this pap er and the suitabilit y of recall-centric or precision-cen tric parameters used in the optimiza- tion. The ﬂexibility of this metho dology can enhance in terpretation. Keyw ords: Performance metrics, Metrics, F-Beta Metric, Penalt y Optimization, C.J. v an Rijsb ergen, Information Retriev al, W eighted Cross-Entrop y , Binary Cross-Entrop y , T ext Retriev al 1 2 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 1 Acron ym List F β F-Beta Metric β opt Optimal β from Algorithm 1 M β 1 Mo del 1: U & IU from ( 5.1 ) M ( λ,σ 2 ) 2 Mo del 2: Ga & IE from ( 5.2 ) PV Pressure V essel Design T s Thic kness of the pressure vessel shell T h Thic kness of the pressure vessel head UST Underground Storage T ank f C Equation for the volume a cylindrical UST ( E2 ) f C H Equation for the v olume of a cylindrical UST with hemispherical endcaps ( E3 ) f E D Equation for the volume of an ellipsoidal UST ( E4 ) f E DH Equation for the volume of an ellipsoidal UST with hemi-ellipsoidal end-caps ( E5 ) CvE Cylindrical UST v ersus Ellipsoidal UST CHvEH Cylindrical UST with hemispherical endcaps versus ellipsoidal UST with hemi-ellipsoidal end-caps UCI UCI Mac hine Learning Rep ository 2 In tro duction Data imbalance is a kno wn, and widespread real w orld issue that aﬀects perfor- mance metrics for a v ariety of learning algorithm problems (i.e., image detec- tion and segmentation, text categorization and classiﬁcation). Approaches to mitigate this issue generally fall in to three categories: adjusting the neural net- w ork architecture (including m ultiple mo dels or ensembles like F ujino et al. 2008 ), adjusting the loss function used for training, or adjusting the data (i.e., collecting more data, or leveraging sampling tec hniques lik e Cha wla et al. 2002 and Hasanin et al. 2019 ). This researc h lo oks at adjusting the loss function with a fo cus on incorporating the F β p erformance metric. The in terconnec- tion betw een p erformance metric and loss function is crucial for understanding b oth model b eha vior and the inherent nature of that sp eciﬁc dataset. This connection has already b een approached from the angle of thresholding (a post mo del step) as in Lipton et al. ( 2014 ) or developing a problem sp eciﬁc metric, as Ho and W o ok ey ( 2019 ), Li et al. ( 2019 ), and Oksuz et al. ( 2018 ) did for real w orld mislab eling costs, dynamic w eighting for easy negative samples, and ob ject detection, resp ectively . This paper takes a uniquely diﬀerent and nov el approac h where statistical distributions act as an intermediary to connect the F β metric to the binary cross-en tropy through dynamic penalty weigh ts. First, the deriv ation of the F β metric from v an Rijsb ergen’s eﬀectiv eness score, E , is revisited to pro ve a limiting case of F 1 in section 4 . This result supp orts the default case for the main algorithm in section 6 . R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 3 Second, the F β metric is reform ulated into a m ultiplicative form b y assuming tw o indep enden t random v ariables. Then parametric statistical dis- tributions are assumed for these random v ariables. In particular, the Uniform and Inv erse Uniform (U & IU) case and the Gaussian and Inv erse Exp onential (Ga & IE) case are proposed. The idea b ehind U & IU is that no known insight is assumed on the F β cum ulative density function’s (CDF) surface. But the Ga & IE provides the practitioner more ﬂexibilit y in setting some insight to this CDF surface. This leads to a more interpretable p erformance metric that is conﬁgurable to the data without ha ving to create a new problem sp eciﬁc metric (or loss function). Third, for b oth distributional cases, the CDF or Pr ( F β ) sho wn in section 5 facilitates ﬁnding an optimal β through a knee curve algorithm in section 6.1 . This algorithm gets the b est β from a monotonic knee curve given precision and recall. It is the v alue when the curv e levels oﬀ. The β opt surface for diﬀerent parameter settings found in section 6.3 suggests a slightly more recall cen tric p enalt y . This is discussed further in section 7 . Finally , a weigh ted binary cross-entrop y loss function based on β opt is pro- p osed in section 6.2 . This loss metho dology is applied to three data categories: image, text and tabular/structured data. F or contextual data (i.e., image and text), mo del p erformance for F 1 impro ves, and the best result o ccurs for the text data that contains (kno wn) labeling errors. The structured/tabular or non-con textual data do es not sho w signiﬁcant F 1 impro vemen t, but pro- vides an imp ortan t result: when considering neural embedding arc hitectures for training, the type (or category) of data matters. 3 Related W ork Logistic regression mo dels are one of the most fundamental statistically based classiﬁer. Jansche ( 2005 ) provides a training pro cedure that uses a sigmoid appro ximation to maximize the F β on this class of classiﬁers. When compar- ing the surface plots of the lik eliho od from Jansche and that from section 5 – a similar but not an equiv alent comparison – a comparable rate of change can b e seen for b oth surfaces with respect to their resp ectiv e parameters. This is an imp ortan t similarit y because this pap er’s pro cedure applies distributional assumptions to provide dynamic p enalties to a well-kno wn binary cross-entrop y loss. Also, implemen tation of this pap er’s methodology is straightforw ard b ecause it av oids the need to pro vide up dated partial deriv atives for the loss function. F urthermore, Jansc he alludes to (future work that considers) a gen- eral method to optimize several op erating p oints simultaneously , which is a fundamen tal and indirect assertion in this pap er. The sigmoid approximation is also used by F ujino et al. ( 2008 ) in the multi-label setting for text categoriza- tion. In their framework, m ultiple binary classiﬁers are trained p er category and com bined with weigh ts estimated to maximize micro- or macro-av eraged F 1 scores. 4 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy Similarly , Aurelio et al. ( 2022 ) prop ose a metho dology for performance met- ric learning that uses a metric approximation (i.e., AUC, F 1 ) derived from the confusion matrix. The bac k-propagation error term in volv es the ﬁrst deriv a- tiv e, follow ed by the application of gradient descen t. This metho d provides an alternative means of in tegrating p erformance metrics with gradien t-based learning. How ev er, there are cases where the bac k-propagation term prop osed b y Aurelio et al. may p ose issues. F or instance, when considering equation 13 from Aurelio et al. in conjunction with batch training and severe imbalance, there could b e a division by zero error if a batch with only the zero lab el app ears. Moreov er, Aurelio et al. test several metrics: F 1 , G-mean, A G-mean and A UC for their metho d. But the G-mean, A G-mean and AUC, based on the confusion matrix appro ximation, can b e deriv ed as functions of F 1 . This suggests that F β is more ﬂexible than G-mean, AG-mean and A UC. In other w ords, β is unique to F β y et generalized across other metrics when equal to 1. In fact, for class imbalance, the AUC metric - an a verage ov er man y thresh- olds, and G-mean - a geometric mean, is less stringent and more generous in accuracy reporting compared to the F 1 . This is the reason all results in this pap er are rep orted using the F 1 score. Surrogate loss functions attempt to mimic certain asp ects of the F β and is another related area. F or example, sigmoidF1 from B´ en ´ edict et al. ( 2021 ) cre- ates smooth versions for the en tries of the confusion matrix, which is used to create a diﬀerentiable loss function that imitates the F 1 . This smo oth diﬀeren- tiabilit y is another application of a sigmoid approximation similar to Jansche . Lee et al. ( 2021 ) formulates a surrogate loss b y adjusting the cross-en tropy loss such that its gradien t matches the gradient of a smo oth version of the F β . In terms of metric creation or v ariation to the F β , Ho and W ookey ( 2019 ), Li et al. ( 2019 ), Oksuz et al. ( 2020 ) and Y an et al. ( 2022 ) are highlighted. The Real W orld W eight Cross Entrop y (R WWCE) loss function from Ho and W o okey is a metric similar in spirit to Oksuz et al. The idea is to set (not train or tune) cost related weigh ts based on the dataset and the main problem, by introducing costs (i.e., ﬁnancial costs) that reﬂect the real w orld. R WWCE aﬀects b oth the p ositiv e and negativ e lab els by tying each to its own real world cost implication. The dice loss from Li et al. prop ose a dynamic w eight adjustment to address the dominating eﬀect of easy-negative examples. The formulation is based on the F β using a smoothing parameter and a fo cal adaptation from Lin et al. ( 2017 ). A ranking loss based on the Lo calisation Recall Precision (LRP) metric Oksuz et al. ( 2018 ) is developed b y Oksuz et al. ( 2020 ) for ob ject detection. They prop ose an a veraged LRP alongside a ranking loss function for not only classiﬁcation but also lo calisation of ob jects in images. This provides a balance b et w een b oth p ositiv e and negative samples. Along a similar theme, Y an et al. ( 2022 ) explores a discriminative loss function that aims to maximize the exp ected F β directly for sp eec h mispronunciation. Their loss function is based on the F β (comparing human assessors and the mo del prediction) weigh ted by a probabilit y distribution (i.e., normal distribution) for that score. The ﬁnal R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 5 ob jective function is a weigh ted a verage betw een their loss function and the ordinary cross-entrop y . When considering the components of p erformance metrics, precision and recall are often the primary fo cus. Mohit et al. ( 2012 ) and Tian et al. ( 2022 ) prop ose tw o diﬀerent loss functions that are b oth recall oriented. Mohit et al. adjust the hinge loss by adding a recall based cost (and p enalt y) into the ob jective function. As they said, by fav oring recall o ver precision it results in a substan tial b oost to recall and F 1 . By lev eraging the concept of in verse frequency w eighting (i.e., a sampling based tec hnique), Tian et al. adjust the cross-en tropy to reﬂect an inv erse w eighting on false negativ es p er class. They state that their loss function sits b et ween regular and inv erse frequency w eighted cross-en tropy by balancing the excessiv e false p ositiv es introduced b y constantly up-w eigh ting minority classes. When they consider a similar loss function using precision, this loss function shows irregular b ehavior. These ﬁndings are insightful because this pap er’s β opt surface as seen in section 6.3 is more recall centric with the added beneﬁt of being able to incorp orate precision w eighting through the assumed probability surface. 4 Bac kground The F β measure comes directly from v an Rijsbergen’s eﬀectiv eness score, E , for information retriev al (chapter 7 in Rijsb ergen 1979 ). F or the theory on the six conditions supp orting E as a measure, refer to Rijsb ergen . This pap er highligh ts tw o of these conditions. First, E guides the practitioner’s abilit y to quan tify eﬀectiveness given any p oin t ( r , p ) – where r and p are recall and precision – as compared to some other p oint. Second, precision and recall con- tribute eﬀects indep endently of E . As said b y Rijsb ergen , for a constant r (or p ) the diﬀerence in E from any set of v arying points of p (or r ) can not b e remov ed b y changing the constan t. These conditions suggest equiv alence relations and imply a common eﬀectiv eness (CE) curv e based on precision and recall (deﬁni- tion 3 in Rijsb ergen 1979 ). They also motiv ate the rationale on using statistical distributions to understand the CE curve. The v an Rijsb ergen’s eﬀectiveness measure is giv en in ( 1 ). E = 1 − 1 α 1 p + (1 − α ) 1 r (1) where, α = 1 β 2 +1 . Sasaki ( 2007 ) giv es the details on deriving F β = ( β 2 +1) pr β 2 p + r from ( 1 ) with β = r p and b y solving ∂ E ∂ r = ∂ E ∂ p . The β parameter is in tended to allo w the practitioner control by giving β times more importance to recall than precision. Using the deriv ation steps from Sasaki , a general form of F β for any deriv ativ e can b e shown as ( 2 ), F n β ( p, r ) = ( β − 2 n − 2 + 1) pr β − 2 n − 2 p + r , (2) 6 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy where n p ertains to ∂ n E ∂ r n = ∂ n E ∂ p n resulting in α n = 1 β − 2 n − 2 +1 . Note that n > 0, and n  = 2. The pro of is found in App endix A . F or n = 2 the equation reduces to the equalit y p = r implying β = 1. Using ( 2 ), it can b e seen that the lim n →∞ F n β = F 1 1 = 2 pr p + r , which is most commonly used in the literature. The reason for sho wing this limiting case is to pro vide a justiﬁcation on ﬁxing β = 1 (instead of claiming equal imp ortance for r and p ) in the default case of an y algorithm – in particular the algorithm in section 6 . 5 Reform ulating the F-Beta to lev erage statistical distributions CE for neural netw orks is seen when diﬀerent netw ork w eights give diﬀerent precision and recall yet resulting in similar p erformance scores. CE also pro- vides a basis for this pap er’s use of β from the F β measure to guide training through p enalties, in lieu of an explicit loss (or surrogate loss) function. In fact, V ashishtha et al. ( 2022 ) uses the F -score as part of a prepro cessing step for feature selection prior to their ensemble model (EM-PCA then ELM) for fault diagnosis. They show signiﬁcan t p erformance improv ement in their approac h which adds supp orting evidence to this pap er’s use of the F β as a loss p enalt y for feature selection via gradien t based learning. The ﬁrst step is to reform ulate ( 2 ) for n = 1. This makes assuming statis- tical distributions easier. Consider the following reformulation through m ulti- plicativ e decomp osition in ( 3 ) which assumes X 1 and X 2 to b e independent random v ariables. F β = X 1 X 2 , (3) where X 1 = r ′ + β ′ , X 2 = ( β ′′ + r ) − 1 with r ′ = pr , β ′ = β 2 pr and β ′′ = β 2 p . X 1 indirectly captures imbalance in the mo del prediction from the underlying data. If precision and recall are on opp osite ends of the [0 , 1] scale, then X 1 will reﬂect this, while maintaining con tinuit y when precision and recall are direc- tionally consistent. X 2 can b e though t of as a w eighting scheme that app ears recall cen tric with a precision based p enalt y . F or instance, for b oth high (or b oth lo w) precision and recall, the weigh ting is consistent with in tuition. How- ev er, when precision and recall are on opp osite ends of the [0 , 1] scale, the w eighting swa ys by the aggregate with the lo wer score. Two use cases are con- sidered for ( 3 ): X 1 and X 2 follo w U & IU, resp ectively , and X 1 and X 2 follo w Ga & IE, resp ectiv ely . 5.1 Case 1: Uniform and In verse Uniform The though t b ehind U & IU is to apply (ﬂat) equal distribution for both X 1 and X 2 . These assumed distributions are applied to β ′ and β ′′ as follows: Let β ′ ∼ U (0 , β ∗ ) R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 7 β ′′ ∼ U (0 , β ∗ ) then X 1 ∼ U ( r ′ , r ′ + β ∗ ) (4) and X 2 ∼ I U  1 r + β ∗ , 1 r  , (5) where r and r ′ ∈ [0 , 1] and β ∗ > 0. Note that for both distributions there is only one β ∗ c hosen and this v alue replaces the need to hav e an explicit form that includes β as a parameter. This is for con venience, as w ell as noticing that b oth β ′ and β ′′ diﬀer by a factor of r . So allowing β ∗ to v ary broadly (which is the β max in section 6 ) w ould b e enough to balance this conv enience tradeoﬀ. Next is to deriv e the join t distribution whic h would b e used in section 6 . It can b e shown (the pro of is in App endix B ) that the join t distribution is: Pr ( F β ) = Pr ( F β ≤ z ) = 1 { z ≤ p & ( r + β ∗ ) z r ′ + β ∗ ≤ 1 } × z 2  r + β ∗ − r ′ z  2 ( β ∗ ) 2 + 1 { z >p & ( r + β ∗ ) z r ′ + β ∗ > 1 } × r z − r ′ β ∗ + 1 ( β ∗ ) 2  r + β ∗ + r ′ + β ∗ 2 z  ( r ′ + β ∗ − rz ) + r ( r z − ( r ′ + β ∗ )) 2  + 1 { z >p & ( r + β ∗ ) z r ′ + β ∗ ≤ 1 } × " r z − r ′ β ∗ + 1 ( β ∗ ) 2  r + β ∗ + r ′ + β ∗ 2 z  ( r ′ + β ∗ − rz ) + r ( r z − ( r ′ + β ∗ )) 2  − 1 ( β ∗ ) 2  ( r + β ∗ )( r ′ + β ∗ ) − ( r ′ + β ∗ ) 2 2 z − ( r + β ∗ ) 2 z 2  # (6) T o understand this ﬂat mixture, consider Figure 1 - the CDF surface for a grid of precision and recall where β ∗ ∈ [8 , 16]. (Note: the blue and red heat coloring is from the CDF and highligh ts curv ature and/or rate of change). F or a lo wer z v alue of 0 . 4, Figure 1a shows that β ∗ = 8 has a faster rate of change as compared to β ∗ = 16. The same conclusion is apparent in Figure 1b , which is for a higher z v alue of 0 . 8. F or b oth ﬁgures more curv ature is seen for low er β ∗ v alues. This suggests that a larger β ∗ v alue smo oths the surface and is a b etter candidate for β max in the algorithm in Section 6 . 5.2 Case 2: Gaussian and In verse Exp onen tial A more informed distributional approach for X 1 and X 2 considers Ga & IE, resp ectiv ely . The reason to use Gaussian distribution for X 1 is to allow a b ell- shap ed v ariabilit y around a ﬁxed r ′ that is based on β ′ and ultimately β . The weigh ting of X 1 b y X 2 uses the Inv erse Exp onen tial distribution because with selections of the rate parameter λ the distribution can shift mass from 8 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy (a) P ( F β < 0 . 4) (b) P ( F β < 0 . 8) Fig. 1 : Probability Mass Surface: U & IU for precision versus recall, and β ∗ ∈ [8 , 16]. The cumulativ e probabilit y is computed for low (0 . 4) and high (0 . 8) v alues. left to right as well as app ear uniformly distributed around r . This provides practitioners enough ﬂexibilit y on exp erimen ting with diﬀerent weigh ts. The follo wing sho ws the assumptions for β ′ and β ′′ : Let β ′ ∼ Ga(0 , σ 2 ) β ′′ ∼ Exp onen tial( λ ) then X 1 ∼ Ga( r ′ , σ 2 ) (7) and X 2 ∼ IE( λ ; r ) , (8) where r in ( 8 ) is the location shift b y recall from the deﬁnition of X 2 in ( 3 ) and σ 2 is the v ariabilit y captured b y β ′ . Using both ( 7 ) and ( 8 ), the distribution for ( 3 ) is now split around z = 0 as follo ws: Pr ( F β ) = Pr ( F β ≤ z ) = 1 z > 0 ×    Φ( r z ; r ′ , σ 2 ) + exp    λr +  λσ 2 z  2 − 2 r ′ λσ 2 z 2 σ 2    ×  1 − Φ  r z ;  r ′ − λσ 2 z  , σ 2  + 1 z =0 × Φ(0; r ′ , σ 2 ) + 1 z < 0 ×    Φ( r z ; r ′ , σ 2 ) − exp    λr +  λσ 2 z  2 − 2 r ′ λσ 2 z 2 σ 2    × Φ  r z ;  r ′ − λσ 2 z  , σ 2  , (9) where Φ( x ; µ, σ 2 ) denotes the standard normal or gaussian distribution at the v alue x for a mean, µ and v ariance, σ 2 . (Refer to App endix C for the pro of ). Similar to b efore the fo cus is on the indicator 1 z ≥ 0 as deﬁned in ( 9 ). Since R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 9 (a) P ( F β < 0 . 4) and λ = 0 . 5 (b) P ( F β < 0 . 8) and λ = 0 . 5 (c) P ( F β < 0 . 4) and λ = 2 . 0 (d) P ( F β < 0 . 8) and λ = 2 . 0 Fig. 2 : Probability Mass Surface: Ga & IE for precision versus recall, β = 16, λ ∈ [0 . 5 , 2 . 0], and σ 2 ∈ [0 . 5 , 2 . 0]. The cum ulative probabilit y is computed for lo w (0 . 4) and high (0 . 8) v alues. this distributional mixture has more ﬂexibilit y due to more parameters, Figure 2 highlights this when λ ∈ [0 . 5 , 2 . 0], and σ 2 ∈ [0 . 5 , 2 . 0]. The probabilities are computed again at a low er z v alue, 0 . 4, and at a higher z v alue, 0 . 8 for comparison. F or a ﬁxed λ , v arying σ 2 impacts the curv ature of the surface with higher σ 2 v alues pro ducing a ﬂattening eﬀect. Figure 2a shows this distinctly . Con versely , as λ increases with a ﬁxed σ 2 , the rate at which the surface c hanges is very apparen t. This can b e seen b y juxtaposing Figure 2c and 2a or Figure 2d and 2b and noticing that the increase of λ pro duces a clear increase in the rate of c hange. These observ ations match the intuition that σ 2 is linked to the shap e of the bell curve, and λ is linked to a rate of c hange. It also serves as a basis of intuition behind the algorithm in Section 6 . That is, a faster rate of c hange along with a curv ed (and/or smoother) surface w ould provide loss p enalties that adapt quickly p er batch using the aggregated information from precision and recall. 10 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 6 Knee algorithm and W eigh ted Cross En trop y 6.1 Knee algorithm to ﬁnd optimal β v alues No w that probabilities, or Pr( F β ≤ z ) for some z ∈ [0 , 1] are established in section 5.1 and 5.2 , the goal is to use them to get an optimal β v alue, β opt . There are a couple things to consider. First, b ecause β is group ed in to β ′ and β ′′ with distributional assumptions, using maximum lik eliho od estimation (MLE) is not particularly suitable here. Also, β max , σ 2 , and λ from ( 4 ), ( 5 ), and ( 8 ) are set in adv ance and do not need to b e estimated. Second, our observ ed data is only one data p oin t p er training batch, namely precision and recall. Given this and the natural b end of the F β function, a kne e algorithm is applicable. F rom Satopaa et al. ( 2011 ), the knee of a curve is asso ciated with go o d op erator p oints in a system righ t b efore the p erformance levels oﬀ. This remo ves the need for complex system-sp eciﬁc analysis. F urthermore, they hav e provided a deﬁnition of curv ature that supp orts their metho d b eing application indep enden t – an imp ortant prop erty for this pap er. Algorithm 1 implements (and sligh tly alters) Kneedles algorithm from Satopaa et al. to detect the knee in the F β curv e. Refer to Algorithm 1 for the formal pseudo co de. A brief explanation in plain words is as follo ws: 1. F or any training batch, compute precision (p) and recall (r). Then with a predeﬁned β max v alue, set n equally spaced v alues, b s i , up to β max , and use section 5 to compute p s i = Pr( F β ≤ z | β = b s i , p = p, r = r ). (This replaces step 1 from Satopaa et al. ). Let D s represen t this smooth curve as D s = { ( b s i , p s i ) ∈ R 2 | b s i , p s i ≥ 0 } for i = 1 , ...n . 2. When r < p , con v ert to a knee b y taking the diﬀerence of the probabilities from the maximum. That is, p s i = max( p s ) − p s i for i = 1 , ...n . This is necessary b ecause of the formulation of the F β metric. 3. Normalize the p oin ts to a unit square and call these b sn and p sn . 4. T ake the diﬀerence of p oints and lab el that b d and p d . 5. Find the candidate knee points b y getting all lo cal maxima’s, lab el that b lmx and p lmx . 6. T ake the av erage of p lmx and this will b e β opt . (This simpliﬁes Satopaa et al. ). 6.2 Prop osal W eighted Binary Cross-En trop y The weigh ted binary cross-entrop y loss is primarily fo cused around the imbal- anced use case where a minority class exists. This paper p osits that from the sh uﬄing of data observ ations, as is frequen tly done while training, relev ant aggregate information is av ailable to use from the batch. F or instance, say for a ﬁxed minority class observ ation, y , it is grouped among diﬀeren t batc hes of the ma jority class. The interaction eﬀect of y among these randomly v arying training batches is often ov erlooked. It is this interaction that can b e inferred through the precision and recall aggregates, then transferred as a p enalt y to the loss function via β opt in a probabilistic w ay . By using Algorithm 1 to get R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 11 Algorithm 1 Calculate β opt Require: n > 0 ∧ β max > 0 Ensure: β opt > 0 1: Compute p , and r from the training batch. 2: Initialize b sn , b d , b lmx , p sn , p d , p lmx , p s to empty arrays. 3: b s ⇐ [ b s 1 , ..., b s n ] where b s n = β max 4: for ( i = 0 to n ) do 5: z ⇐ F n =1 β = b s i ( p, r ) using eqn ( 2 ) 6: p s i ⇐ Pr( F β ≤ z | β = b s i , p = p, r = r ) using section 5.1 or 5.2 7: end for 8: if r < p then 9: p max = max( p s ) 10: for ( i = 0 to n ) do 11: p s i ⇐ p max − p s i 12: end for 13: end if 14: b max = max( b s ), p max = max( p s ), b min = min( b s ), p min = min( p s ) 15: for ( i = 0 to n ) do 16: b sn i ⇐ b s i − b min b max − b min 17: p sn i ⇐ p s i − p min p max − p min 18: b d i ⇐ b sn i 19: p d i ⇐ p sn i − b sn i 20: if ( i ≥ 1) ∧ ( i < n ) then 21: if ( p d i − 1 < p d i ) ∧ ( p d i +1 < p d i ) then 22: p lmx i ⇐ p d i 23: b lmx i ⇐ b d i 24: end if 25: end if 26: end for 27: if p lmx is a non-empt y arra y then 28: β opt = mean( p lmx ) 29: else 30: β opt = 1 as per section 4 31: end if β opt , the proposed loss is, L ( f(x; θ ) | β 2 opt , x ) = − X i { y i log ( f i ( x ; θ )) + (1 − y i ) × log (1 − f i ( x ; θ )) ×  1 { [1 − f i ( x ; θ )] ≤ 0 . 5 } 1 + β 2 opt + (1 + β 2 opt ) × 1 { [1 − f i ( x ; θ )] > 0 . 5 }  , (10) 12 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy (a) U & IU for β max ∈ [8 , 16] (b) Ga & IE for β max = 16, λ = 0 . 5 (c) Ga & IE for β max = 16, λ = 2 . 0 Fig. 3 : β opt Surface: (a) U & IU where β max ∈ [8 , 16], (b) and (c) Ga & IE for ﬁxed β max = 16, λ ∈ [0 . 5 , 2 . 0] resp ectively . Note for Algorithm 1 n = 300 equally spaced points. where the function f i ( x ; θ ) is the i -th elemen t from the prediction of a neural net work using the inputs x and training weigh ts, θ ; and y i is the i -th elemen t of the true target lab el. When considering the ma jorit y class, or y i = 0 for i = { 1 , ..., m } , the loss is weigh ted b y (1 + β 2 opt ). Therefore, for correctly predicted observ ations, the loss has a reduction b y (1 + β 2 opt ). When incorrectly predicted, the loss is magniﬁed b y the same amoun t. F or the minorit y class, or y i = 1 for i = { 1 , ..., n − m } , the loss is unchanged. This is inten tional b ecause under im balanced data there are far less observ ations, and computing precision and recall lead to numerical instabilit y or frequent edge cases for Algorithm 1 . 6.3 Understanding the β opt Surface and W eighted Cross-En trop y Figure 3 highlights the surface generated from Algorithm 1 lev eraging proba- bilities from Section 5 . First, the U & IU mixture or Figure 3a suggests that the shap e of the surface remains relatively similar ev en when doubling β max . This is an imp ortant point tow ard ﬁxing β max = 16 for the Ga & IE. Based on Figure 3a , the U & IU mixture p enalizes more on the outskirts of recall, R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 13 while the immediate p enalties arise on the diagonal of the unit square. This suggests that precision and recall estimates from training cause immediate p enalties when they are on opp osite ends of the [0 , 1] range as well as on the diagonal when these v alues start to ev en out. F or Ga & IE mixture, Figures 3b and 3c show some similar conclusions as U & IU mixture along with addi- tional insights. F or a lo wer rate of λ = 0 . 5, or Figure 3b , diagonal spikes for σ 2 ∈ [0 . 5 , 2 . 0] as well as a precision centric p enalty for higher σ 2 (i.e., σ 2 = 2) are seen. F or a higher rate of λ = 2 . 0, or Figure 3c , a similar diagonal is retained for σ 2 ∈ [0 . 5 , 2 . 0] as in Figure 3b . F urthermore, for increasing v alues of σ 2 , the p enalt y evolv es from precision centric to a vertical separation on the unit grid at around a precision of 0 . 4. The o verall in terpretation is the follo wing: for a lo wer λ , increasing σ 2 creates a slightly more precision based penalty; while for a higher λ , increasing σ 2 causes the p enalt y to b ecome more balanced betw een recall and precision. The choice of these parameters are problem sp eciﬁc, but pro vides the practitioner ﬂexibility in determining the b est selection for their use case. On a separate note, from Figure 3 , a spiky surface is obvious, whic h is partially explained b y having a default setting in the algorithm. This is a strong sign of immediate and conﬁgurable penalties. 7 Datasets and Exp erimen tation 7.1 Datasets The origins of the F β metric come from text retriev al, so it is imp ortan t to v erify this method ac ross diﬀerent categories of data. In particular, image data from CIF AR-10 1 , text data from IMDB movie sentimen ts Maas et al. ( 2011 ) and structured/tabular data from the Census Income Dataset Dua and Graﬀ ( 2019 ) are tested. F or each experiment, the primary lab el (i.e., label 1) is either imbalanced or forced to b e imbalanced to reﬂect real world scenarios. Because CIF AR-10 contains m ultiple image lab els, the airplane lab el is the primary lab el and all others are combined. This yields a 10% class imbalance. IMDB mo vie sentimen t reviews (p ositiv e/negative text) are not im balanced. The p ositiv e sen timents in the training data are reduced to 1K randomly sampled sentimen ts yielding a 7.4% im balance (T able 1 ). The Census Income T abular data contains 14 input features (i.e., age, work class, education, o ccupation, etc) with 5 numerical and 9 categorical features. The binary lab els are greater than 50K salary (lab el 1) and less than 50K salary (label 0). By default, greater than 50K salary is already im balanced at 6.2%. The training and v alidation dataset sizes for eac h data category are as follows: for CIF AR-10, 50K training and 10K v alidation, for IMDB, 13.5K training and 25K v alidation, and for the Census data, 200K training and 100K v alidation. In terms of class imbalance, this pap er considers an im balance or prop ortion of lab el 1 under 10% to be signiﬁcantly im balanced and b et ween 10% to 25% to b e mo derately imbalanced. Some heuristic rationale for a 10% im balance, a mo del that has p erfect recall or 100%, a precision of p = 1 3 w ould be required to get F 1 = 0 . 5. In practical examples, this scenario can o ccur with weakly 14 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy discriminativ e features. Therefore, this pap er seeks to test this algorithm in scenarios that w ould need an impro ved precision. A v ariet y of imbalanced and balanced scenarios will b e tested in this paper. Tw o real-life use cases related to cylindrical tanks are also considered, pro viding a physical domain to test Algorithm 1 . Chauhan et al. ( 2022 ) dev elop ed an arithmetic optimizer with slime mould algorithm and Chauhan et al. ( 2023 ) dev elop ed an ev olutionary based algorithm with slime mould algorithm; b oth algorithms fo cus on global parameter optimization. They tested these algorithms on several b enc hmark problems, one of which is called the pressure vessel design. The problem is a constrained parameter optimiza- tion (i.e., material thickness, and cylinder dimensions) for minimizing a cost function. This pap er fo cuses on using the HA OASMA algorithm by Chauhan et al. in a simulation to con vert the problem into a binary classiﬁcation. The second use case is derived from Underground Storage T anks (UST) and is also inspired by Chauhan et al. ’s pressure vessel problem. The physical shap e (i.e., the cylindrical shape) of USTs is similar to the pressure vessel design. USTs are used to store p etroleum products, chemicals, and other hazardous materials underground. These structures could deform underground and p ossibly explain a false p ositive leak. Ramdhani ( 2016 ) and Ramdhani et al. ( 2018 ) explored parameter optimization of UST dimensions changing from cylindrical to ellipsoidal. The observed data are vertical (underground) heigh t measuremen ts, which can contai n uniformly distributed error. Ramdhani et al. used these measuremen ts and the v olumetric equations ( E2 ), ( E3 ), ( E4 ), and ( E5 ) - derived from a cross-sectional view - to develop a metho dology to estimate tank dimensions and test if the shap e has deformed. The cross- sectional view can b e seen in Figure E2 . The con version of b oth of these real-life use cases into a classiﬁcation in volv es establishing a baseline set of parameters to sim ulate data for lab el 0. V arying these parameters will allo w sim ulation of data for label 1. F or the pres- sure vessel design, the baseline parameters from HA OASMA are T s = 1 . 8048, T h = 0 . 0939, R = 13 . 8360, and L = 123 . 2019. T o con vert this to a classiﬁ- cation, the parameters for thickness T s and T h are changed from the baseline while R and L dimensions are drawn from a normal distribution. By using v al- ues for T s , T h , R and L , the cost function is computed using D1 . These cost v alues concatenated with R and L arra ys serv e as the input to a neural netw ork classiﬁer. The lab el 1 reﬂect simulated data using T s and T h that are changed from the baseline. These v ariations are T s = 1 . 7887 and T h ∈ { 0 . 0313 , 0 . 2817 } . Lab el 0 is HA O ASMA baseline v alues T s = 1 . 8048 and T h = 0 . 0939. App endix D pro vides the equations, the distributional plots seen in Figure D1 , and a detailed explanation of the simulation pro cedure (Algorithm 2 ). F or the UST problem, Ramdhani used a measurement error mo del with an error on the heigh t measuremen t and another on the volume computation. The same mo del R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 15 is used to sim ulate data in this pap er. The baseline is a cylinder and the v ari- ations to vertical and horizontal axes a and b represent a deformed cylinder to an ellipse. By using r , L , h , a and b along with ( E2 ) and ( E4 ) or ( E3 ) and ( E5 ) the v olume is computed. These v olumes concatenated with noisy heigh t measuremen ts are the inputs to a neural netw ork classiﬁer. The lab el 1 reﬂect sim ulated data using the v ariations to the cylinder or the baseline. These v ari- ations are a ∈ { 3 . 2 , 3 . 8 } and b ∈ { 5 . 0 , 4 . 2105 } . Lab el 0 will b e the baseline cylinder with radius r = 4 and the length of L = 32. Refer to Appendix E for detailed explanation of the simulation Algorithm 3 along with comparison plots and v olume equations. 7.2 Mo del Netw orks 7.2.1 Image Net work F or the CIF AR-10 image dataset, ResNet ( He et al. 2016 ) version 1 is applied. The num b er of lay ers for ResNet is 20, which upon initial exp erimen tation is adequate for sp eed and generalization in this case. Adam optimizer was implemen ted with a learning rate of 1 e − 3 with total ep ochs of 30. No learning rate schedule b ecause of the inten tional low er num ber of ep och in order to v alidate faster training via this proposed loss algorithm. The training batc h size is 32. Mo dest data augmentation is done – random horizontal and vertical shifts of 10%, and horizon tal and vertical ﬂips. 7.2.2 T ext Netw ork F or the IMDB movie sentimen ts, a T ransformer blo ck (which applies self- atten tion V asw ani et al. 2017 ) is used. The token embedding size is 32, and the transformer has 2 attention heads and a hidden lay er size of 32 including drop out rates of 10%. A p o oling lay er and the tw o lay ers that follo w – a dense RELU activ ated lay er of size 20 and a dense sigmoid lay er of size 1 – give the ﬁnal output probabilit y . As for prepro cessing, a v o cabulary size of 20K and maxim um sequence length of 200 is implemen ted. The training batch size is 32. 7.2.3 Structured/T abular Netw ork F or the Census Income Dataset, a standard enco der embedding paradigm Sc hmidhuber ( 2015 ) is used. Speciﬁcally , all categorical features with an em b edding size of 64 are concatenated, then numerical features are concate- nated to this em b edding vector. Afterwards, a 25% dropout lay er and the tw o la yers that follow – a fully connected dense lay er with GELU activ ation of size 64 and a sigmoid activ ated lay er of size 1 – pro vide the ﬁnal output probability . The training batc h size is 256. 1 https://www.cs.toron to.edu/ kriz/cifar.html 16 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 7.2.4 UST/V essel Netw ork F or the real life use cases on sim ulated data the model netw ork is simple b ecause of the minimal amount of features. The netw ork is a sequential set of dense la yers of sizes 20, 10 and 1. The last lay er of size 1 has a sigmoid activ ation to give the ﬁnal output probability . Additionally , a drop out of 10% is added after b oth middle la yers. The training batc h size is 128. 7.3 Exp erimen tal Results The results in T able 1 compare the use of the loss function ( 10 ) b y diﬀeren t mo dels based on U & IU and Ga & IE to a baseline case of ordinary cross- en tropy . All results sho wn in this table are computed on the v alidation datasets for each data category ab o v e (see ab o ve for the dataset sizes). F or ease of presen tation, M β 1 is Mo del 1: U & IU from ( 5.1 ). M ( λ,σ 2 ) 2 is Mo del 2: Ga & IE from ( 5.2 ). The sup erscripts β and ( λ, σ 2 ) are the parameters being explored. M B is the baseline or the same mo del netw ork that is trained using ordinary cross-en tropy . 7.3.1 Image Results F or the image netw ork, T able 1 shows mo dest improv ement ov er the base- line under the M β 1 for a mo derately sized β = 8. This suggests that image data trains b etter under constan t p enalties on the outskirts of the unit square to ward the im balance of high precision and low recall. High precision and low recall imply image confusion betw een classes in the feature em b edding space. In fact, this can lead to large implications as in Grush ( 2015 ). Algorithms like DeepInsp ect Tian et al. ( 2020 ) help to detect confusion and bias errors to iso- late misc lassiﬁed images leading to r ep air b ase d training algorithms such as Tian ( 2020 ) and Zhang et al. ( 2021 ). But Qian et al. ( 2021 ) empirically shows that suc h repair or de-biasing algorithms can b e inaccurate with one ﬁxed-seed training run. The imp ortance of the M β 1 result is now evident b ecause M β 1 quic kly p enalizes the netw ork in a wa y that inherently mirrors algorithms lik e DeepInsp ect’s confusion/bias detection without the need for repair algorithms. 7.3.2 T ext Results The training results for the text net work b y far show the most improv emen t with a nearly 14% b o ost in the F 1 score ov er the baseline for the M ( λ,σ 2 ) 2 mo del. Not only is the p erformance notable, the mo del parameter selections are consisten t – the parameters mov e in the same direction. In other words, giv en the parameters λ = 0 . 5 and σ 2 = 0 . 5, the training shows impro ve- men t ov er the baseline and this improv emen t contin ues in the same direction when λ = 0 . 01 and σ 2 = 0 . 01. This is similar to section 7.3.1 because ﬁrst, the architecture is generalizing b etter (seen b y the F 1 score) for lab el confu- sion (i.e., language context) and second, it adjusts for inten tionally conﬁgured R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 17 im balance and incorrect lab eling (a kno wn issue for this dataset). The incor- rect lab eling in the IMDB dataset is shown to b e non-negligible – upw ards of 2-3% – by Klie et al. ( 2022 ) and Northcutt et al. ( 2021 ). In particular, North- cutt et al. show that small increases in lab el errors often cause a destabilizing eﬀect on machine learning mo dels for which c onﬁdent le arning methodology is dev elop ed to detect them. Klie et al. analyze 18 metho ds (including c onﬁdent le arning ) for Automated Error Detections (AED) and shows the imp ortance of AED for data cleaning. In close pro ximity to the AED metho dology , another paradigm is Robust T raining with Lab el Noise. Song et al. ( 2022 ) provides an exhaustiv e survey ranging from robust architectures (i.e., noise adaptation la yers) and robust regularization (i.e., explicit and implicit regularization) to robust loss (i.e., loss correction, re-w eigh ting, etc.) and sample selection. It is in this context that the M ( λ,σ 2 ) 2 framew ork sits b et ween AED and Robust T rain- ing with Label Noise on this IMDB dataset whic h is known to hav e errors. M ( λ,σ 2 ) 2 serv es tw o purp oses: (1) as a robust loss through the β opt re-w eighting on the batch and, (2) as a means to detect and down weigh t p ossible lab el errors. 7.3.3 Structured/T abular Results The results for the structured/tabular net work do not sho w any F 1 impro ve- men t ov er the baseline nor any indication of possible improv emen t through the extra parameter v ariations. F rom T able 1 , the best p erforming mo del for this dataset (not the baseline) is M ( λ,σ 2 ) 2 where λ = 2 . 0 and σ 2 = 0 . 5. The in ter- pretation of this parameter conﬁguration suggests that training tabular data is very susceptible to b oth lo w precision and recall, hence the high p enalty in that area of the unit square in Figure 3 . Despite em b edding categories and n umeric features into a ric her v ector space, the non-con textual nature of tab- ular data may not necessarily b e best trained through these architectures. F urthermore, Sun et al. ( 2019 ) applies a t wo dimensional em b edding (i.e., sim- ulating an image) to this Census dataset and the results show that a decision tree (i.e., xgbo ost) would p erform similarly . It is worth mentioning that Sun et al. presen t these results with an accuracy measure (not F 1 ) whic h is misleading since the data is naturally im balanced. How ev er, a similar general conclusion is given by Boriso v et al. ( 2021 ) for tabular data – decision trees ha ve faster training time and generally comparable accuracy as compared with em b edding based arc hitectures. These results are unsurprising b ecause as stated b y W en et al. ( 2022 ) tabular data is not contextually driven data like images or lan- guages which con tain p osition-related correlations. It is heartening to notice, that after W en et al. apply a casually aw are GAN to the census data, the resulting F 1 score (0 . 509) is similar to the baseline result in T able 1 (0 . 5193). Because of these results, there is an imp ortan t ﬁnding: the type of data, in particular contextual data which is the basis for the creation of the F β metric, pla ys a signiﬁcant role when using the metric alongside a loss function. This h yp othesis is studied further in the benchmark data in Section 7.4 . 18 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 7.3.4 UST/Pressure V essel Results The results for the sim ulation of real life use cases can b e found in T able 2 . In the UST case, it is evident that this metho dology outp erforms the baseline cross-en tropy in determining a shap e change from a cylinder to an ellipse. F or example, in the easier scenario for CvE ( a = 3 . 2) the M ( λ,σ 2 ) 2 mo del family app ears to be better. Ho wev er, in cases of the extra v ariations, the M 8 1 and M (0 . 01 , 0 . 01) 2 p erform the same. This trend is also observ ed in the results for the Image and T ext data presented in T able 1 . The in terpretation is that a sligh tly more recall-cen tric p enalt y may b e optimal for this scenario. Interest- ingly , for the easier CHvEH scenario ( a = 3 . 2), the M ( λ,σ 2 ) 2 mo del family also app ears to be better, and the extra v ariations for M 32 1 and M (5 . 0 , 5 . 0) 2 p erform the same. These v ariations mirror CvE but in the other direction, suggest- ing that a balanced or slightly more precision-centric p enalt y is optimal. In the diﬃcult scenario ( a = 3 . 8), b oth CvE and CHvEH are closely aligned with M ( λ,σ 2 ) 2 mo del family . F or CHvEH the b est p erformer is the M 32 1 v ari- ation. Ov erall, there is b etw een 12% to 28% improv emen t o ver the baseline or standard cross-en tropy for this simulation. Regarding the PV data, for the easier scenario ( t h = 0 . 0313) the M β 1 family appears to be b etter, with the M ( λ,σ 2 ) 2 mo del family not far b ehind. In the diﬃcult scenario ( t h = 0 . 2817) there is no improv ement ov er the baseline cross-entrop y but the b est p erform- ing mo del family is M ( λ,σ 2 ) 2 . The reason is likely due to the signiﬁcant ov erlap in distribution seen in Figure D1 . These results are impactful b ecause the com- monalit y betw een mo del families b egins to surface. F or the easier scenario, a more recall-centric penalty turns out to b e b etter, while in the diﬃcult sce- nario, a balanced or slightly precision-cen tric p enalt y is more eﬀective. This ﬁnding is in tuitive.This ﬁnding is in tuitive. R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 19 T able 1 : Best F1 Score for M β 1 and M ( λ,σ 2 ) 2 o ver 30 Ep o c hs Baseline P arameter V ariations Section 6.3 Extra V ariations Dataset M B M 16 1 M (0 . 5 , 0 . 5) 2 M (0 . 5 , 2 . 0) 2 M (2 . 0 , 0 . 5) 2 M (2 . 0 , 2 . 0) 2 M 8 1 M 32 1 M (0 . 01 , 0 . 01) 2 M (5 . 0 , 5 . 0) 2 Image 1 0.8161 0.8261 0.8085 0.8193 0.8232 0.8257 0.8266 0.8068 0.8087 0.8178 T ext 2 0.6749 0.6393 0.7175 0.6170 0.6547 0.6673 0.7236 0.5460 0.7666 0.7364 Structured 3 0.5193 0.4170 0.3917 0.4126 0.4635 0.3930 0.3824 0.3511 0.3890 0.4516 1 The image dataset is the CIF AR10. The airplane lab el versus the remaining lab els is the binary lab el basis. It gives a training data imbalance of 10%. T raining data size is 50K and v alidation is 10K. 2 The text dataset for NLP is the IMDB mo vie sentimen t with binary lab el of p ositiv e/negativ e sentimen t. The vocabulary size is 20K and the maxim um review length is 200. The training set is im balanced by choosing only 1K p ositiv e sentimen ts which yields an im balance of 7.4%. The training data size is 13.5K and v alidation is 25K. 3 The structured or tabular data set is the Census Income Dataset from UCI rep ository . The lab els are greater than or less than 50K salary . The data is already im balanced with a rate of 6.2% for > 50K. The training data size is 200K and the v alidation is 100K. 20 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy T able 2 : Best F1 Score for M β 1 and M ( λ,σ 2 ) 2 o ver 30 Ep o c hs Baseline P arameter V ariations Section 6.3 Extra V ariations Dataset 0 M B M 16 1 M (0 . 5 , 0 . 5) 2 M (0 . 5 , 2 . 0) 2 M (2 . 0 , 0 . 5) 2 M (2 . 0 , 2 . 0) 2 M 8 1 M 32 1 M (0 . 01 , 0 . 01) 2 M (5 . 0 , 5 . 0) 2 CvE 1 0.9691 0.9228 0.9983 0.9915 0.9898 0.9565 0.9966 0.9915 0.9966 0.9673 CvE 2 0.3169 0.3147 0.3351 0.3469 0.3296 0.3333 0.3224 0.3401 0.3362 0.3573 CHvEH 3 0.9831 0.9813 0.9813 0.9898 0.9915 0.9831 0.9726 0.9882 0.9831 0.9882 CHvEH 4 0.2891 0.3345 0.3427 0.3515 0.3262 0.3159 0.3636 0.3701 0.3395 0.3425 PV 5 0.9967 0.9992 0.9983 0.9483 0.9831 0.9967 0.9967 0.9891 0.9727 0.9958 PV 6 0.7515 0.4552 0.5057 0.4893 0.4722 0.5248 0.4934 0.4861 0.4675 0.5161 0 The simulations for UST (Underground Storage T anks) are for the cylinder versus ellipse (CvE) or cylinder with hemispherical end-caps versus ellipsoidal with hemi-ellipsoidal end-caps (CHvEH). F or the PV or pressure vessel, the simulation is b et ween v arying thickness of the surface and head. Refer to App endix D and E for details. The lab el 1 prop ortion is 25% and total training and testing size is 1200. 1 The simulation has lab el 0 with r = 4 and L = 32 v ersus lab el 1 of a = 3 . 2 and b = 5 . 0. 2 The simulation has lab el 0 with r = 4 and L = 32 v ersus lab el 1 of a = 3 . 8 and b = 4 . 2105. 3 The simulation has lab el 0 with r = 4 and L = 32 v ersus lab el 1 of a = 3 . 2 and b = 5 . 0. 4 The simulation has lab el 0 with r = 4 and L = 32 v ersus lab el 1 of a = 3 . 8 and b = 4 . 2105. 5 The simulation has lab el 0 with t s = 1 . 8048 and t h = 0 . 0939 versus lab el 1 of t s = 1 . 7887 and t h = 0 . 0313. 6 The simulation has lab el 0 with t s = 1 . 8048 and t h = 0 . 0939 versus lab el 1 of t s = 1 . 7887 and t h = 0 . 2817. R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 21 7.4 F urther Exp erimen tation: Benchmark Analysis F ollowing the benchmark analysis from Aurelio et al. , a similar approach is done for the Image, T ext, and T abular data. This expands the analysis from T able 1 to pro vide a more detailed and comprehensiv e view across v arious w ell- kno wn datasets. The results can b e found in T able 3 , 4 , and 5 . The footnotes in these tables are explained as follows: the breakdown of train and test data sizes, the prop ortion of lab el 1, the labeling conv en tion for lab el 1 versus lab el 0 (if m ultiple lab els exist), and the lo cation of the data, if necessary . F or example, lab el 9 vs all means the lab el 9 is the lab el 1 and ev erything else is marked as lab el 0. Detailed explanations, links, and training details for all the datasets are pro vided in the fo otnotes for each table. A t a high level, for images CIF AR-10, CIF AR-100 and F ashion MNIST are analyzed. F or text, A G’s News Corpus, Reuters Corpus V olume 1, Hate Sp eec h and Stanford Sentimen t T reebank are analyzed. F or the tabular data, 10 classical datasets from UCI rep ository are analyzed. Finally , the same mo del net works from Section 7.3 will b e used. 7.4.1 Image Results Comparing the CIF AR-10 result in T able 1 versus 3 , the mo del family changes from M β 1 to M ( λ,σ 2 ) 2 . The interpretation remains the consistent: a recall centric based p enalt y is fav ored. The CIF AR-100 examples, with an imbalance of 1%, follo w a similar recall centric p enalty for M 16 1 under the lab el conv ention 9 vs all. How ever, under the lab eling 39 v ersus all, a more precision centric p enalt y is preferred. This illustrates the problem-sp eciﬁc nature of selecting a model family and parameter, sho wcasing the ﬂexibility of this pap er’s metho dology . Notably , there is a 14% increase in the F 1 score for CIF AR-100 under the 39 v ersus all lab el con ven tion. F ashion MNIST fa vors the M ( λ,σ 2 ) 2 with a more precision centric p enalt y . The most intriguing result is that, for all the extra v ariations, M (5 . 0 , 5 . 0) 2 is the most frequen t p erformer, which is a more balanced p enalt y . This suggests that M (5 . 0 , 5 . 0) 2 could b e a starting p oin t of exploration giv en the balanced nature of the p enalt y distribution. 7.4.2 T ext Results Referring to T able 4 , for the AG’s News Corpus and Reuters Corpus V ol- ume 1 under the lab eling crude vs all, the M ( λ,σ 2 ) 2 mo del family , particularly M (0 . 5 , 2 . 0) 2 is preferred. These parameter selections suggest a sligh tly more pre- cision centric p enalty . When considering Reuters Corpus V olume 1 with the lab eling crude vs all and the Stanford Sen timen t T reebank, there is no observ ed impro vemen t. In the case of the Hate Sp eech Data, a more distinctive con text, there is roughly a 4% b oost under the M (2 . 0 , 2 . 0) 2 mo del. This parameter selec- tion is also a balanced penalty betw een recall and precision. Overall, similar to the Image b enc hmark conclusion, the M (5 . 0 , 5 . 0) 2 is a frequent p erformer in the extra v ariation set of parameters. This insigh t of balanced penalty selection also holds for contextual text data. 22 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy T able 3 : Best F1 Score for M β 1 and M ( λ,σ 2 ) 2 o ver 30 Ep o c hs Baseline P arameter V ariations Section 6.3 Extra V ariations Dataset 0 M B M 16 1 M (0 . 5 , 0 . 5) 2 M (0 . 5 , 2 . 0) 2 M (2 . 0 , 0 . 5) 2 M (2 . 0 , 2 . 0) 2 M 8 1 M 32 1 M (0 . 01 , 0 . 01) 2 M (5 . 0 , 5 . 0) 2 CIF AR-10 1 0.9216 0.9088 0.9204 0.9196 0.9263 0.9122 0.9194 0.9119 0.9268 0.9173 CIF AR-100 2 0.7345 0.7804 0.7273 0.6941 0.7594 0.7692 0.7501 0.7167 0.7314 0.7683 CIF AR-100 2 0.6021 0.6592 0.6778 0.6871 0.6381 0.6818 0.6509 0.6351 0.6702 0.6704 F ashion MNIST 3 0.8651 0.8638 0.8663 0.8462 0.8651 0.8593 0.8544 0.8558 0.8638 0.8672 F ashion MNIST 3 0.9627 0.9621 0.9656 0.9656 0.9675 0.9648 0.9615 0.9648 0.9641 0.9681 0 All the datasets are easily found in Keras rep ository . The link is : https://k eras.io/api/datasets/ . 1 T rain/test 50K/10K, lab el 1 10%, lab eling is 1 vs all. 2 T rain/test 50K/10K & 50K/10K, lab el 1 1% & 1%, labeling is 9 vs all & 39 vs all. 3 T rain/test 50K/10K & 50K/10K, lab el 1 10% & 10%, labeling is 0 vs all & 9 vs all. R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 23 T able 4 : Best F1 Score for M β 1 and M ( λ,σ 2 ) 2 o ver 30 Ep o c hs Baseline P arameter V ariations Section 6.3 Extra V ariations Dataset 0 M B M 16 1 M (0 . 5 , 0 . 5) 2 M (0 . 5 , 2 . 0) 2 M (2 . 0 , 0 . 5) 2 M (2 . 0 , 2 . 0) 2 M 8 1 M 32 1 M (0 . 01 , 0 . 01) 2 M (5 . 0 , 5 . 0) 2 ag news 1 0.9632 0.9474 0.9632 0.9639 0.9626 0.9624 0.9553 0.9404 0.9634 0.9655 rcv1 2 0.9333 0.9298 0.9396 0.9461 0.9211 0.9451 0.9316 0.927 0.9356 0.9501 rcv1 2 0.9324 0.92 0.9251 0.9189 0.9178 0.9189 0.9127 0.9139 0.9251 0.9054 hate 3 0.8671 0.8304 0.8741 0.9045 0.8621 0.9046 0.8383 0.7669 0.8655 0.8868 sst 4 0.8175 0.7619 0.7955 0.7909 0.8071 0.8001 0.7727 0.7494 0.8018 0.8004 0 Datasets are found in Hugging F ace repository . The base-url is https://h uggingface.co/datasets . 1 T rain/test 90K/30K, lab el 1 25%, lab eling is 3 vs all, A G’s News Corpus Data found here base-url /ag news. 2 T rain/test 5485/2189 & 5485/2189, lab el 1 4.61% & 4.57%, lab eling is crude vs all & trade vs all, Reuters Corpus V olume 1 Data found here base-url /yangw ang825/reuters-21578. 3 T rain/test 8027/2676, lab el 1 11%, lab eling is 1 vs 0, Hate Sp eec h Data found here base-url /hate sp eec h18. 4 T rain/test 67K/872, lab el 1 55%, lab eling is 1 vs 0, Stanford Sentimen t T reebank found here base-url /sst2. 24 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 7.4.3 Structured/T abular Results The tabular or structured b enc hmark results in T able 5 show that this pap er’s metho dology outp erforms the baseline for all but one dataset (the breast can- cer dataset). A key insight is that, for the parameter v ariations from section 6.3 and the extra v ariations, a more recall centric p enalty is preferred. In particular, the M β 1 and M (2 . 0 , 0 . 5) 2 mo del families for the datasets iono, pima, v ehicle, glass, vo wel, yeast and abalone are fav ored. The remaining datasets - seg and sat - sho w mo dest improv ement for the balanced p enalty or M (0 . 5 , 2 . 0) 2 mo del. Compared to the Census results in T able 1 , it app ears that feature distinctiv eness plays a ma jor part for tabular data. This pap er deﬁnes fea- ture distinctiv eness as a neural netw ork learning b etter discriminativ e features with resp ect to the dep endent v ariable. This conclusion arises from the more recall cen tric p enalt y showing up in the result, suggesting that for tabular or structured data, the netw ork should focus on learning strong discriminativ e features to enhance recall. This result underscores the hypothesis of this pap er that the type of data, particularly contextual data, matters for a metric-based p enalt y and further supp orts the ﬂexibilit y of this F β p enalt y metho dology . R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 25 T able 5 : Best F1 Score for M β 1 and M ( λ,σ 2 ) 2 o ver 30 Ep o c hs Baseline P arameter V ariations Section 6.3 Extra V ariations Dataset 0 M B M 16 1 M (0 . 5 , 0 . 5) 2 M (0 . 5 , 2 . 0) 2 M (2 . 0 , 0 . 5) 2 M (2 . 0 , 2 . 0) 2 M 8 1 M 32 1 M (0 . 01 , 0 . 01) 2 M (5 . 0 , 5 . 0) 2 iono 1 0.7845 0.8364 0.8068 0.8161 0.8092 0.8256 0.8205 0.8742 0.8114 0.7845 pima 2 0.5253 0.4407 0.4109 0.3645 0.5454 0.5088 0.5124 0.2711 0.5058 0.5208 breast 3 0.9416 0.7985 0.6464 0.8633 0.7934 0.9387 0.7832 0.8239 0.7589 0.7832 v ehicle 4 0.3942 0.4423 0.4000 0.3363 0.4507 0.3470 0.2105 0.3247 0.3103 0.3333 seg 5 0.6798 0.5099 0.6078 0.6987 0.5571 0.5295 0.3915 0.3130 0.6645 0.3247 glass 6 0.8695 0.7200 0.7407 0.7826 0.9473 0.6250 0.9473 0.7000 0.8333 0.9523 sat 7 0.5511 0.1674 0.3274 0.5571 0.4963 0.1313 0.2375 0.1714 0.5849 0.3779 v ow el 8 0.2752 0.3439 0.3076 0.2926 0.3103 0.2434 0.2464 0.1851 0.1647 0.2979 y east 9 0.5491 0.8717 0.7500 0.5079 0.2185 0.2010 0.6046 0.5084 0.6857 0.2105 abalone 10 0.9723 0.9723 0.9723 0.9723 0.9723 0.9723 0.9723 0.9765 0.9723 0.9723 0 Data urls: UCI-url https://arc hive.ics.uci.edu/datasets/ or R-url https://gith ub.com/cran/mlb enc h/tree/master/data/ . 1 T rain/test 235/116, lab el 1 34%, Ionosphere Data found in UCI-url . 2 T rain/test 514/254, lab el 1 35%, Pima Indians Diabete s Data found in R-url . 3 T rain/test 381/188, lab el 1 38%, Breast Cancer Wisconsin Data found in UCI-url . 4 T rain/test 566/280, lab el 1 27%, lab eling is op el vs all, V ehicle Data found in R-url . 5 T rain/test 210/2100, lab el 1 14%, lab eling is brickface vs all, Segmentation Data found in UCI-url . 6 T rain/test 143/71, lab el 1 13%, lab eling is 7 vs all, Glass Data found in R-url . 7 T rain/test 4308/1004, lab el 1 9%, lab eling is 4 vs all, Satellite Data found in UCI-url . 8 T rain/test 663/327, lab el 1 9%, lab eling is hYd vs all, V o wel Data found in R-url . 9 T rain/test 344/170 lab el 1 9%, lab eling is CYT vs ME2, Y east Data found in UCI-url . 10 T rain/test 489/242, lab el 1 6%, lab eling is 18 vs 9, Abalone Data found in UCI-url . 26 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 8 Conclusion This pap er prop oses a weigh ted cross-en tropy based on v an Rijsb ergen F β mea- sure. By assuming statistical distributions as an in termediary , an optimal β can b e found, whic h is then used as a p enalt y weigh ting in the loss function. This approac h is conv enient since v an Rijsbergen deﬁnes β to b e a w eighting param- eter b et w een recall and precision. Guided training by the F β h yp othesizes that the in teraction of the many combinations b et ween the minorit y and ma jority classes has information that can help in three w ays. First, as in V ashishtha et al. it can improv e feature selection. Second, mo del training can generalize b et- ter. Lastly , ov erall p erformance may improv e. Results from T able 1 show that this metho dology helps in achieving better F 1 scores in some cases, with the added beneﬁt of parameter interpretation from M β 1 and M ( λ,σ 2 ) 2 . F urthermore, when considering results from real-life use cases as in T able 2 , commonalities b et w een mo del families start to surface. Parameter selections that yield recall- cen tric p enalties for b oth M β 1 and M ( λ,σ 2 ) 2 can b e observ ed. The analyses from this pap er pro vide the following insigh ts: (1) the balanced p enalt y distribution is a go o d starting p oin t for M ( λ,σ 2 ) 2 mo del family , (2) feature distinctiv eness impacts parameter selections b etw een b oth mo del families, (3) non-contextual data such as tabular or structured data, seem to b eneﬁt from a recall cen- tric p enalty , (4) M β 1 ma y b e b etter for image data, and M ( λ,σ 2 ) 2 for text, and (5) contextual-based data are b etter p ositioned for embedding arc hitectures than non-con textual data - except when the tabular data can b e mapped to con textual data or the features are discriminative.These p oints show that F β as a performance metric can b e integrated alongside a loss function through p enalt y weigh ts by using statistical distributions. R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 27 References Aurelio, Y. S., de Almeida, G. M., de Castro, C. L., and Braga, A. P . (2022). Cost-Sensitiv e Learning based on P erformance Metric for Imbalanced Data. Neur al Pr o c essing L etters , 54(4), 3097-3114. Chauhan, S., V ashishtha, G., and Kumar, A. (2022). A sym biosis of arithmetic optimizer with slime mould algorithm for impro ving global optimization and con ven tional design problem. The Journal of Sup er c omputing , 78(5), 6234-6274. Chauhan, S., and V ashishtha, G. (2023). A synergy of an evolutionary algorithm with slime mould algorithm through series and parallel con- struction for improving global optimization and conv en tional design problem. Engine ering Applic ations of Artiﬁcial Intel ligenc e , 118, 105650. Cha wla, N. V., Bowy er, K. W., Hall, L. O., and Kegelmeyer, W. P . (2002). SMOTE: synthetic minority o ver-sampling technique. Journal of artiﬁ- cial intel ligenc e r ese ar ch, 16, 321-357. F ujino, A., Isozaki, H., and Suzuki, J. (2008). Multi-lab el text categorization with mo del combination based on f1-score maximization. In Pr o c e e d- ings of the Thir d International Joint Confer enc e on Natur al L anguage Pr o c essing: V olume-II. Hasanin, T., Khoshgoftaar, T. M., Leevy , J. L., and Seliya, N. (2019). Exam- ining c haracteristics of predictive mo dels with im balanced big data. Journal of Big Data, 6(1), 1-21. Oksuz, K., Cam, B. C., Akbas, E., and Kalk an, S. (2018). Lo calization recall precision (LRP): A new p erformance metric for ob ject detection. In Pr o- c e e dings of the Eur op e an Confer enc e on Computer Vision (ECCV) (pp. 504-519). Li, X., Sun, X., Meng, Y., Liang, J., W u, F., and Li, J. (2019). Dice loss for data-im balanced NLP tasks. arXiv pr eprint Ho, Y., and W o ok ey , S. (2019). The real-w orld-weigh t cross-en tropy loss function: Mo deling the costs of mislab eling. IEEE A c c ess, 8, 4806-4813. Lipton, Z. C., Elk an, C., and Nara y anaswam y , B. (2014). Thresholding classiﬁers to maximize F1 score. arXiv pr eprint B ´ en ´ edict, G., Ko ops, V., Odijk, D., and de Rijke, M. (2021). sigmoidF1: A Smo oth F1 Score Surrogate Loss for Multilab el Classiﬁcation. arXiv pr eprint 28 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy Boriso v, V., Leemann, T., Seßler, K., Haug, J., Pa w elczyk, M., and Kasneci, G. (2021). Deep neural netw orks and tabular data: A surv ey . arXiv pr eprint Dua, D. and Graﬀ, C. (2019). UCI Mac hine Learning Rep ository [h ttp://archiv e.ics.uci.edu/ml]. Irvine, CA: Universit y of California, Sc ho ol of Information and Computer Science. Dudewicz, E. J., and Mishra, S. (1988). Mo dern mathematical statistics. John Wiley & Sons, Inc. Grush, L. (2015). Google engineer ap ologizes after Photos app tags t wo blac k p eople as gorillas. The V er ge, 1. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Pr o c e e dings of the IEEE c onfer enc e on c omputer vision and p attern r e c o gnition (pp. 770-778). Hogg, R. V., and Craig, A. T. (1995). Introduction to mathematical statis- tics.(5”” edition). Englew o o d Hills, New Jersey . Jansc he, M. (2005, Octob er). Maximum exp ected F-measure training of logis- tic regression mo dels. In Pr o c e e dings of Human L anguage T e chnolo gy Confer enc e and Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing (pp. 692-699). Klie, J. C., W ebb er, B., and Gurevych, I. (2022). Annotation Error Detec- tion: Analyzing the Past and Present for a More Coheren t F uture. arXiv pr eprint Lee, N., Y ang, H., and Y o o, H., A surrogate loss function for optimization of F β score in binary classiﬁcation with im balanced data, arXiv pr eprint arXiv:2104.01459, 2021 . Lin, T. Y., Go yal, P ., Girshick, R., He, K., and Doll´ ar, P . (2017). F o cal loss for dense ob ject detection. In Pr o c e e dings of the IEEE international c onfer enc e on c omputer vision (pp. 2980-2988). Maas, A., Daly , R. E., Pham, P . T., Huang, D., Ng, A. Y., and P otts, C., (June 2011) Learning word v ectors for sentimen t analysis. In Pr o c e e dings of the 49th annual me eting of the asso ciation for c omputational linguistics: Human language te chnolo gies (pp. 142-150). Mohit, B., Schneider, N., Bhowmic k, R., Oﬂazer, K., and Smith, N. A. (2012, April). Recall-orien ted learning of named entities in Arabic Wikip edia. In Pr o c e e dings of the 13th Confer enc e of the Eur op e an Chapter of the Asso ciation for Computational Linguistics (pp. 162-173). R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 29 Northcutt, C. G., Athaly e, A., and Mueller, J. (2021). Perv asive lab el errors in test sets destabilize mac hine learning b enchmarks. arXiv pr eprint Oksuz, K., Cam, B. C., Akbas, E., and Kalk an, S. (2020). A ranking- based, balanced loss function unifying classiﬁcation and lo calisation in ob ject detection. A dvanc es in Neur al Information Pr o c essing Systems, 33, 15534-15545. Qian, S., Pham, V. H., Lutellier, T., Hu, Z., Kim, J., T an, L., ... and Shah, S. (2021). Are my deep learning systems fair? An empirical study of ﬁxed- seed training. A dvanc es in Neur al Information Pr o c essing Systems, 34, 30211-30227. Ramdhani, S. (2016). Some con tributions to underground storage tank calibration models, leak detection and shap e deformation (Do ctoral dissertation, The Univ ersity of T exas at San An tonio). Ramdhani, S., T ripathi, R., Keating, J., and Balakrishnan, N. (2018). Underground storage tanks (UST): A closer inv estigation statistical implications to changing the shap e of a UST. Communic ations in Statistics-Simulation and Computation , 47(9), 2612-2623. Sandler, M., Ho ward, A., Zhu, M., Zhmogino v, A., and Chen, L. C. (2018). Mobilenetv2: Inv erted residuals and linear b ottlenec ks. In Pr o c e e dings of the IEEE c onfer enc e on c omputer vision and p attern r e c o gnition (pp. 4510-4520). Sasaki, Y., The T ruth of the F-Measure, University of Manchester T e chnic al R ep ort, 2007. Satopaa, V., Albrech t, J., Irwin, D., and Raghav an, B., Finding a kne e d le in a ha ystack: Detecting knee p oin ts in system b eha vior, In 2011 31st inter- national c onfer enc e on distribute d c omputing systems workshops (pp. 166-171). IEEE Sc hmidhuber, J. (2015). Deep learning in neural net w orks: An o v erview. Neur al networks, 61, 85-117. Song, H., Kim, M., Park, D., Shin, Y., and Lee, J. G. (2022). Learning from noisy lab els with deep neural netw orks: A surv ey . IEEE T r ansactions on Neur al Networks and L e arning Systems. Sun, B., Y ang, L., Zhang, W., Lin, M., Dong, P ., Y oung, C., and Dong, J. (2019). Sup ertml: Two-dimensional word embedding for the precognition on structured tabular data. In Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition Workshops (pp. 0-0). 30 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy Tian, J., Mithun, N. C., Seymour, Z., Chiu, H. P ., and Kira, Z. (2022, May). Striking the Right Balance: Recall Loss for Seman tic Segmen tation. In 2022 International Confer enc e on R ob otics and Automation (ICRA) (pp. 5063-5069). IEEE. Tian, Y., Zhong, Z., Ordonez, V., Kaiser, G., and Ra y , B. (2020, June). T est- ing dnn image classiﬁers for confusion & bias errors. In Pr o c e e dings of the ACM/IEEE 42nd International Confer enc e on Softwar e Engine ering (pp. 1122-1134). Tian, Y. (2020, Nov ember). Repairing confusion and bias errors for DNN- based image classiﬁers. In Pr o c e e dings of the 28th A CM Joint Me eting on Eur op e an Softwar e Engine ering Confer enc e and Symp osium on the F oundations of Softwar e Engine ering (pp. 1699-1700). V an Rijsb ergen, C. J., Information retriev al 2nd, Newton MA, 1979. V ashishtha, G., and Kumar, R. (2022). Pelton wheel buc ket fault diagnosis using improv ed shannon en trop y and expectation maximization principal comp onen t analysis. Journal of Vibr ation Engine ering & T e chnolo gies , 1-15. V ashishtha, G., and Kumar, R. (2022). Unsup ervised learning mo del of sparse ﬁltering enhanced using wasserstein distance for in telligent fault diagnosis. Journal of V ibr ation Engine ering & T e chnolo gies , 1-18. V aswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... and Polosukhin, I. (2017). Atten tion is all you need. A dvanc es in neur al information pr o c essing systems, 30. Y an, B. C., W ang, H. W., Jiang, S. W. F., Chao, F. A., and Chen, B. (2022, July). Maximum f1-score training for end-to-end mispron unciation detec- tion and diagnosis of L2 English sp eec h. In 2022 IEEE International Confer enc e on Multime dia and Exp o (ICME) (pp. 1-5). IEEE. Zhang, X., Zhai, J., Ma, S., and Shen, C. (2021, May). AUTOTRAINER: An Automatic DNN T raining Problem Detection and Repair Sys- tem. In 2021 IEEE/ACM 43r d International Confer enc e on Softwar e Engine ering (ICSE) (pp. 359-371). IEEE. W en, B., Cao, Y., Y ang, F., Subbalakshmi, K., and Chandramouli, R. (2022, Marc h). Causal-TGAN: Mo deling T abular Data Using Causally- Aw are GAN. In ICLR Workshop on De ep Gener ative Mo dels for Highly Structur e d Data. R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 31 App endix A General form of F-Beta: n-th deriv ativ e The deriv ation pattern using Sasaki’s Sasaki ( 2007 ) steps are straightforw ard for any partial deriv ativ e after the ﬁrst deriv ative. T o set the stage, a few equations are listed. – F rom ( 1 ), it can easily be shown that 1 α 1 p +(1 − α ) 1 r = pr αr +(1 − α ) p . – Keeping the notation similar to Sasaki ( 2007 ), let g = αr + (1 − α ) p then ∂ g ∂ r = α and ∂ g ∂ p = 1 − α . – T aking the ﬁrst deriv ativ e of ( 1 ) via the c hain rule yields the following: ∂ E ∂ r = − pg + pr ∂ g ∂ r g 2 and ∂ E ∂ p = − pg + pr ∂ g ∂ p g 2 . – After simplifying, ∂ E ∂ r = − (1 − α ) p 2 g 2 and ∂ E ∂ p = − αr 2 g 2 . After setting ∂ n E ∂ r n = ∂ n E ∂ p n for n = 1 it’s easy to see that (1 − α ) p 2 = αr 2 and using β = r p yields α that p ertains to the original F β measure or ( 2 ) with n = 1. With the same steps, for n = 2 the equality b ecomes 2(1 − α ) p 2 α = 2 αr 2 (1 − α ) or p = r implying β = 1. With each successive diﬀeren tiation where n > 2, the pattern is as follo ws: cα n − 2 p 2 = c (1 − α ) n − 2 r 2 , where c is the same constant on b oth sides. Using r = β p will then give the generalized equality α n = 1 β − 2 n − 2 +1 . App endix B Case 1: Join t Probability Distribution for U and IU T o prov e ( 6 ) it is suﬃcient to set up b oth integrals and explain the bounds. The computation itself is straightforw ard. F rom the following probabilit y Pr ( F β ) = Pr ( F β ≤ z ) = Pr ( X 1 X 2 ≤ z ) it is clear that the domain is in [ r ′ r + β ∗ , r ′ + β ∗ r ] based on ( 4 ) and ( 5 ). With a slight rearrangement, we can say the following: Pr ( F β ) = Pr ( X 1 X 2 ≤ z ) = Z r ′ + β ∗ r ′ Pr  x 2 ≤ z x 1  f x 1 dx 1 , where f x 1 is the probability densit y of X 1 and Pr  x 2 ≤ z x 1  will b e the cumu- lativ e distribution for X 2 . These are quite common and can b e found online, or in Dudewicz and Mishra ( 1988 ) or Hogg and Craig ( 1995 ). The b ounds come from ( 4 ). Using these b ounds, notice that for x 2 ≤ z x 1 to exist, then z x 1 ≥ 1 r + β ∗ and z x 1 ≤ 1 r . This results in the range r z ≤ x 1 ≤ ( r + β ∗ ) z . Recall, that the existence of x 1 is in the range r ′ ≤ x 1 ≤ r ′ + β ∗ . F rom b oth in terv als on x 1 , deﬁne condition 1 as r z ≤ r ′ or z ≤ p and condition 2 as ( r + β ∗ ) z ≤ r ′ + β ∗ or ( r + β ∗ ) z r ′ + β ∗ ≤ 1. W e need to consider separately the following scenarios: condi- tion 1 and 2 are true, condition 1 and 2 are b oth false, and condition 1 is false 32 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy and condition 2 is true. The scenario of condition 1 b eing true and condition 2 b eing false doe s not o ccur. Pr o of F or z ≤ p & ( r + β ∗ ) z r ′ + β ∗ ≤ 1 we get the following: Z r ′ + β ∗ r ′ Pr  x 2 ≤ z x 1  f x 1 dx 1 = 1 β ∗ Z ( r + β ∗ ) z r ′ Pr  x 2 ≤ z x 1  dx 1 , = 1 ( β ∗ ) 2 Z ( r + β ∗ ) z r ′   r + β ∗  − x 1 z  dx 1 = z 2  r + β ∗ − r ′ z  2 ( β ∗ ) 2 . F or z > p & ( r + β ∗ ) z r ′ + β ∗ > 1 we get the following: Z r ′ + β ∗ r ′ Pr  x 2 ≤ z x 1  f x 1 dx 1 = 1 β ∗ Z rz r ′ dx 1 + 1 β ∗ Z r ′ + β ∗ rz Pr  x 2 ≤ z x 1  dx 1 , = r z − r ′ β ∗ + 1 ( β ∗ ) 2  r + β ∗ + r ′ + β ∗ 2 z   r ′ + β ∗ − r z  + r  r z −  r ′ + β ∗  2 !! F or z > p & ( r + β ∗ ) z r ′ + β ∗ ≤ 1 we get the following: Z r ′ + β ∗ r ′ Pr  x 2 ≤ z x 1  f x 1 dx 1 = 1 β ∗ Z rz r ′ dx 1 + 1 β ∗ Z r ′ + β ∗ rz Pr  x 2 ≤ z x 1  dx 1 − 1 β ∗ Z r ′ + β ∗ ( r + β ∗ ) z Pr  x 2 ≤ z x 1  dx 1 , = r z − r ′ β ∗ + 1 ( β ∗ ) 2  r + β ∗ + r ′ + β ∗ 2 z   r ′ + β ∗ − r z  + r  r z −  r ′ + β ∗  2 ! − 1 ( β ∗ ) 2  ( r + β ∗ )( r ′ + β ∗ ) − ( r ′ + β ∗ ) 2 2 z − ( r + β ∗ ) 2 z 2  F or the scenario z ≤ p & ( r + β ∗ ) z r ′ + β ∗ > 1, we need to show that it never o ccurs. By rearranging condition 2, and recalling r ′ = pr , w e can get r ( z − p ) > β ∗ (1 − z ). Then, if z ≤ 1 then r ( z − p ) ≤ 0 by z ≤ p . Since β ∗ > 0, r ( z − p ) > β ∗ (1 − z ) never o ccurs. If, z > 1 then it implies p > 1 since z ≤ p . This never o ccurs since p ∈ [0 , 1]. □ R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 33 App endix C Case 2: Join t Probability Distribution for Ga and IE The deriv ation of ( 9 ) is similar to Case 1 in that the in tegral will be broken in to pieces and probability distribution pro of will b e used again. As b efore using the same rearrangement, w e can say the following: Pr ( F β ) = Pr ( X 1 X 2 ≤ z ) = Z + ∞ −∞ Pr  x 2 ≤ z x 1  f x 1 dx 1 , Before moving forward, X 2 ’s marginal distribution or ( 8 ) will be giv en. If, β ′′ ∼ Exp onen tial( λ ) , then the CDF for X = β ′′ + r is: F ( x ) = Pr( X ≤ x ) = Pr( β ′′ + r ≤ x ′ 2 ) = Pr( β ′′ ≤ x − r ) F ( x ) = 1 − exp ( − λ ( x − r )) , ∀ x ≥ r. With transformation: Y = g ( X ) = 1 X so, g − 1 ( y ) = β ′′ = 1 − r y y Then, F y ( y ) = F x ( g − 1 ( y )) = exp  − λ  1 − r y y  since g is a strictly decreasing function. No w, we can see that X 2 has the distribution F ( X 2 ) = exp  − λ  1 − rx 2 x 2  where x 2 ∈ [0, 1 r ]. Using this prop ert y w e complete the pro of. Pr o of F or z > 0: Z rz −∞ f x 1 d x 1 + Z + ∞ rz exp ( − λ 1 − r z x 1 z x 1 !) f x 1 d x 1 = Φ( r z ; r ′ , σ 2 ) + 1 √ 2 π σ 2 Z + ∞ rz exp ( − λ 1 − r z x 1 z x 1 !) exp  − 1 2 σ 2  x 1 − r ′  2  = Φ( r z ; r ′ , σ 2 ) + 1 √ 2 π σ 2 Z + ∞ rz exp  − x 2 1 − 2 r ′ x 1 + ( r ′ ) 2 2 σ 2 − λx 1 z + λr  dx 1 = Φ( r z ; r ′ , σ 2 ) + 1 √ 2 π σ 2 Z + ∞ rz exp      −  x 1 −  r ′ − λσ 2 z  2 2 σ 2 + 2 r ′ λσ 2 z −  λσ 2 z  2 + λr 2 σ 2      dx 1 =    Φ( r z ; r ′ , σ 2 ) + exp    λr +  λσ 2 z  2 − 2 r ′ λσ 2 z 2 σ 2    34 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy ×  1 − Φ  r z ;  r ′ − λσ 2 z  , σ 2  . T o b e clear, the b ounds of the in tegral around r z arise from the directionally based inequalit y on z x 1 , in particular z x 1 > 1 r . F or z=0: it can b e seen that the entire mass is summarized by the Gaussian distribution or X 1 since X 2 is non-negative. Z 0 −∞ f x 1 d x 1 = Φ(0; r ′ , σ 2 ) . F or z < 0: this is a bit diﬀerent because though the Inv erse Exponential redistributes the mass for the Gaussian as b efore, it b eing non-negative needs to b e adjusted for. Consider the interv al [ −∞ , 0] that represents the domain for this (probabilit y) mass. Let p z b e the probability mass of interest. Next, deﬁne b 1 to b e the mass from [ −∞ , r z ], b 2 to b e the mass from [ r z , 0], and b 3 as the mass for [ −∞ , 0]. Notice for b 1 and b 2 , the separation of the integral is similar as b efore but with diﬀerent b ounds. So we hav e the following: b 1 = Z rz −∞ exp ( − λ 1 − r z x 1 z x 1 !) f x 1 d x 1 b 2 = Z 0 rz f x 1 d x 1 b 3 = Φ(0; r ′ , σ 2 ) . By X 2 ’s redistribution, the mass for the negative v alues is p z = b 3 − b 1 − b 2 for z < 0. The pro of is now simpliﬁed to solving the p z expression and by using some of the results from the z > 0 case we hav e the following: Φ(0; r ′ , σ 2 ) − Z rz −∞ exp ( − λ 1 − r z x 1 z x 1 !) f x 1 d x 1 − Z 0 rz f x 1 d x 1 = Φ(0; r ′ , σ 2 ) − exp    λr +  λσ 2 z  2 − 2 r ′ λσ 2 z 2 σ 2    × Φ  r z ;  r ′ − λσ 2 z  , σ 2  −  Φ  0; r ′ , σ 2  − Φ  r z ; r ′ , σ 2  = Φ( r z ; r ′ , σ 2 ) − exp    λr +  λσ 2 z  2 − 2 r ′ λσ 2 z 2 σ 2    × Φ  r z ;  r ′ − λσ 2 z  , σ 2  □ R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 35 App endix D Pressure V essel Design This is b orro wed from Chauhan et al. ( 2022 ). See ﬁgure 7 from Chauhan et al. to see the structure design of pressure vessel, which lo oks similar to Underground Storage T anks discussed earlier. D.1 Problem Statement The pressure v essel design ob jective is to minimize total cost, which includes material, forming and welding. The design v ariables are thickness of the shell ( T s ), thickness of the head ( T h ), the inner radius ( R ) and the length of the cylinder ( L ). The mathematical form ulation is found in D1 . X = [ x 1 , x 2 , x 3 , x 4 ] = [ T s , T h , R, L ] f ( X ) = 0 . 6224 x 1 x 3 x 4 + 1 . 7781 x 2 x 2 3 + 3 . 1661 x 2 1 x 4 + 19 . 84 x 2 1 x 3 (D1) F or this pap er, Chauhan et al. HAO ASMA algorithm results will b e used as the baseline for the parameters. These parameter results serve as the b est minim um result. T o b e sp eciﬁc, T s = 1 . 8048, T h = 0 . 0939, R = 13 . 8360, and L = 123 . 2019. The next section will provide a couple v ariations to conv ert this problem to a classiﬁcation problem. D.2 V arying Design Parameter Plots The simulation carried out can b e done using Algorithm 2 . Two simulated realizations from this algorithm can b e seen in Figure D1 . The left ﬁgure has v alues T v s = 1 . 7887 and T v h = 0 . 0313 and the right ﬁgure has T v s = 1 . 7887 and T v h = 0 . 2817. The superscript v stands for v ariation. The v ariations are in tended to b e reﬂect tw o scenarios: the ﬁrst is a clear separation b etw een distribution (the left ﬁgure), hence an easier classiﬁcation. The second has signiﬁcan t o verlap (the right ﬁgure) or a tougher classiﬁcation. Fig. D1 : Comparison of a sim ulated realization using Algorithm 2 and v ari- ations to the thickness parameter. Left: T v s = 1 . 7887 and T v h = 0 . 0313 and Righ t: T v s = 1 . 7887 and T v h = 0 . 2817 36 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy Algorithm 2 Simulation of Pressure V essel Data for Classiﬁcation Require: s > 0 ∧ i ∈ (0 , 1) and T v s > 0 and T v h > 0 where i is imbalance, s is the size of the data set and T v s and T v h are parameter v ariations from the HA OASMA baseline. 1: Set T b s = 1 . 8048, T b h = 0 . 0939, R b = 13 . 8360, and L b = 123 . 2019 where sup erscript b is baseline. 2: Compute data lab el sizes based on the imbalance i as s 0 = ⌊ s × (1 − i ) ⌋ and s 1 = s − s 0 where the subscript 0 and 1 stand for data sizes for lab el 0 and label 1. 3: Initialize t b s and t b h as s 0 size arrays with v alues T b s and T b h resp ectiv ely . 4: Draw r b and l b arra ys of size s 0 from the normal distributions: N ( µ = R b , σ 2 = 1) and N ( µ = L b , σ 2 = 1) resp ectiv ely . 5: Concatenate the column vectors t b s , t b h , r b and l b together and assign this arra y to v ariable X 0 . 6: Apply D1 to each row of X 0 to yield a cost v alue and assign this array to v ariable Y 0 . 7: Initialize l 0 as a label 0 array of size s 0 with the v alue 0. 8: Concatenate the column vectors r b , l b , Y 0 , and l 0 together and assign this arra y to v ariable F 0 . 9: Initialize t v s and t v h as s 1 size arrays with v alues T v s and T v h resp ectiv ely . 10: Draw r b and l b arra ys of size s 1 from the normal distributions: N ( µ = R b , σ 2 = 1) and N ( µ = L b , σ 2 = 1) resp ectiv ely . 11: Concatenate the column v ectors t v s , t v h , r b and l b together and assign this arra y to v ariable X 1 . 12: Apply D1 to each row of X 1 to yield a cost v alue and assign this array to v ariable Y 1 . 13: Initialize l 1 as a label 1 array of size s 1 with the v alue 1. 14: Concatenate the column vectors r b , l b , Y 1 , and l 1 together and assign this arra y to v ariable F 1 . 15: Concatenate or stack the arrays F 0 and F 1 whic h yields a total of s rows with the last column being the lab els for classiﬁcation. R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 37 App endix E Underground Storage T ank (UST) This is section is b orro wed from Ramdhani ( 2016 ) and all equations and deriv ations and further explorations can b e found there. E.1 Problem Statement The UST problem deals with estimating tank dimensions b y using only v er- tical height measurements. It is also p ossible that this cylindrical UST has hemispherical endcaps app ended on the ends which will also contain v olume. The equations for the v olume based on cross-sectional measurements for the tanks with the cylindrical, cylindrical with hemispherical endcaps, ellipsoidal, and ellipsoidal with hemi-ellipsoidal endcaps s hapes, are giv en in ( E2 ), ( E3 ), ( E4 ), and ( E5 ), resp ectiv ely . The equation for the Cylindrical shap e is: f C ( r , L, h ) = L  r 2 cos − 1  r − h r  − ( r − h ) p 2 r h − h 2  (E2) If one were to add hemispherical endcaps to the cylinder ends the subsequen t v olume w ould be: f C H ( r , L, h ) = L  r 2 cos − 1  r − h r  − ( r − h ) p 2 r h − h 2  + π h 2 3 (3 r − h ) (E3) The equation for the Elliptical shap e for a deformed Cylinder is: f E D ( a, b, L, h ) = L    ( ab ) cos − 1   a − h q a 2 + ( h 2 − 2 ha )  1 − b 2 a 2    − b ( a − h ) s 1 −  1 − h a  2    (E4) If one were to add hemispherical endcaps to the cylinder whic h deforms to hemi-ellipsoidal endcaps the subsequent v olume w ould be: f E DH ( a, b, L, h ) = L    ( ab ) cos − 1   a − h q a 2 + ( h 2 − 2 ha )  1 − b 2 a 2    38 R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy − b ( a − h ) s 1 −  1 − h a  2    + 2 π a 3 + π ( a − h )  hb 2 a   h − 2 a a  3 − 2 π a 3 ( a − h ) 3 q a 2 + ( h 2 − 2 ha )  1 − b 2 a 2  (E5) E.2 V arying T ank Dimension The parameters used in the Algorithm 3 are b orro wed from Ramdhani ( 2016 ) and the baseline will be the cylindrical case with radius, r = 4 and length, L = 32 and the parameter v ariations will b e on the a and b for an ellipse. Ramdhani used a measurement error based mo del for simulation, which will also b e used here. The measurement errors will b e on the heights h . Similar to the pressure v essel w e consider an easy and a tough sim ulation scenario for classiﬁcation. This is seen in Figure E3 where the left ﬁgure is easier to distinguish b etw een cylinder and ellipse versus the right ﬁgure. The same interpretation can b e seen for the end-cap based equations or Figure E4 . Fig. E2 : Left: cylindrical UST with a radius (r) of 4 and total max height of 8. Righ t: elliptical UST (deformed) with v ertical axis (a) of 3.6 and max heigh t of 7.2. Fig. E3 : Comparison of one sim ulated realization of volume versus height measuremen ts using Algorithm 3 for equation ( E2 ) versus ( E4 ). Left: a = 3 . 2, b = 5 . 0 v ersus r = 4 and Right: a = 3 . 8, b = 4 . 2105 v ersus r = 4. R eformulating van R ijsb er gen ’s F β metric for weighte d binary cr oss-entr opy 39 Algorithm 3 Simulation of T ank Dimension Data for Classiﬁcation Require: s > 0 ∧ i ∈ (0 , 1) and a > 0 and b > 0 where i is im balance, s is the size of the data set and a and b are the parameters for the vertical and horizon tal axis of an ellipse. 1: Set r = 4, L = 32 as the baseline parameters. 2: Compute data label sizes based on imbalance i as s 0 = ⌊ s × (1 − i ) ⌋ and s 1 = s − s 0 where the subscript 0 and 1 stand for data sizes for lab el 0 and lab el 1. 3: Initialize noise arra ys ϵ 0 ∼ N (0 , 2) and γ 0 ∼ U ( − 0 . 05 , 0 . 05) b oth of size s 0 . 4: Draw an arra y of heights h 0 ∼ U (1 , 2 × r − 1) of size s 0 for the vertical heigh t of a cylinder. 5: Compute v ariable h ′ 0 = h 0 + γ 0 . 6: Initialize l 0 as a label 0 array of size s 0 with the v alue 0. 7: Compute the volume from either ( E2 ) or ( E3 ) using h 0 , r , L and assign this to v ariable X 0 . 8: Assign v ariable Y 0 = X 0 + ϵ 0 . 9: Concatenate the column arra ys Y 0 , h ′ 0 , and l 0 and assign this to F 0 . 10: Initialize noise arra ys ϵ 1 ∼ N (0 , 2) and γ 1 ∼ U ( − 0 . 05 , 0 . 05) b oth of size s 1 . 11: Draw an array of heigh ts h 1 ∼ U (1 , 2 × a − 1) of size s 1 for the v ertical heigh t of an ellipse. 12: Compute v ariable h ′ 1 = h 1 + γ 1 . 13: Initialize l 1 as a label 1 array of size s 1 with the v alue 1. 14: Compute the volume from either ( E4 ) or ( E5 ) using h 1 , a , b , L and assign this to v ariable X 1 . 15: Assign v ariable Y 1 = X 1 + ϵ 1 . 16: Concatenate the column arra ys Y 1 , h ′ 1 , and l 1 and assign this to F 1 . 17: Concatenate or stack the arrays F 0 and F 1 whic h yields a total of s rows with the last column being the lab els for classiﬁcation. Fig. E4 : Comparison of one sim ulated realization of volume versus height measuremen ts using Algorithm 3 for equation ( E3 ) versus ( E5 ). Left: a = 3 . 2, b = 5 . 0 v ersus r = 4 and Right: a = 3 . 8, b = 4 . 2105 v ersus r = 4.

Reformulating van Rijsbergen's $F_β$ metric for weighted binary cross-entropy

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment