Enhancing classification accuracy through chaos

We propose a novel approach which exploits chaos to enhance classification accuracy. Specifically, the available data that need to be classified are treated as vectors that are first lifted into a higher-dimensional space and then used as initial con…

Authors: Panos Stinis

Enhancing classification accuracy through chaos
Enhancing classification accuracy through c haos P anos Stinis A dvanc e d Computing, Mathematics and Data Division, Pacific Northwest National L ab or atory, Richland W A 99354 Abstract W e prop ose a nov el approach whic h exploits chaos to enhance classi- fication accuracy . Sp ecifically , the a v ailable data that need to be classi- fied are treated as v ectors that are first lifted in to a higher-dimensional space and then used as initial conditions for the ev olution of a chaotic dynamical system for a prescribed temporal in terv al. The evolv ed state of the dynamical system is then fed to a trainable softmax classifier whic h outputs the probabilities of the v arious classes. As proof-of- concept, we use samples of randomly p erturb ed orthogonal vectors of mo derate dimension (2 to 20), with a corresponding num b er of classes equal to the vector dimension, and show how our approach can both significan tly accelerate the training pro cess and improv e the classifica- tion accuracy compared to a standard softmax classifier which op erates on the original vectors, as well as a softmax classifier which only lifts the v ectors to a higher-dimensional space without ev olving them. W e also pro vide an explanation for the improv ed performance of the chaos- enhanced classifier. In tro duction Classification is one of the tw o fundamental supervised mac hine learning (ML) tasks [1, 2], the other b eing regression [3, 4]. It is a v ast research topic with applications ranging from health [5] to finance [6] to science [7, 8, 9]. The ob jectiv e of ML-based classification is to train a ML to ol to assign the correct class (label) to eac h member of a dataset. Usually , ML-based classifiers assign, for each member of a dataset, a probabilit y for the member to b elong to each of the av ailable classes. Due to the p erv asiv e nature of classification, there are man y algorithms that ha ve b ee n dev elop ed for this task and an exhaustive comparison is 1 b ey ond the scope of the current work. Instead, we will reduce the ML- classification problem to its bare essentials and offer a nov el approach whic h can both accelerate training and impro ve the training/testing accuracy . The approac h w e will discuss can b e applied directly to the raw input data or to the laten t space representation of the data pro vided b y an existing algorithm (b efore the final linear output lay er whic h connects to a softmax classifier that assigns probabilities to the v arious classes). Sp ecifically , we consider a dataset consisting of randomly perturb ed or- thogonal v ectors of a prescrib ed dimension. The correct class for each of the p erturb ed orthogonal v ectors is the direction corresp onding to the unper- turb ed vector. The baseline mo del w e will use consists of a trainable softmax classifier to whic h we feed the randomly p erturb ed orthogonal vectors. The no vel approach we prop ose consists of t wo stages: (i) lifting the vector to a higher dimension, and (ii) identifying the lifted v ector as the initial state of a c haotic dynamical system and evolving it for a prescrib ed temp oral inter- v al b efore it is pro cessed by the trainable softmax classifier. Our purp ose in this work is to disentangle the effect that the tw o stages can hav e on the classification training time and accuracy , so that the observed improv emen t o ver the baseline mo del, which can b e significan t, can b e explained. Section 1 presen ts the v arious constructions including the baseline clas- sifier mo del, the lifting to higher dimensions and the chaos-enhanced mo del. Section 2 con tains numerical results for the v arious constructions along with an explanation of the improv ed p erformance of the chaos-enhanced classifier, while Section 3 offers a discussion of the prop osed approac h and directions for future work. 1 Chaos-enhanced classification Consider a collection V N = [ v 1 , v 2 , . . . , v N ] of N vectors of dimension m, each b eing a random p erturbation of a canonical basis vector in R m . The canon- ical basis in m dimensions consists of the vectors e i = [0 , . . . , 0 | {z } i − 1 , 1 , 0 , . . . , 0 | {z } m − i ] T , for i = 1 , . . . , m, with all 0s except for the 1 at the i -th co ordinate. F or sym- metry , w e will assume that the m canonical basis v ectors are equally repre- sen ted in the collection, sa y with N class v ectors each, so N = m × N class . Eac h of the N vectors is formed by adding samples from indep endent, nor- mally distributed random v ariables with mean zero and prescrib ed v ariance to all the co ordinates. Th us, each of the N v ectors in V N can b e written as v nk = e i n k + ϵ nk , for n = 1 , . . . , N and k = 1 , . . . , m, where ϵ nk ∼ N (0 , σ ) 2 with σ prescribed and i n is the index corresp onding to the canonical vector that we use to build the n -th sample in the collection. Baseline mo del. The baseline model consists of a trainable softmax classifier to which eac h of the N vectors are fed. Since we consider the N v ectors coming from random p erturbations of the canonical basis vectors in R m , we consider m classes, one for each of the m orthogonal directions in R m . As a result, the trainable weigh ts of the linear output lay er of the softmax classifier are the elements of a m × m matrix W, denoted by w ij for i, j = 1 , . . . , m. F or a vector v n from the collection of N vectors, the softmax classifier tak es as input the logits z ni = P m j =1 W ij v nj for i = 1 , . . . , m, and outputs the m probabilities p ni = e z ni P m i =1 e z ni , for i = 1 , . . . , m, (1) that the vector v n b elongs to eac h of the m classes. W e emplo y cross-en tropy as the loss function to train the elemen ts of W and use the Adam optimizer for training. The cross-entrop y loss for the N vectors is defined as L N = − 1 N N X n =1 m X l =1 χ nl log p nl , (2) where χ nl = 1 if l = i n and 0 otherwise, where i n is the true class of the n -th sample, and p nl is provided by (1). The deriv ativ e of the cross-entrop y loss function with resp ect to an elemen t of the w eight matrix W is given by ∂ L N ∂ w ij = − 1 N N X n =1 [ χ ni − p ni ] v nj , for i, j = 1 , . . . , m. (3) Since the true classes of the vectors in V N are along the directions of the canonical basis v ectors in R m , the training of the baseline mo del aims to suppress the random p erturbations along the m directions (as can b e seen b y (3)). 3 Algorithm 1 Baseline classification algorithm Input: Num b er of classes m, data V N (v ectors of dimension m ) where N = m × N class ( N class orthogonal vectors from each of the m classes p erturb ed with N (0 , σ ) noise, σ is user-prescrib ed), num b er of training data N train and test data N test with N train + N test = N , class lab els χ for the vectors in V N , mini-batc h size N b , initial learning rate for Adam optimizer η , num b er of training ep o c hs N epochs 1: T raining: 2: Initialization of the m × m weigh t matrix W 3: for each training ep o ch n = 1 , 2 , . . . , N epochs do 4: Pic k a mini-batch from the training data V N train 5: Assign class probabilities to the mini-batch using a softmax classifier 6: Use the Adam optimizer with cross-entrop y loss function to up date the weigh t matrix W 7: Rep eat until the ep o ch is completed 8: end for Lifting-enhanced mo del. As we ha ve already discussed, the prop osed approac h can b e divided into t wo stages: (i) lifiting eac h of the N vectors to a higher dimension and (ii) evolving the lifted v ector using a c haotic dynamical system. The first stage can b e considered in isolation from the second one, since it can b e b eneficial for classification to lift, within reason, the dimensionality of the v ectors to b e classified [10, 11, 12]. There are v arious metho ds to lift a v ector to higher dimensions, e.g., zero-padding, p olynomial feature mapping, tensor pro duct. Eac h of the N v ectors of dimension m is lifted to a vector of dimension m lif t . As a result of the lifting process, the linear output la yer weigh t matrix W lif t has dimensions m × m lif t . F or the n -th original (unlifted) vector v n = [ v n 1 , . . . , v nm ] T , the corresp onding lifted vector, denoted b y v lif t n , is defined as v lif t n = [ η 1 , η 2 , η 3 , v n 1 , . . . , v nm , η 4 , . . . , η m lif t − m ] T , (4) where η 1 , . . . , η m lif t − m are random linear com binations of the m co ordinates. Sp ecifically , η j = P m l =1 ϵ l v nl where ϵ l ∼ N (0 , 1 m lif t ) . The sp ecific c hoice of the lifting construction is due to the structure of the Lorenz 96 mo del that is used to ev olve the lifted vector, namely the need to enforce p erio dicity (see (10) b elow). If the constraints from the Lorenz 96 sys tem were not there, w e could hav e pick ed a simpler lifting construction e.g., v lif t n = [ v n 1 , . . . , v nm , η 1 , . . . , η m lif t − m ] T . (5) 4 W e note that the c hosen lifting approach is not optimized, but do es im bue the added co ordinates with randomly distorted information from the m co- ordinates of the original vector. Also, the lifting of each vector to one which preserv es the m co ordinates of the original v ector but adds (small) random p erturbations in the remaining co ordinates can b e understo o d as a random regularization mechanism. This is b ecause the w eight matrix W lif t will b e trained to map from R m lif t to the m classes, thus it will b e trained to sup- press random p erturbations in the m lif t − m dimensions (in addition to learn- ing to suppress random p erturbations in the m co ordinates like the baseline mo del). Impro ving training accuracy through adding random p erturbations to data is a w ell-known practice for neural netw orks [13, 14, 15, 16]. When w e present the n umerical results (Section 2), we provide a more quantita- tiv e explanation as to how the lifting to a higher dimensional space aids the classification task. Due to the lifting of the vectors to m lif t dimensions, the logits z lif t ni = P m lif t j =1 W lif t ij v lif t nj for i = 1 , . . . , m, and softmax outputs the m probabilities p lif t ni = e z lif t ni P m i =1 e z lif t ni , for i = 1 , . . . , m, (6) that the v ector v lif t n b elongs to eac h of the m classes. The loss function is giv en by L lif t N = − 1 N N X n =1 m X l =1 χ lif t nl log p lif t nl , (7) where χ lif t nl = χ nl i.e., the class of each lifted v ector remains the same, and p lif t nl is pro vided b y (6). The deriv ativ e of the cross-en tropy loss function L lif t N with resp ect to an element of the weigh t matrix W lif t is given by ∂ L lif t N ∂ w lif t ij = − 1 N N X n =1 [ χ lif t ni − p lif t ni ] v lif t nj , for i = 1 , . . . , m and j = 1 , . . . , m lif t . (8) 5 Algorithm 2 Lifting-enhanced classification algorithm Input: Num b er of classes m, data V N (v ectors of dimension m ) where N = m × N class ( N class orthogonal vectors from each of the m classes p erturb ed with N (0 , σ ) noise, σ is user-prescrib ed), num b er of training data N train and test data N test with N train + N test = N , class lab els χ for the vectors in V N , lifting dimension m lif t ( m lif t > m ), mini-batc h size N b , initial learning rate for Adam optimizer η , n umber of training ep o chs N epochs 1: Lifting: 2: Create the m lif t -dimensional lifted data V lif t N using (4) 3: T raining: 4: Initialization of the m × m lif t w eight matrix W lif t 5: for each training ep o ch n = 1 , 2 , . . . , N epochs do 6: Pic k a mini-batch from the training data V lif t N train 7: Assign class probabilities to the mini-batch using a softmax classifier 8: Use the Adam optimizer with cross-entrop y loss function to up date the weigh t matrix W lif t 9: Rep eat until the ep o ch is completed 10: end for Chaos-enhanced mo del. While lifting to higher dimensions can im- pro ve classification accuracy , even more significant impro vemen t can be ac hieved if w e further employ c haos. Specifically , each lifted v ector, whic h is a p oint in an m lif t -dimensional space, can b e identified as the initial state of a c haotic dynamical system and ev olved for a prescrib ed temp oral inter- v al. The obvious question is why should such a pro cess aid the classification accuracy? In tuitiv ely , a c haotic dynamical system increases the distance of states that are initially close. In our case, the v ectors resulting from the lifting pro cess already con tain randomness in the m lif t − m dimensions. Ho wev er, a con trolled evolution under a chaotic dynamical system can pull those vectors apart, in essence homogenizing the randomness in all m di- mensions. This results in a more effic ien t use of the high dimensionality of the lifting space (this statemen t will b ecome more concrete quantitativ ely in Section 2). F or our numerical experiments w e hav e emplo yed the Lorenz 96 mo del [17]. The mo del is defined as dx i dt = ( x i +1 − x i − 2 ) x i − 1 − x i + F , for i = 1 , . . . , K, (9) with x − 1 = x K − 1 , x 0 = x K , x K +1 = x 1 , K ≥ 4 , (10) 6 and F is a forcing term. Based on the literature, w e hav e chosen the v alue F = 8 for all the n umerical exp eriments, although a more detailed study of the effect of F on the results will be in teresting [18]. Also, because of the low er b ound on K from (9), the 3 p erio dicit y conditions in (10) and the lifting construction, in our numerical exp erimen ts we consider lifting spaces of dimension m lif t ≥ m + 6 . An imp ortant question to ask is about the length of the temp oral interv al T for whic h one should ev olv e the chaotic dynamical system so that it is b eneficial for the classification accuracy [19]. Based on dynamical system theory , a Lyapuno v time is the time it tak es for the deviation b etw een tw o initial conditions to grow by a factor of e. Thus, a few Lyapuno v times is a go o d indicator of a chaotic system’s predictability . Based on Ly apunov time estimates for the Lorenz 96 mo del (around 0.6 units of time for the regimes studied in the literature [18, 20]), w e c hose T = 2 units of time for our numerical exp eriments (except where noted). W e denote the n -th lifted and chaos-ev olved data v ector as v lif t,chaos n . As a result, the logits z lif t,chaos ni = P m lif t j =1 W lif t ij v lif t,chaos nj for i = 1 , . . . , m, and softmax outputs the m probabilities p lif t,chaos ni = e z lif t,chaos ni P m i =1 e z lif t,chaos ni , for i = 1 , . . . , m, (11) that the vector v lif t,chaos n b elongs to each of the m classes. The loss function is given by L lif t,chaos N = − 1 N N X n =1 m X l =1 χ lif t,chaos nl log p lif t,chaos nl , (12) where χ lif t,chaos nl = χ nl i.e., the class of each lifted and chaos-ev olved vector remains the same, and p lif t,chaos nl is pro vided b y (11). The deriv ative of the cross-entrop y loss function L lif t,chaos N with resp ect to an element of the w eight matrix W lif t,chaos is given by ∂ L lif t,chaos N ∂ w lif t,chaos ij = − 1 N N X n =1 [ χ lif t,chaos ni − p lif t,chaos ni ] v lif t,chaos nj , (13) for i = 1 , . . . , m and j = 1 , . . . , m lif t . 7 Algorithm 3 Lifting- and chaos-enhanced classification algorithm Input: Num b er of classes m, data V N (v ectors of dimension m ) where N = m × N class ( N class orthogonal vectors from each of the m classes p erturb ed with N (0 , σ ) noise, σ is user-prescrib ed), num b er of training data N train and test data N test with N train + N test = N , class labels χ for the vectors in V N , lifting dimension m lif t ( m lif t ≥ 7) , dimension of Lorenz 96 mo del K ( K = m lif t − 3), forcing magnitude F , temp oral in terv al for c haotic ev olution T , timestep for chaotic ev olution η chaos , mini-batc h size N b , initial learning rate for Adam optimizer η , n umber of training ep o chs N epochs 1: Lifting: 2: Create the m lif t -dimensional lifted data V lif t N using (4) 3: Chaotic evolution: 4: Iden tify each datap oin t in V lif t N as an initial condition for Lorenz 96 and ev olve for T units of time with timestep η chaos using (9)-(10) to create the data V lif t,chaos N 5: T raining: 6: Initialization of the m × m lif t w eight matrix W lif t,chaos 7: for each training ep o ch n = 1 , 2 , . . . , N epochs do 8: Pic k a mini-batch from the training data V lif t,chaos N train 9: Assign class probabilities to the mini-batch using a softmax classifier 10: Use the Adam optimizer with cross-entrop y loss function to up date the weigh t matrix W lif t,chaos 11: Rep eat until the ep o ch is completed 12: end for 2 Results 2.1 Numerical exp erimen ts setup W e ha ve conducted n umerical exp eriments to compare the baseline mo del, the lifting-enhanced mo del, and the lifting- and chaos-enhanced mo del for mo derate n umber of dimensions (classes) m = 2 , . . . , 20 . Because w e wan t to hav e an equal num b er of samples from each class, the total num b er of samples v aries for the different v alues of m. W e chose N class = 20 samples from eac h class, so the num b er of samples for each v alue of m is equal to 20 × m. The v ariance of the random p erturbations of the orthogonal vectors w as set to σ = 10 − 4 for all the exp erimen ts. Also, w e split the samples equally b et ween those used for training and those used for testing. F or 8 the Adam optimizer, w e set the mini-batc h size to N b = 10 and learning rate η = 10 − 3 . W e hav e added to the loss function of the baseline model, the lifting-enhanced mo del and the lifting- and c haos-enhanced mo del the square of the L 2 norm of the weigh t matrix elements multiplied by 10 − 3 as a regularizer to the cross-entrop y loss. The Lorenz 96 mo del w as evolv ed with the standard 4th order Runge-Kutta sc heme and η chaos = 10 − 2 . All the calculations were p erformed in double precision. Since w e do not hav e yet a goo d wa y to decide on the optimal v alue of the lifting dimension, we conducted exp eriments where for eac h v alue of m, we searched for the optimal lifting dimension m lif t b et ween the v alues m + 6 and 50. The optimal v alue of m lif t w as decided based on an accuracy metric ev aluated on the test data. W e chose the v alue of m lif t that re- sulted in the highest test accuracy during training. In our exp erimen ts, the optimal v alue of m lif t fluctuated due to the sto chasticit y of the algorithm (random p erturbations of the orthogonal vectors, randomness of the v alues of the comp onents in the lifted dimensions and randomness of the Adam optimizer). A more detailed study of the optimal m lif t will b e presented elsewhere. There are different metrics for the accuracy of the predicted class prob- abilities. The prop ortion accuracy metric is giv en by accur acy propor tion = 1 N N X n =1 δ i max i n , (14) where i max = arg max( p n 1 , p n 2 , ..., p nm ) is the predicted class for the n -th sample and i n is the actual class. F or the prop ortion accuracy metric to b e 1, it is only required that the predicted class probability for the correct class is larger than the predicted class probabilities for the rest of the classes. This metric is not stringen t enough to distinguish betw een the v arious mo dels. Instead, we hav e opted for the follo wing accuracy metric accur acy alig nment = 1 N N X n =1 χ n · p n = 1 N N X n =1 m X l =1 χ nl p nl = 1 N N X n =1 p ni n , (15) where χ nl = δ li n (recall that i n is the correct class for the n -th sample) and p nl is the probability for class l for the n -th sample predicted by one of the 3 mo dels w e compared (we note that N in (15) stands for N train or N test dep ending on whether w e measure training accuracy or testing accuracy). W e see from (15) that to achiev e 100% accuracy , the predicted probability from the mo del for the correct label for eac h sample m ust be 1. Another w ay 9 to interpret the metric in (15) is as confidence of the predicted class lab els b y the mo del, since it measures the alignment b etw een the predicted class probabilities vector and the correct class v ector. Unless otherwise stated, the accuracy used in the figures is the alignment accuracy metric (although w e utilize the prop ortion metric to o in Section 2.4 to discuss robustness). 2.2 Observ ations from results Figures 1-3 show the ev olution of the loss (cross-entrop y) and the accuracy with epo chs for the training and testing for the 3 models when m = 2 , 10 , 20 . Recall that we chose the optimal v alue of m lif t based on the test accuracy . The results lead to v arious observ ations ab out the p erformance of the 3 mo dels. First, while the lifting alone can improv e p erformance ov er the baseline mo del, there is a dramatic acceleration of the decrease rate of the loss func- tion and an equally dramatic acceleration of the conv ergence of the predicted accuracy to 1 (100%) for the lifting- and c haos-enhanced mo del. This model can conv erge to p erfect accuracy within a few ep o chs, while the corresp ond- ing loss is decreased to v ery small v alues. In fact, w e had to put a lo w er threshold for the loss at 10 − 10 and terminate the algorithm when it w as crossed, b ecause the Adam optimizer sometime b ecame unstable when the next ep o ch decreased the loss v alue from 10 − 10 to a v alue close to machine precision (10 − 16 ) for our double precision calculations. Second, the lifting- and chaos-enhanced mo del can reac h p erfect accuracy within a few hundred epo c hs (w e allow ed not more than 500 during our exp erimen ts), while the l ifting-enhanced mo del, with the exception of m = 2 , requires an order of magnitude more ep o chs to reach similar accuracy (if at all). F or m = 2 , the baseline mo del needs 20000 ep o chs to reac h similar accuracy , while for m = 10 and m = 20 it is not able to do so. Third, while the loss v alue decreases monotonically for the baseline and lifting-enhanced mo dels for all v alues of m, for the lifting- and c haos- enhanced mo del for m = 10 and m = 20 , the training and test loss v alues initially go through a phase of increase b efore they start decreasing. It is not accidental that during the same phase, the accuracy of the lifting- and c haos-enhanced model is low er than that of the lifting-enhanced mo del. This b eha vior is intriguing and needs to b e inv estigated more, since the lifting- and c haos-enhanced model b ehav es in a wa y similar to some evolutionary mo dels whic h p erform excursions in the solution landscap e b efore honing on an accurate one (called the steppingstone principle [21]). Of course, in the presen t mo del, there is no exploration during the optimization phase, but it 10 is p ossible that the c haotic ev olution which precedes the optimization step creates an appropriate w arping of the data that fa vors such kind of b ehavior. 1 9 80 7×10 2 6×10 3 5×10 4 Ep o c h s 1×10 -10 3×10 -8 1×10 -5 3×10 -3 1 Lo s s Base lin e- T rain ing -2 Base lin e- T esti ng- 2 Lift - T rain ing -2 Lift - T e sti ng- 2 Lift + Chao s- T rai ning - 2 Lift + Chao s- T esti ng -2 1 9 80 7×10 2 6×10 3 5×10 4 Ep o c h s 0.00 0.25 0.50 0.75 1.00 Ac c u r ac y Baseline-Training-2 Baseline-T est ing-2 Lift-T raini ng-2 Lift-T esting -2 Lift+Chaos -T raining- 2 Lift+Chaos -T esting-2 (a) (b) Figure 1: m = 2 . Comparison of the baseline mo del, the lifting-enhanced mo del, and the lifting- and chaos-enhanced mo del (with the optimal lifting dimension). (a) Evolution of loss (cross-en trop y) with epo chs. (b) Ev olution of accuracy with ep o chs. 1 9 80 7×10 2 6×10 3 5×10 4 Ep o c h s 1×10 -7 8×10 -6 7×10 -4 0.06 5 Lo s s Baseli ne-T raining- 10 Baseli ne-T esting-1 0 Lift-T raining- 10 Lift-T esting-1 0 Lift+ Chaos-T raining-10 Lift+ Chaos-T esting-10 1 9 80 7×10 2 6×10 3 5×10 4 Ep o c h s 0.00 0.14 0.29 0.43 0.57 0.71 0.86 1.00 Ac c u r ac y Base lin e- T rain ing -1 0 Base lin e- T esti ng- 10 Lift - T rain ing -1 0 Lift - T e sti ng- 10 Lift + Chao s- T rai ning - 10 Lift + Chao s- T esti ng -10 (a) (b) Figure 2: m = 10 . Comparison of the baseline mo del, the lifting-enhanced mo del, and the lifting- and chaos-enhanced mo del (with the optimal lifting dimension). (a) Evolution of loss (cross-en trop y) with epo chs. (b) Ev olution of accuracy with ep o chs. 11 1 9 80 7×10 2 6×10 3 5×10 4 Ep o c h s 1.0×10 -5 1.2×10 -4 1.4×10 -3 0.017 0.21 2.5 30 Lo s s Baseli ne-T raining- 20 Baseli ne-T esting-2 0 Lift-T raining- 20 Lift-T esting-2 0 Lift+ Chaos-T raining-20 Lift+ Chaos-T esting-20 1 9 80 7×10 2 6×10 3 5×10 4 Ep o c h s 0.00 0.17 0.33 0.50 0.67 0.83 1.00 Ac c u r ac y Baseline-T raining-2 0 Baseline-T esting-20 Lift-T raining-20 Lift-T esting-20 Lift+Ch aos-Trai ning-20 Lift+Ch aos-Testi ng-20 (a) (b) Figure 3: m = 20 . Comparison of the baseline mo del, the lifting-enhanced mo del, and the lifting- and chaos-enhanced mo del (with the optimal lifting dimension). (a) Evolution of loss (cross-en trop y) with epo chs. (b) Ev olution of accuracy with ep o chs. F ourth, w e observ e in Fig. 3 that the test loss v alue for the lifiting- and c haos-enhanced mo del is significantly larger than the training loss v alue, yet the test accuracy do es not suffer significantly (it plateaus at 98.5%). This is lik ely due to an outlier whic h results in rather small correct class probability prediction and gets p enalized logarithmically for the loss function but only linearly for the accuracy metric. The reason for the outlier could b e tied to the temp oral in terv al T of c haotic evolution being to o long for some samples so their class lab el ma y b e mistak enly identified if they ven ture in to the region corresp onding to a differen t class (see also the results b elow for T = 1 . 5 instead of T = 2 and where this phenomenon do es not occur). W e ha ve included this exp eriment on purpose to emphasize that these results do not inv olve any h yp erparameter optimization and this is something that w e will comment on briefly no w and address with a more elab orate study app earing in future w ork. Sp ecifically , in all the exp erimen ts presented in Figs. 1-3, w e hav e fixed the temp oral interv al of the chaotic evolution to T = 2 and the learning rate of the Adam optimizer is η = 10 − 3 . Both c hoices were based on prior exp erience but they w ere not optimized in the usual manner. T o show that the results of the lifting- and chaos-enhanced mo del can improv e even more, we will start by examining sensitivity of the accuracy to the lifting dimension m lif t for different v alues of T and η . Figure 4 shows the evolution of training and testing accuracy with the 12 lifting dimension for the lifting- and c haos-enhanced mo del for tw o different v alues of the Adam optimizer learning rate η , namely η = 10 − 3 and η = 5 × 10 − 4 . W e keep the maxim um num b er of training ep o c hs equal to 500 in b oth cases. W e observe that while the training accuracy is equally go o d for the tw o learning rates, the testing accuracy is higher when η = 5 × 10 − 4 . 25 30 35 40 45 50 Li f t in g  d im e n s i o n 0.92 0.94 0.96 0.98 1.00 Ac c u r ac y Lift+C haos-Tr aining-dt=1e- 3-20 Lift+C haos-T esting-dt=1e- 3-20 Lift+C haos-Tr aining-dt=5e- 4-20 Lift+C haos-T esting-dt=5e-4 -20 Figure 4: m = 20 . Evolution of training and testing accuracy with the lifting dimension for the lifting- and chaos-enhanced mo del for t wo different v alues of the Adam optimizer learning rate η . W e contin ue with the examination of the effect of the c haotic evolution temp oral interv al v alue T on the accuracy for different lifting dimensions. Fig. 5(a) shows the ev olution of the training and testing accuracy with lifting dimension for tw o v alues of the temp oral in terv al, namely T = 1 . 5 and T = 2 . W e observe that T = 1 . 5 leads consistently to higher testing accuracy than T = 2 , implying b etter generalization capability . Also, Fig. 5(b) sho ws that for T = 1 . 5 , the testing loss trac ks closely the training accuracy , in con trast to the results for T = 2 . 13 25 30 35 40 45 50 Li f t in g  d im e n s i o n 0.92 0.94 0.96 0.98 1.00 Ac c u r ac y Lift+Ch aos-Trai ning-T=2-20 Lift+Ch aos-Testi ng-T=2-20 Lift+Ch aos-Trai ning-T=1 .5-20 Lift+Ch aos-Testi ng-T=1 .5-20 1.0 7 . 9 63 500 Ep o c h s 1.00×10 -5 1.05×10 -4 1.10×10 -3 0.0116 0.121 1.27 13.3 Lo s s Lift+ Chaos-T raini ng-T=2 -20 Lift+ Chaos-T esting-T =2-20 Lift+ Chaos-T raini ng-T=1 .5-20 Lift+ Chaos-T esting-T =1.5-20 (a) (b) Figure 5: m = 20 . (a) Evolution of training and testing accuracy with the lifting dimension for the lifting- and c haos-enhanced mo del for tw o different v alues of the c haotic ev olution temp oral interv al T . (b) Ev olution of training and testing accuracy with ep o c hs for the optimal lifting dimension of the lifting- and chaos-enhanced mo del for t wo different v alues of the chaotic ev olution temp oral interv al T . 2.3 Reason for impro vemen t o v er the baseline mo del The improv ement in classification accuracy that is eviden t for the lifting- enhanced mo del but esp ecially for the lifting- and c haos-enhanced mo del w arrants an explanation. Eac h ro w of the weigh t matrix is a vector in m lif t - dimensional space and each ro w corresp onds to 1 class. In Fig. 6 we plot the ratio of the standard deviation ov er the mean of the absolute v alue of the weigh t matrix ro w components for eac h row (class) for m = 20 . This metric shows how spread are the v alues of the weigh ts in each ro w of the w eight matrix. F rom Fig. 6 we see that the ratio is m uch lo w er for the lifting-enhanced mo del and the lifting- and chaos-enhanced mo del than for the baseline model. This means that the comp onen ts across each row of the weigh t matrix are muc h more similar for these tw o mo dels than for the baseline mo del. Also, the ratio v alues are within a small range across ro ws (classes). A plausible explanation for this phenomenon is that the 2 mo dels whic h op erate in the m lif t -dimensional space con verge through training to ro ws of the w eight matrix (1 p er class) which reside close to the corners of m lif t -dimensional h yp ercub es. In this w ay , the comp onen ts of each ro w ma y ha ve different signs but similar absolute v alue. As a result, the rows of the 14 w eight matrix (seen as vectors) spread apart in the m lif t -dimensional space and facilitate classification since it is easier to b e distinct from one another. 0 5 10 15 20 Weig h t  m at r ix  ro w 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Ra t io  o f  s t d /m e a n  fo r  w e ig h t s  p e r  r o w Lift -20 Basel ine -20 Lift +Ch aos- 20 Figure 6: m = 20 . Ratio of the standard deviation o ver the mean of the absolute v alue of the comp onents of eac h row of the weigh t matrix for the 3 mo dels. Note that each row of the weigh t matrix corresp onds to a class. W e plot in Fig. 7 the v alue of the weigh ts across a row of the w eight matrix for different rows (recall that each row corresp onds to 1 class). W e plot this both for the baseline model and the lifting- and chaos-enhanced mo del. The difference b et ween Fig. 7(a) and Fig. 7(b) is striking. F or the baseline model, the weigh ts corresp onding to eac h class concen trate strongly in one elemen t of the ro w (one coordinate in the m -dimensional space). This is understandable since the data we w an t to classify are p erturbations of the canonical basis in R m and the distribution of the weigh ts v alues across a ro w reflects this structure. As exp ected, the weigh t matrix is v ery efficien t in iden tifying data from this particular class, assuming they are not hea vily p erturb ed. On the other hand, for the lifting- and c haos-enhanced mo del, the w eights corresp onding to each row (class) fluctuate across the ro w (the co ordinates in the m lif t -dimensional space) and th us can handle data which are more strongly p erturb ed. This result suggests an increased robustness to noise in the data b y the lifting- and chaos-enhanced mo del. 15 0 10 20 Co o rd in a te −1 0 1 2 3 4 5 6 Co m p o n e n t s  o f  w e ig h t  m at r ix  ro w s Class 1 Class 5 Class 10 Class 15 Class 20 0 10 20 30 40 Co o rd in a te −3.0 −1.5 0.0 1.5 3.0 Co m p o n e n t s  o f  w e ig h t  m at r ix  ro w s Class 1 Class 5 Class 10 Class 15 Class 20 (a) (b) Figure 7: m = 20 . Comp onen ts of rows of the w eight matrix (eac h ro w corresp onds to a class) for the baseline model and the lifting- and c haos- enhanced mo del. T o av oid clutter we hav e plotted only the weigh ts for rows (classes) 1,5,10, 15 and 20. (a) Baseline mo del. (b) Lifting- and c haos- enhanced mo del. Note that the optimal m lif t determined during training is 40 and that is why there are 40 elemen ts (co ordinates) of the w eight matrix in each row. 2.4 Robustness W e can pro vide a first insight in to the increased robustness of the lifting- and chaos-enhanced mo del compared to the baseline and lifting-enhanced mo dels by employing the prop ortion accuracy metric in (14) in addition to the alignmen t accuracy metric in (15). Fig. 8(a) sho ws the evolution of the testing prop ortion metric for the baseline, lifting-enhanced and lifting- and chaos-enhanced mo dels (for 2 c haotic evolution temp oral in terv al v al- ues). The prop ortion metric b ehav es similarly for all the mo dels in terms of limiting v alue and rate of conv ergence. Ho wev er, the picture changes drastically when w e include the alignment metric (as sho wn in Fig. 8(b)). Sp ecifically , while for the lifting- and chaos- enhanced mo del, the alignment metric evolution tracks the prop ortion met- ric evolution, for the baseline and lifting-enhanced mo dels the alignmen t metric ev olution do es not track the prop ortion metric ev olution. In ad- dition, the plateau reached b y the alignment metric for the baseline and lifting-enhanced models is significantly b elow 1 (100%). This means that ev en though the baseline and lifting-enhanced mo dels can predict 100% of 16 the correct lab els for the test data (prop ortion metric equals 1), the cor- resp onding confidence in the predictions is lac king. Moreo ver, if one stops training the baseline and lifting-enhanced mo dels when the prop ortion met- ric reaches 1, whic h is after 130 ep o chs for the baseline mo del and 244 ep o chs for the lifting-enhanced mo del, resp ectively , the corresp onding confidence in the predictions, giv en by the alignmen t metric is rather low, 36% for the baseline mo del and 77% for the lifting-enhanced mo del. The confidence in the predictions of the baseline mo del suffers due to the lac k of alignmen t of the predicted class probability vector with the correct lab el vectors. W e attribute this fragility to the presence of noise in the data whic h makes the robust prediction of correct lab els more difficult (recall the sharp concentration of the weigh t matrix ro ws on a single class as shown in Fig. 7(a)). The same argument applies to a lesser degree for the lifting- enhanced mo del whic h is more robust to noise, but still not as robust as the lifting- and c haos-enhanced model. A more detailed study to explain the agreemen t b etw een the prop ortion and alignmen t metric for the lifting- and c haos-enhanced mo del will app ear elsewhere. 1.0 12 1 41 1682 Ep o c h s 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Ac c u r ac y Baseline- Testi ng-Proportion Lift-T esting-Proportion Lift+C haos-T esting-T= 2-Proportion Lift+C haos-T esting-T= 1.5-Proportion 1.0 12 1 41 1682 Ep o c h s 0.0 0.2 0.4 0.6 0.8 1.0 Ac c u r ac y Baseline- Testi ng-Alignment Baseline- Testi ng-Proportion Lift-T esting-Alignment Lift-T esting-Proportion Lift+C haos-T esting-T= 2-Alignment Lift+C haos-T esting-T= 2-Proportion Lift+C haos-T esting-T= 1.5-Alignment Lift+C haos-T esting-T= 1.5-Proportion (a) (b) Figure 8: m = 20 . (a) Prop ortion accuracy metric evolution with ep o c hs for the baseline, lifting-enhanced and lifting- and chaos-enhanced mo dels (for 2 differen t c haotic evolution temp oral interv al v alues). (b) Prop ortion and alignmen t accuracy metric evolution with ep o chs for the baseline, lifting- enhanced and lifting- and chaos-enhanced mo dels (for 2 chaotic evolution temp oral interv al v alues). 17 2.5 Complexit y Finally , w e commen t on the computational complexit y of the baseline mo del, the lifting-enhanced mo del and the lifting- and chaos-enhanced mo del. T able 1 presents the computational times in seconds for the three mo dels for m = 2 , 10 , 20 (our calculations w ere run on an Apple M4 processor). W e note that while for the baseline model we p erform training for the dimension m for 20000 ep o chs, for the lifting-enhanced and the lifting- and c haos- enhanced mo dels we train for a collection of lifting dimensions, in the range ( m + 6 , 50) . The computational times reflect the total time needed to train for all the allo wed lifting dimensions. In addition, w e ha ve put an upp er limit on the n umber of training ep o c hs, 2000 ep o chs p er lifting dimension v alue for the lifting-enhanced mo del and 500 for the lifting- and c haos-enhanced mo del. Upper limit means that the optimizer may not complete the allow ed n umber of ep o chs. The reason this can happ en is that the Adam optimizer w as terminated if the loss v alue fell b elow 10 − 10 to a void instability due to v ery small num b ers. F or m = 10 , 20 , the lifting- and c haos-enhanced model requires more time to sw eep across all the allow ed lifting dimension cases and pick the optimal one than a single run of the baseline mo del. Ho wev er, for the v alues of m rep orted in T able 1, the baseline model accuracy do es not improv e ev en if we let it train for time comparable to the one we need for the training of the lifting- and chaos-enhanced mo del. In addition, the testing accuracy of the lifting- and chaos-enhanced mo del is 100% for m = 2 , 10 and 98.61% for m = 20 while the testing accuracy of the baseline mo del is 99.94% for m = 2 , 95% for m = 10 , and 90% for m = 20 . Moreo ver, the robustness of the p erformance of the lifting- and chaos-enhanced mo del to the lifting dimension v alue, is a promising sign that one ma y not need to sweep a large num b er of lifting dimensions to find one which leads to high accuracy . With this in mind, the computational complexit y of the lifting- and chaos- enhanced mo del b ecomes comp etitiv e to the baseline mo del while offering consisten tly sup erior accuracy . 18 m Mo del Baseline Lift Lift + Chaos 2 0.28s 0.36s 0.25s 10 13s 110s 29s 20 151s 570s 170s T able 1: Computing times for the 3 models (in seconds). The baseline model training runs for 20000 ep o chs, the lifting-enhanced model runs for up to 2000 epo c hs (p er lifting dimension v alue) and the lifting- and chaos-enhanced for up to 500 ep o chs (p er lifting dimension v alue). F or the lifting-enhanced and the lifting- and chaos-enhanced mo dels, the computing time is the total one required to sw eep through the lifting dimensions ( m + 6 , 50) to determine the optimal one. 3 Discussion and future w ork W e ha ve presented an approach to enhance classification accuracy that uses a tw o-stage approac h where the input vector is first lifted to a higher di- mension and then is ev olved for a fixed temp oral in terv al through a chaotic system. W e observ e that the lifting to higher dimensions, and esp ecially the chaotic evolution, can accelerate significan tly the training pro cess and in addition increase the achiev ed accuracy compared to a baseline mo del. The rationale b ehind the com bination of lifting to higher dimensions and c haotic evolution is to explore the a v ailable space in higher dimensions to pull apart the clusters of p oints corresp onding to the v arious classes, but to do so in a controlled manner. Sp ecifically , we can consider the evolution b y a c haotic system as taking the cluster of p oints in the higher dimensional space corresp onding to a class and blowing it up so that it o ccupies a larger v olume. This, in turn, aids the classifier b y making it easier to distinguish eac h class from the others. W e hav e found that the lifting and chaotic ev o- lution lead to muc h more uniform distribution of weigh t v alues within a ro w and also across rows (classes) compared to the baseline mo del. While these observ ations pro vide some insight into the improv ed p erformance of the prop osed approac h, a more thorough analysis based on the dynamics of training should b e undertaken. Specifically , w e need to understand ho w the lifting and chaotic ev olution affect the gradien t of the loss function. The results presen ted here indicate that ev en though one can obtain dra- matic increase in the classification p erformance without h yp erparameter op- timization, the b enefits to b e reap ed can increase even more through a care- 19 ful analysis of the main elements of the prop osed approac h. These include the specific construction of the lifting to higher dimensions, the choice of the c haotic system to evolv e the lifted vector as well as the temp oral interv al of the c haotic evolution. In addition, the use of more efficient sto chastic opti- mizers than the standard Adam optimizer should also b e considered. Also, w e need to in v estigate further the seeming increase in robustness to noise in the data that is exhibited by the lifting- and chaos-enhanced mo del com- pared to the baseline mo del. F urthermore, it will b e in teresting to explore connections with physical reserv oir computing approaches e.g., [22, 23, 24]. Finally , the results presented here are only proof-of-concept. A m uc h more substan tial test will b e to see whether the prop osed approach can impro ve the classification accuracy for challenging datasets e.g., standard b enc hmarks from image pro cessing [25]. Ac kno wledgemen ts I would like to thank Shady Ahmed and Saad Qadeer for useful discussions and comments. This work is partially supp orted by the U.S. Department of Energy , Office of Science, Adv anced Scien tific Computing Researc h pro- gram under the ”LEADS: LEarning-Accelerated Domain Science” pro ject (Pro ject No. 85462). Pacific North w est National Lab oratory is a m ulti- program national laboratory op erated for the U.S. Departmen t of Energy b y Battelle Memorial Institute under Contract No. DE-AC05-76RL01830. References [1] Sotiris B Kotsiantis, Ioannis D Zaharakis, and Pana yiotis E Pintelas. Mac hine learning: a review of classification and combining techniques. A rtificial Intel ligenc e R eview , 26(3):159–190, 2006. [2] Y ann LeCun, Y oshua Bengio, and Geoffrey Hinton. Deep learning. Natur e , 521(7553):436–444, 2015. [3] St ´ ephane Lathuili ` ere, Pablo Mesejo, Xa vier Alameda-Pineda, and Radu Horaud. A comprehensiv e analysis of deep regression. IEEE tr ansac- tions on p attern analysis and machine intel ligenc e , 42(9):2065–2081, 2019. 20 [4] George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, P aris P erdik aris, Sifan W ang, and Liu Y ang. Physics-informed mac hine learning. Natur e R eviews Physics , 3(6):422–440, 2021. [5] Qi An, Saifur Rahman, Jingw en Zhou, and James Jin Kang. A com- prehensiv e review on machine learning in healthcare industry: classifi- cation, restrictions, opportunities and challenges. Sensors , 23(9):4178, 2023. [6] Eduardo A Gerlein, Martin McGinnity , Ammar Belatreche, and Sony a Coleman. Ev aluating machine learning classification for financial trad- ing: An empirical approac h. Exp ert Systems with Applic ations , 54:193– 207, 2016. [7] Adi L T arca, Vincent J Carey , Xue-wen Chen, Roberto Romero, and Sorin Dr˘ aghici. Mac hine learning and its applications to biology . PL oS c omputational biolo gy , 3(6):e116, 2007. [8] Giusepp e Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Sc huld, Naftali Tishb y , Leslie V ogt-Maranto, and Lenk a Zdeb orov´ a. Mac hine learning and the ph ysical sciences. R eviews of Mo dern Physics , 91(4):045002, 2019. [9] Kamal Choudhary , Brian DeCost, Chi Chen, Anubha v Jain, F rancesca T av azza, Ryan Cohn, Cheol W o o P ark, Alok Choudhary , Ankit Agra wal, Simon JL Billinge, et al. Recent adv ances and applications of deep learning metho ds in materials science. npj Computational Mate- rials , 8(1):59, 2022. [10] P eter J Bick el and Elizav eta Levina. Some theory for fisher’s linear discriminan t function,naiv e ba yes’, and some alternativ es when there are many more v ariables than observ ations. Bernoul li , 10(6):989–1010, 2004. [11] Jianqing F an and Yingying F an. High dimensional classification using features annealed indep endence rules. A nnals of statistics , 36(6):2605, 2008. [12] Bissan Ghaddar and Joe Naoum-Saw ay a. High dimensional data classi- fication and feature selection using supp ort v ector machines. Eur op e an Journal of Op er ational R ese ar ch , 265(3):993–1004, 2018. 21 [13] Ken to Nishi, Yi Ding, Alex Rich , and T obias Hollerer. Augmenta- tion strategies for learning with noisy lab els. In Pr o c e e dings of the IEEE/CVF c onfer enc e on c omputer vision and p attern r e c o gnition , pages 8022–8031, 2021. [14] Caimin An, F abing Duan, F ran¸ cois Chap eau-Blondeau, and Derek Ab- b ott. Exploring the bay es-orien ted noise injection approach in neural net works. Physics L etters A , page 130804, 2025. [15] Alb ert Kjøller Jacobsen, Johanna Marie Gegenfurtner, and Georgios Arv anitidis. Staying on the manifold: Geometry-aw are noise injection. arXiv pr eprint arXiv:2509.20201 , 2025. [16] P anos Stinis. Enforcing constrain ts for time series prediction in sup er- vised, unsup ervised and reinforcement learning. In Pr o c e e dings of the AAAI 2020 Spring Symp osium on Combining A rtificial Intel ligenc e and Machine L e arning with Physic al Scienc es . CEUR W orkshop Pro ceed- ings, 2020. [17] Edw ard N Lorenz. Predictabilit y: A problem partly solved. In Pr o c. Seminar on pr e dictability , volume 1, pages 1–18. Reading, 1996. [18] Alireza Karimi and Mark R P aul. Extensive c haos in the lorenz-96 mo del. Chaos: A n inter disciplinary journal of nonline ar scienc e , 20(4), 2010. [19] Juan C V allejo, Miguel AF Sanjuan, and Miguel AF Sanju´ an. Pr e- dictability of chaotic dynamics . Springer, 2017. [20] Julien Bra jard, Alb erto Carrassi, Marc Bo cquet, and Laurent Bertino. Com bining data assimilation and mac hine learning to em ulate a dy- namical mo del from sparse and noisy observ ations: A case study with the l orenz 96 mo del. Journal of c omputational scienc e , 44:101171, 2020. [21] Jo el Lehman and Kenneth O Stanley . Abandoning ob jectives: Evolu- tion through the search for nov elty alone. Evolutionary c omputation , 19(2):189–223, 2011. [22] Chrisan tha F ernando and Sampsa So jakk a. P attern recognition in a buck et. In Eur op e an c onfer enc e on artificial life , pages 588–597. Springer, 2003. 22 [23] Gouhei T anak a, T oshiyuki Y amane, Jean Benoit H ´ eroux, Ryosho Nak ane, Naoki Kanaza wa, Seiji T akeda, Hidetoshi Numata, Daiju Nak ano, and Akira Hirose. Recent adv ances in physical reservoir com- puting: A review. Neur al Networks , 115:100–123, 2019. [24] Giulia Marcucci, Davide Pierangeli, and Claudio Conti. Theory of neu- romorphic computing by w av es: mac hine learning by rogue wa v es, dis- p ersiv e sho c ks, and solitons. Physic al R eview L etters , 125(9):093901, 2020. [25] Li De ng. The mnist database of handwritten digit images for machine learning research [b est of the web]. IEEE signal pr o c essing magazine , 29(6):141–142, 2012. 23

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment