Deep Tabular Representation Corrector

Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. The recent success of deep learning has fostered many deep networks (e.g., Transformer, ResNet) based tabular learning…

Authors: Hangting Ye, Peng Wang, Wei Fan

Deep Tabular Representation Corrector
JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 1 Deep T ab ular Representation Corrector Hangting Y e, Peng W ang, W ei F an, Xiaozhuang Song, He Zhao, Member , IEEE , Dandan Guo*, Y i Chang*, Senior Member , IEEE Abstract —T abular data hav e been playing a mostly important role in diverse r eal-world fields, such as healthcare, engineering, finance, etc. The recent success of deep lear ning has fostered many deep networks (e.g., T ransformer , ResNet) based tabular learning methods. Generally , existing deep tabular machine learning methods are along with the two paradigms, i.e., in- learning and pr e-learning. In-lear ning methods need to train networks from scratch or impose extra constraints to regu- late the repr esentations which nonetheless train multiple tasks simultaneously and make learning more difficult, while pre- learning methods design se veral pretext tasks f or pre-training and then conduct task-specific fine-tuning, which however need much extra training effort with prior knowledge. In this paper , we introduce a nov el deep T abular Representation Corr ector , TRC, to enhance any trained deep tabular model’ s representations without altering its parameters in a model-agnostic manner . Specifically , targeting the r epresentation shift and r epresentation redundancy that hinder prediction, we pr opose two tasks, i.e., (i) T abular Representation Re-estimation , that in volves training a shift estimator to calculate the inher ent shift of tabular r epresentations to subsequently mitigate it, ther eby re-estimating the representa- tions and (ii) T abular Space Mapping , that transforms the abov e re-estimated representations into a light-embedding vector space via a coordinate estimator while preserves crucial predicti ve information to minimize redundancy . The two tasks jointly enhance the repr esentations of deep tabular models without touching on the original models thus enjoying high efficiency . Finally , we conduct extensive experiments on state-of-the-art deep tabular machine learning models coupled with TRC on various tabular benchmarks which hav e shown consistent superiority . Index T erms —T abular data, deep neural networks, tabular repr esentation learning. I . I N T R OD U C T I O N T ABULAR data, typically represented as tables with rows standing for the data samples and columns standing for the heterogeneous feature v ariables (e.g. categorical and numerical features), is a fundamental data type across diverse fields, including healthcare [1], finance [2], engineering [3] and psychology [4]. Unlike perceptual data (e.g., image and text), tabular data lacks prior structural information [5], making it essential to discov er relationships within the data without relying on preconcei ved structures. Ov er the course of se veral Hangting Y e, Peng W ang, Dandan Guo, and Y i Chang are with the School of Artificial Intelligence, Jilin University , China. Y i Chang is also with the Engineering Research Center of Kno wledge-Driven Human-Machine Intelligence, Ministry of Education, China and the International Center of Future Science, Jilin Uni versity , China. E-mail: yeht22@mails.jlu.edu.cn, pwang23@mails.jlu.edu.cn, guodandan@jlu.edu.cn, yichang@jlu.edu.cn. W ei Fan is with the School of Computer Science, University of Auckland, New Zealand. E-mail: wei.fan@auckland.ac.nz. Xiaozhuang Song is with the Chinese University of Hong Kong, Shenzhen, China. E-mail: xiaozhuangsong1@link.cuhk.edu.cn. He Zhao is with CSIR O’s Data61 and Monash Uni versity , Australia. E-mail: he.zhao@data61.csiro.au. *Dandan Guo and Y i Chang are the corresponding authors. Our code is a vailable at https://github.com/HangtingY e/TRC. decades, tabular machine learning has garnered significant attention from researchers, transitioning from traditional ma- chine learning techniques (e.g. linear regression [6], logistic regression [7]) to more advanced tree-based approaches (e.g. XGBoost [8], CatBoost [9]). Recently , witnessing the great success of deep learning in many domains such as computer vision [10], [11] and natural language processing [12], re- searchers ha ve observ ed there is still a lar ge gap in the ef fectiv e usage of deep learning methods in the tab ular domain [5], [13]. Research on deep tabular machine learning currently encom- passes two main directions: in-learning and pre-learning styles. For the in-learning style, typically neural network backbones with different architectures are trained from scratch, or addi- tional constraints are applied to re gularize the representations. For example, AutoInt [14] transforms features to embeddings and applies a series of attention-based transformations to the embeddings based on T ransformer [15]. FT -Transformer [13] further impro ves AutoInt through better token embeddings. SNN [16] proposes an MLP-like model with neuron activ a- tions aiming for zero mean and unit variance, fostering an effecti ve regularization for learning complex representations. Kadra et al. [17] further regularizes the MLP with multiple modern regularization techniques. Recently , PT aRL [18] con- structs a new projection space and projects data samples into this space, facilitating the learning of disentangled representa- tions with constraints tailored for tab ular data. Ho we ver , these techniques are limited to learning from scratch, or training multiple loss functions simultaneously , leading to difficulties in ef fectively optimizing the target loss function. Another nov el line of works explore the direction of pre- learning for the tabular domain. This line of works design a set of pretext tasks that are challenging but highly rele vant to the objectiv e tasks and optimize the parameters of the backbone models via pre-training on these pretext tasks. For example, T abNet [19] and VIME [20] try to recover the corrupted inputs with auto-encoding loss. SCARF [21] utilizes a contrasti ve loss similar to SimCLR [22] between the sample and its corrupted counterpart. SubT ab [23] incorporates a combination of both. Howe ver , these pretext tasks are often inspired by some prior kno wledge and need much extra training effort. Our paper introduces a novel deep T abular Representation Corrector , TRC, to enhance the representations of any trained deep tabular backbone model without altering its parameters. Unlike existing approaches that directly intervene in the learn- ing process of backbones (in-learning) or rely on prior knowl- edge to design pretext tasks for pre-training (pre-learning), TRC provides a cost-effecti ve, parameter-ef ficient technique to enhance representations for deep tabular models to improve performance in a model-agnostic way . Besides, with TRC, we aim to solve two inherent issues hindering prediction: (i) JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 2 Repr esentation Shift , where inherent noise within the obser- vations can lead the learned representations by deep tabular models to deviate from the ideal optimal representations, and (ii) Repr esentation Redundancy , where deep tabular models may incorporate redundant information in the representation space, thereby resulting in inaccurate prediction. T o o vercome the aforementioned challenges, we propose two tasks: T abular Repr esentation Re-estimation and T abular Space Mapping . Specifically , for T abular Repr esentation Re- estimation , we fit an estimator to compute and eliminate the latent shift within each representation of the trained tab ular model, thereby re-estimating the representations. For T abular Space Mapping , we transform the re-estimated representations into a ne wly defined space characterized by a set of embedding vectors to condense the information. Furthermore, additional strategies are proposed to preserve the critical kno wledge for prediction. The redundant information is remo ved by reducing the information capacity while preserving critical information necessary for accurate prediction. The contrib utions of this paper include: • W e propose a novel tabular representation corrector, T R C, for deep tab ular machine learning, which enhances the learned representations of any trained deep tabular backbone without altering its parameters in a model-agnostic manner . • W e propose two tasks in TRC, i.e., T abular Representation Re-estimation and T abular Space Mapping to ov ercome the issues of Representation Shift and Representation Redun- dancy respectively hindering tabular predictions. • W e conduct extensiv e experiments on state-of-the-art (SO T A) deep tabular models coupled with T R C on v arious tabular benchmarks. The comprehensi ve results along with analysis and visualizations demonstrate the effecti veness of T R C , e ven in challenging scenarios in volving missing values and reduced training samples. I I . R E L A T E D W O R K Inspired by the success of deep learning in CV [10] and NLP [12], numerous deep learning methods hav e been proposed for tabular domain to accomplish prediction tasks. Research on deep tabular machine learning currently encom- passes two main directions: in-learning and pre-learning styles. A. In-learning for T abular Data In the conte xt of the in-learning style, neural networks with various architectures are typically trained from scratch, or extra constraints are applied to better regulate learned repre- sentations [14], [13], [24], [16], [17], [25], [18]. Among these approaches, AutoInt [14] transforms input features into embed- dings, which are then processed through a series of attention- based transformations. Building on this, FT -T ransformer [13] improv es upon AutoInt by enhancing token embeddings, us- ing element-wise multiplication for numerical features and element-wise lookup tables for categorical features. Simi- larly , ResNet for tabular data [13] demonstrates significant performance improvements. DCN2 [24] introduces a feature- crossing module that combines linear layers and multiplication operations with an MLP-like structure, enhancing its ability to capture feature interactions. SNN [16] proposes neuron activ ations designed to maintain zero mean and unit variance, promoting better regularization. Kadra et al. [17] further regularizes the MLP by searching for the optimal combination of multiple regularization techniques for each dataset using a joint optimization ov er the decision on which regularizers to apply . T ANGOS [25] proposes a ne w re gularization technique for tabular data, encouraging neurons to focus on sparse, non-ov erlapping input features. More recently , PT aRL [18] constructs a new projection space consisting of prototypes and projects data samples into this space, facilitating the learning of disentangled representations with constraints tailored for tabular data. Ho wev er , these techniques are limited to learning from scratch, or training multiple loss functions simultane- ously , leading to difficulties in effecti vely optimizing the target loss function. B. Pr e-learning for T abular Data Pr e-learning methods design se veral pretext tasks for pre- training and then conduct task-specific fine-tuning [20], [21], [23], [26], [27]. For instance, VIME [20] introduces a novel pretext task that estimates mask vectors from corrupted tab- ular data alongside a reconstruction task for self-supervised learning. SCARF [21] employs a contrastiv e loss akin to SimCLR [22], contrasting samples with their corrupted ver- sions by corrupting a random subset of features. SubT ab [23] divides the input features to multiple subsets, and then in- corporates a combination of both reconstruction and con- trastiv e loss. SAINT [26] enhances this by integrating a T ransformer with ro w-wise attention to capture inter -sample interactions and contrastiv e information during pre-training. In addition, Rubachev et al. [27] indicates that pre-training MLP with se veral self-supervised learning objectiv es could achiev e promising results. Ho wev er, these prete xt tasks are often inspired by some prior knowledge and need much extra training effort. Recently , alternativ e approaches have le veraged external information outside target dataset during pre-training to enhance deep learning for tabular data. TransT ab [28] incorporates feature name information into Transformer to achiev e cross table learning. XT ab [29] pre-trains T ransformer on a variety of datasets and aims to enhance tabular deep learning on a previously unseen dataset. These methods need additional information outside tar get dataset and also increase the time complexity , which we do not consider as closely re- lated baseline methods to ours as the primary goal, motiv ation, and methodology are dif ferent. Different from existing approaches that directly intervene in the learning process of deep tab ular models (in-learning), or rely on prior kno wledge or additional information outside the target dataset to design pretext tasks for pre-training thus increasing time ov erhead (pre-learning), tabular representation corrector provides a cost-effecti ve, parameter-ef ficient tech- nique to enhance representations for deep tabular models. I I I . P R O B LE M F O R M U L A T I O N S Notation. Denote the i -th sample as ( x i , y i ) , where x i = ( x ( num ) i , x ( cat ) i ) ∈ X represents numerical and categorical JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 3 features respecti vely and y i ∈ Y is the corresponding label. A real-world tabular dataset D = { ( x i , y i ) } N i =1 is a collection of N data samples. W e denote the training set as D train , the validation set for early stopping as D v al , and the test set for the final ev aluation as D test . W e consider deep learning for supervised tabular prediction tasks: regression Y = R , binary classification Y = { 0 , 1 } and multiclass classification Y = { 1 , ..., C } . The goal is to obtain an accurate deep tab ular model F ( · ; θ ) : X → Y trained on D train , that minimizes the expected loss E [ L ( F ( x ; θ ) , y )] . Here, L denotes a task- specific loss function, typically the mean squared error for regression and the cross-entropy loss for classification. In-learning and Pre-learning . Recall that the deep tabular model F ( · ; θ ) usually includes the backbone G f ( · ; θ f ) param- eterized with θ f and prediction head G h ( · ; θ h ) parameterized with θ h . Standard in-learning methods aim to minimize the following loss function: min θ f ,θ h 1 N N X i =1 L ( G h ( z i ; θ h ) , y i ) , z i = G f ( x i ; θ f ) , (1) where N is the size of D train , z i denotes the represen- tation extracted by G f ( x i ; θ f ) . In some cases, additional constraints may be optimized simultaneously . The trained G h ( G f ( · ; θ f ); θ h ) is then used for prediction. For pre-learning methods, pretext tasks D s = { ( x s i , y s i ) } N i =1 , such as reconstructing original features from corrupted ones, are designed to pre-train the backbone G f ( · ; θ f ) : min θ f ,θ s h 1 N N X i =1 L ( G s h ( z s i ; θ s h ) , y s i ) , z s i = G f ( x s i ; θ f ) , (2) where N is the size of D s and G s h is used to predict y s i based on z s i . After pre-training, the backbone G f ( · ; θ f ) can be used to extract better data representations. A ne w head G h ( · ; θ h ) is then attached to the trained backbone G f ( · ; θ f ) to predict the true labels, and either G h ( · ; θ h ) or both G h ( · ; θ h ) and G f ( · ; θ f ) are further optimized (as in Eq. 1). T abular Representation Corrector (TRC). In this paper, we consider the problem of enhancing the representation z = G f ( x ; θ f ) produced by a trained deep tabular backbone, while fixing the whole parameters of G f ( · ; θ f ) . T o achie ve this, TRC introduces a learnable correction module G p ( · ; θ p ) , which takes the frozen representation z as input and outputs an improv ed representation G p ( z ; θ p ) . A new head G h ( · ; θ h ) is then attached to G p ( · ; θ p ) for label prediction. During train- ing, only G p ( · ; θ p ) and G h ( · ; θ h ) are updated. TRC applies regardless of whether the backbone G f ( · ; θ f ) was trained in an in-learning or pre-learning manner . I V . M E T H O D O L O G Y A. Motivation In the context of deep tabular machine learning, the in- herent heterogeneity of features poses significant challenges to achieving satisfactory performance using deep models. Existing approaches learn tab ular representations by directly intervening in the learning process of deep tab ular models (in-learning) or designing pretext tasks for pre-training based on prior kno wledge or additional information outside the target dataset (pre-learning). W e propose a nov el deep T abular Representation Corrector (TRC), that provides a cost-ef fective, parameter-ef ficient technique to enhance the representations of trained deep tabular models. Specifically , the TRC could solv e two inherent representation issues hindering predictions. (i) Representation Shift . As stated by [30], [5], there is a natural possibility of noise in the collection process due to the heterogeneous nature of tabular data. In this work, we claim that such inherent but unobserved noise within the original observations can lead the representations extracted by deep tabular models to deviate from the ideal optimal representations (called representation shift in this paper), thereby decreasing the model performance. T o verify this point, we perturb the original features with noise as a form of data augmentation, leading to the deteriorated performance of existing deep tabular models as illustrated in Fig. 1. Moreover , to further substantiate the impact of data noise on deep tabular models, we systematically increase the noise ratio and observe a consistent decline in model performance. Therefore, giv en the well-trained deep tab ular models learned by the original observed tabular dataset, we assume that they might be limited in extracting optimal representations due to the existence of inherent noise in tabular data. (ii) Representation Redundancy . T o measure the comple xity of representation space, we perform singular v alue decompo- sition (SVD) on the representations Z = { z i } N i =1 ∈ R N × D to obtain: Z = U Σ V ⊺ and introduce singular value entropy (SVE) as follo ws: Definition 1. Singular V alue Entropy . The singular value entropy (SVE) is defined as the entropy of normalized singular values: SVE = − D X i =1 σ i P D j =1 σ j log σ i P D j =1 σ j , (3) where N is the number of samples, D is the dimensionality , U and V denote the left and right singular vector matrices, respectiv ely , and Σ denotes the diagonal singular value matrix { σ i } D i =1 . The singular value spectrum is widely considered to be related to the generalization performance [31], [32], [33]. SVE measures the flatness of the singular value distri- bution [34]. Greater SVE v alues in the latent space indicate a more comprehensi ve capture of data structure and a higher lev el of information content within the latent space. Howe ver , it is observed in Fig. 2 that existing deep tabular models with higher SVE do not necessarily yield better performance; rather , the models with better performance often exhibit relati vely lower SVE. For each setting, we conduct multiple experiments with random initialization. This observ ation suggests that, due to feature heterogeneity , deep tabular models may incorporate redundant information in the representation space, thereby affecting accurate prediction. T argeting the above two inherent representation issues, we propose two tasks, i.e., (i) T abular Repr esentation Re- estimation , which in volv es training a shift estimator to cal- culate and mitigate the inherent shift in the representation z = G f ( x ; θ f ) extracted by a trained deep tabular backbone, thereby re-estimating z , and (ii) T abular Space Mapping , JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 4 0.00 0.05 0.10 0.15 0.20 P er centage of Nosie Data 0.4500 0.5000 0.5500 0.6000 0.6500 0.7000 RMSE F T - T ransfor mer MLP DCN2 (a) CA dataset ↓ 0.00 0.05 0.10 0.15 0.20 P er centage of Nosie Data 4.0000 4.5000 5.0000 5.5000 6.0000 RMSE F T - T ransfor mer MLP DCN2 (b) CO dataset ↓ 0.00 0.05 0.10 0.15 0.20 P er centage of Nosie Data 0.8460 0.8490 0.8520 0.8550 0.8580 A ccuracy F T - T ransfor mer MLP DCN2 (c) AD dataset ↑ 0.00 0.05 0.10 0.15 0.20 P er centage of Nosie Data 0.5250 0.5500 0.5750 0.6000 0.6250 A ccuracy F T - T ransfor mer MLP DCN2 (d) GE dataset ↑ Fig. 1: The performance of deep tabular models with varying noise levels in observations. FT -T ransformer, MLP , and DCN2 indicate dif ferent deep tab ular models. F or regression tasks, lower RMSE is better , and for classification tasks, higher accuracy is better . 4.25 4.30 4.35 4.40 SVE of R epr esentation 0.4650 0.4675 0.4700 RMSE (a) FT -Transformer on CA ↓ 4.68 4.71 4.74 SVE of R epr esentation 0.4980 0.4995 0.5010 0.5025 RMSE (b) MLP on CA ↓ 3.52 3.60 3.68 3.76 SVE of R epr esentation 0.8584 0.8592 0.8600 A ccuracy (c) FT -Transformer on AD ↑ 1.84 1.92 2.00 SVE of R epr esentation 0.8540 0.8550 0.8560 0.8570 A ccuracy (d) MLP on AD ↑ Fig. 2: The relation between model performance and the corresponding SVE values of representations. For each subfigure, we conduct experiment on multiple random seeds. W e find that deep tabular models with higher SVE often yield lo wer performance. which transforms the re-estimated representation into a light- embedding vector space via a coordinate estimator , aiming to preserve critical predictive information while reducing redun- dancy . These two tasks jointly enhance the frozen representa- tion z without altering G f ( · ; θ f ) thus enjoying high ef ficiency . During inference, for each test sample (notably , we do not add any noise to the test dataset), the representation produced by the fixed G f ( · ; θ f ) is sequentially processed by the shift estimator and the coordinate estimator to obtain a calibrated representation. Note that T R C , as a general representation learning framework, is model-agnostic such that it can be coupled with any trained deep tabular backbone G f ( · ; θ f ) to learn better representations. In the following, we will elaborate on the details of T R C . Fig. 3 gi ves an overvie w frame work. B. T abular Representation Re-estimation Due to feature heterogeneity , deep tab ular models are sus- ceptible to the influence of various types of noise as depicted in Fig. 1, which often leads to undesirable shifts in the models’ latent space. This causes the learned representations (sub- optimal representations) to de viate from the ideal optimal ones. Directly intervening in the learning process of deep tab ular models or designing appropriate pretext tasks is challenging due to the unknown optimal representations. T o alle viate this issue, we propose a task via representation re-estimation to approximate and mitigate inherent shift in the sub-optimal representations of the trained deep tabular models. This pro- cess in volv es the follo wing stages: (i) Approximated Opti- mal Representations Searching, which identifies a subset of representations containing the least shift information to wards the ideal optimal representations; (ii) Inherent Shift Learning, where varying types of random signals are manually injected into the above selected approximated optimal representations to train a robust shift estimator capable of modeling the inherent shift present in sub-optimal representations; and (iii) Re-estimation of Sub-optimal Representations, which applies the learned shift estimator to remove the estimated shift from each sub-optimal representation, thereby re-estimating the representations. Appr oximated Optimal Representations Sear ching. By estimating the shift information present in each sub-optimal representation, we can infer the overall shift information within the whole representations of deep tabular models. T o achiev e this, we propose searching a subset of samples with relatively optimal representations (approximated optimal representations) and then manually introducing shift infor- mation into these representations to simulate representations with shift, i.e. simulated sub-optimal representations. W ith the simulated sub-optimal representations and their corresponding shift, we can fit a shift estimator to compute the overall shift information, which would be introduced in the subsequent stage. T o obtain the set of samples with approximated optimal representations, we first introduce an assumption as follo ws: Assumption 1. If we possess a deep tabular model F ( · ; θ ) = G h ( G f ( · ; θ f ); θ h ) trained on D train , then ther e exists a set of samples D o ⊆ D v al with lowest gradient norms among D v al . ∀ ( x, y ) ∈ D o , the following condition holds: Rank( ∥∇ θ f L ( F ( x ; θ ) , y ) ∥ q p ) / | D v al | ≤ τ , (4) wher e Rank( · ) denotes the ranking (inte ger) in ascending or der within D v al , ∥ · ∥ q p denotes the q squar e of the L p norm, JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 5 (b )The Training Process for Shift Estimator 𝒙 𝒌 𝝓 𝒛 $ 𝒌𝒎 ; 𝜽 𝝓 $ 𝒛 𝒌 𝒛 $ 𝒌𝒎 − 𝒛 𝒌 Partial data !𝓓 𝒐 Perturb (a)Tabular Representation Corrector 𝓓 𝒕𝒓𝒂𝒊𝒏 Any Trained Backbone 𝑮 𝒇 (⋅; 𝜽 𝒇 ) 𝒛 𝒊 𝒚 . 𝒊 Representation Re - estimation 𝒛 $ 𝒌𝒎 𝒙 / 𝒌𝒎 Space Mapping Coordinate Estimator s (⋅; 𝜽 𝒔 $ ) 𝒓 𝒊 𝓑 Head 𝑮 𝒉 (⋅; 𝜽 𝒉 $ ) 🔥 🔥 𝒛 𝒊 − 𝝓(𝒛 𝒊 ; 𝜽 𝝓 ) Any Trained Backbone 𝑮 𝒇 (⋅; 𝜽 𝒇 ) Shift Estimator 𝝓(⋅; 𝜽 𝝓 ) 🔥 Shift Estimator 𝝓(⋅; 𝜽 𝝓 ) 🔥 (c)The Training Process for Coordinate Estimator 𝒛 𝒊 − 𝝓(𝒛 𝒊 ; 𝜽 𝝓 ) Coordinate Estimator s (⋅; 𝜽 𝒔 $ ) Head 𝑮 𝒉 (⋅; 𝜽 𝒉 $ ) 🔥 🔥 𝐦𝐢𝐧 𝓛 𝒐𝒓𝒕𝒉 𝐦𝐢𝐧 𝓛 𝒑𝒓𝒆𝒅 𝒚 𝒊 𝐦𝐢𝐧 𝓛 𝒔𝒉𝒊𝒇𝒕 𝓑 = 𝜷 𝒕 𝒕1𝟏 𝑻 𝓑 = 𝜷 𝒕 𝒕1𝟏 𝑻 𝒙 𝒊 𝒓 𝒊 𝓑 𝒚 . 𝒊 Fig. 3: The frame work of T abular Representation Corrector (TRC). Subfigure (a) illustrates the overall TRC frame work, subfigure (b) presents the training process for the shift estimator of TRC, and subfigure (c) presents the training process for the coordinate estimator of TRC. Here, z i = G f ( x i ; θ f ) is the output of any trained backbone. z i would be enhanced via two tasks. In subfigure (b), the explicitly perturbed samples are only used for training the shift estimator . During the test stage, we feed the representations extracted by the existing trained deep tabular model, which are from the test dataset (notably , we do not add any noise to the test dataset), into the shift estimator follo wed by the coordinate estimator to achiev e the calibrated representations of test samples. L denotes the supervised loss function, and τ denotes the thr eshold that defines D o . After the trained backbone G f ( · ; θ f ) is obtained, we pro- ceed to compute the gradient norms for each sample across D v al . The searching process is conducted on D v al to prevent ov erfitting and reduce the time cost. Based on Assumption 1, we assume that D o ’ s corresponding representation set Z o = { z k = G f ( x k ; θ f ) | ( x k , y k ) ∈ D o , k ∈ { 1 , 2 , ..., K }} , where K denotes the size (the threshold τ is set to 0.01 by default), contains those with the least shift information to wards the ideal optimal representations, thus Z o could be treated as the approximated optimal representations. Inherent Shift Lear ning. After acquiring the approximated optimal representations Z o = { z k = G f ( x k ; θ f ) | ( x k , y k ) ∈ D o , k ∈ { 1 , 2 , ..., K }} , we guide the training process of the shift estimator ϕ by incorporating different types of shift signals into the approximated optimal representations Z o to generate simulated sub-optimal representations ˜ Z o = { ˜ z km } K,M k,m =1 and the corresponding shift information ∆ = { ∆ km } K,M k,m =1 , which is as follo ws: min L shif t = min θ ϕ 1 K M K,M X k,m =1 ∥ ϕ ( ˜ z km ; θ ϕ ) − ∆ km ∥ 2 2 , (5) ˜ z km = G f ( ˜ x km ; θ f ) = G f ( m km ⊙ x k + (1 − m km ) ⊙ ϵ k ; θ f ) , (6) ∆ km = ˜ z km − z k , (7) where m km ∈ { 0 , 1 } d is a feature mask vector with feature number as d , ϵ k ∈ R d is the noise that would be added to the corresponding observ ation x k of z k , M is the perturbing times. Giv en the feature heterogeneity of tabular data, we propose generating ϵ k by sampling each feature element from the empirical distribution of the corresponding column in D train , following pre vious works [20], [23], [35]. The shift estimator ϕ could be trained under the supervision of the simulated sub-optimal representations ˜ Z o and their corresponding shift information ∆ . In addition, during training, we also incorpo- rate approximated optimal representations Z o and their corre- sponding shift (i.e., 0) as the supervision to prevent e xcessive shifts in the output of well-optimized representations. Since the potential shift within representations of real data samples may be div erse, the shift estimator ϕ has the challenge of inferring shift from a single type of simulated sub-optimal representations. T o solve it, we suggest to di versify the types of simulated sub-optimal representations by applying M distinct perturbations to the observ ation x k to simulate M different sub-optimal representations. Specifically , for x k , we generate a feature mask vector m km , where each entry of the vector is set to 1 with probability η k and to 0 otherwise. W e then perturb x k by m km ⊙ x k + (1 − m km ) ⊙ ϵ k , where m ki  = m kj if i  = j . In addition, for x k , η k is sampled from a uniform distrib ution to di versify the mask vectors between dif ferent samples. By introducing such varying types of shift signals to Z o , the shift estimator ϕ has the potential to acquire information about the form of the real representation shift. Re-estimation of Sub-optimal Representations. Gi ven the learned shift estimator ϕ , we could remove the latent shift ϕ ( z i ; θ ϕ ) present in sub-optimal representation z i = G f ( x i ; θ f ) for any real data sample x i to achieve representa- tion re-estimation: Φ( z i ; θ ϕ ) = z i − ϕ ( z i ; θ ϕ ) , (8) where Φ( · ; θ ϕ ) is the re-estimation function. Through the re- estimation function, we refine the representation space of the trained backbone G f ( · ; θ f ) by removing the latent shift without altering any of the backbone’ s parameters. Despite JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 6 the shift remov al achiev ed through this process, the problem of representation redundancy still exists. In the subsequent section, we will introduce another ingeniously designed task to address the issue of representation redundancy . C. T abular Space Mapping The experimental results depicted in Fig. 2 rev eal that in the realm of deep tabular machine learning models, elev ated SVE does not consistently correlate with superior performance. This observation may be attrib uted to the presence of redundant information. T o address this issue, we introduce a technique designed to map representations into a newly defined light embedding space. This space is characterized by clearer and more concise information, aiming to alle viate the influence of redundant information on prediction accuracy while preserving essential information crucial for prediction tasks. Additionally , two strategies are applied to regulate the light embedding space: (i) Space Compression, which compresses and trans- forms the complex knowledge from the original space into a compact space, and (ii) Critical Knowledge Preserving, which contains the ke y information for prediction. Since the inherent representation shift information also has the probability to spread into light embedding space, we suggest to apply this task after the previous task. First, we giv e the formal definition of light embedding space as follo ws: Definition 2. Light Embedding Space (LE-Space). Giv en a collection of embedding vectors B = { β t } T t =1 ∈ R T × D , where T denotes the number of embeddings and β t represents the t -th embedding vector in D -dimensional space, we define the LE-Space as the vector space spanned by the set B . Formally , any representation in the LE-Space can be expressed as a weighted linear combination of the embedding vectors, giv en by: r B = T X t =1 r t β t , (9) where r = { r t } T t =1 ∈ R T denotes the coordinates. This formulation ensures that each representation in the LE-Space captures the essential characteristics encoded by the embed- ding vectors B . Space Compression. Since the trained backbone G f ( · ; θ f ) can inherently capture the interactions among features, and after applying the pre vious task, the original sub-optimal representation z i hav e been re-estimated as Φ( z i ; θ ϕ ) (Eq. 8) to mitigate shift information, it is reasonable to predict r i for each sample based on the re-estimated representation Φ( z i ; θ ϕ ) . T o obtain the target representation r i B within LE-Space, we use a shared coordinate estimator s ( · ; θ s ) with learnable θ s to calculate the corresponding r i by: r i = s (Φ( z i ; θ ϕ ); θ s ) . (10) The embeddings are random initialized and learnable. By controlling the size T of the LE-Space to be significantly smaller than the dimensionality D of representation Φ( z i ; θ ϕ ) , we can compress and transfer the information from the original space into the LE-Space. Critical Kno wledge Pr eserving. Directly mapping the rep- resentations could potentially result in a collapse of LE-Space learning, where part of the embedding vectors in the LE-Space become similar , thus diminishing the space’ s representational capacity . T o diversify these embedding vectors, we define the orthogonality loss to encourage the cosine similarity of any pairs of embedding vectors to be close to 0: min L orth = min B  ∥ A ∥ 1 ∥ A ∥ 2 2 + ( ∥ A ∥ 1 − T ) 2  , (11) where A ∈ [0 , 1] T × T , A ij = ∥ cos( β i , β j ) ∥ 1 , and T is the number of embeddings. The first term ∥ A ∥ 1 ∥ A ∥ 2 2 encourages spar- sity in A , meaning that each element A ij = ∥ cos( β i , β j ) ∥ 1 should be close to 0 (indicating orthogonality between β i and β j ) or 1. The second term promotes ∥ A ∥ 1 → T , giv en that A ii = ∥ cos( β i , β i ) ∥ 1 = 1 for all i ∈ { 1 , 2 , ..., T } , implying that the off-diagonal elements in the A to be close to 0. Furthermore, to maintain the key information for accurate prediction, we transfer the label information into the e xplicitly defined embedding vectors of LE-Space by: min L pred = min θ s ,θ ϕ , B ,θ h 1 N N X i =1 L ( G h ( r i B ; θ h ) , y i ) , (12) where r i = s (Φ( z i ; θ ϕ ); θ s ) (Eq. 10), G h ( · ; θ h ) is the re- initialized projection head. In the new space, the information is compressed compared to the original space as the size T of embedding vectors is significantly smaller than the dimensionality D of representation Φ( z i ; θ ϕ ) , while essential information crucial for prediction has been preserved. As a result, the ne w space contains less redundant information when contrasted with the original space. The detailed discussion is provided in Appendix J. D. F ramework Overview Altogether , T R C requires learning the parameters of θ ϕ , B , θ s and θ h . During each training iteration, T R C first aims to minimize L shif t to update θ ϕ . Subsequently , T R C aims to minimize the combined objective of L orth + L pred to update θ ϕ , B , θ s , and θ h . These two optimization processes, which correspond to T abular Representation Re-estimation and T abular Space Mapping respectiv ely , are iterativ ely alternated to refine the representations of deep tabular model without modifying the backbone’ s parameters θ f in a model-agnostic manner . W e provide the training process of T R C in Algo- rithm 1. For clarity , we also summarize the key notations used throughout the paper in T able VIII of Appendix A. This includes the symbols, their definitions, and dimensionalities where applicable. Specifically , T R C first identifies a set of samples D o with lo west gradient norms and obtains the corresponding representations Z o as the approximated optimal representa- tions. W e then manually generate the simulated sub-optimal representations ˜ Z o and the corresponding shift information ∆ based on Z o . ˜ Z o and ∆ are used to supervise the training process of the shift estimator ϕ ( · ; θ ϕ ) via L shif t (Eq. 5). For a real data sample x i , T R C removes the latent shift ϕ ( z i ; θ ϕ ) on JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 7 Algorithm 1 T R C algorithm workflo w . Input: dataset D train , D v al , trained F ( · ; θ ) = G h ( G f ( · ; θ f ); θ h ) , shift estimator ϕ ( · ; θ ϕ ) , re-estimation function Φ( · ; θ ϕ ) , coordinate estimator s ( · ; θ s ) , embedding vectors B , number of batches B ; 1: Compute ∥∇ θ f L ( F ( x ; θ ) , y ) ∥ q p ov er D v al and identify Z o based on Assumption 1 ; 2: Generate simulated sub-optimal representations ˜ Z o and the corresponding shift information ∆ ; 3: Re-initialize the parameters of G h ( · ; θ h ) ; 4: while θ ϕ , B , θ s and θ h has not con ver ged do 5: Compute L shif t (Eq. 5); 6: Update θ ϕ by minimizing L shif t through gradient descent; 7: for i ← 1 to B do 8: Sample minibatch; 9: Obtain representation z = G f ( x ; θ f ) ; 10: Re-estimate representation by Φ( z ; θ ϕ ) ; 11: Calculate r by s (Φ( z ; θ ϕ ); θ s ) ; 12: T ransform the representation of Φ( z ; θ ϕ ) to r B ; 13: Compute L orth (Eq. 11) and L pred (Eq. 12); 14: Update θ ϕ , B , θ s and θ h by minimizing L orth + L pred through gradient descent. 15: end f or 16: end while Output: ϕ ( · ; θ ϕ ) , s ( · ; θ s ) , B , G h ( · ; θ h ) . representation z i to re-estimate it by Φ( z i ; θ ϕ ) = z i − ϕ ( z i ; θ ϕ ) (Eq. 8), thereby alleviating the issue of representation shift. After that, the coordinate estimator s ( · ; θ s ) is used to calculate the coordinate r i (Eq. 10) to transform Φ( z i ; θ ϕ ) to r i B within LE-Space to compress the information. In addition, T R C diversifies the embedding vectors B via L orth (Eq. 11) and optimizes the L pred (Eq. 12) to preserve the critical knowledge for prediction. Therefore, T R C alle viates the issue of representation redundancy . During inference, we feed the representations extracted by the existing trained deep tab ular backbone, which are from the test dataset (notably , we do not add any noise to the test dataset), into the shift estimator followed by the coordinate estimator to achie ve the calibrated representations of test samples. Note that T R C , as a general representation learning framew ork, is model-agnostic such that it can be coupled with any trained deep tabular backbone G f ( · ; θ f ) to learn better representations. V . E X P E R I M E N T & A N A L Y S I S A. Experimental Setup Datasets. W e consider a variety of tabular datasets, includ- ing regression, binary and multi-class classification. Specif- ically , the datasets include: Combined Cycle Power Plant (CO) [36], Diamonds (DI) [37], Qsar Fish T oxicity (QS) [38], California Housing (CA) [39], Pol (PO) [40], Superconduc- tivty Data (SU) [41], Adult (AD) [42], Australian (A U) [43], Gesture Phase (GE) [44], Y ear (YE) [45] and Cov ertype (CO V) [46]. The dataset properties are summarized in T able I. T ABLE I: T abular data properties. “Objects” indicates the dataset size. RMSE denotes Root Mean Square Error for regression, Accuracy is used for binary and multiclass clas- sification. CO DI QS CA PO SU AD A U GE YE COV Objects 9568 53940 908 20640 15000 21263 48842 690 9873 515345 581012 Numerical 4 6 6 8 48 81 6 14 32 90 54 Categorical 0 3 0 0 0 0 8 0 0 0 0 Classes - - - - - - 2 2 5 - 7 Metric RMSE RMSE RMSE RMSE RMSE RMSE Accuracy Accuracy Accuracy RMSE Accuracy T o handle categorical features, we adopt an integer encoding scheme, where each category within a column is uniquely mapped to an integer to index the embedding in the lookup table. W e maintain consistent embedding dimensions for all categorical features. For numerical features, we apply the column-wise normalization method. In regression tasks, we also apply the normalization to the labels. T o ensure fair comparisons, we adhere to identical preprocessing procedures and use the quantile transformation following [13] for each dataset. W e split each dataset into training, validation, and test sets by the ratio of 6:2:2. The used datasets are collected in the public OpenML-AutoML Benchmark (https://openml. github .io/automlbenchmark/) and UCI datasets (https://archi ve. ics.uci.edu/datasets). Following pre vious studies [13], we use Root Mean Squared Error (RMSE) (lower is better) to ev aluate the re gression tasks, Accurac y (higher is better) to e valuate binary and multiclass classification tasks. Baseline Deep T abular Models. As the T R C is a model- agnostic paradigm, we include 13 mainstream deep tab ular models, which include in-learning and pre-learning paradigms, to test T R C ’ s applicability and ef fectiv eness to different predictors with diverse architectures. The baseline models include: MLP [47], DCN2 [24], SNN [16], ResNet [10], Au- toInt [14], T ANGOS [25], FT -T ransformer [13], PT aRL [18], SCARF [21], SAINT [26] and VIME [20]. In addition, the authors of [27] indicate that pre-training MLP with se veral self-supervised learning objectiv es could achieve promising results. Therefore, we also include two of them as additional baseline methods. W e implement all deep tabular baseline models’ hyperparameters following their original papers. The detailed descriptions about these models are as follo ws: In-learning Style. • MLP [47]. The Multilayer Perceptron (MLP) is a type of feedforward artificial neural network that consists of multiple layers of interconnected nodes, commonly used as a baseline model for supervised learning tasks in tabular domain. • DCN2 [24]. This method consists of an MLP-like module and the feature crossing module (a combination of linear layers and multiplications). • SNN [16]. An MLP-like architecture with the SELU acti- vation that maintains zero mean and unit variance, enabling training deeper models. • ResNet [10]. The key innovation is the use of residual connections, also known as skip connections or shortcut connections. These connections enable the network to ef- fectiv ely train deep neural networks, which was challenging before due to the v anishing gradient problem. In this paper , JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 8 T ABLE II: Results of overall performance. “i” denotes in-learning and “p” denotes pre-learning for baseline models. The best results are highlighted in bold. The impro vement of T R C over baselines is statistically significant at the 95% confidence le vel. CO ↓ DI ↓ QS ↓ CA ↓ PO ↓ SU ↓ AD ↑ A U ↑ GE ↑ MLP (i) 3.961 564.373 0.875 0.505 10.892 10.379 0.856 0.870 0.578 + TRC 3.899 558.330 0.825 0.502 10.593 10.326 0.858 0.891 0.580 DCN2 (i) 4.016 591.958 1.027 0.495 8.805 10.674 0.856 0.841 0.564 + TRC 3.858 574.453 0.879 0.493 8.790 10.584 0.857 0.862 0.584 SNN (i) 11.789 1530.293 1.018 0.896 18.517 25.498 0.847 0.848 0.550 + TRC 6.880 693.369 0.870 0.699 11.202 15.537 0.847 0.877 0.573 ResNet (i) 3.982 606.282 0.872 0.517 10.812 11.163 0.847 0.870 0.587 + TRC 3.919 615.736 0.845 0.508 10.205 10.720 0.849 0.877 0.597 AutoInt (i) 4.020 562.169 0.860 0.487 6.450 11.193 0.854 0.848 0.598 + TRC 3.960 557.423 0.829 0.477 5.885 10.826 0.858 0.891 0.606 T ANGOS (i) 4.687 965.169 0.945 0.573 9.401 10.849 0.84 0.855 0.569 + TRC 4.478 886.904 0.914 0.538 9.271 10.520 0.839 0.882 0.590 FT -Transformer (i) 3.709 551.190 0.823 0.469 2.919 10.410 0.858 0.848 0.611 + TRC 3.648 543.653 0.780 0.462 2.728 10.223 0.862 0.884 0.624 PT aRL (i) 3.668 550.582 0.813 0.447 2.533 10.199 0.872 0.879 0.583 + TRC 3.423 543.671 0.775 0.427 2.412 9.724 0.889 0.890 0.619 SCARF (p) 3.856 579.610 0.863 0.520 8.310 10.324 0.858 0.862 0.589 + TRC 3.847 577.803 0.808 0.520 8.151 10.355 0.859 0.884 0.589 SAINT (p) 4.022 597.207 0.827 0.508 4.415 13.095 0.857 0.870 0.549 + TRC 3.903 557.951 0.818 0.491 4.335 12.443 0.858 0.884 0.595 VIME (p) 5.218 945.238 1.018 0.679 10.914 15.645 0.768 0.812 0.473 + TRC 4.371 612.454 0.950 0.645 7.489 15.028 0.846 0.884 0.479 MLP w/ SSL-Rec (p) 3.92 595.378 0.872 0.498 10.052 10.408 0.855 0.862 0.578 + TRC 3.859 581.511 0.825 0.499 9.796 10.335 0.856 0.884 0.595 MLP w/ SSL-Contrastiv e (p) 3.956 570.54 0.872 0.506 13.305 10.478 0.855 0.862 0.576 + TRC 3.923 562.671 0.825 0.502 12.803 10.436 0.856 0.884 0.587 MLP R esNet F T - T ransfor mer SCARF VIME 8.75 9.00 9.25 9.50 9.75 RMSE 8.86 8.78 8.85 8.77 8.86 8.80 8.81 8.78 9.77 9.63 Backbone +TR C (a) YE dataset ↓ MLP R esNet F T - T ransfor mer SCARF VIME 0.80 0.85 0.90 0.95 1.00 A ccuracy 0.96 0.97 0.96 0.97 0.97 0.99 0.83 0.88 0.91 0.93 Backbone +TR C (b) CO V dataset ↑ Fig. 4: Performance improv ement of TRC over various backbone models on large scale datasets. we use the ResNet for tabular data [13]. • AutoInt [14]. This method transforms features to embed- dings and applies a series of attention-based transformations to the embeddings based on T ransformer [15]. • T ANGOS [25]. This method proposes a new regularization technique for tab ular data, encouraging neurons to focus on sparse, non-overlapping input features. W e adopt MLP as the base model for T ANGOS. • FT -T ransformer [13]. This method is introduced by Gor- ishniy et al. [13] to further improve AutoInt through better token embeddings. • PT aRL [18]. This method constructs a new projection space and uses Optimal T ransport [48] to project data samples into this space, enabling the learning of disentangled represen- tations with constraints designed for tabular data. W e adopt FT -Transformer as the base model for PT aRL. Pre-lear ning Style. • SCARF [21]. This method extends the SimCLR frame work to tabular data by using contrastiv e learning. • SAINT [26]. This method adopts attention ov er both ro ws and columns, and it includes an enhanced embedding method. Contrasti ve learning is used during pretraining. • VIME [20]. This method utilizes self- and semi-supervised learning by recovering the corrupted inputs for tabular data. • MLP w/ SSL-Rec [27]. This method aims to reconstruct the original input, given the corrupted input (the reconstruction loss is computed for all columns). • MLP w/ SSL-Contrastiv e [27]. This method uses InfoNCE loss, considering corrupted inputs as positives for original inputs and the rest of the batch as negati ves. Implementation Details. Our T R C is a model-agnostic method that aims to enhance the representations of any deep tabular model F ( · ; θ ) without altering its internal architecture. T o achieve it, T R C includes two tasks, i.e., T abular Repre- sentation Re-estimation and T abular Space Mapping. In the first task, for the q square of the L p norm, we set q = 1 and JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 9 T ABLE III: Analysis on the effects of diff erent components of T R C . The best results are highlighted in bold, and the second best results are underscored. Backbone TR SC DE CO ↓ DI ↓ QS ↓ CA ↓ PO ↓ SU ↓ AD ↑ AU ↑ GE ↑ MLP 3.961 564.373 0.875 0.505 10.892 10.379 0.856 0.87 0.578 ✓ 3.916 563.649 0.868 0.502 10.605 10.328 0.858 0.87 0.571 ✓ 3.92 564.141 0.845 0.505 10.631 10.349 0.858 0.862 0.568 ✓ ✓ 3.908 558.52 0.852 0.502 10.602 10.341 0.858 0.87 0.575 ✓ ✓ 3.922 565.129 0.841 0.505 10.622 10.363 0.858 0.877 0.577 ✓ ✓ ✓ 3.899 558.33 0.825 0.502 10.593 10.326 0.858 0.891 0.58 T ABLE IV: Performance of backbone w/ TRC and deeper backbone. The depths of backbone layers are increased only for “deeper backbone”. CO ↓ DI ↓ QS ↓ CA ↓ PO ↓ SU ↓ AD ↑ A U ↑ GE ↑ MLP 3.961 564.373 0.875 0.505 10.892 10.379 0.856 0.870 0.578 deeper MLP 3.942 562.888 0.871 0.510 11.583 10.471 0.856 0.862 0.572 MLP+ TRC 3.899 558.330 0.825 0.502 10.593 10.326 0.858 0.891 0.580 FT -Transformer 3.709 551.190 0.823 0.469 2.919 10.410 0.858 0.848 0.611 deeper FT -Transformer 3.704 559.363 0.816 0.474 2.890 10.442 0.851 0.856 0.613 FT -Transformer+ TRC 3.648 543.653 0.780 0.462 2.728 10.223 0.862 0.884 0.624 p = 1 . The threshold τ that defines the representations Z o is set to 0.01. The shift estimator ϕ , which is used to calculate the potential shifts present in the representations, is a simple 2-layer fully-connected MLP . W e propose generating ϵ k by sampling each feature element from the empirical distrib ution of the corresponding column in D train , following previous works [20], [23], [35]. For mask vector m km , each entry of the vector is set to 1 with probability η k and to 0 otherwise, where η k is sampled from a uniform distrib ution [0 . 1 , 0 . 3] . The perturbing times M for constructing the simulated sub- optimal representations is set to 3, unless otherwise specified. For the second task, the coordinate estimator s , which is used to calculate the coefficients for embedding v ectors, is a simple linear layer with softmax activ ation function. The default number of embeddings T is set to 10, unless otherwise specified. The batch size is set to 128 and AdamW is used as the optimizer, with a learning rate of 1e-4, weight decay of 1e- 5. A patience of 10 is kept for early stopping. W e implement all deep tabular backbone models’ hyperparameters following their original papers. All experiments were conducted on the Ubuntu 20.04.4 L TS operating system, Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz with single NVIDIA A40 48GB GPU and 512GB of RAM. The framew ork is implemented with Python 3.11.5 and PyT orch 2.0.2. T o reduce the effect of randomness, the reported performance is av eraged over 10 independent runs. B. Main Results Overall Perf ormance. T able II has shown the perfor- mance comparison of 13 different backbones and their T R C - enhanced versions over 9 real-world tab ular datasets. From the table, we can observe that T R C achiev es consistent improv ements over the baseline deep models in all settings, including both regression and classification scenarios. On av erage, T R C achiev es a relativ e performance improvement of approximately 5.1% across all scenarios. Specifically , for regression, T R C realizes a relative performance improv ement of about 6.5% across all datasets, while for classification, T R C achiev es a relati ve performance improv ement of approximately 2.4% across all datasets. Even in cases where T R C does not outperform the baseline, its performance remains comparable. Additionally , T R C is able to enhance performance even over the best-performing baseline model for each dataset. Moreo ver , we conduct W ilcoxon signed-rank test (with α = 0 . 05 ) [49] to measure the improv ement significance. The results show that in all settings, the improv ement of T R C over deep tabular models is statistically significant at the 95% confidence lev el (with p-value = 6 . 07e − 9 ). This demonstrates the superior adaptability and generalization ability of T R C to different models and tasks. W e further include experiments to verify the scalability of TRC on large scale tabular datasets, as illustrated in Fig. 4. The results show that TRC could still enhance the performance of backbone models on the large scale datasets. Ablation Study and Comparison with Deeper Backbone Models. W e further conduct ablation study to demonstrate the effecti veness of key components of T R C. Specifically , we denote the task of tabular representation re-estimation as “TR”, and for the task of tabular space mapping, we denote the process of space compression as “SC” and the strate gy for diversifying the embedding v ectors as “DE”. Gi ven that the strategy for div ersifying the embedding vectors should be applied after the process of space compression, we conduct a comparativ e analysis between the backbone integrated with T R C and its fiv e variants. The results presented in T able III underscore that the joint ef fects of all of these components are crucial to achiev e consistently good performance, thereby underscoring the effecti veness of T R C. More results are detailed in Appendix B. Besides, T R C is appended to the output of the backbone G f ( · ; θ f ) while keeping the parameters fixed. W e compare the performance of the backbone coupled with T R C against backbone with deeper layers (additional 3 layers). As shown in T able IV, increasing the number of layers in backbone model does not necessarily lead to improved performance; in some cases, it e ven results in performance JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 10 T ABLE V: Performance comparison between various fine- tuning paradigms and TRC. “i” denotes in-learning and “p” denotes pre-learning for baseline models. CO ↓ DI ↓ QS ↓ A U ↑ MLP (i) 3.961 564.373 0.875 0.870 +Linear Head 3.952 563.945 0.870 0.870 +Linear Head & Backbone 3.937 563.304 0.880 0.862 +MLP Head 3.950 562.076 0.862 0.870 +MLP Head & Backbone 3.934 562.427 0.857 0.870 +LoRA 3.941 563.314 0.875 0.870 + TRC 3.899 558.330 0.825 0.891 DCN2 (i) 4.016 591.958 1.027 0.841 +Linear Head 4.173 589.030 0.958 0.843 +Linear Head & Backbone 4.009 592.881 1.029 0.845 +MLP Head 4.077 586.749 0.906 0.847 +MLP Head & Backbone 4.088 591.743 0.910 0.842 +LoRA 4.001 591.903 1.021 0.849 + TRC 3.858 574.453 0.879 0.862 FT -Transformer (i) 3.709 551.190 0.823 0.848 +Linear Head 3.695 552.416 0.828 0.828 +Linear Head & Backbone 3.710 554.259 0.816 0.870 +MLP Head 3.698 549.648 0.816 0.832 +MLP Head & Backbone 3.706 556.086 0.802 0.870 +LoRA 3.683 550.241 0.823 0.862 + TRC 3.648 543.653 0.780 0.884 SCARF (p) 3.856 579.610 0.863 0.862 +Linear Head 3.870 578.389 0.842 0.867 +Linear Head & Backbone 3.871 579.757 0.856 0.870 +MLP Head 3.864 578.565 0.850 0.862 +MLP Head & Backbone 3.872 604.720 0.851 0.862 +LoRA 3.874 589.828 0.863 0.870 + TRC 3.847 577.803 0.808 0.884 degradation. Furthermore, the comparison between deeper backbone and backbone augmented with T R C highlights that the well-designed tasks can significantly enhance model representations, which is not simply achie ved by increasing model depth. More results are provided in Appendix C. Comparison with Fine-tuning Paradigm. W e also com- pare TRC with some fine-tuning methods. Specifically , given any trained backbone models, we compare TRC against fiv e different fine-tuning strate gies in T able V: i) Linear Head: we simply attach a new linear head to a trained backbone and train only this new head; ii) Linear Head & Backbone: we use the same architecture as strategy i, but we train both new head and the whole backbone; iii) MLP Head: we replace the linear head of strategy i with a two-layer MLP as the ne w head, and also train this new head only; iv) MLP Head & Full Backbone: we use the same architecture as strategy iii, but we train both new head and the whole backbone; v) LoRA (Low- Rank Adaptation): we use the parameter-ef ficient fine-tuning strategy LoRA (Lo w-Rank Adaptation), that injects trainable low-rank matrices into the whole frozen backbone parameters to adapt models with minimal additional parameters. In prac- tice, we inject low-rank matrices into all linear layers of the backbone and freeze the original weights. For all strategies, we set the learning rate to 1e-4, which matches the learning rate used for TRC. The results show that none of these fine-tuning strategies including LoRA achiev e performance comparable to TRC. The possible reason is that only training the new head or fine-tuning the backbone with conv entional supervised 0%-10% 20%-30% 40%-50% 60%-70% 80%-90% Gradient 0.05 0.10 0.15 0.20 Shif t (a) MLP on CO dataset 0%-10% 20%-30% 40%-50% 60%-70% 80%-90% Gradient 0.20 0.40 0.60 Shif t (b) ResNet on CO dataset 0%-10% 20%-30% 40%-50% 60%-70% 80%-90% Gradient 0.10 0.20 0.30 Shif t (c) MLP on CA dataset 0%-10% 20%-30% 40%-50% 60%-70% 80%-90% Gradient 0.10 0.20 0.30 Shif t (d) ResNet on CA dataset Fig. 5: Comparison of the L 1 norm of the shift information estimated by T R C on data samples with dif ferent gradi- ents. For each data sample, the gradient is computed by ∥∇ θ f L ( x, y ; θ ) ∥ 1 , where θ is the parameters of the trained deep tabular model. W e partition the L 1 norm of gradients into 10 equally spaced interv als from lo w to high and count the number of samples in each interv al. training loss (already used by the trained backbone) can not provide additional useful information for representation. Different from fine-tuning methods, TRC is a representation correction method that tailor-designed for solving the inherent deep tabular backbone representations issues in a post-hoc and model-agnostic way , without altering backbone parameters. Specifically , we build a shift estimator to calculate the inherent shift of tabular representations and design a space mapping method to alle viate the influence of redundant information. T o end this, we further design two unsupervised losses (please refer to Eq. 5 and Eq. 11). C. Further Analysis T R C enables the learning of the inherent shift by r epre- sentation re-estimation. Fig. 5 illustrates the estimated shift information within the representations of trained deep tabular backbones. F or the representations of the backbone G f ( · ; θ f ) , those with lar ger gradients are associated with greater shift information learned by TRC. Through the task of tabular r epr esentation re-estimation , the shift information contained in sub-optimal representations can be gradually eliminated. T R C reduces the r edundant inf ormation by space map- ping. Subfigure (a) of Fig. 6 sho ws the representation SVE of deep tabular backbone G f ( · ; θ f ) w/ and w/o T R C , and subfigure (b) shows the corresponding performance. Similar to subfigures (c) and (d). W e could find that with T R C, the SVE value of the representations decreased, and the performance improv ed. This demonstrates that T R C reduces the redundant JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 11 DCN2 F T - T ransfor mer MLP R esNet S AINT Backbone(CO) TR C(CO) 1.5 3.0 4.5 (a) SVE on CO dataset DCN2 F T - T ransfor mer MLP R esNet S AINT Backbone(CO) TR C(CO) 3.75 3.90 4.05 (b) RMSE ( ↓ ) on CO dataset DCN2 F T - T ransfor mer MLP R esNet S AINT Backbone(GE) TR C(GE) 1.5 3.0 4.5 (c) SVE on GE dataset DCN2 F T - T ransfor mer MLP R esNet S AINT Backbone(GE) TR C(GE) 0.56 0.60 (d) Accuracy ( ↑ ) on GE dataset Fig. 6: SVE and performance changes of backbone models w/ and w/o T R C on dif ferent datasets. Subfigures (a) and (c) show the Singular V alue Entropy (SVE) (Eq. 3) of representation space w/ and w/o T R C on different datasets. Subfigures (b) and (d) sho w the corresponding performance. T ABLE VI: The influence of using a small portion (1%) of validation set for training. “Train” indicates only training data are used for training model, “V al” indicates additional small portion of v alidation data are used for training. CO ↓ DI ↓ A U ↑ T rain V al Train V al Train V al MLP 3.961 3.968 564.373 565.616 0.870 0.870 + TRC 3.902 3.899 559.020 558.330 0.891 0.891 FT -Transformer 3.709 3.726 551.190 550.228 0.848 0.848 + TRC 3.647 3.648 544.570 543.653 0.877 0.884 information via tabular space mapping to enhance the model performance. The perf ormance improv ement comes from the effec- tiveness of the T R C , rather than from additional training data. Only a small portion (1%) of the validation data is used to train our shift estimator , and these additional data are not used in the remaining learning or test process. W e provide additional experiment results in T able VI to show whether the impro vements come from additional training data or the proposed modifications. Here, we consider two settings. One setting is that we use a portion of the training data rather than the additional validation data to train the shift estimator, making it fair when being compared with existing baselines. Another setting is that we also add a small portion (1%) of the validation data into the training set when training baseline models. As listed in T able VI, ours still outperforms the corresponding baseline even without using additional v alidation data. In addition, the baseline shows similar performance whether or not additional data is used, and T R C similarly demonstrates consistent performance re gardless of the use of additional data. Therefore, we demonstrate that the performance impro vement comes from the ef fectiveness of the T R C , rather than from additional training data. The performance of T R C under missing values and reduced training data. W e systematically increase the pro- 0 5 10 15 20 R atio of Missed Data(%) 600.00 700.00 800.00 900.00 RMSE DCN2 DCN2+TR C (a) RMSE on DI dataset ↓ 0 5 10 15 20 R atio of Missed Data(%) 0.7500 0.8000 0.8500 A ccuracy DCN2 DCN2+TR C (b) A CC on A U dataset ↑ 100 75 50 25 R atio of Sampled Dataset(%) 580.00 600.00 620.00 640.00 660.00 RMSE DCN2 DCN2+TR C (c) RMSE on DI dataset ↓ 100 75 50 25 R atio of Sampled Dataset(%) 0.8200 0.8400 0.8600 A ccuracy DCN2 DCN2+TR C (d) A CC on A U dataset ↑ Fig. 7: The results of T R C training with missing values and fewer samples. The left two subfigures indicate the scenarios with missing v alues, while the right two indicate the scenarios where the number of training samples is reduced. portion of missing values in features and reduce the proportion of training samples separately . The results in Fig. 7 indicate that as the dataset’ s missing values escalate or the training sample size diminishes, the difficulty of model training inten- sifies, which is reflected in a decline in model performance. Nev ertheless, by lev eraging T R C, we are able to consistently enhance the capabilities of the deep tab ular model, thereby alleviating the neg ative impact of data incompleteness and scarcity . More results can be found in the Appendix D. V isualization. The TSNE visualization of learned represen- tations in Fig. 8 demonstrates that TRC could well calibrate representations. Even only with T abular Representation Re- estimation, the representations are more organized compared to the baseline and fine-tuning approaches, as shown in sub- JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12 (a) Baseline (b) w/ FT (c) w/ T ask 1 (d) w/ T ask 1, 2 (TRC) (e) Baseline (f) w/ FT (g) w/ T ask 1 (h) w/ T ask 1, 2 (TRC) Fig. 8: The TSNE visualization of learned representations of deep tabular model, deep tabular model w/ FT (fine-tuning), deep tabular model w/ T ask 1 (T abular Representation Re-estimation) and deep tabular model w/ T ask 1, 2 (T abular Representation Re-estimation and T abular Space Mapping) (i.e., TRC). Here, the backbone model is ResNet. The first row represents the CA dataset (regression), while the second ro w represents the GE dataset (classification). Different colors indicate different labels. Marker=“x” indicates embedding vectors. 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(DI) TR C(DI) (a) DI dataset 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(AD) TR C(AD) (b) AD dataset Fig. 9: The comparison of parameters between T R C and different deep tabular models. The parameters for T R C do not include the parameters for backbone models. figure (c). W e also visualize the learned embedding v ectors in subfigures (d) and (h). The embedding vectors are learnable, and we encourage the embedding vectors to be orthogonal through an orthogonality loss (Eq. 11), as orthogonality is known to be helpful in improving discriminativ e ability [50], [51]. Our findings reveal that these embedding vectors are extensi vely distributed across the latent space, contributing to the ef fectiv e modeling of representations. In the regression tasks, which in volve continuous labels and necessitate fine- grained information for discriminati ve representations, the application of TRC leads to a marked improv ement in the ability to differentiate between v arious continuous labels. Con- sequently , the arrangement of these representations becomes more structured and or ganized. In the classification tasks, after applying TRC, representations are separated into se veral non- ov erlapping regions, where representations with similar labels are clustered together , demonstrating the method’ s efficacy in enhancing the separability of the data points. Parameters analysis and Computational efficiency . The proposed T R C pro vides a parameter -efficient, cost-effecti ve technique to enhance the representations of deep tabular back- bones without altering any of their parameters. As illustrated in Fig. 9, the parameter quantity of T R C is significantly lower compared to deep tabular models, particularly for the FT -Transformer model, which inherently has a large number of parameters. More results are provided in Appendix E. In addition, since T R C does not need to retrain the deep tabular backbone, the training time of T R C is noticeably reduced compared to the deep tabular model in most cases. The inference cost of TRC is also much smaller compared to deep tabular backbones, thus will not introduce high inference ov erhead to the existing backbones. Details of training and inference time cost are provided in T able XI and T able XII in Appendix F . The Threshold of Selected Appr oximated Optimal Rep- resentations τ . T o verify the robustness of TRC to violations of proposed assumption and the choice of the threshold τ , we compare TRC against two v ariants: (i) TRC (Random) that randomly selects samples and (ii) TRC (Highest) that selects samples with the highest gradient norms. W e provide these comparisons across a wide range of threshold τ in Fig. 10. Here, the backbone model is DCN2. Additional results are provided in Fig. 21 in Appendix I. The results show that TRC benefits from selecting samples with the lowest gradient norms, as hypothesized. When the assumption is violated, e.g., by selecting random samples or those with the highest JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 13 0.005 0.01 0.04 0.07 0.1 0.3 0.5 0.7 0.9 1.0 R a t i o o f A p p r o x i m a t e d O p t i m a l R e p r e s e n t a t i o n 3.8400 3.8800 3.9200 3.9600 4.0000 RMSE DCN2 DCN2+TR C (L owest, ours) DCN2+TR C (R andom) DCN2+TR C (Highest) Fig. 10: Comparison with dif ferent approximated optimal representation selection strategies across a wide range of threshold τ on CO dataset ↓ . 5 10 15 20 25 64 100 128 N u m b e r o f E m b e d d i n g V e c t o r s T 3.8000 3.8400 3.8800 3.9200 3.9600 4.0000 RMSE DCN2 DCN2+TR C Fig. 11: The performance of T R C with v arying number of embedding vectors on CO dataset ↓ . gradient norms, TRC still improv es the performance of the backbone model, albeit to a lesser extent. When the threshold τ is relati vely lo w (e.g., less than 0.3), the performance of TRC is stable. As τ increases further, samples with high gradient norms are also selected and the corresponding performance degrades slightly , but TRC could still enhance the performance of backbone model. Overall, TRC is robust to the proposed assumption and choice of the threshold τ . W e suggest to select the samples with the lowest gradient norms using a conserva- tiv e threshold (less than 0.3), which balances performance gain and computational cost. Additional Experiments. A representative case of the sensitivity analysis on the number of embedding vectors is provided in Fig. 11. Additional results are presented in Fig.22 of Appendix I, and a more in-depth discussion is provided in Appendix J. Additional sensitivity analysis w .r .t. the weight of loss function L orth , and the perturbing times for constructing simulated sub-optimal representations in Inherent Shift Learning in Appendix I. The comparison between T R C and tree-based methods is detailed in T able VII. W e also provide the visualizations of learned coefficients r (Eq. 10) for embedding vectors in Appendix G. The discussion about how well the computed shift by shift estimator can model the real inherent shift is provided in Appendix K, and the limitation discussion about Assumption 1 is presented in Appendix L. V I . C O N C L U S I O N In this paper , we introduce a no vel deep T abular Repre- sentation Corrector , TRC, to enhance trained tabular models through two specific tasks without altering their parameters T ABLE VII: Comparison with tree-based methods. DI ↓ QS ↓ PO ↓ A U ↑ FT -Transformer+ TRC 543.653 0.780 2.728 0.884 CatBoost [9] 546.827 0.817 4.641 0.855 XGBoost [8] 559.680 0.820 4.936 0.862 in a model-agnostic way . Specifically , T R C could solve two inherent representation issues, i.e., r epresentation shift and r epr esentation r edundancy . W e propose two tasks, i.e., (i) T abular Repr esentation Re-estimation , that in volves training a shift estimator to calculate the inherent shift of tabular rep- resentations to subsequently mitigate it, thereby re-estimating the representations, and (ii) T abular Space Mapping , that trans- forms the re-estimated representations into a light-embedding vector space while preserves crucial predictiv e information to minimize redundancy . The empirical results on v arious real world tasks demonstrated the ef fectiv eness of T R C for tabular deep learning. Our work can shed some light on dev eloping better algorithms for similar tasks. R E F E R E N C E S [1] M. Hernandez, G. Epelde, A. Alberdi, R. Cilla, and D. Rankin, “Syn- thetic data generation for tabular health records: A systematic re view , ” Neur ocomputing , vol. 493, pp. 28–45, 2022. [2] S. A. Assefa, D. Derv ovic, M. Mahfouz, R. E. Tillman, P . Reddy , and M. V eloso, “Generating synthetic data in finance: opportunities, challenges and pitfalls, ” Proceedings of the F irst ACM International Confer ence on AI in F inance , pp. 1–8, 2020. [3] H. Y e, Z. Liu, X. Shen, W . Cao, S. Zheng, X. Gui, H. Zhang, Y . Chang, and J. Bian, “Uadb: Unsupervised anomaly detection booster , ” in 2023 IEEE 39th International Conference on Data Engineering (ICDE) . IEEE, 2023, pp. 2593–2606. [4] C. J. Urban and K. M. Gates, “Deep learning: A primer for psycholo- gists. ” Psyc hological Methods , v ol. 26, no. 6, p. 743, 2021. [5] V . Borisov , T . Leemann, K. Seßler , J. Haug, M. Pawelczyk, and G. Kasneci, “Deep neural networks and tabular data: A survey , ” IEEE T ransactions on Neural Networks and Learning Systems , 2022. [6] X. Su, X. Y an, and C.-L. Tsai, “Linear regression, ” W iley Interdisci- plinary Reviews: Computational Statistics , vol. 4, no. 3, pp. 275–294, 2012. [7] R. E. Wright, Logistic re gr ession. American Psychological Association, 1995. [8] T . Chen and C. Guestrin, “Xgboost: A scalable tree boosting system, ” Pr oceedings of the 22nd acm sigkdd international conference on knowl- edge discovery and data mining , pp. 785–794, 2016. [9] L. Prokhorenkov a, G. Gusev , A. V orobev , A. V . Dorogush, and A. Gulin, “Catboost: unbiased boosting with categorical features, ” Advances in neural information pr ocessing systems , vol. 31, 2018. [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016. [11] I. Goodfellow , Y . Bengio, and A. Courville, Deep Learning . MIT Press, 2016, http://www .deeplearningbook.org. [12] J. De vlin, M.-W . Chang, K. Lee, and K. T outanov a, “Bert: Pre-training of deep bidirectional transformers for language understanding, ” arXiv pr eprint arXiv:1810.04805 , 2018. [13] Y . Gorishniy , I. Rubachev , V . Khrulkov , and A. Babenko, “Revisiting deep learning models for tabular data, ” Advances in Neural Information Pr ocessing Systems , vol. 34, pp. 18 932–18 943, 2021. [14] W . Song, C. Shi, Z. Xiao, Z. Duan, Y . Xu, M. Zhang, and J. T ang, “ Autoint: Automatic feature interaction learning via self-attentive neural networks, ” Proceedings of the 28th A CM international confer ence on information and knowledge management , pp. 1161–1170, 2019. [15] A. V aswani, N. Shazeer, N. P armar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “ Attention is all you need, ” Advances in neural information pr ocessing systems , vol. 30, 2017. JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 14 [16] G. Klambauer, T . Unterthiner, A. Mayr , and S. Hochreiter , “Self- normalizing neural networks, ” Advances in neural information pr ocess- ing systems , vol. 30, 2017. [17] A. Kadra, M. Lindauer , F . Hutter, and J. Grabocka, “W ell-tuned simple nets excel on tabular datasets, ” Advances in neur al information pr ocess- ing systems , vol. 34, pp. 23 928–23 941, 2021. [18] H. Y e, W . Fan, X. Song, S. Zheng, H. Zhao, D. dan Guo, and Y . Chang, “PTaRL: Prototype-based tabular representation learning via space calibration, ” in The T welfth International Confer ence on Learning Repr esentations , 2024. [19] S. ¨ O. Arik and T . Pfister, “T abnet: Attentiv e interpretable tabular learn- ing, ” in Proceedings of the AAAI conference on artificial intelligence , vol. 35, no. 8, 2021, pp. 6679–6687. [20] J. Y oon, Y . Zhang, J. Jordon, and M. V an der Schaar , “V ime: Extending the success of self-and semi-supervised learning to tabular domain, ” Ad- vances in Neural Information Processing Systems , vol. 33, pp. 11 033– 11 043, 2020. [21] D. Bahri, H. Jiang, Y . T ay , and D. Metzler, “Scarf: Self-supervised contrastiv e learning using random feature corruption, ” in International Confer ence on Learning Representations , 2022. [22] T . Chen, S. Kornblith, M. Norouzi, and G. Hinton, “ A simple framework for contrastiv e learning of visual representations, ” in International confer ence on machine learning . PMLR, 2020, pp. 1597–1607. [23] T . Ucar , E. Hajiramezanali, and L. Edwards, “Subtab: Subsetting features of tabular data for self-supervised representation learning, ” Advances in Neural Information Processing Systems , vol. 34, pp. 18 853–18 865, 2021. [24] R. W ang, R. Shi vanna, D. Cheng, S. Jain, D. Lin, L. Hong, and E. Chi, “Dcn v2: Improved deep & cross network and practical lessons for web- scale learning to rank systems, ” Pr oceedings of the web confer ence 2021 , pp. 1785–1797, 2021. [25] A. Jeffares, T . Liu, J. Crabb ´ e, F . Imrie, and M. van der Schaar, “T angos: Regularizing tabular neural networks through gradient orthogonalization and specialization, ” in The Ele venth International Conference on Learn- ing Representations , 2022. [26] G. Somepalli, A. Schwarzschild, M. Goldblum, C. B. Bruss, and T . Goldstein, “SAINT: Improv ed neural networks for tab ular data via row attention and contrastive pre-training, ” in NeurIPS 2022 F irst T able Representation W orkshop , 2022. [Online]. A vailable: https://openrevie w .net/forum?id=FiyUT A y4sB8 [27] I. Rubachev , A. Alekbero v , Y . Gorishniy , and A. Babenko, “Revis- iting pretraining objectives for tabular deep learning, ” arXiv preprint arXiv:2207.03208 , 2022. [28] Z. W ang and J. Sun, “T ranstab: Learning transferable tabular transform- ers across tables, ” Advances in Neural Information Pr ocessing Systems , vol. 35, pp. 2902–2915, 2022. [29] B. Zhu, X. Shi, N. Erickson, M. Li, G. Karypis, and M. Shoaran, “Xtab: Cross-table pretraining for tabular transformers, ” arXiv preprint arXiv:2305.06090 , 2023. [30] C.-H. Chang, J. Y oon, S. ¨ O. Arik, M. Udell, and T . Pfister , “Data- efficient and interpretable tabular anomaly detection, ” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , 2023, pp. 190–201. [31] S. Oymak, Z. Fabian, M. Li, and M. Soltanolkotabi, “Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian, ” arXiv preprint , 2019. [32] X. Chen, S. W ang, M. Long, and J. W ang, “T ransferability vs. discrim- inability: Batch spectral penalization for adv ersarial domain adaptation, ” in International conference on machine learning . PMLR, 2019, pp. 1081–1090. [33] Y . Xue, K. Whitecross, and B. Mirzasoleiman, “In vestigating why con- trastiv e learning benefits robustness against label noise, ” in International Confer ence on Machine Learning . PMLR, 2022, pp. 24 851–24 871. [34] H. Chen, J. W ang, A. Shah, R. T ao, H. W ei, X. Xie, M. Sugiyama, and B. Raj, “Understanding and mitigating the label noise in pre-training on downstream tasks, ” in The T welfth International Confer ence on Learning Representations , 2024. [Online]. A vailable: https://openrevie w .net/forum?id=TjhUtloBZU [35] J. Nam, J. T ack, K. Lee, H. Lee, and J. Shin, “STUNT: Few-shot tabular learning with self-generated tasks from unlabeled tables, ” in The Eleventh International Conference on Learning Representations , 2023. [Online]. A vailable: https://openre view .net/forum?id= xlsjehDvlY [36] D. D. V . M. Chary , “Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, ” SSRN Electr onic Journal , 2021. [Online]. A vailable: https://api.semanticscholar .org/CorpusID:240033947 [37] K.-Y . Chen, P .-H. Chiang, H.-R. Chou, T .-W . Chen, and T .-H. Chang, “T rompt: T ow ards a better deep neural network for tab ular data, ” in International Conference on Machine Learning . PMLR, 2023, pp. 4392–4434. [38] M. Cassotti, D. Ballabio, R. T odeschini, and V . Consonni, “ A similarity-based qsar model for predicting acute toxicity towards the fathead minnow (pimephales promelas). ” SAR and QSAR in en vironmental r esearc h , vol. 26 6, p. 521, 2015. [Online]. A vailable: https://api.semanticscholar .org/CorpusID:11120429 [39] R. K. Pace and R. Barry , “Sparse spatial autoregressions, ” Statistics & Pr obability Letters , vol. 33, no. 3, pp. 291–297, 1997. [40] L. Grinsztajn, E. Oyallon, and G. V aroquaux, “Why do tree-based models still outperform deep learning on typical tab ular data?” in Neural Information Pr ocessing Systems , 2022. [Online]. A vailable: https://api.semanticscholar .org/CorpusID:252697222 [41] K. Hamidieh, “ A data-driv en statistical model for predicting the critical temperature of a superconductor , ” Computational Materials Science , 2018. [Online]. A vailable: https://api.semanticscholar .org/CorpusID: 55069173 [42] R. K ohavi et al. , “Scaling up the accurac y of naive-bayes classifiers: A decision-tree hybrid. ” KDD , vol. 96, pp. 202–207, 1996. [43] B. Sch ¨ afl, L. Gruber, A. Bitto-Nemling, and S. Hochreiter , “Hopular: Modern hopfield networks for tabular data, ” in NeurIPS 2023 Second T able Representation Learning W orkshop , 2023. [44] R. C. Madeo, C. A. Lima, and S. M. Peres, “Gesture unit segmentation using support vector machines: segmenting gestures from rest positions, ” in Proceedings of the 28th Annual ACM Symposium on Applied Com- puting , 2013, pp. 46–52. [45] T . Bertin-Mahieux, D. P . Ellis, B. Whitman, and P . Lamere, “The million song dataset, ” in Pr oceedings of the 12th International Confer ence on Music Information Retrieval (ISMIR 2011) , 2011. [46] J. A. Blackard and D. J. Dean, “Comparative accuracies of artificial neu- ral networks and discriminant analysis in predicting forest cover types from cartographic variables, ” Computers and electr onics in agricultur e , vol. 24, no. 3, pp. 131–151, 1999. [47] H. T aud and J. Mas, “Multilayer perceptron (mlp), ” Geomatic ap- pr oaches for modeling land change scenarios , pp. 451–455, 2018. [48] G. Peyr ´ e, M. Cuturi et al. , “Computational optimal transport, ” Center for Research in Economics and Statistics W orking P apers , pp. 2017–86, 2017. [49] R. F . W oolson, “W ilcoxon signed-rank test, ” Wile y encyclopedia of clinical trials , pp. 1–3, 2007. [50] S.-A. Liu, Y . Zhang, Z. Qiu, H. Xie, Y . Zhang, and T . Y ao, “Learning orthogonal prototypes for generalized fe w-shot semantic se gmentation, ” in Proceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , 2023, pp. 11 319–11 328. [51] Y . Jiang, B. Aragam, and V . V eitch, “Uncovering meanings of em- beddings via partial orthogonality , ” Advances in Neural Information Pr ocessing Systems , vol. 36, 2024. [52] O. Roy and M. V etterli, “The effecti ve rank: A measure of effecti ve dimensionality , ” in 2007 15th European signal pr ocessing conference . IEEE, 2007, pp. 606–610. [53] V . Papyan, X. Han, and D. L. Donoho, “Prevalence of neural collapse during the terminal phase of deep learning training, ” Pr oceedings of the National Academy of Sciences , vol. 117, no. 40, pp. 24 652–24 663, 2020. [54] Y . Y ang, S. Chen, X. Li, L. Xie, Z. Lin, and D. T ao, “Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network?” Advances in neural information pr ocessing systems , vol. 35, pp. 37 991–38 002, 2022. JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15 A P P E N D I X A. Nomenclatur e For clarity , we summarize the key notations used throughout the paper in T able VIII. This includes the symbols, their definitions, and dimensionalities where applicable. T ABLE VIII: Notation. Symbols Description x, y Observed sample and the corresponding label. N Number of observ ed samples in a dataset. D train , D val , D test Training set, v alidation set, test set. F ( · ; θ ) Deep tabular model. G f ( · ; θ f ) Existing deep tab ular backbone. G h ( · ; θ h ) Prediction head. z Representation extracted by G f ( · ; θ f ) . D Dimensionality of z . L Loss function. Z = { z i } N i =1 ∈ R N × D Representations of N samples. SVE Singular V alue Entropy . Z o Approximated optimal representations. ϕ Shift estimator . ˜ Z o Simulated sub-optimal representations derived from Z o . ∆ ˜ Z o ’ s corresponding shift. Φ( · ; θ ϕ ) Re-estimation function. B = { β t } T t =1 ∈ R T × D Embedding vectors. T Number of embedding v ectors. β t t -th embedding v ector. LE-Space Light Embedding Space consists of embedding vectors. s ( · ; θ s ) Coordinate estimator. r Coordinate of representation in the LE-Space. A ∈ [0 , 1] T × T Cosine similarity between embedding v ectors. B. Ablation Study W e further conduct ablation study to demonstrate the effec- tiv eness of ke y components of T R C . Specifically , we denote the task of tabular representation re-estimation as “TR”, and for the task of tabular space mapping, we denote the process of space compression as “SC” and the strategy for div ersifying the embedding v ectors as “DE”. Given that the strate gy for div ersifying the embedding vectors should be applied after the process of space compression, we conduct a comparativ e analysis between the backbone integrated with T R C and its fiv e variants in T able IX. C. Comparison with Deeper Backbone Besides, T R C is appended to the output of the backbone G f ( · ; θ f ) while keeping the parameters fix ed. W e compare the performance of the backbone coupled with T R C against backbone with deeper layers (additional 3 layers) in T able X. Increasing the number of layers in backbone model does not necessarily lead to improv ed performance; in some cases, it e ven results in performance degradation. Furthermore, the comparison between deeper backbone and backbone aug- mented with T R C highlights that the well-designed tasks can significantly enhance model representations, which is not simply achie ved by increasing model depth. D. T raining with Missing V alue and F ewer Samples T R C continues to enhance the performance of deep tabular models even under challenging conditions where training samples contain missing values and the o verall number of training samples is reduced. W e systematically increase the T ABLE IX: Analysis on the effects of different components of T R C. The best results are highlighted in bold, and the second best results are underscored. This table serves as an extension of T able III in manuscript. TR SC DE CA ↓ CO ↓ DI ↓ QS ↓ SU ↓ PO ↓ AD ↑ AU ↑ GE ↑ MLP 0.505 3.961 564.373 0.875 10.379 10.892 0.856 0.87 0.578 ✓ 0.502 3.916 563.649 0.868 10.328 10.605 0.858 0.87 0.571 ✓ 0.505 3.92 564.141 0.845 10.349 10.631 0.858 0.862 0.568 ✓ ✓ 0.502 3.908 558.52 0.852 10.341 10.602 0.858 0.87 0.575 ✓ ✓ 0.505 3.922 565.129 0.841 10.363 10.622 0.858 0.877 0.577 ✓ ✓ ✓ 0.502 3.899 558.33 0.825 10.326 10.593 0.858 0.891 0.58 DCN2 0.495 4.016 591.958 1.027 10.674 8.805 0.856 0.841 0.564 ✓ 0.493 3.912 579.461 0.941 10.521 8.795 0.858 0.848 0.575 ✓ 0.498 4.016 580.454 0.86 10.634 8.757 0.858 0.848 0.565 ✓ ✓ 0.495 3.844 574.712 0.884 10.727 8.782 0.858 0.841 0.563 ✓ ✓ 0.496 4.023 580.247 0.853 10.634 8.764 0.858 0.855 0.567 ✓ ✓ ✓ 0.493 3.858 574.453 0.879 10.584 8.79 0.857 0.862 0.584 SNN 0.896 11.789 1530.293 1.018 25.498 18.517 0.847 0.848 0.55 ✓ 0.88 11.369 1226.754 0.921 25.265 15.86 0.847 0.855 0.572 ✓ 0.748 5.701 783.015 0.889 17.738 12.001 0.847 0.833 0.543 ✓ ✓ 0.743 6.889 696.473 0.886 15.512 11.422 0.847 0.862 0.561 ✓ ✓ 0.742 7.306 760.66 0.891 17.711 12.002 0.847 0.855 0.546 ✓ ✓ ✓ 0.699 6.88 693.369 0.87 15.537 11.202 0.847 0.877 0.573 ResNet 0.517 3.982 606.282 0.872 11.163 10.812 0.847 0.87 0.587 ✓ 0.512 3.925 597.975 0.866 10.94 10.405 0.849 0.862 0.58 ✓ 0.513 3.96 612.993 0.83 10.944 10.134 0.849 0.87 0.576 ✓ ✓ 0.512 3.922 606.482 0.85 10.755 10.15 0.85 0.862 0.578 ✓ ✓ 0.513 3.96 617.098 0.841 10.951 10.137 0.849 0.877 0.586 ✓ ✓ ✓ 0.508 3.919 615.736 0.845 10.72 10.205 0.849 0.877 0.597 AutoInt 0.487 4.02 562.169 0.86 11.193 6.45 0.854 0.848 0.598 ✓ 0.481 4.018 562.677 0.849 11.089 6.19 0.858 0.87 0.605 ✓ 0.48 4.001 559.535 0.835 10.86 5.694 0.857 0.855 0.591 ✓ ✓ 0.477 3.971 555.787 0.828 10.972 5.88 0.856 0.891 0.59 ✓ ✓ 0.48 4.003 559.341 0.842 10.856 5.685 0.856 0.87 0.595 ✓ ✓ ✓ 0.477 3.96 557.423 0.829 10.826 5.885 0.858 0.891 0.606 FT -Transformer 0.469 3.709 551.19 0.823 10.41 2.919 0.858 0.848 0.611 ✓ 0.463 3.685 549.075 0.819 10.295 2.71 0.861 0.862 0.61 ✓ 0.464 3.668 549.684 0.812 10.338 2.759 0.861 0.855 0.62 ✓ ✓ 0.462 3.659 543.504 0.806 10.272 2.76 0.862 0.862 0.622 ✓ ✓ 0.464 3.668 549.923 0.797 10.342 2.763 0.861 0.855 0.618 ✓ ✓ ✓ 0.462 3.648 543.653 0.78 10.223 2.728 0.862 0.884 0.624 SCARF 3.856 579.61 0.863 0.52 7.292 10.24 0.858 0.87 0.593 ✓ 3.839 585.659 0.864 0.52 7.584 10.296 0.858 0.884 0.584 ✓ 3.841 588.05 0.829 0.521 7.485 10.235 0.858 0.877 0.586 ✓ ✓ 3.827 571.684 0.837 0.52 7.212 10.292 0.858 0.884 0.587 ✓ ✓ 3.841 572.493 0.835 0.523 7.325 10.201 0.859 0.87 0.584 ✓ ✓ ✓ 3.847 577.803 0.808 0.52 7.531 10.355 0.859 0.884 0.589 SAINT 0.508 4.022 597.207 0.827 13.095 4.415 0.857 0.87 0.549 ✓ 0.491 3.911 559.944 0.822 12.868 4.403 0.857 0.862 0.579 ✓ 0.495 3.869 565.402 0.81 12.813 4.358 0.857 0.87 0.581 ✓ ✓ 0.492 3.911 556.744 0.823 12.85 4.329 0.856 0.884 0.586 ✓ ✓ 0.495 3.869 565.393 0.821 12.81 4.365 0.857 0.87 0.584 ✓ ✓ ✓ 0.491 3.903 557.951 0.818 12.443 4.335 0.858 0.884 0.595 VIME 0.679 5.218 945.238 1.018 15.645 10.914 0.768 0.812 0.473 ✓ 0.643 4.378 707.749 0.952 15.486 7.365 0.847 0.87 0.473 ✓ 0.636 4.174 629.678 0.96 15.537 6.861 0.846 0.87 0.472 ✓ ✓ 0.645 4.342 638.677 0.957 15.073 7.443 0.847 0.877 0.472 ✓ ✓ 0.648 4.378 644.659 0.949 15.544 7.799 0.846 0.877 0.476 ✓ ✓ ✓ 0.645 4.371 612.454 0.95 15.028 7.489 0.846 0.884 0.479 T ABLE X: Performance of backbone w/ TRC and deeper backbone. The depths of backbone layers are increased only for “deeper backbone”. This table serves as an extension of T able IV in manuscript. CO ↓ DI ↓ QS ↓ CA ↓ PO ↓ SU ↓ AD ↑ A U ↑ GE ↑ MLP 3.961 564.373 0.875 0.505 10.892 10.379 0.856 0.870 0.578 deeper MLP 3.922 562.888 0.851 0.510 11.583 10.271 0.856 0.862 0.572 MLP+ TRC 3.899 558.330 0.825 0.502 10.593 10.326 0.858 0.891 0.580 DCN2 4.016 591.958 1.027 0.495 8.805 10.674 0.856 0.841 0.564 deeper DCN2 3.872 575.765 0.904 0.486 9.165 10.675 0.858 0.862 0.575 DCN2+ TRC 3.858 574.453 0.879 0.493 8.790 10.584 0.857 0.862 0.584 SNN 11.789 1530.293 1.018 0.896 18.517 25.498 0.847 0.848 0.550 deeper SNN 10.614 1042.509 1.121 0.851 14.460 15.039 0.847 0.833 0.553 SNN+ TRC 6.880 693.369 0.870 0.699 11.202 15.537 0.847 0.877 0.573 ResNet 3.982 606.282 0.872 0.517 10.812 11.163 0.847 0.870 0.587 deeper ResNet 3.943 657.823 0.847 0.512 10.628 11.008 0.850 0.862 0.600 ResNet+ TRC 3.919 615.736 0.845 0.508 10.205 10.720 0.849 0.877 0.597 AutoInt 4.020 562.169 0.860 0.487 6.450 11.193 0.854 0.848 0.598 deeper AutoInt 3.899 562.383 0.815 0.488 5.201 11.265 0.857 0.862 0.577 AutoInt+ TRC 3.960 557.423 0.829 0.477 5.885 10.826 0.858 0.891 0.606 FT -Transformer 3.709 551.190 0.823 0.469 2.919 10.410 0.858 0.848 0.611 deeper FT-T ransformer 3.604 559.363 0.786 0.464 2.850 10.142 0.860 0.870 0.623 FT -Transformer+ TRC 3.648 543.653 0.780 0.462 2.728 10.223 0.862 0.884 0.624 SAINT 4.022 597.207 0.827 0.508 4.415 13.095 0.857 0.870 0.549 deeper SAINT 3.912 585.623 0.806 0.501 41.349 34.681 0.860 0.877 0.572 SAINT+ TRC 3.903 557.951 0.818 0.491 4.335 12.443 0.858 0.884 0.595 VIME 5.218 945.238 1.018 0.679 10.914 15.645 0.768 0.812 0.473 deeper VIME 4.348 690.491 0.969 0.651 5.984 15.332 0.843 0.877 0.436 VIME+ TRC 4.371 612.454 0.950 0.645 7.489 15.028 0.846 0.884 0.479 proportion of missing v alues in features and reduce the pro- portion of training samples separately . The results in Fig 12 and Fig 13 indicate that as the dataset’ s missing values escalate or the training sample size diminishes, the difficulty of model training intensifies, which is reflected in a decline in model JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 16 performance. Nevertheless, by leveraging T R C , we are able to consistently enhance the capabilities of the deep tab ular model, thereby alleviating the negativ e impact of data incompleteness and scarcity . 0 5 10 15 20 R atio of Missed Data(%) 600.00 700.00 800.00 900.00 RMSE DCN2 DCN2+TR C (a) DCN2 on DI 0 5 10 15 20 R atio of Missed Data(%) 600.00 700.00 800.00 RMSE F T - T ransfor mer F T - T ransfor mer+TR C (b) FT -Transformer on DI 0 5 10 15 20 R atio of Missed Data(%) 600.00 700.00 800.00 900.00 RMSE MLP MLP+TR C (c) MLPs on DI 0 5 10 15 20 R atio of Missed Data(%) 600.00 700.00 800.00 900.00 RMSE R esNet R esNet+TR C (d) ResNet on DI 0 5 10 15 20 R atio of Missed Data(%) 12.0000 14.0000 16.0000 RMSE DCN2 DCN2+TR C (e) DCN2 on SU 0 5 10 15 20 R atio of Missed Data(%) 10.0000 11.0000 12.0000 13.0000 RMSE F T - T ransfor mer F T - T ransfor mer+TR C (f) FT -Transformer on SU 0 5 10 15 20 R atio of Missed Data(%) 10.0000 12.0000 14.0000 RMSE MLP MLP+TR C (g) MLPs on SU 0 5 10 15 20 R atio of Missed Data(%) 11.0000 12.0000 13.0000 14.0000 15.0000 RMSE R esNet R esNet+TR C (h) ResNet on SU Fig. 12: The results of T R C training with missing values. This figure serves as an extension of Fig. 7 in manuscript. E. P arameter Analysis The proposed T R C provides a parameter-ef ficient, cost- effecti ve technique to enhance the representations of deep tabular backbones without altering any of their parameters. As shown in Fig. 14, the parameter quantity of T R C is sig- nificantly lower compared to deep tabular models, particularly for the FT -T ransformer model, which inherently has a lar ge number of parameters. 100 75 50 25 R atio of Sampled Dataset(%) 580.00 600.00 620.00 640.00 660.00 RMSE DCN2 DCN2+TR C (a) DCN2 on DI 100 75 50 25 R atio of Sampled Dataset(%) 540.00 560.00 580.00 600.00 620.00 640.00 RMSE F T - T ransfor mer F T - T ransfor mer+TR C (b) FT -Transformer on DI 100 75 50 25 R atio of Sampled Dataset(%) 550.00 575.00 600.00 625.00 650.00 675.00 RMSE MLP MLP+TR C (c) MLPs on DI 100 75 50 25 R atio of Sampled Dataset(%) 600.00 650.00 700.00 750.00 800.00 RMSE R esNet R esNet+TR C (d) ResNet on DI 100 75 50 25 R atio of Sampled Dataset(%) 11.0000 12.0000 13.0000 RMSE DCN2 DCN2+TR C (e) DCN2 on SU 100 75 50 25 R atio of Sampled Dataset(%) 10.0000 11.0000 12.0000 13.0000 14.0000 RMSE F T - T ransfor mer F T - T ransfor mer+TR C (f) FT -Transformer on SU 100 75 50 25 R atio of Sampled Dataset(%) 11.0000 12.0000 13.0000 RMSE MLP MLP+TR C (g) MLPs on SU 100 75 50 25 R atio of Sampled Dataset(%) 11.0000 11.5000 12.0000 12.5000 13.0000 RMSE R esNet R esNet+TR C (h) ResNet on SU Fig. 13: The results of T R C training with fewer samples. This figure serves as an extension of Fig. 7 in manuscript. F . Computational Efficiency W e provide training and inference time cost in T able XI and T able XII. All experiments were conducted on the Ubuntu 20.04.4 L TS operating system, Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz with single NVIDIA A40 48GB GPU and 512GB of RAM. Since T R C does not need to retrain the deep tabular backbone, the training time of T R C is noticeably reduced compared to the deep tabular model in most cases. In addition, the inference cost of TRC is also much smaller com- pared to deep tabular backbones, thus will not introduce high inference ov erhead to the existing backbones. The relati vely higher inference time for both backbone and TRC observed on the YE and CO V datasets are are primarily due to their large sizes, as shown in T able I of manuscript. JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 17 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(CA) TR C(CA) (a) CA dataset 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(CO) TR C(CO) (b) CO dataset 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(DI) TR C(DI) (c) DI dataset 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(QS) TR C(QS) (d) QS dataset 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(SU) TR C(SU) (e) SU dataset 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(PO) TR C(PO) (f) PO dataset 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(AD) TR C(AD) (g) AD dataset 0 300000 600000 DCN2 F T - T ransfor mer MLP R esNet SNN Backbone(GE) TR C(GE) (h) GE dataset Fig. 14: The comparison of parameters between T R C and deep tabular models. This figure serves as an extension of Fig. 9 in manuscript. T ABLE XI: T ime cost (seconds) per training epoch on the full training set. The training time for TRC does not include the training cost for backbone model. CO DI QS CA PO SU AD A U GE YE CO V MLP 0.129 0.719 0.013 0.267 0.189 0.285 0.566 0.011 0.123 9.259 9,442 TRC 0.171 0.878 0.020 0.348 0.248 0.372 0.688 0.017 0.167 13.086 13.818 ResNet 0.188 1.092 0.020 0.399 0.282 0.424 0.864 0.016 0.184 15.285 13.948 TRC 0.168 0.880 0.021 0.354 0.249 0.374 0.698 0.018 0.167 13.629 12.909 FT -Transformer 0.424 2.348 0.045 0.917 0.685 1.397 1.857 0.038 0.431 68.678 52.186 TRC 0.174 0.891 0.027 0.356 0.259 0.382 0.705 0.024 0.174 13.200 13.194 SCARF 8.972 3.368 0.043 1.435 0.873 1.331 2.485 0.034 0.593 22.253 22.083 TRC 4.882 1.145 0.032 0.452 0.317 0.474 0.908 0.027 0.216 12.974 13.113 VIME 0.619 2.974 0.272 1.061 1.197 2.169 2.972 0.261 0.776 43.997 36.498 TRC 0.270 1.332 0.037 0.512 0.398 0.582 1.099 0.028 0.260 11.700 13.356 T ABLE XII: Inference time cost (seconds) on the full test set. The inference time for TRC does not include the inference time for backbone model. CO DI QS CA PO SU AD A U GE YE CO V MLP 0.007 0.053 0.003 0.015 0.012 0.013 0.081 0.003 0.009 0.177 0.397 TRC 0.006 0.032 0.002 0.013 0.011 0.011 0.053 0.002 0.006 0.151 0.297 ResNet 0.016 0.111 0.006 0.042 0.030 0.034 0.147 0.006 0.020 0.433 0.970 TRC 0.005 0.033 0.002 0.015 0.010 0.012 0.043 0.002 0.006 0.154 0.341 FT -Transformer 0.042 0.275 0.012 0.130 0.131 0.180 0.512 0.012 0.083 2.704 4.600 TRC 0.005 0.030 0.001 0.014 0.010 0.012 0.051 0.001 0.007 0.142 0.318 Scarf 0.007 0.051 0.003 0.016 0.012 0.014 0.080 0.003 0.009 0.175 0.413 TRC 0.006 0.034 0.002 0.014 0.009 0.012 0.047 0.002 0.007 0.137 0.337 VIME 0.008 0.073 0.004 0.019 0.015 0.015 0.149 0.004 0.011 0.222 0.480 TRC 0.005 0.030 0.002 0.013 0.010 0.010 0.046 0.002 0.006 0.138 0.268 JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 18 G. V isualization W e provide the TSNE visualization of coefficients for em- bedding vectors in Fig. 15, the visualization of the similarity matrix for embedding vectors in Fig. 16, and the heatmap of coefficients for embedding vectors in Fig. 17. (a) AutoInt on A U dataset (b) DCN2 on A U dataset (c) FT -Transformer on AU dataset (d) MLP on A U dataset (e) ResNet on A U dataset (f) SAINT on AU dataset (g) SNN on A U dataset (h) VIME on AU dataset (i) AutoInt on GE dataset (j) DCN2 on GE dataset (k) FT -Transformer on GE dataset (l) MLP on GE dataset (m) ResNet on GE dataset (n) SAINT on GE dataset (o) SNN on GE dataset (p) VIME on GE dataset Fig. 15: The TSNE visualization of coefficients for embedding vectors. (a) CA dataset (b) CO dataset (c) SU dataset (d) AD dataset Fig. 16: The visualization of the similarity matrix for embed- ding vectors on FT -T ransformer . 0 1 sample of differ ent class coeff 0.2 0.4 0.6 0.8 (a) AutoInt on A U dataste 0 1 sample of differ ent class coeff 0.0 0.2 0.4 0.6 0.8 (b) DCN2 on A U dataset 0 1 sample of differ ent class coeff 0.2 0.4 0.6 0.8 (c) FT -Transformer on AU dataset 0 1 sample of differ ent class coeff 0.2 0.4 0.6 0.8 (d) MLP on A U dataset 0 1 sample of differ ent class coeff 0.2 0.4 0.6 0.8 (e) ResNet on A U dataset 0 1 sample of differ ent class coeff 0.2 0.4 0.6 0.8 (f) SAINT on AU dataset 0 1 sample of differ ent class coeff 0.2 0.4 0.6 0.8 (g) SNN on A U dataset 0 1 sample of differ ent class coeff 0.2 0.4 0.6 0.8 (h) VIME on AU dataset 0 1 2 3 4 sample of differ ent class coeff 0.1 0.2 0.3 0.4 0.5 (i) AutoInt on GE dataset 0 1 2 3 4 sample of differ ent class coeff 0.1 0.2 0.3 0.4 0.5 (j) DCN2 on GE dataset 0 1 2 3 4 sample of differ ent class coeff 0.1 0.2 0.3 0.4 0.5 (k) FT -Transformer on GE dataset 0 1 2 3 4 sample of differ ent class coeff 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (l) MLP on GE dataset 0 1 2 3 4 sample of differ ent class coeff 0.2 0.4 0.6 0.8 (m) ResNet on GE dataset 0 1 2 3 4 sample of differ ent class coeff 0.2 0.4 0.6 0.8 (n) SAINT on GE dataset 0 1 2 3 4 sample of differ ent class coeff 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (o) SNN on GE dataset 0 1 2 3 4 sample of differ ent class coeff 0.1 0.2 0.3 0.4 0.5 (p) VIME on GE dataset Fig. 17: The visualization of coefficients for embedding vec- tors. H. The Relationship between SVE and the Number of Embed- ding V ectors W e provide the relationship between SVE and the number of embedding vectors in Fig. 18. I. Sensitivity Analysis W e incorporate the sensitivity analysis for the weight of loss function L orth , the perturbing times for constructing simulated sub-optimal representations in Inherent Shift Learning, the ratio of selected optimal representations, and the number of embedding vectors in Fig. 19, Fig. 20, Fig. 21, and Fig. 22. JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 19 5 10 15 20 25 Number of Embedding V ectors 0.450 0.460 0.470 0.480 0.490 0.500 0.510 0.520 0.530 RMSE P erfor mance of DCN2+TR C P erfor mance of F T - T ransfor mer+TR C P erfor mance of MLP+TR C P erfor mance of R esNet+TR C P erfor mance of S AINT+TR C 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 SVE SVE of DCN2+TR C SVE of F T - T ransfor mer+TR C SVE of MLP+TR C SVE of R esNet+TR C SVE of S AINT+TR C (a) CA dataset 5 10 15 20 25 Number of Embedding V ectors 3.600 3.700 3.800 3.900 4.000 RMSE P erfor mance of DCN2+TR C P erfor mance of F T - T ransfor mer+TR C P erfor mance of MLP+TR C P erfor mance of R esNet+TR C P erfor mance of S AINT+TR C 0.80 1.00 1.20 1.40 1.60 1.80 2.00 2.20 2.40 SVE SVE of DCN2+TR C SVE of F T - T ransfor mer+TR C SVE of MLP+TR C SVE of R esNet+TR C SVE of S AINT+TR C (b) CO dataset 5 10 15 20 25 Number of Embedding V ectors 0.845 0.850 0.855 0.860 0.865 A ccuracy P erfor mance of DCN2+TR C P erfor mance of F T - T ransfor mer+TR C P erfor mance of MLP+TR C P erfor mance of R esNet+TR C P erfor mance of S AINT+TR C 0.80 1.00 1.20 1.40 1.60 1.80 2.00 2.20 SVE SVE of DCN2+TR C SVE of F T - T ransfor mer+TR C SVE of MLP+TR C SVE of R esNet+TR C SVE of S AINT+TR C (c) AD dataset 5 10 15 20 25 Number of Embedding V ectors 0.560 0.580 0.600 0.620 0.640 A ccuracy P erfor mance of DCN2+TR C P erfor mance of F T - T ransfor mer+TR C P erfor mance of MLP+TR C P erfor mance of R esNet+TR C P erfor mance of S AINT+TR C 1.20 1.40 1.60 1.80 2.00 2.20 2.40 SVE SVE of DCN2+TR C SVE of F T - T ransfor mer+TR C SVE of MLP+TR C SVE of R esNet+TR C SVE of S AINT+TR C (d) GE dataset Fig. 18: Increasing the number of embedding vectors leads to larger SVE values of representations. 0.05 0.10 0.15 0.20 0.25 W eight of Orthogonal L oss 0.4620 0.4635 0.4650 0.4665 0.4680 RMSE F T - T ransfor mer F T - T ransfor mer+TR C (a) CA dataset ↓ 0.05 0.10 0.15 0.20 0.25 W eight of Orthogonal L oss 3.6450 3.6600 3.6750 3.6900 3.7050 RMSE F T - T ransfor mer F T - T ransfor mer+TR C (b) CO dataset ↓ 0.05 0.10 0.15 0.20 0.25 W eight of Orthogonal L oss 0.8576 0.8584 0.8592 0.8600 0.8608 0.8616 A ccuracy F T - T ransfor mer F T - T ransfor mer+TR C (c) AD dataset ↑ 0.05 0.10 0.15 0.20 0.25 W eight of Orthogonal L oss 0.6125 0.6150 0.6175 0.6200 0.6225 0.6250 A ccuracy F T - T ransfor mer F T - T ransfor mer+TR C (d) GE dataset ↑ Fig. 19: The performance of T R C with varying weights of the orthogonality loss. 1 3 5 7 9 T imes of A dding Noise 0.4600 0.4620 0.4640 0.4660 0.4680 RMSE F T - T ransfor mer F T - T ransfor mer+TR C (a) CA dataset ↓ 1 3 5 7 9 T imes of A dding Noise 3.6450 3.6600 3.6750 3.6900 3.7050 RMSE F T - T ransfor mer F T - T ransfor mer+TR C (b) CO dataset ↓ 1 3 5 7 9 T imes of A dding Noise 0.8576 0.8584 0.8592 0.8600 0.8608 0.8616 A ccuracy F T - T ransfor mer F T - T ransfor mer+TR C (c) AD dataset ↑ 1 3 5 7 9 T imes of A dding Noise 0.6125 0.6150 0.6175 0.6200 0.6225 0.6250 A ccuracy F T - T ransfor mer F T - T ransfor mer+TR C (d) GE dataset ↑ Fig. 20: The performance of T R C with varying times of perturbing the observation in Inherent Shift Learning. JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 20 0.005 0.01 0.04 0.07 0.1 0.3 0.5 0.7 0.9 1.0 R a t i o o f A p p r o x i m a t e d O p t i m a l R e p r e s e n t a t i o n 576.00 580.00 584.00 588.00 592.00 RMSE DCN2 DCN2+TR C (L owest, ours) DCN2+TR C (R andom) DCN2+TR C (Highest) (a) DI dataset ↓ 0.005 0.01 0.04 0.07 0.1 0.3 0.5 0.7 0.9 1.0 R a t i o o f A p p r o x i m a t e d O p t i m a l R e p r e s e n t a t i o n 10.5800 10.6000 10.6200 10.6400 10.6600 10.6800 RMSE DCN2 DCN2+TR C (L owest, ours) DCN2+TR C (R andom) DCN2+TR C (Highest) (b) SU dataset ↓ 0.005 0.01 0.04 0.07 0.1 0.3 0.5 0.7 0.9 1.0 R a t i o o f A p p r o x i m a t e d O p t i m a l R e p r e s e n t a t i o n 0.8400 0.8480 0.8560 0.8640 0.8720 A ccuracy DCN2 DCN2+TR C (L owest, ours) DCN2+TR C (R andom) DCN2+TR C (Highest) (c) A U dataset ↑ 0.005 0.01 0.04 0.07 0.1 0.3 0.5 0.7 0.9 1.0 R a t i o o f A p p r o x i m a t e d O p t i m a l R e p r e s e n t a t i o n 0.5650 0.5700 0.5750 0.5800 0.5850 A ccuracy DCN2 DCN2+TR C (L owest, ours) DCN2+TR C (R andom) DCN2+TR C (Highest) (d) GE dataset ↑ Fig. 21: The performance of T R C with the varying proportion of the selected optimal representation over all of the validation set samples, i.e., τ . This figure serves as an extension of Fig. 10 in manuscript. 5 10 15 20 25 64 100 128 N u m b e r o f E m b e d d i n g V e c t o r s T 576.00 580.00 584.00 588.00 592.00 RMSE DCN2 DCN2+TR C (a) DI dataset ↓ 5 10 15 20 25 64 100 128 N u m b e r o f E m b e d d i n g V e c t o r s T 10.5600 10.6400 10.7200 10.8000 10.8800 10.9600 RMSE DCN2 DCN2+TR C (b) SU dataset ↓ 5 10 15 20 25 64 100 128 N u m b e r o f E m b e d d i n g V e c t o r s T 0.8400 0.8460 0.8520 0.8580 0.8640 0.8700 A ccuracy DCN2 DCN2+TR C (c) A U dataset ↑ 5 10 15 20 25 64 100 128 N u m b e r o f E m b e d d i n g V e c t o r s T 0.5640 0.5700 0.5760 0.5820 0.5880 A ccuracy DCN2 DCN2+TR C (d) GE dataset ↑ Fig. 22: The performance of T R C with v arying number of embedding vectors. This figure serves as an extension of Fig. 11 in manuscript. JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 21 J. Theor etical Analysis and Discussion Definition 3 Effective Rank. The effective rank [52] of the matrix Z , denoted erank ( Z ) , is defined as: erank ( Z ) = exp( SVE ) , (13) wher e SVE is the entr opy of normalized singular values, as defined in Definition 1. The effective rank can be considered as a r eal-valued extension of the rank. While the usual rank counts the number of non-zer o singular values, the effective rank measur es how the ener gy is distributed acr oss the singu- lar dir ections. T o intuiti vely understand the dif ference between rank and effecti ve rank, consider a two-dimensional Gaussian random vector with highly correlated components. Its covariance ma- trix is of rank two, but the corresponding Gaussian distribution exhibits most of its energy along the direction of one singular vector . In such a case, the spectral entropy (SVE) approaches zero, where the ef fective rank is slightly greater than one. Lemma 1 Let Z = { z i } N i =1 ∈ R N × D denote the repr e- sentations, where N is the number of samples and D is the dimensionality . Then we have SVE ( Z ) = log erank ( Z ) ≤ log rank ( Z ) , (14) which is given by [52]. Proposition 1 Consider a classification task with C classes. Let Z = { z i } N i =1 ∈ R N × D and Y ∈ R N × C denote the r epr esentations and one-hot label matrix, where N is the number of samples, D is the dimensionality and C ≤ D . When the model is trained to near-zer o loss, the SVE of Z satisfies SVE ( Z ) ≤ log C . Pr oof. Assuming a balanced dataset, when the model is trained to near-zero loss, the phenomenon neural collapse typically occurs [53], [54]. This implies that all representations within a class conv erge to their within-class means. Specifi- cally , for class c , all representations { z i } belongs to class c satisfies z i → w c , where w c ∈ R D is the class center . Thus representation Z could be written as Z = Y W, (15) where W = { w c } C c =1 ∈ R C × D contains the class centers as rows. From properties of matrix multiplication, the rank of Z is upper bounded by: rank ( Z ) ≤ min( rank ( Y ) , rank ( W )) ≤ min( C , D ) = C. (16) Then we hav e: SVE ( Z ) ≤ log rank ( Z ) ≤ log C, (17) which is gi ven by Lemma 1 . Discussion 1: SVE interpretation in TRC. TRC focuses on deep learning for supervised tabular prediction tasks, i.e., classification and regression. The purpose is to model the conditional distribution P ( y | x ) , where x and y are features and corresponding label. As we discussed in Section III of manuscript, the typical architecture used for such tasks consists of a backbone network that extracts latent representations, followed by a linear prediction head. Let us take classification as an example to illustrate. Ideally , the learned representations should be linearly separable: that is, representations for the same class should cluster together, and dif ferent classes should be well-separated. Let Z = { z i } N i =1 ∈ R N × D denote the representations, where N is the number of samples, D is the dimensionality and D ≤ N . As shown in Lemma 1 in Appendix J, we hav e SVE ( Z ) ≤ log Rank ( Z ) ≤ log D , where SVE denotes singu- lar value entropy . Proposition 1 in Appendix J establishes that for well-learned representations that are informativ e for the prediction task, the energy tends to concentrate in a few domi- nant singular directions. As a result, the SVE is upper bounded by log C , where C is the number of classes and C ≤ D . W e assume that the trained deep tab ular backbones has learned redundant information in addition to useful information for prediction. This is supported by our empirical results in Fig. 2 and Fig. 6 of manuscript, where higher SVE often exhibits worse performance. In addition, we could observe from Fig. 8 of manuscript that with the redundant information (higher SVE), the representations tend to be more entangled and less linearly separable. Our method, TRC, leverages this insight to improve learned representations in a post-hoc, model-agnostic f ashion. It re- duces SVE while maintaining predictiv e performance, by jointly minimizing the supervised loss (Eq. 12) and the entropy of the singular value spectrum. This av oids modifying the original backbone and retains its capacity to learn complex feature interactions. Importantly , we do not advocate for extreme compression: excessi vely low SVE can discard task- relev ant information and cause underfitting. Our results show that moderate SVE reduction strikes a good balance between compactness and predictiv e utility (Fig. 6 and Fig. 8 of manuscript). Proposition 2 Let Z ∗ ∈ R N × D denote the r epr esentations obtained by T abular Space Mapping in TRC, where N is the number of samples, D is the dimensionality and T is the number of embedding vectors with T ≤ D . Then we have SVE ( Z ∗ ) ≤ log T . Pr oof. Given that any representation computed by Eq. 9 in T abular Space Mapping of manuscript could be viewed as a weighted linear combination of the embedding vectors r B , Z ∗ could be written as: Z ∗ = R B , (18) where R = { r i } N i =1 ∈ R N × T denotes the weights across all samples, and B ∈ R T × D denotes the embedding v ectors. From properties of matrix multiplication, the rank of Z ∗ is upper bounded by: rank ( Z ∗ ) ≤ min( rank ( R ) , rank ( B )) ≤ min( T , T ) = T . (19) Then we hav e: SVE ( Z ∗ ) ≤ log rank ( Z ∗ ) ≤ log T , (20) which is gi ven by Lemma 1 . Discussion 2: The choice of number of embedding vectors T in TRC. Let Z = { z i } N i =1 ∈ R N × D and Z ∗ ∈ R N × D denote the representations obtained by backbone and T abular Space Mapping in TRC respectiv ely , where N is the number of samples and D is the dimensionality . As JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 22 discussed in Discussion 1 , the deep tabular backbones often exhibits high SVE, indicating that the learned representations Z contain redundancy in addition to task-relev ant information. W e propose to jointly reduce the SVE of Z ∗ and minimize the supervised loss in Eq. 12 of manuscript to remain the critical information for prediction. This encourages the energy of representations to concentrate along dominant singular directions that are most informati ve for the task. As illustrated in Pr oposition 2 , SVE of Z ∗ is upper bounded by log T , where T is the number of embedding vectors in LE-Space. Therefore the reduction of SVE could be achie ved by decreasing the number of embedding vectors T . T o make the SVE of Z ∗ in TRC effecti vely less than that of Z from backbone models, one could calculate the SVE of Z on train set, and set T < exp( SVE ( Z )) . Notably , overly low SVE values may lead to underfitting by discarding useful task- relev ant information. Hence, we recommend choosing T in a moderate range (e.g., [5 , exp( SVE ( Z ))] ) to maintain a good balance between compression and information preserv ation. The sensitivity analysis w .r .t. T are presented in Fig. 11 of manuscript and Fig. 22 of Appendix I. Overall, the TRC is robust across a wide range of T , and TRC consistently impro ve the performance of backbone models across all v alues of T . As we suggested, when T is within the range of [5 , exp( SVE ( Z ))] (e.g., exp( SVE ( Z )) = 22 on CO dataset, exp( SVE ( Z )) = 29 on DI dataset, exp( SVE ( Z )) = 94 on SU dataset, exp( SVE ( Z )) = 81 on A U dataset, exp( SVE ( Z )) = 101 on GE dataset), the performance of TRC is stable. As T increases further , we observe a slight degradation in performance, yet TRC could still enhance the performance of backbone model. This is aligned with our claim that higher SVE may introduce redundant information hindering prediction. K. How W ell the Simulated Shift is Expected to Model the “Inher ent” Shift W e include experiment to ev aluate ho w well the simulated shift, computed by shift estimator, can model the real inherent shift. The results are provided in T able XIII. Let us denote the representations extracted by the trained backbone G f ( · ; θ f ) as Z = { z i } N i =1 ∈ R N × D , where z i = G f ( x i ; θ f ) ∈ R D corresponds to sample x i , N is the number of samples. The true inherent shift is defined by the discrepancy between sub- optimal representation and the corresponding optimal repre- sentation. Howe ver , since the true optimal representations are not accessible in practice, directly measuring the inherent shift is challenging. T o approximate this, we propose the following experimental setup: we train the same backbone architecture G f ( · ; θ f ) in two different ways, (i) heavy training with the full number of epochs, producing representations Z heavy = { z heavy i } N i =1 , and (ii) light training with only half the number of epochs, yielding Z light = { z light i } N i =1 . W e make an assumption that compared to the representations Z light from light training, the represen- tations Z heavy from heavy training are closer to the optimal ones. Hence, we treat Z heavy as optimal representations, and Z light as sub-optimal representations. Notably , this assumption is adopted solely for the purpose of this ev aluation, in the absence of ground truth for the inherent shift. T ABLE XIII: The L2 distance between sub-optimal represen- tations and optimal representations w/o and w/ shift estimator . “w/o Shift Estimator” corresponds to the distance between z light i and z heavy i , “w/ Shift Estimator” corresponds to the distance between ˆ z i and z heavy i . Backbone Discrepancy ( ↓ ) AD CA CO GE MLP w/o Shift Estimator 0.647 0.487 0.450 0.464 MLP w/ Shift Estimator 0.578 0.395 0.246 0.434 FT -Transformer w/o Shift Estimator 2.217 2.460 2.352 3.870 FT -Transformer w/ Shift Estimator 1.068 0.887 0.820 1.443 Under this setup, the true inherent shift for each z light i is defined as z light i − z heavy i . W e apply the proposed TRC to Z light and obtain the learned shift estimator ϕ ( · ; θ ϕ ) . Next, we apply our shift estimator ϕ ( · ; θ ϕ ) to estimate the inherent shift on each z light i , and obtain the re-estimated representation ˆ z i by ˆ z i = z light i − ϕ ( z light i ; θ ϕ ) . Intuiti vely , if the estimated shift ϕ ( · ; θ ϕ ) could well model the true inherent shift z light i − z heavy i , ˆ z i should be closer to z heavy i . W e then compare the average distance between ˆ z i and z heavy i , against the av erage distance between z light i and z heavy i . Our results show that the re-estimated representation ˆ z i are significantly closer to the optimal representation z heavy i than the original suboptimal ones z light i . This demonstrates that the shift estimator learned from artificial corruption has the potential ability to model the inherent shift in the representation space. L. Limitations of Assumption 1 W e acknowledge that the assumption may not always hold. In particular , as shown in T able XIV, when the backbone model is sev erely underfitting and far from conv ergence (e.g., trained for only 5% of the total epochs), the performance of TRC based on Assumption 1 (lo west-gradient selection) becomes similar to that of random or highest-gradient selec- tion. This suggests that, under sev ere backbone underfitting settings, lo w gradient norms may not indicate representations with the least shift. Ho wev er, when the number of training epochs increases, e.g., trained for 50% of the total epochs, TRC based on Assumption 1 could enhance the backbone performance better than that of random or highest-gradient selection, which illustrates the effecti veness of Assumption 1. M. Br oader Impacts T abular data is extensiv ely utilized across various domains including healthcare, finance, engineering and psychology . Despite its wide-ranging application, much of the research in deep learning, has predominantly focused on other data modalities such as images, text, and time series. Our paper aims to bridge this gap by introducing a novel deep T abu- lar Representation Corrector , TRC, to enhance trained deep tabular models through tw o tasks. Progress in this direction has the potential to facilitate the construction of multi-modal pipelines for problems where only one part of the input is tabular , while other components inv olve images, text, and other JOURNAL OF L A T E X CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 23 T ABLE XIV: The performance of TRC on underfitted backbone. Datasets Backbone training epochs DCN2 + TRC (Lowest, ours) +TRC (Random) +TRC (Highest) CO ↓ 5% 4.57 4.498 4.512 4.503 CO ↓ 50% 4.384 4.165 4.285 4.325 GE ↑ 5% 0.508 0.527 0.525 0.524 GE ↑ 50% 0.543 0.561 0.55 0.548 Deep Learning-friendly data types. These pipelines can then be trained end-to-end through gradient optimization across all modalities. Such integration enables the fusion of insights from tabular data, such as demographics and genomics, with information from images, text, and time series.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment