Imputation-free Learning of Tabular Data with Missing Values using Incremental Feature Partitions in Transformer
Tabular data sets with varying missing values are prepared for machine learning using an arbitrary imputation strategy. Synthetic values generated by imputation models often raise concerns regarding data quality and the reliability of data-driven outcomes. To address these concerns, this article proposes an imputation-free incremental attention learning (IFIAL) method for tabular data with missing values. A pair of attention masks is derived and retrofitted to a transformer to directly streamline tabular data without imputing or initializing missing values. The proposed method incrementally learns partitions of overlapping and fixed-size feature sets to enhance the performance of the transformer. The average classification performance rank order across 17 diverse tabular data sets highlights the superiority of IFIAL over 11 state-of-the-art learning methods with or without missing value imputations. Additional experiments corroborate the robustness of IFIAL to varying types and proportions of missing data, demonstrating its superiority over methods that rely on explicit imputations. A feature partition size equal to one-half the original feature space yields the best trade-off between computational efficiency and predictive performance. IFIAL is one of the first solutions that enables deep attention models to learn directly from tabular data, eliminating the need to impute missing values. %without the need for imputing missing values. The source code for this paper is publicly available.
💡 Research Summary
The paper tackles a pervasive problem in tabular machine learning: how to handle missing values without resorting to imputation, which can introduce bias, degrade data quality, and increase computational cost. The authors propose Imputation‑Free Incremental Attention Learning (IFIAL), a novel framework that enables a transformer‑based classifier to operate directly on data containing missing entries. The key idea is to exclude missing values only during the attention computation, leaving the raw input untouched. Two binary attention masks are constructed: a column mask (M₁) that assigns –∞ to columns corresponding to missing features, thereby turning their soft‑max scores to zero, and a row mask (M₂) that zeroes out rows belonging to missing features after the soft‑max. This double‑masking guarantees that any query‑key interaction involving a missing feature contributes nothing to the attention output, while observed values interact normally.
To keep the transformer’s quadratic memory complexity tractable for high‑dimensional tables, the authors introduce an incremental learning scheme based on overlapping fixed‑size feature partitions. After sorting features by ascending missing‑rate, the feature set is divided into partitions of size k with an overlap of ⌈k/2⌉ features between consecutive partitions. The number of partitions P follows the formula
P = 1 + ⌈(d – k) / (k – ⌈k/2⌉)⌉,
where d is the total number of features. Each partition is fed to a Feature‑Tokenized Transformer (FTT): feature names and categorical values are embedded using a pretrained language model, while numerical values are linearly projected. The FTT is first trained on the initial partition (P₁); subsequent partitions fine‑tune the previously learned model, progressively incorporating more features while re‑using the overlapping region to preserve learned representations. Because each training step only sees k features, the attention matrix size is O(k²), dramatically reducing GPU memory usage compared with a full‑feature transformer (O(d²)).
The authors evaluate IFIAL on 17 heterogeneous tabular datasets spanning healthcare, finance, and biology. They simulate four missingness mechanisms—MCAR, MAR, MNAR, and a mixed scenario—at rates ranging from 10 % to 70 %. Eleven baselines are compared, including traditional ML (Random Forest, XGBoost, LightGBM), classic imputation methods (MICE, missForest), deep imputation models (GAIN, Diffputer, VAE‑based, denoising autoencoders), and recent incremental‑learning transformers (mamba, TabTransformer). Across all settings, IFIAL achieves the best average rank, often outperforming baselines by a substantial margin, especially when missing rates are high. A partition size equal to half the total feature count (k = d/2) yields the best trade‑off between predictive performance and computational cost; smaller partitions save memory but hurt accuracy, while larger partitions risk out‑of‑memory errors.
Ablation studies confirm that both masks are essential: removing M₁ or M₂ degrades performance to levels comparable with naive imputation. Visualizations of attention maps show that missing features receive zero attention weight, preventing them from influencing the learned representation. Feature‑importance analyses further demonstrate that IFIAL’s importance scores align with those derived from fully observed data, whereas imputation‑based methods can distort importance due to synthetic values.
Limitations are acknowledged. The sequential nature of partition training may make the final model sensitive to the order of partitions; the authors did not explore optimal ordering strategies. Extremely high‑cardinality categorical features may still pose embedding challenges, and the current implementation does not support parallel training of partitions.
Future work is outlined: developing algorithms to automatically determine partition order and size, integrating dynamic partitioning, and combining the mask mechanism with a learned missing‑value encoder that could capture informative patterns in the missingness itself. Overall, IFIAL represents a significant step toward trustworthy, efficient deep learning on imperfect tabular data, eliminating the need for any form of explicit imputation.
Comments & Academic Discussion
Loading comments...
Leave a Comment