LassoFlexNet: Flexible Neural Architecture for Tabular Data
Despite their dominance in vision and language, deep neural networks often underperform relative to tree-based models on tabular data. To bridge this gap, we incorporate five key inductive biases into deep learning: robustness to irrelevant features, axis alignment, localized irregularities, feature heterogeneity, and training stability. We propose \emph{LassoFlexNet}, an architecture that evaluates the linear and nonlinear marginal contribution of each input via Per-Feature Embeddings, and sparsely selects relevant variables using a Tied Group Lasso mechanism. Because these components introduce optimization challenges that destabilize standard proximal methods, we develop a \emph{Sequential Hierarchical Proximal Adaptive Gradient optimizer with exponential moving averages (EMA)} to ensure stable convergence. Across $52$ datasets from three benchmarks, LassoFlexNet matches or outperforms leading tree-based models, achieving up to a $10$% relative gain, while maintaining Lasso-like interpretability. We substantiate these empirical results with ablation studies and theoretical proofs confirming the architecture’s enhanced expressivity and structural breaking of undesired rotational invariance.
💡 Research Summary
LassoFlexNet is a novel deep learning architecture specifically designed for tabular data, addressing the well‑known performance gap between neural networks and tree‑based models. The authors identify five inductive biases crucial for tabular tasks: robustness to irrelevant features, axis alignment, capacity for localized irregularities, handling heterogeneous feature types, and training stability. To embed these biases, LassoFlexNet introduces three tightly coupled components. First, each raw feature is transformed by a Piecewise Linear Encoding (PLE) that discretizes numerical inputs into intervals reminiscent of decision‑tree splits, preserving the original coordinate axes and breaking rotational invariance. The encoded vectors are then passed through independent Per‑Feature Embedding (PFE) blocks—lightweight ResNet‑style networks followed by parameter‑free batch normalization—producing rich, non‑linear representations while keeping each feature’s marginal contribution identifiable. Second, a linear skip connection operates on the mean‑pooled embeddings rather than the raw inputs. This connection is regularized with a Tied Group Lasso, assigning a single scalar coefficient βi to the entire embedding group of feature i. Consequently, feature selection is performed on the learned non‑linear embeddings, yielding interpretable βi magnitudes that directly reflect feature importance. Third, the embeddings are fed into an MLP‑Mixer module that captures cross‑feature interactions. A curriculum scalar τ (< 1 early in training) scales the Mixer output, ensuring that the Lasso‑driven linear path converges before the Mixer dominates, thereby preventing premature over‑fitting.
Training such a composite system poses optimization challenges because the Lasso constraint is non‑smooth and hierarchical. The paper replaces the original Hierarchical Prox‑Gradient (HPG) scheme of LassoNet with a Sequential Hierarchical Proximal (Seq‑Hier‑Prox) algorithm. The procedure first updates β using Adam’s adaptive learning rates and a soft‑thresholding operator, then projects the first‑layer weights W(1) onto a feasible set defined by the newly obtained β. This decoupling guarantees that feature selection drives the subsequent weight updates rather than being overwhelmed by the much larger neural‑network parameter space. To further stabilize stochastic training, exponential moving averages (EMA) are applied directly to the parameters before the proximal step, yielding smoother estimates and mitigating variance spikes that commonly cause divergence in sparse networks.
The authors provide theoretical guarantees: they prove that PLE eliminates rotational invariance, that the Tied Group Lasso inherits the ℓ1‑regularization robustness to irrelevant features, and that the EMA‑enhanced proximal updates converge under standard assumptions. Empirically, LassoFlexNet is evaluated on 52 datasets drawn from three recent tabular benchmarks (OpenML‑CC18, AutoML‑Benchmark, and Kaggle Tabular). Across these tasks, it matches or surpasses state‑of‑the‑art tree ensembles such as XGBoost, CatBoost, and LightGBM, achieving up to a 10 % relative improvement on several datasets. Ablation studies isolate the contribution of each component: removing PLE degrades axis‑alignment and localized learning; omitting the Tied Group Lasso eliminates sparsity and interpretability; replacing Seq‑Hier‑Prox‑Adam‑EMA with the original HPG leads to unstable training and poorer performance.
In summary, LassoFlexNet integrates per‑feature non‑linear embeddings, group‑wise Lasso selection, and a carefully engineered optimizer to bridge the representational and optimization gaps between deep networks and decision trees. It delivers competitive predictive accuracy, maintains Lasso‑like feature importance scores, and offers a principled framework for future extensions such as pre‑training on massive tabular corpora or automated hyper‑parameter tuning.
Comments & Academic Discussion
Loading comments...
Leave a Comment