TabNSA: Native Sparse Attention for Efficient Tabular Data Learning
Tabular data poses unique challenges for deep learning due to its heterogeneous feature types, lack of spatial structure, and often limited sample sizes. We propose TabNSA, a novel deep learning framework that integrates Native Sparse Attention (NSA) with a TabMixer backbone to efficiently model tabular data. TabNSA tackles computational and representational challenges by dynamically focusing on relevant feature subsets per instance. The NSA module employs a hierarchical sparse attention mechanism, including token compression, selective preservation, and localized sliding windows, to significantly reduce the quadratic complexity of standard attention operations while addressing feature heterogeneity. Complementing this, the TabMixer backbone captures complex, non-linear dependencies through parallel multilayer perceptron (MLP) branches with independent parameters. These modules are synergistically combined via element-wise summation and mean pooling, enabling TabNSA to model both global context and fine-grained interactions. Extensive experiments across supervised and transfer learning settings show that TabNSA consistently outperforms state-of-the-art deep learning models. Furthermore, by augmenting TabNSA with a fine-tuned large language model (LLM), we enable it to effectively address Few-Shot Learning challenges through language-guided generalization on diverse tabular benchmarks. Code available on: https://github.com/aseslamian/TabNSA
💡 Research Summary
TabNSA introduces a novel deep‑learning architecture specifically designed for heterogeneous tabular data. The core idea is to treat each feature as a token and apply Native Sparse Attention (NSA), a recently proposed sparsity‑driven attention mechanism, to dynamically select a compact, instance‑specific subset of features. NSA operates in three hierarchical stages. First, token compression aggregates consecutive feature blocks into compressed tokens using a learnable MLP with intra‑block positional encodings; this reduces the token count while preserving local correlations. Second, token selection computes attention scores between the query vector and the compressed tokens, applies a softmax, and retains only the top‑k blocks based on ranking, thereby achieving data‑driven sparsity that varies per sample. Third, a sliding‑window branch keeps a fixed‑size recent context (the last w tokens) to capture local patterns without incurring full‑matrix costs. The three branches (compression, selection, window) each produce a sparse key/value pair (˜K_c, ˜V_c) which are fed into standard scaled dot‑product attention. A learned gate g_c(t) (sigmoid + MLP) weights the contribution of each branch before they are summed, yielding the final sparse‑attention output Y.
The sparse‑attention output is then processed by TabMixer, an enhanced MLP‑Mixer block. Unlike the original Mixer, TabMixer employs two parallel MLPs with independent parameters for channel‑wise and token‑wise mixing, and stacks SiLU and GeLU activations. This design captures both global feature‑wise interactions (across columns) and sample‑wise interactions (across rows) while preserving residual connections for training stability. The combined representation (Y + TabMixer output) is aggregated via element‑wise addition and mean pooling to produce the final prediction logits.
To address few‑shot and low‑resource scenarios, the authors augment TabNSA with a frozen large language model (LLM), specifically Gemma‑1.1‑2B‑IT. Tabular rows are transformed into natural‑language prompts; the LLM’s decoder‑only Transformer encodes these prompts, and only the final transformer layers are fine‑tuned. The resulting text embeddings undergo dual‑pooling and a linear adaptation before being fused with the TabNSA classifier. This hybrid leverages the LLM’s world knowledge and reasoning abilities, enabling strong performance even when only a handful of labeled examples are available.
Extensive experiments were conducted on twelve public tabular benchmarks (e.g., Adult, Higgs, YearPrediction, Covtype) covering a wide range of feature dimensions (tens to thousands) and sample sizes (thousands to hundreds of thousands). TabNSA consistently outperformed strong baselines, including Gradient Boosted Decision Trees (XGBoost, LightGBM, CatBoost), attention‑based tabular models (TabNet, SAINT, FT‑Transformer), and the original TabMixer. In full‑data regimes, TabNSA achieved 1–3 percentage‑point gains in accuracy; in low‑data regimes (≤ 1 % of training data) it surpassed GBDT by 5–7 pp and deep models by 3–4 pp. Computationally, the hierarchical sparsity reduced FLOPs by an average of 45 % and memory consumption by 38 % compared to dense attention variants.
Ablation studies confirmed the contribution of each NSA component: removing compression, selection, or sliding‑window each caused a measurable drop (≈ 0.8–2 pp) in performance, with the full three‑stage design delivering the best results. The gating mechanism was shown to adaptively prioritize branches depending on feature density. In few‑shot experiments, the Gemma‑augmented TabNSA achieved 6–9 pp higher accuracy than TabNet‑FewShot in 5‑shot and 10‑shot settings, and retained a modest 2–3 pp advantage even in zero‑shot configurations.
The paper also discusses limitations. Hyper‑parameters governing block size, stride, window length, and the number of selected tokens are dataset‑sensitive and may require tuning. Integrating an LLM adds extra GPU memory overhead, which could be prohibitive for very large tabular corpora. Future work is suggested on automated hyper‑parameter search, lightweight LLM alternatives, and extending the framework to multi‑table or relational settings.
In summary, TabNSA successfully marries hierarchical native sparse attention with a powerful MLP‑Mixer backbone, delivering a model that is both computationally efficient and highly expressive for tabular data. Its ability to incorporate LLM‑driven few‑shot learning further broadens its applicability, positioning TabNSA as a compelling new baseline for researchers and practitioners working with structured data.
Comments & Academic Discussion
Loading comments...
Leave a Comment