CausalTAD: Injecting Causal Knowledge into Large Language Models for Tabular Anomaly Detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Detecting anomalies in tabular data is critical for many real-world applications, such as credit card fraud detection. With the rapid advancements in large language models (LLMs), state-of-the-art performance in tabular anomaly detection has been achieved by converting tabular data into text and fine-tuning LLMs. However, these methods randomly order columns during conversion, without considering the causal relationships between them, which is crucial for accurately detecting anomalies. In this paper, we present CausalTaD, a method that injects causal knowledge into LLMs for tabular anomaly detection. We first identify the causal relationships between columns and reorder them to align with these causal relationships. This reordering can be modeled as a linear ordering problem. Since each column contributes differently to the causal relationships, we further propose a reweighting strategy to assign different weights to different columns to enhance this effect. Experiments across more than 30 datasets demonstrate that our method consistently outperforms the current state-of-the-art methods. The code for CausalTAD is available at https://github.com/350234/CausalTAD.

💡 Research Summary

The paper introduces CausalTAD, a novel framework that enhances large language model (LLM)‑based anomaly detection on tabular data by explicitly incorporating causal knowledge among columns. Existing LLM approaches such as AnoLLM serialize a table into a textual sequence with a random column order, then fine‑tune the LLM to model conditional probabilities of each column given its predecessors. This random ordering ignores the causal dependencies that often exist between features (e.g., salary depends on education and job description), leading to distorted conditional probabilities and sub‑optimal anomaly scores.

CausalTAD addresses this limitation through two main components: (1) causal‑driven column ordering and (2) causal‑aware column reweighting.
Causal discovery and factor extraction: The authors adapt the COAT framework, originally designed for unstructured text, to handle mixed‑type tabular data. Each training sample is serialized as “column c₁ is value x₁, column c₂ is value x₂, …”. An LLM processes these sentences to propose high‑level latent factors (e.g., “Compensation”) that are described by one or more columns (salary, benefits). Using annotated factor values, a factor‑value matrix is built, and standard causal discovery algorithms (PC, LiNGAM, FCI) are applied to infer a directed causal graph among factors, complete with edge weights.

Projection to column‑level preferences: Since the causal graph lives at the factor level, the authors project it onto columns. For any pair of columns (cᵢ, cⱼ), the preference strength w(cᵢ→cⱼ) is defined as the sum of absolute edge weights across all factor pairs where cᵢ participates in the source factor and cⱼ in the target factor. This yields a |C|×|C| preference matrix W that may contain cycles because multiple factors can involve the same columns.

Linear Ordering Problem (LOP): The goal is to find a permutation π of columns that maximizes the total satisfied preference weight: maximize Σ_{i,j} w(cᵢ→cⱼ)·I

CausalTAD: Injecting Causal Knowledge into Large Language Models for Tabular Anomaly Detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment