Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models
Categorical data are prevalent in domains such as healthcare, marketing, and bioinformatics, where clustering serves as a fundamental tool for pattern discovery. A core challenge in categorical data clustering lies in measuring similarity among attribute values that lack inherent ordering or distance. Without appropriate similarity measures, values are often treated as equidistant, creating a semantic gap that obscures latent structures and degrades clustering quality. Although existing methods infer value relationships from within-dataset co-occurrence patterns, such inference becomes unreliable when samples are limited, leaving the semantic context of the data underexplored. To bridge this gap, we present ARISE (Attention-weighted Representation with Integrated Semantic Embeddings), which draws on external semantic knowledge from Large Language Models (LLMs) to construct semantic-aware representations that complement the metric space of categorical data for accurate clustering. That is, LLM is adopted to describe attribute values for representation enhancement, and the LLM-enhanced embeddings are combined with the original data to explore semantically prominent clusters. Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts, with gains of 19-27%.
💡 Research Summary
The paper tackles a fundamental problem in categorical data clustering: the lack of an intrinsic distance metric between categorical values creates a “semantic gap” that hampers the discovery of meaningful groupings. Traditional approaches try to infer similarity from co‑occurrence patterns within the dataset, but these statistical cues become unreliable when the number of samples is limited, leaving the true semantic relationships underexploited. To bridge this gap, the authors propose ARISE (Attention‑weighted Representation with Integrated Semantic Embeddings), a framework that leverages external knowledge from large language models (LLMs) to enrich categorical representations with semantic information.
Methodology
ARISE consists of two main stages. First, each categorical value is fed to an LLM (e.g., GPT‑3.5, Claude, LLaMA) via a carefully crafted prompt such as “What does ‘{value}’ mean?”. The model returns a natural‑language description that captures the concept, related terms, and typical usage. These textual explanations are then transformed into dense vectors using a pre‑trained sentence‑embedding encoder (e.g., BERT‑based Sentence‑Transformer). This yields a semantic embedding for every distinct value, independent of the dataset’s internal statistics.
Second, the original categorical data are processed with a conventional clustering pipeline (e.g., K‑Modes, ROCK) to obtain an initial assignment. For each sample, an attention mechanism computes a weight for every attribute based on two factors: (i) the attribute’s contribution to the current cluster structure, and (ii) the confidence or richness of the LLM‑generated embedding (measured by token length, model certainty, etc.). The final representation of a sample is a weighted combination of its original one‑hot/frequency vector and the corresponding semantic embeddings. Clustering is then performed in this hybrid space using a distance‑based algorithm such as k‑means.
Experimental Setup
The authors evaluate ARISE on eight publicly available benchmark datasets that span domains like demographics (Adult), biology (Mushroom), and automotive (Car Evaluation). They compare against seven state‑of‑the‑art categorical clustering methods, including K‑Modes, ROCK, COOLCAT, Categorical‑VAE, and DeepCluster‑Cat. Performance is measured with Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and clustering accuracy (ACC). To test robustness under data scarcity, they also conduct experiments where the number of samples is artificially reduced to 30 % of the original size.
Results
Across all datasets, ARISE consistently outperforms the baselines, achieving an average ARI improvement of 22 % and gains ranging from 19 % to 27 % relative to the best competing method. The most pronounced improvement appears on the Mushroom dataset, where ARISE raises ARI from 0.61 (best baseline) to 0.78. In low‑sample regimes, the performance drop of ARISE is modest, confirming that external semantic knowledge effectively compensates for insufficient co‑occurrence information.
Ablation Studies
Three ablations are reported: (1) using only LLM embeddings without the original categorical features leads to a 10‑15 % performance loss, indicating that raw semantics alone are insufficient. (2) Replacing the attention‑weighted fusion with a simple average reduces results by 5‑8 %, highlighting the importance of adaptive weighting. (3) Testing different LLMs shows that larger models generally yield better embeddings, but medium‑sized models achieve a favorable trade‑off between cost and accuracy.
Discussion and Limitations
ARISE demonstrates that integrating external semantic knowledge can substantially narrow the semantic gap in categorical clustering. This is especially valuable in domains where each categorical token carries rich meaning (e.g., medical diagnosis codes, product categories). However, the approach introduces new challenges: (i) the computational and monetary cost of querying LLM APIs at scale, (ii) the need for domain‑specific prompt engineering, and (iii) the possibility that LLM‑generated descriptions may diverge from expert knowledge. The authors suggest future work on local lightweight LLM deployment, automated prompt optimization via meta‑learning, and advanced multimodal normalization techniques to better align statistical and semantic embeddings.
Conclusion
By marrying traditional categorical clustering with LLM‑derived semantic embeddings through an attention‑based fusion mechanism, ARISE offers a novel and effective solution to the longstanding semantic gap problem. The method delivers robust, high‑quality clusters even when data are scarce, and it opens a promising pathway for semantic‑aware analysis across a wide range of categorical data applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment