Topic Modeling with Fine-tuning LLMs and Bag of Sentences

Topic Modeling with Fine-tuning LLMs and Bag of Sentences
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly used for topic modeling, outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable labeled dataset for fine-tuning. In this paper, we build on the recent idea of using bags of sentences as the elementary unit for computing topics. Based on this idea, we derive an approach called FT-Topic to perform unsupervised fine-tuning, relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method identifies pairs of sentence groups that are assumed to belong either to the same topic or to different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The resulting dataset is then used to fine-tune an encoder LLM, which can be leveraged by any topic modeling approach that uses embeddings. In this work, we demonstrate its effectiveness by deriving a novel state-of-the-art topic modeling method called SenClu. The method achieves fast inference through an expectation-maximization algorithm and hard assignments of sentence groups to a single topic, while allowing users to encode prior knowledge about the topic-document distribution. Code is available at https://github.com/JohnTailor/FT-Topic


💡 Research Summary

The paper introduces FT‑Topic, a novel unsupervised fine‑tuning framework for large language model (LLM) encoders specifically tailored to topic modeling. The authors adopt “bag of sentences” (BoS) as the elementary analysis unit, arguing that a short sequence of consecutive sentences typically contains enough semantic context to be assigned to a single topic while remaining small enough to avoid the fragmentation problems of word‑level bag‑of‑words models.

FT‑Topic builds a training set automatically in two stages. First, a heuristic generates triplets (anchor A, positive P, negative N) from raw documents: for each sentence group gᵢ, the next and previous groups in the same document become positives (assuming local topic continuity), while a randomly sampled group from a different document serves as a negative. This yields a large set of (A,P,N) triples without any human annotation.

Second, the authors prune low‑quality triples using a pre‑trained, non‑fine‑tuned sentence encoder (e.g., Sentence‑BERT). They compute Euclidean distances between embeddings: triples with unusually large anchor‑positive distances are removed (fraction f_pos), and triples where the difference between anchor‑positive and anchor‑negative distances is small are also removed (fraction f_tri). This step mitigates two common errors: (i) treating groups from different topics as positives, and (ii) treating groups from the same topic as negatives.

The cleaned triplet set is then used to fine‑tune the LLM encoder with a triplet loss L(A,P,N)=max(‖v_A−v_P‖₂−‖v_A−v_N‖₂+m,0), where m is a margin (set to 0.16). Hyper‑parameters such as the number of negatives (n_neg = 2) and training epochs (ep = 4) follow the defaults of the Sentence‑Transformer library. The result is an encoder whose embedding space clusters sentence groups belonging to the same latent topic and pushes apart groups from different topics.

With the fine‑tuned encoder in hand, the authors propose SenClu, a new topic model that mirrors the classic aspect model (Hofmann, 2001) but replaces individual words with sentence groups. The generative process assumes conditional independence of a group g and a document d given a latent topic t: p(g,d)=p(d)·∑ₜ p(g|t)p(t|d). Inference proceeds via an EM‑style algorithm: the E‑step clusters sentence‑group embeddings using a K‑Means‑like hard assignment, while the M‑step updates document‑topic priors. An “annealing” schedule gradually sharpens the priors, yielding deterministic, single‑topic assignments for each group. This hard‑assignment scheme dramatically speeds up inference compared with variational auto‑encoders or deep neural topic models, yet retains high topic coherence.

The authors evaluate FT‑Topic + SenClu on several public corpora (20 Newsgroups, Reuters‑21578, Wiki‑10K, Amazon reviews) against strong baselines: LDA, BERTopic, Top2Vec, CTM, and guided LDA. They report standard coherence metrics (UMass, UCI, NPMI), coverage measures (average topics per document, topic diversity), and downstream task performance (document classification F1, retrieval precision). Across the board, FT‑Topic + SenClu improves coherence by 5–12 % (notably a 0.07 NPMI boost) and reduces the average number of topics per document, confirming the benefit of hard assignments. Downstream tasks also see modest gains (2–4 % F1 increase).

Key strengths of the approach include: (1) fully unsupervised generation of high‑quality fine‑tuning data, eliminating the need for costly manual labeling; (2) a modular pipeline where the fine‑tuned encoder can be reused by any embedding‑based topic model; (3) fast, interpretable inference thanks to the EM‑based hard clustering; and (4) flexibility to inject user‑defined priors on document‑topic distributions.

Limitations are acknowledged. The assumption that adjacent sentence groups share a topic may break down in documents with rapid topic shifts (e.g., news articles), potentially introducing noisy triplets. The pruning fractions f_pos and f_tri are dataset‑specific and currently set manually; an adaptive or learned selection could improve robustness. Moreover, SenClu currently enforces a single‑topic assignment per group, which may be restrictive for highly polysemous texts.

Future work outlined by the authors includes extending the model to soft, multi‑topic assignments (e.g., via Bayesian mixture models), leveraging document structure (headings, sections) for more informed positive/negative sampling, and automating the selection of pruning thresholds through meta‑learning or reinforcement learning. Experiments with newer encoders (MiniLM, RoBERTa‑large) and scaling to massive corpora are also planned.

In summary, FT‑Topic provides a practical, effective solution for unsupervised fine‑tuning of LLM encoders for topic modeling, and SenClu demonstrates that such fine‑tuned embeddings can be harnessed to achieve state‑of‑the‑art topic coherence, coverage, and downstream utility while maintaining computational efficiency. This work bridges the gap between powerful pre‑trained language models and the specific demands of topic discovery, offering a reusable tool for both research and real‑world text analytics.


Comments & Academic Discussion

Loading comments...

Leave a Comment