Summaries as Centroids for Interpretable and Scalable Text Clustering

Summaries as Centroids for Interpretable and Scalable Text Clustering
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea, summary-as-centroid, retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering-without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.


💡 Research Summary

Paper Overview
The authors propose two variants of k‑means for text clustering—k‑NLPmeans and k‑LLMmeans—that replace the numeric centroid update with a textual summary of the cluster. The core idea, “summary‑as‑centroid,” keeps the standard k‑means assignment step in embedding space but periodically substitutes the mean vector with the embedding of a human‑readable summary. This yields interpretable prototypes while preserving the k‑means objective between summary steps.

Methodology

  1. Standard k‑means Recap – Documents are embedded into a d‑dimensional space, and k‑means iteratively assigns points to the nearest centroid and recomputes centroids as the arithmetic mean of assigned vectors.
  2. Summary Step – Every l iterations, instead of computing the mean, the algorithm gathers all documents assigned to a cluster Cj and generates a short textual prototype f(Cj). The prototype is re‑embedded with the same encoder, producing a new centroid µj = Embedding(f(Cj)).
  3. Two Summarizer Families
    • k‑NLPmeans: Uses deterministic, lightweight extractive summarizers (centroid‑based sentence ranking, TextRank, LSA‑style SVD). No LLM calls are required, making the method fully offline and cheap.
    • k‑LLMmeans: Calls a large language model (LLM) with a prompt that includes a small, representative sample of cluster documents (selected via k‑means++ sampling). The LLM returns a concise summary; the embedding of this summary becomes the new centroid. The number of LLM calls per summarization step is bounded by the number of clusters k, independent of dataset size.
  4. Mini‑Batch Extension – To handle large corpora and streaming data, the authors embed the summary step into the mini‑batch k‑means update rule. Each incoming batch updates centroids incrementally, and after a fixed number of batches a summary step is performed. This yields a low‑memory, online clustering algorithm that still produces readable centroids.

Theoretical Properties

  • The algorithm retains the original k‑means objective between summary steps, guaranteeing monotonic reduction of within‑cluster sum‑of‑squares during numeric updates.
  • Summaries act as semantic “prototypes” that can redirect the optimization trajectory, reducing sensitivity to random initialization.
  • If a summary is poor, the method gracefully falls back to vanilla k‑means behavior, preserving convergence guarantees.

Experimental Setup

  • Datasets: Four diverse benchmarks—Bank77 (online banking queries), CLINC (intent classification), GoEmo (emotion detection), and MASSIVE (multilingual intent).
  • Embeddings: DistilBERT, e5‑large, Sentence‑BERT, and OpenAI’s text‑embedding‑3‑small.
  • Summarizers: For k‑NLPmeans – TextRank, centroid‑based, LSA; for k‑LLMmeans – GPT‑3.5‑turbo, GPT‑4o, Llama‑3.3, Claude‑3.7, DeepSeek‑V3.
  • Protocols: 120 centroid‑update iterations; two summarization schedules (single step at iteration 60, and five steps every 20 iterations). The prompt for each dataset is a simple “write a single question that represents the cluster” instruction.
  • Metrics: Clustering accuracy (ACC), precision/recall/F1, and human evaluation of summary readability.

Key Results

  1. Accuracy Gains – Even a single summarization step improves ACC by 2–5 % over vanilla k‑means across all embedding‑summarizer combinations. Five summarization steps add another 1–2 % improvement.
  2. LLM Cost Efficiency – k‑LLMmeans requires only k LLM calls per summarization step, independent of the number of documents. This yields comparable performance to recent LLM‑heavy clustering pipelines that make millions of calls, but at a fraction of the cost.
  3. Interpretability – Human judges rate k‑NLPmeans summaries at 4.2/5 and k‑LLMmeans at 4.6/5, confirming that the textual centroids convey cluster semantics clearly.
  4. Streaming Performance – The mini‑batch versions achieve a 70 % reduction in memory usage while maintaining within 1–2 % of the static version’s accuracy, demonstrating suitability for real‑time applications such as social‑media monitoring or live customer‑support logs.

Contributions

  • Introduction of a simple yet powerful “summary‑as‑centroid” mechanism that makes k‑means interpretable without sacrificing its convergence properties.
  • An LLM‑optional design: a fully offline variant (k‑NLPmeans) and a cost‑controlled LLM‑augmented variant (k‑LLMmeans).
  • Extension to mini‑batch and streaming settings, enabling real‑time clustering with human‑readable prototypes.
  • Release of a new StackExchange‑derived benchmark for evaluating streaming text clustering.

Limitations & Future Work

  • The approach relies on the quality of the summarizer; domain‑specific jargon may degrade extractive methods, while LLMs can still hallucinate or produce overly generic summaries.
  • For extremely large clusters (hundreds of thousands of documents), even the sampling step can become a bottleneck; more sophisticated representative‑selection strategies are needed.
  • Future directions include adaptive summarization frequency based on cluster drift, ensemble summarization (combining extractive and generative outputs), multimodal extensions (incorporating images or audio), and automated prompt optimization to further reduce LLM cost while preserving summary fidelity.

Overall Assessment
The paper delivers a pragmatic solution to a long‑standing tension in text clustering: the trade‑off between algorithmic efficiency and human interpretability. By embedding a lightweight summarization step into the classic k‑means loop, the authors achieve measurable accuracy improvements, dramatically lower LLM usage, and real‑time applicability—all while providing readable cluster prototypes. The method is straightforward to implement, compatible with any existing embedding model, and opens a clear path for future research on interpretable, scalable unsupervised learning for textual data.


Comments & Academic Discussion

Loading comments...

Leave a Comment