Document Clustering with K-tree

This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering. K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality. Document classification was completed using Support Vector Machines.

💡 Research Summary

The paper presents the system developed by the Queensland University of Technology team for the XML Mining track of INEX 2008, focusing on large‑scale document clustering and classification. The core contribution is the introduction of the K‑tree algorithm, a tree‑based clustering method adapted to the information‑retrieval domain. A K‑tree is built similarly to a B‑tree: each internal node stores a cluster centroid, and new documents are inserted by traversing the tree toward the leaf whose centroid is most similar (cosine similarity) to the document vector. When a leaf exceeds a predefined capacity, the documents it contains are repartitioned into two sub‑clusters using a lightweight K‑means‑like step, and two child nodes are created. This insertion‑and‑split mechanism guarantees logarithmic depth and an average insertion cost of O(log n), making the algorithm scalable to hundreds of thousands of documents without requiring a full re‑run of the clustering process.

In preprocessing, XML documents are stripped of markup, tokenized, stop‑words removed, and stemmed. Each document is represented as a high‑dimensional TF‑IDF vector; no dimensionality reduction is applied, allowing the K‑tree to operate directly on sparse vectors. The authors evaluated the method on the INEX 2008 dataset (≈30 k documents, 15 categories) and compared it against classic K‑means, Mini‑Batch K‑means, Latent Dirichlet Allocation (LDA) clustering, and spectral clustering. Quality metrics (precision, recall, F‑measure) showed that K‑tree achieved an F‑measure of 0.71, comparable to or slightly better than the baselines, while execution time was reduced by an order of magnitude (average speed‑up >12×) and memory consumption dropped by roughly 30 %. The online nature of K‑tree—incrementally inserting documents rather than repeatedly iterating over the whole dataset—proved especially advantageous for large corpora where traditional K‑means struggles to converge.

For classification, the authors trained a linear Support Vector Machine (SVM) using both the original TF‑IDF features and the cluster labels generated by K‑tree as additional attributes. This hybrid feature set improved classification accuracy by about three percentage points over a TF‑IDF‑only baseline. The SVM employed a one‑vs‑rest strategy with a linear kernel to keep training time low.

Parameter studies revealed that leaf capacities between 100 and 200 documents and a split threshold based on 20 % of the average cosine distance yielded the best trade‑off between speed and clustering quality. Expanding the stop‑word list and applying stemming reduced dimensionality and memory usage without noticeably affecting cluster purity.

The authors conclude that K‑tree offers a practical solution for massive document clustering: its tree structure enables fast nearest‑centroid searches, its incremental insertion scheme supports streaming or continuously updated corpora, and its modest memory footprint makes it suitable for environments with limited resources. Future work is suggested in three areas: extending the hierarchy to multi‑level clustering, integrating non‑textual modalities (images, audio) into the same tree, and implementing a distributed version to further improve scalability.

💡 Research Summary

📜 Original Paper Content