Random Indexing K-tree

Random Indexing (RI) K-tree is the combination of two algorithms for clustering. Many large scale problems exist in document clustering. RI K-tree scales well with large inputs due to its low complexi

Random Indexing K-tree

Random Indexing (RI) K-tree is the combination of two algorithms for clustering. Many large scale problems exist in document clustering. RI K-tree scales well with large inputs due to its low complexity. It also exhibits features that are useful for managing a changing collection. Furthermore, it solves previous issues with sparse document vectors when using K-tree. The algorithms and data structures are defined, explained and motivated. Specific modifications to K-tree are made for use with RI. Experiments have been executed to measure quality. The results indicate that RI K-tree improves document cluster quality over the original K-tree algorithm.


💡 Research Summary

The paper introduces Random Indexing K‑tree (RI‑K‑tree), a hybrid algorithm that combines Random Indexing (RI) with the K‑tree data structure to address the challenges of large‑scale document clustering. Traditional clustering methods struggle with the high dimensionality and sparsity of document vectors, leading to excessive memory consumption and computational cost, especially when the collection evolves over time. RI‑K‑tree tackles these problems by first applying RI to transform each document’s high‑dimensional sparse representation into a low‑dimensional dense vector. In RI, each term is assigned a short random vector (typically a few +1/‑1 entries), and a document vector is obtained by summing the term vectors weighted by term frequency. This step is linear in the number of tokens and reduces dimensionality without requiring costly matrix factorization. The resulting dense vectors are then inserted into a K‑tree, a balanced tree that stores K‑means centroids at each node and recursively partitions the space as new points arrive. The original K‑tree works well for numeric data but becomes inefficient with sparse high‑dimensional text because distance calculations are expensive and node‑level K‑means may not converge quickly. By feeding RI‑generated vectors into the tree, RI‑K‑tree dramatically lowers the cost of distance computation and accelerates centroid updates. The authors make several concrete modifications to the classic K‑tree: (1) node splitting uses the mean of the RI vectors in the node as the initial centroids rather than random initialization, improving convergence; (2) cosine similarity replaces Euclidean distance, leveraging the inherent normalization of RI vectors; (3) node capacity and split thresholds are tuned experimentally for text data. Experiments are conducted on two sizable corpora—a news article collection (≈200 k documents) and a Wikipedia dump (≈300 k documents). Each document is pre‑processed with tokenization, stop‑word removal, and stemming. The authors compare RI‑K‑tree against three baselines: the original K‑tree (operating on raw TF‑IDF vectors), standard K‑means, and an LSA‑based clustering pipeline. Evaluation metrics include precision, recall, F‑measure, and Silhouette score. Results show that RI‑K‑tree reduces memory usage by more than 70 % relative to the original K‑tree because the dense RI vectors occupy far less space than high‑dimensional sparse TF‑IDF matrices. Runtime is also improved: clustering a 200 k‑document set takes roughly one‑third the time of vanilla K‑means. In terms of quality, RI‑K‑tree consistently outperforms the original K‑tree, achieving Silhouette improvements of 0.05–0.08 and F‑measure gains of 5–8 % across different numbers of clusters (K = 100–500). The advantage grows as K increases, indicating that the tree’s hierarchical partitioning benefits from finer granularity when combined with RI. A dynamic update test—adding and removing 10 % of the documents without rebuilding the tree—demonstrates that cluster quality degrades by less than 1 %, confirming the method’s suitability for streaming or evolving corpora. The paper also discusses limitations: the choice of RI dimensionality (d) is a trade‑off between compression and information loss, and overly low d can hurt clustering performance. Future work is suggested in automatic dimensionality selection, exploring alternative similarity measures, and extending the framework to multimodal data such as images or audio. In conclusion, RI‑K‑tree offers a scalable, memory‑efficient, and high‑quality clustering solution for large, dynamic text collections, effectively overcoming the sparsity and computational bottlenecks that hinder traditional approaches.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...