Online Density-Based Clustering for Real-Time Narrative Evolution Monitorin

Online Density-Based Clustering for Real-Time Narrative Evolution Monitorin
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automated narrative intelligence systems for social media monitoring face significant scalability challenges when relying on batch clustering methods to process continuous data streams. We investigate replacing offline HDBSCAN with online density-based clustering algorithms in a production narrative report generation pipeline that processes large volumes of multilingual social media data. While HDBSCAN effectively discovers hierarchical clusters and handles noise, its batch-only nature requires full retraining for each time window, limiting scalability and real-time adaptability. We evaluate online clustering methods with respect to cluster quality, computational efficiency, memory footprint, and integration with downstream narrative extraction. Our evaluation combines standard clustering metrics, narrative-specific measures, and human validation of cluster correctness to assess both structural quality and semantic interpretability. Experiments using sliding-window simulations on historical data from the Ukrainian information space reveal trade-offs between temporal stability and narrative coherence, with DenStream achieving the strongest overall performance. These findings bridge the gap between batch-oriented clustering approaches and the streaming requirements of large-scale narrative monitoring systems.


💡 Research Summary

The paper addresses a critical bottleneck in large‑scale narrative intelligence systems that monitor social media streams: the reliance on batch‑only clustering, specifically HDBSCAN, which must reload and recluster the entire dataset for each time window. While HDBSCAN excels at discovering hierarchical, variable‑density clusters and handling noise, its O(N log N) computational cost and high memory footprint make it impractical for real‑time, high‑volume multilingual streams such as those from the Ukrainian information space.

To overcome these limitations, the authors evaluate three online density‑based clustering algorithms—DBSTREAM, DenStream, and TextClust—implemented via the River library. They embed all documents using a multilingual MiniLM‑v2 model, reduce dimensionality with UMAP, and then feed the reduced vectors into each clustering method. The experimental protocol mimics a production pipeline: six days of historical data (≈ 69 k documents) are used for initialization, followed by incremental updates on a target day (≈ 11 k documents). This sliding‑window simulation reflects realistic streaming conditions where only recent context is available before processing new data.

Evaluation combines standard clustering metrics (Silhouette Score, Davies‑Bouldin Index) with narrative‑specific measures (Narrative Distinctness, Contingency, Variance) that assess semantic separability and intra‑cluster cohesion of discovered storylines. Additionally, a human assessment involving three domain experts rates each generated narrative cluster as correct or incorrect, yielding inter‑annotator agreement (Cohen’s κ, Krippendorff’s α) and acceptance accuracy.

Results show that DenStream consistently outperforms both the batch baseline and the other online methods. DenStream achieves a Silhouette Score of 0.685 (vs. 0.592 for HDBSCAN) and a Davies‑Bouldin Index of 0.453 (lower than 0.550 for HDBSCAN), indicating superior cluster cohesion and separation. Narrative Distinctness reaches 0.319, close to HDBSCAN’s 0.352 and well above DBSTREAM’s 0.266. Operationally, DenStream processes the daily batch in 3.56 seconds of training and 1.7 seconds of prediction, far faster than HDBSCAN’s 13 seconds and 1.3 seconds respectively, and it generates only 303 clusters compared with HDBSCAN’s 1,063, reducing over‑segmentation.

Human evaluation corroborates these quantitative findings. DenStream attains an acceptance accuracy of 84 % and the highest inter‑annotator agreement (κ = 0.83, α = 0.83), suggesting that its clusters are not only statistically sound but also more interpretable and reliable for analysts. DBSTREAM performs poorly on traditional metrics (Silhouette 0.327, DBI 1.220) and yields lower human acceptance (81 %). TextClust is mentioned but not fully evaluated in this study.

The authors also analyze implementation nuances in the River library, highlighting the sensitivity of decay parameters and micro‑cluster lifecycle management. Improper decay settings can either retain stale clusters, hampering concept‑drift adaptation, or discard useful clusters prematurely. They propose automated tuning and monitoring of these hyper‑parameters for production deployment.

In conclusion, the study demonstrates that DenStream is a viable online replacement for HDBSCAN in narrative monitoring pipelines, delivering better cluster quality, reduced computational and memory demands, and higher analyst confidence. This work bridges the gap between batch‑oriented clustering and the streaming requirements of modern intelligence systems, offering a concrete evaluation framework and practical guidance for organizations that need real‑time detection of emerging narratives, misinformation campaigns, or geopolitical discourse.


Comments & Academic Discussion

Loading comments...

Leave a Comment