Discovering Emerging Topics in Social Streams via Link Anomaly Detection
Detection of emerging topics are now receiving renewed interest motivated by the rapid growth of social networks. Conventional term-frequency-based approaches may not be appropriate in this context, because the information exchanged are not only texts but also images, URLs, and videos. We focus on the social aspects of theses networks. That is, the links between users that are generated dynamically intentionally or unintentionally through replies, mentions, and retweets. We propose a probability model of the mentioning behaviour of a social network user, and propose to detect the emergence of a new topic from the anomaly measured through the model. We combine the proposed mention anomaly score with a recently proposed change-point detection technique based on the Sequentially Discounting Normalized Maximum Likelihood (SDNML), or with Kleinberg’s burst model. Aggregating anomaly scores from hundreds of users, we show that we can detect emerging topics only based on the reply/mention relationships in social network posts. We demonstrate our technique in a number of real data sets we gathered from Twitter. The experiments show that the proposed mention-anomaly-based approaches can detect new topics at least as early as the conventional term-frequency-based approach, and sometimes much earlier when the keyword is ill-defined.
💡 Research Summary
The paper addresses the problem of detecting emerging topics in social media streams, focusing on the link structure formed by user mentions rather than relying solely on textual content. Traditional topic detection methods depend on term frequencies, TF‑IDF, or probabilistic topic models, which become less effective when posts contain images, videos, URLs, or when keywords are ambiguous. The authors propose a probabilistic model of mentioning behavior that captures two aspects of each post: (i) the number of mentions (k) and (ii) the set of mentioned users (V). The number of mentions is modeled with a geometric distribution parameterized by θ, with a Beta(α,β) prior, yielding a closed‑form predictive distribution for k. The selection of mentioned users is treated as a multinomial distribution with probabilities πᵥ. To avoid zero‑probability issues for users not seen in the training window, a Chinese Restaurant Process (CRP) prior is employed, assigning a small probability γ to “new” users while proportionally distributing the remaining mass among observed users.
For each new post x = (t, u, k, V) from user u at time t, the model computes an anomaly score s(x) = –log P(k|T) – Σ_{v∈V}log P(v|T), where T denotes the recent training window (30 days in the experiments). This score quantifies how unlikely the post’s mentioning pattern is compared to the user’s historical behavior.
The per‑post scores are aggregated over a fixed time window τ (e.g., one minute) to produce a time series s′j = (1/τ) Σ{t_i∈window_j} s(x_i). To detect abrupt changes in this series, the authors adopt a two‑layer Sequentially Discounting Normalized Maximum Likelihood (SDNML) coding scheme. In the first layer, an autoregressive (AR) model is fitted to s′j, and the SDNML code length p_SDNML(x_j|x{j‑1}) is computed. The log‑loss is smoothed over a window κ to obtain an intermediate score y_j. The second layer repeats the SDNML procedure on y_j, yielding a final change‑point score Score(y_j).
Because the distribution of Score(y_j) evolves over time, a Dynamic Threshold Optimization (DTO) algorithm is applied. DTO maintains a histogram of recent scores, updates it with discounting, and selects a threshold η(j) such that the tail probability beyond η(j) does not exceed a predefined false‑alarm rate ρ (set to 0.05). An alarm is raised whenever Score(y_j) ≥ η(j).
The method was evaluated on four real Twitter datasets collected via the collaborative “Togetter” service: “Job hunting” (200 participants), “YouTube” (160), “NASA” (90), and “BBC” (47). For each dataset, the authors compared their mention‑anomaly pipeline against a baseline that monitors the frequency of a manually selected keyword related to the topic, applying the same DTO but without SDNML (keyword frequencies are sparse, making SDNML ineffective). Additionally, a two‑state version of Kleinberg’s burst detection model was run on both the mention‑anomaly scores and the keyword frequencies.
Results show that the aggregated mention‑anomaly scores rise sharply at the onset of a new topic, often earlier than the keyword‑frequency baseline. In the “NASA” and “BBC” cases—where the relevant keywords are ambiguous or the content is primarily non‑textual—the proposed approach detected the topic 10–30 minutes before the keyword method. Precision, recall, and F‑measure of the mention‑based burst detection also outperformed the keyword‑based counterpart. The experiments confirm that monitoring the dynamics of user mentions provides a robust early‑warning signal for emerging topics, especially when textual cues are weak or noisy.
Key contributions of the paper are: (1) a probabilistic model of user mentioning behavior that yields per‑post anomaly scores; (2) an aggregation and change‑point detection framework that combines SDNML coding with adaptive thresholding (DTO); (3) empirical evidence that link‑based anomaly detection can match or surpass traditional term‑frequency methods in real‑world social streams.
Limitations include sensitivity to the CRP hyperparameter γ for low‑activity users, and the fact that only the count and identity of mentions are modeled—semantic content of the mentions or higher‑order network structures are ignored. Future work may extend the model to incorporate textual or visual features of the mentioned content, apply graph neural networks to capture evolving mention networks, and explore distributed implementations for large‑scale streaming environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment