Empirical analysis of collective human behavior for extraordinary events in blogosphere

Empirical analysis of collective human behavior for extraordinary events   in blogosphere
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

To uncover underlying mechanism of collective human dynamics, we survey more than 1.8 billion blog entries and observe the statistical properties of word appearances. We focus on words that show dynamic growth and decay with a tendency to diverge on a certain day. After careful pretreatment and fitting method, we found power laws generally approximate the functional forms of growth and decay with various exponents values between -0.1 and -2.5. We also observe news words whose frequency increase suddenly and decay following power laws. In order to explain these dynamics, we propose a simple model of posting blogs involving a keyword, and its validity is checked directly from the data. The model suggests that bloggers are not only responding to the latest number of blogs but also suffering deadline pressure from the divergence day. Our empirical results can be used for predicting the number of blogs in advance and for estimating the period to return to the normal fluctuation level.


💡 Research Summary

This paper presents a large‑scale empirical investigation of collective human behavior in the Japanese blogosphere, focusing on the temporal dynamics of “peak words” – keywords whose frequency rises sharply around a specific calendar date and then decays. The authors collected 1.8 billion blog entries from 15 million blogs over a four‑year period (Nov 1 2006 – Oct 31 2010) using the “Kuchikomi@kakaricho” API, which includes a built‑in spam filter (middle level) to exclude automatically generated advertising posts that account for roughly 40 % of the Japanese blogosphere. Japanese text was tokenized with the MeCab morphological analyzer, and multi‑word expressions (e.g., “April‑Fool”) were added to the dictionary to treat them as single tokens.

Data preprocessing
Two systematic biases were identified and corrected. First, a circadian activity pattern showed an artificial spike at exactly 00:00, likely due to a timestamp error. By examining 10 000 random bloggers, the authors found that activity is lowest around 04:00 and proposed shifting the daily boundary to 05:00. The corrected daily count x′j(t) for word j on day t is obtained by a linear combination of the original count and the following day’s count (weight w = 0). Second, non‑stationary fluctuations in the total number of blog posts (e.g., a sudden drop in February 2007 caused by search‑engine maintenance) were removed by normalizing each word’s daily count by the mean total daily traffic over the whole observation period. The resulting normalized frequency xj(t) can be interpreted as the probability that a random blog contains word j on day t.

Peak‑word selection
The authors defined three categories of peak words: (1) Event – names of 14 public holidays and 16 major annual events (e.g., “Marine Day”), (2) Date – 365 expressions of month‑day (e.g., “May 9th”), and (3) News – words that appear abruptly after a sudden event such as an earthquake, the name of a deceased public figure, or a Nobel laureate. For each candidate, the authors identified the day tc with the maximal frequency and called the monotonic increase before tc the “fore‑slope” and the monotonic decrease after tc the “after‑slope”.

Model fitting and statistical validation
Two functional forms were fitted to each slope: a power‑law decay
(x_j(t)-\bar{x}_j = A_j |t_c - t|^{-\alpha_j})
and an exponential decay
(x_j(t)-\bar{x}_j = B_j \exp


Comments & Academic Discussion

Loading comments...

Leave a Comment