The nature and origin of heavy tails in retweet activity
Modern social media platforms facilitate the rapid spread of information online. Modelling phenomena such as social contagion and information diffusion are contingent upon a detailed understanding of the information-sharing processes. In Twitter, an …
Authors: Peter Mathews, Lewis Mitchell, Giang T. Nguyen
The nature and origin of hea vy tails in retweet activity P eter Mathews The University of Adelaide Adelaide, A ustralia peter .mathews@adelaide.edu.au Lewis Mitchell The University of Adelaide Adelaide, A ustralia lewis.mitchell@adelaide.edu.au Giang T . Nguyen The University of Adelaide Adelaide, A ustralia giang.nguyen@adelaide.edu.au Nigel G. Bean The University of Adelaide Adelaide, A ustralia nigel.bean@adelaide.edu.au ABSTRA CT Modern social media platforms facilitate the rapid spread of infor- mation online. Modelling phenomena such as social contagion and information diffusion are contingent upon a detailed understand- ing of the information-sharing processes. In T witter , an important aspect of this occurs with retweets, where users rebroadcast the tweets of other users. T o improve our understanding of how these distributions arise, we analyse the distrib ution of retweet times. W e show that a po wer law with exponential cutoff provides a better fit than the power laws previously suggested. W e explain this fit through the burstiness of human behaviour and the priorities indi- viduals place on different tasks. K eywords Retweet; T witter; Po wer law; Po wer law with exponential cutof f 1. INTR ODUCTION T witter is one of the most popular social media sites with twit- ter .com being the 10th most visited website in the world [2]. The main method of interaction on T witter is by users changing their status, kno wn as a tweet . Other users can interact with this tweet in sev eral ways including favouriting or retweeting the tweet. The temporal dynamics of how information spreads through T witter through retweets provides an excellent case study for understand- ing information propagation. There exists a large body of work on cascades in social media systems, e.g. [14, 21, 3, 15, 16]. These focus on the size and vol- ume aspect of a cascade, often using statistics or machine learning techniques to predict the final cascade size based on various fea- tures. Howe ver , the temporal component underlying this phenom- ena is relatively poorly understood. There also exists generativ e behavioural models for human dynamics and large scale collective phenomena using stochastic models, e.g. [4]. The link between the two may not alw ays be clear . T o appear in MSM 2017: 8th International W orkshop on Modelling Social Media: Machine Learning and AI for Modelling and Analysing Social Me- dia, 2017 International W orld W ide W eb Conference ( WWW‘17 ) W ork- shop, April 2017, Perth, Australia. This work builds upon both research topics, using social be- havioural models to explain the temporal component of informa- tion cascades in a social media system. W e make the follo wing key ne w contributions: • Showing that a power law with exponential cutoff is a better fit for the distribution of retweet times than a po wer law . • Providing an explanation of the origin of the po wer law with exponential cutof f for the retweet time distribution. The remainder of the paper is structured as follows: In Section 2 we revie w prior work on T witter dynamics and causes of power laws. In Section 3 we introduce the dataset and analyse the best way to fit a distribution. In Section 4 we explain the underlying processes which lead to the distribution. In Section 5 we summarise our findings and discuss possible extensions to this work. 2. RELA TED WORK 2.1 T witter dynamics There exists a significant amount of related work about mod- elling T witter dynamics. Howe ver , although other authors hav e touched on the subject, the distribution of retweet times has not been analysed in detail previously . Crane and Sornette [10] studied the response of a social system after endogenous and exogenous bursts of activity . They found that after the initial peak, activity declines as a po wer law distribution. Zhao et al. [22] looked at the reaction time for retweets from an initial tweet. They plotted the retweet times up to 15 hours after the initial tweet and concluded that the linear trend on logarithmic axes suggests a po wer law decay . Lu et al. [16] dev eloped a method to model the lifetime number of retweets from an originating source. They found the distribution to be a power law with exponent in the range 0.6 to 0.7. The y proposed that the “probability of being forwarded is proportional to the product of preferential attachment and transmissibility". W u et al. [21] performed an extensi ve analysis of the production, flow and consumption of information on T witter . They found that different content types exhibit dramatically different characteristic lifespans. There has also been much work in predicting cascade size in social media structures based on various factors. Kupavskii et al. [14] predicted the size of the cascade based on the initial spread using machine learning techniques. Bakshy et al. [3] looked at the possibility to achieve cascades through social media structures from ordinary influencers. Bild et al. [8] showed that lifetime tweet counts follow a type- II discrete W eibull distribution. They showed that the tweet rate distribution is asymptotically power law but exhibits a lognormal cutoff over finite sample intervals. The y also sho wed that the in- tertweet interval distribution for a single user is po wer law with exponential cutof f. Doerr et al. [12] showed that many processes gov erning online information spread have a log-normal distribution. The authors questioned the applicability of fitting power law distributions to temporal behavioural data. They argued that the lo w exponents found in temporal data militates against preferential attachment. They also argued that while preferential attachment provides an explanation for scale-free degree distribution, it does not provide insight into propagation time distributions. Based on this, they claimed that there does not exist a theoretical model able to explain the observed traces of online human beha viour . A clear shortcoming of this paper is that it only considered pref- erential attachment as the cause of power laws. Although prefer- ential attachment is a common and well-known mechanism for the generation of power la ws, it is certainly not the only mechanism. User interest in topics has a tendency to decay exponentially over time [11, 15]. This will form a component of our model in Section 4 where we consider the user interest in tweets after a period of time. 2.2 Causes of power law Power laws occur frequently in nature and man-made systems. Examples of phenomena that can be modelled well by power laws include frequencies of words in most languages, sizes of earth- quakes, intensity of wars, severity of terrorist attacks, sightings of bird species and many others [9]. Mitzenmacher [18] and Newman [19] identified 14 causes of power la ws, both natural and man made. In particular we note: • Growth by preferential attachment, where new entities attach to existing entities proportional to their current size [6]. • The inter-e vent time distrib ution for a single event type where behaviour is a consequence of a decision-based queuing pro- cess [20]. As these two causes of power law are the most relevant to our work we discuss them in more detail. Gr owth by pr eferential attachment In preferential attachment, ne w entities attach to existing entities proportional to their current size. In Polya’ s Urn model [17] where balls are added to urns with probability proportional to the number of balls in the urn, it can be shown that the number of balls per urn is distributed as a power law . Power laws by preferential at- tachment occur frequently in nature and in human sciences. Cities tend to gro w proportional to their current size [13]. Networks have a tendency to grow by attaching new nodes to nodes that already hav e a large number of connections [5]. P ower law due to decision-based queuing pr ocess Barabasi [4] showed that the b ursty nature of human behaviour can be explained by a decision-based queuing process, which was fur- ther explained by V azquez et al. [20]. Consecuti ve actions from a single user, such as the inter-e vent times between emails sent, hav e a tendency to be power law distributed. This is different to the exponential distribution that w ould occur if human activity was modelled as a Poisson process. Barabasi showed that the timings of fiv e human activity patterns, email and letter based communi- cations, web browsing, library visits and stock trading, followed non-Poisson statistics. When humans execute tasks based on some perceiv ed priority , the waiting time between tasks is hea vy-tailed. 3. RETWEET TIME ANAL YSIS 3.1 Overview of r etweet rates W e define the retweet rate as the number of retweets per unit time occuring for a particular seed tweet. A tweet tends to hav e the highest retweet rate shortly after it is posted, with the retweet rate slowly decaying over time. W e consider the distribution of times until retweets occur and look to determine the most appropriate model to represent this distribution. Analysing retweet rate decay giv es an insight into the longevity of interest in topics being tweeted. Retweets indicate interest about the tweet by a user , so a seed tweet with a slow retweet decay rate suggests that the topic of the tweet has longevity . T o illustrate the problem, we first look at specific examples of retweet time distributions. Figure 1 shows examples of retweet counts with constant-width bins from six sample seed tweets by Donald Trump (T witter: @re- alDonaldT rump) in February 2016. As can be seen, the retweet counts decay from their starting lev els with some amount of noise. T o illustrate why this distribution might be considered a power law , we choose constant-width bins on a log scale and again plot the log of the retweet rate ag ainst the log of time. This gi ves Figure 2. As can be seen, the graphs are roughly linear, suggesting that in this region the retweet rate is well modelled by a po wer law . If we look at the same data set over a longer period, up to 24 hours, and plot the retweet rate on a log-log plot, we get Figure 3. This shows visually that the line is no longer straight, so a power law does not appear to continue to fit the data. As we shall demon- strate in Section 3.4, this phenomenon tends to occur for the v ast majority of seed tweets. As is shown on the graph, a power law with exponential cutof f is a better fit to the data. In the rest of this section we quantify these claims and sho w that the po wer law with exponential cutoff is indeed a better fit than the power la w . 3.2 Collection methodology W e monitored and collected tweets from the 100 T witter users with the most follo wers [1] using the T witter REST API. W e chose these T witter users as their tweets are retweeted more frequently , providing more dense data. In total, we obtained the times of retweets from a total of 1676 seed tweets in April 2016. W e ex- clude an y tweet that w as deleted shortly after being tweeted, as this causes a truncated data set. W e also exclude an y tweet that has less than 100 retweets as it is less meaningful to fit a curve to a sparse data set. The T witter REST API allows us to query the details of the 100 most recent retweets from a giv en tweet. T witter imposes a rate limit of 15 such queries per 15 minutes, allowing an average of one hundred retweets to be collected per minute. In order to av oid hitting this rate limit, we stop the collection of any retweet set that has a retweet rate greater than 60 retweets per minute, an average of one per second. All remaining retweet times form our dataset to be analysed. This collection methodology giv es us the complete retweet cascade for all tweets in our dataset. After removing the data which did not fit our criteria we are left with 808 seed tweets, which had a mean of 307.7 retweets and a median of 197 retweets. There were 34 seed tweets with ov er 1000 retweets. 02 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 Time (s) 0 50 100 150 200 Retw eet count Tweet A 02 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 Time (s) 0 50 100 150 200 Retw eet count Tweet B 02 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 Time (s) 0 50 100 150 Retw eet count Tweet C 02 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 Time (s) 0 50 100 150 200 250 Retw eet count Tweet D 02 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 Time (s) 0 100 200 300 400 Retw eet count Tweet E 02 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 Time (s) 0 100 200 300 Retw eet count Tweet F Figure 1: Retweet count histograms showing the first three hours after the initial tweet. The rate of retweets tends to decay over time. 56789 log(Time (s)) − 4 − 3 − 2 − 1 log(Retw eets p er second) Tweet A 56789 log(Time (s)) − 3 . 0 − 2 . 5 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 log(Retw eets p er second) Tweet B 56789 log(Time (s)) − 3 − 2 − 1 log(Retw eets p er second) Tweet C 56789 log(Time (s)) − 3 − 2 − 1 log(Retw eets p er second) Tweet D 56789 log(Time (s)) − 3 − 2 − 1 0 log(Retw eets p er second) Tweet E 56789 log(Time (s)) − 3 − 2 − 1 0 log(Retw eets p er second) Tweet F Figure 2: Retweet rate log-log plots over the first 3 hours since the initial tweet. The linear relationship suggests a power law holds within this region. W e note that as our dataset is only from the subsection of the T witter population with a high number of followers, we can only make conclusions about information propagation from these users. 3.3 Fitting a power law W e fit a power law to each of our 808 retweet data sets using maximum likelihood estimation. W e choose maximum likelihood estimation to conduct the fit as it is more accurate than logarithmic binning [7]. A power la w has density function p ( x ) = C x − α , (1) where α > 0 and C > 0 is a normalising constant which depends on α . W e calculate the Kolmogoro v-Smirnov statistic to determine ho w accurately our empirical distribution matches the theoretical distri- bution. For a theoretical distribution F ( x ) and an empirical CDF 468 1 0 log(Time (s)) − 8 − 6 − 4 − 2 log(Retw eets p er second) Tweet A 468 1 0 log(Time (s)) − 6 − 4 − 2 log(Retw eets p er second) Tweet B 468 1 0 log(Time (s)) − 6 − 4 − 2 log(Retw eets p er second) Tweet C 468 1 0 log(Time (s)) − 6 − 4 − 2 log(Retw eets p er second) Tweet D 468 1 0 log(Time (s)) − 5 − 4 − 3 − 2 − 1 0 log(Retw eets p er second) Tweet E 468 1 0 log(Time (s)) − 5 − 4 − 3 − 2 − 1 0 log(Retw eets p er second) Tweet F Figure 3: Retweet rate log-log plots over the first 24 hours since the initial tweet. A curve of a power law with exponential cutoff is fitted, showing the faster than linear decay . S ( x ) , the Kolmogoro v-Smirnov statistic D is defined by D = sup x | F ( x ) − S ( x ) | . (2) The histogram of the KS-statistic for each dataset is shown in Figure 4. The mean KS v alue is 0.07454 with standard de via- tion 0.02966. As can be seen, the KS-statistic values are centered around this mean and mostly fall between 0.05 and 0.10. 0.00 0.05 0.10 0.15 0.20 KS statistic 0 20 40 60 80 100 120 140 160 180 Frequency Figure 4: Histogram of KS statistics for a power law fit to the distribution of retweet times. The mean KS value is 0.07454 with standard deviation 0.02966. 3.4 Fitting a power law with exponential cut- off W e also fit a power law with exponential cutoff to each of our 808 retweet time data sets using maximum likelihood estimation. A power la w with exponential cutoff has the density function p ( x ) = Ax − b e − cx . (3) with A, b, c > 0 and where A is a normalising constant. W e calculate the KS statistic for each retweet data set modeled by a power law with exponential cutoff. A histogram of the resul- tant values is sho wn in Figure 5. 0.00 0.05 0.10 0.15 0.20 KS statistic 0 50 100 150 200 250 Frequency Figure 5: Histogram of KS statistics for a po wer law with expo- nential cutoff fit to the distribution of retweet times. The mean KS statistic is 0.05080 with standard deviation 0.02302 The mean KS statistic is 0.05080 with standard deviation 0.02302. This is lower than the mean KS value of 0.07454 without the ex- ponential cutoff (32% improvement) and demonstrates a clear im- prov ement in the quality of fit. In order to determine whether the reduction in the mean KS- statistic for the power law with exponential cutoff is statistically significant, we conduct a paired t-test on the two sets of data, giv- ing a p-value of 2 . 26071 × 10 − 157 . W e therefore reject the null hypothesis that the paired dif ferences hav e zero mean and conclude that the po wer law with e xponential cutoff has a lo wer KS statistic. The set of power law distributions is a special case of the set of power laws with exponential cutoffs. As we have added an extra parameter to our model, the po wer law with e xponential cutof f will always provide at least as good a fit. T o measure the relativ e quality of each model, we thus use the AIC criterion AIC = 2 k − 2 ln( L ) (4) where k is the number of parameters and L is the likelihood func- tion. W e wish to minimise the AIC value. In order to do this, adding an additional parameter requires an improvement in log-likelihood score of 1 to increase the AIC score. W e consider the log-likelihood scores for the power law and po wer la w with exponential cutoff and observe the increase in log-likelihood score in Figure 6. Some datasets are well modeled by a power law and only show a very small increase in log-likelihood score, while other datasets benefit significantly by adding the cutoff. 0 5 10 15 20 Increase in log-likelihood 0 50 100 150 200 250 300 350 Frequency Figure 6: Histogram of improv ement to log-likelihood chang- ing from power law to power law with exponential cutoff. The change of distribution improv es the likelihood score by more than 1 in 558 out of 808 tested datasets, 69.1% of cases. The in- crease in log-likelihood scor e justifies the additional parameter of the power law with exponential cutoff. Changing from a po wer law to a power law with exponential cutoff improves the likelihood score by more than 1 in 558 of 808 tested datasets, 69.1% of cases. It improves the likelihood score by a mean v alue of 4.239. Consequently , adding an exponential cutoff improv es the AIC score by a mean value of 6.478. W e conclude that adding an exponential cutoff to the power law provides a better fit. 4. EXPLANA TION OF PO WER LA W WITH EXPONENTIAL CUTOFF A potential cause of the power law in retweet activity is due to a decision-based queuing process. The action of checking T witter and deciding whether to retweet is a task prioritised against other daily activities. Consequently the time between a tweet arriving and a user checking their twitter account has a power law distribu- tion [4]. A decision-based queuing process is much more relev ant to de- scribe human activity on the internet than the more commonly dis- cussed cause of power laws, preferential attachment [12, 6]. Users will implicitly assign priorities to tasks in their lives and execute these tasks according to their internal perceived priorities . This ex- plains the origin of the po wer law component for the distrib ution of time until retweets. The second factor af fecting the retweet distribution is the loss of interest in topics over time, which has exponential decay [11, 15]. If the topic of the tweet is less relev ant than when it w as tweeted, it is less likely that it will be retweeted. The third and final component that affects the likelihood of a retweet is the proportion of users who decide to retweet. For our explanatory model, we assume that a constant proportion of users who see the tweet at a time when it is still relev ant will decide to retweet. T o obtain the ov erall likelihood of retweet at time t , we multiply these three components together: P ( Retweet at time t ) = P ( T witter checked at time t ) × P ( T weet still relev ant at time t ) × P ( User will choose to retweet ) . (5) This giv es P ( Retweet at time t ) = At − b e − ct . (6) It is possible that there are alternativ e explanations for the cause of the power law with exponential cutoff. Howe ver , our explanation is simple and explains every component of the phenomenon that we hav e observed in the empirical data. 5. DISCUSSION AND CONCLUSIONS The rate of retweets can be well modelled by a power law with exponential cutoff, providing a better fit than a standard power law distribution. The power law component is explained by the time until the user checks their social media, which is governed by a decision-based queuing process. The exponential cutoff is ex- plained by the loss of interest in topics ov er time. In this work we analysed retweet times from the 100 T witter users with the most followers. A natural question is whether similar retweet rate distributions w ould hold for all other T witter users. Future work will analyse how the parameters of the power law and exponential cutoff vary based on author, tweet topic or other factors. This will allow prediction of the propagation rate of the tweet. W e could also look at population-lev el social questions, e.g. how do the decay parameters v ary ov er the long term? As a society , are we growing more or less engaged with ne ws from social me- dia? As the tweet/retweet mechanism provides a continual source of information propagation data, it is possible to test theories which hav e been proposed in the social science literature using this e xper- imental en vironment. The model that we hav e produced gives an explanation of the phenomena that gov ern the spread rate of information online through T witter . It builds upon previous work on the burstiness of human behaviour to gi ve a better understanding of cascades in a social me- dia information system. 6. A CKNO WLEDGMENTS PM, LM, and NGB acknowledge the financial support of the Data to Decisions Cooperative Research Centre (D2DCRC). All the authors acknowledge the financial support of the ARC Center of Excellence for Mathematical and Statistical Frontiers (A CEMS). 7. REFERENCES [1] T witter Counter , 2008 (accessed March 20, 2016). [2] Alexa. The top 500 sites on the web, May 2016. http://www.alexa.com/topsites . [3] E. Bakshy , J. M. Hofman, W . A. Mason, and D. J. W atts. Identifying influencers on T witter. In F ourth A CM International Confer ence on W eb Seach and Data Mining (WSDM) , 2011. [4] A.-L. Barabási. The origin of bursts and heavy tails in human dynamics. Natur e , 435:207, 2005. [5] A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science , 286(5439):509–512, 1999. [6] A.-L. Barabási, R. Albert, and H. Jeong. Mean field theory for scale-free random networks. Physica A Statistical Mechanics and its Applications , 272:173–187, Oct. 1999. [7] H. Bauke. Parameter estimation for po wer-la w distributions by maximum likelihood methods. The European Physical Journal B , 58(2):167–173, 2007. [8] D. R. Bild, Y . Liu, R. P . Dick, Z. M. Mao, and D. S. W allach. Aggregate characterization of user beha vior in T witter and analysis of the retweet graph. A CM T rans. Internet T echnol. , 15(1):4:1–4:24, Mar . 2015. [9] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Po wer-la w distributions in empirical data. SIAM Rev . , 51(4):661–703, Nov . 2009. [10] R. Crane and D. Sornette. Robust dynamic classes re vealed by measuring the response function of a social system. Pr oceedings of the National Academy of Sciences , 105(41):15649–15653, 2008. [11] Y . Ding and X. Li. Time weight collaborati ve filtering. In Pr oceedings of the 14th ACM International Conference on Information and Knowledge Management , CIKM ’05, pages 485–492, New Y ork, NY , USA, 2005. [12] C. Doerr, N. Blenn, and P . V an Mieghem. Lognormal infection times of online information spread. PLoS ONE , 8(5):1–6, 05 2013. [13] Y . M. Ioannides and H. G. Overman. Zipf ’ s law for cities: an empirical examination . Re gional Science and Urban Economics , 33(2):127 – 137, 2003. [14] A. Kupavskii, L. Ostroumo v a, A. Umnov , S. Usachev , P . Serdyukov , G. Gusev , and A. Kustare v . Prediction of retweet cascade size ov er time. In Pr oceedings of the 21st A CM International Confer ence on Information and Knowledge Management , CIKM ’12, pages 2335–2338, Ne w Y ork, NY , USA, 2012. [15] L. Li, L. Zheng, F . Y ang, and T . Li. Modeling and broadening temporal user interest in personalized news recommendation. Expert Syst. Appl. , 41(7):3168–3177, June 2014. [16] Y . Lu, P . Zhang, Y . Cao, Y . Hu, and L. Guo. On the frequency distrib ution of retweets. Pr ocedia Computer Science , 31:747 – 753, 2014. [17] H. Mahmoud. P olya Urn Models . Chapman & Hall/CRC, 1st edition, 2008. [18] M. Mitzenmacher . A brief history of generative models for power la w and lognormal distributions. Internet Mathematics , 1:226–251, 2004. [19] M. E. J. Newman. Po wer laws, P areto distributions and Zipf ’ s law. Contemporary Physics , 2005. [20] A. Vázquez, J. a. G. Oli veira, Z. Dezsö, K.-I. Goh, I. K ondor, and A.-L. Barabási. Modeling bursts and hea vy tails in human dynamics. Phys. Rev . E , 73:036127, Mar . 2006. [21] S. W u, J. M. Hofman, W . A. Mason, and D. J. W atts. Who says what to whom on T witter. In Proceedings of the 20th International Confer ence on W orld W ide W eb , WWW ’11, pages 705–714, New Y ork, NY , USA, 2011. [22] Q. Zhao, M. A. Erdogdu, H. Y . He, A. Rajaraman, and J. Leskov ec. SEISMIC: A self-exciting point process model for predicting tweet popularity . In Pr oceedings of the 21th A CM SIGKDD International Confer ence on Knowledge Discovery and Data Mining , KDD ’15, pages 1513–1522, New Y ork, NY , USA, 2015.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment