On the statistical properties of viral misinformation in online social media
The massive diffusion of online social media allows for the rapid and uncontrolled spreading of conspiracy theories, hoaxes, unsubstantiated claims, and false news. Such an impressive amount of misinformation can influence policy preferences and enco…
Authors: Aless, ro Bessi
On the statistical prop erties of viral misinformation in online so cial media Alessandro Bessi ∗ a University of Southern California, Information Scienc es Institute, Marina del R ey, L os Angeles, CA, USA b IUSS Institute for A dvanc e d Study, Pavia, IT AL Y c IMT Institute for A dvanc e d Studies, Luc c a, IT AL Y Abstract The massiv e diffusion of online so cial media allo ws for the rapid and uncon- trolled spreading of conspiracy theories, hoaxes, unsubstan tiated claims, and false news. Suc h an impressive amoun t of misinformation can influence pol- icy preferences and encourage b eha viors strongly divergen t from recommended practices. In this pap er, w e study the statistical prop erties of viral misinfor- mation in online so cial media. By m eans of metho ds b elonging to Extreme V alue Theory , we sho w that the num ber of extremely viral p osts ov er time fol- lo ws a homogeneous Poisson process, and that the in terarriv al times betw een suc h p osts are indep enden t and iden tically distributed, follo wing an exp onen tial distribution. Moreo ver, we c haracterize the uncertaint y around the rate param- eter of the Poisson pro cess through Bay esian metho ds. Finally , we are able to deriv e the predictiv e p osterior probability distribution of the num b er of p osts exceeding a certain threshold of shares o ver a finite interv al of time. Keywor ds: misinformation, online so cial media, extreme v alue theory 1. In tro duction The wide a v ailability of user-pro vided con tents in online social media encour- ages the aggregation of p eople around common interests and narrativ es. The direct path from pro ducers to consumers of conten ts driv es the emergence of a disin termediated enviromen t that is changing the wa y p eople b ecome informed, in terpret facts, and form their opinions [1, 2, 3, 4, 5]. Unfortunately , suc h a disin termediation can facilitate the spreading of ru- mors, hoaxes, fake news, and conspiracy theories, that often arouse naive and a wkward so cial resp onses on different topics, suc h as health, environmen t, na- tional security , and p olitics [6, 7, 8, 9, 10, 11]. In particular, conspiracy theories simplify causation, reduce the complexit y of realit y , and are form ulated in a w a y ∗ Corresponding author Email addr ess: bessi@isi.edu (Alessandro Bessi) Pr eprint submitte d to Elsevier August 6, 2018 that is able to contain a certain level of uncertaint y [12, 13, 14, 15]. Ho w ever, suc h an impressive amount of misinformation can influence policy preferences and encourage b eha viors strongly divergen t from recommended practices. Since the W orld Economic F orum listed massiv e digital misinformation as one of the main threats to our so ciet y [16], comm unity-driv en [17] and algorithmic- driv en [18, 19, 20, 21, 22, 23, 24] solutions ha ve been proposed to coun teract the p erv as iv eness of online misinformation. Ho wev er, a part of the scientific comm unity is sk eptical ab out the real effectiveness of such solutions. Indeed, the communit y-driven approach prop osed by F aceb o ok – where users can flag false conten ts to correct the newsfeed algorithm – is con trov ersial, b ecause it raises fears that the free circulation of ideas may b e threatened. Moreov er, algorithmic-driv en approaches ma y not b e effective, since the acceptance of a claim (either substantiated or not) is heavily influenced by so cial norms and in- dividual cognitive factors [25, 26, 27, 28, 29, 30]. Indeed, recent w orks p oint out b oth the inefficacy of correcting false b eliefs and the concrete risk of a bac kfire effect [31, 32, 33] from the usual and most committed consumers of conspir- acy theories. In fact, false b eliefs, once adopted b y an individual, are rarely corrected [34, 35, 36, 37]. More sp ecifically , b oth the formation and the revision of b eliefs are strongly affected b y the communities wherein ideas and facts are debated [38, 39]. Such a phenomenon is emphasized in online so cial net works, where users pro cess infor- mation through a shared system of meaning [40, 41] inside their echo cham bers [42, 43, 44, 45], making sense of facts in wa ys that are often biased tow ard self- confirmation. Indeed, recent studies show that increasing the exp osure of users to unsubstan tiated rumors increases their tendency to b e credulous [26, 46], and that the con tent-selectiv e exp osure is the primary driv er of conten t diffusion, and generates the formation of ec ho cham b ers [42]. In this work, w e study the statistical prop erties of viral misinformation in online so cial media. In particular, we apply metho ds of Extreme V alue The- ory – a branc h of statistics dealing with extreme deviations from the median of probabilit y distributions – to analyze a large dataset of p osts published by F aceb ook pages supp orting conspiracy theories and m yth narratives. By means of an in-depth statistical analysis of the shares distribution and the application of the P eaks Ov er Threshold (POT) approac h, we show that the n umber of extremely viral posts (e.g. > 250 K shares) ov er time follo ws a homogeneous P oisson pro cess, and that the in terarriv al times b et ween suc h posts are indep en- den t and identically distributed, following an exp onen tial distribution. F urther, w e characterize the uncertaint y around the rate parameter of the Poisson pro- cess through Ba yesian metho ds. Finally , we are able to derive the predictiv e p osterior probabilit y distribution of the num b er of p osts exceeding a certain threshold of shares o ver a finite interv al of time. The relev ance of our results is not necessarily limited to the field of computa- tional social science coping with misinformation [47, 48, 49]. Indeed, despite the prediction of extremely viral posts and rare ev en ts remains an hard task [50, 51], w e b elieve that both our findings and the metho dology used herein may b e of in terest to the broader field of computational so cial science dealing with forecast- 2 ing and tracking of viral conten ts and ev en ts [52, 53, 54, 55, 56, 57, 58, 59, 60, 61]. 2. Metho ds 2.1. Ethics Statement The entire data collection pro cess has b een carried out exclusively through the F aceb ook Graph API, which is publicly a v ailable. W e used only public a v ailable data. The pages from which we downloaded data are public F aceb ook en tities. 2.2. Data Col le ction W e analyzed 328 US public F acebo ok pages diffusing conspiratorial b eliefs, m yth narratives, and contro versial information, usually lacking supp orting evi- dence and most often contradictory of the official news. Suc h a space of inv es- tigation is defined with the same approac h as in [26, 42], with the supp ort of differen t F aceb o ok groups v ery active in monitoring the conspiracy narratives. F or each page, w e downloaded all the posts (and their resp ectiv e metadata) in a timespan of 5 years (Jan 1, 2010 to Dec 31, 2014). The dataset is comp osed by 345 , 054 p osts. T o our knowledge, the dataset is the complete set of conspiracy- lik e information sources activ e in the US F aceb o ok scenario up to December 31, 2014. 2.3. F undamentals of Extr eme V alue The ory Extreme v alue theory (EVT) is a branch of statistics dealing with the ex- treme deviations from the median of probability distributions. In particular, it aims at assessing the probabilit y of even ts that are more extreme than any pre- viously observed. Extreme v alue theory is widely used in many fields of science where pow er laws play a role in mo deling [62], such as structural and geological engineering, finance and risk managemen t, earth sciences, traffic prediction, etc. In this section, we briefly review some fundamental results of extreme v alue theory . F or extended discussion, pro ofs, and theorems see [63, 64, 65]. 2.3.1. Extr eme V alue The ory Supp ose X 1 , X 2 , . . . are indep enden t and iden tically distributed (iid) random v ariables with common cumulativ e distribution function (cdf ) F . Let M n = max { X 1 , . . . , X n } denote the maximum of the first n random v ariables (partial maxima) and let u ( F ) = sup { x : F ( x ) < 1 } denote the upp er endpoint of F . Since Pr ( M n ≤ x ) = Pr ( X 1 ≤ x, . . . , X n ≤ x ) = F n ( x ) , M n con verges almost surely to u ( F ) whether it is finite or infinite. Extreme v alue theory se eks norming constants a n > 0, b n ∈ R , and some nondegenerate distribution function G such that the cdf of the normalized M n con verges to G , i.e. Pr M n − b n a n ≤ x = F n ( a n x + b n ) → d G ( x ) . 3 If this holds for suitable choices of a n and b n , then w e say that G is an extreme v alue distribution function, and F b elongs to the maximum domain of attraction of G , i.e. F ∈ M DA ( G ). The Extremal T yp es Theorem c haracterizes the limit distribution function G as of the t yp e of one of the following three classes: • Gum b el: Λ( x ) = exp (exp( − x )) , x ∈ R • F r´ ec het: Φ α ( x ) = ( 0 , if x ≤ 0 exp( − x − α ) , if x > 0 • W eibull Ψ α ( x ) = ( exp( − ( − x ) α ) , if x ≤ 0 1 , if x > 0 for some α > 0. The three extreme v alue distributions can b e represen ted using the general- ized extreme v alue (GEV) distribution (family). Let H ξ ( x ) = ( exp − (1 + ξ x ) − 1 ξ , if ξ 6 = 0 exp( − exp( − x )) , if ξ = 0 where 1 + ξ x > 0. Then, • ξ = α − 1 > 0 ← → Φ α • ξ = − α − 1 < 0 ← → Ψ α • ξ = 0 ← → Λ F rom a mo deling p oin t of view, the three extreme v alue distributions are v ery differen t, esp ecially for what concerns the b eha vior of the tails, i.e. the part of the distribution more relev an t when dealing with extreme even ts. Here, we fo cus on the F r´ echet case, Φ α with α > 0. If we consider the tail of Φ α ( x ), a T aylor expansion shows that 1 − Φ α ( x ) = 1 − exp( − x − α ) ∼ x − α , x → ∞ . Hence, Φ α ( x ) tends to decrease as a p o wer law. Moreo ver, ev ery distribution function that b elongs to MD A(Φ α ) has necessarily and infinite righ t endp oin t, i.e. it is defined for x ∈ [0 , ∞ ). It follo ws that all the distribution functions b elonging to MD A(Φ α ) are appropriate for modeling phenomena with extremely large maxima. 4 Consider a random v ariable X with unknown distribution function G and righ t endp oin t x G = sup { x ∈ R : G ( x ) < 1 } . The exceedance distribution function of X ab o ve a given threshold t is defined as G t ( x ) = Pr ( X ≤ x | X > t ) = G ( x ) − G ( t ) 1 − G ( t ) , x ≥ t. F or a large class of distribution functions G and a high threshold t → x G , G t can b e approximated by a Generalized Pareto Distribution, i.e. G t = GP D ( x ; ξ , β , t ) = 1 − 1 + ξ x − t β − 1 ξ , if ξ 6 = 0 1 − exp − x − t β , if ξ = 0 where x ≥ t for ξ ≥ 0, t ≤ x ≤ t − β /ξ for ξ < 0, t ∈ R , ξ ∈ R , and β > 0. The shap e parameter, ξ , gov erns the fatness of the tails, and thus the existence of the moments. The moment of order p of a Generalized P areto distributed random v ariable only exists if and only if ξ < 1 /p . 2.3.2. Extr eme V alue Analysis Tw o approaches exist for practical extreme v alue analysis. The Blo c k Max- ima (BM) approac h consists on splitting the observ ation p erio d into a certain n umber of non-ov erlapping p erio ds of equal size – e.g. w eeks, months, y ears – and then considering only the maximal v alue within eac h p erio d. Such maximal v alues follow approximately a Generalized Extreme V alue (GEV) distribution. The P eaks Ov er Threshold (POT) approach relies on considering only the v alues exceeding a certain high threshold. The probability distribution of those selected observ ations is approximately a Generalized Pareto Distribution (GPD). Both approaches hav e some limitations. The POT approac h picks up all relev an t high observ ations, and thus seems to make b etter use of the av ailable information. Con versely , the BM approac h misses some of these high observ a- tions and retains some low er observ ations. How ever, there ma y b e reason for using the BM metho d: the only av ailable information ma y b e blo c k maxima (e.g. daily , weekly , mon thly , or yearly maxima) and the BM approach may b e preferable when the observ ations are not exactly iid. Ho wev er, if the BM approach ma y b e easier to apply when the blo ck perio ds app ear naturally , some problems arise when this do es not happen. In suc h a case, the choice of the block size for the BM approach may b e as difficult as the c hoice of the threshold for the POT approach. Figure 1 pro vides a graphical representation of the tw o approaches. 3. Results and Discussion In this pap er, we aim at inv estigating the statistical prop erties of viral mis- information on F aceb ook by means of extreme v alue theory . More sp ecifically , the ob ject of the analysis is the num b er of times that p osts supporting — in 5 Figure 1: Blo ck Maxima (BM) vs. Peaks Over Threshold (POT). this study , we are assuming that eac h share of conspiracy p osts represents the will to supp ort a given conspiracy narrativ e — conspiracy theories hav e been shared, whic h can b e considered as a random v ariable X following a generic distribution function F with supp ort [0 , ∞ ). Suc h an infinite right endp oin t is justified b y the fact that users can share a p ost how many times they desire. 3.1. Explor atory Data Analysis Since for eac h p ost we know the time of creation, we hav e a temp orally or- dered collection of observ ations. Such a time series is irregularly spaced, in the sense that it is characterized b y v arying in terarriv al times b et ween observ ations. A common approach to analyze irregularly spaced time series consists in trans- forming the data into equally spaced observ ations using interpolation metho ds, and then apply standard metho ds for equally spaced data. Ho wev er, suc h a transformation can introduce a num b er of significant and hard to quantify bias, esp ecially when the in terarriv al times b et ween observ ations are highly irregular. Since in our case the spacing of observ ations v aries from seconds to da ys, we a void to transform data. Moreo ver, despite observ ations are temporally ordered, it is difficult to assume some kind of time dep endence b et ween the num b er of shares receiv ed b y p osts — i.e. the num ber of shares receiv ed by a post do es not affect the n umber of shares receiv ed by following p osts. Rather, if we can conceiv e that some external ev ents (e.g. breaking news, top stories, scandals, etc.) can cause an un usual n umber of p osts in a restricted temp oral window (clustering), we can safely assume that the num ber of shares receiv ed by each of those p osts is indep enden t and identically distributed (iid). Figure 2 shows the cumulativ e n umber of weekly p ost ( top p anel ) and the n umber of weekly p osts ( b ottom p anel ). Despite it looks like there is a clear gro wth trend — which is likely due b y the increase of users on F aceb ook o c- curring from 2010 to 2014 — and some form of seasonality , w e are not able to iden tify an y meaningful seasonalit y pattern. Indeed, we can assume that the activit y of this kind of pages is primarily driven b y external even ts, suc h as breaking news, top stories, scandals, etc. 6 Figure 2: W eekly p osts. Cumulativ e num b er of weekly p ost ( top p anel ) and num b er of weekly p osts ( b ottom p anel ). The solid red line indicates the fitted linear trend. Ho wev er, an increase in the num ber of p osts published by pages in a giv en temp oral window ma y reflect an increase in the users’ excitement and activ- it y . W e accoun t for such a possible c haracteristic of the phenomenon under in vestigation by rescaling raw data by a factor defined as R i = w i max( w ) , i ∈ { 1 , . . . , 261 } where w i represen ts the num b er of p osts published b y pages in week i . Suc h a rescaling factor inflates the num b er of shares of p osts published in weeks c haracterized b y an ov erall low activity . Different rescaling strategies hav e b een considered — e.g. rescaling by the mean or the median —, and similar results ha ve b een obtained. Figure 3: Shares time series. Time series of the original num b er of shares ( r aw data ) and the rescaled one ( r esc ale d data ). Figure 3 shows the time series of the original num ber of shares ( r aw data ) and the rescaled one ( r esc ale d data ). The rescaling pro cedure should hav e remov ed or at least reduced p ossible clustering phenomena that would hav e led to a violation of the iid assumption. W e chec k the iid assumption b y means of the 7 records plot, a simple and intuitiv e exploratory to ol widely used in extreme v alue analysis whic h exploits the fact that successive records for iid data should b ecome more and more rare as time go es by . Since a record x n for the random v ariable X o ccurs if x n > max { x 1 , . . . , x n − 1 } , it is intuitiv e that if data are iid it b ecomes more difficult to exceed all past observ ations, and thus the n umber of records should follow a logarithmic pattern [63]. Records plots in Figure 4 sho w that the iid assumption for ra w data is violated, but still v alid for rescaled data, where records are distributed around their exp ected v alue and within the 95% confidence in terv als. Figure 4: Records plots. The iid assumption for raw data is violated, but still v alid for rescaled data, where records are distributed around their exp ected value and within the 95% confidence interv als. Figure 5 sho ws the empirical complementary cumulativ e distribution func- tions of ra w data and rescaled data. The double log scale of the figures highlights the fatness of the righ t tail in b oth the tw o empirical distributions. Beyond re- mo ving an y form of dep endence in the ra w data, we observe that the rescaling pro cedure slightly exacerbates the pow er la w behavior of the tail without in- fluencing the b ody of the distribution. Thus, rescaled data will be used for successiv e analysis. 3.2. Statistic al Pr op erties of Vir al Misinformation W e use the distribution function of rescaled data to characterize the statisti- cal prop erties of viral misinformation b y means of EVT to ols. First, we analyze the limit b eha vior of the Maximum/Sum ratio R n ( p ) = M n ( p ) S n ( p ) , n ≥ 1 , p > 0 , where S n ( p ) = P n i =1 ( X p i ) and M n ( p ) = max( X p i ). The momen t of order p of the distribution exists, i.e. E [ X p ] < ∞ , if and only if R n ( p ) conv erges to zero for n → ∞ . Con versely , an erratic limit behavior of R n ( p ) indicates the infiniteness of the p -th momen t of the distribution, i.e. E [ X p ] = ∞ . Figure 6 shows that only the first moment of the distribution exists, i.e. E [ X ] < ∞ , whereas moments of order greater than p = 2 are infinite, i.e. 8 Figure 5: Empirical Complementary Cumulativ e Distribution F unction. The rescal- ing procedure slightly exacerbates the pow er law behavior of tail without influencing the bo dy of the distribution. E [ X p ] = ∞ for p ≥ 2. Iden tical results hold for raw data. Our distribution function b elongs to the maximum domain of attraction of F r´ echet, i.e. F ∈ M D A (Φ α ). The existence of the first momen t of the distribution function allo ws us to compute a r e asonable (in a sample one can compute basically anything, ev en meaningless quan tities) estimate of the conditional tail mean ab o ve a given threshold. Indeed, by the law of total exp ectation E [ X | X > t ] = E [ X ] − Pr ( X ≤ t ) E [ X | X ≤ t ] Pr ( X > t ) , where E [ X | X ≤ t ] is finite since bounded from ab ov e b y t , and the finiteness of E [ X ] implies that the conditional tail mean, E [ X | X > t ], is finite. Suc h a measure is known in finance as the exp ected shortfall of a loss distri- bution, and it let us answer to question such as “What is the exp e cte d numb er of shar es for a p ost onc e it has exc e e de d the 250 K shar es thr eshold?” . Indeed, E [ X | X > t ] = P N i =1 x i I ( x i > t ) P N i =1 I ( x i > t ) = P N i =1 x i I ( x i > 250 K ) P N i =1 I ( x i > 250 K ) ≈ 467 K. Since we show ed that the moments of order greater than 1 do not exist, one should prefer the mean absolute deviation o ver the v ariance as a measure of disp ersion around the conditional tail mean. Recall that the momen t of order p of a Generalized Pareto distributed ran- dom v ariable only exists if and only if ξ < 1 /p , and th us the shape parameter w e are going to estimate can not b e smaller than 1 / 2. Suc h an observ ation has the 9 Figure 6: Maxim um/Sum ratio plot. Only the first moment of the distribution exists, i.e. E [ X ] < ∞ , whereas moments of order greater than p = 2 are infinite, i.e. E [ X p ] = ∞ for p ≥ 2. main implication that we can safely use the maximum likelihoo d (ML) approach to estimate ξ , since the ML estimates are consistent only when ξ > − 1 / 2. Since our time series is highly irregularly spaced, with interarriv al times ranging b et ween seconds and days, we prefer the Peaks Over Threshold (POT) approac h ov er the Block Maxima (BM) approac h to estimate the shape parame- ter ξ . Indeed, when the blo c k p erio ds used in the BM approac h do es not app ear naturally , the choice of a threshold for the POT approach ma y b e easier. Before fitting the distribution function to a Generalized Pareto Distribu- tion, w e hav e to identify a feasible threshold. T o accomplish such a task, we rely on the Mean Excess F unction (MEF). The empirical MEF of a sample of observ ations x 1 , . . . , x n is defined as e n ( t ) = P n i =1 ( x i − t ) P n i =1 I ( x i > t ) , that is the ratio b et ween the sum and the n umber of the exceedances ov er the threshold t . Figure 7 sho ws the MEF plot for the rescaled data. W e observe that the empirical MEF b egins to linearly increase in the threshold at t ≈ 10 4 . Suc h a b eha vior characterizes pow er law distribution functions [66], and thus w e choose t = 10 4 as threshold. Giv en the heavy-tailed b eha vior of the rescaled data distribution function, the shap e parameter ξ is likely to b e p ositiv e. The left panel of Figure 8 shows the Pic k ands plot, based on the nonparametric Pic k ands estimator for ξ , defined as ˜ ξ ( P ) τ ,n = 1 log 2 log X τ ,n − X 2 τ ,n X 2 τ ,n − X 4 τ ,n , τ = 1 , . . . , b n/ 4 c where X τ ,n is the τ -th upp er order statistics out of a sample of n observ a- tions. The Pick ands plot shows a more or less stable b eha vior of the Pick ands estimates for different v alues of τ , suggesting that the true v alue of ξ lies in the in terv al (0 . 5 , 1). 10 Figure 7: Empirical Mean Excess F unction plot. The righ t panel of Figure 8 shows the Hill plot, based on the nonparametric Hill estimator for ξ , defined as ˜ ξ ( H ) τ ,n = 1 τ τ X j =1 ln( X j,n ) − ln( X τ ,n ) , where X j,n is the j -th upp er order statistics out of a sample of n observ ations. The Hill plot outp erforms the Pic k ands plot in stability , suggesting a true v alue of ξ around 0 . 75. Figure 8: Pick ands and Hill nonparametric estimators for the shap e parameter ξ . The main implication is that, as already anticipated by the analysis of the Maxim um/Sum ratio plot, the true v alue of ξ is greater than − 1 / 2, and thus w e can obtain consistent estimates via a maximum likelihoo d (ML) approac h. T able 1 shows ML estimates of ξ and β for differen t thresholds. W e observ e a stable v alue of ˜ ξ M L for increasing v alues of the threshold. W e obtain similar results for ra w data (i.e. ˜ ξ M L = 0 . 769(0 . 0198) with t = 2 . 5 K ). 11 T able 1: Maxim um Likelihoo d estimates, standard errors, and num b er of ex- ceedances for different thresholds. threshold ˜ ξ ( M L ) ˜ β ( M L ) # exceedances 10 K 0 . 770 8 , 750 6 , 408 (0 . 0220) (205) 25 K 0 . 800 19 , 500 2 , 153 (0 . 0391) (805) 50 K 0 . 737 43 , 170 884 (0 . 059) (2 , 730) 100 K 0 . 746 74 , 380 399 (0 . 0869) (6 , 923) 150 K 0 . 726 120 , 460 223 (0 . 120) (15 , 500) 3.3. F r e quency of Vir al Misinformation The Peaks Over Threshold (POT) metho d has t wo main implications: the exceedances o ver a high threshold follo w a Generalized P areto Distribution, and the num b er of excesses ov er time follows a homogeneous Poisson pro cess. In a homogeneous P oisson pro cess the num ber of even ts, N ( θ ), in a finite interv al of time of length θ follo ws the Poisson distribution, i.e. Pr ( N ( θ ) = n ) = ( λθ ) n n ! exp( − λθ ) . Moreo ver, the interarriv al times b et ween even ts are indep enden t and follow the exp onen tial distribution, i.e. Pr (in terarriv al time > θ ) = exp( − λθ ) . Figure 9: Exp onential Quan tiles vs. Interarriv al Times. Figure 9 shows that the interarriv al times of pos ts shared more than 750 K times (rescaled data) follow approximately an exp onen tial distribution. More- o ver, the auto correlogram function (ACF) plot — i.e. a plot showing the simi- larit y b et ween observ ations as a function of the time lag betw een them [67] — in 12 Figure 10 sho ws that the interarriv al times b et ween those p osts are independent, supp orting the i.i.d. h yp othesis suggested by the records plot in Figure 4. Simi- lar results approximately hold for raw data when considering p osts shared more than 250 K (the bo otstrap test of fit for the Generalized P areto Distribution [68] giv es a p-v alue equal to 0 . 2). W e conclude that the num b er of extremely viral p osts ov er time follows a homogeneous P oisson pro cess. Figure 10: Auto correlogram for Interarriv al Times. W e find no correlation as a function of the time lag b etw een them. Suc h a conclusion allows us to exploit some useful prop erties of the homo- geneous Poisson pro cesses to quan tify the frequency of rare viral con tents on online so cial media. Indeed, the exp ected v alue of the n umber of ev ents, N ( θ ), in a finite in terv al of time of length θ is defined as E [ N ( θ )] = λθ , where λ > 0 is known as the rate parameter of the Poisson pro cess. The recipro cal of such a parameter, i.e. 1 /λ , is kno wn as the surviv al parameter of the exponential distribution follo w ed by the interarriv al times b et ween the N ( θ ) ev ents. Given a sample z 1 , . . . , z n of in terarriv al times, the surviv al parameter is estimated through the sample mean 1 λ = P n i =1 z i n . Essen tially , we can estimate the surviv al parameter, 1 /λ , of the exp onen tial distribution describing the interarriv al times b etw een rare ev ents exceeding a certain threshold, and then use the rate parameter, λ , of the Poisson process to estimate the exp ected num b er of even ts exceeding that threshold in a finite time of length θ . F or instance, the surviv al parameter of the interarriv al times distribution of p osts exceeding 250 K shares (raw data) is 1 /λ = 18 . 5. It follows that λ = 1 / 18 . 5 = 0 . 0541. Basically , if by means of the surviv al parameter w e can answ er to questions suc h as “What is the me an waiting time b etwe en p osts exc e e ding 250 K shar es?” , through the rate parameter we can answer to questions such as 13 “What is the exp e cte d numb er of p osts exc e e ding 250 K shar es in the futur e 365 days?” . Indeed, E [ N ( θ )] = λθ = 0 . 0541 × 365 = 19 . 8 ≈ 20 . A con venien t wa y to assess the uncertain ty around the rate parameter, λ , consists in using a standard Bay esian probability updating metho d. Indeed, the conjugate prior distribution for a Poisson distribution is the Gamma distribu- tion, and w e can express the prior distribution of λ as Pr ( λ ) = Gamma( α, β ) . Since the exp ected v alue (mean) of a Gamma distribution is defined as α/β , w e ma y w ant to choose the h yp erparameters, α and β , of the prior distribution Pr ( λ ) so that 1 λ = β α = P n i =1 z i n , where z 1 , . . . , z n are the observed interarriv al times. Then, the posterior distribution of the rate parameter is defined as Pr ( λ | z ) = Gamma( α + k , β + k X i =1 z i ) , where z 1 , . . . , z k represen t k new observed interarriv al times. F or instance, w e ma y define the prior distribution of the rate parameter of the interarriv al times distribution of p osts exceeding 250 K shares (raw data) as Pr ( λ ) = Gamma( α, β ) = Gamma( n, n X i =1 z i ) = Gamma(38 , 702) , with mean equal to α/β = 38 / 702 = 0 . 0541, and v ariance equal to α/β 2 = 38 / 702 2 = 7 . 71 × 10 − 5 . Then, if after 60 da ys w e observe a p ost exceeding the 250 K shares threshold, the p osterior probability distribution of the rate parameter is Pr ( λ | z ) = Gamma( α + k , β + k X i =1 z i ) = Gamma(38 + 1 , 702 + 60) , with mean equal to ( α + k ) / ( β + P k i =1 z i ) = (38 + 1) / (702 + 60) = 0 . 0512, and v ariance equal to 38 + 1 / (702 + 60) 2 = 6 . 72 × 10 − 5 . Figure 11 shows b oth the prior and the p osterior distributions of the rate parameter in the aforemen tioned example. After such an up date, the exp ected v alue of the num ber of p osts exceeding the 250 K threshold in the next 365 days is defined as E [ N ( θ )] = E [ λ | z ] θ = 0 . 0512 × 365 = 18 . 7 ≈ 19 , 14 Figure 11: Prior and Posterior Distributions of the rate parameter. The grey and red dashed lines indicate, resp ectiv ely , the mean of the prior and the mean of the posterior distribution of the rate parameter. Figure 12: Posterior predictive probability distribution. P osterior predictive probabil- ity distribution of the num b er of p osts exceeding the 250 K shares threshold in the next 365 days. The red dashed line indicates the mean, i.e. 18 . 7. and the mean w aiting time b etw een p osts exceeding the 250 K shares is 1 / 0 . 0512 = 19 . 5 days. Moreo ver, the full probability assessment of the uncer- tain ty around the rate parameter, λ , allo ws us to express the predictive p osterior probabilit y distribution of the num b er of p osts exceeding a certain threshold o ver a finite interv al of time Pr ( N ( θ ) | λ ) = Pr ( λ | z ) θ. Figure 12 sho ws the p osterior predictive probability distribution function of the num b er of p osts exceeding the 250 K shares threshold (raw data) in the finite in terv al time of length 365 days. 3.4. Concluding R emarks In this pap er, we study the statistical prop erties of viral misinformation in online so cial media. In particular, we fo cus our atten tion on F aceb o ok p osts 15 spreading false news, hoaxes and unsubstantiated claims. By means of an Ex- treme V alue Theory approach, we show that the n um b er of extremely viral posts o ver time follo ws a homogeneous P oisson process, and that the interarriv al times b et ween such posts are independent and identically distributed, following an ex- p onen tial distribution. Moreo ver, we c haracterize the uncertain ty around the rate parameter of the P oisson pro cess through Ba yesian metho ds. Finally , we are able to derive the predictive p osterior probabilit y distribution of the n umber of p osts exceeding a certain threshold of shares ov er a finite interv al of time. The relev ance of our results is not necessarily limited to the field of com- putational so cial science coping with misinformation. Despite the prediction of extremely viral p osts — and, more generally , rare even ts — remains an hard task, we b eliev e that b oth our findings and the metho dology introduced in this pap er may b e of in terest to the broader field of computational social science dealing with forecasting and trac king of viral con tents and even ts — e.g. cyber- securit y attacks, terrorist attacks, etc. Ac knowledgemen ts Sp ecial thanks to Geoff Hall and Skepti F orum for providing fundamen tal supp ort in defining the atlas of F aceb o ok pages disseminating conspiracy theo- ries and m yth narratives. References References [1] J. Brown, A. J. Bro deric k, N. Lee, W ord of mouth communication within online communities: Conceptualizing the online so cial netw ork, Journal of in teractive marketing 21 (3) (2007) 2–20. [2] R. Kahn, D. Kellner, New media and internet activism: F rom the’battle of seattle’to blogging., New media & so ciet y 6 (1) (2004) 87–95. [3] W. Quattrocio cchi, R. Conte, E. Lo di, Opinions manipulation: Media, p o wer and gossip, Adv ances in Complex Systems 14 (04) (2011) 567–586. [4] W. Quattro cio cc hi, G. Caldarelli, A. Scala, Opinion dynamics on inter- acting netw orks: media comp etition and so cial influence, Scientific rep orts 4. [5] R. Kumar, M. Mahdian, M. McGlohon, Dynamics of conv ersations, in: Pro- ceedings of the 16th ACM SIGKDD in ternational conference on Kno wledge disco very and data mining, ACM, 2010, pp. 553–562. [6] Eb ola Lessons: Ho w So cial Media Gets Infected, http://www.informationweek.com/software/social/ - ebola- lessons- how- social- media- gets- infected/a/d- id/1307061 (Marc h 2014). 16 [7] The Eb ola Conspiracy Theories, http://www.nytimes.com/2014/10/19/ sunday- review/the- ebola- conspiracy- theories.html (Marc h 2014). [8] The inevitable rise of Eb ola conspiracy theories, http: //www.washingtonpost.com/blogs/wonkblog/wp/2014/10/13/ the- inevitable- rise- of- ebola- conspiracy- theories/ (Marc h 2014). [9] Remem b er jade helm 15, the con trov ersial military exercise? its ov er (Marc h 2014) [cited 29.10.2014]. URL https://www.washingtonpost.com/news/checkpoint/wp/2015/ 09/14/remember- jade- helm- 15- the- controversial- military- exercise- its- over/ [10] 5 m yths surrounding v accines — and the reality (F ebruary 2015). URL http://edition.cnn.com/2015/02/04/us/5- vaccine- myths/ [11] T rumps outrageous claim that thousands of new jersey muslims celebrated the 9/11 attac ks (Nov ember 2015). URL https://www.washingtonpost.com/news/fact- checker/wp/2015/ 11/22/donald- trumps- outrageous- claim- that- thousands- of- new- jersey- muslims- celebrated- th e- 911- attacks/ [12] C. R. Sunstein, A. V ermeule, Conspiracy theories: Causes and cures*, Jour- nal of P olitical Philosophy 17 (2) (2009) 202–227. [13] J. Byford, Conspiracy theories: a critical introduction, P algrav e Macmillan, 2011. [14] G. A. Fine, V. Campion-Vincen t, C. Heath, Rumor mills: The so cial impact of rumor and legend, T ransaction Publishers, 2005. [15] M. A. Hogg, D. L. Bla ylo c k, Extremism and the Psychology of Uncertaint y , V ol. 8, John Wiley & Sons, 2011. [16] L. Ho well, Digital wildfires in a hyperconnected world, WEF Rep ort. [17] News feed fyi: Showing fewer hoaxes (January 2015). URL http://newsroom.fb.com/news/2015/01/ news- feed- fyi- showing- fewer- hoaxes/ [18] V. Qazvinian, E. Rosengren, D. R. Radev, Q. Mei, Rumor has it: Identi- fying misinformation in microblogs, in: Pro ceedings of the Conference on Empirical Metho ds in Natural Language Pro cessing, Asso ciation for Com- putational Linguistics, 2011, pp. 1589–1599. [19] G. L. Ciampaglia, P . Shiralk ar, L. M. Ro c ha, J. Bollen, F. Menczer, A. Flammini, Computational fact chec king from knowledge netw orks, PloS one 10 (6) (2015) e0128193. [20] P . Resnic k, S. Carton, S. P ark, Y. Shen, N. Zeffer, Rumorlens: A system for analyzing the impact of rumors and corrections in so cial media, in: Pro c. Computational Journalism Conference, 2014. 17 [21] A. Gupta, P . Kumaraguru, C. Castillo, P . Meier, Tw eetcred: Real- time credibilit y assessment of conten t on t witter, in: Social Informatics, Springer, 2014, pp. 228–243. [22] A. A. AlMansour, L. Brank ovic, C. S. Iliopoulos, A mo del for recalibrat- ing credibility in different contexts and languages-a twitter case study , In- ternational Journal of Digital Information and Wireless Communications (IJDIW C) 4 (1) (2014) 53–62. [23] J. Ratkiewicz, M. Conov er, M. Meiss, B. Gon¸ calv es, A. Flammini, F. Menczer, Detecting and tracking p olitical abuse in so cial media., in: ICWSM, 2011. [24] X. L. Dong, E. Gabrilo vich, K. Murphy , V. Dang, W. Horn, C. Lugaresi, S. Sun, W. Zhang, Knowledge-based trust: Estimating the trust worthiness of w eb sources, Proceedings of the VLDB Endowmen t 8 (9) (2015) 938–949. [25] D. Mo can u, L. Rossi, Q. Zhang, M. Karsai, W. Quattro cio cc hi, Collective atten tion in the age of (mis) information, Computers in Human Behavior 51 (2015) 1198–1204. [26] A. Bessi, M. Coletto, G. A. Da videscu, A. Scala, G. Caldarelli, W. Quat- tro ciocchi, Science vs conspiracy: Collective narratives in the age of misin- formation, PloS one 10 (2) (2015) e0118093. [27] B. Nyhan, J. Reifler, S. Richey , G. L. F reed, Effectiv e messages in v accine promotion: a randomized trial, Pediatrics 133 (4) (2014) e835–e842. [28] M. A. Ja v arone, So cial influences in opinion dynamics: the role of con- formit y , Ph ysica A: Statistical Mechanics and its Applications 414 (2014) 19–30. [29] M. A. Ja v arone, G. Armano, Perception of similarity: a model for so cial net work dynamics, Journal of Physics A: Mathematical and Theoretical 46 (45) (2013) 455102. [30] M. A. Jav arone, Netw ork strategies in election campaigns, Journal of Sta- tistical Mec hanics: Theory and Exp eriment 2014 (8) (2014) P08013. [31] A. Bessi, G. Caldarelli, M. Del Vicario, A. Scala, W. Quattro ciocchi, Social determinan ts of conten t selection in the age of (mis) i nformation, in: So cial Informatics, Springer, 2014, pp. 259–268. [32] F. Zollo, A. Bessi, M. Del Vicario, A. Scala, G. Caldarelli, L. Shekht- man, S. Havlin, W. Quattro ciocchi, Debunking in a world of trib es, arXiv preprin t [33] R. Garrett, E. Nisb et, E. Lync h, Undermining the correctiv e effects of media-based p olitical fact chec king? the role of contextual cues and naive theory , Journal of Communication. 18 [34] R. K. Garrett, B. E. W eeks, The promise and p eril of real-time correc- tions to p olitical misp erceptions, in: Proceedings of the 2013 conference on Computer supp orted co op erativ e work, ACM, 2013, pp. 1047–1058. [35] M. L. Meade, H. L. Ro ediger, Explorations in the so cial contagion of mem- ory , Memory & cognition 30 (7) (2002) 995–1009. [36] A. Koriat, M. Goldsmith, A. Pansky , T ow ard a psychology of memory accuracy , Annual review of psychology 51 (1) (2000) 481–537. [37] M. S. Ayers, L. M. Reder, A theoretical review of the misinformation effect: Predictions from an activ ation-based memory mo del, Psyc honomic Bulletin & Review 5 (1) (1998) 1–21. [38] B. Zh u, C. Chen, E. F. Loftus, C. Lin, Q. He, C. Chen, H. Li, R. K. Mo yzis, J. Lessard, Q. Dong, Individual differences in false memory from misinfor- mation: P ersonality c haracteristics and their interactions with cognitive abilities, P ersonality and Individual Differences 48 (8) (2010) 889–894. [39] S. J. F renda, R. M. Nichols, E. F. Loftus, Current issues and adv ances in misinformation researc h, Curren t Directions in Psyc hological Science 20 (1) (2011) 20–23. [40] A. Bessi, F. Zollo, M. Del Vicario, A. Scala, G. Caldarelli, W. Quattro- cio cc hi, T rend of narratives in the age of misinformation, PloS one 10 (8) (2015) e0134641. [41] F. Zollo, P . K. No v ak, M. Del Vicario, A. Bessi, I. Mozeti ˇ c, A. Scala, G. Caldarelli, W. Quattro ciocchi, Emotional dynamics in the age of misin- formation, PloS one 10 (9) (2015) e0138740. [42] M. Del Vicario, A. Bessi, F. Zollo, F. P etroni, A. Scala, G. Caldarelli, H. E. Stanley , W. Quattro ciocchi, The spreading of misinformation online, Pro ceedings of the National Academy of Sciences 113 (3) (2016) 554–559. [43] A. Bessi, F. Petroni, M. Del Vicario, F. Zollo, A. Anagnostopoulos, A. Scala, G. Caldarelli, W. Quattro ciocchi, Viral misinformation: The role of homophily and p olarization, in: Proceedings of the 24th International Conference on W orld Wide W eb Companion, In ternational W orld Wide W eb Conferences Steering Committee, 2015, pp. 355–356. [44] A. Bessi, F. Zollo, M. Del Vicario, M. Puliga, A. Scala, G. Caldarelli, B. Uzzi, W. Quattrocio cchi, Users polarization on faceb o ok and youtube, arXiv preprin t [45] A. Bessi, Personalit y traits and echo cham b ers on faceb ook, Computers in Human Beha vior 65 (2016) 319–324. [46] A. Bessi, A. Scala, L. Rossi, Q. Zhang, W. Quattrocio cchi, The econom y of attention in the age of (mis) information, Journal of T rust Management 1 (1) (2014) 1–13. 19 [47] C. Shao, G. L. Ciampaglia, A. Flammini, F. Menczer, Hoaxy: A platform for tracking online misinformation, in: Pro ceedings of the 25th In terna- tional Conference Companion on W orld Wide W eb, In ternational W orld Wide W eb Conferences Steering Committee, 2016, pp. 745–750. [48] D. R. Grimes, On the viabilit y of conspiratorial b eliefs, PloS one 11 (1) (2016) e0147905. [49] A. Zubiaga, M. Liak ata, R. Procter, K. Bon tchev a, P . T olmie, T ow ards detecting rumours in so cial media, arXiv preprint [50] M. J. Salganik, P . S. Do dds, D. J. W atts, Experimental study of inequality and unpredictability in an artificial cultural market, science 311 (5762) (2006) 854–856. [51] D. J. W atts, Ev erything is obvious: Ho w common sense fails us, Crown Pub, 2011. [52] J. Cheng, L. Adamic, P . A. Do w, J. M. Kleinberg, J. Lesko v ec, Can cas- cades be predicted?, in: Proceedings of the 23rd international conference on W orld wide web, ACM, 2014, pp. 925–936. [53] A. F riggeri, L. A. Adamic, D. Ec kles, J. Cheng, Rumor cascades., in: ICWSM, 2014. [54] J. Staiano, D. Albanese, et al., Exploring image virality in go ogle plus, in: So cial Computing (So cialCom), 2013 International Conference on, IEEE, 2013, pp. 671–678. [55] T.-A. Hoang, E.-P . Lim, Virality and susceptibility in information diffu- sions., in: ICWSM, 2012. [56] L. Hong, O. Dan, B. D. Davison, Predicting p opular messages in twitter, in: Pro ceedings of the 20th international conference companion on W orld wide w eb, ACM, 2011, pp. 57–58. [57] M. Jenders, G. Kasneci, F. Naumann, Analyzing and predicting viral t weets, in: Proceedings of the 22nd in ternational conference on W orld Wide W eb companion, International W orld Wide W eb Conferences Steer- ing Committee, 2013, pp. 657–664. [58] J. Y ang, S. Counts, Predicting the sp eed, scale, and range of information diffusion in t witter., ICWSM 10 (2010) 355–358. [59] M. Coscia, Average is b oring: How similarity kills a meme’s success, Sci- en tific rep orts 4. [60] L. W eng, F. Menczer, Y.-Y. Ahn, Virality prediction and comm unity struc- ture in so cial netw orks, Scientific rep orts 3. 20 [61] J. Zhou, H. Pei, H. W u, Early w arning of human crowds based on query data from baidu map: Analysis based on shanghai stamp ede, arXiv preprint [62] P . Cirillo, N. N. T aleb, On the statistical prop erties and tail risk of violent conflicts, Av ailable at SSRN 2675355. [63] E. Gumbel, Statistics of extremes. 1958, Columbia Univ. press, New Y ork. [64] S. Coles, J. Baw a, L. T renner, P . Dorazio, An in tro duction to statistical mo deling of extreme v alues, V ol. 208, Springer, 2001. [65] P . Em brech ts, C. Kl ¨ upp elberg, T. Mikosc h, Mo delling extremal even ts: for insurance and finance, V ol. 33, Springer Science & Business Media, 2013. [66] P . Cirillo, Are your data really pareto distributed?, Physica A: Statistical Mec hanics and its Applications 392 (23) (2013) 5947–5962. [67] G. E. Box, G. M. Jenkins, G. C. Reinsel, G. M. Ljung, Time series analysis: forecasting and con trol, John Wiley & Sons, 2015. [68] J. A. Villase˜ nor-Alv a, E. Gonz´ alez-Estrada, A b o otstrap go o dness of fit test for the generalized pareto distribution, Computational Statistics & Data Analysis 53 (11) (2009) 3835–3841. 21
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment