On the statistical properties of viral misinformation in online social media

On the statistical prop erties of viral misinformation in online so cial media Alessandro Bessi ∗ a University of Southern California, Information Scienc es Institute, Marina del R ey, L os Angeles, CA, USA b IUSS Institute for A dvanc e d Study, Pavia, IT AL Y c IMT Institute for A dvanc e d Studies, Luc c a, IT AL Y Abstract The massiv e diﬀusion of online so cial media allo ws for the rapid and uncon- trolled spreading of conspiracy theories, hoaxes, unsubstan tiated claims, and false news. Suc h an impressive amoun t of misinformation can inﬂuence pol- icy preferences and encourage b eha viors strongly divergen t from recommended practices. In this pap er, w e study the statistical prop erties of viral misinfor- mation in online so cial media. By m eans of metho ds b elonging to Extreme V alue Theory , we sho w that the num ber of extremely viral p osts ov er time fol- lo ws a homogeneous Poisson process, and that the in terarriv al times betw een suc h p osts are indep enden t and iden tically distributed, follo wing an exp onen tial distribution. Moreo ver, we c haracterize the uncertaint y around the rate param- eter of the Poisson pro cess through Bay esian metho ds. Finally , we are able to deriv e the predictiv e p osterior probability distribution of the num b er of p osts exceeding a certain threshold of shares o ver a ﬁnite interv al of time. Keywor ds: misinformation, online so cial media, extreme v alue theory 1. In tro duction The wide a v ailability of user-pro vided con tents in online social media encour- ages the aggregation of p eople around common interests and narrativ es. The direct path from pro ducers to consumers of conten ts driv es the emergence of a disin termediated enviromen t that is changing the wa y p eople b ecome informed, in terpret facts, and form their opinions [1, 2, 3, 4, 5]. Unfortunately , suc h a disin termediation can facilitate the spreading of ru- mors, hoaxes, fake news, and conspiracy theories, that often arouse naive and a wkward so cial resp onses on diﬀerent topics, suc h as health, environmen t, na- tional security , and p olitics [6, 7, 8, 9, 10, 11]. In particular, conspiracy theories simplify causation, reduce the complexit y of realit y , and are form ulated in a w a y ∗ Corresponding author Email addr ess: bessi@isi.edu (Alessandro Bessi) Pr eprint submitte d to Elsevier August 6, 2018 that is able to contain a certain level of uncertaint y [12, 13, 14, 15]. Ho w ever, suc h an impressive amount of misinformation can inﬂuence policy preferences and encourage b eha viors strongly divergen t from recommended practices. Since the W orld Economic F orum listed massiv e digital misinformation as one of the main threats to our so ciet y [16], comm unity-driv en [17] and algorithmic- driv en [18, 19, 20, 21, 22, 23, 24] solutions ha ve been proposed to coun teract the p erv as iv eness of online misinformation. Ho wev er, a part of the scientiﬁc comm unity is sk eptical ab out the real eﬀectiveness of such solutions. Indeed, the communit y-driven approach prop osed by F aceb o ok – where users can ﬂag false conten ts to correct the newsfeed algorithm – is con trov ersial, b ecause it raises fears that the free circulation of ideas may b e threatened. Moreov er, algorithmic-driv en approaches ma y not b e eﬀective, since the acceptance of a claim (either substantiated or not) is heavily inﬂuenced by so cial norms and in- dividual cognitive factors [25, 26, 27, 28, 29, 30]. Indeed, recent w orks p oint out b oth the ineﬃcacy of correcting false b eliefs and the concrete risk of a bac kﬁre eﬀect [31, 32, 33] from the usual and most committed consumers of conspir- acy theories. In fact, false b eliefs, once adopted b y an individual, are rarely corrected [34, 35, 36, 37]. More sp eciﬁcally , b oth the formation and the revision of b eliefs are strongly aﬀected b y the communities wherein ideas and facts are debated [38, 39]. Such a phenomenon is emphasized in online so cial net works, where users pro cess infor- mation through a shared system of meaning [40, 41] inside their echo cham bers [42, 43, 44, 45], making sense of facts in wa ys that are often biased tow ard self- conﬁrmation. Indeed, recent studies show that increasing the exp osure of users to unsubstan tiated rumors increases their tendency to b e credulous [26, 46], and that the con tent-selectiv e exp osure is the primary driv er of conten t diﬀusion, and generates the formation of ec ho cham b ers [42]. In this work, w e study the statistical prop erties of viral misinformation in online so cial media. In particular, we apply metho ds of Extreme V alue The- ory – a branc h of statistics dealing with extreme deviations from the median of probabilit y distributions – to analyze a large dataset of p osts published by F aceb ook pages supp orting conspiracy theories and m yth narratives. By means of an in-depth statistical analysis of the shares distribution and the application of the P eaks Ov er Threshold (POT) approac h, we show that the n umber of extremely viral posts (e.g. > 250 K shares) ov er time follo ws a homogeneous P oisson pro cess, and that the in terarriv al times b et ween suc h posts are indep en- den t and identically distributed, following an exp onen tial distribution. F urther, w e characterize the uncertaint y around the rate parameter of the Poisson pro- cess through Ba yesian metho ds. Finally , we are able to derive the predictiv e p osterior probabilit y distribution of the num b er of p osts exceeding a certain threshold of shares o ver a ﬁnite interv al of time. The relev ance of our results is not necessarily limited to the ﬁeld of computa- tional social science coping with misinformation [47, 48, 49]. Indeed, despite the prediction of extremely viral posts and rare ev en ts remains an hard task [50, 51], w e b elieve that both our ﬁndings and the metho dology used herein may b e of in terest to the broader ﬁeld of computational so cial science dealing with forecast- 2 ing and tracking of viral conten ts and ev en ts [52, 53, 54, 55, 56, 57, 58, 59, 60, 61]. 2. Metho ds 2.1. Ethics Statement The entire data collection pro cess has b een carried out exclusively through the F aceb ook Graph API, which is publicly a v ailable. W e used only public a v ailable data. The pages from which we downloaded data are public F aceb ook en tities. 2.2. Data Col le ction W e analyzed 328 US public F acebo ok pages diﬀusing conspiratorial b eliefs, m yth narratives, and contro versial information, usually lacking supp orting evi- dence and most often contradictory of the oﬃcial news. Suc h a space of inv es- tigation is deﬁned with the same approac h as in [26, 42], with the supp ort of diﬀeren t F aceb o ok groups v ery active in monitoring the conspiracy narratives. F or each page, w e downloaded all the posts (and their resp ectiv e metadata) in a timespan of 5 years (Jan 1, 2010 to Dec 31, 2014). The dataset is comp osed by 345 , 054 p osts. T o our knowledge, the dataset is the complete set of conspiracy- lik e information sources activ e in the US F aceb o ok scenario up to December 31, 2014. 2.3. F undamentals of Extr eme V alue The ory Extreme v alue theory (EVT) is a branch of statistics dealing with the ex- treme deviations from the median of probability distributions. In particular, it aims at assessing the probabilit y of even ts that are more extreme than any pre- viously observed. Extreme v alue theory is widely used in many ﬁelds of science where pow er laws play a role in mo deling [62], such as structural and geological engineering, ﬁnance and risk managemen t, earth sciences, traﬃc prediction, etc. In this section, we brieﬂy review some fundamental results of extreme v alue theory . F or extended discussion, pro ofs, and theorems see [63, 64, 65]. 2.3.1. Extr eme V alue The ory Supp ose X 1 , X 2 , . . . are indep enden t and iden tically distributed (iid) random v ariables with common cumulativ e distribution function (cdf ) F . Let M n = max { X 1 , . . . , X n } denote the maximum of the ﬁrst n random v ariables (partial maxima) and let u ( F ) = sup { x : F ( x ) < 1 } denote the upp er endpoint of F . Since Pr ( M n ≤ x ) = Pr ( X 1 ≤ x, . . . , X n ≤ x ) = F n ( x ) , M n con verges almost surely to u ( F ) whether it is ﬁnite or inﬁnite. Extreme v alue theory se eks norming constants a n > 0, b n ∈ R , and some nondegenerate distribution function G such that the cdf of the normalized M n con verges to G , i.e. Pr  M n − b n a n ≤ x  = F n ( a n x + b n ) → d G ( x ) . 3 If this holds for suitable choices of a n and b n , then w e say that G is an extreme v alue distribution function, and F b elongs to the maximum domain of attraction of G , i.e. F ∈ M DA ( G ). The Extremal T yp es Theorem c haracterizes the limit distribution function G as of the t yp e of one of the following three classes: • Gum b el: Λ( x ) = exp (exp( − x )) , x ∈ R • F r´ ec het: Φ α ( x ) = ( 0 , if x ≤ 0 exp( − x − α ) , if x > 0 • W eibull Ψ α ( x ) = ( exp( − ( − x ) α ) , if x ≤ 0 1 , if x > 0 for some α > 0. The three extreme v alue distributions can b e represen ted using the general- ized extreme v alue (GEV) distribution (family). Let H ξ ( x ) = ( exp  − (1 + ξ x ) − 1 ξ  , if ξ 6 = 0 exp( − exp( − x )) , if ξ = 0 where 1 + ξ x > 0. Then, • ξ = α − 1 > 0 ← → Φ α • ξ = − α − 1 < 0 ← → Ψ α • ξ = 0 ← → Λ F rom a mo deling p oin t of view, the three extreme v alue distributions are v ery diﬀeren t, esp ecially for what concerns the b eha vior of the tails, i.e. the part of the distribution more relev an t when dealing with extreme even ts. Here, we fo cus on the F r´ echet case, Φ α with α > 0. If we consider the tail of Φ α ( x ), a T aylor expansion shows that 1 − Φ α ( x ) = 1 − exp( − x − α ) ∼ x − α , x → ∞ . Hence, Φ α ( x ) tends to decrease as a p o wer law. Moreo ver, ev ery distribution function that b elongs to MD A(Φ α ) has necessarily and inﬁnite righ t endp oin t, i.e. it is deﬁned for x ∈ [0 , ∞ ). It follo ws that all the distribution functions b elonging to MD A(Φ α ) are appropriate for modeling phenomena with extremely large maxima. 4 Consider a random v ariable X with unknown distribution function G and righ t endp oin t x G = sup { x ∈ R : G ( x ) < 1 } . The exceedance distribution function of X ab o ve a given threshold t is deﬁned as G t ( x ) = Pr ( X ≤ x | X > t ) = G ( x ) − G ( t ) 1 − G ( t ) , x ≥ t. F or a large class of distribution functions G and a high threshold t → x G , G t can b e approximated by a Generalized Pareto Distribution, i.e. G t = GP D ( x ; ξ , β , t ) =    1 −  1 + ξ x − t β  − 1 ξ , if ξ 6 = 0 1 − exp  − x − t β  , if ξ = 0 where x ≥ t for ξ ≥ 0, t ≤ x ≤ t − β /ξ for ξ < 0, t ∈ R , ξ ∈ R , and β > 0. The shap e parameter, ξ , gov erns the fatness of the tails, and thus the existence of the moments. The moment of order p of a Generalized P areto distributed random v ariable only exists if and only if ξ < 1 /p . 2.3.2. Extr eme V alue Analysis Tw o approaches exist for practical extreme v alue analysis. The Blo c k Max- ima (BM) approac h consists on splitting the observ ation p erio d into a certain n umber of non-ov erlapping p erio ds of equal size – e.g. w eeks, months, y ears – and then considering only the maximal v alue within eac h p erio d. Such maximal v alues follow approximately a Generalized Extreme V alue (GEV) distribution. The P eaks Ov er Threshold (POT) approach relies on considering only the v alues exceeding a certain high threshold. The probability distribution of those selected observ ations is approximately a Generalized Pareto Distribution (GPD). Both approaches hav e some limitations. The POT approac h picks up all relev an t high observ ations, and thus seems to make b etter use of the av ailable information. Con versely , the BM approac h misses some of these high observ a- tions and retains some low er observ ations. How ever, there ma y b e reason for using the BM metho d: the only av ailable information ma y b e blo c k maxima (e.g. daily , weekly , mon thly , or yearly maxima) and the BM approach may b e preferable when the observ ations are not exactly iid. Ho wev er, if the BM approach ma y b e easier to apply when the blo ck perio ds app ear naturally , some problems arise when this do es not happen. In suc h a case, the choice of the block size for the BM approach may b e as diﬃcult as the c hoice of the threshold for the POT approach. Figure 1 pro vides a graphical representation of the tw o approaches. 3. Results and Discussion In this pap er, we aim at inv estigating the statistical prop erties of viral mis- information on F aceb ook by means of extreme v alue theory . More sp eciﬁcally , the ob ject of the analysis is the num b er of times that p osts supporting — in 5 Figure 1: Blo ck Maxima (BM) vs. Peaks Over Threshold (POT). this study , we are assuming that eac h share of conspiracy p osts represents the will to supp ort a given conspiracy narrativ e — conspiracy theories hav e been shared, whic h can b e considered as a random v ariable X following a generic distribution function F with supp ort [0 , ∞ ). Suc h an inﬁnite right endp oin t is justiﬁed b y the fact that users can share a p ost how many times they desire. 3.1. Explor atory Data Analysis Since for eac h p ost we know the time of creation, we hav e a temp orally or- dered collection of observ ations. Such a time series is irregularly spaced, in the sense that it is characterized b y v arying in terarriv al times b et ween observ ations. A common approach to analyze irregularly spaced time series consists in trans- forming the data into equally spaced observ ations using interpolation metho ds, and then apply standard metho ds for equally spaced data. Ho wev er, suc h a transformation can introduce a num b er of signiﬁcant and hard to quantify bias, esp ecially when the in terarriv al times b et ween observ ations are highly irregular. Since in our case the spacing of observ ations v aries from seconds to da ys, we a void to transform data. Moreo ver, despite observ ations are temporally ordered, it is diﬃcult to assume some kind of time dep endence b et ween the num b er of shares receiv ed b y p osts — i.e. the num ber of shares receiv ed by a post do es not aﬀect the n umber of shares receiv ed by following p osts. Rather, if we can conceiv e that some external ev ents (e.g. breaking news, top stories, scandals, etc.) can cause an un usual n umber of p osts in a restricted temp oral window (clustering), we can safely assume that the num ber of shares receiv ed by each of those p osts is indep enden t and identically distributed (iid). Figure 2 shows the cumulativ e n umber of weekly p ost ( top p anel ) and the n umber of weekly p osts ( b ottom p anel ). Despite it looks like there is a clear gro wth trend — which is likely due b y the increase of users on F aceb ook o c- curring from 2010 to 2014 — and some form of seasonality , w e are not able to iden tify an y meaningful seasonalit y pattern. Indeed, we can assume that the activit y of this kind of pages is primarily driven b y external even ts, suc h as breaking news, top stories, scandals, etc. 6 Figure 2: W eekly p osts. Cumulativ e num b er of weekly p ost ( top p anel ) and num b er of weekly p osts ( b ottom p anel ). The solid red line indicates the ﬁtted linear trend. Ho wev er, an increase in the num ber of p osts published by pages in a giv en temp oral window ma y reﬂect an increase in the users’ excitement and activ- it y . W e accoun t for such a possible c haracteristic of the phenomenon under in vestigation by rescaling raw data by a factor deﬁned as R i = w i max( w ) , i ∈ { 1 , . . . , 261 } where w i represen ts the num b er of p osts published b y pages in week i . Suc h a rescaling factor inﬂates the num b er of shares of p osts published in weeks c haracterized b y an ov erall low activity . Diﬀerent rescaling strategies hav e b een considered — e.g. rescaling by the mean or the median —, and similar results ha ve b een obtained. Figure 3: Shares time series. Time series of the original num b er of shares ( r aw data ) and the rescaled one ( r esc ale d data ). Figure 3 shows the time series of the original num ber of shares ( r aw data ) and the rescaled one ( r esc ale d data ). The rescaling pro cedure should hav e remov ed or at least reduced p ossible clustering phenomena that would hav e led to a violation of the iid assumption. W e chec k the iid assumption b y means of the 7 records plot, a simple and intuitiv e exploratory to ol widely used in extreme v alue analysis whic h exploits the fact that successive records for iid data should b ecome more and more rare as time go es by . Since a record x n for the random v ariable X o ccurs if x n > max { x 1 , . . . , x n − 1 } , it is intuitiv e that if data are iid it b ecomes more diﬃcult to exceed all past observ ations, and thus the n umber of records should follow a logarithmic pattern [63]. Records plots in Figure 4 sho w that the iid assumption for ra w data is violated, but still v alid for rescaled data, where records are distributed around their exp ected v alue and within the 95% conﬁdence in terv als. Figure 4: Records plots. The iid assumption for raw data is violated, but still v alid for rescaled data, where records are distributed around their exp ected value and within the 95% conﬁdence interv als. Figure 5 sho ws the empirical complementary cumulativ e distribution func- tions of ra w data and rescaled data. The double log scale of the ﬁgures highlights the fatness of the righ t tail in b oth the tw o empirical distributions. Beyond re- mo ving an y form of dep endence in the ra w data, we observe that the rescaling pro cedure slightly exacerbates the pow er la w behavior of the tail without in- ﬂuencing the b ody of the distribution. Thus, rescaled data will be used for successiv e analysis. 3.2. Statistic al Pr op erties of Vir al Misinformation W e use the distribution function of rescaled data to characterize the statisti- cal prop erties of viral misinformation b y means of EVT to ols. First, we analyze the limit b eha vior of the Maximum/Sum ratio R n ( p ) = M n ( p ) S n ( p ) , n ≥ 1 , p > 0 , where S n ( p ) = P n i =1 ( X p i ) and M n ( p ) = max( X p i ). The momen t of order p of the distribution exists, i.e. E [ X p ] < ∞ , if and only if R n ( p ) conv erges to zero for n → ∞ . Con versely , an erratic limit behavior of R n ( p ) indicates the inﬁniteness of the p -th momen t of the distribution, i.e. E [ X p ] = ∞ . Figure 6 shows that only the ﬁrst moment of the distribution exists, i.e. E [ X ] < ∞ , whereas moments of order greater than p = 2 are inﬁnite, i.e. 8 Figure 5: Empirical Complementary Cumulativ e Distribution F unction. The rescal- ing procedure slightly exacerbates the pow er law behavior of tail without inﬂuencing the bo dy of the distribution. E [ X p ] = ∞ for p ≥ 2. Iden tical results hold for raw data. Our distribution function b elongs to the maximum domain of attraction of F r´ echet, i.e. F ∈ M D A (Φ α ). The existence of the ﬁrst momen t of the distribution function allo ws us to compute a r e asonable (in a sample one can compute basically anything, ev en meaningless quan tities) estimate of the conditional tail mean ab o ve a given threshold. Indeed, by the law of total exp ectation E [ X | X > t ] = E [ X ] − Pr ( X ≤ t ) E [ X | X ≤ t ] Pr ( X > t ) , where E [ X | X ≤ t ] is ﬁnite since bounded from ab ov e b y t , and the ﬁniteness of E [ X ] implies that the conditional tail mean, E [ X | X > t ], is ﬁnite. Suc h a measure is known in ﬁnance as the exp ected shortfall of a loss distri- bution, and it let us answer to question such as “What is the exp e cte d numb er of shar es for a p ost onc e it has exc e e de d the 250 K shar es thr eshold?” . Indeed, E [ X | X > t ] = P N i =1 x i I ( x i > t ) P N i =1 I ( x i > t ) = P N i =1 x i I ( x i > 250 K ) P N i =1 I ( x i > 250 K ) ≈ 467 K. Since we show ed that the moments of order greater than 1 do not exist, one should prefer the mean absolute deviation o ver the v ariance as a measure of disp ersion around the conditional tail mean. Recall that the momen t of order p of a Generalized Pareto distributed ran- dom v ariable only exists if and only if ξ < 1 /p , and th us the shape parameter w e are going to estimate can not b e smaller than 1 / 2. Suc h an observ ation has the 9 Figure 6: Maxim um/Sum ratio plot. Only the ﬁrst moment of the distribution exists, i.e. E [ X ] < ∞ , whereas moments of order greater than p = 2 are inﬁnite, i.e. E [ X p ] = ∞ for p ≥ 2. main implication that we can safely use the maximum likelihoo d (ML) approach to estimate ξ , since the ML estimates are consistent only when ξ > − 1 / 2. Since our time series is highly irregularly spaced, with interarriv al times ranging b et ween seconds and days, we prefer the Peaks Over Threshold (POT) approac h ov er the Block Maxima (BM) approac h to estimate the shape parame- ter ξ . Indeed, when the blo c k p erio ds used in the BM approac h do es not app ear naturally , the choice of a threshold for the POT approach ma y b e easier. Before ﬁtting the distribution function to a Generalized Pareto Distribu- tion, w e hav e to identify a feasible threshold. T o accomplish such a task, we rely on the Mean Excess F unction (MEF). The empirical MEF of a sample of observ ations x 1 , . . . , x n is deﬁned as e n ( t ) = P n i =1 ( x i − t ) P n i =1 I ( x i > t ) , that is the ratio b et ween the sum and the n umber of the exceedances ov er the threshold t . Figure 7 sho ws the MEF plot for the rescaled data. W e observe that the empirical MEF b egins to linearly increase in the threshold at t ≈ 10 4 . Suc h a b eha vior characterizes pow er law distribution functions [66], and thus w e choose t = 10 4 as threshold. Giv en the heavy-tailed b eha vior of the rescaled data distribution function, the shap e parameter ξ is likely to b e p ositiv e. The left panel of Figure 8 shows the Pic k ands plot, based on the nonparametric Pic k ands estimator for ξ , deﬁned as ˜ ξ ( P ) τ ,n = 1 log 2 log X τ ,n − X 2 τ ,n X 2 τ ,n − X 4 τ ,n , τ = 1 , . . . , b n/ 4 c where X τ ,n is the τ -th upp er order statistics out of a sample of n observ a- tions. The Pick ands plot shows a more or less stable b eha vior of the Pick ands estimates for diﬀerent v alues of τ , suggesting that the true v alue of ξ lies in the in terv al (0 . 5 , 1). 10 Figure 7: Empirical Mean Excess F unction plot. The righ t panel of Figure 8 shows the Hill plot, based on the nonparametric Hill estimator for ξ , deﬁned as ˜ ξ ( H ) τ ,n = 1 τ τ X j =1 ln( X j,n ) − ln( X τ ,n ) , where X j,n is the j -th upp er order statistics out of a sample of n observ ations. The Hill plot outp erforms the Pic k ands plot in stability , suggesting a true v alue of ξ around 0 . 75. Figure 8: Pick ands and Hill nonparametric estimators for the shap e parameter ξ . The main implication is that, as already anticipated by the analysis of the Maxim um/Sum ratio plot, the true v alue of ξ is greater than − 1 / 2, and thus w e can obtain consistent estimates via a maximum likelihoo d (ML) approac h. T able 1 shows ML estimates of ξ and β for diﬀeren t thresholds. W e observ e a stable v alue of ˜ ξ M L for increasing v alues of the threshold. W e obtain similar results for ra w data (i.e. ˜ ξ M L = 0 . 769(0 . 0198) with t = 2 . 5 K ). 11 T able 1: Maxim um Likelihoo d estimates, standard errors, and num b er of ex- ceedances for diﬀerent thresholds. threshold ˜ ξ ( M L ) ˜ β ( M L ) # exceedances 10 K 0 . 770 8 , 750 6 , 408 (0 . 0220) (205) 25 K 0 . 800 19 , 500 2 , 153 (0 . 0391) (805) 50 K 0 . 737 43 , 170 884 (0 . 059) (2 , 730) 100 K 0 . 746 74 , 380 399 (0 . 0869) (6 , 923) 150 K 0 . 726 120 , 460 223 (0 . 120) (15 , 500) 3.3. F r e quency of Vir al Misinformation The Peaks Over Threshold (POT) metho d has t wo main implications: the exceedances o ver a high threshold follo w a Generalized P areto Distribution, and the num b er of excesses ov er time follows a homogeneous Poisson pro cess. In a homogeneous P oisson pro cess the num ber of even ts, N ( θ ), in a ﬁnite interv al of time of length θ follo ws the Poisson distribution, i.e. Pr ( N ( θ ) = n ) = ( λθ ) n n ! exp( − λθ ) . Moreo ver, the interarriv al times b et ween even ts are indep enden t and follow the exp onen tial distribution, i.e. Pr (in terarriv al time > θ ) = exp( − λθ ) . Figure 9: Exp onential Quan tiles vs. Interarriv al Times. Figure 9 shows that the interarriv al times of pos ts shared more than 750 K times (rescaled data) follow approximately an exp onen tial distribution. More- o ver, the auto correlogram function (ACF) plot — i.e. a plot showing the simi- larit y b et ween observ ations as a function of the time lag betw een them [67] — in 12 Figure 10 sho ws that the interarriv al times b et ween those p osts are independent, supp orting the i.i.d. h yp othesis suggested by the records plot in Figure 4. Simi- lar results approximately hold for raw data when considering p osts shared more than 250 K (the bo otstrap test of ﬁt for the Generalized P areto Distribution [68] giv es a p-v alue equal to 0 . 2). W e conclude that the num b er of extremely viral p osts ov er time follows a homogeneous P oisson pro cess. Figure 10: Auto correlogram for Interarriv al Times. W e ﬁnd no correlation as a function of the time lag b etw een them. Suc h a conclusion allows us to exploit some useful prop erties of the homo- geneous Poisson pro cesses to quan tify the frequency of rare viral con tents on online so cial media. Indeed, the exp ected v alue of the n umber of ev ents, N ( θ ), in a ﬁnite in terv al of time of length θ is deﬁned as E [ N ( θ )] = λθ , where λ > 0 is known as the rate parameter of the Poisson pro cess. The recipro cal of such a parameter, i.e. 1 /λ , is kno wn as the surviv al parameter of the exponential distribution follo w ed by the interarriv al times b et ween the N ( θ ) ev ents. Given a sample z 1 , . . . , z n of in terarriv al times, the surviv al parameter is estimated through the sample mean 1 λ = P n i =1 z i n . Essen tially , we can estimate the surviv al parameter, 1 /λ , of the exp onen tial distribution describing the interarriv al times b etw een rare ev ents exceeding a certain threshold, and then use the rate parameter, λ , of the Poisson process to estimate the exp ected num b er of even ts exceeding that threshold in a ﬁnite time of length θ . F or instance, the surviv al parameter of the interarriv al times distribution of p osts exceeding 250 K shares (raw data) is 1 /λ = 18 . 5. It follows that λ = 1 / 18 . 5 = 0 . 0541. Basically , if by means of the surviv al parameter w e can answ er to questions suc h as “What is the me an waiting time b etwe en p osts exc e e ding 250 K shar es?” , through the rate parameter we can answer to questions such as 13 “What is the exp e cte d numb er of p osts exc e e ding 250 K shar es in the futur e 365 days?” . Indeed, E [ N ( θ )] = λθ = 0 . 0541 × 365 = 19 . 8 ≈ 20 . A con venien t wa y to assess the uncertain ty around the rate parameter, λ , consists in using a standard Bay esian probability updating metho d. Indeed, the conjugate prior distribution for a Poisson distribution is the Gamma distribu- tion, and w e can express the prior distribution of λ as Pr ( λ ) = Gamma( α, β ) . Since the exp ected v alue (mean) of a Gamma distribution is deﬁned as α/β , w e ma y w ant to choose the h yp erparameters, α and β , of the prior distribution Pr ( λ ) so that 1 λ = β α = P n i =1 z i n , where z 1 , . . . , z n are the observed interarriv al times. Then, the posterior distribution of the rate parameter is deﬁned as Pr ( λ | z ) = Gamma( α + k , β + k X i =1 z i ) , where z 1 , . . . , z k represen t k new observed interarriv al times. F or instance, w e ma y deﬁne the prior distribution of the rate parameter of the interarriv al times distribution of p osts exceeding 250 K shares (raw data) as Pr ( λ ) = Gamma( α, β ) = Gamma( n, n X i =1 z i ) = Gamma(38 , 702) , with mean equal to α/β = 38 / 702 = 0 . 0541, and v ariance equal to α/β 2 = 38 / 702 2 = 7 . 71 × 10 − 5 . Then, if after 60 da ys w e observe a p ost exceeding the 250 K shares threshold, the p osterior probability distribution of the rate parameter is Pr ( λ | z ) = Gamma( α + k , β + k X i =1 z i ) = Gamma(38 + 1 , 702 + 60) , with mean equal to ( α + k ) / ( β + P k i =1 z i ) = (38 + 1) / (702 + 60) = 0 . 0512, and v ariance equal to 38 + 1 / (702 + 60) 2 = 6 . 72 × 10 − 5 . Figure 11 shows b oth the prior and the p osterior distributions of the rate parameter in the aforemen tioned example. After such an up date, the exp ected v alue of the num ber of p osts exceeding the 250 K threshold in the next 365 days is deﬁned as E [ N ( θ )] = E [ λ | z ] θ = 0 . 0512 × 365 = 18 . 7 ≈ 19 , 14 Figure 11: Prior and Posterior Distributions of the rate parameter. The grey and red dashed lines indicate, resp ectiv ely , the mean of the prior and the mean of the posterior distribution of the rate parameter. Figure 12: Posterior predictive probability distribution. P osterior predictive probabil- ity distribution of the num b er of p osts exceeding the 250 K shares threshold in the next 365 days. The red dashed line indicates the mean, i.e. 18 . 7. and the mean w aiting time b etw een p osts exceeding the 250 K shares is 1 / 0 . 0512 = 19 . 5 days. Moreo ver, the full probability assessment of the uncer- tain ty around the rate parameter, λ , allo ws us to express the predictive p osterior probabilit y distribution of the num b er of p osts exceeding a certain threshold o ver a ﬁnite interv al of time Pr ( N ( θ ) | λ ) = Pr ( λ | z ) θ. Figure 12 sho ws the p osterior predictive probability distribution function of the num b er of p osts exceeding the 250 K shares threshold (raw data) in the ﬁnite in terv al time of length 365 days. 3.4. Concluding R emarks In this pap er, we study the statistical prop erties of viral misinformation in online so cial media. In particular, we fo cus our atten tion on F aceb o ok p osts 15 spreading false news, hoaxes and unsubstantiated claims. By means of an Ex- treme V alue Theory approach, we show that the n um b er of extremely viral posts o ver time follo ws a homogeneous P oisson process, and that the interarriv al times b et ween such posts are independent and identically distributed, following an ex- p onen tial distribution. Moreo ver, we c haracterize the uncertain ty around the rate parameter of the P oisson pro cess through Ba yesian metho ds. Finally , we are able to derive the predictive p osterior probabilit y distribution of the n umber of p osts exceeding a certain threshold of shares ov er a ﬁnite interv al of time. The relev ance of our results is not necessarily limited to the ﬁeld of com- putational so cial science coping with misinformation. Despite the prediction of extremely viral p osts — and, more generally , rare even ts — remains an hard task, we b eliev e that b oth our ﬁndings and the metho dology introduced in this pap er may b e of in terest to the broader ﬁeld of computational social science dealing with forecasting and trac king of viral con tents and even ts — e.g. cyber- securit y attacks, terrorist attacks, etc. Ac knowledgemen ts Sp ecial thanks to Geoﬀ Hall and Skepti F orum for providing fundamen tal supp ort in deﬁning the atlas of F aceb o ok pages disseminating conspiracy theo- ries and m yth narratives. References References [1] J. Brown, A. J. Bro deric k, N. Lee, W ord of mouth communication within online communities: Conceptualizing the online so cial netw ork, Journal of in teractive marketing 21 (3) (2007) 2–20. [2] R. Kahn, D. Kellner, New media and internet activism: F rom the’battle of seattle’to blogging., New media & so ciet y 6 (1) (2004) 87–95. [3] W. Quattrocio cchi, R. Conte, E. Lo di, Opinions manipulation: Media, p o wer and gossip, Adv ances in Complex Systems 14 (04) (2011) 567–586. [4] W. Quattro cio cc hi, G. Caldarelli, A. Scala, Opinion dynamics on inter- acting netw orks: media comp etition and so cial inﬂuence, Scientiﬁc rep orts 4. [5] R. Kumar, M. Mahdian, M. McGlohon, Dynamics of conv ersations, in: Pro- ceedings of the 16th ACM SIGKDD in ternational conference on Kno wledge disco very and data mining, ACM, 2010, pp. 553–562. [6] Eb ola Lessons: Ho w So cial Media Gets Infected, http://www.informationweek.com/software/social/ - ebola- lessons- how- social- media- gets- infected/a/d- id/1307061 (Marc h 2014). 16 [7] The Eb ola Conspiracy Theories, http://www.nytimes.com/2014/10/19/ sunday- review/the- ebola- conspiracy- theories.html (Marc h 2014). [8] The inevitable rise of Eb ola conspiracy theories, http: //www.washingtonpost.com/blogs/wonkblog/wp/2014/10/13/ the- inevitable- rise- of- ebola- conspiracy- theories/ (Marc h 2014). [9] Remem b er jade helm 15, the con trov ersial military exercise? its ov er (Marc h 2014) [cited 29.10.2014]. URL https://www.washingtonpost.com/news/checkpoint/wp/2015/ 09/14/remember- jade- helm- 15- the- controversial- military- exercise- its- over/ [10] 5 m yths surrounding v accines — and the reality (F ebruary 2015). URL http://edition.cnn.com/2015/02/04/us/5- vaccine- myths/ [11] T rumps outrageous claim that thousands of new jersey muslims celebrated the 9/11 attac ks (Nov ember 2015). URL https://www.washingtonpost.com/news/fact- checker/wp/2015/ 11/22/donald- trumps- outrageous- claim- that- thousands- of- new- jersey- muslims- celebrated- th e- 911- attacks/ [12] C. R. Sunstein, A. V ermeule, Conspiracy theories: Causes and cures*, Jour- nal of P olitical Philosophy 17 (2) (2009) 202–227. [13] J. Byford, Conspiracy theories: a critical introduction, P algrav e Macmillan, 2011. [14] G. A. Fine, V. Campion-Vincen t, C. Heath, Rumor mills: The so cial impact of rumor and legend, T ransaction Publishers, 2005. [15] M. A. Hogg, D. L. Bla ylo c k, Extremism and the Psychology of Uncertaint y , V ol. 8, John Wiley & Sons, 2011. [16] L. Ho well, Digital wildﬁres in a hyperconnected world, WEF Rep ort. [17] News feed fyi: Showing fewer hoaxes (January 2015). URL http://newsroom.fb.com/news/2015/01/ news- feed- fyi- showing- fewer- hoaxes/ [18] V. Qazvinian, E. Rosengren, D. R. Radev, Q. Mei, Rumor has it: Identi- fying misinformation in microblogs, in: Pro ceedings of the Conference on Empirical Metho ds in Natural Language Pro cessing, Asso ciation for Com- putational Linguistics, 2011, pp. 1589–1599. [19] G. L. Ciampaglia, P . Shiralk ar, L. M. Ro c ha, J. Bollen, F. Menczer, A. Flammini, Computational fact chec king from knowledge netw orks, PloS one 10 (6) (2015) e0128193. [20] P . Resnic k, S. Carton, S. P ark, Y. Shen, N. Zeﬀer, Rumorlens: A system for analyzing the impact of rumors and corrections in so cial media, in: Pro c. Computational Journalism Conference, 2014. 17 [21] A. Gupta, P . Kumaraguru, C. Castillo, P . Meier, Tw eetcred: Real- time credibilit y assessment of conten t on t witter, in: Social Informatics, Springer, 2014, pp. 228–243. [22] A. A. AlMansour, L. Brank ovic, C. S. Iliopoulos, A mo del for recalibrat- ing credibility in diﬀerent contexts and languages-a twitter case study , In- ternational Journal of Digital Information and Wireless Communications (IJDIW C) 4 (1) (2014) 53–62. [23] J. Ratkiewicz, M. Conov er, M. Meiss, B. Gon¸ calv es, A. Flammini, F. Menczer, Detecting and tracking p olitical abuse in so cial media., in: ICWSM, 2011. [24] X. L. Dong, E. Gabrilo vich, K. Murphy , V. Dang, W. Horn, C. Lugaresi, S. Sun, W. Zhang, Knowledge-based trust: Estimating the trust worthiness of w eb sources, Proceedings of the VLDB Endowmen t 8 (9) (2015) 938–949. [25] D. Mo can u, L. Rossi, Q. Zhang, M. Karsai, W. Quattro cio cc hi, Collective atten tion in the age of (mis) information, Computers in Human Behavior 51 (2015) 1198–1204. [26] A. Bessi, M. Coletto, G. A. Da videscu, A. Scala, G. Caldarelli, W. Quat- tro ciocchi, Science vs conspiracy: Collective narratives in the age of misin- formation, PloS one 10 (2) (2015) e0118093. [27] B. Nyhan, J. Reiﬂer, S. Richey , G. L. F reed, Eﬀectiv e messages in v accine promotion: a randomized trial, Pediatrics 133 (4) (2014) e835–e842. [28] M. A. Ja v arone, So cial inﬂuences in opinion dynamics: the role of con- formit y , Ph ysica A: Statistical Mechanics and its Applications 414 (2014) 19–30. [29] M. A. Ja v arone, G. Armano, Perception of similarity: a model for so cial net work dynamics, Journal of Physics A: Mathematical and Theoretical 46 (45) (2013) 455102. [30] M. A. Jav arone, Netw ork strategies in election campaigns, Journal of Sta- tistical Mec hanics: Theory and Exp eriment 2014 (8) (2014) P08013. [31] A. Bessi, G. Caldarelli, M. Del Vicario, A. Scala, W. Quattro ciocchi, Social determinan ts of conten t selection in the age of (mis) i nformation, in: So cial Informatics, Springer, 2014, pp. 259–268. [32] F. Zollo, A. Bessi, M. Del Vicario, A. Scala, G. Caldarelli, L. Shekht- man, S. Havlin, W. Quattro ciocchi, Debunking in a world of trib es, arXiv preprin t [33] R. Garrett, E. Nisb et, E. Lync h, Undermining the correctiv e eﬀects of media-based p olitical fact chec king? the role of contextual cues and naive theory , Journal of Communication. 18 [34] R. K. Garrett, B. E. W eeks, The promise and p eril of real-time correc- tions to p olitical misp erceptions, in: Proceedings of the 2013 conference on Computer supp orted co op erativ e work, ACM, 2013, pp. 1047–1058. [35] M. L. Meade, H. L. Ro ediger, Explorations in the so cial contagion of mem- ory , Memory & cognition 30 (7) (2002) 995–1009. [36] A. Koriat, M. Goldsmith, A. Pansky , T ow ard a psychology of memory accuracy , Annual review of psychology 51 (1) (2000) 481–537. [37] M. S. Ayers, L. M. Reder, A theoretical review of the misinformation eﬀect: Predictions from an activ ation-based memory mo del, Psyc honomic Bulletin & Review 5 (1) (1998) 1–21. [38] B. Zh u, C. Chen, E. F. Loftus, C. Lin, Q. He, C. Chen, H. Li, R. K. Mo yzis, J. Lessard, Q. Dong, Individual diﬀerences in false memory from misinfor- mation: P ersonality c haracteristics and their interactions with cognitive abilities, P ersonality and Individual Diﬀerences 48 (8) (2010) 889–894. [39] S. J. F renda, R. M. Nichols, E. F. Loftus, Current issues and adv ances in misinformation researc h, Curren t Directions in Psyc hological Science 20 (1) (2011) 20–23. [40] A. Bessi, F. Zollo, M. Del Vicario, A. Scala, G. Caldarelli, W. Quattro- cio cc hi, T rend of narratives in the age of misinformation, PloS one 10 (8) (2015) e0134641. [41] F. Zollo, P . K. No v ak, M. Del Vicario, A. Bessi, I. Mozeti ˇ c, A. Scala, G. Caldarelli, W. Quattro ciocchi, Emotional dynamics in the age of misin- formation, PloS one 10 (9) (2015) e0138740. [42] M. Del Vicario, A. Bessi, F. Zollo, F. P etroni, A. Scala, G. Caldarelli, H. E. Stanley , W. Quattro ciocchi, The spreading of misinformation online, Pro ceedings of the National Academy of Sciences 113 (3) (2016) 554–559. [43] A. Bessi, F. Petroni, M. Del Vicario, F. Zollo, A. Anagnostopoulos, A. Scala, G. Caldarelli, W. Quattro ciocchi, Viral misinformation: The role of homophily and p olarization, in: Proceedings of the 24th International Conference on W orld Wide W eb Companion, In ternational W orld Wide W eb Conferences Steering Committee, 2015, pp. 355–356. [44] A. Bessi, F. Zollo, M. Del Vicario, M. Puliga, A. Scala, G. Caldarelli, B. Uzzi, W. Quattrocio cchi, Users polarization on faceb o ok and youtube, arXiv preprin t [45] A. Bessi, Personalit y traits and echo cham b ers on faceb ook, Computers in Human Beha vior 65 (2016) 319–324. [46] A. Bessi, A. Scala, L. Rossi, Q. Zhang, W. Quattrocio cchi, The econom y of attention in the age of (mis) information, Journal of T rust Management 1 (1) (2014) 1–13. 19 [47] C. Shao, G. L. Ciampaglia, A. Flammini, F. Menczer, Hoaxy: A platform for tracking online misinformation, in: Pro ceedings of the 25th In terna- tional Conference Companion on W orld Wide W eb, In ternational W orld Wide W eb Conferences Steering Committee, 2016, pp. 745–750. [48] D. R. Grimes, On the viabilit y of conspiratorial b eliefs, PloS one 11 (1) (2016) e0147905. [49] A. Zubiaga, M. Liak ata, R. Procter, K. Bon tchev a, P . T olmie, T ow ards detecting rumours in so cial media, arXiv preprint [50] M. J. Salganik, P . S. Do dds, D. J. W atts, Experimental study of inequality and unpredictability in an artiﬁcial cultural market, science 311 (5762) (2006) 854–856. [51] D. J. W atts, Ev erything is obvious: Ho w common sense fails us, Crown Pub, 2011. [52] J. Cheng, L. Adamic, P . A. Do w, J. M. Kleinberg, J. Lesko v ec, Can cas- cades be predicted?, in: Proceedings of the 23rd international conference on W orld wide web, ACM, 2014, pp. 925–936. [53] A. F riggeri, L. A. Adamic, D. Ec kles, J. Cheng, Rumor cascades., in: ICWSM, 2014. [54] J. Staiano, D. Albanese, et al., Exploring image virality in go ogle plus, in: So cial Computing (So cialCom), 2013 International Conference on, IEEE, 2013, pp. 671–678. [55] T.-A. Hoang, E.-P . Lim, Virality and susceptibility in information diﬀu- sions., in: ICWSM, 2012. [56] L. Hong, O. Dan, B. D. Davison, Predicting p opular messages in twitter, in: Pro ceedings of the 20th international conference companion on W orld wide w eb, ACM, 2011, pp. 57–58. [57] M. Jenders, G. Kasneci, F. Naumann, Analyzing and predicting viral t weets, in: Proceedings of the 22nd in ternational conference on W orld Wide W eb companion, International W orld Wide W eb Conferences Steer- ing Committee, 2013, pp. 657–664. [58] J. Y ang, S. Counts, Predicting the sp eed, scale, and range of information diﬀusion in t witter., ICWSM 10 (2010) 355–358. [59] M. Coscia, Average is b oring: How similarity kills a meme’s success, Sci- en tiﬁc rep orts 4. [60] L. W eng, F. Menczer, Y.-Y. Ahn, Virality prediction and comm unity struc- ture in so cial netw orks, Scientiﬁc rep orts 3. 20 [61] J. Zhou, H. Pei, H. W u, Early w arning of human crowds based on query data from baidu map: Analysis based on shanghai stamp ede, arXiv preprint [62] P . Cirillo, N. N. T aleb, On the statistical prop erties and tail risk of violent conﬂicts, Av ailable at SSRN 2675355. [63] E. Gumbel, Statistics of extremes. 1958, Columbia Univ. press, New Y ork. [64] S. Coles, J. Baw a, L. T renner, P . Dorazio, An in tro duction to statistical mo deling of extreme v alues, V ol. 208, Springer, 2001. [65] P . Em brech ts, C. Kl ¨ upp elberg, T. Mikosc h, Mo delling extremal even ts: for insurance and ﬁnance, V ol. 33, Springer Science & Business Media, 2013. [66] P . Cirillo, Are your data really pareto distributed?, Physica A: Statistical Mec hanics and its Applications 392 (23) (2013) 5947–5962. [67] G. E. Box, G. M. Jenkins, G. C. Reinsel, G. M. Ljung, Time series analysis: forecasting and con trol, John Wiley & Sons, 2015. [68] J. A. Villase˜ nor-Alv a, E. Gonz´ alez-Estrada, A b o otstrap go o dness of ﬁt test for the generalized pareto distribution, Computational Statistics & Data Analysis 53 (11) (2009) 3835–3841. 21

On the statistical properties of viral misinformation in online social media

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment