Can Cascades be Predicted?

Can Cascades be Predicted? Justin Cheng Stanf ord University jcccf@cs.stanf ord.edu Lada A. Adamic F acebook ladamic@fb .com P . Alex Dow F acebook adow@fb .com Jon Kleinberg Cornell University kleinber@cs.cornell.edu Jure Lesko vec Stanf ord University jure@cs.stanf ord.edu ABSTRA CT On many social networking web sites such as Facebook and T wit- ter , resharing or reposting functionality allows users to share oth- ers’ content with their own friends or follo wers. As content is reshared from user to user , large cascades of reshares can form. While a growing body of research has focused on analyzing and characterizing such cascades, a recent, parallel line of work has argued that the future trajectory of a cascade may be inherently un- predictable. In this work, we develop a framework for addressing cascade prediction problems. On a large sample of photo reshare cascades on F acebook, we ﬁnd strong performance in predicting whether a cascade will continue to grow in the future. W e ﬁnd that the relative growth of a cascade becomes more predictable as we observe more of its reshares, that temporal and structural features are ke y predictors of cascade size, and that initially , breadth, rather than depth in a cascade is a better indicator of lar ger cascades. This prediction performance is robust in the sense that multiple distinct classes of features all achiev e similar performance. W e also dis- cov er that temporal features are predictiv e of a cascade’ s eventual shape. Observing independent cascades of the same content, we ﬁnd that while these cascades dif fer greatly in size, we are still able to predict which ends up the largest. Categories and Subject Descriptors: H.2.8 [Database Manage- ment] : Database applications— Data mining General T erms: Experimentation, Measurement. Keyw ords: Information diffusion, cascade prediction, contagion. 1. INTR ODUCTION The sharing of content through social networks has become an important mechanism by which people discover and consume in- formation online. In certain instances, a photo, link, or other piece of information may get reshar ed multiple times: a user shares the content with her set of friends, sev eral of these friends share it with their respectiv e sets of friends, and a cascade of resharing can de- velop, potentially reaching a large number of people. Such cas- cades ha ve been identiﬁed in settings including blogging [1, 13, 21], e-mail [12, 22], product recommendation [20], and social sites such as Facebook and T witter [9, 18]. A growing body of research . has focused on characterizing cascades in these domains, including their structural properties and their content. In parallel to these inv estigations, there has been a recent line of work adding notes of caution to the study of cascades. These cautionary notes f all into tw o main genres: ﬁrst, that large cascades are rare [11]; and second, that the eventual scope of a cascade may be an inherently unpredictable property [28, 31]. The ﬁrst concern — that large cascades are rare — is a widespread property that has been observed quantitatively in many systems where information is shared. The second concern is arguably more striking, but also much harder to verify quantitatively: to what extent is the future trajectory of a cascade predictable; and which features, if any , are most useful for this prediction task? Part of the challenge in approaching this prediction question is that the most direct ways of formulating it do not fully address the two concerns above. Speciﬁcally , if we are presented with a short initial portion of a cascade and asked to estimate its ﬁnal size, then we are faced with a pathological prediction task, since almost all cascades are small. Alternately , if we radically overrepresent large cascades in our sample, we end up studying an artiﬁcial setting that does not resemble how cascades are encountered in practice. A set of recent initial studies hav e undertaken versions of cascade prediction despite these dif ﬁculties [19, 23, 26, 29], b ut to some extent the y are inherent in these problem formulations. These challenges reinforce the fact that ﬁnding a robust way to formulate the problem of cascade prediction remains an open prob- lem. And because it is open, we are missing a way to obtain a deeper , more fundamental understanding of the predictability of cascades. How should we set up the question so that it becomes possible to address these issues directly , and engage more deeply with arguments about whether cascades might, in the end, be inher - ently unpredictable? The pr esent work: Cascade gr owth prediction . In this paper , we propose a ne w approach to the prediction of cascades, and sho w that it leads to strong and robust prediction results. W e are moti- vated by a view of cascades as complex dynamic objects that pass through successive stages as they grow . Rather than thinking of a cascade as something whose ﬁnal endpoint should be predicted from its initial conditions, we think of it as something that should be trac ked over time, via a sequence of prediction problems in which we are constantly seeking to estimate the cascade’ s next stage from its current one. What would it mean to predict the “next stage” of a cascade? If we think about all cascades that reach size k , there is a distribution of eventual sizes that these cascades will reach. Then the distribu- tion of cascade sizes has a median value f ( k ) ≥ k . This number f ( k ) is thus the “typical” ﬁnal size for cascades that reached size at least k . Hence, the most basic way to ask about a cascade’ s next stage of gro wth, gi ven that it currently has size k , is to ask whether it reaches size f ( k ) . W e therefore propose the follo wing cascade gr owth pr ediction pr oblem : given a cascade that currently has size k , predict whether it grow beyond the median size f ( k ) . (As we show later, the pre- diction problem is equiv alent to asking: giv en a cascade of size k , will the cascade double its size and reach at least 2 k nodes?) This implicitly deﬁnes a family of prediction problems, one for each k . W e can thus ask how cascade predictability behav es as we sweep ov er larger and larger values of k . (There are natural variants and generalizations in which we ask about reaching target sizes other than the median f ( k ) .) This problem formulation has a number of strong adv antages o ver standard ways of trying to deﬁne cas- cade prediction. First, it leads to a prediction problem in which the classes are balanced, rather than highly unbalanced. Second, it al- lows us to ask for the ﬁrst time how the predictability of a cascade varies ov er the range of its growth from small to large. Finally , it more closely approximates the real tasks that need to be solved in applications for managing viral content, where man y ev olving cas- cades are being monitored, and the question is which are likely to grow signiﬁcantly as time mo ves forw ard. For studying cascade growth prediction, it is important to work with a system in which the sharing and resharing of information is widespread, the complete trajectories of many cascades—both large and small—are observable, and the same piece of content shared separately by many people, so that we can begin to con- trol for variation in content. F or this purpose, we use a month of complete photo-resharing data from Facebook, which provides a rich ecosystem of shared content exhibiting all of these properties. In this setting, we focus on sev eral categories of questions: (i) How high an accuracy can we achie ve for cascade gro wth prediction? If we cannot improve on baseline guessing, then this would be e vidence for the inherent unpredictably of cas- cades. But if we can signiﬁcantly improve on this baseline, then there is a basis for non-trivial prediction. In the latter case, it also becomes important to understand the features that make prediction possible. (ii) Is gro wth prediction more tractable on small cascades or large ones? In other words, does the future behavior of a cascade become more or less predictable as the cascade unfolds? (iii) Beyond just the growth of a cascade, can we predict its “shape” — that is, its network structure? Summary of results . Gi ven the challenges in predicting cascades, we ﬁnd surprisingly strong performance for the growth prediction problem. Moreov er , the performance is robust in the sense that multiple distinct classes of features, including those based on time, graph structure, and properties of the indi viduals resharing, can achiev e accuracies well above the baseline. Cascades whose ini- tial reshares come quickly are more likely to grow signiﬁcantly; and from a structural point of vie w , breadth rather than depth in the resharing tree is a better predictor of signiﬁcant growth. W e inv estigate the performance of growth prediction as a func- tion of the size of the cascade so far — when we want to predict the growth of a cascade of size k , how does our accuracy depend on k ? It is not a priori clear whether accuracy should increase or decrease as a function of k , since for any value of k the challenge is to determine what the cascade will do in the future. Seeing more of the cascade (larger k ) does not make the problem easier, as it also inv olves predicting “farther” into the future (i.e., whether the cascade will reach size at least 2 k ). W e ﬁnd that accurac y increases with k , so that it is possible to achieve better performance on large cascades than small ones. The features that are most signiﬁcant for prediction change with k as well, with properties of the content and the original author becoming less important, and temporal features remaining relativ ely stable. W e also consider a related question: how much of a cascade do we need to see in order to obtain good performance? Speciﬁcally , suppose we want to predict the growth of a cascade of size at least R , but we are only able to see the ﬁrst k < R nodes in the cascade. How does prediction performance depend on k , and in particular , is there a “sweet spot” where a relatively small v alue of k giv es most of the performance beneﬁts? W e ﬁnd in fact that there is no sweet spot: performance essentially climbs linearly in k , all the way up to k = R . Perhaps surprisingly , more information about the cascade continues to be useful ev en up to the full snapshot of size R . In addition to growth, we also study ho w well we can predict the ev entual “shape” of the cascade, using metrics for evaluating tree structures as a numerical measure of the shape. W e obtain performance signiﬁcantly abov e baseline for this task as well; and perhaps surprisingly , multiple classes of features including tempo- ral ones perform well for this task, despite the fact that the quantity being predicted is a purely structural one. One of the compelling arguments that originally brought the is- sue of inherent unpredictability onto the research agenda was a striking experiment by Salganik, Dodds, and W atts, in which they showed that the same piece of content could achieve v ery different lev els of popularity in separate independent settings [28]. Gi ven the richness of our data, we can study a version of this issue here in which we can control for the content being shared by analyzing many cascades all arising from the sharing of the same photo. As in the experiment of Salganik et al., we ﬁnd that independent re- sharings of the same photo can generate cascades of very different sizes. But we also sho w that this observation can be compatible with prediction: after observing small initial portions of these dis- tinct cascades for the same photo, we are able to predict with strong performance which of the cascades will end up being the largest. In other words, our data shows wide variation in cascades for the same content, but also predictability despite this v ariation. Overall, our goal is to set up a framew ork in which prediction questions for cascades can be carefully analyzed, and our results indicate that there is in fact a rich set of questions here, pointing to important distinctions between different types of features charac- terizing cascades, and between the essential properties of lar ge and small cascades. 2. RELA TED WORK Many papers have analyzed and cataloged properties of empiri- cally observed information cascades, while others hav e considered theoretical models of cascade formation in networks. Most rele vant to our work are those which focus on predicting the future popu- larity of a gi ven piece of content. These studies ha ve proposed rich sets of features for prediction, which we discuss later in Section 3.2. Much prior work aims to predict the volume of ag gr e gate activity — the total number of up-votes on Digg stories [29], total hourly volume of ne ws phrases [34], or total daily hashtag use [23]. At the other end of the spectrum, research has focused on individual user- lev el prediction tasks: whether a user will retweet a speciﬁc tweet [26] or share a speciﬁc URL [10]. Rather than attempt to predict aggregate popularity or individual behavior in the next time step, we instead look at whether an information cascade grows over the median size (or doubles in size, as we later show). Research on communities deﬁned by user interests [3] or hash- tag content [27] has also looked at a notion of growth, predicting whether a group will increase in size by a gi ven amount. Neverthe- less, these focused on groups of already non-trivial size, and their growth predicted without an e xplicit internal cascade topology , and without tracking predictability ov er different size classes. Sev eral papers focus on predictions after having observ ed a cas- cade for a giv en ﬁxed time frame [19, 23, 30]. In contrast, rather than studying speciﬁc time slices, we continuously observ e the cas- cade over its entire lifetime and attempt to understand how predic- tiv e performance varies as the cascade de velops. Moreov er , our methodology does not penalize slo wly b ut persistently gro wing cas- cades. Thus, we predict the size and the structure after having ob- served a certain number of initial reshares. Many studies consider the cascade prediction task as a regres- sion problem [6, 19, 29, 30] or a binary classiﬁcation problem with large bucket sizes [16, 17, 19]. The danger with these approaches is that they are biased to wards studying extremely lar ge b ut also ex- tremely rare cascades, bypassing the whole issue about the general predictability of cascades. For example, research has speciﬁcally focused on content and users that create extremely large cascades, such as popular hashtags [15, 33] and very popular users [9, 14], which has led to criticism that cascades may only be predictable after they have already grown large [31]. While it is useful to un- derstand the dynamics of extremely popular content, such content is also very rare. Thus, we rather seek to understand predictability along cascade’ s entire lifetime. W e consider cascades that have as few as ﬁ ve reshares, and introduce a classiﬁcation task which is not ske wed tow ards very lar ge cascades. 3. PREDICTING CASCADE GR O WTH T o examine the cascade growth prediction problem, we ﬁrst de- ﬁne and motiv ate our experimental setup and the feature sets used, then report our prediction results with respect to different k . 3.1 Experimental setup Mechanics of inf ormation passing on Facebook. W e focus on content consisting of posts the author has designated as public, meaning that anyone on Facebook is eligible to vie w it, and we further restrict our attention to content in the form of photos, which comprise the majority of reshare cascades on Facebook [9]. Such posts are then distributed by Facebook’ s Ne ws Feed, typically at ﬁrst to users who are either friends of the poster or who subscribe to their content, e.g. as followers. Each post is accompanied by a “share” link that allows friends and followers to “reshare” the post with her o wn friends and follo wers, thus expanding the set of users exposed to the content. This e xplicit sharing mechanism creates in- formation cascades, starting with the root node (user or page) that originally created the content, and consisting of all subsequent re- shares of that content. Figure 1 illustrates the process with an example: a node v 0 posts a public photo, seen by v 0 ’ s friends and follo wers in their Ne ws Feeds. Friends v 1 and v 3 then share the photo with their o wn friends. This way the photo propagates over the edges of the Face- book network and creates an information cascade. W e represent the cascade graph as ˆ G , and the induced subgraph of all photo shar- ers, including all friendship or follow links between them as G 0 . Notice that some users (ex. v 5 ) are exposed via multiple sources ( v 0 , v 1 , v 3 , v 4 ). An important issue for our understanding of reshare cascades is the following distinction: content can be produced by users — individual Facebook accounts whose primary audience consists of friends and any subscribers the indi vidual has — and it can also be produced by pages , which correspond to the Facebook accounts of Figure 1: An information cascade represented by solid edges on a graph G , starting at v 0 ( ˆ G ). Dashed lines indicate friendship edges; the edges between resharers make up the friend subgraph G 0 . Figure 2: The complementary cumulative distribution (CCDF) of cascade size (left) and structural virality measured by using the W iener index (right). companies, brands, celebrities, and other highly visible public en- tities. In the common parlance around cascades, reshared content originally produced by a user is often informally viewed as more “organic, ” dev eloping a following in a more bottom-up way . In contrast, reshared content from pages is seen as more top-down, and generally broadcast via News Feed to a larger set of initial followers. A natural question, and a theme that will run through sev eral analyses in the paper, is to understand if these distinctions carry over to the properties we study here: do user-initiated cas- cades differ in their predictability and their underlying structure from page-initiated cascades? Dataset description. W e sampled our anon ymized dataset from photos uploaded to Facebook in June 2013 and observ ed any re- shares occurring within 28 days of initial upload. The dataset only includes photos posted publicly (vie wable by anyone), and not deleted during the observ ation period. Further , we e xclude photos with fewer than ﬁv e reshares as is required by the prediction tasks de- scribed below . W e constructed dif fusion trees ﬁrst by taking the ex- plicit cascade, e.g. C clicking “share" on B’ s “share" of A ’ s photo forms the cascade A → B → C . Howe ver , it is possible that user C clicked on user B’ s share, and then directly reshared from A. Since we want to know how the information actually ﬂowed in the network, we reconstruct the path A → B → C based on click, impression, and friend/follower data [9]. Figure 2 be gins to sho w how photos uploaded by pages generate cascades that differ from those uploaded by users. In our dataset, 81% of cascades are initiated by pages. Figure 2 shows the cas- cade size distribution for pages, users, and the two combined. Page cascades are typically larger than user cascades, e.g., 11% of page cascades reach at least 100 reshares, while only 2% of user cas- cades do, though both follo w heavy tailed distrib utions. Fitting power -law curves to their tails, we observe power -law exponents of α equal to 2.2, 2.1, and 2.1 for user , page, and both, respecti vely ( x min = 10, 2000, 2000). In addition to cascade growth, we quantify the shape of a cascade using the W iener inde x, deﬁned as the av erage distance between all (a) d = 1 . 98 (b) d = 2 . 47 (c) d = 14 . 4 Figure 3: Cascades with a low W iener inde x d resemble star graphs, while those with a high index appear more viral (the root is red). pairs of nodes in a cascade. Recent work has proposed the W iener index as a measure of the structural virality of a cascade [2]. Fig- ure 3 shows examples of cascades with varying W iener index val- ues. Intuitiv ely , a cascade with low structural virality has most of its distrib ution follo wing from a small number of hub nodes, while a cascade with high virality will hav e many long paths. Fig- ure 2 shows the distribution of cascade virality (as measured by W iener index) in our dataset, which, as we saw with cascade size, follows a heavy-tailed distribution. While user cascades are typi- cally smaller than page cascades in our dataset, they tend to have greater structural virality , supporting the intuition that the structure of user-initiated cascades is richer and deeper than that of page- initiated cascades. Deﬁning the cascade growth prediction pr oblem. Our aim in this paper is to study how well cascades can be predicted. More- ov er , we are interested in understanding ho w v arious aspects of the prediction task affect the predicti ve performance. There are several formulations of the task. If we were to deﬁne the task as a re gression problem, predictions may be ske wed to- wards large cascades, as cascade size follows a heavy-tailed distri- bution (Figure 2(right)). Similarly , if we deﬁne it as a classiﬁcation problem of predicting whether a cascade reaches a speciﬁc size, we may end up with unbalanced classes, and an overrepresentation of large cascades. Also, if we simply observed a small initial portion of a cascade, and predict its future size, the problem is pathological as almost all cascades are small. And, if we only v aried the ini- tial period of observation, the task of predicting whether a cascade reaches a certain size gets easier as we observe more of it. T o remedy these issues, we deﬁne a classiﬁcation task that does not suffer from these deﬁciencies. W e consider a binary classiﬁca- tion problem where we observ e the ﬁrst k reshares of a cascade and predict whether the ev entual size of a cascade reaches the median size of all the cascades with at least k reshares, f ( k ) . This allows us to study how cascade predictibility v aries with k . As exactly half the cascades reach a size greater than the median by deﬁnition, random guessing achiev es accuracy of 50%. Interestingly , the question of whether the cascade will reach f ( k ) is equiv alent to that of whether a cascade will double in size. This follows directly from the f act that cascade size distribution follo ws a power -law with exponent α ≈ 2 . Consider a power -law distribu- tion on the interval ( x min , ∞ ) with a power -law exponent α ≈ 2 . Then the median f ( x ) of this distribution is 2 · x min , as demon- strated by the following calculation: Z f ( x ) x min α − 1 x min  x x min  − α dx = 1 2 ⇒ f ( x ) = 2 1 α − 1 x min = 2 x min As we e xamine cascades of size greater than k = x min , the median size of these cascades is thus 2 · k from this deriv ation. In each of our prediction tasks, we observe that this is indeed true. Methods used f or learning. Our general methodology for the cas- cade prediction problem will be to represent a cascade by a set of features and then use machine learning classiﬁers to predict its fu- ture size. W e used a variety of learning methods, including linear regression, naive Bayes, SVM, decision trees and random forests. Howe ver , we primarily report performance of the logistic regres- sion classiﬁer for ease of comparison. In many cases, the perfor- mance of most classiﬁers was similar , although non-linear classi- ﬁers such as random forests usually performed slightly better than linear classiﬁers such as logistic regression. In all cases, we per- formed 10-fold cross validation and report the classiﬁcation accu- racy , F1 score, and area under the R OC curve (A UC). 3.2 F actors driving cascade gr owth W e proceed by describing factors that contribute to the growth and spreading of cascades. W e group these factors into ﬁ ve classes: properties of the content that is spreading, features of the original poster , features of the resharer, structural features of the cascade, and temporal characteristics of the cascade. T able 1 contains a de- tailed list of features. Content features. The ﬁrst natural factor contributing to the abil- ity of the cascade to spread is the content itself [7]. On T witter , tweet content and in particular , hashtags, are used to generate con- tent features [23, 30], and identify topics af fecting retweet like- lihood [26]. LD A topic models have also been incorporated into these prediction tasks [16], and human raters employed to infer the interestingness of content [5, 26]. In our work, we relied on a lin- ear SVM model, trained using image GIST descriptors and color histogram features, to assign likelihood scores of a photo being a closeup shot, taken indoors or outdoors, synthetically generated (e.g., screenshots or pure text vs. photographs), or contained food, a landmark, person, nature, water, or overlaid text (e.g., a meme). W e also analyzed w ords in the caption accompan ying an image for positiv e sentiment, negati ve sentiment, and sociality [17, 25]. Nev ertheless, while content features af fected the performance of structural and temporal features, we ﬁnd that the y are weak predic- tors of how widely disseminated a piece of content w ould become. Original poster/resharer features. Some prior work focused on features of the root note in a cascade to predicting the cascade’ s ev olution, ﬁnding that content from highly-connected individuals reaches lar ger audiences, and thus spreads further . Users with lar ge follower counts on T witter generated the largest retweet cascades [5]. Separately , features of an author of a tweet were shown to be more important than features of the tweet itself [26]. In many T witter studies predicting cascade size or popularity , a user’ s number of followers ranks among the top, if not the most, important predictor of popularity [5, 23]. Other features of the root node have also been studied, such as the number of prior retweets of a user’ s posts [5, 16], and how many T witter lists a user was included in [26]. The number of @- mentions of a T witter user was used to predict whether, and how soon a tweet would be retweeted, how many users would directly retweet, and the depth a cascade would reach [33]. Still, [8] found that various measures of a user’ s popularity are not very correlated with his or her inﬂuence. W e capture the intuition behind these factors by deﬁning demo- graphic as well as network features of the original poster as well as the features of the users who reshared the content so far . W e use Facebook’ s distinction of users (individuals) and pages (enti- ties representing an interest) to further distinguish different origin types, in addition to the inﬂuence features mentioned abov e. Structural features of the cascade. Networks provide the sub- strate through which information spreads, and thus their structure inﬂuences the path and reach of the cascade. As illustrated in Fig- Accuracy F1 Score A UC 0.558 0.637 0.671 0.722 0.730 0.780 0.795 0.547 0.564 0.679 0.730 0.744 0.770 0.795 0.582 0.707 0.735 0.794 0.797 0.870 0.877 Content Root Structural All \ T emporal Resharer T emporal All 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 Figure 4: Using logistic regression, we are able to predict with near 80% accuracy whether the size of a cascade will reach the median (10) after observing the ﬁrst k = 5 reshares. ure 1, we generate features from both the graph of the ﬁrst k re- shares ( ˆ G ), as well as the induced friend subgraph of the ﬁrst k resharers ( G 0 ). Whereas the reshare graph ˆ G describes the actual spread of a cascade, the friend subgraph G 0 provides information about the social ties between these initial resharers. The social graph G allows us to compute the potential reach of these reshares. Previous work considered the network structure of the underly- ing graph in inferring the virality of content [32], with highly viral items spreading across communities. W e use the density of the ini- tial reshare cascade ( sub gr aph 0 k ) and the proximity to the root node ( orig _ c onne ctions k , did _ le ave ) as proxies for whether an item is spreading primarily within a community or across man y . One can also look outside the network between resharers, and count the number of users reachable via all friendship and follo w edges of the ﬁrst k users ( b or der _ no des k ). This relates to total number of ex- posed users, and has been demonstrated to be an important feature in predicting T witter hashtag popularity [23]. As we can trace information ﬂow on Facebook exactly , we need not worry about independent entry points inﬂuencing a cascade [6, 24]; external inﬂuence instead allows us to in vestigate multiple in- dependent cascades arising from the same content (see Section 5.1). T emporal features. Properties related to the “speed” of the cas- cade (e.g., time k ) were sho wn to be the most important features in predicting thread length on Facebook [4], and are a primary mech- anism in predicting online content popularity [29]. Moreov er , as the speed of dif fusion changes ov er time, this may have a strong effect on the ability of the cascade to continue spreading through the network [33]. W e characterize a number of temporal properties of cascade dif- fusion (see T able 1). In particular, we measure the change in the speed of reshares ( time 00 1 ..k ), compare the differences between the speed in the ﬁrst and second half of the measurement period ( time 0 1 ..k/ 2 , time 0 k/ 2 ..k ), and qunatify the number of users who were exposed to the cascade per time unit ( views 0 1 ..k − 1 , k ). 3.3 Predicting cascade gr owth T o illustrate the general performance of the features described in the previous section we consider a simple prediction task, where we observe the ﬁrst 5 reshares of the cascade and want to pre- dict whether it will reach the median cascade size (or equiv alently , whether it will double and be reshared at least 10 times). For the experiment we use a set of N c = 150,572 photos, where each photo was shared at least 5 times. The total number of reshares of these photos was N r = 9,233,300. ● ● ● ● ● ● ● ● 0.79 0.80 0.81 25 50 75 100 Number of reshares observed (k) Mean Accuracy T otal Reshares ● k or more Figure 5: If we observe the ﬁrst k reshares of a cascade, and want to predict whether the cascade will double in size, our prediction improv es as we observe more of it. Figure 4 sho ws logistic regression performance using all features from T able 1. For this task, random guessing w ould obtain a perfor - mance of 0.5, while our method achieves surprisingly strong per- formance: classiﬁcation accuracy of 0.795 and A UC of 0.877. If we relax the task and instead of predicting abov e vs. belo w median size, we predict top vs. bottom quartile (top 25% vs. bottom 25%) the accuracy rises e ven further to 0.926, and the A UC to 0.976. Overall, while each feature set is individually signiﬁcantly bet- ter than predicting at random, it is the set of temporal features that outperforms all other individual feature sets, obtaining perfor- mance scores within 0.025 of those obtained when using all fea- tures. T o understand if we could do well without temporal features, we trained a classiﬁer which excluded them and were still able to obtain reasonable performance e ven without these features. This is especially useful when one knows through whom information was passed, but not when it was passed. The lack of reliance on any in- dividual set of features demonstrates that the predictions are rob ust. Studied individually , we also ﬁnd that temporal features gener- ally performed best, follo wed by structural features. The reshare rate in the second half ( time 0 k/ 2 ..k ) was most predictive, attaining accuracy of 0.73. This was follo wed by the rate of user vie ws of the original photo, views 0 0 ,k , and the time elapsed between the origi- nal post and ﬁfth reshare, time 5 (both 0.72). In fact, time k +1 is always more accurate than time k . The most accurate structural features were did _ le ave and outde g ( v 0 ) (both 0.65). W e examine individual feature importance in more detail later . 3.4 Predictability and the observation window of size k It is also natural to ask whether cascades get more or less pre- dictable as we observe more of the initial growth of a cascade. One may think that observing more of the cascade may allow us to e xtrapolate its future growth better; on the other hand, additional observed reshares may also introduce noise and uncertainty in the future growth of the cascade. Note that the task does not get easier as we observe more of the cascade, as we are predicting whether the cascade will reach size 2 k (or equiv alently , the median) giv en that we hav e seen k reshares so far . Figure 5 sho ws that the predicti ve performance of whether a cas- cade doubles in size increases as a function of the number of ob- served reshares k . In other words, it is easier to predict whether a cascade that has reached 25 reshares will get another 25, than to predict whether a cascade that has reached 5 reshares will obtain an additional 5. Thus, the prediction accuracy for larger cascades is above the already high accuracy for smaller values of k . The change in the F1 score and A UC also follow a very similar trend. Content Featur es sc ore fo o d / nature /... The probability of the photo having a speciﬁc feature (food, o verlaid text, landmark, nature, etc.) is _ en Whether the photo was posted by an English-speaking user or page has _ c aption Whether the photo was posted with a caption liwc p os / neg / so c Proportion of words in the caption that expressed positi ve or ne gati ve emotion, or sociality , if English Root (Original Poster) F eatures views 0 , k Number of users who saw the original photo until the k th reshare was posted orig _ is _ p age Whether the original poster is a page outde g ( v 0 ) Friend, subscriber or f an count of the original poster age 0 Age of the original poster , if a user gender 0 Gender of the original poster , if a user fb _ age 0 T ime since the original poster registered on Facebook, if a user activity 0 A verage number of days the original poster was acti ve in the past month, if a user Resharer F eatures views 1 ..k − 1 , k Number of users who saw the ﬁrst k − 1 reshares until the k th reshare was posted p ages k Number of pages responsible for the ﬁrst k reshares, including the root, or P k i =0 1 { v i is a page } friends avg / 90p k A verage or 90th percentile friend count of the ﬁrst k resharers, or 1 k P k i =1 outde g friends ( v i ) 1 { v i is a user } fans avg / 90p k A verage or 90th percentile fan count of the ﬁrst k resharers, or 1 k P k i =1 outde g ( v i ) 1 { v i is a page } subscrib ers avg / 90p k A verage or 90th percentile subscriber count of the ﬁrst k resharers, or 1 k P k i =1 outde g subscrib er ( v i ) 1 { v i is a user } fb _ ages avg / 90p k A verage or 90th percentile time since the ﬁrst k resharers registered on Facebook, or 1 k P k i =1 fb _ age i activities avg / 90p k A verage number of days the ﬁrst k resharers were active in July , or 1 k P k i =1 activity i ages avg / 90p k A verage age of the ﬁrst k resharers, or 1 k P k i =1 age i female k Number of female users among the ﬁrst k resharers, or P k i =1 1 { gender i is female } Structural Featur es outde g ( v i ) Connection count (sum of friend, subscriber and fan counts) of the i th resharer (or out-degree of v i on G = ( V , E ) ) outde g ( v 0 i ) Out-degree of the i th reshare on the induced subgraph G 0 = ( V 0 , E 0 ) of the ﬁrst k resharers and the root outde g ( ˆ v i ) Out-degree of the i th reshare on the reshare graph ˆ G = ( ˆ V , ˆ E ) of the ﬁrst k reshares orig _ c onnections k Number of ﬁrst k resharers who are friends with, or fans of the root, or |{ v i | ( v 0 , v i ) ∈ E , 1 ≤ i ≤ k }| b order _ no des k T otal number of users or pages reachable from the ﬁrst k resharers and the root, or |{ v i | ( v i , v j ) ∈ E , 0 ≤ i, j ≤ k }| b order _ e dges k T otal number of ﬁrst-degree connections of the ﬁrst k resharers and the root, or |{ ( v i , v j ) | ( v i , v j ) ∈ E , 0 ≤ i, j ≤ k }| sub graph 0 k Number of edges on the induced subgraph of the ﬁrst k resharers and the root, or |{ ( v i , v j ) | ( v i , v j ) ∈ E 0 , 0 ≤ i, j ≤ k }| depth 0 k Change in tree depth of the ﬁrst k reshares, or min β P k i =1 ( depth i − β i ) 2 depths avg / 90p k A verage or 90th percentile tree depth of the ﬁrst k reshares, or 1 k P k i =1 depth i did _ le ave Whether any of the ﬁrst k reshares are not ﬁrst-degree connections of the root T emporal F eatures time i T ime elapsed between the original post and the i th reshare time 0 1 ..k/ 2 A verage time between reshares, for the ﬁrst k/ 2 reshares, or 1 k/ 2 − 1 P k/ 2 − 1 i =1 ( time i +1 − time i ) time 0 k/ 2 ..k A verage time between reshares, for the last k/ 2 reshares, or 1 k/ 2 − 1 P k − 1 i = k/ 2 ( time i +1 − time i ) time 00 1 ..k Change in the time between reshares of the ﬁrst k reshares, or min β P k − 1 i =1 ( time i +1 − time i ) − β i ) 2 views 0 0 ,k Number of users who saw the original photo, until the k th reshare was posted, per unit time, or views 0 , k time k views 0 1 ..k − 1 , k Number of users who saw the ﬁrst k − 1 reshares, until the k th reshare was posted, per unit time, or views 1 ..k − 1 ,k time k T able 1: List of features used for learning. W e compute these features given the cascade until the k th reshare. Overall, these results demonstrate that observing more of the cascade, while also predicting “farther” into the future, is easier than observing a cascade early in its life and predicting what it will do next (i.e., k = 5 vs. k = 25 ). Fixing the minimum cascade size R . In the previous version of the task, cascades are required only to hav e at least k reshares. Thus, the set of cascades changes with k . Here, we examine a variation of this task, where we compose a dataset of cascades that hav e at least R reshares. W e observe the ﬁrst k ( k ≤ R ) reshares of the cascade and aim to predict whether the cascade will grow over the median size (over all cascades of size ≥ R ). As we increase k , the task gets easier as we observe more of the cascade and the predicted quantity does not change. W ith the task, we ﬁnd that performance increases linearly with k up to R , or that there is no “sweet spot” or re gion of diminishing returns ( p < 0.05 using a Harvey-Collier test). For example, the top-most line in Figure 6 shows that when each observed cascade has obtained 100 or more reshares, performance increases linearly as more of the cascade is observed. This demonstrates that more information is always better: the greater the number of observed reshares, the better the prediction. Howe ver , Figure 6 also shows that larger cascades are less pre- Figure 6: Knowing that a cascade obtains at least R reshares, pre- diction performance increases linearly with k , k ≤ R . Howev er , differentiating among cascades with large R also becomes more difﬁcult. dictable than smaller cascades. For example, predicting whether cascades with 1,000 to 2,000 reshares grow large is signiﬁcantly more difﬁcult than predicting cascades of 100 to 200 reshares. This shows that once one kno ws that a cascade will grow to be large, knowing the characteristics of the very beginning of its spread is less useful for prediction. 3.5 Changes in feature importance W e now examine how feature importance changes as more and more of the cascade is observed. In this experiment, we compute the value of the feature after observing ﬁrst k reshares and mea- sure the correlation coefﬁcient of the feature value with the log- transformed number of reshares (or cascade size). Figure 7 sho ws the results for the ﬁ ve feature types. W e summa- rize the results by the following observ ations: • Correlations of averag es increase with the number of ob- servations . As we obtain more examples, naturally aver - ages get less noisy , and more predictiv e (e.g., ages avg and friends avg ). • The original post gets less important with incr easing k . Af- ter observing 100 reshares, it becomes less important that the original post was made by a page ( orig _ is _ p age ), or that the original poster had many connections to other users ( outde g ( v 0 ) ). • Similarly , the actual content being reshar ed gets less impor- tant with incr easing k . Almost all content features tend to zero as k increases, except for has _ c aption and is _ en . This can be explained by the fact that cascades of photos with captions have a unimodal distribution, and cascades started by English speakers ha ve a bimodal distribution. Thus, these features become correlated in opposite directions. • Successful cascades get many views in a short amount of time, and achieve high con version rates . The number of users who hav e viewed reshares of a cascade is more nega- tiv ely correlated with increasing k ( views 1 ..k − 1 ,k ), suggest- ing that requiring “fewer tries” to achiev e a given number of reshares is a positive indicator of its future success. On the other hand, while requiring fe wer views is good, rapid expo- sure, or reaching many users within a short amount of time is also a positiv e predictor ( views 0 1 ..k − 1 ,k ). • Structural connectedness is important, but gets less impor- tant over time . Nevertheless, reshare depth remains highly (a) Content ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● has_caption is_en score closeup score food score indoor score landmark score nature score outdoor score overlaidte xt score person score synthetic score water −0.05 0.00 0.05 0.10 −0.04 −0.02 0.00 0.02 0.04 −0.06 −0.04 −0.02 0.00 −0.03 −0.02 −0.01 0.00 −0.075 −0.050 −0.025 0.000 −0.04 −0.02 0.00 0.00 0.01 0.02 0.03 0.04 0.05 −0.06 −0.04 −0.02 0.00 0.00 0.03 0.06 0.09 0.12 −0.06 −0.04 −0.02 0.00 0.000 0.025 0.050 0.075 0.100 −0.04 −0.03 −0.02 −0.01 0.00 25 50 75 100 25 50 75 100 25 50 75 100 k Correlation Coefficient (b) Root ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● views 0,k orig_is_page outdeg(v 0 ) 0.00 0.05 0.10 0.15 0.20 −0.01 0.00 0.01 0.02 0.0 0.1 0.2 25 50 75 100 25 50 75 100 25 50 75 100 k Corr . Coef. (c) Resharer ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ages avg fb_ages avg fans avg friends avg activities avg female[k] pages k views 1..k−1,k subscribers avg −0.05 0.00 0.05 0.00 0.05 0.10 0.000 0.025 0.050 0.075 0.000 0.025 0.050 0.075 0.100 0.00 0.05 0.10 0.15 0.20 0.25 0.000 0.025 0.050 0.075 −0.05 0.00 0.05 0.10 −0.20 −0.15 −0.10 −0.05 0.00 0.00 0.01 0.02 0.03 25 50 75 100 25 50 75 100 25 50 75 100 k Correlation Coefficient (d) Structural ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● subgraph k orig_connections k depth' k border_nodes k border_edges k depths k avg −0.20 −0.15 −0.10 −0.05 0.00 −0.25 −0.20 −0.15 −0.10 −0.05 0.00 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.00 0.03 0.06 0.09 0.0 0.1 0.2 0.3 25 50 75 100 25 50 75 100 25 50 75 100 k Correlation Coefficient (e) T emporal ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● views' 0,k views' 1..k−1,k time' k/2..k time' ' 1..k time' 1..k/2 time k 0.0 0.1 0.2 0.3 0.4 0.00 0.05 0.10 0.15 0.20 0.25 −0.10 −0.05 0.00 −0.3 −0.2 −0.1 0.0 −0.2 −0.1 0.0 −0.3 −0.2 −0.1 0.0 25 50 75 100 25 50 75 100 25 50 75 100 k Correlation Coefficient Figure 7: The importance of each feature v aries as we observe more of a cascade, as shown by the change in correlation coef ﬁcients. correlated: the deeper a cascade goes, the more likely it is to be long-lasting, as even users “far away” from the original poster still ﬁnd the content interesting. • The importance of timing featur es remains relatively stable . While highly correlated, timing features remain remarkably stable in importance as k increases. W e note individual features’ logistic regression coefﬁcients em- pirically follo w similar shapes, b ut ha ve the do wnside of ha ving interactions with one another . Using either the slope of the best- ﬁt line of the cascade size against the normalized feature value, ● ● ● ● ● 0 10 20 30 40 [1,10) [10,100) [100,1000) [1000,10000) [10000,1e+05) Cascade Size Mean Structural Virality ● P ages Users Figure 8: The mean structural virality (W iener index) increases with cascade size, but is signiﬁcantly higher for user cascades. or individual feature performance also reveals similar trends. Fur- ther LIWC text content features (positiv e, negativ e, and social cat- egories) consistently performed poorly , attaining performance no better than chance, with accuracy between 0.49 and 0.52. 4. PREDICTING CASCADE STR UCTURE Similar to predicting cascade size we can also attempt to predict the structur e of the cascade. W e now turn to examining ho w struc- tural features of the cascade determine its ev olution and spread. 4.1 User -started and page-started cascades Earlier we discussed the notion of structural virality as a mea- sure of how much the structure of a cascade is dominated by a few hub nodes, and we saw that user-initiated cascades have sig- niﬁcantly higher structural virality than page-initiated cascades, re- ﬂecting their richer graph structure. It is natural to ask how these distinctions vary with the size of the cascade — are large user - initiated cascades more similar to page-initiated ones, e.g. are they driv en by popular hub nodes? Figure 8 shows that the opposite is the case — user and page- initiated cascades remain structurally distinct, with this distinction ev en increasing with cascade size. Moreover , this difference con- tinues to hold ev en when controlling for the number of ﬁrst-degree reshares (directly from the root), suggesting a certain rob ustness to their richer structure. Because of these structural differences, we handle user and page cascades separately in the analyses that fol- low . These distinctions may also help e xplain a lar ge dif ference in the predictability of user-initiated vs. page-initiated cascades. W e observe that for page cascades accuracy exceeds 80%, while that for user cascades is slightly under 70%. (These results also hold for the F1 score and A UC, with a difference of about 0.1.) The fact that much more of the structure of a page-initiated cascade is typically carried by a small number of hub nodes may suggest why the prediction task is more tractable in this case. 4.2 The initial structure of a cascade inﬂuences its ev entual size T o understand how structure bears on the future growth of the cascade, we examine how the conﬁguration of the ﬁrst three re- shares (and the root) correlates with the cascade size. In particu- lar , we measure the proportion of cascades starting from each con- ﬁguration that reach the median size. W e do this separately for two different initial poster types: a user , and a page. W e discard “celebrity” users who may large followings like the most popular pages. Figure 9a sho ws that as the initial cascade structure becomes shallower , the proportion of cascades that double in size increases. 1 2 3 0 0 1 2 3 1 3 2 0 0 1 3 2 1 2 3 0 0 1 2 3 1 3 2 0 0 1 3 2 Figure 9: Shallow initial cascade structures are indicativ e of larger cascades. In contrast to page-started cascades, where the mean time to the 3rd reshare decreases with decreasing depth of the initial cascade, shallo w cascades take a much longer time to form for user- started cascades. For these, the connections of the 1st resharer also signiﬁcantly impacts the time to the 3rd resharer , especially when it receiv es two reshares before the original recei ves a second. T o e xamine why this w ould be the case, we also e xamined the time needed for the 3rd reshare to happen (Figure 9c). For pages, shal- lower cascades tend to happen more rapidly , consistent with being initiated by a popular page and achieving a large number of re- shares directly from its fans. Interestingly , the conﬁguration hav- ing the second and third reshares stemming from the ﬁrst reshare correspond to having a ﬁrst resharer with many connections, and indicating that the initial poster is less popular, be it a page or user (Figure 9d). Curiously , for user-started cascades, the star conﬁguration tends to grow into the lar gest cascades, b ut is also the slowest. It also tends to correspond to the ﬁrst resharer having a low degree, both for page and user roots. One might speculate that this pattern is indicativ e of the item’ s appeal to less well-connected users, who also happen to be more likely to reshare. In fact, a median resharer has 35 fewer friends than someone who is activ e on the site nearly ev ery day . Thus, an item’ s appeal, rather than the initial network structure, may driv e the e ventual cascade size in the long run. 4.3 Predicting cascade structur e The observ ations above naturally lead to the question of whether it is possible to predict future cascade structure. In particular , we aim to distinguish cascades that spread like a virus in a shallo w forest ﬁre-like pattern (Figure 3a) and cascades which spread in long, narro w string-like pattern (Figure 3c). As discussed earlier, this difference is related to the structural virality of a cascade and is quantiﬁed by the W iener index. Here, we observe k = 5 reshares of a cascade and aim to predict whether the ﬁnal cascade will have a W iener index abov e or below the median. W e obtain accuracy of 0.725 (F1 = 0.715, A UC = 0.796), while random guessing would, by construction, achiev e accuracy of 0.5. T emporal and structural features are most predictive of struc- ture. For this task we e xpect structural features to be most im- portant, while we expect temporal features not to be indicativ e Accuracy Mean Reciprocal Rank 0.266 0.363 0.330 0.340 0.319 0.382 0.364 0.481 0.342 0.497 0.443 0.546 0.508 0.534 0.526 0.581 0.553 0.653 0.556 0.662 Root Structural Resharer T emporal All 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Logistic Regression Conditional Inf erence Forest Figure 10: In predicting the largest cascade in clusters of 10 or more cascades of identical photos, we perform signiﬁcantly above the baseline of 0.1. of the cascade structure. Howe ver , when we train the model on individual classes of features we surprisingly ﬁnd that both tem- poral and structural features are almost equally useful in predict- ing cascade structure: 0.622 vs. 0.620. Nev ertheless, structural features remain indi vidually more accurate ( ≈ 0.58) and highly correlated ( 0 . 161 ≤ | r | ≤ 0 . 255 ) with the W iener index. In- dividually , one temporal feature, views 0 1 ..k − 1 ,k , is slightly more accurate (0.602) compared to the best-performing structural fea- ture, outde g ( ˆ v 0 ) (0.600), but is signiﬁcantly less correlated (0.041 vs. − 0.255). The two classes of features nicely complement each other , since when combined, accuracy increases to 0.72. Cascade structur e also becomes mor e pr edictable with increas- ing k . Like for cascade growth prediction, our prediction perfor- mance improv es as we observe more of the cascade, with accurac y linearly increasing from 0.724 when k is 5 to 0.808 when k is 100. A linear relation also exists in the alternate task where we set the minimum cascade size R to be 100, varying k between 5 and 100. Changes in feature importance. As we increase k , we ﬁnd that the structural features become highly correlated with the Wiener index, suggesting that the initial shape of a cascade is a good indi- cator of its ﬁnal structure. Rapidly growing cascades also result in ﬁnal structures that are shallo wer—temporal features become more strongly correlated with the W iener index as k increases. Unlike with cascade size, views were generally weakly correlated with structure, while content features had a weak, near-constant effect. Nonetheless, some of these features still provided reasonable per- formance in the prediction task. User vs. page-started cascades. In predicting the shape of a cas- cade, we ﬁnd that our o verall prediction accurac y for pages is slightly higher (0.724) than for users (0.700). While using only structural features alone results in a higher prediction accuracy for users (0.643) than for pages (0.601), user and content features are signiﬁcantly more predictiv e of cascade structure in the case of pages. T o sum up, we ﬁnd that predicting the shape of a cascade is not as hard as one might fear . Ne vertheless, predicting cascade size is still much easier than predicting cascade shape, though classiﬁers for either achiev e non-tri vial performance. 5. PREDICT ABILITY & CONTENT 5.1 Controlling f or cascade content In our analyses thus f ar , we examined cascades of uploads of different photos, and tried to account for content differences by in- cluding photo and caption features. Howev er , temporal and struc- Correlation with Log Cascade Size ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● pages 5 outdeg(v 0 ) views' 1..4,5 views' 0,5 views 1..4,5 views 0,5 pages 5 orig_is_page outdeg(v 0 ) views' 1..4,5 views' 0,5 views 1..4,5 views 0,5 pages 5 orig_is_page outdeg(v 0 ) views' 1..4,5 views' 0,5 views 1..4,5 views 0,5 pages 5 orig_is_page outdeg(v 0 ) views' 1..4,5 views' 0,5 views 1..4,5 views 0,5 Pages/Users Language Overlaid T ext Categories −0.2 0.0 0.2 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● Users Pages English Portuguese T ext None Animal Entertainment Famous Food Health Politics Religion Figure 11: The initial exposure of the uploaded photo and initial reshares serve to differentiate datasets from one another, as can be seen by comparing the correlation coefﬁcients of each feature with the log cascade size. Solid circles indicate signiﬁcance at p < 10 –3 , and lines through each circle indicate the 95% conﬁdence interval. tural features may still capture some of the difference in content. Thus, we now study ho w well we can predict cascade size if we control for the content of the photo itself. W e consider identical photos uploaded to Facebook by different users and pages, which is not a rare occurrence. W e used an image matching algorithm to identify copies of the same image and place their corresponding cascades into clusters (983 clusters, N c = 38,073, N r = 12,755,621). As one might expect, ev en the same photo uploaded at dif ferent times by different users can fare dramatically differently; a cluster typically consists of a fe w or e ven a single cascade with a lar ge number of reshares, and many smaller cascades with few reshares. The a verage Gini coef ﬁcient, a measure of inequality , is 0.787 ( σ = 0.104) within clusters. Thus, a natural task is to try to predict the largest cascade within a cluster . For ev ery cluster we select 10 ran- dom cascades, placing the accuracy of random guessing at 10%. As shown in Figure 10, in all cases we signiﬁcantly outperform the baseline. Using a random forest model, we can identify the most popular cascade nearly half the time (accurac y 0.497); a mean reciprocal rank of 0.662 indicates that this cascade also appears in the top two predicted cascades almost all the time. In terms of feature importance we notice that best results are obtained using temporal features, followed by resharer , root node, and structural features. Essentially , if one upload of the photo is ini- tially spreading more rapidly than other uploads of the same photo, that cascade is also likely to gro w to be the largest. This points to the importance of landing in the right part of the network at the right time, as the same photo tends to hav e widely and predictably varying outcomes when uploaded multiple times. 5.2 F eature importance in context Some features may be more or less important for our prediction tasks in different contexts. Figure 11 shows how several features correlate with log-transformed cascade size when conditioned on one of four different variables, including (1) source node type— user vs page, (2) language—English versus Portuguese, the two most common languages of cascade root nodes in our dataset, (3) whether text is overlaid on a photo—a common feature of recent Internet memes, and (4) content category . W e determine content category by matching entities in photo captions to Wikipedia articles, and in turn articles to sev en higher-le vel categories: animal, entertain- ment, politics, religion, f amous people (excluding religious and po- litical ﬁgures), food, and health. Figure 11 shows that the initial rate of exposure of the uploaded photo is generally more important for page cascades than for user cascades ( views 0 0 , 5 ). This is likely due to the higher variance in the distrib ution of the number of follo wers for a user v ersus a page. For page cascades in our sample, the median number of followers is 73,855 with a standard deviation of 675,203, while for users at the root of cascades the median number of friends and subscribers is 1,042 with a standard deviation of 26,482. Though rate of expo- sure to the original photo is more important for pages, we see that rate of exposure to the initial reshares ( views 0 1 .. 4 , 5 ) is much more important for user cascades. The number and rate of vie ws also act to differentiate topical cat- egories, with religion having the highest correlation between views and cascade size. Correlation for the rate of views of the uploaded photo is also higher for those with a Portuguese-speaking root node as opposed to an English one. The feature outdeg ( v 0 ) indicates the ability of the root to broadcast content, and we see this playing an important role for page cascades, Portuguese content, photos with text, and religious photos. This indicates that much of the success of these cascades is related to the root nodes being directly con- nected to large audiences. In addition to the analysis of Figure 11, we also examined how the features correlate with the structural virality of the ﬁnal cas- cades. (Each of the reported correlation coef ﬁcient comparisons that follow are signiﬁcant at p < 10 –3 using a Fisher transforma- tion.) Photos relating to food dif fer signiﬁcantly from all other cat- egories in that features of the root, such as outde g ( v 0 ) , are less negati vely correlated ( > –0.18 vs. –0.11), and depth features, such as depth avg k , are less positiv ely correlated ( > 0.18 vs. 0.11). This relationship also holds for English compared to Portuguese pho- tos. While users with many friends or follo wers are more likely to generate cascades of lar ger size and greater structural virality , pages with many fans create cascades of larger size, although not necessarily greater virality (0.05 vs. –0.01). Howe ver, if the ini- tial structure of a cascade is already deep, the ﬁnal structure of the cascade is likely to hav e greater structural virality for both user and page-started cascades ( > 0.16). A user-started cascade whose ini- tial reshares are vie wed more quickly is also more likely to become viral than that for a page-started cascade (0.23 vs. 0.06). 6. DISCUSSION & CONCLUSION This paper examines the problem of predicting the growth of cascades over social networks. Although predictive tasks of sim- ilar spirit have been considered in the past, we contribute a novel formulation of the problem which does not suffer from sk ew bi- ases. Our formulation allows us to study predictability throughout the life of a cascade. W e examine not only how the predictability changes as more and more of the cascade is observed (it improv es), but also ho w predictable lar ge cascades are if we only observe them initially (larger cascades are more dif ﬁcult to predict). While some features, e.g., the average connection count of the ﬁrst k resharers, hav e increasing predictiv e ability with increasing k , others weaken in importance, e.g., the connectivity of the root node. W e ﬁnd that the importance of features depends on properties of the original up- load as well: the topics present in the caption, the language of the root node, as well as the content of the photo. Despite the rich set of results we were able to obtain, there are some limitations to this study . Most importantly , the study was con- ducted entirely with Facebook data and only with photos. Still, one advantage of this is the scale of the medium; hundreds of millions Figure 12: There is considerable overlap in friendship edges (blue) between four independent cascades of the same photo. of photos are uploaded to Facebook ev ery day , and photos, more than other content types, tend to dominate reshares. This also gi ves us high-ﬁdelity traces of how the photo moves within Facebook’ s ecosystem, which allows us to precisely o verlay the spreading cas- cade over the social network. Moreover , we are able to identify uploads of the same photo and track them individually . This elim- inates the concern of shares being driven by an external entity and only appearing to be spreading over the network. Instead, e xter- nal dri vers beneﬁt our study by creating independent ‘experiments’ where the same photo gets multiple chances to spread, helping us control for the role of content in some of our experiments. An- other disadv antage of our setup is that diffusion within Facebook is driv en by the mechanics of the site. The distinction between pages and users is speciﬁc to F acebook, as are the mechansisms by which users interact with content, e.g., liking and resharing. Despite these limitations, we believe the results give general insights which will be useful in other settings. The present work only examines each cascade independently from others. Future work should examine interactions between cascades, both between different content competing for the same attention, and between the same content surfacing at dif ferent times and in dif ferent parts of the netw ork. W e found that when the same photo is uploaded at least 10 times, the largest cascade was twice as likely to be among the ﬁrst 20% of uploads than the last 20%. Similarly , for photos uploaded 20 times, the lar gest cascade w as 2.3 times as likely to be among the ﬁrst 20% than the last. Fig- ure 12 shows the friendship edges between users participating in different cascades of a single, speciﬁc photo. The high connectivity between different cascades demonstrates that users are lik ely being exposed to the same photo via dif ferent cascades, which could be a contributing factor in why earlier uploads of the same photo tend to generate lar ger cascade than later ones. Between-cascade dynamics like this should provide ample opportunities for further research. Addressing questions like these will lead to a richer understand- ing of how information spreads online and pav e the way to wards better management of socially shared content and applications that can identify trending content in its early stages. 7. REFERENCES [1] E. Adar , L. Zhang, L. A. Adamic, and R. M. Lukose. Implicit structure and the dynamics of blogspace. In W orkshop on the W eblo gging Ecosystem , 2004. [2] A. Anderson, S. Goel, J. Hofman, and D. W atts. The structural virality of online diffusion. Under r eview . [3] L. Backstrom, D. Huttenlocher, J. Kleinber g, and X. Lan. Group formation in large social netw orks: Membership, growth, and e volution. In ACM SIGKDD International Confer ence on Knowledge Discovery and Data Mining , 2006. [4] L. Backstrom, J. Kleinberg, L. Lee, and C. Danescu-Niculescu-Mizil. Characterizing and curating con versation threads: Expansion, focus, volume, re-entry . In Pr oc. WSDM , 2013. [5] E. Bakshy , J. M. Hofman, W . A. Mason, and D. J. W atts. Everyone’ s an inﬂuencer: quantifying inﬂuence on twitter . In Pr oc. WSDM , 2011. [6] E. Bakshy , B. Karrer , and L. A. Adamic. Social inﬂuence and the diffusion of user -created content. In Pr oc. EC , 2009. [7] J. Berger and K. L. Milkman. What makes online content viral. J. Marketing Resear ch , 49(2):192–205, 2012. [8] M. Cha, H. Haddadi, F . Benevenuto, and P . K. Gummadi. Measuring user inﬂuence in twitter: The million follower fallacy . In Pr oc. ICWSM , 2010. [9] P . A. Do w , L. A. Adamic, and A. Friggeri. The anatomy of large f acebook cascades. In Pr oc. ICWSM , 2013. [10] W . Galuba, K. Aberer , D. Chakraborty , Z. Despotovic, and W . K ellerer . Outtweeting the twitterers-predicting information cascades in microblogs. In Pr oc. OSM , 2010. [11] S. Goel, D. J. W atts, and D. G. Goldstein. The structure of online diffusion netw orks. In Pr oc. EC , 2012. [12] B. Golub and M. O. Jackson. Using selection bias to explain the observed structure of internet dif fusions. Pr oc. Natl. Acad. Sci. , 2010. [13] D. Gruhl, R. V . Guha, D. Liben-Nowell, and A. T omkins. Information diffusion through blogspace. In Pr oc. WWW , 2004. [14] M. Guerini, J. Staiano, and D. Albanese. Exploring image virality in google plus. Pr oc. SocialCom , 2013. [15] T .-A. Hoang and E.-P . Lim. V irality and susceptibility in information diffusions. In Pr oc. ICWSM , 2012. [16] L. Hong, O. Dan, and B. D. Davison. Predicting popular messages in twitter . In Proc. WWW Companion , 2011. [17] M. Jenders, G. Kasneci, and F . Naumann. Analyzing and predicting viral tweets. In Pr oc. WWW Companion , 2013. [18] R. Kumar , M. Mahdian, and M. McGlohon. Dynamics of con versations. In Pr oc. KDD , 2010. [19] A. Kupavskii, L. Ostroumo va, A. Umno v , S. Usachev , P . Serdyukov , G. Gusev , and A. K ustarev . Prediction of retweet cascade size ov er time. In Pr oc. CIKM , 2012. [20] J. Leskovec, L. Adamic, and B. Huberman. The dynamics of viral marketing. A CM T ransactions on the W eb , 2007. [21] J. Leskovec, M. McGlohon, C. F aloutsos, N. Glance, and M. Hurst. Cascading behavior in lar ge blog graphs. In Pr oc. ICDM , 2007. [22] D. Liben-Nowell and J. Kleinberg. T racing information ﬂow on a global scale using Internet chain-letter data. Pr oc. Natl. Acad. Sci. , 2008. [23] Z. Ma, A. Sun, and G. Cong. On predicting the popularity of newly emer ging hashtags in twitter . Journal of the American Society for Information Science and T ec hnology , 2013. [24] S. A. Myers, C. Zhu, and J. Leskovec. Information dif fusion and external inﬂuence in networks. In Pr oc. KDD , 2012. [25] J. W . Pennebaker , M. E. Francis, and R. J. Booth. Linguistic inquiry and word count: LIWC 2001. 2001. [26] S. Petrovic, M. Osborne, and V . Lavrenk o. R T to win! predicting message propagation in twitter . In Pr oc. ICWSM , 2011. [27] D. M. Romero, C. T an, and J. Ugander . On the interplay between social and topical structure. In Pr oceedings of the Seventh International Conference on W eblogs and Social Media (ICWSM) , 2013. [28] M. Salganik, P . Dodds, and D. W atts. Experimental study of inequality and unpredictability in an artiﬁcial cultural market. Science , 2006. [29] G. Szabo and B. A. Huberman. Predicting the popularity of online content. Communications of the A CM , 2010. [30] O. Tsur and A. Rappoport. What’ s in a hashtag?: content based prediction of the spread of ideas in microblogging communities. In Pr oc. WSDM , 2012. [31] D. J. W atts. Everything is Obvious: How Common Sense F ails Us . Crown, 2012. [32] L. W eng, F . Menczer , and Y .-Y . Ahn. V irality prediction and community structure in social networks. Sci. Rep. , 2013. [33] J. Y ang and S. Counts. Predicting the speed, scale, and range of information diffusion in twitter . In Pr oc. ICWSM , 2010. [34] J. Y ang and J. Lesko vec. Modeling information dif fusion in implicit networks. In Pr oc. ICDM , 2010.

Can Cascades be Predicted?

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment