Using a Model of Social Dynamics to Predict Popularity of News

Using a Model of Social Dynamics to Predict Popularity of News Kristina Lerman ∗ USC Information Sciences Institute 4676 Admir alty W ay , Marina del Re y , CA 90292 T ad Hogg † HP Labs 1501 P age Mill Road, P alo Alto, CA 94304, USA Popularity of content in social media is unequally distributed, with some items recei ving a disproportionate share of attention from users. Predicting which newly-submitted items will become popular is critically im- portant for both companies that host social media sites and their users. Accurate and timely prediction would enable the companies to maximize revenue through differential pricing for access to content or ad placement. Prediction would also giv e consumers an important tool for ﬁltering the ev er-gro wing amount of content. Pre- dicting popularity of content in social media, howe ver , is challenging due to the complex interactions among content quality , how the social media site chooses to highlight content, and inﬂuence among users. While these factors make it dif ﬁcult to predict popularity a priori , we show that stochastic models of user beha vior on these sites allows predicting popularity based on early user reactions to new content. By incorporating aspects of the web site design, such models improve on predictions based on simply extrapolating from the early votes. W e validate this claim on the social ne ws portal Digg using a previously-de veloped model of social voting based on the Digg user interface. P A CS numbers: I. INTR ODUCTION Success or popularity in social media is not ev enly dis- tributed. Instead, a small number of users dominate the ac- tivity on the site, and receiv e most of the attention of other users. The popularity of contributed items also sho ws this e x- treme div ersity . Relati vely fe w of the four billion images on the social photo-sharing site Flickr, for example, are viewed thousands of times, while most of the rest are rarely viewed. Of the more than 16,000 new stories submitted to the social news portal Digg ev ery day , only a handful go on to become wildly popular , gathering thousands of votes, while most of the remaining stories never recei ve more than a single vote from the submitter herself. Among thousands of new blog posts ev ery day , only a handful rise above the noise. It is crit- ically important to provide users with tools to help them sift through the v ast stream of new content to identify interesting items in a timely manner , or least those items that will prov e to be successful or popular . Accurate and timely prediction will also enable social media companies that host user-generated content to maximize rev enue through differential pricing for access to content or ad placement, and encourage greater user loyalty by helping their users quickly ﬁnd interesting new con- tent. Success in social media is difﬁcult to predict. Although early and late popularity , which can be measured in terms of the number of views or votes an item generates, are some- ∗ Electronic address: lerman@isi.edu † Electronic address: tadhogg@yahoo.com what correlated [7, 22], we know little about what drives suc- cess. Is it item’ s inherent quality [2], consumer response to it [5], or some e xternal factors, such as social inﬂuence [15– 17]? In a landmark study , Salganik et al. [21] addressed this question experimentally by measuring the impact of content quality and social inﬂuence on the e ventual popularity or suc- cess of cultural artifacts. They sho wed that while quality con- tributes only weakly to their ev entual success, social inﬂu- ence, or knowing about the choices of other people, is respon- sible for both the inequality and unpredictability of success. In their experiment, Salganik et al. asked users to rate songs they listened to. The users were assigned to different groups. In the control group (independent condition), users were simply pre- sented with lists of songs. In the other group (social inﬂuence condition), users were also shown ho w man y times each song was downloaded by other users. The social inﬂuence con- dition resulted in large inequality in popularity of songs, as measured by the number of times the songs were downloaded. Although a song’ s quality , as measured by its popularity in the control group, was positi vely related to its eventual popularity in the social condition group, the variance in popularity at a giv en quality was very high, meaning that two songs of similar quality ended up with v ery different levels of success. More- ov er , when users were aware of the choices made by others, popularity was also very unpredictable. Although Salganik et al. ’ s study was limited to a small set of songs created by unknown bands, its conclusions about in- equality and unpredictability of success appear to apply to cul- tural artifacts in general and social media production in par- ticular . While this may at ﬁrst sound discouraging, as we will show in this paper, a model of social dynamics that includes social inﬂuence can help make success in social media pre- dictable. Speciﬁcally , we claim that modeling the collective 2 behavior of users of a social media site allows us to predict the popularity of items fr om the users’ early r eaction to them . W e in vestigate the claim empirically using data from the so- cial news portal Digg. Digg allows users to submit and collec- tiv ely moderate ne ws stories by voting on them. Digg selects a hundred or so stories from the thousands that are submitted daily , to feature on its front page. The proprietary promotion algorithm is Digg’ s way of making a prediction about which stories are interesting to the community and will accumulate many votes. In previous works, we used the stochastic model- ing frame work [18] to mathematically describe social dynam- ics of Digg users [9, 14]. The model, which took into account the user interface and how it affects user beha vior , described how the number of votes receiv ed by stories changed in time. W e showed qualitati ve agreement between the data and the model, indicating that the features of the Digg user interface we considered can explain the patterns of collective voting. In this paper we use the model to predict whether a newly submitted story will be promoted based on Digg users’ early reaction to it. Moreo ver , we use the model to predict how popular or successful the story will become, i.e., how many votes it will receiv e. The stochastic modeling framework is general and can be applied to other social media sites, making prediction of popularity of content on those sites possible. The paper is org anized as follows. In Section II we describe details of Digg. In Section III we summarize the model de vel- oped in earlier works. Next, in Section IV we sho w ho w this model can predict ev entual popularity of newly submitted sto- ries on Digg. W e discuss results in Section V and compare against other prediction methods outlined in Section VI. II. SOCIAL NEWS POR T AL DIGG W ith over 3 million registered users, the social news ag- gregator Digg is one of the more popular news portals on the W eb . Digg allows users to submit and rate news stories by voting on, or ‘digging’, them. There are many new submis- sions every minute, over 16,000 a day . Every day Digg picks about a hundred stories that it deems to be popular and pro- motes them to the front page. Although the exact promotion mechanism is kept secret and changes occasionally , it appears to take into account the number of votes the story receives and how rapidly it receives them. Digg’ s success is fueled in large part by the emergent front page, which is created by the collectiv e decision of its many users. A. User interface A ne wly submitted story goes to the upcoming stories list, where it remains for 24 hours, or until it is promoted to the front page, whichever comes ﬁrst. Newly submitted stories are displayed as a chronologically ordered list, with the most recently submitted story at the top of the list, 15 stories to a page. T o see older stories, a user must navigate to the upcom- ing stories page 2, 3, etc. Promoted stories (Digg calls them FIG. 1: Screenshot of the front page of the social news aggre gator Digg. ‘popular’) are also displayed as a list on the front pages, 15 stories to a page, with the most recently promoted story at the top of the list. T o see older stories, user must na vigate to front page 2, 3, etc. Figure 1 shows a screenshot of a Digg front page. Digg also allo ws users to designate friends and track their activities, i.e., see the stories friends recently submitted or voted for . The friends interface is av ailable through the Friends Activity link at the top of any Digg web page (see, for example, Fig. 1). The friend relationship is asymmetric. When user A lists user B as a friend , A can watch the ac- tivities of B but not vice versa. W e call A the fan of B . A newly submitted story is visible in the upcoming stories list, as well as to submitter’ s fans through the friends interface. W ith each vote, a story becomes visible to the v oter’ s fans through the friends interface, which shows the ne wly submitted stories that user’ s friends voted for . In addition to these interfaces, Digg also allows users to view the most popular stories from the previous day , week, month, or a year . Digg also implements a social ﬁltering fea- ture which recommends stories, including upcoming stories, that were liked by users with a similar voting history . This interface, howe ver , was not av ailable at the time the data for our study was collected. 3 0 20 40 60 80 100 120 0 200 400 600 800 1000 1200 1400 1600 Time since submission (hrs) votes story 1 story 2 (a) 0 500 1000 1500 2000 2500 3000 3500 4000 0 5 10 15 20 25 30 35 40 number of votes number of stories (b) FIG. 2: Dynamics of social voting. (a) Ev olution of the number of votes received by two front page stories in June 2006. (b) Distribu- tion of popularity of 201 front page stories submitted in June 2006. B. Inequality of popularity While a story is in the upcoming stories list, it accrues v otes slowly . After it is promoted to the front page, it accumulates votes at a much faster pace. F or example, Fig. 2(a) shows the e volution of the number of v otes for two stories submitted in June 2006. The point where the slope abruptly increases corresponds to promotion to the front page. As the story ages, accumulation of new votes slows do wn [24], and after a few days the total number of v otes receiv ed by a story saturates to some v alue. This value, which we also call the ﬁnal number of votes, gi ves a measure of the story’ s success or popularity. Popularity v aries widely from story to story . Figure 2(b) shows the distribution of the ﬁnal number of votes receiv ed by front page stories that were submitted ov er a period of about two days in June 2006. The distribution is characteristic of ‘inequality of popularity’, since a handful of stories become very popular , accumulating thousands of votes, while most others can only muster a few hundred votes. This distribution applies to front page stories only . Stories that are nev er pro- moted to the front page receiv e very fe w v otes, in many cases just a single vote from the submitter . While the exact shape of the distribution dif fers among so- cial media sites, the long tail is a ubiquitous feature [3] of human acti vity . It is present in inequality of popularity of cultural artifacts, such as books and music albums [21], and also manifests itself in a v ariety of online behaviors, including tagging, where a fe w documents are tagged much more fre- quently than others, collaborativ e editing on wikis [13], and general social media usage [23]. While unpredictability of popularity is more dif ﬁcult to v erify than in the controlled ex- periments of Salganik et al., it is reasonable to assume that a similar set of stories submitted to Digg on another day will end with radically dif ferent numbers of v otes. In other words, while the distribution of the ﬁnal number of votes these sto- ries recei ve will look similar to the distrib ution in Figure 2(b), the number of votes received by individual stories will be v ery different in the tw o realizations. C. Predictability of popularity These observations make predicting popularity of social media content difﬁcult. W e claim, ho wev er , that we can le ver - age social inﬂuence, the very factor responsible for inequality and unpredictability of popularity , to predict the popularity of social media content. Social inﬂuence occurs when in- formation about the choices or opinions of others af fects a user’ s behavior . In Salganik et al. ’ s social inﬂuence was ex- erted by showing to a user the number of times a particular item was downloaded. This information affected what items users chose to download, ultimately leading to a large dispar- ity in the number of downloads of speciﬁc items. On Digg, social inﬂuence manifests itself through the friends interface, which shows users the stories their friends chose to vote for . In previous works [9, 14] we have constructed a mathemati- cal model of the dynamics of social voting on Digg that takes social inﬂuence into account. W e showed that the model ex- plains the ev olution of the number of votes recei ved by Digg stories. In this paper we use the model to predict the popular- ity of newly submitted stories. Speciﬁcally , we use the model to estimate the inherent quality of a ne w story from the Digg users’ early reaction to it. Next, using this estimate, we pre- dict the story’ s ﬁnal number of votes. In the sections below we summarize the model and validate it on a sample of stories retriev ed from Digg. III. SOCIAL D YNAMICS OF DIGG The model of the dynamics of social v oting on Digg [9, 14] is based on the stochastic processes framework [18], which represents each Digg user as a stochastic process with a small number of states. For users of a social media site, the states correspond to actions such as re gister for the site, follow link to a story , vote on the story , befriend another user , and so on. This abstraction captures much of the inherent indi vid- ual complexity by casting individual’ s decisions as inducing probabilistic transitions between states. The framew ork al- lows us to relate aggregate behavior of a group of users, such as v oting, to simple descriptions of their individual behavior . 4 In past work, we used the model of social v oting to study ho w individual stories accumulate v otes on Digg. In this paper , we use the model to explain why some stories accumulate many more votes than others. In addition to the model’ s explanatory power , we inv estigate its predictiv e power . W e ﬁrst describe the data sets we collected for our study and then present an ov erview of the model de veloped in [9]. A. Data sets W e collected data by scraping web pages in Digg’ s T ech- nology section in May and June 2006. The May data set con- sists of stories that were submitted to Digg May 25-27, 2006. W e follo wed stories by periodically scraping Digg to deter- mine the number of votes stories recei ved as a function of the time since their submission. W e collected at least 4 such ob- servations for each of 2152 stories, submitted by 1212 distinct users. Of these stories, 510, by 239 distinct users, were pro- moted to the front page. W e followed the promoted stories ov er a period of sev eral days. The June data set consists of 201 stories promoted to the front page between June 27 – 30, 2006. For each story , we collected the names of the ﬁrst 216 users who v oted on the story . In addition, we also collected information about sto- ries that were submitted to Digg between June 30, 2006 and July 1, 2006. From this set, we retained stories that received at least 10 votes, resulting in 159 stories. In October 2009, we updated information about the front page and upcoming stories, using the Digg API to obtain time stamps of the ﬁrst (up to 216) votes for each story , the total number of votes it receiv ed, and for the stories in the upcoming sample, their promotion time, if it exists. In addition to data about stories, we also e xtracted a snap- shot of the social network of the top-ranked 1020 Digg users (as of June, 2006). This data contained the names of each user’ s friends and fans. As a reminder , user A ’ s friends are all the users that A is watching (outgoing links on the so- cial network graph), while A ’ s fans are all the users watching his activity (incoming links). Since the original network did not contain information about all the voters in the June data set, we augmented it in February 2008 by e xtracting names of friends of more than 15 , 000 additional users. Many of these users added friends between June 2006 and February 2008. Although Digg does not provide information about the time the ne w link was created on its web page, it does list these links in rev erse chronological order , with the most recent link appearing on top. In addition to friend’ s name, Digg also gives the date friend joined Digg. By eliminating friends who joined Digg after June 30, 2006, we believ e we were able to faithfully reconstruct the fan links for all voters in our data set. Note that the fans netw ork in the two data sets was slightly dif ferent. In the May data set, we retained the number of fans for the top 1020 users, and assumed that other users had zero fans. In the June data set, we kno w who acti ve users (who v oted recently) list as friends and calculate the number of acti ve f ans for each submitter . Both are reasonable interpretations of the number of fans, and the exact meaning of the number of fans should depend on the application. B. Dynamical model of social voting When a user visits Digg, she can choose to browse its fr ont pages to see recently promoted stories, upcoming sto- ries pages to see recently submitted stories, or use the friends interface to see the stories her friends hav e recently submitted or voted for . She can select one of the stories to read, and depending on whether she considers it interesting, vote for it. Alternativ ely , after perusing Digg’ s pages, she may choose to leav e it. The user’ s en vironment, the stories she is seeing, is itself changing in time depending on actions of all other users. At an aggregate lev el, we focus on how the number of v otes a story receiv es changes over time. The changing state of a story is characterized by three values: the number of votes, N vote ( t ) , the story has recei ved by time t after it was submitted to Digg, the list the story is in at time t (upcoming or front pages) and its location within that list, which we denote by q and p for upcoming and front page lists, respecti vely . Stochastic modeling provides a framew ork for relating users’ individual choices to their aggregate behavior , which is, in turn, related to the changes in the state of a single story . The aggregate user behavior on Digg at a gi ven time has the following components: the number of users who see a story via one of the front pages, one of the upcoming pages, through the friends pages, and number of users who vote for a story , N vote . In other words, the votes a story receiv es depends on the combination of its visibility and interest, with visibility coming from different parts of the Digg user interface: the friends interface, upcoming and front page lists, and the posi- tion within each list. The Rate Equation for N vote ( t ) is: dN vote ( t ) dt = r ( ν f ( t ) + ν u ( t ) + ν friends ( t )) (1) where r measures how interesting the story is, i.e., the prob- ability a user seeing the story will vote on it, and ν f , ν u and ν friends are the rates at which users ﬁnd the story via one of the front or upcoming pages, and through the friends interface, respectiv ely . T o solve Eq. 1, we must model the rates at which users ﬁnd the story through the different parts of the Digg interf ace. These rates depend on the story’ s location in each list (upcom- ing or front page) and how users navig ate to that position in the list. While many details of these behaviors are not read- ily observable, we are able to estimate the values required for our model from the sample of data obtained from Digg and by making some reasonable assumptions. For example, while we do not know how many users visit Digg each day , we assume that a Digg visitor sees the front page ﬁrst. The upcoming sto- ries list is less popular than the front page. W e model this by assuming that a fraction c < 1 of Digg visitors proceed to the upcoming stories pages. Story position depends on the details of Digg user inter- face. Digg splits each story list into groups of 15 stories, with 5 15 most recently submitted (promoted) stories on the ﬁrst up- coming (front) page, the ne xt group of 15 on the second page, and so on. W e model this process as decreasing visibility as a function of location, the value of f page ( p ) , through p taking on fractional values. Thus, p = 1 . 5 denotes the position of a story half way do wn the ﬁrst page of the list. V alues of p and q grow linearly in time as new stories are promoted to the front page and submitted to the upcoming stories list. In addition to story position in the list, we need a descrip- tion of ho w users navig ate to that position. While we do not have data about Digg visitors’ behavior , speciﬁcally , how many proceed to page 2, 3 and so on, generally when pre- sented with lists over multiple pages on a web site, succes- siv ely smaller fractions of users visit later pages in the list. Follo wing [10], we use an inv erse Gaussian to model the dis- tribution of the number of pages a user visits before leaving the web site. W e model the decreasing visibility of stories as they mov e down the list on a gi ven page through p and q tak- ing on fractional v alues in the in v erse Gaussian model of user navigation. When a story is promoted, it becomes visible at the top of the front page list. An accurate model of this process would require us to reverse engineer Digg’ s promotion algorithm. Instead, we use a simple threshold to model ho w a story is promoted to the front page. The threshold model appears to approximate Digg’ s promotion algorithm well, and works as follows. Initially the story is visible on the upcoming stories pages. When the number of accumulated votes exceeds a pro- motion threshold h , the story mov es to the front page. Next, we model story’ s visibility through the friends inter- face. W e only consider two components of the friends inter - face, which allo w users to see stories their friends ( i ) submit- ted or ( ii ) voted for in the preceding 48 hours. Fans of the story’ s submitter can ﬁnd the story via the friends interface at any time after submission, regardless of which list it is on. As additional users vote on the story , their fans can also see the story through the friends interface, regardless of the list the story is on. W e model this with s ( t ) , the number of fans of vot- ers on the story by time t who hav e not yet seen the story . W e suppose these users visit Digg daily , and since the y are lik ely to be geographically distributed across all time zones, the rate fans discover the story is distributed over the day . A simple model of this behavior takes fans arriving at the friends page independently at a rate ω . As f ans read the story , the number of potential voters gets smaller, i.e., s decreases at a rate ω s . At the same time, the number of additional fans who can see the story through the friends interface grows as ∆ s = aN − b vote for each new vote, with a = 51 and b = 0 . 62 . Combining these models of growth in the expected number of av ailable fans and its decrease as fans return to Digg, we ha ve ds dt = − ω s + aN − b vote dN vote dt (2) with initial value s (0) equal to the number of fans of the story’ s submitter, S . parameter value rate general users come to Digg ν = 10 users / min fraction viewing upcoming pages c = 0 . 3 rate a voters’ f ans come to Digg ω = 0 . 002 / min page view distrib ution µ = 0 . 6 , λ = 0 . 6 fans per ne w vote a = 51 , b = 0 . 62 vote promotion threshold h = 40 upcoming stories list location k u = 0 . 06 pages / min front page list location k f = 0 . 003 pages / min story speciﬁc parameters interestingness r number of submitter’ s fans S T ABLE I: Model parameters. Parameters specifying page vie w dis- tribution are deﬁned in [9]. In summary , the rates in Eq. 1 are: ν f = ν f page ( p ( t )) Θ( N vote ( t ) − h ) ν u = c ν f page ( q ( t )) Θ( h − N vote ( t ))Θ(24 hr − t ) ν friends = ω s ( t ) where t is time since the story’ s submission and ν is the rate users visit Digg. The ﬁrst step function in ν f and ν u indi- cates that when a story has fewer votes than required for pro- motion, it is visible in the upcoming stories pages; and when N vote ( t ) > h , the story is visible on the front page. The second step function in ν u accounts for a story staying in the upcom- ing queue for at most 24 hours. W e solve Eq. 1 subject to initial condition N vote (0) = 1 , because a ne wly submitted story appears on the top of the up- coming stories queue and it starts with a single vote, from the submitter . C. Model parameters and solutions As shown in [9] solutions to Eq. 1 agree with the ev olu- tion of votes receiv ed by actual stories on Digg. The solutions depend on the model parameters, of which only two param- eters — the story’ s interestingness r and number of fans of the submitter S — change from story to story . W e estimated r from the data as the value that minimizes the root-mean- square ( R M S ) difference between the observed votes and the model predictions. The remaining parameters, given in T a- ble I, are ﬁxed. As described in more detail in [9], some of these parameters, such as the growth in list location, promo- tion threshold and fans per new vote, were measured directly from the May data set. Other parameters were estimated based on model predictions. The small number of stories in our data set, as well as the approximations made in the model, do not giv e strong constraints on these parameters. W e selected val- ues to giv e a reasonable match to our observations. These pa- rameters could in principle be measured independently from 6 n ot promote d p romote d 0 50 1 0 0 1 5 0 2 0 0 2 5 0 300 0 .0 1 0 .0 2 0 .0 5 0 .1 0 0 .2 0 0 .5 0 1 .0 0 s ubmitter's fans  S  i nterestingness  r  FIG. 3: Story promotion as a function of S and r for stories in the May data set. The r values are sho wn on a logarithmic scale. The model predicts stories above the curve are promoted to the front page. The points sho w the S and r values for the stories in our data set: black and gray for stories promoted or not, respectiv ely . aggregate behavior with more detailed information on user be- havior . Fig. 3 shows parameters r and S required for a story to reach the front page according to the model, and how that prediction compares to the stories in the May data set. The model’ s prediction of whether a story is promoted is correct for 95% of the stories in our data set. For promoted stories, the correlation between S and r is − 0 . 13 , which is signiﬁcantly different from zero ( p -value less than 10 − 4 by a randomiza- tion test). Thus a story submitted by a poorly connected user (small S ) tends to need high interest (lar ge r ) to be promoted to the front page [15]. Parameter r depends on the inherent story quality , which we cannot directly measure from our data. Ho wev er , our interpre- tation of r as how ‘interesting’ a story is to users appears to be consistent with treating it as representing intrinsic story qual- ity . Speciﬁcally , the model reproduces three general observ a- tions about behavior of stories on Digg: (1) slow initial growth in votes while the story is on the upcoming list, as shown in Fig. 2(a); (2) more interesting stories (higher r ) are promoted to the front page faster and recei ve more v otes than less inter- esting stories; (3) howe ver , as supported also by observ ations in [15], better connected users (high S ) are more successful in getting their less interesting stories (lower r ) promoted to the front page than poorly-connected users. These observa- tions gi ve us conﬁdence that the model captures the important details of social voting on Digg. By estimating r from the observed dynamics of social vot- ing, our model allows us to separate story quality fr om social inﬂuence and study how each affects the popularity of sto- ries on Digg . While there are alternative ways to measure the effects of quality and social inﬂuence, they may not be feasi- ble for social media applications. Quality , for example, may be measured through controlled experiments, as in [21]. So- cial inﬂuence may be measured through surve ys or intervie ws with participants, which is also not usually practical in social media. An empirically grounded model, on the other hand, allows us to quantitati vely characterize the effects of quality and social inﬂuence on the popularity of social media content, and deduce the strength of these ef fects from the observed dynamics of popularity . This leads to an insight that mod- els can be used to predict popularity of content. Speciﬁcally , observing the initial stages of voting on Digg, and kno wing how users are connected, enables us to use the model of so- cial dynamics to estimate r , and then use this value to predict how many votes the story will recei ve in the long-term. In the sections below we inv estigate the implications of the model for determining quality of stories submitted to Digg, and also for predicting the number of votes they will receiv e. Since the stochastic modeling framew ork on which the approach is based is general, and has been applied to sev eral other sys- tems [8, 18], we conjecture that this approach can also be used to predict popularity of content on other social media sites. IV . MODEL-BASED PREDICTION By separating the impact of story quality and social inﬂu- ence on the popularity of stories on Digg, a model of social dynamics enables two novel applications: (1) estimating in- herent story quality from the evolution of its observed popu- larity , and (2) predicting its ev entual popularity based on the early reaction of users to the story . W e in vestigate these prob- lems on real-world data extracted from Digg. A. Estimating story quality W e can estimate how interesting a story is by comparing the model’ s solutions to the observed popularity of the story . W e take as story interestingness the value of r that minimizes R M S dif ference between the observed number of v otes and the number of votes predicted by the model at the end of the data sample or two days after submission, whiche ver was earlier . For the 510 promoted stories in the May data set, the R M S relativ e error between the number of v otes and the model pre- diction is 14% , corresponding to a R M S error of 109 votes. For stories not promoted these values are 14% and 1 . 1 votes, respectiv ely . The estimated r v alues of stories in the May data set sho w that the 510 promoted stories have a wide range of interest- ingness to users. As shown in Fig. 4, these r v alues ﬁt well to a lognormal distribution with maximum likelihood esti- mates of the mean and standard deviation of log( r ) equal to − 1 . 67 ± 0 . 04 and 0 . 47 ± 0 . 03 , respectiv ely , with the ranges giving the 95% conﬁdence intervals. A randomization test based on the Kolmogoro v-Smirnov statistic and accounting for the fact that the distribution parameters are determined from the data [4] shows the r values are consistent with this distribution ( p -value 0 . 35 ). T able II sho ws some of the sto- ries with the highest, as well as lowest, estimated r values. Stories with higher r values include those bound to pique cu- riosity , such as “Lego Aircraft Carrier Complete!” and lists of 7 ﬁnal votes estimated r story title 3054 0.71 Lego Aircraft Carrier Complete! 3388 0.70 How to Mak e a Spider from 5 Crisp Dollar Bills (and Scare W aitresses!) 3125 0.65 Things Y ou Didn’t Kno w About Y our Body 2981 0.63 25 W orst T ech products of all time 2776 0.59 The Coolest Solar Eclipse Photo Y ou W ill Ever See... 2748 0.59 14 year old kid becomes millionaire through online scamming 2701 0.58 X-Men: Last Stand Post-Credits Scene? 2327 0.58 18 Days of Reckless Computing 2690 0.58 First Photos of MIT’ s $100 Laptop 1310 0.57 Nintendo Puts $250 Price T ag on W ii OFFICIAL 2204 0.54 MacBook vent blocked 2413 0.54 W ii will cost less than $220 397 0.09 Microsoft: “OpenDocument is T oo Slow” 364 0.09 AMD aims to take 15% of notebook market this year 278 0.09 New Intel roadmap re veals Conroe L “solo”, mobile plans 300 0.09 Interactiv e display system knows users by touch 341 0.09 A DN A Database For All U.S. W orkers? 540 0.08 Computer V iruses Monitored via Dynamic W orldmap 258 0.08 New Sensor T echnology Looks at Molecular ’Fingerprint’ 149 0.07 Supreme Court won’ t consider Y ahoo case 247 0.07 Lambda T able - A high-res tiled LCD table and interaction device 642 0.03 Interactiv e dining table 1204 0.03 W ebsites as graphs: V isualizing the DOM Structure of W ebsites 532 0.02 MIT T echnology Revie w Launches New Micro-documentary V ideo Series T ABLE II: Selection of stories from the May data set with the highest and lowest r v alues. For each story , we show the ﬁnal number of votes it receiv ed, its estimated r v alue, and its title. the “worst” and “coolest”. Among stories with lo wer r values are more serious stories about science and technology . Un- fortunately , it looks like Digg users do not ﬁnd such stories interesting. The r values for June data set ha ve a similar lognormal distribution. While broad distrib utions occur in many web sites [23], using a model of social dynamics allo ws us to factor out effects of user interface (v arious components of story vis- ibility) from the ov erall distribution of story interestingness. Thus we can identify v ariations in the stories’ inherent inter- est to users as measured by their inclination to vote on a story they see. These ﬁndings indicate that at least part of the in- equality in the distribution of ﬁnal number of votes receiv ed by Digg stories ( cf Fig. 2(b)) can be attributed to the inequal- ity of their inherent interest to users. B. Predicting ﬁnal number of v otes Rather than estimating r values from the full voting history , we can estimate them from the early voting history of each story . For instance, using just the ﬁrst 4 observations for each promoted story in the May data set increases the relati ve error in the votes to 34% . The predicted numbers of votes have 87% correlation with the observed numbers so early observations provide a strong prediction of the relati ve ordering of num- bers of votes stories will recei ve, as illustrated in Fig. 5. This corresponds to the predictability of e ventual ratings from the early reaction to ne w content seen on Digg and Y ouT ube [22]. Figure 6 sho ws predictions for front page stories in the June data set, based on the ﬁrst 20 v otes a story receives and using the model described abov e, i.e., with parameters determined from the May data set. In this case, the predictions are not as good (correlation between predicted and actual ﬁnal votes is 0 . 49 , the R M S error is 593, and the linear ﬁt accounts for only 23% of the v ariance). In both ﬁgures, the cluster of points at the extreme left of the plot are promoted stories the model predicts will not be promoted (based on the r estimate from the early votes). Thus their actual ﬁnal number of votes is considerably larger than the model predicts based on the early votes. C. Comparing to direct extrapolation Once a story reaches the front page, its subsequent growth in v otes is well-predicted from the number of v otes it receiv es shortly after promotion when accounting for the hourly and daily v ariation in story submission rate [22]. Ho wev er , these 8 0.0 0.2 0.4 0.6 0.8 0 1 2 3 4 5 6 r p robability densi ty r estimates for promoted stori es (a) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 q uantile of lognorm al q uantile of r valu es r estimates for promoted stori es (b) FIG. 4: (a) Histogram of estimated r values for the promoted stories in our data set compared with the best ﬁt lognormal distribution. (b) Quantile-quantile plot comparing observed distribution of r values with the lognormal distribution ﬁt. predictions apply to promoted stories only and do not take into account changes in visibility of a story through growth in the number of f ans. Although we do not have enough data to reproduce the approach of [22], as the ﬁrst 216 votes often did not cover one hour after promotion required by the ap- proach, as a simple comparison, we determined the predicted number of v otes based on extrapolating from the rate a story accumulated votes during the ﬁrst 4 observations. This sim- pler model, which does not consider the number of fans for the story’ s voters, has a lower correlation, 75% , with the ob- served numbers and a larger R M S error for stories in the May data set. A randomization test comparing these two methods indicates this reduction in performance is statistically signiﬁ- cant ( p -value less than 5 × 10 − 4 ). Thus, by incorporating the av erage gro wth in number of fans, our model provides a bet- ter description of how stories accumulate votes than simply extrapolating from early observ ations while on the upcoming pages. More generally , by estimating the “interestingness” of a story from early votes, we separate the inﬂuence of changing 0 5 0 0 1 00 0 1 50 0 2 00 0 2 50 0 0 5 0 0 1 00 0 1 50 0 2 00 0 2 50 0 3 00 0 3 50 0 f inal vote estimate after 4 observatio ns o bserved final vot es FIG. 5: Observed number of ﬁnal votes for promoted stories in the May data set compared to prediction from the model using the ﬁrst four observations of each story to estimate the story’ s r value. The line is the best linear ﬁt, with slope 0 . 84 . 0 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 3500 final vote estimate after 20 votes observed final votes FIG. 6: Observed number of ﬁnal votes for promoted stories in the June data set compared to prediction from the model using the ﬁrst 20 votes each story receiv ed to estimate the story’ s r value. The line is the best linear ﬁt, with slope 0 . 62 . visibility in the Digg user interface from the underlying rate at which users will vote on the story if the y see it. Although model-based predictions for stories in the June data set are not as good, ne vertheless, using the model im- prov es on direct extrapolation (correlation 0 . 44 , R M S error 610, and fraction of variance 19% ). W e ﬁnd a similar im- prov ement for predicting the ﬁnal votes for the upcoming sto- ries of the June data set, e.g., correlation 0 . 47 using the model compared to 0 . 31 for direct e xtrapolation. D. Comparing to social inﬂuence only pr ediction In [16] we studied the role of social inﬂuence in predicting popularity of news stories on Digg. W e sho wed that stories that initially receive many fan votes, i.e., votes from fans of the submitter or pre vious voters, ultimately go on to accumu- 9 0 500 1000 1500 2000 024 68 1 0 fa n v ote s f inal vot es FIG. 7: Number of fan votes within the ﬁrst 10 v otes vs ﬁnal votes receiv ed by front page stories in the June data set. The dashed line shows 505 v otes. late fewer votes than stories that initially receiv e few fan votes. Although this may at ﬁrst seem counter intuitive, it is reason- able to e xpect that a story that is of interest to a narrow com- munity will spread within that community only , while a gener- ally interesting story will spread from many independent sites as users unconnected to pre vious voters disco ver it with some small probability and propagate it to their own fans. [16] did not separate ef fects of story quality or interestingness from so- cial inﬂuence, b ut simply used the strength of social inﬂuence as a predictor of whether the story will receiv e many v otes. As described in this paper, at the time of submission, a story is only visible on the upcoming stories list and to submitter’ s fans through the friends interface. As users v ote on the story , it becomes visible to their own fans through the friends inter- face. Some of these fans will ﬁnd the story interesting and vote for it. Although we cannot conﬁrm it, we assume that if a voter is a fan of the previous voters (including the submitter), social inﬂuence, ex erted via the friends interface, played a role in helping the voter discov er the story . Therefore, the strength of social inﬂuence is measured in terms of the proportion of initial v otes that can be made via the friends interf ace: those coming from the fans of the submitter and previous voters. Social inﬂuence during the early voting period and the ﬁnal number of votes a story recei ves are in versely correlated. Fig- ure 7 shows the number of fan votes within the ﬁrst 10 v otes vs the ﬁnal number of votes receiv ed by the 201 front page stories in the June data set. The plot shows median number of ﬁnal votes, with the errors bars showing the distribution of votes, with the outliers remov ed. Despite wide range of ﬁ- nal votes for each value of fan votes, in general, stories that receiv e relatively few fan votes within the ﬁrst 10 votes end up becoming very popular , accumulating many hundreds or thousands of v otes, while stories that receive many fan v otes within the ﬁrst 10 votes end up with fe wer than 500 v otes. W e trained a decision tree classiﬁer on front page stories in the June data set to predict whether a story will be successful, i.e., accumulate a large number of votes, based on the strength of social inﬂuence during the early stages of voting [16]. Each story was characterized by three attributes: number of fan votes it receiv ed within the ﬁrst 10 votes, number of submit- ter’ s fans, and a boolean attribute indicating whether the story was successful (i.e., receiv ed more than 505 votes). This clas- siﬁer can then by used to predict whether a story will become successful by monitoring its spread through the fan network. As shown in [16], the prediction can be made relati vely early , after the ﬁrst 10 votes. W e compare model-based prediction against social inﬂuence-based classiﬁer described above. W e use the classi- ﬁer to predict whether an upcoming story in the June data set will accumulate more than 505 votes. As argued in [16], that prediction should be made for stories submitted by top users, who tend to hav e bigger and more activ e fan networks, which make it more difﬁcult for Digg to determine story’ s general appeal to the rest of its community . There were 39 stories submitted by users who were among the top-rank ed 100 users in June 2006. Of these stories, 13 were actually promoted by Digg, and of these only four went on to receiv e more than 505 votes. The classiﬁer predicted that 14 of the 39 stories will get more than 505 votes, and of these, only three did. The classiﬁer also predicted that 25 stories will accumulate fe wer than 505 votes, and 24 of these predictions were correct. In all, social inﬂuence-based classiﬁer correctly predicted the fate of 27 stories. Using the same criterion of success and using only the ﬁrst 10 votes for prediction, the model-based method predicted that 11 stories will accumulate more than 505, of which 3 did. It also predicted that 28 stories will not reach 505 votes, and 27 of these predictions were correct. In all, model-based method correctly predicted the f ate of 30 stories, a 10% improv ement ov er the social inﬂuence-based method. V . DISCUSSION There is a number of reasons why predictions for the ﬁ- nal number of votes receiv ed by June stories were worse than predictions for the stories in the May data set. May data was collected by scraping Digg web pages at regular time interval. While for o ver half of the promoted and upcoming stories in the May data set the fourth observ ation was made about four hours since story submission, for many of the remaining sto- ries, 4th observation was made many hours later . Therefore, prediction w as able to exploit longer-term dynamics. The ﬁrst 20 votes used for prediction in the June data set generally ac- counted for shorter periods since submission. Another reason for the disparity was that the model w as calibrated on the May data set. Using parameters calculated from June data could improv e predictions. W e could not explore this questions due to lack of rele vant data. On the other hand, we believe that some prediction accuracy on the June data set demonstrate generalizability of the model. Another dif ference between the models is that for the May data we used all fans as e xtracted from Digg, while number of fans in the June data set is based on users who were active (i.e. voted recently). Both deﬁni- tions seem reasonable to me, so by comparing the May and June results, we’ re also comparing the use of these different deﬁnitions in the two cases. The model makes several assumptions and approximations which could reduce accuracy of prediction. First, we treated promotion as an exact threshold. Detailed analysis of June data sho ws this not to be accurate, as some stories were pro- 10 moted well before they reached 40 votes. The earlier in its history the story is promoted, the more votes it will receive. While we do not kno w the exact promotion algorithm Digg uses, we can mitigate this problem by giving bounds on the predicted number of votes, which reﬂect our uncertainty about the promotion mechanism. Another modeling simpliﬁcation we made is to use gro wth in the expected number of ne w f ans, giv en by Eq. 2. Since we know how lar ge the f ans network is for each voter , we can compute these values more precisely . This will enable us to treat cases when a v ote by a highly con- nected user , such as kevinr ose , exposes the story to a large number of users. Finally , as evidence in Section IV D suggests, prediction may also beneﬁt from a ﬁner grained model of social in- ﬂuence. While model-based prediction outperforms social inﬂuence-only model, we believ e that social inﬂuence of fers valuable evidence about story’ s interest within and outside a community . Monitoring the spread of interest in a story through the fan network will lead to a better estimate of r , which will, in turn, lead to a more accurate prediction of the ﬁnal number of votes. The value of r could be different to fans vs non-fans. W e plan to study these issues in future work. VI. RELA TED WORK The Social W eb pro vides massive quantities of av ailable data about the behavior of large groups of people. Researchers are using this data to study a variety of topics, including de- tecting [1, 20] and inﬂuencing [6, 12] trends in public opinion, and dynamics of information ﬂow in groups [19, 25]. Sev eral researchers e xamined the role of social dynamics in explaining and predicting distrib ution of popularity of online content. W ilkinson [23] found broad distributions of popular - ity and user activity on many social media sites and showed that these distrib utions can arise from simple macroscopic dy- namical rules. W u and Huberman [24] constructed a phe- nomenological model of the dynamics of collecti ve attention on Digg. Their model is parametrized by a single variable that characterizes the rate of decay of interest in a ne ws arti- cle. Rather than characterize ev olution of votes received by a single story , they sho w the model describes the distrib ution of ﬁnal votes received by promoted stories. Our models offers an alternati ve e xplanation for the distrib ution of v otes. Rather than novelty decay , we ar gue that the distribution can also be explained by the combination of a non-uniform variations in the stories’ inherent interest to users and effects of user in- terface, speciﬁcally decay in visibility as the story moves to subsequent front pages. Such a mechanism can also explain the distribution of popularity of photos on Flickr , which would be dif ﬁcult to characterize by nov elty decay . Crane and Sor- nette [5] analyzed a large number of videos posted on Y ou- T ube and found that collecti ve dynamics w as linked to the in- herent quality of videos. By looking at how the observed num- ber of votes receiv ed by videos changed in time, they could separate high quality videos, whether they were selected by Y ouT ube editors or spontaneously became popular , from junk videos. This study is similar in spirit to our own in exploit- ing the link between observed popularity and content qual- ity . Howe ver , while this, and W u & Huberman study , aggre- gated data from tens of thousands of individuals, our method focuses instead on the micr oscopic dynamics, modeling ho w individual behavior contributes to the observed popularity of content. Researchers found statistically signiﬁcant correlation be- tween early and late popularity of content on Slashdot [11], Digg and Y ouT ube [22]. Speciﬁcally , similar to our study , Sz- abo & Huberman [22] predicted long-term popularity of sto- ries on Digg. Through large-scale statistical study of stories promoted to the front page, they were able to predict stories’ popularity after 30 days based on their popularity one hour af- ter promotion. Unlike our work, their study did not specify a mechanism for e volution of popularity , and simply exploited the correlation between early and late story popularity to make the prediction. Our work also differs in that we predict popu- larity of stories shortly after submission, long before the y are promoted. In [16] we exploited anti-correlation between the number of early fan v otes and stories’ eventual popularity on Digg. Speciﬁcally , we found that stories that initially receiv ed few votes from the fans of submitters and previous voters went on to become much more popular than stories which had many initial votes from fans. Using this correlation, we were able to predict whether stories submitted by well connected users would become popular, i.e., recei ve more than 505 votes. That work exploited social inﬂuence only to make the prediction, and the results were not applicable to stories submitted by poorly connected users which were not quickly disco vered by highly connected users. In contrast, the approach described in this paper considers effects of social inﬂuence regardless of the connectedness of the submitter, and also accounts for story quality in making a prediction about story popularity . VII. CONCLUSION In the v ast stream of new user-generated content, only a few items will prove to be popular , attracting a lion’ s share of at- tention, while the rest languish in obscurity . Predicting which items will become popular is e xceedingly difﬁcult, e ven to e x- perts. Research has shown that popularity is weakly related to inherent content quality , and that social inﬂuence leads to an unev en distribution of popularity and makes it so difﬁcult to predict. W e claim that a model of social dynamics of users on a social media site allo ws us to quantitativ ely characterize ev olution of popularity of items on that site and study how it is affected by item quality and social inﬂuence. W e e valuate this claim by studying the social news aggregator Digg, which al- lows users to submit and v ote on ne ws stories. The number of votes a story accumulates on Digg shows its popularity . In an earlier w ork we developed a model of social v oting on Digg, which describes ho w the number of v otes received by a story changes in time. Knowing ho w interesting a story is and how connected the submitter is fully determines the e volution of the number of votes the story receiv es. This leads to an in- 11 sight that a model can be used to predict story’ s popularity from the initial reaction of users to it. Speciﬁcally , we use observations of ev olution of the number of votes recei ved by a story shortly after submission to estimate ho w interesting it is, and then use the model to predict ho w man y v otes the story will get after a period of a few days. Model-based prediction outperforms other methods that e xploit social inﬂuence only , or correlation between early and late votes recei ved by sto- ries. Howe v er , results show that we can improve prediction by dev eloping a more ﬁne-grained model that differentiates between how interesting a story is to fans and to the general population. Acknowledgments W e would like to thank Fetch T echnologies for providing the tool to extract data from W eb pages. In addition we w ould like to thank Suradej Intagorn for his help in retrieving data from Digg and Aram Galstyan for useful discussions. This work is supported in part by National Science F oundation un- der award 0915678. [1] E. Adar, L. Zhang, L. A. Adamic, and R. M. Lukose. Im- plicit structure and the dynamics of blogspace. In W orkshop on the W eblogging Ecosystem, 13th International W orld W ide W eb Conference , 2004. [2] N. Agarwal, H. Liu, L. T ang, and P . S. Y u. Identifying the inﬂuential bloggers in a community . In WSDM ’08: Pr oc. of the International Conference on W eb Sear ch and W eb Data Mining , New Y ork, NY , USA, 2008. A CM. [3] C. Anderson. The Long T ail: Why the Future of Business is Selling Less of Mor e . Hyperion, 2006. [4] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Po wer-la w distributions in empirical data. Feb 2009. [5] R. Crane and D. Sornette. V iral, quality , and junk videos on youtube: Separating content from noise in an information-rich en vironment. In Pr oc. of AAAI symposium on Social Informa- tion Pr ocessing , Menlo Park, CA, 2008. AAAI. [6] P . Domingos and M. Richardson. Mining the network value of customers. In Proc. of KDD , 2001. [7] V . G ´ omez, A. Kaltenbrunner , and V . L ´ opez. Statistical analysis of the social network and discussion threads in Slashdot. In WWW ’08: Pr oceeding of the 17th international confer ence on W orld W ide W eb , pages 645–654, New Y ork, NY , USA, 2008. A CM. [8] T . Hogg and G. Szabo Diversity of User Activity and Content Quality in Online Communities. In ICWSM ’10: Pr oc. of 3r d International Confer ence on W eblogs and Social Media , 2009. [9] T . Hogg and K. Lerman. Stochastic models of user-contrib utory web sites. In ICWSM ’10: Proc. of 3rd International Confer - ence on W eblogs and Social Media , 2009. [10] B. A. Huberman, P . L. T . Pirolli, J. E. Pitko w , and R. M. Lukose. Strong re gularities in W orld Wide W eb surﬁng. Sci- ence , 280:95–97, 1998. [11] A. Kaltenbrunner, V . Gomez, and V . Lopez. Description and prediction of slashdot activity . In Pr oc. 5th Latin American W eb Congr ess (LA-WEB 2007) , 2007. [12] D. Kempe, J. Kleinberg, and ´ Eva T ardos. Maximizing the spread of inﬂuence through a social network. In KDD ’03: Pr oc. of the ninth A CM SIGKDD international confer ence on Knowledge discovery and data mining , New Y ork, NY , USA, 2003. A CM. [13] A. Kittur , E. Chi, B. A. Pendleton, B. Suh, and T . Mytko wicz. Power of the few vs. wisdom of the crowd: W ikipedia and the rise of the bour geoisie. In Proc. of W orld W ide W eb Confer ence , 2006. [14] K. Lerman. Social information processing in social ne ws ag- gregation. IEEE Internet Computing: special issue on Social Sear ch , 11(6):16–28, 2007. [15] K. Lerman. Social networks and social information ﬁltering on digg. In Pr oc. of International Confer ence on W eblogs and Social Media (ICWSM-07) , 2007. [16] K. Lerman and A. Galstyan. Analysis of social voting patterns on digg. In Proc. of the 1st ACM SIGCOMM W orkshop on Online Social Networks , 2008. [17] K. Lerman and L. Jones. Social browsing on ﬂickr . In Pr oc. of International Conference on W eblogs and Social Me- dia (ICWSM-07) , 2007. [18] K. Lerman, A. Martinoli, and A. Galstyan. A re view of proba- bilistic macroscopic models for swarm robotic systems. In S. E. and S. W ., editors, Swarm Robotics W orkshop: State-of-the- art Surve y , number 3342 in LNCS, pages 143–152. Springer - V erlag, Berlin Heidelberg, 2005. [19] J. Lesko vec, L. Adamic, and B. Huberman. The dynamics of viral marketing. ACM T ransactions on the W eb , 1(1), 2007. [20] J. Lesko vec, A. Krause, C. Guestrin, C. Faloutsos, J. V an- briesen, and N. Glance. Cost-effecti ve outbreak detection in networks. In KDD ’07: Pr oc. of the 13th ACM SIGKDD inter- national conference on Knowledge discovery and data mining , New Y ork, NY , USA, 2007. A CM. [21] M. Salganik, P . Dodds, and D. W atts. Experimental study of inequality and unpredictability in an artiﬁcial cultural market. Science , 311:854, 2006. [22] G. Szabo and B. A. Huberman. Predicting the popularity of on- line content. Social Science Research Network W orking P aper Series , Nov ember 2008. [23] D. M. W ilkinson. Strong regularities in online peer production. In EC ’08: Pr oc. of the 9th ACM confer ence on Electr onic com- mer ce , pages 302–309, New Y ork, NY , USA, 2008. A CM. [24] F . W u and B. A. Huberman. Novelty and collective attention. Pr oc. of the National Academy of Sciences , 104(45):17599– 17601, Nov ember 2007. [25] F . W u, B. A. Huberman, L. A. Adamic, and J. R. T yler . Infor - mation ﬂow in social groups. Physica A: Statistical and Theo- r etical Physics , 337(1-2):327–335, June 2004.

Using a Model of Social Dynamics to Predict Popularity of News

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment