Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks

Inf ormation Contagion: an Empirical Study of the Spr ead of News on Digg and T witter Social Networks Kristina Lerman and Rumi Ghosh USC Inform ation Sciences Institute Marina del Rey , CA 90292, USA Abstract Social networks hav e emerg ed as a crit ical factor in infor- mation dissemination, search, marketing, e xpertise and inﬂu- ence disco very , and po tentially an impo rtant tool fo r mobiliz- ing people. Social media has made social networks ubiqui- tous, and also give n research ers access to massiv e quantities of data for empirical analysis. These data sets offer a rich source of evide nce for studying dynamics of individu al and group behav ior , the structure of networks and global patterns of the ﬂow of information on them. Ho wev er , in most pre- vious studies, the structure of the underlying networks was not directly visible but had to be inferred from t he ﬂow of information from one individual to another . As a result, we do no t y et und erstand dyn amics of information spread on net- works or how the structure of the network affects it. W e ad- dress this gap by analyzing d ata from two popular social ne ws sites. Speciﬁcally , we extract social networks of activ e users on Digg and T witter , and track ho w interest in news stories spreads among them. W e show that social networks play a crucial role in the spread of information on these sites, and that network structure af fects dynamics of information ﬂow . Intr oduction Social scientists have l ong recog nized the impor- tance of social networks in the spread of inform a- tion (Granovetter 1973) and innov ation (Rogers 2003). Modern comm unications technolo gies, notably email and more recently social media, ha ve o nly enhanced the role o f networks in marketing (Domin gos and R ichard son 2001; Kempe, Kleinberg, and ´ Eva T ardos 2003), infor mation dis- semination (W u et al. 2004; Gruhl and Liben-nowell 2004), search (Adamic and Adar 2005), and exper tise discov- ery (Davitz et al. 200 7). The recen t D ARP A Ne twork Challenge 1 successfully tested the ability of onlin e social networks to mo bilize ma ssi ve ad -hoc teams to solve r eal- world problems, which cou ld p otentially improve d isaster response and coord ination of relief efforts. In addition to m aking social networks ubiqu itous, social media sites have g i ven researchers access to massi ve q uantities of data for emp irical analy sis. These data sets offer a rich source o f evidence for studying th e structure of social Copyrigh t c  2024, Association for the Adv ancement of Artiﬁ cial Intelligence (www .aaai.org). All rights reserv ed. 1 https://networkcha llenge.darpa.mil networks (Leskovec and Horvitz 2008) a nd th e d ynamics of in dividual (V ´ azquez et al. 200 6) and group behav- ior (Hogg and Lerman 2009), efﬁcacy of vir al pr oduct rec- ommend ation (Lesko vec, Adamic, and Huberm an 2006), global proper ties of the spread o f email mes- sages (W u et al. 2004; Lib en-Nowell a nd Kleinberg 2008) and blog posts (Leskovec et al. 2 007b), and identiﬁca- tion of inﬂuen tial blogs (Gruh l and Liben-nowell 2 004; Leskovec et al. 2007a). I n most of th ese studies, howe ver , the structur e of the und erlying network was not visible but had to be inferred fr om the ﬂow of informa tion from o ne individual to another . This posed a serious challenge to our efforts to und erstand how th e structu re of the network affects dynamics of informatio n spread on it. Understand ing this question is especially critical for the effecti ve use of social media an d peer productio n sys- tems, which often aggr egate over activities of, or contri- butions mad e by , m any p eople in or der to identif y trend- ing top ics an d note worthy contributions. Most of these sites also highligh t ac ti vities of a perso n’ s social network links. Since people crea te links to others who are sim- ilar to them, or who se contributions they ﬁnd interest- ing, the dynam ics of inform ation on a social ne twork m ay be different from its dynamics within the ge neral p opula- tion. Separating in-network f rom out-o f-network activity al- lows us, among o ther th ings, to b etter estimate the inher- ent qu ality of the c ontributions (Crane and Sornette 2008) or pre dict th eir f uture activity (Hog g and Lerman 2010; Lerman and Galstyan 2008). This will in turn allow us to separate high quality contributions from noise. Social news sites Digg and T witter offer a unique oppo rtu- nity to study dyna mics of informatio n spread on social ne t- works. Both sites have become important sources of timely informa tion f or pe ople. The social news aggregator Digg allows users to submit links to news stories and vote o n sto- ries submitted by other users. On the microbloggin g service T witter users tweet sho rt text message s that often contain links to news stories and comment o n or r etweet message s of others. Both sites enable users to explicitly create lin ks to o ther u sers they want to follow . Anothe r impor tant com- mon f eature is data transparen cy , with both sites pr oviding progr ammatic access to detailed data ab out sto ry an d user activity . This pape r presents an empirical stud y of the role of so- cial n etworks in the spread of inform ation on Digg and T wit- ter . For our study we collected data about popu lar stories on Digg and T witter that includes information about who voted or retweete d the story and when. In addition, we extracted the social networks o f active u sers on these sites. These data sets allow us to empirically characterize indi vidual dy- namics, network structure, and to map th e spread of interest in news stories thr ough the n etwork. First, we emp irically characterize th e structu re of social ne tworks on b oth sites. While th e numbe r of fans a user h as o n eac h site exhibits a long-tail distribution, Digg’ s social network is denser and more in terconn ected than T witter’ s, as judged by the num ber of r eciprocated links and the network clustering coefﬁcient. W e also show that user acti vity o n bo th sites h as a power -law distribution, albeit with different e xpon ents. Next, we stud y ev olution of the number o f votes stories rece i ve. W e show that u ser interface affects dynam ics of votes, with evolution of Dig g stories going thro ugh two distinct stages. Nev er- theless, the nu mber of votes accumulated by stories on both sites saturates af ter a period of about a d ay to a value tha t reﬂects th eir popularity . Next, we stu dy h ow inf ormation spreads throu gh the social network by measuring how the number of in-network v otes a story receiv es, i.e., v otes from fans of the submitter o r p revious v oters, chang es in time. W e show that the structure of the network affects dyn amics o f informa tion spread, with informatio n reaching node s faster in a denser n etwork of Dig g th an T witter . Howe ver , T witter stories spread farther, as judg ed by the total n umber of in- network v otes they r eceiv e. W e conclud e with a discussion of implications of the study . Social News Social media has become an important channel for peop le to share inform ation. On Digg, T witter , Slashdo t, Reddit, and Facebook, amo ng others, users p ost news or lin ks to news stories, d iscuss them , and share their op inions in r eal time. Often, these sites are the ﬁrst to break important news. After the Christmas 2009 f ailed attempt to blo w up a US co mmer- cial airlin er , T witter was the ﬁrst source to repo rt new secu- rity measures for inter national ﬂights (Carr 201 0). In addi- tion to ne ws, these sites are being used as a tool to organ ize people. For examp le, in the aftermath of the disputed elec- tions in Ir an in June 20 09, th e op position movement u sed T witter to mobilize the public, organize protests, and inform people abou t the latest dev elopmen ts, which was more vital in the absence of reliable ofﬁcial information sources. Digg (http://dig g.com) is a popu lar social news aggrega- tor with over 3 millio n r egistered users. Digg allows users to submit links to and rate ne ws stories by voting on, or dig- ging , them. T here are many new submissions every minute, over 16,000 a day . Dig g picks ab out a hu ndred stories daily to f eature o n its fr ont pag e. Altho ugh the exact prom otion mechanism is kept secret, it appears to take into accoun t the number and the r ate at which story r eceiv es votes. Digg’ s success is largely fueled by the emergent front page, created by the collective d ecisions of its many users. A newly sub mitted story goes to the upco ming storie s list, where it r emains for 2 4 h ours, or until it is promoted to the fr ont page , whichever com es ﬁrst. Newly subm itted stories are display ed as a chronologically order ed list, with the most recent story at the top of the list, 15 stories to a p age. Pro- moted ( or ‘p opular’ ) stories are also d isplayed in a reverse chrono logical order on th e front pages, 15 stories to a p age, with the most re cently promoted story a t the top of th e list. The importance of being promoted has, among other things, spawned a black market 2 which claims the ability to manip- ulate the voting proce ss. Digg also a llows user s to designate friends and track their activities. The friends interface allows users to see the sto- ries friends recently submitted or voted for . The frien dship relationship is a symmetric. When user A lists user B as a friend , A can watch the a cti vities of B but not vice versa. W e call A the fan of B . A newly submitted s tory i s visible in the upcomin g stories list, as well as to subm itter’ s f ans through the friend s interface. W ith each vote it also beco mes v isi- ble to voter’ s fans. Th e friends interface can b e accessed by clicking on F riend s Activity tab at the top of any Digg page. In addition , a story submitted o r voted on by user’ s friends receives a green ribbo n on the story’ s Digg badge, r aising i ts visibility to fans. W e used Digg API to collect data about 3,553 stories pro- moted to the fron t page in June 2009 . The data a ssociated with each story contain ed story title, story id, link, sub mit- ter’ s n ame, sub mission time, list o f voters and the time o f each vote, th e time the stor y was promoted to the fro nt page. In addition , we collected the list of voters’ friends. From this informa tion, we were ab le to reco nstruct the fan network of Digg users who were acti ve during the sample period. T witter ( http://twitter .com) is a p opular so cial network- ing site that allows registered users to p ost and read sho rt (at m ost 1 40 characters) text messages, which may con - tain URLs to onlin e content, usu ally shortened by a URL shortening service su ch as bit.ly o r tinyurl. A user can also retweet or comment on another user’ s post, usually p repend- ing it with a string “R T @ x , ” where x is a user’ s n ame. Post- ing a link on T witter is analog ous to submitting a new story on Digg, and r etweeting the post is analogou s to voting f or it. Like Digg, T witter allows users to designate as friends other users wh ose po sts th ey want to follow . Being a follower o n T witter is equ iv alent to being a fan on Digg. T witter r estricts large-scale access to its data to a lim- ited numb er o f entities. One of these, T weetmem e (http://tweetmeme .com ), agg regates all T witter p osts to de- termine f requen tly retweeted URLs, categorizes the stories these URLs p oint to, and presents them as n ews stor ies in a fashion similar to Digg’ s front p age. W e collected d ata from T weetm eme using specialized pag e scraper s developed using Fetch T echnologies’ s Agen tBuilder too l. For each story , we retrie ved the name of the u ser who p osted the link to it, the time it was posted , the nu mber of times the link was retweeted , and details of up to 1000 o f the m ost recent retweets. For each retweet, we extracted the name of the user , the te xt an d time stamp of the retweet. W e were limited to 1000 mo st recent retweets b y the structu re of T weetm eme. W e extrac ted 398 stories from T wee tmeme that were or igi- nally posted between June 1 1, 200 9 and July 3, 2009. Of 2 As an examp le, see http://subvertand proﬁt.com these, 3 29 stories ha d f ewer than 1000 retweets. Next, we used T witter API to d ownload pro ﬁle inf ormation fo r each user in the data set. The proﬁle included the complete li st of user’ s friend s and follo wers. Characteristics of User Activity W e deﬁn e as active u ser any user who voted for at least one story on Digg or r etweeted at least one story on T witter . There are 139,409 activ e Digg an d 137,5 82 ac- ti ve T witter users in our sample . On Digg, 71,834 ac- ti ve users designated at least one other user as a friend, with a total of 258,2 20 friend links. Activ e users on T witter wer e connected to 6,200,05 1 users. From this data, we wer e able to reconstruct the fan networks of ac- ti ve users, i.e., active u sers who are watching acti vities of other user s. Figure 1 shows the distribution of num- ber of a ctiv e fans and f ollowers per user . Digg’ s distri- bution, shown in Fig. 1(a), has a long -tail shape that is common to degree distributions in real-world complex net- works (Clauset, Shalizi, and Newman 2009). T witter’ s dis- tribution, shown in Fig. 1(b), has a peak at arou nd 100 fol- lowers and a long tail. As the numbe rs above suggest, the Dig g social n etwork is denser , more tigh tly k nit than the T witter social network. W e measur e den sity by the number o f reciprocal frien dship links a nd the mod iﬁed clustering c oefﬁcient. A recip rocal, or m utual, f riendship link exists when user A marks B as friend and vice versa. Ther e wer e 125,2 19 such links am ong 279,7 25 distinct u sers in th e Digg sample an d 3,973 ,892 mutual links among 6,20 0,051 users in the T witter sam- ple. Nor malizing these counts by the numb er of all pos- sible mutu al lin ks in the network gives us the fractio n of mutual links f m . For Digg f m = 3 . 20 × 10 − 6 , and f or T witter f m = 2 . 0 7 × 10 − 7 , an order of magnitude smaller . The clusterin g coefﬁcient f c measures the degree to wh ich a no de’ s network neighbo rs are interlinked. W e deﬁne the clustering coefﬁcient fo r direc ted network s such as those that exist on Digg and T witter as the fraction of closed tri- angles th at exist out of all possible sets of three nod es, or triples. For simplicity , we deﬁne a closed triangle as a cycle of length three that exists when A lists B as a friend, B lists C and C lists A as a fr iend. T here were 166,2 39 such trian- gles in the D igg netw ork, gi ving us the clustering coef ﬁcient f c = 7 . 60 × 10 − 12 , and 4,566,95 2 triang les on T witter , giv- ing the clu stering co efﬁcient of f c = 1 . 92 × 10 − 14 that is two orders of ma gnitude smaller . Due to the size of the n et- works, we implemen ted these metr ics using Hadoo p 3 . W e suspect that the differences in den sity of the two networks are due to their ag e, since T witter is a mo re recent service than Dig g. With time, we expe ct the T witter network to grow denser (Leskovec, Kleinberg, and F aloutsos 2005) and become as tightly knit as Digg. Next, we character ize users’ v oting activity . The 13 9,409 activ e users in the Digg data set c ast 3, 018,1 97 votes on 3,553 stories. User acti vity is not uniform , as shown in inset Fig. 1(a). While m ajority of users cast fewer th an 1 0 votes, some user s voted on tho usands of stories over the samp le 3 http://hadoop.apac he.org/ time p eriod. The distribution of the number of retweets per user in the T witter data set has a similar sh ape, with the n um- ber of retweets p er user rangin g fro m 1 to abou t 100 . The difference in slop es in these distribution is likely explained by the level of effort ( W ilkinson 2008) requ ired to vote on Digg vs retweet on T witter . Dynamics of V oting Our data sets co ntain a co mplete record of voting on Digg front page storie s and frequently r etweeted stories o n T wit- ter . From this data we can reco nstruct dyn amics of voting. In addition to voting history , we also know th e active fan network of Digg and T witter users and use this information to ch eck w hether a par ticular v oter is a fan of the su bmitter or pre vious voters. W e call t hese in-network votes fan votes . This information allows us to s tudy how in terest in the story spreads through the social networks on Digg and T witter . Figure 2(a) shows the ev olution of the number o f votes received by three Digg stories about post-election unr est in Iran in June 2009. While the details of the dynamics d iffer , the general f eatures of v otes evolution a re sh ared by all Digg stories and ca n b e described by a stocha stic model of social voting (Hogg and Lerm an 2009). While in th e upco ming stories queu e, a story a ccumulates votes at some slow rate. The point where the slope abr uptly changes corresponds to promo tion to the fro nt page. After pro motion th e story is visible to a large number of people, and the numb er o f v otes grows at a faster rate. As the stor y ages, acc umulation of new votes slows down (W u and Huberman 2007) an d ﬁn ally saturates. Figu re 2(b) sho ws the e volution of th e nu mber of times stories o n the same topics were retweete d. The num- ber of retwee ts grows smoo thly until it saturates. It takes about a day for the number of votes/retweets to saturate on both sites. Distribu tion of popularit y The total numb er of times the story was voted for and r etweeted r eﬂects their po pularity among Digg and T witter users respecti vely . The distribution of stor y po pularity o n eith er site, Figur e 3, shows the ‘in- equality of pop ularity’ (Salganik , Dodds, and W atts 2006), with re lati vely few stories becoming very p opular, accruing thousand s of votes, while mo st ar e much less p opular, re- ceiving fe wer than 50 0 votes. 4 The mo st com mon num ber of votes by a story is aroun d 5 00 o n Digg an d 400 o n T witter . These values are well descr ibed by a lognormal distribution (shown as the red line in the ﬁgure). The log-norm al distribution o f story popularity is typical of the “heavy-tailed” distributions associate d with social produ ction and consump tion of con tent. In a h eavy-tailed distribution a small but n on-vanishing numb er of items generate un characteristically large amount of a ctivity . These distributions hav e been observed in a variety of contexts, including voting o n Digg (W u and Huberman 2007) and Essembly (Hogg and Szabo 200 9), edits of 4 This distribution applies to Digg’ s front p age stories only . Sto- ries that are nev er promoted to the front page recei ve very few votes, in many cases just a single vote from the su bmitter . 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 10 5 number of fans per user number of users 10 0 10 1 10 2 10 3 10 4 10 0 10 5 number of votes per user 10 0 10 2 10 4 10 6 10 8 10 0 10 1 10 2 10 3 number of followers per user number of users 10 0 10 1 10 2 10 0 10 1 10 2 10 3 10 4 number of tweets per user (a) Digg (b) T witter Figure 1: Distribution of u ser activity . (a) Number of acti ve fans per user in the Digg data set vs the num ber of users with that many fans. Inset shows distrib ution of voting activity , i.e., nu mber of votes pe r user v s numb er of users who cast tha t many votes. (b) Num ber of acti ve followers per user in the T witter data set vs the number of users with that many followers. Inset shows distribution of retweeting acti vity . W ikipedia articles (W ilkinson 2008), a nd music down- loads (Salganik, Dodds, and W atts 2 006). Under standing the origin of such distributions is the next ch allenge in modeling user acti vity on social media sites. Dynamics of V oting on Network s At the time of sub mission, a Digg story is v isible on the upcomin g stories list and to subm itter’ s fans thro ugh the friends interface. As u sers vote on the story , it becomes vis- ible to their own fans via the f riends interface. Analogo us to the spread of a contagious disease (Ne wman 2002), inter- est in the story cascades thr ough the social n etwork. When the story is pro moted to the front p age, it becom es visible to many nonfans, alth ough u sers ar e still able to pick ou t stories th eir fr iends liked thro ugh the g reen r ibbon on the story’ s Digg badge. Similarly , a new post on T witter is visi- ble to sub mitter’ s followers, and every user wh o retweets th e story bro adcasts it to her own followers. Although aggrega- tors like T weetmeme a ttempt to identify popu lar stories on T witter in Digg-like fashion , th ere is n o evidence tha t they boost their visibility to nonfans. W e can trace the cascade of in terest in a sto ry through the u nderlyin g social network o f Digg (T witter) by checking wh ether a n ew vote ( retweet) cam e from a fan (follower) of any of the previous v oters, including the subm itter . W e call suc h votes or retweets fan votes , regardless of whether we are talkin g ab out Digg or T wit- ter . Therefore, the cascade (“infor mation contagion ” in the title of this article) starts with story ’ s su bmitter and grows as the story accru es fan v otes. Researchers have studied info rmation cascades in email chain let- ters (W u et al. 2 004; Liben -Nowell and Kleinberg 2008) and blog posts (Gr uhl and Liben-nowell 2 004; Leskovec et al. 200 7b) in order to obtain insights into the structure o f the network, id entify inﬂuen - tial n odes within it, or pre dict popu larity o f con - tent (Lerman and Galstyan 2008). Characterizing in - formation cascades is necessary for creating a mod el of the dynamics of information on networks. Dynamics and dis tribution of fan votes T he dashe d lines in Figure 2 show ho w the number of fan votes received by each story , grows in time . T heir e volution is similar to that of all votes and growth satu rates after a perio d of a bout a day . The value at which growth saturates shows th e story’ s range, or h ow widely it pe netrates the social network. Fig- ure 4 shows the distribution of cascade sizes generated by Digg and T witter stories. These distributions are m arkedly different from the distribution of story po pularity shown in Fig. 3. Althou gh the distribution of n etwork cascades o f Digg stor ies, Fig. 4(a), is sligh tly asymmetrical, it is b est described by a normal with the mean and stan dard de viation equal to 104 . 27 and 32 . 31 votes resp ectiv ely , n ot the lo g- normal distribution in F ig. 3(a). It is also unlike distribution of cascade sizes in a blog post network, which has a power law distribution (Leskov ec et al. 2007b). Remark ably , there are no s tories that did not generate a cascade, i.e., which did not recei ve any fan votes. The inset in Figure 4(a) sh ows the distribution of v otes from sub mitter’ s fans o nly . I t is also described by a norm al function with a mean aro und 50 votes. A small fr action of stories, fewer than 4 00, did n ot have any votes fro m sub- mitter’ s fans. This indicates that acti ve users who are f ans of the submitter are also fans of other voters, i.e., that the social network of active Digg users is dense and highly interlinked. This observation is supported b y the ﬁnding o f a relatively high clustering coefﬁcient of the Digg social network. The distribution of cascade sizes of of T witter stories is shown in Fig. 4(b). These also appear to be no rmally d is- tributed, altho ugh a substan tial nu mber of stories do not spread on the network. This distrib ution is broader than that of Digg stories, which indicates that s tories spread farther on the T witter network. The distribution of the n umber of votes cast b y subm itter’ s followers, shown in inset in Fig. 4(b), 0 20 40 60 80 100 120 0 500 1000 1500 2000 time (hours) votes story1: all votes story1: fan votes story2: all votes story2: fan votes story3: all votes story3: fan votes 0 20 40 60 80 100 120 0 200 400 600 800 1000 time (hours) retweets story1: all retweets story1: fan retweets story2: all retweets story2: fan retweets story3: all retweets story3: fan retweets (a) Digg (b) T witter Figure 2: Dynamics of stories on Digg and T witter . (a) T o tal number of v otes (diggs) and fan v otes recei ved by st ories on Digg since submission . (b) T otal nu mber of times a story was retweeted and the number of retweets f rom followers since the ﬁrst post vs time. The titles of stories on Digg were: story 1: “U.S. Government Asks T w itter to Stay Up for #IranElection” , stor y2: “W estern Corporatio ns Helped Censor I ranian I nternet”, stor y3: “Iran ian cle rics defy ayatollah , join protests. ” T he titles of retweeted stories were: story1:“US gov ask s twitter to stay up”, story2:“Iran Has Built a Censorship Monster with help of west tech”, story3:“Clerics join Iran’ s anti-government pro tests - CNN.com. ” is ma rkedly different fr om Digg. Th e vast m ajority of th e stories did not receive any votes from submitter’ s follo wers, indicating that submitter ’ s and other voters’ followers are disjoint. Th is observation is sup ported by our ﬁndin g that the T witter social netw ork is sparsely intercon nected. Evolution of fan votes Figure 5 shows how the number of fan votes (size of the cascade), agg regated over all stories, grows d uring the early stages of voting or retweeting. While there is signiﬁcant variation in th e n umber of fan votes re- ceiv ed by a story , the aggregate exhib its a well-deﬁned trend. The solid black lines show the median cascade size, while thin gray lines show the en velope of the boun dary that is one standard deviation from the mean . The cascade g rows steadily with new votes on Digg (Fig. 5(a)), although faster initially , i ndicatin g that there ar e two d istinct m echanisms for stor y v isibility on Digg. This is seen more clearly in Fig. 5(b), which sh ows the probabil- ity that n ext vote is a fan vote and will increase the size of the cascade. W e separate votes c ast before promotion f rom those cast af ter the story is pr omoted. Before promo tion, this probab ility is almost constant, at p = 0 . 74 . After promo tion, it decay s to a lower , b ut also almost co nstant value p = 0 . 3 . This is consistent with our h ypoth esis th at before prom otion social networks are the primary mechanism for spreadin g interest in new stories. Altho ugh a story is also v isible on the upcomin g sto ries list, fe w users actually discover stories there. W ith 16,0 00 da ily su bmissions, a new story is quickly submerged by new submissions and is pu shed to page 15 o f the upcom ing stories list within th e ﬁrst 2 0 minutes. Fe w users are likely to navigate that far (Huberman et al. 1998). Promotion to the fro nt page, which generally happen s when a story accrues between 50 and 100 votes, exposes the story to a lar ge and diverse aud ience, making social networks less of a factor in its sprea d, since large nu mbers of Digg users who read front page stories do not befriend others. The sprea d of interest in stories throu gh the T witter net- work, shown in Figure 5(c), is similar to Dig g. As o n Digg, the median numb er of fan votes rises steadily durin g the early stages of v oting. Howe ver , the rate of growth is nearly constant, indica ting there is a sing le signiﬁcant mech anism for making stories visible to voters, na mely the social net- work. The probability that next retweet is from a fan, shown in Fig. 5(d), rises slo wly fro m arou nd p = 0 . 4 to p = 0 . 5 5 . This value is lower than pr e-prom otion probab ility of n ext fan vote on Digg. The rate of interest spread app ears to de- pend on the density of network. Initially , Digg stories spread faster throu gh th e social network than stories on T witter, be- cause of Digg’ s den ser network structu re, but after promo - tion th ey spread much slo wer as u nconn ected users see and vote on the stories. The dashe d lines in Fig. 5(a) & ( c) show how the median number of votes fr om submitter’ s fans or followers chang es with voting. By th e time a story a ccumulates 50 votes on Digg ( at wh ich po int some of the stor ies ar e p romoted to the f ront pag e), about half of the votes are from submitter ’ s fans, an d anothe r 10 are fro m fans of pr ior voters but no t the submitter . After a story recei ves ab out 1 00 votes (by which p oint mo st of the stories are promo ted), the n umber o f votes from subm itter’ s fans changes very slowly , while the number of fan votes con tinues to grow . This in dicates that submitter’ s fans v ote for the story during its early stages and that users pa y attentio n to the stor ies their frie nds sub mit. On T witter, initial votes are from submitter’ s fans, but slo ws signiﬁcantly later . Related W ork Sev eral researche rs studied d ynamics of infor mation ﬂow on networks, howe ver , e mpirical studies hav e produc ed con- ﬂicting results. ( W u et al. 2004) examined patterns of email 0 1000 2000 3000 4000 5000 6000 7000 0 100 200 300 400 500 600 final votes (diggs) number of stories 0 1000 2000 3000 4000 5000 6000 7000 0 20 40 60 80 100 total retweets (popularity) number of stories (a) Digg (b) T witter Figure 3: Distribution of story po pularity . (a) Distribution o f the total number of votes r eceiv ed by Digg stories, with line showing log-n ormal ﬁt. Th e plot exclu des the 1 5 stories that re ceiv ed mor e than 6,0 00 votes. (b) Distribution o f the to tal number of times stories in the T witter data set were retweeted, with the line showing log-norm al ﬁt. 0 50 100 150 200 0 50 100 150 200 250 300 350 400 450 number of fan votes number of stories 0 50 100 150 0 200 400 600 0 100 200 300 400 500 600 700 800 0 5 10 15 20 25 30 35 number of follower tweets number of stories 0 200 400 600 800 0 50 100 150 (a) Digg (b) T witter Figure 4: Distribution of story cascade sizes. (a) Histogra m o f th e d istribution of the total num ber of fan votes received by Digg stories (size of the interest cascade). Th e inset shows the distribution of the number of votes from submitter’ s fans. (b) Histogram of the distribution of the total numb er of retweets from follo wers. The inset shows the distribution of the number of retweets of a story from submitter’ s follo wers. forwarding within an organ ization and foun d that email for- warding chains terminate after an unexpec tedly small num- ber of steps. T hey argued that unlike the sp read o f a viru s on a social network, which is expected to reach many indi- viduals, the ﬂo w of in formation is slowed b y decay o f sim- ilarity among individuals within the social network . Th ey measured similarity by distance in organizatio nal hierar- chy b etween the two individuals within a n o rganization, or in gener al, as a number o f ed ges separatin g two n odes within a graph. Similar ly , in a large-scale study of the effecti veness of word-o f-mou th pro duct recom mendatio ns, (Leskovec, Ad amic, and Huberman 2006) foun d that most recommen dation ch ains termin ate after one or two steps. Howe ver , authors noted sensitivity o f reco mmend ation to price and category of p roduct, leaving ope n the question whether social ne tworks are an effective tool for d issemi- nating inf ormation , rather tha n pu rchasing pro ducts. Con- trary to these studies, we ﬁnd that infor mation, such a s news, reaches many individuals within a social network. Moreover , the r each o f in formatio n spread does not seem to depen d on similar ity between users, at least when simi- larity is measur ed by n umber of edges between the m. On Digg, wh ose u sers are highly interco nnected, a story do es not reach as many fans as on T witter , where users are less densely connected. Like W u et al., (Liben -Nowell an d Kleinberg 200 8) stud- ied the patter ns of for warding of two popular email petition s. Unlike th eir expectatio ns, the forwarding chains pro duced long narrow , ra ther th an bushy wide, trees. In these stud- ies, howe ver , the structure of the underlying social network was not directly visible but had to be in ferred by observing new signatures on the forwarde d petitions. Th is method of- fers only a partial view of the network and does not identify all edges b etween individuals that particip ated in the email chain. I f an individual has already fo rwarded the message, she will n ot d o so again, and an edg e b etween her and the sender will not be observed. In our study , on the other hand , the n etworks are extracted in depend ently of d ata ab out the 0 50 100 150 200 0 20 40 60 80 100 120 votes fan votes mean+std median votes by fans mean−std submitter’s fans votes 0 50 100 150 200 0 0.2 0.4 0.6 0.8 1 votes prob next fan vote before promotion after promotion (a) (b) 0 50 100 150 200 0 20 40 60 80 100 120 140 160 tweets follower tweets median follower tweets mean+std mean−std submitter’s followers tweets 0 50 100 150 200 0 0.2 0.4 0.6 0.8 1 tweets prob next follower tweet (c) (d) Figure 5: Sp read of inter est in stories through the network. (a) Median number of fan votes vs votes, aggr egated over all Digg stories in our data set. Do tted lines show the bou ndary one standar d deviation from the mean. Dashed lines sho ws the nu mber of votes from fans of submitter . (b) Probab ility next vote is from a fan before and after the Digg story is pro moted. (c) Median number o f retweets fr om fo llowers v s all retweets, aggregated over all st ories in t he T witter d ata set. (d ) Pr obability next retweet is from a follower . spread of informatio n. A numb er o f research ers have studied th e ﬂow of inform a- tion and inﬂuence in the b logosph ere and in a virtu al world. (Gruhl and Liben-n owell 20 04) traced topic propag ation throug h blog s an d used a m odel of the spr ead of epidemics on networks (Newman 2002) to ch aracterize the spread of topics throug h the blogosphere. (Le skovec et al. 2 007b) de- ﬁned an information cascade as a graph o f hyper links be- tween blog posts. A cascade starts with a cascade initiator, with other blog posts joining the cascade by link ing to th e initiator or other m embers of the cascad e. Leskovec et al. found that the distribution of cascade sizes follows a power law . In th ese stud ies, the ne tworks were derived f rom the observed links between blog po sts, i.e. , fro m th e diffusion of information . In our study , on the co ntrary , they wer e e x- tracted fro m the sites in depend ently of data abo ut the dif- fusion of info rmation. (Bakshy , Karrer, and Adamic 2009) traced the spread of inﬂuence in a m ulti-player online gam e and foun d that similar to our ﬁnd ings with social news, in- ﬂuence spreads easily on social network s in v irtual worlds. This p rovides an indep endent co nﬁrmation o f the impor- tance of social networks in the dynam ics of info rmation ﬂow . Conclusion W e cond ucted an empiric al analysis of user acti vity on D igg and T witter . Thoug h the two sites are vastly different in their function ality and user interface, they are used in striking ly similar ways to spre ad informatio n. First, on both sites users activ ely create soc ial networks by designating as f riends oth- ers who se acti vities th ey want to follow . Second, users em- ploy these networks to d iscover and spread information, in- cluding news stories. The m echanism for the sprea d of in- formation is th e same o n both sites, namely , users watch their friends’ activities — what they tweet or vote fo r — and by their own twe eting and v oting actions they make this informa tion visible to their own fans o r f ollowers. I n spite of th e similarities, there a re quantitative d ifferences in the structure and function of social networks on Digg and T wit- ter . Dig g networks ar e dense and highly interconn ected. A story p osted on Digg initially spr eads quickly thro ugh the network, with users who are fo llowing the submitter also likely to follow oth er voters. A fter the stor y is p romote d to Digg’ s fro nt p age, ho wev er, it is exposed to a large n umber of u nconn ected users. The spread of the story on th e net- work slows signiﬁcantly , though the story may still generate a large respo nse fro m Digg audience. The T witter social net- work is less d ense th an Digg’ s, an d stories spread throug h the network slower than Digg stories do initially , but they continue spr eading at this ra te as th e story ages an d gene r- ally penetrate the network farther than Di gg stories. Understand ing chara cteristics o f user acti vity and th e effect social networks h av e o n it will help u s make better use of social med ia and peer pro duction sys- tems. Curren tly these systems blindly a ggregate activi- ties of all users in order to iden tify high quality contri- butions. Howev er, since popularity an d quality ar e r arely linked (Salganik , Dodds, and W atts 2006), this me thod is likely to highligh t po pular, th ough trivial, contributions. Separating in-n etwork and o ut-of-n etwork u ser activity , howe ver , will lead to a better un derstandin g of social dyn am- ics of p eer production systems (Hogg and Lerman 2009; Hogg and Szabo 2009; Lerman and Hogg 2010), wh ich wil l allow u s to better separate h igh q uality con tributions fro m noise (Ho gg and Lerman 2010; Crane and Sorn ette 2008; Lerman and Galstyan 2008). Acknowledgments W e a re grateful to T ad Hogg for valuable insig hts into data analysis and to Prash ant Kh anduri for initial analy sis of Dig g data. This material is b ased upon work supp orted by the National Science Foundation under Grant No. 09156 78. Refer ences [Adamic and Adar 2005 ] Adamic, L. A., and Adar, E. 2005. How to search a social ne twork. Socia l Networks 27(3) :187–2 03. [Bakshy , Kar rer, an d Adamic 2009] Bakshy , E.; Karr er , B.; and Ad amic, L. A. 2009 . Soc ial inﬂu ence and the diffu- sion of user-created content. In EC ’09: Pr oc. 10th ACM confer ence on Electr o nic commer ce , 325–3 34. [Carr 2010] Carr , D. 2010. Why twitter will endure. New Y ork T imes. [Clauset, Shalizi, and Newman 2009 ] Clauset, A.; Sh alizi, C. R.; and Newman, M. E . J. 200 9. Po wer-la w distributions in empirical data. SIAM Review 51(4 ):661+. [Crane and Sornette 2008 ] Cran e, R., an d Sor nette, D . 2008. V iral, quality , and junk videos o n youtub e: Separ at- ing content from noise in an informa tion-rich environment. In Pr oc. AAAI symposium on So cial Information Pr ocess- ing . [Davitz et al. 2007] D avitz, J.; Y u, J.; Basu, S.; Gutelius, D.; and Harris, A. 200 7. ilink: Search and routing in social networks. In Pr oc. Knowledge Discovery and Data Mining Confer ence (KDD-2007) . [Doming os and Richardson 200 1] Domingos, P ., and Richardson, M. 2 001. Mining the network v alue of customers. In Pr oc. KDD . [Granovetter 1973] Gran ovetter , M. 1 973. The strength of weak ties. Th e American Journal of So ciology . [Gruhl and Liben-n owell 20 04] Gruhl, D., an d Liben- nowell, D. 2004. I nforma tion diffusion th roug h b logspace. In Pr oc. Int. W orld W ide W eb Conference (WWW) , 49 1– 501. [Hogg and Lerman 2009] Hogg, T ., an d Lerman, K. 200 9. Stochastic m odels of user -contributory web sites. In Pr o c. Int. Confer ence on W eblogs and Social Media . [Hogg and Lerman 2010] Hogg, T ., an d Lerman, K. 201 0. Social dynamics of digg. I n Pr oc. I nt. Confer ence on W e- blogs and Social Media (ICWSM10) . [Hogg and Szabo 2009] Hogg, T ., and Szabo, G. 2009 . Di- versity of user a ctivity a nd content qu ality in onlin e com- munities. I n Pr oc. I nt. Con fer ence on W eblogs a nd Social Media (ICWSM) . [Huberm an et al. 1998] Huber man, B. A.; Pirolli, P . L. T .; Pitko w , J. E.; and Lukose, R. M. 19 98. Strong regularities in world Wide Web surﬁng. S cience 280(5 360) :95–97. [Kempe, Kleinberg, and ´ Eva T ardos 2003] Kempe, D.; Kleinberg, J.; and ´ Eva T ardos. 2 003. Maximizin g the spread o f inﬂuen ce throug h a social network. In KDD ’03: Pr oc. 9th Int. Conf. on Knowledge disco very and data mining , 137–1 46. [Lerman and Galstyan 2008] Lerman, K., and Galstyan, A. 2008. An alysis of social v oting patterns on digg. In Pr oc. 1st ACM SI GCOMM W orkshop o n Online S ocial Networks . [Lerman and Hogg 2010] Lerman, K., and Ho gg, T . 201 0. Using a model of soc ial dynamics to predict po pularity of online conten t. In Pr oc. 19th In t. W orld W id e W eb Co nfer- ence . [Leskovec, Adam ic, and Huberman 2006] Leskovec, J.; Adamic, L.; and Huberm an, B. 200 6. The dyna mics of viral marketing. In EC ’06: Pr oc. 7th Co nf. on Electr onic commer ce , 228– 237. [Leskovec an d Horvitz 2008] Leskovec, J., and Horvitz, E. 2008. Plane tary-scale v iews o n a large instant-messaging network. In WWW ’0 8: Pr oc . 17 th Int. W o rld W ide W eb Confer ence , 915–92 4. [Leskovec et al. 2 007a] Leskovec, J .; Krause, A.; Guestrin, C.; Faloutsos, C.; V anbriesen, J.; an d Glan ce, N. 2007 a. Cost-effecti ve outbre ak detec tion in network s. In KDD ’07: Pr oc. 13th In t. Conf. o n Knowledge discovery and data mining , 420– 429. [Leskovec et al. 2 007b] Leskovec, J.; McGlohon, M.; Faloutsos, C.; Glance, N.; a nd Hurst, M. 2007 b . Cas- cading behavior in large b log gra phs. In P r oc. 7 th SIAM Int. Confer ence on Data Mining (SDM) . [Leskovec, Kleinbe rg, and Faloutsos 2005] Leskovec, J.; Kleinberg, J.; and Faloutsos, C. 2005. Graphs over time: densiﬁcation laws, shrinkin g d iameters and p ossible explanations. In KDD ’05: Pr oc. 1 1th Int. Co nf. on Knowledge discovery in data mining , 177 –187. [Liben-Nowell and Kleinberg 2008] Liben-Nowell, D., and Kleinberg, J. 20 08. T racing infor mation ﬂo w on a global scale using in ternet chain-letter data. PN AS 105(12):46 33– 4638. [Newman 200 2] Ne wman, M. E . J. 2002. Spread of epidem ic disease on n etworks. Physical Review E 66(1) :01612 8+. [Rogers 2003] Rogers, E. M. 2003 . Diffusion of I nnova- tions, 5th Edition . Free Press, 5 edition. [Salganik, Dodds, and W atts 2 006] Salganik, M.; Do dds, P .; and W atts, D. 2006 . Exp erimental study of inequality and unp redictability in an artiﬁcial cultu ral market. Scien ce 311:85 4. [V ´ azquez et al. 2006] V ´ azquez, A.; Oliveira, J. G.; Dezs ¨ o, Z.; Goh, K.; K ondor, I.; and Barab ´ asi, A. 2006 . Modeling bursts and heavy tails in h uman dyna mics. Phys. Rev . E 73(3) :03612 7+. [W ilkinson 2008] W ilkinson, D. M. 2 008. Strong regulari- ties in onlin e peer produ ction. I n EC ’08 : Pr oc. 9th Con f. on Electr o nic commer ce , 302–3 09. [W u and Huberman 2007] W u, F ., and Hu berman , B. A. 2007. Novelty and collecti ve attentio n. PNAS 104(4 5):175 99–17601. [W u et al. 2004] W u, F .; Huberman, B.; Adam ic, L.; and T y ler , J. 2004. Info rmation ﬂo w in social g roups. Physica A . 0 20 40 60 80 100 120 0 400 800 1,200 1,600 time (hours) votes story1: all votes story1: fan votes story2: all votes story2: fan votes story3: all votes story3: fan votes

Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment