Predicting the popularity of online content
We present a method for accurately predicting the long time popularity of online content from early measurements of user access. Using two content sharing portals, Youtube and Digg, we show that by modeling the accrual of views and votes on content o…
Authors: Gabor Szabo, Bernardo A. Huberman
Predicting the popularity of online content Gabor Szabo Social Computing Lab HP Labs P alo Alto , CA gabors@hp .com Bernardo A. Huberman Social Computing Lab HP Labs P alo Alto , CA bernardo .huberman@hp .com ABSTRA CT W e present a metho d for accurately predicting the long time p opularit y of online co nten t from e arly measurements of user’s access. Using tw o content sharing p ortals, Y outub e and D igg, w e show that by mo deling the accrual of views and votes on co nten t offered by these services w e can pre- dict th e long-term dynamics of individu al submissions from initial data. In the case of Digg, measuring access t o given stories during the first tw o hours allo ws us to forecast th eir p opularit y 30 days ahead with remark able accuracy , while dow nloads of Y outub e videos need to b e follo wed for 10 da ys to attain th e same p erformance. The differing time scales of the predictions are shown to b e due to d ifferences in h o w con tent is consumed on th e tw o p ortals: D igg stories q uic kly b ecome out d ated, while Y outub e videos are still found long after they are initiall y submitted to the p ortal. W e sho w that predictions are more accurate for submissions for whic h attenti on deca ys quickly , whereas predictions for evergreen con tent will b e prone to larger errors. Keyw ords Y outub e, Digg, prediction, p opularity , videos 1. INTR ODUCTION The ubiquity and inexp en siveness of W eb 2.0 services hav e transformed the landscap e of ho w content is pro duced and consumed online. Thanks to t he we b, it is p ossible for con- tent pro ducers to reach out to audiences with sizes that are inconceiv able u sing con ven tional channels. Examples of the services that hav e made this exchange b etw een prod uc- ers and consumers p ossible on a global scale include v ideo, photo, and music sharing, w eblogs and wikis, so cial bo ok- marking sites, collaborative p ortals, and news aggrega tors where conten t is submitted, p erused, and often rated and discussed by the user communit y . At the same time, the dwindling cost of pro d ucing and sharing conten t has made the online publication space a h ighly comp etitive domain for authors. The ease with whic h con tent can no w b e p rodu ced brings to the center th e problem of the attention that can be devoted to it. R ecen tly , it h as b een shown that attenti on [22] is allo- cated in a rather asymmetric wa y , with most conten t getting some views and do wnloads, whereas only a few receive the bulk of the attention. While it is p ossible to predict the distribution in attention ov er many items, so far it has b een hard to predict t h e amoun t that would b e devoted o ver time to give n ones. This is the problem we solve in this p aper. Most often p ortals rank and categorize conten t b ased on its qualit y and appeal to users. This is esp ecially true of aggre- gators where the “wi sdom of t he crowd” is used to provide collaborative filtering faci lities to select and order submis- sions that are fav ored by many . O ne such well -k n o wn p or- tal is Digg, where users submit links and short descriptions to con tent that th ey h a ve found on the W eb, and others vote on them if t h ey find th e submission in teresting. The articles collecting the most votes are then exhibited on pre- miere sections across the site, such as the “recen tly popu- lar submissions” (the main page), and “most p opular of the day/w eek/month/y ear” . This results in a p ositive feedbac k mec hanism that leads to a “ric h get richer” t yp e of vote ac- crual for the very popular items, although it is also clear that this p ertains to only a very small fracti on of th e submissions. As a parallel to Digg, where conten t is not pro duced by the submitters themselv es but only link ed to it, we study Y outub e, one of the fi rst video sharing portals that lets users upload, describ e, and tag their own videos. View ers can w atch, reply to, and leav e comments on them. The ex tent of the online ecosystem that has developed around the videos on Y outub e is impressiv e by any standards, and videos that dra w a lot of viewers are prominently exp osed on the site, similarly to Digg stories. The pap er is organized as follo ws. In Section 2 we d escribe how we collected access data on sub missions on Y outub e and Digg. S ection 3 sho ws how daily or weekly fluctuations can b e observed in Digg, together with presenting a simple method to eliminate them for the sake of more accurate pre- dictions. In Section 4 w e discuss the models used to describe con tent p opularity and how prediction accuracy dep en d s on their choi ce. Here we will also p oint out th at t he exp ected gro wth in p opularity of videos on Y outub e is markedly dif- feren t from when compared to Digg, and further study the reasons for this in Section 5. In Section 6 w e conclude and cite relev ant works to th is study . 2. SOURCES OF D A T A The form ulation of t h e prediction mod els relies heavily on observe d c haracteristics of our exp erimental data, which we describe in this section. T he orga nization of Y outub e and Digg is conceptually similar to eac h other, so w e can also em- plo y a similar framewo rk to study conten t p opularity after the data has b een normalized. T o simplify the terminology , by p opularity in th e follo wing we will refer to the num b er of views that a video receives on Y outub e, and to the num b er of votes (diggs) that a story collects on D igg, resp ectively . 2.1 Y outube Y outub e is the pinnacle of user-created video sharing p ortals on the W eb, with 65,000 new v ideos u ploaded and 100 mil- lion d o wnloaded on a daily basis, implying that that 60% of all online videos are watc hed th rough the p ortal [11]. Y outub e is also the third most frequently accessed site on the Internet based on traffic rank [11, 6, 3 ]. W e started collecting view count time series on 7,146 selected videos daily , b eginning April 21, 2008, on videos that app eared in the “recently added” section of the p ortal on this da y . Apart from the list of most recently added videos, the web site also offers listings based on different selection criteria, suc h as “featured” , “most discussed” , and “most v iewed” lists, among others. W e chose the most recently uploaded list to hav e an unbiased sample of all videos submitted to the site in th e sampling p erio d, not only the most p opular ones, and also so that w e can have a complete history of the v iew counts for eac h v ideo during their lifetime. The Y outub e application programming interface [23] gives programmatic access to several of a video’s statistics, the view count at a given time b eing one of them. How ever, due to the fact that the view count field of a v ideo do es not appear to b e up dated more often th an once a day by Y outub e, it is only p ossible to have a goo d approximatio n for the num b er of views daily . Within a da y , h o wev er, t h e API do es indicate when the view count was recorded. It is worth n oting that while the o verwhelming ma jority of video v iews is initiated from the Y outub e website itself, videos may b e linked from external sources as well (ab out h alf of all videos are thought to b e linked externally , bu t also that only ab out 3% of the views are coming from these links [5 ]). In Section 4, we compare the v iew counts of videos at given times after th eir upload. Since in most cases we only h a ve information on th e view counts once a da y , we use linear in- terp olation b etw een th e nearest measuremen t p oints around the t ime of intere st to appro ximate the view coun t at the giv en time. 2.2 Digg Digg is a W eb 2.0 service where registered users can su b mit links and short descriptions to news, images, or videos they hav e found interesting on the W eb, and which th ey th ink should hold interest for the greater general audience, to o (90 . 5% of all uploads were links to news, 9 . 2% to videos, and only 0 . 3% to images). Submitted con tent will be placed on the site in a so-called “up coming” section, whic h is one clic k aw ay from the main page of the site. Links to conten t are pro vided together with surrogates for the submission (a short description in the case of news, and a thumbnail image for images and videos), which is intended to entice readers to p eruse th e conten t. The main purp ose of D igg is to act as a massive collaborativ e fi ltering tool to select and sho w the most p opular conten t, and thus registered users can digg submissions they found interesting. This serves to increase the d igg count of the submission by one, and sub missions that get substan tially enough diggs in a relatively short time in th e up coming section will b e presented on the front page of Digg, or using its terminology , they will be pr omote d to the front page. Someone’s submission b eing p romoted is a considerable source of pride in th e Digg communit y , and is a main motiv ator for returning submitters. The exact algo- rithm for promotion is not made pub lic to thw art gaming, but is thought to give p reference to up coming submissions that accum ulate diggs quic kly enough from diverse neigh b or- hoo ds of the Digg social netw ork [18]. The so cial n etw ork- ing feature offered by Digg is extremely important, through whic h users may p lace watc h lists on another user by b e- coming their “fans” . F ans will b e shown up dates on which submissions users d ugg who they are fans of, and thus the social netw ork will play a ma jor role in making up coming submissions more visible. V ery imp ortantly , in this pap er we also only consider stories th at we re promoted to th e front page, give n th at we are interested in submissions’ p opular- it y among the general user base rather than in niche social netw orks. W e used the Digg API [8] to retrieve all th e diggs made by registered users b etw een July 1, 2007, and December 18, 2007. This data set comprises of ab out 60 million diggs by 850 thousand users in total, cast on app ro ximately 2.7 mil- lion submissions (this number includes all past submissions also that receiv ed any digg). The number of submissions in this perio d w as 1,321,90 3, of whic h 94,005 (7.1%) b ecame promoted to th e front page. 3. D AIL Y CYC LES In this section we examine th e daily and w eekly activit y v ariations in user activity . Figure 1 shows th e hourly rates of digging and story submitting of users, and of up coming story promotions b y Digg, as a function of time for one we ek, starting August 6, 2007. The difference in the rates may be as muc h as threefold, and week ends also sho w lesser activ- it y . Similarly , Fig. 1 also show cases weekly vari ations, where w eekdays app ear ab out 50% more active than week ends. It is also reasonable to assume that b esides daily and weekly cycles, there are seasonal v ariations as wel l. It ma y also b e concluded that Digg users are mostly lo cated in t he UTC-5 to UTC-8 time zones, and since the official language of Digg is English, Digg users are mostly from N orth A merica. Dep ending on the time of da y when a sub mission is made to the p ortal, stories will differ greatly on th e num b er of initial d iggs that they get, as Fig. 2 illustrates. A s can b e exp ected, stories submitted at less activ e p erio ds of the day will accrue less diggs in the fi rst few hours initially than stories submitted during p eak times. This is a natural con- sequence of suppressed d igging activit y during the nigh tly hours, but may initially penalize in teresting stories that will ultimately become p opu lar. In other w ords, based on obser- v ations made only after a few hours after a story has b een promoted, w e may misinterpret a story’s relativ e interesting- ness, if w e do not correct for the v ariation in daily activity cycles. F or instance, a story that gets promoted at 12pm will on a verage get appro ximately 400 diggs in the first 2 08/06 08/07 08/08 08/09 08/10 08/11 08/12 08/13 0 2000 4000 6000 8000 10000 12000 14000 Time Count / hour diggs submissions * 10 promotions * 1000 Figure 1: D aily and weekly cycles in the hourly rates of digging activit y , s tory submissi ons, and story pro- motions, resp ectively . T o match the different scale s the rates for submissions is mu ltipl i ed b y 10, that of the promotions is m ultipli ed by 1000. The hori- zon tal axi s represe n ts one week from August 6, 2007 (Monda y) through Aug 12, 2007 (Sunda y). The tick marks represent midnight of the resp ective day , Pa- cific Standard Time . hours, while it will only get 200 d iggs if it is p romoted at midnight. Since the digging activity v aries by time, we introduce th e notion of di gg time , where w e measure time not by w all time (seconds), bu t by the num b er of all diggs that users cast on promoted stories. W e choose to count diggs only on promoted stories only b ecause this is th e section of the p ortal that we fo cus on stories from, and most d iggs (72%) are going to promoted stories an ywa y . The a verage num- b er of d iggs arriving to p romoted stories p er hour is 5,478 when calculated o ver the full data collection p erio d, thus w e define one digg hour as the time it takes for so many new diggs to b e cast. As seen earlier, during the night this will take ab out three times longer than during the active daily p eriods. This transformation allo ws us to mitigate the de- p endence of submission p opularit y on the time of day when it was submitted. When we refer to the age of a submission in digg hours at a given time t , we measure h o w many diggs w ere received in the system b etw een t and the submission of the story , and divide by 5,47 8. A further reason to use digg time instead of absolute t ime will b e given in S ection 4.1. 4. PREDICTIONS In this section we show that if w e p erform a logarithmic transformatio n on the p opu larities of sub missions, th e trans- formed v ariables exhibit strong correlations b etw een early and later times, and on this scale the random fluctuations can b e expressed as an additive noise t erm. W e use this fact to mo del and predict the future p opularit y of individual con- tent, and measure the p erformance of the predictions. In the follo wing, we call r efer enc e time t r the time when we intend to predict th e p opularit y of a sub mission whose age 0 5 10 15 20 25 0 200 400 600 800 1000 Promotion hour of day Average number of diggs Figure 2: T he av erage num b er of diggs that sto- ries get after a certain tim e , sho wn as a function of the hour that the story was promoted at (PST). Curv es from b ottom to top correspond to measure- ments made 2, 4, 8, and 24 hours after promotion, respe ctively . with resp ect to the upload (promotion) time is t r . By indic a- tor time t i w e refer to when in the life cycle of the submission w e perform the prediction, or in other wo rds how long we can observe the submission history in order to extrap olate; t i < t r . 4.1 Corr elations between early and la ter times W e first consider the qu estion whether the p opu larit y of sub- missions early on is any predictor of their popularity at a later stage, and if so, what the relationship is. F or th is, we first plot the p opularity counts for submissions at the refer- ence time t r = 30 days b oth for Digg (Fig. 3) and Y outub e (Fig. 4), versus the p opu larities measured at the ind icator times t i = 1 digg hour, and t i = 7 da ys for the t wo p or- tals, resp ectively . W e choose to measure the p opularit y of Y outub e v ideos at the end of the 7th day so that the view counts at this time are in the 10 1 –10 4 range, and similarly for Digg in this measuremen t. W e logarithmical ly rescale the h orizon tal and verti cal axes in the fi gures du e to the large v ariances present among the p opularities of different submissions (n otice th at they span three decades). Observing the D igg data, one notices that the p opularity of ab out 11% of stories (indicated by ligh ter color in Fig. 3) gro ws m uch slo wer than th at of the ma jorit y of submissions: by the end of the first hour of th eir lifetime, they have re- ceiv ed most of the diggs th at they will ever get. The sepa- ration of the tw o clusters is p erceiv able u ntil approximately the 7th digg hour, after which the separation v anishes due to fact that by that time the digg counts of stories mostly saturate to their resp ective maximum v alues (skip to Fig. 10 for the av erage gro wth of Digg article p opularities). While there is no obvious reason for the p resence of clustering, we assume that it arises when the promotion algorithm of Digg misjudges the exp ected future popularity of stories, and p ro- motes stories from the up coming p h ase that will not main- tain a sustained attentio n from the users. Users th us lose 10 1 10 2 10 3 10 1 10 2 10 3 10 4 Popularity after 1 digg hour Popularity after 30 digg days Figure 3: The correlation b etw een digg coun ts on the 17,097 promoted stories in the dataset that are older than 30 days. A k-means clustering separates 89% of the stories into the upp er cluster, whi le the rest of the stories is shown in lighter color. The bol d guide l ine indicates a linear fit with slop e 1 on the upp er cluster, w ith a prefactor of 5.92 (the P earson correlation coe fficient is 0.90). The dashed line marks the y = x li ne b el ow which no stories can fall. intere st in them m uch sooner than in stories in the upp er cluster. W e u sed k- means clustering with k = 2 and cosine distance measure to separate the tw o clusters as sho wn in Fig. 3 up to the 7th digg hour (after which the clusters are not separable), and we exclusively use the upp er cluster for the calculatio ns in the follo wing. As a second step, t o quantify the strength of the correla- tions apparent in Figs. 3 and 4, we measured the Pearson correlatio n co efficien ts b etw een the p opularities at different indicator times and the reference time. The reference time is alwa ys chosen t r = 30 days (or d igg days for Digg) as previously , and the indicator time is v aried b etw een 0 and t r . Y outub e . Fig. 5 sho ws the Pe arson correlation coefficients b etw een th e loga rithmically transformed p opularities, and for comparison also the correlations b etw een the untrans- formed v ariables. The PCC is 0 . 92 after ab out 5 days; ho w- ever, the untransf ormed scale shows weak er linear dep en- dence, at 5 d ays the PCC is only 0 . 7 , and it consistently sta ys b elo w the PCC of the logarithmically transformed scale. Digg . Also in Fig. 5, w e plot the PCCs of the log-transformed p opularities b etw een the indicator times and the reference time. I t is already 0 . 98 after the 5th digg hour, and it is as strong as 0 . 993 after the 12th. W e also argue here that by measuring submission age as digg time leads to stronger correlatio ns: the figure shows the PCC as well for th e case when the story age is measured as absolute time (dashed line, 17,222 stories), and it is alwa ys less t h an th e PCCs taken with digg h ours (solid line, 17,097 stories) up to ap- pro ximately the 12th hour. This is understandable since 10 0 10 1 10 2 10 3 10 4 10 5 10 0 10 1 10 2 10 3 10 4 10 5 Popularity after 7 days Popularity after 30 days Figure 4: The p opularities of v ideos shown at the 30th da y after upload, versus their p opularity after 7 days. The b old sol id l ine with gradient 1 has be en fit to the data, with corr elation coeffi cient R = 0 . 77 and prefactor 2 . 13 . this is the time scale of the strongest daily v ariations (cf. Fig. 1). W e do not sho w the untransformed scale PCC for Digg submissions measured in digg hours, since it approxi- mately traces the dashed line in the figure, thus also indi- cating a w eaker correlation than the solid line. 4.2 The ev olution of submission popularity The strong linear correlatio n found b etw een the indicator and reference times of the logarithmically transformed sub - mission p opularities suggests that the more p opu lar submis- sions are in the b eginning, the more they will be also later on, and the conn ection can b e describ ed by a linear mod el: ln N s ( t 2 ) = ln [ r ( t 1 , t 2 ) N s ( t 1 )] + ξ s ( t 1 , t 2 ) (1) = ln r ( t 1 , t 2 ) + ln N s ( t 1 ) + ξ s ( t 1 , t 2 ) , where N s ( t ) is the p opularity of submission s at time t (in the case of Digg, time is n aturally measured by digg time), and t 1 and t 2 are tw o arbitrarily chosen p oints in time, t 2 > t 1 . r ( t 1 , t 2 ) accounts for the linear relationship found b etw een th e log-transformed p opu larities at different times, and it is indep endent of s . ξ s is a n oise term drawn from a giv en distribution with mean 0 t hat describes the random- ness observ ed in the data. It is important to note that the noise term is additive on the log-scale of p opularities, jus- tified by the fact that the strongest correlations w ere found on th is transformed scale. Considering Figures 3 and 4, the p opularities at t 2 = t r also app ear to b e evenly distributed around th e linear fit (with taking only the upp er cluster in Fig. 3 and considering t h e n atu ral cutoff y = x in Fig. 4 ). W e will now sho w that the v ariations of the log-popularities around the exp ected a verage are distributed approximately normally with an additive noise. T o this end w e p erformed linear regression on the logari thmicalyy transformed data p oin ts shown in Figs. 3 and 4, resp ective ly , fixing the slop e of the linear regression function to 1 in accordance with Eq. (1). The intercept of the linear fit corresponds to ln r ( t i , t r ) above ( t i = 7 d a ys/1 digg hour, t r = 30 da ys), and ξ s ( t i , t r ) are 0 5 10 15 0.7 0.75 0.8 0.85 0.9 0.95 1 Time (days/hours/digg hours) Pearson correlation coefficient Youtube (days) Youtube (untr.) Digg (digg hours) Digg (hours) Figure 5: The P earson correlation co efficients b e- t ween the logarithms of the popularities of submis- sions measured at different tim es: at the time indi - cated b y the horizontal ax i s, and on the 30th day . F or Y outube, the x -axis is in days. F or Digg, it is in hours for the dashed l ine, and digg hours for the solid line (stronger correlation). F or comparison, the dotted line shows the correlation co efficients for the untr ansformed (non-logarithmic) p opularities in Y outu b e. giv en by the residuals of the v ariables with respect to the b est fit. W e tested the normality of the residuals by plotting the quantiles of their empirical distributions vers us the quantiles of the th eoretical (normal) distributions in Figs. 6 (Digg) and 7 (Y outub e). The residuals show a reasonable matc h with n ormal distributions, although w e observe in the quantile- quantile plots th at the measured distributions of the residu- als are sligh tly right-sk ew ed, which means that content with very h igh p opularity v alues is ove rrepresented in compari- son to less p opular con tent. This is un derstandable if we consider that a small fraction of the submissions ends up on “most p opular” and “top” pages of b oth p ortals. These are the submissions that are deemed most req uested by the p ortals, and are shown to th e users as those th at others found most interesting. They sta y on frequented and very visible parts of the p ortals, and are naturally attract fur- ther diggs/views. In the case of Y outub e, one can see that con tent p opularit y at the 30th day versus the 7th day as sho wn in Fig. 4 is b oun ded from b elow, du e to th e fact the v iew counts can only grow, and thus the d istribution of residuals is also truncated in Fig. 7. W e also n ote that the Jarque-Bera and Lilliefors tests reject residual normalit y at the 5% significance leve l for b oth systems, although the residuals appear to b e distributed reasonably close to Gaus- sians. Moreo ver, t o see whether the homoscedasticit y of the residuals holds that is n ecessary for the linear regression [their v ariance b eing indep endent of N s ( t i )], w e chec ked the means and v ariances of the residuals as a function of N c ( t i ) by sub dividing the p opularit y v alues into 50 bins, with the result that b oth th e mean and vari ance are indep endent of N c ( t i ). −4 −2 0 2 4 −1.5 −1 −0.5 0 0.5 1 1.5 2 Standard normal quantiles Residual quantiles −1 −0.5 0 0.5 1 1.5 2 0 200 400 600 800 1000 1200 Residuals Frequency Figure 6: The quantile-quan tile plot of the resi dual s of the linear fit of Fi g. 3 to the logarithms of Digg story p opularities, as de scribed in the text. The in- set shows the frequency distribution of the residuals. A further justification for the mo del of Eq. (1) is given in the follo wing. It has b een sho wn that the p opularity dis- tribution of D igg stories of a given age follow s a lognormal distribution [22] that is t he result of a gro wth mechanism with multipli cative noise, and can b e described as ln N s ( t 2 ) = ln N s ( t 1 ) + t 2 X τ = t 1 η ( τ ) , (2) where η ( · ) denotes indep endent v alues dra wn from a fixed probabilit y distribution, and time is measured in discrete steps. If the difference betw een t 1 and t 2 is large enough, th e distribution of the sum of η ( τ )’s will approximate a normal distribution, according to t h e central limit theorem. W e can thus map the mean of t h e sum of η ( τ )’s to ln r ( t 1 , t 2 ) in Eq. (1), and fi nd that the t wo descriptions are equ iv alent c haracterizations of the same lognormal gro wth pro cess. 4.3 Pr ediction models W e presen t three mod els to predict an ind ividual submis- sion’s p opularity at a futu re time t r . The p erformance of the predictions is measured on the test sets by defining error functions that yield a measure of deviation of the predictions from the observed p opularities at t r , and together with t he models we d iscuss what error measure th ey are exp ected to minimize. One mo del that minimizes a given error function ma y fare worse for another error measure. The first prediction model close ly p arallels the exp erimen- tal observ ations sho wn in the previous section. In the sec- ond, we consider a common error measure and formula te the model so that it is optimal with resp ect to this error func- tion. Lastly , the third prediction method is presented as comparison and one t hat has b een u sed in previous works as an “intuitive ” wa y of mo deling p opularity grow th [15]. Belo w, w e use the ˆ x notation to refer to the predicted v alue of x at t r . −4 −2 0 2 4 −2 −1 0 1 2 3 4 5 Standard normal quantiles Residual quantiles −1 0 1 2 3 4 5 0 100 200 300 400 500 600 700 Residuals Frequency Figure 7: The q uantile-quantile plot of the residuals of the linear fit of Fig. 4 for Y outube. 4.3.1 LN mo del: linear re gression on a logarithmic scale; least-squares absolute er r o r The linear relationship found for the logarithmically trans- formed p opularities and describ ed by Eq. (1) ab ov e suggests that given the popularity of a sub mission at a given time, a goo d estimate we can give for a later time is determined by the ordinary least sq u ares estimate, and it is t h e b est estimate that minimizes the sum of the squ ared residuals ( a consequence of the linear regression with the maxim um like- lihoo d method). How ever, th e linear regression assumes nor- mally distributed residuals and t h e lognormal mo del giv es rise t o additive Gaussian noise only if the logarithms of the p opularities are considered, and thus t he o verall error that is minimized by the linear regressio n on this scale is LSE ∗ = X c r 2 c = X c h ˆ ln N c ( t i , t r ) − ln N c ( t r ) i 2 , (3) where ˆ ln N c ( t i , t r ) is the prediction for ln N c ( t r ), and is cal- culated as ˆ ln N c ( t i , t r ) = β 0 ( t i ) + ln N c ( t i ) and β 0 is yielded by the maximum likel iho od parameter estimator for the in- tercept of the linear regression with slop e 1. The sum in Eq. (3) goes o ver all conten t in the training set when es- timating the parameters, and the t est set when estimating the error. W e, on the other hand , are in p ractice interested in the error on t h e linear scale, LSE = X c h ˆ N c ( t i , t r ) − N c ( t r ) i 2 . (4) The residuals, while distributed normally on the logarith- mic scale, will not hav e this prop erty on the untransformed scale, and an inconsisten t estimate would result if we used exp h ˆ ln N c ( t i , t r ) i as a predictor on th e natural (original) scale of p opularities [9]. How ever, fitt ing least squares re- gression mo dels to transformed data has b een extensively inv estigated (see Refs. [9, 16, 21]), and in case t he trans- formation of the dep endent v ariable is logarithmic, the b est untransfo rmed scale estimate is ˆ N s ( t i , t r ) = exp ˆ ln N s ( t i ) + β 0 ( t i ) + σ 2 0 / 2 ˜ . (5) Here σ 2 0 = v ar( r c ), the consistent estimate for t h e v ariance of the residuals on the logari thmic scale. Thus the procedure to estimate the ex p ected p opularity of a giv en sub mission s at time t r from measurements at time t i , w e first determine the regressio n co efficient β 0 ( t i ) and the va riance of the residuals σ 2 0 from th e training set, and apply Eq. (5) to obtain the exp ectation on the original scale, using the p opularity N s ( t i ) measured for s at t i . 4.3.2 CS model: constant scaling model; r elative squared err or In th is section we first defin e the error function that we wish to minimize, and th en present a linear estimator for the predictions. The relative squared error th at w e use h ere takes the form of RSE = X c " ˆ N c ( t i , t r ) − N c ( t r ) N c ( t r ) # 2 = X c " ˆ N c ( t i , t r ) N c ( t r ) − 1 # 2 . (6) This is similar to the commonly used relative standard error ˛ ˛ ˛ ˛ ˛ ˆ N c ( t i , t r ) − N c ( t r ) N c ( t r ) ˛ ˛ ˛ ˛ ˛ , (7) except that the absolute v alue of the relativ e difference is replaced by a square. The linear correspondence found b etw een the logarithms of the popularities up to a normally distributed noise term sug- gests th at the future exp ected v alue ˆ N s ( t i , t r ) for submission s can b e exp ressed as ˆ N s ( t i , t r ) = α ( t i , t r ) N s ( t i ) . (8) α ( t i , t r ) is ind ep endent of the particular submission s , and only dep ends on the indicator and reference times. The v alue that α ( t i , t r ) takes, ho wev er, will b e con tingent on what the error function is, so that the optimal va lue of α minimizes this. W e will minimize RSE on the training set if and only if 0 = ∂ R SE ∂ α ( t i , t r ) = 2 X c » N c ( t i ) N c ( t r ) α ( t i , t r ) − 1 – N c ( t i ) N c ( t r ) . (9) Expressing α ( t i , t r ) from ab ov e, α ( t i , t r ) = P c N c ( t i ) N c ( t r ) P c h N c ( t i ) N c ( t r ) i 2 . (10) The v alue of α ( t i , t r ) can b e calculated from the t raining data for an y t i , and further, the prediction for any new sub- mission may b e made kn o wing its age using this va lue from the training set, together with Eq. (8). If w e verified the error on the t raining set itself, it is guaranteed that RSE is minimized under the mo del assumptions of linear scaling. 4.3.3 GP mo del: gr owth pr o file mode l F or comparison, we consider a t hird description for predict- ing future conten t p opularity , which is based on av erage gro wth p rofiles devised from the training set [15]. This as- sumes in essence that the gro wth of a sub mission’s p opular- it y in time follo ws a uniform accrual curve, which is appro- T raining set T est set Digg 10825 stories 6272 stories (7/1/07 –9/18/07) (9/18/0 7–12/6/07 ) Y outub e 3573 videos 357 3 videos randomly selected randomly selected T able 1: The partitioning of the collected data into training and test sets. The Digg data is di vided by time whil e the Y outube vide os are c hosen randomly for each set, resp ectively . priately rescaled to accoun t for the differences b etw een sub- mission interestingnesses . The gro wth profile is calculated on the training set as the av erage of the relativ e p opularities of the sub missions of a given age t i , as normalized by the final p opularit y at the reference, t r : P ( t 0 , t 1 ) = fi N c ( t 0 ) N c ( t 1 ) fl c , (11) where h·i c takes the mean of its argumen t ov er all content in the training set. W e assume that th e rescaled growth profile appro ximates the observed p opularities well ov er the whole time axis with an affine transformation, and thus at t i the rescaling factor Π s is given by N s ( t i ) = Π s ( t i , t r ) P ( t i , t r ). The prediction for t r consists of using Π s ( t i , t r ) to calculate the future p opularit y , ˆ N s ( t r ) = Π s ( t i , t r ) P ( t r , t r ) = Π s ( t i , t r ) = N s ( t i ) P ( t i , t r ) . (1 2) The gro wth profiles for Y outub e and Digg were measured and sho wn in Fig. 10. 4.4 Pr ediction perf ormance The p erformance of the prediction meth ods will b e assessed in th is section, using tw o error functions t hat are analogous to LSE and R SE, resp ectiv ely . W e sub divided the su b mission time series data into a train- ing set and a test set, on whic h w e benchmarked the d ifferen t prediction schemes. F or D igg, we took all stories that were submitted during the first half of the d ata collection p erio d as t he training set, and the second half was considered as the test set. On the other hand , the 7,146 Y outub e v ideos that we follo wed were submitted around the same time, so instead w e randomly selected 50% of these videos as training and the other half as test. The number of submissions that the training and test sets con tain are summarized in T able 1. The parameters defined in the prediction mo dels w ere found through linear regression ( β 0 and σ 2 0 ) and sample ave raging ( α and P ), resp ectivel y . F or reference time t r where we intend to predict the p op- ularit y of submissions w e c hose 30 da ys after th e submis- sion time. Since the predictions naturally dep end on t i and how close we are to the reference time, we performed the parameter estimations in hourly interv als starting after the introduction of any submission. Analogously to LSE and RSE, we will consider the follo wing prediction error measures for one p articular subm ission s : QSE( s, t i , t r ) = h ˆ N s ( t i , t r ) − N s ( t r ) i 2 (13) and QRE( s, t i , t r ) = " ˆ N s ( t i , t r ) − N s ( t r ) N s ( t r ) # 2 . (14) QSE( s, t i , t r ) is the squared difference b etw een the predic- tion and the actual p opularit y for a particular submission s , and QRE is t he relative sq u ared error. W e will use th is no- tation to refer to their ensemble av erage v alues, too, QSE = h QSE( c, t i , t r ) i c , where c go es ov er all submissions in the test set, and similarly , Q RE = h QRE( s, t i , t r ) i c . W e used the parameters obtained in the learning session to p erform the predictions on the test set, and plotted the resulting a v- erage error va lues calculated with the abov e error measures. Figure 8 show s QSE and QRE as a function of t i , together with their respective stand ard deviations. t i , as earlier, is measured from the time a video is p resen ted in th e recent list or when a story gets promoted to the fron t page of Digg. QSE, the squared error is indeed smallest for t he LN model for Digg stories in the b eginning, then the difference b e- tw een the three mo dels b ecomes mo dest. This is expected since the LN model optimizes for the RSE ob jective function, whic h is equiv alent to QSE up to a constant facto r. Y outub e videos do not show remark able differences against any of the three mo dels, how ever. A further difference b etw een Digg and Y outub e is th at QSE sho ws considerable disp ersion for Y outub e videos ov er the whole time axis, as can b e seen from the large v alues of the standard deviation (the shaded areas in Fig. 8). This is u nderstandable, how ever, if we consider that t he p opularity of Digg news saturates muc h earlier than that of Y outub e videos, as will b e studied in more detail in the follo wing section. Considering further Fig. 8 (b) and (d), w e can observe that the relative exp ected error QRE decreases very rapidly for Digg (after 12 hours it is already negligible), while the pre- dictions conv erge slow er to the actual v alue in the case of Y outub e. Here, h o wev er, the CS mo del outp erforms the other t wo for b oth p ortals, again as a consequence of fin e- tuning th e mo del to minimize the ob jective function R SE. It is also apparen t th at the v ariation of the prediction er- ror among submissions is muc h smaller than in th e case of QSE, and t he standard deviation of QRE is approximatel y prop ortional to Q RE itself. The exp lanation for th is is that the noise fluctuations around th e exp ected av erage as d e- scribed by Eq. (1) are additive on a lo garithmic scal e, whic h means that taking the ratio of a predicted and an actual p opularit y as in QRE is translated into a difference on the logarithmic scale of p opularities. The difference of th e logs is commensurate with the noise term in Eq. (1), thus stays b ounded in QR E, and is instead amplified multi plicatively in QSE. In conclusion, for relativ e error measures the CS mo del shou ld b e chosen, while for absolute measures the LN model is a goo d choice. 5. SA TURA TION O F THE POPULARITY (a) 0 0.5 1 1.5 2 0 200 400 600 800 1000 Digg story age (digg days) Squared error LN CS GP (b) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Digg story age (digg days) Relative squared error LN CS GP (c) 0 5 10 15 20 25 30 0 2000 4000 6000 8000 Youtube video age (days) Squared error LN CS GP (d) 0 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 1 Youtube video age (days) Relative squared error LN CS GP Figure 8: The p erformance of the diffe ren t prediction mo dels, measured by t wo error functions as defined in the text: the absolute squared error QSE [(a) and (c)], and the relative squared error QRE [(b) and (d)], respe ctively . (a) and (b) show the results for Digg, while (c) and (d) for Y outube. The s hade d areas i ndicate one standard devi a tion of the individual submissi on errors around the a verage. 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Percentage of final popularity Relative squared error Digg Youtube Figure 9: The relative squared error shown as a function of the pe rcen tage of the final p opularity of submissions on day 30. The standard deviations of the errors are indi cated by the shaded areas. 0 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 1 1.2 Time (days) Average normalized popularity Digg Youtube 0 10 20 30 40 0 0.2 0.4 0.6 0.8 1 Time (digg hours) Figure 10: Average normalized p opularities of sub- missions for Y outube and Digg by the p opularity at da y 30. The inset shows the same for the first 48 digg hours of Digg submissi ons. Here we d iscuss ho w t h e trends in the gro wth of popularities in time are different for Y outub e and Digg, and h ow this generally affects th e predictions. As seen in the previous section, the p redictions conv erge muc h faster for Digg articles than for videos on Y outub e to their resp ective reference v alues, and th e explanation can b e found when w e consider ho w the p opu larit y of submis- sions approaches the reference v alues. In Fig. 9 we show an analogous interpretation of Q RE, but instead of plotting the error against time, we plotted it as a function of th e ac- tual p opularity , ex pressed as t he fraction of reference va lue N s ( t r ). The plots are av erages ov er all conten t in the test set, and ov er times t i in hourly incremen ts u p to t r . T his means that the predictions across Y outub e and Digg become comparable, since we can eliminate the eff ect of the different time dy namics imposed on conten t p opularity by th e visi- tors that are idiosyncratic to the tw o different p ortals: the p opularit y of D igg submissions initially gro ws muc h faster, but it quickly saturates to a constant val ue, while Y outu b e videos keep getting v iews constan tly (Fig. 10). As Fig. 9 sho ws, the av erage error QRE for Digg articles conv erges to 0 as we approach the reference time, with v ariations in the error staying relativ ely small. On the other hand, the same error measure do es not decrease monotonically for Y outub e videos unti l very close to the reference, whic h means that the grow th of p opularit y of videos still shows considerable fluctuations near the 30st day , to o, when th e p opularity is already almost as large as the reference val ue. This fact is further illustrated by Fig. 10, where we sho w the a verage normalized p opularities for all submissions. This is calculated by dividing the p opularit y counts of individual submission by their reference p opularities on day 30, and a veraging the resulting normalized functions ov er all con- tent. A n imp ortant difference that is app aren t in the figure is that while Digg stories saturate fairly q uickly (in ab out one day) to their resp ective reference p opularities, Y outub e videos keep getting views al l thr oughout their li fetime (at least throughout the d ata collection p eriod , but it is ex- p ected that the trendline contin u es almost linearly). The rate at which videos keep getting v iews may naturally dif- fer among videos: less p opu lar videos in the b eginning are lik ely to show a slow pace ov er longer time scale s, to o. It is thus not surprising that the fluctuations around the av er- age are not getting supressed for videos as th ey age (com- pare with Fig. 9). W e also note that th e normalized gro wth curves sho wn in Fig. 10 are exactly P ( t i , t r ) of Eq. (11) when t r = 30 days. The mec hanism that gives rise to these t wo markedly differ- ent b eh a viors is a consequence of the d ifferen t wa ys of how users fi nd con tent on the tw o p ortals: on Digg, articles b e- come obsolete fairly quickly , since they oftenmost refer to breaking news, fleeting Internet fads, or technology-related stories that naturally hav e a limited time p erio d while t h ey intere st p eople. Videos on Y outu b e, ho wev er, are mostly found t h rough search, since due to the sheer amount of videos uploaded constan t ly it is not possible to match Digg’s w ay of giving exp osure to each promoted story on a front page (excep t for featured v ideos, bu t here we did not con- sider th ose separately). The faster initial rise of the p opu- larit y of videos can b e explained by their exp osure on the “recen tly added” tab of Y outub e, but after they leav e that section of the site, th e only wa y to find them is through keyw ord searc h or when they are d ispla yed as related videos with another video that is b eing watc hed . I t serves thus an exp lanation to why the predictions con verge faster for Digg stories than Y outub e videos (10% accuracy is reached within ab out 2 hours on Digg vs. 10 days on Y outu b e) that the p opularities of D igg submissions do not change consid- erably after 2 days. 6. CONCLUSIONS AND REL A TED WORK In this p aper we presented a metho d and exp erimenta l veri - fication on h o w the p opularity of (user contributed) conten t can be predicted very soon after the submission has been made, b y measuring th e p opularity at an early time. A strong linear correlation was found b etw een the logarithmi- cally transformed p opularities at early and later times, with the residual noise on this transformed scale b eing n ormally distributed. Using the fact of linear correlati on we presented three mo dels for making predictions ab out future p opular- it y , and compared th eir p erformance on Y outub e videos and Digg story submissions. The multiplicativ e n ature of the noise term allo ws us to sho w that the accuracy of th e pre- dictions will exhibit a large disp ersion around the av erage if a direct squ ared error measure is chose n, while if we t ake the relativ e errors the d ispersion is considerably smalle r. An im- p ortan t consequence is that absolute error measures should b e av oided in fa vor of relative measures in communit y p or- tals when the error of t h e p red iction is estimated. W e mentio n tw o scenarios where p redictions of individual con tent can b e used: advertising and conten t ranking. If the p opularit y count is tied to advertis ing reven ue such as what results from advertisemen t impressions shown beside a video, the reven ue ma y b e fairly accurately estimated, since the uncertaint y of the relativ e errors stays acceptable. How- ever, when the p opularities of different conten t are compared to eac h other as commonly d one in ranking and presenting the most popular con tent to users, it is exp ected that the precise forecast of the ordering of th e top items will b e more difficult du e to the large disp ersion of the p opularit y count errors. W e based t h e predictions of future p opu larities only on va l- ues measurable in the present, but d id not consider the se- manti cs of p opularity and why some submissions b ecome more p opu lar than others. W e b eliev e that in the presence of a large user base predictions can essential ly b e made on observe d early time series, and seman tic analysis of conten t is more u seful when no early clickthrough information is known for conten t. F urthermore, we argue for the generali ty of p erforming maximum likelihoo d estimates for the mo del parameters in light of a large amount of exp erimental infor- mation, since in this case Ba yesian inference and maxim um lik eliho od meth o ds essential ly yield the same estimates [14]. There are several areas that w e could not explore here. It w ould b e interes ting to extend the analysis by fo cusing on different sections of th e W eb 2.0 p ortals, such as how the “news & p olitics” category differs from the “entertain- ment” sectio n on Y outub e, since w e exp ect that news videos reac h obsolescence so oner than videos that are recurringly searc hed for for a long time. It is also to b e seen if it is p ossi- ble to forecast a D igg sub mission’s popu larit y when the diggs are coming from a small num b er of u sers only whose voting history is k now n, as is the case for stories in the up coming section of Digg. In related wo rks video on demand systems and p roperties of media files on the W eb hav e b een stud ied in detail, statisti- cally chara cterizing video conten t in terms of length, rank, and comments [6, 1 , 19]. Video characteris tics and user ac- cess frequencies are studied together when streaming media w orkload is estimated [11, 7, 13, 24]. User participation and conten t rating is also mo deled in Digg, with particu- lar emphasis on t he so cial netw ork and th e up coming p hase of stories [18]. Activity fluctuations, user commenting b e- havio r prediction, the en suing social netw ork, and commu- nity mo deration structu re is the fo cus of studies on Slash- dot [15, 12, 17], a p ortal that is similar in spirit to Digg. The prediction of u ser clic kthrough rates as a function of docu ment and search en gine result ranking order has ov er- laps with this p aper [4, 2]. While the displa y ordering of submissions plays a less important role for the p redictions presented here, Dupret et al. studied the effect of document p osition in a list on its selection probability with a Bay esian netw ork mo del that b ecomes imp ortant when static conten t is predicted [10]; a related area is online ad clic k through rate prediction also [20]. 7. REFERENCES [1] S. Achary a, B. Smith, and P . Parnes . Characterizing User A ccess T o Videos On The W orld Wide W eb. In Pr o c. SPIE , 2000. [2] E. Agich tein, E. Brill, S. D umais, and R . Ragno. Learning user interaction mo dels for predicting web searc h result preferences. In SIGIR ’06: Pr o c e e dings of the 29th annual internat ional ACM SIGIR c onfer enc e on R ese ar ch and development i n information r etrieval , pages 3–10, New Y ork, NY, USA , 2006. AC M. [3] Alexa W eb Information S ervice, http://www .alexa.com . [4] K. Ali and M. Scarr. Robust metho dologies for modeling web clic k distributions. In WWW ’07: Pr o c e e dings of the 16th international c onfer enc e on World Wide Web , p ages 511–520, New Y ork, NY, USA, 2007. ACM . [5] M. Cha, H. Kwak, P . Ro d riguez, Y .-Y. A hn, and S. Mo on. I tub e, yo u tub e, everybo dy tub es: analyzing the w orld’s largest user generated conten t video system. In IMC ’07: Pr o c e e dings of the 7th ACM SIGCOMM c onfer enc e on Interne t m e asur ement , pages 1–14, New Y ork, NY, USA , 2007. AC M. [6] X. Cheng, C. Dale, and J. Liu. Understanding the c haracteristics of internet short video sharing: Y outub e as a case study , 2007, arxiv :0707. 3670v1. [7] M. Chesire, A. W olman, G. M. V o elker, and H. M. Levy . Measurement and analysis of a streaming-media w orkload. In USITS’01: Pr o c e e dings of the 3r d c onfer enc e on USENI X Symp osium on Internet T e chnolo gies and Syst ems , pages 1–1, Berke ley , CA, USA, 2001. USENIX Asso ciation. [8] Digg application programming interf ace, http://api doc.digg.com/ . [9] N. Duan. Smearing estimate: A nonparametric retransformatio n metho d. Journal of the A meric an Statistic al Asso ciation , 78(383):605–61 0, 1983. [10] G. Du pret, B. Piwo warski, C. A. Hurtado, and M. Mendoza. A statistical mo del of query log generation. In SPIRE , pages 217–228, 2006. [11] P . Gill, M. Arlitt, Z. Li, and A. Mahanti . Y outub e traffic characteriza tion: a view from the edge. In IMC ’07: Pr o c e e dings of the 7th A CM SIGCOMM c onfer enc e on I nternet me asur ement , pages 15–28, New Y ork, NY, USA, 2007. ACM. [12] V . G´ omez, A. Kaltenbrunner, and V . L´ opez. Statistical analysis of the so cial netw ork and discussion threads in slashdot. In W WW ’08: Pr o c e e ding of the 17th internat ional c onfer enc e on World Wide Web , pages 645–654, New Y ork, NY, USA, 2008. ACM. [13] M. J. Halvey and M. T. Keane. Exploring so cial dynamics in online media sharing. In WWW ’07: Pr o c e e dings of the 16th international c onfer enc e on World Wide Web , p ages 1273–1274, New Y ork, N Y, USA, 2007. ACM. [14] J. Higgins. Ba yesian inference and the optimalit y of maxim um likelihood estimation. International Statistic al R eview , 45:9–1 1, 1977. [15] A . Kaltenbrunner, V. Gomez, and V. Lop ez. Description and prediction of Slashdot activit y . In LA-WEB ’07: Pr o c e e dings of the 2007 L atin Amer ic an Web Confer enc e , pages 57–66, W ashington, DC, USA , 2007. I EEE Computer So ciety . [16] M. Kim and R. C. Hill. The Box-Co x transformatio n-of-va riables in regression. Empiric al Ec onomics , 18(2):307–1 9, 1993. [17] C. Lamp e and P . R esnic k. Slash(dot) and burn: distributed mo deration in a large online conv ersation space. I n CHI ’04: Pr o c e e dings of the SI GCHI c onfer enc e on Hum an factors in c omputing systems , pages 543–550, New Y ork, NY, USA, 2004. ACM. [18] K . Lerman. So cial information processing in news aggrega tion. IEEE Internet Computing: sp e cial i ssue on So cial Se ar ch , 11(6):16–28, Nov ember 2007. [19] M. Li, M. Claypo ol, R. Kinicki, and J. Nichol s. Characteristic s of streaming media stored on the web. ACM T r ans. Inter et T e chnol. , 5(4):601 –626, 2005. [20] M. Richardson, E. Dominows k a, and R. Ragno. Predicting clicks: estimating the clic k-through rate for new ads. In WW W ’07: Pr o c e e dings of the 16th international c onfer enc e on W orld Wide Web , pages 521–53 0, New Y ork, NY , USA , 2007. ACM. [21] J. M. W o oldridge. Some alternatives to t he Box-Co x regressio n mo del. International Ec onomic R eview , 33(4):935 –55, N o vem b er 1992. [22] F. W u and B. A. Hub erman. Nov elty and collectiv e attenti on. Pr o c e e di ngs of the National A c ademy of Scienc es , 104(45):1759 9–17601, Nov ember 2007. [23] Y outu b e application p rogramming interface, http://cod e.google.com/a pis/youtube/ove rview.html . [24] H . Y u, D. Zheng, B. Y. Zhao, and W. Zheng. Understanding user b ehavior in large-scale video-on-demand systems. SIGOPS Op er. Syst. R ev. , 40(4):333 –344, 2006.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment