Predictability of social interactions
The ability to predict social interactions between people has profound applications including targeted marketing and prediction of information diffusion and disease propagation. Previous work has shown that the location of an individual at any given …
Authors: Kevin S. Xu
Predictabilit y of so cial in teractions Kevin S. Xu ⋆ Department of Electrical Engineering and Computer Science, Universit y of Michig an, Ann Arb or, MI, USA xukevin@um ich.edu Abstract. The abilit y to predict social intera ctions betw een p eople has profound applicatio ns including targeted marketing and prediction of in- formation diffusion and disease propagation. Previous w ork has sho wn that the location of an individual at any giv en time is highly predictable. This study examines the pr e dictability of so cial inter actions betw een p eo- ple to determine whether in teraction patterns are simi larly predictable. I find t hat the lo c ations and times of interactions for an in d ividual are highly predictable; h o wev er, the other p erson the individual interacts with is less predictable. F urthermore, I show that knowledge of the lo ca- tions and t imes of interactio ns has almost no effe ct on the p redictabili ty of the other p erson. Finally I demonstrate that a simple Mark ov c hain mod el is able to ac hieve close to the upp er b ound in terms of predicting the next p erson with whom a given ind ividual will intera ct. Keywords: predictabilit y , social, interaction, h uman dynamics, en tropy 1 In tro duction One of the most imp ortan t q uestions in the emerging field of h uman dynamics concerns pr e dictabil ity : to what extent is h uman b eha vior predictable, and how do es predictability v ary acro s s the p opulation? Recent technological adv ances hav e led to the developmen t of w earable human sensors , which are ca pa ble of contin uous ly collecting data on an individual’s mov emen t, activities, a nd in ter- actions among other features. These sensors could allow us to make predictions ab out many asp ects of human b ehavior including s o cial interactions. The ability to make s uc h predic tio ns has profound applications s uc h as targ eted mar k et- ing using an individual’s so cial net w ork to predicting how disea ses transmitted through human co ntact propagate ov er time. This study is motiv ated by pr evious work on the predictability of human mobility . Using cell phone data from 50 , 000 individuals, Song et al. [5 ] studied the pr e dictability of individuals’ lo c ations using the clos est cellular tow er each time an individual used his phone. T o capture the predictability of a n individual’s lo cation ov er time, the authors e stimated the ent r opy r ate of the time series of his loca tions and found that the rea l uncertaint y in an individual’s location at any g iv en time is fe wer than two lo c ations ! F urthermore, there was found to b e ⋆ Current affiliation: 3M Corp orate Researc h Lab oratory , St. Paul, MN, USA 2 Kevin S . Xu surprisingly little v aria bilit y in the estimated ent ropy rate among the p opulation, suggesting that individuals’ lo cations ov er time are, in gener al, high ly pr e dictable . The main question behind this study is as follows: to what ext ent ar e indi- viduals’ so cial inter actions pr e dictabl e? I inv estigate the exten t to whic h three asp ects of a n individual’s interactions are pr edictable: the physic al lo c ation and time of an in teraction and the other p erson with whom the individual in teracts. Similar to Song et al. [5], I e stimate the entr opy r ate to capture predictabilit y . In addition, I use a Marko v chain mo del for so cial interaction to ev aluate actual prediction pe r formance on t wo real data sets. 2 Metho dology 2.1 En trop y rates T o capture the pr edictabilit y o f a time series, I utilize the en tropy rate. Firs t we hav e the notion of entr opy , which meas ures the amount of uncertaint y in a ra ndom v ariable. The en tropy of a single random v ariable X is defined a s H ( X ) = − P i p ( x i ) log 2 p ( x i ), where the s ummation is ov er all p ossible out- comes { x i } , and p ( x i ) denotes the pr obabilit y of outcome x i . F o r t wo random v ariables ( X , Y ) , we ha v e the notions of joint entr opy H ( X , Y ) and c onditional entr opy H ( X | Y ). The joint entrop y measures the uncertaint y a ssocia ted with bo th random v ariables, while the conditional entropy mea sures the uncertainly in one r andom v ariable giv en that the v alue of the o ther random v a riable has been observed. The join t and conditional en tro pies are related thro ugh the equation H ( X | Y ) = H ( X , Y ) − H ( Y ). The entr opy r ate was fir s t introduced by Shannon [4] and genera lizes the notion of entrop y to s equences of dep endent random v ariables. F o r a stationary sto c ha stic pr ocess X = { X i } , the en tropy r ate is defined a s H ( X ) = lim n →∞ 1 n H ( X 1 , X 2 , . . . , X n ) = lim n →∞ H ( X n | X 1 , X 2 , . . . , X n − 1 ) (1) where the first equa lit y holds for all sto c hastic pro cesses, but the seco nd requires stationarity of the pr ocess. The qua n tity on the right side of (1) leads to the int erpretation of en tropy rate as the uncertain t y in a qua n tity a t time n having observed the complete history . The entropy rate denotes the average p er-v ariable ent ropy o f ea c h r andom v ariable in the sto c hastic pro cess. Joint a nd conditional ent ropy rates can similarly be defined. In this study , I use the en tropy r ate to characterize the aver age un c ertainty of a quantity a t any given t ime . 2.2 Lemp el-Ziv complexities T o calculate the en tropy of a random v a riable X , one needs to know the prob- ability of each p ossible outcome p ( x i ). When these pro babilities are not known, one can estimate the ent ropy by replac ing the probabilities with relative fre- quencies from obser v e d data. Estimating the entropy r ate o f a sto chastic pr ocess Predictabilit y of social intera ctions 3 is more inv o lv ed b ecause the random v ariables are, in genera l, dep enden t on one another. A suitable estimator of the entropy ra te for gener al s ta tionary sto chas- tic pro cesses is the L emp el-Ziv c omplexity . Similar to Song et a l. [5], I use the following Lempel-Ziv complexity to es timate the en tr op y ra te o f a time ser ies: ˆ H ( X ) = n log 2 n P i Λ i , (2) where n denotes the length of t he time series, and Λ i denotes the length of the shortest substring starting f rom time i that has not yet b een obser v ed prior to time i , i.e. from times 1 to t − 1. It is known that for stationary ergo dic pro cesses, ˆ H ( X ) conv er ges to the entrop y rate H ( X ) almost surely [3] as n → ∞ . T o estimate the join t ent ropy rate, one ca n ex tend (2) to tw o time se r ies, with Λ i denoting the length of the shortest substring of ordered pairs fro m b oth time ser ies. This joint Lemp el-Ziv co mplexit y ˆ H ( X , Y ) also con verges to the joint entrop y ra te as n → ∞ [6]. I obtain an estimate of the c onditional entro py r ate using the conditional Lemp el-Ziv complexit y ˆ H ( X | Y ) = ˆ H ( X , Y ) − ˆ H ( Y ). 3 Results I inv estigate the predictability of s ocial in teractions on tw o data sets. The first is the Rea lit y Mining data [2], whic h provides lo cation (via near est cellular tower) and int eraction (via Blueto oth pr o ximit y) data for 94 individuals at 5-minute int erv als over a year. The second is the F riends and F amily data [1], which provides only interaction (via Blueto oth proximit y) data for 146 individuals at 6-minute interv als over 9 months. Similar to So ng et al. [5], I compare the estimated e ntropy rate ˆ H (u sing the Lempel- Ziv complexity) with the estimated ent ropy r ates of an iid sequence w ith the same mar ginal pro babilities a s the observed sequence ˆ H iid and an iid s equence of uniformly lik ely outcomes ˆ H unif . I b e g in with the Reality Mining data. The estimated entrop y rates for the lo cations of in teractions are shown in Fig. 1 (left). The mea n o f ˆ H loc is ab out 1 . 1, indicating that the actual uncer tain ty in the lo cation of an interaction is ab out 2 1 . 1 = 2 . 1 lo cations. This is similar to the finding of Song et al. [5] that the estimated entrop y rate of an individual’s lo cation p eaks at abo ut 0 . 8. Thus I conclude that the lo c ations of an individual’s in ter actions are high ly pr e dictable . The estimated entrop y ra te ˆ H loc is m uch lower than the iid entropy rate ˆ H loc iid , indicating that the temp o ral sequence is highly dep enden t. The estimated en- tropy rates for the times betw een interactions are shown in Fig. 1 (right). Similar to lo cations, the times o f interactions are also highly pr e dictable . The estimated ent ropy rates for the p erson an individual interacts with are shown in Fig. 2 (left). Unlike with lo cations and times, the mean of ˆ H pers is ab out ab out 3 . 1 , suggesting that the actual uncertainly is ab out 2 3 . 1 = 8 . 5 individuals. Thu s it app ears that the p erson an individual in ter acts with is sig nifican tly less pr e dictable than the lo cation or time! The estimated entrop y rate ˆ H pers is still m uch low er than the iid ent ropy ra te ˆ H pers iid , so some temp oral dep endency is still present in the time series. 4 Kevin S . Xu 0 2 4 6 8 10 0 0.5 1 1.5 Estimated entropy rate Estimated probability density ˆ H loc ˆ H loc i i d ˆ H loc u ni f 0 2 4 6 8 10 0 0.5 1 1.5 Estimated entropy rate Estimated probability density ˆ H ti m e ˆ H ti m e i i d ˆ H ti m e u ni f Fig. 1. Distributions of t he estimated en tropy rate ˆ H , iid entrop y rate ˆ H iid , and uni- form en tropy rate ˆ H unif for lo cations ( left ) and times ( right ) of intera ctions. The lo w ˆ H loc and ˆ H time indicate th at lo c ations and times of inter actions ar e hi ghly pr e dictable . 0 2 4 6 8 10 0 0.5 1 1.5 2 Estimated entropy rate Estimated probability density ˆ H pers ˆ H pers i i d ˆ H pers u ni f 0 2 4 6 8 0 0.2 0.4 0.6 0.8 1 Estimated entropy rate Estimated probability density ˆ H pers ˆ H pers loc ˆ H pers ti m e Fig. 2. L eft: Distributions of the estimated entrop y rate ˆ H pers , iid entropy rate ˆ H pers iid , and u nifo rm entrop y rate ˆ H pers unif . ˆ H pers is muc h higher than ˆ H loc and ˆ H time (see Fig. 1), indicating that the p erson an individual i nte r acts with is much less pr e dictable than location o r time. R ight: Distributions of the estimated entro py rate ˆ H pers and condi- tional en t ropy rates given lo cations ˆ H pers loc and times ˆ H pers time . There is little d ifference b et ween the three distributions, indicating that kn o wledge of t imes and lo cations do es not pr ovide any signific ant b enefit in pred icti ng t he p erson an in dividual interacts with. Perhaps the p erson an individual interacts with ma y be more predictable if one is giv en the lo cations or times o f in teractions. The pre dic ta bilit y g iv e n t his additional information is c a ptured by the c onditional entr opy r ate , which I esti- mate using the conditional Lemp el-Ziv c omplexit y ˆ H ( X | Y ) = ˆ H ( X , Y ) − ˆ H ( Y ). Somewhat surpris ingly , I find that the e s timated conditiona l ent ropy rates given lo cations ˆ H pers loc or times ˆ H pers time do not differ significantly from the estimated unconditional entropy ra tes, a s sho wn in Fig. 2 (right) 1 . Thus I conclude that knowing the locations or times do es not add mu ch pr e dictive value when trying to predict the p e r son with whom an individual in teracts. 1 Note that the true conditional en trop y rate H ( X | Y ) must alw a ys b e less than the true unconditional entrop y rate H ( X ) , but this is not necessa rily true for the esti- mated conditional and unconditional entrop y rates due to finite sample size. Predictabilit y of social intera ctions 5 20 50 21 6 7 9 89 28 2 90 91 41 34 77 60 67 88 33 24 32 73 93 68 0 2 4 6 8 0 0.2 0.4 0.6 0.8 1 Estimated entropy rate Estimated probability density ˆ H pers ˆ H pers M C Fig. 3. L eft: Graphical rep rese ntatio n of Marko v chain state transiti on matrix for a selected individual. Edge width is prop ortional to transitio n probability . Right: Distri- butions of the estimated entrop y rate ˆ H pers and Marko v chain entropy rate ˆ H pers M C . The estimated entro py rates of th e Mark o v chains are only slightly h igher th an the rates of the actual sequences, suggesting that the Mark ov chain can ac hiev e close to the upp er b ound for predicting the person an individu al will in teract with next. Entrop y ra tes provide o nly an upp er b ound on predictabilit y . I now co nsider the problem o f actua lly mode ling the sequence of p eople an individual in tera cts with o v er time. A simple mo del consists o f a Marko v c ha in, which assumes that the p erson an individual in ter acts with next depends only on the person she is currently in tera cting with (or the last p erson she in ter acted with, if s he is not currently in tera cting with a n yone). I learn a stationar y Markov chain for each individual, such as the one pictured in Fig. 3 (left), a nd estimate the entrop y rates of these c hains b y substituting rela tive frequencies for pro babilities. As shown in Fig . 3 (r ig h t), the estimated entrop y rates of the Marko v chains ˆ H pers M C are only sligh tly higher 2 than those of the actua l seq uences ˆ H pers . Specifically the mean ˆ H pers M C is 3 . 2 co mpared to the mean ˆ H pers of 3 . 1. This s ug gests that the Markov c hain is able to ac hiev e close to t he u pp er b ound for predicting the next p erson an individual in teracts with! T o measure how well the Markov chain works in practice, I learn the mo del on the fir st w eek of data and attempt to predict the most likely (top-1) and 5 mo st likely (top-5) p eople an individual will interact with nex t for each interaction in the second week. I then up date the mo del based on the interactions in the second week and rep eat the pro cess un til I reach the end of the data. Overall, the predictions from the Mark ov chain ac hieved a top-1 accuracy of 19% and top-5 accuracy of 49%. These results, while not s p ectacular, do a ppear to b e reasona ble given that I prev io usly found the uncertain t y to be ab out eigh t people. I rep eated the previous ex periments on the F riends and F amily data set. Lo cation data is not av ailable, but all of my findings fro m the Rea lit y Mining data not in volving lo cation hold also for the F riends and F a mily data. The p erson an individual interacts with is slig h tly more predictable, with the mean ˆ H pers = 2 . 3 2 The t ru e entropy rate is alw ays lo wer than the Marko v c hain entropy rate, bu t this is not necessarily true for t he estimated rates, again, du e to fin ite sample size. 6 Kevin S . Xu corres p onding to uncerta int y of ab out 2 2 . 3 = 5 . 0 p eople. The learned Marko v chain mo del achiev es mean ˆ H pers M C = 2 . 7 , which is again clos e to the mean ˆ H pers , although the larger gap compared to the Realit y Mining data suggests that the effects of higher-o rder dep endencies is stronger in the F riends and F amily data. The predictions from the Mar k ov chain ac hiev ed to p-1 and top-5 accuracies of 21% a nd 59 %, re s pectively , which are also hig he r than in the Rea lit y Mining data and agree with the low er entropy rate of the F riends and F amily data . 4 Conclusions This study ex a mined the predictability of so cial in teractions, an impor tan t ques- tion in the emerging area of human dyna mics. My main findings are threefo ld: 1. The lo c ations and times of in teractions for an individual a re highly pre- dictable, but not the other p erson with whom the individua l in tera cts. 2. Even if the lo cations a nd times o f interactions are known, ther e is almost no effe ct on the predictability o f the o ther person. 3. A simple Markov chain mo del achieves close t o the upp er b ound for predicting the next p erson with whom an individual will in teract. I believe thes e findings hav e several key implica tions. Being able to predict the next p erson an individual will in ter act with co uld a llo w for indire c t ta rgeted marketing through this p erson. How ever, I found that there is significant uncer- taint y in who the next person is (roughly fiv e to eight p eople), sug gesting that one may need target a group of peo ple rather than a single p erson. O n a more po sitiv e note, the simplicit y of the Ma rk o v chain mo del ena bles us to per form rig - orous mathematical a nalyses that w ould no t b e p ossible with mo re complicated mo dels. F rom the findings of this study , I believe that s uc h a mo del is appropri- ate for making predictio ns about dynamic pr ocesses over s ocial netw orks such as information diffusion and disease propag ation. Bibliograph y [1] Aharo n y , N., Pan, W., Ip, C., Khay a l, I., Pen tland, A.: So cial fMRI: In v es- tigating and shaping so cial mechanisms in the real world. Perv as iv e Mob. Comput. 7(6), 643–6 59 (2011 ) [2] Ea gle, N., Pen tla nd, A.: Reality mining: sens ing complex so cial systems. Pers. Ubiquitous Comput. 10(4), 255 –268 (2006 ) [3] Ko n toyiannis, I., Algo et, P .H., Suhov, Y.M., Wyner, A.J.: Nonparametric ent ropy estimation for stationa ry pro cesses and random fields, with applica- tions to English text. IEEE T r ans. Inf. Theory 44(3), 1319–132 7 (1998) [4] Shannon, C.E .: A mathematical theor y of co mm unicatio n. Bell Sys. T ech. J. 27(3), 379– 423 (1948) [5] Song, C., Q u, Z., Blumm, N., Bara b´ asi, A.L.: Limits o f predicta bility in hu man mobility. Science 32 7(5968), 10 18–21 (201 0) [6] Zoz o r, S., Ravier, P ., Buttelli, O.: On Lempel- Ziv complexity for multidi- mensional data analysis. Phys. A 3 45(1-2), 285–302 (2 005)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment