Predictability of social interactions

Predictabilit y of so cial in teractions Kevin S. Xu ⋆ Department of Electrical Engineering and Computer Science, Universit y of Michig an, Ann Arb or, MI, USA xukevin@um ich.edu Abstract. The abilit y to predict social intera ctions betw een p eople has profound applicatio ns including targeted marketing and prediction of in- formation diﬀusion and disease propagation. Previous w ork has sho wn that the location of an individual at any giv en time is highly predictable. This study examines the pr e dictability of so cial inter actions betw een p eo- ple to determine whether in teraction patterns are simi larly predictable. I ﬁnd t hat the lo c ations and times of interactions for an in d ividual are highly predictable; h o wev er, the other p erson the individual interacts with is less predictable. F urthermore, I show that knowledge of the lo ca- tions and t imes of interactio ns has almost no eﬀe ct on the p redictabili ty of the other p erson. Finally I demonstrate that a simple Mark ov c hain mod el is able to ac hieve close to the upp er b ound in terms of predicting the next p erson with whom a given ind ividual will intera ct. Keywords: predictabilit y , social, interaction, h uman dynamics, en tropy 1 In tro duction One of the most imp ortan t q uestions in the emerging ﬁeld of h uman dynamics concerns pr e dictabil ity : to what extent is h uman b eha vior predictable, and how do es predictability v ary acro s s the p opulation? Recent technological adv ances hav e led to the developmen t of w earable human sensors , which are ca pa ble of contin uous ly collecting data on an individual’s mov emen t, activities, a nd in ter- actions among other features. These sensors could allow us to make predictions ab out many asp ects of human b ehavior including s o cial interactions. The ability to make s uc h predic tio ns has profound applications s uc h as targ eted mar k et- ing using an individual’s so cial net w ork to predicting how disea ses transmitted through human co ntact propagate ov er time. This study is motiv ated by pr evious work on the predictability of human mobility . Using cell phone data from 50 , 000 individuals, Song et al. [5 ] studied the pr e dictability of individuals’ lo c ations using the clos est cellular tow er each time an individual used his phone. T o capture the predictability of a n individual’s lo cation ov er time, the authors e stimated the ent r opy r ate of the time series of his loca tions and found that the rea l uncertaint y in an individual’s location at any g iv en time is fe wer than two lo c ations ! F urthermore, there was found to b e ⋆ Current aﬃliation: 3M Corp orate Researc h Lab oratory , St. Paul, MN, USA 2 Kevin S . Xu surprisingly little v aria bilit y in the estimated ent ropy rate among the p opulation, suggesting that individuals’ lo cations ov er time are, in gener al, high ly pr e dictable . The main question behind this study is as follows: to what ext ent ar e indi- viduals’ so cial inter actions pr e dictabl e? I inv estigate the exten t to whic h three asp ects of a n individual’s interactions are pr edictable: the physic al lo c ation and time of an in teraction and the other p erson with whom the individual in teracts. Similar to Song et al. [5], I e stimate the entr opy r ate to capture predictabilit y . In addition, I use a Marko v chain mo del for so cial interaction to ev aluate actual prediction pe r formance on t wo real data sets. 2 Metho dology 2.1 En trop y rates T o capture the pr edictabilit y o f a time series, I utilize the en tropy rate. Firs t we hav e the notion of entr opy , which meas ures the amount of uncertaint y in a ra ndom v ariable. The en tropy of a single random v ariable X is deﬁned a s H ( X ) = − P i p ( x i ) log 2 p ( x i ), where the s ummation is ov er all p ossible out- comes { x i } , and p ( x i ) denotes the pr obabilit y of outcome x i . F o r t wo random v ariables ( X , Y ) , we ha v e the notions of joint entr opy H ( X , Y ) and c onditional entr opy H ( X | Y ). The joint entrop y measures the uncertaint y a ssocia ted with bo th random v ariables, while the conditional entropy mea sures the uncertainly in one r andom v ariable giv en that the v alue of the o ther random v a riable has been observed. The join t and conditional en tro pies are related thro ugh the equation H ( X | Y ) = H ( X , Y ) − H ( Y ). The entr opy r ate was ﬁr s t introduced by Shannon [4] and genera lizes the notion of entrop y to s equences of dep endent random v ariables. F o r a stationary sto c ha stic pr ocess X = { X i } , the en tropy r ate is deﬁned a s H ( X ) = lim n →∞ 1 n H ( X 1 , X 2 , . . . , X n ) = lim n →∞ H ( X n | X 1 , X 2 , . . . , X n − 1 ) (1) where the ﬁrst equa lit y holds for all sto c hastic pro cesses, but the seco nd requires stationarity of the pr ocess. The qua n tity on the right side of (1) leads to the int erpretation of en tropy rate as the uncertain t y in a qua n tity a t time n having observed the complete history . The entropy rate denotes the average p er-v ariable ent ropy o f ea c h r andom v ariable in the sto c hastic pro cess. Joint a nd conditional ent ropy rates can similarly be deﬁned. In this study , I use the en tropy r ate to characterize the aver age un c ertainty of a quantity a t any given t ime . 2.2 Lemp el-Ziv complexities T o calculate the en tropy of a random v a riable X , one needs to know the prob- ability of each p ossible outcome p ( x i ). When these pro babilities are not known, one can estimate the ent ropy by replac ing the probabilities with relative fre- quencies from obser v e d data. Estimating the entropy r ate o f a sto chastic pr ocess Predictabilit y of social intera ctions 3 is more inv o lv ed b ecause the random v ariables are, in genera l, dep enden t on one another. A suitable estimator of the entropy ra te for gener al s ta tionary sto chas- tic pro cesses is the L emp el-Ziv c omplexity . Similar to Song et a l. [5], I use the following Lempel-Ziv complexity to es timate the en tr op y ra te o f a time ser ies: ˆ H ( X ) = n log 2 n P i Λ i , (2) where n denotes the length of t he time series, and Λ i denotes the length of the shortest substring starting f rom time i that has not yet b een obser v ed prior to time i , i.e. from times 1 to t − 1. It is known that for stationary ergo dic pro cesses, ˆ H ( X ) conv er ges to the entrop y rate H ( X ) almost surely [3] as n → ∞ . T o estimate the join t ent ropy rate, one ca n ex tend (2) to tw o time se r ies, with Λ i denoting the length of the shortest substring of ordered pairs fro m b oth time ser ies. This joint Lemp el-Ziv co mplexit y ˆ H ( X , Y ) also con verges to the joint entrop y ra te as n → ∞ [6]. I obtain an estimate of the c onditional entro py r ate using the conditional Lemp el-Ziv complexit y ˆ H ( X | Y ) = ˆ H ( X , Y ) − ˆ H ( Y ). 3 Results I inv estigate the predictability of s ocial in teractions on tw o data sets. The ﬁrst is the Rea lit y Mining data [2], whic h provides lo cation (via near est cellular tower) and int eraction (via Blueto oth pr o ximit y) data for 94 individuals at 5-minute int erv als over a year. The second is the F riends and F amily data [1], which provides only interaction (via Blueto oth proximit y) data for 146 individuals at 6-minute interv als over 9 months. Similar to So ng et al. [5], I compare the estimated e ntropy rate ˆ H (u sing the Lempel- Ziv complexity) with the estimated ent ropy r ates of an iid sequence w ith the same mar ginal pro babilities a s the observed sequence ˆ H iid and an iid s equence of uniformly lik ely outcomes ˆ H unif . I b e g in with the Reality Mining data. The estimated entrop y rates for the lo cations of in teractions are shown in Fig. 1 (left). The mea n o f ˆ H loc is ab out 1 . 1, indicating that the actual uncer tain ty in the lo cation of an interaction is ab out 2 1 . 1 = 2 . 1 lo cations. This is similar to the ﬁnding of Song et al. [5] that the estimated entrop y rate of an individual’s lo cation p eaks at abo ut 0 . 8. Thus I conclude that the lo c ations of an individual’s in ter actions are high ly pr e dictable . The estimated entrop y ra te ˆ H loc is m uch lower than the iid entropy rate ˆ H loc iid , indicating that the temp o ral sequence is highly dep enden t. The estimated en- tropy rates for the times betw een interactions are shown in Fig. 1 (right). Similar to lo cations, the times o f interactions are also highly pr e dictable . The estimated ent ropy rates for the p erson an individual interacts with are shown in Fig. 2 (left). Unlike with lo cations and times, the mean of ˆ H pers is ab out ab out 3 . 1 , suggesting that the actual uncertainly is ab out 2 3 . 1 = 8 . 5 individuals. Thu s it app ears that the p erson an individual in ter acts with is sig niﬁcan tly less pr e dictable than the lo cation or time! The estimated entrop y rate ˆ H pers is still m uch low er than the iid ent ropy ra te ˆ H pers iid , so some temp oral dep endency is still present in the time series. 4 Kevin S . Xu 0 2 4 6 8 10 0 0.5 1 1.5 Estimated entropy rate Estimated probability density ˆ H loc ˆ H loc i i d ˆ H loc u ni f 0 2 4 6 8 10 0 0.5 1 1.5 Estimated entropy rate Estimated probability density ˆ H ti m e ˆ H ti m e i i d ˆ H ti m e u ni f Fig. 1. Distributions of t he estimated en tropy rate ˆ H , iid entrop y rate ˆ H iid , and uni- form en tropy rate ˆ H unif for lo cations ( left ) and times ( right ) of intera ctions. The lo w ˆ H loc and ˆ H time indicate th at lo c ations and times of inter actions ar e hi ghly pr e dictable . 0 2 4 6 8 10 0 0.5 1 1.5 2 Estimated entropy rate Estimated probability density ˆ H pers ˆ H pers i i d ˆ H pers u ni f 0 2 4 6 8 0 0.2 0.4 0.6 0.8 1 Estimated entropy rate Estimated probability density ˆ H pers ˆ H pers loc ˆ H pers ti m e Fig. 2. L eft: Distributions of the estimated entrop y rate ˆ H pers , iid entropy rate ˆ H pers iid , and u nifo rm entrop y rate ˆ H pers unif . ˆ H pers is muc h higher than ˆ H loc and ˆ H time (see Fig. 1), indicating that the p erson an individual i nte r acts with is much less pr e dictable than location o r time. R ight: Distributions of the estimated entro py rate ˆ H pers and condi- tional en t ropy rates given lo cations ˆ H pers loc and times ˆ H pers time . There is little d iﬀerence b et ween the three distributions, indicating that kn o wledge of t imes and lo cations do es not pr ovide any signiﬁc ant b eneﬁt in pred icti ng t he p erson an in dividual interacts with. Perhaps the p erson an individual interacts with ma y be more predictable if one is giv en the lo cations or times o f in teractions. The pre dic ta bilit y g iv e n t his additional information is c a ptured by the c onditional entr opy r ate , which I esti- mate using the conditional Lemp el-Ziv c omplexit y ˆ H ( X | Y ) = ˆ H ( X , Y ) − ˆ H ( Y ). Somewhat surpris ingly , I ﬁnd that the e s timated conditiona l ent ropy rates given lo cations ˆ H pers loc or times ˆ H pers time do not diﬀer signiﬁcantly from the estimated unconditional entropy ra tes, a s sho wn in Fig. 2 (right) 1 . Thus I conclude that knowing the locations or times do es not add mu ch pr e dictive value when trying to predict the p e r son with whom an individual in teracts. 1 Note that the true conditional en trop y rate H ( X | Y ) must alw a ys b e less than the true unconditional entrop y rate H ( X ) , but this is not necessa rily true for the esti- mated conditional and unconditional entrop y rates due to ﬁnite sample size. Predictabilit y of social intera ctions 5 20 50 21 6 7 9 89 28 2 90 91 41 34 77 60 67 88 33 24 32 73 93 68 0 2 4 6 8 0 0.2 0.4 0.6 0.8 1 Estimated entropy rate Estimated probability density ˆ H pers ˆ H pers M C Fig. 3. L eft: Graphical rep rese ntatio n of Marko v chain state transiti on matrix for a selected individual. Edge width is prop ortional to transitio n probability . Right: Distri- butions of the estimated entrop y rate ˆ H pers and Marko v chain entropy rate ˆ H pers M C . The estimated entro py rates of th e Mark o v chains are only slightly h igher th an the rates of the actual sequences, suggesting that the Mark ov chain can ac hiev e close to the upp er b ound for predicting the person an individu al will in teract with next. Entrop y ra tes provide o nly an upp er b ound on predictabilit y . I now co nsider the problem o f actua lly mode ling the sequence of p eople an individual in tera cts with o v er time. A simple mo del consists o f a Marko v c ha in, which assumes that the p erson an individual in ter acts with next depends only on the person she is currently in tera cting with (or the last p erson she in ter acted with, if s he is not currently in tera cting with a n yone). I learn a stationar y Markov chain for each individual, such as the one pictured in Fig. 3 (left), a nd estimate the entrop y rates of these c hains b y substituting rela tive frequencies for pro babilities. As shown in Fig . 3 (r ig h t), the estimated entrop y rates of the Marko v chains ˆ H pers M C are only sligh tly higher 2 than those of the actua l seq uences ˆ H pers . Speciﬁcally the mean ˆ H pers M C is 3 . 2 co mpared to the mean ˆ H pers of 3 . 1. This s ug gests that the Markov c hain is able to ac hiev e close to t he u pp er b ound for predicting the next p erson an individual in teracts with! T o measure how well the Markov chain works in practice, I learn the mo del on the ﬁr st w eek of data and attempt to predict the most likely (top-1) and 5 mo st likely (top-5) p eople an individual will interact with nex t for each interaction in the second week. I then up date the mo del based on the interactions in the second week and rep eat the pro cess un til I reach the end of the data. Overall, the predictions from the Mark ov chain ac hieved a top-1 accuracy of 19% and top-5 accuracy of 49%. These results, while not s p ectacular, do a ppear to b e reasona ble given that I prev io usly found the uncertain t y to be ab out eigh t people. I rep eated the previous ex periments on the F riends and F amily data set. Lo cation data is not av ailable, but all of my ﬁndings fro m the Rea lit y Mining data not in volving lo cation hold also for the F riends and F a mily data. The p erson an individual interacts with is slig h tly more predictable, with the mean ˆ H pers = 2 . 3 2 The t ru e entropy rate is alw ays lo wer than the Marko v c hain entropy rate, bu t this is not necessarily true for t he estimated rates, again, du e to ﬁn ite sample size. 6 Kevin S . Xu corres p onding to uncerta int y of ab out 2 2 . 3 = 5 . 0 p eople. The learned Marko v chain mo del achiev es mean ˆ H pers M C = 2 . 7 , which is again clos e to the mean ˆ H pers , although the larger gap compared to the Realit y Mining data suggests that the eﬀects of higher-o rder dep endencies is stronger in the F riends and F amily data. The predictions from the Mar k ov chain ac hiev ed to p-1 and top-5 accuracies of 21% a nd 59 %, re s pectively , which are also hig he r than in the Rea lit y Mining data and agree with the low er entropy rate of the F riends and F amily data . 4 Conclusions This study ex a mined the predictability of so cial in teractions, an impor tan t ques- tion in the emerging area of human dyna mics. My main ﬁndings are threefo ld: 1. The lo c ations and times of in teractions for an individual a re highly pre- dictable, but not the other p erson with whom the individua l in tera cts. 2. Even if the lo cations a nd times o f interactions are known, ther e is almost no eﬀe ct on the predictability o f the o ther person. 3. A simple Markov chain mo del achieves close t o the upp er b ound for predicting the next p erson with whom an individual will in teract. I believe thes e ﬁndings hav e several key implica tions. Being able to predict the next p erson an individual will in ter act with co uld a llo w for indire c t ta rgeted marketing through this p erson. How ever, I found that there is signiﬁcant uncer- taint y in who the next person is (roughly ﬁv e to eight p eople), sug gesting that one may need target a group of peo ple rather than a single p erson. O n a more po sitiv e note, the simplicit y of the Ma rk o v chain mo del ena bles us to per form rig - orous mathematical a nalyses that w ould no t b e p ossible with mo re complicated mo dels. F rom the ﬁndings of this study , I believe that s uc h a mo del is appropri- ate for making predictio ns about dynamic pr ocesses over s ocial netw orks such as information diﬀusion and disease propag ation. Bibliograph y [1] Aharo n y , N., Pan, W., Ip, C., Khay a l, I., Pen tland, A.: So cial fMRI: In v es- tigating and shaping so cial mechanisms in the real world. Perv as iv e Mob. Comput. 7(6), 643–6 59 (2011 ) [2] Ea gle, N., Pen tla nd, A.: Reality mining: sens ing complex so cial systems. Pers. Ubiquitous Comput. 10(4), 255 –268 (2006 ) [3] Ko n toyiannis, I., Algo et, P .H., Suhov, Y.M., Wyner, A.J.: Nonparametric ent ropy estimation for stationa ry pro cesses and random ﬁelds, with applica- tions to English text. IEEE T r ans. Inf. Theory 44(3), 1319–132 7 (1998) [4] Shannon, C.E .: A mathematical theor y of co mm unicatio n. Bell Sys. T ech. J. 27(3), 379– 423 (1948) [5] Song, C., Q u, Z., Blumm, N., Bara b´ asi, A.L.: Limits o f predicta bility in hu man mobility. Science 32 7(5968), 10 18–21 (201 0) [6] Zoz o r, S., Ravier, P ., Buttelli, O.: On Lempel- Ziv complexity for multidi- mensional data analysis. Phys. A 3 45(1-2), 285–302 (2 005)

Predictability of social interactions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment