Unveiling Spatial Epidemiology of HIV with Mobile Phone Data
An increasing amount of geo-referenced mobile phone data enables the identification of behavioral patterns, habits and movements of people. With this data, we can extract the knowledge potentially useful for many applications including the one tackle…
Authors: Sanja Brdar, Katarina Gavric, Dubravko Culibrk
Un veiling Spatial Epidemiology of HIV with Mobile Phone Data Sanja Brdar 1,*,+ , Katarina Gavri ´ c 2,+ , Dubravk o ´ Culibrk 1,3 , and Vladimir Crnojevi ´ c 1 1 F aculty of T echnical Sciences, Univ ersity of Novi Sad, No vi Sad, 21000, Serbia 2 Institute of Computer Science, Univ ersity of Heidelberg, Heidelberg, 69117, Germany 3 Depar tment of Inf or mation Engineering and Computer Science, Univ ersity of T rento, T rento , 38122, Italy * corresponding author : brdars@uns.ac.rs + these authors contributed equally to this work ABSTRA CT An increasing amount of geo-referenced mobile phone data enables the identification of behavioral patterns, habits and mov ements of people. With this data, we can e xtract the knowledge potentially useful for many applications including the one tackled in this study - understanding spatial v ariation of epidemics. W e explored the datasets collected b y a cell phone ser vice pro vider and linked them to spatial HIV pre v alence rates estimated from pub licly av ailable surve ys. For that purpose, 224 f eatures were e xtracted from mobility and connectivity traces and related to the le vel of HIV epidemic in 50 Iv or y Coast depar tments. By means of regression models, we e valuated predictiv e ability of e xtracted f eatures. Sev eral models predicted HIV pre valence that are highly correlated ( > 0 . 7 ) with actual values . Through contribution analysis we identified k ey elements that impact the rate of inf ections. Our findings indicate that night connectivity and activity , spatial area co vered b y users and ov erall migrations are strongly linked to HIV . By visualizing the communication and mobility flows, we strived to e xplain the spatial structure of epidemics. We discov ered that strong ties and hubs in communication and mobility align with HIV hot spots. Introduction HIV has a dev astating social, demographic, and economic ef fect on Africa. 1 , 2 W ith a 3.7% of population infected, 3 Ivory Coast has the highest pre valence rate in W est Africa and a generalized epidemic 4 . 5 This epidemic, where the disease spreads out of the risk groups and af fects general population, demands the de velopment of national HIV -prevention plans. Although the pre valence rate appears to ha ve remained relati v ely stable over the past decade, and is e ven decreasing, due to pre vention of mother-to-child transmission, there is still much w ork to be done to impro ve the health system to enable a more ef fecti ve response to HIV . Deeper understanding of the epidemics can help find ways to suppress HIV further and modern technologies that deal with human mobility phenomena may help respond to that challenge. Mobile phone communication engendered the era of big data by creating huge amounts of call detail records (CDRs). Cell phone service providers collect these records whene v er a phone is used to send a text message or mak e a calls. These records contain the time of the action, identifiers (IDs) of sender , receiv er and the cell towers used to communicate. In this way , mobile phones provide approximate spatio-temporal localization of users and create an immense resource for the analysis of human mobility and behavioral patterns. 6 – 8 In a burst of ne w applications b uilt on mobile phone data, 9 we emphasize those of great practical importance such as urban planning, 10 disaster management, 11 transportation mode inference, 12 traffic engineering, 13 deriving po verty indicators 14 and crime prediction. 15 Currently , there is a gro wing interest in the mining of mobile phone data for epidemiological purposes. 16 , 17 Mining can advance research in epidemiology by shedding light on relationships between disease distribution, spread and incidence on one side, and migrations, e veryday mo vements and connectivity of people on the other side. Up to no w , only a few studies hav e used mobile phone data to quantify those relationships based on real disease distribution data. W esolowski and co-w orkers explored the impact of the human mobility to the spread of malaria. 18 They analyzed CDR data collected by a mobile phone service provider in Ken ya ov er the period of one year and discov ered how human mobility patterns contrib ute to the spread of the disease beyond what could be possible if it was transferred only by insects. Another study carried by Martinez et al. 19 in vestigated the ef fect of government alerts during H1N1 flu outbreak in Mexico on the diameter of the mobility of individuals. Bengtsson and co-workers 11 estimated population movements from a cholera outbreak area and suggested the use of information obtained for disease surveillance and resolving priority in relief assistance. Those pioneering works usher in the emerging field of digital epidemiology. 20 T o the best of our knowledge, the study we describe here is the first attempt to use mobile phone data to explore the complex structure of HIV epidemics. Significant scientific ef fort is aimed at identifying the dri ving factors of HIV spread. Most frequently mentioned are po verty , social instability and violence, high mobility , rapid urbanization and modernization. The differences among these f actors could help explain the spatial disparity observed in pre v alence rates. Messina et al. examined geographic patterns of HIV pre valence in Democratic Republic of Congo. 21 They sho wed that spatial factors: the prev alence lev el in the 25 km range and the distance to the urban areas are strongly connected to the risk of HIV infection. The impact of migration on the spread of HIV in South Africa has been studied in, 22 where authors de veloped a mathematical model to compare the ef fects of migration and associated risk behavior . In the early stage of epidemics, migration impacts the HIV progression by linking geographical areas of lo w and high risk. In the later stage, the impact is mainly through the increase the high-risk sexual beha vior . Howe v er , the migration in the study was quantified through surve ys, in which the participants were questioned about mov ement history , and the study included only two migration destinations, limiting both the extent of the study and the quality of data that was used. Now adays, when overwhelming amounts of mobile phone data exists, providing us with insight into the movements and activity of millions of people over large areas, we can try to utilize it for new studies of the epidemiology of HIV . In the study described here, we conducted a comprehensi v e analysis of two data sets of fered within the Data for De velopment (D4D) Challenge. 23 Our research was guided by the following hypothesis: the risk of HIV infection is associated with spatial and behavioral factors that can be detected from the collection of data available. W e were particularly interested in tracking population movements and inferring the strength of communication between departments of Ivory Coast with different prev alence rates. Results Spatial distribution of HIV T o determine the health status of a population, Demographic and Health Surve ys (DHS) periodically organizes surveys to gather rele vant data, focusing on specific countries. In our study we used the DHS data collected in Iv ory Coast during their 2012 campaign. 3 Based on the measurement, DHS provides estimates of HIV pre v alence at sub-national lev el, but with a lo w spatial resolution, determined by 10 administrati v e regions (Fig. 1 (a)). Estimates of the HIV prev alence range from 2.2 to 5.1% and re veal the spatial variability of the distrib ution of HIV -infected across country . Due to initiati ves to examine further the spatial heterogeneity of HIV, 24 new methods emer ged, aiming to provide HIV estimates at a finer resolution. An approach that employs kernel estimation based on spatial DHS measurements, with an additional adjustment to UN AIDS data, made estimates for 50 departments of Ivory Coast available (see Methods). After redistributing disease frequencies across 50 departments, the HIV prev alence map (Fig. 1 (b)) shows higher spatial variability (from 0.6 to 5.7%) in the disease distribution. W e can notice the hot spots of epidemics – departments severely hit by HIV . The map also enables us to explore links between the connectivity and mobility patterns deri ved from D4D data and HIV pre valence with increased spatial resolution. Although the quality of HIV estimates (imposed by DHS measurement sampling) at department level v aries from good and moderate to uncertain, the data has the highest spatial resolution currently av ailable for studying the HIV epidemic in Iv ory Coast. Figure 1. (a) HIV prev alence rate by administrati ve regions - DHS data (b) HIV pre v alence rate by departments for 15–49 year-olds population; estimated v alues range between 0.6 and 5.7% . 2/ 13 Communication and mobility patterns Social interactions and mobility mediate the spread of infectious diseases. 17 , 25 , 26 When examined in a spatio-temporal context, they can uncover how a disease propagates and finally explain the variability in the prev alence distribution. T o understand spatial epidemiology of HIV in Ivory Coast better , we analyzed the collecti ve communication and mobility connections at the lev el of departments. W e estimated pairwise connections among sub-prefectures by measuring communication and mobility flo ws. T o accomplish that, we explored the ”antenna-to-antenna data” (SET1) and the ”long term indi vidual trajectories” (SET3) D4D datasets. 23 SET1 provided us with insight into the communication flow between each pair of antennas on an hourly basis. The strength of the communication flow is expressed through the number of calls. W e assigned each antenna to a corresponding department and then aggregated the number of calls at the department level during a 5-month observation period. SET3 shed light on the mobility of people, providing the geographic location of users while using their phone to make calls or send messages. Since records in SET3 contain the user ID, location at the sub-prefecture resolution and time stamps indicating when the phone was used, we were able to use them to estimate the location of the user’ s home. Based on the the most frequent location, we assigned each user to his/her home department. Then we counted the user’ s mov ements from home to other locations over the entire 5-month observation period and aggre gated users’ mov ements at the department le vel. In the pairwise communication and mobility matrices, obtained in this way , we identified str ong ties for each department, which represent links to other departments with the connection strength higher than the a verage (see Methods). Before searching for the strong ties, we normalized the matrices by the corresponding population sizes. SET1 encompasses 5 million of users. W e distributed them into departments, using population frequencies provided by Afripop data, 27 and used the per-department populations obtained to normalize the communication flo ws. T o normalize the migration flows, we used estimates based on the derived home locations of the users to calculate the required population size per department. Each communication or mobility flo w was normalized by the corresponding population size of originating department. The overall flo w between two departments was then quantified as sum of normalized flows in both directions. This enabled us to eliminate the bias caused by the different population sizes when identifying the strong links. The strong ties discov ered in communication flo ws are shown in Fig. 2 (a). This visualization emphasizes the strongest links further and communication hubs emerge. Remarkably , the hubs correspond to HIV hot spots and we can also notice that larger hubs ha ve higher pre valence rates. Additionally , we visualized the night communication, constrained to the time interval between 1AM and 5AM, and obtained a similar structure of the connectivity graph - Fig. 2 (b). Figure 2. Strong connectivity ties for (a) o verall communication (b) night communication. The hubs are labeled with the corresponding HIV prev alence rate sho wn in Fig. 1 (b). Link thickness and color , ranging from yellow to red, are proportional to the strength of communication flow . 3/ 13 The links correspond to relativ e rather than absolute flow , which we obtained by dividing the flo w with the maximum value of flow in the set of strong ties. In both graphs we can notice ho w departments in the north part of the county ha ve weaker links and this may explain why the y hav e smaller HIV prev alence. The strong ties discovered in mobility flo ws (Fig. 3 (a)) hav e an obvious localized character . They connect the departments that are geographically close, but, on a global scale, we can also observe strong migratory pathw ays. One connects the two largest hubs - the lar gest city Abidjan (5.1% prev alence rate) and the capital city Y amoussoukro (3.1% prev alence rate). From the center of country we can notice strong pathways to the region in the W est (3.6% prev alence rate, Fig. 1 (a)) and the North-central region (4.0% prev alence rate, Fig. 1 (a)). The East-central region, with a prev alence rate of 4.0% is strongly connected to Abidjan. The map of the mobility flows re v ealed the pathways that connect regions with higher pre v alence. In addition to the observ ed general mobility of users, we e xplored the long-term mobility . W e measured how long users stay at their destinations and in our migration analysis considered only those stays in which the users stayed longer than 3 days. The strong ties discovered in long-term mobility flo ws are sho wn in Fig. 3 (b). The connectivity graph obtained, re veals how long-term migrations link departments further a way . Interestingly , Abidjan emerged as the most prominent hub for those migrations. In this light, we can denote this city , with the lar gest prev alence rate and high connecti vity , as a dri ver of epidemic in the Iv ory Coast. As such, Abidjan needs careful monitoring of mobility flo ws, especially the high-risk longer -term mobilities, in order to prioritize interventions and control the further spread of HIV . Figure 3. Strong mobility ties discovered through summarizing (a) all mobilities (b) mobilities with 3 days or longer spent at the destination. The hubs are labeled with the corresponding HIV prev alence rates shown in Fig. 1 (b). The link thickness and color , ranging from yello w to red, are proportional to the strength of mobility flow . Extracted features For each department of Iv ory Coast, numerous features were e xtracted during the course of the study presented, with the goal to quantify behavioral and mobility patterns potentially rele v ant to the measured HIV prev alence rate. Overall, we extracted 224 different features and grouped them into 4 cate gories: connecti vity , spatial, migration and activity (phone use). The connectivity features were obtained from the SET1 data. The communication flow is e xpressed through the number of calls and their duration in SET1. Using the information of the originating and terminating antenna, for each department, we aggregated its inner , originating, terminating and o verall communication. The ov erall communication was further separated based on the type of day and time of day constraints. W e considered two types of days: weekdays and weekends, and used 1-hour time slots (00-01h, 01-02h, ... , 23-24h) and 8-hour time slots (00-08h, 08-16h, 16-24h) to express the time within a day . For each of these discrete intervals, the features related to the number of calls represent the cumulativ e sum ov er the whole fiv e-month observ ation period. Once extracted they were normalized by the corresponding department population size, 4/ 13 estimated based on Afripop data 27 and rescaled to fit the 5 million of users monitored in our data set. Features related to the duration of calls represent a verage v alues. Overall, 120 connecti vity features related to dif ferent time slots and type of days were extracted; half to describe the number of calls and half to describe the a verage duration of calls. Spatial, migration and acti vity features were deri ved from SET3 data. T o craft spatial features we explored positions and the distribution of locations visited by users. W e measured the radius of gyration, area and the perimeter of con vex hull of users’ mov ements, as well as the diameter of their range. 28 – 30 The features were deriv ed both for all locations visited by a user , as well as specific subsets of locations: visited at night, on weekdays, weekends, weekday and weekend nights. In addition, we calculated the total distance trav eled by each user . In total, 25 spatial features were created, representing 95 percentile v alues across users matched to departments based on their home location. Interestingly , we first considered av eraged instead of 95 percentile values for users in corresponding departments, but for predicti ve models better results are achiev ed when spatial features capture only the top five percent of users; i.e. the patterns of users that cover larger re gions through their mobility have higher predictiv e po wer on the prev alence of HIV . T o extract migration features we tracked the changes in locations. Every time a user changed department, we added a single migration link from his home to the observed department. W e summarized all movements into a pairwise migration matrix by iterating this procedure for all users. Beside quantifying all mov ements, we also identified those where users were away from home for more than defined number of days (1, 2, ..., 10) to explore longer-term migrations. The features were divided further according to the direction of the mobility into ”in” or ”out” migration, bringing their total number to 22. The acti vity features were extracted similarly to the connecti vity features. Ho wev er , in SET3, we cannot distinguish the direction of communication (in or out), nor do we ha ve the duration of communication. Therefore, we refer to those features simply as acti vity since they can count only when and where users were acti ve. As with the connectivity features we considered two types of days: weekdays and weekends. The time of day was again considered in 1-hour time slots, 8-hour time slots and whole days. The total number of activity features used w as 57. All the features capture the cumulati ve ef fect of human connecti vity or mobility observ ed ov er a fi ve-month period. W e focused on this long-term perspectiv e in our feature extraction, in order to understand the spatial distribution of HIV prev alence better . Predictive models HIV pre v alence rates across the departments of Iv ory Coast range from 0.6 to 5.7%. Each of the 50 departments was represented with a v ector of extracted feature v alues and the corresponding pre v alence rate. In this feature space, we b uilt re gression models and ev aluated their performance when predicting a department’ s prev alence rate. All features were normalized by dividing each feature with its mean value across the whole data set, before re gression was attempted. Experiments were conducted using two different re gression methods: Ridge 31 and Support vector regression (SVR). 32 The regression models were initially built using the four dif ferent groups of features separately . In order to select smaller subsets of most relev ant features, both regression methods were subjected to recursiv e feature elimination RFE 33 method. In the final stage, we considered an ensemble approach – stacked regression 34 – through which we fused 4 heterogeneous feature sets, building a single inte grated prediction model. The prediction of disease lev els needs careful ev aluation 35 in order to av oid situations in which models built on randomly generated data work comparati vely well to those created on possibly meaningful data. Therefore, to estimate the predicti ve capacity of a model, we measured the prediction errors and correlations between the predicted and actual values for the models built on real data and the same models created based on random data sets, obtained by randomly permuting v alues for each feature. Experiments were di vided into two parts: the first stage focused on the 15 departments with good and moderate estimates of HIV pre valence, while in the second we used data for all 50 departments. In T ables 1 and 2 , we report the correlation coefficients ( ρ ) and relati ve root mean square errors ( RRM SE ) produced by the models during lea ve-one-out (LOO) cross-v alidation, for two experimental setups (15 and 50 departments). LOO ev aluation enabled us to select the best model among those we built. On the subsample of 15 departments, the models built with SVR, with RFE, perform best. SVR models surpassed Ridge, and reducing the size of the feature set with RFE improv ed performance of both, but the SVR method benefited more from the RFE procedure than Ridge. The highest correlation coef ficient (0.753) between the predicted and actual v alues is achie v ed with the SVR on a reduced set of 6 most relev ant spatial features. The lowest error of 0.287 is reached by combining regressors learned on different sets of features. Through the linear combination of the four models, the ensemble approach predicts HIV prev alence v alues that are well correlated with actual ( ρ = 0 . 710). All models built on the real features outperformed their random counterparts. The second part of the experiments ev aluated the proposed methods and extracted features on the full set of 50 departments, including those with uncertain estimates on HIV . T able 2 reports the obtained results. As expected the performance declined. Predictions are moderately correlated with actual values. The best result ρ = 0 . 627 , RRM SE = 0.509 is achiev ed with the SVR 5/ 13 T able 1. Evaluation of predicti v e models on good and moderate HIV estimates - Correlation coefficient (Relati ve Root Mean Square Error): ρ ( RRM SE ) Predictiv e models Features Ridge Ridge+RFE SVR SVR+RFE Connectivity features (SET1) 0.624 (0.331) 0.626 (0.331) 0.661 (0.306) 0.669 (0.301) Spatial features (SET3) 0.639 (0.434) 0.703 (0.376) 0.544 (0.351) 0.753 (0.294) Migration features (SET3) 0.585 (0.369) 0.585 (0.369) 0.678 (0.307) 0.691 (0.288) Activity features (SET3) 0.618 (0.339) 0.645 (0.325) 0.633 (0.316) 0.664 (0.302) Ensemble 0.610 (0.327) 0.601(0.327) 0.659 (0.305) 0.710 (0.287) Best Random -0.231(0.511) -0.066 (0.480) -0.065 (0.479) 0.070 (0.441) model on a reduced subset of activity features. Ensemble approach that combines four SVR+RFE models results in ρ = 0 . 518 and RRM SE = 0 . 514 . Still, the models created on randomly permuted features predict HIV with higher errors and without correlation with actual values and, thus, underperform those b uilt on real features. T able 2. Evaluation of predicti v e models on all HIV estimates - Correlation coefficient (Relati ve Root Mean Square Error): ρ ( RRM SE ) Predictiv e models Features Ridge Ridge+RFE SVR SVR+RFE Connectivity features (SET1) 0.467 (0.556) 0.481 (0.546) 0.501 (0.516) 0.508 (0.514)) Spatial features (SET3) 0.363 (0.540) 0.431 (0.523) 0.310 (0.552) 0.336 (0.545) Migration features (SET3) 0.269 (0.630) 0.315 (0.613) 0.291 (0.637) 0.375 (0.599) Activity features (SET3) 0.511 (0.542) 0.542 (0.535) 0.522 (0.537) 0.627 (0.509) Ensemble 0.500 (0.527) 0.543 (0.519) 0.535 (0.515) 0.518 (0.514) Best Random 0.020 (0.760) 0.202 (0.657) 0.139 (0.630) 0.038 (0.607) Feature contribution Once a regression model is built, we can use it to estimate the risk of disease in defined spatial units. Furthermore, we can examine what the model learned from the data. Model explanation techniques 36 , 37 can un veil black-box predicti ve models by estimating contributions of each feature o ver the whole range of its input v alues. For example, we can e xamine how changes in an activity feature affect the value of the HIV prev alence rate, obtained by the model built. The outcome is a plot of the contribution as a function of feature v alues. This model-explanation procedure provides us with the opportunity to identify specific features that impact prev alence rate most of all and to quantify their contribution. The features identified in this manner can later be continuously measured and lev eraged for the monitoring of changes in the HIV prev alence rate and to create early warning signs for possible increase of the infected population. T o conduct the feature contrib ution analysis, we used the best model (SVR+RFE) b uilt for each set of features, since the ensemble method is just an additi v e combination of models b uilt on dif ferent sets. In the analysis we used models b uilt on a subsample of departments (15 with good or moderate HIV estimation) and focused on the top 3 features, selected by running the RFE procedure until only 3 features remain. The remaining features hav e highest impact on HIV prev alence prediction. For the selected features ( f t , i , where t denotes set of features and i is index of feature in that set) we conducted contribution analysis. W e calculated the contribution for each feature o ver the full range from its minimal to maximal v alue in m equally distributed points. The contribution analysis included the randomization process to create two instances as inputs to regression model. The first instance is a vector where each feature v alue is sampled at random from the data set t . The second instance differs in i t h feature which is not random but takes a particular value from set of previously defined m values that are currently under contribution analysis. The contribution of the feature is the dif ference between the outputs of the regression model produced using the first and the second instance as input. Due to the randomization process this procedure is repeated for a defined number of iterations. By averaging the results from all iterations, we obtained the final value for contribution. In addition to this v alue, we also report the standard deviation of the v alues obtained in each iteration, which provides information on the contribution stability and quantify complex interactions among features. W e created plots (Fig. 4 ) for 12 features - top 3 for each of four data sets, sampled in m = 12 points with contributions calculated through 100 iterations. In addition, the 12 graphs that correspond to features ranked from 4 t h to 6 t h place for each data set are provided in the supplement - Fig. S2. All graphs contain points of the mean contribution and error bars in the length of standard de viation. Red color indicates points with 6/ 13 feature v alues that are associated with increased HIV prev alence, and orange color indicates feature v alues that are associated with decreased HIV prev alence. The gray part of graph denotes the range where the standard distribution crosses zero, meaning that contribution is neither strongly positi ve nor ne gati ve. Figure 4. Feature contribution graphs for 12 features; top 3 features for 4 types of features. Points correspond to the mean contribution and error bars correspond to standard de viation. Red color indicates strong association to higher HIV prev alence, and orange to lower HIV pre v alence. Contributions of the three connectivity features are presented in Fig. 4 (a), (b) and (c). T op three features represent the communication flow e xpressed as the number of calls per resident of a department during the days of weekend in the time slots 7/ 13 01-02 AM, 02-03 AM and 03-04 AM, ov er a 5-month period. W e can notice that the top connection features are related to weekend night-time communication and all ha ve a positi ve slope. A similar graph (Fig. S2) is obtained for the 5 t h ranked feature related to weekday 03-04 AM communication. According to the model, the departments with higher night-time communication hav e a higher prev alence rate. In further analysis of the contribution plot shown in Fig. 4 (a), values higher than 0.2 can be seen as indicators of behavior increasing the risk of infection and thus critical for HIV . For example, for the department where this feature has the maximum value, the expectation of HIV pre valence is by 0.3 ± 0.15 higher than average. The plots for features ranked at 4 t h and 6 t h place (Fig. S2), refer to average call duration during the hours of early morning (06-07 AM) and contribute to HIV pre valence in a dif ferent manner . Those graphs have a negati ve slope, indicating that, for departments were people hav e longer talks early in the morning, we can expect lower HIV pre valence. W e can observe this as a social signature 38 and may hypothesize that longer talks early in the morning could be an indicator of emotionally close relationships and lower -risk beha vior . In the contribution analysis of spatial features, area and gyration stand out as features with higher impact. Area is measured ov er weekdays and gyration ov er weekday and weekend nights. The model suggests that departments were people, through their overall mo vements, tend to cover a larger area, have a higher HIV prev alence rate (Fig. 4 (d)). This is also confirmed by the 4 t h ranked feature, which measures the area covered o ver weekends (Fig. S2 ). On the contrary , gyration, a measure of standard deviation from the mean location, ne gati vely impacts HIV (Fig. 4 (e),(f) and also Fig. S2). But it is no surprise that small gyration indicates higher HIV , since it has already been sho wn in other studies that there is a higher expectation of shorter mov ements in the denser urban areas, 39 and those urban areas are usually more affected by HIV . Interestingly , when the area cov ered is tracked only during the hours of the night, the contrib ution graph has a negati ve slope as it does in the case of gyration (see graph for 5 t h ranked feature - area cov ered during weekday nights, Fig. S2). The contributions of overall in and out migration features are shown in Fig. 4 (g), (h). Both plots indicate that larger migration flows are associated to higher HIV prev alence. W e can notice the strong impact of incoming migrations: for the department where this feature has the maximum v alue, the expectation of the HIV pre v alence is by 1.0 ± 0.5 higher than the av erage. Among the top three features is the one that quantifies the number of out migrations per resident of a department, with the time clause of staying for more than 10 days. Its contribution plot, presented in Fig. 4 (i), shows neg ativ e impact. The plots for features ranked between 4 t h and 6 t h place (Fig. S2) further show that out migrations, with stays longer than one day have a positiv e slope, and those with stays longer than 5 or 9 days exhibit a negati ve slope. The contribution analysis of the migration features uncov ers an interesting phenomenon - the ov erall amount of migrations is linked to higher HIV pre v alence, and this positiv e slope remains true for migrations up to a fe w days, but beyond that, the slope becomes negati v e. The slope changes once the thresholds of 4 days for out migrations and 3 days for in migrations are reached. Thus, the model built suggests that the risk comes from shorter stays at host departments and higher dynamics in migrations, while the longer stays are associated with lower HIV pre v alence. The contribution of the acti vity features, expressed through the number of calls and SMSs per residents of a department, are shown in Fig. 4 (j), (k), (l). As with the connectivity features, night-time activity is strongly linked to HIV and higher activity implies higher pre v alence rates. This is also confirmed by the contribution plots for the 4 t h - and 5 t h -ranked feature that encompass acti vity during weekday nights, between 1 AM and 2 AM and week end nights, between 4 AM and 5 AM. On the contrary , the feature ranked 6 t h , which refers to early morning activity (07-08 AM) has a ne gati ve slope. The presented contribution analysis unco vers what the trained models learned from the data. All features work in synergy to provide the prediction of the HIV pre v alence. Nev ertheless this method helps us to identify the subset of stronger factors. The resulting plots can be used to create ne w hypotheses in epidemiology , when disease distribution and spread are concerned, and, subsequently , to quantify the risk of increase in the prev alence of HIV . Discussion Using mobile phone data, that can unv eil patterns of human interactions and mobility , is gaining increased attention in epidemiology . In the study presented here, we placed the mobile phone data in the context of a generalized HIV epidemic. Ro w data was processed in the search for patterns that could explain the spatial v ariation in disease prev alence. W e discov ered that strong ties and hubs in the communication align with HIV hot spots. Strong ties created by user mobility re vealed pathw ays that connect regions with higher pre v alence, and Abidjan - the city most se verely af fected by HIV - emerged as the center of migrations. Next, we focused on extracting features related to the connectivity and mobility of users at the lev el of spatial units – departments – that could be used to predict HIV prev alence. Sev eral regression methods were used to address that task, and the results obtained on a subset of departments, for which good estimates of HIV prev alence exist, are promising and can lead to generation of ne w hypotheses. The initial set of 224 features was reduced using a recursi ve feature elimination procedure, allowing us to identify features with the largest impact on prediction. It turned out that night-time connectivity and acti vity , the spatial area cov ered by users and o verall migrations are strongly linked to HIV pre valence. Models built on spatial features 8/ 13 (gyration, area, perimeter of con ve x hull, diameter and distance) exhibit high predicti ve po wer ( ρ = 0 . 753 , RRM S E = 0 . 294 ). Future work should include a detailed analysis of spatio-temporal dynamics of human motion in the context of primary and subsidiary habitats, 40 where the first denote frequently visited locations during typical daily acti vities and the second capture additional trav el. The limitations of our study arise from spatial and temporal scale of data. On one side, HIV data is limited by the measurement strate gy of DHS, UN AIDS or other rele v ant entities. The quality and spatial resolution of such data are determined by the sampling design - frequency and distrib ution of measurements. The variability in HIV pre valence across Iv ory Coast is certainly higher than one modeled on the department le vel, b ut we lacked more precise measurements to account for it better . The time resolution is e ven scarcer . HIV measurement campaigns are or ganized only once e very fe w years (for Ivory Coast 2012, 2005, 2001). Our findings linked aggre gated behavioral patterns to HIV pre v alence rates, but discov ered correlations do not imply causation. T o explore causation, we would need more estimates on changes of HIV prev alence during time. This could soon be overcome by a ne w device that easily connects to a smartphone. 41 The device performs the ELISA test and discovers disease markers from a tiny drop of blood, taken from a finger, in just 15 minutes. This approach has a high acceptance rate among population and will enable lar ge–scale screening. On the other hand, the spatial resolution of mobile phone data is restricted by the distribution of carrier’ s antennas and the time resolution is conditioned by users’ phone activity (calls or massages). But the major constraint on using mobile phone data are the pri v acy concerns. 42 Beside the mandatory user anonymization, mobile phone data are, usually , further spatially and/or temporally aggre gated, or a part of information is removed. For example, the antennas are aggregated at the lev el of larger geographical units, time is expressed in hourly intervals, and communication graphs at the lev el of users are detached from any spatial information. In D4D data sources, mobile phone data sets are temporally aggregated to one-hour time slots, with preserved spatial resolution of 1250 antennas or spatially restricted to 255 sub-prefectures, but without time aggregation. Even with data aggreg ation, mobile phone data is still quite a richer source of information when compared to HIV estimates that are available across 50 departments. Only in the case of individual communication graphs (D4D SET4) where spatial information is completely removed, we lose any chance to link it with the HIV distribution. Those communication graphs, if geographically determined, would be an immense source of information for uncov ering the connecti vity at a more detailed scale. If such data becomes a vailable in a pri vac y-acceptable form, further progress in the domain of modeling the spread of communicable diseases 43 will be enabled. In summary , our study showed how ra w real–world data can be used for significant knowledge e xtraction. W e believ e that our work, a first attempt to link mobile phone data and HIV epidemiology , lays a foundation for further research into ways to explain the heterogeneity of HIV and b uild predictiv e tools aimed at advancing public–health campaigns and decision making for HIV interventions. T ogether with other ”big data” approaches to HIV epidemiology 44 that rely on T witter data 45 and social networks 46 , 47 our work fits well into the wider initiati ve of digital epidemiology. 20 Methods Data sources P opulation data: W e used the data set av ailable on the AfriPop website: www.afripop.org , which contains full details on population distribution, summarized on the country lev el. The authors dev eloped a new high resolution population distribution data set for Africa and analyzed rural accessibility to population centers. Contemporary population data was combined with detailed satellite-deriv ed settlement extents to map population distrib ution across Africa at a finer spatial resolution. 27 HIV data: Demographic and Health Surve ys (DHS) provides data about the health status of countries. W e used data collected in the survey conducted during 2011 and 2012. 3 This data provides estimates for ten administrativ e regions of Iv ory Coast. The results of the estimation are shown in Fig. 1 (a). D4D data: Mobile phone data sets originate from the Orange service pro vider in Iv ory Coast collected in fi ve month period (December 1, 2011 - April 28, 2012) and are further processed into four different D4D sets. T wo of these were used in our study: SET1 and SET3. SET1 contains the antenna-to-antenna communication traf fic flo w of fiv e million Orange costumers aggregated to hourly interv als. Each record contains the originating and terminating antennas of calls, the number of calls and ov erall duration. SET2 observes users in consecutiv e two-week periods, which do not significantly influence HIV transmission patterns. On the other hand, insight into the long term mobility (5 months long observ ation period) is possible trough SET3. Spatial resolution in this set is reduced from towers to sub-prefectures (255 spatial units). A record in this set contains the user id, time stamp and sub-prefecture ids. Although SET4 provides connectivity at the le vel of single users and could be v ery informativ e for HIV epidemiology , it lacks spatial information. User IDs cannot be related to the IDs in the second or third set and therefore we were not able to approximate their home locations. Estimates on HIV prev alence at the level of departments National estimates on HIV pre valence hide the heterogeneity that exists within the country . T o un veil subnational pre v alence rates, a recently proposed method - pre vR 48 relies on an estimation function and DHS measurements to generate a surface of 9/ 13 HIV pre valence. Estimations are based on Gaussian kernel density functions with adapti v e bandwidths. An estimate on HIV prev alence in a spatial point ( x , y ) is determined by Eq. 1. pre v ( x , y ) = n ∑ i 1 h i 2 K d i h i (1) where, n is the number of samples, d i the geometrical distance between sample i and point ( x , y ) , K is the kernel function and h i the bandwidth used for sample i . Additionally , an indicator of the quality of the estimates w as assigned to each department, based on the surv ey sampling size. 49 Some estimates are v ery uncertain and should be interpreted with caution. See supplement T able S1 for estimated values and quality indicators. Strong ties identification T ies among sub-prefectures are expressed by communication and mobility flo ws. T o categorize those connecti vity ties as strong or weak we adopted the approach from 50 where Eq. 2 is used to calculate the strength of ties. s ( i ) = c ( i ) 1 N N ∑ i = 1 c ( i ) (2) where i is index of department, s ( i ) is the strength of tie i , and c ( i ) corresponds to the number of calls or mobilities to department i . T ies with s ( i ) < 1 are classified as weak ties, and those with s ( i ) ≥ 1 as strong ones. Ridge regression Ridge regression is a variant of ordinary multiple linear regression whose goal is to circumvent the problem of instability , arising, among other , from co-linearity of the predictor variables. It works with the original v ariables and tries to minimize penalized sum of squares. Like the ordinary least squares, ridge re gression includes all predictor v ariables, but typically with smaller coefficients, depending upon the v alue of the complexity parameter λ . The selection of the ridge parameter λ plays an important role; it multiplies the ridge penalty and thus controls the strength of the shrinkage of coefficients to ward zero. 31 The value of λ is estimated though leave-one-out v alidation. Support vector regression Support vector machines are a set of supervised learning methods used for classification and regression analysis. A version of SVM for regression analysis is the Support V ector Regression (SVR). 32 SVR searches for the optimal repression function, but allows a tolerance margin ( ε ), creating a tube around the re gression function where errors in predictions on training data are ignored. The method also includes a re gularization parameter in the form of a cost parameter ( C ), that penalizes the training errors outside the tube. In our experiments we used a linear kernel, the default ε = 0 . 1 , while the value of C was estimated though leav e-one-out validation. Recursive Feature Elimination Recursiv e Feature Elimination (RFE) is a greedy method for selecting a defined number of features. It starts from the initial set of features and builds a model (in our case SVM or Ridge), assigns weights to each feature based on estimate from the predictiv e model, eliminates the lowest ranked feature and then recursiv ely repeats this procedure on the remaining set of features until it reaches the desired number of features. The output is a top-ranked feature subset obtained through this recursi ve procedure. 33 References 1. Buv ´ e, A., Bishikwabo-Nsarhaza, K. & Mutangadura, G. The spread and effect of HIV -1 infection in sub-Saharan Africa. The Lancet 359 , 2011–2017 (2002). 2. De Cock, K. M., Mbori-Ngacha, D. & Marum, E. Shadow on the continent: public health and HIV/AIDS in africa in the 21st century . The Lancet 360 , 67–72 (2002). 3. Demographic and health surv ey Iv ory Coast 2011-12. (2013). URL www.dhsprogram.com . 4. Kalipeni, E. & Zulu, L. C. HIV and AIDS in Africa: a geographic analysis at multiple spatial scales. GeoJ ournal 77 , 505–523 (2012). 10/ 13 5. Joint united nations programme on hi v/aids-unaids (2013). URL http://www.unaids.org/en/ regionscountries/countries/ctedivoire/ . 6. Becker , R. et al. Human mobility characterization from cellular network data. Communications of the A CM 56 , 74–82 (2013). 7. Candia, J. et al. Uncovering indi vidual and collecti ve human dynamics from mobile phone records. Journal of Physics A: Mathematical and Theor etical 41 , 224015 (2008). 8. W esolo wski, A. & Eagle, N. P arameterizing the dynamics of slums. In AAAI Spring Symposium: Artificial Intelligence for Development (2010). 9. Blondel, V . D., Decuyper , A. & Krings, G. A survey of results on mobile phone datasets analysis. arXiv preprint arXiv:1502.03406 (2015). 10. Becker , R. A. et al. A tale of one city: Using cellular network data for urban planning. IEEE P ervasive Computing 10 , 18–26 (2011). 11. Bengtsson, L., Lu, X., Thorson, A., Garfield, R. & V on Schreeb, J. Improv ed response to disasters and outbreaks by tracking population movements with mobile phone network data: a post-earthquake geospatial study in haiti. PLoS medicine 8 , e1001083 (2011). 12. W ang, H., Calabrese, F ., Di Lorenzo, G. & Ratti, C. T ransportation mode inference from anonymized and aggregated mobile phone call detail records. In Intelligent T ransportation Systems (ITSC), 2010 13th International IEEE Confer ence on , 318–323 (IEEE, 2010). 13. Caceres, N., Romero, L. M., Benitez, F . G. & Del Castillo, J. M. Traf fic flo w estimation models using cellular phone data. Intelligent T ransportation Systems, IEEE T ransactions on 13 , 1430–1441 (2012). 14. Smith-Clarke, C., Mashhadi, A. & Capra, L. Poverty on the cheap: Estimating poverty maps using aggregated mobile communication networks. In Pr oceedings of the 32nd annual A CM confer ence on Human factors in computing systems , 511–520 (A CM, 2014). 15. Bogomolov , A. et al. Once upon a crime: T ow ards crime prediction from demographics and mobile data. In Proceedings of the 16th International Confer ence on Multimodal Interaction , 427–434 (A CM, 2014). 16. Lima, A., De Domenico, M., Pejovic, V . & Musolesi, M. Exploiting cellular data for disease containment and information campaigns strategies in country-wide epidemics. arXiv pr eprint arXiv:1306.4534 (2013). 17. T izzoni, M. et al. On the use of human mobility proxies for modeling epidemics. PLoS computational biology 10 , e1003716 (2014). 18. W esolowski, A. et al. Quantifying the impact of human mobility on malaria. Science 338 , 267–270 (2012). 19. Frias-Martinez, V ., Rubio, A. & Frias-Martinez, E. Measuring the impact of epidemic alerts on human mobility . P ervasive Urban Applications–PURBA (2012). 20. Salathe, M. et al. Digital epidemiology . PLoS computational biology 8 , e1002616 (2012). 21. Messina, J. P . et al. Spatial and socio-behavioral patterns of hi v pre v alence in the democratic republic of congo. Social Science & Medicine 71 , 1428–1435 (2010). 22. Coffee, M., Lurie, M. N. & Garnett, G. P . Modelling the impact of migration on the hiv epidemic in south africa. Aids 21 , 343–350 (2007). 23. Blondel, V . D. et al. Data for development: the d4d challenge on mobile phone data. arXiv pr eprint (2012). 24. Identifying populations at greatest risk of infection – geographic hotspots and key populations. unaids reference group on estimates, modeling and projections (2013). URL http://www.epidem.org/resources/ . 25. Read, J. M., Eames, K. T . & Edmunds, W . J. Dynamic social networks and the implications for the spread of infectious disease. J ournal of The Royal Society Interface 5 , 1001–1007 (2008). 26. Belik, V ., Geisel, T . & Brockmann, D. Natural human mobility patterns and spatial spread of infectious diseases. Physical Revie w X 1 , 011001 (2011). 27. Linard, C., Gilbert, M., Snow , R. W ., Noor, A. M. & T atem, A. J. Population distribution, settlement patterns and accessibility across africa in 2010. PloS one 7 , e31743 (2012). 11/ 13 28. Gonzalez, M. C., Hidalgo, C. A. & Barabasi, A.-L. Understanding individual human mobility patterns. Natur e 453 , 779–782 (2008). 29. W illiams, N. E., Thomas, T . A., Dunbar , M., Eagle, N. & Dobra, A. Measures of human mobility using mobile phone records enhanced with gis data. arXiv pr eprint arXiv:1408.5420 (2014). 30. Cs ´ aji, B. C. et al. Exploring the mobility of mobile phone users. Physica A: Statistical Mechanics and its Applications 392 , 1459–1473 (2013). 31. El-Dereny , M. & Rashwan, N. Solving multicollinearity problem using ridge regression models. Int. J . Contemp. Math. Sciences 6 , 585–600 (2011). 32. Gunn, S. R. et al. Support vector machines for classification and regression. ISIS technical r eport 14 (1998). 33. Guyon, I., W eston, J., Barnhill, S. & V apnik, V . Gene selection for cancer classification using support vector machines. Machine learning 46 , 389–422 (2002). 34. Breiman, L. Stacked regressions. Machine learning 24 , 49–64 (1996). 35. Bodnar , T . & Salath ´ e, M. V alidating models for disease detection using twitter . In Pr oceedings of the 22nd international confer ence on W orld W ide W eb companion , 699–702 (International W orld W ide W eb Conferences Steering Committee, 2013). 36. ˇ Strumbelj, E. & Kononenk o, I. A general method for visualizing and explaining black-box regression models. In Adaptive and Natural Computing Algorithms , 21–30 (Springer , 2011). 37. ˇ Strumbelj, E. & Kononenk o, I. Explaining prediction models and indi vidual predictions with feature contributions. Knowledge and Information Systems 41 , 647–665 (2014). 38. Saram ¨ aki, J. et al. Persistence of social signatures in human communication. Proceedings of the National Academy of Sciences 111 , 942–947 (2014). 39. Noulas, A., Scellato, S., Lambiotte, R., Pontil, M. & Mascolo, C. A tale of many cities: universal patterns in human urban mobility . PloS one 7 , e37027 (2012). 40. Bagro w , J. P . & Lin, Y .-R. Mesoscopic structure and social aspects of human mobility . PloS one 7 , e37676 (2012). 41. Laksanasopin, T . et al. A smartphone dongle for diagnosis of infectious diseases at the point of care. Science translational medicine 7 , 273re1–273re1 (2015). 42. de Montjoye, Y .-A., Hidalgo, C. A., V erleysen, M. & Blondel, V . D. Unique in the crowd: The priv acy bounds of human mobility . Scientific reports 3 (2013). 43. Bian, L. Spatial approaches to modeling dispersion of communicable diseases–a revie w . T r ansactions in GIS 17 , 1–17 (2013). 44. Y oung, S. D. A “big data” approach to hiv epidemiology and pre vention. Pr eventive medicine 70 , 17–18 (2015). 45. Y oung, S. D., Riv ers, C. & Lewis, B. Methods of using real-time social media technologies for detection and remote monitoring of hiv outcomes. Pr eventive medicine 63 , 112–115 (2014). 46. Y oung, S. D. Recommended guidelines on using social networking technologies for hi v pre vention research. AIDS and Behavior 16 , 1743–1745 (2012). 47. Y oung, S. D. et al. Social networking technologies as an emerging tool for hiv prev ention: a cluster randomized trial. Annals of internal medicine 159 , 318–324 (2013). 48. Larmarange, J., V allo, R., Y aro, S., Msellati, P . & M ´ eda, N. Methods for mapping regional trends of HIV prev alence from demographic and health surve ys (DHS). CyberGeo: Eur opean J ournal of Geography (2011). 49. Larmarange, J. & Bendaud, V . HIV estimates at second subnational level from national population-based surv eys. AIDS 28 , S469–S476 (2014). 50. Phithakkitnukoon, S., Smoreda, Z. & Oli vier , P . Socio-geography of human mobility: A study using longitudinal mobile phone data. PloS one 7 , e39253 (2012). 12/ 13 Ackno wledgements W e would lik e to thank the operator France T elecom-Orange and the organizers of the Data for De velopment Challenge for sharing the D4D data sets, as well as Joseph Larmarange from IRD (Institut de recherche pour le d ´ eveloppement), France, for providing us with preliminary results on HIV estimates o ver the 50 departments of Ivory Coast. This work was partly supported by Serbian Ministry of Education and Science (Project III 43002) and by European Commission (FP7 InnoSense project, ref. no: 316191). A uthor contributions S.B., K.G., D. ´ C. and V .C. designed the research. S.B. and K.G. conducted the experiments. All authors analysed the results and participated in the writing of the manuscript Additional inf ormation Competing financial interests: The authors declare no competing financial interests. 13/ 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment