Mining of health and disease events on Twitter: validating search protocols within the setting of Indonesia

Reading time: 5 minute
...

📝 Abstract

This study seeks to validate a search protocol of ill health-related terms using Twitter data which can later be used to understand if, and how, Twitter can reveal information on the current health situation. We extracted conversations related to health and disease postings on Twitter using a set of pre-defined keywords, assessed the prevalence, frequency, and timing of such content in these conversations, and validated how this search protocol was able to detect relevant disease tweets. Classification and Regression Trees (CART) algorithm was used to train and test search protocols of disease and health hits comparing to those identified by our team. The accuracy of predictions showed a good validity with AUC beyond 0.8. Our study shows that monitoring of public sentiment on Twitter can be used as a real-time proxy for health events.

💡 Analysis

This study seeks to validate a search protocol of ill health-related terms using Twitter data which can later be used to understand if, and how, Twitter can reveal information on the current health situation. We extracted conversations related to health and disease postings on Twitter using a set of pre-defined keywords, assessed the prevalence, frequency, and timing of such content in these conversations, and validated how this search protocol was able to detect relevant disease tweets. Classification and Regression Trees (CART) algorithm was used to train and test search protocols of disease and health hits comparing to those identified by our team. The accuracy of predictions showed a good validity with AUC beyond 0.8. Our study shows that monitoring of public sentiment on Twitter can be used as a real-time proxy for health events.

📄 Content

Mining of health and disease events on Twitter:
validating search protocols within the setting of Indonesia Aditya L. Ramadonaa,b*, Rendra Agustac, Sulistyawatid, Lutfan Lazuardie, Anwar D. Cahyonof,
Åsa Holmnerg, Fatwa S.T. Dewih, Hari Kusnantoi, Joacim Rocklöva,j aDepartment of Public Health and Clinical Medicine, Epidemiology and Global Health, Umeå University, Umeå 90187, Sweden bCenter for Environmental Studies, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia cDepartment of Psychology, Universitas Mercu Buana, Yogyakarta 55753, Indonesia dDepartment of Public Health, Universitas Ahmad Dahlan, Yogyakarta 55164, Indonesia eDepartment of Health Policy and Management, Faculty of Medicine, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia fDistrict Health Office, Yogyakarta 55165, Indonesia gDepartment of Radiation Sciences, Umeå University, Umeå 90187, Sweden hDepartment of Health Behavior, Environment and Social Medicine, Faculty of Medicine, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia iDepartment of Community and Family Medicine, Faculty of Medicine, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia jHeidelberg University Medical School, Institute of Public Health, Heidelberg 69120, Germany

Abstract This study seeks to validate a search protocol of ill health-related terms using Twitter data which can later be used to understand if, and how, Twitter can reveal information on the current health situation. We extracted conversations related to health and disease postings on Twitter using a set of pre-defined keywords, assessed the prevalence, frequency, and timing of such content in these conversations, and validated how this search protocol was able to detect relevant disease tweets. Classification and Regression Trees (CART) algorithm was used to train and test search protocols of disease and health hits comparing to those identified by our team. The accuracy of predictions showed a good validity with AUC beyond 0.8. Our study shows that monitoring of public sentiment on Twitter can be used as a real-time proxy for health events. Keywords: social media analytics; public health surveillance; digital epidemiology.

  1. Introduction
    Twitter is a free social networking and micro-blogging service that enables its users to read and share messages no longer than 140-characters. As of May 2016, there are 24.34 million Indonesian, or around 10% of the population being active monthly on Twitter [1], sharing news, events, as well as their personal feelings and experiences including health- related information. Twitter offers a potential for data mining of public information flows [2] and these massive data sources may be exploited for public health monitoring and surveillance purposes [3]. Previous studies have explored the use of Twitter, for example, to track levels of disease activity [4], to predicts heart disease mortality [5], and for measuring health-related quality of life [6]. However, the validity of twitter mining protocols to correctly detect health and disease events is one methodological challenge of this media. This study seeks to validate a search protocol of ill health-related terms using real-time Twitter data which can later be used to understand if, and how, twitter can reveal information on the current health situation in Indonesia.
    In this validation study of mining protocols, we: 1) extracted geo-located conversations related to health and disease postings on Twitter using a set of pre-defined keywords, 2) assessed the prevalence, frequency and timing of such content in these conversations, and 3) validated how this search protocol was able to detect relevant disease tweets.
  2. Materials and Methods We started by developing groups of words and phrases relevant to disease symptoms and health outcomes to extract relevant content from Twitter Search API [7]. Since this study is intended to reveal the relationships in the setting of Indonesia, we only considered and filtered tweets in Bahasa by using the following keywords.

“rumah OR sakit OR rawat OR inap OR demam OR panas
-cuaca OR berdarah OR pendarahan OR tombosit OR badan OR muntah OR badan OR tua OR ‘:(’”.

We removed retweets and, therefore, only collected 390 tweets in Bahasa. We obtained 1,632 unique words from such tweets after removing punctuation, numbers, capitalization, and the Bahasa stop-words (e.g., kamu and aja). Many words appeared infrequently and we considered only the 22 highest words frequencies (i.e., words that appear at least 10 times, Fig. 1).

Fig. 1. Words that appear more than 10 times

We developed a model of twitter search term protocol hits to our team assigned positive or neutral tweets content in respect to health events. We used the rpart package [8], in R software environment [9], to implement the CART algorithm [10]. Predictors were generated from the 22 words with highest frequency, i.e. by counting the

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut