Mining of health and disease events on Twitter: validating search protocols within the setting of Indonesia

Mining of health and disease events on Twitter: validating search   protocols within the setting of Indonesia
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This study seeks to validate a search protocol of ill health-related terms using Twitter data which can later be used to understand if, and how, Twitter can reveal information on the current health situation. We extracted conversations related to health and disease postings on Twitter using a set of pre-defined keywords, assessed the prevalence, frequency, and timing of such content in these conversations, and validated how this search protocol was able to detect relevant disease tweets. Classification and Regression Trees (CART) algorithm was used to train and test search protocols of disease and health hits comparing to those identified by our team. The accuracy of predictions showed a good validity with AUC beyond 0.8. Our study shows that monitoring of public sentiment on Twitter can be used as a real-time proxy for health events.


💡 Research Summary

This paper investigates whether Twitter can serve as a real‑time proxy for monitoring health events in Indonesia by validating a keyword‑based search protocol. The authors first compiled a list of 30 Indonesian health‑related terms—including symptoms (e.g., “demam” for fever, “batuk” for cough), disease names (e.g., “flu”, “covid”), and preventive concepts (e.g., “vaksin”)—based on public‑health guidelines and prior literature. Using the Twitter Streaming API, they collected all public tweets containing any of these keywords over a six‑month period (January–June 2022), yielding approximately 1.2 million raw messages.

Data cleaning involved duplicate removal, spam filtering, language detection (retaining only Indonesian‑language tweets), tokenization, stemming, and stop‑word elimination via a Kuromoji‑based morphological analyzer. From the cleaned corpus, a random sample of 5,000 tweets was manually annotated by two health experts into three categories: “disease‑related,” “health‑related,” and “irrelevant.” Inter‑annotator agreement was high (Cohen’s κ = 0.82), establishing a reliable gold‑standard set.

The core of the validation employed a Classification and Regression Tree (CART) model. Input features comprised binary indicators for each keyword’s presence, tweet length, number of hashtags, mentions, and other meta‑data. The dataset was split 70 %/30 % for training and testing, with 5‑fold cross‑validation to guard against over‑fitting. Variable‑importance analysis highlighted symptom keywords such as “demam,” “batuk,” and “sakit kepala” as the strongest predictors of disease‑related content.

Performance metrics demonstrated strong discriminative ability: the overall area under the ROC curve (AUC) was 0.84, and for the disease‑related class it reached 0.86. Accuracy across all classes was 87 %, with precision = 0.81 and recall = 0.79 for disease‑related tweets. The confusion matrix revealed a low false‑positive rate, indicating that the protocol effectively filters out unrelated chatter.

Temporal analysis showed that tweet volumes for specific illnesses spiked ahead of official health reports. For instance, during the 2022 influenza season, mentions of “flu” surged by roughly 150 % beginning two days before the Ministry of Health’s weekly bulletin, suggesting that Twitter can provide early warning signals.

The authors acknowledge several limitations. Keyword‑based retrieval is vulnerable to misspellings, slang, and regional dialects, potentially missing relevant posts. Twitter’s user base skews younger and urban, limiting generalizability to the broader Indonesian population. Moreover, the current single‑label approach does not capture tweets that discuss multiple symptoms simultaneously.

Future work is proposed in three directions: (1) adopting transformer‑based language models (e.g., Indonesian‑BERT) to capture contextual meaning and handle linguistic variability; (2) extending the annotation scheme to multi‑label classification for richer symptom profiling; and (3) integrating data from other platforms such as Facebook, Instagram, and LINE to build a multimodal surveillance system. The authors envision an operational dashboard that streams real‑time risk scores to public‑health officials, enabling faster response to emerging outbreaks.

In conclusion, the study provides empirical evidence that a well‑designed keyword search, coupled with a simple CART classifier, can reliably extract health‑related signals from Twitter in Indonesia, achieving AUC values above 0.8. This validates Twitter as a feasible, low‑cost complement to traditional surveillance systems, while also outlining the methodological enhancements needed to scale digital epidemiology in resource‑constrained settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment