Syndromic classification of Twitter messages

Reading time: 5 minute
...

📝 Original Info

  • Title: Syndromic classification of Twitter messages
  • ArXiv ID: 1110.3094
  • Date: 2023-06-15
  • Authors: : John Smith, Jane Doe, Robert Johnson

📝 Abstract

Recent studies have shown strong correlation between social networking data and national influenza rates. We expanded upon this success to develop an automated text mining system that classifies Twitter messages in real time into six syndromic categories based on key terms from a public health ontology. 10-fold cross validation tests were used to compare Naive Bayes (NB) and Support Vector Machine (SVM) models on a corpus of 7431 Twitter messages. SVM performed better than NB on 4 out of 6 syndromes. The best performing classifiers showed moderately strong F1 scores: respiratory = 86.2 (NB); gastrointestinal = 85.4 (SVM polynomial kernel degree 2); neurological = 88.6 (SVM polynomial kernel degree 1); rash = 86.0 (SVM polynomial kernel degree 1); constitutional = 89.3 (SVM polynomial kernel degree 1); hemorrhagic = 89.9 (NB). The resulting classifiers were deployed together with an EARS C2 aberration detection algorithm in an experimental online system.

💡 Deep Analysis

Figure 1

📄 Full Content

Twitter is a social networking service that allows users throughout the world to communicate their personal experiences, opinions and questions to each other using micro messages ('tweets'). The short message style reduces thought investment [1] and encourages a rapid 'on the go' style of messaging from mobile devices. Statistics show that Twitter had over 200 million users1 in March 2011, representing a small but significant fraction of the international population across both age and gender2 with a bias towards the urban population in their 20s and 30s. Our recent studies into novel health applications [2] have shown progress in identifying free-text signals from tweets that allow influenza-like illness (ILI) to be tracked in real time. Similar studies have shown strong correlation with national weekly influenza data from the Centers for Disease Control and Prevention and the United Kingdom's Health Protection Agency. Approaches like these hold out the hope that low cost sensor networks could be deployed as early warning systems to supplement more expensive traditional approaches. Web-based sensor networks might prove to be particularly effective for diseases that have a narrow window for effective intervention such as pandemic influenza.

Despite such progress, studies into deriving linguistic signals that correspond to other major syndromes have been lacking. Unlike ILI, publicly available gold standard data for other classes of conditions such as gastrointestinal or neurological illnesses are not so readily available. Nevertheless, the previous studies suggest that a more comprehensive early warning system based on the same principles and approaches should prove effective. Within the context of the DIZIE project, the contribution of this paper is (a) to present our data classification and collection approaches for building syndromic classifiers; (b) to evaluate machine learning approaches for predicting the classes of unseen Twitter messages; and (c) to show how we deployed the classifiers for detecting disease activity. A further goal of our work is to test the effectiveness of outbreak detection through geo-temporal aberration detection on aggregations of the classified messages. This work is now ongoing and will be reported elsewhere in a separate study.

In this section we make a brief survey of recent health surveillance systems that use the Web as a sensor source to detect infectious disease outbreaks. Web reports from news media, blogs, microblogs, discussion forums, digital radio, user search queries etc. are considered useful because of their wide availability, low cost and real time nature. Although we will focus on infectious disease detection it is worth noting that similar approaches can be applied to other public health hazards such as earthquakes and typhoons [3,4].

Current systems fall into two distinct categories: (a) event-based systems that look for direct reports of interest in the news media (see [5] for a review), and (b) systems that exploit the human sensor network in sites like Twitter, Jaiku and Prownce by sampling reports of symptoms/GP visits/drug usage etc. from the population at risk [6,7,8]. Early alerts from such systems are typically used by public health analysts to initiate a risk analysis process involving many other sources such as human networks of expertise.

Work on the analysis of tweets, whilst still a relatively novel information source, is related to a tradition of syndromic surveillance based on analysis of triage chief complaint (TCC) reports, i.e. the initial triage report outlining the reasons for the patient visit to a hospital emergency room. Like tweets they report the patient’s symptoms, are usually very brief, often just a few keywords and can be heavily abbreviated. Major technical challenges though do exist: unlike TCC reports tweets contain a very high degree of noise (e.g. spam, opinion, re-tweeting etc.) as well as slang (e.g. itcy for itchy) and emoticons which makes them particularly challenging. Social media is inherently an informal medium of communication and lacks a standard vocabulary although Twitter users do make use of an evolving semantic tag set. Both TCC and tweets often consist of short telegraphic statements or ungrammatical sentences which are difficult for uncustomised syntactic parsers to handle.

In the area of TCC reports we note work done by the RODS project [9] that developed automatic techniques for classifying reports into a list of syndromic categories based on natural language features. The chief complaint categories used in RODS were respiratory, gastrointestinal, botulinic, constitutional, neurologic, rash, hemorrhagic and none. Further processes took aggregated data and issued alerts using time series aberration detection algorithms. The DIZIE project which we report here takes a broadly similar approach but applies it to user generated content in the form of Twitter messages.

DIZIE currently consists of the follow

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut