Identifying Data Noises, User Biases, and System Errors in Geo-tagged Twitter Messages (Tweets)

Reading time: 6 minute
...

📝 Original Info

  • Title: Identifying Data Noises, User Biases, and System Errors in Geo-tagged Twitter Messages (Tweets)
  • ArXiv ID: 1712.02433
  • Date: 2017-12-08
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Many social media researchers and data scientists collected geo-tagged tweets to conduct spatial analysis or identify spatiotemporal patterns of filtered messages for specific topics or events. This paper provides a systematic view to illustrate the characteristics (data noises, user biases, and system errors) of geo-tagged tweets from the Twitter Streaming API. First, we found that a small percentage (1%) of active Twitter users can create a large portion (16%) of geo-tagged tweets. Second, there is a significant amount (57.3%) of geo-tagged tweets located outside the Twitter Streaming API's bounding box in San Diego. Third, we can detect spam, bot, cyborg tweets (data noises) by examining the "source" metadata field. The portion of data noises in geo-tagged tweets is significant (29.42% in San Diego, CA and 53.47% in Columbus, OH) in our case study. Finally, the majority of geo-tagged tweets are not created by the generic Twitter apps in Android or iPhone devices, but by other platforms, such as Instagram and Foursquare. We recommend a multi-step procedure to remove these noises for the future research projects utilizing geo-tagged tweets.

💡 Deep Analysis

Deep Dive into Identifying Data Noises, User Biases, and System Errors in Geo-tagged Twitter Messages (Tweets).

Many social media researchers and data scientists collected geo-tagged tweets to conduct spatial analysis or identify spatiotemporal patterns of filtered messages for specific topics or events. This paper provides a systematic view to illustrate the characteristics (data noises, user biases, and system errors) of geo-tagged tweets from the Twitter Streaming API. First, we found that a small percentage (1%) of active Twitter users can create a large portion (16%) of geo-tagged tweets. Second, there is a significant amount (57.3%) of geo-tagged tweets located outside the Twitter Streaming API’s bounding box in San Diego. Third, we can detect spam, bot, cyborg tweets (data noises) by examining the “source” metadata field. The portion of data noises in geo-tagged tweets is significant (29.42% in San Diego, CA and 53.47% in Columbus, OH) in our case study. Finally, the majority of geo-tagged tweets are not created by the generic Twitter apps in Android or iPhone devices, but by other platfo

📄 Full Content

Identifying Data Noises, User Biases, and System Errors in Geo-tagged Twitter Messages (Tweets) Ming-Hsiang Tsou Center for Human Dynamics in the Mobile Age, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182, USA. 01-6195940205 mtsou@mail.sdsu.edu Hao Zhang Center for Human Dynamics in the Mobile Age, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182, USA. zhanghaoshogo@gmail.com Chin-Te Jung Esri (Beijing) R&D Center, Beijing, China chinte.jung@gmail.com

ABSTRACT Many social media researchers and data scientists collected geo- tagged tweets to conduct spatial analysis or identify spatiotemporal patterns of filtered messages for specific topics or events. This paper provides a systematic view to illustrate the characteristics (data noises, user biases, and system errors) of geo-tagged tweets from the Twitter Streaming API. First, we found that a small percentage (1%) of active Twitter users can create a large portion (16%) of geo-tagged tweets. Second, there is a significant amount (57.3%) of geo-tagged tweets located outside the Twitter Streaming API’s bounding box in San Diego. Third, we can detect spam, bot, cyborg tweets (data noises) by examining the “source” metadata field. The portion of data noises in geo-tagged tweets is significant (29.42% in San Diego, CA and 53.47% in Columbus, OH) in our case study. Finally, the majority of geo-tagged tweets are not created by the generic Twitter apps in Android or iPhone devices, but by other platforms, such as Instagram and Foursquare. We recommend a multi-step procedure to remove these noises for the future research projects utilizing geo-tagged tweets.

Keywords Social media, Geo-tagged, Data noises, Twitter, Tweets.

  1. INTRODUCTION Twitter is one of the most popular social media platforms used in academic research works due to its large number of users, comprehensive metadata, openness of messages, and public available application programming interfaces (APIs). Tweets are the actual Twitter messages created by users to express their feelings, events, or activities within 140 characters including spaces. Geocoding tasks for tweets are important for social media analytics because data scientists and researchers need to aggregate social media messages or users into a city, a region, or nearby points of interests (POI) for location-based analysis and regional trend analysis. Currently, the public Twitter Application Programming Interfaces (APIs) can provide five types of geocoding sources: 1. Geo-tagged coordinates, 2. Place check-in location (bounding box), 3. User Profile Location, 4. Time Zones,
  2. Texts containing locational information (explicit or implicit information). Among the five geocoding methods, tweets with geo-tagged coordinates is the most popular data source used in location-based social media research. Geo-tagged tweets have precise latitude and longitude coordinates (decimal degrees) stored in a metadata field of tweets, called “geo” (a deprecated field name in APIs) or “coordinates” (the current field name in APIs). When users turn on the precise location tag function on their Twitter accounts (which is off by default), their tweets will be geo-tagged using GPS or Wi-Fi signals in their mobile devices. Since many users do not enable precise location tags, there are only around 1% of tweets containing geo-tagged information. The percentage of geo-tagged tweets may vary among different topics or keywords. For example, during a wildfire event, the percentage of geo-tagged tweets can become 4% or higher collected by using wildfire related keywords [1]. Geo-tagged tweets are valuable data for social media researchers and geographers to study geographic context and spatial association within social media data. One popular method of collecting geo- tagged tweets is to utilize Twitter’s Streaming API with a predefined bounding box or multiple predefined keywords. Previous works have demonstrated that social media data collected by the Twitter Streaming API (free version with 1% sampling rate) are a good sample of Twitter’s Firehose API (very expensive version providing100% tweet data in Twitter servers) [2]. In academics, many researchers used geo-tagged tweets for conducting spatial analysis and GIS operations for their research projects. For example, the 2013 special issue of “Mapping Cyberspace and Social Media” in Cartography and Geographic Information Science [3] includes seven refereed research papers and four out of seven paper are using geo-tagged tweets as their main data sources. There are two types of Twitter APIs in general, Streaming API for collecting real-time feeds of Twitter messages and Search API for collecting historical tweets (up to 7 or 9 days before the search date) with specific keywords or user names with database query methods. This paper will only focus on the characteristics

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut