📝 Original Info
- Title: Identifying Data Noises, User Biases, and System Errors in Geo-tagged Twitter Messages (Tweets)
- ArXiv ID: 1712.02433
- Date: 2017-12-08
- Authors: Researchers from original ArXiv paper
📝 Abstract
Many social media researchers and data scientists collected geo-tagged tweets to conduct spatial analysis or identify spatiotemporal patterns of filtered messages for specific topics or events. This paper provides a systematic view to illustrate the characteristics (data noises, user biases, and system errors) of geo-tagged tweets from the Twitter Streaming API. First, we found that a small percentage (1%) of active Twitter users can create a large portion (16%) of geo-tagged tweets. Second, there is a significant amount (57.3%) of geo-tagged tweets located outside the Twitter Streaming API's bounding box in San Diego. Third, we can detect spam, bot, cyborg tweets (data noises) by examining the "source" metadata field. The portion of data noises in geo-tagged tweets is significant (29.42% in San Diego, CA and 53.47% in Columbus, OH) in our case study. Finally, the majority of geo-tagged tweets are not created by the generic Twitter apps in Android or iPhone devices, but by other platforms, such as Instagram and Foursquare. We recommend a multi-step procedure to remove these noises for the future research projects utilizing geo-tagged tweets.
💡 Deep Analysis
Deep Dive into Identifying Data Noises, User Biases, and System Errors in Geo-tagged Twitter Messages (Tweets).
Many social media researchers and data scientists collected geo-tagged tweets to conduct spatial analysis or identify spatiotemporal patterns of filtered messages for specific topics or events. This paper provides a systematic view to illustrate the characteristics (data noises, user biases, and system errors) of geo-tagged tweets from the Twitter Streaming API. First, we found that a small percentage (1%) of active Twitter users can create a large portion (16%) of geo-tagged tweets. Second, there is a significant amount (57.3%) of geo-tagged tweets located outside the Twitter Streaming API’s bounding box in San Diego. Third, we can detect spam, bot, cyborg tweets (data noises) by examining the “source” metadata field. The portion of data noises in geo-tagged tweets is significant (29.42% in San Diego, CA and 53.47% in Columbus, OH) in our case study. Finally, the majority of geo-tagged tweets are not created by the generic Twitter apps in Android or iPhone devices, but by other platfo
📄 Full Content
Identifying Data Noises, User Biases, and System Errors in
Geo-tagged Twitter Messages (Tweets)
Ming-Hsiang Tsou
Center for Human Dynamics in the
Mobile Age, San Diego State
University, 5500 Campanile Drive,
San Diego, CA 92182, USA.
01-6195940205
mtsou@mail.sdsu.edu
Hao Zhang
Center for Human Dynamics in the
Mobile Age, San Diego State
University, 5500 Campanile Drive,
San Diego, CA 92182, USA.
zhanghaoshogo@gmail.com
Chin-Te Jung
Esri (Beijing) R&D Center,
Beijing, China
chinte.jung@gmail.com
ABSTRACT
Many social media researchers and data scientists collected geo-
tagged tweets to conduct spatial analysis or identify spatiotemporal
patterns of filtered messages for specific topics or events. This
paper provides a systematic view to illustrate the characteristics
(data noises, user biases, and system errors) of geo-tagged tweets
from the Twitter Streaming API. First, we found that a small
percentage (1%) of active Twitter users can create a large portion
(16%) of geo-tagged tweets. Second, there is a significant amount
(57.3%) of geo-tagged tweets located outside the Twitter Streaming
API’s bounding box in San Diego. Third, we can detect spam, bot,
cyborg tweets (data noises) by examining the “source” metadata
field. The portion of data noises in geo-tagged tweets is significant
(29.42% in San Diego, CA and 53.47% in Columbus, OH) in our
case study. Finally, the majority of geo-tagged tweets are not
created by the generic Twitter apps in Android or iPhone devices,
but by other platforms, such as Instagram and Foursquare. We
recommend a multi-step procedure to remove these noises for the
future research projects utilizing geo-tagged tweets.
Keywords
Social media, Geo-tagged, Data noises, Twitter, Tweets.
- INTRODUCTION
Twitter is one of the most popular social media platforms used in
academic research works due to its large number of users,
comprehensive metadata, openness of messages, and public
available application programming interfaces (APIs). Tweets are
the actual Twitter messages created by users to express their
feelings, events, or activities within 140 characters including
spaces. Geocoding tasks for tweets are important for social media
analytics because data scientists and researchers need to aggregate
social media messages or users into a city, a region, or nearby
points of interests (POI) for location-based analysis and regional
trend analysis. Currently, the public Twitter Application
Programming Interfaces (APIs) can provide five types of
geocoding sources: 1. Geo-tagged coordinates, 2. Place check-in
location (bounding box), 3. User Profile Location, 4. Time Zones,
- Texts containing locational information (explicit or implicit
information). Among the five geocoding methods, tweets with
geo-tagged coordinates is the most popular data source used in
location-based social media research. Geo-tagged tweets have
precise latitude and longitude coordinates (decimal degrees) stored
in a metadata field of tweets, called “geo” (a deprecated field name
in APIs) or “coordinates” (the current field name in APIs). When
users turn on the precise location tag function on their Twitter
accounts (which is off by default), their tweets will be geo-tagged
using GPS or Wi-Fi signals in their mobile devices. Since many
users do not enable precise location tags, there are only around 1%
of tweets containing geo-tagged information. The percentage of
geo-tagged tweets may vary among different topics or keywords.
For example, during a wildfire event, the percentage of geo-tagged
tweets can become 4% or higher collected by using wildfire related
keywords [1].
Geo-tagged tweets are valuable data for social media researchers
and geographers to study geographic context and spatial association
within social media data. One popular method of collecting geo-
tagged tweets is to utilize Twitter’s Streaming API with a
predefined bounding box or multiple predefined keywords.
Previous works have demonstrated that social media data collected
by the Twitter Streaming API (free version with 1% sampling rate)
are a good sample of Twitter’s Firehose API (very expensive
version providing100% tweet data in Twitter servers) [2]. In
academics, many researchers used geo-tagged tweets for
conducting spatial analysis and GIS operations for their research
projects. For example, the 2013 special issue of “Mapping
Cyberspace and Social Media” in Cartography and Geographic
Information Science [3] includes seven refereed research papers
and four out of seven paper are using geo-tagged tweets as their
main data sources. There are two types of Twitter APIs in general,
Streaming API for collecting real-time feeds of Twitter messages
and Search API for collecting historical tweets (up to 7 or 9 days
before the search date) with specific keywords or user names with
database query methods. This paper will only focus on the
characteristics
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.