Real-time Road Traffic Information Detection Through Social Media
In current study, a mechanism to extract traffic related information such as congestion and incidents from textual data from the internet is proposed. The current source of data is Twitter. As the data being considered is extremely large in size automated models are developed to stream, download, and mine the data in real-time. Furthermore, if any tweet has traffic related information then the models should be able to infer and extract this data. Currently, the data is collected only for United States and a total of 120,000 geo-tagged traffic related tweets are extracted, while six million geo-tagged non-traffic related tweets are retrieved and classification models are trained. Furthermore, this data is used for various kinds of spatial and temporal analysis. A mechanism to calculate level of traffic congestion, safety, and traffic perception for cities in U.S. is proposed. Traffic congestion and safety rankings for the various urban areas are obtained and then they are statistically validated with existing widely adopted rankings. Traffic perception depicts the attitude and perception of people towards the traffic. It is also seen that traffic related data when visualized spatially and temporally provides the same pattern as the actual traffic flows for various urban areas. When visualized at the city level, it is clearly visible that the flow of tweets is similar to flow of vehicles and that the traffic related tweets are representative of traffic within the cities. With all the findings in current study, it is shown that significant amount of traffic related information can be extracted from Twitter and other sources on internet. Furthermore, Twitter and these data sources are freely available and are not bound by spatial and temporal limitations. That is, wherever there is a user there is a potential for data.
💡 Research Summary
The thesis “Real‑time Road Traffic Information Detection Through Social Media” presents a comprehensive framework for extracting, classifying, and visualizing traffic‑related information from Twitter. The author collected geo‑tagged tweets over a five‑month period (September 2014 – February 2015), amassing roughly 120 000 traffic‑related tweets and 6 million non‑traffic tweets across the United States. Data were streamed in real time via the Twitter API, stored in a relational database, and later processed for both offline and online analyses.
Pre‑processing involved removal of stop‑words, application of a custom traffic dictionary, and TF‑IDF weighting to generate feature vectors. Additional linguistic features such as part‑of‑speech tags and named‑entity recognition were extracted to capture domain‑specific terms (e.g., “accident”, “grid‑lock”). Eight distinct feature sets were constructed, encompassing word frequencies, sentiment scores, temporal weights, and geographic attributes.
For classification, several machine‑learning algorithms were evaluated: Naïve Bayes, Support Vector Machines, and Stochastic Gradient Descent. Two families of models were built—one that retained the keyword “traffic” and another that deliberately omitted it to test robustness. The “traffic‑included” model achieved ~92 % accuracy, while the “traffic‑excluded” model still maintained ~85 % accuracy, demonstrating that the system does not rely solely on obvious keywords. A second‑level classifier separated congestion‑related tweets from incident‑related tweets, reaching 88 % and 84 % accuracy respectively.
Beyond binary classification, the study derived three city‑level indices:
-
Congestion Index – calculated by combining tweet volume with time‑of‑day weighting. When compared to the Texas A&M Transportation Institute (TTI) Travel Time Index, the Twitter‑based index showed a Pearson correlation of 0.78, indicating strong agreement with traditional sensor‑based measures.
-
Incident (Safety) Index – built from the frequency of accident‑related terms and sentiment polarity. This index correlated (r ≈ 0.71) with Allstate’s safety rankings, suggesting that social media can reflect relative safety conditions across metropolitan areas.
-
Traffic Perception Index – a novel metric that quantifies how commuters feel about traffic conditions. Sentiment analysis scores were normalized and averaged per city, producing a ranking of the top 30 cities that aligns closely with survey‑based perception studies.
Topic modeling using Latent Dirichlet Allocation (LDA) identified dominant themes for each city (e.g., “accidents”, “grid‑lock”, “construction”). These topics provide qualitative insight into the primary traffic challenges faced locally.
Visualization was implemented through a Flask‑based web dashboard coupled with Leaflet maps. Real‑time tweet streams were plotted geographically, and temporal patterns were displayed via histograms and peak‑hour analyses. The spatial distribution of tweets mirrored actual vehicle flows, as illustrated by the similarity between tweet density maps and known traffic corridors. Detailed “traffic signatures” for Atlanta were generated for six‑hour intervals, revealing consistent diurnal patterns that match conventional traffic counts.
The thesis concludes that Twitter, and by extension other social media platforms, constitute a low‑cost, ubiquitous data source capable of supplementing traditional traffic monitoring systems. However, limitations include the relatively low proportion of geo‑tagged tweets, potential sampling bias toward urban areas, and the noisy, subjective nature of user‑generated content. Future work is suggested in three areas: (i) integration of additional platforms (e.g., Instagram, traffic‑specific apps), (ii) deployment of deep‑learning language models (e.g., BERT, GPT) to improve semantic understanding, and (iii) optimization of the real‑time processing pipeline to reduce latency. Overall, the study demonstrates that meaningful, city‑scale traffic metrics can be derived from publicly available textual data, offering a scalable complement to sensor‑heavy infrastructure.
Comments & Academic Discussion
Loading comments...
Leave a Comment