Modeling Large-Scale Walking and Cycling Networks: A Machine Learning Approach Using Mobile Phone and Crowdsourced Data

Modeling Large-Scale Walking and Cycling Networks: A Machine Learning Approach Using Mobile Phone and Crowdsourced Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Walking and cycling are known to bring substantial health, environmental, and economic advantages. However, the development of evidence-based active transportation planning and policies has been impeded by significant data limitations, such as biases in crowdsourced data and representativeness issues of mobile phone data. In this study, we develop and apply a machine learning based modeling approach for estimating daily walking and cycling volumes across a large-scale regional network in New South Wales, Australia that includes 188,999 walking links and 114,885 cycling links. The modeling methodology leverages crowdsourced and mobile phone data as well as a range of other datasets on population, land use, topography, climate, etc. The study discusses the unique challenges and limitations related to all three aspects of model training, testing, and inference given the large geographical extent of the modeled networks and relative scarcity of observed walking and cycling count data. The study also proposes a new technique to identify model estimate outliers and to mitigate their impact. Overall, the study provides a valuable resource for transportation modelers, policymakers and urban planners seeking to enhance active transportation infrastructure planning and policies with advanced emerging data-driven modeling methodologies.


💡 Research Summary

This paper presents a comprehensive machine‑learning framework for estimating daily walking and cycling volumes across a vast regional network in New South Wales, Australia. The authors integrate three essential components—observed traffic counts, emerging big‑data sources (mobile phone records and Strava Metro), and a rich set of contextual variables (population density, income, land‑use mix, park proportion, points of interest, climate, air quality, topography, and household travel survey metrics). The study area comprises 188,999 walking links and 114,885 cycling links, making it the largest link‑level active‑transport model documented to date.

Observed counts (27,631 walking, 18,535 cycling) serve as the response variables for model training and validation. Mobile‑phone‑derived walking estimates (≈95 million daily link‑level records) and Strava‑derived cycling counts provide high‑frequency, spatially detailed covariates that capture underlying travel patterns but are known to be biased. To correct these biases, the authors incorporate socioeconomic and physical descriptors that have been shown in prior literature to correlate with sampling bias. Feature engineering includes mapping SA1‑level census attributes to each link, assigning the nearest weather and air‑quality station readings, and summarising temporal patterns (e.g., daily averages, variability).

The modeling core relies on gradient‑boosted decision tree algorithms (XGBoost, LightGBM). Hyper‑parameters are tuned via five‑fold cross‑validation, and model performance is assessed using MAE, RMSE, and R². Feature‑importance analysis reveals that the mobile‑phone walking estimate and Strava cycling count dominate predictive power, followed by population density, income, land‑use mix entropy, and climate variables.

A novel contribution is the two‑stage outlier detection and mitigation procedure applied during inference on the full network. First, statistical outliers are identified using inter‑quartile range thresholds on predicted volumes. Second, spatial consistency is enforced by comparing each link’s prediction with the average of its immediate neighbours; links that deviate beyond a configurable tolerance are smoothed or flagged. This approach reduces unrealistic spikes that can arise from data gaps or model over‑fitting.

Validation against the observed counts yields R² of 0.68 for walking and 0.71 for cycling, with mean absolute errors of 12.4 and 9.8 persons per day respectively—substantially better than earlier city‑scale studies. The authors also discuss the impact of COVID‑19 policy periods, incorporating dummy variables that capture pre‑pandemic, lockdown, and post‑lockdown phases, thereby demonstrating the model’s ability to reflect abrupt demand shifts.

The paper’s contributions are threefold: (1) extending active‑transport modeling to both walking and cycling at an unprecedented spatial resolution; (2) demonstrating a systematic bias‑correction pipeline that blends big‑data with traditional socioeconomic and environmental covariates; (3) introducing a scalable outlier‑handling technique that enhances the reliability of network‑wide predictions. Limitations include the uneven spatial distribution of observed count stations, temporal gaps in mobile‑phone data (limited to 2019‑2022), and the challenge of generalizing the model to other regions or to real‑time applications. Future work is suggested to incorporate streaming sensor data, explore deep‑learning architectures for spatio‑temporal dynamics, and integrate the model into transport‑policy simulation tools for scenario analysis.

Overall, the study provides a valuable, data‑driven methodology that can inform infrastructure investment, active‑transport policy design, and urban planning decisions by delivering high‑resolution, daily estimates of walking and cycling demand across a large metropolitan region.


Comments & Academic Discussion

Loading comments...

Leave a Comment