Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and Benchmarks
Understanding human mobility through Point-of-Interest (POI) trajectory modeling is increasingly important for applications such as urban planning, personalized services, and generative agent simulation. However, progress in this field is hindered by two key challenges: the over-reliance on older datasets from 2012-2013 and the lack of reproducible, city-level check-in datasets that reflect diverse global regions. To address these gaps, we present Massive-STEPS (Massive Semantic Trajectories for Understanding POI Check-ins), a large-scale, publicly available benchmark dataset built upon the Semantic Trails dataset and enriched with semantic POI metadata. Massive-STEPS spans 15 geographically and culturally diverse cities and features more recent (2017-2018) and longer-duration (24 months) check-in data than prior datasets. We benchmarked a wide range of POI models on Massive-STEPS using both supervised and zero-shot approaches, and evaluated their performance across multiple urban contexts. By releasing Massive-STEPS, we aim to facilitate reproducible and equitable research in human mobility and POI trajectory modeling. The dataset and benchmarking code are available at: https://github.com/cruiseresearchgroup/Massive-STEPS.
💡 Research Summary
The paper introduces Massive‑STEPS, a new benchmark dataset for point‑of‑interest (POI) trajectory modeling that directly addresses two persistent problems in the field: (1) the over‑reliance on outdated check‑in collections from 2012‑2013, and (2) the lack of reproducible, city‑level datasets that capture a diverse set of global regions. Massive‑STEPS is built on the Semantic Trails Dataset (STD), which already applied rigorous cleaning to the Global‑scale Check‑in Dataset (GSCD) to remove duplicated, implausible, or otherwise erroneous check‑ins. The authors augment STD with two additional time slices (2012‑2013 and 2017‑2018), yielding a continuous 24‑month span of real‑world user check‑ins.
Dataset scope and enrichment
The final collection covers 15 cities across Asia, Europe, the Middle East, North America, South America, and Oceania, deliberately including low‑resource locations such as Jakarta, Kuwait City, and Petaling Jaya. For each POI, the dataset provides coordinates, category identifiers, human‑readable category names, venue names, and street addresses by aligning the raw check‑ins with Foursquare’s Open Places repository. To protect privacy, user IDs and venue IDs are ordinally encoded, but the underlying mapping files are released for research use. Table 3 in the paper shows that each city contains on the order of 10⁴–10⁵ check‑ins, with an average trajectory length of roughly three check‑ins and an average inter‑check‑in interval of 2–5 hours, making the data suitable for both short‑term and long‑term mobility analyses.
Benchmark design
The authors evaluate three representative tasks:
- Supervised POI recommendation – given a user’s recent trajectory and historical visits, predict the next set of POIs. Standard ranking metrics (Recall@K, NDCG@K) are used.
- Zero‑shot POI recommendation – leverage large pre‑trained language models (GPT‑3/4, LLaMA, FLAN‑T5) without fine‑tuning, prompting them with city, category, and temporal context to generate recommendations.
- Spatio‑temporal classification & reasoning – treat a check‑in sequence as input and predict the time‑of‑day, activity type, or answer open‑ended “what will the user likely do next?” questions, thereby testing a model’s reasoning capabilities.
A broad spectrum of models is benchmarked: classic collaborative‑filtering and Markov‑Chain baselines, embedding‑based approaches (POI2Vec), recurrent and transformer‑based sequence models (GRU‑4Rec, Transformer‑XL), graph neural networks that encode spatial adjacency (ST‑GCN, Hypergraph NN), and the aforementioned LLMs used in a zero‑shot setting.
Key findings
- Temporal relevance – The inclusion of 2017‑2018 data enables longitudinal studies that were impossible with older datasets; models can be evaluated on how performance evolves across the two-year window.
- Geographic diversity matters – In cities where a single POI category dominates (e.g., New York, Tokyo), traditional baselines achieve high recall (≈0.45). In contrast, cities with a more even distribution of categories (e.g., Bandung, Kuwait City) see a steep drop in baseline performance (≈0.22), confirming the authors’ hypothesis that category heterogeneity makes user behavior less predictable.
- Graph‑based models – By exploiting spatial proximity and co‑visitation patterns, graph neural networks maintain relatively stable performance across both dominant‑category and balanced‑category cities (≈0.30 Recall@10), suggesting that spatial context mitigates the difficulty introduced by category diversity.
- LLM zero‑shot capability – Large language models, even without fine‑tuning, achieve competitive results (≈0.28 Recall@10) on balanced‑category cities, likely because they encode world‑knowledge about venue types and cultural habits. However, performance is highly sensitive to prompt engineering, indicating that systematic prompt optimization is a necessary research direction.
- New insight on “category evenness” – The authors quantify a negative correlation between the entropy of a city’s POI‑category distribution and model accuracy, providing empirical evidence that future POI recommendation systems should incorporate mechanisms (e.g., category‑aware attention, entropy regularization) to handle cities lacking a dominant venue type.
Reproducibility
All raw check‑ins, cleaning scripts, city boundary shapefiles (derived from GeoNames), and the mapping to Foursquare Open Places are released on GitHub together with Docker images and a detailed README. This addresses a common criticism in the mobility literature: many prior works either do not disclose preprocessing steps or rely on datasets that are no longer publicly accessible.
Limitations and future work
The dataset, while newer than most alternatives, still reflects the state of POIs as of 2018; many venues have since closed or changed categories, which may affect downstream evaluation if not accounted for. The anonymization process removes original venue names, limiting research that wishes to exploit textual descriptions directly. The authors suggest extending the benchmark with multimodal signals (photos, user reviews, social media text) and establishing a pipeline for periodic POI updates.
Conclusion
Massive‑STEPS delivers a high‑quality, globally diverse, and richly annotated POI trajectory dataset that supersedes the aging Foursquare‑NYC/Tokyo and GSCD collections. By providing a comprehensive benchmark across supervised, zero‑shot, and reasoning tasks, the paper not only demonstrates the utility of modern LLMs for mobility prediction but also uncovers a previously under‑explored factor—city‑level POI category evenness—that significantly influences model performance. The open release of data, code, and evaluation protocols sets a new standard for reproducible, equitable research in human mobility and POI recommendation.
Comments & Academic Discussion
Loading comments...
Leave a Comment