GPSBench: Do Large Language Models Understand GPS Coordinates?
Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs’ ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state-of-the-art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real-world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country-level performance but weak city-level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS-coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade-offs between gains in geometric computation and degradation in world knowledge. Our dataset and reproducible code are available at https://github.com/joey234/gpsbench
💡 Research Summary
The paper introduces GPSBench, a large‑scale benchmark designed to evaluate the intrinsic ability of large language models (LLMs) to reason about GPS coordinates and real‑world geography. Existing spatial benchmarks focus on tabletop or visual tasks and do not test global‑scale, spherical geometry or the integration of coordinate data with world knowledge. GPSBench fills this gap with 57,800 samples spanning 17 tasks, organized into two tracks: a “Pure GPS” track (9 tasks) that requires only mathematical manipulation of coordinates, and an “Applied” track (8 tasks) that couples coordinates with factual geographic information. All data are derived from the GeoNames database, covering 18,196 cities across six continents, and ground‑truth answers are computed using standard geodetic formulas (Haversine distance, forward azimuth, spherical interpolation, L’Huilier’s theorem for polygon area, etc.) or database look‑ups.
The authors evaluate 14 state‑of‑the‑art LLMs, including OpenAI’s GPT‑5.1, GPT‑4.1, GPT‑5‑mini, GPT‑5‑nano, Google’s Gemini‑2.5 (Flash and Pro), Anthropic’s Claude‑4.5‑Haiku, Qwen3 (235B, 30B, 14B, 8B) and Mistral 2 (large and small). Models are tested in a zero‑shot setting with a uniform system prompt and a task‑specific user prompt; no chain‑of‑thought or few‑shot examples are used. For multiple‑choice tasks, accuracy is reported; for numeric tasks, Mean Absolute Percentage Error (MAPE) is transformed to 1‑MAPE so that higher scores always indicate better performance, enabling direct comparison across task types.
Key findings:
- Geographic knowledge vs. geometric computation – Models perform relatively well on country‑level location identification (often >80 % accuracy) but struggle at city‑level (≈50 % or lower). Pure geometric tasks such as distance, bearing, and polygon area yield modest scores (generally below 60 % accuracy), with especially poor results on spherical interpolation and area calculations.
- Model size and family effects – Larger, more recent models (GPT‑5.1, Gemini‑Flash, Qwen3‑235B) achieve the highest overall scores, while smaller variants (GPT‑5‑nano, Qwen3‑8B) lag considerably across all tasks.
- Robustness to coordinate noise – Adding small random perturbations (±0.01°) to input coordinates causes only minimal performance degradation (<2 % absolute drop), suggesting that models are not merely memorizing textual patterns but have some numeric understanding of coordinates.
- Downstream augmentation – Augmenting existing geographic QA datasets with GPS coordinates as additional inputs improves accuracy by 3–5 %, indicating that coordinate information can complement world‑knowledge reasoning.
- Fine‑tuning trade‑offs – Fine‑tuning on pure GPS computation data boosts numeric task performance by >10 % but simultaneously reduces geographic‑knowledge task accuracy by 5–7 %, revealing a tension between learning precise geodetic calculations and retaining factual place knowledge.
The authors interpret these results as evidence that current LLMs possess limited “survey‑level” spatial cognition: they can recall coarse‑grained location facts but lack robust internal models of spherical geometry and fine‑grained place mapping. The benchmark also uncovers geographic bias (higher performance on Western/North‑American locations) reflecting training data distributions. Consequently, for latency‑sensitive or offline applications such as navigation assistants, robotics, or GIS automation, relying solely on LLMs is insufficient; integration with external GIS tools or dedicated coordinate‑processing modules remains necessary.
GPSBench, along with all code and data, is released publicly (https://github.com/joey234/gpsbench) to enable reproducible research and to spur further work on improving LLMs’ geospatial reasoning capabilities.
Comments & Academic Discussion
Loading comments...
Leave a Comment