Title: OsmT: Bridging OpenStreetMap Queries and Natural Language with Open-source Tag-aware Language Models
ArXiv ID: 2512.04738
Date: 2025-12-04
Authors: ** - Zhuoyue Wan ∗ - Wentao Hu ∗ - Chen Jason Zhang ∗ - Yuanfeng Song †¶ - Shuaimin Li ‡ - Ruiqiang Xiao § - Xiao‑Yong Wei ∗ - Raymond Chi‑Wing Wong § 소속 - ∗홍콩 폴리테크닉 대학교 (Hong Kong Polytechnic University) - †WeBank, Shenzhen, China - ‡Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences - §홍콩 과학기술대학 (Hong Kong University of Science and Technology) **
📝 Abstract
Bridging natural language and structured query languages is a long-standing challenge in the database community. While recent advances in language models have shown promise in this direction, existing solutions often rely on large-scale closed-source models that suffer from high inference costs, limited transparency, and lack of adaptability for lightweight deployment. In this paper, we present OsmT, an open-source tag-aware language model specifically designed to bridge natural language and Overpass Query Language (OverpassQL), a structured query language for accessing large-scale OpenStreetMap (OSM) data. To enhance the accuracy and structural validity of generated queries, we introduce a Tag Retrieval Augmentation (TRA) mechanism that incorporates contextually relevant tag knowledge into the generation process. This mechanism is designed to capture the hierarchical and relational dependencies present in the OSM database, addressing the topological complexity inherent in geospatial query formulation. In addition, we define a reverse task, OverpassQL-to-Text, which translates structured queries into natural language explanations to support query interpretation and improve user accessibility. We evaluate OsmT on a public benchmark against strong baselines and observe consistent improvements in both query generation and interpretation. Despite using significantly fewer parameters, our model achieves competitive accuracy, demonstrating the effectiveness of open-source pre-trained language models in bridging natural language and structured query languages within schema-rich geospatial environments.
💡 Deep Analysis
📄 Full Content
OSMT: Bridging OpenStreetMap Queries and
Natural Language with Open-source Tag-aware
Language Models
Zhuoyue Wan∗, Wentao Hu∗, Chen Jason Zhang∗, Yuanfeng Song†¶ , Shuaimin Li‡,
Ruiqiang Xiao§, Xiao-Yong Wei∗, Raymond Chi-Wing Wong§
∗The Hong Kong Polytechnic University, Hong Kong, China †WeBank, Shenzhen, China
‡Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
§The Hong Kong University of Science and Technology, Hong Kong, China
Abstract—Bridging natural language and structured query lan-
guages is a long-standing challenge in the database community.
While recent advances in language models have shown promise in
this direction, existing solutions often rely on large-scale closed-
source models that suffer from high inference costs, limited
transparency, and lack of adaptability for lightweight deploy-
ment. In this paper, we present OSMT, an open-source tag-aware
language model specifically designed to bridge natural language
and Overpass Query Language (OverpassQL), a structured query
language for accessing large-scale OpenStreetMap (OSM) data.
To enhance the accuracy and structural validity of generated
queries, we introduce a Tag Retrieval Augmentation (TRA)
mechanism that incorporates contextually relevant tag knowledge
into the generation process. This mechanism is designed to
capture the hierarchical and relational dependencies present
in the OSM database, addressing the topological complexity
inherent in geospatial query formulation. In addition, we define
a reverse task, OverpassQL-to-Text, which translates structured
queries into natural language explanations to support query
interpretation and improve user accessibility. We evaluate OSMT
on a public benchmark against strong baselines and observe
consistent improvements in both query generation and interpre-
tation. Despite using significantly fewer parameters, our model
achieves competitive accuracy, demonstrating the effectiveness
of open-source pre-trained language models in bridging natural
language and structured query languages within schema-rich
geospatial environments.
Index Terms—structured query generation, natural language
interfaces, Text-to-OverpassQL, OverpassQL-to-Text, language
model
I. INTRODUCTION
Structured query languages are essential interfaces for man-
aging and interacting with complex databases. Establishing
effective alignment between natural language and structured
queries has emerged as a prominent research focus in both
academia and industry, motivated by the need to lower exper-
tise barriers and facilitate intuitive database access for non-
technical users. Significant research addressing this challenge
has been presented in a broad literature, including works such
as [1]–[12]. These studies collectively demonstrate sustained
scholarly and practical interest in advancing natural language
interfaces for structured data and have been widely adopted
¶ Corresponding author.
10
0
10
1
10
2
10
3
10
4
Number of Parameters (Billions, Log Scale)
58
60
62
64
66
Average Performance Score (EM, chrF, KVS, TreeS, OQS)
LLaMA-3.1-8B
Qwen2.5-72B
Qwen3-235B
DeepSeek-V3
GPT-4.1
GPT-4o
Claude-4-sonnet
GPT-4
CodeT5-base
CodeT5+
ByT5-small
OverpassT5
OsmT-small
OsmT-base High Performance
Low Parameters
OsmT (Ours)
Other Models
Unknown Params.
Fig. 1: Model performance vs. parameter size (log scale).
Comparison of OSMT with state-of-the-art open-source and
closed-source models on the Text-to-OverpassQL task. Aver-
age performance is over five metrics.
across diverse real-world applications, ranging from business
analytics to scientific data management. Among the various
types of structured data, geospatial databases have emerged as
particularly critical due to their foundational role in supporting
large-scale spatial analysis, complex query execution, and
location-based decision-making. These capabilities underpin a
wide range of downstream applications, including geospatial
knowledge extraction, urban mobility modeling, and spatio-
temporal forecasting [13]–[19].
A prominent example of a geospatial database is Open-
StreetMap (OSM), a collaboratively maintained, open-access
platform that provides foundational infrastructure for spatial
data analysis. OSM supports various sophisticated spatial
query functionalities, including location filtering, proximity
searches, and routing, as exemplified by widely used ap-
plications such as OsmAnd1 and Locus Map2. Retrieving
structured geospatial data from OSM typically relies on the
Overpass Query Language (OverpassQL), a domain-specific
language designed for fine-grained spatial data extraction
through filters, scoped conditions, and recursive constructs.
1https://osmand.net/
2https://web.locusmap.app/en/
arXiv:2512.04738v1 [cs.CL] 4 Dec 2025
While OverpassQL offers powerful expressive capabilities, it
demands that users possess detailed knowledge of OSM’s
schema and tagging structure, creating significant usability
barriers for non-expert users.
To improve