Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants
Virginia’s seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures-o-series, GPT-4-class, and GPT-3.5-were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared against a GIS analyst baseline, Stanford NER geoparser, Mordecai-3 neural geoparser, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19.2 km (median 12.2 km) at minimal additional cost (~USD 0.20 per grant), outperforming the median LLM by 48.7%. A patentee-name redaction ablation slightly increased error (~7%), showing reliance on textual landmark and adjacency descriptions rather than memorization. The cost-effective gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark. External geocoding tools offer no measurable benefit in this evaluation. These findings demonstrate LLMs’ potential for scalable, accurate, cost-effective historical georeferencing.
💡 Research Summary
This paper tackles the long‑standing bottleneck in historical geographic information systems (GIS) that stems from colonial Virginia land patents being recorded solely as narrative metes‑and‑bounds descriptions. The authors digitized 5,471 patent abstracts dated 1695‑1732, released the full OCR text on GitHub and Zenodo, and created a rigorously validated benchmark of 43 randomly selected patents with authoritative latitude/longitude points derived from GIS polygons produced by the nonprofit One Shared Story (OSS) and cross‑checked by scholars.
Six contemporary OpenAI large language models (LLMs) were evaluated: three from the “o‑series” (o1‑2024‑11‑12, o2‑2025‑03‑01, o3‑2025‑04‑16), two GPT‑4‑class models (gpt‑4‑turbo‑2024‑07‑01, gpt‑4o‑2024‑08‑06), and one GPT‑3.5 model (gpt‑3.5‑turbo‑2024‑06‑01). Two prompting paradigms were compared. The “direct‑to‑coordinate” approach supplies the abstract and asks the model to output a single latitude/longitude pair in one call. The “tool‑augmented chain‑of‑thought” approach lets the model interleave reasoning with live calls to external geocoding services (Google Places, Nominatim, etc.) during a multi‑step chain‑of‑thought process. Each model was run both as a single call and as a five‑call ensemble (averaging five independent predictions).
Performance was measured primarily by mean great‑circle error and median error, supplemented with 95 % bootstrap confidence intervals, cumulative‑error curves, and cost‑latency Pareto frontiers. The top performer, o3‑2025‑04‑16, achieved a mean error of 23 km and a median error of 14 km in the single‑call setting, outperforming the overall LLM mean (37.4 km) by 37.5 %. The weakest model recorded a 50.3 km mean error, 53.5 % higher than the best. A five‑call ensemble reduced mean error to 19.2 km (median 12.2 km) for a marginal additional cost of roughly USD 0.20 per grant.
Cost analysis considered token usage and model pricing. The gpt‑4o‑2024‑08‑06 model emerged as the most cost‑effective, delivering a 28 km mean error at USD 1.09 per 1,000 grants. The tool‑augmented chain‑of‑thought did not improve accuracy; in fact, it added ~2 km of error while incurring extra API charges, suggesting that the LLM’s internal spatial reasoning suffices for this task.
An ablation where patentee names were redacted increased error by about 7 %, indicating that models rely heavily on geographic descriptors (rivers, paths, adjacent parcels) rather than memorized personal names. Sensitivity tests varying temperature, token limits, and abstract length showed negligible impact on error, confirming robustness.
Baseline comparisons included a professional GIS analyst workflow, Stanford NER‑based GeoText, the Mordecai‑3 neural geoparser, and a simple county‑centroid heuristic. The GIS analyst produced a 70 km mean error; GeoText and Mordecai‑3 yielded 71 km and 68 km respectively. Thus, LLMs not only surpass traditional geoparsers in accuracy but also dramatically reduce processing time—from hours of manual work to seconds per abstract—achieving orders‑of‑magnitude gains in cost and latency.
The study contributes (1) a publicly available, copyright‑compliant dataset of colonial Virginia land patents; (2) a high‑quality, geographically validated benchmark; (3) the first systematic evaluation of LLMs on historical land‑grant geolocation, covering both direct and tool‑augmented prompting; and (4) a detailed trade‑off analysis of spatial error, monetary cost, and latency. Limitations include the modest size of the test set and focus on a single colony. Future work should expand to other colonies, explore fine‑tuning on domain‑specific corpora, incorporate multilingual models, and develop more sophisticated tool‑integration (e.g., vector‑search for historic place‑name variants). Overall, the findings demonstrate that contemporary LLMs can reliably and affordably automate the conversion of complex, centuries‑old textual land descriptions into usable geographic coordinates, opening new avenues for large‑scale digital history and spatial analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment