Large Language Models for Geolocation Extraction in Humanitarian Crisis Response
Humanitarian crises demand timely and accurate geographic information to inform effective response efforts. Yet, automated systems that extract locations from text often reproduce existing geographic and socioeconomic biases, leading to uneven visibility of crisis-affected regions. This paper investigates whether Large Language Models (LLMs) can address these geographic disparities in extracting location information from humanitarian documents. We introduce a two-step framework that combines few-shot LLM-based named entity recognition with an agent-based geocoding module that leverages context to resolve ambiguous toponyms. We benchmark our approach against state-of-the-art pretrained and rule-based systems using both accuracy and fairness metrics across geographic and socioeconomic dimensions. Our evaluation uses an extended version of the HumSet dataset with refined literal toponym annotations. Results show that LLM-based methods substantially improve both the precision and fairness of geolocation extraction from humanitarian texts, particularly for underrepresented regions. By bridging advances in LLM reasoning with principles of responsible and inclusive AI, this work contributes to more equitable geospatial data systems for humanitarian response, advancing the goal of leaving no place behind in crisis analytics.
💡 Research Summary
The paper addresses a critical gap in humanitarian crisis response: the extraction of accurate and equitable geolocation information from textual reports. Existing pipelines that rely on traditional named‑entity recognition (NER) models (e.g., SpaCy, RoBERTa) and rule‑based geocoding suffer from systematic geographic and socioeconomic biases, performing well for Western, English‑speaking regions while under‑performing for low‑ and middle‑income countries. To mitigate these disparities, the authors propose a two‑step framework that leverages large language models (LLMs) for both NER and context‑aware geocoding.
Methodology
-
Document Pre‑processing and Chunking – Because LLMs have token limits, the authors introduce a dynamic chunking algorithm that selects the most suitable separator (double line break, period, etc.) to split documents into segments whose lengths fall within predefined minimum and maximum bounds. The algorithm also minimizes variance in chunk length, ensuring consistent input sizes for the downstream LLM.
-
Few‑Shot LLM‑Based NER – The study evaluates two output formats for the LLM: (a) JSON, which allows longer chunks (1,000–2,000 characters) but returns only a list of literal toponyms without positional information, and (b) Markdown, which embeds delimiters directly in the text (e.g., @@…##) and therefore provides exact character offsets but requires shorter chunks (200–500 characters) to maintain fidelity. Prompt engineering follows a standardized template (see Appendix A) and uses a handful of annotated examples (few‑shot) to guide the model.
-
Post‑Processing Alignment – For the JSON format, a dynamic‑programming alignment algorithm matches each extracted toponym to its first valid occurrence after the previous match, allowing skips and handling duplicate mentions. This avoids the errors of a naïve greedy approach, especially in sentences where the same place appears multiple times or where associative mentions (e.g., “the US Embassy”) coexist with literal ones. The Markdown approach generally bypasses this step, but when the output diverges from the input the same DP alignment is applied. Adjacent toponyms separated by commas or conjunctions are merged into a single entity to reflect human annotation conventions.
-
Agent‑Based Geocoding – After NER, each tag is processed by a LangChain‑implemented “geolocation agent” that interacts with the GeoNames database via the Pelias API. The agent follows a “Search‑Select‑Finish” loop: (i) it issues a search query (optionally constrained by a country code), (ii) selects the most appropriate GeoNames ID based on LLM‑generated reasoning that incorporates surrounding context, and (iii) terminates with an explicit justification. The agent also tags each toponym as literal or associative, providing richer metadata for downstream analysis. While the agent improves disambiguation for ambiguous or low‑resource places, it inherits any structural biases present in GeoNames.
-
Dataset and Evaluation – The authors use the HumSet corpus, focusing on 467 English documents that were previously annotated for toponym extraction. They refine the annotations by comparing the original labels with outputs from their GPT‑4o JSON model, correcting missing or mis‑classified entries via a custom GUI tool. Evaluation includes standard accuracy metrics (Precision, Recall, F1) and a suite of fairness metrics that assess performance across geographic regions (continent, country) and socioeconomic strata (World Bank income groups). Baselines comprise the SpaCy‑RoBERTa NER model and a rule‑based geocoder from prior work.
Results
- The LLM‑driven pipeline achieves a 7–9 percentage‑point increase in overall F1 compared to the baseline.
- Fairness analysis shows a substantial reduction in performance gaps: recall for low‑income countries improves by more than 15 pp, and the disparity index between high‑ and low‑income groups drops by roughly 30 %.
- The JSON output format yields higher overall extraction accuracy due to larger chunk sizes, while the Markdown format provides more precise span annotations but suffers a slight dip in recall for very long documents.
- The agent’s reasoning traces reveal successful disambiguation of ambiguous toponyms (e.g., “Springfield” resolved to the correct state based on contextual cues) and effective handling of associative mentions.
Limitations and Future Work
- Language Scope – The study is limited to English documents; extending the approach to multilingual humanitarian reports (Arabic, French, Spanish, etc.) is essential for broader applicability.
- GeoNames Bias – Since the geocoding agent relies on GeoNames, any inherent Western‑centric bias in that gazetteer propagates to the final coordinates. Incorporating alternative open‑source gazetteers or crowdsourced datasets could mitigate this.
- Prompt Sensitivity – Few‑shot performance varies with prompt phrasing and the selection of exemplar sentences. Systematic ablation of prompt designs would improve reproducibility.
- Computational Cost – Repeated LLM calls and external API queries introduce latency and cost, which may be prohibitive for real‑time crisis monitoring. Future work could explore model distillation, caching strategies, or hybrid pipelines that invoke LLMs only for ambiguous cases.
- Evaluation Scale – The benchmark uses a relatively small subset (467 documents) of the full HumSet corpus. Scaling evaluation to the entire dataset and to other crisis‑specific corpora (e.g., UN OCHA reports) would strengthen claims of generalizability.
Conclusion
The paper demonstrates that large language models, when coupled with a carefully engineered preprocessing, alignment, and agent‑based geocoding workflow, can substantially improve both the accuracy and fairness of location extraction from humanitarian texts. By explicitly measuring and reporting socioeconomic performance disparities, the authors set a valuable precedent for responsible AI in disaster response. The work opens promising avenues for multilingual extensions, bias‑aware gazetteer integration, and efficient deployment in operational humanitarian settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment