Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs’ ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.

💡 Research Summary

MilSCORE introduces a novel benchmark designed to evaluate large language models (LLMs) and vision‑language models (VLMs) on long‑context, multimodal geospatial reasoning and planning tasks that mirror real‑world military operations. The dataset is built from an authentic training operation order (OPORD) scenario and includes 50 distinct operational maps, satellite imagery, textual orders, and structured GeoJSON overlays. Expert military analysts authored over 100 multi‑hop questions, each paired with a reference answer, and annotated them with a difficulty tier (single‑hop, single‑source multi‑hop, cross‑source multi‑hop) and one or more of seven spatial‑analysis categories: understanding where, measuring size/shape/distribution, determining relationships, finding optimal locations/paths, detecting and quantifying patterns, making predictions, and unsolvable tasks.

MilSCORE’s construction emphasizes realism: maps are drawn using the open‑source Military Map tool to conform to standard military symbology, and all materials are open‑source, ensuring reproducibility. Questions range from straightforward factual recall to complex planning that requires synthesizing information across maps, textual orders, and structured data while respecting doctrinal constraints such as unit boundaries and phase lines. The benchmark therefore tests not only factual grounding but also the ability to perform multi‑step, tool‑augmented reasoning.

The evaluation protocol employs a lightweight chain‑of‑thought (CoT) agent that can invoke external tools (e.g., PDF readers, spreadsheet parsers, map viewers) up to ten iterations per query. The agent iteratively gathers evidence, reasons, and produces either free‑form text or structured outputs (e.g., ordered phase‑line lists). An LLM‑based grader then compares the model’s final answer against the expert reference, assigning a discrete score from 0 (completely incorrect) to 3 (completely correct).

Experiments on a 60‑question slice (20 per difficulty tier) were conducted with several state‑of‑the‑art VLMs: GPT‑4o, Claude Sonnet 4.5, Gemini 2.5 Flash, and Claude Haiku. GPT‑4o achieved the highest overall accuracy (58.3%) and excelled on Tier 3 cross‑source multi‑hop questions (75% correct), largely because it adopts a terse prompting style that quickly chains to tools, staying within the 10‑step budget. In contrast, Claude Sonnet 4.5 and Gemini 2.5 Flash spent many turns on narrative reasoning before invoking tools, often exhausting the iteration limit and yielding “max iterations reached” failures, especially on Tier 3. Claude Haiku performed poorly across all tiers. These results highlight that effective tool‑use and concise reasoning are critical for success on long‑context, multimodal tasks.

The authors identify two major challenges exposed by MilSCORE. First, current LLMs struggle with very long inputs; they must either truncate or inefficiently summarize large documents, leading to loss of essential evidence. Second, hallucination remains a serious issue when models generate geospatial facts not grounded in the provided maps or data. The benchmark therefore serves as a stress test for retrieval‑augmented generation (RAG), dynamic context windows, and better tool‑selection policies.

MilSCORE fills a gap in the geospatial AI evaluation landscape. Prior benchmarks such as GeoImageNet, GeoGLUE, SpatialEval, and GeoBenchX focus on single‑modal or passive reasoning, while newer suites like GeoAnalystBench and CBGB begin to assess agentic capabilities. MilSCORE pushes further by requiring sustained, cross‑modal planning over realistic military documents, making it a valuable testbed for future research on high‑stakes decision‑making, emergency response, and large‑scale infrastructure management. The paper suggests future directions including advanced RAG pipelines, multi‑agent collaboration, and integration with actual military training simulators to close the gap between benchmark performance and operational readiness.

Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment