Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries

Reading time: 5 minute
...

📝 Original Info

  • Title: Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries
  • ArXiv ID: 2512.14102
  • Date: 2025-12-16
  • Authors: Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof

📝 Abstract

Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.

💡 Deep Analysis

Figure 1

📄 Full Content

NEUROSYMBOLIC INFERENCE ON FOUNDATION MODELS FOR REMOTE SENSING TEXT-TO-IMAGE RETRIEVAL WITH COMPLEX QUERIES ∗ Emanuele Mezzi Vrije Universiteit Amsterdam The Netherlands e.mezzi@vu.nl Gertjan Burghouts TNO The Netherlands gertjan.burghouts@tno.nl Maarten Kruithof TNO The Netherlands maarten.kruithof@tno.nl ABSTRACT Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision- language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Lan- guage Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM’s effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE’s potential for real-world RS applications through a use case on post-flood satellite image retrieval. Keywords Text-to-Image Retrieval · Neurosymbolic Reasoning · Spatial Reasoning · Large Language Models · Large Visual Language Models 1 Introduction Given their impressive capabilities, foundation models and large vision-language models (LVLMs) are seeing rapidly expanding adoption, with applications spanning nearly every area of scientific research and practical implementation [1, 2, 3, 4]. In recent years, LVLMs have been successfully applied to remote sensing (RS) [5, 6], where GeoAI foundation models can handle a variety of tasks — notably text-to-image retrieval, which enables automated extraction of actionable information from RS data. Models such as RemoteCLIP [7], GeoRSCLIP [8], and RS-LLaVA [9] have shown promising retrieval performance, achieving high recall and ensuring good coverage of relevant images given a textual query. These results highlight that, even in RS, general-purpose LVLMs can deliver strong performance when fine-tuned on data tailored to the task. ∗Accepted for publication in ACM Transactions on Spatial Algorithms and Systems (TSAS). arXiv:2512.14102v1 [cs.CV] 16 Dec 2025 However, RS-LVLMs face several key limitations. First, they require large amounts of training data, which is often unavailable in the target domain, motivating the need for models that can generalize without domain-specific training. Second, existing LVLMs rely on joint image-text embeddings, where both the query and image are projected into a shared embedding space and their similarity is computed. While effective, this approach encodes the detailed semantics of the query implicitly, limiting both representational power and interpretability. Explainability is critical for real-world applications, where human-understandable reasoning is often required. More importantly, we show that LVLMs struggle with semantically complex queries — especially those involving multiple objects and relationships. Previous evaluations have largely focused on recall (R@k), which provides useful insights into coverage and false negatives (FNs) but fail to capture models’ ability to limit false positives (FPs), an important consideration in operational settings, where manual intervention on every alert is impractical. To address this, we conduct a thorough evaluation of RS-LVLMs on a dataset characterised by queries with complex spatial relations, assessing both recall and precision (P@k), and analyzing robustness across increasing levels of query complexity. Our results reveal that current LVLMs lack precision, robustness, and interpretability. To overcome these challenges, we propose RUNE (Reasoning Using Neurosymbolic Entities), a novel text-to-image retrieval approach that integrates foundation models with logical reasoning. Our met

📸 Image Gallery

cover.png decomposition.png efficiency.png explainability1.png explainability2.png flooded_image.png graph_comparison.png graph_complexity.png image_difficulty.png model_performance_change.png pipeline.png robustness_graphs.png uncertainty_comparison.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut