GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM

GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Socio-economic indicators like regional GDP, population, and education levels, are crucial to shaping policy decisions and fostering sustainable development. This research introduces GeoReg a regression model that integrates diverse data sources, including satellite imagery and web-based geospatial information, to estimate these indicators even for data-scarce regions such as developing countries. Our approach leverages the prior knowledge of large language model to address the scarcity of labeled data, with the language model functioning as a data engineer by extracting informative features to enable effective estimation in few-shot settings. Specifically, our model obtains contextual relationships between data features and the target indicator, categorizing their correlations as positive, negative, mixed, or irrelevant. These features are then fed into the linear estimator with tailored weight constraints for each category. To capture nonlinear patterns, the model also identifies meaningful feature interactions and integrates them, along with nonlinear transformations. Experiments across three countries at different stages of development demonstrate that our model outperforms baselines in estimating socio-economic indicators, even for low-income countries with limited data availability.


💡 Research Summary

GeoReg introduces a novel framework for estimating regional socio‑economic indicators—such as gross regional domestic product (GRDP), population, and education levels—under severe label scarcity. Traditional approaches that fuse satellite imagery, night‑light data, and web‑based geospatial information typically require large numbers of ground‑truth labels to train deep neural networks. In low‑resource settings, especially in developing countries, this requirement is often unmet, leading to over‑fitting, biased predictions, and a lack of interpretability.

The key innovation of GeoReg is to treat a pre‑trained large language model (LLM) as a “data engineer.” First, the authors design a set of 26 modular feature extractors (called modules) that operate on heterogeneous sources: GIS APIs for administrative boundaries, POI databases for distances to airports/ports, VIIRS night‑light imagery for illumination intensity, a pretrained segmentation model for land‑cover classification (e.g., building, road, agriculture), and neighbor‑aggregation functions that summarize surrounding regions. Each module is a deterministic function (e.g., get_area, get_distance_to_nearest_target, get_night_light, count_area, get_aggregate_neighbor_info) that returns a numeric descriptor for a given region.

GeoReg’s learning pipeline consists of two stages. Stage 1 – Knowledge‑driven module categorization. The LLM receives a carefully crafted prompt containing the textual definition of each module, the definition of the target indicator, and a description of four correlation types: Positive, Negative, Mixed, and Irrelevant. Using a Chain‑of‑Thought (CoT) reasoning style, the LLM evaluates the relationship between each module and the indicator, repeats the query five times for self‑consistency, and aggregates the answers by majority vote. The outcome is a categorical label for every module (positive ↔ negative ↔ mixed ↔ irrelevant). This step injects domain knowledge without requiring many labeled examples.

Stage 2 – Correlation‑constrained linear regression with nonlinear augmentation. The modules classified as positive are forced to have strictly positive coefficients, negatives receive strictly negative coefficients, and irrelevant modules are constrained near zero. These constraints are encoded as linear inequality constraints in a standard least‑squares regression solver, effectively turning the prior knowledge from the LLM into an inductive bias that mitigates over‑fitting. To capture complex patterns that a purely linear model cannot express, the authors automatically discover pairwise interactions among modules within each correlation group and augment the feature space with nonlinear transformations (polynomials, logarithms, etc.). The final model thus combines a transparent, weight‑constrained linear core with a modest set of nonlinear interaction terms.

The authors evaluate GeoReg on three countries representing different development stages: South Korea (high‑income), Vietnam (middle‑income), and Cambodia (low‑income). For each country they predict three indicators (GRDP, population, education) using only 5–10 labeled regions (few‑shot regime). Baselines include state‑of‑the‑art multimodal CNN models, unconstrained linear regression, and a version of GeoReg without LLM‑driven categorization (random module selection). GeoReg achieves an average winning rate of 87.2 % across all tasks. Notably, in Cambodia where only five labeled regions are available, GeoReg improves R² by more than 0.12 over the best CNN baseline, demonstrating robustness to extreme data scarcity. Ablation studies show that both the LLM‑derived weight constraints and the nonlinear interaction augmentation contribute substantially to performance gains.

Interpretability is a central benefit. Because each coefficient is tied to a human‑readable module and its sign is enforced by the LLM’s correlation label, policymakers can directly read statements such as “larger administrative area positively correlates with GRDP” or “higher night‑light intensity negatively correlates with education deprivation.” This level of global explanation is absent in typical deep‑learning black‑box models that rely on post‑hoc saliency maps.

The paper acknowledges limitations: (1) the quality of the LLM’s categorization depends on prompt engineering and may suffer from hallucinations; (2) the initial module set still requires domain expertise to define; (3) scalability to hundreds of modules or to fully multimodal LLM‑vision architectures remains to be demonstrated. Future work is suggested on automated prompt optimization, dynamic module generation, and integration with vision‑language models to further reduce human engineering effort.

In summary, GeoReg presents a compelling solution to the “few‑shot socio‑economic estimation” problem by leveraging LLMs for knowledge‑driven feature selection and weight constraint, coupling it with a transparent linear regression enriched by learned nonlinear interactions. The approach delivers both superior predictive accuracy under label scarcity and actionable interpretability, opening a promising path for data‑poor regions to obtain reliable, policy‑relevant economic statistics.


Comments & Academic Discussion

Loading comments...

Leave a Comment