Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0%$ of samples, while South American and African countries are severely under-represented with only $1.8%$ and $3.8%$ of images, respectively. We observe a strong correlation between a country’s GDP and its representation in the data ($ρ= 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.

💡 Research Summary

This paper investigates the geographic composition of large‑scale vision‑language datasets by extracting country information from image captions using large language models (LLMs). The authors argue that captions are a natural source of location cues, yet traditional methods (string matching, NER, geoparsing tools) struggle with ambiguous place names, informal language, and multilingual text. To overcome these limitations, they design a three‑step “extract‑retrieve‑predict” pipeline: an LLM first extracts location mentions from a caption, then retrieves the top‑k candidate locations from the GeoNames gazetteer, and finally predicts the country by feeding the candidates (with their associated country names) back to the LLM as additional context.

The pipeline is evaluated on three curated annotation sets: (1) D_self – 5 K real web‑captions manually labeled with countries; (2) D_geo – 57 K synthetic captions where a known GeoNames location is inserted; and (3) D_marginalized – 1.6 K sentences from Wikipedia covering ten low‑GDP countries. Across all sets, LLM‑based methods dramatically outperform baseline approaches. Gemini‑2.5‑Flash, especially with the extract‑retrieve‑predict protocol, achieves the highest precision/recall (≈0.98/0.95), confirming that modern LLMs can reason about geography far better than rule‑based systems.

Armed with this tool, the authors geo‑profile three widely used multimodal datasets: Re‑LAION2B‑en (the English subset of LAION), DataComp1B, and CC12M. They focus on 20 common visual entities (e.g., house, flag, road) that are globally recognizable. For each entity, they first filter image‑caption pairs with an entity‑presence classifier to ensure the object actually appears in the image. Then they map the captions to countries using the LLM pipeline. The resulting country‑level frequency distributions reveal stark imbalances: the United States, United Kingdom, and Canada together account for 48 % of all sampled pairs, while South American and African nations contribute only 1.8 % and 3.8 % respectively. A striking 12 of the top 15 countries are common across all three datasets, and these 15 nations comprise 77.2 % of location‑specified captions in Re‑LAION2B‑en.

Statistical analysis shows a strong positive correlation (Pearson ρ = 0.82) between a country’s nominal GDP and its representation in the datasets, suggesting that economic wealth drives the likelihood of images from that region being scraped and retained. When the authors compare the observed frequencies to real‑world occurrence rates for the same entities, they find that on average 33.8 % of countries are under‑represented relative to their true prevalence.

The paper also examines multilingual subsets of Re‑LAION (Spanish, Hindi, Greek, Japanese). Here, representation skews heavily toward countries where the target language is dominant. For example, Spanish captions mention South American countries in 26.4 % of cases, whereas English captions mention them in only 1.8 % of cases. This demonstrates that language‑based crawling amplifies geographic bias.

Beyond raw counts, the authors assess visual and semantic diversity within each country. Using clustering on image embeddings and topic modeling on captions, they find only a moderate correlation (≈0.45) between the number of samples from a country and its intra‑country diversity. Hence, higher representation does not guarantee richer visual or textual variation.

Finally, the impact of these data biases on downstream models is evaluated using Stable Diffusion v1.3 trained on Re‑LAION. The authors generate country‑specific prompts (e.g., “a traditional house in Kenya”) and compare the generated images to real images from the same country using human judgments and CLIP similarity scores. While the generated images appear realistic, they cover a far narrower set of visual concepts, especially for low‑GDP regions, confirming that training‑data skew directly limits generative diversity.

In summary, the study makes four key contributions: (1) a robust LLM‑driven method for extracting country information from noisy web captions; (2) a large‑scale quantitative analysis exposing severe geographic concentration in three major vision‑language datasets, tightly linked to economic wealth and language; (3) evidence that representation volume is not a reliable proxy for intra‑country visual/semantic diversity; and (4) a demonstration that these data‑level biases propagate to text‑to‑image models, curtailing their ability to generate geographically diverse content. The authors advocate for systematic geo‑profiling during dataset construction, intentional balancing of country and language coverage, and the use of LLM‑based geoparsing as a scalable monitoring tool to foster more inclusive multimodal AI systems.

Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets

💡 Research Summary

Comments & Academic Discussion

Leave a Comment