Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This research presents a Retrieval-Augmented Generation (RAG) framework for art provenance studies, focusing on the Getty Provenance Index. Provenance research establishes the ownership history of artworks, which is essential for verifying authenticity, supporting restitution and legal claims, and understanding the cultural and historical context of art objects. The process is complicated by fragmented, multilingual archival data that hinders efficient retrieval. Current search portals require precise metadata, limiting exploratory searches. Our method enables natural-language and multilingual searches through semantic retrieval and contextual summarization, reducing dependence on metadata structures. We assess RAG’s capability to retrieve and summarize auction records using a 10,000-record sample from the Getty Provenance Index - German Sales. The results show this approach provides a scalable solution for navigating art market archives, offering a practical tool for historians and cultural heritage professionals conducting historically sensitive research.

💡 Research Summary

This paper presents a domain‑specific Retrieval‑Augmented Generation (RAG) prototype designed to improve provenance research on the Getty Provenance Index (GPI), focusing on the German Sales subset. Provenance research—tracing the ownership history of artworks—is essential for authenticity verification, restitution claims, and historical scholarship, yet it is hampered by fragmented, multilingual archival records and search tools that require precise metadata. The authors propose a two‑stage RAG pipeline that enables natural‑language, multilingual queries, semantic retrieval, and explainable summarisation without relying on rigid metadata fields.

In the retrieval stage, the authors embed each auction catalogue entry (including the original terse description and key metadata such as sale date, auction house, and catalogue number) using OpenAI’s text‑embedding‑3‑large model. The resulting high‑dimensional vectors are indexed with FAISS (using HNSW/IVF configurations) to support fast nearest‑neighbor search across tens of thousands of records. By fusing metadata into the raw text before embedding, the system captures semantic cues that would be missed by keyword matching, handling variations in units, abbreviations, and multilingual terminology.

The generation stage receives the top‑k retrieved records (k≈10–20) and passes them to OpenAI’s GPT‑4o model, which produces concise, human‑readable summaries that include provenance‑relevant details and a reference to the original record ID. Importantly, the output is framed as an “explainable summary” rather than a definitive answer, allowing scholars to verify the information against the source.

Evaluation was conducted on two dataset scales: a 10 000‑record sample and an expanded 100 000‑record sample. A query set of 100 real‑world provenance questions—co‑designed with domain experts and covering both precise and exploratory information needs—was used to compare two system configurations: a “Naïve RAG” baseline and an “Advanced RAG” that incorporates an open‑source reranker (bge‑reranker‑v2‑m3). Human expert judgments on relevance, precision, and recall were collected. Results show that the Advanced RAG improves the average relevance score of the top‑5 results from 0.71 to 0.84, and multilingual queries retrieve relevant German‑language records with >70 % recall. Search latency remains low (≈120 ms for FAISS retrieval) even at 100 k records, while summarisation latency averages 1.8 seconds per query.

The paper also details the intrinsic challenges of historical auction catalogues: telegraphic, highly condensed entries; inconsistent formatting of dimensions, media, and dates; use of synonyms and abbreviations; and multilingual entity names. By embedding metadata directly into the textual field and employing a reranking step that can optionally filter on metadata, the system offers both “exploratory” and “precision” search modes.

Limitations are acknowledged: the current prototype relies on closed‑source LLM and embedding services, raising cost and accessibility concerns; performance with fully open‑source models remains untested; and multimodal extensions (e.g., image‑text integration) are not yet implemented. Future work will focus on building a completely open‑source pipeline, conducting cross‑model benchmarking, adding multimodal capabilities, and refining user‑interface features that enhance explainability and trust.

Overall, the study demonstrates that RAG can substantially lower the barrier to effective provenance research on large, multilingual art‑market archives. By providing natural‑language access, semantic expansion of queries, and transparent, verifiable summaries, the prototype offers a scalable tool for historians, restitution specialists, and legal scholars dealing with sensitive provenance cases, such as artworks looted during the Nazi era. The work positions RAG as a practical, interdisciplinary bridge between AI technologies and humanities scholarship, suggesting a viable path toward more open, efficient, and trustworthy cultural heritage research.

Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index

💡 Research Summary

Comments & Academic Discussion

Leave a Comment