A Reproducible Workflow for Scraping, Structuring, and Segmenting Legacy Archaeological Artifact Images
This technical note presents a reproducible workflow for converting a legacy archaeological image collection into a structured and segmentation ready dataset. The case study focuses on the Lower Palaeolithic hand axe and biface collection curated by the Archaeology Data Service (ADS), a dataset that provides thousands of standardised photographs but no mechanism for bulk download or automated processing. To address this, two open source tools were developed: a web scraping script that retrieves all record pages, extracts associated metadata, and downloads the available images while respecting ADS Terms of Use and ethical scraping guidelines; and an image processing pipeline that renames files using UUIDs, generates binary masks and bounding boxes through classical computer vision, and stores all derived information in a COCO compatible Json file enriched with archaeological metadata. The original images are not redistributed, and only derived products such as masks, outlines, and annotations are shared. Together, these components provide a lightweight and reusable approach for transforming web based archaeological image collections into machine learning friendly formats, facilitating downstream analysis and contributing to more reproducible research practices in digital archaeology.
💡 Research Summary
This technical note presents a comprehensive and reproducible workflow designed to transform a legacy, web-based archaeological image collection into a structured, machine-learning-ready dataset. The case study focuses on the “Lower Palaeolithic technology, raw material and population ecology (bifaces)” dataset curated by the Archaeology Data Service (ADS), which contains over 10,000 standardized photographs of handaxes but lacks mechanisms for bulk download or automated processing.
The workflow is built upon two distinct, open-source pipelines. The first is a web scraping framework developed in Python. This script systematically accesses every individual record page in the ADS collection by exploiting stable URL patterns. It extracts all associated archaeological metadata (provenience, raw material, measurements) and downloads the available high-resolution images. Crucially, the scraper incorporates ethical guidelines by checking the site’s robots.txt file, introducing randomized delays between requests to avoid server overload, and strictly adhering to the ADS Terms of Use. The metadata is compiled into a CSV file, facilitating further analysis.
The second component is an image processing and segmentation pipeline. This pipeline addresses several key challenges in preparing legacy image sets for computer vision. First, it renames all image files using Universally Unique Identifiers (UUIDs), storing the mapping to original filenames in a separate CSV. This prevents file naming conflicts if datasets from different sources are merged in the future. Second, it generates a JSON file in the widely adopted COCO (Common Objects in Context) format. This file is enriched not only with standard COCO fields (image dimensions, IDs) but also with the archaeological metadata scraped earlier, creating a rich, contextualized dataset. Third, it performs image segmentation using classical computer vision techniques. Assuming one artifact per image on a dark background, the pipeline identifies contours, selects the largest as the artifact, and generates a binary mask and bounding box coordinates for each object, embedding these annotations into the COCO JSON.
A core ethical and legal principle of the workflow is that the original copyrighted images from ADS are not redistributed. Only the derived products—the scripts, CSV metadata, UUID mapping file, COCO annotations, and the generated masks and outlines—are shared openly. This respects repository licensing while enabling reproducible research.
The paper candidly discusses the limitations of the proposed approach. The image segmentation is effective only for the specific photographic setup of the ADS biface collection (single object, dark background). Different collections would require more advanced techniques like background subtraction or deep learning-based segmentation. The web scraper is tailored to the current HTML structure of the ADS website and may break if the site changes. Furthermore, the current COCO implementation supports only a single object category.
Despite these limitations, the workflow provides a pragmatic and reusable template for converting closed, web-accessible archives into open, structured datasets suitable for quantitative analysis. By decoupling the copyrighted original assets from the newly created annotations and metadata, it offers a path for leveraging legacy digital collections in modern computational research without violating terms of use. The release of all code and derived data aims to contribute a concrete piece of infrastructure to the growing field of digital archaeology and promote more reproducible research practices.
Comments & Academic Discussion
Loading comments...
Leave a Comment