스캔포드 로봇으로 도서관 데이터 자동 수집 및 VLM 맞춤 학습
📝 Abstract
Foundation models (FM) have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet pretraining data leaves them brittle in unstructured, real-world settings. The messy, real-world data encountered during deployment (e.g. occluded or multilingual text) remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches FM training with precisely the examples current models lack. We introduce the Robot-Powered Data Flywheel, a framework that transforms robots from FM consumers into data generators. By deploying robots equipped with FMs in the wild, we enable a virtuous cycle: robots perform useful tasks while collecting real-world data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator deployed in the East Asia Library for 2 weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to label images without human annotation. This deployment both aids librarians and produces a dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an ~18.7 hrs of human time. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting FMs to the messiness of reality. More details are at: https://scanford-robot.github.io
💡 Analysis
Foundation models (FM) have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet pretraining data leaves them brittle in unstructured, real-world settings. The messy, real-world data encountered during deployment (e.g. occluded or multilingual text) remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches FM training with precisely the examples current models lack. We introduce the Robot-Powered Data Flywheel, a framework that transforms robots from FM consumers into data generators. By deploying robots equipped with FMs in the wild, we enable a virtuous cycle: robots perform useful tasks while collecting real-world data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator deployed in the East Asia Library for 2 weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to label images without human annotation. This deployment both aids librarians and produces a dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an ~18.7 hrs of human time. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting FMs to the messiness of reality. More details are at: https://scanford-robot.github.io
📄 Content
Fig. 1. Scanford: a Robot-Powered Data Flywheel system for continual foundation model improvement through in-the-wild deployment. We deploy Scanford in the East Asia Library for two weeks to scan books and assist inventory management, a challenging setting for off-the-shelf foundation models due to multilingual text, degraded labels, and occlusions [Left]. Scanford uses a mobile manipulator to collect pictures of bookshelves and leverages a VLM to identify the books in each image by title and call number. These labels are then compared with a library catalog database to curate a clean, accurate dataset for VLM fine-tuning [Center]. Crucially, the autonomously gathered data improves not only the domain-specific performance on book identification, but also domain-adjacent generalizability of foundation models (multilingual OCR). Scanford simultaneously (1) saves 18.7 hours of manual scanning, (2) collects real-world book data, and (3) improves the very foundation model it relies on -enhancing its own performance on the library task while also strengthening the model’s broader multilingual OCR capabilities [Right].
Abstract-Foundation models have unlocked powerful zeroshot capabilities in vision and language, yet their reliance on internet-sourced pretraining data leaves them brittle in unstructured, real-world environments. The messy, real-world data encountered during deployment -such as low resolution images, occluded signs, or multilingual text -remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches foundation model training with precisely the examples current models lack. We introduce the Robot-Powered Data Flywheel, a framework that transforms robots from consumers of foundation models into data generators. By deploying robots equipped with foundation models in the wild, we enable a virtuous cycle: robots perform useful tasks while simultaneously collecting domain-representative data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator robot deployed in the East Asia Library for two weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to automatically label images without human
Recent advances in foundation models have unlocked impressive zero-shot capabilities in vision and language, from optical character recognition (OCR) to image captioning [1], [2]. However, these systems rely heavily on a vast amount of internet data that are clean, curated, and biased toward certain languages and domains. As a result, they often fail on the “final mile” of perception in unstructured environments (e.g., reading nutrition facts on a crumpled wrapper, interpreting graffiti-covered road signs, or identifying book titles on worn bindings of library books) [3], [4].
The core gap is that the messiness of real-world environments is massively underrepresented in existing pretraining data. Robots, as mobile embodied agents, are uniquely positioned to close this gap: they can autonomously collect large-scale, real-world data directly from the environments where foundation models are ultimately deployed, enriching training corpora with precisely the kinds of real-world examples these models lack.
Our key insight is to transform robots from consumers of foundation models into data generators that drive a Robot-Powered Data Flywheel (RPDF). By deploying robots equipped with foundation models in-the-wild, we enable them to perform useful tasks while simultaneously collecting domain-representative data that can be used to continually refine those very same foundation models. Crucially, because this data captures domains missing from internet-scale pretraining corpora, it enhances not only domain-specific adaptation but also strengthens the foundation model’s broader capabilities in domain-adjacent settings (e.g., reading text in low-resolution or occlusion-heavy images). This creates a virtuous cycle where deployment improves models, and improved models enable more successful deployments.
We instantiate this framework with Scanford, as shown in Fig. 1. Inventory management in this library is extremely labor-intensive: at the East Asia Library, it takes five librarians approximately nine months to complete a full cataloging, meaning a complete inventory can only be performed once every 10-15 years. Our robotic system scans shelves with an onboard camera and uses a vision-language model (VLM) to identify the books. Unlike standard benchmarks dominated by English, these shelves contain books primarily in Chinese, Japanese, and Korean -languages for which current VLMs, and especially OCR capabilities, remain underdeveloped [5]. To close this gap, we automatically label scans using the library’s pr
This content is AI-processed based on ArXiv data.