Safeguarding Old and New Journal Tables for the VO: Status for Extragalactic and Radio Data

Safeguarding Old and New Journal Tables for the VO: Status for   Extragalactic and Radio Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Independent of established data centers, and partly for my own research, since 1989 I have been collecting the tabular data from over 2600 articles concerned with radio sources and extragalactic objects in general. Optical character recognition (OCR) was used to recover tables from 740 papers. Tables from only 41 percent of the 2600 articles are available in the CDS or CATS catalog collections, and only slightly better coverage is estimated for the NED database. This fraction is not better for articles published electronically since 2001. Both object databases (NED, SIMBAD, LEDA) as well as catalog browsers (VizieR, CATS) need to be consulted to obtain the most complete information on astronomical objects. More human resources at the data centers and better collaboration between authors, referees, editors, publishers, and data centers are required to improve data coverage and accessibility. The current efforts within the Virtual Observatory (VO) project, to provide retrieval and analysis tools for different types of published and archival data stored at various sites, should be balanced by an equal effort to recover and include large amounts of published data not currently available in this way.


💡 Research Summary

The paper documents a three‑decade personal effort to collect and preserve tabular data from more than 2,600 articles dealing with radio sources and extragalactic objects. Starting in 1989, the author systematically identified papers, extracted tables either directly from digital PDFs or by scanning printed copies, and applied optical character recognition (OCR) to recover data from 740 scanned articles. OCR was followed by extensive manual verification because of frequent recognition errors and complex table layouts. The resulting datasets were annotated with metadata (author, journal, year, observing frequency, etc.) and, where possible, converted into standard formats such as VOTable or FITS.

Despite this massive undertaking, only 41 % of the tables have been deposited in major astronomical data services such as the CDS (VizieR) and the CATS catalog collection. The NASA Extragalactic Database (NED) shows a comparable coverage level, and the situation has not improved for papers published electronically after 2001. The low registration rate reflects a combination of factors: authors often do not submit their tables in a machine‑readable form, referees and editors lack a mandate to enforce data deposition, and the submission procedures at data centers remain cumbersome. Consequently, researchers who rely on a single database or catalog browser frequently obtain incomplete information about an object and must cross‑query multiple services to assemble a full picture.

The author argues that this fragmented accessibility undermines scientific reproducibility and hampers the reuse of historic observations. While the Virtual Observatory (VO) project has succeeded in providing powerful tools for querying and analysing data that are already hosted in established archives, it does not address the large body of legacy tables that remain unavailable in any digital repository. The VO’s vision of seamless, integrated access to all astronomical data therefore remains only partially realized.

To close the gap, three concrete recommendations are proposed. First, data centers need additional human resources and funding to scale up OCR post‑processing, manual quality control, and metadata standardisation. Second, a policy framework should be established that obliges authors, reviewers, editors, and publishers to submit tables in a defined, machine‑readable format at the time of publication, with clear standards for column definitions and units. Third, the workflow between data centers and VO services must be automated so that newly deposited tables are instantly indexed and made searchable through VO portals. Implementing these measures would not only increase the fraction of published tables that are publicly accessible but also enhance the overall efficiency of astronomical research by enabling straightforward cross‑comparison of new and archival measurements.

In summary, the paper highlights a substantial, yet largely unaddressed, deficit in the preservation of published astronomical tables. It demonstrates that even with modern OCR technology, the bottleneck lies in institutional practices and resource allocation. By strengthening collaboration among authors, referees, editors, publishers, and data centers, and by aligning VO development with systematic data recovery efforts, the astronomical community can achieve the comprehensive, open‑access data environment envisioned by the Virtual Observatory.


Comments & Academic Discussion

Loading comments...

Leave a Comment