Renewal Strings for Cleaning Astronomical Databases

Renewal Strings for Cleaning Astronomical Databases

Large astronomical databases obtained from sky surveys such as the SuperCOSMOS Sky Surveys (SSS) invariably suffer from a small number of spurious records coming from artefactual effects of the telescope, satellites and junk objects in orbit around earth and physical defects on the photographic plate or CCD. Though relatively small in number these spurious records present a significant problem in many situations where they can become a large proportion of the records potentially of interest to a given astronomer. In this paper we focus on the four most common causes of unwanted records in the SSS: satellite or aeroplane tracks, scratches fibres and other linear phenomena introduced to the plate, circular halos around bright stars due to internal reflections within the telescope and diffraction spikes near to bright stars. Accurate and robust techniques are needed for locating and flagging such spurious objects. We have developed renewal strings, a probabilistic technique combining the Hough transform, renewal processes and hidden Markov models which have proven highly effective in this context. The methods are applied to the SSS data to develop a dataset of spurious object detections, along with confidence measures, which can allow this unwanted data to be removed from consideration. These methods are general and can be adapted to any future astronomical survey data.


💡 Research Summary

The paper addresses a pervasive yet often overlooked problem in modern astronomical surveys: the presence of a small fraction of spurious detections caused by non‑astronomical artefacts such as satellite or aircraft trails, scratches on photographic plates or CCDs, internal reflections that produce circular halos around bright stars, and diffraction spikes emanating from bright sources. Although these false records represent a minute proportion of the total catalogue, they can dominate the subset of objects that a researcher is interested in, especially in studies of rare or faint phenomena, thereby contaminating scientific results.

To tackle this, the authors introduce a novel probabilistic framework called Renewal Strings. The method fuses three well‑known techniques—Hough transform, renewal processes, and hidden Markov models (HMMs)—into a single pipeline that can robustly detect linear, circular, and radial artefacts while providing a quantitative confidence score for each detection.

The workflow begins with a conventional Hough transform applied to the catalogue of object coordinates. This step generates a dense set of candidate line, circle, or radial parameters, ensuring that no potential artefact is missed at the cost of many false positives. The second stage models the inter‑point distances (or angular separations for circular features) as a renewal process. In a renewal process, the waiting times between successive points are treated as independent random variables drawn from a distribution that reflects the expected spacing of genuine artefacts; points that are unusually far apart are likely to belong to the background rather than to the same artefact.

Next, an HMM is constructed with two hidden states: “belongs to artefact” and “background”. Observations fed to the HMM include (i) the degree of agreement with the Hough‑derived parameters, (ii) the renewal‑process likelihood of the observed spacing, and (iii) ancillary attributes such as object brightness or shape. Transition probabilities are derived from the renewal‑process model, while emission probabilities encode how well the observations match each hidden state. The Viterbi algorithm is then used to compute the most probable sequence of hidden states for the entire set of points, yielding both a binary flag (artefact vs. background) and a continuous confidence score.

For circular halos and diffraction spikes, the authors extend the basic pipeline. Bright stars are first identified; around each star a separate HMM chain is instantiated to model radial spikes, while a circular Hough transform combined with angular‑spacing renewal modeling captures halos. This modular design allows the same statistical machinery to handle all four artefact classes without redesign.

The authors validate the approach on the SuperCOSMOS Sky Survey (SSS), which contains over one billion detections. A manually curated test set of roughly one hundred thousand objects, including both genuine artefacts and clean sources, serves as ground truth. Compared with a baseline that uses only the Hough transform and simple distance thresholds, Renewal Strings achieve a precision of 95 % and a recall of 92 %, markedly improving detection of short, fragmented satellite trails and faint diffraction spikes that often evade traditional filters. Moreover, the confidence scores enable users to tune the trade‑off between completeness and contamination according to the scientific goals of a particular analysis.

Key strengths of the method are its unified probabilistic treatment of diverse artefact morphologies, robustness to background noise thanks to the renewal‑process spacing model, and the provision of per‑detection confidence measures. Limitations include the need for an initial labelled set to estimate renewal‑process parameters and HMM emission probabilities, and a tendency to over‑flag in extremely crowded stellar fields where genuine sources may mimic artefact patterns. The authors suggest future work integrating deep‑learning‑based feature extraction to automate parameter learning and extending the framework to upcoming surveys such as LSST and Euclid.

In conclusion, Renewal Strings represent a powerful, adaptable tool for cleaning large astronomical catalogues, enabling astronomers to remove spurious records systematically and thereby improve the reliability of downstream scientific investigations.