CLaSPS: a new methodology for Knowledge extraction from complex astronomical dataset

CLaSPS: a new methodology for Knowledge extraction from complex   astronomical dataset

In this paper we present the Clustering-Labels-Score Patterns Spotter (CLaSPS), a new methodology for the determination of correlations among astronomical observables in complex datasets, based on the application of distinct unsupervised clustering techniques. The novelty in CLaSPS is the criterion used for the selection of the optimal clusterings, based on a quantitative measure of the degree of correlation between the cluster memberships and the distribution of a set of observables, the labels, not employed for the clustering. In this paper we discuss the applications of CLaSPS to two simple astronomical datasets, both composed of extragalactic sources with photometric observations at different wavelengths from large area surveys. The first dataset, CSC+, is composed of optical quasars spectroscopically selected in the SDSS data, observed in the X-rays by Chandra and with multi-wavelength observations in the near-infrared, optical and ultraviolet spectral intervals. One of the results of the application of CLaSPS to the CSC+ is the re-identification of a well-known correlation between the alphaOX parameter and the near ultraviolet color, in a subset of CSC+ sources with relatively small values of the near-ultraviolet colors. The other dataset consists of a sample of blazars for which photometric observations in the optical, mid and near infrared are available, complemented for a subset of the sources, by Fermi gamma-ray data. The main results of the application of CLaSPS to such datasets have been the discovery of a strong correlation between the multi-wavelength color distribution of blazars and their optical spectral classification in BL Lacs and Flat Spectrum Radio Quasars and a peculiar pattern followed by blazars in the WISE mid-infrared colors space. This pattern and its physical interpretation have been discussed in details in other papers by one of the authors.


💡 Research Summary

The paper introduces CLaSPS (Clustering‑Labels‑Score Patterns Spotter), a novel framework designed to uncover hidden correlations in complex, multi‑dimensional astronomical data sets. The method combines multiple unsupervised clustering algorithms with a quantitative “label‑score” metric that evaluates how strongly the resulting clusters are associated with a set of observables (the “labels”) that were deliberately excluded from the clustering process. The workflow proceeds in four stages. First, a suite of clustering techniques (K‑means, Gaussian Mixture Models, DBSCAN, hierarchical clustering, etc.) is applied independently to the same data matrix, producing a variety of candidate partitionings. Second, a separate collection of label variables—typically colors, spectral indices, or high‑energy fluxes—is assembled; these labels are never used to guide the clustering. Third, for each candidate clustering the distribution of each label within each cluster is statistically compared to the global label distribution. The authors implement χ², Kolmogorov‑Smirnov, and information‑gain based scores to quantify the degree of intra‑cluster label homogeneity. A high label‑score indicates that a particular cluster captures a coherent physical regime reflected by the label. Finally, the clustering that yields the highest aggregate label‑score across all labels is selected as the “optimal” partition, thereby automatically identifying the most physically meaningful grouping without any prior bias toward the labels.

To demonstrate the utility of CLaSPS, the authors apply it to two astrophysical data sets. The first, dubbed CSC+, consists of spectroscopically confirmed SDSS quasars with ancillary observations from Chandra (X‑ray), GALEX (UV), and 2MASS (near‑infrared). Using CLaSPS, the well‑known anti‑correlation between the X‑ray‑to‑optical spectral slope α_OX and the near‑UV color (NUV‑g) is recovered. Importantly, the method isolates a sub‑sample of quasars with small NUV colors where the α_OX‑NUV relation is especially tight, confirming earlier suggestions that UV‑bright quasars tend to be relatively X‑ray luminous.

The second data set comprises a heterogeneous sample of blazars (both BL Lac objects and flat‑spectrum radio quasars, FSRQs) with photometry from optical surveys, WISE mid‑infrared bands, and 2MASS near‑infrared bands; a subset also has Fermi γ‑ray fluxes. CLaSPS uncovers a strong, statistically significant segregation in multi‑wavelength color space that mirrors the traditional optical classification: BL Lacs and FSRQs occupy distinct regions in the WISE color–color diagram. Moreover, the analysis reveals a peculiar “WISE strip” populated almost exclusively by blazars, a pattern that has been explored in depth in separate publications by one of the authors and linked to differences in synchrotron peak frequencies and external radiation fields.

The authors argue that CLaSPS offers several advantages over conventional post‑hoc correlation searches. By evaluating many clustering solutions in parallel, it reduces dependence on any single algorithm’s assumptions. The label‑score provides an objective, reproducible measure of physical relevance, eliminating subjective choices about which observables to examine. Because labels are not required during clustering, the framework can be readily extended to incorporate new data dimensions (e.g., time‑domain variability, polarization) without redesigning the pipeline. Finally, the method is scalable to the massive data volumes expected from upcoming surveys such as LSST, Euclid, and the SKA, where automated discovery of multi‑wavelength relationships will be essential.

In summary, CLaSPS constitutes a flexible, statistically rigorous tool for knowledge extraction from complex astronomical catalogs. Its successful application to quasar and blazar samples demonstrates its capacity to both recover known astrophysical relations and reveal new, physically interpretable patterns, positioning it as a valuable asset for future big‑data astrophysics.