Finding rare objects and building pure samples: Probabilistic quasar classification from low resolution Gaia spectra
We develop and demonstrate a probabilistic method for classifying rare objects in surveys with the particular goal of building very pure samples. It works by modifying the output probabilities from a classifier so as to accommodate our expectation (priors) concerning the relative frequencies of different classes of objects. We demonstrate our method using the Discrete Source Classifier, a supervised classifier currently based on Support Vector Machines, which we are developing in preparation for the Gaia data analysis. DSC classifies objects using their very low resolution optical spectra. We look in detail at the problem of quasar classification, because identification of a pure quasar sample is necessary to define the Gaia astrometric reference frame. By varying a posterior probability threshold in DSC we can trade off sample completeness and contamination. We show, using our simulated data, that it is possible to achieve a pure sample of quasars (upper limit on contamination of 1 in 40,000) with a completeness of 65% at magnitudes of G=18.5, and 50% at G=20.0, even when quasars have a frequency of only 1 in every 2000 objects. The star sample completeness is simultaneously 99% with a contamination of 0.7%. Including parallax and proper motion in the classifier barely changes the results. We further show that not accounting for class priors in the target population leads to serious misclassifications and poor predictions for sample completeness and contamination. (Truncated)
💡 Research Summary
The paper presents a probabilistic framework for classifying rare astronomical objects in large surveys, with a focus on constructing extremely pure quasar samples for the Gaia mission. The authors start from the Discrete Source Classifier (DSC), a supervised machine‑learning system currently based on Support Vector Machines, which ingests Gaia’s very low‑resolution (R≈20) optical spectra (330–1050 nm) and outputs raw class probabilities for three categories: stars, galaxies, and quasars.
The key methodological advance is the explicit incorporation of class priors that reflect the true frequencies of objects in the target population. In the Gaia context quasars are expected to occur only once in about 2 000 sources (≈0.05 %). The raw DSC probabilities, which are implicitly conditioned on the class distribution of the training set, are therefore re‑weighted by the prior odds using Bayes’ theorem to produce posterior probabilities that are appropriate for the survey’s actual composition.
Once posterior probabilities are available, the user can impose a probability threshold to decide which sources are accepted as quasars. Raising the threshold reduces contamination at the cost of completeness, allowing a controlled trade‑off between purity and recall. The authors explore this trade‑off with simulated Gaia data at two magnitude limits, G = 18.5 and G = 20.0. At G = 18.5, a threshold of 0.9999 yields a quasar contamination of ≤2.5 × 10⁻⁵ (i.e., fewer than one false quasar in 40 000 candidates) while retaining 65 % of the true quasars. At the fainter limit G = 20.0 the same threshold gives 50 % completeness with the same ultra‑low contamination level.
The method simultaneously delivers excellent performance for the dominant stellar class. With the same quasar threshold, the stellar sample retains 99 % completeness and suffers only 0.7 % contamination by quasars. Adding astrometric information (parallax and proper motion) as extra features to the classifier produces only marginal changes, indicating that the low‑resolution spectra already contain the decisive discriminative information.
A crucial diagnostic presented in the paper is the impact of neglecting class priors. When the prior is omitted, the classifier assumes the training‑set class frequencies, leading to a severe under‑estimation of quasar contamination and overly optimistic completeness predictions. This demonstrates that any realistic search for rare objects must adjust for the true prior distribution of classes.
The authors argue that the prior‑adjustment step is model‑agnostic: it can be applied to any probabilistic classifier (e.g., neural networks, random forests) as long as the classifier provides calibrated class probabilities. Consequently, the approach is broadly applicable to other rare‑object searches such as supernovae, gravitational‑wave counterparts, or exotic transients, where the scientific payoff depends critically on sample purity.
In summary, the study shows that by re‑weighting classifier outputs with realistic class priors and by carefully selecting a posterior probability threshold, one can construct quasar samples of unprecedented purity (≤1 / 40 000 contaminants) while maintaining acceptable completeness (50–65 %). This methodology fulfills the stringent requirements of the Gaia astrometric reference frame and provides a general template for rare‑object classification in forthcoming large‑scale astronomical surveys.
Comments & Academic Discussion
Loading comments...
Leave a Comment