A search for new symbiotic stars in the Milky Way: Using machine learning techniques applied to photometric databases
Symbiotic stars (SySts) are interacting binaries composed of a red giant transferring material to a hot compact star, typically a white dwarf. Although only about 300 systems are confirmed, the Galactic population is estimated at 1.2 x 10^3 - 1.5 x 10^4, indicating that most remain undiscovered. We identify new SySts using a machine-learning approach that combines Gaia DR3, 2MASS, and WISE photometry, parallaxes, and the pseudo-equivalent width of H alpha. A Random Forest model was trained on 166 confirmed S-type SySts and 1600 non-symbiotic stars, applying SMOTE to mitigate class imbalance. The model achieved an F1-score of 89% for the symbiotic class. Applied to 2.5 x 10^6 color-selected sources, it identified 990 candidates with probabilities more than 70%. We further refined the sample using physically motivated cuts on effective temperature, surface gravity, metallicity, and SkyMapper photometry, yielding 12 high-confidence candidates. These objects show cool temperatures, low surface gravities, near-solar metallicity, H alpha emission, moderate-to-high luminosities, and UV excess consistent with S-type SySts. Validation on recently confirmed systems recovered 92.3%, demonstrating the robustness and generalizability of our method.
💡 Research Summary
This paper addresses the long‑standing discrepancy between the relatively small number of confirmed symbiotic stars (≈300) and the much larger Galactic population predicted by theoretical models (1.2 × 10³–1.5 × 10⁴). The authors develop a supervised machine‑learning pipeline specifically tuned to identify new S‑type symbiotic stars (SySts) by exploiting the wealth of photometric and astrometric data now available from Gaia DR3, 2MASS, and WISE, together with the pseudo‑equivalent width of H α derived from Gaia’s low‑resolution XP spectra.
Dataset construction:
The training set consists of 166 spectroscopically confirmed S‑type SySts drawn from the Merc et al. (2024) catalogue and 1 600 non‑symbiotic objects selected to represent the dominant field‑star population. Because many physical parameters (effective temperature, surface gravity, metallicity) are missing for a substantial fraction of the confirmed SySts, the authors deliberately exclude these from the feature matrix and instead use them later as physical filters. Photometric colors are built from intra‑survey band combinations (Gaia G‑BP, BP‑RP; 2MASS J‑H, H‑K; WISE W1‑W2) to minimise the impact of variability. The H α pseudo‑equivalent width (EW_Hα) spans from +0.69 Å (weak absorption) to –18.49 Å (strong emission) and serves as a key discriminant. Parallax limits (0–5.29 mas) are also imposed to restrict the search to plausible Galactic distances.
Feature engineering and class imbalance handling:
To counter the severe class imbalance (SySts ≪ non‑SySts), the authors apply SMOTE (Synthetic Minority Over‑sampling Technique), generating synthetic SySt instances that preserve the covariance structure of the minority class. This step improves the decision boundary for the Random Forest classifier without over‑fitting to the limited real SySt sample.
Model training and evaluation:
A Random Forest with 500 trees is trained using 5‑fold cross‑validation. Performance metrics on the held‑out validation set show an F1‑score of 0.89 for the SySt class, overall accuracy of 0.93, precision of 0.91, and recall of 0.92. To test generalisation, the model is applied to a recent set of 13 newly confirmed SySts; 12 are correctly identified (92.3 % recovery), confirming robustness against unseen data.
Application to the full candidate pool:
The authors first define a broad colour‑selected sample of ≈2.5 × 10⁶ Gaia sources whose colours fall within the empirically determined S‑type SySt envelope (Table 1). The trained classifier is then run on this pool, yielding 990 objects with a predicted SySt probability ≥ 70 %. To prune this list, the authors impose physically motivated cuts: effective temperature 3 500–4 000 K, surface gravity log g ≈ 0–2, near‑solar metallicity, and a UV excess identified via SkyMapper NUV‑g colours. After this filtering, 12 high‑confidence candidates remain. All exhibit cool temperatures, low gravities, solar‑like metallicities, strong H α emission, moderate‑to‑high luminosities (derived from Gaia parallaxes), and UV excesses consistent with a hot white‑dwarf component, matching the canonical S‑type SySt phenomenology.
Discussion and comparison with prior work:
The paper situates its results against earlier machine‑learning searches (e.g., Akras et al. 2019b, Jia et al. 2023, Ball et al. 2025). While those studies reported large candidate lists with low confirmation rates, the present work achieves a markedly higher recovery fraction by integrating domain knowledge (H α emission, physical parameter limits) into the pipeline. The exclusive focus on S‑type systems is justified by their tighter colour locus, which reduces confusion with mimics such as T Tauri or Be stars. However, the authors acknowledge that D‑type and D′‑type SySts, which possess significant dust emission, are excluded; extending the methodology to those subclasses will require mid‑infrared data (e.g., Spitzer, AKARI) and perhaps a separate classifier.
Conclusions and future prospects:
The study demonstrates that a carefully curated, multi‑wavelength feature set combined with SMOTE‑balanced Random Forest classification can reliably isolate S‑type symbiotic stars from millions of Gaia sources. The 12 newly identified high‑confidence candidates provide a valuable target list for spectroscopic follow‑up, which will refine Galactic population estimates and improve our understanding of binary evolution pathways leading to Type Ia supernova progenitors. Future work will aim to (i) incorporate dust‑sensitive mid‑IR bands to capture D‑type systems, (ii) explore deep‑learning architectures that can directly ingest low‑resolution spectra for more nuanced line diagnostics, and (iii) apply the pipeline to upcoming data releases (Gaia DR4, LSST) to continuously update the Galactic SySt census.
Comments & Academic Discussion
Loading comments...
Leave a Comment