Classifying High-Energy Celestial Objects with Machine Learning Methods
Machine learning is a field that has been growing in importance since the early 2010s due to the increasing accuracy of classification models and hardware advances that have enabled faster training on large datasets. In the field of astronomy, tree-based models and simple neural networks have recently garnered attention as a means of classifying celestial objects based on photometric data. We apply common tree-based models to assess performance of these models for discriminating objects with similar photometric signals, pulsars and black holes. We also train a RNN on a downsampled and normalized version of the raw signal data to examine its potential as a model capable of object discrimination and classification in real-time.
💡 Research Summary
**
The paper investigates the feasibility of using machine learning (ML) techniques to automatically distinguish between two high‑energy astrophysical objects—black holes and pulsars—based on NuSTAR hard X‑ray observations. The authors retrieve event files (.evt) from the HEASARC archive, each containing photon arrival time, detector pixel, and pulse‑invariant (PI) energy channel. Because of instrument dead time, satellite passages through high‑radiation zones, and other operational constraints, the raw data contain large gaps (~3 000 s) and irregular inter‑arrival times that follow an approximate Poisson distribution with a mean of 0.5 s.
To create a uniform dataset, the authors first integrate photon energies over 1‑second intervals, discarding spatial information under the assumption that the two classes have no distinctive spatial signatures. They then segment the continuous streams into 30‑minute windows (1 800 s) with a 30‑second stride, yielding many overlapping samples. From each window they extract ten statistical descriptors: mean, five quartiles (including median), standard deviation, coefficient of variation, skewness, and kurtosis. This results in a feature matrix of size (samples × 10) plus a binary label (0 = black hole, 1 = pulsar). The dataset is imbalanced (≈3:1 favoring black holes), so class weighting and stratified splitting are employed during training.
Four models are evaluated:
-
Logistic Regression – Serves as a linear baseline. Trained for 1 000 iterations, it achieves 71.1 % training accuracy and 71 % test accuracy. The confusion matrix shows many black holes mis‑classified as pulsars, reflecting the model’s inability to capture the non‑linear decision boundary required for this problem.
-
Random Forest – Implemented with 100 trees, each split considering three random features, and no maximum depth constraint. Training accuracy reaches 100 % (indicative of over‑fitting), while test accuracy is 93.1 %. Feature importance (via Gini impurity) highlights median intensity, the 25th percentile, and mean as the top predictors. However, the average tree depth of 42 makes the model computationally heavy and less interpretable.
-
XGBoost – Configured with a learning rate of 0.05, 400 trees, max depth 10, and 80 % subsampling of rows and columns per tree. Training accuracy is 93.8 % and test accuracy 92 %. SHAP analysis mirrors the Random Forest results, confirming that central tendency and tail‑behavior statistics drive classification. The shallower trees reduce over‑fitting relative to the Random Forest.
-
Bidirectional LSTM RNN – A ten‑layer network comprising two bidirectional LSTM layers, one standard LSTM, three dense layers, and four dropout layers (30 % or 40 % dropout). Input sequences are the same 1‑second‑binned, z‑score normalized intensities. The model is trained on a 12‑core CPU with 64 GiB RAM for ~11 hours. It reaches 91 % training accuracy but only 69 % test accuracy, indicating severe over‑fitting. The authors attribute this to limited data volume, insufficient hyper‑parameter tuning, and the exclusion of spatial information that could aid feature extraction.
Overall, the tree‑based methods outperform the linear baseline and the RNN, achieving >92 % accuracy on a challenging binary classification task where visual inspection of light curves fails to separate the classes. The authors discuss several avenues for improvement: (i) incorporating time‑frequency transformations (e.g., wavelet or Fourier spectra) to enrich features; (ii) applying advanced imbalance‑handling techniques such as SMOTE or cost‑sensitive learning; (iii) developing hybrid CNN‑LSTM architectures that exploit both spatial and temporal patterns; (iv) employing regularization strategies (early stopping, weight decay) and more extensive hyper‑parameter searches for the RNN; and (v) exploring point‑process models that directly model the Poisson‑like arrival times.
In conclusion, the study demonstrates that relatively simple statistical feature extraction combined with modern ensemble classifiers can reliably discriminate between black holes and pulsars in hard X‑ray data, paving the way for automated, real‑time classification pipelines in high‑energy astrophysics. Future work that integrates richer signal processing and deep learning techniques is likely to push performance even higher and enable broader applicability to other astrophysical classification problems.
Comments & Academic Discussion
Loading comments...
Leave a Comment