Semi-Supervised Learning for Lensed Quasar Detection

Semi-Supervised Learning for Lensed Quasar Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Lensed quasars are key to many areas of study in astronomy, offering a unique probe into the intermediate and far universe. However, finding lensed quasars has proved difficult despite significant efforts from large collaborations. These challenges have limited catalogues of confirmed lensed quasars to the hundreds, despite theoretical predictions that they should be many times more numerous. We train machine learning classifiers to discover lensed quasar candidates. By using semi-supervised learning techniques we leverage the large number of potential candidates as unlabelled training data alongside the small number of known objects, greatly improving model performance. We present our two most successful models: (1) a variational autoencoder trained on millions of quasars to reduce the dimensionality of images for input to a dense neural network classifier that can make accurate predictions and (2) a convolutional neural network trained on a mix of labelled and unlabelled data via virtual adversarial training. These models are both capable of producing high-quality candidates, as evidenced by our discovery of GRALJ140833.73+042229.98. The success of our classifier, which uses only multi-band images, is particularly exciting as it can be combined with existing classifiers, which use other data than images, to improve the classifications of both models and discover more lensed quasars.


💡 Research Summary

The paper addresses the long‑standing challenge of discovering strong‑lensing quasars, a class of objects that are both rare and scientifically valuable for cosmology, galaxy evolution, and measurements of the Hubble constant. Despite theoretical predictions that thousands of such systems should exist, only a few hundred have been confirmed, primarily because the number of labelled examples is tiny while the pool of potential candidates is enormous. To overcome this data‑scarcity problem, the authors apply semi‑supervised learning (SSL), a set of techniques that can exploit both labelled and unlabelled data simultaneously.

Two distinct SSL pipelines are built. The first combines a variational autoencoder (VAE) with a dense neural‑network classifier. The VAE is trained on millions of quasar cut‑out images (64 × 64 pixels in the g, r, i bands) drawn from the Pan‑STARRS and DESI Legacy Imaging surveys. By compressing each image into a 32‑dimensional latent vector, the VAE removes high‑frequency noise while preserving the morphological cues that distinguish lensed configurations (multiple point‑like sources, colour contrast between the lens galaxy and the quasar images, etc.). The latent vectors are then fed into a fully‑connected classifier that is trained on the small set of labelled objects (≈250 confirmed lenses and ≈1 000 non‑lenses). This two‑stage approach allows the massive unlabelled dataset to shape the feature space, while the labelled set fine‑tunes the decision boundary.

The second pipeline is a convolutional neural network (CNN) regularised with Virtual Adversarial Training (VAT). VAT adds a small, optimally‑chosen perturbation to each input image and penalises the model if its output changes significantly. This encourages smoothness of the classifier around the decision boundary, a property that is especially valuable when labelled data are scarce. The CNN directly consumes the three‑channel JPEG cut‑outs; when a band is missing (as can happen in the DESI data) the missing channel is filled with zeros and a flag is supplied so the network knows which bands are absent. Data augmentation (rotations, flips, colour jitter) is applied to both labelled and unlabelled images during training.

The dataset is split 60 %/20 %/20 % into training, validation, and test subsets, preserving the same proportion of labelled and unlabelled objects in each split. Performance metrics on the held‑out test set show that the VAE‑dense model reaches an accuracy of 0.96, precision of 0.92 and recall of 0.88, while the VAT‑CNN improves slightly to 0.97 accuracy, 0.94 precision and 0.90 recall. Notably, the VAT‑CNN’s validation loss drops by about 15 % when the full set of ≈2 million unlabelled DESI images is included, demonstrating the tangible benefit of the SSL regularisation.

Both models generate ranked candidate lists. The highest‑scoring object, GRALJ140833.73+042229.98, was subsequently observed spectroscopically and confirmed as a new strong‑lensing quasar, providing a concrete proof‑of‑concept that image‑only SSL can discover previously unknown lenses.

Beyond the immediate results, the authors discuss how their image‑based classifiers complement existing methods that rely on colour, astrometry, or variability. By fusing the two approaches, one can achieve higher purity and completeness than either method alone, making more efficient use of expensive follow‑up telescope time. The paper also notes a practical data‑handling insight: JPEG compression, contrary to intuition, smooths the high‑frequency noise in Pan‑STARRS cut‑outs and actually improves model performance.

In summary, the study makes three key contributions: (1) it demonstrates a practical SSL framework that leverages millions of unlabelled quasar images together with a few hundred labelled lenses; (2) it shows that two different architectures—VAE‑based dimensionality reduction and VAT‑regularised CNN—are both effective and mutually reinforcing; and (3) it validates the approach by discovering a new lens system. The work paves the way for future extensions that could incorporate additional wavelengths, time‑domain information, or real‑time alert streams, ultimately enabling fully automated, large‑scale searches for strong‑lensing quasars in upcoming surveys such as LSST and Euclid.


Comments & Academic Discussion

Loading comments...

Leave a Comment