An updated efficient galaxy morphology classification model based on ConvNeXt encoding with UMAP dimensionality reduction
We present an enhanced unsupervised machine learning (UML) module within our previous \texttt{USmorph} classification framework featuring two components: (1) hierarchical feature extraction via a pre-trained ConvNeXt convolutional neural network (CNN) with transfer learning, and (2) nonlinear manifold learning using Uniform Manifold Approximation and Projection (UMAP) for topology-aware dimensionality reduction. This dual-stage design enables efficient knowledge transfer from large-scale visual datasets while preserving morphological pattern geometry through UMAP’s neighborhood preservation. We apply the upgraded UML on I-band images of 99,806 COSMOS galaxies at redshift $0.2<z<1.2$ (to ensure rest-frame optical morphology) with $I_{\mathrm{mag}}<25$. The predefined cluster number is optimized to 20 (reduced from 50 in the original framework), achieving significant computational savings. The 20 algorithmically identified clusters are merged into five physical morphology types. About 51% of galaxies (50,056) were successfully classified. To assess classification effectiveness, we tested morphological parameters for massive galaxies with $M_{*}>10^{9}~M_{\odot}$. Our classification results align well with galaxy evolution theory. This improved algorithm significantly enhances galaxy morphology classification efficiency, making it suitable for large-scale sky surveys such as those planned with the China Space Station Telescope (CSST).
💡 Research Summary
This paper presents an upgraded unsupervised machine‑learning (UML) pipeline for galaxy morphology classification, building on the previously introduced USmorph framework. The new system combines two state‑of‑the‑art components: (1) hierarchical feature extraction using a pre‑trained ConvNeXt convolutional neural network (CNN) with transfer learning, and (2) nonlinear manifold learning via Uniform Manifold Approximation and Projection (UMAP) for topology‑preserving dimensionality reduction.
Data and preprocessing
The authors work with I‑band images of 99,806 galaxies from the COSMOS field, selected to satisfy 0.2 < z < 1.2 and I < 25 mag, ensuring rest‑frame optical morphology. Images are uniformly cropped to 100 × 100 pixels after confirming that 98 % of galaxies have effective radii < 50 pixels. To mitigate noise, a convolutional auto‑encoder (CAE) denoises each cutout. Rotational invariance is achieved through Adaptive Polar Coordinate Transformation (APCT), which converts rotations into translations in polar space, thereby aligning the data with the intrinsic translation invariance of CNNs.
Feature extraction
The denoised, APCT‑processed images are fed into a ConvNeXt model pre‑trained on ImageNet. ConvNeXt, a modern CNN architecture that incorporates large kernels, layer‑scale, and a ResNet‑style design, extracts a 1024‑dimensional feature vector from the global average‑pooled layer for each galaxy. This deep encoding captures high‑level visual patterns while leveraging knowledge transferred from massive natural‑image datasets.
Dimensionality reduction
The 1024‑D vectors are reduced to a 2‑D embedding using UMAP (n_neighbors = 30, min_dist = 0.1). UMAP preserves local neighbor relations and global manifold structure better than t‑SNE, which is crucial for maintaining subtle morphological distinctions. The resulting low‑dimensional space is both computationally cheap to store and well‑suited for clustering.
Clustering and morphology grouping
A bagging‑based voting clustering ensemble (combining K‑means, agglomerative, and spectral clustering) is applied to the UMAP embedding. The authors empirically determine that 20 clusters provide an optimal trade‑off between granularity and computational load, a reduction from the 50 clusters used in the original USmorph. These 20 algorithmic groups are subsequently merged into five physically interpretable morphology types (e.g., early‑type, disk‑dominated, irregular, compact, AGN‑dominated) based on visual inspection and the distribution of astrophysical parameters such as Sersic index, color, stellar mass, and star‑formation rate.
Performance evaluation
Out of the full sample, 50,056 galaxies (≈ 51 %) receive a confident classification; the remaining ≈ 49 % are left unclassified, primarily due to low signal‑to‑noise or ambiguous features. For massive galaxies (M* > 10⁹ M⊙), the five classes reproduce expected trends: early‑type systems show high Sersic n and red colors, disks occupy the blue cloud, and compact/AGN‑like objects populate the green valley. Comparisons with previous supervised classifications show agreement within a few percent, while the computational cost drops by ~60 % thanks to the reduced cluster count and the efficient ConvNeXt + UMAP pipeline.
Strengths and limitations
Key strengths include: (i) leveraging a powerful, publicly available CNN encoder that eliminates the need for extensive domain‑specific training; (ii) using UMAP to retain morphological topology while drastically shrinking feature dimensionality; (iii) integrating CAE and APCT to improve robustness against noise and rotation; (iv) achieving a scalable solution suitable for upcoming large surveys such as the China Space Station Telescope (CSST). Limitations are the sizable fraction of unclassified galaxies and the stochastic nature of UMAP embeddings, which can lead to slight variations in cluster boundaries across random seeds. The authors suggest future work involving semi‑supervised refinement, graph‑neural‑network‑based manifold regularization, or ensemble UMAP to improve reproducibility and coverage.
Conclusion
The study demonstrates that a ConvNeXt‑based encoder combined with UMAP dimensionality reduction provides an efficient, accurate, and scalable framework for unsupervised galaxy morphology classification. By cutting the number of clusters from 50 to 20 and preserving physical interpretability, the method offers a practical pathway for real‑time morphological tagging in next‑generation sky surveys, thereby facilitating large‑scale studies of galaxy evolution.
Comments & Academic Discussion
Loading comments...
Leave a Comment