Analyzing and visualizing scientific ensemble datasets with high dimensionality and complexity poses significant challenges. Dimensionality reduction techniques and autoencoders are powerful tools for extracting features, but they often struggle with such high-dimensional data. This paper presents an enhanced autoencoder framework that incorporates a clustering loss, based on the soft silhouette score, alongside a contrastive loss to improve the visualization and interpretability of ensemble datasets. First, EfficientNetV2 is used to generate pseudo-labels for the unlabeled portions of the scientific ensemble datasets. By jointly optimizing the reconstruction, clustering, and contrastive objectives, our method encourages similar data points to group together while separating distinct clusters in the latent space. UMAP is subsequently applied to this latent representation to produce 2D projections, which are evaluated using the silhouette score. Multiple types of autoencoders are evaluated and compared based on their ability to extract meaningful features. Experiments on two scientific ensemble datasets - channel structures in soil derived from Markov chain Monte Carlo, and droplet-on-film impact dynamics - show that models incorporating clustering or contrastive loss marginally outperform the baseline approaches.
In recent years, the growth of high-dimensional scienJfic ensemble datasets has presented both opportuniJes and challenges for data analysis and visualizaJon [1]. ScienJfic ensembles, characterized by their complex and mulJ-dimensional nature, contain valuable insights that can assist in decision-making processes across various domains, from climate modeling to healthcare diagnosJcs [2]. However, extracJng meaningful features from these datasets remains a difficult task due to their complexity.
To make these complex ensemble datasets more understandable, dimensionality reducJon techniques can be applied. However, these techniques struggle to uncover the structures in highdimensional datasets. Thus, we first apply feature extracJon through the use of (variaJonal) autoencoders. These models extract the most relevant features from the datasets, aUer which dimensionality reducJon techniques can be used to obtain a more intuiJve visualizaJon. Com-bining these two methods has shown promising results, but we hope to achieve beXer clustering within this visualizaJon [3].
In this paper, we propose a novel approach for clustering scienJfic ensemble datasets by combining the strengths of autoencoder-based feature extracJon with a dedicated clustering and contrasJve loss funcJon. Our method aims to extract relevant features of scienJfic data while simultaneously encouraging the forming of disJnct clusters in the latent space. By jointly opJmizing the reconstrucJon objecJve during training as well as the cluster separability en-forced by either the clustering or contrasJve loss, our approach offers a framework to obtain a beXer understandable visualizaJon.
We implement a soU silhoueXe score, which is a differenJable version of the silhoueXe score. This score is implemented as a clustering loss alongside the reconstrucJon loss of a (variaJonal) autoencoder during training, ensuring that the data will form more compact clusters while also preserving the original data’s informaJon. This clustering loss will also be compared to a contrasJve loss funcJon. ContrasJve loss aims to bring instances of the same class closer together while pushing apart the instances of different classes, ensuring that similar data will be grouped in the latent space. This clustering during training can be performed on mostly unlabelled datasets because EffNetV2 will be used to first train the model on the manually labelled part of the ensemble datasets, aUer which pseudo-labels for the unlabelled part of the datasets will be generated, making this a semi-supervised problem.
Once the training is finished, we will perform dimensionality reducJon to further reduce the latent space to a 2D visualizaJon by performing dimensionality reducJon: specifically by uJliz-ing UMAP. The resulJng visualizaJons will be evaluated by their silhoueXe score and compared to similar models, with and without a clustering or contrasJve loss. In our experiments, we used two ensemble datasets: Markov Chain Monte Carlo and Drop Dynamics [4], [5].
In SecJon 2, we give a brief overview of related works. AUerwards, we describe the methodology used for this paper in SecJon 3. Then, we move on to the results and discussion in SecJon 4. We conclude our findings in SecJon 5. Finally, we discuss future works in SecJon 6.
In this secJon, we briefly describe what research has been done previously in the fields of autoencoder-based feature extracJon, deep clustering, and contrasJve learning.
Autoencoders have become instrumental in the field of feature extracJon due to their ability to learn efficient, compressed representaJons of high-dimensional data. Ardelean et al. propose autoencoders as a feature extracJon method for spike sorJng, the process of grouping spikes of disJnct neurons into their respecJve clusters [6]. Autoencoders are also widely used in computer vision. Nayak et al. use a deep autoencoder to help detect brain tumors in medical images [7]. Chen et al. propose a convoluJonal autoencoder to help in detecJng and analyzing long nodules [8]. Solomon et al. use autoencoders to develop a face verificaJon system [9]. Furthermore, VariaJonal Autoencoders are also useful in this process. Tian et al. developed the Pyramid-VAE-GAN network to assist in image inpainJng [10].
Deep clustering refers to the process of integraJng deep learning networks with clustering meth-ods. It helps in transforming the input data such that clusters will try to form within the latent space [11]. In the paper “Deep clustering using the soU silhoueXe score: towards compact and well-separated clusters” Vardakas et al. introduce a probabilisJc formulaJon of the silhoueXe score to complement their autoencoders’ reconstrucJon loss with a clustering loss [12]. They use a Radial Basis FuncJon model as a clustering network to predict the probabiliJes with which they calculate the soU silhoueXe score. They show promising results on the EMNIST datasets. Xie et al. propose the Deep Embedding Clusterin
This content is AI-processed based on open access ArXiv data.