The mixture model is undoubtedly one of the greatest contributions to clustering. For continuous data, Gaussian models are often used and the Expectation-Maximization (EM) algorithm is particularly suitable for estimating parameters from which clustering is inferred. If these models are particularly popular in various domains including image clustering, they however suffer from the dimensionality and also from the slowness of convergence of the EM algorithm. However, the Classification EM (CEM) algorithm, a classifying version, offers a fast convergence solution while dimensionality reduction still remains a challenge. Thus we propose in this paper an algorithm combining simultaneously and non-sequentially the two tasks --Data embedding and Clustering-- relying on Principal Component Analysis (PCA) and CEM. We demonstrate the interest of such approach in terms of clustering and data embedding. We also establish different connections with other clustering approaches.
N OWADAYS, many real-world data sets are high- dimensional. In order to reduce the dimensionality a manifold learning technique can be used to map a set of high-dimensional data into a low-dimensional space, while preserving the intrinsic structure of the data. Principal component analysis (PCA) [19] is the most popular linear approach. In nonlinear cases, however, there are other more efficient approaches to be found. A number of techniques have been proposed, including Multi-Dimensional Scaling (MDS) [26], Isometric Feature Mapping (ISOMAP) [1], Locally Linear Embedding (LLE) [30], Locally Preserving Projections (LPP) [13], and Stochastic Neighbor Embedding (SNE) [15]; we refer the interested reader to [10]. Unfortunately, these nonlinear techniques tend to be extremely sensitive to noise, sample size, the choice of neighborhood, and other parameters; see for instance [12,35,10]. In the context of deep learning, where data embedding is referred to as data representation, an autoencoder can learn a representation (or encoding) for a set of data. If linear activations are used, or if there is only a single sigmoid hidden layer, then the optimal solution for an autoencoder is strongly related to Principal Component Analysis (PCA) [16].
In data science, data embedding (DE) is commonly used for the purposes of visualizing, but it can also play a significant role in clustering, where the aim is to divide a dataset into homogeneous clusters. Working with a low-dimensional space can be useful when partitioning data, and a number of approaches are reported in the literature with applications in various fields. A popular method to achieve this is the use of principal component analysis (PCA), which reduces the dimensionality of data while retaining the most relevant information, followed by any clustering algorithm. Although such an approach has been successfully applied sequentially in many applications [14], they present some drawbacks. This is because the first components do not necessarily capture the clustering structure. Chang [6] discussed a simulated example from a 15-dimensional mixture model into 2 clusters to show the failure of principal components as a method for reducing the dimension of the data before clustering. With the scheme described in [6] the first 8 variables can be considered roughly as a block of variables with the same correlations, while the rest of the variables form another block. Thereby, from the simulated continuous data of size 1000 × 15 into 2 clusters we observe that the plan spawned by the two-first components, obtained by PCA, does not reveal the two classes Figure (1, left). However, the plan spawned by the last and the first components perfectly discerns the two classes as shown in Figure (1, right). Therefore the idea of taking into account only the first components to perform a clustering is not always effective.
In our proposal, we have chosen to rely on the mixture model approach for its flexibility. Thereby, we present an ap-proach that simultaneously uses the PCA and the Expectation-Maximization-Type algorithm (EM) [8,23] that inserts a classification step, referred to as Classification EM (CEM [5]). The derived algorithm called CEM-PCA we propose can be viewed as a regularized dimension reduction method. This regulation will be beneficial to the reduction of the dimension while taking into account the clustering structure to be discovered. In this way, it simultaneously combines the two data tasks -data embedding and clustering-. As reported in Figure 2, we observe the interest of such approach on simulated data. The sub-figure on the left illustrates the obtained clustering by K-means [20] applied on the 15 principal components arising from PCA. The sub-figure on the right represents the clusters obtained by CEM-PCA that generates a data embedding B. CEM-PCA successfully separates the two classes and results in perfect accuracy (Accuracy=100%), as opposed to PCA followed by K-means (Accuracy=78%). The paper is structured as follows: Section II describes related work and methods to which we compare our approach, Section III presents CEM-PCA and all the algorithms that are used such as EM and CEM algorithms. Section IV covers the optimization aspect of CEM-PCA as well as the algorithm itself and the complexity analysis. In Section V we present the experimental evaluation of our approach, including the results and comparisons with other state-of-the-art methods. In Section VI, we establish connections between CEM-PCA and other state-of-the-art methods and finally, we conclude and discuss the potential applications and future directions of the proposed approach.
In this section, we describe several existing methods that are related to our proposed CEM-PCA approach:
• Reduced K-means [21,39] combines PCA for dimension reduction with K-means for clustering. • EM-GMM [2,29] estimates the parameters of a Gaussian mixture models. The clusters are inferred at the convergenc
This content is AI-processed based on open access ArXiv data.