Median topographic maps for biomedical data sets
📝 Abstract
Median clustering extends popular neural data analysis methods such as the self-organizing map or neural gas to general data structures given by a dissimilarity matrix only. This offers flexible and robust global data inspection methods which are particularly suited for a variety of data as occurs in biomedical domains. In this chapter, we give an overview about median clustering and its properties and extensions, with a particular focus on efficient implementations adapted to large scale data analysis.
💡 Analysis
Median clustering extends popular neural data analysis methods such as the self-organizing map or neural gas to general data structures given by a dissimilarity matrix only. This offers flexible and robust global data inspection methods which are particularly suited for a variety of data as occurs in biomedical domains. In this chapter, we give an overview about median clustering and its properties and extensions, with a particular focus on efficient implementations adapted to large scale data analysis.
📄 Content
arXiv:0909.0638v1 [cs.LG] 3 Sep 2009 Median topographic maps for biomedical data sets Barbara Hammer1, Alexander Hasenfuss1, and Fabrice Rossi2 1 Clausthal University of Technology, D-38678 Clausthal-Zellerfeld, Germany, 2 INRIA Rocquencourt, Domaine de Voluceau, Rocquencourt, B.P. 105, 78153 Le Chesnay Cedex, France Abstract. Median clustering extends popular neural data analysis meth- ods such as the self-organizing map or neural gas to general data struc- tures given by a dissimilarity matrix only. This offers flexible and robust global data inspection methods which are particularly suited for a va- riety of data as occurs in biomedical domains. In this chapter, we give an overview about median clustering and its properties and extensions, with a particular focus on efficient implementations adapted to large scale data analysis. 1 Introduction The tremendous growth of electronic information in biological and medical do- mains has turned automatic data analysis and data inspection tools towards a key technology for many application scenarios. Clustering and data visualization constitute one fundamental problem to arrange data in a way understandable by humans. In biomedical domains, prototype based methods are particularly well suited since they represent data in terms of typical values which can be directly inspected by humans and visualized in the plane if an additional low-dimensional neighborhood or embedding is present. Popular methodologies include K-means clustering, the self-organizing map, neural gas, affinity propagation, etc. which have successfully been applied to various problems in the biomedical domain such as gene expression analysis, inspection of mass spectrometric data, health- care, analysis of microarray data, protein sequences, medical image analysis, etc. [1, 37,36, 41, 44, 53, 54]. Many popular prototype-based clustering algorithms, however, have been derived for Euclidean data embedded in a real-vector space. In biomedical ap- plications, data are diverse including temporal signals such as EEG and EKG signals, functional data such as mass spectra, sequential data such as DNA se- quences, complex graph structures such as biological networks, etc. Often, the Euclidean metric is not appropriate to compare such data, rather, a problem dependent similarity or dissimilarity measure should be used such as alignment, correlation, graph distances, functional metrics, or general kernels. Various extensions of prototype-based methods towards more general data structures exist such as extensions for recurrent and recursive data structures, II functional versions, or kernelized formulations, see e.g. [27, 26, 25, 7, 24] for an overview. A very general approach relies on a matrix which characterizes the pairwise similarities or dissimilarities of data. This way, any distance measure or kernel (or generalization thereof which might violate symmetry, triangle in- equality, or positive definiteness) can be dealt with including discrete settings which cannot be embedded in Euclidean space such as alignment of sequences or empirical measurements of pairwise similarities without explicit underlying metric. Several approaches extend popular clustering algorithms such as K-means or the self-organizing map towards this setting by means of the relational dual formulation or kernelization of the approaches [30, 31, 51, 8, 24]. These methods have the drawback that they partially require specific properties of the dissim- ilarity matrix (such as positive definiteness), and they represent data in terms of prototypes which are given by (possibly implicit) mixtures of training points, thus they cannot easily be interpreted directly. Another general approach lever- ages mean field annealing techniques [19, 20, 33] as a way to optimize a modified criterion that does not rely anymore on the use of prototypes. As for the rela- tional and kernel approaches, the main drawback of those solutions is the reduced interpretability. An alternative is offered by a representation of classes by the median or centroid, i.e. prototype locations are restricted to the discrete set given by the training data. This way, the distance of data points from prototypes is well- defined. The resulting learning problem is connected to a well-studied optimiza- tion problem, the K-median problem: given a set of data points and pairwise dissimilarities, find k points forming centroids and an assignment of the data into k classes such that the average dissimilarities of points to their respective closest centroid is minimized. This problem is NP hard in general unless the dissimilarities have a special form (e.g. tree metrics), and there exist constant factor approximations for specific settings (e.g. metrics) [10, 6]. The popular K-medoid clustering extends the batch optimization scheme of K-means to this restricted setting of prototypes: it in turn assigns data points to the respective closest prototypes and determines optimum prototypes for these assig
This content is AI-processed based on ArXiv data.