Accuracy and Robustness of Clustering Algorithms for Small-Size Applications in Bioinformatics

Accuracy and Robustness of Clustering Algorithms for Small-Size   Applications in Bioinformatics
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The performance (accuracy and robustness) of several clustering algorithms is studied for linearly dependent random variables in the presence of noise. It turns out that the error percentage quickly increases when the number of observations is less than the number of variables. This situation is common situation in experiments with DNA microarrays. Moreover, an {\it a posteriori} criterion to choose between two discordant clustering algorithm is presented.


💡 Research Summary

The paper investigates how well several popular clustering algorithms perform when the number of observations is smaller than the number of variables – a situation that frequently occurs in bio‑informatics experiments such as DNA micro‑array studies. The authors generate synthetic data sets in which the variables are linearly dependent and contaminated with Gaussian noise at different signal‑to‑noise ratios. By varying the sample size (N) relative to the dimensionality (p) they create three regimes: N < p (severely undersampled), N ≈ p (borderline), and N > p (well‑sampled). For each regime they run 100 independent simulations and evaluate four clustering methods: k‑means, average‑linkage hierarchical clustering, self‑organizing maps (SOM), and an expectation‑maximization (EM) mixture‑model approach. Accuracy is measured as the proportion of correctly recovered class labels, while error rate is the complement.

The results show a dramatic rise in error when N < p. Distance‑based methods (k‑means and hierarchical clustering) suffer the most, often exceeding 30 % error under severe undersampling. The EM mixture model can remain relatively robust (error below 20 %) only when its assumed covariance structure matches the true data generation process; otherwise its performance deteriorates sharply. SOM, which combines non‑linear dimensionality reduction with clustering, maintains lower error rates (around 25 % for N = 20) but is highly sensitive to learning‑rate and topology parameters. As the sample size approaches or exceeds the number of variables, all algorithms converge to low error (<5 %).

Beyond performance measurement, the authors propose an a‑posteriori decision criterion to resolve conflicts when two algorithms produce discordant clusterings. The criterion combines three components: (1) the within‑cluster mean squared error (MSW), (2) the between‑cluster mean squared distance (MSB), and (3) a stability score obtained by bootstrapping the data and recomputing clusters. A low MSW/MSB ratio together with a high bootstrap reproducibility (≥0.8) signals the preferred clustering solution. The authors validate this rule on a real micro‑array data set from the GEO repository, where k‑means and SOM disagree; the criterion selects SOM, which aligns better with known biological pathways.

In conclusion, the study highlights that clustering in high‑dimensional, low‑sample contexts is intrinsically fragile and that algorithm choice cannot rely solely on popularity or computational speed. Practitioners should first assess the N/p ratio, possibly apply dimensionality reduction (PCA, ICA) or regularization, and then use the proposed a‑posteriori metric to adjudicate between competing clusterings. The paper suggests future work extending the analysis to non‑linear dependencies, non‑Gaussian noise, and deep‑learning‑based clustering frameworks.


Comments & Academic Discussion

Loading comments...

Leave a Comment