miRNA and Gene Expression based Cancer Classification using Self- Learning and Co-Training Approaches

miRNA and gene expression profiles have been proved useful for classifying cancer samples. Efficient classifiers have been recently sought and developed. A number of attempts to classify cancer samples using miRNA/gene expression profiles are known in literature. However, the use of semi-supervised learning models have been used recently in bioinformatics, to exploit the huge corpuses of publicly available sets. Using both labeled and unlabeled sets to train sample classifiers, have not been previously considered when gene and miRNA expression sets are used. Moreover, there is a motivation to integrate both miRNA and gene expression for a semi-supervised cancer classification as that provides more information on the characteristics of cancer samples. In this paper, two semi-supervised machine learning approaches, namely self-learning and co-training, are adapted to enhance the quality of cancer sample classification. These approaches exploit the huge public corpuses to enrich the training data. In self-learning, miRNA and gene based classifiers are enhanced independently. While in co-training, both miRNA and gene expression profiles are used simultaneously to provide different views of cancer samples. To our knowledge, it is the first attempt to apply these learning approaches to cancer classification. The approaches were evaluated using breast cancer, hepatocellular carcinoma (HCC) and lung cancer expression sets. Results show up to 20% improvement in F1-measure over Random Forests and SVM classifiers. Co-Training also outperforms Low Density Separation (LDS) approach by around 25% improvement in F1-measure in breast cancer.

💡 Research Summary

The paper tackles the problem of cancer sample classification using molecular profiling data—specifically micro‑RNA (miRNA) and gene expression—by introducing two semi‑supervised learning strategies that have not previously been applied to this domain: self‑learning and co‑training. The motivation stems from the fact that while miRNA and gene expression signatures are powerful discriminators of tumor type, the number of labeled samples available for training is typically very small due to the high cost of experimental annotation. At the same time, large public repositories contain thousands of unlabeled expression profiles that remain untapped by conventional supervised classifiers such as Random Forests (RF) and Support Vector Machines (SVM). The authors propose to exploit these abundant unlabeled data to enrich the training set, thereby improving classification performance without additional wet‑lab effort.

Self‑learning approach
The authors first construct two independent classifiers: one based on miRNA expression, the other on gene expression. Each classifier is initially trained on the modest set of labeled samples. Then, the trained model is applied to the unlabeled pool; samples for which the classifier predicts a label with high confidence (probability > 0.9) are added to the labeled training set together with their predicted label. The model is retrained on this expanded set, and the process is iterated several times. By using a stringent confidence threshold, the method aims to limit error propagation while gradually increasing the diversity of the training data. Both RF and SVM are evaluated as base learners, and the self‑learning pipeline is applied separately to the miRNA and gene views.

Co‑training approach
Co‑training leverages the “multi‑view” nature of the data: miRNA expression and gene expression constitute two distinct feature spaces that are assumed to be conditionally independent given the class label. Two classifiers are trained in parallel, each on one view. After each iteration, each classifier selects its most confident predictions from the unlabeled pool and supplies those pseudo‑labeled instances to the other classifier’s training set. This cross‑feeding mechanism allows each view to benefit from the complementary information captured by the other, effectively enlarging the labeled set for both classifiers simultaneously. The authors verify that the two views are sufficiently uncorrelated in practice, satisfying the theoretical assumptions underlying co‑training.

Experimental setup
The methods are evaluated on three publicly available cancer expression datasets: breast cancer, hepatocellular carcinoma (HCC), and lung cancer. For each cancer type, the labeled training set consists of only a few dozen samples, while the unlabeled pool contains several thousand profiles. Performance is measured primarily by the F1‑score, which balances precision and recall and is appropriate for the often imbalanced class distributions in cancer datasets. Baseline comparisons include standard supervised RF and SVM models trained only on the labeled data, as well as the Low‑Density Separation (LDS) semi‑supervised technique, which is a state‑of‑the‑art method for leveraging unlabeled data.

Results
Self‑learning yields consistent improvements over the supervised baselines, with F1‑score gains ranging from roughly 12 % to 18 % across the three cancers. Co‑training provides the most pronounced benefit, especially for the breast cancer dataset where it outperforms LDS by about 25 % in F1‑score. In HCC and lung cancer, co‑training still surpasses both the supervised models and LDS, though the margin is slightly smaller. The authors also conduct ablation studies showing that the confidence threshold and the number of added pseudo‑labeled samples per iteration critically affect performance, confirming the importance of careful parameter tuning.

Key contributions

Novel application of semi‑supervised learning to miRNA and gene expression–based cancer classification, demonstrating that unlabeled public data can be harnessed effectively.
Dual‑view co‑training framework that exploits the complementary nature of miRNA and gene expression, achieving larger performance gains than treating each view independently.
Empirical validation on multiple cancer types, showing that the proposed methods are robust across different biological contexts and data distributions.
Practical relevance: the approach reduces the need for costly experimental labeling while delivering classifiers that are competitive with, or superior to, existing supervised and semi‑supervised techniques.

Future directions suggested by the authors include extending the multi‑view paradigm to incorporate additional omics layers (e.g., DNA methylation, proteomics), integrating graph‑based semi‑supervised methods to capture sample similarity structures, and exploring deep learning architectures that can jointly learn representations from miRNA and gene expression while still benefiting from unlabeled data.

In summary, the paper provides a compelling demonstration that semi‑supervised strategies—particularly co‑training—can substantially boost the accuracy of cancer classification models built on high‑dimensional molecular profiles, opening a pathway toward more cost‑effective and scalable diagnostic tools in precision oncology.