Evaluating MCC for Low-Frequency Cyberattack Detection in Imbalanced Intrusion Detection Data
In many real-world network environments, several types of cyberattacks occur at very low rates compared to benign traffic, making them difficult for intrusion detection systems (IDS) to detect reliably. This imbalance causes traditional evaluation metrics, such as accuracy, to often overstate model performance in these conditions, masking failures on minority attack classes that are most important in practice. In this paper, we evaluate a set of base and meta classifiers on low-traffic attacks in the CSE-CIC-IDS2017 dataset and compare their reliability in terms of accuracy and Matthews Correlation Coefficient (MCC). The results show that accuracy consistently inflates performance, while MCC provides a more accurate assessment of a classifier’s performance across both majority and minority classes. Meta-classification methods, such as LogitBoost and AdaBoost, demonstrate more effective minority class detection when measured by MCC, revealing trends that accuracy fails to capture. These findings establish the need for imbalance-aware evaluation and make MCC a more trustworthy metric for IDS research involving low-traffic cyberattacks.
💡 Research Summary
This paper investigates how the choice of evaluation metric influences the perceived performance of intrusion detection systems (IDS) when dealing with extremely imbalanced network traffic, where low‑frequency cyber‑attacks constitute only a tiny fraction of the overall flow. Using the CSE‑CIC‑IDS2017 dataset, the authors extract three subsets that focus on the rare attacks Heartbleed, Web Attack, and Infiltration. After converting the raw CSV files to ARFF format, they down‑sample the benign class with WEKA’s SpreadSubsample filter, capping the majority class at 50 000 instances while preserving all attack instances. The resulting class distributions are highly skewed: Heartbleed (11 attacks vs. 50 000 benign, 0.022 % attack), Infiltration (36 vs. 20 000, 0.18 %), and Web Attack (2 180 vs. 20 000, 9.8 %). No feature scaling or removal is performed, preserving the original 80+ flow‑based attributes.
The experimental design compares two families of classifiers: (1) baseline single‑model learners—Logistic Regression, Random Forest, Naïve Bayes, J48 decision tree, and JRip rule learner; and (2) meta‑classifiers that operate on top of base learners, primarily ensemble methods such as AdaBoostM1, LogitBoost, Bagging, RandomSubSpace, and RandomCommittee. All models are evaluated with 10‑fold cross‑validation in WEKA, and the authors report only two metrics: overall accuracy and Matthews Correlation Coefficient (MCC). Accuracy is highlighted as the de‑facto standard in IDS literature but is known to be misleading under severe class imbalance because it is dominated by true negatives. MCC, by contrast, incorporates true positives, true negatives, false positives, and false negatives, yielding a value between –1 and +1 that remains informative even when the minority class is vanishingly small.
Results across the three attack subsets reveal a consistent pattern. For Infiltration, most classifiers report accuracies above 99.8 % yet many obtain an MCC of 0, indicating that they failed to detect any attack despite appearing perfect on the accuracy scale. In contrast, DecisionTree correctly identified all 36 infiltration instances, achieving an MCC of 0.986 (98.6 %) while its accuracy (99.995 %) is only marginally lower than the worst‑performing models. This demonstrates that accuracy cannot differentiate between a model that detects attacks and one that predicts every instance as benign.
The Web Attack subset, with a higher attack proportion (≈10 %), shows tighter alignment between accuracy and MCC, but discrepancies still exist. Some models with 99.8 % accuracy have MCC values noticeably below 0.97, indicating residual misclassifications of attack traffic. Ensemble meta‑classifiers—Bagging, RandomSubSpace, RandomCommittee—consistently achieve the highest MCC scores (0.97–0.99), suggesting that combining multiple base learners mitigates the bias toward the majority class.
Heartbleed, the most extreme case (11 attacks vs. 50 000 benign), again illustrates the pitfalls of accuracy. Nearly all models reach ~99.97 % accuracy, with several reporting perfect scores. However, MCC ranges from near‑zero to almost 1.0. Models that fail to flag any Heartbleed instance receive MCC≈0 despite perfect accuracy, while certain meta‑ensembles attain MCC≈1, reflecting near‑perfect detection. This stark divergence underscores MCC’s robustness in reflecting true detection capability under severe imbalance.
To provide an overall ranking, the authors compute the mean MCC for each classifier across the three datasets. The top five by average MCC are: FilteredClassifier (95.53 %), ClassificationViaRegression (94.93 %), JRip (93.80 %), RandomForest (93.23 %), and RandomSubSpace (92.83 %). The first two are meta‑ensemble approaches, confirming that ensembles generally dominate performance on low‑frequency attacks. Nevertheless, strong baseline learners such as JRip and RandomForest also appear near the top, indicating that well‑tuned single models can still be competitive.
The discussion reiterates that accuracy is unreliable for heavily imbalanced IDS evaluation, especially for the most scarce attacks (Heartbleed, Infiltration). MCC consistently separates effective from ineffective models, providing a more truthful picture of minority‑class detection. Meta‑classification techniques, particularly boosting and bagging variants, improve MCC scores, likely due to their ability to re‑weight or resample the minority class during training. The authors argue that future IDS research should adopt MCC (or similarly imbalance‑aware metrics) as a primary evaluation criterion, and that ensemble methods should be considered a strong baseline when targeting rare attack detection.
In summary, the paper empirically demonstrates that Matthews Correlation Coefficient is a far more reliable metric than accuracy for assessing IDS performance on low‑frequency cyber‑attacks. It also shows that meta‑ensemble classifiers generally achieve higher MCC, making them preferable for real‑world scenarios where attacks are rare but critically important. These findings provide clear guidance for both researchers and practitioners on metric selection and model design in the context of imbalanced intrusion detection data.
Comments & Academic Discussion
Loading comments...
Leave a Comment