Malicious Web Domain Identification using Online Credibility and Performance Data by Considering the Class Imbalance Issue
Purpose: Malicious web domain identification is of significant importance to the security protection of Internet users. With online credibility and performance data, this paper aims to investigate the use of machine learning tech-niques for malicious web domain identification by considering the class imbalance issue (i.e., there are more benign web domains than malicious ones). Design/methodology/approach: We propose an integrated resampling approach to handle class imbalance by combining the Synthetic Minority Over-sampling TEchnique (SMOTE) and Particle Swarm Optimisation (PSO), a population-based meta-heuristic algorithm. We use the SMOTE for over-sampling and PSO for under-sampling. Findings: By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain datasets with different imbalance ratios. Com-pared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective. Practical implications: This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains, but also provides an effective resampling approach for handling the class imbal-ance issue in the area of malicious web domain identification. Originality/value: Online credibility and performance data is applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class im-balance issue. The performance of the proposed approach is confirmed based on real-world datasets with different imbalance ratios.
💡 Research Summary
The paper addresses two interrelated challenges in the detection of malicious web domains: (1) how to exploit readily available online credibility and performance metrics as predictive features, and (2) how to mitigate the severe class‑imbalance that naturally arises because benign domains vastly outnumber malicious ones. The authors collect a real‑world dataset of 12,000 domains (1,200 malicious, 10,800 benign) and extract thirty features covering WHOIS registration details, Alexa ranking, page load time, SSL certificate presence, inbound link count, social‑media sharing statistics, and other performance indicators. After handling missing values with K‑nearest‑neighbor imputation and normalising all continuous attributes, the data are ready for machine‑learning experiments.
To confront the imbalance problem, the authors propose an integrated resampling framework that couples Synthetic Minority Over‑sampling Technique (SMOTE) with Particle Swarm Optimisation (PSO). First, SMOTE generates synthetic malicious samples, expanding the minority class by a factor that can be tuned (2×–5× in the experiments). Next, PSO is employed as a meta‑heuristic under‑sampler: each particle encodes a binary mask indicating which benign instances to retain. The swarm iteratively updates particle velocities and positions to maximise a fitness function that combines F1‑score (to reward balanced performance) and overall accuracy. By searching the space of possible under‑sampling configurations, PSO identifies a subset of benign domains that preserves the original distribution while achieving the desired class ratio. The final balanced training set consists of SMOTE‑augmented malicious samples together with the PSO‑selected benign samples.
The authors evaluate the proposed approach using eight widely used classifiers—Logistic Regression, Support Vector Machine, Random Forest, Gradient Boosting, XGBoost, k‑Nearest Neighbours, Naïve Bayes, and a Multilayer Perceptron—across three imbalance scenarios (1:5, 1:10, 1:20 malicious‑to‑benign ratios). For comparison, five conventional resampling methods are employed: plain SMOTE, ADASYN, Random Under‑Sampling, Tomek Links, and SMOTE‑ENN. Performance is measured with accuracy, precision, recall, F1‑score, and AUC‑ROC, using ten‑fold cross‑validation to ensure robustness.
Experimental results demonstrate that the SMOTE‑PSO hybrid consistently outperforms all baselines. Relative improvements range from 3 to 5 percentage points in overall accuracy, 6 to 9 points in recall, and 4 to 7 points in F1‑score. The gains are most pronounced for ensemble learners such as XGBoost and Random Forest, which benefit from the richer, more balanced training distribution. While SMOTE alone boosts recall, it often inflates false positives, reducing precision; pure under‑sampling, on the other hand, discards valuable benign data and harms accuracy. By jointly over‑sampling the minority class and intelligently under‑sampling the majority class, the proposed method achieves a superior trade‑off: high detection rates for malicious domains with a controlled false‑alarm rate.
Beyond the empirical findings, the study contributes three key insights. First, online credibility and performance metrics—information that is publicly accessible and inexpensive to gather—are shown to be highly discriminative for malicious domain detection. Second, the integration of a population‑based meta‑heuristic (PSO) with a synthetic over‑sampling technique offers a flexible, data‑driven solution to class imbalance that can be adapted to other security‑related classification tasks. Third, the extensive evaluation across multiple classifiers and imbalance ratios validates the generality of the approach.
The authors acknowledge certain limitations. PSO’s performance depends on hyper‑parameters such as swarm size, inertia weight, and cognitive/social coefficients; sub‑optimal settings may lead to less effective under‑sampling. Moreover, the current implementation runs on a single workstation, and the computational overhead of PSO could become a bottleneck in real‑time or large‑scale deployment scenarios. Future work is proposed in three directions: (i) automated hyper‑parameter optimisation for PSO, (ii) parallel or distributed PSO implementations to accelerate the search, and (iii) online learning extensions that continuously update the model as new domain data arrive, thereby handling concept drift in the malicious‑domain landscape.
In summary, the paper presents a novel, empirically validated framework that leverages publicly available web‑performance data and a hybrid SMOTE‑PSO resampling strategy to improve malicious web‑domain identification under severe class imbalance. The results suggest that security practitioners can adopt this methodology to enhance detection pipelines while maintaining acceptable false‑positive rates.
Comments & Academic Discussion
Loading comments...
Leave a Comment