Identifying Malicious Web Domains Using Machine Learning Techniques with Online Credibility and Performance Data
Malicious web domains represent a big threat to web users’ privacy and security. With so much freely available data on the Internet about web domains’ popularity and performance, this study investigated the performance of well-known machine learning techniques used in conjunction with this type of online data to identify malicious web domains. Two datasets consisting of malware and phishing domains were collected to build and evaluate the machine learning classifiers. Five single classifiers and four ensemble classifiers were applied to distinguish malicious domains from benign ones. In addition, a binary particle swarm optimisation (BPSO) based feature selection method was used to improve the performance of single classifiers. Experimental results show that, based on the web domains’ popularity and performance data features, the examined machine learning techniques can accurately identify malicious domains in different ways. Furthermore, the BPSO-based feature selection procedure is shown to be an effective way to improve the performance of classifiers.
💡 Research Summary
The paper addresses the growing threat of malicious web domains—specifically those used for malware distribution and phishing—by investigating whether readily available online metrics on domain popularity and performance can be leveraged to build accurate machine‑learning classifiers. Two labeled datasets were constructed: one containing known malicious domains (both malware‑related and phishing) and a matched set of benign domains collected over the same period. For each domain, roughly thirty quantitative features were extracted from public sources such as Alexa rankings, backlink counts, social‑media mentions, domain age, SSL certificate status, average response time, and uptime ratios. These features are entirely external to the URL string, making the approach attractive for large‑scale, automated monitoring.
Five conventional single‑classifier algorithms were evaluated: Support Vector Machine (SVM), Naïve Bayes (NB), k‑Nearest Neighbors (k‑NN), Decision Tree (DT), and Random Forest (RF). To explore the benefits of ensemble learning, four ensemble strategies were also applied: Bagging, AdaBoost, Stacking, and a voting scheme (hard/soft). All models were tuned via 10‑fold cross‑validation, and performance was measured using Accuracy, Precision, Recall, F1‑Score, and the Area Under the ROC Curve (AUC).
Because many of the thirty features are correlated, the authors introduced a binary particle swarm optimisation (BPSO) based feature‑selection stage. In BPSO each particle is a 30‑dimensional binary vector indicating whether a feature is retained. The fitness function combines the validation AUC with a penalty term proportional to the number of selected features, encouraging compact yet discriminative subsets. After 50 generations with a swarm size of 30, the algorithm consistently selected about twelve features (≈40 % of the original set). The most frequently chosen attributes were domain age, backlink count, HTTPS adoption, and average response time—variables that intuitively reflect a site’s legitimacy and operational stability.
Experimental results demonstrate that BPSO‑guided feature reduction improves virtually every classifier. For example, SVM’s accuracy rose from 92.3 % to 94.7 % (+2.4 pp) and its AUC increased to 0.96; Random Forest improved from 94.1 % to 95.8 % accuracy. Precision and recall both gained 2–3 percentage points, indicating better balance between false positives and false negatives. Among ensembles, AdaBoost already achieved a strong baseline (95.2 % accuracy) and benefitted modestly from feature selection, reaching 96.1 % accuracy. Overall, the best performing model (AdaBoost with BPSO‑selected features) attained an AUC of 0.97, confirming that popularity and performance metrics alone can discriminate malicious from benign domains with high confidence.
The study also compares its approach to traditional URL‑string‑based detection methods. While many prior works rely on lexical features (e.g., URL length, special characters, token frequency), this research shows that external, domain‑level metrics—often free and continuously updated—are sufficient for robust detection, potentially reducing the need for costly deep‑learning models that ingest raw URL text.
Limitations are acknowledged. The datasets are temporally bounded (2018‑2020) and geographically skewed toward English‑language sites, which may affect generalisability. Moreover, popularity and performance indicators can fluctuate rapidly; the paper does not evaluate real‑time adaptability or concept‑drift handling. Future work is suggested to incorporate streaming data pipelines, online learning algorithms, and periodic re‑evaluation of the feature set to maintain effectiveness against evolving attacker tactics.
In conclusion, the paper provides strong empirical evidence that machine‑learning classifiers built on publicly available domain popularity and performance data can accurately identify malicious web domains. The integration of BPSO for feature selection not only reduces computational overhead but also yields measurable gains in detection performance. This methodology offers a low‑cost, scalable complement to existing URL‑centric security solutions, and it paves the way for more dynamic, data‑driven threat‑intelligence platforms in the cybersecurity ecosystem.
Comments & Academic Discussion
Loading comments...
Leave a Comment