SecureScan: An AI-Driven Multi-Layer Framework for Malware and Phishing Detection Using Logistic Regression and Threat Intelligence Integration
The growing sophistication of modern malware and phishing campaigns has diminished the effectiveness of traditional signature-based intrusion detection systems. This work presents SecureScan, an AI-driven, triple-layer detection framework that integrates logistic regression-based classification, heuristic analysis, and external threat intelligence via the VirusTotal API for comprehensive triage of URLs, file hashes, and binaries. The proposed architecture prioritizes efficiency by filtering known threats through heuristics, classifying uncertain samples using machine learning, and validating borderline cases with third-party intelligence. On benchmark datasets, SecureScan achieves 93.1 percent accuracy with balanced precision (0.87) and recall (0.92), demonstrating strong generalization and reduced overfitting through threshold-based decision calibration. A calibrated threshold and gray-zone logic (0.45-0.55) were introduced to minimize false positives and enhance real-world stability. Experimental results indicate that a lightweight statistical model, when augmented with calibrated verification and external intelligence, can achieve reliability and performance comparable to more complex deep learning systems.
💡 Research Summary
SecureScan is a three‑layer, AI‑driven detection framework designed to address the growing sophistication of malware and phishing campaigns that render traditional signature‑based intrusion detection systems (IDS) increasingly ineffective. The first layer implements fast, deterministic heuristic filtering that examines structural attributes of inputs—such as domain length, presence of IP‑based hostnames, suspicious top‑level domains, phishing‑related keywords in URL paths, and risky file extensions—to quickly discard obviously malicious or malformed samples with minimal latency. Samples that pass this gate are transformed into high‑dimensional feature vectors: URLs are tokenized into character n‑grams (3‑7 characters) and encoded using TF‑IDF, capped at 50,000 features, while file hashes are represented by static metadata (size, entropy, import count, byte‑sequence n‑grams, etc.).
The second layer employs a logistic regression classifier trained on these vectors. Regularization (L2) and ten‑fold cross‑validation are used to prevent over‑fitting, and Platt scaling calibrates the raw model scores into well‑behaved probabilities. A calibrated probability thresholding scheme partitions predictions into three zones: ≥ 0.60 (malicious), ≤ 0.45 (benign), and a gray‑zone between 0.45 and 0.55. The gray‑zone is deliberately designed to capture uncertain cases that would otherwise generate false positives or false negatives if handled solely by the statistical model.
The third layer resolves gray‑zone predictions by querying the VirusTotal API. For URLs, the /api/v3/urls endpoint is called; for file hashes, /api/v3/files/{hash} is used. The response provides aggregated engine counts, reputation tags, and the timestamp of the latest analysis. A consensus rule is applied: only when both SecureScan’s probabilistic verdict and VirusTotal’s aggregated intelligence agree on maliciousness is the sample finally labeled as malicious; otherwise, it is downgraded to “safe (verified)”. This external validation step adds a dynamic, community‑driven perspective that compensates for the static nature of the logistic model.
The authors assembled a comprehensive dataset of 651,191 labeled instances (223,088 malicious, 428,103 benign) drawn from public repositories such as VirusShare, PE collections, PhishTank, and OpenPhish. After deduplication, normalization, and stratified 80/20 train‑test splitting, they performed ten‑fold cross‑validation during model tuning. Feature engineering included lexical, structural, and metadata attributes, with URL augmentation through realistic directory variations to improve generalization.
Experimental results show that SecureScan achieves 93.1 % overall accuracy, a precision of 0.87, and a recall of 0.92. The gray‑zone mechanism reduces false‑positive rates by more than 30 % compared to a baseline logistic regression without external verification. Inference latency averages 15 ms per sample, confirming suitability for real‑time deployment in security operation centers (SOCs) or enterprise gateways. Despite using a lightweight statistical model, the integration of calibrated probabilities and threat intelligence yields performance comparable to more complex deep‑learning classifiers while maintaining interpretability through feature‑weight inspection.
The paper discusses limitations, notably the reliance on VirusTotal’s rate limits and the potential for performance degradation when API access is throttled. Additionally, the model’s ability to generalize to novel, heavily obfuscated malware families remains tied to the quality and diversity of the training data. Future work is proposed in three directions: (1) incorporating dynamic behavioral analysis to enrich feature sets, (2) exploring ensemble or meta‑learning architectures that combine logistic regression with tree‑based or neural models, and (3) implementing a caching layer for VirusTotal responses to mitigate API constraints and further reduce latency.
In conclusion, SecureScan demonstrates that a thoughtfully engineered, multi‑layer pipeline—combining deterministic heuristics, calibrated logistic regression, and real‑time threat intelligence—can deliver high detection accuracy, low false‑positive rates, and operational efficiency, offering a practical alternative to heavyweight deep‑learning solutions for malware and phishing detection.
Comments & Academic Discussion
Loading comments...
Leave a Comment