CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection
Phishing attacks represents one of the primary attack methods which is used by cyber attackers. In many cases, attackers use deceptive emails along with malicious attachments to trick users into giving away sensitive information or installing malware while compromising entire systems. The flexibility of malicious email attachments makes them stand out as a preferred vector for attackers as they can embed harmful content such as malware or malicious URLs inside standard document formats. Although phishing email defenses have improved a lot, attackers continue to abuse attachments, enabling malicious content to bypass security measures. Moreover, another challenge that researches face in training advance models, is lack of an unified and comprehensive dataset that covers the most prevalent data types. To address this gap, we generated CIC-Trap4Phish, a multi-format dataset containing both malicious and benign samples across five categories commonly used in phishing campaigns: Microsoft Word documents, Excel spreadsheets, PDF files, HTML pages, and QR code images. For the first four file types, a set of execution-free static feature pipeline was proposed, designed to capture structural, lexical, and metadata-based indicators without the need to open or execute files. Feature selection was performed using a combination of SHAP analysis and feature importance, yielding compact, discriminative feature subsets for each file type. The selected features were evaluated by using lightweight machine learning models, including Random Forest, XGBoost, and Decision Tree. All models demonstrate high detection accuracy across formats. For QR code-based phishing (quishing), two complementary methods were implemented: image-based detection by employing Convolutional Neural Networks (CNNs) and lexical analysis of decoded URLs using recent lightweight language models.
💡 Research Summary
The paper addresses the growing threat of phishing and “quishing” (QR‑code‑based phishing) attacks that exploit malicious email attachments. Recognizing a critical gap in existing research—namely, the lack of a unified, high‑quality dataset covering the most common attachment types—the authors introduce CIC‑Trap4Phish, a comprehensive benchmark that includes both malicious and benign samples of Microsoft Word documents, Excel spreadsheets, PDF files, HTML pages, and QR‑code images.
Data collection was performed from reputable sources: malicious samples were harvested from Malware Bazaar, PhishTank, the PDFMal2022 repository, the Nazario Phishing Email Corpus, and other trusted feeds, while benign samples were generated or crawled from Google, Wikipedia, and publicly available datasets. The resulting corpus is balanced across the five formats, providing a realistic representation of real‑world phishing campaigns.
For the four “traditional” attachment types (Word, Excel, PDF, HTML), the authors designed an execution‑free static feature extraction pipeline. Features span structural metadata (e.g., number of embedded objects, presence of macros, document revision history), lexical cues (suspicious strings, URL patterns), and content‑level indicators (script tags, JavaScript code, external resource links). Each file type initially yields 30‑40 raw attributes. To reduce dimensionality and improve interpretability, a hybrid feature‑selection approach combines SHAP (Shapley Additive Explanations) with conventional importance metrics, resulting in the top‑10 most discriminative features per format.
These compact feature sets are fed into lightweight classifiers—Random Forest, XGBoost, and Decision Tree. Across 5‑fold cross‑validation and a held‑out test split, all models achieve high performance: average accuracy exceeds 96 % for every format, with F1‑scores typically above 0.94. XGBoost consistently attains the best results, while Random Forest offers a good trade‑off between speed and accuracy. Importantly, because the pipeline relies solely on static, non‑executing features, it can be deployed in high‑throughput email gateways without the security risks or computational overhead associated with dynamic sandbox analysis.
QR‑code‑based phishing (quishing) is tackled with two complementary strategies. First, an image‑based Convolutional Neural Network (a ResNet‑18‑style architecture) learns spatial and pattern‑level cues from QR‑code images; trained on over 200 k samples, the CNN reaches >98 % accuracy in distinguishing malicious from benign codes. Second, the decoded URLs embedded in QR codes undergo lexical analysis. Tokens are processed by several lightweight language models—BERT‑Tiny, DeBERTa‑v3, ModernBERT, and DeepSeek‑R1. SHAP analysis reveals that URL length, domain entropy, and suspicious query‑parameter patterns are the strongest predictors of malicious intent. An ensemble of the CNN and language‑model pipelines further improves robustness against image manipulation or URL obfuscation.
The authors compare their dataset to prior works (e.g., PDFMal2022, EMBER2024, various Office‑document collections) and demonstrate that CIC‑Trap4Phish uniquely combines multiple attachment types, includes QR codes, and provides both raw files and extracted feature vectors. All data, feature‑extraction scripts, and model code are publicly released, fostering reproducibility and enabling the security community to benchmark new detection techniques.
In the discussion, the paper highlights several practical implications: (1) static, lightweight classifiers can be integrated into existing email security appliances with minimal latency; (2) the dual‑modal QR‑code approach mitigates the risk of attackers evading detection by solely altering visual patterns or URL content; (3) the SHAP‑driven feature selection offers explainability, aiding security analysts in understanding why a particular attachment is flagged.
Future work is outlined as follows: extending the dataset with newer file formats (e.g., PowerPoint, Rich Text), incorporating dynamic analysis signals to create hybrid models, leveraging large language models (LLMs) for deeper phishing‑email body analysis, and investigating defenses against emerging QR‑code login hijacking (QRLJacking) techniques.
Overall, CIC‑Trap4Phish represents a significant step toward unified, scalable, and explainable detection of attachment‑based phishing and quishing threats, providing both a valuable research resource and a practical blueprint for real‑world deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment