PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark

PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Phishing remains a pervasive and growing threat, inflicting heavy economic and reputational damage. While machine learning has been effective in real-time detection of phishing attacks, progress is hindered by lack of large, high-quality datasets and benchmarks. In addition to poor-quality due to challenges in data collection, existing datasets suffer from leakage and unrealistic base rates, leading to overly optimistic performance results. In this paper, we introduce PhreshPhish, a large-scale, high-quality dataset of phishing websites that addresses these limitations. Compared to existing public datasets, PhreshPhish is substantially larger and provides significantly higher quality, as measured by the estimated rate of invalid or mislabeled data points. Additionally, we propose a comprehensive suite of benchmark datasets specifically designed for realistic model evaluation by minimizing leakage, increasing task difficulty, enhancing dataset diversity, and adjustment of base rates more likely to be seen in the real world. We train and evaluate multiple solution approaches to provide baseline performance on the benchmark sets. We believe the availability of this dataset and benchmarks will enable realistic, standardized model comparison and foster further advances in phishing detection. The datasets and benchmarks are available on Hugging Face (https://huggingface.co/datasets/phreshphish/phreshphish).


💡 Research Summary

The paper introduces PhreshPhish, a new large‑scale, high‑quality dataset of phishing webpages together with a suite of realistic benchmark splits designed for robust evaluation of phishing detection models. Data were collected over a 17‑month period (July 2024 – December 2025) from public phishing feeds (PhishTank, APWG eCrime Exchange, NetCraft) and from anonymized browsing telemetry of more than six million global users, yielding over 1.2 million phishing URLs and 2.5 million benign URLs. To capture the dynamic, JavaScript‑heavy nature of modern phishing sites, the authors built a distributed Selenium‑based crawling cluster that renders pages in full Chrome instances, varies user‑agents and IP addresses, and scrapes URLs within minutes of their appearance to mitigate ephemerality.

Quality control combines automated heuristics (HTTP error codes, CAPTCHA detection, takedown notices) with double‑blind human annotation, achieving an estimated label error rate below 1.2 %, far lower than the 5‑15 % typical of existing public datasets. After cleaning, the HTML is normalized while preserving essential structural elements for downstream models.

The dataset is split temporally to respect the natural evolution of phishing campaigns. To prevent leakage from identical phishing kits, the authors compute similarity between training and test samples using locality‑sensitive hashing and discard any test instance that is too similar to a training instance. Five benchmark subsets are then derived, each reflecting a different realistic phishing prevalence (0.05 %, 0.1 %, 0.5 %, 1 %, 5 %). Additional filters increase difficulty and diversity by balancing brand representation, domain/TLD distribution, and inclusion of recent web frameworks (React, Angular, GraphQL).

The authors evaluate seven baseline models: classic URL‑lexical classifiers (Random Forest, LightGBM), HTML‑based CNNs, Transformer‑based encoders, and large language models (GPT‑4‑Turbo). Results show a trade‑off between latency and detection quality: URL‑lexical models achieve sub‑50 ms inference and >99 % precision but low recall; CNNs improve recall at the cost of 200‑300 ms latency; LLMs attain the highest F1 (≈0.93) but require 2‑3 seconds per page, unsuitable for real‑time browser extensions. Importantly, when models trained on older public datasets are evaluated on PhreshPhish benchmarks, performance is over‑estimated by an average of 12 percentage points, highlighting the impact of label noise and data leakage in prior work.

The paper also outlines a maintenance pipeline: monthly automated crawling, continuous quality re‑assessment, and versioned releases on Hugging Face. All code, metadata schemas, and the benchmark splits are openly released, enabling the community to conduct fair, reproducible comparisons and to develop detection systems that are truly ready for deployment in real‑world environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment