A Comprehensive Survey of Data Mining-based Fraud Detection Research

This survey paper categorises, compares, and summarises from almost all published technical and review articles in automated fraud detection within the last 10 years. It defines the professional fraudster, formalises the main types and subtypes of known fraud, and presents the nature of data evidence collected within affected industries. Within the business context of mining the data to achieve higher cost savings, this research presents methods and techniques together with their problems. Compared to all related reviews on fraud detection, this survey covers much more technical articles and is the only one, to the best of our knowledge, which proposes alternative data and solutions from related domains.

💡 Research Summary

The paper presents a comprehensive, decade‑long survey of technical and review literature on automated fraud detection that relies on data‑mining techniques. It begins by underscoring the economic and societal costs of fraud and noting that most prior surveys have been either domain‑specific (e.g., credit‑card fraud, insurance fraud) or limited to methodological overviews. To address these gaps, the authors introduce the notion of a “Professional Fraudster,” a construct that quantifies the expertise, organization level, and technological sophistication of fraud actors, thereby providing a common lens for profiling malicious agents across industries.

The taxonomy section systematically classifies fraud into four primary categories—financial fraud, insurance fraud, e‑commerce fraud, and social‑network fraud—and further subdivides each into well‑defined sub‑types such as identity theft, transaction manipulation, claim fabrication, phishing, and spear‑phishing. This hierarchical schema serves two critical purposes: it standardizes labeling practices for supervised learning and it aligns evaluation metrics (precision, recall, AUC) with the specific risk profile of each fraud scenario.

In the data‑evidence chapter, the authors map the characteristic data sources for each sector. Financial institutions generate high‑frequency transaction logs, geolocation stamps, and device fingerprints (structured data) together with call‑center transcripts and email bodies (unstructured data). Insurance firms rely heavily on claim‑form images, medical records, and diagnostic codes, while e‑commerce platforms produce click‑stream data, product reviews, and payment details that together form a multimodal dataset. Social‑network fraud detection draws on user‑relationship graphs, post content, and hashtag trends. Across all domains, the data are high‑dimensional, severely imbalanced, and often require near‑real‑time processing.

The methodological review spans from classical statistical models (logistic regression, naïve Bayes) through traditional machine‑learning algorithms (random forests, XGBoost) to deep‑learning architectures (LSTM for sequential transaction streams, CNN for image‑based claims, Graph Neural Networks for relational fraud). Each technique is evaluated on predictive performance, interpretability, computational overhead, and suitability for online learning. The authors stress that handling class imbalance—via oversampling methods such as SMOTE/ADASYN, cost‑sensitive learning, or ensemble re‑weighting—is indispensable for achieving reliable detection rates. Moreover, they advocate for post‑hoc explainability tools (LIME, SHAP) to satisfy regulatory requirements and to aid fraud analysts in root‑cause analysis.

A distinctive contribution of this survey is its exploration of “alternative data” sourced from adjacent domains. Cyber‑security logs, IoT sensor streams, and medical imaging data, although structurally different from traditional fraud datasets, can be transformed through feature‑extraction pipelines and embedding techniques to enrich fraud detection models. Incorporating such heterogeneous data promises earlier detection of novel fraud patterns that would otherwise remain hidden.

The concluding discussion acknowledges persistent challenges: data privacy constraints, the high cost of obtaining reliable ground‑truth labels, and the difficulty of transferring models across jurisdictions or industries. The authors propose a forward‑looking research agenda that includes privacy‑preserving federated learning, reinforcement‑learning‑based fraud simulation environments, advanced multimodal fusion strategies, and interdisciplinary studies that couple technical solutions with policy and legal frameworks. In sum, the survey not only maps the current landscape of data‑mining‑based fraud detection but also charts a clear path toward more robust, explainable, and cross‑domain solutions.

💡 Research Summary

📜 Original Paper Content