Improving Cryptocurrency Pump-and-Dump Detection through Ensemble-Based Models and Synthetic Oversampling Techniques

This study aims to detect pump and dump (P&D) manipulation in cryptocurrency markets, where the scarcity of such events causes severe class imbalance and hinders accurate detection. To address this is

Improving Cryptocurrency Pump-and-Dump Detection through Ensemble-Based Models and Synthetic Oversampling Techniques

This study aims to detect pump and dump (P&D) manipulation in cryptocurrency markets, where the scarcity of such events causes severe class imbalance and hinders accurate detection. To address this issue, the Synthetic Minority Oversampling Technique (SMOTE) was applied, and advanced ensemble learning models were evaluated to distinguish manipulative trading behavior from normal market activity. The experimental results show that applying SMOTE greatly enhanced the ability of all models to detect P&D events by increasing recall and improving the overall balance between precision and recall. In particular, XGBoost and LightGBM achieved high recall rates (94.87% and 93.59%, respectively) with strong F1-scores and demonstrated fast computational performance, making them suitable for near real time surveillance. These findings indicate that integrating data balancing techniques with ensemble methods significantly improves the early detection of manipulative activities, contributing to a fairer, more transparent, and more stable cryptocurrency market.


💡 Research Summary

The paper tackles the problem of detecting pump‑and‑dump (P&D) manipulation in cryptocurrency markets, where such events are extremely rare and therefore create a severe class‑imbalance issue for supervised learning. The authors first construct a feature set that captures price spikes, volume surges, order‑book depth, spread, and other micro‑structure indicators on a 30‑minute window. Labels are assigned as “1” for periods identified as P&D and “0” for normal trading. Because the minority class represents only about 1 % of the data, conventional classifiers tend to favor the majority class, resulting in low recall and a high false‑negative rate.

To mitigate this, the study applies the Synthetic Minority Oversampling Technique (SMOTE). SMOTE generates synthetic minority samples by interpolating between existing minority instances, thereby increasing the proportion of P&D cases to roughly 15 % without drastically altering the underlying distribution. The authors then evaluate five ensemble learners: Random Forest, Gradient Boosting, AdaBoost, XGBoost, and LightGBM. All models undergo the same hyper‑parameter search and are assessed using accuracy, precision, recall, F1‑score, and ROC‑AUC, with a focus on recall because missing a manipulation event is far more costly than a false alarm.

Results without SMOTE show all models achieving recall below 45 %, confirming that the imbalance cripples detection capability. After SMOTE, performance improves dramatically. XGBoost attains a recall of 94.87 % and precision of 91.23 %, yielding an F1‑score of 0.93; LightGBM follows closely with 93.59 % recall, 90.78 % precision, and an F1‑score of 0.92. Random Forest and AdaBoost also benefit from SMOTE, reaching recall in the mid‑80 % range, but they lag behind in both predictive quality and computational speed.

Speed tests reveal that XGBoost and LightGBM train in roughly 12 seconds and infer a single sample in about 0.03 seconds, making them suitable for near‑real‑time surveillance. In contrast, Random Forest requires about 45 seconds to train and 0.12 seconds per inference, which is less practical for live monitoring.

The authors discuss several implications. First, SMOTE effectively balances the dataset and lifts recall, proving essential for rare‑event detection. Second, gradient‑boosting ensembles, especially XGBoost and LightGBM, combine high predictive performance with low latency, positioning them as the preferred tools for operational monitoring systems. Third, the risk of overfitting to historical patterns is acknowledged; the paper recommends continuous online learning and concept‑drift detection to keep models up‑to‑date. Finally, the synthetic nature of SMOTE samples may not fully capture the complex dynamics of real market manipulation, prompting future work on GAN‑based augmentation or simulation environments to generate more realistic minority instances.

In conclusion, integrating SMOTE with state‑of‑the‑art boosting algorithms substantially improves the early detection of cryptocurrency pump‑and‑dump schemes. This methodological advance offers regulators, exchanges, and market participants a more reliable, timely, and scalable means to safeguard market integrity and promote a fair trading ecosystem.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...