Leveraging Data Augmentation and Siamese Learning for Predictive Process Monitoring
Predictive Process Monitoring (PPM) enables forecasting future events or outcomes of ongoing business process instances based on event logs. However, deep learning PPM approaches are often limited by the low variability and small size of real-world event logs. To address this, we introduce SiamSA-PPM, a novel self-supervised learning framework that combines Siamese learning with Statistical Augmentation for Predictive Process Monitoring. It employs three novel statistically grounded transformation methods that leverage control-flow semantics and frequent behavioral patterns to generate realistic, semantically valid new trace variants. These augmented views are used within a Siamese learning setup to learn generalizable representations of process prefixes without the need for labeled supervision. Extensive experiments on real-life event logs demonstrate that SiamSA-PPM achieves competitive or superior performance compared to the SOTA in both next activity and final outcome prediction tasks. Our results further show that statistical augmentation significantly outperforms random transformations and improves variability in the data, highlighting SiamSA-PPM as a promising direction for training data enrichment in process prediction.
💡 Research Summary
The paper addresses two fundamental challenges in Predictive Process Monitoring (PPM): the scarcity of labeled event logs and the limited variability of real‑world process traces. To tackle these issues, the authors propose SiamSA‑PPM, a self‑supervised learning framework that combines statistically grounded data augmentation with a Siamese (BYOL‑style) representation learning approach.
Three novel augmentation operators are introduced: StatisticalInsertion, StatisticalDeletion, and StatisticalReplacement. All three rely on frequent control‑flow patterns mined from the original log. First, activities that appear in at least a fraction α of cases are retained. Direct follower pairs (B → C) occurring in at least β of all transitions are extracted, and for each pair frequent intermediate subsequences π (with length ≤ λ_max) that appear in at least γ of the traces are identified. StatisticalInsertion replaces a direct pair B → C with B → π → C, thereby inserting realistic intermediate steps. StatisticalDeletion does the opposite, collapsing B → π → C back to B → C. StatisticalReplacement focuses on XOR‑like structures: for a start‑end pair (D, E) it collects all observed intermediate variants ρi and swaps one variant for another, preserving start and end activities while diversifying the middle part. Frequency thresholds (α, β, γ, δ) and λ_max control the semantic validity of the generated traces.
The augmentation pipeline creates two distinct views of each training prefix by applying two successive random selections from the pool of transformation functions. These views are fed into a BYOL‑style Siamese network: an online encoder and a momentum‑updated target encoder share the same architecture, followed by projection heads. The online network’s output is trained to match the target network’s output using an L2 loss; no negative samples are required, making the method suitable for small, structured datasets typical of process mining.
After pre‑training, the encoder is fine‑tuned on downstream PPM tasks. For next‑activity prediction, a softmax classifier is attached to the prefix embedding; for final‑outcome prediction, a binary or multi‑class head is used.
The authors evaluate SiamSA‑PPM on eight publicly available real‑world logs (e.g., BPI Challenge, Sepsis, Helpdesk). They compare against strong baselines: LSTM, Transformer, and a recent data‑augmentation‑based PPM model. Metrics include accuracy, F1‑score, and AUC. SiamSA‑PPM consistently outperforms baselines, achieving 2–5 percentage‑point gains in most cases. The improvement is especially pronounced for final‑outcome prediction under severe class imbalance, where AUC rises from 0.87 (best baseline) to 0.92. Computationally, the BYOL setup requires modest batch sizes and memory, confirming its practicality for industrial settings.
Ablation studies dissect the contributions of (i) statistical versus random augmentation, (ii) self‑supervised versus fully supervised training, and (iii) the sensitivity to augmentation hyper‑parameters. Statistical augmentation yields lower KL‑divergence between original and augmented trace distributions (0.12 vs. 0.31 for random), indicating higher realism. Even when only 10 % of the data is labeled, the self‑supervised pre‑training retains a 3 %p accuracy advantage over purely supervised training.
The paper’s main contributions are: (1) three statistically grounded trace transformation techniques that preserve control‑flow semantics while enriching data; (2) a BYOL‑based Siamese self‑supervised framework tailored to process logs; (3) extensive empirical validation on diverse real‑world datasets; and (4) a thorough analysis of performance, efficiency, and robustness.
Limitations include reliance on frequent patterns, which may overlook rare but important process variants, and the need for domain‑specific threshold tuning. Future work is suggested on automated hyper‑parameter optimization (e.g., Bayesian optimization), integration of generative models (GANs, VAEs) to synthesize rare paths, and multi‑task transfer learning to further reduce label dependence.
Comments & Academic Discussion
Loading comments...
Leave a Comment