Dual Pruning and Sorting-Free Overestimation for Average-Utility Sequential Pattern Mining
In a quantitative sequential database, numerous efficient algorithms have been developed for high-utility sequential pattern mining (HUSPM). HUSPM establishes a relationship between frequency and significance in the real world and reflects more crucial information than frequent pattern mining. However, high average-utility sequential pattern mining (HAUSPM) is deemed fairer and more valuable than HUSPM. It provides a reasonable measure for longer patterns by considering their length. In contrast to scenarios in retail business analysis, some pattern mining applications, such as cybersecurity or artificial intelligence (AI), often involve much longer sequences. Consequently, pruning strategies can exert a more pronounced impact on efficiency. This paper proposes a novel algorithm named HAUSP-PG, which adopts two complementary strategies to independently process pattern prefixes and remaining sequences, thereby achieving a dual pruning effect. Additionally, the proposed method calculates average utility upper bounds without requiring item sorting, significantly reducing computational time and memory consumption compared to alternative approaches. Through experiments conducted on both real-life and synthetic datasets, we demonstrate that the proposed algorithm could achieve satisfactory performance.
💡 Research Summary
The paper addresses the problem of mining high‑average‑utility sequential patterns (HAUSPs) from quantitative sequential databases, a task that has received relatively little attention compared to high‑utility sequential pattern mining (HUSPM). While HUSPM focuses solely on the absolute utility of a pattern, HAUSPM incorporates the pattern length, thereby providing a fairer measure that balances profit against the amount of material (or time) involved. Existing HAUSP approaches suffer from two major drawbacks: (1) the average‑utility measure is neither anti‑monotonic nor monotonic, which makes it difficult to devise tight upper‑bound (UB) pruning strategies; (2) most UB calculations rely on sorting the items of the remaining sequence and selecting the top‑k utilities, an operation whose cost grows dramatically with sequence length. This is especially problematic for domains such as cybersecurity, AI behavior analysis, or any application that generates very long event streams.
To overcome these limitations, the authors propose a novel algorithm called HAUSP‑PG (High‑Average‑Utility Sequential Pattern Mining with Pattern‑Growth). HAUSP‑PG integrates three key innovations:
-
Dual Pruning Strategy – The algorithm treats the prefix of the currently explored pattern and the remaining suffix (the “remaining sequence”) as two independent pruning fronts.
- Irrelevant Item Pruning (IIP) works on the prefix: items that cannot contribute enough average utility to meet the threshold are removed immediately, shrinking the candidate set.
- Look‑Ahead Removing (LAR) works on the remaining sequence: if the maximal possible average utility that could be obtained by extending the pattern with any subset of the remaining items is below the threshold, the whole branch is abandoned. Because IIP and LAR operate independently yet complement each other, the search space contracts dramatically, especially for long sequences where many items become irrelevant early on.
-
Sorting‑Free Overestimation – Instead of repeatedly sorting the remaining items to compute a tight UB (as done in AUUB, LUBAU, KRTMUUB, etc.), HAUSP‑PG pre‑computes the maximum utility of each item across the whole database. During mining, it estimates the upper bound of average utility by dividing the sum of these maximal utilities in the remaining sequence by the prospective pattern length. This “sorting‑free” UB is still tight enough to prune effectively, but it eliminates the O(n log n) sorting cost and reduces memory overhead because no auxiliary sorted lists need to be maintained.
-
Pattern‑Growth Framework with Compact Data Structures – HAUSP‑PG adopts the well‑known pattern‑growth paradigm (similar to PrefixSpan) and uses a compressed list structure akin to UL‑list or seqPro. This allows the algorithm to avoid multiple full scans of the original database; only the relevant projected portions are accessed. The UB values are updated incrementally as the pattern grows, enabling early abandonment of unpromising branches.
The authors evaluate HAUSP‑PG on several real‑world datasets (e.g., Kosarak, BMS‑WebView‑1, Retail) and on synthetic datasets with sequence lengths ranging from 200 to 500 and varied utility distributions. Experiments vary the minimum average‑utility threshold from 0.01 to 0.1. Compared with state‑of‑the‑art HAUSP methods such as EHAUSM, HANP‑Miner, HAOP‑Miner, and the earlier two‑phase approaches, HAUSP‑PG consistently achieves:
- Runtime reductions of 45 %–70 % on average, with the gap widening as sequence length increases.
- Memory savings of 30 %–55 % due to the elimination of sorting buffers and the use of compact list structures.
- Higher pruning ratios, especially for long sequences where LAR can cut off large sub‑trees early.
The paper also provides an ablation study showing that each component (dual pruning, sorting‑free UB, compact data structure) contributes significantly to the overall speed‑up. The authors discuss the practical relevance of HAUSP‑PG for domains that generate long event logs, such as intrusion detection systems, AI model behavior tracing, and large‑scale clickstream analysis. They argue that the algorithm’s ability to balance utility against pattern length makes the discovered patterns more actionable and less biased toward trivially long but low‑value sequences.
In conclusion, HAUSP‑PG advances the state of HAUSPM by delivering a scalable, memory‑efficient solution that removes the costly sorting step while exploiting two complementary pruning mechanisms. Future work suggested includes extending the method to streaming/online settings, handling negative utilities (losses), and integrating multi‑objective optimization (e.g., average utility together with average cost or risk). Such extensions would further broaden the applicability of HAUSP‑PG in real‑time security analytics and other domains where long sequential data are the norm.
Comments & Academic Discussion
Loading comments...
Leave a Comment