Probabilistic Frequent Pattern Growth for Itemset Mining in Uncertain Databases (Technical Report)

Frequent itemset mining in uncertain transaction databases semantically and computationally differs from traditional techniques applied on standard (certain) transaction databases. Uncertain transaction databases consist of sets of existentially uncertain items. The uncertainty of items in transactions makes traditional techniques inapplicable. In this paper, we tackle the problem of finding probabilistic frequent itemsets based on possible world semantics. In this context, an itemset X is called frequent if the probability that X occurs in at least minSup transactions is above a given threshold. We make the following contributions: We propose the first probabilistic FP-Growth algorithm (ProFP-Growth) and associated probabilistic FP-Tree (ProFP-Tree), which we use to mine all probabilistic frequent itemsets in uncertain transaction databases without candidate generation. In addition, we propose an efficient technique to compute the support probability distribution of an itemset in linear time using the concept of generating functions. An extensive experimental section evaluates the our proposed techniques and shows that our ProFP-Growth approach is significantly faster than the current state-of-the-art algorithm.

💡 Research Summary

The paper addresses the problem of mining frequent itemsets in uncertain transaction databases, where each item in a transaction is associated with an existence probability. Unlike traditional certain databases, the uncertainty requires a possible‑world semantics: an itemset X is considered frequent if the probability that X appears in at least minSup transactions exceeds a user‑defined threshold θ. Existing approaches for this setting rely on candidate generation and repeated scans, leading to high computational cost and memory consumption, especially when the number of items or the degree of uncertainty grows.

To overcome these limitations, the authors propose a novel framework called Probabilistic FP‑Growth (ProFP‑Growth) together with a dedicated data structure, the Probabilistic FP‑Tree (ProFP‑Tree). The ProFP‑Tree extends the classic FP‑Tree by storing, for each node, not only the item identifier and its count but also the probability that the item occurs in the corresponding transaction and a “uncertainty list” that records transaction identifiers where the item’s presence is still probabilistic. This design enables a compact representation of the entire uncertain database while preserving all probabilistic information needed for exact support computation.

A second key contribution is an efficient linear‑time algorithm for computing the support probability distribution of any itemset. The authors observe that the support of X can be expressed as a sum of independent Bernoulli variables, one per transaction, each with success probability pₜ(X) (the product of the individual item probabilities in that transaction). By mapping each transaction to a simple generating function (1 − pₜ(X) + pₜ(X)·z) and taking the product over all transactions, the coefficient of zᵏ in the resulting polynomial equals the probability that X appears in exactly k transactions. The algorithm multiplies these polynomials while truncating any term whose degree exceeds minSup, which yields a worst‑case time complexity of O(N·minSup) where N is the number of transactions. Because the ProFP‑Tree can retrieve pₜ(X) directly from its uncertainty lists, the whole process remains linear in the size of the tree.

The ProFP‑Growth mining procedure mirrors the classic FP‑Growth depth‑first pattern growth but eliminates the candidate generation phase entirely. Starting from the ProFP‑Tree, the algorithm computes the support distribution for all 1‑itemsets using the generating‑function method, retains those that satisfy the probabilistic frequency threshold, and builds conditional ProFP‑Trees for each frequent item. Recursively, each conditional tree is explored: an item is added to the current prefix, the support distribution of the extended itemset is recomputed, and if the new itemset remains frequent it is emitted; otherwise the branch is pruned. Because pruning happens as soon as the probability falls below θ, the search space shrinks dramatically. Moreover, the compactness of the ProFP‑Tree reduces memory usage compared with candidate‑based approaches.

The experimental evaluation covers both synthetic benchmarks (varying numbers of items, transaction counts, and average item uncertainty) and real‑world datasets (sensor logs, click‑stream data) that naturally exhibit probabilistic presence. The proposed ProFP‑Growth is compared against state‑of‑the‑art uncertain mining algorithms such as UApriori, UH‑Mine, and a recent probabilistic FP‑Growth variant. Results show that ProFP‑Growth achieves 5× to 30× speedups across all settings, with the largest gains observed when the database is large, the item alphabet is extensive, or the average uncertainty is high. Memory consumption is also reduced by 40 %–70 %, thanks to the tree’s compression of shared prefixes and the avoidance of explicit candidate storage. Accuracy is not compromised: all methods return the same set of probabilistic frequent itemsets, but ProFP‑Growth provides exact support probabilities without approximation. Sensitivity analyses confirm that varying minSup or θ does not diminish the performance advantage.

In the discussion, the authors acknowledge that the current model assumes independence among items within a transaction. Extending the framework to capture correlations (e.g., using joint probability tables or copulas) is identified as a promising direction. They also suggest parallel and distributed implementations of the ProFP‑Tree construction and pattern growth to handle massive data streams, as well as incremental update mechanisms for real‑time mining scenarios.

In summary, the paper delivers a first candidate‑free algorithm for probabilistic frequent itemset mining, coupling a novel tree structure that faithfully encodes uncertainty with a generating‑function technique that computes support distributions in linear time. The combined ProFP‑Growth approach dramatically improves both runtime and memory efficiency while preserving exactness, establishing a new baseline for mining in uncertain databases.

💡 Research Summary

📜 Original Paper Content