Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
The tasks of extracting (top-$K$) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-$K$) FI’s and AR’s. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call \emph{d-index}, and is the maximum integer $d$ such that the dataset contains at least $d$ transactions of length at least $d$ such that no one of them is a superset of or equal to another. We show that this bound is strict for a large class of datasets.
💡 Research Summary
The paper tackles the classic data‑mining tasks of extracting frequent itemsets (FIs) and association rules (ARs), which traditionally require multiple full scans of the dataset. While exact algorithms are well‑studied, they become prohibitively expensive on large, disk‑resident data. Recent works have explored sampling to obtain high‑quality approximations, but they suffer from overly conservative sample‑size bounds because they must guard against the exponential number of possible itemsets.
To overcome this, the authors introduce a novel application of Vapnik‑Chervonenkis (VC) dimension theory. They model the presence of each itemset in a transaction as a binary indicator function, and consider the collection of all such functions as a range space. A fundamental result in statistical learning theory states that, for a range space with VC‑dimension d, a random sample of size O((d + log (1/δ))/ε²) suffices to approximate all indicator functions within absolute error ε with failure probability δ.
The main theoretical contribution is the definition of the d‑index of a dataset: the largest integer d such that the dataset contains at least d distinct transactions of length at least d, none of which is a subset of another. The authors prove that the VC‑dimension of the range space associated with any dataset is upper‑bounded by its d‑index, and that this bound is tight for a broad class of datasets. Computing the exact d‑index requires multiple passes, but an efficiently computable upper bound can be obtained in a single linear scan using a greedy online algorithm.
Building on this bound, the paper derives concrete sample‑size formulas for several mining problems, both for absolute and relative approximations:
- Frequent itemsets with a minimum support θ – absolute error: O((d + log (1/δ))/ε²); relative error: O((d · log (1/εθ) + log (1/δ))/ε²).
- Top‑K frequent itemsets – analogous bounds, independent of the unknown threshold f(K).
- Association rules with support θ and confidence γ – similar formulas, again linear in d rather than in the number of items |I| or the dataset size.
Table 1 in the paper compares these results with the best previously known bounds, showing consistent improvements: the dependence on |I| disappears, and the dependence on the maximum transaction length Δ is replaced by the usually much smaller d‑index.
The experimental evaluation uses a variety of real‑world datasets (retail, web logs, etc.). The authors sample according to the derived formulas, mine the sample with standard FI/AR algorithms, and then evaluate precision, recall, and runtime against the exact results. The findings confirm that the proposed sample sizes are dramatically smaller than those required by earlier Chernoff‑based analyses, yet they still achieve the prescribed ε‑δ guarantees. Moreover, the approach integrates smoothly into a MapReduce framework, where the sampling phase becomes a lightweight preprocessing step that dramatically reduces overall execution time for large‑scale mining.
In conclusion, the work provides the first rigorous characterization of the VC‑dimension of the range space induced by a transactional dataset and leverages this insight to obtain tight, data‑dependent sampling bounds for frequent itemset and association‑rule mining. By tying sample size directly to the intrinsic combinatorial structure of the data (the d‑index), the authors achieve both strong theoretical guarantees and practical efficiency, opening the door for similar VC‑dimension‑based sampling strategies in other data‑mining domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment