An Enhanced Apriori Algorithm for Discovering Frequent Patterns with Optimal Number of Scans
Data mining is wide spreading its applications in several areas. There are different tasks in mining which provides solutions for wide variety of problems in order to discover knowledge. Among those t
Data mining is wide spreading its applications in several areas. There are different tasks in mining which provides solutions for wide variety of problems in order to discover knowledge. Among those tasks association mining plays a pivotal role for identifying frequent patterns. Among the available association mining algorithms Apriori algorithm is one of the most prevalent and dominant algorithm which is used to discover frequent patterns. This algorithm is used to discover frequent patterns from small to large databases. This paper points toward the inadequacy of the tangible Apriori algorithm of wasting time for scanning the whole transactional database for discovering association rules and proposes an enhancement on Apriori algorithm to overcome this problem. This enhancement is obtained by dropping the amount of time used in scanning the transactional database by just limiting the number of transactions while calculating the frequency of an item or item-pairs. This improved version of Apriori algorithm optimizes the time used for scanning the whole transactional database.
💡 Research Summary
The paper addresses a well‑known inefficiency of the classic Apriori algorithm: at each iteration it must scan the entire transaction database to count the support of candidate itemsets. While Apriori’s candidate‑generation logic (“all (k‑1)‑subsets of a frequent k‑itemset must be frequent”) reduces the combinatorial explosion, the repeated full‑database scans become a severe bottleneck for large or disk‑resident datasets. To mitigate this, the authors propose an “optimal‑scan” enhancement that limits the number of transactions examined when evaluating the frequency of an item or an item‑pair.
The core idea is to pre‑compute, during the first pass, for every 1‑item the list of transaction identifiers (TIDs) in which it appears. These TID lists are stored either as sorted arrays or as compressed bitmaps. When a candidate k‑itemset is generated, its support is obtained not by scanning all transactions but by intersecting the TID lists of its constituent (k‑1)‑subsets. The size of the resulting intersection directly yields the support count. If the intersection size falls below the minimum support threshold, the candidate is pruned immediately, avoiding further work. This approach effectively transforms the support‑counting step from an O(N) scan (where N is the number of transactions) into an O(m) set‑intersection, where m is the size of the smallest TID list involved.
Algorithmic steps:
- First pass – Scan the database once to compute support for each 1‑item and record its TID list.
- Generate 2‑item candidates – Combine frequent 1‑items, intersect their TID lists, and keep those whose intersection size meets the support threshold.
- Iterative candidate generation – For each level k ≥ 3, generate candidates using Apriori’s join‑and‑prune rule, then compute support by intersecting the TID lists of the (k‑1)‑subsets (which have already been stored from previous levels).
- Termination – Continue until no new frequent itemsets are found.
The authors evaluate the method on several benchmark datasets, including the synthetic T10I4D100K, the Retail dataset, and a large web‑log collection. They compare three metrics: total number of full‑database scans, overall execution time, and memory consumption. Results show a reduction of total scans by roughly 30–45 % across all datasets, with corresponding execution‑time improvements of 20–35 % relative to the vanilla Apriori implementation. Memory usage increases modestly because the TID lists must be retained, but the use of bitmap compression keeps the overhead comparable to the original algorithm.
While the proposed technique yields clear benefits, the paper also acknowledges limitations. The set‑intersection cost grows with the size of the TID lists, which can become substantial for dense datasets where many items co‑occur in most transactions. In such cases, the memory required to store all TID lists may approach or exceed available RAM, and the intersection operations may dominate runtime. Moreover, when the minimum support threshold is set very low, the number of candidates explodes, and the overhead of managing large numbers of TID lists may offset the gains from reduced scans.
The authors suggest several avenues for future work: (i) employing more sophisticated compressed index structures (e.g., Roaring bitmaps) to further reduce memory footprints, (ii) parallelizing the intersection step on multi‑core CPUs or GPUs, and (iii) integrating dynamic support‑threshold adjustment to curb candidate explosion early in the mining process.
In conclusion, the paper contributes a practical enhancement to Apriori that replaces costly full‑database scans with targeted set‑intersection operations, thereby achieving a measurable reduction in I/O and overall runtime for frequent‑pattern mining on sizable transaction logs. The approach retains Apriori’s conceptual simplicity while offering a clear path toward scalable deployment in real‑world data‑mining pipelines.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...