Performance Optimization of MapReduce-based Apriori Algorithm on Hadoop Cluster
Many techniques have been proposed to implement the Apriori algorithm on MapReduce framework but only a few have focused on performance improvement. FPC (Fixed Passes Combined-counting) and DPC (Dynamic Passes Combined-counting) algorithms combine multiple passes of Apriori in a single MapReduce phase to reduce the execution time. In this paper, we propose improved MapReduce based Apriori algorithms VFPC (Variable Size based Fixed Passes Combined-counting) and ETDPC (Elapsed Time based Dynamic Passes Combined-counting) over FPC and DPC. Further, we optimize the multi-pass phases of these algorithms by skipping pruning step in some passes, and propose Optimized-VFPC and Optimized-ETDPC algorithms. Quantitative analysis reveals that counting cost of additional un-pruned candidates produced due to skipped-pruning is less significant than reduction in computation cost due to the same. Experimental results show that VFPC and ETDPC are more robust and flexible than FPC and DPC whereas their optimized versions are more efficient in terms of execution time.
💡 Research Summary
The paper addresses the well‑known inefficiency of the Apriori algorithm when implemented on the Hadoop MapReduce platform, where each iteration traditionally requires a separate MapReduce job, leading to high I/O and scheduling overhead. Existing attempts to mitigate this problem—Fixed Passes Combined‑counting (FPC) and Dynamic Passes Combined‑counting (DPC)—reduce the number of jobs by merging several Apriori passes into a single MapReduce phase. However, FPC’s fixed‑pass strategy lacks adaptability to varying data sizes and support thresholds, while DPC’s time‑based merging can still produce imbalanced workloads because it relies only on the current phase’s execution time.
To overcome these limitations, the authors propose two novel algorithms: Variable‑Size based Fixed Passes Combined‑counting (VFPC) and Elapsed‑Time based Dynamic Passes Combined‑counting (ETDPC). VFPC monitors the size of the candidate set generated in each pass; when the candidate count exceeds a predefined threshold, a new MapReduce job is launched. This dynamic control prevents candidate explosion, keeps memory consumption in check, and avoids the inefficiency of overly large map tasks. ETDPC, on the other hand, measures the actual elapsed time of a pass and stops merging further passes once a target time limit (e.g., five minutes) is reached. By reacting to real‑time cluster load and network conditions, ETDPC achieves better load balancing and reduces total runtime.
Both algorithms share a data‑driven decision mechanism for when to split passes, contrasting with the static heuristics of prior work. In addition, the paper introduces an optimization layer—Optimized‑VFPC and Optimized‑ETDPC—where pruning (the step that discards infrequent candidates) is deliberately omitted in selected passes. While pruning reduces the candidate space, it incurs substantial CPU and I/O costs each time it is executed. The authors demonstrate analytically and experimentally that the extra candidates generated by skipping pruning are negligible compared with the savings in computation and I/O, resulting in net performance gains.
The experimental evaluation uses real transaction logs and synthetic datasets across a range of minimum support values and data volumes. Results show that VFPC and ETDPC achieve 25 %–40 % lower execution times than FPC/DPC, and their optimized variants provide an additional 10 %–15 % speedup. The benefits become more pronounced as the dataset grows and the support threshold decreases, confirming the scalability of the variable‑size and elapsed‑time strategies. Memory consumption and network traffic are also reduced, and the algorithms exhibit near‑linear scaling when the number of Hadoop nodes is increased.
In summary, the study contributes a flexible, performance‑oriented framework for Apriori on MapReduce by (1) introducing adaptive pass‑combining criteria based on candidate size or elapsed time, and (2) selectively bypassing pruning to lower per‑iteration overhead. These innovations collectively deliver a more robust and efficient solution for large‑scale frequent itemset mining, offering practical value for practitioners dealing with massive transactional data in distributed environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment