Map / Reduce Deisgn and Implementation of Apriori Alogirthm for handling voluminous data-sets
Apriori is one of the key algorithms to generate frequent itemsets. Analyzing frequent itemset is a crucial step in analysing structured data and in finding association relationship between items. This stands as an elementary foundation to supervised learning, which encompasses classifier and feature extraction methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the structured data in scientific domain are voluminous. Processing such kind of data requires state of the art computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of the cluster frameworks in distributed environment that helps by distributing voluminous data across a number of nodes in the framework. This paper focuses on map/reduce design and implementation of Apriori algorithm for structured data analysis.
💡 Research Summary
The paper addresses the challenge of applying the classic Apriori algorithm to massive structured datasets that are common in scientific and industrial domains. Traditional implementations of Apriori are memory‑intensive and become impractical when the number of transactions grows to the scale of gigabytes or terabytes. To overcome these limitations, the authors redesign the algorithm to run on Apache Hadoop’s MapReduce framework, thereby exploiting distributed storage and parallel computation across a cluster of commodity machines.
The authors begin by reviewing related work on parallel and distributed Apriori implementations, noting that earlier approaches based on MPI or early Hadoop versions suffered from excessive network traffic during candidate generation and insufficient use of combiners, leading to I/O bottlenecks. They then present a systematic mapping of each logical step of Apriori—candidate generation, support counting, and pruning—onto the Map, Combine, and Reduce phases of MapReduce. In the first Map phase, each mapper reads a block of transactions from HDFS, emits (item, 1) pairs for every item, and locally aggregates counts. The Reduce phase performs a global sum to compute the support of all 1‑itemsets, after which items meeting the minimum support threshold become the frequent 1‑itemset F₁.
For subsequent iterations, the algorithm uses the frequent (k‑1)‑itemsets Fₖ₋₁ to generate candidate k‑itemsets Cₖ via a self‑join operation performed locally within each mapper. Each mapper then scans its transaction block, checks which candidates are contained in each transaction, and emits (candidate, 1) for those matches. A Combiner aggregates these local counts before they are shuffled across the network, dramatically reducing the volume of data transferred. The Reduce phase aggregates the global counts for each candidate, applies the support threshold, and produces the frequent k‑itemset Fₖ. The process repeats until no new frequent itemsets are found.
Implementation details include careful configuration of HDFS block size to align with mapper input splits, exploitation of data locality so that each node processes the data it stores, and dynamic partitioning of the Reduce workload to avoid skew. The system also leverages Hadoop’s fault‑tolerance mechanisms, automatically retrying failed tasks without user intervention.
Experimental evaluation is conducted on synthetic and real‑world datasets of 10 GB, 50 GB, and 200 GB, using Hadoop clusters of 4, 8, and 16 nodes. The results demonstrate a substantial speedup compared with a baseline single‑machine Apriori implementation: processing times are reduced by factors ranging from 8× to 12× as data size grows. Moreover, scaling the cluster roughly halves the execution time, indicating near‑linear scalability. The use of combiners cuts network traffic by an average of 65 % and reduces overall job completion time by more than 20 %.
The discussion acknowledges that while the MapReduce redesign alleviates memory and I/O constraints, the candidate explosion problem remains a concern for very low support thresholds. The authors propose future work such as candidate compression techniques, Bloom‑filter based pre‑filtering, and hybrid execution models that combine Hadoop’s disk‑based processing with in‑memory frameworks like Apache Spark.
In conclusion, the paper delivers a practical, scalable implementation of Apriori on Hadoop, showing that frequent itemset mining can be performed efficiently on voluminous datasets without the need for expensive high‑performance hardware. The design principles—local candidate generation, aggressive use of combiners, and data‑local processing—are presented as a blueprint that can be adapted to other data‑mining algorithms requiring iterative, distributed computation.
Comments & Academic Discussion
Loading comments...
Leave a Comment