A novel approach for fast mining frequent itemsets use N-list structure based on MapReduce

Reading time: 6 minute
...

📝 Abstract

Frequent Pattern Mining is a one field of the most significant topics in data mining. In recent years, many algorithms have been proposed for mining frequent itemsets. A new algorithm has been presented for mining frequent itemsets based on N-list data structure called Prepost algorithm. The Prepost algorithm is enhanced by implementing compact PPC-tree with the general tree. Prepost algorithm can only find a frequent itemsets with required (pre-order and post-order) for each node. In this chapter, we improved prepost algorithm based on Hadoop platform (HPrepost), proposed using the Mapreduce programming model. The main goals of proposed method are efficient mining frequent itemsets requiring less running time and memory usage. We have conduct experiments for the proposed scheme to compare with another algorithms. With dense datasets, which have a large average length of transactions, HPrepost is more effective than frequent itemsets algorithms in terms of execution time and memory usage for all min-sup. Generally, our algorithm outperforms algorithms in terms of runtime and memory usage with small thresholds and large datasets.

💡 Analysis

Frequent Pattern Mining is a one field of the most significant topics in data mining. In recent years, many algorithms have been proposed for mining frequent itemsets. A new algorithm has been presented for mining frequent itemsets based on N-list data structure called Prepost algorithm. The Prepost algorithm is enhanced by implementing compact PPC-tree with the general tree. Prepost algorithm can only find a frequent itemsets with required (pre-order and post-order) for each node. In this chapter, we improved prepost algorithm based on Hadoop platform (HPrepost), proposed using the Mapreduce programming model. The main goals of proposed method are efficient mining frequent itemsets requiring less running time and memory usage. We have conduct experiments for the proposed scheme to compare with another algorithms. With dense datasets, which have a large average length of transactions, HPrepost is more effective than frequent itemsets algorithms in terms of execution time and memory usage for all min-sup. Generally, our algorithm outperforms algorithms in terms of runtime and memory usage with small thresholds and large datasets.

📄 Content

A novel approach for fast mining frequent itemsets use N-list structure based on MapReduce ARKAN A. G. AL-HAMODI1*, SONGFENG LU2 1Research scholar, School of computer science Huazhong University of Science and Technology Wuhan 430074, PRC 2Associate professor, School of computer science Huazhong University of Science and Technology Wuhan 430074, PRC E-mail: arkan_almalky@yahoo.com , lusongfeng@hust.edu.cn

Abstract Frequent Pattern Mining is a one field of the most significant topics in data mining. In recent years, many algorithms have been proposed for mining frequent itemsets. A new algorithm has been presented for mining frequent itemsets based on N-list data structure called Prepost algorithm. The Prepost algorithm is enhanced by implementing compact PPC-tree with the general tree. Prepost algorithm can only find a frequent itemsets with required (pre-order and post-order) for each node. In this chapter, we improved prepost algorithm based on Hadoop platform (HPrepost), proposed using the Mapreduce programming model. The main goals of proposed method are efficient mining frequent itemsets requiring less running time and memory usage. We have conduct experiments for the proposed scheme to compare with another algorithms. With dense datasets, which have a large average length of transactions, HPrepost is more effective than frequent itemsets algorithms in terms of execution time and memory usage for all min-sup. Generally; our algorithm outperforms algorithms in terms of runtime and memory usage with small thresholds and large datasets.

Keywords Data mining, Frequent itemsets, N-list, MapReduce
1 Introduction Frequent pattern mining is one of the most important and popular research areas in mining Association rules field and data mining [1,2]. It is becoming the hot topic for finding frequent itemsets mining. Most of the proposed algorithms for frequent itemsets can be clustered in to Apriori method and FP- growth method. Repeatedly, the Apriori method scans the database to find frequent itemsets with generates a large set of a candidate [3]. FP-growth method scans the database twice to mines frequent itemsets without generating candidates [4]. The FP- growth uses FP-tree data structure to store database and employs a divide-and-conquer strategy to find frequent itemsets, which is much more efficient than Apriori method.
In the frequent itemsets, two kinds of data structure (Node-list and N-list) have been proposed by Deng and et al. [5,6], to reduce the mining time and memory usage with mining frequent itemsets. The two of data structures based on a prefix tree with encoded nodes. The Node-list and N-list based on PPC-tree, and both of them consuming of memory because they need to encoding nodes with pre-order and post-order. Based on N-list algorithm called Prepost. In this chapter, we present a new method HPrepost algorithm based on PPC-tree under Mapreduce framework with Hadoop platform to obtain more efficiently for mining frequent itemsets, reduce running time, and usage memory.

  • 2 -

2 Related work The previous proposed algorithms for mining frequent itemsets divided into three groups, Generate candidate, frequent pattern growth and Hybrid approach. In recent years, three kinds of structure have been proposed for finding a frequent itemsets efficiently. Node-list structure was proposed by Deng and et al. [5], based on PPC-tree (pre-order post-order Code tree). N-List structure was proposed by Deng and et al. [6], needs to encode a node of the PPC-tree with pre-order and post-order. Both of the two structures are based on a prefix tree called PPC-tree. Two novel data structures are memory consuming because need to encode a node with pre-order and post-order. N-list based on algorithm for mining called NAFCP was proposed by Tuong Le and Bay Vo [7]. An enhanced N-list and Subsume-based algorithm for mining Frequent Itemsets (NSFI) algorithm that uses a hash table to improve the process of creating the N-lists associated with 1- itemsets and an enhanced N-list intersection algorithm was presented by Bay Vo and et al. [8]. New algorithm more effective with reducing the memory usage and mining time. An improved version of the mining top-rank- k frequent pattern (NTK) presents by Huynh et al. [9]. A hybrid algorithm based on PrePost proposed by Vo et al. [10], An improved PrePost algorithm uses a hash table to enhance the process of creating the N-lists data structure. Mapreduce programming framework is very well known technique for processing such massive of data [11,12]. Liao et al. [13] presented a parallel algorithm adapted for mining big data based on Hadoop platform under Mapreduce (MRPrepost). The algorithm employs N-list data structure, which improves PrePost by way of adding a prefix pattern. An improved Prepost algorithm with hadoop platform proposed by Thakare et al. [14]. The

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut