Data Partitioning View of Mining Big Data
📝 Abstract
There are two main approximations of mining big data in memory. One is to partition a big dataset to several subsets, so as to mine each subset in memory. By this way, global patterns can be obtained by synthesizing all local patterns discovered from these subsets. Another is the statistical sampling method. This indicates that data partitioning should be an important strategy for mining big data. This paper recalls our work on mining big data with a data partitioning and shows some interesting findings among the local patterns discovered from subsets of a dataset.
💡 Analysis
There are two main approximations of mining big data in memory. One is to partition a big dataset to several subsets, so as to mine each subset in memory. By this way, global patterns can be obtained by synthesizing all local patterns discovered from these subsets. Another is the statistical sampling method. This indicates that data partitioning should be an important strategy for mining big data. This paper recalls our work on mining big data with a data partitioning and shows some interesting findings among the local patterns discovered from subsets of a dataset.
📄 Content
1 Data Partitioning View of Mining Big Data
Shichao Zhang Guangxi Key Lab of Multi-source Information Mining & Security College of Computer Science and Information Technology Guangxi Normal University, Guilin, 541004, PR China zhangsc@mailbox.gxnu.edu.cn
Abstract: There are two main approximations of mining big data in memory. One is to partition a big dataset to several subsets, so as to mine each subset in memory. By this way, global patterns can be obtained by synthesizing all local patterns discovered from these subsets. Another is the statistical sampling method. This indicates that data partitioning should be an important strategy for mining big data. This paper recalls our work on mining big data with a data partitioning and shows some interesting findings among the local patterns discovered from subsets of a dataset. Key Words: Big data; big data mining; data partitioning; statistical sampling
- Introduction Big data has become a hot research area after the “Nature”, one of top-end journals, published a special issue on big data, named “Science in the petabyte (PB) era” [Nature 2008]. Before this came up, there were many words/terms of standing for “big” arose in papers and articles, such as large (scale), huge, very large, massive and vast. Seems “big” is an easy name to be defined, accepted, understood and propagated. These reports also indicated that big data mining has widely been studied after data mining was proposed as a research field. In fact, I had a firsthand experience of how it is actually difficult to in-memory identify frequent patterns in a large scale dataset, when I worked in the National University of Singapore in 1998. To attack this issue, we proposed a solution of mining large scale dataset based on data partitioning [Zhang and Wu 2001]. The main idea is sketched in Figure 1.
Figure 1. Mining big dataset based on data partitioning
2
To identify interesting patterns from a big dataset, our approximation is as follows. The big dataset is first partitioned to several subsets taking into account the memory size of computers. And then, each subset is mined in memory. These discovered patterns are referred to local patterns. Finally, all local patterns are fused to generate global patterns. These global patterns are output as the results of mining the big dataset. In practice, this is only an approximate solution of mining the big dataset, although data partitioning actually brings us an efficient approach. It is a lately research result that mathematically proved it is feasible to discover big data based on data partitioning [Xu, Zhang, Li 2015]. Therefore, I think data partitioning should be an important research direction of mining big data. In particular, data partitioning brings us some interesting findings among the local patterns discovered from subsets of a dataset, which cannot be learnt with traditional centralized-style mining methods.
- What happen after mining segments of big data After splitting a big dataset and mining its subsets segment by segment, there are many interesting patterns hided in these subsets, referred to local patterns in this paper. Those local patterns occurred in many segments can be synthesized as global patterns, i.e., “Pattern A” in Figure 2, where minsupport = 0.5 for all data subsets. They are really close to those frequent patterns that are directly discovered from the big dataset. However, most local patterns cannot be discovered from the big dataset with traditional centralized-style data mining methods. Some local patterns are often with high supports identified in few segments, like “Pattern B” in Figure
- They should be called, such as subspace patterns in general, exceptional patterns for outlier detection, and burst pattern for mining historical big data. And some patterns look like trend patterns for dynamic data mining, i.e., “Pattern C” in Figure 2.
Figure 2. Patterns hided in data subsets of a big dataset
The above observations have showed what happen after mining segments of big data. There are really some interesting local patterns hidden in segments that cannot be discovered with traditional centralized-style mining methods. Do these interesting local patterns make sense in applications? In about 2003, I introduced my findings of mining big data to a manager of stock data processing in the UTS, Australia. The manager told me, these interesting local patterns are much more significant than traditional frequent patterns in real applications because
3 it is difficult to obtain and master these interesting local patterns to the people in industrial community. With these findings in our data partitioning view, we started some significant researches on mining big data. Main studies include approximate frequent patterns in [Zhang, Zhang and Webb 2003; Zhang and Zhang 2002], dynamic data mining in [Zhang, Zhang and Yan 2003; Zhang, Zhang and Zha
This content is AI-processed based on ArXiv data.