Mining Compressed Repetitive Gapped Sequential Patterns Efficiently

Reading time: 6 minute
...

📝 Original Info

  • Title: Mining Compressed Repetitive Gapped Sequential Patterns Efficiently
  • ArXiv ID: 0906.0885
  • Date: 2009-06-04
  • Authors: Yongxin Tong, Li Zhao, Dan Yu, Shilong Ma, Ke Xu

📝 Abstract

Mining frequent sequential patterns from sequence databases has been a central research topic in data mining and various efficient mining sequential patterns algorithms have been proposed and studied. Recently, in many problem domains (e.g, program execution traces), a novel sequential pattern mining research, called mining repetitive gapped sequential patterns, has attracted the attention of many researchers, considering not only the repetition of sequential pattern in different sequences but also the repetition within a sequence is more meaningful than the general sequential pattern mining which only captures occurrences in different sequences. However, the number of repetitive gapped sequential patterns generated by even these closed mining algorithms may be too large to understand for users, especially when support threshold is low. In this paper, we propose and study the problem of compressing repetitive gapped sequential patterns. Inspired by the ideas of summarizing frequent itemsets, RPglobal, we develop an algorithm, CRGSgrow (Compressing Repetitive Gapped Sequential pattern grow), including an efficient pruning strategy, SyncScan, and an efficient representative pattern checking scheme, -dominate sequential pattern checking. The CRGSgrow is a two-step approach: in the first step, we obtain all closed repetitive sequential patterns as the candidate set of representative repetitive sequential patterns, and at the same time get the most of representative repetitive sequential patterns; in the second step, we only spend a little time in finding the remaining the representative patterns from the candidate set. An empirical study with both real and synthetic data sets clearly shows that the CRGSgrow has good performance.

💡 Deep Analysis

Deep Dive into Mining Compressed Repetitive Gapped Sequential Patterns Efficiently.

Mining frequent sequential patterns from sequence databases has been a central research topic in data mining and various efficient mining sequential patterns algorithms have been proposed and studied. Recently, in many problem domains (e.g, program execution traces), a novel sequential pattern mining research, called mining repetitive gapped sequential patterns, has attracted the attention of many researchers, considering not only the repetition of sequential pattern in different sequences but also the repetition within a sequence is more meaningful than the general sequential pattern mining which only captures occurrences in different sequences. However, the number of repetitive gapped sequential patterns generated by even these closed mining algorithms may be too large to understand for users, especially when support threshold is low. In this paper, we propose and study the problem of compressing repetitive gapped sequential patterns. Inspired by the ideas of summarizing frequent ite

📄 Full Content

Sequential pattern mining has been a central data mining research topic in broad applications, including frequent sequential patterns-based classification, analysis of web log, analysis of frequent sequential patterns in DNA and protein sequence, API specification mining and API usage mining from open source repositories, and so on. So far many efficient sequential mining algorithms have been proposed for solving various of real problems, such as the general sequential pattern mining [3,17,27], frequent episode mining [16], closed sequential pattern mining [22,25], maximal sequential pattern mining [14], top-k closed sequential pattern mining [21], long sequential pattern mining in noisy environment [26], constraint-based sequential pattern mining [18], frequent partial order mining [19], periodic pattern mining [28], etc.. In recent years, some studies have focused on a novel problem of sequential pattern mining, mining repetitive gapped sequential patterns [7]. By gapped sequential patterns, it means a sequential pattern, which appears in a sequence in a sequence database, possibly with gaps between two successive events. In addition, for brevity, we use the term sequential pattern instead of gapped sequential patterns in this paper. Because traditional sequence mining [3,17,27] ignores the possibility of repetitive occurrences of sequential patterns in a sequence, it is a very important problem that discovers frequent repetitive sequential patterns by capturing not only repetitive occurrences of sequential patterns in different sequences but also repetitive occurrences within each sequence [7]. Currently a few approaches have been proposed to solve how to mine repetitive sequential patterns, but they cannot avoid an explosive number of output frequent repetitive sequential patterns for Apriori property, especially when the support threshold is low. Hence, it is difficult to understand the result set of frequent repetitive sequential patterns.

To solve the above challenge, it is natural to decrease the size of output result set, and two solutions [7,14] have been proposed: mining maximal repetitive sequential patterns 1 and mining closed repetitive sequential patterns 2 . Mining maximal repetitive sequential patterns only focuses on the structure information of repetitive sequential pattern and fails to keep the information of the support. The mining closed repetitive sequential patterns emphasizes to have the same support of sub-sequence and supersequence exactly (see the definitions in section2.1), resulting in the number of closed repetitive sequential patterns still too large to be used easily. Especially, mining closed repetitive sequential patterns will possibly generate more closed sequential patterns than traditional mining closed sequential patterns, since that the former studies not only the repetition in different sequences but also those repetitions within each sequence. The following example will further explain. Example 1: Table 1 shows a sequence database 1 2 { , } SeqDB S S = . If we compute the value of the support based on traditional sequence mining, the support of sequential pattern AB is 2 since AB occurs in each sequence in the database. In addition, the support of sequential pattern ABB is 2, and AB is a sequential pattern of ABB. Therefore, ABB is a closed sequential pattern in the database, and AB is not a closed sequential pattern. However, if we consider the same problem under the condition of the repetitive sequence mining, the non-overlapping occurrence of sequential pattern AB is 4 ( 1 2 , e e < > and 6 8 ,

, e e < > and 5 7 , e e < > in 2 S ), and nonoverlapping occurrence of sequential pattern ABB is 2 (only 1 2 3 , , e e e < > in 1 S , 2

, , e e e < > in 2 S ), so both AB and ABB are closed repetitive sequential patterns. According to the above example, it is clear that mining closed repetitive sequential pattern is worse than the traditional mining closed sequential patterns in reducing the number of all frequent sequential patterns. Thus, apart from the two extremes of maximal repetitive sequential pattern and closed repetitive sequential pattern, we should find a solution to compress the result set of repetitive sequential patterns with a smaller number of representative repetitive sequential patterns. We give a motivating example as follows. Example 2: Table 2 shows a subset of all frequent repetitive sequential patterns on a real dataset from a workflow system, where A, B, C, D are the names of distinct events. The five repetitive sequential patterns are all closed repetitive sequential patterns, since RP can represent the subset of the repetitive sequential patterns with carrying sufficient information and decreasing the size of the subset.

The above Example2 show that we can use a few representative repetitive sequential patterns to compress all frequent repetitive sequential patterns. Recently the problem of compressing frequent itemset has been studied [1,23,2

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut