A Novel Gaussian Based Similarity Measure for Clustering Customer Transactions Using Transaction Sequence Vector

Reading time: 6 minute
...

📝 Abstract

Clustering Transactions in sequence, temporal and time series databases is achieving an important attention from the database researchers and software industry. Significant research is carried out towards defining and validating the suitability of new similarity measures for sequence, temporal, time series databases which can accurately and efficiently find the similarity between user transactions in the given database to predict the user behavior. The distribution of items present in the transactions contributes to a great extent in finding the degree of similarity between them. This forms the key idea of the proposed similarity measure. The main objective of the research is to first design the efficient similarity measure which essentially considers the distribution of the items in the item set over the entire transaction data set and also considers the commonality of items present in the transactions, which is the major drawback in the Jaccard, Cosine, Euclidean similarity measures. We then carry out the analysis for worst case, the average case and best case situations. The Similarity measure designed is Gaussian based and preserves the properties of Gaussian function. The proposed similarity measure may be used to both cluster and classify the user transactions and predict the user behaviors.

💡 Analysis

Clustering Transactions in sequence, temporal and time series databases is achieving an important attention from the database researchers and software industry. Significant research is carried out towards defining and validating the suitability of new similarity measures for sequence, temporal, time series databases which can accurately and efficiently find the similarity between user transactions in the given database to predict the user behavior. The distribution of items present in the transactions contributes to a great extent in finding the degree of similarity between them. This forms the key idea of the proposed similarity measure. The main objective of the research is to first design the efficient similarity measure which essentially considers the distribution of the items in the item set over the entire transaction data set and also considers the commonality of items present in the transactions, which is the major drawback in the Jaccard, Cosine, Euclidean similarity measures. We then carry out the analysis for worst case, the average case and best case situations. The Similarity measure designed is Gaussian based and preserves the properties of Gaussian function. The proposed similarity measure may be used to both cluster and classify the user transactions and predict the user behaviors.

📄 Content

Rev. Téc. Ing. Univ. Zulia. Vol. 38, Nº 1, 85 - 96, 2015

A Novel Gaussian Based Similarity Measure for Clustering Customer Transactions Using Transaction Sequence Vector

M.S.B.Phridvi Raj1, Vangipuram Radhakrishna2, C.V.Guru Rao3 1 prudviraj.kits@gmail.com, Department of CSE, Kakatiya Institute of Technology and Science, Warangal, India.
2 radhakrishna_v@vnrvjiet.in, Department of Information Technology, VNR VJIET, Hyderabad, India.
3 Principal and Professor, S.R.Engineering College, Warangal, India.

Abstract. Clustering Transactions in sequence, temporal and time series databases is achieving an important attention from the database researchers and software industry. Significant research is carried out towards defining and validating the suitability of new similarity measures for sequence, temporal, time series databases which can accurately and efficiently find the similarity between user transactions in the given database to predict the user behavior. The distribution of items present in the transactions contributes to a great extent in finding the degree of similarity between them. This forms the key idea of the proposed similarity measure. The main objective of the research is to first design the efficient similarity measure which essentially considers the distribution of the items in the item set over the entire transaction data set and also considers the commonality of items present in the transactions, which is the major drawback in the Jaccard, Cosine, Euclidean similarity measures. We then carry out the analysis for worst case, the average case and best case situations. The Similarity measure designed is Gaussian based and preserves the properties of Gaussian function. The proposed similarity measure may be used to both cluster and classify the user transactions and predict the user behaviors.

Keywords: Transaction Sequence vector, similarity measure, cluster, transaction

  1. INTRODUCTION Clustering Transactions in sequence databases, temporal databases, and time series databases is achieving an important attention from the database researchers and from the perspective of the software industry. The importance for clustering comes from the need for decision making such as classification, prediction. The input to clustering algorithm in databases is usually a set of user transactions with the output being set of clusters of user transactions. One of the important properties of clustering is, all the patterns within a cluster share similar or properties in some sense and patterns in different clusters are dissimilar in corresponding sense. The advantage of clustering w.r.t databases is that each user transaction has a fixed item set with the item set consisting of fixed set of items and do not change frequently. In other words, the item set is static. This eliminates the need of preprocessing the transaction dataset. The motivation of this work comes from our previous research (M.S.B.Phridvi Raj et.al; 2014).
    In this paper, we design the similarity measure for clustering the user transactions which has the Gaussian property and considers the distribution of each item from the item set over the entire database of transactions. In case the transactions are arriving as a stream then we can first find the closed frequent item set and apply the similarity measure on the final set of transactions.
    In the recent years, clustering data streams has gained lot of research focus in academia and industry (Albert Bifet et.al; 2011, Chang Dong Wang et.al; 2013, Chen Ling et.al; 2012, Shi Zhong; 2005). An approach for handling text data stream is discussed in (Yu Bao Liu; 2008). A similarity measure for clustering and classification of the text which considers the distribution of words is discussed in (Yung-Shen Lin et.al; 2014) which helped us a lot in carrying out the work. A tree based approach for clustering text stream data using the concept of ternary vector is discussed in (M.S.B.Phridvi Raj et.al; 2013).

  2. PROPOSED MEASURE The idea for the present similarity measure comes from our previous work (Phridvi Raj et.al; 2014, Chintakindi
    Srinivas et.al; 2014) considering the feature distribution and commonality which also holds good between the pair of any two transactions. In this work we assume each transaction to be a sequence of 2-tuple elements, the first being count of each item and the later denoting the presence or absence of an item in that transaction say Ti. The table.1 denotes the function Ф, and here we use it as a second element in the 2-tuple representation. We define another function called ∆ (Iik, Ijk) which is used to store the difference of count of items w.r.t transactions Ti and Tj. The table.1 and table.2 define functions Ф and ∆ for the binary and non-binary transaction-item set.

85

Rev. Téc. Ing. Univ. Zulia. Vol. 38, Nº 1, 85 - 96, 2015

Table 1. Function definitions Ф and ∆ for transaction item set in bin

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut