'Improved FCM algorithm for Clustering on Web Usage Mining'

Reading time: 5 minute
...

📝 Original Info

  • Title: ‘Improved FCM algorithm for Clustering on Web Usage Mining’
  • ArXiv ID: 1104.1892
  • Date: 2011-04-12
  • Authors: ** (논문에 저자 정보가 제공되지 않았으므로, 저자 명시 불가) **

📝 Abstract

In this paper we present clustering method is very sensitive to the initial center values, requirements on the data set too high, and cannot handle noisy data the proposal method is using information entropy to initialize the cluster centers and introduce weighting parameters to adjust the location of cluster centers and noise problems.The navigation datasets which are sequential in nature, Clustering web data is finding the groups which share common interests and behavior by analyzing the data collected in the web servers, this improves clustering on web data efficiently using improved fuzzy c-means(FCM) clustering. Web usage mining is the application of data mining techniques to web log data repositories. It is used in finding the user access patterns from web access log. Web data Clusters are formed using on MSNBC web navigation dataset.

💡 Deep Analysis

📄 Full Content

The World Wide Web has huge amount information [1,2]and large datasets are available in databases. Information retrieving on websites is one of possible ways how to extract information from these datasets is to find homogeneous clusters of similar units for the description of the data vector descriptions are usually used each its component corresponds to a variable which can be measured in different scales (nominal, ordinal, or numeric) most of the well known clustering methods are implanted only for numeric data(k-means method) or are too complex for clustering large datasets(such as hierarchical methods based on dissimilarity matrices). Fuzzy clustering relevant for information retrieval as a document might be relevant to multiple queries, this document should be given in the corresponding response sets otherwise the user would not be aware of it ,Fuzzy clustering seems a natural technique for document categorization there are two basic methods of fuzzy clustering [4] ,one which is based on fuzzy c-partitions is called a fuzzy c-means clustering method and the other ,based on the fuzzy equivalence relations is called a fuzzy equivalence clustering method.

Broadly speaking clustering algorithms [3] can be divided into two types partitioned and hierarchical. Partitioning algorithms construct a partition of a database D of n objects into a set of clusters where k is a input parameter.

Hierarchical algorithms create decomposition of the database D. they are a Agglomerative and divisive. Hierarchical clustering builds a tree of clusters, also known as a dendrogram. Every cluster node contains child cluster. An agglomerative clustering starts with one-point (singleton) Clusters and recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points and recursively splits into the most appropriate clusters. The process continues until a stopping criterion is achieved. There are two main issues in clustering techniques, Firstly finding the optimal number of clusters in a given dataset and secondly, given two sets of clusters, computing relative measure of goodness between them. For both these purposes, a criterion function or a validation function is usually applied. The simplest and most widely used cluster optimization function is the sum of squared error [5]. Studies on the sum of squared error clustering were focused on the well-known k-Means algorithm [6] and its variants.

In conventional clustering objects that are similar are allocated to the same cluster while objects that differ are put in different clusters. These clusters are hard clusters. In soft clustering an object may be in more than two or more clusters. Clustering is a widely used technique in data mining application for discovering patterns in underlying data. Most traditional clustering algorithms are limited in handling datasets that contain categorical attributes. However, datasets with categorical types of attributes are common in real life data mining problem. For each pair of documents, a comparison vector is constructed that contains binary features that measure the overlap for highly informative but sparse features between the two documents and numeric features. The aggregating the comparison vector into one value that belongs to interval. The aggregation step is performed by taking a weighted average the information gain has a tendency to favor features with many possible values over feature with fewer possible values ,we used a normalized version of information gain, called gain ration as weighting metric. Clustering is of prime importance in data analysis, machine learning and statistics. It is defines as the process of grouping N item sets into distinct clusters based on similarity or distance function .A good clustering technique may yield clusters thus have high inter cluster and low intra cluster distance [7].The objective of clustering is to maximize the similarity of the data points within each cluster and maximize dissimilarity across clusters.

Information gain of feature is calculated as follows. Assume we have K, the set of class label (a binary set in our case: document pairs belonging to the same cluster or not) and M i , the set of feature values for feature I, with this information; we can calculate the database information entropy; the probabilities are estimated from the relative frequencies in the training set.

The information gain of feature i is then measured by calculating the difference in entropy between the situations with and without the information about the values of the feature.

gain ration is a normalized version of information gain. it is information gain divided by split info li(i),the entropy of the feature values. This I just the entropy of the database restricted to a single feature.

As the FCM algorithm is very sensitive to the number of cluster centers, cluster centers initialization often artificially get significant errors, and even get

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut