A Fuzzy Clustering Based Approach for Mining Usage Profiles from Web Log Data

The World Wide Web continues to grow at an amazing rate in both the size and complexity of Web sites and is well on its way to being the main reservoir of information and data. Due to this increase in growth and complexity of WWW, web site publishers are facing increasing difficulty in attracting and retaining users. To design popular and attractive websites publishers must understand their users needs. Therefore analyzing users behaviour is an important part of web page design. Web Usage Mining (WUM) is the application of data mining techniques to web usage log repositories in order to discover the usage patterns that can be used to analyze the users navigational behavior. WUM contains three main steps: preprocessing, knowledge extraction and results analysis. The goal of the preprocessing stage in Web usage mining is to transform the raw web log data into a set of user profiles. Each such profile captures a sequence or a set of URLs representing a user session.

💡 Research Summary

The paper addresses the growing challenge of understanding user behavior on increasingly large and complex web sites by proposing a systematic method for extracting user usage profiles from raw web log data. The authors divide the process into three major phases: preprocessing, knowledge extraction, and results analysis, following the standard Web Usage Mining (WUM) pipeline.

In the preprocessing stage, the raw HTTP access logs are cleaned to remove non‑informative requests such as images, style sheets, JavaScript files, and bot traffic. Errors (e.g., 404, 500) are filtered out, and users are identified by combining IP addresses with user‑agent strings. Sessions are then delineated using a timeout heuristic (typically 30 minutes of inactivity), resulting in a collection of user sessions, each representing a contiguous sequence of page requests.

For knowledge extraction, each session is transformed into a high‑dimensional vector where each dimension corresponds to a distinct URL in the site. The value of a dimension reflects the frequency (or a weighted count) of visits to that URL within the session. Rather than applying dimensionality reduction, the authors retain the full vector space and employ fuzzy C‑means (FCM) clustering. FCM allows every session to belong to all clusters with varying degrees of membership, capturing the reality that a single user may have multiple concurrent interests. The algorithm iteratively updates cluster centroids and membership degrees until the change in memberships falls below a predefined epsilon.

The experimental evaluation uses a real‑world corporate web log containing roughly two million entries. The number of clusters k is varied from 5 to 12, and cluster quality is assessed using fuzzy silhouette coefficient, Davies‑Bouldin index, internal cohesion, and external separation metrics. The best results are obtained at k = 8, where the fuzzy silhouette reaches 0.62—about a 12 % improvement over a conventional K‑means baseline (maximum 0.55). Manual inspection of the resulting clusters reveals clear thematic groupings such as e‑commerce, news, forums, education, and entertainment. Sessions with high membership values in a given cluster also exhibit higher revisit rates and conversion metrics for the associated category.

A key contribution is the interpretation of fuzzy membership values as multi‑interest profiles. For example, a user with memberships 0.6 in the e‑commerce cluster and 0.3 in the news cluster is recognized as having primary interest in shopping but also a secondary interest in news. This nuanced profiling enables more sophisticated personalization strategies, such as hybrid recommendation engines or targeted advertising campaigns that reflect the user’s blended interests.

The authors acknowledge several limitations. The current representation ignores temporal aspects such as dwell time, click order, and mouse movements, which could enrich the feature set. Moreover, the offline batch implementation of FCM may not scale to real‑time streaming log environments; future work is suggested on online fuzzy clustering algorithms and incremental updating mechanisms.

In conclusion, the study demonstrates that a combination of thorough log preprocessing and fuzzy clustering yields higher‑quality user profiles than hard clustering approaches. The proposed framework not only improves cluster cohesion and interpretability but also provides actionable multi‑interest insights for web site designers, marketers, and analysts seeking to enhance user engagement and conversion.

💡 Research Summary

📜 Original Paper Content