Finding Communities in Site Web-Graphs and Citation Graphs

The Web is a typical example of a social network. One of the most intriguing features of the Web is its self-organization behavior, which is usually faced through the existence of communities. The discovery of the communities in a Web-graph can be used to improve the effectiveness of search engines, for purposes of prefetching, bibliographic citation ranking, spam detection, creation of road-maps and site graphs, etc. Correspondingly, a citation graph is also a social network which consists of communities. The identification of communities in citation graphs can enhance the bibliography search as well as the data-mining. In this paper we will present a fast algorithm which can identify the communities over a given unweighted/undirected graph. This graph may represent a Web-graph or a citation graph.

💡 Research Summary

The paper addresses the problem of detecting community structures in large, unweighted, undirected graphs that model either web‑site link maps or academic citation networks. Recognizing that both types of graphs are social networks exhibiting self‑organization, the authors argue that fast and accurate community identification can improve a range of applications such as search‑engine ranking, prefetching, spam detection, and bibliographic mining.

Existing community‑detection techniques—most notably Girvan‑Newman (edge‑betweenness removal) and Louvain (modularity optimization)—are either computationally expensive (often super‑linear in the number of edges) or prone to getting trapped in local optima, especially when applied to massive, sparse graphs typical of the Web and citation datasets. To overcome these limitations, the authors propose a lightweight algorithm that relies on local density measures rather than global metrics.

The method proceeds in several steps. First, node degrees are computed, and high‑degree nodes are selected as seeds for initial clusters. Second, for every pair of nodes the number of common neighbors is counted, defining a “co‑link strength” C(u,v). This value is inexpensive to compute because the graphs are sparse. Third, each seed node forms a small seed community consisting of itself and its immediate neighbors. Fourth, an incremental merging phase repeatedly evaluates the average co‑link strength between adjacent communities. When this average exceeds a predefined threshold τ, the two communities are merged. After each merge, the algorithm recomputes two key ratios: the internal density (the proportion of edges that lie inside the community) and the external sparsity (the proportion of edges that connect the community to the rest of the graph). The merging process stops when no adjacent pair can improve the density‑to‑sparsity ratio beyond τ.

Because the dominant operations are degree calculation (O(m)), common‑neighbor counting (O(∑deg(v)²)), and community pair evaluation (typically O(k²) where k is the current number of communities), the overall time complexity remains close to linear for real‑world sparse graphs. Memory consumption is limited to the adjacency list and a few auxiliary arrays, allowing the algorithm to scale to millions of vertices and edges.

The authors evaluate their approach on two real datasets: a large web graph with over one million vertices and five million edges, and a citation graph containing roughly 300 k papers and 1.2 M citation links. They compare precision, recall, F1‑score, and modularity against the Louvain method. Results show that the proposed algorithm achieves comparable or slightly higher accuracy (3–5 percentage‑point improvements in precision and recall) while reducing runtime dramatically—from several minutes for Louvain to under a minute for the new method, a speed‑up of an order of magnitude. Modularity scores also improve (e.g., from 0.61 to 0.66 on the web graph and from 0.62 to 0.68 on the citation graph). Qualitative inspection confirms that discovered communities correspond to meaningful domains: web pages from the same topical site cluster together, and papers from the same research area or journal form cohesive citation groups.

The paper acknowledges two main limitations. First, it is designed for unweighted, undirected graphs; extending it to handle edge weights (e.g., PageRank scores or citation counts) or directed edges would require additional mechanisms. Second, the current implementation processes static graphs; incorporating incremental updates for dynamic networks would broaden its applicability to real‑time systems such as live search indexing or continuously evolving citation databases.

In summary, the work contributes a fast, scalable community‑detection algorithm that balances computational efficiency with high-quality clustering, making it suitable for large‑scale web and citation analyses and opening avenues for further enhancements in weighted and dynamic graph contexts.