A Local Clustering Algorithm for Massive Graphs and its Application to Nearly-Linear Time Graph Partitioning

A Local Clustering Algorithm for Massive Graphs and its Application to   Nearly-Linear Time Graph Partitioning

We study the design of local algorithms for massive graphs. A local algorithm is one that finds a solution containing or near a given vertex without looking at the whole graph. We present a local clustering algorithm. Our algorithm finds a good cluster–a subset of vertices whose internal connections are significantly richer than its external connections–near a given vertex. The running time of our algorithm, when it finds a non-empty local cluster, is nearly linear in the size of the cluster it outputs. Our clustering algorithm could be a useful primitive for handling massive graphs, such as social networks and web-graphs. As an application of this clustering algorithm, we present a partitioning algorithm that finds an approximate sparsest cut with nearly optimal balance. Our algorithm takes time nearly linear in the number edges of the graph. Using the partitioning algorithm of this paper, we have designed a nearly-linear time algorithm for constructing spectral sparsifiers of graphs, which we in turn use in a nearly-linear time algorithm for solving linear systems in symmetric, diagonally-dominant matrices. The linear system solver also leads to a nearly linear-time algorithm for approximating the second-smallest eigenvalue and corresponding eigenvector of the Laplacian matrix of a graph. These other results are presented in two companion papers.


💡 Research Summary

The paper tackles the problem of designing algorithms that operate locally on massive graphs, where “local” means that the algorithm discovers a solution near a given seed vertex without having to inspect the entire graph. The authors introduce a local clustering algorithm that, given a seed vertex s, quickly finds a set of vertices (a cluster) whose internal edge density is substantially higher than its external edge density. The algorithm is based on a modified personalized PageRank (or heat‑kernel) diffusion: starting from s a unit of probability mass is repeatedly “pushed” to neighboring vertices while maintaining for each vertex a residual mass and a settled mass. When the residual at a vertex falls below a prescribed tolerance ε, the push operation stops for that vertex. This push‑pull scheme guarantees that only vertices whose total volume is comparable to the volume of the output cluster ever receive non‑negligible mass, which yields a running time that is essentially linear in the size of the output cluster.

To evaluate cluster quality the algorithm sorts vertices by the amount of settled mass they have received and computes the conductance of every prefix of this ordering. Conductance, defined as the ratio of the weight of edges leaving the set to the minimum of the volume of the set and its complement, measures how well‑separated a set is. The prefix with the smallest conductance is returned as the local cluster. The authors prove that, if a non‑empty cluster exists with conductance φ, the algorithm finds a set whose conductance is O(√φ log n) in time Õ(vol(C)/ε), where vol(C) is the volume of the returned cluster and Õ hides polylogarithmic factors.

Beyond the single‑seed clustering routine, the paper shows how to turn this primitive into a nearly‑linear‑time global graph partitioning algorithm. By repeatedly invoking the local clustering subroutine on carefully chosen seeds and recursively cutting the graph, the algorithm produces a cut whose balance lies within a constant factor of the optimal (e.g., each side contains at least a constant fraction of the total volume) and whose conductance is within an O(√log n) factor of the sparsest possible cut. The total work of the partitioning algorithm is Õ(m), where m is the number of edges, which is essentially optimal for dense graphs.

The significance of these results extends far beyond partitioning. The authors point out that the same local clustering technique can be used to construct spectral sparsifiers in nearly linear time. A spectral sparsifier preserves the Laplacian quadratic form of the original graph up to (1±ε) while having only O(n log n/ε²) edges. With such sparsifiers in hand, they obtain a nearly‑linear‑time solver for symmetric, diagonally‑dominant (SDD) linear systems, a cornerstone of many graph‑based computations. The SDD solver, in turn, yields fast algorithms for approximating the second smallest eigenvalue (the algebraic connectivity) and its associated eigenvector of the graph Laplacian. These downstream applications are described in two companion papers.

In summary, the paper makes three major contributions: (1) a provably efficient local clustering algorithm whose running time scales with the output size rather than the whole graph; (2) a global partitioning method that leverages the local primitive to achieve almost‑linear‑time computation of an approximately optimal sparsest cut with good balance; and (3) a blueprint for using the partitioner to build spectral sparsifiers and, consequently, nearly‑linear‑time algorithms for SDD linear systems and Laplacian eigenpair approximation. The work opens a new paradigm for massive‑graph processing where local computation replaces costly global sweeps, making it highly relevant for social‑network analysis, web‑graph mining, and large‑scale machine‑learning pipelines.