A space efficient streaming algorithm for triangle counting using the birthday paradox

A space efficient streaming algorithm for triangle counting using the   birthday paradox

We design a space efficient algorithm that approximates the transitivity (global clustering coefficient) and total triangle count with only a single pass through a graph given as a stream of edges. Our procedure is based on the classic probabilistic result, the birthday paradox. When the transitivity is constant and there are more edges than wedges (common properties for social networks), we can prove that our algorithm requires $O(\sqrt{n})$ space ($n$ is the number of vertices) to provide accurate estimates. We run a detailed set of experiments on a variety of real graphs and demonstrate that the memory requirement of the algorithm is a tiny fraction of the graph. For example, even for a graph with 200 million edges, our algorithm stores just 60,000 edges to give accurate results. Being a single pass streaming algorithm, our procedure also maintains a real-time estimate of the transitivity/number of triangles of a graph, by storing a minuscule fraction of edges.


💡 Research Summary

The paper introduces a novel single‑pass streaming algorithm for estimating the global clustering coefficient (also known as transitivity) and the total number of triangles in a graph presented as a sequence of edges. The core idea is to exploit the classic birthday paradox, which states that among √N randomly chosen items from a set of size N, a collision (two items being the same) occurs with constant probability. By treating each “wedge” (a pair of edges sharing a common vertex) as an item, the algorithm observes collisions among a small random sample of wedges to infer the overall wedge population and, consequently, the triangle count.

The algorithm proceeds in two stages. First, it maintains a reservoir sample S of size s from the edge stream using standard reservoir sampling; each incoming edge has equal probability of being kept, guaranteeing an unbiased sample without needing to know the total number of edges in advance. The authors set s = Θ(√n), where n is the number of vertices, which yields a memory footprint that scales only with the square root of the graph size.

Second, from the sampled edges in S the algorithm enumerates all possible wedges (pairs of edges that meet at a vertex). Because S is small, the number of wedges generated is also modest (O(s·d), where d is the average degree). The algorithm then counts how many of these wedges close into triangles using the edges present in S. The key observation, derived from the birthday paradox, is that the probability two randomly selected wedges share the same central vertex is roughly 1/√n. Hence, the frequency of “colliding” wedges in the sample provides an unbiased estimator for the total number of wedges w in the full graph. Since the transitivity τ = 3·T/w (T = total triangles), estimating w and the fraction of colliding wedges that are closed yields estimates for both τ and T.

The theoretical analysis assumes that τ is a constant (a realistic condition for many social networks) and that the number of edges m exceeds the number of wedges w (i.e., the graph is not extremely sparse). Under these conditions the authors prove that the estimator’s bias is O(1/√n) and its variance is O(1/n). By choosing the sample size s = Θ(√n/ε²), the relative error can be bounded by any desired ε with high probability, using Chernoff and Hoeffding bounds.

Empirical evaluation is performed on a diverse collection of real‑world graphs, including several with more than 200 million edges. The algorithm consistently stores only a tiny fraction of the edges (e.g., ~60 k edges for a 200 M‑edge graph, i.e., <0.03 % of the input) while achieving average relative errors below 2 % for both transitivity and triangle count. Compared against state‑of‑the‑art wedge‑sampling and graph‑sketching streaming methods, the proposed approach delivers significantly higher accuracy for the same memory budget. Moreover, because the estimator is updated after each processed edge, the method provides a real‑time view of the evolving clustering structure, reacting promptly to sudden changes such as community merges.

The paper highlights several strengths: (1) a single pass makes it suitable for high‑throughput streams; (2) the memory requirement grows only as O(√n), enabling deployment on massive graphs with modest hardware; (3) the birthday‑paradox intuition yields a simple, analytically tractable estimator without complex matrix sketches or multi‑level sampling; (4) reservoir sampling is easy to implement and parallelize. Limitations include reduced effectiveness on extremely sparse graphs where the collision probability among wedges becomes negligible, and a dependence on the randomness of the edge order—biased stream orders could affect the representativeness of the reservoir. The authors suggest future work on weighted reservoir schemes, multi‑sample aggregation, and order‑robust preprocessing to mitigate these issues.

In summary, by translating the birthday paradox into the domain of wedge collisions, the authors present a space‑efficient, accurate, and real‑time streaming algorithm for triangle counting and transitivity estimation. The method’s sublinear memory footprint and strong empirical performance make it a compelling tool for online social‑network analysis, large‑scale graph monitoring, and any application requiring rapid insight into clustering structure without storing the full graph.