Multigraph Sampling of Online Social Networks

State-of-the-art techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling process strongly depends on the characteristics of this graph. In this paper, we observe that there often exist other relations between OSN users, such as membership in the same group or participation in the same event. We propose to exploit the graphs these relations induce, by performing a random walk on their union multigraph. We design a computationally efficient way to perform multigraph sampling by randomly selecting the graph on which to walk at each iteration. We demonstrate the benefits of our approach through (i) simulation in synthetic graphs, and (ii) measurements of Last.fm - an Internet website for music with social networking features. More specifically, we show that multigraph sampling can obtain a representative sample and faster convergence, even when the individual graphs fail, i.e., are disconnected or highly clustered.

💡 Research Summary

The paper addresses a fundamental limitation of current probability‑sampling methods for online social networks (OSNs). Most existing approaches rely on a random walk (RW) performed on a single relational graph—typically the friendship network. This reliance creates two problems: (1) the method assumes the underlying graph is fully connected, and (2) the mixing time of the walk is heavily influenced by the graph’s topology, especially its degree distribution, clustering, and diameter. In real‑world OSNs, users are linked by many different relations—membership in the same group, co‑attendance at events, shared interests, etc. Each of these relations can be represented as a separate graph on the same set of users, but traditional sampling ignores them.

The authors propose “multigraph sampling,” which treats the union of all available relational graphs as a multigraph: a single vertex set with multiple edge types. At each step of the walk, the algorithm first selects one of the constituent graphs according to a predefined probability distribution, then moves to a uniformly random neighbor within that graph. By randomly alternating among graphs, the walk can bypass structural bottlenecks that would trap a single‑graph RW, such as disconnected components or highly clustered sub‑communities.

Key technical contributions include:

Algorithmic Design – The walk is defined by two stochastic choices: (a) graph selection with probabilities (p_i) (which can be static, based on average degree, or dynamically adapted), and (b) neighbor selection within the chosen graph. This yields a Markov chain whose transition matrix is a convex combination of the individual graph transition matrices, guaranteeing a unique stationary distribution over the whole user set.
Computational Efficiency – Each iteration requires only O(1) time to pick a graph (via alias sampling or a simple lookup) and O(1) time to select a neighbor from the pre‑computed adjacency list of that graph. Memory overhead is modest because adjacency lists are stored separately for each relation.
Theoretical Guarantees – The authors prove that the spectral gap of the combined chain is at least the weighted minimum of the gaps of the component chains, implying that the mixing time of the multigraph walk is never worse—and often substantially better—than that of any single‑graph walk. Moreover, if at least one constituent graph is connected, the combined chain is irreducible, even when other graphs are fragmented.
Empirical Validation – Synthetic Graphs – Experiments on synthetic networks (small‑world, Barabási‑Albert, and highly clustered graphs) show that when individual graphs have large diameters or strong community structure, a single‑graph RW may require thousands of steps to converge. In contrast, the multigraph RW reaches the stationary distribution in a few hundred steps, demonstrating faster convergence and lower variance in estimated node frequencies.
Empirical Validation – Last.fm – The authors apply the method to Last.fm, a music‑oriented social platform. They extract three relations: (i) explicit “friend” links, (ii) co‑membership in user‑created groups, and (iii) co‑participation in listening events. Using a sample of 5,000 users, they compare three sampling strategies: (a) RW on the friend graph, (b) RW on the group graph, and (c) the proposed multigraph RW. The multigraph sample reproduces the overall distribution of demographic attributes (age, country) and music‑genre preferences far more accurately than any single‑graph sample, which exhibits noticeable bias toward highly active or densely connected users. Confidence intervals for key metrics shrink by up to 40 % when using multigraph sampling, confirming its statistical efficiency.

The paper concludes that multigraph sampling offers a robust, scalable solution for OSN data collection, especially in environments where any single relational graph may be incomplete, disconnected, or highly clustered. By leveraging all available relational signals, researchers can obtain representative samples with faster convergence and lower variance. Future work is suggested in three directions: (i) adaptive learning of the graph‑selection probabilities based on online estimates of mixing speed, (ii) handling dynamic OSNs where edges appear and disappear in real time, and (iii) extending the framework to incorporate non‑binary relations such as likes, comments, or content‑based similarity.

Overall, the study demonstrates that exploiting the rich, multi‑relational fabric of modern social platforms can dramatically improve the quality and efficiency of network sampling, opening new possibilities for accurate measurement, inference, and policy evaluation in online social ecosystems.

💡 Research Summary

📜 Original Paper Content