A Blue Start: A large-scale pairwise and higher-order social network dataset

A Blue Start: A large-scale pairwise and higher-order social network dataset
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large-scale networks have been instrumental in shaping how we think about social systems, and have undergirded many foundational results in mathematical epidemiology, computational social science, and biology. However, many of the social systems through which diseases spread, information disseminates, and individuals interact are inherently mediated through groups, known as higher-order interactions. A gap exists between higher-order models of group formation and spreading processes and the data necessary to validate these mechanisms. Similarly, few datasets bridge the gap between pairwise and higher-order network data. The Bluesky social media platform is an ideal laboratory for observing social ties at scale through its open API. Not only does Bluesky contain pairwise following relationships, but it also contains higher-order social ties known as “starter packs” which are user-curated lists designed to promote social network growth. We introduce “A Blue Start”, a large-scale network dataset comprising 39.7M user accounts, 2.4B pairwise following relationships, and 365.8K groups representing starter packs. This dataset will be an essential resource for the study of higher-order networks.


💡 Research Summary

The paper introduces “A Blue Start,” a large‑scale dataset harvested from the Bluesky social media platform that uniquely combines pairwise (follow) relationships with higher‑order group interactions called “starter packs.” The authors collected data on 39.7 million user accounts, 2.4 billion directed follow edges, and 365.8 thousand starter packs, each comprising 8–150 users or feeds. Data acquisition leveraged Bluesky’s decentralized identity system (DIDs) and Personal Data Servers (PDSs). First, the Public Ledger of Credentials (PLC) directory was queried to obtain all DIDs and the locations of PDS instances. Active DIDs were then enumerated via the com.atproto.sync.listRepos endpoint, and each user’s repository was retrieved using com.atproto.sync.getRepo. The collection pipeline employed Python’s asyncio and aiohttp with 1,024 concurrent tasks, a per‑host connection cap of 64, exponential back‑off retries, and rate‑limit awareness, resulting in a 0.6 % failure rate across 36.5 million repository requests.

Three core tables were produced: (1) a node table containing DID, creation timestamp, and activity status; (2) a follow table encoding the directed pairwise network; and (3) a starter‑pack table linking pack metadata (creator, creation time, pack ID) with member lists (user DIDs and addition timestamps). Because starter packs can be followed en masse via a “follow‑all” button, they represent genuine higher‑order interactions rather than inferred cliques. The dataset therefore enables direct comparison between the induced higher‑order network (edges implied by pack membership) and the observed pairwise follow network, exposing structural features exclusive to group‑level ties such as overlapping pack memberships without corresponding follow edges.

The authors discuss the novelty of providing both interaction modalities from a single platform, a gap in existing social‑media datasets that are typically limited to pairwise edges. They also note limitations: roughly 59 % of PDS addresses could not be queried, imposing a potential bias; starter packs are size‑capped at 150 members, excluding very large communities; and the snapshot reflects the state of the platform as of 18 Oct 2025, omitting subsequent dynamics. Ethical considerations include the exposure of persistent DIDs, prompting the need for anonymization and responsible data‑use agreements.

Potential research avenues are extensive. In higher‑order network science, the data support multilayer analyses, hypergraph modeling, and community detection that respects both pairwise and group structures. Epidemiological models can differentiate intra‑pack transmission from inter‑pack spread. Machine‑learning applications may predict future pack membership from pairwise features or recommend follows based on higher‑order similarity. Communication scholars can examine how starter‑pack topics shape information cascades and norm formation. Overall, “A Blue Start” offers a rare, openly accessible resource that bridges the methodological divide between dyadic and group‑based social network analysis, promising to advance theory and empirical work across computational social science, complex systems, and beyond.


Comments & Academic Discussion

Loading comments...

Leave a Comment