Coding Together at Scale: GitHub as a Collaborative Social Network
GitHub is the most popular repository for open source code. It has more than 3.5 million users, as the company declared in April 2013, and more than 10 million repositories, as of December 2013. It has a publicly accessible API and, since March 2012, it also publishes a stream of all the events occurring on public projects. Interactions among GitHub users are of a complex nature and take place in different forms. Developers create and fork repositories, push code, approve code pushed by others, bookmark their favorite projects and follow other developers to keep track of their activities. In this paper we present a characterization of GitHub, as both a social network and a collaborative platform. To the best of our knowledge, this is the first quantitative study about the interactions happening on GitHub. We analyze the logs from the service over 18 months (between March 11, 2012 and September 11, 2013), describing 183.54 million events and we obtain information about 2.19 million users and 5.68 million repositories, both growing linearly in time. We show that the distributions of the number of contributors per project, watchers per project and followers per user show a power-law-like shape. We analyze social ties and repository-mediated collaboration patterns, and we observe a remarkably low level of reciprocity of the social connections. We also measure the activity of each user in terms of authored events and we observe that very active users do not necessarily have a large number of followers. Finally, we provide a geographic characterization of the centers of activity and we investigate how distance influences collaboration.
💡 Research Summary
This paper presents the first large‑scale quantitative study of GitHub, the world’s most popular open‑source code hosting service, treating it simultaneously as a social network and a collaborative platform. The authors collected the complete public event stream from March 11 2012 to September 11 2013, amounting to 183,540,210 events across 18 event types (e.g., PushEvent, CreateEvent, WatchEvent, IssueCommentEvent). From these events they extracted metadata for 2.19 million users and 5.68 million public repositories, both of which grew linearly over the observation period.
Four principal graphs were constructed: (1) a directed followers graph G_F (671 k nodes, 2.03 M edges) representing “follow” relationships; (2) a bipartite collaborators graph G_C linking users to repositories they have written to, and its user‑projection G⊥C; (3) a bipartite star‑gazers graph G_S derived from Watch events; and (4) a contributors graph G_N based on commit authorship within Push events. All degree distributions (contributors per project, watchers per project, followers per user) exhibit heavy‑tailed, power‑law‑like behavior, confirming that a small minority of projects and users dominate activity.
The followers network is extremely sparse (density ≈ 4.5 × 10⁻⁶) with an average degree of only 3.0, indicating that following on GitHub carries a relatively high “cost” (many notifications). Reciprocity is strikingly low: only 9.6 % of follower pairs are mutual, far below values reported for Twitter (≈22 %), Flickr (≈68 %) or Yahoo! 360 (≈84 %). This suggests that GitHub’s follow mechanism functions more as an information‑subscription service than a typical social tie.
Activity versus popularity was examined by correlating the number of authored events per user with their follower count. The relationship is weak; highly active contributors do not automatically attract many followers, and many users with large follower bases contribute relatively little code. This decoupling highlights the platform’s dual nature: reputation can be built through visibility rather than pure code output.
Geographic analysis leveraged the optional “location” field in user profiles. Of the 2.19 M users, 345 k supplied a non‑empty location, which the authors geocoded using the MapQuest Open Geocoding API (≈10 % error rate). Mapping these users revealed a global distribution across all continents. Distance analysis showed a negative correlation between geographic separation and collaboration intensity: developers tend to fork, submit pull requests, and commit more frequently with nearby peers. Projects with few collaborators are geographically concentrated, whereas highly collaborative projects have contributors spread worldwide.
The authors also investigated the structure of fork trees. Most forks form shallow, narrow trees; only a few “key” repositories generate deep, wide fork cascades. This indicates that while forking is a core mechanism for contribution, the majority of collaborative effort remains focused on a limited set of central projects.
Limitations include the temporal bias of the dataset (events prior to March 2012 are absent), exclusion of private repositories (due to API restrictions), and potential inaccuracies in self‑reported location data. Despite these constraints, the study provides a comprehensive baseline of GitHub’s social and collaborative dynamics, complementing prior qualitative work on developer motivations.
The paper concludes by emphasizing GitHub’s unique blend of social networking and software engineering, suggesting future work on longitudinal network evolution, integration of private repository data, and linking network metrics with code quality indicators.
Comments & Academic Discussion
Loading comments...
Leave a Comment