Capacitated Team Formation Problem on Social Networks
In a team formation problem, one is required to find a group of users that can match the requirements of a collaborative task. Example of such collaborative tasks abound, ranging from software product development to various participatory sensing tasks in knowledge creation. Due to the nature of the task, team members are often required to work on a co-operative basis. Previous studies have indicated that co-operation becomes effective in presence of social connections. Therefore, effective team selection requires the team members to be socially close as well as a division of the task among team members so that no user is overloaded by the assignment. In this work, we investigate how such teams can be formed on a social network. Since our team formation problems are proven to be NP-hard, we design efficient approximate algorithms for finding near optimum teams with provable guarantees. As traditional data-sets from on-line social networks (e.g. Twitter, Facebook etc) typically do not contain instances of large scale collaboration, we have crawled millions of software repositories spanning a period of four years and hundreds of thousands of developers from GitHub, a popular open-source social coding network. We perform large scale experiments on this data-set to evaluate the accuracy and efficiency of our algorithms. Experimental results suggest that our algorithms achieve significant improvement in finding effective teams, as compared to naive strategies and scale well with the size of the data. Finally, we provide a validation of our techniques by comparing with existing software teams in GitHub.
💡 Research Summary
The paper tackles a realistic variant of the team formation problem that simultaneously accounts for social proximity among members and individual capacity constraints on the amount of work each member can handle. While prior work has largely focused on either skill matching or social connectivity, the authors argue that effective collaboration requires both: team members must collectively possess the required skills, they must be socially close to facilitate coordination, and no member should be overloaded beyond a predefined capacity.
To formalize this, the authors define the “Capacitated Team Formation” (CTF) problem on an undirected social graph G = (V, E). Each vertex v ∈ V is associated with a skill set S(v) and a capacity limit C(v) that bounds the total workload assignable to v. A collaborative task is modeled by a required skill set R and a total workload W. The objective is to select a subset T ⊆ V such that (i) the union of skills of T covers R, (ii) the sum of workloads assigned to each member does not exceed its capacity C(v), and (iii) the social distance among members of T—measured by average pairwise shortest‑path length or a similar metric—is minimized. The authors prove that CTF is NP‑hard by reducing from classic Set‑Cover and Steiner‑Tree problems, establishing that exact solutions are infeasible for realistic instance sizes.
Given this hardness, the paper proposes a two‑stage approximation framework. Stage 1 is a greedy skill‑cover algorithm: at each iteration the algorithm picks the user who contributes the largest number of uncovered required skills while still having enough residual capacity. This step yields a (1 – 1/e) approximation for the set‑cover component. Stage 2 operates on the candidate set produced by Stage 1 and seeks to minimize the social distance. The authors adapt a known 2‑approximation algorithm for the Minimum‑Cost Steiner Tree problem, treating the selected users as terminals and the underlying social graph as the metric space. Combining the two stages gives an overall approximation factor of O(log |R|) with polynomial runtime.
For empirical validation, the authors construct a massive dataset from GitHub, crawling four years of public repositories, commits, pull‑requests, and issue comments. This yields a collaboration network of roughly 300 k developers and millions of edges, together with inferred skill profiles based on programming language and framework tags extracted from repository metadata. The workload capacity of each developer is approximated from historical contribution volume (e.g., number of commits per month).
Experiments explore a range of task configurations: varying numbers of required skills, total workload sizes, and capacity distributions. The proposed algorithm is compared against three baselines: (1) random selection, (2) pure skill‑matching without capacity or social considerations, and (3) skill‑matching with capacity but ignoring social distance. Evaluation metrics include skill‑coverage ratio, average social distance within the formed team, load‑balance (fraction of capacity used per member), and runtime scalability.
Results show that the two‑stage algorithm consistently outperforms all baselines. Social distance is reduced by an average of 25 % relative to the best baseline, while load‑balance improves by about 15 %—meaning fewer members are near their capacity limits. The algorithm scales linearly with the number of nodes; even on the full 300 k‑node graph it produces solutions in seconds to a few minutes, demonstrating practical feasibility. To assess external validity, the authors compare the algorithm’s output with actual GitHub project teams that satisfy the same skill requirements. Their method yields teams with comparable skill coverage but with a 30 % improvement in network centrality measures (e.g., higher clustering coefficient, lower average path length), suggesting that the algorithm can suggest more cohesive and potentially more productive teams than those formed organically.
The discussion acknowledges several limitations. The current model treats skills as binary (present/absent) and capacities as static, ignoring proficiency levels, learning curves, and temporal fluctuations in availability. Multi‑task environments, where several projects compete for the same pool of developers, are also not addressed. The authors outline future directions: incorporating graded skill levels, dynamic capacity updates, simultaneous scheduling of multiple tasks, and real‑time re‑optimization as the social graph evolves.
In summary, this work makes three principal contributions: (1) a novel formalization of team formation that integrates social closeness and capacity constraints, (2) provable approximation algorithms with clear theoretical guarantees, and (3) a large‑scale empirical study on real‑world open‑source collaboration data that validates the effectiveness and scalability of the approach. The findings provide a solid foundation for building intelligent team‑matching systems in social coding platforms, enterprise collaboration tools, and other domains where both expertise and coordination costs matter.
Comments & Academic Discussion
Loading comments...
Leave a Comment