Joint Inference of Multiple Label Types in Large Networks

Joint Inference of Multiple Label Types in Large Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We tackle the problem of inferring node labels in a partially labeled graph where each node in the graph has multiple label types and each label type has a large number of possible labels. Our primary example, and the focus of this paper, is the joint inference of label types such as hometown, current city, and employers, for users connected by a social network. Standard label propagation fails to consider the properties of the label types and the interactions between them. Our proposed method, called EdgeExplain, explicitly models these, while still enabling scalable inference under a distributed message-passing architecture. On a billion-node subset of the Facebook social network, EdgeExplain significantly outperforms label propagation for several label types, with lifts of up to 120% for recall@1 and 60% for recall@3.


💡 Research Summary

The paper addresses the problem of inferring multiple label types for nodes in a massive partially‑labeled graph, using the example of Facebook users whose profiles contain fields such as hometown, current city, and employer. Traditional label propagation (LP) treats each label independently and assumes that friends are similar across all dimensions, which fails when each edge is typically formed for a single dominant reason (e.g., shared high school). To overcome this limitation, the authors propose EdgeExplain, a probabilistic model that treats each edge as needing an “explanation” drawn from a limited set of label types.

Formally, for each node u and label type t, binary variables S_utℓ indicate whether u possesses label ℓ in type t. For an edge (u, v), the compatibility score for type t is r(u,v,t)=∑_{ℓ∈L(t)} S_utℓ·S_vℓ, i.e., the count of shared labels of that type. The model then applies a softmax‑like function, implemented with a sigmoid σ(α·∑_t r(u,v,t)+c), to map the total compatibility into a probability that the edge is “explained”. Parameter α controls how much evidence is required to deem an edge explained; a larger α lets a single matching label type suffice. Parameter c captures the probability that an edge is explained by an unknown label type (set to zero in experiments).

The objective is to maximize the joint likelihood P(S_V, S_H) = (1/Z) ∏_{u∼v} σ(α·∑_t r(u,v,t)+c). This is optimized via an iterative gradient‑based message‑passing algorithm: each node receives from its neighbors the current estimates of label probabilities, updates its own distribution to increase the likelihood of explaining its incident edges, and repeats until convergence. Because updates involve only local neighbor information, the method fits naturally into distributed graph‑processing frameworks (e.g., Pregel, GraphX), scaling linearly with the number of edges.

Empirical evaluation uses a billion‑node subset of the Facebook social graph, with five label types (hometown, high school, college, current city, employer) each having millions of possible values. EdgeExplain is compared against standard LP applied independently to each label type. Results show substantial gains: recall@1 improves by up to 120 % and recall@3 by up to 60 % across the label types. Ablation studies reveal that larger α values consistently perform better, confirming that a single matching label type is often sufficient to explain an edge, validating the mutual‑exclusivity assumption (P2). Setting c=0 yields the best performance, indicating that the chosen label types indeed capture most reasons for friendship formation (property P1).

The authors also discuss extensions: incorporating additional node attributes (group memberships, interests), adding new label types, or replacing the sigmoid with more expressive functions to model complex interactions. The inferred “reason” for each edge can be used directly for link prediction, friend recommendation, or targeted advertising, demonstrating practical utility beyond pure label inference.

In summary, EdgeExplain introduces a novel perspective on graph‑based semi‑supervised learning by framing label inference as an edge‑explanation problem. It explicitly models the limited, mostly mutually exclusive reasons behind connections, handles high‑dimensional label spaces, and scales to billions of nodes through distributed message passing. The method outperforms traditional label propagation on real‑world large‑scale data, offering a powerful tool for multi‑attribute inference in social networks and other relational domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment