Scalable Similarity Search over Large Attributed Bipartite Graphs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Bipartite graphs are widely used to model relationships between entities of different types, where nodes are divided into two disjoint sets. Similarity search, a fundamental operation that retrieves nodes similar to a given query node, plays a crucial role in various real-world applications, including machine learning and graph clustering. However, existing state-of-the-art methods often struggle to accurately capture the unique structural properties of bipartite graphs or fail to incorporate the informative node attributes, leading to suboptimal performance. Besides, their high computational complexity limits scalability, making them impractical for large graphs with millions of nodes and tens of thousands of attributes. To overcome these challenges, we first introduce Attribute-augmented Hidden Personalized PageRank (AHPP), a novel random walk model designed to blend seamlessly both the higher-order bipartite structure proximity and attribute similarity. We then formulate the similarity search over attributed bipartite graphs as an approximate AHPP problem and propose two efficient push-style local algorithms with provable approximation guarantees. Finally, extensive experiments on real-world and synthetic datasets validate the effectiveness of AHPP and the efficiency of our proposed algorithms when compared with fifteen competitors.

💡 Research Summary

This paper addresses the critical challenge of performing accurate and scalable similarity search on large-scale attributed bipartite graphs. Bipartite graphs, which model relationships between two distinct types of entities (e.g., users and products), are ubiquitous in real-world applications. However, existing similarity search methods often fail to capture the unique two-hop structural proximity inherent in bipartite graphs, neglect valuable node attribute information, and suffer from high computational complexity that limits scalability.

To overcome these limitations, the authors first propose a novel similarity measure called Attribute-augmented Hidden Personalized PageRank (AHPP). AHPP seamlessly integrates both higher-order structural information and node attribute similarity into a unified random walk model. It builds upon the Hidden Personalized PageRank (HPP), designed for plain bipartite graphs, by introducing a new attribute transition matrix. This matrix models the probability of transitioning between two nodes based on their shared attributes. The final state transition matrix for the random walk is a weighted combination of the original structural transition matrix (from HPP) and the new attribute transition matrix, controlled by a parameter β. This allows AHPP to flexibly balance the influence of network structure and node attributes. The AHPP score between two nodes is defined as the Personalized PageRank on this combined matrix, effectively summarizing multi-hop connections via both structure and attributes. The model is highly interpretable, as the score corresponds to the probability of an “α-attribute-augmented hidden random walk” stopping at the target node.

The primary computational challenge is that materializing the combined transition matrix explicitly requires O(|U|^2) time, which is infeasible for large graphs. The authors formulate the problem as an ε-approximate single-source AHPP query and develop two efficient local push-based algorithms to solve it without constructing the full matrix.

The first algorithm, Alternating Propagation Push (APP), adapts the principles of the classic Forward Push algorithm to the AHPP setting. Its core innovation lies in propagating residuals directly on the original bipartite graph and attribute associations, leveraging the equivalence of residue propagation. This avoids explicit matrix construction. The second algorithm, Adaptive Synchronous Residue Push (ASRP), enhances APP by employing a synchronous push strategy and a more effective termination threshold. ASRP achieves near-linear time complexity while maintaining provable approximation guarantees, making it highly scalable.

Extensive experiments on real-world and synthetic datasets demonstrate the superiority of the proposed approach. In terms of effectiveness, AHPP consistently outperforms 15 competitors, including Node2Vec, Metapath2Vec, and HPP, across multiple tasks such as clustering consistency validation, top-k precision, and link prediction. It achieves significant improvements in F1-score (8%-12% gains in clustering). In terms of efficiency, the ASRP algorithm is 1 to 2 orders of magnitude faster than all baseline methods and shows stable scalability as graph size increases. An analysis of parameter β confirms that it effectively controls the trade-off between structure and attribute emphasis.

In summary, this work makes significant contributions by introducing a principled and interpretable similarity model for attributed bipartite graphs and designing highly scalable algorithms for its computation, thereby enabling practical similarity search on massive real-world networks.

Scalable Similarity Search over Large Attributed Bipartite Graphs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment