Network Clustering Approximation Algorithm Using One Pass Black Box Sampling

Reading time: 5 minute
...

📝 Original Info

  • Title: Network Clustering Approximation Algorithm Using One Pass Black Box Sampling
  • ArXiv ID: 1110.3563
  • Date: 2023-06-15
  • Authors: : John Doe, Jane Smith, Michael Johnson

📝 Abstract

Finding a good clustering of vertices in a network, where vertices in the same cluster are more tightly connected than those in different clusters, is a useful, important, and well-studied task. Many clustering algorithms scale well, however they are not designed to operate upon internet-scale networks with billions of nodes or more. We study one of the fastest and most memory efficient algorithms possible - clustering based on the connected components in a random edge-induced subgraph. When defining the cost of a clustering to be its distance from such a random clustering, we show that this surprisingly simple algorithm gives a solution that is within an expected factor of two or three of optimal with either of two natural distance functions. In fact, this approximation guarantee works for any problem where there is a probability distribution on clusterings. We then examine the behavior of this algorithm in the context of social network trust inference.

💡 Deep Analysis

📄 Full Content

Finding clusters or communities is one of the most important steps in network analysis. Clusters should have high internal connectivity and relatively low connectivity with the rest of the network. Finding such groups of similar or tightly connected vertices increases our understanding of the underlying graph [23,24,30,27,6], and many algorithms exist for clustering networks [3,5,31,18,26,19]. Because the networks we work with grow all of the time, some of these algorithms are specifically designed to perform efficiently on large networks. We take this goal to its extreme by proposing a randomized network clustering algorithm which queries each edge at most once. We then derive approximation guarantees for the resulting clusterings and demonstrate its behavior on a number of real social networks.

While our algorithm applies to networks from any number of domains (the internet, biological networks, etc.), our primary motivation comes from using inferred trust in social networks. With hundreds of millions of users on social networking websites and millions of pages of user-generated content coming on line every day, there are vast networks of users, content, and meta-data. Access to this type of information is extremely powerful. There is potential to personalize and enhance users’ experiences and improve our understanding of users and their behavior.

In particular, connecting social network data -especially trust -to user-generated content allows systems to direct users to the most trustworthy users and data. This may be through recommender systems, search personalization, or direct presentation of trust information about other users.

Clustering is an important challenge in this context. All the applications discussed above, and many more, can benefit from clustering over these networks. Motivated by these applications -particularly the problem of trust inference -our research addresses the issue of clustering the vertices in graphs.

Using random graphs as a model, our goal is to find a clustering where vertices in a cluster are likely to be in the same connected component while vertices in different clusters are not. DuBois et al. [8] define the distance between two nodes to be the logarithm of the reciprocal of the probability that they are connected. Because computing this probability exactly is intractable (#P -complete [36]), they repeatedly sample random graphs to estimate such probabilities to within any desired precision and confidence. If edges are chosen independently, this distance is a metric, and any one of a number of clustering algorithms can be applied. They show that this technique works well in some practical settings; however it has some drawbacks -most notably that many samples of the random graph are required to accurately estimate distances between nodes, and hence the running time involved may be prohibitive for very large graphs. On the Web, where interesting graphs tend to be large, this is a major issue.

In this paper, we present a new method for graph clustering where every edge is mapped to an independent probability of its being in an instance of the graph. The connected components of the resulting graphs, which we can sample with a depth first search, are its clusters. Our algorithm is computationally efficient -only a single pass is needed. Furthermore, it applies not only to network clustering, but to any problem where clusterings come from any probability distribution which we can sample.

To analyze this algorithm, we define a distance function between any two clusterings and attempts to minimize the expected distance between its output and a randomly sampled clustering. We show that good clusterings can be found in expectation directly by sampling the random graph only once. We then show that repeated sampling improves our confidence in the result. In Section 3.1 we formalize the problem and prove that a single random sample gives a 3-approximation in expectation. In Section 3.3 we show how to use multiple samples to improve on our probabilistic guarantees. Finally in Section 4 we apply our new algorithm to trust inference clustering as a demonstration of its usefulness.

We begin our literature review with an overview of our target application -social network trust inference and the usage of trust-based clusters, and then move on to a discussion of other clustering algorithms.

Since an individual in a social network usually knows only a tiny fraction of all the users, it is important to have some mechanism for estimating the relative importance of unknown users. In many webbased applications that seek to personalize the user’s experience, this will take the form of computing their influence or trustworthiness. Trust propagation is a particularly challenging problem because of the many social and interpersonal factors that play into trust.

There are many trust inference algorithms that take advantage of given trust values and the structure of a social network, includ

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut