Complex networks are at the core of an intense research activity. However, in most cases, intricate and costly measurement procedures are needed to explore their structure. In some cases, these measurements rely on link queries: given two nodes, it is possible to test the existence of a link between them. These tests may be costly, and thus minimizing their number while maximizing the number of discovered links is a key issue. This paper studies this problem: we observe that properties classically observed on real-world complex networks give hints for their efficient measurement; we derive simple principles and several measurement strategies based on this, and experimentally evaluate their efficiency on real-world cases. In order to do so, we introduce methods to evaluate the efficiency of strategies. We also explore the bias that different measurement strategies may induce.
💡 Deep Analysis
Deep Dive into Efficient Measurement of Complex Networks Using Link Queries.
Complex networks are at the core of an intense research activity. However, in most cases, intricate and costly measurement procedures are needed to explore their structure. In some cases, these measurements rely on link queries: given two nodes, it is possible to test the existence of a link between them. These tests may be costly, and thus minimizing their number while maximizing the number of discovered links is a key issue. This paper studies this problem: we observe that properties classically observed on real-world complex networks give hints for their efficient measurement; we derive simple principles and several measurement strategies based on this, and experimentally evaluate their efficiency on real-world cases. In order to do so, we introduce methods to evaluate the efficiency of strategies. We also explore the bias that different measurement strategies may induce.
📄 Full Content
Complex networks, modeled as large graphs, are everywhere in science, society, and everyday life. However, it must be clear that most real-world complex networks are not directly available: collecting information on their structure generally relies on intricate and expensive measurement procedures. Conducting such a measurement often is a challenge in itself, and is an important part of the work needed to study a complex network.
In general, complex network measurements consist in a combination of a few simple measurement primitives. In several cases, this primitive consists in testing the existence of a link, which we call a link query: given two nodes u and v, a measurement operation makes it possible to decide whether there is a link between them or not. This simple test may be expensive (regarding the needed resources or time, or the load it induces on the network, for instance) and so conducting measurements with as few calls to the measurement primitive as possible is a key issue.
For instance, in online social networks like Facebook or Flickr 1 , privacy concerns and reduction of server load often lead to limitations in the queries that one is allowed to perform to explore networks between users. Link queries are however allowed in most cases. Likewise, measurements of real-world social networks often rely on interviews, in which link queries play a central role [1]. In biological networks like protein interactions or gene regulatory networks, link queries also play a key role [2], [3]. 1 http://www.facebook.com/
and http://www.flickr.com/
In all these contexts, and others, link queries are very expensive: they have a significant load on server running online social network software and their number is generally bounded; they have a significant cost for interviewers and participants in sociological studies ; or they require costly biological experiments, depending on the case.
In this paper, we formalise this problem as follows: given a graph G = (V, E), we want to define strategies (ordered lists of link queries) which lead to the discovery of as many links of the network as possible. In other words, we want to minimize the number of link queries while maximizing the number of observed links, i.e. the number of positive answers to these tests 2 .
In order to do so, we will rely on simple intuitions derived from statistical properties observed on most real-world complex networks, which we discuss in Section II. We then propose several measurement strategies in Section III based on these principles. We also need a way to compare and evaluate measurement strategies, see Section IV. We finally use this to experimentally evaluate proposed strategies in Section V.
Before entering in the core of this paper, we give the needed formalism and notations, and discuss related work.
In all the paper, we will consider an undirected3 graph G = (V, E), with n = |V | nodes and m = |E| links. We suppose that all the nodes are known, and focus on link discovery only. In other words, we know V but know nothing about E (although we will make some statistical assumptions in accordance with classical empirical observations in the field, see Section II).
We will denote by N (v) the set of neighbors of v ∈ V :
A measurement consists in a series of link queries, i.e. tests of the existence of link (u, v) for two nodes u and v in V . At a given stage in such a measurement, one has already discovered a set of links, which we will denote by E ′ ⊆ E. The set of extremities of links in E ′ will be denoted by V ′ ⊂ V . Notice that, although we know V , in general V ′ = V . We will also denote by n ′ the number of nodes in V ′ and m ′ the number of discovered links so far:
and d ′ vary during a measurement; however, the context will make it clear which value we consider.
This work belongs to the fields of complex network metrology, which mostly focused on the specific case of the Internet topology until now, see for instance [4]- [10]. This area of research aims mainly at evaluating the relevance of collected complex network samples and properties observed on them, and correcting these observations. Viewing the measurement as the combination of many instance of a simple primitive (link queries, here) which we want to optimize is new, and is an important contribution of this paper.
Another related problem is the one of link prediction: given a network in which new links may appear, one wants to predict which new links will appear in the future based on currently existing ones [11], [12]. In this context, authors use properties of the known network to infer probable future link, which is similar to what we do below in the measurement context. The main difference lies in the fact that very little of the network topology is known in our case.
Our goal is to design measurement strategies based on link queries (test of the existence of a link between two given nodes) which will minimize the number of such queries and maximize