In this paper, we generalize a recently introduced Expectation Maximization (EM) method for graphs and apply it to content-based networks. The EM method provides a classification of the nodes of a graph, and allows to infer relations between the different classes. Content-based networks are ideal models for graphs displaying any kind of community or/and multipartite structure. We show both numerically and analytically that the generalized EM method is able to recover the process that led to the generation of such networks. We also investigate the conditions under which our generalized EM method can recover the underlying contents-based structure in the presence of randomness in the connections. Two entropies, Sq and Sc, are defined to measure the quality of the node classification and to what extent the connectivity of a given network is content-based. Sq and Sc are also useful in determining the number of classes for which the classification is optimal.
Classifying items with respect to their properties is a fundamental and very old problem. If the properties are inherent to the objects, the difficulty is deciding first how many groups are required and then establishing the discrimination thresholds for each. The matter becomes more complicated when instead of the inherent properties, one tries to classify elements based on mutual interactions. Of course, such classifications would be very useful for a better understanding of the mechanisms underlying the behavior of systems encountered in scientific disciplines as diverse as Sociology, Biology or Physics [1,2,3,4]. As an example, consider social systems which are often modeled as networks. The vertices represent individuals and the edges interactions between them. These interactions can be of many types: friendship, belonging to the same club or school, working together, etc. In these graphs, it is important to be able to group the nodes into what is commonly known as communities. That is, groups of vertices that share a higher number of connections among themselves than with the rest of the network [5,6,7,8,9] (see also [10] for a recent review). This partition bears information on which persons have a stronger interdependence and may allow to predict the actors that drive the dynamics of the group as a whole. In Biology, on the other hand, network methods have been used to understand gene regulatory patterns [11]. Here each vertex corresponds to a gene and an edge contains information on how the associated protein regu-lates the synthesis of the protein associated to the second gene. Since regulation of gene activity plays a fundamental role in the functioning of the cell [12], the community structure points towards the different functional subunits (see [13] and references therein). Given the relevance of communities, the last years have seen an increase in the number of techniques proposed to detect them. To name a few: some of them are based on the concept of betweenness (number of paths passing through a link) and modularity [8,9,14], others on synchronization of oscillators [15,16] or on other dynamical systems running on the network [17,18,19], detection of overlapping cliques [20] or the diffusion of random walkers [21,22,23].
Nevertheless, communities are not the only relevant information that can be extracted from networks. It is also possible to search for vertices with similar connection patterns (not necessarily having connections among themselves, as in the case of communities) that are expected to play equivalent functional roles. In the social networks literature such nodes are referred to as structurally equivalent [24] and have lead to an analysis of social networks based on Block Modeling [1,25]. In many types of networks, like those formed by webpages or social actors, the connection between nodes is often due to some intrinsic properties of the nodes, which we will refer to henceforth as their “contents”. Thus it is possible to consider an alternative point of view in which a network structure arises as a result of node contents, leading one to the notion of contents-based networks [26,27,28,29].
In many cases, network analysis approaches based on communities and those based on some form of node similarity are aimed towards the understanding of very different questions. When viewed within the framework of contents based networks, however, these differences disappear as will be argued below. We will also show that an extension of Newman and Leicht’s Expectation Maximization (EM) method [30] is well-suited for uncovering content-based structure underlying a network, inverting in practice the process that lead to its formation.
The organization of the paper is as follows: in Section II, content based networks are formally introduced. Next, we describe in Section III our generalization of the EM method to directed graphs. In Section IV, we show how the EM method can be used to solve the inverse problem, namely to recover the underlying contents-based structure from a given network. We present in Section V analytical results regarding the application of the EM method to contents-based networks and the recovery of the contents-based structure. These results will be complemented with a numerical study in Sections VI and VII. In Section VII, we consider a more realistic situation and ask to what extent an underlying contents-based structure can be recoverred in the presence of disorder in the connections. Finally, we summarize our results and present the conclusions in Section VIII.
Let us define first content-based networks. Consider a set of nodes i = 1, 2, . . . N each of which has a content x i assigned with x i ∈ X = {1, 2, . . . , N x }, and where 1, 2, . . . are labels for the possible contents. The structure of the connectivity pattern of the associated content-based network is determined by the function c(x i , x j ) ∈ {0, 1}, which is defined for all ordered pairs of contents (
This content is AI-processed based on open access ArXiv data.