This article discusses a latent variable model for inference and prediction of symmetric relational data. The model, based on the idea of the eigenvalue decomposition, represents the relationship between two nodes as the weighted inner-product of node-specific vectors of latent characteristics. This ``eigenmodel'' generalizes other popular latent variable models, such as latent class and distance models: It is shown mathematically that any latent class or distance model has a representation as an eigenmodel, but not vice-versa. The practical implications of this are examined in the context of three real datasets, for which the eigenmodel has as good or better out-of-sample predictive performance than the other two models.
Let {y i,j : 1 ≤ i < j ≤ n} denote data measured on pairs of a set of n objects or nodes. The examples considered in this article include friendships among people, associations among words and interactions among proteins. Such measurements are often represented by a sociomatrix Y , which is a symmetric n × n matrix with an undefined diagonal. One of the goals of relational data analysis is to describe the variation among the entries of Y , as well as any potential covariation of Y with observed explanatory variables X = {x i,j , 1 ≤ i < j ≤ n}.
To this end, a variety of statistical models have been developed that describe y i,j as some function of node-specific latent variables u i and u j and a linear predictor β T x i,j . In such formulations, {u 1 , . . . , u n } represent across-node variation in the y i,j ’s and β represents covariation of the y i,j ’s with the x i,j ’s. For example, Nowicki and Snijders [2001] present a model in which each node i is assumed to belong to an unobserved latent class u i , and a probability distribution describes the relationships between each pair of classes (see Kemp et al. [2004] and Airoldi et al. [2005] for recent extensions of this approach). Such a model captures stochastic equivalence, a type of pattern often seen in network data in which the nodes can be divided into groups such that members of the same group have similar patterns of relationships.
An alternative approach to representing across-node variation is based on the idea of homophily, in which the relationships between nodes with similar characteristics are stronger than the relationships between nodes having different characteristics. Homophily provides an explanation to data patterns often seen in social networks, such as transitivity (“a friend of a friend is a friend”), balance (“the enemy of my friend is an enemy”) and the existence of cohesive subgroups of nodes.
In order to represent such patterns, Hoff et al. [2002] present a model in which the conditional mean of y i,j is a function of β ′ x i,j -|u i -u j |, where {u 1 , . . . , u n } are vectors of unobserved, latent characteristics in a Euclidean space. In the context of binary relational data, such a model predicts the existence of more transitive triples, or “triangles,” than would be seen under a random allocation of edges among pairs of nodes. An important assumption of this model is that two nodes with a strong relationship between them are also similar to each other in terms of how they relate to other nodes: A strong relationship between i and j suggests |u i -u j | is small, but this further implies that |u i -u k | ≈ |u j -u k |, and so nodes i and j are assumed to have similar relationships to other nodes.
The latent class model of Nowicki and Snijders [2001] and the latent distance model of Hoff et al. [2002] are able to identify, respectively, classes of nodes with similar roles, and the locational properties of the nodes. These two items are perhaps the two primary features of interest in social network and relational data analysis. For example, discussion of these concepts makes up more than half of the 734 pages of main text in Wasserman and Faust [1994]. However, a model that can represent one feature may not be able to represent the other: Consider the two graphs in Figure 1. The graph on the left displays a large degree of transitivity, and can be well-represented by the latent distance model with a set of vectors {u 1 , . . . , u n } in two-dimensional space, in which the probability of an edge between i and j is decreasing in |u i -u j |. In contrast, representation of the graph by a latent class model would require a large number of classes, none of which would be particularly cohesive or distinguishable from the others. The second panel of Figure 1 Just as any symmetric matrix can be approximated with a subset of its largest eigenvalues and corresponding eigenvectors, the variation in a sociomatrix can be represented by modeling y i,j as a function of β ′ x i,j + u T i Λu j , where {u 1 , . . . , u n } are node-specific factors and Λ is a diagonal matrix. In this article, we show mathematically and by example how this eigenmodel can represent both stochastic equivalence and homophily in symmetric relational data, and thus is more general than the other two latent variable models.
The next section motivates the use of latent variables models for relational data, and shows mathematically that the eigenmodel generalizes the latent class and distance models in the sense that it can compactly represent the same network features as these other models but not viceversa. Section 3 compares the out-of-sample predictive performance of these three models on three different datasets: a social network of 12th graders; a relational dataset on word association counts from the first chapter of Genesis; and a dataset on protein-protein interactions. The first two networks exhibit latent homophily and stochastic equivalence respecti
This content is AI-processed based on open access ArXiv data.