Title: Normalized Mutual Information to evaluate overlapping community finding algorithms
ArXiv ID: 1110.2515
Date: 2015-03-18
Authors: : Aaron McDaid and James Curran
📝 Abstract
Given the increasing popularity of algorithms for overlapping clustering, in particular in social network analysis, quantitative measures are needed to measure the accuracy of a method. Given a set of true clusters, and the set of clusters found by an algorithm, these sets of clusters must be compared to see how similar or different the sets are. A normalized measure is desirable in many contexts, for example assigning a value of 0 where the two sets are totally dissimilar, and 1 where they are identical. A measure based on normalized mutual information, [1], has recently become popular. We demonstrate unintuitive behaviour of this measure, and show how this can be corrected by using a more conventional normalization. We compare the results to that of other measures, such as the Omega index [2].
💡 Deep Analysis
📄 Full Content
Abstract-Given the increasing popularity of algorithms for overlapping clustering, in particular in social network analysis, quantitative measures are needed to measure the accuracy of a method. Given a set of true clusters, and the set of clusters found by an algorithm, these sets of clusters must be compared to see how similar or different the sets are. A normalized measure is desirable in many contexts, for example assigning a value of 0 where the two sets are totally dissimilar, and 1 where they are identical.
A measure based on normalized mutual information, [1], has recently become popular. We demonstrate unintuitive behaviour of this measure, and show how this can be corrected by using a more conventional normalization. We compare the results to that of other measures, such as the Omega index [2].
A C++ implementation is available online. 1 In a non-overlapping scenario, each node belongs to exactly one cluster. We are looking at overlapping, where a node could belong to many communities, or indeed to no clusters. Such a set of clusters has been referred to as a cover in the literature, and this is the terminology that we will use.
For a good introduction to our problem of comparing covers of overlapping clusters, see [2]. They describe the Rand index, which is defined only for disjoint (non-overlapping) clusters, and then show how to extend it to overlapping clusters. Each pair of nodes is considered and the number of clusters in common between the pair is counted. Even if a typical node is in many clusters, it’s likely that a randomly chosen pair of nodes will have zero clusters in common. These counts are calculated for both covers and the Omega index is defined as the proportion of pairs for which the shared-cluster-count is identical, subject to a correction for chance.
Meila [3] defined a measure based on mutual information for comparing disjoint clusterings. Lancichinetti et al. [1] proposed a measure also based on mutual information, extended for covers. This measure has become quite popular for comparing community finding algorithms in social network analysis. It is this measure we are primarily concerned with there, and we will refer to it as NMILF K after the authors’ initials.
We are proposing to use a different normalization to that used in NMILF K , but first we will define the non-normalized measure which is based very closely on that in NMILF K . You may want to compare this to the final section of Lancichinetti et al. [1].
Given two covers, X and Y , we must first see how to measure the similarity between a pair of clusters. X and Y are matrices of cluster membership. There are n objects. The first cover has KX clusters, and hence X is an n × KX matrix. Y is an n × KY matrix. Xim tells us whether node m is in cluster i in cover X.
To compare cluster i of the first cover to cluster j of the second cover, we compare the vectors Xi and Yj. These are vectors of ones and zeroes denoting which clusters the node is in.
The lack of information between two vectors is defined:
where h(w, n) = -w log 2 w n . There is an interesting technicality here. Imagine a pair of clusters but where the memberships have been defined randomly. There is a possibility that there will be a small amount of mutual information, even in the situation where the two vectors are negatively correlated with each other. In extremis, if the two vectors are near complements of each other, mutual information will be very high. We wish to override this and define that there is zero mutual information in this case. This is defined in equation (B.14) of [1]. We also use this restriction in our proposal.
This allows us to compare vectors Xi and Yj, but we want to compare the entire matrices X and Y to each other. We will follow the approximation used by [1] here and match each vector in X to its best match in Y ,
then summing across all the vectors in X,
H(Y |X) is defined in a similar way to H(X|Y ), but with the roles reversed. We will also need to define the (unconditional) entropy of a cover,
where n m=1 [Xim = 1] counts the number of nodes in cluster i, end n m=1 [Xim = 0] counts the number of nodes not in cluster i,
fig. 1 gives us an easy way to remember the following useful identities, which apply to any mutual information context.
The first two equalities give us two definitions for the mutual information, I(X : Y ). In theory, these should be identical, but due arXiv:1110.2515v2 [physics.soc-ph] 2 Aug 2013 to the approximation used in eq. ( 3) they may be different. Therefore, we will use the average of the two.
We are now ready to discuss normalization, contrasting the method of [1] with our alternative.
Lancichinetti et al. [1] define their own normalization of the variation of information,