Folks in Folksonomies: Social Link Prediction from Shared Metadata

Folks in Folksonomies: Social Link Prediction fr om Shared Metadata Rossano Schif anella 1 ∗ Alain Barrat 2 , 3 Ciro Cattuto 3 Benjamin Markines 4 Filippo Menczer 3 , 4 1 Depar tment of Computer Science, Univ ersity of T urin, Italy 2 Centre de Ph ysique Théorique (CNRS UMR 6207), Marseille, F rance 3 Comple x Networks and Systems Laborator y , ISI Foundation, T urin, Italy 4 School of Inf or matics and Computing, Indiana Univ ersity , Bloomington, IN, USA ABSTRA CT W eb 2.0 applications have attracted a considerable amount of at- tention because their open-ended nature allows users to create light- weight semantic scaf folding to or ganize and share content. T o date, the interplay of the social and semantic components of social me- dia has been only partially explored. Here we focus on Flickr and Last.fm, two social media systems in which we can relate the tag- ging activity of the users with an explicit representation of their social network. W e show that a substantial lev el of local lexical and topical alignment is observable among users who lie close to each other in the social network. W e introduce a null model that preserves user activity while removing local correlations, allowing us to disentangle the actual local alignment between users from sta- tistical effects due to the assortativ e mixing of user activity and cen- trality in the social network. This analysis suggests that users with similar topical interests are more likely to be friends, and therefore semantic similarity measures among users based solely on their an- notation metadata should be predictive of social links. W e test this hypothesis on the Last.fm data set, conﬁrming that the social net- work constructed from semantic similarity captures actual friend- ship more accurately than Last.fm’ s suggestions based on listening patterns. Categories and Subject Descriptors H.1.2 [ Information Systems ]: Models and Principles— Human informa- tion pr ocessing ; H.3.5 [ Information Storage and Retrieval ]: Online Infor- mation Services— W eb-based services ; H.5.3 [ Inf ormation Interfaces and Presentation ]: Group and Organization Interfaces— Collaborative comput- ing, W eb-based interaction General T erms Algorithms, Experimentation, Measurement ∗ Corresponding author . Email: schifane@di.unito.it . A good por- tion of the work in this paper was carried out while Dr. Schifanella was a visiting scholar at the Center for Comple x Networks and Systems Research (CNetS) of the Indiana Univ ersity School of Informatics and Computing. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. T o copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. WSDM’10, February 4–6, 2010, New Y ork City , New Y ork, USA. Copyright 2010 A CM 978-1-60558-889-6/10/02 ...$10.00. K eywords W eb 2.0, social media, folksonomies, collaborativ e tagging, social network, lexical and topical alignment, link prediction, social se- mantic similarity , Maximum Information Path 1. INTR ODUCTION Social networking systems like Facebook and systems for con- tent organization and sharing such as Flickr and Delicious hav e created information-rich ecosystems where the cognitive, behav- ioral and social aspects of a user community are entangled with the underlying technological platform. This opens up new ways to monitor and in vestigate a variety of processes in volving the inter- action of users with one another, as well as the interaction of users with the information they process. Social media supporting tag- ging [14, 3] are especially interesting in this respect because they stimulate users to provide light-weight semantic annotations in the form of freely chosen terms. Usage patterns of tags can be em- ployed to monitor interest, track user attention, and inv estigate the co-ev olution of social and semantic networks. While the emergence of conv entions and shared conceptualiza- tions has attracted considerable interest [24, 16, 25, 2], little atten- tion has been de voted so f ar to relating, at the microscopic level, the usage of shared tags with the social links existing between users. The present paper aims at ﬁlling this gap. T o this end we focus on Flickr and Last.fm, as to our knowledge they are currently the only popular social media system where: (1) a signiﬁcant fraction of the users provide tag metadata for their content (photographs or songs), and (2) an explicit representation of the social links between users is readily av ailable. The main question that we address in this study is the follow- ing: giv en two randomly chosen users, how does the alignment of their tag vocab ularies relate to their proximity on the social net- work? That is, does lexical alignment exist between neighboring users, and if so, how does this alignment fade when considering users lying at an increasing distance on the social graph? And if indeed such a relationship exists, does it allow us to predict so- cial links from analysis of the semantic similarity among users, ex- tracted from their annotations? Contributions and outline The main contributions of this paper are summarized as follo ws: • In § 4.1 we show that strong correlations exist across sev- eral measures of user activity , and characterize the mixing patterns that in volv e user activity and user centrality in the social network. • In § 4.2 we develop sound measures of tag overlap. W e fur- ther introduce appropriate null models to disentangle the ac- tual local alignment between users from statistical effects due to the mixing properties of user activity and centrality in the social network. W e apply these measures to the Flickr and Last.fm data sets. The resulting analysis shows that, despite neither Flickr nor Last.fm support globally-shared tag vocab- ularies, a substantial lev el of local lexical (shared tags) and topical (shared groups) alignment is observable among users who are close to each other in the social network. W e also ﬁnd that some observables are more adequate than others to measure lexical and topical alignment, in the sense that they are less sensitiv e to purely statistical effects. • In § 5 we inquire if the observed correlations between anno- tation metadata and social proximity allow to use semantic similarity between user annotations as statistical predictors of friendship links. W e ev aluate a number of semantic sim- ilarity measures from the literature, based on Last.fm meta- data. W e ﬁnd that when we consider the annotations of the most active users, almost all of the semantic similarity mea- sures considered outperform the neighbor suggestions from the Last.fm system at predicting actual friendship relations. Scalable semantic similarity measures such as Maximum In- formation Path, proposed by some of the authors, are among those achieving the best predicti ve performance. 2. RELA TED WORK One of the ﬁrst quantitativ e studies on Flickr is presented by Marlow et al. [13], who discuss the heterogeneity of tagging pat- terns and perform a preliminary analysis of vocab ulary ov erlap be- tween pairs of users. The analysis suggests that users who are linked in the Flickr social network have on average a higher vo- cabulary overlap, but no assessment is made of biases and other correlations that could be responsible for the reported observation. The structure and the temporal evolution of the Flickr social net- work are inv estigated in several papers [4, 18, 17]. Leskov ec et al. [6] place a special emphasis on the local mechanisms driving the microscopic ev olution of the network. The role of social contacts in shaping browsing patterns on Flickr has also been e xplored [5, 26], providing insights into the behavior and activity patterns of Flickr users. Prieur et al. [21] in vestigate the role of Flickr groups as coordina- tion tools, and explore the relation between the density of the social network and the density of the network of tag co-usage among the group members. Liben-Nowell and Kleinber g [8] explore se veral notions of node similarity for link prediction in social networks. In our own prior work we performed a systematic analysis of a broad range of se- mantic similarity measures based on folksonomies, that can be ap- plied directly to b uild networks of users, tags, or resources [12, 10, 11]. Here we build upon this e v aluation framework. Li et al. [7] propose a system to discover common user interests and cluster users and their saved URLs by different interest topics. They use a Delicious data set to deﬁne implicit links between users based on the similarity of their tags. Howe ver they do not correlate the interest clusters with social connections. Perhaps the work that most directly relates to our approach is by Santos-Neto et al. [23], who explore the question of whether tag-based or resource-based interest sharing in tagging systems re- late to other indicators of social behavior . The authors analyze the CiteULike and Connotea systems, which deal with scientiﬁc publi- cation and lack explicit social network components. Therefore they are unable to directly explore social friendship between two users, and instead look at participation in the same discussion group, with mixed results. They do not ﬁnd a statistical correlation between the intensity of interest sharing and the collaboration lev els. Our present results are both more explicit (we deal with pairs of users rather than groups) and more conclusiv e. Furthermore we are able to ev aluate our interest-based predictions against external sugges- tions based on independent data, quantifying the applicativ e value of our ﬁndings. 3. D A T A SETS Flickr makes available most of its public data by means of API methods ( flickr.com/api ). The data used for the present study were obtained by using the public Flickr API to perform a dis- tributed crawl of the content uploaded to Flickr between January 2004 and January 2006. The system was crawled during the ﬁrst half of 2007. The crawling task was distributed by dividing the abov e interval of time into work units consisting of smaller time windows, and cra wling each time windo w separately . Each cra wler was programmed to issue search queries for ev ery known tag, lim- ited to its temporal window of competence, as well as to issue search queries for photos uploaded in the same interval. As new tags were discovered, they were added to a global table shared by all the crawlers. Separate crawls were performed to explore the Flickr social network (in Flickr jargon, the “contacts” of a giv en user are her nearest neighbors in the social network, as represented in the system), as well as group membership information. Overall, the data set we analyzed comprises 241 , 031 users for whom we have tagging information, and 118 , 144 users for whom we also hav e group membership information. Our analysis will focus on two networks. The ﬁrst one, G 0 , comprises the Flickr users for whom we have tag, group and contact information. It consists of 118 , 144 nodes (users) and 2 , 263 , 182 edges (contacts between users). The second network, G 1 , is obtained by extending G 0 to include all the neighbors of its nodes, neighbors for whom we may not have tagging, group membership, or complete contact information. G 1 comprises 983 , 778 nodes and 16 , 673 , 476 edges and will be used to check the robustness of analyses in volving the distance among users in the Flickr social network. Similarly , we constructed our Last.fm data set using public API methods ( last.fm/api ), in particular for collecting neighbor and friend relations. In Last.fm jargon, friends are contacts in the social networks, while neighbors are users recommended by the system as potential contacts, based on their music playing histo- ries. Last.fm also allows users to annotate various items (songs, artists, or albums) with tags. Ho wev er , the Last.fm API does not allow to retrieve the complete user annotation activity . Therefore, with permission, we de veloped a crawler that extracts all the triples ( user , item, tag ) and group membership information by visiting and parsing user proﬁle web pages. The crawls took place over a pe- riod of a few months in the ﬁrst half of 2009. The resulting data set comprises of 99 , 405 users, of which 52 , 452 are active, i.e., have at least one annotation. The 10 , 936 , 545 triples annotate 1 , 393 , 559 items with 281 , 818 tags. The users belong to 66 , 429 groups. No ﬁlters were applied to our data set collections. 4. D A T A ANAL YSIS In this section we present an analysis of the data. The very same analysis was carried out for both Flickr and Last.fm data sets. How- ev er , due to space limitations, we report below mainly on the re- sults of the Flickr analysis. Unless otherwise speciﬁed our analy- sis refers to G 0 but we checked that the results do not change for 10 0 10 2 10 4 k 10 -4 10 -2 10 0 P(k) 10 0 10 2 n g 10 -4 10 -2 10 0 P(n g ) 10 0 10 1 10 2 10 3 10 4 n t 10 -4 10 -2 P(n t ) 10 0 10 1 10 2 10 3 10 4 10 5 a 10 -4 10 -2 P(a) AB D C Figure 1: Flickr distributions of (A) the number k of neighbors of a user , (B) the number n g of groups of which a user is a member , (C) the number n t of distinct tags per user , and (D) the number a of tag assignments per user . T able 1: A verages and ﬂuctuations of Flickr user activity Measure of activity x A verage h x i h x 2 i / h x i k 38.3 469 n t 85.3 511.4 n g 32.6 184.6 a 690.7 8471.3 G 1 . The analysis of Last.fm yielded analogous results, both qual- itativ ely and quantitatively . Therefore we believe our conclusions to be robust. 4.1 Heterogeneity and corr elations Let us ﬁrst focus on the activity of users as measured by a number of indicators, and in vestigate the correlations between these indica- tors. The activity of a Flickr user has indeed various aspects, among which the most important are uploading photos, tagging them, par- ticipating in groups, commenting on other users’ photos, and other social networking activities. Fig. 1 displays the probability distri- butions of the number k of neighbors in the social network (the degree k of a node), and the probability of ﬁnding a user with a giv en number n t of distinct tags in her vocabulary . The breadth of a user’ s tag vocabulary can be regarded as a proxy for the breadth of her interests. W e also sho w in Fig. 1 the distrib ution of the num- ber of groups n g to which a user belongs, and of the total number a of tags assignments submitted by a user (in this case, a tag used twice by a user is counted twice). More precisely , if f u ( t ) is the number of times that a tag t has been used by user u , then the to- tal number of tag assignments of user u is a u = P t f u ( t ) . All these distributions are broad, showing that the activity patterns of users are highly heterogeneous. F or reference, T able 1 reports the av erages and variances of these quantities. A few comments are in order . First, in our analysis we do not consider one obvious measure of activity , namely the number of photos uploaded by a user . One reason for this is that the number of photos posted by a user is known to be strongly correlated with the number of tags from the same user [13]. More importantly , Flickr is a “narrow folksonomy” in which users tend to tag mostly their own content [28]. Thus, when exploring the similarity of users and relating it to the underlying social network, shared usage of tags 10 0 10 1 10 2 10 3 10 4 k 10 0 10 2 10 4 10 6 a(k) n t (k) n g (k) Figure 2: A verage number of distinct tags ( n t ), of groups ( n g ), and of tag assignments ( a ) of users having k neighbors in the Flickr social network. and co-membership in groups are natural and more direct indicators of shared interests. Another note concerns the comparison with the study by Mislov e et al. [18], who reported a smaller a verage degree for the Flickr social network. This difference is due to the fact that in our study we focus on those users who use both tags and groups. Since only 21% of the users participate in groups [18], this means that here we are focusing on the “acti ve” users. As we will see below , the various activity metrics are correlated with one another , so users who are more activ e in terms of tags and groups will tend to have more contacts in the Flickr social network, hence the higher av erage degree we report here. Fig. 1, howe ver , clearly shows that ev en within this “active” set of users, very broad distributions of activity patterns are observed and no “typical” value of the acti vity metrics can be deﬁned. It seems natural to ask whether the different types of activity measures are correlated with one another and with the structure of the social network: are users with more social links also more activ e in tagging their content, and do they participate in more groups? As shown in Fig. 2, the data show that this is indeed the case (see also Ref. [13]). Fig. 2 displays the av erage activity of users with k neighbors in the social network, as measured by the various metrics deﬁned abo ve. For instance, n t ( k ) = 1 | u : k u = k | X u : k u = k n t ( u ) . All types of activity hav e an increasing trend for increasing val- ues of k , and of course strong ﬂuctuations are present at all values of k . The strong ﬂuctuations visible for large k values are due to the fewer highly-connected nodes over which the a verages are per - formed. Notably , users with a large number of social contacts but using very few tags and belonging to very few groups can be ob- served. Despite these important heterogeneities in the behavior of users with the same de gree k , the data clearly indicate a strong cor- relation between the different types of activity metrics. The Pear- son correlation coefﬁcients are: 0.349 between k and n t , 0.482 between k and n g , 0.268 between k and a , 0.429 between n t and n g , 0.753 between n t and a , 0.304 between n g and a . Another important question concerns the correlations between the acti vity metrics of users who are link ed in the Flickr social net- work. This is a well-kno wn problem in the social sciences, ecology , and epidemiology: a typical pattern, referred to as “assortativ e mix- ing, ” describes the tendency of nodes of a network (here, the users) to link to other nodes with similar properties. This appears intuitiv e in the context of a social netw ork, where one expects indi viduals to connect preferentially with other individuals sharing the same in- terests. Likewise, it is possible to deﬁne a “disassortativ e mixing” pattern if the elements of the network tend to link to nodes that hav e different properties or attributes. Mixing patterns can be de- ﬁned with respect to any property of the nodes [19]. In the present case, we characterize the mixing patterns concerning the v arious activity metrics. In the case of large scale networks, the most commonly inv es- tigated mixing pattern in volves the degree (number of neighbors) of nodes. This type of mixing refers to the likelihood that users with a given number of neighbors connect with users with simi- lar degree. T o this end, a commonly used measure is giv en by the av erage nearest neighbors degree of a user u , k u nn = 1 k u X v ∈V ( u ) k v , where the sum runs ov er the set V ( u ) of nearest neighbors of u . T o characterize mixing patterns in the degree of nodes, a conv enient measure can be built on top of k u nn by averaging ov er all nodes u that hav e a giv en degree k [20, 27]: k nn ( k ) = 1 | u : k u = k | X u : k u = k k u nn . (1) In the case of Flickr, each user is endowed with sev eral proper- ties characterizing its activity . It is thus interesting to characterize the mixing patterns with respect to all of those properties. T o this end, we generalize the average nearest neighbors degree presented abov e, and deﬁne for each user u the following quantities: (i) the av erage number of tags of its nearest neighbors, n u t,nn = 1 k u X v ∈V ( u ) n t ( v ) , (ii) the av erage total number of tags used by its nearest neighbors, a u nn = 1 k u X v ∈V ( u ) a ( v ) , and (iii) the average number of groups to which its nearest neigh- bors participate, n u g,nn = 1 k u X v ∈V ( u ) n g ( v ) . T o detect the mixing patterns, in complete analogy with the case of k nn ( k ) , we compute the av erage number of distinct tags of the nearest neighbors for the class of users having n t distinct tags : n t,nn ( n ) = 1 | u : n t ( u ) = n | X u : n t ( u )= n n u t,nn , (2) the av erage total number of tags used by the nearest neighbors for the class of users with a tag assignments : a nn ( a ) = 1 | u : a ( u ) = a | X u : a ( u )= a a u nn , (3) and the av erage number of groups of the nearest neighbors for the class of users who ar e members of n g gr oups : n g,nn ( n ) = 1 | u : n g ( u ) = n | X u : n g ( u )= n n u g,nn . (4) 10 0 10 2 10 4 k 10 2 10 3 k nn (k) 10 0 10 2 n g 10 2 n g , nn (n g ) 10 0 10 1 10 2 10 3 10 4 n t 10 2 10 3 n t, nn (n t ) 10 0 10 1 10 2 10 3 10 4 10 5 a 10 3 10 4 a nn (a) A CD B Figure 3: (A) A verage degree of the nearest neighbors of nodes of degree k , computed for G 0 (black circles) G 1 (red crosses) Flickr networks. (B) A verage number of groups for the near - est neighbors of nodes belonging to n g groups. (C) A verage number of tags f or the near est neighbors of nodes with n t tags. (D) A verage total number of tag assignments for the nearest neighbors of nodes with a tag assignments. In all cases a clear assortative tr end is observed. Fig. 3 displays the quantities of Eqs. 1–4 for the Flickr data set. In all cases, a clear assortative trend is visible: the average activity of the neighbors of a user increases with the user’ s o wn activity , for all the acti vity measures we computed. Note that for the de gree mixing patterns, the assortativ e trend is ev en enhanced when G 1 is considered instead of G 0 . Large ﬂuctuations are observed for large activity values, because of the small number of very active users. W e remark that while Mislove et al. [18] had already found an assortativ e trend with respect to the degree of the social netw ork, Fig. 3 highlights that the activities of socially connected users are correlated at all lev els. 4.2 Lexical and topical alignment In this section we analyze more in detail the similarity of user proﬁles in relation to their social distance. More precisely , the pre- vious section was devoted to the correlations between the intensity of user activities, as quantiﬁed by several metrics. W e now focus on the similarity between user proﬁles as measured by the similar- ity of their respectiv e tag vocab ularies, and by the similarity of the set of groups they belong to. As mentioned above, Flickr is a “narrow folksonomy” [28]: tag annotations are pro vided mostly by the content creator, i.e., the tags associated with a photo are typically provided by the user who posted that photo. Intuitiv ely , the absence of shared content, to- gether with the very personal character of both the content and the tag metadata, make the Flickr tag v ocabulary e xtremely incoherent across the user community . Conv ersely , social bookmaking sys- tems like Delicious allow multiple users to annotate the same re- source and one could argue that the browsing experience exposes users to the global tag vocab ulary and fosters — at least in princi- ple — imitative or cooperative processes leading to the emergence of global con ventions in the user community [14]. In light of the abo ve observ ations, we do not expect to observe a globally shared tag vocabulary in the Flickr community . A simple test for the existence of such a globally shared vocab ulary can be performed by selecting pairs of users at random and measuring the number of tags they share, n st . It turns out that the a verage number T able 2: T ags most frequently used by thr ee Flickr users User A User B User C green ﬂower japan red green tokyo catchycolors kitchen architecture ﬂower red bw blue blue setagaya yellow white reject catchcolors fa ve sunset trav el detail subway london closeupﬁlter steel pink metal geometry orange yellow foundart macro zoo can v as of shared tags is only h n st i ≈ 1 . 6 . The most common case (mode) is in fact the absence of any shared tags; this occurs with probability close to 2 / 3 among randomly chosen pairs of users. One can nevertheless expect that a number of mechanisms may lead to local alignment of the user proﬁles, in terms of shared tags and/or groups membership. The presence of a social link, in fact, indicates a priori some degree of shared context between the con- nected users, which are likely to have some interests in common, or to share some experiences, or who are simply exposed to each other’ s content and annotations. As an example, T able 2 shows the 12 most frequently used tags for three Flickr users with compara- ble tagging activity . User A and user B hav e marked each other as friends, while user C has no connections to either A or B on the Flickr social network. All of these users hav e globally popular tags in their tag vocabulary . In this example, the neighbors A and B share an interest (expressed by the tag ﬂower ) and sev eral of the most frequently used tags (marked in bold). Regardless of the mechanism driving this potential local align- ment, in the following we want to measure this effect for the case of tag dictionaries and group memberships, and put it into relation with the distances between users along the social network. This ap- proach is similar to the exploration of topical locality in the W eb, where the question is whether pages that are closer to each other in the link graph are more likely to be related to one another [15]. First, we must deﬁne robust measures of vocab ulary similarity and group afﬁliation similarity between two users u and v . The ﬁrst and simplest measures are given by the number of shared tags n st among the tag vocabularies of u and v , and by the number of shared groups n sg to which both u and v belong. These measures, howe ver , are not normalized and can be affected by the speciﬁc activity patterns of the users: two very activ e users may have more tags in common than tw o less activ e users, just because acti ve users tag more, on av erage. W e therefore consider as well a distributional notion of similarity between the tag vocab ularies of u and v . Fol- lowing Ref. [1] we regard the vocabulary of a user as a feature vec- tor whose elements correspond to tags and whose entries are the tag frequencies for that speciﬁc user’ s vocabulary . T o compare the tag feature vectors of two users, we use the standard cosine simi- larity [22]. Denoting by f u ( t ) the number of times that tag t has been used by user u , the cosine similarity σ tags ( u, v ) is deﬁned as σ tags ( u, v ) = P t f u ( t ) f v ( t ) p P t f u ( t ) 2 p P t f v ( t ) 2 . (5) This quantity is 0 if u and v have no shared tags, and 1 if they have used exactly the same tags, in the same relativ e proportions. Be- cause of the normalization factors in the denominator, σ tags ( u, v ) 2 4 6 8 10 0 5 10 15 20 ! n st " real data reshuffled 2 4 6 8 10 d 0 0.01 0.02 0.03 0.04 0.05 0.06 !# tags " Figure 4: T op: average number of shared tags h n st i for two Flickr users as a function of their distance d along the social network. Bottom: av erage cosine similarity h σ tags i between the tag vocabularies of two Flickr users as a function of d . In both cases data for the same social netw ork with reshufﬂed tag vocab ularies are shown. is not directly inﬂuenced by the global activity of a user . Similarly , we can deﬁne the cosine similarity σ gr oups for groups memberships. Since a user belongs at most once to a group, this reduces to σ gr oups ( u, v ) = P g δ g u δ g v p n g ( u ) n g ( v ) (6) where δ g u = 1 if u belongs to group g and 0 otherwise. T o compute av erages of these similarities, we randomly chose N = 2 × 10 4 users and e xplored their neighborhoods in a breadth- ﬁrst fashion. In order to exclude biases due to this sampling, we also performed an exhausti ve inv estigation of the social network neighborhoods up to distance 2 from each user, obtaining the same results. Moreover we considered the distances along the social net- work using G 1 instead of G 0 , and again found the same results, showing the robustness of the observed behavior with respect to possible sampling biases due either to the crawl or to considering only users having both tagging acti vity and groups memberships. Figures 4 and 5 giv e an indication of how the similarity be- tween users depends on their distance d along the social network, by showing the average number of shared tags, of shared groups, and the corresponding av erage cosine similarities, of tw o users as a function of d . While the av erage number of shared tags or groups is quite large for neighbors (respectively close to 20 and to 10 ), it drops rapidly (exponentially) as d increases, and is close to 0 for d ≥ 4 . Figures 4 and 5 provide a strong indication that a certain degree of alignment between neighbors in the social network is observed both at the lexical level and for group afﬁliations. As soon as the distance between two users on the social distance is not 1 (neigh- bors) or 2 (neighbors of neighbors) howe ver , it becomes highly probable that these users hav e neither tags nor groups in common. Therefore the alignment is a strongly local effect. The analysis of the mixing patterns of the social network per- formed previously leads us to in vestigate in more detail this local alignment. This analysis has indeed sho wn the presence of a strong assortativity with respect to the intensity of the users’ activity . It 2 4 6 8 10 0 2 4 6 8 10 ! n sg " 2 4 6 8 10 d 0 0.02 0.04 0.06 !# groups " real data reshuffled Figure 5: T op: av erage number of shared groups h n sg i for tw o Flickr users as a function of their distance d along the social network. Bottom: average cosine similarity of the group afﬁli- ation h σ gr oups i vs d . In both cases data for the same social net- work with reshufﬂed group afﬁliations (preserving the number of groups f or each user) are sho wn. could therefore be the case that such assortativity , by a purely sta- tistical ef fect, yields an “apparent” local alignment between the tag vocab ularies of users. For example, even in a hypothetical case of purely random tag assignments, it would be more probable to ﬁnd tags in common between two large tag vocab ularies than between a small one and a large one. In order to discriminate between effects due to actual le xical and group membership similarity and those simply due to the assorta- tivity , it is important to de vise a proper null model , i.e. to construct an artiﬁcial system that retains the same social structure as the one under study , but lacks any lexical or topical alignment other than the one that may result from statistical effects. This is done by keeping ﬁxed the Flickr social network and its assortati vity pattern for the intensity of activity , but destroying socially-related lexical or topical alignments by means of a random permutation of tags among themselves and groups among themselves. More precisely , we proceed in the follo wing fashion: (i) we keep the social network unchanged; (ii) we build the global list of tags with their multi- plicity , i.e. each tag appears the total number of times it has been used; (iii) for each user with n t tags t 1 , t 2 . . . t n t , with respective frequencies f 1 , f 2 , . . . , f n t , we extract n t distinct tags at random from the global list of tags and assign them to u with frequencies f 1 , f 2 , . . . , f n t . This guarantees that the number of distinct tags and the total number of tag assignments for each user is the same as in the original data, and that the distribution of frequencies of tags is left unchanged. Clearly , this null model preserv es the assor- tativity patterns with respect to the amount of activity of users, as each user has exactly the same number of distinct tags and of tag assignments as in the real data. Howe ver , correlations between the tag vocab ularies are lost, except the ones purely ascribed to statis- tical effects. For group membership, we can proceed in a similar way: (i) we build a list of groups with a multiplicity equal to the number of users of each group (i.e., a group appears n times in the list if it has n users); (ii) for each user u belonging to n g groups, we extract at random n g (distinct) groups from the list and assign them to u . As for the tags, this procedure preserves the number of groups for 1 10 100 1000 n st 10 -8 10 -6 10 -4 10 -2 10 0 P(n st ) d=1 d=2 d=1, reshuffled d=2, reshuffled 0 0,2 0,4 0,6 0,8 1 σ tags 10 -8 10 -6 10 -4 10 -2 10 0 P( σ tags ) Figure 6: T op: Probability distributions of the number of shared tags of two users being at distance d on the social net- work, for d = 1 and d = 2 (symbols), and f or the same netw ork with reshufﬂed tags (lines). Bottom: same for the distributions of cosine similarities of the tag vocab ularies. each user , as well as the statistics of the number of users per group, while destroying correlations between users’ group memberships. The goal of the null model is to determine the amount of lexical and topical alignment due to spurious activity correlations. Elim- inating such spurious correlations is analogous in purpose to the use of inv erse document frequency (IDF) in information retriev al. IDF discounts the contribution of globally common terms in as- sessing the similarity between documents. Such terms are likely to be shared by pairs of documents solely because of their statistical prev alence. Unlike in information retriev al, it is not straightfor- ward to apply this type of discounting in social annotations. One would ﬁrst need to determine whether to discount tags based on their prevalence among users or among resources. The null model destroys all spurious correlations reg ardless of their source. Using the null model, we measured the alignment between users at distance d on the social network in the same way as for the orig- inal data. As Figs. 4 (top) and 5 (top) show , the average number of shared tags or of shared groups, as a function of the distance d , shows a similar trend to the original (non-reshufﬂed) data. For neighbors and next to nearest neighbors ( d < 3 ), the average num- bers of shared tags or groups are lo wer in the null model than in the original data, but still signiﬁcantly higher than for users at larger distances. The assortative mixing between the amount of activity of neighboring users is therefore enough to yield a strong lexical and topical alignment as simply measur ed by the number of shar ed tags or gr oups . The case of cosine similarity is quite different. As shown in Fig. 4 (bottom), the average cosine similarity is very small in the null model, and does not depend strongly on the dis- tance in the social network. Therefore local lexical alignment is a real effect: friends are more likely to use similar tag patterns. W ith respect to the topical alignment, a certain — albeit weaker — dependence of h σ gr oups i on d is visible in Fig. 5 (bottom). W e also analyzed the distributions of n st , σ tags , n sg , and σ gr oups among users at ﬁxed distance d , for both the original and the reshuf- ﬂed data. For brevity we show only the distributions of n st and σ tags for d = 1 and d = 2 in Fig. 6. The distributions of n st are 2 4 6 8 10 0 2 4 〈 n st 〉 real data reshuffled 2 4 6 8 10 d 0 0.02 0.04 0.06 0.08 0.1 〈σ tags 〉 Figure 7: A verage number of shared tags (top) and average co- sine similarity between tag vocabularies (bottom) for pairs of Last.fm users as a function of their social distance. W e also show data for the same social network with reshufﬂed tag vo- cabularies. very similar for the original and the reshufﬂed data, while for the cosine similarity they are clearly different: a much stronger local alignment occurs in the original data. As mentioned earlier, analogous results are found by analyzing our Last.fm data set. For illustration purposes we just show in Fig. 7 the dependencies of local tag alignment measures on social dis- tance. Again cosine similarity is the more robust measure. Our in vestigation of the le xical and topical alignment patterns in Flickr and Last.fm re veals therefore the following picture. The v ar- ious measures of the topical and lexical ov erlap between users as a function of their distance along the social network clearly point to- ward a partial local alignment, which persists up to distances 2 − 3 , ev en if large v alues can occasionally still be observed at larger dis- tances. Interestingly , if the number of shared tags between users is the only retained measure, a reshufﬂing of tags and groups between users shows that a large part of the alignment is simply due to the assortativ e pattern concerning users’ amounts of activity . This re- sult highlights the importance of considering appropriate null mod- els to discriminate between purely statistical effects and real lexi- cal or topical alignments. It also shows that correctly normalized similarity measures such as cosine similarity , which factor out the effects of v ocabulary size, are more appropriate for such in vestiga- tions, since they are less af fected by the assortativity patterns. 5. PREDICTING SOCIAL LINKS The analysis in the previous section strongly suggests that users with similar topical interests, as captured by shared tags in particu- lar , are more likely to be neighbors in the social netw ork. Therefore a natural question is whether semantic similarity measures among users based solely on their annotation patterns can be employed as accurate predictors of friendship links. W e tested this hypothesis on both our Flickr and Last.fm data sets, because each provides an- notation metadata needed to compute similarity as well as a social network to evaluate the accuracy of the predictions. For brevity we focus on reporting the results for the Last.fm data, which are more interesting for two reasons. First, contrary to Flickr, Last.fm is a “broad folksonomy” in which different users can easily anno- tate the same songs, artists, or albums. This allows us to compute similarity based on shared content as well as shared vocabulary . Second, Last.fm provides neighbor recommendations. Neighbors are users with a similar music taste, based on listening patterns. The neighborhood relation is therefore independent of the explicit friendships established by the users, and provides an obvious gauge against which to evaluate any algorithm to predict social links. Ex- cept for the lack of such a comparison measure in Flickr (beyond the random choice baseline), and for not considering similarity measures based on shared items in Flickr , the prediction analysis yields consistent and encouraging results using both data sets. 5.1 Overview of semantic similarity measures In prior work [12, 10, 11] we ev aluated a number of social sim- ilarity measures based on folksonomies, i.e., on annotations repre- sented as triples ( user , item, tag ) where Flickr photos and Last.fm songs are instances of items. All of these social similarity mea- sures hav e the desirable property of being symmetric in the sense that they can be directly applied to compute the similarity between two items, two tags, or two users from a folksonomy . Therefore we emplo y se veral of these measures here to predict social network links from the similarity among users. W e summarize below a few main features of the proposed user similarity measures; for further details and examples see Refs. [10, 11]. W e consider two aggregation schemes. In distributional aggre- gation, we project along one of the dimensions keeping track of frequencies. For example, projecting onto items, a user u is rep- resented as a tag vector whose component f u ( t ) is the number of items tagged by u with t . Analogously we can project onto tags rep- resenting users as item vectors. Unfortunately distributional aggre- gation requires that all similarities be recomputed for any change in annotations, leading to quadratic runtime complexity . In collaborative aggregation, ﬁrst we pick a feature (tag or item) and for each value of this feature we represent each user as a list of values of the other feature (items or tags). Then we compute a dif- ferent similarity value between two users according to each of these lists. Finally we aggregate these similarities by voting (summing). For e xample, for each tag we can compute a similarity value based on item lists. These are then summed across tags to obtain the ﬁ- nal similarity . Analogously we can compute similarities from tag lists and sum them across items. Collaborativ e aggre gation has two advantages. First, it can be integrated with collaborative ﬁltering techniques (hence the name) by a judicious deﬁnition of conditional probabilities p ( item | tag ) or p ( tag | item ) . This makes collabora- tiv e similarity measures competitiv e with distributional measures in terms of accuracy [10]. Second, similarities based on collab- orativ e aggregation can be updated incrementally , in linear time. When a triple is added or deleted, only similarities in volving the item or tag in that triple need be updated. As a result, collabora- tiv e aggregation leads to scalable social similarity measures. Each aggregation scheme has two variants, depending on whether we project onto/aggregate across tags or items. For each aggregation scheme/variant we consider six measures: cosine, o verlap, matching , Dice and J accard coefﬁcients, and max- imum information path (MIP). Note that distributional cosine with projection onto tag vectors is the σ tags measure discussed in the previous section. MIP is a generalization of Lin’ s similarity [9] to the non-hierarchical triple representation [11]. F or example, the distributional version of MIP with aggregation across items is de- ﬁned as σ M I P items ( u 1 , u 2 ) = 2 log ( min t ∈ T 1 ∩ T 2 p [ t ]) log(min t ∈ T 1 p [ t ]) + log(min t ∈ T 2 p [ t ]) where T i is the set of tags used by u i and p [ t ] is the fraction of users annotating with tag t . For aggregation across tags the deﬁ- nition is analogous except that we look at probabilities of shared items. For the collaborative version projecting onto items, say , we would similarly deﬁne σ M I P items ( u 1 , u 2 ; r ) for each item r replac- ing T i by the set T r i of tags used by u i to annotate r , and replac- ing p [ t ] by a suitably deﬁned p [ t | r ] . Finally σ M I P items ( u 1 , u 2 ) = P r σ M I P items ( u 1 , u 2 ; r ) . Among the measures discussed in Ref. [10] we did not consider mutual information due to its higher computa- tional complexity . In addition to these 6 × 2 × 2 = 24 measures, we also consider for comparison purposes the afﬁnity score provided by Last.fm for the 60 top neighbors of each user . As mentioned ear - lier , this score is based on similar music taste and computed from listening patterns. 5.2 Methodology The ev aluation consists in selecting a set of pairs of users, com- puting each similarity measure for each pair , and adding social links between users in decreasing order of their topical similarity: the pairs of users with highest similarity are those we predict to be most likely friends. For each predicted social link, we check the actual social network to see if the prediction is correct. As one decreases the similarity threshold more links are added, leading to more true positives but also more false positives. The best similar- ity measure is the one that achieves the best ratio of true positiv e to false positi ve rate across similarity v alues, as illustrated by R OC plots and quantiﬁed by the area under the R OC curve (A UC). T o sample the pairs of users from our data set, we start by sorting the users by one of three different criteria: 1. Most Active: By number of annotations; 2. Most Connected: By number of friends; 3. Random: By shufﬂing. The set P of pairs is then constructed according to the following algorithm: repeat: pick next u by sorting criterion R ← set of 60 neighbors of u for each n from R : if n is active: P ← ( u, n ) stop when | P | = M Recall that users are considered active if they have at least one an- notation. This is a requirement in order to compute topical simi- larity . The choice to select pairs among neighbors stems from the goal of comparing the accuracy of topical similarity methods with Last.fm recommendations. Giv en the sparsity of the social and neighbor networks, comparati ve ev aluation would be impossible without such a sampling. Note that this sampling strategy may bias the ev aluation in fa vor of Last.fm’ s neighbor recommendations, be- cause if two acti ve neighbors are friends, they are guaranteed to be detected while two active friends who are not neighbors would be missed by our sampling ev en if they were detected by our similar- ity measures. Therefore our sampling algorithm is a conservati ve choice in that it does not unfairly help our similarity measures in the ev aluation. 1 W e experimented with sets of pairs of cardinality M = 1 , 000 and M = 2 , 500 . The results are similar; we report below on e valuations with 1,000 pairs. 1 At press time Last.fm has released a ne w API functionality , called T asteometer , to query the afﬁnity score for arbitrary user pairs. This will allow us to sample users independently of neighborhood rela- tions in future ev aluations. Figure 8: ROC curv es comparing the social link predictions of distributional and collaborative MIP with the Last.fm recom- mendations. T riples can be aggregated across items (left) or tags (right). Users are sampled from the most active (top), the most connected (middle), or at random (bottom). 5.3 Results The best results are obtained by sampling the most activ e users. This is not surprising, as the topical similarity measures have more evidence at their disposal from the users’ metadata. In Fig. 8 we show R OC plots for the MIP measures, which perform consis- tently well (among the top 3 measures) in all conditions. While Last.fm neighbor recommendations do perform better than the ran- dom baseline, topical similarity is much more accurate than mu- sic taste in predicting friends for the most active users. The high- est accuracy is achiev ed by aggregating across items, i.e. repre- senting users as vectors of tags. For the most connected users as well as randomly selected users, the topical similarity measures still perform signiﬁcantly better than the random baseline, b ut only marginally better than Last.fm neighbor recommendations. Let us therefore focus on the most active users to ev aluate the predictions of additional measures. Since it is difﬁcult to compare 25 R OC plots, let us summarize our results as follows. For each of the 24 topical similarity mea- sures, σ , we compare the area under the ROC curve with that ob- tained by the Last.fm neighbor recommendations. W e measure the relativ e improvement AU C ( σ ) / AU C ( Last.f m ) − 1 . A positiv e number indicates higher accuracy than Last.fm in the sense of a larger number of true positives for the same number of false posi- tiv es. Fig. 9 shows that all topical similarity measures outperform the Last.fm neighbor recommendations. The lonely exception is distributional item overlap, for which the improvement is not sig- niﬁcant. For distributional measures, aggregation across items (fo- cusing on shared tags) yields better predictions. Overall, the best accuracy is achieved by distributional MIP based on shared tags (37% improvement). Howe ver , if scalability is important, predic- tions of comparable accuracy can be obtained by projecting over !"#$!%& !'#()%& !'#()%& !!#'(%& *+#(+%& !(#*!%& ,(#"*%& ,(#*!%& ,(#*)%& ,(#')%& ,(#-$%& ,(#+-%& ./0123& 4153& 6755789& :7;5<12=& >?38@7A& :BC& ./@@7D/87E?3& 410;81DFE/27@& !"#$%&' (!#(!&' (!#(!&' ")#"!&' )#"*&' "*#(+&' ")#%"&' ")#*)&' ")#!*&' "$#!%&' "(#"!&' "(#%+&' )&' $&' !)&' !$&' ()&' ($&' ")&' "$&' *)&' ,-./01' 2/31' 4533567' 8593:/0;' <=16>5?' 8@A' !"#$%&"'()*+,&")"-.'(-'/ 01' Figure 9: Relative improvement in ar ea under R OC cur ves over Last.fm neighbor recommendations, for the most active users. T riples are aggregated across items (top) or tags (bot- tom). each tag, and then aggregating the similarities across tags. Col- laborativ e matching yields the best predictions in this case (35% improv ement), followed closely by collaborative MIP (33%) and ov erlap (32%). In summary , these results conﬁrm that the social network con- structed from semantic similarity based on user annotations cap- tures actual friendship more accurately than Last.fm’ s recommen- dations based on listening patterns. This suggests that the Last.fm neighbourhood selection could be improved by adopting tag-based similarity measures, especially for acti ve users. The results are qualitativ ely similar for the other sampling methods, but the dif- ferences in accuracy are less signiﬁcant, with the best predictions outperforming Last.fm by at most 3–4% in A UC for the most con- nected users and by 1–5% for random users. 6. CONCLUSION AND FUTURE WORK In this paper we exploited one peculiarity of Flickr and Last.fm, namely the availability of both tagging data and the explicit social links between users, to in vestigate the interplay of the social and semantic aspects of W eb 2.0 applications. W e showed that strong correlations e xist between user acti vity in the social context (user degree centrality and group participation) and the tagging activity of the same user , and that a strong assor- tativ e mixing exists in the social network; more activ e nodes tend to hav e as neighbors other activ e nodes. W e also found that a lo- cal alignment of users’ tag vocabularies is clearly visible between nearby users in the social network, ev en for social tagging systems that lack a notion of globally shared tag v ocabulary , such as Flickr . W e inv estigated the dependence of the number of shared tags and the number of shared groups of two users, as a function of their shortest-path distance on the social network. W e introduced a null model and we used it to show that part of the similarity between users who are close on the social network is due to the aforemen- tioned correlations between user activity and user de gree centrality in the social network. That is, assortati vity and heterogeneity alone can yield a comparativ ely higher overlap of tag usage and group membership for neighboring users. In this context, our work high- lights the importance of backing up the data analysis with carefully designed null models, which are necessary — as is the case here — to disentangle the actual signal we are looking for from effects arising purely from correlations and mixing properties. Armed with the null model methodology , we showed that it is possible to deﬁne measures of tag vocab ulary and group member- ship overlap that are robust with respect to the above biases. W e in vestigated the average similarity of two users, according to such measures, as a function of the distance in the social network, ﬁnd- ing that a clear signal of local lexical and topical alignment can be detected in Flickr and Last.fm. The observed local alignment between lexical (tag) features on the social network led us to inv estigate the question of whether top- ical similarity measures based on social annotations can be applied to the prediction (or recommendation) of friend relations in a social network. Last.fm provided us with an ideal opportunity to explore this question thanks to the simultaneous availability of social link recommendations based on music listening patterns, along with the annotation metadata and social network. W e were able to e valuate the predictive po wer of a lar ge number of social topical similarity measures from the literature, spanning multiple aggregation/projection schemes. The results were v ery en- couraging; using any of the tested social similarity measures we were able to improve on the accuracy of the social link predictions provided by Last.fm, and the improvements were especially sig- niﬁcant for users who are activ e taggers. Equally encouraging is the fact that accurate predictions are afforded even by incremental measures, pointing to scalable algorithms to compute social link recommendations or improv e existing methods. Among the various measures we ev aluated, maximum informa- tion path has proven very accurate across aggregation schemes, data sets, and sampling methodologies. When predicting social links between acti ve taggers, MIP is the best measure among those based on distributional aggregation (regardless of whether we ag- gregate across items or tags), and either the best or a close second among the scalable measures based on collaborativ e aggregation, across items or tags respectiv ely . As expected, the Last.fm neighborhood relation seems to be in- dependent of the tagging activity of users; we obtain very close A UC values for both the most activ e and most connected sampling strategies. Therefore the number of annotations considered does not af fect the estimation of user afﬁnity based on listening patterns. Accordingly , the present results suggest that the Last.fm neighbor- hood recommendation could beneﬁt considerably from social sim- ilarity measures — especially for activ e users. Our results hav e important implications for the design of social media. As social networks and social tagging continue to become increasingly popular and integrated in the W eb 2.0, our techniques can be directly applied to improv e the synergies between social and semantic networks — speciﬁcally , to help users ﬁnd friends with similar topical interests as well as facilitate the formation of topical communities. W e plan to further v alidate our ﬁndings via user studies. W e will pursue this direction by inte grating a “suggest friend” functionality into GiveALink.org , a social bookmarking system dev eloped by our group at Indiana Univ ersity for research purposes. On the more theoretical side, future work will consider the present analysis performed longitudinally in time, to move from assessing correlations to assessing causality . W e will inv estigate whether the activ ation of a social link induces a local alignment of tags and group membership, or conv ersely a similarity in interests triggers the creation of a social link. Both processes probably play an im- portant role in different situations, and adding a temporal dimen- sion to the analysis presented here will pro vide new insight for modeling the structure and ev olution of user-dri ven systems. Acknowledgments The authors acknowledge stimulating discussion with A. Baldas- sarri, A. Capocci, V . Loreto, and V . D. P . Servedio. W e are grateful to Flickr and Last.fm for making their data a vailable. This w ork has been partly supported by the T A Gora project ( FP6-IST5-34721 ) funded by the FET program of the European Commission and by the project Social Integr ation of Semantic Annotation Networks for W eb Applications funded by National Science Foundation award IIS-0811994 . R. Schifanella was supported by the W orld Wide Style project (WWS) of the Univ ersity of T urin. 7. REFERENCES [1] C. Cattuto, D. Benz, A. Hotho, and G. Stumme. Semantic grounding of tag relatedness in social bookmarking systems. In Pr oceedings of the 7th International Semantic W eb Confer ence (ISWC08) , volume 5318 of LNCS , pages 615–631. Springer-V erlag, 2008. [2] S. Golder and B. A. Huberman. The structure of collaborativ e tagging systems. Journal of Information Science , 32(2):198–208, April 2006. [3] T . Hammond, T . Hannay , B. Lund, and J. Scott. Social Bookmarking Tools (I): A General Revie w . D-Lib Magazine , 11(4), April 2005. [4] R. Kumar , J. Nov ak, and A. T omkins. Structure and ev olution of online social networks. In KDD ’06: Pr oceedings of the 12th A CM SIGKDD international confer ence on Knowledge discovery and data mining , pages 611–617, New Y ork, NY , USA, 2006. A CM Press. [5] K. Lerman and L. Jones. Social browsing on ﬂickr . In Pr oceedings of International Confer ence on W eblogs and Social Media (ICWSM) , March 2007. http://arxiv .org/abs/cs.HC/0612047. [6] J. Leskovec, L. Backstrom, R. K umar , and A. T omkins. Microscopic ev olution of social networks. In KDD ’08: Pr oceeding of the 14th A CM SIGKDD international confer ence on Knowledge discovery and data mining , pages 462–470, New Y ork, NY , USA, 2008. A CM. [7] X. Li, L. Guo, and Y . E. Zhao. T ag-based social interest discov ery . In Proceeding of the 17th Intl. Conf . on W orld W ide W eb (WWW) , pages 675–684, New Y ork, NY , USA, 2008. A CM. [8] D. Liben-Nowell and J. Kleinber g. The link prediction problem for social networks. In Pr oc. 12th Intl. Conf. on Information and Knowledge Manag ement (CIKM) , pages 556–559, New Y ork, NY , USA, 2003. A CM. [9] D. Lin. An information-theoretic deﬁnition of similarity . In J. W . Shavlik, editor , Pr oceedings of the Fifteenth International Confer ence on Machine Learning (ICML) , pages 296–304. Morgan Kaufmann, 1998. [10] B. Markines, C. Cattuto, F . Menczer , D. Benz, A. Hotho, and G. Stumme. Evaluating similarity measures for emer gent semantics of social tagging. In Pr oc. 18th Intl. W orld W ide W eb Conference (WWW) , 2009. [11] B. Markines and F . Menczer . A scalable, collaborative similarity measure for social annotation systems. In Pr oc. 20th A CM Conf. on Hypertext and Hypermedia (HT) , 2009. [12] B. Markines, H. Roinestad, and F . Menczer . Efﬁcient assembly of social semantic networks. In Pr oc. 19th ACM Conf. on Hyperte xt and Hypermedia (HT) , pages 149–156, New Y ork, NY , USA, 2008. A CM. [13] C. Marlow , M. Naaman, D. Boyd, and M. Da vis. Ht06, tagging paper , taxonomy , ﬂickr , academic article, to read. In Pr oc. 17th A CM Confer ence on Hypertext and hypermedia (HT) , pages 31–40, New Y ork, NY , USA, 2006. A CM Press. [14] A. Mathes. Folksonomies – Cooperative Classiﬁcation and Communication Through Shared Metadata, December 2004. http://www .adammathes.com/academic/computer-mediated- communication/folksonomies.html. [15] F . Menczer . Lexical and semantic clustering by W eb links. Journal of the American Society for Information Science and T echnology , 55(14):1261–1269, 2004. [16] P . Mika. Ontologies are us: A uniﬁed model of social networks and semantics. W eb Semantics: Science, Services and Agents on the W orld W ide W eb , 5(1):5–15, 2007. [17] A. Mislove, H. S. K oppula, K. P . Gummadi, P . Druschel, and B. Bhattacharjee. Growth of the ﬂickr social netw ork. In Pr oceedings of the 1st A CM SIGCOMM W orkshop on Social Networks (WOSN’08) , August 2008. [18] A. Mislove, M. Marcon, K. P . Gummadi, P . Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Pr oceedings of the 5th ACM/USENIX Internet Measur ement Confer ence (IMC’07) , October 2007. [19] M. E. J. Newman. Mixing patterns in networks. Phys. Re v . E , 67:026126, 2003. [20] R. Pastor-Satorras, A. Vázquez, and A. V espignani. Dynamical and correlation properties of the Internet. Phys. Rev . Lett. , 87:258701, 2001. [21] C. Prieur, D. Cardon, J.-S. Beuscart, N. Pissard, and P . Pons. The strength of weak cooperation: A case study on ﬂickr . T echnical Report arXiv:0802.2317v1, CoRR, 2008. [22] G. Salton. Automatic text pr ocessing: the transformation, analysis, and r etrieval of information by computer . Addison-W esley Longman Publishing Co., Inc., Boston, MA, USA, 1989. [23] E. Santos-Neto, D. Condon, N. Andrade, A. Iamnitchi, and M. Ripeanu. Individual and social beha vior in tagging systems. In C. Cattuto, G. Ruffo, and F . Menczer , editors, Pr oceedings of the 20th A CM Confer ence on Hypertext and Hypermedia (HT) , pages 183–192, New Y ork, NY , USA, 2009. A CM. [24] S. Staab, S. Santini, F . Nack, L. Steels, and A. Maedche. Emergent semantics. Intellig ent Systems, IEEE [see also IEEE Expert] , 17(1):78–86, 2002. [25] L. Steels. Semiotic dynamics for embodies agents. IEEE Intelligent Systems , 21, 2006. [26] R. van Zwol. Flickr: Who is looking? In WI ’07: Pr oceedings of the IEEE/WIC/A CM International Confer ence on W eb Intelligence , pages 184–190, W ashington, DC, USA, 2007. IEEE Computer Society . [27] A. Vázquez, R. Pastor-Satorras, and A. V espignani. Large-scale topological and dynamical properties of the Internet. Phys. Rev . E , 65:066130, 2002. [28] T . V . W al. Explaining and showing broad and narro w folksonomies, 2005. http://www .personalinfocloud.com/2005/02/ /explaining_and_.html.

Folks in Folksonomies: Social Link Prediction from Shared Metadata

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment