Leveraging User Diversity to Harvest Knowledge on the Social Web
Social web users are a very diverse group with varying interests, levels of expertise, enthusiasm, and expressiveness. As a result, the quality of content and annotations they create to organize content is also highly variable. While several approaches have been proposed to mine social annotations, for example, to learn folksonomies that reflect how people relate narrower concepts to broader ones, these methods treat all users and the annotations they create uniformly. We propose a framework to automatically identify experts, i.e., knowledgeable users who create high quality annotations, and use their knowledge to guide folksonomy learning. We evaluate the approach on a large body of social annotations extracted from the photosharing site Flickr. We show that using expert knowledge leads to more detailed and accurate folksonomies. Moreover, we show that including annotations from non-expert, or novice, users leads to more comprehensive folksonomies than experts’ knowledge alone.
💡 Research Summary
The paper tackles the problem of learning high‑quality folksonomies from the noisy, heterogeneous annotations that users generate on the social photo‑sharing site Flickr. While prior work (e.g., Plangprasopchok et al.) treated all users as equally reliable, the authors argue that the community is highly diverse: some users are “experts” who create deep, well‑structured personal directories (called saplings) and use precise, domain‑specific terminology, whereas others are “novices” who produce shallow, vague hierarchies. The central hypothesis is that expert annotations can guide the learning process toward more accurate and detailed taxonomies, but that novice contributions are still valuable for achieving broader coverage.
To test this hypothesis the authors first devise a systematic method for automatically identifying expert users. They extract a rich set of quantitative features from both the user level (e.g., number of saplings, number of tag‑relations, entropy‑based balance of sapling sizes, Jensen‑Shannon disparity between tag distributions of a user’s saplings) and the sapling level (e.g., depth‑weighted breadth – “Sapling‑Variety”, normalized entropy balance across levels, number of leaves, conflict count, root‑diversity measured by how concentrated the distribution of child nodes is for a given root concept). The intuition behind root‑diversity is that meaningful high‑level concepts such as “nature” or “Europe” have a few dominant children (high peakedness), whereas vague roots like “misc” or “other stuff” produce a flat distribution of children.
Using a small manually labeled seed set (200 users, 19 consensus experts) they train three classifiers – J48 decision trees, Random Forests, and a linear SVM (LibSVM). They then employ a self‑training loop: predictions labeled as experts by any classifier are manually verified, added to the training pool, and the models are retrained. After eight iterations the LibSVM model reaches 100 % precision, 88 % recall, and a 93 % F‑score on the enriched set (315 users, 43 experts). Applying the final model to the full dataset of 7,121 Flickr users yields 66 experts. Feature‑selection analyses (SVM‑based, ReliefF, Information Gain, Chi‑Square) consistently rank “Sapling‑Depth” as the most discriminative feature, followed by “Number of Leaves” and “Sapling‑Balance”, confirming that experts tend to build deep, balanced hierarchies.
Having identified experts, the authors modify the Relational Affinity Propagation (RAP) algorithm, originally proposed for clustering saplings into a common folksonomy. RAP works by constructing an affinity matrix over all sapling nodes, then iteratively selecting exemplars (representative nodes) and assigning other nodes to them, effectively merging multiple personal hierarchies into a single, deeper, bushier tree. In the enhanced version, each sapling is weighted according to the expertise of its creator: expert saplings receive higher weight, novice saplings lower weight. The authors evaluate three configurations: (1) uniform weighting (baseline RAP), (2) expert‑only weighting, and (3) combined weighting with differential emphasis.
Experimental results show that the expert‑weighted RAP produces folksonomies that are significantly more detailed: it recovers intermediate concepts (e.g., “jumping spider → spiders → arachnids → invertebrates”) that the baseline misses, and the resulting hierarchy aligns better with external taxonomic resources. However, when only expert saplings are used, the overall coverage suffers because many niche or peripheral concepts are absent. Incorporating novice saplings, even with lower weight, dramatically increases the breadth of the learned taxonomy, adding many leaf concepts and improving recall. Thus, the best overall folksonomy balances both sources: expert knowledge drives precision and depth, while novice contributions supply completeness.
The study contributes two major insights to the field of social‑web knowledge mining. First, it demonstrates that the structure of user‑generated annotations alone—without analyzing textual content—can reliably distinguish high‑quality contributors, enabling scalable expert detection. Second, it provides a principled way to fuse heterogeneous contributions by weighting them according to inferred expertise, yielding folksonomies that are both accurate and comprehensive. The methodology is readily transferable to other platforms that support hierarchical tagging or collection organization (e.g., YouTube playlists, Instagram album groups, StackOverflow tag hierarchies). Potential downstream applications include automatic ontology construction, improved semantic search, personalized recommendation, and enhanced data integration across social media ecosystems.
Comments & Academic Discussion
Loading comments...
Leave a Comment