Wikiometrics: A Wikipedia Based Ranking System
📝 Abstract
We present a new concept - Wikiometrics - the derivation of metrics and indicators from Wikipedia. Wikipedia provides an accurate representation of the real world due to its size, structure, editing policy and popularity. We demonstrate an innovative mining methodology, where different elements of Wikipedia - content, structure, editorial actions and reader reviews - are used to rank items in a manner which is by no means inferior to rankings produced by experts or other methods. We test our proposed method by applying it to two real-world ranking problems: top world universities and academic journals. Our proposed ranking methods were compared to leading and widely accepted benchmarks, and were found to be extremely correlative but with the advantage of the data being publically available.
💡 Analysis
We present a new concept - Wikiometrics - the derivation of metrics and indicators from Wikipedia. Wikipedia provides an accurate representation of the real world due to its size, structure, editing policy and popularity. We demonstrate an innovative mining methodology, where different elements of Wikipedia - content, structure, editorial actions and reader reviews - are used to rank items in a manner which is by no means inferior to rankings produced by experts or other methods. We test our proposed method by applying it to two real-world ranking problems: top world universities and academic journals. Our proposed ranking methods were compared to leading and widely accepted benchmarks, and were found to be extremely correlative but with the advantage of the data being publically available.
📄 Content
Wikiometrics: A Wikipedia Based Ranking System Gilad Katz* and Lior Rokach†
*University of California, Berkeley giladk@berkeley.edu † Ben-Gurion University of the Negev liorrk@bgu.ac.il
Abstract We present a new concept—Wikiometrics—the derivation of metrics and indicators from Wikipedia. Wikipedia provides an accurate representation of the real world due to its size, structure, editing policy and popularity. We demonstrate an innovative “mining” methodology, where different elements of Wikipedia – content, structure, editorial actions and reader reviews – are used to rank items in a manner which is by no means inferior to rankings produced by experts or other methods. We test our proposed method by applying it to two real-world ranking problems: top world universities and academic journals. Our proposed ranking methods were compared to leading and widely accepted benchmarks, and were found to be extremely correlative but with the advantage of the data being publically available.
- Introduction
Ranking is the process by which the relative standing of items is determined. This process is
common in multiple domains, both scientific and not. Ranking is considered a difficult problem
in many cases as there is no absolute “ground truth” to which the generated ratings can be
compared. Nonetheless, multiple studies have been performed that utilize ranking in general
and Wikipedia in particular.
Wikipedia has been used in multiple scientific fields: computer science, medicine, physics,
sociology etc. According to [1], a growing number of Wikipedia-related papers seems to be
generated with each passing year. Wikipedia has several traits which constitute it as such a
valuable source of information for research:
Size and scope - As mentioned above, the English Wikipedia alone has over 4.6 million
entries. Encyclopedia Britannica, one of the best-known “regular” encyclopedias, has
40,000. This great difference in scope suggests that Wikipedia covers a multitude of
fields and areas of interest that are not covered by curated encyclopedias.
Timely and updated – Because of Wikipedia’s open editing policy which enables any
person to modify its content, the information it contains is almost always up-to-date.
Case in point: In 2013, a few minutes after the election of the new pope, one of the
authors of this study reviewed the relevant Wikipedia entries and found them to
already be updated with the elected pope’s new status.
Tags and meta-data – Wikipedia contains multiple types of user-generated content
(UGC); categories, links, redirect pages and infoboxes can all be used to infer the type,
attributes and connections among the various entities represented in Wikipedia.
Wisdom of the crowd – Since every person has the ability to contribute content to
Wikipedia, it reflects the thoughts, ideas and perceptions of peoples, groups and
societies [2]. This enables us to use Wikipedia to measure popularity, importance and
influence. In a sense, Wikipedia is “representative of the real world.”
We argue that Wikipedia’s scope and open editing policy render it a representation of the real
world. By representation, we mean that the “footprint” of an entity or a concept in Wikipedia
is often indicative of its popularity or importance in the real world. It is our belief that by
applying this approach to Wikipedia, researchers will be able to use it to address multiple real-
world challenges. This change of focus could be significant, as Wikipedia’s currently most
utilized feature is its text.
In this study we propose a novel concept – Wikiometrics – the derivation of metrics and indicators from Wikipedia. While entities ranking is often subjective, we argue that Wikipedia represents the “wisdom of the crowd” and can effectively reflect common perceptions. We propose using three Wikipedia features – infobox data, links and page views – and applying them to the ranking of two of the most widely studied tasks in scientometrics: the ranking of world universities and academic journals. In both cases we compare our results to those obtained by leading and widely-accepted rankings and show that the correlation between our proposed ranking and each of the baselines is similar to the correlation of the baselines among themselves.
Our contribution in this study is twofold: first, we propose a novel approach to a previously unaddressed problem – the ranking of real-world objects. Secondly, we demonstrate how two underutilized Wikipedia features – the infoboxes and the page views – can be effectively used to address this challenge.
The remainder of this paper is organized as follows. In Section 2 we review related work while in Sections 3 and 4 we present two case studies and evaluate the performance of the proposed methods. In Section 5 we present our conclusions and future research directions.
Related Work In this section we review four topics. In Sect
This content is AI-processed based on ArXiv data.