📝 Original Info
- Title: Mining Meaning from Wikipedia
- ArXiv ID: 0809.4530
- Date: 2009-05-10
- Authors: ** - Olena Medelyan - David Milne - Catherine Legg - Ian H. Witten **
📝 Abstract
Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.
💡 Deep Analysis
Deep Dive into Mining Meaning from Wikipedia.
Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and
📄 Full Content
1
Mining meaning from Wikipedia
OLENA MEDELYAN, DAVID MILNE, CATHERINE LEGG and IAN H. WITTEN
University of Waikato, New Zealand
Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of
researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of
manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied
to a host of tasks.
This article provides a comprehensive description of this work. It focuses on research that extracts and makes
use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad
categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and
information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being
used as is, how it is being improved and adapted, and how it is being combined with other structures to create
entirely new resources. We identify the research groups and individuals involved, and how their work has
developed in the last few years. We provide a comprehensive list of the open-source software they have
produced.
- INTRODUCTION
Wikipedia requires little introduction or explanation. As everyone knows, it was launched
in 2001 with the goal of building free encyclopedias in all languages. Today it is easily
the largest and most widely-used encyclopedia in existence. Wikipedia has become
something of a phenomenon among computer scientists as well as the general public. It
represents a vast investment of freely-given manual effort and judgment, and the last few
years have seen a multitude of papers that apply it to a host of different problems. This
paper provides the first comprehensive summary of this research (up to mid-2008), which
we collect under the deliberately vague umbrella of mining meaning from Wikipedia. By
meaning, we encompass everything from concepts, topics, and descriptions to facts,
semantic relations, and ways of organizing information. Mining involves both gathering
meaning into machine-readable structures (such as ontologies), and using it in areas like
information retrieval and natural language processing.
Traditional approaches to mining meaning fall into two broad camps. On one side are
carefully hand-crafted resources, such as thesauri and ontologies. These resources are
generally of high quality, but by necessity are restricted in size and coverage. They rely
on the input of experts, who cannot hope to keep abreast of the incalculable tide of new
discoveries and topics that arise constantly. Even the most extensive manually created
resource—the Cyc ontology, whose hundreds of contributors have toiled for 20 years—
has limited size and patchy coverage [Sowa 2004]. The other extreme is to sacrifice
quality for quantity and obtain knowledge by performing large-scale analysis of
unstructured text. However, human language is rife with inconsistency, and our intuitive
2
understanding of it cannot be entirely replicated in rules or trends, no matter how much
data they are based upon. Approaches based on statistical inference might emulate human
intelligence for particular purposes and in specific situations, but cracks appear when
generalizing or moving into new domains and tasks.
Wikipedia provides a middle ground between these two camps—quality and
quantity—by offering a rare mix of scale and structure. With two million articles and
thousands of contributors, it dwarfs any other manually created resource by an order of
magnitude in the number of concepts covered, has far greater potential for growth, and
offers a wealth of further useful structural features. It contains around 18 Gb of text, and
its extensive network of links, categories and infoboxes provide a variety of explicitly
defined semantics that other corpora lack. One must, however, keep Wikipedia in
perspective. It does not always engender the same level of trust or expectations of quality
as traditional resources, because its contributors are largely unknown and unqualified. It is
also far smaller and less representative of all human language use than the web as a
whole. Nevertheless, Wikipedia has received enthusiastic attention as a promising natural
language and informational resource of unexpected quality and utility. Here we focus on
research that makes use of Wikipedia, and as far as possible leave aside its controversial
nature.
This paper is structured as follows. In the next section we describe Wikipedia’s
creation process and structure, and how it is viewed by computer scientists as anything
from a corpus, taxonomy, thesaurus, or hierarchy of knowledge topics to a full-blown
ont
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.