Toward a new instances of NELL

October 12, 2016

Reading time: 5 minute

...

📝 Original Info

Title: Toward a new instances of NELL
ArXiv ID: 1610.03246
Date: 2016-10-12
Authors: Maisa C. Duarte and Pierre Maret

📝 Abstract

We are developing the method to start new instances of NELL in various languages and develop then NELL multilingualism. We base our method on our experience on NELL Portuguese and NELL French. This reports explain our method and develops some research perspectives.

💡 Deep Analysis

📄 Full Content

NELL (Never-Ending Language Learner) [1] is a computer system that runs 24 hours per day, 7 days per week. It was started up on January, 12th, 2010 and should be running forever, reading the web and gathering more and more facts to grow and populate its own knowledge base.

In short, we can describe the system as follows: NELL’s initial knowledge base (KB) is an ontology defining hundreds of categories (e.g., athlete, sports, sportsTeam, fruit, product, country, city, emotion, etc.) and relations (e.g., athletePlaysForTeam (athlete, sportsTeam), cityLocatedInCountry (city, country)) and a set of 10 to 15 examples (instances) for each one of the categories (e.g. athlete (Kobe Bryant), sportsTeam(LA Lakers), etc.) and for each one of the relations (e.g. athletePlaysForTeam (Kobe Bryant, LA Lakers), cityLocatedInCountry (New York, USA), etc.). Publications that explain in more detail about NELL and that support the current project can be found in [1], [4], [5] and [6]. The publication [7] is the last publication about all the system and currently components.

The standard process of NELL is depicted in Figure 1 in a simple and generic view. As shown in this figure, the input is the all-pairs-data and the ontology/knowledge base (KB), the output is the ontology/knowledge base. Let’s explain better each part of the process.

The present document describes what is necessary to setting up a new NELL instance in a different language. In resume, it is necessary an ontology and an input, both are described in details in the Sections 2 ,3 and 4. The process to create a new NELL instance were published for the first time for the Portuguese NELL version ( [3,2]), and for the second time for the French version. This document is based on these publications and experiences.

The input of NELL is the web. NELL reads and learn from webpages. NELL learns all time, “forever”. NELL’s key is to read and understand better each day. We can imagine a human reading a book on an unknown subject, and as much the human reads the book, as much knowledge he is able to extract. NELL does a similar task: it reads the web a lot of times (call “iterations”), which requires a lot of time and resources. From the text, NELL pre-processes a source base called All-Pairs-Data.

An All-Paris-Data is made using a big corpus and stores all occurrences and co-occurrences between Named Entities (NE) and Textual Patterns (TP) in two views: Categories and Relations.

Categories is the learning of a unary relation. For example the Category City can have the unary relation: City(New York). Relations is the learning of a binary relation. For example the Relation LocatedIn(New York,USA). For INPUT OUTPUT NELL All-PairsData Ontology/ KB Categories the system extracts just one instance for predicate (New York), while for relations it extracts one pair of instances for predicate (New York, USA).

An All-Pairs-Data is created for Categories and Relations or just for Categories. For Categories the all-pairs will consist of all occurrences between a TP and a NE. For Relations all occurrences between a pair of NE’s and a TP.

It is shown in the Tables 1 and2 one simple example for an All-Pairs-Data of Category (Table 1) and another of Relation (Table 2). In these tables we have the number of occurrences between NE and TP for categories, and between a pair of NEs and a TP for relations. When NELL learns, NELL makes the math to discover and count the co-occurrence.

The NE and TP extractions are guided by a part-of-speech / Tagging process. Any other approaches can be applied for extracted the NE and TP. The only important thing is that the NE and TP identified are not modified. In other words, it’s important to keep the strings as it was found on the original text. The point is that it extracts exactly what is written in order to keep the process consistent (for more information about the keys of NELL access: http://www.cs.cmu.edu/ ) Currently, some of the approaches applied on NELL to create an All-Pairs for Categories is: after the NE is found, two TP are extracted, one on the left of the NE and the other on the right of the NE. For example: “Located in USA, New York is a very famous city”. For the NE “New York” it will be extracted one TP on the left: “Located in USA,”. And another on the right: “is a very famous city”. Five (5) For other languages (basically all languages, except English for now) combinations between 2 or 3 and 5 grams are saved. For example: “is a city”, “is a city located”, “is a city located near”. For English a filter is used to find the better TP for a NE. In French and Portuguese, for Relations, the TP is the sentence between the NE pair. For categories, the number of grams is between 3 and 5. For English the number of grams is also until 5, but there are different filters to find the best TP. These filters have not been developed for other languages yet.

In order to start a new NELL instance, we need to give names of some Categories and

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

Toward a new instances of NELL

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Harassment detection: a benchmark on the #HackHarassment dataset

The Role of CNL and AMR in Scalable Abstractive Summarization for Multilingual Media Monitoring

Multi-Level Analysis and Annotation of Arabic Corpora for Text-to-Sign Language MT

Start searching

No results found