1.5 billion words Arabic Corpus

Reading time: 5 minute
...

📝 Original Info

  • Title: 1.5 billion words Arabic Corpus
  • ArXiv ID: 1611.04033
  • Date: 2016-11-15
  • Authors: Ibrahim Abu El-khair

📝 Abstract

This study is an attempt to build a contemporary linguistic corpus for Arabic language. The corpus produced, is a text corpus includes more than five million newspaper articles. It contains over a billion and a half words in total, out of which, there is about three million unique words. The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. The corpus was encoded with two types of encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two mark-up languages, namely: SGML, and XML.

💡 Deep Analysis

Figure 1

📄 Full Content

The efficiency of any information retrieval systems mainly depends on the experiments conducted by the researchers in the field, and commercial companies producing these systems. These experiments are done to emulate real world queries submitted to any system and the response of it to these queries. It is usually conducted in a closed laboratory environment. Elements of the retrieval process in this type of experiments are controlled by the researchers, in order to determine causes of success or failure and fixing it.

Language corpora are one of the most important elements for information retrieval experiments in particular and for natural language processing in general. This is because the corpus represents the actual everyday use of the language. Corpus use in retrieval has improved significantly in most languages especially Latin based languages. As for Arabic language it is still relatively new.

Arabic Language is the language of the holy Quran. It is used by more than a billion and a half Muslims around the world in the daily rituals. It is the mother tongue of about two hundred and fifty 1 . https://www.gnu.org/software/wget 2 . https://www.httrack.com 3 . https://www.internetdownloadmanager.com 4 . http://www.cyotek.com/cyotek-webcopy million people around the world. It is also, the official language of twenty-two countries and an official language for non-Arabic countries like Chad, Eretria, Mali, and Turkey (Encyclopaedia Britannica Almanac, 2009). Moreover, it is one of the six official languages of the United Nation (UN, 2015), since 1973 (UN, 1973).

In spite of all of the above, Arabic language Corpora still in need for more research and studies. There is an ongoing need for more Arabic Corpora. The majority of available corpora now are relatively small in size, or rather expensive. The main purpose of this paper is producing a new free corpus. A corpus with a large size, representative of the language, from different countries, different writing styles, from more than one source, and distributed over many years. It will be available for researchers in the field of information retrieval, computational linguistics, and natural language processing.

Table one shows some of the previous attempts to create Arabic corpora. It should be noted that the review will be limited to textual monolingual corpora, not word lists, lexicons, speech, and opinion corpora, all types were reviewed by Zaghouani, (2014).

Web scrapping or web copying programs were used to extract text from news sources in order to create the corpus. The researchers used wget (1) , which is used by LDC, and htttrack (2) site copier, but both were very slow, so they were not used. Two other program, Internet Download Manager (3) , cyotek webcopy (4) , were used and eliminated as well because they stop working for no apparent reason, in addition to being slow. After several attempts the researcher used MetaProducts Offline Explorer Pro (5) , Visual Web Ripper (6) . Both programs were very good in extracting text and eliminating all unnecessary objects like images, videos, JavaScript files, and CSS files.

There are a lot of news sources that could be used for creating a language corpus. At this paper, the researcher has chosen ten sources to be used in the corpus. Several news websites were tested before selecting the source that will be used. The fame of the website, and the news source, or the number of readers were not the criterion for selection. There were other criteria and technical reasons for selecting the news resources used in building the corpus.

 The first criterion is having no overlap with previous Arabic corpora. For Table two, indicates the selected sources for the corpus, its name in English and in Arabic, its abbreviation, the time period for each one of them, country of origin, and its website. Nine newspapers, and one news agency from eight countries were selected as shown in the table. Egypt and Saudi Arabia are represented with two newspapers each, since they are the pioneers in online journalism, and have some of the oldest online newspapers in the Arab world.

The coverage period varies from one source to the other. The starting time in each news source is basically the time it first appeared online. The ending date depended on the time of the data collection. Some websites allowed harvesting the news archive but not the current news like Alyaum from Saudi Arabia, and Almasryalyoum from Egypt.

Two tagging schemes were used with the corpus in hand. All articles in the current corpus were tagged with SGML (Standard Generalized Markup Language), which is used in TREC corpora. The other scheme was using XML (Extensible Markup Language) tagging, which is used in the LDC corpora.

. https://msdn.microsoft.com/enus/goglobal/cc305149.aspx Each article will have an ID using the source abbreviation, table one, Arabic language abbreviation, and a serial number, e.g. RYD_ARB_0000001 , or RYD_ARB_0

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut