Experiments with Different Indexing Techniques for Text Retrieval tasks on Gujarati Language using Bag of Words Approach

Reading time: 5 minute
...

📝 Original Info

  • Title: Experiments with Different Indexing Techniques for Text Retrieval tasks on Gujarati Language using Bag of Words Approach
  • ArXiv ID: 2002.01792
  • Date: 2023-07-15
  • Authors: 원문에 명시된 저자 리스트가 제공되지 않았습니다.

📝 Abstract

This paper presents results of various experiments carried out to improve text retrieval of gujarati text documents. Text retrieval involves searching and ranking of text documents for a given set of query terms. We have tested various retrieval models that uses bag-of-words approach. Bag-of-words approach is a traditional approach that is being used till date where the text document is represented as collection of words. Measures like frequency count, inverse document frequency etc. are used to signify and rank relevant documents for user queries. Different ranking models have been used to quantify ranking performance using the metric of mean average precision. Gujarati is a morphologically rich language, we have compared techniques like stop word removal, stemming and frequent case generation against baseline to measure the improvements in information retrieval tasks. Most of the techniques are language dependent and requires development of language specific tools. We used plain unprocessed word index as the baseline, we have seen significant improvements in comparison of MAP values after applying different indexing techniques when compared to the baseline.

💡 Deep Analysis

Figure 1

📄 Full Content

deals with retrieval of unstructured or partially structured text data, especially textual documents. The relevant documents are retrieved in response to a set of query which itself may be structured or unstructured. The typical interaction between a user and an IR system can be modeled as the user submitting information needs in the form of queries to the system; the system returning a ranked list of relevant documents that matches the queries. The ranked list of documents is ordered such that the most relevant documents are at the top of the list. When the information need is not known in advance and in the situation where the user query is fired once on an indexed data, the task is referred as ad hoc information retrieval [3].

The need for effective methods of automated indexing and automated IR has grown due to tremendous explosion in the amount of text documents and increase in the sources of information over the Internet. In the last decade, there has been a significant growth in the amount of text documents in Indian languages. Researchers have been performing IR tasks in English and European languages since many years through evaluation forums like TREC [4], CLEF [5] etc., efforts are being made to encourage IR tasks for the Indian languages through evaluation forum FIRE [6].

IR evaluation forums and research communities uses resources know as test collection [7]. The classic components of a test collection are:

  1. A huge corpus that includes collection of text documents; a tag “docid” is used to identify each document uniquely. 2) A set of queries (also referred as topics); a tag “qid” is used to identify each query uniquely. The query is further classified as T (title), TD (title and description) and TDN (title, description and narration). 3) A collection of query relevance judgements (also referred as qrels or relevance judgement) which consists of pairs detailing the matching documents for each query, that is gold standard for each query. Ad hoc IR can be represented as deriving a ranked list of the most relevant documents among a static collection of documents with regards to one time information need in the form of a query. A scoring function (a.k.a. retrieval model) is used to estimate the relevance and rank of each document among the document collection with reference to the query. In Experiments with Different Indexing Techniques for Text Retrieval tasks on Gujarati Language using Bag-of-Words Approach Dr. Jyoti Pareek, Hardik Joshi, Krunal Chauhan, Rushikesh Patel I bag-of-words approach, the document is taken into account as a collection of words, the semantic information like co-occurrence of words or linguistic information like parts of speech etc. are not taken into account.

In information retrieval, the query itself is represented as a document that may share the same document representation as the documents within the collection. Therefore, the relevance of any document can be interpreted as a measure of similarity between two documents (document belonging to the document collection and query document). In the case of bag-of-words approach, the document relevance is aggregated from the relevance of each query term taken separately. It is usually defined as the sum of each query term’s weights in the query document and the collection. The IR task is to judge that how much each query term contributes to the overall relevance of the document belonging to the document collection. So, the documents matching more query terms that the others should be favored. However, large documents may contain more query terms, to resolve such cases; various statistical measures have been proposed to penalize large documents to some extent as they have more chance of containing a query term.

Several retrieval models have been proposed to improve on indexing and retrieval tasks. In this paper, we have experimented with Guajarati document collection using Terrier tool [8] that has implemented few widely used retrieval models like vector space model (TF-IDF model) [9], probabilistic models like BM25 [10], language models like drichlet prior [11], information based approaches [12] and the divergence from randomness framework [13]. These are the state-of-art methods to perform ad hoc IR using bag-of-words approach.

Gujarati language is resource constrained language. To the best of our knowledge, there is a single corpus available to perform ad hoc IR tasks. An IR task must use Gold Standard data to evaluate various tools and techniques. Details of document collection (corpus) and topics (queries) are as following:

To conduct our experiments, we used the collection of Gujarati text documents that were made available by Forum for Information Retrieval and Evaluation (FIRE) in 2011 [6]. Details of Gujarati test collection used for experiments are mentioned in Table I. The test collection was created from the news article of the daily newspaper, “Gujarat Samachar”, where the articles are included from 2001

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut