Probability Based Clustering for Document and User Properties

Reading time: 5 minute
...

📝 Original Info

  • Title: Probability Based Clustering for Document and User Properties
  • ArXiv ID: 1102.3865
  • Date: 2011-02-21
  • Authors: - Thomas Mandl (University of Hildesheim, 독일) - Christa Womser‑Hacker (University of Hildesheim, 독일)

📝 Abstract

Information Retrieval systems can be improved by exploiting context information such as user and document features. This article presents a model based on overlapping probabilistic or fuzzy clusters for such features. The model is applied within a fusion method which linearly combines several retrieval systems. The fusion is based on weights for the different retrieval systems which are learned by exploiting relevance feedback information. This calculation can be improved by maintaining a model for each document and user cluster. That way, the optimal retrieval system for each document or user type can be identified and applied. The extension presented in this article allows overlapping, probabilistic clusters of features to further refine the process.

💡 Deep Analysis

📄 Full Content

In: Ojala, Timo (ed.): Infotech Oulo International Workshop on Information Retrieval (IR 2001). Oulo, Finnland. 19.- 21.9.2001. S. 100-107. Probability Based Clustering for Document and User Properties Thomas Mandl, Christa Womser-Hacker University of Hildesheim, Information Science Marienburger Platz 22 - D-31141 Hildesheim - Germany {mandl, womser}@uni-hildesheim.de Abstract. Information Retrieval systems can be improved by exploiting context information such as user and document features. This article presents a model based on overlapping probabilistic or fuzzy clusters for such features. The model is applied within a fusion method which linearly combines several retrieval systems. The fusion is based on weights for the different retrieval systems which are learned by exploiting relevance feedback information. This calculation can be improved by maintaining a model for each document and user cluster. That way, the optimal retrieval system for each document or user type can be identified and applied. The extension presented in this article allows overlapping, probabilistic clusters of features to further refine the process. 1. Introduction The quality of retrieval systems is still an important research issue [14]. Fusion is a technique to improve the performance of several single information retrieval engines. This article presents an extension of an adaptive fusion method which allows to model document properties using probabilistic clusters. Chapter 2 gives a short overview on fusion in machine learning and retrieval. Chapter 3 discusses the MIMOR model and its fusion strategy. The extension of the fusion model to allow retrieval adapted toward individual and document differences is presented in chapter 4. 2. Fusion in Information Retrieval Fusion methods delegate a task to different systems and integrate each result returned into one final result. For information retrieval tasks, this means the integration of different probabilities for the relevance of a document. The different probabilities are calculated by different retrieval systems applying e.g. specific indexing, linguistic, context or similarity functions. As the TREC experiments have shown, the results of similarly well performing information retrieval systems often differ. This means, the systems do find the same percentage of relevant documents, but, the overlap between their results sometimes is low. Therefore, fusion seems to be a promising approach and has been applied to text retrieval in many systems. Many studies have shown improvements compared to individual retrieval systems. [1, 2, 7, 8, 12, 13]. 2.1. Committee Machines in Machine Learning The idea of fusion in information retrieval corresponds to the concept of committee machines in artificial intelligence. Machine learning has coined the notion of committee machines for the combination of supervised learning algorithms for prognosis or classification. This metaphor suggests that these systems try to work in a way similarly to a committee of humans. The following architectures are discussed [4]: • Static methods: the input of single experts is not considered • Ensemble averaging: the output of different experts is combined linearly with weights assigned to each expert (weighted voting) • Boosting: a weak learning algorithm is improved by retraining badly classified examples with another learning method • Dynamic methods: the input of experts governs the integration process • Mixture of experts: the output of several experts is combined non-linearly • Hierarchical mixture of experts: the combination system is organized in a hierarchical manner All these fusion methods do not need to be based on experts differing in all details nor in their basic algorithms. E.g. the results of a neural net usually depend on the random initialization. The same neural net with different initialization states can be considered as different experts. The same applies to other learning methods with different parameter settings. 2.2. Fusion in information retrieval The application of fusion methods in information retrieval has gained considerable attention especially since the beginning of TREC. An overview of fusion algorithms is given by McCabe et al. [11]. Simple numeric functions like sum, minimum or maximum of individual results have been applied as well as majority vote or linear combination with a weight assigned to each retrieval system. All methods used so far belong to the static committee machines. Research aims at finding out which retrieval or indexing methods should be combined, which committee machine architecture should be used and which features of collections indicate that a fusion might lead to positive results. The ideas from information retrieval and machine learning are integrated in the algorithm RankBoost, which applies boosting to ranked result lists [5]. A different approach toward fusion is presented by the popular internet meta search engines. They have been developed because

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut