📝 Original Info
- Title: Probability Based Clustering for Document and User Properties
- ArXiv ID: 1102.3865
- Date: 2011-02-21
- Authors: - Thomas Mandl (University of Hildesheim, 독일) - Christa Womser‑Hacker (University of Hildesheim, 독일)
📝 Abstract
Information Retrieval systems can be improved by exploiting context information such as user and document features. This article presents a model based on overlapping probabilistic or fuzzy clusters for such features. The model is applied within a fusion method which linearly combines several retrieval systems. The fusion is based on weights for the different retrieval systems which are learned by exploiting relevance feedback information. This calculation can be improved by maintaining a model for each document and user cluster. That way, the optimal retrieval system for each document or user type can be identified and applied. The extension presented in this article allows overlapping, probabilistic clusters of features to further refine the process.
💡 Deep Analysis
📄 Full Content
In: Ojala, Timo (ed.): Infotech Oulo International Workshop on Information Retrieval (IR 2001). Oulo, Finnland. 19.-
21.9.2001. S. 100-107.
Probability Based Clustering for
Document and User Properties
Thomas Mandl, Christa Womser-Hacker
University of Hildesheim, Information Science
Marienburger Platz 22 - D-31141 Hildesheim - Germany
{mandl, womser}@uni-hildesheim.de
Abstract. Information Retrieval systems can be improved by exploiting context information
such as user and document features. This article presents a model based on overlapping
probabilistic or fuzzy clusters for such features. The model is applied within a fusion method
which linearly combines several retrieval systems. The fusion is based on weights for the
different retrieval systems which are learned by exploiting relevance feedback information.
This calculation can be improved by maintaining a model for each document and user cluster.
That way, the optimal retrieval system for each document or user type can be identified and
applied. The extension presented in this article allows overlapping, probabilistic clusters of
features to further refine the process.
1. Introduction
The quality of retrieval systems is still an important research issue [14]. Fusion is a technique
to improve the performance of several single information retrieval engines. This article
presents an extension of an adaptive fusion method which allows to model document
properties using probabilistic clusters. Chapter 2 gives a short overview on fusion in machine
learning and retrieval. Chapter 3 discusses the MIMOR model and its fusion strategy. The
extension of the fusion model to allow retrieval adapted toward individual and document
differences is presented in chapter 4.
2. Fusion in Information Retrieval
Fusion methods delegate a task to different systems and integrate each result returned into one
final result. For information retrieval tasks, this means the integration of different probabilities
for the relevance of a document. The different probabilities are calculated by different retrieval
systems applying e.g. specific indexing, linguistic, context or similarity functions.
As the TREC experiments have shown, the results of similarly well performing
information retrieval systems often differ. This means, the systems do find the same
percentage of relevant documents, but, the overlap between their results sometimes is low.
Therefore, fusion seems to be a promising approach and has been applied to text retrieval in
many systems. Many studies have shown improvements compared to individual retrieval
systems. [1, 2, 7, 8, 12, 13].
2.1. Committee Machines in Machine Learning
The idea of fusion in information retrieval corresponds to the concept of committee machines
in artificial intelligence. Machine learning has coined the notion of committee machines for
the combination of supervised learning algorithms for prognosis or classification. This
metaphor suggests that these systems try to work in a way similarly to a committee of humans.
The following architectures are discussed [4]:
• Static methods: the input of single experts is not considered
• Ensemble averaging: the output of different experts is combined linearly with weights
assigned to each expert (weighted voting)
• Boosting: a weak learning algorithm is improved by retraining badly classified
examples with another learning method
• Dynamic methods: the input of experts governs the integration process
• Mixture of experts: the output of several experts is combined non-linearly
• Hierarchical mixture of experts: the combination system is organized in a hierarchical
manner
All these fusion methods do not need to be based on experts differing in all details nor in their
basic algorithms. E.g. the results of a neural net usually depend on the random initialization.
The same neural net with different initialization states can be considered as different experts.
The same applies to other learning methods with different parameter settings.
2.2. Fusion in information retrieval
The application of fusion methods in information retrieval has gained considerable attention
especially since the beginning of TREC. An overview of fusion algorithms is given by
McCabe et al. [11]. Simple numeric functions like sum, minimum or maximum of individual
results have been applied as well as majority vote or linear combination with a weight
assigned to each retrieval system. All methods used so far belong to the static committee
machines.
Research aims at finding out which retrieval or indexing methods should be combined,
which committee machine architecture should be used and which features of collections
indicate that a fusion might lead to positive results. The ideas from information retrieval and
machine learning are integrated in the algorithm RankBoost, which applies boosting to ranked
result lists [5].
A different approach toward fusion is presented by the popular internet meta search
engines. They have been developed because
Reference
This content is AI-processed based on open access ArXiv data.