Feature Weighting for Improving Document Image Retrieval System Performance

Feature weighting is a technique used to approximate the optimal degree of influence of individual features. This paper presents a feature weighting method for Document Image Retrieval System (DIRS) based on keyword spotting. In this method, we weight the feature using coefficient of multiple correlations. Coefficient of multiple correlations can be used to describe the synthesized effects and correlation of each feature. The aim of this paper is to show that feature weighting increases the performance of DIRS. After applying the feature weighting method to DIRS the average precision is 93.23% and average recall become 98.66% respectively

💡 Research Summary

The paper addresses a longstanding problem in document image retrieval systems (DIRS): while many visual features can be extracted from scanned pages (e.g., projection profiles, zoning histograms, block geometry, inter‑character spacing), traditional DIRS treat all of them with equal importance during similarity computation. This uniform treatment ignores the fact that some features are far more discriminative for keyword spotting than others, leading to sub‑optimal precision and recall.

To remedy this, the authors propose a feature‑weighting scheme based on the coefficient of multiple correlations (CMC). CMC is a statistical measure that quantifies the combined linear relationship between a set of independent variables (the extracted features) and a dependent variable (the document matching score). By fitting a multiple‑linear‑regression model on a training set where the ground‑truth relevance of document pairs is known, the regression coefficients and the inter‑feature correlation matrix are used to compute a CMC value for each feature. These CMC values are then normalized to the interval