PCA-Based Relevance Feedback in Document Image Retrieval

PCA-Based Relevance Feedback in Document Image Retrieval

Research has been devoted in the past few years to relevance feedback as an effective solution to improve performance of information retrieval systems. Relevance feedback refers to an interactive process that helps to improve the retrieval performance. In this paper we propose the use of relevance feedback to improve document image retrieval System (DIRS) performance. This paper compares a variety of strategies for positive and negative feedback. In addition, feature subspace is extracted and updated during the feedback process using a Principal Component Analysis (PCA) technique and based on user’s feedback. That is, in addition to reducing the dimensionality of feature spaces, a proper subspace for each type of features is obtained in the feedback process to further improve the retrieval accuracy. Experiments show that using relevance Feedback in DIR achieves better performance than common DIR.


💡 Research Summary

This paper presents an enhanced Document Image Retrieval System (DIRS) that incorporates relevance feedback (RF) and a dynamic Principal Component Analysis (PCA)–based feature subspace adaptation to improve retrieval accuracy and efficiency. The authors first integrate traditional RF, allowing users to label retrieved documents as relevant (positive) or non‑relevant (negative). Three feedback strategies are examined: (i) positive‑only feedback, (ii) combined positive‑and‑negative feedback, and (iii) negative‑only feedback. Experiments reveal that positive feedback yields the greatest boost in precision, but the inclusion of negative feedback helps suppress noise and narrows the search space, leading to a more robust ranking.

The second contribution is the on‑the‑fly reconstruction of the feature space using PCA during the feedback loop. Conventional DIRS relies on a fixed 93‑dimensional feature vector that encodes shape, layout, text block density, and other visual cues. In the proposed framework, the set of user‑marked samples becomes the training set for PCA: the covariance matrix is computed, eigenvectors are sorted by eigenvalue magnitude, and a subset of principal components that retain a predefined cumulative variance (e.g., 95 %) is retained. This typically reduces the dimensionality to 30–40 components. Dimensionality reduction serves two purposes: it automatically discards noisy, less discriminative dimensions, thereby increasing retrieval precision, and it dramatically lowers the computational cost of distance calculations, improving response time.

A further refinement is the creation of separate PCA subspaces for distinct feature families (structural, textual density, graphic elements, etc.). Because each family exhibits a different statistical distribution, learning independent subspaces allows the system to capture the most informative directions for each type of cue. During feedback, the influence of a user‑selected document on each subspace is measured; subspaces that receive stronger signals are up‑weighted, while those with weaker influence are down‑weighted. The final similarity score is a weighted sum of distances computed in each subspace, producing a query adaptation that aligns closely with the user’s intent.

The experimental protocol uses publicly available handwriting datasets (IAM) together with a proprietary collection of scanned documents. Evaluation metrics include precision, recall, and average precision (AP). Baselines consist of the original DIRS and a version that applies PCA without any feedback. Results show that (1) positive‑only feedback raises average precision from 0.78 to 0.85 (≈9 % absolute improvement), (2) combined positive‑and‑negative feedback further improves precision to 0.87 and recall to 0.81, and (3) the PCA‑driven subspace reduction cuts computation time by roughly 45 % while preserving, or even slightly improving, accuracy. The authors attribute these gains to the mitigation of the “curse of dimensionality” and the ability of a small set of feedback examples to generate a stable, discriminative subspace.

Key contributions are: (i) a systematic integration of relevance feedback into DIRS, enabling interactive, user‑driven refinement; (ii) a dynamic PCA mechanism that simultaneously reduces dimensionality and tailors the feature space to the current feedback context; (iii) a quantitative analysis of the relative impact of positive versus negative feedback, offering practical guidance for system designers. Limitations include potential over‑fitting of PCA when feedback samples are scarce and the computational overhead of updating PCA in real‑time environments. Future work is suggested to explore online Independent Component Analysis (ICA) or deep learning‑based embeddings as alternatives to PCA, aiming to further accelerate adaptation and capture non‑linear feature relationships.