Annotation-Free Reinforcement Learning Query Rewriting via Verifiable Search Reward

Annotation-Free Reinforcement Learning Query Rewriting via Verifiable Search Reward
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Optimizing queries for Retrieval-Augmented Generation (RAG) systems poses a significant challenge, particularly across diverse modal indices. We introduce RL-QR, a novel annotation-free reinforcement learning framework for query rewriting that eliminates the need for costly human-annotated data. By leveraging verifiable search rewards derived from index-aligned synthetic queries, RL-QR overcomes human-annotation dependencies, extending its applicability to various modalities and index domains. Experimental results demonstrate the framework’s robustness, achieving substantial retrieval performance gains of up to 3.9$\times$ on lexical retrievers and 3.5$\times$ on semantic retrievers on the MTEB VIDORE V2 benchmark for unstructured visual documents, along with consistent 5% to 10% improvements on MS MARCO v2.1 and internal industrial datasets.


💡 Research Summary

The paper tackles a central bottleneck in Retrieval‑Augmented Generation (RAG) systems: the quality of the user query that drives the retrieval step. Existing query‑rewriting approaches either rely on manually crafted dictionaries for lexical retrievers or on fine‑tuned dense retrievers, both of which demand costly human annotations and frequent re‑indexing. To eliminate this dependency, the authors propose RL‑QR (Annotation‑Free Reinforcement Learning Query Rewriting via Verifiable Search Reward), a framework that learns a query‑rewriter solely from search‑engine feedback.

RL‑QR consists of two tightly coupled stages. First, an “index‑aligned query synthesis” process uses a large language model (e.g., Qwen3‑VL‑235B‑A22B) prompted to generate question‑answer pairs that are guaranteed to require the source index for the answer. The prompt instructs the model to (1) imagine a scenario that needs the document, (2) formulate a question fitting that scenario, and (3) provide the correct answer. Because the answer can only be retrieved from the designated index, the generated question becomes a reliable proxy for a “ground‑truth” query without any human labeling.

Second, the synthesized queries are fed to a query‑rewriter policy πθ. The policy rewrites each synthetic query into a new form y, which is then submitted to a retrieval engine R (lexical BM25, dense semantic, or multimodal image‑embedding based). The retrieval outcome is evaluated with Normalized Discounted Cumulative Gain (NDCG), which directly measures whether the target document appears at a high rank. This NDCG score serves as the primary, verifiable reward r_retrieval. A secondary penalty r_penalty enforces formatting constraints (e.g., wrapping the rewritten query in tags) and discourages unnecessary token generation. The total reward is a weighted sum λ1·r_retrieval + λ2·r_penalty. Policy updates are performed with a PPO‑style objective that groups tokens, computes advantage estimates per group, and clips probability ratios to stabilize learning.

Key technical contributions include: (1) the use of index‑aligned synthetic queries to guarantee that the reward is fully computable from the search engine, removing any need for human‑annotated relevance judgments; (2) a reward formulation that is agnostic to the underlying index modality, allowing the same learning pipeline to be applied to lexical, dense, and multimodal retrievers; (3) the recommendation to train separate rewrit­ers per index, acknowledging that different retrievers have distinct optimal query characteristics (e.g., term repetition benefits BM25, semantic similarity benefits dense models).

Empirical evaluation spans three in‑house RAG pipelines—Lexical, Semantic, and Multimodal—and three benchmark datasets: (a) MS MARCO v2.1 (1 % test split) for pure text retrieval, (b) MTEB VIDORE V2 for unstructured visual documents, and (c) an internal industrial dataset of unstructured documents. Results show consistent improvements: on MS MARCO, NDCG@3 rises by 5‑10 %; on VIDORE V2, lexical retrieval recall improves up to 3.9× and semantic retrieval up to 3.5×; similar gains are observed on the internal data. These gains demonstrate that RL‑QR can substantially boost retrieval performance across modalities without any human‑generated query‑document relevance pairs.

The authors acknowledge limitations: the quality of synthetic queries depends on the prompting of the LLM; maintaining a separate rewrit­er per index can be parameter‑inefficient; and noisy or biased retrieval engines could corrupt the reward signal. Future work is suggested on automatic quality assessment of synthetic queries, shared‑parameter rewrit­er architectures for multiple indices, and techniques to denoise the reward (e.g., reward shaping, curriculum learning).

In summary, RL‑QR offers a practical, annotation‑free pathway to train query‑rewriting models that are modular, modality‑agnostic, and demonstrably effective in real‑world RAG deployments.


Comments & Academic Discussion

Loading comments...

Leave a Comment