A Picture of Agentic Search

A Picture of Agentic Search
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With automated systems increasingly issuing search queries alongside humans, Information Retrieval (IR) faces a major shift. Yet IR remains human-centred, with systems, evaluation metrics, user models, and datasets designed around human queries and behaviours. Consequently, IR operates under assumptions that no longer hold in practice, with changes to workload volumes, predictability, and querying behaviours. This misalignment affects system performance and optimisation: caching may lose effectiveness, query pre-processing may add overhead without improving results, and standard metrics may mismeasure satisfaction. Without adaptation, retrieval models risk satisfying neither humans, nor the emerging user segment of agents. However, datasets capturing agent search behaviour are lacking, which is a critical gap given IR’s historical reliance on data-driven evaluation and optimisation. We develop a methodology for collecting all the data produced and consumed by agentic retrieval-augmented systems when answering queries, and we release the Agentic Search Queryset (ASQ) dataset. ASQ contains reasoning-induced queries, retrieved documents, and thoughts for queries in HotpotQA, Researchy Questions, and MS MARCO, for 3 diverse agents and 2 retrieval pipelines. The accompanying toolkit enables ASQ to be extended to new agents, retrievers, and datasets.


💡 Research Summary

The paper “A Picture of Agentic Search” addresses a fundamental shift in information retrieval (IR) caused by the rise of large language model (LLM) agents that automatically generate and issue search queries. Traditional IR research, benchmarks, and evaluation metrics have been built around human‑generated queries, assuming a single organic query stream (α = 1). In practice, agents now produce a synthetic query stream that can be large, fast, and stylistically different from human queries. This mismatch undermines many long‑standing assumptions: caching strategies become less effective, query pre‑processing may add latency without benefit, and classic relevance‑based metrics (e.g., MAP, NDCG) no longer reflect user satisfaction when the “user” is an autonomous agent.

To study this emerging phenomenon, the authors propose a systematic methodology for logging every intermediate step of an agentic Retrieval‑Augmented Generation (RAG) system. An “agentic run” (arun) starts from an initial human question (q₀) and proceeds through a sequence of actions (query formulation, retrieval, reasoning, refinement, answer generation). Each iteration produces a “frame” consisting of the generated sub‑query (q), the ranked list of retrieved document identifiers (R_q), and any natural‑language “thought” or refinement description (D) emitted by the agent. Frames are ordered chronologically to form a “trace” (T_A(q₀) = (S, a)), where S is the ordered list of frames and a is the final answer (which may be empty if the run aborts). The methodology intercepts retrieval calls during the model’s decoding loop, parses agent‑specific control tags (, , , ) with regular expressions, and stores raw data without any post‑processing, preserving incomplete runs for full transparency.

Using this pipeline, the authors construct the Agentic Search Queryset (ASQ). ASQ covers three open‑source agents—Search‑R1, AutoRefine, and a third baseline—and two retrieval pipelines (a classic BM25 index and a dense retriever). The agents are run on three well‑known QA collections: HotpotQA (multi‑hop, in‑domain), Researchy Questions (research‑style, out‑of‑domain), and MS MARCO dev (factoid, out‑of‑domain). For each combination, every arun is logged, yielding hundreds of thousands of frames and thousands of traces. The dataset is sharded per trace, with separate TSV files for queries, document IDs, and textual descriptions, enabling selective access while maintaining traceability and chronological order.

The authors articulate both intrinsic and extrinsic desiderata for such a dataset. Intrinsically, ASQ guarantees traceability (each frame is linked to its trace and answer), completeness (all actions are recorded, including empty retrievals and early terminations), and diversity (multiple agents, retrievers, query types, and domains). Extrinsically, ASQ supports optimisation (training or fine‑tuning retrievers using the logged synthetic queries), assessability (evaluating retrieval models against agent‑centric relevance), interoperability (easy integration with new agents or retrieval back‑ends), and extensibility (the open‑source toolkit allows researchers to add new agents, corpora, or evaluation protocols without redesigning the data schema).

The paper argues that ASQ fills a critical gap: while synthetic queries have been used for data augmentation or user simulation, they are typically engineered to mimic human logs and thus do not capture the true behaviour of autonomous agents. By providing real agent‑generated traces, ASQ enables a new line of research: redesigning caching policies for high‑frequency synthetic queries, developing predictive models of an agent’s next sub‑query, creating evaluation metrics that account for the reasoning‑retrieval loop, and studying how agents refine or discard retrieved information. Moreover, the “thought” texts attached to each frame offer a rich source of meta‑information for probing the internal reasoning of LLM agents.

In conclusion, “A Picture of Agentic Search” delivers both a methodological framework and a publicly released dataset that together lay the groundwork for agent‑centric IR research. The authors anticipate that ASQ will catalyse work on agent behavior modeling, multi‑turn retrieval strategies, and human‑agent collaborative search scenarios, ultimately prompting the IR community to rethink core assumptions and develop systems that serve both humans and autonomous agents effectively.


Comments & Academic Discussion

Loading comments...

Leave a Comment