Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests

Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLM-powered search agents are increasingly being used for multi-step information seeking tasks, yet the IR community lacks empirical understanding of how agentic search sessions unfold and how retrieved evidence is used. This paper presents a large-scale log analysis of agentic search based on 14.44M search requests (3.97M sessions) collected from DeepResearchGym, i.e. an open-source search API accessed by external agentic clients. We sessionize the logs, assign session-level intents and step-wise query-reformulation labels using LLM-based annotation, and propose Context-driven Term Adoption Rate (CTAR) to quantify whether newly introduced query terms are traceable to previously retrieved evidence. Our analyses reveal distinctive behavioral patterns. First, over 90% of multi-turn sessions contain at most ten steps, and 89% of inter-step intervals fall under one minute. Second, behavior varies by intent. Fact-seeking sessions exhibit high repetition that increases over time, while sessions requiring reasoning sustain broader exploration. Third, agents reuse evidence across steps. On average, 54% of newly introduced query terms appear in the accumulated evidence context, with contributions from earlier steps beyond the most recent retrieval. The findings suggest that agentic search may benefit from repetition-aware early stopping, intent-adaptive retrieval budgets, and explicit cross-step context tracking. We plan to release the anonymized logs to support future research.


💡 Research Summary

This paper presents the first large‑scale empirical study of how large‑language‑model (LLM) powered search agents behave in real‑world, multi‑step information‑seeking sessions. The authors obtained 14.44 million search requests (spanning six months from June to December 2025) from DeepResearchGym (DRGym), an open‑source, reproducible search API that is accessed by a diverse set of external agentic clients. After cleaning and filtering, the raw stream was segmented into 3.97 million sessions using a hybrid temporal‑semantic sessionization method: for each anonymized client IP, a query is attached to the most semantically continuous active session if the continuity score exceeds a threshold; otherwise a new session is started, with a hard 10‑minute gap rule to prevent overly long idle periods. This approach respects the fast, often parallel request patterns typical of autonomous agents, which differ from the conventional 30‑minute inactivity rule used for human logs.

To characterize what each session is trying to achieve, the authors applied an LLM‑as‑a‑judge pipeline that maps sessions onto a standard intent taxonomy (e.g., fact‑seeking, reasoning, exploratory). For step‑wise dynamics, the same LLM‑based annotator labeled each adjacent query pair with one of four reformulation types: addition, deletion, modification, or no change. Both labeling stages were validated against a small human‑annotated sample, achieving >92 % agreement, demonstrating that automated labeling can scale to millions of sessions without sacrificing reliability.

A central methodological contribution is the Context‑driven Term Adoption Rate (CTAR). For a given step k, the set of newly introduced query terms Tₖ is extracted. The accumulated evidence context C₁…k‑1 (titles, snippets, and bodies of all documents retrieved in previous steps) is concatenated, and lexical n‑gram matching is performed between Tₖ and C₁…k‑1. CTARₖ is defined as the proportion of new terms that have at least one lexical match in the prior context. By averaging CTAR across all steps of a session, the authors obtain a session‑level measure of evidence reuse. Across the entire corpus, the mean CTAR is 0.54, indicating that more than half of the newly added terms can be traced back to previously retrieved evidence. Importantly, contributions are not limited to the immediately preceding step; terms often match evidence from steps three, five, or even earlier, revealing that agents maintain a long‑range memory of retrieved content.

Descriptive analyses uncover several striking patterns. First, 90 % of multi‑turn sessions contain ten steps or fewer, and 89 % of inter‑step intervals are under one minute, suggesting that agents operate in a rapid feedback loop rather than the slower, deliberative cycles typical of human users. Second, the retrieval depth parameter (num_of_docs) is largely static within a session, implying that agents treat it as a fixed hyper‑parameter rather than dynamically adjusting it based on progress. Third, behavior varies markedly by intent. Fact‑seeking sessions exhibit the highest repetition rate (≈0.68) and show an increasing proportion of identical queries as the session proceeds, indicating a tendency to enter near‑duplicate loops when the search is unproductive. In contrast, reasoning or complex‑information‑need sessions maintain a higher rate of novel term introduction (≈0.42) and display broader exploration throughout the session.

CTAR analysis further differentiates intents: fact‑seeking sessions rely heavily on previously seen evidence (higher CTAR), while reasoning sessions blend older evidence with new terms, reflecting more sophisticated synthesis. The authors also examine query similarity across the entire log, finding a low average cosine similarity (≈0.12) among a random sample of 100 k queries, confirming that the DRGym traffic is semantically diverse and not dominated by a few repeated benchmark‑style prompts. Overlap with four public agentic benchmarks (GAIA, FRAMES, HLE, WebWalkerQA) is below 0.4 % of sampled queries, reinforcing the claim that the dataset reflects open‑ended, real‑world usage rather than benchmark execution.

Based on these findings, the paper proposes three practical design recommendations for future agentic search systems: (1) Repetition‑aware early stopping – detect high‑frequency query duplication early and terminate the session or trigger a fallback strategy to avoid wasted retrieval budget; (2) Intent‑adaptive retrieval budgeting – allocate more documents or higher ANN search complexity to sessions identified as reasoning‑heavy, while keeping fact‑seeking sessions lean; (3) Cross‑step context tracking – maintain a persistent evidence store that aggregates content from all prior steps, enabling the agent to draw on long‑range context, as evidenced by the substantial CTAR contributions from distant steps. The authors also suggest incorporating CTAR as an evaluation metric for tool‑using agents, complementing existing task‑oriented benchmarks.

Finally, the authors release the anonymized, cleaned log dataset (session IDs, timestamps, query texts, retrieval parameters, and retrieved document identifiers) on Hugging Face, accompanied by a detailed dataset card describing the anonymization process, residual privacy risks, and usage licenses. This open release is intended to foster reproducibility, enable further behavioral analyses, and support the development of next‑generation agentic retrieval architectures.

In sum, the paper delivers a comprehensive, data‑driven portrait of agentic search in the wild, introduces a novel metric for evidence‑conditioned query evolution, and translates empirical insights into concrete system‑design guidelines, thereby bridging the gap between benchmark performance and real‑world agent behavior.


Comments & Academic Discussion

Loading comments...

Leave a Comment