IntentRL: Training Proactive User-intent Agents for Open-ended Deep Research via Reinforcement Learning
Deep Research (DR) agents extend Large Language Models (LLMs) beyond parametric knowledge by autonomously retrieving and synthesizing evidence from large web corpora into long-form reports, enabling a long-horizon agentic paradigm. However, unlike real-time conversational assistants, DR is computationally expensive and time-consuming, creating an autonomy-interaction dilemma: high autonomy on ambiguous user queries often leads to prolonged execution with unsatisfactory outcomes. To address this, we propose IntentRL, a framework that trains proactive agents to clarify latent user intents before starting long-horizon research. To overcome the scarcity of open-ended research data, we introduce a scalable pipeline that expands a few seed samples into high-quality dialogue turns via a shallow-to-deep intent refinement graph. We further adopt a two-stage reinforcement learning (RL) strategy: Stage I applies RL on offline dialogues to efficiently learn general user-interaction behavior, while Stage II uses the trained agent and a user simulator for online rollouts to strengthen adaptation to diverse user feedback. Extensive experiments show that IntentRL significantly improves both intent hit rate and downstream task performance, outperforming the built-in clarify modules of closed-source DR agents and proactive LLM baselines.
💡 Research Summary
IntentRL tackles a fundamental problem in emerging Deep Research (DR) agents: the autonomy‑interaction dilemma. DR agents autonomously browse the web, synthesize information, and generate long‑form reports, but when the initial user query is ambiguous, the agent may waste substantial computational resources on irrelevant searches, producing reports that do not meet the user’s true needs. Existing commercial DR systems include a simple clarification step, yet these modules often fail to elicit enough detail, and open‑source research on proactive intent‑driven DR agents is scarce.
The authors propose a reinforcement‑learning (RL) framework that trains a proactive “intent‑mining” agent to ask clarification questions before the costly research phase begins. The core contributions are two‑fold: (1) a scalable data‑construction pipeline that expands a handful of seed queries into thousands of high‑quality dialogue turns, and (2) a two‑stage RL training regime that first learns from offline expert trajectories and then refines the policy through online interaction with an intent‑aware user simulator.
Data construction starts from benchmark queries paired with detailed rubrics. For each query, the authors derive a “shallow intent” by stripping explicit constraints (producing a fuzzy, simplified query) and a “deep intent” by analyzing rubric requirements that go beyond pure retrieval (e.g., preferred analytical perspectives). These intents become nodes in a Clarification Directed Acyclic Graph (C‑DAG). Each node represents a multi‑option clarification question, and directed edges encode logical dependencies (a downstream question is only meaningful after its predecessor is resolved). By depth‑first traversal and branching on option selections, a single seed expands into many coherent dialogue trajectories. The resulting dataset contains 371 intent trajectories and 2,347 dialogue turns, which serve as expert demonstrations.
Problem formulation casts the interaction as a Partially Observable Markov Decision Process (POMDP). The hidden state is the user’s latent intent I (shallow + deep). Observations are the dialogue history Hₜ₋₁, actions are the agent’s clarification questions xₜ, and the transition model generates user responses uₜ conditioned on I and the history. The reward function measures the information gain of a question, operationalized as alignment with a turn‑level “target intent set” I*ₜ derived from the current top of the C‑DAG traversal stack.
Two‑stage RL uses Group Relative Policy Optimization (GRPO) in both phases. Stage I performs offline RL on the expert trajectories. A hindsight‑driven bootstrapping technique converts the long‑horizon objective into turn‑level rewards by mapping each turn to the set of intents that should be elicited at that point (I*ₜ). This yields a stable base policy that learns to ask questions that maximize the expected information gain.
Stage II addresses distribution shift by introducing an intent‑aware user simulator for online rollouts. The simulator combines rule‑based checks (semantic similarity thresholds to detect redundancy and relevance) with an LLM‑judge that generates natural user responses when the question passes the checks. This hybrid design keeps the action space tractable while providing realistic feedback. The policy is further refined to avoid redundant or irrelevant questions and to adapt to diverse user answer patterns.
Experiments compare IntentRL against (a) built‑in clarification modules of closed‑source DR systems, and (b) recent proactive LLM baselines such as Ask‑when‑Needed and ACT. Evaluation metrics include Intent Hit Rate (the proportion of latent intents correctly identified before research) and rubric‑based report quality scores (comprehensiveness, insight, coverage). IntentRL improves Intent Hit Rate by 12–18 percentage points and raises report quality scores by 0.15–0.22 on average. Notably, the benefit scales with the underlying DR model’s capability: more powerful research agents gain larger performance lifts from the clarification stage. Ablation studies show that adding Stage II online exploration yields an extra 5–7 % gain over offline‑only training.
The paper also discusses limitations. Reward design relies on heuristic alignment with intent sets, which may not capture nuanced user satisfaction. The simulator, while hybrid, cannot fully replicate real human behavior; future work should involve live user studies. Finally, the current pipeline assumes the existence of well‑defined rubrics; extending to domains without explicit rubrics would require automatic rubric generation or alternative intent extraction methods.
In summary, IntentRL demonstrates that proactive, RL‑trained clarification can substantially reduce wasteful computation in open‑ended DR tasks and produce reports that better match user expectations. By providing a systematic method for scaling dialogue data and a robust two‑stage learning scheme, the work offers a practical blueprint for integrating user‑centric interaction into next‑generation autonomous agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment