Large language models are evolving from single-turn responders into tool-using agents capable of sustained reasoning and decision-making for deep research. Prevailing systems adopt a linear pipeline of plan to search to write to a report, which suffers from error accumulation and context rot due to the lack of explicit control over both model behavior and context. We introduce RhinoInsight, a deep research framework that adds two control mechanisms to enhance robustness, traceability, and overall quality without parameter updates. First, a Verifiable Checklist module transforms user requirements into traceable and verifiable sub-goals, incorporates human or LLM critics for refinement, and compiles a hierarchical outline to anchor subsequent actions and prevent non-executable planning. Second, an Evidence Audit module structures search content, iteratively updates the outline, and prunes noisy context, while a critic ranks and binds high-quality evidence to drafted content to ensure verifiability and reduce hallucinations. Our experiments demonstrate that RhinoInsight achieves state-of-the-art performance on deep research tasks while remaining competitive on deep search tasks.
Large language models (LLMs) (OpenAI, 2025b;DeepSeek Team, 2025;Gemini, 2025) are evolving from fluent generators into tool-using problem-solvers capable of sustained reasoning and decision-making. In this work, we focus on deep research: an agentic capability that performs multi-step reasoning and searches for information on the internet to tackle complex research tasks. Deep research agents present an innovative approach to enhancing and potentially transforming human intellectual productivity.
Prevailing approaches (Chai et al., 2025;Liu et al., 2025;Consult, 2025;Xbench-Team, 2025;Lei et al., 2024;Chai et al., 2025) mainly focus on using a linear pipeline with different actions, such as planning, retrieval, writing, and reporting. As a result, these methods rely on powerful agent models to carry out each step in a serial manner, as shown in Figure 2 (a), in order to mitigate common issues in linear pipeline structures, such as error accumulation and context rot. In contrast to these approaches, we argue that the key to improving the effectiveness of the deep research framework does not lie in simply enhancing model capability, but rather in how to properly incorporate control mechanisms for both the context and model behavior. Specifically, different components may face distinct challenges in the absence of proper guidance. (1) Planning lacks executable and checkable anchors, which makes generated goals unclear and easily leads to incorrect actions in subsequent model generation. (2) Search, memory, and drifting modules often generate excessively long contexts that introduce substantial noisy or irrelevant content. As the context grows, the model struggles to maintain effectiveness due to context rot and performs poorly when processing lengthy and disorganized contexts. (3) The writing module operates at the end of the pipeline and is therefore highly sensitive to the cumulative context produced by all preceding modules-including erroneous actions and the noisy content they generate. Without proper intervention, this accumulated noise may propagate into the final information aggregation stage, degrading the final report quality and undermining the reliability of the produced analysis. Therefore, it is necessary to introduce control mechanisms to guide model behavior, e.g., actions, and efficiently organize contextual information, ensuring the entire process remains stable and effective.
In this paper, we introduce RhinoInsight, a deep research framework that involves two control mechanisms to drive overall improvement. Specifically, we introduce a verifiable checklist module to supervise and control model behavior, and an evidence audit module to organize the context information across planning, retrieval, memory, outlining, and writing modules. Within the proposed verifiable checklist module, we use a checklist generator to produce traceable and verifiable sub-goals, then further incorporate manual efforts (e.g., scenarios where user queries are unclear in real-world applications) or automated LLMs as a critic to check these sub-goals, and then the planning module compiles them into a hierarchical outline. Unlike previous works that utilize a planning module to generate outlines to guide the following steps, our proposed approach can help ensure that each sub-goal is well-defined and prevent an unclear plan from resulting in incorrect actions later on. Meanwhile, to avoid the context rot phenomenon and filter noisy information in the context, we introduce the evidence audit module that dynamically organizes the context via iteratively updating the outlines, effectively structuring the content, and properly preserving useful information. In this way, we can effectively organize the context and ensure the model can fully utilize the context to provide high-quality answers. Finally, combined with two novel modules, we provide control mechanisms for both context and model behavior, and then ensure the deep research agents can achieve better performance without updating parameters.
Our proposed RhinoInsight achieves the new state-of-the-art results in various deep research tasks and shows competitive performance in deep search tasks, shown in Figure 1. Specifically, RhinoInsight reaches 50.92 on the DeepResearch Bench (Du et al., 2025) on the RACE evaluation and 6.82 on DeepConsult (Consult, 2025). Rhinoinsight surpasses Doubao-Research, Claude-DeepResearch, and OpenAI-DeepResearch, leading all systems on DeepResearch and DeepConsult. On deep search tasks, RhinoInsight also shows competitive performance with advanced LLMs such as OpenAI-o3 (OpenAI, 2025c), e.g., attaining 68.9 scores on the text-only version of GAIA (Mialon et al., 2023).
For deep research tasks, e.g., generating professional financial reports, given a user query q, agents are expected to produce a high-quality, structured report R = (T, V, C):
• Textual Content: T represents textual content, such as factual statements, executive summa
This content is AI-processed based on open access ArXiv data.