UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking

Published as a conference paper at ICLR 2026 U I S - D I G G E R : T O W A R D S C O M P R E H E N S I V E R E - S E A R C H A G E N T S Y S T E M S F O R R E A L - W O R L D U N I N - D E X E D I N F O R M A T I O N S E E K I N G Chang Liu 1 ∗ , Chuqiao Kuang 1 ∗ , Tianyi Zhuang 1 ∗ , Y uxin Cheng 2 , Huichi Zhou 3 , Xiaoguang Li 1 , Lifeng Shang 1 1 Huawei T echnologies Ltd. 2 The Univ ersity of Hong K ong 3 Univ ersity College London {LiuChang730, kuangchuqiao, zhuangtianyi}@huawei.com yxcheng@connect.hku.hk huichi.Zhou.25@ucl.ac.uk {lixiaoguang11, Shang.Lifeng}@huawei.com A B S T R AC T Recent advancements in LLM-based information-seeking agents ha ve achie ved record-breaking performance on established benchmarks. Howe ver , these agents remain heavily reliant on search-engine-indexed kno wledge, leaving a critical blind spot: Unindexed Information Seeking (UIS). This paper identiﬁes and ex- plores the UIS problem, where vital information is not captured by search engine crawlers, such as o verlooked content, dynamic webpages, and embedded ﬁles. Despite its signiﬁcance, UIS remains an underexplored challenge. T o address this gap, we introduce UIS-QA, the ﬁrst dedicated UIS benchmark, comprising 110 expert-annotated QA pairs. Notably , even state-of-the-art agents experience a drastic performance drop on UIS-QA (e.g., from 70.90 on GAIA and 46.70 on BrowseComp-zh to 24.55 on UIS-QA), underscoring the sev erity of the prob- lem. T o mitigate this, we propose UIS-Digger , a no vel multi-agent frame work that incorporates dual-mode browsing and enables simultaneous webpage searching and ﬁle parsing. W ith a relatively small ∼ 30B-parameter backbone LLM opti- mized using SFT and RFT training strategies, UIS-Digger sets a strong baseline at 27.27%, outperforming systems integrating sophisticated LLMs such as O3 and GPT -4.1. This demonstrates the importance of proactiv e interaction with unin- dex ed sources for effecti ve and comprehensiv e information-seeking. Our work not only uncov ers a fundamental limitation in current agent e valuation paradigms but also provides the ﬁrst toolkit for advancing UIS research, deﬁning a new and promising direction for robust information-seeking systems. The dataset has been released at: https://huggingface.co/datasets/UIS-Digger/UIS-QA . 1 I N T RO D U C T I O N W ith the emergence of Large Language Models (LLMs) augmented by tool calls and agent-based workﬂo w designs, modern AI systems have demonstrated impressiv e capabilities in performing complex real-world information seeking tasks (OpenAI, 2025). These methods usually lev er- age powerful tools such as search engines and crawlers for retrieving external knowledge (T eam, 2025b;a; Li et al., 2025a), which we term as Indexed Information Seeking (IIS) . While existing benchmarks such as GAIA (Mialon et al., 2023a) and BrowseComp (OpenAI T eam, 2025) sug- gest current agent system’ s advancement in information seeking, these benchmarks do not explicitly measure the extend of agent’ s reliance on search engines or the ability in discov ering information scattered across unindex ed pages. In real-world scenarios, howe ver , many tasks in volve unindexed information seeking (UIS) , where necessary information is hidden in obscure corners of the Internet, embedded in ﬁles, or excluded from search engine indices due to crawling and ranking limitations. As shown in Fig. 1, for UIS questions, search engines may return related pages b ut fail to provide the direct content needed and ∗ Equal contribution. 1 Published as a conference paper at ICLR 2026 Figure 1: UIS problem. Previous information-seeking agents (bottom) focus primarily on index ed information and thus often fail to gather the evidence needed to answer complex queries, either rejecting to answer or generate hallucinations. In contrast, UIS agents (top) are equipped with additional tools and ﬁne-tuned to exca vate unindexed information, thus capable of interacting with websites deeply and solve UIS tasks reliably . interactions such as date selection and visual graph reading. Recognizing that existing benchmarks ov erlook the intrinsic distinction between IIS and UIS, which leads to insuf ﬁcient adaptation and ev aluation on agent’ s UIS capability , we introduce UIS-QA , the ﬁrst benchmark explicitly designed for UIS capability ev aluation. It consists of 110 carefully annotated and cross-validated test samples, ensuring correctness, objectivity , and temporal stability . The tasks in UIS-QA cover a wide range of action spaces including search, crawl page, download ﬁles and webpage interaction (e.g., select option), requiring agents to skillfully interact with webpages and exca vate uninde xed information. Our experiment results rev eal that ev en the top information-seeking agents struggle in UIS-QA, where the strongest baseline yielding only 25.45% accuracy , much lower than their GAIA and BrowseComp-zh (Zhou et al., 2025b) performance (over 70% and 45%, respecti vely). This ﬁnd- ings highlight the signiﬁcance of an UIS benchmark tow ards comprehensiv e ev aluation of infor- mation seeking agents. W ith comprehensiv e analysis on the failure modes, we also identify two major causes of methods that perform poorly on UIS tasks, namely , the insuf ﬁcient action space and limited foundation models. As mentioned abov e, solving UIS tasks usually inv olved a wide range of actions, which can be out of the action space for search-engine-based agents (Li et al., 2025a; T eam, 2025b; openPanGu T eam, 2025; Shi et al., 2025), making UIS problems theoretically unsolv able. While the action space affecting the upper bound, the capability of foundation models sets the lower bound, determining whether the agent can take correct choices and forms a rational strate gy within the large action space. T o this end, we also propose UIS-Digger , a multi-agent system for deep research tasks, with a v er- satile frame work and a tuned inner LLM, supporting af ﬂuent actions of searching and deep brows- ing. The dual-mode browser within UIS-Digger allows the agent to dynamically switch between visual (screenshot) and textual modes, providing richer and more efﬁcient webpage understanding. Furthermore, UIS-Digger also incorporates ﬁle readers and parallel tool execution, signiﬁcantly strengthening its UIS-solving ability . W e also tuned the underlying LLM using synthesized QA pairs through two stages: an initial supervised ﬁne-tuning (SFT) round for cold start, followed by rejection sampling ﬁne-tuning (RFT) for bootstrapping UIS capability . The ﬁnal system achiev es 27.27% accuracy on UIS-QA using ∼ 30B backbone LLMs, surpassing all existing baselines, in- cluding those integrating sophisticated LLMs such as O3 and GPT -4.1. W e summarize the principal contributions of this work as follo ws: • W e identify and formalize the overlooked problem of Unindexed Information Seeking (UIS) , highlighting its intrinsic distinction from IIS and demonstrating that even state-of- the-art information-seeking agents remain limited in UIS scenarios. • W e introduce UIS-QA , the ﬁrst benchmark dedicated to UIS, featuring a rigorously val- idated dataset for systematically ev aluating agent systems. Alongside, we propose UIS- 2 Published as a conference paper at ICLR 2026 Digger , a versatile multi-agent framew ork that serves as a strong baseline, achieving a best-in-class score of 27.27%. • W e conduct detailed analyses of failure cases and agent beha vior e volution across training stages, offering concrete insights and resources to guide future research in advancing the UIS domain. 2 U I S - Q A Since there are scarce pre vious studies explored the UIS problem, how to e v aluate an agent system’ s ability under UIS task setting is still a missing piece of the puzzle. Therefore, we propose a new benchmark named UIS-QA. In this section, we will elaborate the construction of UIS-QA in three parts: the problem formulation, the data collection procedure, and the UIS ﬁltering. 2 . 1 P RO B L E M D E FI N I T I O N T o begin with, the whole Internet can be understand as a structured collection of a vast number of webpages P . As one of the most pre v alent entrances of the Internet, search engines E normally hav e crawled and or ganized a large portion of the webpages, denoted as P ( E ) , which is also known as ‘in- dex ed pages’. The information retrie ved by a search engine is thus deﬁned as ’inde xed information’. W e formalize this concept as follows: P ( E ) = { p i } = { ( u i , s i ) | u i ∈ u , s i ∈ s } (1) I I = { x | x ∈ s ∪ crawl ( u ) } , (2) where each webpage retriev ed by the search engine is represented as a tuple of a URL u i and a snippet s i . The collections of all the URLs and snippets from P ( E ) are represented as s and u , respectiv ely . In other words, all the information present in the page snippets or in the results of one-step crawling from indexed pages can be considered as indexed information. Unless speciﬁed otherwise, in the following sections we use Google Serper 1 as the default search engine for E . Con versely , ‘unindex ed information’ refers to all other information on the Internet excluded I I : U I = { x | x ∈ ( P \ I I ) } (3) In practical terms, it is infeasible to examine all the pages indexed by a search engine. Thus, the abov e deﬁnition serves as a theoretical model. In reality , due to computational constraints, only a small set of search queries can be fed into the search engine and only a fe w top pages returned from the searches will hav e chance to be visited. Hence, we introduce approximations for I I and U I : A ( Q ) ⇝ ˜ P = {E ( q i ) } i =1 , 2 , ··· ,m = { ( ˜ u j , ˜ s j ) } (4) ˜ I I = { x | x ∈ ( ˜ s ∪ crawl ( ˜ u ) } (5) ˜ U I = { x | x ∈ P \ ˜ I I } (6) Here, A denotes an arbitrary information-seeking agent system that receives a task Q from the user and formulates m search queries q i for searching via E . ˜ I I represents the practically accessible index ed information based on the search engine E and queries { q i } , which is a subset of the ideal I I . Consequently , the remainder of P not included in ˜ I I becomes unindexed information ˜ U I . Compared to the ideal deﬁnition, in practice, ˜ I I is much smaller than I I , making it more likely that the target information necessary to solve the user’ s task is located in ˜ U I . This practical limitation highlights the widespread and critical nature of the UIS problem. Based on the deﬁnition of unindex ed information, we further formalize the UIS problem as follow: ( Q , C ) ⇒ z , (7) C = C ( I ) ∪ C ( U ) = { c | c ∈ ˜ I I } ∪ { c | c ∈ ˜ U I } , (8) wher e |C ( U ) | > 0 , and ( Q , C ( I ) ) ⇒ z (9) (10) 1 https://www .serper .dev 3 Published as a conference paper at ICLR 2026 T ask T ype Real-W old W eb Environment Unknown Startpoint Unindexed-Information Dependence Final Answer- oriented Evaluation W ebArena Computer Use ✗ ✗ - ✗ Mind2W eb Computer Use ✗ ✗ - ✗ Mind2W eb-Live Computer Use ✓ ✗ ✓ ✗ Online-Mind2W eb Computer Use ✓ ✗ ✓ ✗ Browsecomp-en/zh Info Seeking ✓ ✓ ✗ ✓ xbench-DeepSearch Info Seeking ✓ ✓ ✗ ✓ GAIA-textual-103 Info Seeking ✓ ✓ ✗ ✓ UIS-QA Info Seeking ✓ ✓ ✓ ✓ T able 1: Comparison of our UIS-QA and existing benchmarks. T o solve a user’ s question Q , a context C consisting of both indexed and unindexed information is required, denoted as C ( I ) and C ( U ) , respecti vely . If the required unindex ed information is not empty and the correct answer z cannot be inferred from C ( I ) , then the Q is a UIS problem. W e compare UIS-QA with existing information-seeking and computer-use datasets along ﬁv e key dimensions, as summarized in T ab. 1. T ask T ype: Information-seeking datasets (e.g., Bro wsecomp- en/zh (OpenAI T eam, 2025; Zhou et al., 2025b), xbench-DeepSearch(Chen et al., 2025b), GAIA- textual-103(Mialon et al., 2023b)) require multi-step exploration on the open web, emphasiz- ing search strategy and information extraction. In contrast, computer-use datasets (e.g., W e- bArena(Zhou et al., 2024), Mind2W eb(Deng et al., 2023), Mind2W eb-Liv e(Pan et al., 2024), Online- Mind2W eb(Xue et al., 2025)) focus on performing interactiv e browser actions (e.g., click, type) to accomplish user goals, prioritizing tool operation proﬁciency . Real-W orld W eb En vironment: UIS-QA ev aluates in the li ve public Internet. This exposes agents to real-world complexities such as outdated information, distracting content, complex layouts, and advertisements—challenges largely absent in controlled settings. Unknown Startpoint: UIS-QA provides no predeﬁned starting point. Agents must initiate searches using general-purpose engines (e.g., Google) and navigate the entire web, without being restricted to speciﬁc sites. Unindexed-Inf ormation Dependence: UIS-QA uniquely requires reliance on information not directly accessible via standard search results. Final Answer -Oriented Evaluation: The benchmark emplo ys deterministic short-form answers for e v al- uation, minimizing subjectiv e judgment and enabling fully automatic scoring. In summary , UIS-QA holistically ev aluates the integration of information-seeking and computer-use capabilities under realistic and demanding web interaction settings. 2 . 2 D A TA C O L L E C T I O N Follo wing the deﬁnition of UIS tasks, we form an expert group to manually annotate question- answer (QA) pairs, and ﬁlter out those can be solved using only indexed-information solely . This process resulted in a test set of 110 high-quality UIS data samples. Speciﬁcally , the team is asked to navigate deeply authoritativ e or ofﬁcial websites, performing interacti ve actions such as multi- round clicks, option list selection, setting ﬁlters, intra-searching, and downloading ﬁles. Afterward, the annotators arriv e at an information source such as a speciﬁc webpage or ﬁle. Based on the content of this page or ﬁle, the annotator then formulated a question, whose answer could be found or inferred from the av ailable content. For each website, we restrict the annotators to compose a maximum of two QA pairs to ensure di versity . T o further improv e the quality of the annotation process, we emphasized the following principles: Objectivity: unlike open-ended or subjectiv e questions, our setting requires answers in the form of factual ﬁll-in-the-blank questions. Thus, the answer z to each question Q is expected to be objective, deterministic, and unique. Authoritativeness: our golden answers are strictly derived from authoritativ e sources. Due to the intrinsic nature of UIS, such sources are often not searchable and demand strong world modeling ability to know which websites contain the appropriate authoritativ e information. This challenges the model to identify reliable sources amid abundant secondary and conﬂicting information. Static Natur e: giv en the dynamic nature of the internet, some content may change signiﬁcantly ov er time (e.g., “What is today’ s weather?”), making it unsuitable for our benchmark. Therefore, 4 Published as a conference paper at ICLR 2026 annotators were instructed to ensure that answers are static, so that the comparisons between agents could be fair across dif ferent testing times. V eriﬁability: to assess the performance of agent systems on UIS-QA, we use a rule-based LLM as a veriﬁcation tool. Consequently , the answers must be veriﬁable. Most of the "golden" answers are presented in the form of numerical values, dates, logical statements, or proper nouns. Some answers are also deﬁned by unambiguous rules (e.g., “including either A or B can be considered correct”). Accessibility: annotators are asked to av oid posing questions that would trigger human veriﬁcation (e.g., CAPCHAs) during the browing process. Similarly , websites with access-restricted content requiring a login are also excluded from consideration. 2 . 3 U I S F I LT E R I N G Even under the strict collection rules described above, some questions are still inevitably solvable using only indexed information. Therefore, we design a UIS ﬁltering pipeline to remove IIS ques- tions. Firstly , each question is independently e xamined by three annotators, who use Google Search to check whether the target content can be directly retriev ed from the search engine. If the search engine result page does not directly contain the target content but contains a link that redirects to the actual content page, the question is still considered UIS. In addition to manual veriﬁcation, we employ z.ai 2 as an automatic veriﬁer to ﬁlter out IIS questions. Howe ver , if a question can be an- swered by z.ai only after downloading a ﬁle, we classify it as UIS, since ﬁle access requires explicit browsing actions beyond indexed snippets. Next, we lev erage an ofﬂine LLM (e.g., Deepseek-R1) to ﬁlter out questions answerable from LLM’ s inner knowledge (DeepSeek-AI, 2025). Finally , we obtain 110 high-quality samples that constitute UIS-QA. Among the 110 samples in UIS-QA, 84 questions are written in Chinese and the remaining in En- glish. The questions span a variety of domains, including government announcements, of ﬁcial prod- uct introductions, source code repositories, games, and company annual reports. 3 U I S - D I G G E R : A M U L T I - A G E N T F R A M E W O R K F O R U I S P R O B L E M S As mentioned above, there are few existing works that have studied the UIS problem. Therefore, we propose a ne w agent-system framew ork for UIS problem solving, named UIS-Digger , which can serve as a fundamental methodology for UIS-QA. In this section, UIS-Digger will be elaborated in three aspects of agent design, framew ork architecture, and training process. Figure 2: UIS-Digger multi-agent system. Planner , web searcher , web surfer and ﬁle reader works together to solve UIS problems. The web surfer can switch between textual- and visual-mode to observe webpages and hence make ne xt-step decisions. Zoom-in for better view . 3 . 1 A G E N T A N D A R C H I T E C T U R E D E S I G N In Fig. 2 we introduce the overall architecture of UIS-Digger . UIS-Digger is a multi-agent system engaging four agents, planner, web searcher , web surfer and ﬁle reader . Each agent is equipped 2 https://chat.z.ai/ 5 Published as a conference paper at ICLR 2026 with a set of tools and assigned to a speciﬁc category of sub tasks. For every new instruction, the agent initializes an empty memory and works in an iterativ e problem-solving process inspired by the ReAct paradigm (Y ao et al., 2023). The agents communicate with each other and corresponding tools via a request-response message system. Planner: upon receiving a new user query , the top-lev el planner decomposes it into a set of sub- tasks, coordinates the ex ecution among the three subordinate agents, and deliv ers the ﬁnal answer to the user . W eb Sear cher: the web searcher concurrently employs search engines and crawling tools to retrie ve index ed information (Eq. 5), and may further delegate sub-tasks to the web surfer and ﬁle reader to obtain unindex ed information from web URLs or ﬁles. W eb Surfer: The web surfer starts from a URL and operates a browser to access unindexed infor- mation. Its action space cov ers common interactions with websites, including clicking, scrolling, typing, selecting, navigating, submitting forms, downloading ﬁles, locating elements, and taking screenshots. Unlike pre vious browser-inte grated methods with either purely textual or visual obser- vation about the webpage (Zheng et al., 2025; CAMEL-AI.org, 2025), we introduce a dual-model memory-shared browsing strategy , to balance both completeness in functionality and high efﬁciency . Crucially , unlike previous multimodal agents, our surfer maintains a shared memory and consistent browser state across textual and visual modes. This design preserves a uniﬁed working history , eliminates synchronization ov erhead, and encourages ef ﬁcient inference by prioritizing textual mode while reserving visual inspection for essential cases. F ile Reader: both the information seeker and web surfer can download ﬁles, which are then pro- cessed by the ﬁle reader supporting formats such as PDF , XLSX, and DOCX. When content exceeds the context windo w , it is incrementally read chunk by chunk, following Y u et al. (2025b). 3 . 2 A G E N T T R A I N I N G UIS-Digger requires specialized capabilities from its inner LLMs, including task decomposition, tool usage, and integrating div erse information for UIS tasks. T o this end, we construct synthesized training data and tune the inner LLMs in two stages of SFT and RFT . 3 . 2 . 1 T R A I N I N G D A TA C O N S T R U C T I O N For ef ﬁciency , we synthesize QA pairs rather than rely solely on manual annotation. W e draw upon both real-world information from the internet and simulated en vironments, as illustrated in Fig. 3. T o construct QA pairs from real-world information sources, over one hundred base websites are collected, across domains such as public companies, product catalogs, government announcements, data dashboards, and code repositories. UIS-Digger is instructed to roam within these websites and extract ﬁve informative sections about a chosen entity , forming a context as deﬁned in Eq. 8. It is designed to gather information from deeper webpages by performing various browsing actions. Then we deploy another LLM to compose a question and label the corresponding answer based on this context, followed by an LLM judge ﬁltering out ambiguous or subjective questions. The prompts used for information collection and query generation are provided in Appendix B. T o address early weaknesses in handling interactiv e web elements, such as selecting a date in a datetime picker , we further dev eloped three types of virtual websites that simulate ﬂight booking and statistical data lookup scenarios. These websites incorporate speciﬁc interactive elements that posed challenges to the earlier version of UIS-Digger . Each virtual site is provided a ﬁctitious JSON database (e.g., synthetic shopping records). QA pairs can be directly derived from the database, while UIS-Digger must solv e them by interacting with the simulated website. This simulation strat- egy signiﬁcantly enhances the agent’ s ability to manipulate widgets such as radio buttons, date selectors, ﬁlters, and graphs. Based on the constructed QA pairs from real and virtual websites, we employ UIS-Digger to solve these questions and collect the trajectories, which are then ﬁltered with reject-sampling method and used for tuning the inner LLM of UIS-Digger . The ﬁnal result trajectories are used in two stages of SFT and RFT , with disjoint question sets allocated to each. 6 Published as a conference paper at ICLR 2026 H uaw ei P r opr i et ar y - R es t r i c t ed D i s t r i but i on 3 R eal - w orl d Q A pa i rs H om e pa ge Col l e c t i on U ni nde xe d Inform a t i on Col l e c t i on Q ue s t i on G e ne ra t i on Pu b lic C o m p an ies C o d e r ep o s ito r ies G ove r nm e nt A nnounc e m e nt s ... I t i s c ur r e n t l y { m o n th} . B a s e d o n the s pe c i fi c i n f o r m a ti o n i n the pr o v i de d t e x t s e g m e n ts … N ame s o me o f f ic ial w eb s i t es o f { c a t eg o r y } S i m ul a t e d Q A pa i rs P le as e e x pl o r e the o f f ic ial w eb s i t e o f { h o me p ag e } ... Q : … A : … Q : … A : … … D if f ic u lt A c tio n S p a c e I d e n tif ic a tio n W e bpa ge G e ne ra t i on Q ue s t i on G e ne ra t i on ... ... Q : … A : … … Q : … A : … … ... Figure 3: QA Pairs Construction Pipeline. ( Left ): Procedure for constructing QA pairs using real- world information. First, homepages potentially containing deep navigation structures and informa- tiv e content are collected. UIS-Digger then explores these homepages to extract information from pages requiring multiple na vigation steps. The collected information subsequently serves as context for query generation. ( Right ): Procedure for constructing QA pairs based on simulated webpages. W e identify browsing actions that UIS-Digger struggles to perform, and generate webpages (along with a JSON database containing relev ant statistics) that incorporate these actions. QA pairs are then generated using the information from the JSON database of these simulated webpages. 3 . 2 . 2 T W O - S TAG E T R A I N I N G In the SFT stage, we inte grate a po werful teacher model X ∗ to solve sampled questions with temper - ature 0, producing one trajectory per question. A separate LLM judge veriﬁes (1) the correctness of the ﬁnal answer and (2) whether the question is trivial, i.e., if ﬁrst-round reply already contains the golden answer z . W e adopt reject sampling, retaining only those correct and non-tri vial trajectories. The resulting SFT -tuned model is denoted as X s , which is then used for RFT trajectory generation. In the RFT stage, X s is deployed to solve the remaining training questions, with temperature 0.4 and a sampling group size of four , encouraging exploration. The same reject-sampling strategy is applied. T o emphasize challenging tasks, we reweight samples by difﬁculty , measured by the number of correct attempts. Speciﬁcally , trajectories from challenging questions are more likely to be retained than those from easier ones. Bootstrapping with these RFT trajectories yields the ﬁnal model X r , which is integrated as the default LLM in UIS-Digger . Unless otherwise speciﬁed, all subsequent experimental results are reported with X r . 4 E X P E R I M E N T S In this section, we present results and analyses on the proposed benchmark UIS-QA and the UIS- Digger system. Across nearly all baseline methods, we observe substantial performance gaps be- tween UIS-QA and prior non-UIS tasks, underscoring the signiﬁcance of UIS-QA as a novel bench- mark. Furthermore, through error case analysis across different models, we identify sev eral key factors that determine an agent system’ s success in UIS. 4 . 1 E X P E R I M E N TA L S E T T I N G S T o distinguish the different nature of the proposed UIS-QA benchmark from exisiting ones, we conduct afﬂuent e v aluations on existing advanced information seeking agents: • Direct API Infer ence These methods directly query a base LLM through provider’ s APIs, with action space (e.g., whether can use search tools) not fully disclosed. W e ev aluate models such as DeepSeek-V3.1 (DeepSeek-AI, 2024), Claude-sonnet-4 (Anthropic, 2025) and GPT -5 (OpenAI, 2025). 7 Published as a conference paper at ICLR 2026 Names Action Space Backbone UIS-QA GAIA BC-zh crawl(s) visual ﬁle browser Direct Inference DeepSeek-V3.1 - - - - DeepSeek-V3.1 1.8 - - Claude-sonnet-4 - - - - Claude-S4 2.7 - - GPT -5 - - - - GPT -5 0.9 - - Commercial System GLM-4.5 auto-thinking, web search ✓ ✗ - ✓ GLM4.5 † 11.8 - - Doubao DeepThink - - - - Doubao 11.8 - - Gemini-2.5-pro google_search - - - - Gemini-2.5-pro 4.5 - - ReAct Agentic Framework W ebSailor ✓ ✗ ✗ ✗ W ebSailor-32B + Qwen3-72B 7.3 53.2 ‡ 25.5 T ongyi-DR ✓ ✗ ✗ ✗ T ongyiDR-30B-A3B † +GPT -4o 23.6 70.9 ‡ 46.7 Multi-agent Framew ork DDv2 ✓ ✗ ✗ ✗ Pangu-38B 8.2 - 34.6 OWL ✓ ✓ ✓ ✓ O3-mini + 4o + Claude-S3.7 4.6 69.7 - MiroThinker v0.1 ✓ ✓ ✓ ✓ MiroThinker -32B-DPO + GPT -4.1 +Claude-S3.7 7.3 57.9 ‡ Memento ✓ ✓ ✓ ✗ O3 + GPT -4.1 25.5 § 79.4 - A W orld ✓ ✓ ✓ ✓ Gemini-2.5-pro + GPT -4o 5.5 32.2 - UIS-Digger (Pangu) ✓ ✓ ✓ ✓ PanGu-38B 27.3 50.5 32.5 UIS-Digger (Qwen) ✓ ✓ ✓ ✓ Qwen3-32B 27.3 47.6 32.5 T able 2: Evaluation results on UIS-QA, GAIA, and BrowseComp-zh (BC-zh). † indicates reasoning- oriented LLMs. ‡ denotes results measured on GAIA-text-103 rather than the full GAIA benchmark. § indicates that the UIS-QA score for Memento (Zhou et al., 2025a) is reported without using its case bank, since UIS is a new task type and only limited cases hav e been previously allocated. Action spaces including crawl (read webpage content), visual (read images), download ﬁle and opearte browser are included. • Commercial Systems. Beyond a single LLM, these systems adopt more sophisticated architectures that theoretically enable a broader action space such as searching. GLM- 4.5 (T eam et al., 2025), Doubao (Seed team, 2025), Gemini-2.5-pro (DeepMind, 2025) belongs to this category . • ReAct-based Framew orks. A straightforward agent design that couples reasoning and ac- tion, represented by W ebSailor (Li et al., 2025a) and T ongyi DeepResearch (T eam, 2025b). • Multi-agent Frameworks. These methods implement multi-agent architectures where specialized agents handle different tasks such as webpage crawling, visual signal interpreta- tion, ﬁle reading, and browser operation. Many systems in this group achieve strong results on traditional benchmarks like GAIA and BrowseComp. Examples include DDv2 (open- PanGu T eam, 2025), OWL (CAMEL-AI.org, 2025), MiroThinker (T eam, 2025a), Me- mento (Zhou et al., 2025a), and A W orld (Y u et al., 2025a). The proposed UIS-Digger with backbone X r is also ev aluated in this part. W e trained two ver- sions of X r , a 38B-Pangu model (Chen et al., 2025a) and a Qwen3-32B model (Y ang et al., 2025). During training, only LLM-generated tokens are updated with gradient backpropagation, while tool responses are excluded. Implementation details of the two stages are provided in Appendix C. 4 . 2 M A I N R E S U LT S O N U I S - Q A In T ab. 2, we present the ev aluation results of baseline methods and UIS-Digger . UIS-Digger achiev es the highest score of 27.27% on the UIS-QAbenchmark, outperforming even sophisticated systems powered by O3. In addition, it deliv ers competitive results on conv entional information- seeking benchmarks such as GAIA and BC-zh, demonstrating strong generality . These ﬁndings suggest that UIS-Digger establishes a solid baseline for advancing research on the UIS problem. By contrast, all baseline methods suffer substantial accuracy drops under the UIS setting. Even strong systems such as T ongyi-DR and Memento, which exceed 70% accuracy on GAIA, drop to only 23.6% and 25.5% on UIS-QA—corresponding to declines of 47.3% and 53.9%, respectively . This sharp degradation reinforces our central motiv ation: UIS remains an underexplored and insuf- ﬁciently addressed capability in current agent systems. 8 Published as a conference paper at ICLR 2026 Beyond the ranking of baseline methods, it is also worthy to note that methods that achieve higher scores on general information-seeking tasks such as GAIA also tend to perform relatively better on UIS-QA. This correlation suggests that a strong foundation model (e.g., O3 in Memento) is still essential for UIS tasks. Nev ertheless, When comparing ReAct-style methods with more comple x agent frame works, we ob- serve that the relati ve distrib ution of UIS and IIS scores is not fundamentally dif ferent. Even within the same framew ork type and similar action spaces, these methods exhibit large performance dispar- ities, with gaps of up to 17.3% and 20.9%, respectively . W e hypothesize that while a larger action space theoretically enables more diverse strategies, it also expands the search space and introduces new challenges. The main bottleneck, therefore, shifts to the underlying LLM’ s fundamental ability . 4 . 3 A N A LY S I S T o systematically analyze the challenges faced by agent systems in solving UIS tasks, we conduct a detailed examination of their searching and browsing behaviors. Fig. 4 illustrates two key aspects: the proportion of trials successfully grounded to the golden information source (left), and the action frequency distrib utions across correct and incorrect samples after different training stages (right). Gains from SFT and RFT T raining Both SFT and RFT training stages lead to substantial ac- curacy improv ements on UIS-QA, demonstrating the effecti veness of the two-stage tuning strategy . For instance, UIS-Digger with a PanGu backbone achieves gains of 13.6% from SFT and an addi- tional 4.6% from RFT . Further details and extended results are pro vided in Appendix D.1. Error Analysis On the left side of Fig. 4, we analyze the searching behaviors of four representati ve methods—Memento, T ongyi-DR, W ebSailor , and UIS-Digger—on UIS-QA. W e ev aluate whether an agent successfully retriev es and accesses the root website of the annotated golden information source. The root website is deﬁned as the domain name of the ground-truth webpage, and actions such as cra wling, surﬁng, or do wnloading the golden webpage URL are counted as visits. The three concentric rings in each pie chart, from the innermost to the outermost, denote: (1) ﬁnal answer correctness, (2) whether the golden root website is retriev ed during search, and (3) whether the golden root website is subsequently accessed. Observed from dif ferent parts of the pie charts, we identify sev eral ke y patterns. For clarity , sections are denoted by their colors from the inner to outer rings (e.g., BBR stands for Blue–Blue–Red). More illustrativ e examples are provided in the Appendix E. Missing retrie val (RRR) and knowledge sourcing (RBR) ar e two dominant failur e modes. W ithout retrieving the root page, solving a UIS problem becomes theoretically impossible, underscoring the need for robust search capabilities. Even when homepages are retrieved, agents often fail to select the correct kno wledge source among the results, highlighting the importance of precise source identiﬁcation. These patterns emphasize the value of UIS-QAin exposing UIS-speciﬁc weaknesses in agent behaviors. UIS r emains difﬁcult even when the source page is reac hed (RBB). Another substantial fraction of cases in volv e correctly retrieving and visiting the root website but still producing incorrect ﬁnal an- swers. Such failures stem from the inherent complexity of UIS action spaces: even when starting from the correct source, agents must ex ecute intricate operation sequences—such as multi-step navi- gation, ﬁlter adjustments, or repeated back-and-forth e xploration. This calls for stronger continuous reasoning and long-horizon planning capabilities in future agent systems. Intrinsic knowledge and alternative sour ces offer only limited shortcuts. W e also observe a small number of correct cases where the golden root website is neither retriev ed nor visited. Our man- ual inspection suggests two explanations: (1) agents occasionally leverage intrinsic knowledge of URLs to directly access rele vant pages, and (2) third-party websites sometimes redundantly host the required information. While such cases reveal that prior knowledge or external redundancy can occasionally “hack” UIS tasks, their rarity indicates they do not fundamentally mitigate the UIS challenge. T ool Usage Across T raining Stages W e observe clear shifts in tool-utilization patterns as the agent advances through training. As shown in Fig. 4 (right), the frequency of search tool calls in- 9 Published as a conference paper at ICLR 2026 Figure 4: Action analysis. ( left ): Search behaviors of UIS-Digger and three baseline methods. The pie charts show the proportions of cases where the agent successfully retrieves the root URL via search, and whether the root URL is subsequently accessed through crawling or browsing. ( Right ): Action frequenc y distributions of correct and incorrect cases for P angu-38B UIS-Digger at dif ferent training stages. Zoom-in for best view . creases across both correct and incorrect trajectories, reﬂecting the growing reliance on external re- triev al, which is belie ved to potentially reduce hallucination. In contrast, ﬁle-parsing actions remain largely unchanged, consistent with their role as a follow-up step once rele vant ﬁles are do wnloaded. A critical difference emerges in the use of the crawl tool. The untrained model fails to inv oke it altogether , whereas this capability appears after SFT and further improves with RFT , underscoring the importance of staged training for acquiring essential beha viors. Browsing actions re veal another important trend: in successful trajectories, browsing attempts sharply decrease over training, indicat- ing more tar geted and ef ﬁcient na vigation. Con versely , unsuccessful trajectories show an increasing number of attempts, suggesting heavy unsuccessful e xploration. Overall, correct trajectories follo w a trajectory of “learn then streamline”: tool usage rises after SFT as the agent learns to solve more complex tasks with longer tool-use sequences, then declines as navigation efﬁciency improves with RFT . Incorrect trajectories, howe ver , show a monotonic increase in tool calls, reﬂecting prolonged retries that fail to con ver ge to a correct solution. 5 C O N C L U S I O N In this paper , we identify the overlooked problem of Unindexed Information Seeking (UIS), where indispensable information resides beyond the reach of search engines. T o systematically ev alu- ate this UIS capability , we introduce the UIS-QA benchmark, which provides a dedicated test set for assessing agent systems on UIS tasks. Although existing agents achiev e strong performance on conv entional information-seeking benchmarks, their ability to solve UIS problems remains lim- ited. Consequently , we propose UIS-Digger , an agent system with enhanced web-interactiv e tools and trained through sequential SFT and RFT stages. Our results demonstrate that with an appro- priate action space and tailored training strategy , UIS ability can be effecti vely bootstrapped, en- abling UIS-Digger to achiev e state-of-the-art performance on UIS-QA. Nevertheless, despite these improv ements, the absolute accuracy of UIS-Digger at 27.27% remains f ar from satisf actory , under- scoring the dif ﬁculty of UIS. W e hope that UIS-QA will encourage further research in this direction and inspire the dev elopment of more practical and generalizable deep research agents. 10 Published as a conference paper at ICLR 2026 R E F E R E N C E S Moonshot AI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities, 2025. URL https://moonshotai.github.io/Kimi- Researcher/ . Anthropic. Introducing claude 4, 2025. URL https://www.anthropic.com/news/ claude- 4 . CAMEL-AI.org. Owl: Optimized workforce learning for general multi-agent assistance in real- world task automation. https://github.com/camel- ai/owl , 2025. Accessed: 2025- 03-07. Sky CH-W ang, Darshan Deshpande, Smaranda Muresan, Anand Kannappan, and Rebecca Qian. Browsing Lost Unformed Recollections: A Benchmark for T ip-of-the-T ongue Search and Rea- soning, March 2025. URL . [cs]. Hanting Chen, Jiarui Qin, Jialong Guo, T ao Y uan, Y ichun Y in, Huiling Zhen, Y asheng W ang, Jin- peng Li, Xiaojun Meng, Meng Zhang, Rongju Ruan, Zhe yuan Bai, Y ehui T ang, Can Chen, Xing- hao Chen, Fisher Y u, Ruiming T ang, and Y unhe W ang. Pangu light: W eight re-initialization for pruning and accelerating llms, 2025a. URL . Kaiyuan Chen, Y ixin Ren, Y ang Liu, Xiaobo Hu, Haotong T ian, T ianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Y uan Gong, Chen Sun, Han Hou, Hui Y ang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, K enkun Liu, Rui W ang, Run Li, T ong Niu, W en- long Zhang, W enqi Y an, Xuanzheng W ang, Y uchen Zhang, Y i-Hsin Hung, Y uan Jiang, Zexuan Liu, Zihan Y in, Zijian Ma, and Zhiwen Mo. xbench: Tracking agents productivity scaling with profession-aligned real-world ev aluations, 2025b. URL 13651 . Google DeepMind. Gemini model & thinking updates (march 2025). https://blog.google/technology/google- deepmind/ gemini- model- thinking- updates- march- 2025/#gemini- 2- 5- pro , 2025. DeepSeek-AI. Deepseek-v3 technical report, 2024. URL 19437 . DeepSeek-AI. Deepseek-r1: Incenti vizing reasoning capability in llms via reinforcement learning, 2025. URL . Xiang Deng, Y u Gu, Boyuan Zheng, Shijie Chen, Samuel Stev ens, Boshi W ang, Huan Sun, and Y u Su. Mind2web: T ow ards a generalist agent for the web, 2023. URL https://arxiv. org/abs/2306.06070 . Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui W ang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025. URL abs/2506.11763 . Google. Deep research is no w available on gemini 2.5 pro experimental. Google Blog, 2025. URL https://blog.google/products/gemini/ deep- research- gemini- 2- 5- pro- experimental/ . Liang Hu, Jianpeng Jiao, Jiashuo Liu, Y anle Ren, Zhoufutu W en, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, T ianci He, Fei Hu, Y ali Liao, Zaiyuan W ang, Chenghao Y ang, Qianyu Y ang, Mingren Y in, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, W enhao Huang, and Y uwen T ang. Finsearchcomp: T owards a realistic, expert-le vel e v aluation of ﬁnancial search and reasoning, 2025. URL . Bowen Jin, Hansi Zeng, Zhenrui Y ue, Jinsung Y oon, Sercan Arik, Dong W ang, Hamed Zamani, and Jiawei Han. Search-r1: T raining llms to reason and lev erage search engines with reinforcement learning. arXiv preprint , 2025. 11 Published as a conference paper at ICLR 2026 Kuan Li, Zhongw ang Zhang, Huifeng Y in, Liwen Zhang, Litu Ou, Jialong W u, W enbiao Y in, Baix- uan Li, Zhengwei T ao, Xinyu W ang, W eizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi W u, Y ong Jiang, Ming Y an, Pengjun Xie, Fei Huang, and Jingren Zhou. W ebsailor: Navigating super - human reasoning for web agent. arXiv preprint , 2025a. Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Y utao Zhu, Y ongkang W u, Ji-Rong W en, and Zhicheng Dou. W ebthinker: Empowering large reasoning models with deep research capability , 2025b. URL . Grégoire Mialon, Clémentine Fourrier , Thomas W olf, Y ann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The T welfth International Confer ence on Learning Repr esentations , 2023a. Grégoire Mialon, Clémentine Fourrier , Craig Swift, Thomas W olf, Y ann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023b. URL abs/2311.12983 . OpenAI. Introducing gpt-5, 2025. URL https://openai.com/index/ introducing- gpt- 5/ . OpenAI. Introducing deep research, 2025. URL https://openai.com/index/ introducing- deep- research/ . Accessed: 2025-09-13. OpenAI T eam. Browsecomp: a benchmark for browsing agents. https://openai.com/ index/browsecomp/ , 2025. Accessed: 2025-04-29. openPanGu T eam. Deepdiver v2 technical report, 2025. URL https://ai.gitcode. com/ascend- tribe/openPangu- Embedded- 7B- DeepDiver/blob/main/docs/ openpangu- deepdiver- v2- tech- report.pdf . Y ichen P an, Dehan K ong, Sida Zhou, Cheng Cui, Y ifei Leng, Bing Jiang, Hangyu Liu, Y anyi Shang, Shuyan Zhou, T ongshuang W u, and Zhengyang W u. W ebcan vas: Benchmarking web agents in online en vironments, 2024. URL . Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, and et al. Sean Shi. Humanity’ s last exam, 2025. URL https://arxiv.org/abs/2501.14249 . ByteDance Seed team. Doubao-1.5-pro, 2025. URL https://seed.bytedance.com/en/ special/doubao_1_5_pro . W enxuan Shi, Haochen T an, Chuqiao Kuang, Xiaoguang Li, Xiaozhe Ren, Chen Zhang, Chen Hant- ing, Y asheng W ang, Lifeng Shang, Fisher Y u, and Y unhe W ang. Pangu deepdiver: Adapti ve search intensity scaling via open-web reinforcement learning. arXiv preprint , 2025. Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Y ingyan Hou, Y ong Jiang, Pengjun Xie, Y an Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incenti vize the search capability of llms without searching, 2025a. URL . Shuang Sun, Huatong Song, Y uhao W ang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, W ayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan W ang, and Ji-Rong W en. Sim- pledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis, 2025b. URL . 5 T eam, Aohan Zeng, Xin Lv , Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang W ang, Da Y in, Hao Zeng, Jiajie Zhang, Kedong W ang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Y ao W ei, Y ean Cheng, Y ifan An, Y ilin Niu, Y uanhao W en, and et al. Y ushi Bai. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models, 2025. URL . MiroMind AI T eam. Mirothinker: An open-source agentic model series trained for deep re- search and complex, long-horizon problem solving. https://github.com/MiroMindAI/ MiroThinker , 2025a. 12 Published as a conference paper at ICLR 2026 T ongyi DeepResearch T eam. T ongyi-deepresearch. https://github.com/Alibaba- NLP/ DeepResearch , 2025b. Harsh Tri vedi, Niranjan Balasubramanian, T ushar Khot, and Ashish Sabharwal. Musique: Multi- hop questions via single-hop question composition, 2022. URL 2108.00573 . Jason W ei, Nguyen Karina, Hyung W on Chung, Y unxin Joy Jiao, Spencer Papay , Amelia Glaese, John Schulman, and W illiam Fedus. Measuring short-form factuality in large language models, 2024. URL . Ryan W ong, Jiawei W ang, Junjie Zhao, Li Chen, Y an Gao, Long Zhang, Xuan Zhou, Zuo W ang, Kai Xiang, Ge Zhang, W enhao Huang, Y ang W ang, and Ke W ang. W idesearch: Benchmarking agentic broad info-seeking, 2025. URL . Jialong W u, Baixuan Li, Runnan Fang, W enbiao Y in, Liwen Zhang, Zhengwei T ao, Dingchu Zhang, Zekun Xi, Gang Fu, Y ong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. W ebdancer: T ow ards autonomous information seeking agency , 2025. URL 22648 . xAI. Grok 3 beta — the age of reasoning agents. https://x.ai/news/grok- 3 , 2025. T ianci Xue, W eijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Y u Su. An illusion of progress? assessing the current state of web agents, 2025. URL https: //arxiv.org/abs/2504.01382 . An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv , Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran W ei, Huan Lin, Jialong T ang, Jian Y ang, Jianhong Tu, Jianwei Zhang, Jianxin Y ang, Jiaxi Y ang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Ke xin Y ang, Le Y u, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng W ang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, T ianhao Li, T ianyi T ang, W enbiao Y in, Xingzhang Ren, Xinyu W ang, Xinyu Zhang, Xuancheng Ren, Y ang Fan, Y ang Su, Y ichang Zhang, Y inger Zhang, Y u W an, Y uqiong Liu, Zekun W ang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report, 2025. URL . Zhilin Y ang, Peng Qi, Saizheng Zhang, Y oshua Bengio, W illiam W . Cohen, Ruslan Salakhutdinov , and Christopher D. Manning. Hotpotqa: A dataset for div erse, explainable multi-hop question answering, 2018. URL . Shunyu Y ao, Jeffre y Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik Narasimhan, and Y uan Cao. ReAct: Synergizing reasoning and acting in language models. In International Confer ence on Learning Repr esentations (ICLR) , 2023. Chengyue Y u, Siyuan Lu, Chenyi Zhuang, Dong W ang, Qintong W u, Zongyue Li, Runsheng Gan, Chunfeng W ang, Siqi Hou, Gaochi Huang, W enlong Y an, Lifeng Hong, Aohui Xue, Y anfeng W ang, Jinjie Gu, David Tsai, and T ao Lin. A world: Orchestrating the training recipe for agentic ai, 2025a. URL . Hongli Y u, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, W einan Dai, Qiying Y u, Y a-Qin Zhang, W ei-Y ing Ma, Jingjing Liu, Mingxuan W ang, et al. Memagent: Reshaping long-context llm with multi-con v rl-based memory agent. arXiv preprint , 2025b. Y uxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, L yumanshan Y e, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environ- ments, 2025. URL . Huichi Zhou, Y ihang Chen, Siyuan Guo, Xue Y an, Kin Hei Lee, Zihan W ang, Ka Y iu Lee, Guchun Zhang, Kun Shao, Linyi Y ang, and Jun W ang. Memento: Fine-tuning llm agents without ﬁne- tuning llms, 2025a. URL . 13 Published as a conference paper at ICLR 2026 Peilin Zhou, Bruce Leon, Xiang Y ing, Can Zhang, Y ifan Shao, Qichen Y e, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Y uxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Y ining Hua. Browsecomp-zh: Benchmarking web bro wsing ability of large language models in chinese, 2025b. URL . Shuyan Zhou, Frank F . Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar , Xianyi Cheng, T ianyue Ou, Y onatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. W ebarena: A realistic web environment for building autonomous agents, 2024. URL 2307.13854 . A R E L A T E D W O R K Information Seeking Benchmarks Early benchmarks primarily focused on multi-hop question answering. For instance, HotpotQA (Y ang et al., 2018) w as designed to e valuate multi-hop retrie val and question answering, while SimpleQA (W ei et al., 2024) focused on short-form factual queries. Musique (T riv edi et al., 2022) further tested multi-hop reasoning over single-hop e vidence. More recent benchmarks demand deeper and more persistent search behaviors. Benchmarks such as BrowseComp-en/zh (OpenAI T eam, 2025; Zhou et al., 2025b) and Blur (CH-W ang et al., 2025) incorporate deliberate information obfuscation, requiring agents to persistently navigate the web to locate hard-to-ﬁnd information. Similarly , xbench-DeepSearch (Chen et al., 2025b) measures reasoning, tool usage, and memory in e xtended interaction chains. A common strate gy among these benchmarks is to increase task dif ﬁculty by lengthening the reasoning chain, thereby necessitating multi-step browsing and tool in vocation. Howe ver , they do not explicitly ev aluate an agent’ s ability to retriev e unindexed information that is not easily accessible via search engines. As a result, systems excelling at con ventional search may perform well on these benchmarks b ut f ail dramatically on our proposed UIS-QA. Other works approach complex information-seeking from different angles. W ideSearch (W ong et al., 2025) ev aluates broad-scale information retriev al across multiple domains and time spans, often drawing from of ﬁcial websites. HLE (Phan et al., 2025) focuses on challenging academic rea- soning, while GAIA (Mialon et al., 2023a) emphasizes long-horizon tool use. More recently , bench- marks like WideSearch (W ong et al., 2025), FinSearchComp (Hu et al., 2025), and DeepResearch Bench (Du et al., 2025) tackle domain-speciﬁc information needs and, in doing so, occasionally in volv e unindexed sources through historical ﬁnancial reports or specialized of ﬁcial data. Nev er- theless, such exposure remains incidental. In contrast, our work systematically isolates U nindexed I nformation S eeking (UIS) as a core capability dimension, offering a principled ev aluation frame- work. Information Seeking Agents Recent years hav e witnessed signiﬁcant progress in the develop- ment of information-seeking agents. T echnology companies hav e released deep research prod- ucts, such as OpenAI Deep Research (OpenAI, 2025), Google Gemini Deep Research (Google, 2025), Kimi-Researcher (AI, 2025), and Grok-3 Deep Research (xAI, 2025). In parallel, the re- search community has explored multi-agent architectures for complex task orchestration. For e xam- ple, O WL (CAMEL-AI.org, 2025) proposes a hierarchical framework of planning and specialized ex ecution, while A W orld (Y u et al., 2025a) offers an open-source platform for large-scale agent- en vironment interaction. Sev eral studies focus on enhancing reasoning and exploration capabilities during web search. W eb- Thinker (Li et al., 2025b) integrates reasoning processes with web exploration. Search-R1 (Jin et al., 2025) employs reinforcement learning to enable LLMs to autonomously generate search queries during multi-step reasoning. T o address training data scarcity , methods such as Sim- pleDeepSearcher (Sun et al., 2025b) synthesize data by simulating realistic user interactions in live search en vironments, and ZeroSearch (Sun et al., 2025a) uses LLMs to simulate a search engine during training. W ebDancer (W u et al., 2025) creates challenging training tasks that demand deeper multi-hop reasoning. Furthermore, DeepDi ver V2 (openPanGu T eam, 2025) trains a multi-agent system on both closed-ended problems requiring extensi ve information gathering and veriﬁcation, and open-ended tasks aimed at producing comprehensiv e long-form content. 14 Published as a conference paper at ICLR 2026 T o explore the unindexed information seeking capabilities of agents, we propose an agent architec- ture that supports ﬂexible interaction between a planner and specialized subagents capable of di- rectly manipulating web elements. Additionally , we enhance the backbone model through carefully curated synthetic data using both supervised ﬁne-tuning (SFT) and rejection sampling ﬁne-tuning (RFT). B P RO M P T S U S E D F O R Q A P A I R S G E N E R A T I O N The prompts utilized for collecting information from homepage browsing and for generating the ﬁnal queries are presented as follows. Please explore the of ﬁcial website of {homepage}. Y ou are encouraged to conduct searches and to select in-depth pages rich in substanti ve content for browsing. Finally , paraphrase the content of at least **ﬁv e speciﬁc articles on different topics that contain a wealth of detailed entity information**. For e xample: - V isit the Inv estor Relations section of a corporate website, locate the Q3 2024 report, download it, and record the shareholding percentage of the lar gest individual shareholder . - V isit a museum’ s ofﬁcial website, ﬁnd the "Treasures of the Museum" tab, click on it, and record all the listed treasures. Note: 1. Paraphrase the speciﬁc content; do not use statements such as "for speciﬁc details, please refer to X document." 1.1 Y ou are encouraged to paraphrase information containing numerical v alues and names. 2. The collected content should ideally not be directly searchable via search engines. 2.1 Y ou are encouraged to visit related detailed pages. For example: Access "X Com- pany’ s accounts payable for Q2 2025 is..." and collect the speciﬁc content, then access "X Company’ s accounts payable for Q1 2025 is..." and paraphrase both pieces of information. 2.2 If documents are av ailable on the website, you are encouraged to paraphrase the speciﬁc content within those documents. 3. The source of the speciﬁc content must be the original text you actually saw; DO NO T fabricate anything!!! Paraphrase these contents v erbatim directly . 4. Select objective and speciﬁc content. 4.1 The information provided by the content must be objectiv e and deﬁnitiv e. For example: "In 2025, X’ s rev enue rate was...", "X’ s standard numbers include...", "X was included in the National Patent Industrialization Demonstration Enterprise Cultiv ation Pool in month z of year y ." 4.2 The information provided by the content cannot be vague or allow for other possi- bilities. For example: "X’ s advantages include...", "X’ s main goals are...", "X focuses on aspects y and z.", "The reasons X does Y are...". 4.3 The information provided by the content cannot be overvie w/summary in nature. For example: "X’ s key measures include...", "The difﬁculties in X’ s research include...", "X’ s prospects for the future include...". 4.4 Do not select speech-type, manifesto-type, or address-type webpages. 5. Maintain rigor . 5.1 For all content, considering the current date, version, etc., the collected content must include speciﬁc qualifying statements. Do not say "Sales of X’ s ﬂagship model were y yuan"; add conditions and change it to, for example, "Sales of X’ s 2024 ﬂagship model in Mainland China were y yuan". Do not say "X has a total of 41 characters"; add conditions and change it to, for example, "V ersion 5.2.3 of X has a total of 41 heroes". 5.2 For content speciﬁc to a particular institution or enterprise, include the institution or enterprise as a condition. Do not say "In vestment meetings are held on the last day of each quarter"; add the enterprise condition and change it to, for example, "Enterprise X’ s in vestment meetings are held on the last day of each quarter". 15 Published as a conference paper at ICLR 2026 It is currently {month}. Based on the speciﬁc information in the provided text segments, please create 5 objective questions from different perspecti ves that hav e deﬁnitiv e answers. Attach the answers and the rationale for each question, separating multiple rationales with semicolons. Use the format: 1. Question Design: XXX Question: XXX Answer: XXX Rationale: XXX T ext Segment: {context} Note: 1. Ask questions targeting speciﬁc information; do not focus on "task" descriptions. 2. The questions should ideally not be answerable by directly searching a search engine. 2.1 Y ou are encouraged to design multi-hop questions based on the text segment. For example: combine "What were X Company’ s accounts payable for Q2 2025?" and "What were X Company’ s accounts payable for Q1 2025?" into "By how much did X Company’ s accounts payable increase in Q2 2025 compared to Q1 2025?" 3. The source for the questions must be the original text you actually saw; DO NOT fabricate anything!!! 4. Ensure the objectivity and speciﬁcity of the questions. 4.1 A question is objecti ve and speciﬁc if it has an objectiv e, deﬁnitiv e answer . For example: "What was X’ s rev enue rate in 2025?", "What are the standard numbers for X?", "When was X included in the National Patent Industrialization Demonstration Enterprise Cultiv ation Pool?" 4.2 A question is *not* objectiv e and speciﬁc if the answer is open to reasonable interpre- tation. A void questions like: "What are the advantages of X?", "What are the main goals of X?", "What aspects does X focus on?", "Why does X do Y?" 4.3 A question is *not* objective and speciﬁc if multiple non-equiv alent answers could be considered accurate. A void questions like: "List the key measures of X." 4.4 A question is *not* objective and speciﬁc if it is overvie w/summary in nature. A void questions like: "What was reported in X?", "What are the research difﬁculties in X?" 5. Maintain rigor and ensure the uniqueness of the answer . 5.1 For all content, considering the current date, version, etc., include speciﬁc qualifying statements. Do not ask "What were the sales of X’ s ﬂagship model?"; add conditions and ask, for example, "What were the sales of X’ s 2024 ﬂagship model in Mainland China?". Do not ask "How many characters does X have?"; add conditions and ask, for example, "How many heroes are in v ersion 5.2.3 of X?" 5.2 For content speciﬁc to a particular institution or enterprise, include the institution or enterprise as a condition. Do not ask "On which day are in vestment meetings held each quarter?"; add the enterprise condition and ask, for example, "On which day of the quarter does Enterprise X hold its inv estment meetings?". Do not ask "on the ofﬁcial website"; specify which ofﬁcial website. 6. Do not include speciﬁc webpage titles or ﬁle names in the questions. The answers must not contain phrases like "for speciﬁc details, please refer to X link". C I M P L E M E N T A T I O N D E T A I L S SFT T raining. For supervised ﬁne-tuning (SFT), we use a learning rate of 3 × 10 − 6 with a batch size of 32. Each training instance is packed to a sequence length of 128k tokens. W e train the model for a total of 3 epochs. After ﬁltering for correct teacher answers, we retain 1,482 training queries, corresponding to 4,501 trajectories in total. Since our framework is a multi-agent system, a single query may correspond to multiple trajectories. RFT T raining. For reject-sampling ﬁne-tuning (RFT), we use the same learning rate and batch size as in SFT . After ﬁltering for correct responses, the full RFT dataset contains 12,959 trajectories associated with 3,317 queries. After applying difﬁculty-weighted sampling (oversampling difﬁcult queries and undersampling simpler ones), the ﬁnal number of trajectories actually used for training is 4,467. 16 Published as a conference paper at ICLR 2026 UIS-QA BC-zh GAIA FinSearchComp(T2/T3) Pangu-38B 9.1 12.1 25.2 48/3.4 Pangu-SFT 22.7 30.8 42.7 69.0/5.7 Pangu-RFT 27.3 32.5 50.5 73.0/11.4 T able 3: Performance of each training stage across different benchmarks. D A B L A T I O N S T U DY In this section, we analyze the contributions of UIS-Digger’ s modules and technical choices. Over- all, the results conﬁrm both the robustness and the effecti veness of our framew ork in tackling UIS problems. D . 1 P E R F O R M A N C E G A I N S F RO M S F T A N D R F T T R A I N I N G Beyond UIS-Digger’ s strong performance on UIS-QA, we conduct ablations to assess ho w different training stages contribute to accuracy . As shown in Fig. 5, performance consistently improves after each stage of SFT and RFT , though with diminishing returns. The most signiﬁcant gain comes from the SFT stage, supporting our claim that vanilla agents lack a wareness of UIS and perform poorly at the outset. RFT further improves performance by enabling the agent to explore div erse solving strategies and re- inforce successful ones. This ﬁnding is encouraging: even under the UIS setting, self-improvement through reinforcement remains effecti ve. Ne vertheless, UIS-Digger’ s absolute accuracy after RFT is still unsatisfactory , indicating substantial room for future works. W e hypothesize two key limi- tations: (1) a distribution gap between synthesized QA pairs and the real test set, which weakens transfer , and (2) sparse supervision from reject sampling, where feedback is based only on ﬁnal answers, potentially reinforcing low-quality trajectories. W e also ev aluate our method on other benchmarks to assess its generalizability . As shown in T ab. 3, consistent performance gains across various benchmarks are observed. Notably , some benchmarks exhibit ev en larger improvements than on UIS-QA, validating the broad ef fecti veness of our SFT and RFT stages. D . 2 B A C K B O N E M O D E L S T o disentangle the impact of the backbone LLM from that of the UIS-Digger framew ork, we com- pare sev eral models (T ab. 4). Both Pangu-38B and Qwen3-32B, when trained under UIS-Digger , achiev e high score of 27.3%, demonstrating that the framework and training pipeline generalize across backbones. Similarly , Claude-sonnet-4 reaches 23.6%, showing a substantial improvement ov er its original performance and indicating that UIS-Digger beneﬁts e ven relativ ely weaker back- bones. In contrast, directly deploying GPT -4o as the main LLM leads to a dramatic drop to 8.2%, while the similarly untuned O3 yield to 30.9%, which even surpass the tuned small models of Pangu and Qwen3. This ﬁnding suggests that ra w foundation model capability alone is critical and compatibil- ity with the framew ork can also signiﬁcantly affect performance. For the dual-mode web surfer , we also ablate the choice of VLM used to interpret visual signals. By replacing GPT -4o with QwenVL-max, UIS-Digger still achie ves 25.5%, close to the original 27.3%. This demonstrates that UIS-Digger is robust to dif ferent VLM choices, with only minor performance variation. E C A S E S T U DY This section provides detailed case analyses corresponding to the error categories discussed in Sec- tion 4.3. Each case (translated into English) illustrates a speciﬁc mode, detailing the agent’ s actions. 17 Published as a conference paper at ICLR 2026 Figure 5: The UIS-QA score changing curve for UIS-Digger at different stages of training. T able 4: Ablation results on backbone models and VLMs. backbone VLM UIS-QA Pangu-38B GPT -4o 27.3 Pangu-38B QwenVL-max 25.5 Qwen3-32B GPT -4o 27.3 O3-mini GPT -4o 7.6 GPT -4o GPT -4o 8.2 Qwen3-235b-a22b GPT -4o 13.9 GPT -oss-120b GPT -4o 19.9 Claude-sonnet-4 GPT -4o 23.6 O3 GPT -4o 30.9 Knowledge Sour cing Error This case demonstrates a critical failure in source selection. Despite the presence of the correct, authoritati ve website in the search results, the agent committed a kno wl- edge sourcing error by opting for a third-party source. The selected source was compromised by outdated historical data that was irrelev ant to the query’ s speciﬁc context. Consequently , the agent produced a highly conﬁdent yet incorrect answer . Blue indicates the correct source, red indicates a third-party source and orange indicates the outdated information Question What committees are under the Academic Committee of the Chengdu Giant Panda Base? Gold Root W ebsite www .panda.org.cn Chengdu Research Base of Giant Panda Breeding Bro wsed W ebpage clas.cas.cn Chengdu Library and Information Center , Chinese Academy of Sciences Action ## Observation Through an in-depth exploration of the website of the Chengdu Docu- mentation and Information Center , Chinese Academy of Sciences , I hav e obtained detailed organizational structure information regarding the Aca- demic Committee of the Information Service Center for the Chengdu Giant Panda Breeding Research Base . The primary sources of information in- clude: 1. Successfully accessed the detailed page for the Academic Com- mittee (http://clas.cas.cn/zztx2016/gltx2016/xswyh2016/) 2. Acquired the complete historical e volution and organizational structure of the Academic Committee from the 1st term (1981) to the 8th term (revised) (2019 to present) ... ## Reﬂection & Reasoning I have successfully completed the user’ s task requirements. Through sys- tematic exploration of the website, I located an authoritativ e ofﬁcial in- formation source and obtained complete organizational structure informa- tion for the Academic Committee of the Information Service Center for the Chengdu Giant Panda Breeding Research Base . Ke y Findings: 1. The Academic Committee is an institution with a long history , estab- lished in 1981, spanning ov er 40 years... Successful Recall of the Gold Root W ebsite from Internal Knowledge This case demonstrates a scenario where the agent successfully le verages its internal kno wledge of authoritati ve domains to navigate directly to the correct (gold root) website, bypassing the need for a potentially noisy web search. 18 Published as a conference paper at ICLR 2026 Question According to the Industrial and Commercial Bank of China’ s RMB spot exchange rate, what was the buying price of the Euro on 2025-05-01, and what was the speciﬁc release time? Gold Root W ebsite www .icbc.com.cn Industrial and Commercial Bank of China LLM Input (Re- ceived from Plan- ner) Please help me complete the following task. When carrying out the task, please carefully meet the requirements speciﬁed in the system prompt. Y our task is: V isit the ofﬁcial website of the Industrial and Com- mercial Bank of China (ICBC) and ﬁnd the euro’ s RMB spot exchange buying rate for May 1, 2025, along with the speciﬁc release time. Please focus on checking sections such as ’Foreign Exchange Rates’ or ’Finan- cial Markets’ on the ICBC website to locate the euro/RMB spot exchange buying rate data. LLM Output (T ool Call) "task": "V isit the of ﬁcial website of Industrial and Commercial Bank of China (ICBC) to ﬁnd the Euro to RMB spot foreign exchange buying rate for May 1, 2025, and the speciﬁc release time. Focus on checking sections like ’Foreign Exchange Rates’, ’Financial Markets’, or related sections on the ICBC website to locate the Euro/RMB spot forex buying price data.", "start_url": "https://www .icbc.com.cn/" , "max_num_steps": 20 Failur e to Interact with W ebpage Elements In this case, the agent successfully navigated to the correct (gold root) website. Howe ver , it failed to complete the task due to an inability to navigate to the speciﬁc page that shows the e xhibition hall map. 19 Published as a conference paper at ICLR 2026 Question Which heroes in the game Honor of Kings had the highest number of ap- pearances, ban rate, and average kills per game during the 2025 KPL Spring Season, and what were the respectiv e ﬁgures? Gold Root URL www .chnmuseum.cn Nation Museum of China Search Action "query_list": ["Number of restrooms/toilets on B1 ﬂoor of National Mu- seum of China", "National Museum of China basement level toilets", "Na- tional Museum of China ﬂoor plan restrooms"] Search Response ... "title": "Partial T oilet Renov ation and Upgrading Completed at the Na- tional Museum of China to Enhance V isitor Experience", "link": "https://www .chnmuseum.cn /zx/gbxw/202507/t20250728_272269.shtml", "snippet": "Humanized services were the focus of this renov ation. Dur- ing the renov ation project, additional facilities such as toilet armrests and height-adjustable hand wash basins were added to facilitate the disabled, children, and other groups; small shelves and coat hooks were installed inside the toilet stalls to meet visitors’ needs for placing personal items...", "date": "Jul 28, 2025", "position": 1 ... Bro wsed W ebpage https://www .chnmuseum.cn /zx/gbxw/202507/t20250728_272269.shtml Bro wsed W ebpage Content ...In terms of functional layout, addressing the restroom distribution issues raised by the audience, this renov ation in volv ed swapping the men’ s and women’ s sections of the ground-ﬂoor restroom on the north side . While maintaining the same number of men’ s stalls, the number of w omen’ s stalls was increased. Additionally , the cleaning room was relocated outward, further improving space utilization ef ﬁciency .... Gold Speciﬁc W ebpage https://www .chnmuseum.cn/ cg/ Gold Speciﬁc W ebpage Infor - mation This is the service page of the National Museum of China, which contains comprehensiv e visitor information. Key details include the opening hours, reservation rules, and a map displaying the exhibition halls and public fa- cilities (e.g., restrooms) . F L I M I T A T I O N S A N D F U T U R E W O R K This paper highlights the ov erlooked problem of UIS, introducing the dedicated benchmark UIS-QA and a strong baseline method, UIS-Digger . While the results presented abov e are promising, sev eral limitations remain to be addressed. First, as shown in Fig. 5, UIS-Digger continues to improv e after RFT training, b ut the g ains are lim- ited. This suggests that despite our careful data generation and ﬁltering pipeline, the synthesized QA distribution may still differ from real-world cases. Moreov er, the sparse supervision signal—focused solely on ﬁnal answers—restricts the model’ s ability to distinguish between trajectories that are equally correct but v ary in quality . Second, because websites ev olve unpredictably , ev en carefully chosen time-in variant sources may shift in accessibility . For example, new third-party websites might replicate the target information, effecti vely transforming a UIS case into an IIS one and altering the problem dif ﬁculty . Looking forward, we plan to enhance UIS-Digger with more advanced self-improv ement techniques such as reinforcement learning, and to synthesize higher-quality QA pairs that better reﬂect the complexity of real-world UIS scenarios. 20 Published as a conference paper at ICLR 2026 G S T A T E M E N T O N T H E U S E O F A I AI techniques were employed solely to assist with language polishing and improving sentence ﬂu- ency during the writing of this paper . All ideas, methods, and experimental results were conceived, designed, and ex ecuted entirely by the authors. H E T H I C S S TA T E M E N T This work in volv es human annotators in the data collection process. All annotators were compen- sated abov e the minimum w age speciﬁed by the local go vernment. The primary goal of this research is to support the community in advancing UIS-capable agents. T o promote transparency and repro- ducibility , the dataset will be open-sourced. No commercial or conﬁdential information is included in the dataset. 21

UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment