Deep Research of Deep Research: From Transformer to Agent, From AI to AI for Science

With the advancement of large language models (LLMs) in their knowledge base and reasoning capabilities, their interactive modalities have evolved from pure text to multimodality and further to agentic tool use. Consequently, their applications have …

Authors: Yipeng Yu

Deep Research of Deep Research: From Transformer to Agent, From AI to AI for Science
Deep Resear ch of Deep Resear ch: Fr om T ransf ormer to Agent, Fr om AI to AI f or Science Y ipeng Y u * Abstract W ith the adv ancement of lar ge language mod- els (LLMs) in their knowledge base and rea- soning capabilities, their interactiv e modalities hav e ev olv ed from pure text to multimodality and further to agentic tool use. Consequently , their applications hav e broadened from ques- tion answering to AI assistants and now to general-purpose agents. Deep research (DR) represents a prototypical vertical application for general-purpose agents, which represents an ideal approach for intelligent information processing and assisting humans in disco vering and solving problems, with the goal of reach- ing or e ven surpassing the le v el of top human scientists. This paper pro vides a deep research of deep research. W e articulate a clear and precise definition of deep research and unify perspectiv es from industry’ s deep research and academia’ s AI for Science (AI4S) within a de- velopmental frame work. W e position LLMs and Stable Diffusion as the twin pillars of gen- erati ve AI, and lay out a roadmap ev olving from the T ransformer to agents. W e examine the progress of AI4S across various disciplines. W e identify the predominant paradigms of human- AI interaction and prev ailing system architec- tures, and discuss the major challenges and fundamental research issues that remain. AI supports scientific innov ation, and science also can contribute to AI gro wth (Science for AI, S4AI). W e hope this paper can help bridge the gap between the AI and AI4S communities. K eywords: Deep Research, AI for Science, AI4S, LLM, Dif fusion, Agent, Agentic AI, AI Scientist, Scaling law , GenAI, Generative AI. 1 Introduction “Mind the risks of AI, but fear the halt of its pr o gr ess mor e. ” — This paper * Email: yypzju@163.com Since the emergence of ChatGPT on 30 Nov 2022, nations hav e gradually become a ware of the tremendous advances in AI and ha ve recognized its strategic significance. The European Commis- sion has launched the “European Strategy for Ar- tificial Intelligence (AI)” to harness the potential of AI technologies in science and support scien- tists to adopt them for their research on 8 Oct 2025. The White House of the United States has also launched the “Genesis Mission” which aims to win the AI race on 24 Nov 2025. The aca- demic community has increasingly applied LLMs to cutting-edge research areas such as biology ( Gao et al. , 2024 ; Rao et al. , 2026 ), chemistry & ma- terials ( T om et al. , 2024 ), healthcare ( Ong et al. , 2026 ), mathematics ( Ju and Dong , 2026 ; W ang et al. , 2026a ), physics, medicine, meteorology , and other fields ( Gao and W ang , 2024 ; Gil and Moler , 2025 ; Chugunov a et al. , 2026 ), while industry has begun rolling out next-generation search engines capable of deep research, such as Google Deep- Mind , OpenAI , and Perplexity . Unquestionably , as the capabilities of AI ad- v ance and its applications broaden, its integration into the research domain will feature progressively higher lev els of automation and intelligence. How- e ver , “deep research” is a concept that has only emerged within the past two years and currently lacks a unified definition. Its relationship with sim- ilar concepts, such as “deep search” and “ AI sci- entist”, also remains ambiguous. Moreov er , con- strained by disparate resources and environments, industry and academia exhibit diver gent motiv a- tions, methodologies, and outcomes in their studies of LLMs and deep research. Furthermore, AI re- searchers often ha ve limited understanding of the research pain points of AI4S researchers, while many AI4S researchers are uncertain about the e x- tent to which AI can contribute to their work. This results in a gap between the two communities. On the other hand, the majority of current publi- 1 cations on deep research have not under gone rigor - ous peer revie w , as many are preprints released on platforms such as arXiv and bioRxiv . Consequently , their quality cannot be guaranteed. Furthermore, the fe w existing survey papers on the topic often fail to pro vide a comprehensiv e overvie w of deep research. They also lack the clarity and broad ap- plicability necessary to be accessible and useful across dif ferent research communities. In response to these existing issues, we con- ducted a deep research of deep research. W e began by providing a precise definition of the concept and distinguishing it from related notions. Subse- quently , we carried out a comprehensiv e inv estiga- tion and synthesis of deep research as practiced in both industry and academia. W e present the ev olu- tion of AI from the T ransformer to agents to help AI4S scientists understand core principles. Addi- tionally , we demonstrate how these scientists apply AI within their specific research fields. These prac- tical insights assist AI researchers in refining and iterating deep research agents. 2 Related W ork As this area is relativ ely nascent, only a few sur - ve ys exist. W ang examined breakthroughs ov er the past decade that include self-supervised learning and geometric deep learning ( W ang et al. , 2023a ). Mo conducted a re view of con versational search systems which focuses on four modules: query reformulation, search clarification, con versational retrie val, and response generation ( Mo et al. , 2025 ). Lin surve yed the agentic RL foundations of search systems ( Lin et al. , 2025 ). Xi analyzed and cat- egorized the LLM-based search agents from the perspecti ves of architecture, optimization, appli- cation and ev aluation ( Xi et al. , 2025 ). Li pro- vided an overvie w of RL-based agentic search, including methods, ev aluation, applications, and challenges ( Li et al. , 2025g ). Zhang provided a systematic ov ervie w of the DR pipeline, which comprises four core stages: planning, question de veloping, web exploration, and report genera- tion ( Zhang et al. , 2025d ). Ren re viewed the ar- chitectures, design, benchmarks, applications, and ethical considerations surrounding LLM-based sci- entific agents ( Ren et al. , 2025 ). Shi understood DR from three progressi ve phases (agentic search, integrated search, and AI scientist) and introduced four key components (query planning, information acquisition, memory management, and answer gen- Deep Research Resarch Agent AI Scientist Summary Deep Survey Deep Review Search Review Research RAG Deep Search Search Agent +LLM +LLM +LLM Solution Explanation V alidation Application Problem Question Finding Hypothesis Simulation Experimental Environment (SEE) Real Experimental Environment (REE) Internet Digital Environment (IDE) X Y Z Figure 1: An o verview of deep research. eration) ( Shi et al. , 2025a ). Xu defined the scope of DR, and explored the architectural patterns, implementation approaches, and domain-specific adaptations ( Xu and Peng , 2025 ). Hu revie wed the scientific LLMs from the perspecti ves of data, model architectures, and agent-based systems ( Hu et al. , 2025a ). W ei offered a domain-oriented re- vie w of autonomous scientific discovery across life sciences, chemistry , materials, and physics, syn- thesizing research progress and adv ances within each discipline ( W ei et al. , 2025b ). Huang con- ducted an analysis of the foundational technologies and architectural components that constitute DR agents ( Huang et al. , 2025b ). Ho wever , existing surve ys lack a clear definition of deep research. They often define it narro wly as an agent of search and report generation, and fail to distinguish it from related concepts. Fur- thermore, they pro vide little discussion on the col- laborati ve roles of humans and AI, nor do they thoroughly address the gap between industrial ap- plications and academic research. Consequently , many critical questions remain unresolv ed for the audience. This is especially true for researchers from AI4S. In contrast, our work is not merely an- other surve y . Instead, we of fer a deep research of deep research. Specifically , we pro vide a unified and e volving perspecti ve by comprehensi vely in- vestigating principles, dataset/benchmarks, models, agents, applications, and challenges. W e also artic- ulate promising directions for achie ving A GI. Our work can guide and inspire future research for AI and AI4S. 2 3 Definition and Differences Research is the systematic and diligent inquiry or in vestigation to disco ver and interpret facts, gener - ate ne w knowledge, and gain a deeper understand- ing of a subject. Research typically comprises basic research and applied research, and it is e xpected to be reproducible and subjected to peer re view . As sho wn in Figure 1 , we provide an o vervie w of deep research from a three-dimensional perspecti ve. The x -axis presents the basic stages of research, from search to revie w and then to research. W ith the introduction of LLMs, these stages hav e also been associated with ne w AI related terms such as “deep” and “agent”. The y -axis represents the motiv ations for research, including the presence of problems, questions, new findings, and hypotheses, as well as the goals, such as providing solutions, answers, v alidation, and applications. The z -axis reflects a gradual progression of experimental settings for re- search, from the internet digital en vironment (IDE) to simulation experimental en vironment (SEE) and finally to the real experimental en vironment (REE). Definition 1 ( Deep Research ) . Center ed on LLM- based AI and using tools to interact with the e xter- nal en vir onment in a multimodal and interactive manner with feedbac k, deep r esear ch assists hu- mans in discovering and solving pr oblems at differ - ent levels of automation, with the goal of r eaching or even surpassing the level of top human scientists. T o clearly delineate the boundaries of deep re- search, we distinguish it from adjacent terminolo- gies as follo ws: • Dif ferentiating from Search/RA G: Search con- stitutes a ke y step within deep research. • Dif ferentiating from Research: The addition of “deep” in deep research implies that the research process becomes more automated, more ef ficient, and more intelligent. • Dif ferentiating from Deep Re- vie w/Surve y/Summarization: Re vie w constitutes a ke y step within deep research. • Dif ferentiating from AI4S / AI for Science: Similar to “ AI assistant”, AI4S emphasizes the use of AI as a tool to support scientific research in v arious domains. • Dif ferentiating from V ibe Research: V ibe re- search can be regarded as a stage of deep re- search. It in volves partial automation, b ut it still requires human intervention. • Dif ferentiating from AI Scientist: AI Scien- tist is closely related to deep research, but in industry , deep research places greater em- phasis on the dev elopment of next-generation information processing engines endowed with research capabilities. 4 F oundation: From T ransf ormer to Agent 4.1 Machine Learning AI is a system and a goal. Machine learning (ML) ( Mitchell , 1997 ) is an effecti ve approach to realize AI. Generative AI (GenAI) is an effecti ve route to AI. A generati ve algorithm is a kind of ML algorithm. From an application standpoint, ML is typically categorized into classification, regression, and clustering. Based on research methodology , the field is further mainly divided into statistical machine learning, deep neural networks ( LeCun et al. , 2015 ; Goodfello w et al. , 2016 ), reinforce- ment learning (RL), and ev olutionary computation. Based on their approaches to modeling probability distributions, models are also classified as gener - ati ve or discriminative. According to the use of labeled data, learning is categorized as supervised or unsupervised. Prior to the recent boom in GenAI, ML in volv ed significantly smaller datasets, com- pute clusters, and model sizes. 4.2 Gemini of Generative AI As illustrated in Figure 2 , we regard LLMs ( P ollux ) and Stable Diffusion ( Castor ) as the Gemini of GenAI. 4.2.1 P ollux : LLM The time from the adoption of deep neural net- work for natural language processing (NLP) to the emergence of LLMs is actually only about ten years. W ord2V ec introduced efficient neu- ral methods to learn dense distributed word rep- resentations that capture semantic and syntactic relationships ( Mikolov et al. , 2013a , b ). Better subword tokenization techniques like BPE ( Sen- nrich et al. , 2016 ; Kudo , 2018 ), W ordPiece and SentencePiece, further help to reduce v ocabulary size and handle rare words by encoding te xt into subword units (tokens). Subsequently , the Trans- former model was proposed ( V aswani et al. , 2017 ). T ransformers work because they use self-attention 3 (c) Multimodal Generative Model T ransformer-based Architecture Image Encoder Audio Encoder V ideo Encoder T ext Image Audio V ideo Other Modalities Image Input Projection Audio Input Projection V ideo Input Projection … Image Output Projection Audio Output Projection V ideo Output Projection Image Diffusion Audio Diffusion V ideo Diffusion … (a) LLM - " Next-token prediction " Layer Norm Multi-Head Attention FFN Layer Norm Imput Embedding Output Embedding T okens T okenizer Using a Transformer network is simple in …… Input using a transform ##er network is simple in …… 152 13 52 76 254 720 8 17 T okens 13 52 76 254 720 8 17 T okenizer a transform ##er network is simple in theory …… …… 123 …… × L β star ( Pollux ) (b) Stable Diffusion - " Denoising by noise estimation at each time step " Diffusion Process Latent Space x  Q K V Q K V Denoising U-Net ∈ θ 𝑥 Pixel Space Semantic Map T ext Images Representations Conditioning ε 𝒟 𝓏 𝓏 𝓏 T - 1 𝓏 T 𝓏 T × ( T -1) τ θ α star ( Castor ) Agent Runtime T ools\Functions\Skills (d) AI Agent LLM-based Model Profile, goals & instructions Memory Reasoning/Planning Orchestration Critique Figure 2: The Gemini of generati ve AI. to dynamically weigh the importance of dif ferent words in a sequence, enabling parallel process- ing and capturing long-range dependencies more ef fecti vely than recurrent or con volutional mod- els. Follo wing this, different architectures based on the T ransformer demonstrated superior perfor - mance on NLP tasks. These included the decoder- only GPT -1 ( Radford et al. , 2018 ), the encoder- only BER T ( Devlin et al. , 2019 ), and the encoder- decoder T5 ( Raf fel et al. , 2020 ). OpenAI continued to refine the decoder-only models (see Figure 2 (a)) and published GPT -2 ( Radford et al. , 2019 ) and GPT -3 ( Brown et al. , 2020 ). Note that GPT -3 ush- ered in the era of prompt engineering. In 2022, OpenAI released ChatGPT based on GPT -3.5. This system marked the first human-lik e conv ersation and attracted widespread public attention. In the next year , Meta’ s open-weight model Llama 2 ac- celerated the trend of LLMs mo ving from “close” to “open” ( T ouvron et al. , 2023 ). Later , more ef- fecti ve positional embedding approaches were pro- posed to integrate the positional information among tokens ( Su et al. , 2024 ; Zheng et al. , 2024a ; Liu et al. , 2025e ), LoRA ( Hu et al. , 2022 ) and RL ( Sti- ennon et al. , 2020 ; W eng et al. , 2022 ) were used to fine-tune LLMs. It can be argued that LLMs are a success of brute force in computation. They demon- strate that larger models, more data, and stronger computational resources can lead to a qualitative improv ement in neural network performance. This established the foundation for the later idea sum- marized as “scaling law” ( Kaplan et al. , 2020 ; Y an et al. , 2026 ), “intelligence emer ging” ( W ei et al. , 2022a ), “grokking” ( Po wer et al. , 2022 ), “LLM is intelligence compression” ( Deletang et al. , 2024 ), or “ Aha moment”. Abov e all, next-tok en predic- tion (NTP) in the pretraining is the core and foun- dational training paradigm for autore gressiv e lan- guage modeling, its training objective can be for - mulated as follo ws: L ( θ ) = − T − 1 X t =1 log P θ ( x t +1 | x 1 , x 2 , . . . , x t ) (1) P θ ( x t +1 | x 1: t ) is the probability assigned by the model (parameterized by θ ) to the true ne xt token x t +1 , given the conte xt x 1: t . The sum runs from t = 1 to T − 1 , since there is no “ne xt token” after the final token. 4 4.2.2 Castor : Stable Diffusion A dif fusion probabilistic model is a parameterized Marko v chain trained using variational inference to produce samples matching the data after finite time ( Ho et al. , 2020 ). The forward process (diffu- sion/noising) gradually adds Gaussian noise to an image ov er man y steps until it becomes pure noise, while the rev erse process (denoising/generation) in- volv es a neural network (U-Net ( Ronneber ger et al. , 2015 )) learning to predict and subtract that noise step-by-step, allo wing it to generate new , realistic data from random noise by rev ersing the corruption. The U-Net learns to rev erse the noising, taking a noisy image and a timestep, and predicting the noise component to subtract, effecti vely moving from x t to x t − 1 . Stable Dif fusion (SD) ( Rombach et al. , 2022 ) performs these processes not on pixel data, but in a compressed “latent space” using an autoencoder . This makes the process much faster and less computationally intensi ve than applying dif fusion directly to high-resolution images. More- ov er, by introducing cross-attention layers into the model architecture, SD turns diffusion models into po werful and flexible generators for general condi- tioning inputs (see Figure 2 (b)). In the same year , Stability AI and Midjourne y released their image generation products and sparked a w ave of mass participation in image creation. The corresponding objecti ve can be simplified as follo ws: L SD := E E ( x ) , y , ϵ ∼N (0 , 1) , t h ∥ ϵ − ϵ θ ( z t , t, τ θ ( y )) ∥ 2 2 i (2) The noise is drawn from a standard normal dis- tribution ϵ ∼ N (0 , 1) , the encoder E encodes an image x into a latent representation z , the encoder τ θ projects the condition y to an intermediate repre- sentation τ θ ( y )) , the denoising U-Net ϵ θ estimates the noise at each time step. Later , the Transformer replaced U-Net ( Peebles and Xie , 2023 ), and Rec- tified flo w ( Esser et al. , 2024 ), a ne w generativ e model formula for finding a transport map between two empirically observed distributions by learn- ing an ordinary differential equation, demonstrated superior performance compared to the diffusion for- mulations. Thus flow and diffusion architectures based on the T ransformer became the dominant approach for image generation ( Ma et al. , 2024b ). Concurrently , ControlNet ( Zhang et al. , 2023b ) and InstantID ( W ang et al. , 2024c ) were proposed to provide finer grained control o ver image generation using conditional inputs, and AnimateDiff ( Guo et al. , 2024 ) were de veloped to generate temporally consistent images, enabling video synthesis. 4.3 Multimodal Generative Model As sho wn in Figure 2 (c), multimodal generativ e models aim to integrate autoregressi ve language models and dif fusion/flow models into a single frame work, thereby extending model capabilities from a single modality to multiple modalities ( Chen et al. , 2024a ). One approach is to unify under- standing and generation across multiple modalities within the same NTP used by LLMs ( W u et al. , 2025a ; Chen et al. , 2025e ; W ang et al. , 2026c ), an- other approach cascades external dif fusion models after the output of an MLLM to generate visual and audio modalities ( W ang et al. , 2024e ; Ge et al. , 2025 ), and a third approach combines next-token prediciton with mask token prediction ( Xie et al. , 2025 ) or rectified flow ( Ma et al. , 2025 ) in one single LLM that can handle dif ferent modalities in distinct ways. The exploration of multimodal generati ve models is still in its early research stage. The leading products right now are Google’ s Nano Banana and Bytedance’ s Seedance2.0 . 4.4 Agent “T oken is cheap, show me your ag ent. ” — This paper The term “agent” is not ne w . In earlier work, it typically referred to humans ( Jiang et al. , 2019 ; Y u et al. , 2020 ). After the emergence of ChatGPT , it has attracted renewed attention. Its meaning has also shifted from referring to humans to refer - ring to AI. W ith the discovery of “T est-time Scal- ing” ( W ang et al. , 2023b ; Lightman et al. , 2023 ) and advances in interaction methods and reason- ing capabilities of LLMs ( W ei et al. , 2022b ; Y ao et al. , 2023a , b ; OpenAI et al. , 2024 ; Guo et al. , 2025 ), researchers have begun inte grating LLMs as the cogniti ve core into e xisting agents, while also equipping LLMs with tools, memory , and feed- back ( Nakano et al. , 2022 ; Schick et al. , 2023 ; Shinn et al. , 2023 ). Real-time and factual infor- mation from tools can help mitigate the halluci- nation issue and transcend the NTP paradigm in LLMs. LLMs are static, stateless, and passiv e, whereas agents are dynamic, stateful, and proac- ti ve. After pretraining, fine-tuning, or prompt engineering, an LLM can generate tokens that specify tool in vocation. The interaction between an agent and an LLM typically carries out the 5 Y ear GPU Framework A pplication Model Hardwar e Software 2007 G80, T esla C870 CUD A Theano 2012 GTX 580 AlexNet 2013 Caf fe W ord2V ec 2014 GTX 980, 1080 cuDNN GAN 2015 K eras, T ensorflo w ResNet 2016 T esla P100 Double/single-precision, HBM2, NVLink T ensorR T PyT orch , MXNet AlphaGo 2017 T esla V100 Mixed-precision training (FP16), T ensor Cores T ransformer 2018 BER T , GPT -1 2019 Megatron-LM GPT -2 2020 A100 TF32, BF16, 40/80GB HBM, Sparse Computing DeepSpeed GPT -3 2022 H100 FP8, T ransformer Engine, DPX T riton ChatGPT (GPT -3.5), Stable Dif fusion T able 1: The GPU-dri ven golden decade of AI de velopment from 2012 to 2022. “Thought → Action → Observ ation” loop. First, a user submits a query and the agent calls the LLM to obtain an initial token sequence. Second, if the agent detects that the output requires a tool call, it executes the tool, obtains the result, concate- nates the result with the context, and calls the LLM again. Third, the agent repeats these steps until a termination condition is met, and then returns the answer to the user . The core modules of an agent are typically sho wn in Figure 2 (d). Different agents feature distinct orchestration structures and interact with the LLM through dif ferent processes. Notable open-source agent framew orks include LangChain , AutoGPT , LlamaInde x , AutoGen , MetaGPT , Open- Claw , and Cre wAI , and popular applications mainly include the coding agent Cursor , the search agent P erplexity , and deep research agents introduced in this paper . 4.5 GPU As sho wn in T able 1 , the dev elopment of AI has relied heavily on adv ances in NVIDIA ’ s GPUs and their associated software infrastructure. The intro- duction of CUD A in 2007 enabled GPU computing po wer to be used not only for graphics rendering but also for general-purpose computation. In the same year , Theano laid the groundw ork for modern deep learning frameworks. Ho wever , it was the AlexNet work using only tw o GTX 580 GPUs in 2012 that truly established GPUs as essential hard- ware for deep learning, triggering what is often called the “Cambrian explosion” of AI. This paper refers to the period from 2012, when AlexNet was published, to 2022, when ChatGPT was released, as the golden decade of AI. During this time, GPUs e volv ed rapidly . CUDA core counts continued to increase. Memory capacity grew lar ger, and mem- ory bandwidth became higher . As a result, two ke y metrics, compute density (FLOPs) and inter- connect bandwidth, improved significantly . The application domain of GPUs expanded from con- sumer gaming to AI data centers. Concurrently , deep learning frame works matured, gi ving rise to two dominant platforms: T ensorFlow and PyT orch. Among the GPU generations, the NVIDIA A100 made large-scale distrib uted training of large mod- els practically feasible. ChatGPT can be regarded as a product of GPT -3.5 trained on A100/V100 GPUs with PyT orch. Leading LLMs today are typically trained on GPU clusters with more than 10,000 GPUs. 5 AI Perspecti ve “T rain AI for the r eal world, not just for the leaderboar d. ” — This paper This section in vestigates deep research from an AI perspecti ve, focusing on why and ho w AI de- 6 User System MasterAgent SubAgent1 SubAgent2 Memory ReviewAgent User System MasterAgent SubAgent1 SubAgent2 Memory ReviewAgent Query Create master research agent Intent confirmation Intent supplement Think (intent determination) Persist results Return results Checks on temporal consistency , authority, relevance, hallucination, fact, and format, etc. Save intent Intent supplement Think (plan approach) Save plan Retrieve context Create SubAgent1 for subtask 1 Create SubAgent2 for subtask 2 Inner search, offline search, online search, tool use Think (evaluate) Inner search, offline search, online search, tool use Think (evaluate) Complete subtask 2 Complete subtask 1 Think (synthesize results) Iterative research process More research needed? × ✓ Exit loop Continue loop Complete task Intent confirmation Iterative intent confirmation More information? × ✓ Exit loop Continue loop Save results Figure 3: Iterati ve deep research. velopers design and build such agents based on LLMs. The motiv ation of DR is to automate com- plex, multi-step research by using LLMs to plan, search the web, analyze hundreds of sources, and synthesize information into detailed, cited reports, drastically cutting do wn research time from days to minutes for tasks like mark et analysis and legal re views. It goes beyond simple Q&A by tackling in- tricate queries that require reasoning and gathering data from v ast online sources, providing actionable insights and plans. 5.1 Benchmark A benchmark is typically b uilt on one or more datasets. It defines ev aluation rubrics, metrics, parameters, and procedures to assess and rank agent performance. Benchmarks can be either public or internally proprietary . The datasets used in the public benchmarks may be fully public, hav e hidden test sets, or be entirely priv ate. Agent performance on these benchmarks can guide ver- sion iteration, regression testing, and deployment decisions. Common public benchmarks for deep research agents include GAIA ( Mialon et al. , 2023 ), GPQA ( Rein et al. , 2024 ), FRAMES ( Krishna 7 DR Agent Backbone Logo Open or Closed Capabilities ENV A utonomy Search Review Research Gemini Gemini Closed ✓ ✓ ✩✩★★★ IDE ◦ ◦ • • • ChatGPT GPT Closed ✓ ✓ ✩✩★★★ IDE ◦ ◦ • • • Claude Claude Closed ✓ ✓ ✩✩★★★ IDE ◦ ◦ • • • Grok Grok Closed ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • Kimi Kimi Closed ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • Doubao Seed Closed ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • MiniMax MiniMax Closed ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • Ernie Ernie Closed ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • StepFun Step Closed ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • Qwen Qwen Open ✓ ✓ ✩✩★★★ IDE ◦ ◦ • • • DeepSeek DeepSeek Open ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • GLM GLM Open ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • Perplexity DeepSeek Closed ✓ ✓ ✩✩★★★ IDE ◦ ◦ • • • MiroThinker Qwen Open ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • SciMaster DeepSeek Open ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • DeerFlo w Model-agnostic Open ✓ ✓ ✩✩✩★★ IDE ◦ ◦ • • • T able 2: Pioneering deep research agents (proprietary closed, proprietary open, external base) from the AI perspective. The fiv e levels of automation correspond to L1-L5 in Figure 5 . Note that the performance of these agents varies ov er time. AI4S researchers may also consider conducting studies based on open-weight models such as Llama 3 ( Grattafiori et al. , 2024 ), MiMO ( Xiaomi et al. , 2026 ), Mistral ( Liu et al. , 2026a ), and LongCat ( Meituan et al. , 2026 ). et al. , 2025 ), Bro wseComp ( W ei et al. , 2025a ; Zhou et al. , 2025c ), W ebW alkerQA ( W u et al. , 2025c ), DeepConsult , DeepResearchGym ( Coelho et al. , 2025 ), xbench-DeepSearch ( Chen et al. , 2025c ), DeepResearch Bench ( Du et al. , 2026 ), Scholar- QABench ( Asai et al. , 2026 ), and Humanity’ s Last Exam ( Phan et al. , 2026 ). Newly proposed public benchmarks that still require v alidation include SuperCLUE ( Xu et al. , 2023 ), W ebAg- gregatorQA ( W ang et al. , 2025c ), Mind2W eb 2 ( Gou et al. , 2025 ), MLR-Bench ( Chen et al. , 2025b ), ArXivBench ( Li et al. , 2025e ), Paper - Bench ( Starace et al. , 2025 ), ResearchBench ( Liu et al. , 2025d ), ResearcherBench ( Xu et al. , 2025a ), DeepScholar-Bench ( Patel et al. , 2025 ), ResearchRubrics ( Sharma et al. , 2025 ), Report- Bench ( Li et al. , 2025d ), AcademicBro wse ( Zhou et al. , 2025b ), AstaBench ( Bragg et al. , 2026 ), Li veDRBench ( Jav a et al. , 2026 ), Liv eSearch- Bench ( Zhou et al. , 2025a ), ExpertLong- Bench ( Ruan et al. , 2026 ), DeepResearch Arena ( W an et al. , 2025a ), SPO T ( Son et al. , 2025 ), DatasetResearch ( Li et al. , 2025b ), Rigorous- Bench ( Y ao et al. , 2025 ), DRBench ( Abaskohi et al. , 2026 ), DeepShop ( L yu et al. , 2025 ), FINDER ( Zhang et al. , 2025a ), PHYBench ( Qiu et al. , 2025 ), ScienceAgentBench ( Chen et al. , 2025g ), MicroVQA ( Bur gess et al. , 2025 ), PDR-Bench ( Liang et al. , 2026 ), Liv eRe- searchBench ( W ang et al. , 2026b ), Liv eNews- Bench ( Zhang et al. , 2026b ), SealQA ( Pham et al. , 2026 ), DR-Arena ( Gao et al. , 2026a ), DeepSearchQA ( Gupta et al. , 2026 ), P2P ( Sun et al. , 2026 ), and DeepResearch Bench II ( Li et al. , 2026b ). These cov er domains such as finance, science, policy , and engineering. Research questions are typically provided in the form of te xt, images, audio, videos, or PDF documents. Most benchmarks supply standard answers, though a fe w rely on expert human e valuation. 5.2 Architectur e Inspired by Anthropic’ s Claude Research Agent and informed by published papers and open-source projects, the architecture of a deep research agent is generally as sho wn in Figure 3 . When a user 8 submits a query , the system first confirms user in- tents through a simple interacti ve procedure, and then creates a MasterAgent that enters an itera- ti ve research process. The MasterAgent begins by thinking through the approach and sa ving its plan to Memory to persist the context. It then cre- ates specialized SubAgents with specific research tasks. Each SubAgent independently performs searches, e valuates tool results using interleaved thinking, and returns findings to the MasterAgent. The MasterAgent synthesizes these results and de- cides whether more research is needed. Once suf- ficient information is gathered, the system exits the research loop and passes all findings to a Re- vie wAgent, which ensures all claims are properly attributed to their sources. The final research re- sults, complete with citations, are then returned to the user . 5.3 Pioneering Agents Pioneering deep research agents from AI perspec- ti ve that provide users with accessible links are listed in T able 2 . W e can see that, 1) Backbone large models, after agentic training, all gain the ability to perform tool-augmented deep research; 2) These agents consistently demonstrate search and re view capabilities, b ut their resear ch capa- bility remains comparatively weak, leaving a substantial gap between current systems and the vision of AI4S ; 3) Open-weight agents lag behind closed source systems in deep research performance, although the disparity is not signif- icant; 4) DR agents built on open-weight models can outperform their backbones; 5) Current stud- ies ev aluate these agents primarily in IDE rather than in REE or SEE; 6) In terms of automation, DR systems are generally at le vel three. In addi- tion, other notable open-source DR projects include GPT Researcher , langchain “ open_deep_research ” and “ local-deep-researcher ”, node-DeepResearch , “ Open Deep Research ”, deep-research , OpenDeep- Researcher , OpenResearcher , F ARS , autoresearch , DeerFlo w , and W ebThinker ( Li et al. , 2025h ). 5.4 K ey Aspects Datasets & Benchmarks As shown in Sec- tion 5.1 , many datasets and benchmarks hav e not yet been widely adopted. General DR agents can be trained on a broader range of high quality datasets to improv e the generalizability of their research ca- pability and intelligence. Domain researchers can de velop dedicated datasets for their fields and then build their own DR agents on either open-weight or proprietary foundation models. T ools T oolkenGPT represented each tool as a to- ken and learned an embedding for it, enabling tool calls in the same way as generating a regular word token ( Hao et al. , 2023 ). SciMaster utilized a tool- augmented reasoning agent designed to emulate human researchers by interacting flexibly with e x- ternal tools during its reasoning process ( Chai et al. , 2025 ). AutoT ools was a frame work that enables LLMs to act as automated tool learners, automat- ing the tool-use w orkflo w ( Shi et al. , 2025b ). W u proposed an agentic reasoning frame work by inte- grating mind-map, search, and code tools ( W u et al. , 2025d ). W ebDancer pro vides the agent with search and click tools ( W u et al. , 2025b ). FlashRA G is an ef ficient and modular open-source toolkit designed to assist researchers ( Jin et al. , 2025a ). TTE enables agents to synthesize, verify , and evolv e executable tools during inference for scientific reasoning ( Lu et al. , 2026b ). FutureT ools and SciencePedia pro- vide tools for science in multiple fields. Agent Framework ResearStudio is a human- intervenable framew ork for building controllable DR Agents ( Y ang and W eng , 2025 ). SFR- DeepResearch was a nati ve single-agent featur- ing minimal web crawling and Python tool inte- gration ( Nguyen et al. , 2025 ). SciAgent opera- tionalized scientific problem solving under a hi- erarchical Coordinator–W orker–Subagents frame- work ( Li et al. , 2025i ). DeepResearcher imple- mented a multi-agent architecture where bro wsing agents e xtract relev ant information from various webpage structures ( Zheng et al. , 2025b ). TTD-DR conceptualized research report generation as a dif- fusion process ( Han et al. , 2025a ). FlowSearch was a multi-agent framew ork that activ ely con- structs and ev olves a dynamic structured knowl- edge flow to dri ve subtask ex ecution and reason- ing ( Hu et al. , 2025b ). MARS was a multi-agent system that seamlessly inte grates System 1’ s fast, intuiti ve thinking with System 2’ s deliberate reason- ing ( Chen et al. , 2025a ). MiroFlow was a three-tier hierarchical agent frame work for general deep re- search tasks ( MiroMind , 2025 ). W ebW ea ver w as a dual-agent frame work with a planner and a writer for open-ended deep research ( Li et al. , 2026d ). W ebW atcher combined vision-language reasoning and multi-tool interaction ( Geng et al. , 2026 ). O- Researcher was an open ended DR model via multi- agent distillation and agentic RL ( Y ao et al. , 2026 ). 9 V ision-DeepResearch performed multi-turn, multi- entity and multiscale visual and textual search to robustly hit real-w orld search engines under heavy noise ( Huang et al. , 2026 ). FS-Researcher was a file-system-based and dual-agent frame work that scales deep research beyond the context window via a persistent workspace ( Zhu et al. , 2026 ). Agentic Learning Atom-Searcher provided atomic thought re wards for fine-grained guidance to address conflicting gradients and re ward sparsity in RL learning ( Deng et al. , 2025 ). DeepDiv e designed a redundancy penalty that discourages repeated similar queries in multi-turn RL ( Lu et al. , 2025b ). Hong designed M-GRPO RL training methods for vertical multi-agent DR systems ( Hong et al. , 2025 ). DeepPlanner trained the DR agent by GRPO with advantage shaping to scale its planning capability ( Fan et al. , 2025 ). DR T ulu used RL with e volving rubrics for learning in long-form tasks ( Shao et al. , 2025b ). PokeeResearch-7B is trained by an annotation-free RLAIF framework to optimize policies using LLM-based re ward signals that capture factual accuracy , citation faithfulness, and instruction adherence ( W an et al. , 2025b ). IterResearch introduced ef ficiency-aw are rewards and adaptive downsampling into the RL learning frame work ( Chen et al. , 2026 ). DeepSearch ov ercame the bottleneck of RL with v erifiable re wards via Monte Carlo T ree Search ( W u et al. , 2026 ). Search-R1++ was a strong DR agent adopting fast thinking templates and trained via REINFORCE with F1+ re ward ( Xu et al. , 2026b ). Context & Memory The enterprise Dingtalk- DeepResearch was able to ev olve via an entropy- guided, memory-aw are online learning mechanism, retrie ving high-v alue prior cases from an episodic memory bank and exploring di verse historical con- texts ( Chen et al. , 2025d ). W ebResearcher ( Qiao et al. , 2025 ) and IterResearch ( Chen et al. , 2026 ) introduced a Markovian structure to build effec- ti ve context and memory . EvoFSM proposed a self-e volving mechanism for DR with finite state machines ( Zhang et al. , 2026a ). K-Dense BY OK is a free, open-source AI co-scientist that runs on your desktop, po wered by Claude Scientific Skills ( K- Dense Inc. , 2026 ). PantheonOS implemented an ex- tensible skill system encoding domain expertise as markdo wn templates with structured workflo ws for automatic genomics discov ery ( Xu et al. , 2026a ). 6 AI4S Perspecti ve “T raining an AI system with a knowledg e cutoff of 1911 and seeing if it could come up with general r elativity like Einstein did in 1915. ” —Demis Hassabis 6.1 Related Summaries and Platf orms T om provided an overvie w of self-driving laborato- ries for chemistry and materials science ( T om et al. , 2024 ). Gao and W ang found that the use of AI in re- search is widespread throughout the sciences, gro w- ing especially rapidly since 2015 ( Gao and W ang , 2024 ). Messeri and Crockett were concerned that the proliferation of AI tools in science risks in- troducing a phase of scientific enquiry in which we produce more but understand less ( Messeri and Crockett , 2024 ). OpenAI and researchers presented a collection of short case studies in which GPT -5 produced ne w , concrete steps in ongoing research across mathematics, physics, astronomy , computer science, biology , and materials science ( Bubeck et al. , 2025 ). Si found that LLM-generated ideas are judged as more nov el than human expert ideas while being judged slightly weaker on feasibil- ity ( Si et al. , 2025a , b ). Ren pro vided a revie w of the architectures, design, benchmarks, applications, and ethical considerations surrounding LLM-based scientific agents ( Ren et al. , 2025 ). W ei offered a domain-oriented re vie w of autonomous scientific discov ery across life sciences, chemistry , materials, and physics ( W ei et al. , 2025b ). Hu re viewed re- cent Sci-LLMs, from general-purpose foundations to specialized models across di verse scientific dis- ciplines, alongside an extensi ve analysis of training datasets ( Hu et al. , 2025a ). Zheng provided a con- ceptual architecture and strate gic foresight to nav- igate and shape the future of AI-driven scientific discov ery ( Zheng et al. , 2025a ). Trehan and Chopra reported lessons from four autonomous research attempts using LLMs ( T rehan and Chopra , 2026 ). Hao stated that AI tools e xpand scientists’ impact but contract science’ s focus ( Hao et al. , 2026 ). At present, the accessible AI4S platforms in- clude FutureHouse , Edison , ResearchRabbit , SciS- pace , scite_ , Sider , Elicit , Autoscience , Deep Prin- ciple , hypogenic.ai , and Intern-Discov ery . The open-source AI4S platforms include Research- Claw , autoresearch , Auto-claude-code-research-in- sleep , AutoResearchClaw , ScienceCla w , NanoRe- search , dr-cla w , ASI-Evolv e , and AI-Scientist ( Lu et al. , 2026a ). 10 U1: Machine Learning as a T ool U2: Human-LLM Conversation U5: Agent Refinement U3: Prompt Engineering Input Output Machine Learning Models Question Prompt T emplate LLM Answer ? LLM Human What is deep resear ch? How is it related AI for Science? …… (a) Pretraining (b) Supervised Fine-tuing (c) Preference Alignment Dataset Domain Document LLM Dataset Question LLM Answer Fine-tuned LLM Question Policy Model Observation Reward Reward Model Reference Model Action ( i) Learning Time ( ii) Runtime Environment Actor Gen Reward Reference Infer Reward Infer Critic Infer Actor Train Critic Train A2A A2A A2A Question Result Feedback Tool Memory MCP Thinking … U4: LLM Optimization Figure 4: Fi ve interaction paradigms in AI4S. 6.2 Paradigm As sho wn in Figure 4 , we categorize the interaction paradigms between researchers and AI in AI4S into fiv e types. The fourth type, U 4 , can be further di vided into three subtypes, and the fifth type, U 5 , into two subtypes. U1 : Machine Learning as a T ool Before Chat- GPT appeared, researchers commonly referred to AI precursors as machine learning algorithms. These algorithms were primarily used to process and model experimental data. W ithin this paradigm, machine learning functioned as a tool. U2 : Human-LLM Con versation After LLMs like ChatGPT gained acceptance, researchers be gan treating LLMs as more adv anced search engines or smarter information processing systems. In this paradigm, researchers typically interact with these models through con versation. U3 : Prompt Engineering Compared with U 2 , U 3 in volv es more complex con versation. Re- searchers use prompt engineering to encourage LLMs to produce more ef fecti ve responses. U4 : LLM Optimization In U 4 , researchers build their own LLMs. They use methods such as pretraining (a), fine-tuning (b), and alignment pref- erence (c). These models may be based on open- weight LLMs or de veloped entirely from scratch. U5 : Agent Refinement The use of tools dri ves the transition from U 4 to U 5 . Researchers can train models using agentic RL (i). Alternati vely , they can b uild agents with LLMs that possess tool capabilities (ii). 6.3 Fields For each field of scientific research, we first present the relev ant datasets/benchmarks ( D B s ) and T ools , then describe approaches following the paradigm in the order of U 1 → U 2 → U 3 → U 4 → U 5 . Note that if a study uses a T ransformer model with a relatively small number of parameters, we classify it as U 1 rather than U 3 . 6.3.1 T ask-Agnostic & Multi-T ask [DBs] : QASA benchmark consists of 1,798 novel question answering pairs that require full-stack reasoning on scientific articles in AI and ML fields ( Lee et al. , 2023 ). PIQA is a large-scale QA dataset specifically designed to interpret comple x figures and tables within the context of scientific research articles across various domains of com- puter science ( Pramanick et al. , 2024 ). Multimodal ArXi v is a dataset for improving scientific com- prehension of large vision-language models ( Li et al. , 2024 ). SciEval is a multi-le vel LLM e valua- tion benchmark for scientific research ( Sun et al. , 2024 ). SciBench contains carefully curated dataset featuring a range of collegiate-lev el scientific prob- lems from mathematics, chemistry , and physics domains ( W ang et al. , 2024d ). OlympiadBench is an Olympiad-lev el bilingual multimodal scien- tific benchmark, featuring 8,476 problems from Olympiad-le vel mathematics and physics compe- titions, including the Chinese college entrance exam ( He et al. , 2024 ). Song introduced a scenario grounded benchmark that e valuates LLMs across biology , chemistry , materials, and physics ( Song et al. , 2025 ) . Scientist-Bench is a comprehen- si ve benchmark comprising state-of-the-art papers across div erse AI research domains ( T ang et al. , 2025 ). Liu introduced A TLAS benchmark, which is a cross-disciplinary ev aluation suite composed of approximately 800 original problems ( Liu et al. , 2025b ). LIMITGEN is a benchmark for e valuating LLMs’ capability to support early-stage feedback and complement human peer re vie w ( Xu et al. , 2025b ). ScienceAgentBench e valuated language agents for data-dri ven scientific discov ery , extract- ing 102 tasks from 44 peer-revie wed publications in four disciplines ( Chen et al. , 2025g ). SciArena is an open and collaborativ e platform for ev alu- ating foundation models on scientific literature- grounded tasks ( Zhao et al. , 2025b ). SCIVER is a benchmark designed to ev aluate the ability of foundation models to verify claims within a multi- modal scientific context ( W ang et al. , 2025a ). Hu- 11 manity’ s Last Exam is a multi-modal benchmark at the frontier of human kno wledge, designed to be an expert-le vel closed-ended academic bench- mark with broad subject co verage ( Phan et al. , 2026 ). [T ools] : SciT oolAgent leveraged a scien- tific tool kno wledge graph across biology , chem- istry , and materials science that enables intelli- gent tool selection and ex ecution through graph- based RA G ( Ding et al. , 2025 ). [U3] : PersLEARN is a tool designed to facilitate the cultiv ation of scientific perspectives, starting from a basic seed idea and progressing to a well-articulated frame- work ( Shi et al. , 2023 ). [U4] : OpenResearcher was built based on RAG to integrate LLMs with up- to-date, domain-specific kno wledge ( Zheng et al. , 2024c ). GraphEval is a lightweight graph-based LLM frame work for idea e v aluation ( Feng et al. , 2025 ). Goel lev eraged the vast corpus of existing research papers to train LLMs that generate better research plans ( Goel et al. , 2025 ). Intern-S1 is a scientific multimodal foundation model with 28 billion acti vated parameters and 241 billion total parameters ( Bai et al. , 2025 ). SciReasoner intro- duced a scientific language foundation model that bridges general-purpose lar ge language modeling with the heterogeneous data and reasoning work- flo ws of the natural sciences ( W ang et al. , 2025d ). TTT -Discover performed RL at test time, so the LLM can continue to train, but with experience specific to the test scientific problem ( Y uksekgonul et al. , 2026 ). Innovator -VL is a scientific MLLM designed to adv ance multimodal understanding and reasoning across di verse scientific domains while still maintaining excellent performance on general vision tasks ( W en et al. , 2026 ). [U5] : Le proposed a multi-agent deep research MLLMs system for mul- timedia verification ( Le et al. , 2025 ). SciSciGPT is a multi-agent designed to serve as a research collab- orator for science of science researchers and practi- tioners ( Shao et al. , 2025a ). aiXiv is a multi-agent ecosystem, which allows research proposals and papers to be submitted, revie wed, and iterati vely re- fined by both human and AI scientists ( Zhang et al. , 2025c ). VIRSCI organized a team of agents to col- laborati vely generate, e valuate, and refine research ideas ( Su et al. , 2025 ). PiFlow is an information- theoretical multi-agent framework, treating auto- mated scientific discov ery as a structured uncer- tainty reduction problem guided by principles ( Pu et al. , 2025 ). Denario is a multi-agent system de- signed to serve as a research assistant for scientific discov ery ( V illaescusa-Nav arro et al. , 2025 ). K os- mos is a multi-agent AI scientist that automates data-dri ven disco very ( Mitchener et al. , 2025 ). AI co-scientist is a multi-agent system to help unco ver ne w , original knowledge and formulate demonstra- bly nov el research hypotheses and proposals ( Got- tweis et al. , 2025 ). Li proposed a multi-agent frame- work to decompress scientific reasoning and con- struct a verifiable long CoT knowledge base ( Li et al. , 2025j ). SA GA is a bi-lev el agent to ac- celerate scientific discov ery for antibiotic design, inorg anic materials design, functional DN A se- quence design, and chemical process design ( Du et al. , 2025 ). InternAgent is a unified closed-loop multi-agent frame work to conduct autonomous sci- entific research across various scientific research fields ( InternAgent et al. , 2025 ). AgentExpt is a frame work for baseline and dataset recommenda- tion ( Li et al. , 2025k ). Deep Ideation integrated LLMs with scientific networks to generate novel and scientifically grounded research ideas ( Zhao et al. , 2025a ). AI-Researcher is a multi-agent or- chestrating literature re vie w , idea generation, algo- rithm implementation, experimental validation, and paper writing ( T ang et al. , 2025 ). Chain of Ideas agent of fered a promising and concise solution by org anizing ideas into a chain structure, ef fectiv ely mirroring the progressiv e dev elopment within a gi ven research domain ( Li et al. , 2025c ). URSA is a scientific agent ecosystem for accelerating re- search tasks, which consists of a set of modular agents and tools ( Grosskopf et al. , 2025 ). RDR is a generalizable pipeline capable of systematically analyzing AI, robotics and beyond: identifying emerging trends, unco vering cross-domain opportu- nities, and of fering concrete starting points for new inquiry ( Zou et al. , 2025 ). DEPLO Y -MASTER constructed reproducible runtime environments for 50,112 scientific tools, and each successful tool is validated by a minimal ex ecutable command and registered in SCIENCEPEDIA for search and reuse ( W ang et al. , 2026d ). MAR VEL is a locally deployable, open-source framework for domain- aw are question answering and assisted scientific research ( Mukund et al. , 2026 ). EvoScientist is an ev olving multi-agent AI scientist framew ork that continuously impro ves its research strate gies through persistent memory and self-ev olution ( L yu et al. , 2026 ). AI Scientist used existing founda- tion models to perform ideation, literature search, experiment planning and implementation, result analysis, manuscript writing, and peer revie w to produce complete, new papers of machine learning 12 science ( Lu et al. , 2026a ). 6.3.2 Biology [U1] : AlphaFold used a T ransformer-lik e neural network to predict the three-dimensional struc- ture that a protein will adopt based solely on its amino acid sequence ( Jumper et al. , 2021 ). W ang used deep learning approaches for scaffolding pro- tein functional sites without needing to prespec- ify the fold or secondary structure of the scaf- fold ( W ang et al. , 2022 ). AlphaMissense is an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense v ariant pathogenicity ( Cheng et al. , 2023 ). CLEAN is a contrasti ve learning al- gorithm for enzyme annotation ( Y u et al. , 2023 ). Lutz used Monte Carlo tree search with RL to design protein architectures ( Lutz et al. , 2023 ). LinearDesign algorithm found an optimal mRN A design for the spike protein in just 11 minutes, and concurrently optimized stability and codon usage ( Zhang et al. , 2023a ). RoseTT AFold was proposed to design protein structure and function, which used diffusion architecture to model pro- tein backbone geometry and sequence–structure relationships ( W atson et al. , 2023 ). Chroma is a dif f usion model for proteins and protein complex es that can directly sample nov el protein structures and sequences ( Ingraham et al. , 2023 ). N AErnie is an RN A-focused pretrained model b uilt upon the transformer architecture ( W ang et al. , 2024a ). Al- phaFold 3 is a diffusion-based architecture that is capable of predicting the joint structure of com- plex es including proteins, nucleic acids, small molecules, ions, and modified residues ( Abramson et al. , 2024 ). MxDN A is a frame work dev eloped to autonomously learn effecti ve DN A tokenization strategies solely through gradient descent ( Qiao et al. , 2024 ). scGPT is a foundation model for single-cell biology , which is based on a generative pretrained transformer across a repository of o ver 33 million cells ( Cui et al. , 2024 ). GEMORNA is a transformer encoder-decoder capable of designing mRN A sequences with unprecedented translational capacity and durability ( Zhang et al. , 2025b ). [U4] : Lin demonstrated direct inference of full atomic- le vel protein structure from primary sequence us- ing a LLM ( Lin et al. , 2023 ). scFoundation is a large pretrained model with 100 million parame- ters cov ering about 20,000 genes, pretrained on ov er 50 million human single-cell transcriptomic profiles ( Hao et al. , 2024 ). BiomedGPT is an open- source and lightweight vision–language founda- tion model for div erse biomedical tasks ( Zhang et al. , 2024 ). UniFMIR is a pre-trained foundation models for uni versal fluorescence microscopy im- age restoration ( Ma et al. , 2024a ). scT ranslator utilized an encoder-decoder T ransformer-based ar- chitecture for translating single-cell transcriptome to proteome ( Liu et al. , 2025c ). ESM3 is a mul- timodal generativ e language model that reasons ov er the sequence, structure, and function of pro- teins ( Hayes et al. , 2025 ). Omni-DNA is a fam- ily of models spanning 20M to 1.1B parameters that supports sequence understanding, long-context genomic reasoning, and natural language annota- tion ( Li et al. , 2025l ). LucaOne is a pre-trained foundation model trained on nucleic acid and pro- tein sequences from 169,861 species ( He et al. , 2025 ). AlphaGenome used a U-Net-inspired back- bone with transformer blocks to analyze the regula- tory genome for predicting molecular functions and v ariant ef fects from DNA ( A vsec et al. , 2026 ). En- zymeCA GE is a catalytic-specific geometric foun- dation model trained on approximately 1.5 million structure-informed enzyme–reaction pairs span- ning ov er 3,000 species ( Liu et al. , 2026b ). Evo is a biological foundation model trained on 9 trillion DN A base pairs from a highly curated genomic atlas spanning all domains of life ( Nguyen et al. , 2024 ; Merchant et al. , 2025 ; Brixi et al. , 2026 ). [U5] : Biomni inte grated LLM reasoning with RA G and code-based execution to help scientists dra- matically enhance research productivity and gener - ate testable hypotheses ( Huang et al. , 2025a ). K- Dense Analyst is a hierarchical multi-agent sys- tem that achie ves autonomous bioinformatics anal- ysis through a dual-loop architecture ( Li et al. , 2025f ). LabOS AI is a co-scientist for the biomed- ical domain that unites computational reasoning with physical experimentation through multimodal perception, self-e volving agents, and XR-enabled human-AI collaboration ( Cong et al. , 2025 ). V ir- tual Lab, an AI–human research collaboration multi-agent, was used to design nanobody binders to recent variants of SARS-CoV -2 ( Sw anson et al. , 2025 ). ChatNT is a multimodal conv ersational agent to bridge the gap between biology founda- tion models and con versational agents ( de Almeida et al. , 2025 ). CellWhisperer established a user- friendly approach for exploring scRN A-seq data, dri ven by chat-based analysis with natural lan- guage ( Schaefer et al. , 2025 ). 13 6.3.3 Materials [U1] : Raccuglia used ML algorithms trained on reaction data to predict reaction outcomes for the crystallization of templated vanadium selen- ites ( Raccuglia et al. , 2016 ). Generativ e models were used for in verse molecular design in mat- ter engineering ( Sanchez-Lengeling and Aspuru- Guzik , 2018 ). Tshitoyan captured latent kno wledge from materials science literature through unsuper - vised word embeddings ( Tshitoyan et al. , 2019 ). Burger used a mobile robot to search for improv ed photocatalysts for hydrogen production from water with a batched Bayesian search algorithm ( Bur ger et al. , 2020 ). GNoME is graph neural networks for efficient disco very of inorganic materials ( Mer - chant et al. , 2023 ). A-Lab is an autonomous lab- oratory for the solid-state synthesis of inor ganic po wders with machine learning and acti ve learn- ing ( Szymanski et al. , 2023 ). In vDesFlow-AL is an activ e learning-based dif fusion model for de- signing target functional inorganic crystal materi- als across the periodic table ( Han et al. , 2025b ). MatterGen is a diffusion-based generativ e model that generates stable, div erse inorganic materials across the periodic table and can be fine-tuned to- wards a wide range of do wnstream tasks for in verse materials design ( Zeni et al. , 2025 ). MatRIS lev er- aged attention to model three-body interactions for quantum mechanism calculation in materials ( Zhou et al. , 2026 ). [U4] : CrystaLLM is an autoregres- si ve LLM for the versatile generation of crystal structures ( Antunes et al. , 2024 ). [U5] : SciAgents is a multi-agent designed to autonomously gener- ate and refine research hypotheses by le veraging LLMs and a comprehensi ve ontological kno wledge graph ( Ghafarollahi and Buehler , 2024 ; Buehler , 2024 ). ChatMOF is an AI system for predicting and generating metal-organic frame works using LLMs, tools, and e v aluators ( Kang and Kim , 2024 ). 6.3.4 Healthcare [DBs] : MultiMedQA is a benchmark combining six existing medical question answering datasets spanning professional medicine, research and con- sumer queries, and HealthSearchQA, a new dataset of medical questions searched online ( Singhal et al. , 2023 ). OmniMedVQA is a comprehensi ve medical visual question answering (VQA) benchmark, in- cluding 12 dif ferent modalities and covering more than 20 distinct anatomical regions ( Hu et al. , 2024 ). GMAI-MMBench is a general medical benchmark with 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 de- partments, and 4 perceptual granularities in a VQA format ( Chen et al. , 2024b ). [U1] : Estev a demon- strated classification of skin lesions using a single CNN, trained end-to-end from images directly , us- ing only pixels and disease labels as inputs ( Estev a et al. , 2017 ). Barata utilized a RL model for AI- based decision support in skin cancer ( Barata et al. , 2023 ). Steyaert fused multimodal data for cancer biomarker disco very with deep learning ( Steyaert et al. , 2023 ). RadDiag is a transformer-based foundational model for large-scale long-tailed dis- ease diagnosis on radiology images ( Zheng et al. , 2024b ). OISA is a post-training frame work based on a pre-trained CLIP model for radiology report generation with self-generation, self-ev aluation, self-alignment, and self-iteration ( Xiao et al. , 2025 ). MA OSS is multi-modal and transformer-based AI for opportunistic screening, staging and progres- sion risk stratification of steatotic liver disease ( Gao et al. , 2026b ). [U2] : Bean conducted a randomized study testing the ef fects of using LLMs to support medical self-assessment, and highlights the chal- lenges of public deployments of LLMs for direct patient care ( Bean et al. , 2026 ). [U4] : Flan-PaLM and Med-PaLM are instruction-tuned v ariants of PaLM on clinical data ( Singhal et al. , 2023 , 2025 ). GMAI is a foundation models proposed for gen- eralist medical artificial intelligence ( Moor et al. , 2023 ). Zhongjing is a Chinese medical LLaMA- based LLM for Chinese medicine ( Y ang et al. , 2024 ). Delphi-2M is a GPT -based architecture to predict the rates of more than 1,000 diseases, con- ditional on each individual’ s past disease history diseases ( Shmatko et al. , 2025 ). SlideChat is a large vision-language assistant for whole-slide pathology image understanding ( Chen et al. , 2025f ). CSFM is a multimodal foundation model pretrained on data from 1.7 million indi viduals for cardiac health assessment across scenarios and devices ( Gu et al. , 2026 ). [U5] : PathChat is a vision-language gener - alist AI assistant for human pathology ( Lu et al. , 2024a ). LLM-RDF is a chemical synthesis de vel- opment platform powered by GPT -4, which com- prises six specialized LLM-based agents, includ- ing Literature Scouter , Experiment Designer , Hard- ware Ex ecutor, Spectrum Analyzer , Separation In- structor , and Result Interpreter ( Ruan et al. , 2024 ). AMIE is an LLM-based AI system optimized for diagnostic dialogue ( T u et al. , 2025 ; McDuff et al. , 2025 ). DeepRare is a multi-agent system with 40 specialized tools and up-to-date kno wledge sources 14 for rare disease differential diagnosis decision sup- port ( Zhao et al. , 2026 ). 6.3.5 Medicine [U1] : Similarity-based machine learning ap- proaches were used to predict new molecular tar- gets for known drugs ( Keiser et al. , 2009 ). Deep neural networks were used to predict molecules with antibacterial acti vity ( Stokes et al. , 2020 ). RosettaVS is a structure-based virtual screen method based on acti ve learning to predict dock- ing poses and binding affinities for drug discov- ery ( Zhou et al. , 2024 ). DrugCLIP combined con- trasti ve learning and dense retrie val based on a transformer architecture to achieve rapid and ac- curate genome-wide virtual screening ( Jia et al. , 2026 ). [U4] : MMed-Llama 3 is an 8B multi- lingual language model for medicine ( Qiu et al. , 2024 ). InstructMol is a multi-modal LLM which ef fecti vely aligns molecular structures with natural language via an instruction-tuning approach ( Cao et al. , 2025b ). [U5] : MolRL-MGPT used a RL al- gorithm with multiple GPT agents for drug molec- ular generation ( Hu et al. , 2023 ). 6.3.6 Chemistry [DBs] : ChemBench is an automated framew ork for e v aluating the chemical knowledge and rea- soning abilities of LLMs against the expertise of chemists ( Mirza et al. , 2025 ). [U1] : MCTS com- bined deep neural networks and symbolic rules to perform chemical synthesis planning ( Segler et al. , 2018 ). Reac-Discovery used ML models for process optimization and reactor geometry re- finement ( T inajero et al. , 2025 ). [U4] : Chemma is a fully fine-tuned LLM with 1.28 million pairs of Q&A about reactions, as an assistant to accel- erate organic chemistry synthesis ( Zhang et al. , 2025f ). ChemVLM is an open-source chemical MLLM, which is trained on a bilingual multimodal dataset including molecular structures, reactions, and chemistry examination questions ( Li et al. , 2025a ). QF ANG is a scientific reasoning model for organic synthesis procedure generation ( Liu et al. , 2025a ). MOSAIC is a computational frame- work that fine-tunes the open-weight Llama 3.1- 8B-instruct model into 2,498 specialized chemistry experts ( Li et al. , 2026a ). [U5] : Coscientist is a system driv en by GPT -4 that autonomously de- signs, plans, and performs complex chemical ex- periments with tools such as internet and docu- mentation search, code e xecution and e xperimental automation ( Boiko et al. , 2023 ). ChemCro w is a chemistry agent with 18 expert-designed tools de- signed to accomplish tasks across or ganic synthesis, drug discovery and materials design ( M. Bran et al. , 2024 ). 6.3.7 Mathematics [DBs] : MA THVIST A is a benchmark designed to combine challenges from diverse mathematical and visual tasks ( Lu et al. , 2024b ). [U1] : Davies demon- strated a method by which machine learning can aid mathematicians in disco vering new conjectures and theorems ( Davies et al. , 2021 ). Alfarano trained sequence-to-sequence transformers to discover a L yapunov function that ensures the global stabil- ity of a dynamical system ( Alfarano et al. , 2024 ). [U3] : DSP+ is an improv ed version of the Draft, Sketch, and Pro ve frame work for adv anced theo- rem proving using LLMs, featuring a fine-grained and integrated neurosymbolic enhancement ( Cao et al. , 2025a ). [U4] : LLEMMA is an open language model for mathematics by pretraining Code Llama on Proof-Pile-2 ( Azerbaye v et al. , 2024 ). POSEI- DON is a foundation model based on a multiscale operator transformer for learning the solution oper- ators of PDEs ( Herde et al. , 2024 ). Math-Shepherd is a process-oriented math v erifier, which assigns a rew ard score to each step of the LLM’ s outputs on math problems ( W ang et al. , 2024b ). FunSearch is an ev olutionary procedure based on pairing a pretrained LLM with a systematic ev aluator for mathematical discov eries ( Romera-Paredes et al. , 2024 ). STP is a self-play LLM theorem prover with iterati ve conjecturing and proving ( Dong and Ma , 2025 ). [U5] : AlphaGeometry is a neuro-symbolic system that uses a neural language model to prove theorems in Euclidean plane geometry by synthe- sizing millions of theorems and proofs across vary- ing complexity le vels ( T rinh et al. , 2024 ). TORA is a series of novel tool-integrated reasoning agents that synergistically combines natural language ra- tionale with program-based tool-use for mathe- matical problem solving ( Gou et al. , 2024 ). Al- phaProof is an AI agent that learns to find formal proofs through RL by training on millions of auto- formalized problems ( Hubert et al. , 2025 ). AlphaE- volv e is an LLM-based code-mutation agent that helps researchers mak e adv ances in complexity the- ory ( Nagda et al. , 2026 ). Aletheia is a math re- search agent that iterativ ely generates, verifies, and re vises solutions end-to-end in natural language, le veraging a novel inference-time scaling law based 15 upon Gemini Deep Think ( Feng et al. , 2026a , b ). 6.3.8 Physics [DBs] : NewtonBench is a benchmark comprising 324 scientific la w disco very tasks across 12 physics domains ( Zheng et al. , 2026 ). [U1] : Canabarro em- ployed both unsupervised and supervised machine learning to identify quantum phase transitions ( Can- abarro et al. , 2019 ). W u inv estigated an “ AI Physi- cis” learning agent for unsupervised learning ( W u and T egmark , 2019 ). Seif used machine learning models to infer the direction of time’ s arro w iden- tifies entropy production as the relev ant physical quantity in its decision-making process ( Seif et al. , 2021 ). Degrav e presented a paradigm for plasma magnetic confinement on tokamaks through deep RL in nuclear fusion ( De grave et al. , 2022 ). TQS is a transformer-based model for quantum many- body problems ( Zhang and Di V entra , 2023 ). Rein- schmidt introduced RL to cold atom experiments and demonstrated a flexible and adapti ve approach to control a magneto-optical trap ( Reinschmidt et al. , 2024 ). Belis used an unsupervised kernel machine and two clustering algorithms to detect quantum anomaly detection ( Belis et al. , 2024 ). Zhang also provided a technical and unified revie w of AI for quantum, atomistic, and continuum sys- tems ( Zhang et al. , 2025e ). [U3] : Pan carried out the quantum man y-body physics calculations using LLMs by multistep prompt templates ( Pan et al. , 2025 ). [U5] : SGA is a scientific generati ve agent in which LLMs act as knowledgeable and adapt- able reasoners to propose scientific solutions such as physics equations or molecular structures, while simulations serve as experimental platforms that provide observ ational feedback and optimize con- tinuous components like physical parameters ( Ma et al. , 2024c ). AI-Newton is a concept-driv en dis- cov ery w orkflow capable of autonomously deri ving physical laws from ra w data ( Fang et al. , 2025 ). 6.3.9 Meteorology [U1] : Ham used a CNN model to produce skilful ENSO forecasts for lead times of up to one and a half years ( Ham et al. , 2019 ). GraphCast is a machine learning–based method trained from re- analysis data to learn skillful medium-range global weather forecasting ( Lam et al. , 2023 ). Pangu- W eather used a 3D transformer-based encoder - decoder model for fast and accurate global weather forecast ( Bi et al. , 2023 ). NeuralGCM is a dif feren- tiable hybrid atmospheric model that combines the strengths of traditional general circulation models with machine learning for weather forecasting and climate simulation ( K ochkov et al. , 2024 ). GenCast is a conditional dif fusion model for probabilistic weather forecasting ( Price et al. , 2025 ). FuXi-CFD is a machine learning-based framework designed to generate detailed 3D near-surf ace wind fields at 30-meter horizontal resolution, using only coarse- resolution atmospheric inputs and high-resolution terrain information ( Lin et al. , 2026 ). 6.4 Present and P erspective of AI4S “AI for Science is the new lens of dis- covery , much like cryo-electr on micr o- scopes for pr oteins, particle acceler ators for physics, and telescopes for astr onomy . ” — This paper Current research demonstrates substantial progress for AI4S in fields such as biology , materials, health- care, medicine, chemistry , mathematics, physics, and meteorology . AI is also playing an increas- ing role in other fields such as robotics ( Kauf- mann et al. , 2023 ; Radosa vovic et al. , 2024 ; Haarnoja et al. , 2024 ; Lu et al. , 2025a ), neu- roscience ( Bashiv an et al. , 2019 ; Luo et al. , 2025 ), aerospace ( Reichstein et al. , 2019 ), agricul- ture ( Y ing et al. , 2025 ), operations research ( W ang et al. , 2025e ), nuclear reactions ( Spears et al. , 2025 ), geography ( Bro wn et al. , 2025 ), and fi- nance ( Jin et al. , 2025b ). No single interaction paradigm is inherently superior to another . The choice of paradigm depends on the specific re- search problem. Even simple multi-turn dialogues ( U2 ) can pro vide valuable assistance to scientists. Ho we ver , several limitations persist. Automation le vels remain insuf ficient for fully autonomous re- search. While agent-based paradigms enhance ef- ficiency , they often shift the manual b urden rather than eliminating it, lea ving scientists to perform the essential “scaffold work” for the agents. Fur- thermore, AI4S researchers often lack direct col- laboration with researchers of AI foundation mod- els. Although scientists frequently release datasets and benchmarks, it remains unclear whether these resources are directly applicable for training AI foundation models. Based on the degree of automation, as sho wn in Figure 5 , we introduce a fi ve-le vel taxonomy (L1- L5) to or ganize human-AI collaboration in AI4S. L1 is “Function-Lev el”, in which AI is in voked as a tool, and the human executes and closes the 16 L1: Function-Level - Auxiliary T ool L2: Task-Level - Task Execution L3: Collaborative-Level - Collaborative Autonomy L4: Guidance-Level - Expert Guidance L5: Autonomous-Level - Full Autonomy AI is invoked as a tool; human executes and closes the loop. Human decomposes and assigns research tasks, AI executes the decomposed sub-tasks. AI executes the primary research task; human collaborates and supervises. AI provides expert-level services; human participates in key decisions. AI operates with full autonomy under human authorization, potentially exceeding human capabilities. Increasing AI Autonomy Decreasing Human Intervention Figure 5: Human-AI collaboration in science. loop. L2 is “T ask-Level”, in which human de- composes and assigns research tasks, AI e xecutes the decomposed sub-tasks. L3 is “Collaborati ve- Le vel”, in which AI e xecutes the primary research task, human collaborates and supervises. L4 is “Guidance-Le vel”, in which AI provides expert- le vel services, human participates in ke y decisions. L5 is “ Autonomous-Lev el”, in which AI operates with full autonomy under human authorization, po- tentially e xceeding human capabilities. It can be observed that most current research reaches at most L3 automation, with only a small fraction attaining L4. The current stage of AI4S can also be described as vibe research ( Zhang , 2026 ). 7 Discussion and Future Dir ections 7.1 K ey Challenges LLM and Har ness The knowledge and reason- ing capacities of LLMs constitute the core basis of DR agents. Howe ver , LLMs can do some very complex things extremely well, yet fail at other tasks that seem simpler or closely related. “Jagged Intelligence” is a concept used to describe their une ven and unpredictable capabilities. Most rela- ti ve research focuses on improving a DR agent’ s ex ecutiv e capability , while enhancing its LLM’ s sci- entific taste remains lar gely underexplored ( T ong et al. , 2026a ). In addition, the architecture of DR agents warrants further in vestigation. Harness is an agentic architecture that allo ws multiple agents to work with shared context across dif ferent sessions and context windo ws. Building reliable harnesses for DR sometimes matters more than the LLMs. Self-ev olution The cost-effecti veness of pre- training is facing diminishing returns now . Ap- proaches like prompt engineering, context engi- neering, test-time scaling, SFT , and RL of fer a limited performance ceiling for specific tasks. Con- sequently , current LLMs and agents lack the ca- pacity for robust life-long learning ( Dupoux et al. , 2026 ). Unlike these models and agents, humans can continually learn from the en vironment through observ ation, interaction and feedback. Therefore, DR agents must be capable of self-ev olution af- ter deployment and able to learn autonomously or online within research en vironments. From IDE to REE A significant disparity exists between the research en vironments of human sci- entists and DR agents. Humans primarily conduct research in REE while utilizing IDE and SEE (see Figure 1 ). In contrast, DR agents operate mostly in the IDE and occasionally in the SEE. These en vironmental differences present three primary challenges. First, foundation models for DR must de velop better perception and reasoning for ph ysi- cal properties such as olfactory sensing and spatial cognition. Second, these agents require a broader set of tools to interact with REE. Third, DR agents need physical embodiment to mov e and perceiv e in the real world. This embodiment could be mani- fested as robotic systems or human proxies ( Gemini et al. , 2025 ). 7.2 Promising A v enues “AI’ s next scaling law is AI itself . ” — This paper AI and AI4S are mutually reinforcing. Adv ances in AI can improv e the quality and ef ficiency of scien- tific research (AI for Science). Progress in science can also adv ance AI (Science for AI). For example, DNNs were inspired by the human brain. Advances in physics and materials science can improve semi- conductor manufacturing and chip design. In the context of Science for AI, researchers shift from passi ve data collection to activ e generation. This approach pro vides specific data designed for model 17 training and structural reasoning. Data production thus prioritizes the needs of the model ov er the display of single experimental results. The emergence of ChatGPT and subsequent ad- v ancements in LLMs suggest that AI has effecti vely passed the T uring T est ( T uring , 2007 ). Current ef- forts in the field focus on reaching artificial general intelligence (A GI) ( Hendrycks et al. , 2025 ) or ar- tificial superintelligence (ASI or SI). Beyond DR agents, here we briefly introduce three promising directions for A GI. Agentic AI DR agents and coding agents (such as Cursor , Codex , OpenCode ) are built for spe- cific research and programming tasks. General agents serve a broader purpose. They aim to ex e- cute various tasks in unfamiliar settings. This ca- pability does not require extensi ve domain-specific engineering. At present, there are two primary pathways to achiev e agentic AI. The first is to in- corporate agentic capabilities into LLMs, such as Claude-4, Gemini-3, GPT -5, and Kimi-2.5. The second is to build agent sw arms on top of propri- etary or open-weight LLMs, such as OpenClaw and Manus . Embodied AI Embodied AI can be viewed as an adv anced form of agentic AI. Its core compo- nents are “world model” and “embodied agent”. W orld models, such as Marble and Genie 3 , can be regarded as a multimodal and multidimensional extension of LLMs ( Hafner et al. , 2025 ; W ang et al. , 2025b ; T ong et al. , 2026b ; Maes et al. , 2026 ). Current world models typically use NTP for lan- guage and dif fusion for vision, as sho wn in Fig- ure 2 (c). Their training data go beyond text and include videos, image-te xt pairs, and ev en action- conditioned videos. The model e volutionary path is roughly “NTP (1D) → Dif fusion (2D) → NeRF / V ideo Model (3D) → W orld Model (4D)”. The em- bodied agent e xtends agents from the IDE to the REE. It is often b uilt on top of the w orld models. The agent can explore and interact in a physics- based simulation (SEE) ( Bolton et al. , 2025 ). It can also interact with the real w orld through li ve cameras and voice interfaces, smart glasses, au- tonomous cars, IP camera surveillance, or robots like Boston Atlas and Unitree robotics ( Gemini et al. , 2025 ; Li et al. , 2026c ). This allows the em- bodied agents to learn directly through continuous interaction. Neuromor phic Intelligence The term “Intelli- gence” is used rather than “ AI” because biologi- cal brains play the primary role in this direction. There are two main branches of neuromorphic intel- ligence. One branch focuses on brain-mimetic mod- els ( Maass , 1997 ) and hardw are ( Pei et al. , 2019 ). These technologies deriv e intelligence from sim- ulating biological neural architectures. The other branch is “Cybor g Intelligence” ( Y u et al. , 2016 ; Y u , 2016 ). This approach uses brain-computer in- terfaces (BCIs) to establish direct communication between biological brains and machines. This inte- gration f acilitates the fusion of biological and artifi- cial intelligence. W ithin this frame work, machines may handle rapid System 1 tasks while biologi- cal brains manage deliberate System 2 decision- making. Their roles are also interchangeable de- pending on the context. 8 Conclusions This paper first provides a definition of deep re- search. W e dif ferentiate this concept from re- lated terms for clarity . T o help non-experts un- derstand the core of AI, we track the technical e volution of deep research from the T ransformer to agents. W e then analyze AI4S across multiple disci- plines. These include biology , materials, chemistry , medicine, mathematics, physics, meteorology , and other disciplines. The specific roles and impacts of AI in each field are clarified. AI can advance science, and science can in turn inform AI. Finally , this paper summarizes the core challenges facing deep research. W e also propose three promising research directions for achie ving A GI. Limitations Open-weight models and commercial closed- source LLMs are continuously and rapidly e volv- ing. Therefore, this paper reflects only the state of these models at the time of publication. The au- thors carefully re viewed all included papers. This applies to preprints from arXi v and bioRxi v as well. Ho we ver , the authors primarily specialize in AI research. Consequently , the scope of the inv esti- gation into v arious AI for Science fields might be limited. 18 References Amirhossein Abaskohi, T ianyi Chen, Miguel Muñoz- Mármol, Curtis Fox, Amrutha V arshini Ramesh, Éti- enne Marcotte, Xing Han Lù, Nicolas Chapados, Spandana Gella, Christopher Pal, Ale xandre Drouin, and Issam H. Laradji. 2026. DRBench: A realistic benchmark for enterprise deep research . In The F our - teenth International Confer ence on Learning Repre- sentations . Josh Abramson, Jonas Adler, Jack Dunger , Richard Evans, Tim Green, Alexander Pritzel, Olaf Ron- neberger , Lindsay W illmore, Andre w J. Ballard, Joshua Bambrick, Sebastian W . Bodenstein, Da vid A. Evans, Chia-Chun Hung, Michael O’Neill, David Reiman, Kathryn T unyasuvunakool, Zachary W u, Akvil ˙ e Žemgulyt ˙ e, Eirini Arvaniti, and 29 oth- ers. 2024. Accurate structure prediction of biomolecular interactions with alphafold 3. Natur e , 630(8016):493–500. Alberto Alf arano, Francois Charton, and Amaury Hayat. 2024. Global lyapunov functions: a long-standing open problem in mathematics, with symbolic trans- formers . In The Thirty-eighth Annual Conference on Neural Information Pr ocessing Systems . Luis M Antunes, Keith T Butler , and Ricardo Grau- Crespo. 2024. Crystal structure generation with au- toregressi ve large language modeling. Natur e Com- munications , 15(1):10570. Akari Asai, Jacqueline He, Rulin Shao, W eijia Shi, Amanpreet Singh, Joseph Chee Chang, K yle Lo, Luca Soldaini, Serge y Feldman, Mike D’Arc y , David W adden, Matt Latzke, Jenna Sparks, Jena D. Hwang, V arsha Kishore, Min yang T ian, Pan Ji, Shengyan Liu, Hao T ong, and 9 others. 2026. Synthesizing scientific literature with retriev al-augmented language models. Natur e , 650(8103):857–863. Žiga A vsec, Natasha Latyshev a, Jun Cheng, Guido No- vati, Kyle R. T aylor, T om W ard, Clare Bycroft, Lau- ren Nicolaisen, Eirini Arvaniti, Joshua Pan, Raina Thomas, V incent Dutordoir, Matteo Perino, So- ham De, Alexander Karollus, Adam Gayoso, T oby Sargeant, Anne Mottram, Lai Hong W ong, and 8 oth- ers. 2026. Adv ancing regulatory variant effect pre- diction with alphagenome. Natur e , 649(8099):1206– 1218. Zhangir Azerbaye v , Hailey Schoelkopf, K eiran Paster , Marco Dos Santos, Stephen Marcus McAleer , Al- bert Q. Jiang, Jia Deng, Stella Biderman, and Sean W elleck. 2024. Llemma: An open language model for mathematics . In The T welfth International Con- fer ence on Learning Representations . Lei Bai, Zhongrui Cai, Y uhang Cao, Maosong Cao, W eihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Y ing Chen, Y ongkang Chen, Y u Cheng, Pei Chu, T ao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, and 158 others. 2025. Intern-s1: A scientific multimodal foundation model . Pr eprint , Catarina Barata, V eronica Rotemberg, Noel C. F . Codella, Philipp Tschandl, Christoph Rinner, Bengu Nisa Akay , Zoe Apalla, Giuseppe Argenziano, Allan Halpern, Aimilios Lallas, Caterina Longo, Josep Malvehy , Susana Puig, Cliff Rosendahl, H. Pe- ter Soyer , Iris Zalaudek, and Harald Kittler . 2023. A reinforcement learning model for ai-based decision support in skin cancer . Natur e Medicine , 29(8):1941– 1946. Pouya Bashiv an, Kohitij Kar , and James J. DiCarlo. 2019. Neural population control via deep image synthesis. Science , 364(6439):eaav9436. Andrew M. Bean, Rebecca Elizabeth Payne, Guy Par- sons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera- Gómez, Sara Hincapié M, Aruna S. Ekanayaka, Li- onel T arassenko, Luc Rocher, and Adam Mahdi. 2026. Reliability of llms as medical assistants for the general public: a randomized preregistered study . Natur e Medicine , 32(2):609–615. V asilis Belis, Kinga Anna W o ´ zniak, Ema Puljak, P anagi- otis Barkoutsos, Günther Dissertori, Michele Grossi, Maurizio Pierini, Florentin Reiter , Ivano T avernelli, and Sofia V allecorsa. 2024. Quantum anomaly detec- tion in the latent space of proton collision ev ents at the lhc. Communications Physics , 7(1):334. Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. 2023. Accurate medium- range global weather forecasting with 3d neural net- works. Natur e , 619(7970):533–538. Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. 2023. Autonomous chemical research with large language models. Natur e , 624(7992):570– 578. Adrian Bolton, Alexander Lerchner , Alexandra Cordell, Alexandre Moufarek, Andre w Bolt, Andrew Lampinen, Anna Mitenkov a, Arne Olav Hallingstad, Bojan V ujatovic, Bonnie Li, Cong Lu, Daan Wier - stra, Daniel P . Sawyer , Daniel Slater , David Reichert, Davide V ercelli, Demis Hassabis, Dre w A. Hud- son, Duncan W illiams, and 46 others. 2025. Sima 2: A generalist embodied agent for virtual worlds . Pr eprint , Jonathan Bragg, Mik e D’Arcy , Nishant Balepur, Dan Bareket, Bhav ana Dalvi Mishra, Sergey Feldman, Dany Haddad, Jena D. Hwang, Peter Jansen, V arsha Kishore, Bodhisattwa Prasad Majumder , Aakanksha Naik, Sigal Rahamimov , Kyle Richardson, Aman- preet Singh, Harshit Surana, Aryeh T iktinsky , Rosni V asu, Guy W iener , and 20 others. 2026. Astabench: Rigorous benchmarking of AI agents with a scien- tific research suite . In The F ourteenth International Confer ence on Learning Representations . Garyk Brixi, Matthew G. Durrant, Jerome K u, Mohsen Naghipourfar , Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A. Gonzalez, Samuel H. King, David B. Li, Aditi T . Merchant, Eric Nguyen, Chiara Ricci-T am, David W . 19 Romero, Jonathan C. Schmok, Ali T aghibakhshi, An- ton V orontso v , and 43 others. 2026. Genome mod- elling and design across all domains of life with ev o 2. Natur e . Christopher F . Brown, Michal R. Kazmierski, V alerie J. Pasquarella, W illiam J. Rucklidge, Masha Samsiko v a, Chenhui Zhang, Ev an Shelhamer, Estef ania Lahera, Olivia W iles, Simon Ilyushchenko, Noel Gorelick, Lihui L ydia Zhang, Sophia Alj, Emily Schechter, Sean Askay , Oliv er Guinan, Rebecca Moore, Alexis Boukouv alas, and Pushmeet Kohli. 2025. Alphaearth foundations: An embedding field model for accurate and ef ficient global mapping from sparse label data . Pr eprint , T om Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry , Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neur al information pr ocessing systems , 33:1877–1901. Sébastien Bubeck, Christian Coester , Ronen Eldan, T im- othy Gowers, Y in T at Lee, Alexandru Lupsasca, Mehtaab Sawhney , Robert Scherrer, Mark Sellke, Brian K. Spears, Derya Unutmaz, K e vin W eil, Ste ven Y in, and Nikita Zhi votovskiy . 2025. Early sci- ence acceleration e xperiments with gpt-5 . Preprint , Markus J. Buehler . 2024. Accelerating scientific dis- cov ery with generativ e knowledge extraction, graph- based representation, and multimodal intelligent graph reasoning. Machine Learning: Science and T echnology . Benjamin Bur ger, Phillip M. Maf fettone, Vladimir V . Gusev , Catherine M. Aitchison, Y ang Bai, Xiaoyan W ang, Xiaobo Li, Ben M. Alston, Buyi Li, Rob Clowes, Nicola Rankin, Brandon Harris, Reiner Se- bastian Sprick, and Andre w I. Cooper . 2020. A mo- bile robotic chemist. Natur e , 583(7815):237–241. James Burgess, Jef frey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G Galaz-Montoya, Y uhui Zhang, Y uchang Su, Disha Bhowmik, Zachary Coman, and 1 others. 2025. Mi- crovqa: A multimodal reasoning benchmark for microscopy-based scientific research. In Pr oceed- ings of the Computer V ision and P attern Recognition Confer ence , pages 19552–19564. Askery Canabarro, Felipe Fernandes Fanchini, An- dré Luiz Malv ezzi, Rodrigo Pereira, and Rafael Chav es. 2019. Un veiling phase transitions with ma- chine learning. Phys. Rev . B , 100:045129. Chenrui Cao, Liangcheng Song, Zenan Li, Xinyi Le, Xian Zhang, Hui Xue, and F an Y ang. 2025a. Re viv- ing dsp for advanced theorem proving in the era of reasoning models. In The Thirty-ninth Annual Con- fer ence on Neural Information Pr ocessing Systems . He Cao, Zijing Liu, Xingyu Lu, Y uan Y ao, and Y u Li. 2025b. InstructMol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery . In Pr oceedings of the 31st Inter- national Confer ence on Computational Linguistics , pages 354–379. Jingyi Chai, Shuo T ang, Rui Y e, Y uwen Du, Xinyu Zhu, Mengcheng Zhou, Y anfeng W ang, W einan E, Y uzhi Zhang, Linfeng Zhang, and Siheng Chen. 2025. Sci- master: T owards general-purpose scientific ai agents, part i. x-master as foundation: Can we lead on hu- manity’ s last exam? Pr eprint , Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Y u, Haotian Xu, Xin Zhao, Ruihua Song, W enbiao Y in, Huifeng Y in, Liwen Zhang, Kuan Li, Minpeng Liao, Y ong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. 2026. Iterresearch: Rethinking long-horizon agents via markovian state reconstruction. In The F ourteenth International Conference on Learning Repr esentations . Guoxin Chen, Zile Qiao, W enqing W ang, Donglei Y u, Xuanzhong Chen, Hao Sun, Minpeng Liao, Kai Fan, Y ong Jiang, Penguin Xie, W ayne Xin Zhao, Ruihua Song, and Fei Huang. 2025a. Mars: Optimizing dual- system deep research via multi-agent reinforcement learning . Pr eprint , Hong Chen, Xin W ang, Y uwei Zhou, Bin Huang, Y ipeng Zhang, W ei Feng, Houlun Chen, Zeyang Zhang, Siao T ang, and W enwu Zhu. 2024a. Multi- modal generativ e ai: Multi-modal llm, dif fusion and beyond. ArXiv , abs/2409.14993. Hui Chen, Miao Xiong, Y ujie Lu, W ei Han, Ailin Deng, Y ufei He, Jiaying W u, Y ibo Li, Y ue Liu, and Bryan Hooi. 2025b. MLR-bench: Evaluating AI agents on open-ended machine learning research. In The Thirty-ninth Annual Confer ence on Neural Informa- tion Processing Systems Datasets and Benchmarks T rack . Kaiyuan Chen, Y ixin Ren, Y ang Liu, Xiaobo Hu, Hao- tong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Y uan Gong, Chen Sun, Han Hou, Hui Y ang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, and 14 others. 2025c. xbench: Tracking agents producti vity scaling with profession-aligned real-world e valuations . Pr eprint , Mengyuan Chen, Chengjun Dai, Xinyang Dong, Chengzhe Feng, Ke wei Fu, Jianshe Li, Zhihan Peng, Y ongqi T ong, Junshao Zhang, and Hong Zhu. 2025d. Dingtalk deepresearch: A unified multi agent frame- work for adapti ve intelligence in enterprise en viron- ments . Pr eprint , Pengcheng Chen, Jin Y e, Guoan W ang, Y anjun Li, Zhongying Deng, W ei Li, T ianbin Li, Haodong Duan, Ziyan Huang, Y anzhou Su, Benyou W ang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Y u Qiao, and Junjun He. 2024b. Gmai- mmbench: a comprehensiv e multimodal ev aluation 20 benchmark to wards general medical ai. In Pr oceed- ings of the 38th International Confer ence on Neural Information Pr ocessing Systems , NIPS ’24. Xiaokang Chen, Zhiyu W u, Xingchao Liu, Zizheng Pan, W en Liu, Zhenda Xie, Xingkai Y u, and Chong Ruan. 2025e. Janus-pro: Unified multimodal understanding and generation with data and model scaling . Pr eprint , Y ing Chen, Guoan W ang, Y uanfeng Ji, Y anjun Li, Jin Y e, T ianbin Li, Ming Hu, Rongshan Y u, Y u Qiao, and Junjun He. 2025f. Slidechat: A large vision-language assistant for whole-slide pathology ima ge understand- ing. In Pr oceedings of the Computer V ision and P at- tern Recognition Confer ence , pages 5134–5143. Ziru Chen, Shijie Chen, Y uting Ning, Qianheng Zhang, Boshi W ang, Botao Y u, Y ifei Li, Zeyi Liao, Chen W ei, Zitong Lu, V ishal Dey , Mingyi Xue, Frazier N. Baker , Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Y u Su, and Huan Sun. 2025g. Scienceagentbench: T oward rigorous assessment of language agents for data-driv en sci- entific discovery . In The Thirteenth International Confer ence on Learning Representations . Jun Cheng, Guido Novati, Joshua P an, Clare Bycroft, Akvil ˙ e Žemgulyt ˙ e, T aylor Applebaum, Alexander Pritzel, Lai Hong W ong, Michal Zielinski, T obias Sargeant, Rosalia G. Schneider , Andrew W . Senior , John Jumper , Demis Hassabis, Pushmeet Kohli, and Žiga A vsec. 2023. Accurate proteome-wide missense vari ant ef fect prediction with alphamissense. Science , 381(6664):eadg7492. Marina Chugunov a, Dietmar Harhoff, Katharina Hölzle, V erena Kaschub, Sonal Malagimani, Ulrike Morgalla, and Robert Rose. 2026. Who uses ai in research, and for what? large-scale surv ey evidence from german y . Resear ch P olicy , 55(2):105381. João Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, João Magalhães, Bruno Martins, and Chenyan Xiong. 2025. Deepresearchgym: A free, transparent, and reproducible ev aluation sandbox for deep research . Pr eprint , Le Cong, Zaixi Zhang, Xiaotong W ang, Y in Di, Ruo- fan Jin, Michal Gerasimiuk, Y inkai W ang, Ravi K. Dinesh, David Smerkous, Alex Smerkous, Xuekun W u, Shilong Liu, Peishan Li, Y i Zhu, Simran Serrao, Ning Zhao, Imran A. Mohammad, John B. Sunwoo, Joseph C. W u, and Mengdi W ang. 2025. Labos: The ai-xr co-scientist that sees and works with humans . Pr eprint , Haotian Cui, Chloe W ang, Hassaan Maan, Kuan P ang, Fengning Luo, Nan Duan, and Bo W ang. 2024. scgpt: tow ard b uilding a foundation model for single-cell multi-omics using generative ai. Natur e methods , 21(8):1470–1480. Alex Davies, Petar V eli ˇ ckovi ´ c, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad T omašev , Richard T anburn, Peter Battaglia, Charles Blundell, András Juhász, Marc Lackenby , Geordie Williamson, Demis Hassabis, and Pushmeet Kohli. 2021. Advancing mathematics by guiding human intuition with ai. Na- tur e , 600(7887):70–74. Bernardo P de Almeida, Guillaume Richard, Hugo Dalla-T orre, Christopher Blum, Lorenz Hexemer , Priyanka Pande y , Stefan Laurent, Chandana Rajesh, Marie Lopez, Alexandre Laterre, and 1 others. 2025. A multimodal con versational agent for dna, rna and protein tasks. Natur e Machine Intellig ence , pages 1–14. Jonas Degra ve, Federico Felici, Jonas Buchli, Michael Neunert, Brendan T racey , Francesco Carpanese, T imo Ewalds, Roland Hafner , Abbas Abdolmaleki, Diego de las Casas, Craig Donner , Leslie Fritz, Cris- tian Galperti, Andrea Huber , James Keeling, Maria Tsimpoukelli, Jackie Kay , Antoine Merle, Jean-Marc Moret, and 12 others. 2022. Magnetic control of toka- mak plasmas through deep reinforcement learning. Natur e , 602(7897):414–419. Gregoire Deletang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, T im Genewein, Christo- pher Mattern, Jordi Grau-Moya, Li K evin W enliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter , and Joel V eness. 2024. Language modeling is com- pression. In The T welfth International Confer ence on Learning Repr esentations . Y ong Deng, Guoqing W ang, Zhenzhe Y ing, Xiaofeng W u, Jinzhen Lin, W enwen Xiong, Y uqin Dai, Shuo Y ang, Zhanwei Zhang, Qiwen W ang, Y ang Qin, Y uan W ang, Quanxing Zha, Sunhao Dai, and Changhua Meng. 2025. Atom-searcher: Enhancing agentic deep research via fine-grained atomic thought re ward . Pr eprint , Jacob De vlin, Ming-W ei Chang, Kenton Lee, and Kristina T outanova. 2019. BER T: Pre-training of deep bidirectional transformers for language under - standing. In Pr oceedings of the 2019 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies, V olume 1 , pages 4171–4186. Ke yan Ding, Jing Y u, Junjie Huang, Y uchen Y ang, Qiang Zhang, and Huajun Chen. 2025. Scitoola- gent: a kno wledge-graph-driv en scientific agent for multitool inte gration. Natur e Computational Science , pages 1–11. Kef an Dong and T engyu Ma. 2025. STP: Self-play LLM theorem pro vers with iterati ve conjecturing and proving . In F orty-second International Conference on Machine Learning . Mingxuan Du, Benfeng Xu, Chiwei Zhu, Licheng Zhang, Xiaorui W ang, and Zhendong Mao. 2026. Deepresearch bench: A comprehensive benchmark for deep research agents. In The F ourteenth Interna- tional Confer ence on Learning Representations . 21 Y uanqi Du, Botao Y u, T ianyu Liu, T ony Shen, Junwu Chen, Jan G. Rittig, Kunyang Sun, Y ikun Zhang, Zhangde Song, Bo Zhou, Cassandra Masschelein, Y ingze W ang, Haorui W ang, Haojun Jia, Chao Zhang, Hongyu Zhao, Martin Ester, T eresa Head-Gordon, Carla P . Gomes, and 4 others. 2025. Accelerating scientific discov ery with autonomous goal-ev olving agents . Pr eprint , Emmanuel Dupoux, Y ann LeCun, and Jitendra Malik. 2026. Why ai systems don’t learn and what to do about it: Lessons on autonomous learning from cog- nitiv e science . Pr eprint , arXi v:2603.15381. Patrick Esser , Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller , Harry Saini, Y am Levi, Do- minik Lorenz, Axel Sauer , Frederic Boesel, Dustin Podell, T im Dockhorn, Zion English, and Robin Rombach. 2024. Scaling rectified flo w transform- ers for high-resolution image synthesis. In Proceed- ings of the 41st International Confer ence on Machine Learning , ICML ’24. Andre Este va, Brett K uprel, Roberto A. No voa, Justin K o, Susan M. Swetter, Helen M. Blau, and Sebas- tian Thrun. 2017. Dermatologist-le vel classification of skin cancer with deep neural networks. Nature , 542(7639):115–118. W ei Fan, W enlin Y ao, Zheng Li, Feng Y ao, Xin Liu, Liang Qiu, Qingyu Y in, Y angqiu Song, and Bing Y in. 2025. Deepplanner: Scaling planning capability for deep research agents via advantage shaping . Pr eprint , Y ou-Le Fang, Dong-Shan Jian, Xiang Li, and Y an-Qing Ma. 2025. Ai-newton: A concept-dri ven physical law discov ery system without prior physical knowledge . Pr eprint , T ao Feng, Y ihang Sun, and Jiaxuan Y ou. 2025. Graphe- val: A lightweight graph-based LLM framework for idea e valuation . In The Thirteenth International Con- fer ence on Learning Representations . T ony Feng, Junehyuk Jung, Sang hyun Kim, Carlo Pagano, Sergei Guk ov , Chiang-Chiang Tsai, David W oodruff, Adel Jav anmard, Aryan Mokhtari, Dawsen Hwang, Y uri Chervon yi, Jonathan N. Lee, Garrett Bingham, T rieu H. Trinh, V ahab Mirrokni, Quoc V . Le, and Thang Luong. 2026a. Aletheia tackles first- proof autonomously . Pr eprint , T ony Feng, T rieu H. T rinh, Garrett Bingham, Da wsen Hwang, Y uri Cherv onyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang hyun Kim, Federico Pasqualotto, Ser gei Gukov , Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Y i T ay , Y aGuang Li, Chenkai Kuang, Y uan Liu, and 9 others. 2026b. T owards autonomous mathematics research . Pr eprint , Jian Gao and Dashun W ang. 2024. Quantifying the use and potential benefits of artificial intelligence in scientific research. Nature Human Behaviour , 8(12):2281–2292. Shanghua Gao, Ada Fang, Y epeng Huang, V alentina Giunchiglia, A yush Noori, Jonathan Richard Schwarz, Y asha Ektefaie, Jov ana K ondic, and Marinka Zitnik. 2024. Empo wering biomedical dis- cov ery with ai agents. Cell , 187(22):6125–6151. Y iwen Gao, Ruochen Zhao, Y ang Deng, and W enx- uan Zhang. 2026a. Dr-arena: an automated ev alua- tion framework for deep research agents . Pr eprint , Y uan Gao, Chunli Li, W anxing Chang, Bai Du, Xi- anghua Y e, Y ee Hui Y eo, Y ingda Xia, Heng Guo, Xiaoming Zhang, W ei Liu, Ruobing Bai, Beibei Li, Y ang Hong, Jia wen Y ao, Le Lu, Kai Cao, K e Y an, Jun Chen, Jie Li, and 3 others. 2026b. Multi-modal ai for opportunistic screening, staging and progression risk stratification of steatotic liver disease. Nature Communications , 17(1):1562. Y uying Ge, Sijie Zhao, Jinguo Zhu, Y ixiao Ge, Kun Y i, Lin Song, Chen Li, Xiaohan Ding, and Y ing Shan. 2025. Seed-x: Multimodal models with unified multi- granularity comprehension and generation . Pr eprint , Robotics T eam Gemini, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gon- zalez Arenas, T ravis Armstrong, Ashwin Balakr- ishna, Robert Baruch, Maria Bauza, Michiel Blokz- ijl, Stev en Bohez, Konstantinos Bousmalis, Anthony Brohan, Thomas Buschmann, Arunkumar Byra van, Serkan Cabi, Ken Caluwaerts, Federico Casarini, Os- car Chang, and 99 others. 2025. Gemini robotics: Bringing ai into the physical world . Pr eprint , Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu W ang, Qi- uchen W ang, Ruixue Ding, Chenxi W ang, Jialong W u, Kuan Li, Y ida Zhao, Huifeng Y in, Y ong Jiang, Pengjun Xie, Fei Huang, Huaxiu Y ao, Y i R. Fung, and Jingren Zhou. 2026. W ebwatcher: Breaking new frontiers of vision-language deep research agent. In The F ourteenth International Confer ence on Learn- ing Repr esentations . Alireza Ghafarollahi and Markus J Buehler . 2024. Scia- gents: Automating scientific discov ery through bioin- spired multi-agent intelligent graph reasoning. Ad- vanced Materials , page 2413523. Darío Gil and Kathryn A. Moler . 2025. Accelerating science with ai. Science , 390(6777):965–965. Shashwat Goel, Rishi Hazra, Dulhan Jayalath, T i- mon W illi, Parag Jain, W illiam F . Shen, Ilias Leon- tiadis, Francesco Barbieri, Y oram Bachrach, Jonas Geiping, and Chenxi Whitehouse. 2025. Train- ing ai co-scientists using rubric re wards . Pr eprint , Ian Goodfello w , Y oshua Bengio, and Aaron Courville. 2016. Deep Learning . MIT Press. Juraj Gottweis, W ei-Hung W eng, Alexander Daryin, T ao T u, Anil Palepu, Petar Sirkovic, Artiom 22 Myaskovs ky , Felix W eissenberger , Keran Rong, Ryu- taro T anno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, A vinatan Hassidim, Bu- rak Gokturk, Amin V ahdat, Pushmeet K ohli, and 15 others. 2025. T owards an ai co-scientist . Pr eprint , Boyu Gou, Zanming Huang, Y uting Ning, Y u Gu, Michael Lin, Botao Y u, Andrei K opanev , W ei- jian Qi, Y iheng Shu, Jiaman W u, Chan Hee Song, Bernal Jimenez Gutierrez, Y ifei Li, Zeyi Liao, Hanane Nour Moussa, TIANSHU ZHANG, Jian Xie, Tianci Xue, Shijie Chen, and 7 others. 2025. Mind2web 2: Evaluating agentic search with agent- as-a-judge. In The Thirty-ninth Annual Confer ence on Neural Information Pr ocessing Systems Datasets and Benchmarks T rac k . Zhibin Gou, Zhihong Shao, Y eyun Gong, yelong shen, Y ujiu Y ang, Minlie Huang, Nan Duan, and W eizhu Chen. 2024. T oRA: A tool-integrated reasoning agent for mathematical problem solving . In The T welfth International Confer ence on Learning Repre- sentations . Aaron Grattafiori, Abhimanyu Dube y , Abhinav Jauhri, Abhinav Pandey , Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur , Alan Schel- ten, Alex V aughan, Amy Y ang, Angela Fan, Anirudh Goyal, Anthon y Hartshorn, Aobo Y ang, Archi Mi- tra, Archie Sravankumar , Artem K orenev , Arthur Hinsvark, and 542 others. 2024. The llama 3 herd of models . Pr eprint , Michael Grosskopf, Russell Bent, Rahul Somasun- daram, Isaac Michaud, Arthur Lui, Nathan De- bardeleben, and Earl La wrence. 2025. Ursa: The univ ersal research and scientific agent . Pr eprint , Xiao Gu, W ei T ang, Jinpei Han, V eer Sangha, Fenglin Liu, Shreyank N Go wda, Antonio H Ribeiro, Patrick Schwab, Kim Branson, Lei Clifton, and 1 others. 2026. Cardiac health assessment across scenarios and de vices using a multimodal foundation model pretrained on data from 1.7 million indi viduals. Na- tur e Machine Intelligence , 8(2):220–233. Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Peiyi W ang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Y u, Y u W u, Z. F . W u, Zhibin Gou, Zhihong Shao, Zhu- oshu Li, Ziyi Gao, Aixin Liu, and 175 others. 2025. Deepseek-r1 incenti vizes reasoning in llms through reinforcement learning. Natur e , 645(8081):633–638. Y uwei Guo, Ceyuan Y ang, Anyi Rao, Zhengyang Liang, Y aohui W ang, Y u Qiao, Maneesh Agra wala, Dahua Lin, and Bo Dai. 2024. Animatediff: Animate your personalized text-to-image dif fusion models without specific tuning. International Confer ence on Learn- ing Repr esentations . Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie T ao, Andrew W ang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer , Sasha Goldshtein, and Dipanjan Das. 2026. Deepsearchqa: Bridging the comprehensiv eness gap for deep re- search agents . Pr eprint , T uomas Haarnoja, Ben Moran, Guy Lev er , Sandy H. Huang, Dhruva Tirumala, Jan Humplik, Markus W ulfmeier , Saran T unyasuvunakool, Noah Y . Sie gel, Roland Hafner, Michael Bloesch, Kristian Har- tikainen, Arunkumar Byrav an, Leonard Hasenclever , Y uv al T assa, Fereshteh Sadeghi, Nathan Batchelor , Federico Casarini, Stefano Saliceti, and 9 others. 2024. Learning agile soccer skills for a bipedal robot with deep reinforcement learning. Science Robotics , 9(89):eadi8022. Danijar Hafner , Jurgis Pasukonis, Jimmy Ba, and T imo- thy Lillicrap. 2025. Mastering diverse control tasks through world models. Natur e , 640(8059):647–653. Y oo-Geun Ham, Jeong-Hwan Kim, and Jing-Jia Luo. 2019. Deep learning for multi-year enso forecasts. Natur e , 573(7775):568–572. Rujun Han, Y anfei Chen, Zoey CuiZhu, Lesly Miculi- cich, Guan Sun, Y uanjun Bi, W eiming W en, Hui W an, Chunfeng W en, Solène Maître, George Lee, V ishy T irumalashetty , Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, T omas Pfister , and Chen-Y u Lee. 2025a. Deep researcher with test-time diffusion . Pr eprint , Xiao-Qi Han, Peng-Jie Guo, Ze-Feng Gao, Hao Sun, and Zhong-Y i Lu. 2025b. In vdesflow-al: active learning-based workflo w for in verse design of func- tional materials. npj Computational Materials . Minsheng Hao, Jing Gong, Xin Zeng, Chiming Liu, Y ucheng Guo, Xingyi Cheng, T aifeng W ang, Jianzhu Ma, Xuegong Zhang, and Le Song. 2024. Large- scale foundation model on single-cell transcriptomics. Natur e methods , 21(8):1481–1491. Qianyue Hao, Fengli Xu, Y ong Li, and James Evans. 2026. Artificial intelligence tools e xpand scientists’ impact but contract science’ s focus. Natur e . Shibo Hao, T ianyang Liu, Zhen W ang, and Zhiting Hu. 2023. T oolkenGPT: Augmenting frozen language models with massiv e tools via tool embeddings . In Thirty-seventh Conference on Neural Information Pr ocessing Systems . Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofronie w , Deniz Oktay , Zeming Lin, Robert V erkuil, V incent Q Tran, Jonathan Deaton, Marius W ig- gert, and 1 others. 2025. Simulating 500 million years of ev olution with a language model. Science , 387(6736):850–858. Chaoqun He, Renjie Luo, Y uzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Y ujie Huang, Y uxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. OlympiadBench: A challenging benchmark for promoting A GI with 23 olympiad-lev el bilingual multimodal scientific prob- lems. In Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V ol- ume 1: Long P apers) , pages 3828–3850. Y ong He, Pan Fang, Y ongtao Shan, Y uanfei Pan, Y an- hong W ei, Y ichang Chen, Y ihao Chen, Y i Liu, Zhenyu Zeng, Zhan Zhou, and 1 others. 2025. Gen- eralized biological foundation model with unified nucleic acid and protein language. Nature Mac hine Intelligence , pages 1–12. Dan Hendrycks, Dawn Song, Christian Szegedy , Honglak Lee, Y arin Gal, Erik Brynjolfsson, Sharon Li, Andy Zou, Lionel Levine, Bo Han, Jie Fu, Ziwei Liu, Jinw oo Shin, Kimin Lee, Mantas Mazeika, Long Phan, George Ingebretsen, Adam Khoja, Cihang Xie, and 14 others. 2025. A definition of agi . Pr eprint , Maximilian Herde, Bogdan Raoni ´ c, T obias Rohner , Roger Käppeli, Roberto Molinaro, Emmanuel de Bézenac, and Siddhartha Mishra. 2024. Poseidon: efficient foundation models for pdes. In Pr oceed- ings of the 38th International Confer ence on Neural Information Pr ocessing Systems , NIPS ’24. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De- noising diffusion probabilistic models. In Pr oceed- ings of the 34th International Confer ence on Neural Information Pr ocessing Systems , NIPS ’20. Haoyang Hong, Jiajun Y in, Y uan W ang, Jingnan Liu, Zhe Chen, Ailing Y u, Ji Li, Zhiling Y e, Hansong Xiao, Y efei Chen, Hualei Zhou, Y un Y ue, Minghui Y ang, Chunxiao Guo, Junwei Liu, Peng W ei, and Jinjie Gu. 2025. Multi-agent deep research: Train- ing multi-agent systems with m-grpo . Pr eprint , Edward J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizhu Chen. 2022. LoRA: Lo w-rank adaptation of large language models. In International Conference on Learning Repr esentations . Ming Hu, Chenglong Ma, W ei Li, W anghan Xu, Jiamin W u, Jucheng Hu, T ianbin Li, Guohang Zhuang, Ji- aqi Liu, Y ingzhou Lu, Y ing Chen, Chaoyang Zhang, Cheng T an, Jie Y ing, Guocheng W u, Shujian Gao, Pengcheng Chen, Jiashi Lin, Haitao W u, and 84 oth- ers. 2025a. A survey of scientific large language models: From data foundations to agent frontiers . Pr eprint , Xiuyuan Hu, Guoqing Liu, Y ang Zhao, and Hao Zhang. 2023. De nov o drug design using reinforcement learning with multiple gpt agents. In Advances in Neural Information Pr ocessing Systems , volume 36, pages 7405–7418. Y usong Hu, Runmin Ma, Y ue F an, Jinxin Shi, Zong- sheng Cao, Y uhao Zhou, Jiakang Y uan, Xiangchao Y an, W enlong Zhang, Lei Bai, and Bo Zhang. 2025b. Flo wsearch: Advancing deep research with dynamic structured knowledge flo w . Preprint , Y utao Hu, T ianbin Li, Quanfeng Lu, W enqi Shao, Jun- jun He, Y u Qiao, and Ping Luo. 2024. Omnimed- vqa: A ne w large-scale comprehensiv e ev aluation benchmark for medical lvlm. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P at- tern Recognition , pages 22170–22183. Ke xin Huang, Serena Zhang, Hanchen W ang, Y uanhao Qu, Y ingzhou Lu, Y usuf Roohani, Ryan Li, Lin Qiu, Junze Zhang, Y in Di, and 1 others. 2025a. Biomni: A general-purpose biomedical ai agent. bioRxiv , pages 2025–05. W enxuan Huang, Y u Zeng, Qiuchen W ang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Y in, Shuang Chen, Zhenfei Y in, Lin Chen, Zehui Chen, Y ao Hu, Philip T orr, Feng Zhao, and W anli Ouyang. 2026. V ision-deepresearch: Incentivizing deepre- search capability in multimodal large language mod- els . Pr eprint , Y uxuan Huang, Y ihang Chen, Haozheng Zhang, Kang Li, Meng F ang, Lin yi Y ang, Xiaoguang Li, Lifeng Shang, Songcen Xu, Jianye Hao, Kun Shao, and Jun W ang. 2025b. Deep research agents: A systematic e xamination and roadmap . Preprint , Thomas Hubert, Rishi Mehta, Laurent Sartran, Mik- lós Z. Horváth, Goran Žuži ´ c, Eric W ieser, Aja Huang, Julian Schrittwieser , Y annick Schroecker , Hussain Masoom, Ottavia Bertolli, T om Zahavy , Amol Mand- hane, Jessica Y ung, Iuliya Beloshapka, Borja Ibarz, V ivek V eeriah, Lei Y u, Oli ver Nash, and 20 others. 2025. Olympiad-le vel formal mathematical reason- ing with reinforcement learning. Natur e . John B. Ingraham, Max Baranov , Zak Costello, Karl W . Barber , W ujie W ang, Ahmed Ismail, V incent Frap- pier , Dana M. Lord, Christopher Ng-Thow-Hing, Erik R. V an Vlack, Shan T ie, V incent Xue, Sarah C. Cowles, Alan Leung, João V . Rodrigues, Claudio L. Morales-Perez, Alex M. A youb, Robin Green, Kather- ine Puentes, and 7 others. 2023. Illuminating protein space with a programmable generative model. Na- tur e , 623(7989):1070–1078. T eam InternAgent, Bo Zhang, Shiyang Feng, Xiangchao Y an, Jiakang Y uan, Runmin Ma, Y usong Hu, Zhiyin Y u, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong W ang, Jinyao Liu, T ianshuo Peng, Peng Y e, Dongzhan Zhou, Shufei Zhang, Xi- aosong W ang, and 7 others. 2025. Internagent: When agent becomes the scientist – b uilding closed-loop system from hypothesis to verification . Pr eprint , Abhinav Jav a, Ashmit Khandelwal, Sukruta Prakash Midigeshi, Aaron Halfaker , Amit Deshpande, Navin Goyal, Ankur Gupta, Nagarajan Natarajan, and Amit Sharma. 2026. Characterizing deep research: A benchmark and formal definition . In The F ourteenth International Confer ence on Learning Repr esenta- tions . 24 Y injun Jia, Bowen Gao, Jiaxin T an, Jiqing Zheng, Xin Hong, W enyu Zhu, Haichuan T an, Y uan Xiao, Liping T an, Hongyi Cai, Y anwen Huang, Zhiheng Deng, Xi- angwei W u, Y ue Jin, Y afei Y uan, Jiekang T ian, W ei He, W eiying Ma, Y aqin Zhang, and 4 others. 2026. Deep contrasti ve learning enables genome-wide vir- tual screening. Science , 391(6781):eads9530. Zhuoxuan Jiang, Jie Ma, Jingyi Lu, Guangyuan Y u, Y ipeng Y u, and Shaochun Li. 2019. A general planning-based framework for goal-dri ven con ver - sation assistant. Pr oceedings of the AAAI Confer ence on Artificial Intelligence , 33(01):9857–9858. Jiajie Jin, Y utao Zhu, Zhicheng Dou, Guanting Dong, Xinyu Y ang, Chenghao Zhang, T ong Zhao, Zhao Y ang, and Ji-Rong W en. 2025a. Flashrag: A modular toolkit for efficient retriev al-augmented generation research. In Companion Pr oceedings of the ACM on W eb Conference 2025 , WWW ’25, page 737–740. Song Jin, Shuqi Li, Shukun Zhang, and Rui Y an. 2025b. Finrpt: Dataset, ev aluation system and llm-based multi-agent frame work for equity research report gen- eration . Pr eprint , Haocheng Ju and Bin Dong. 2026. Ai for mathemat- ics: Progress, challenges, and prospects . Preprint , John Jumper , Richard Evans, Alexander Pritzel, T im Green, Michael Figurnov , Olaf Ronneberger , Kathryn T unyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Me yer, Simon A. A. Kohl, Andre w J. Ballard, Andrew Co wie, Bernardino Romera-Paredes, Stanislav Nikolov , Rishub Jain, Jonas Adler, and 15 others. 2021. Highly accurate protein structure prediction with al- phafold. Natur e , 596(7873):583–589. K-Dense Inc. 2026. Claude scientific skills: A comprehensiv e collection of scientific tools for claude ai. https://github.com/K- Dense- AI/ claude- scientific- skills . Y eonghun Kang and Jihan Kim. 2024. Chatmof: an artificial intelligence system for predicting and gener - ating metal-or ganic frame works using lar ge language models. Natur e communications , 15(1):4705. Jared Kaplan, Sam McCandlish, T om Henighan, T om B. Brown, Benjamin Chess, Re won Child, Scott Gray , Alec Radford, Jef frey W u, and Dario Amodei. 2020. Scaling laws for neural language models . Pr eprint , Elia Kaufmann, Leonard Bauersfeld, Antonio Loquer - cio, Matthias Müller , Vladlen K oltun, and Davide Scaramuzza. 2023. Champion-lev el drone rac- ing using deep reinforcement learning. Nature , 620(7976):982–987. Michael J. K eiser , V incent Setola, John J. Irwin, Chris- tian Laggner , Atheir I. Abbas, Sandra J. Hufeisen, Niels H. Jensen, Michael B. Kuijer , Roberto C. Matos, Thuy B. T ran, Ryan Whaley , Richard A. Glen- non, Jérôme Hert, K elan L. H. Thomas, Douglas D. Edwards, Brian K. Shoichet, and Bryan L. Roth. 2009. Predicting ne w molecular targets for known drugs. Natur e , 462(7270):175–181. Dmitrii K ochkov , Janni Y uval, Ian Langmore, Peter Nor- gaard, Jamie Smith, Grif fin Mooers, Milan Klöwer , James Lottes, Stephan Rasp, Peter Düben, Sam Hatfield, Peter Battaglia, Alvaro Sanchez-Gonzalez, Matthew W illson, Michael P . Brenner , and Stephan Hoyer . 2024. Neural general circulation models for weather and climate. Natur e , 632(8027):1060–1066. Satyapriya Krishna, Kalpesh Krishna, Anhad Mo- hananey , Ste ven Schwarcz, Adam Stambler , Shyam Upadhyay , and Manaal Faruqui. 2025. Fact, fetch, and reason: A unified ev aluation of retriev al- augmented generation. In Pr oceedings of the 2025 Confer ence of the Nations of the Americas Chap- ter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P a- pers) , pages 4745–4759. T aku Kudo. 2018. Subword regularization: Improv- ing neural network translation models with multiple subword candidates. In Pr oceedings of the 56th An- nual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 66–75. Remi Lam, Alvaro Sanchez-Gonzalez, Matthew W ill- son, Peter W irnsberger , Meire F ortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton- Rosen, W eihua Hu, Alexander Merose, Stephan Hoyer , George Holland, Oriol V inyals, Jack- lynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. 2023. Learning skillful medium-range global weather forecasting. Science , 382(6677):1416–1421. Huy Hoan Le, V an Sy Thinh Nguyen, Thi Le Chi Dang, V o Thanh Khang Nguyen, T ruong Thanh Hung Nguyen, and Hung Cao. 2025. Multimedia verifica- tion through multi-agent deep research multimodal large language models. In Pr oceedings of the 33rd A CM International Conference on Multimedia , MM ’25, page 14034–14040. Y ann LeCun, Y oshua Bengio, and Geoffre y Hinton. 2015. Deep learning. Natur e , 521(7553):436–444. Y oonjoo Lee, Kyungjae Lee, Sunghyun Park, Dasol Hwang, Jaehyeon Kim, Hong-in Lee, and Moontae Lee. 2023. Qasa: adv anced question answering on scientific articles. In Pr oceedings of the 40th Interna- tional Confer ence on Machine Learning , ICML ’23. Haote Li, Sumon Sarkar , W enxin Lu, P atrick O. Lof- tus, Tian yin Qiu, Y u Shee, Abbigayle E. Cuomo, John-Paul W ebster, H. Ray K elly , V idhyadhar Manee, Sanil Sreekumar , Frederic G. Buono, Robert H. Crab- tree, Timothy R. Newhouse, and V ictor S. Batista. 2026a. Collecti ve intelligence for ai-assisted chemi- cal synthesis. Natur e . 25 Junxian Li, Di Zhang, Xunzhi W ang, Ze ying Hao, Jingdi Lei, Qian T an, Cai Zhou, W ei Liu, Y aotian Y ang, Xinrui Xiong, W eiyun W ang, Zhe Chen, W enhai W ang, W ei Li, Mao Su, Shufei Zhang, W anli Ouyang, Y uqiang Li, and Dongzhan Zhou. 2025a. Chemvlm: Exploring the power of multimodal large language models in chemistry area. Pr oceedings of the AAAI Confer ence on Artificial Intelligence , 39(1):415–423. Ke yu Li, Mohan Jiang, Dayuan Fu, Y unze W u, Xiangkun Hu, Dequan W ang, and Pengfei Liu. 2025b. Datasetresearch: Benchmarking agent sys- tems for demand-driv en dataset discovery . Preprint , Lei Li, Y uqi W ang, Runxin Xu, Peiyi W ang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal ArXi v: A dataset for improving scientific comprehen- sion of large vision-language models. In Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 14369–14387. Long Li, W eiwen Xu, Jiayan Guo, Ruochen Zhao, Xingxuan Li, Y uqian Y uan, Boqiang Zhang, Y uming Jiang, Y ifei Xin, Ronghao Dang, Y u Rong, Deli Zhao, T ian Feng, and Lidong Bing. 2025c. Chain of ideas: Rev olutionizing research via nov el idea dev elopment with LLM agents. In Findings of the Association for Computational Linguistics: EMNLP 2025 , pages 8971–9004. Minghao Li, Y ing Zeng, Zhihao Cheng, Cong Ma, and Kai Jia. 2025d. Reportbench: Evaluating deep re- search agents via academic surve y tasks . Pr eprint , Ning Li, Jingran Zhang, and Justin Cui. 2025e. Arx- ivbench: Can llms assist researchers in conducting research? Pr eprint , Orion Li, V inayak Agarwal, Summer Zhou, Ashwin Gopinath, and Timothy Kassis. 2025f. K-dense an- alyst: T ow ards fully automated scientific analysis . Pr eprint , Ruizhe Li, Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui W ang, and Zhendong Mao. 2026b. Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report . Preprint , W enjun Li, Zhi Chen, Jingru Lin, Hannan Cao, W ei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, and Y ong Liu. 2025g. Reinforcement learning foundations for deep research systems: A surve y . Preprint , arXi v:2509.06733. Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Y ongkang W u, Ji-Rong W en, Y utao Zhu, and Zhicheng Dou. 2025h. W ebthinker: Empo wering large reasoning models with deep research capability . In The Thirty-ninth Annual Confer ence on Neural Information Pr ocessing Systems . Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong W ang, Jirong Liu, Bingyi Kang, Xiao Ma, Xin- long W ang, Di Guo, T ao Kong, Hanbo Zhang, and Huaping Liu. 2026c. What matters in b uilding vision– language–action models for generalist robots. Natur e Machine Intelligence , 8(2):158–172. Xuchen Li, Ruitao W u, Xuanbo Liu, Xukai W ang, Jinbo Hu, Zhixin Bai, Bohan Zeng, Hao Liang, Leheng Chen, Mingrui Chen, Haitian Zhong, Xuanlin Y ang, Xu-Y ao Zhang, Liu Liu, Jia Li, Kaiqi Huang, Jiahao Xu, Haitao Mi, W entao Zhang, and Bin Dong. 2025i. Sciagent: A unified multi-agent system for generalis- tic scientific reasoning . Pr eprint , Y u Li, Y uan Huang, T ao W ang, Caiyu Fan, Xiansheng Cai, Sihan Hu, Xinzijian Liu, Cheng Shi, Mingjun Xu, Zhen W ang, Y an W ang, Xiangqi Jin, T ianhan Zhang, Linfeng Zhang, Lei W ang, Y oujin Deng, Pan Zhang, W eijie Sun, Xingyu Li, and 4 others. 2025j. In verse kno wledge search over v erifiable rea- soning: Synthesizing a scientific enc yclopedia from a long chains-of-thought kno wledge base . Pr eprint , Y u Li, Lehui Li, Qingmin Liao, Fengli Xu, and Y ong Li. 2025k. Agentexpt: Automating ai e xperiment design with llm-based resource retrie val agent . Pr eprint , Zehui Li, V allijah Subasri, Y ifei Shen, Dongsheng Li, W entao Gu, Guy-Bart Stan, Y iren Zhao, and Cai- hua Shan. 2025l. Omni-dna: A genomic model sup- porting sequence understanding, long-context, and textual annotation. In The Thirty-ninth Annual Con- fer ence on Neural Information Pr ocessing Systems . Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Y an, Y ong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, and Jingren Zhou. 2026d. W ebweav er: Structuring web-scale evidence with dynamic outlines for open-ended deep research. In The F ourteenth International Confer ence on Learn- ing Repr esentations . Y uan Liang, Jiaxian Li, Y uqing W ang, W ANG PIA O- HONG, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Y uchen Eleanor Jiang, Ningyu Zhang, and W angchunshu Zhou. 2026. T owards personalized deep research: Benchmarks and e valuations. In The F ourteenth International Conference on Learning Repr esentations . Hunter Lightman, V ineet K osaraju, Y ura Burda, Harri Edwards, Bowen Baker , T eddy Lee, Jan Leike, John Schulman, Ilya Sutskev er, and Karl Cobbe. 2023. Let’ s verify step by step . Pr eprint , Chensen Lin, Ruian Tie, Shihong Y i, Dongqing Liu, Xiaohui Zhong, Zixin Hu, and Hao Li. 2026. Re- constructing fine-scale 3d wind fields with terrain- informed machine learning. Natur e Communica- tions . 26 Minhua Lin, Zongyu W u, Zhichao Xu, Hui Liu, Xi- anfeng T ang, Qi He, Charu Aggarwal, Hui Liu, Xi- ang Zhang, and Suhang W ang. 2025. A comprehen- siv e survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, ev alua- tions, and applications . Pr eprint , Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, W enting Lu, Nikita Smetanin, Robert V erkuil, Ori Kabeli, Y aniv Shmueli, and 1 others. 2023. Evolutionary-scale prediction of atomic-le vel protein structure with a language model. Science , 379(6637):1123–1130. Alexander H. Liu, Kartik Khandelwal, Sandeep Sub- ramanian, V ictor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jef fares, Albert Jiang, Alexandre Cahill, Ale xandre Gav audan, Ale xandre Sablayrolles, Amélie Héliou, Amos Y ou, Andy Ehrenberg, Andy Lo, Anton Eliseev , Antonia Calvi, A vinash Soori- yarachchi, Baptiste Bout, and 101 others. 2026a. Ministral 3 . Pr eprint , Guoqing Liu, Junren Li, Zihan Zhao, Eray Inanc, Krzysztof Maziarz, Jose Garrido T orres, V ictor Gar- cia Satorras, Shoko Ueda, Christopher M. Bishop, and Marwin Se gler . 2025a. A scientific reasoning model for org anic synthesis procedure generation . Pr eprint , Hongwei Liu, Junnan Liu, Shudong Liu, Haodong Duan, Y uqiang Li, Mao Su, Xiaohong Liu, Guangtao Zhai, Xinyu F ang, Qianhong Ma, T aolin Zhang, Zihan Ma, Y ufeng Zhao, Peiheng Zhou, Linchen Xiao, W enlong Zhang, Shijie Zhou, Xingjian Ma, Siqi Sun, and 17 others. 2025b. Atlas: A high-dif ficulty , multidisci- plinary benchmark for frontier scientific reasoning . Pr eprint , Linjing Liu, W ei Li, Fang W ang, Y iming Li, Long- Kai Huang, Ka-Chun W ong, Fan Y ang, and Jianhua Y ao. 2025c. A pre-trained large generati ve model for translating single-cell transcriptomes to proteomes. Natur e Biomedical Engineering . Y ong Liu, Chenqing Hua, Menglong Xu, T ao Zeng, Jiahua Rao, Zhongyue Zhang, Ruibo W u, Jing-Ke W eng, Connor W Coley , and Shuangjia Zheng. 2026b. A geometric foundation model for enzyme retrie val with ev olutionary insights. Natur e Catalysis . Y ujie Liu, Zonglin Y ang, T ong Xie, Jinjie Ni, Ben Gao, Y uqiang Li, Shixiang T ang, W anli Ouyang, Erik Cambria, and Dongzhan Zhou. 2025d. Research- bench: Benchmarking llms in scientific discov ery via inspiration-based task decomposition . Preprint , Zikang Liu, Longteng Guo, Y epeng T ang, T ongtian Y ue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. 2025e. VRoPE: Rotary position embedding for video large language models. In Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Language Pr ocessing . Chris Lu, Cong Lu, Robert Tjarko Lange, Y utaro Y a- mada, Shengran Hu, Jakob Foerster , David Ha, and Jeff Clune. 2026a. T ow ards end-to-end automation of ai research. Natur e , 651(8107):914–919. Jiaxuan Lu, Ziyu K ong, Y emin W ang, Rong Fu, Haiyuan W an, Cheng Y ang, W enjie Lou, Haoran Sun, Lilong W ang, Y ankai Jiang, Xiaosong W ang, Xiao Sun, and Dongzhan Zhou. 2026b. Beyond static tools: T est-time tool ev olution for scientific reason- ing . Pr eprint , Ming Y Lu, Bowen Chen, Drew FK W illiamson, Richard J Chen, Melissa Zhao, Aaron K Chow , Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Pa- tel, and 1 others. 2024a. A multimodal generativ e ai copilot for human pathology . Nature , 634(8033):466– 473. Pan Lu, Hritik Bansal, T ony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-W ei Chang, Michel Galley , and Jianfeng Gao. 2024b. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts . In The T welfth International Confer ence on Learning Repr esenta- tions . Renzhi Lu, Zonghe Shao, Y uemin Ding, Ruijuan Chen, Dongrui W u, Housheng Su, T ao Y ang, Fumin Zhang, Jun W ang, Y ang Shi, Zhong-Ping Jiang, Han Ding, and Hai-T ao Zhang. 2025a. Discov ery of the rew ard function for embodied reinforcement learning agents. Natur e Communications , 16(1):11064. Rui Lu, Zhenyu Hou, Zihan W ang, Hanchen Zhang, Xiao Liu, Y ujiang Li, Shi Feng, Jie T ang, and Y uxiao Dong. 2025b. Deepdiv e: Advancing deep search agents with kno wledge graphs and multi-turn rl . Pr eprint , Xiaoliang Luo, Akilles Rechardt, Guangzhi Sun, K evin K. Nejad, Felipe Yáñez, Bati Y ilmaz, Kangjoo Lee, Alexandra O. Cohen, V alentina Borghesani, An- ton Pashko v , Daniele Marinazzo, Jonathan Nicholas, Alessandro Salatiello, Ilia Sucholutsky , Pasquale Minervini, Sepehr Razavi, Roberta Rocca, Elkhan Y usifov , T ereza Okalov a, and 20 others. 2025. Large language models surpass human experts in predict- ing neuroscience results. Natur e Human Behaviour , 9(2):305–315. Isaac D. Lutz, Shunzhi W ang, Christoffer Norn, Alexis Courbet, Andrew J. Borst, Y an T ing Zhao, Annie Dosey , Longxing Cao, Jinwei Xu, Elizabeth M. Leaf, Catherine T reichel, Patrisia Litvicov , Zhe Li, Alexan- der D. Goodson, P aula Riv era-Sánchez, Ana-Maria Bratovianu, Mink yung Baek, Neil P . King, Hannele Ruohola-Baker , and David Baker . 2023. T op-down design of protein architectures with reinforcement learning. Science , 380(6642):266–273. Y ougang L yu, Xi Zhang, Xinhao Y i, Y uyue Zhao, Shuyu Guo, W enxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, and Xiao- hui Y an. 2026. Evoscientist: T owards multi-agent 27 ev olving ai scientists for end-to-end scientific discov- ery . Pr eprint , Y ougang L yu, Xiaoyu Zhang, Lingyong Y an, Maarten de Rijke, Zhaochun Ren, and Xiuying Chen. 2025. Deepshop: A benchmark for deep research shopping agents . Pr eprint , Andres M. Bran, Sam Cox, Oli ver Schilter , Carlo Bal- dassari, Andrew D White, and Philippe Schwaller . 2024. Augmenting large language models with chem- istry tools. Nature Machine Intelligence , 6(5):525– 535. Chenxi Ma, W eimin T an, Ruian He, and Bo Y an. 2024a. Pretraining a foundation model for generalizable flu- orescence microscopy-based image restoration. Na- tur e Methods , 21(8):1558–1567. Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Bof fi, Eric V anden-Eijnden, and Sain- ing Xie. 2024b. Sit: Exploring flo w and diffusion- based generati ve models with scalable interpolant transformers. In Eur opean Conference on Computer V ision , pages 23–40. Pingchuan Ma, Tsun-Hsuan W ang, Minghao Guo, Zhiqing Sun, Joshua B. T enenbaum, Daniela Rus, Chuang Gan, and W ojciech Matusik. 2024c. LLM and simulation as bilev el optimizers: A new paradigm to advance physical scientific discov ery . In F orty-first International Confer ence on Machine Learning . Y iyang Ma, Xingchao Liu, Xiaokang Chen, W en Liu, Chengyue W u, Zhiyu W u, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Y u, and 1 others. 2025. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and gen- eration. In Pr oceedings of the Computer V ision and P attern Recognition Confer ence , pages 7739–7751. W olfgang Maass. 1997. Networks of spiking neurons: the third generation of neural network models. Neu- ral Networks , 10(9):1659–1671. Lucas Maes, Quentin Le Lidec, Damien Scieur , Y ann LeCun, and Randall Balestriero. 2026. Lew orldmodel: Stable end-to-end joint-embedding predictiv e architecture from pixels . Pr eprint , Daniel McDuff, Mike Schaekermann, T ao Tu, Anil Palepu, Amy W ang, Jake Garrison, Karan Singhal, Y ash Sharma, Shek oofeh Azizi, Ka vita Kulkarni, Le Hou, Y ong Cheng, Y un Liu, S. Sara Mahdavi, Sushant Prakash, Anupam Pathak, Christopher Sem- turs, Shwetak Patel, Dale R. W ebster , and 9 others. 2025. T owards accurate differential diagnosis with large language models. Nature , 642(8067):451–457. LongCat T eam Meituan, Anchun Gui, Bei Li, Bingyang T ao, Bole Zhou, Borun Chen, Chao Zhang, Chao Zhang, Chen Gao, Chen Zhang, Chengcheng Han, Chenhui Y ang, Chuyu Zhang, Cong Chen, Cun- guang W ang, Daoru Pan, Defei Bu, Dengchang Zhao, Di Xiu, and 147 others. 2026. Longcat- flash-thinking-2601 technical report . Pr eprint , Aditi T Merchant, Samuel H King, Eric Nguyen, and Brian L Hie. 2025. Semantic design of functional de nov o genes from a genomic language model. Nature . Amil Merchant, Simon Batzner , Samuel S. Schoenholz, Muratahan A ykol, Gow oon Cheon, and Ekin Dogus Cubuk. 2023. Scaling deep learning for materials discov ery . Natur e , 624(7990):80–85. Lisa Messeri and M. J. Crockett. 2024. Artificial intel- ligence and illusions of understanding in scientific research. Natur e , 627(8002):49–58. Grégoire Mialon, Clémentine Fourrier , Thomas W olf, Y ann LeCun, and Thomas Scialom. 2023. Gaia: a benchmark for general ai assistants. In The T welfth International Confer ence on Learning Repr esenta- tions . T omás Mikolov , Kai Chen, Greg Corrado, and Jef frey Dean. 2013a. Efficient estimation of word representa- tions in vector space. In 1st International Confer ence on Learning Repr esentations, ICLR 2013 . T omas Mikolov , Ilya Sutskev er, Kai Chen, Gre g Cor- rado, and Jeffrey Dean. 2013b. Distributed repre- sentations of words and phrases and their composi- tionality . In Pr oceedings of the 27th International Confer ence on Neur al Information Pr ocessing Sys- tems - V olume 2 , NIPS’13, page 3111–3119. AI T eam MiroMind. 2025. Miroflow: A high- performance open-source research agent framework. https://github.com/MiroMindAI/MiroFlow . Adrian Mirza, Naw af Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, T anya Gupta, Mara Schilling-W ilhelmi, Macjonathan Okereke, Anagha Aneesh, Mehrdad Asgari, Juliane Eberhardt, Amir Mohammad Elahi, Hani M. Elbeheiry , María V ictoria Gil, Christina Glaubitz, Maximilian Greiner , Caroline T . Holick, T im Hoffmann, and 16 others. 2025. A framew ork for ev aluating the chemical knowledge and reason- ing abilities of large language models against the expertise of chemists. Nature Chemistry , 17(7):1027– 1034. T .M. Mitchell. 1997. Machine Learning . McGraw-Hill. Ludovico Mitchener , Angela Y iu, Benjamin Chang, Mathieu Bourdenx, T yler Nadolski, Arvis Sulov ari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Ev ans, Shriya Reddy , Martha Foiani, Aizad Kamal, Leah P . Shri ver , Fang Cao, As- mamaw T . W assie, Jon M. Laurent, Edwin Melville- Green, Mayk Caldas, and 18 others. 2025. Kosmos: An ai scientist for autonomous discov ery . Preprint , 28 Fengran Mo, Kelong Mao, Ziliang Zhao, Hongjin Qian, Haonan Chen, Y iruo Cheng, Xiaoxi Li, Y utao Zhu, Zhicheng Dou, and Jian-Y un Nie. 2025. A survey of con versational search. A CM T rans. Inf. Syst. , 43(6). Michael Moor , Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M. Krumholz, Jure Leskovec, Eric J. T opol, and Pranav Rajpurkar . 2023. Foundation mod- els for generalist medical artificial intelligence. Na- tur e , 616(7956):259–265. Nikhil Mukund, Y ifang Luo, Fan Zhang, Lisa Barsotti, and Erik Katsavounidis. 2026. Marvel: A multi agent-based research validator and enabler using large language models . Preprint , arXi v:2601.03436. Ansh Nagda, Prabhakar Ragha van, and Abhradeep Thakurta. 2026. Reinforced generation of combinato- rial structures: Hardness of approximation . Pr eprint , Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff W u, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, V ineet K osaraju, William Saunders, Xu Jiang, Karl Cobbe, T yna Eloundou, Gretchen Krueger , Ke vin Button, Matthe w Knight, Benjamin Chess, and John Schulman. 2022. W ebgpt: Bro wser- assisted question-answering with human feedback . Pr eprint , Eric Nguyen, Michael Poli, Matthe w G Durrant, Brian Kang, Dhruva Katrekar , David B Li, Liam J Bartie, Armin W Thomas, Samuel H King, Garyk Brixi, and 1 others. 2024. Sequence modeling and design from molecular to genome scale with ev o. Science , 386(6723):eado9336. Xuan-Phi Nguyen, Shrey Pandit, Re vanth Gangi Reddy , Austin Xu, Silvio Sav arese, Caiming Xiong, and Shafiq Joty . 2025. Sfr-deepresearch: T ow ards ef- fectiv e reinforcement learning for autonomously rea- soning single agents . Pr eprint , Jasmine Chiat Ling Ong, Y ilin Ning, Rui Y ang, Danielle S. Bitterman, Xiaoxuan Liu, Y ih Chung Tham, Gary S. Collins, Michelle María Jiménez de T avárez, Bilal A. Mateen, Kwesi Nyan Amissah- Arthur , Bin Sheng, Iain Bee Huat T an, Chuan Hong, Lionel Tim-Ee Cheng, Benjamin Alan Goldstein, Phuoc V . Le, Y un Liu, Hiang Khoon T an, Marcus Eng Hock Ong, and 9 others. 2026. Large language models in global health. Natur e Health , 1(1):35–47. OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer , Adam Richardson, Ahmed El-Kishky , Aiden Low , Alec Helyar , Aleksander Madry , Alex Beutel, Alex Carney , Alex Iftimie, Alex Karpenk o, Alex T achard Passos, Alexander Neitz, Alexander Prokofie v , Alexander W ei, Allison T am, Ally Bennett, and 243 others. 2024. Openai o1 system card . Pr eprint , Haining Pan, Nayantara Mudur , William T aranto, Maria T ikhanovskaya, Subhashini V enugopalan, Y asaman Bahri, Michael P . Brenner, and Eun-Ah Kim. 2025. Quantum many-body physics calculations with large language models. Communications Physics , 8(1):49. Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar , Ion Stoica, Matei Zaharia, and Carlos Guestrin. 2025. Deepscholar-bench: A live bench- mark and automated ev aluation for generativ e re- search synthesis . Pr eprint , W illiam Peebles and Saining Xie. 2023. Scalable dif- fusion models with transformers. In Proceedings of the IEEE/CVF international confer ence on computer vision , pages 4195–4205. Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Y ouhui Zhang, Shuang W u, Guanrui W ang, Zhe Zou, Zhen- zhi W u, W ei He, Feng Chen, Ning Deng, Si W u, Y u W ang, Y ujie W u, Zheyu Y ang, Cheng Ma, Guoqi Li, W entao Han, and 5 others. 2019. T owards ar- tificial general intelligence with hybrid tianjic chip architecture. Natur e , 572(7767):106–111. Thinh Pham, Nguyen Phan Nguyen, Pratibha Zunjare, W eiyuan Chen, Y u-Min Tseng, and T u V u. 2026. SealQA: Raising the bar for reasoning in search- augmented language models. In The F ourteenth In- ternational Confer ence on Learning Repr esentations . Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy , Oliv er Zhang, Mantas Mazeika, Dan Hendrycks, Ziwen Han, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, and 281 others. 2026. A bench- mark of expert-le vel academic questions to assess ai capabilities. Natur e , 649(8099):1139–1146. Alethea Power , Y uri Burda, Harri Edwards, Igor Babuschkin, and V edant Misra. 2022. Grokking: Generalization beyond o verfitting on small algorith- mic datasets . Pr eprint , Shraman Pramanick, Rama Chellappa, and Subhashini V enugopalan. 2024. SPIQA: A dataset for multi- modal question answering on scientific papers . In The Thirty-eight Confer ence on Neural Information Pr ocessing Systems Datasets and Benchmarks T rac k . Ilan Price, Alv aro Sanchez-Gonzalez, Ferran Alet, T om R. Andersson, Andre w El-Kadi, Dominic Mas- ters, T imo Ewalds, Jacklynn Stott, Shakir Mohamed, Peter Battaglia, Remi Lam, and Matthe w W illson. 2025. Probabilistic weather forecasting with ma- chine learning. Natur e , 637(8044):84–90. Y ingming Pu, T ao Lin, and Hongyu Chen. 2025. Piflow: Principle-aware scientific disco very with multi-agent collaboration . Pr eprint , Lifeng Qiao, Peng Y e, Y uchen Ren, W eiqiang Bai, Chaoqi Liang, Xinzhu Ma, Nanqing Dong, and W anli Ouyang. 2024. Model decides how to tokenize: adap- tiv e dna sequence tokenization with mxdna. In Pr o- ceedings of the 38th International Conference on Neural Information Pr ocessing Systems , NIPS ’24. Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Y u, W enbiao Y in, Xinyu W ang, Zhen Zhang, Baixuan Li, Huifeng Y in, Kuan Li, Rui Min, Minpeng Liao, 29 Y ong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. 2025. W ebresearcher: Unleashing unbounded reasoning capability in long-horizon agents . Pr eprint , Pengcheng Qiu, Chao yi W u, Xiaoman Zhang, W eixiong Lin, Haicheng W ang, Y a Zhang, Y anfeng W ang, and W eidi Xie. 2024. T o wards b uilding multilingual lan- guage model for medicine. Natur e Communications , 15(1):8384. Shi Qiu, Shaoyang Guo, Zhuo-Y ang Song, Y unbo Sun, Zeyu Cai, Jiashen W ei, Tian yu Luo, Y ixuan Y in, Zhang Haoxu, Y i Hu, Chenyang W ang, Chencheng T ang, Haoling Chang, Qi Liu, Ziheng Zhou, T ianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, and 34 others. 2025. PHYBench: Holistic ev alua- tion of physical perception and reasoning in lar ge language models. In The Thirty-ninth Annual Con- fer ence on Neural Information Pr ocessing Systems Datasets and Benchmarks T rac k . Paul Raccuglia, Katherine C. Elbert, Philip D. F . Adler , Casey F alk, Malia B. W enny , Aurelio Mollo, Matthias Zeller , Sorelle A. Friedler , Joshua Schrier , and Alexander J. Norquist. 2016. Machine-learning- assisted materials discov ery using failed e xperiments. Natur e , 533(7601):73–76. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskev er . 2018. Improving language under- standing by generativ e pre-training . Alec Radford, Jeffrey W u, Rewon Child, David Luan, Dario Amodei, Ilya Sutske ver , and 1 others. 2019. Language models are unsupervised multitask learn- ers . OpenAI blog , 1(8):9. Ilija Radosav ovic, T ete Xiao, Bike Zhang, T rev or Dar- rell, Jitendra Malik, and K oushil Sreenath. 2024. Real-world humanoid locomotion with reinforcement learning. Science Robotics , 9(89):eadi9579. Colin Raffel, Noam Shazeer , Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Y anqi Zhou, W ei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-te xt trans- former . J. Mach. Learn. Res. , 21(1). V ishwanatha M. Rao, Serena Zhang, Brian S. Plosky , Patrick D. Hsu, Bo W ang, James Zou, Marinka Zitnik, Eric J. T opol, and Pranav Rajpurkar . 2026. General- ist biological artificial intelligence in modeling the language of life . Natur e Biotechnology . Markus Reichstein, Gustau Camps-V alls, Bjorn Ste vens, Martin Jung, Joachim Denzler , Nuno Carvalhais, and Prabhat. 2019. Deep learning and process under- standing for data-driv en earth system science. Na- tur e , 566(7743):195–204. David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty , Richard Y uanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R. Bowman. 2024. GPQA: A graduate-lev el google-proof q&a benchmark. In F irst Confer ence on Language Modeling . Malte Reinschmidt, József Fortágh, Andreas Günther , and V alentin V . V olchko v . 2024. Reinforcement learning in cold atom experiments. Nature Com- munications , 15(1):8532. Shuo Ren, Pu Jian, Zhenjiang Ren, Chunlin Leng, Can Xie, and Jiajun Zhang. 2025. T owards scientific in- telligence: A survey of llm-based scientific agents . Pr eprint , Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser , and Björn Ommer . 2022. High- resolution image synthesis with latent diffusion mod- els. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recognition (CVPR) , pages 10684–10695. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov , Matej Balog, M. Paw an Kumar , Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming W ang, Omar Fa wzi, Pushmeet Kohli, and Alhussein F awzi. 2024. Mathematical discov eries from program search with large language models. Nature , 625(7995):468–475. Olaf Ronneberger , Philipp Fischer, and Thomas Brox. 2015. U-net: Con volutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer -assisted intervention , pages 234–241. Jie Ruan, Inderjeet Jayakumar Nair , Shuyang Cao, Amy Liu, Sheza Munir , Micah Pollens-Dempsey , Y une- T ing Tif fany Chiang, Lucy R. Kates, Nicholas David, Sihan Chen, Ruxin Y ang, Y uqian Y ang, Jihyun Jas- mine Gump, T essa Bialek, V iv ek S Sankaran, Margo Schlanger , and Lu W ang. 2026. Expertlongbench: Benchmarking language models on e xpert-level long- form generation tasks with structured checklists . In The F ourteenth International Confer ence on Learn- ing Repr esentations . Y ixiang Ruan, Chenyin Lu, Ning Xu, Y uchen He, Y ixin Chen, Jian Zhang, Jun Xuan, Jianzhang Pan, Qun Fang, Han yu Gao, and 1 others. 2024. An automatic end-to-end chemical synthesis dev elopment platform powered by lar ge language models. Nature commu- nications , 15(1):10160. Benjamin Sanchez-Lengeling and Alán Aspuru-Guzik. 2018. In verse molecular design using machine learn- ing: Generati ve models for matter engineering. Sci- ence , 361(6400):360–365. Moritz Schaefer , Peter Peneder, Daniel Malzl, Salvo Danilo Lombardo, Mihaela Peyche va, Jake Burton, Anna Hakobyan, V arun Sharma, Thomas Krausgruber , Celine Sin, and 1 others. 2025. Mul- timodal learning enables chat-based exploration of single-cell data. Natur e Biotechnology , pages 1–11. T imo Schick, Jane Dwiv edi-Y u, Roberto Dessí, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luk e Zettle- moyer , Nicola Cancedda, and Thomas Scialom. 2023. T oolformer: language models can teach themselv es to use tools. In Pr oceedings of the 37th International 30 Confer ence on Neur al Information Pr ocessing Sys- tems , NIPS ’23. Marwin H. S. Segler , Mike Preuss, and Mark P . W aller . 2018. Planning chemical syntheses with deep neural networks and symbolic ai. Natur e , 555(7698):604– 610. Alireza Seif, Mohammad Hafezi, and Christopher Jarzynski. 2021. Machine learning the thermody- namic arrow of time. Nature Physics , 17(1):105– 113. Rico Sennrich, Barry Haddow , and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Pr oceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (V olume 1: Long P apers) , pages 1715–1725. Erzhuo Shao, Y ifang W ang, Y ifan Qian, Zhenyu Pan, Han Liu, and Dashun W ang. 2025a. Sciscigpt: ad- vancing human–ai collaboration in the science of science. Natur e Computational Science . Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, V arsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, T yler Murray , Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, W en tau Y ih, T ong- shuang W u, Luke Zettlemoyer , Y oon Kim, and 2 others. 2025b. Dr tulu: Reinforcement learning with ev olving rubrics for deep research . Preprint , Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton W ang, Ankit Aich, Huy Nghiem, T ahseen Rabbani, Y e Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskof f, Marcos A yestaran, Sean M. Hendryx, Brad Kenstler , and Bing Liu. 2025. Researchrubrics: A benchmark of prompts and rubrics for e valuating deep research agents . Pr eprint , Y u-Zhe Shi, Shiqian Li, Xin yi Niu, Qiao Xu, Jiawen Liu, Y ifan Xu, Shiyu Gu, Bingru He, Xinyang Li, Xin yu Zhao, Zijian Zhao, Y idong L yu, Zhen Li, Sijia Liu, Lin Qiu, Jinhao Ji, Lecheng Ruan, Y uxi Ma, W enjuan Han, and Y ixin Zhu. 2023. PersLEARN: Research training through the lens of perspectiv e culti vation. In A CL: System Demonstrations . Zhengliang Shi, Y iqun Chen, Haitao Li, W eiwei Sun, Shiyu Ni, Y ougang L yu, Run-Ze Fan, Bowen Jin, Y ixuan W eng, Minjun Zhu, Qiujie Xie, Xinyu Guo, Qu Y ang, Jiayi W u, Jujia Zhao, Xiaqiang T ang, Xin- bei Ma, Cunxiang W ang, Jiaxin Mao, and 7 oth- ers. 2025a. Deep research: A systematic surve y . Pr eprints . Zhengliang Shi, Shen Gao, Lingyong Y an, Y ue Feng, Xiuyi Chen, Zhumin Chen, Dawei Y in, Suzan V er - berne, and Zhaochun Ren. 2025b. T ool learning in the wild: Empo wering language models as automatic tool agents. In Pr oceedings of the ACM on W eb Con- fer ence 2025 , WWW ’25, page 2222–2237. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Y ao. 2023. Re- flexion: language agents with v erbal reinforcement learning. In Proceedings of the 37th International Confer ence on Neur al Information Pr ocessing Sys- tems , NIPS ’23. Artem Shmatko, Ale xander W olfgang Jung, K umar Gau- rav , Søren Brunak, Laust Hvas Mortensen, Ewan Birney , T om Fitzgerald, and Moritz Gerstung. 2025. Learning the natural history of human disease with generativ e transformers. Nature , 647(8088):248– 256. Chenglei Si, T atsunori Hashimoto, and Diyi Y ang. 2025a. The ideation-ex ecution gap: Execution out- comes of llm-generated versus human research ideas . Pr eprint , Chenglei Si, Diyi Y ang, and T atsunori Hashimoto. 2025b. Can LLMs generate no vel research ideas? a large-scale human study with 100+ NLP researchers. In The Thirteenth International Confer ence on Learn- ing Repr esentations . Karan Singhal, Shekoofeh Azizi, T ao Tu, S. Sara Mahdavi, Jason W ei, Hyung W on Chung, Nathan Scales, Ajay T anwani, Heather Cole-Le wis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gam- ble, Chris Kelly , Ab ubakr Babiker , Nathanael Schärli, Aakanksha Cho wdhery , Philip Mansfield, Dina Demner-Fushman, and 13 others. 2023. Lar ge lan- guage models encode clinical kno wledge. Natur e , 620(7972):172–180. Karan Singhal, T ao Tu, Juraj Gottweis, Rory Sayres, Ellery W ulczyn, Mohamed Amin, Le Hou, K evin Clark, Stephen R Pfohl, Heather Cole-Le wis, and 1 others. 2025. T oward expert-lev el medical ques- tion answering with lar ge language models. Nature Medicine , 31(3):943–950. Guijin Son, Jiwoo Hong, Honglu Fan, Heejeong Nam, Hyunwoo K o, Seungwon Lim, Jin yeop Song, Jinha Choi, Gonçalo Paulo, Y oungjae Y u, and Stella Bi- derman. 2025. When ai co-scientists fail: Spot-a benchmark for automated verification of scientific research . Pr eprint , Zhangde Song, Jieyu Lu, Y uanqi Du, Botao Y u, Thomas M. Pruyn, Y ue Huang, K ehan Guo, Xiuzhe Luo, Y uanhao Qu, Y i Qu, Y inkai W ang, Haorui W ang, Jef f Guo, Jingru Gan, Parshin Shojaee, Di Luo, Andres M Bran, Gen Li, Qiyuan Zhao, and 37 others. 2025. Evaluating lar ge language models in scientific discov ery . Preprint , arXi v:2512.15567. Brian K. Spears, Scott Brandon, Dan T . Casey , John E. Field, Jim A. Gaf fney , Kelli D. Humbird, Andrea L. Kritcher , Michael K. G. Kruse, Eugene Kur , Bog- dan Kusto wski, S. Langer , Dav e Munro, Ryan Nora, J. Luc Peterson, Dav e J. Schlossberg, Paul Springer , and Alex Zylstra. 2025. Predicting fusion ignition at the national ignition facility with physics-informed deep learning. Science , 389(6761):727–731. 31 Giulio Starace, Oliver Jaf fe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, W yatt Thompson, Johannes Heidecke, Amelia Glaese, and T ejal Pat- wardhan. 2025. Paperbench: Evaluating AI’s ability to replicate AI research. In F orty-second Interna- tional Confer ence on Machine Learning . Sandra Steyaert, Marija Pizurica, Di vya Nagaraj, Priya Khandelwal, T ina Hernandez-Boussard, Andre w J Gentles, and Olivier Gevaert. 2023. Multimodal data fusion for cancer biomark er disco very with deep learning. Natur e machine intelligence , 5(4):351–362. Nisan Stiennon, Long Ouyang, Jeff W u, Daniel M. Ziegler , Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. Learning to summarize from human feedback. In Proceed- ings of the 34th International Confer ence on Neural Information Pr ocessing Systems , NIPS ’20. Jonathan M Stokes, K evin Y ang, K yle Swanson, W en- gong Jin, Andres Cubillos-Ruiz, Nina M Donghia, Craig R MacNair , Shawn French, Lindsey A Carfrae, Zohar Bloom-Ackermann, and 1 others. 2020. A deep learning approach to antibiotic discov ery . Cell , 180(4):688–702. Haoyang Su, Renqi Chen, Shixiang T ang, Zhenfei Y in, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi W u, Hui Li, W anli Ouyang, Philip T orr , Bowen Zhou, and Nanqing Dong. 2025. Many heads are better than one: Impro ved scientific idea generation by a LLM- based multi-agent system. In Pr oceedings of the 63r d Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 28201– 28240. Jianlin Su, Murtadha Ahmed, Y u Lu, Shengfeng Pan, W en Bo, and Y unfeng Liu. 2024. Roformer: En- hanced transformer with rotary position embedding. Neur ocomput. , 568(C). Liangtai Sun, Y ang Han, Zihan Zhao, Da Ma, Zhen- nan Shen, Baocai Chen, Lu Chen, and Kai Y u. 2024. Sciev al: A multi-le vel large language model e valua- tion benchmark for scientific research. Pr oceedings of the AAAI Conference on Artificial Intelligence , 38(17):19053–19061. T ao Sun, Enhao Pan, Zhengkai Y ang, Kaixin Sui, Ji- ajun Shi, Xianfu Cheng, T ongliang Li, Ge Zhang, W enhao Huang, Jian Y ang, and Zhoujun Li. 2026. P2p: Automated paper -to-poster generation and fine- grained benchmark. In The F ourteenth International Confer ence on Learning Representations . Kyle Swanson, W esley W u, Nash L Bulaong, John E Pak, and James Zou. 2025. The virtual lab of ai agents designs new sars-co v-2 nanobodies. Natur e , 646:716–723. Nathan J. Szymanski, Bernardus Rendy , Y uxing Fei, Rishi E. Kumar , T anjin He, David Milsted, Matthew J. McDermott, Max Gallant, Ekin Dogus Cubuk, Amil Merchant, Haegyeom Kim, Anubhav Jain, Christo- pher J. Bartel, Kristin Persson, Y an Zeng, and Ger- brand Ceder . 2023. An autonomous laboratory for the accelerated synthesis of inorganic materials. Na- tur e , 624(7990):86–91. Jiabin T ang, Lianghao Xia, Zhonghang Li, and Chao Huang. 2025. AI-researcher: Autonomous scientific innov ation . In The Thirty-ninth Annual Confer ence on Neural Information Pr ocessing Systems . Cristopher T inajero, Marcileia Zanatta, Julián E Sánchez-V elandia, Eduardo García-V erdugo, and V ictor Sans. 2025. Reac-discovery: an artificial intelligence–driv en platform for continuous-flow cat- alytic reactor discovery and optimization. Nature Communications , 16(1):9062. Gary T om, Stefan P Schmid, Sterling G Baird, Y ang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Ser- gio Pablo-García, Ella M Rajaonson, Marta Skreta, and 1 others. 2024. Self-driving laboratories for chemistry and materials science. Chemical Revie ws , 124(16):9633–9732. Jingqi T ong, Mingzhe Li, Hangcheng Li, Y ongzhuo Y ang, Y urong Mou, W eijie Ma, Zhiheng Xi, Hongji Chen, Xiaoran Liu, Qinyuan Cheng, Ming Zhang, Qiguang Chen, W eifeng Ge, Qipeng Guo, Tianlei Y ing, T ianxiang Sun, Y ining Zheng, Xinchi Chen, Jun Zhao, and 4 others. 2026a. Ai can learn scientific taste . Pr eprint , Shengbang T ong, Da vid F an, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Bo yang Zheng, Théo- phane V allaeys, Junlin Han, Rob Fer gus, Naila Mur- ray , Marjan Ghazvininejad, Mik e Lewis, Nicolas Bal- las, Amir Bar , Michael Rabbat, Jakob V erbeek, Luke Zettlemoyer , K oustuv Sinha, and 2 others. 2026b. Beyond language modeling: An e xploration of multi- modal pretraining . Pr eprint , Hugo T ouvron, Louis Martin, Ke vin Stone, Peter Al- bert, Amjad Almahairi, Y asmine Babaei, Nikolay Bashlykov , Soumya Batra, Prajjwal Bharga va, Shruti Bhosale, Dan Bikel, Lukas Blecher , Cristian Canton Ferrer , Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, W enyin Fu, and 49 oth- ers. 2023. Llama 2: Open foundation and fine-tuned chat models . Pr eprint , Dhruv T rehan and P aras Chopra. 2026. Why llms aren’ t scientists yet: Lessons from four autonomous re- search attempts . Pr eprint , T rieu H. Trinh, Y uhuai W u, Quoc V . Le, He He, and Thang Luong. 2024. Solving olympiad ge- ometry without human demonstrations. Nature , 625(7995):476–482. V ahe Tshitoyan, John Dagdelen, Leigh W eston, Alexan- der Dunn, Ziqin Rong, Olga K ononova, Kristin A. Persson, Gerbrand Ceder, and Anubhav Jain. 2019. Unsupervised word embeddings capture latent kno wl- edge from materials science literature. Natur e , 571(7763):95–98. 32 T ao T u, Mike Schaek ermann, Anil Palepu, Khaled Saab, Jan Freyber g, Ryutaro T anno, Amy W ang, Brenna Li, Mohamed Amin, Y ong Cheng, Elahe V edadi, Nenad T omase v , Shekoofeh Azizi, Karan Singhal, Le Hou, Albert W ebson, Ka vita K ulkarni, S. Sara Mahda vi, Christopher Semturs, and 7 others. 2025. T ow ards con versational diagnostic artificial intelligence. Na- tur e , 642(8067):442–450. Alan M T uring. 2007. Computing machinery and intelli- gence. In P arsing the T uring test: Philosophical and methodological issues in the quest for the thinking computer , pages 23–65. Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser , and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pr o- cessing Systems , volume 30. Francisco V illaescusa-Nav arro, Boris Bolliet, P ablo V illanuev a-Domingo, Adrian E. Bayer , Aidan Ac- quah, Chetana Amancharla, Almog Barzilay-Siegal, Pablo Bermejo, Camille Bilodeau, Pablo Cárde- nas Ramírez, Miles Cranmer , Urbano L. França, ChangHoon Hahn, Y an-Fei Jiang, Raul Jimenez, Jun-Y oung Lee, Antonio Lerario, Osman Mamun, Thomas Meier , and 17 others. 2025. The denario project: Deep kno wledge ai agents for scientific dis- cov ery . Preprint , arXi v:2510.26887. Haiyuan W an, Chen Y ang, Junchi Y u, Meiqi T u, Jiaxuan Lu, Di Y u, Jianbao Cao, Ben Gao, Jiaqing Xie, Aoran W ang, W enlong Zhang, Philip T orr, and Dongzhan Zhou. 2025a. Deepresearch arena: The first exam of llms’ research abilities via seminar-grounded tasks . Pr eprint , Y i W an, Jiuqi W ang, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu. 2025b. Pokeeresearch: Effec- tiv e deep research via reinforcement learning from ai feedback and robust reasoning scaffold . Pr eprint , Chengye W ang, Y ifei Shen, Ze xi Kuang, Arman Co- han, and Y ilun Zhao. 2025a. SciVer: Ev aluating foundation models for multimodal scientific claim verification. In Pr oceedings of the 63r d Annual Meet- ing of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 8562–8579. Dilin W ang, Hyunyoung Jung, T om Monnier, Kihyuk Sohn, Chuhang Zou, Xiaoyu Xiang, Y u-Y ing Y eh, Di Liu, Zixuan Huang, Thu Nguyen-Phuoc, Y uchen Fan, Sergiu Oprea, Ziyan W ang, Roman Shapo- valo v , Nikolaos Sarafianos, Thibault Groueix, An- toine T oisoul, Prithviraj Dhar , Xiao Chu, and 6 oth- ers. 2025b. W orldgen: From text to tra versable and interactiv e 3d worlds . Pr eprint , Erik Y . W ang, Sumeet Motwani, James V . Roggev een, Eliot Hodges, Dulhan Jayalath, Charles London, Kalyan Ramakrishnan, Flaviu Cipcigan, Philip T orr , and Alessandro Abate. 2026a. Horizonmath: Measur - ing ai progress to ward mathematical discovery with automatic verification . Preprint , arXi v:2603.15617. Hanchen W ang, T ianfan Fu, Y uanqi Du, W enhao Gao, Kexin Huang, Ziming Liu, P ayal Chandak, Shengchao Liu, Peter V an Katwyk, Andreea Deac, and 1 others. 2023a. Scientific discovery in the age of artificial intelligence. Natur e , 620(7972):47–60. Jiayu W ang, Y ifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, A ws Al- barghouthi, Caiming Xiong, and Shafiq Joty . 2026b. Liv eresearchbench: Benchmarking single- and multi- agent systems for citation-grounded deep research. In The F ourteenth International Confer ence on Learning Repr esentations . Jue W ang, Sidney Lisanza, David Juergens, Doug T ischer, Joseph L. W atson, Karla M. Castro, Robert Ragotte, Amijai Sarago vi, Lukas F . Milles, Minkyung Baek, Iv an Anishchenko, W ei Y ang, Der- rick R. Hicks, Marc Expòsit, Thomas Schlichthaerle, Jung-Ho Chun, Justas Dauparas, Nathaniel Bennett, Basile I. M. W icky , and 5 others. 2022. Scaffol ding protein functional sites using deep learning. Science , 377(6604):387–394. Ning W ang, Jiang Bian, Y uchen Li, Xuhong Li, Shahid Mumtaz, Linghe K ong, and Haoyi Xiong. 2024a. Multi-purpose rna language modelling with motif- aware pretraining and type-guided fine-tuning. Na- tur e Machine Intelligence , 6(5):548–557. Peiyi W ang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Y ifei Li, Deli Chen, Y u W u, and Zhifang Sui. 2024b. Math-shepherd: V erify and reinforce LLMs step-by-step without human annotations. In Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P a- pers) , pages 9426–9439. Qixun W ang, Xu Bai, Haofan W ang, Zekui Qin, and Anthony Chen. 2024c. Instantid: Zero-shot identity- preserving generation in seconds. arXiv pr eprint arXiv:2401.07519 . Rui W ang, Ce Zhang, Jun-Y u Ma, Jianshu Zhang, Hon- gru W ang, Y i Chen, Boyang Xue, T ianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Y u, and Kam-Fai W ong. 2025c. Explore to ev olve: Scaling ev olved aggregation logic via proacti ve on- line exploration for deep research agents . Preprint , Xiaoxuan W ang, Ziniu Hu, Pan Lu, Y anqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Y izhou Sun, and W ei W ang. 2024d. Scibench: Ev aluating college-le vel scientific problem-solving abilities of lar ge language models . In F orty-first International Confer ence on Machine Learning . Xinlong W ang, Y ufeng Cui, Jinsheng W ang, Fan Zhang, Y ueze W ang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Zhen Li, Y uqi W ang, Qiying Y u, Y ingli Zhao, Y ulong Ao, Xuebin Min, Chunlei Men, Boya W u, Bo Zhao, Bo wen Zhang, Liangdong W ang, and 7 others. 2026c. Multimodal learning with next-tok en prediction for large multimodal models. Nature . 33 Xinlong W ang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Y ufeng Cui, Jinsheng W ang, Fan Zhang, Y ueze W ang, Zhen Li, Qiying Y u, Y ingli Zhao, Y u- long Ao, Xuebin Min, T ao Li, Boya W u, Bo Zhao, Bowen Zhang, Liangdong W ang, Guang Liu, and 6 others. 2024e. Emu3: Next-token prediction is all you need . Pr eprint , Xuezhi W ang, Jason W ei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery , and Denny Zhou. 2023b. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Confer ence on Learning Repr esentations . Y i W ang, Zhenting Huang, Zhaohan Ding, Ruoxue Liao, Y uan Huang, Xinzijian Liu, Jiajun Xie, Siheng Chen, and Linfeng Zhang. 2026d. Deploy-master: Au- tomating the deployment of 50,000+ agent-ready sci- entific tools in one day . Pr eprint , Y izhou W ang, Chen T ang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu W u, Jun Y ao, Pengze Li, Encheng Su, Lintao W ang, Guohang Zhuang, Y uchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Y ue, Zhenfei Y in, and 13 others. 2025d. Scireasoner: Laying the scientific reasoning ground across disciplines . Pr eprint , Zhiyuan W ang, Bokui Chen, Y inya Huang, Qingxing Cao, Ming He, Jianping Fan, and Xiaodan Liang. 2025e. ORMind: A cognitiv e-inspired end-to-end reasoning framew ork for operations research. In Pr o- ceedings of the 63r d Annual Meeting of the Associa- tion for Computational Linguistics (V olume 6: Indus- try T rack) , pages 104–131. Joseph L. W atson, David Juergens, Nathaniel R. Ben- nett, Brian L. T rippe, Jason Y im, Helen E. Eisenach, W oody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F . Milles, Basile I. M. W icky , Nikita Hanikel, Samuel J. Pellock, Alexis Courbet, W illiam Sheffler , Jue W ang, Preetham V enkatesh, Isaac Sappington, Susana Vázquez T orres, and 9 others. 2023. De novo design of protein structure and function with rfdiffu- sion. Natur e , 620(7976):1089–1100. Jason W ei, Zhiqing Sun, Spencer Papay , Scott McK- inney , Jef frey Han, Isa Fulford, Hyung W on Chung, Alex T achard Passos, W illiam Fedus, and Amelia Glaese. 2025a. Browsecomp: A simple yet chal- lenging benchmark for browsing agents . Preprint , Jason W ei, Y i T ay , Rishi Bommasani, Colin Raf fel, Barret Zoph, Sebastian Bor geaud, Dani Y ogatama, Maarten Bosma, Denn y Zhou, Donald Metzler , Ed H. Chi, T atsunori Hashimoto, Oriol V inyals, Percy Liang, Jef f Dean, and W illiam Fedus. 2022a. Emer - gent abilities of large language models. T ransactions on Machine Learning Resear ch . Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Brian Ichter , Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. 2022b. Chain-of-thought prompt- ing elicits reasoning in lar ge language models. In Pr oceedings of the 36th International Conference on Neural Information Pr ocessing Systems , NIPS ’22. Jiaqi W ei, Y uejin Y ang, Xiang Zhang, Y uhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai W ang, Zhiqiang Gao, Juntai Cao, Zijie Qiu, Ming Hu, Chenglong Ma, Shixiang T ang, Jun- jun He, Chunfeng Song, Xuming He, Qiang Zhang, Chenyu Y ou, and 8 others. 2025b. From ai for sci- ence to agentic science: A surv ey on autonomous scientific discov ery . Preprint , arXi v:2508.14111. Zichen W en, Boxue Y ang, Shuang Chen, Y aojie Zhang, Y uhang Han, Junlong K e, Cong W ang, Y icheng Fu, Jiaw ang Zhao, Jiangchao Y ao, Xi Fang, Zhen W ang, Henxing Cai, Lin Y ao, Zhifeng Gao, Y anhui Hong, Nang Y uan, Y ixuan Li, Guojiang Zhao, and 15 others. 2026. Innov ator-vl: A multimodal lar ge language model for scientific discovery . Pr eprint , Jiayi W eng, Huayu Chen, Dong Y an, Kaichao Y ou, Alexis Duburcq, Minghao Zhang, Y i Su, Hang Su, and Jun Zhu. 2022. T ianshou: A highly modular - ized deep reinforcement learning library . Journal of Machine Learning Resear ch , 23(267):1–6. Chengyue W u, Xiaokang Chen, Zhiyu W u, Y iyang Ma, Xingchao Liu, Zizheng Pan, W en Liu, Zhenda Xie, Xingkai Y u, Chong Ruan, and 1 others. 2025a. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Pr oceedings of the Computer V ision and P attern Recognition Confer- ence , pages 12966–12977. Fang W u, W eihao Xuan, Heli Qi, Ximing Lu, Aaron T u, Li Erran Li, and Y ejin Choi. 2026. Deepsearch: Ov er- come the bottleneck of reinforcement learning with verifiable rew ards via monte carlo tree search. In The F ourteenth International Confer ence on Learn- ing Repr esentations . Jialong W u, Baixuan Li, Runnan Fang, W enbiao Y in, Liwen Zhang, Zhenglin W ang, Zhengwei T ao, Ding- Chu Zhang, Zekun Xi, Xiangru T ang, Y ong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. 2025b. W ebdancer: T ow ards autonomous information seek- ing agency . In The Thirty-ninth Annual Confer ence on Neural Information Pr ocessing Systems . Jialong W u, W enbiao Y in, Y ong Jiang, Zhenglin W ang, Zekun Xi, Runnan Fang, Linhai Zhang, Y ulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. 2025c. WebWalker: Benchmarking LLMs in web traversal. In Pr oceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 10290–10305. Junde W u, Jiayuan Zhu, Y uyuan Liu, Min Xu, and Y ueming Jin. 2025d. Agentic reasoning: A stream- lined framew ork for enhancing LLM reasoning with agentic tools. In Pr oceedings of the 63r d Annual Meeting of the Association for Computational Lin- guistics (V olume 1: Long P apers) , pages 28489– 28503. 34 T ailin W u and Max T egmark. 2019. T oward an artifi- cial intelligence physicist for unsupervised learning. Phys. Rev . E , 100:033311. Y unjia Xi, Jianghao Lin, Y ongzhao Xiao, Zheli Zhou, Rong Shan, T e Gao, Jiachen Zhu, W eiwen Liu, Y ong Y u, and W einan Zhang. 2025. A survey of llm-based deep search agents: Paradigm, optimization, ev alua- tion, and challenges . Pr eprint , T ing Xiao, Lei Shi, Y ang Zhang, HaoFeng Y ang, Zhe W ang, and Chenjia Bai. 2025. Online iterativ e self- alignment for radiology report generation. In Pr o- ceedings of the 63r d Annual Meeting of the Associa- tion for Computational Linguistics (V olume 1: Long P apers) , pages 27799–27814. Core T eam Xiaomi, Bangjun Xiao, Bingquan Xia, Bo Y ang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang W ang, Gang Xie, Hailin Zhang, Hanglong Lv , Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, and 107 others. 2026. Mimo-v2-flash technical report . Pr eprint , Jinheng Xie, W eijia Mao, Zechen Bai, David Junhao Zhang, W eihao W ang, Ke vin Qinghong Lin, Y uchao Gu, Zhijie Chen, Zhenheng Y ang, and Mike Zheng Shou. 2025. Show-o: One single transformer to unify multimodal understanding and generation. In The Thirteenth International Confer ence on Learning Repr esentations . Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao, Haonan He, Xuanwei Zhang, Qiyue Kang, and Zhenzhong Lan. 2023. Superclue: A com- prehensiv e chinese large language model benchmark . Pr eprint , Renjun Xu and Jingwen Peng. 2025. A comprehensiv e surve y of deep research: Systems, methodologies, and applications . Pr eprint , T ianze Xu, Pengrui Lu, L yumanshan Y e, Xiangkun Hu, and Pengfei Liu. 2025a. Researcherbench: Ev alu- ating deep ai research systems on the frontiers of scientific inquiry . Pr eprint , W eize Xu, Erwin Poussi, Quan Zhong, Zehua Zeng, Christopher Zou, Xuehai W ang, Y ifan Lu, Miao Cui, Daiji Okamura, Cinlong Huang, Jiayuan Ding, Zhe Zhao, Y uheng Y ang, Xinhai Pan, V arshini V ijay , Naoki K onno, Nianping Liu, Lei Li, X. Rosa Ma, and 14 others. 2026a. Pantheonos: An ev olvable multi- agent frame work for automatic genomics discov ery . bioRxiv . Y inuo Xu, Shuo Lu, Jianjie Cheng, Meng W ang, Qian- long Xie, Xingxing W ang, Ran He, and Jian Liang. 2026b. How to train your deep research agent? prompt, rew ard, and policy optimization in search-r1 . Pr eprint , Zhijian Xu, Y ilun Zhao, Manasi Patwardhan, Lov ekesh V ig, and Arman Cohan. 2025b. Can LLMs iden- tify critical limitations within scientific research? a systematic e valuation on AI research papers. In Pr o- ceedings of the 63r d Annual Meeting of the Associa- tion for Computational Linguistics (V olume 1: Long P apers) , pages 20652–20706. Junxi Y an, Zixi W ei, Jingtao Zhan, Qingyao Ai, and Y iqun LIU. 2026. What scales in cross-entropy scal- ing law? In The F ourteenth International Confer ence on Learning Repr esentations . Linyi Y ang and Y ixuan W eng. 2025. ResearStudio: A human-intervenable framew ork for building con- trollable deep research agents. In Pr oceedings of the 2025 Confer ence on Empirical Methods in Nat- ural Langua ge Pr ocessing: System Demonstrations , pages 896–905. Songhua Y ang, Hanjie Zhao, Senbin Zhu, Guangyu Zhou, Hongfei Xu, Y uxiang Jia, and Hongying Zan. 2024. Zhongjing: Enhancing the chinese medical capabilities of lar ge language model through e xpert feedback and real-w orld multi-turn dialogue. Pr o- ceedings of the AAAI Conference on Artificial Intelli- gence , 38(17):19368–19376. Shunyu Y ao, Dian Y u, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Y uan Cao, and Karthik Narasimhan. 2023a. Tree of thoughts: deliberate problem solving with large language models. In Pr o- ceedings of the 37th International Conference on Neural Information Pr ocessing Systems , NIPS ’23. Shunyu Y ao, Jeffre y Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik Narasimhan, and Y uan Cao. 2023b. React: Syner gizing reasoning and acting in language models . Pr eprint , Y ang Y ao, Y ixu W ang, Y uxuan Zhang, Y i Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming W u, Haozhe W ang, Ping Nie, Y an T eng, and Y ingchun W ang. 2025. A rigorous benchmark with multidimensional ev aluation for deep research agents: From answers to reports . Pr eprint , Y i Y ao, He Zhu, Piaohong W ang, Jincheng Ren, Xin- long Y ang, Qianben Chen, Xiaowan Li, Dingfeng Shi, Jiaxian Li, Qiexiang W ang, Sinuo W ang, Xin- peng Liu, Jiaqi W u, Minghao Liu, and W angchunshu Zhou. 2026. O-researcher: An open ended deep re- search model via multi-agent distillation and agentic rl . Pr eprint , Jie Y ing, Zihong Chen, Zhefan W ang, W anli Jiang, Chenyang W ang, Zhonghang Y uan, Haoyang Su, Huanjun K ong, Fan Y ang, and Nanqing Dong. 2025. SeedBench: A multi-task benchmark for e valuating large language models in seed science. In Pr oceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P a- pers) , pages 31395–31449. T ianhao Y u, Haiyang Cui, Jianan Canal Li, Y unan Luo, Guangde Jiang, and Huimin Zhao. 2023. Enzyme function prediction using contrastiv e learning. Sci- ence , 379(6639):1358–1363. 35 Y ipeng Y u. 2016. Cybor g Intelligent Systems Based on Brain-machine Inte gration: Researc h on Pr oto- types and Behavioral V erification . PhD Dissertation, Zhejiang Univ ersity , Hangzhou, China. Y ipeng Y u, Ran Guan, Jie Ma, Zhuoxuan Jiang, and Jingchang Huang. 2020. When and who? con- versation transition based on bot-agent symbiosis learning network. In Pr oceedings of the 28th Inter- national Confer ence on Computational Linguistics , pages 4056–4066. Y ipeng Y u, Gang Pan, Y ongyue Gong, Kedi Xu, Neng- gan Zheng, W eidong Hua, Xiaoxiang Zheng, and Zhaohui W u. 2016. Intelligence-augmented rat c y- borgs in maze solving. PLOS ONE , 11(2):1–18. Mert Y uksekgonul, Daniel K oceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong W ang, Jan Kautz, Y ejin Choi, James Zou, Carlos Guestrin, and Y u Sun. 2026. Learning to discov er at test time. arXiv pr eprint arXiv:2601.16175 . Claudio Zeni, Robert Pinsler, Daniel Zügner , Andrew Fo wler , Matthew Horton, Xiang Fu, Zilong W ang, Aliaksandra Shysheya, Jonathan Crabbé, Shoko Ueda, Roberto Sordillo, Lixin Sun, Jake Smith, Bich- lien Nguyen, Hannes Schulz, Sarah Lewis, Chin-W ei Huang, Ziheng Lu, Y ichi Zhou, and 7 others. 2025. A generati ve model for inorganic materials design. Natur e , 639(8055):624–632. Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, W eihao Xie, Zhaohui W ang, Tianrui Qin, King Zhu, Y uqing W ang, Qianben Chen, Y uchen Eleanor Jiang, W ei W ang, Jiaheng Liu, and W angchunshu Zhou. 2025a. Ho w far are we from genuinely useful deep research agents? Pr eprint , He Zhang, Hailong Liu, Y ushan Xu, Haoran Huang, Y iming Liu, Jia W ang, Y an Qin, Haiyan W ang, Lili Ma, Zhiyuan Xun, Xuzhuang Hou, T imothy K. Lu, and Jicong Cao. 2025b. Deep generative models design mrna sequences with enhanced translational capacity and stability . Science , 390(6773):eadr8470. He Zhang, Liang Zhang, Ang Lin, Congcong Xu, Ziyu Li, Kaibo Liu, Boxiang Liu, Xiaopin Ma, Fanfan Zhao, Huiling Jiang, Chunxiu Chen, Haifa Shen, Hangwen Li, David H. Mathe ws, Y ujian Zhang, and Liang Huang. 2023a. Algorithm for optimized mrna design improv es stability and immunogenicity . Na- tur e , 621(7978):396–403. Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Y an, Y ixin Liu, Jun Y u, Zhengliang Liu, Xun Chen, Brian D Davison, Hui Ren, and 1 others. 2024. A gen- eralist vision–language foundation model for div erse biomedical tasks. Natur e Medicine , 30(11):3129– 3141. Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023b. Adding conditional control to text-to-image diffusion models. In 2023 IEEE/CVF International Confer ence on Computer V ision (ICCV) , pages 3813– 3824. Pengsong Zhang, Xiang Hu, Guo wei Huang, Y ang Qi, Heng Zhang, Xiuxu Li, Jiaxing Song, Jiabin Luo, Y ijiang Li, Shuo Y in, Chengxiao Dai, Eric Hanchen Jiang, Xiaoyan Zhou, Zhenfei Y in, Boqin Y uan, Jing Dong, Guinan Su, Guanren Qiao, Haiming T ang, and 4 others. 2025c. aixiv: A next-generation open access ecosystem for scientific disco very generated by ai scientists . Pr eprint , Shuo Zhang, Chaof a Y uan, Ryan Guo, Xiaomin Y u, Rui Xu, Zhangquan Chen, Zinuo Li, Zhi Y ang, Shuhao Guan, Zhenheng T ang, Sen Hu, Liwen Zhang, Rong- hao Chen, and Huacan W ang. 2026a. Evofsm: Con- trollable self-e volution for deep research with finite state machines . Pr eprint , W enlin Zhang, Xiaopeng Li, Y ingyi Zhang, Pengyue Jia, Y ichao W ang, Huifeng Guo, Y ong Liu, and Xiangyu Zhao. 2025d. Deep research: A sur- ve y of autonomous research agents . Pr eprint , Xuan Zhang, Limei W ang, Jacob Helwig, Y ouzhi Luo, Cong Fu, Y aochen Xie, Meng Liu, Y uchao Lin, Zhao Xu, K eqiang Y an, Keir Adams, Maurice W eiler, Xiner Li, T ianfan Fu, Y ucheng W ang, Alex Strasser , Haiyang Y u, Y uQing Xie, Xiang Fu, and 44 others. 2025e. Artificial intelligence for science in quantum, atomistic, and continuum systems. F oundations and T rends in Mac hine Learning , 18(4):385–912. Y ongjun Zhang. 2026. V ibe researching as w olf coming: Can ai agents with skills replace or augment social scientists? Pr eprint , Y u Zhang, Y ang Han, Shuai Chen, Ruijie Y u, Xin Zhao, Xianbin Liu, Kaipeng Zeng, Mengdi Y u, Jidong T ian, Feng Zhu, and 1 others. 2025f. Large language mod- els to accelerate organic chemistry synthesis. Natur e Machine Intelligence , pages 1–13. Y uan-Hang Zhang and Massimiliano Di V entra. 2023. T ransformer quantum state: A multipurpose model for quantum many-body problems. Phys. Rev . B , 107:075147. Y unfan Zhang, Kathleen McK eown, and Smaranda Muresan. 2026b. Li vene wsbench: Ev aluating llm web search capabilities with freshly curated news . Pr eprint , Ke yu Zhao, W eiquan Lin, Qirui Zheng, Fengli Xu, and Y ong Li. 2025a. Deep ideation: Designing llm agents to generate novel research ideas on scientific concept network . Pr eprint , W eike Zhao, Chaoyi W u, Y anjie Fan, Pengcheng Qiu, Xiaoman Zhang, Y uze Sun, Xiao Zhou, Shuju Zhang, Y u Peng, Y anfeng W ang, Xin Sun, Y a Zhang, Y ong- guo Y u, Kun Sun, and W eidi Xie. 2026. An agentic system for rare disease diagnosis with traceable rea- soning. Natur e . Y ilun Zhao, Kaiyan Zhang, T iansheng Hu, Sihong W u, Ronan Le Bras, Y ixin Liu, Xiangru T ang, Joseph Chee Chang, Jesse Dodge, Jonathan Bragg, 36 Chen Zhao, Hannaneh Hajishirzi, Doug Do wney , and Arman Cohan. 2025b. Sciarena: An open e valua- tion platform for non-verifiable scientific literature- grounded tasks . In The Thirty-ninth Annual Con- fer ence on Neural Information Pr ocessing Systems Datasets and Benchmarks T rac k . Chuanyang Zheng, Y ihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, and Y u Li. 2024a. Dape: data-adaptive positional encoding for length extrapolation. In Pr oceedings of the 38th In- ternational Confer ence on Neural Information Pr o- cessing Systems , NIPS ’24. Qiaoyu Zheng, W eike Zhao, Chao yi W u, Xiaoman Zhang, Lisong Dai, Hengyu Guan, Y uehua Li, Y a Zhang, Y anfeng W ang, and W eidi Xie. 2024b. Large-scale long-tailed disease diagnosis on radiol- ogy images. Natur e Communications , 15(1):10147. T ianshi Zheng, Zheye Deng, Hong T ing Tsang, W eiqi W ang, Jiaxin Bai, Zihao W ang, and Y angqiu Song. 2025a. From automation to autonomy: A surv ey on large language models in scientific disco very . In Pr o- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Langua ge Pr ocessing , pages 17733– 17750. T ianshi Zheng, Kelvin Kiu W ai T am, Newt Nguyen Kim Hue Nam, Baixuan Xu, Zhao wei W ang, Cheng Jiayang, Hong T ing Tsang, W eiqi W ang, Jiaxin Bai, T ianqing Fang, Y angqiu Song, Ginny W ong, and Simon See. 2026. Newtonbench: Benchmarking gen- eralizable scientific law discov ery in LLM agents . In The F ourteenth International Confer ence on Learn- ing Repr esentations . Y uxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, L yumanshan Y e, Pengrui Lu, and Pengfei Liu. 2025b. DeepResearcher: Scaling deep research via reinforce- ment learning in real-world environments. In Pr o- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Langua ge Pr ocessing , pages 414–431. Y uxiang Zheng, Shichao Sun, Lin Qiu, Dongyu Ru, Cheng Jiayang, Xuefeng Li, Jifan Lin, Binjie W ang, Y un Luo, Renjie Pan, Y ang Xu, Qingkai Min, Zizhao Zhang, Y iwen W ang, W enjie Li, and Pengfei Liu. 2024c. OpenResearcher: Unleashing AI for acceler - ated scientific research. In Pr oceedings of the 2024 Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing: System Demonstrations , pages 209–218. Guangfeng Zhou, Domnita-V aleria Rusnac, Hahnbeom Park, Daniele Canzani, Hai Minh Nguyen, Lance Stew art, Matthew F . Bush, Phuong Tran Nguyen, Heike W ulff, Vladimir Y arov-Y aro voy , Ning Zheng, and Frank DiMaio. 2024. An artificial intelligence accelerated virtual screening platform for drug dis- cov ery . Natur e Communications , 15(1):7761. Heng Zhou, Ao Y u, Y uchen F an, Jianing Shi, Li Kang, Hejia Geng, Y ongting Zhang, Y utao Fan, Y uhao W u, T iancheng He, Y iran Qin, Lei Bai, and Zhenfei Y in. 2025a. Liv esearchbench: An automatically con- structed benchmark for retrie val and reasoning ov er dynamic knowledge . Pr eprint , Junting Zhou, W ang Li, Y iyan Liao, Nengyuan Zhang, T ingjia Miao, Zhihui Qi, Y uhan W u, and T ong Y ang. 2025b. Scholarsearch: Benchmarking scholar search- ing ability of llms . Pr eprint , Peilin Zhou, Bruce Leon, Xiang Y ing, Can Zhang, Y ifan Shao, Qichen Y e, Dading Chong, Zhiling Jin, Chenx- uan Xie, Meng Cao, Y uxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, and Y ining Hua. 2025c. Browsecomp-zh: Benchmarking web browsing abil- ity of large language models in chinese . Pr eprint , Y uanchang Zhou, Siyu Hu, Xiangyu Zhang, Hongyu W ang, Guangming T an, and W eile Jia. 2026. Ma- tRIS: T o ward reliable and efficient pretrained ma- chine learning interaction potentials. In The F our- teenth International Confer ence on Learning Repre- sentations . Chiwei Zhu, Benfeng Xu, Mingxuan Du, Shaohan W ang, Xiaorui W ang, Zhendong Mao, and Y ongdong Zhang. 2026. Fs-researcher: T est-time scaling for long-horizon research tasks with file-system-based agents . Pr eprint , Xueyan Zou, Jianglong Y e, Hao Zhang, Xiaoyu Xi- ang, Mingyu Ding, Zhaojing Y ang, Y ong Jae Lee, Zhuowen T u, Sifei Liu, and Xiaolong W ang. 2025. Real deep research for ai, robotics and beyond . Pr eprint , 37

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment