Cross-Disciplinary Knowledge Retrieval and Synthesis: A Compound AI Architecture for Scientific Discovery
📝 Abstract
The exponential growth of scientific knowledge has created significant barriers to cross-disciplinary knowledge discovery, synthesis and research collaboration. In response to this challenge, we present BioSage, a novel compound AI architecture that integrates Large Language Models (LLMs) with Retrieval Augmented Generation (RAG), orchestrated specialized agents and tools to enable discoveries across AI, data science, biomedical, and biosecurity domains. Our system features several specialized agents including the retrieval agent with query planning and response synthesis that enable knowledge retrieval across domains with citation-backed responses, cross-disciplinary translation agents that align specialized terminology and methodologies, and reasoning agents that synthesize domain-specific insights with transparency, traceability and usability. We demonstrate the effectiveness of our BioSage system through a rigorous evaluation on scientific benchmarks (LitQA2, GPQA, WMDP, HLE-Bio) and introduce a new cross-modal benchmark for biology and AI, showing that our BioSage agents outperform vanilla and RAG approaches by 13%-21% powered by Llama 3.1. 70B and GPT-4o models. We perform causal investigations into compound AI system behavior and report significant performance improvements by adding RAG and agents over the vanilla models. Unlike other systems, our solution is driven by user-centric design principles and orchestrates specialized user-agent interaction workflows supporting scientific activities including but not limited to summarization, research debate and brainstorming. Our ongoing work focuses on multimodal retrieval and reasoning over charts, tables, and structured scientific data, along with developing comprehensive multimodal benchmarks for cross-disciplinary discovery. Our compound AI solution demonstrates significant potential for accelerating scientific advancement by reducing barriers between traditionally siloed domains.
💡 Analysis
The exponential growth of scientific knowledge has created significant barriers to cross-disciplinary knowledge discovery, synthesis and research collaboration. In response to this challenge, we present BioSage, a novel compound AI architecture that integrates Large Language Models (LLMs) with Retrieval Augmented Generation (RAG), orchestrated specialized agents and tools to enable discoveries across AI, data science, biomedical, and biosecurity domains. Our system features several specialized agents including the retrieval agent with query planning and response synthesis that enable knowledge retrieval across domains with citation-backed responses, cross-disciplinary translation agents that align specialized terminology and methodologies, and reasoning agents that synthesize domain-specific insights with transparency, traceability and usability. We demonstrate the effectiveness of our BioSage system through a rigorous evaluation on scientific benchmarks (LitQA2, GPQA, WMDP, HLE-Bio) and introduce a new cross-modal benchmark for biology and AI, showing that our BioSage agents outperform vanilla and RAG approaches by 13%-21% powered by Llama 3.1. 70B and GPT-4o models. We perform causal investigations into compound AI system behavior and report significant performance improvements by adding RAG and agents over the vanilla models. Unlike other systems, our solution is driven by user-centric design principles and orchestrates specialized user-agent interaction workflows supporting scientific activities including but not limited to summarization, research debate and brainstorming. Our ongoing work focuses on multimodal retrieval and reasoning over charts, tables, and structured scientific data, along with developing comprehensive multimodal benchmarks for cross-disciplinary discovery. Our compound AI solution demonstrates significant potential for accelerating scientific advancement by reducing barriers between traditionally siloed domains.
📄 Content
Cross-Disciplinary Knowledge Retrieval and Synthesis: A Compound AI Architecture for Scientific Discovery Svitlana Volkova, Peter Bautista, Avinash Hiriyanna, Gabriel Ganberg, Isabel Erickson, Zachary Klinefelter, Nick Abele, Hsien-Te Kao, Grant Engberson Aptima, Inc. Woburn, MA 01801 Abstract The exponential growth of scientific knowledge has created significant barriers to cross-disciplinary knowledge discovery, synthesis and research collaboration. In response to this challenge, we present BioSage, a novel compound AI architecture that integrates Large Language Models (LLMs) with Retrieval Augmented Genera- tion (RAG), orchestrated specialized agents and tools to enable discoveries across AI, data science, biomedical, and biosecurity domains. Our system features several specialized agents including the retrieval agent with query planning and response synthesis that enable knowledge retrieval across domains with citation-backed responses, cross-disciplinary translation agents that align specialized terminology and methodologies, and reasoning agents that synthesize domain-specific insights with transparency, traceability and usability. We demonstrate the effectiveness of our BioSage system through a rigorous evaluation on scientific benchmarks (LitQA2, GPQA, WMDP, HLE-Bio) and introduce a new cross-modal benchmark for biology and AI, showing that our BioSage agents outperform vanilla and RAG approaches by 13%-21% powered by Llama 3.1. 70B and GPT-4o models. We perform causal investigations into compound AI system behavior and report sig- nificant performance improvements by adding RAG and agents over the vanilla models. Unlike other systems, our solution is driven by user-centric design prin- ciples and orchestrates specialized user-agent interaction workflows supporting scientific activities including but not limited to summarization, research debate and brainstorming. Our ongoing work focuses on multimodal retrieval and reasoning over charts, tables, and structured scientific data, along with developing comprehen- sive multimodal benchmarks for cross-disciplinary discovery. Our compound AI solution demonstrates significant potential for accelerating scientific advancement by reducing barriers between traditionally siloed domains. 1 Introduction The extreme growth of scientific knowledge Bornmann et al. (2021) has transformed modern research into an increasingly challenging landscape for individual researchers to navigate. With over one million new papers published annually, alongside vast repositories of domain-specific resources such as chemical structures, medical imaging libraries, and specialized ontologies, researchers face unprecedented difficulty in synthesizing insights across disciplinary boundaries Wang et al. (2023); Park et al. (2023).This information explosion has created significant barriers to cross-disciplinary col- laboration and knowledge discovery, despite evidence that scientific progress increasingly depends on bridging traditionally siloed domains Guimerà et al. (2020). Recent advances in artificial intelligence have demonstrated potential to address these challenges, with domain-specific applications showing promise in areas such as materials science Gómez-Bombarelli et al. (2016) and drug discovery Sady- bekov et al. (2022). However, existing approaches primarily focus on information retrieval within 39th Conference on Neural Information Processing Systems (NeurIPS 2025). arXiv:2511.18298v1 [cs.AI] 23 Nov 2025 siloed domains rather than facilitating meaningful knowledge synthesis across disciplinary boundaries. To address these challenges, we present a novel compound AI system that integrates LLMs, RAG, knowledge graphs (KGs), specialized agents and tools to enable breakthrough discoveries across AI, computer science and engineering, biomedical, and biosecurity domains –ultimately helping to bridge the communication gap and manage the overwhelming volume of cross-disciplinary declarative (“knowing what”), procedural (“knowing how”) and conditional (“knowing when and why”) scientific knowledge Anderson (2013); Ryle & Tanney (2009); Squire (2004); Star & Stylianides (2013). User-centric design of BioSage compound AI system enables intuitive and transparent user-agent interactions through an interpretable conversational interface. As shown in Figure 1, the system incorporates several key features that enhance scientific workflows. When a user poses a cross- disciplinary question, the system selects specialized agents and knowledge corpora to address the query. Throughout the interaction, intermediate agent steps are explicitly explained to users, providing full transparency into the system’s reasoning process Dedhe et al. (2023); Fletcher & Carruthers (2012). Beyond simply answering queries, the system can represent synthesized information in structured formats that highlight capabilities, promises, and gaps. The system maintains conversation context through agent memory, allowing
This content is AI-processed based on ArXiv data.