LLM-Based Multi-Agent Blackboard System for Information Discovery in Data Science
Advances in large language models (LLMs) have created new opportunities in data science, but their deployment is often limited by the challenge of finding relevant data in large data lakes. Existing methods struggle with this: both single- and multi-agent systems are quickly overwhelmed by large, heterogeneous files, and master-slave multi-agent systems rely on a rigid central controller that requires precise knowledge of each sub-agent’s capabilities, which is not possible in large-scale settings where the main agent lacks full observability over sub-agents’ knowledge and competencies. We propose a novel multi-agent paradigm inspired by the blackboard architecture for traditional AI models. In our framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents - either responsible for a partition of the data lake or retrieval from the web - volunteer to respond based on their capabilities. This design improves scalability and flexibility by removing the need for a central coordinator to know each agent’s expertise or internal knowledge. We evaluate the approach on three benchmarks that require data discovery: KramaBench and modified versions of DSBench and DA-Code. Results show that the blackboard architecture substantially outperforms strong baselines, achieving 13%-57% relative improvements in end-to-end success and up to a 9% relative gain in data discovery F1 over the best baseline.
💡 Research Summary
The paper tackles the often‑overlooked bottleneck of data discovery in large, heterogeneous data lakes when using large language models (LLMs) for data‑science tasks. Traditional approaches either give a single LLM access to the entire file set—running into context‑length limits and scalability issues—or adopt a master‑slave multi‑agent architecture where a central controller must know the exact capabilities of each subordinate agent in order to assign tasks. Both paradigms break down as the number of files grows into the thousands and as the diversity of file formats and domains increases.
To overcome these limitations, the authors introduce a blackboard‑based multi‑agent system inspired by classic AI architectures from the 1980s. The system consists of a main agent (π_m) responsible for solving the user query, a set of file agents (π_f_i) each managing a cluster of files, and an optional web‑search agent (π_s) for external knowledge. The data lake is first partitioned into C clusters; in the default implementation clustering is performed purely on file names using Gemini‑2.5‑Pro, but the authors also demonstrate that embedding‑based clustering (E5 + K‑Means) yields comparable results, showing the method’s flexibility.
The main agent operates within the ReAct framework, iteratively selecting actions from a predefined set: Planning, Reasoning, Executing Code, Requesting Help, and Answering. When the “Requesting Help” action is chosen, the agent posts a natural‑language request on a shared blackboard β without addressing any specific sub‑agent. All helper agents continuously monitor β; if an agent determines that it possesses the required knowledge—either because its file cluster contains relevant data or because it can retrieve needed information from the web—it writes a response on a separate response board β_r. The main agent then consumes these responses as observations and decides the next step. This design eliminates the need for a central controller to maintain an up‑to‑date model of each agent’s expertise, allowing agents to join or abstain autonomously based on their current capabilities and resources.
The authors evaluate the approach on three benchmarks that explicitly require data discovery: KramaBench (a newly released benchmark focused on data‑science queries that need file identification), and modified versions of DSBench and DA‑Code where a data‑discovery component was added. Baselines include Retrieval‑Augmented Generation (RAG), a conventional master‑slave multi‑agent system, and state‑of‑the‑art data‑science pipelines. Across all settings, the blackboard system achieves relative end‑to‑end success improvements ranging from 13 % to 57 % over the best baseline. Moreover, the F1 score for correctly identifying relevant files improves by up to 9 % compared to the strongest baseline. These gains hold for both proprietary LLMs (Gemini‑2.5‑Pro) and open‑source models, underscoring the method’s model‑agnostic nature.
Key contributions of the work are:
- Novel Paradigm – Introducing a blackboard communication model for LLM‑driven multi‑agent systems that shifts task routing from a central scheduler to a distributed, self‑selecting process.
- Scalability and Flexibility – By partitioning the data lake and allowing agents to volunteer only when they can contribute, the system scales to thousands of files without overloading any single LLM’s context window.
- Autonomy and Robustness – Agents retain full autonomy, reducing the risk of assigning tasks to agents lacking the necessary knowledge or operating on outdated information.
- Empirical Validation – Comprehensive experiments on three benchmarks demonstrate consistent performance gains in both overall problem‑solving success and precise data‑source identification.
- Generalizability – The approach works with different clustering strategies, various LLM back‑ends, and can be extended to dynamic re‑clustering, meta‑learning for request routing, and security‑aware access controls.
In summary, the blackboard‑based multi‑agent framework provides a practical, scalable solution for the data‑discovery phase of data‑science pipelines, enabling LLMs to operate effectively in real‑world, large‑scale data environments. Future work may explore adaptive clustering, richer inter‑agent negotiation protocols, and integration with privacy‑preserving data governance mechanisms.
Comments & Academic Discussion
Loading comments...
Leave a Comment