NotebookRAG: Retrieving Multiple Notebooks to Augment the Generation of EDA Notebooks for Crowd-Wisdom

NotebookRAG: Retrieving Multiple Notebooks to Augment the Generation of EDA Notebooks for Crowd-Wisdom
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-quality exploratory data analysis (EDA) is essential in the data science pipeline, but remains highly dependent on analysts’ expertise and effort. While recent LLM-based approaches partially reduce this burden, they struggle to generate effective analysis plans and appropriate insights and visualizations when user intent is abstract. Meanwhile, a vast collection of analysis notebooks produced across platforms and organizations contains rich analytical knowledge that can potentially guide automated EDA. Retrieval-augmented generation (RAG) provides a natural way to leverage such corpora, but general methods often treat notebooks as static documents and fail to fully exploit their potential knowledge for automating EDA. To address these limitations, we propose NotebookRAG, a method that takes user intent, datasets, and existing notebooks as input to retrieve, enhance, and reuse relevant notebook content for automated EDA generation. For retrieval, we transform code cells into context-enriched executable components, which improve retrieval quality and enable rerun with new data to generate updated visualizations and reliable insights. For generation, an agent leverages enhanced retrieval content to construct effective EDA plans, derive insights, and produce appropriate visualizations. Evidence from a user study with 24 participants confirms the superiority of our method in producing high-quality and intent-aligned EDA notebooks.


💡 Research Summary

NotebookRAG introduces a retrieval‑augmented generation (RAG) framework that leverages multiple existing computational notebooks to automatically produce high‑quality exploratory data analysis (EDA) notebooks aligned with a user’s intent. The system accepts three inputs: (1) a tabular dataset, (2) a collection of notebooks that have previously analyzed the same or closely related data source, and (3) a natural‑language description of the analytical goal (e.g., “prepare data for time‑series forecasting”).

In the retrieval stage, each notebook is decomposed into code and markdown cells. Code cells undergo static analysis via abstract syntax trees (AST) to extract metadata such as the data columns referenced, the visualization libraries used, and the chart types produced. This information is used to transform each cell into an “executable component” annotated with its column dependencies. The components are indexed by column names, enabling fast lookup when a user query is mapped to a set of column‑based EDA sub‑queries (e.g., “plot average price by region”). Because the components are executable, they can be re‑run on the new dataset, automatically generating up‑to‑date visualizations and statistical results.

The generation stage is driven by an LLM‑based agent. The agent first translates the high‑level user intent into a sequence of concrete EDA queries, then retrieves the relevant executable components, re‑executes them, and incorporates the fresh outputs into a coherent notebook. For insight extraction, NotebookRAG adopts a hybrid approach: visualizations are first fed to a vision‑language model (VLM) to produce natural‑language captions, and then a language model generates statistical code that validates and refines these captions, reducing hallucination and ensuring factual correctness. The final product is an executable Jupyter‑style notebook containing code cells, updated visualizations, and markdown explanations that can be directly edited by analysts.

A formative study with senior enterprise analysts and data‑science graduate students identified four design requirements: goal‑aligned extraction and enhancement, efficient and deep results, flexibility when retrieval fails, and seamless code reuse. Guided by these requirements, the authors conducted a within‑subject user study with 24 participants using realistic Kaggle datasets and data‑mining tasks. Participants compared NotebookRAG against three baselines: the ChatGPT Data Analyst plugin, a conventional notebook generator, and a generic RAG retrieval method. Across quantitative metrics (overall quality, intent alignment, visualization relevance, code reusability) and qualitative feedback, NotebookRAG achieved statistically significant improvements and was praised for providing richer, more actionable insights and for delivering notebooks that could be readily modified.

The paper’s contributions are threefold: (1) a novel retrieval technique that converts notebook code cells into context‑enriched, executable components annotated with column usage, (2) an LLM‑driven generation agent that integrates retrieved components to construct intent‑aligned EDA plans, and (3) a comprehensive evaluation demonstrating the superiority of the approach over existing automated EDA tools. Limitations include dependence on the quality of source notebooks, potential mismatches in column annotation, and residual errors in VLM‑generated captions. Future work will explore automatic notebook quality assessment, richer multimodal insight generation, and domain‑specific prompting to further enhance robustness and applicability.


Comments & Academic Discussion

Loading comments...

Leave a Comment