Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse

Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Empirical conclusions depend not only on data but on analytic decisions made throughout the research process. Many-analyst studies have quantified this dependence: independent teams testing the same hypothesis on the same dataset regularly reach conflicting conclusions. But such studies require costly human coordination and are rarely conducted. We show that fully autonomous AI analysts built on large language models (LLMs) can, cheaply and at scale, replicate the structured analytic diversity observed in human multi-analyst studies. In our framework, each AI analyst independently executes a complete analysis pipeline on a fixed dataset and hypothesis; a separate AI auditor screens every run for methodological validity. Across three datasets spanning distinct domains, AI analyst-produced analyses exhibit substantial dispersion in effect sizes, $p$-values, and conclusions. This dispersion can be traced to identifiable analytic choices in preprocessing, model specification, and inference that vary systematically across LLM and persona conditions. Critically, the outcomes are \emph{steerable}: reassigning the analyst persona or LLM shifts the distribution of results even among methodologically sound runs. These results highlight a central challenge for AI-automated empirical science: when defensible analyses are cheap to generate, evidence becomes abundant and vulnerable to selective reporting. Yet the same capability that creates this risk may also help address it: treating analyst results as distributions makes analytic uncertainty visible, and deploying AI analysts against a published specification can reveal how much disagreement stems from underspecified design choices. Taken together, our results motivate a new transparency norm: AI-generated analyses should be accompanied by multiverse-style reporting and full disclosure of the prompts used, on par with code and data.


💡 Research Summary

This paper tackles the pervasive problem of analytic variability—often described as the “garden of forking paths”—by leveraging large language model (LLM) based AI agents as fully autonomous analysts. Traditional many‑analyst studies have shown that independent human teams can reach contradictory conclusions when analyzing the same dataset and hypothesis, but such studies are costly and rare. The authors propose a scalable alternative: AI analysts that receive a fixed dataset, a natural‑language hypothesis, and a pre‑specified primary estimand, then independently decide on every step of the analysis pipeline (data cleaning, variable construction, model specification, inference) and produce a reproducible code base and narrative report without human oversight.

The experimental design crosses three distinct dataset‑hypothesis pairs (the well‑known soccer‑referee bias study, a recent AI‑assisted programming randomized controlled trial, and a long‑running ANES time‑series survey), four contemporary LLM back‑ends (Claude Sonnet 4.5, Claude Haiku 4.5, Qwen3 480B, Qwen3 235B), and five analyst personas that encode different motivational frames (neutral, negative, positive, confirmation‑seeking, strong confirmation‑seeking). In total roughly 5,000 runs were generated, each limited to a temperature of 1.0, 250 messages or 60 minutes of compute. The agents are implemented as ReAct‑style tools within the Inspect AI framework, granting them access to a persistent Python session, a shell, and a file editor, thereby mimicking a realistic data‑science workflow.

Because autonomous agents can hallucinate or mis‑interpret the task, a separate AI auditor (Claude Sonnet 4.5 with a dedicated audit prompt) reviews the full transcript, code artifacts, and intermediate outputs for each run. The auditor flags violations such as misspecified estimands, invalid variable constructions, or inappropriate statistical tests. Only runs that pass this audit are retained for analysis, yet even among the “methodologically sound” subset a striking dispersion remains: effect sizes, p‑values, and binary support decisions vary widely across runs. Support rates differ by 34 to 66 percentage points depending on persona, and the distribution of estimates spans both positive and negative values for the same hypothesis.

The authors trace this dispersion to concrete analytic choices. Differences in missing‑data handling, covariate selection, functional form (linear vs. non‑linear vs. Bayesian), and standard‑error calculation (clustered, bootstrapped, robust) systematically shift the results. Moreover, the outcomes are “steerable”: simply swapping the persona while holding the LLM constant moves the entire distribution toward more or less supportive conclusions; swapping the LLM while holding the persona constant also produces systematic shifts, with larger models tending to explore more complex specifications.

These findings have two major implications. First, the cheap, high‑throughput generation of defensible analyses creates a new vulnerability: evidence can be cherry‑picked or p‑hacked at scale, amplifying the risk of selective reporting. Second, the same capability offers a powerful remedy: by treating the collection of AI‑generated analyses as a multiverse, researchers can make analytic uncertainty visible, quantify it, and report it alongside traditional sampling uncertainty. The authors therefore propose a new transparency norm for AI‑assisted research: every AI‑generated analysis should be accompanied by (i) a multiverse‑style report that presents the full distribution of results, (ii) complete disclosure of the prompts, LLM version, temperature, and persona used, and (iii) the full audit trail (code, logs, intermediate outputs) on par with data and code sharing.

In the discussion, the paper highlights the need for robust auditing mechanisms, standards for persona design, and policy frameworks that encourage multiverse reporting rather than single‑point claims. Future work is suggested on human‑AI collaborative multiverses, improving auditor reliability, and extending the approach to other scientific domains. Overall, the study demonstrates that autonomous LLM analysts can faithfully reproduce the analytic variability observed in human many‑analyst studies, while also providing a scalable tool for quantifying and managing that variability in the era of AI‑driven empirical science.


Comments & Academic Discussion

Loading comments...

Leave a Comment