Scaling Reproducibility: An AI-Assisted Workflow for Large-Scale Reanalysis

Reading time: 5 minute
...

📝 Original Info

  • Title: Scaling Reproducibility: An AI-Assisted Workflow for Large-Scale Reanalysis
  • ArXiv ID: 2602.16733
  • Date: 2026-02-17
  • Authors: - Lal, A. (주요 연구자) - Xu, B. (시스템 구현·AI 오케스트레이션) - et al. (협업 연구팀, 구체적인 공동 저자 명단은 원문 참고)

📝 Abstract

Reproducibility is central to research credibility, yet large-scale reanalysis of empricial data remains costly because replication packages vary widely in structure, software environment, and documentation. We develop and evaluate an agentic AI workflow that addresses this execution bottleneck while preserving scientific rigor. The system separates scientific reasoning from computational execution: researchers design fixed diagnostic templates, and the workflow automates the acquisition, harmonization, and execution of replication materials using pre-specified, version-controlled code. A structured knowledge layer records resolved failure patterns, enabling adaptation across heterogeneous studies while keeping each pipeline version transparent and stable. We evaluate this workflow on 92 instrumental variable (IV) studies, including 67 with manually verified reproducible 2SLS estimates and 25 newly published IV studies under identical criteria. For each paper, we analyze up to three two-stage least squares (2SLS) specifications, totaling 215. Across the 92 papers, the system achieves 87% end-to-end success overall. Conditional on accessible data and code, reproducibility is 100% at both the paper and specification levels. The framework substantially lowers the cost of executing established empirical protocols and can be adapted in empirical settings where analytic templates and norms of transparency are well established.

💡 Deep Analysis

📄 Full Content

Reproducibility is fundamental to research credibility and cumulative scientific progress. In empirical social science, reproducible analyses allow researchers to verify published claims, scrutinize identifying assumptions, and assess the practical relevance of new methodological developments. As empirical methods evolve rapidly, access to real-world data and code has become increasingly important not only for assessing research credibility, but also for advancing methodology through systematic reanalysis of existing studies.

Institutional norms have expanded the availability of replication materials. Leading journals in economics and political science now require authors to post data and code, and some conduct in-house replication checks before publication. Yet availability alone does not ensure reproducibility at scale. Replication packages vary widely in software environment, directory structure, naming conventions, documentation quality, and execution logic. Even when materials are public, reproducing results across many papers remains costly and fragile.

The bottleneck is operational: executing idiosyncratic replication materials in a standardized and auditable manner requires substantial researcher time.

This paper develops and evaluates an agentic AI workflow to address this execution bottleneck. The workflow combines adaptive coordination with deterministic computation.

A large language model (LLM) routes tasks across modular agents that ingest replication materials, identify specifications, reconstruct computational environments, execute models, and generate standardized diagnostic reports. A structured knowledge layer records previously resolved failure patterns and clarifies stage-level responsibilities, allowing the system to accumulate experience across studies while keeping each pipeline version transparent and stable. All numerical operations-data preparation, estimation, and diagnostic computation-are carried out by version-controlled program code. For a fixed pipeline version and fixed inputs, reruns produce identical numerical outputs and retain a complete audit trail of intermediate artifacts and logs. The paper does not propose new estimators or diagnostics.

Instead, we ask whether established empirical protocols can be executed reliably and at scale under real-world conditions. Our evidence suggests that they can.

A central design principle of this workflow is the separation of scientific reasoning from computational execution. Human researchers design diagnostic templates that specify estimands, estimators, robustness checks, and summary measures appropriate for a given research design. Once these templates are fixed, reproduction largely consists of executionoriented tasks: acquiring replication packages, reconstructing computational environments, locating and running prespecified specifications, extracting analysis datasets, and harmonizing outputs. At the current stage of development, AI systems can not design diagnostic tools that meet the precision standards implied by econometric and statistical theory. We therefore treat diagnostics as human-designed inputs and evaluate whether AI can execute them reliably and reproducibly at scale. This division of labor may evolve as AI systems improve, but it aligns with current research needs.

We evaluate the workflow on a corpus of 92 studies with instrumental variable (IV) designs. Of these, 67 were previously analyzed in Lal et al. (2024), where the authors manually verified the reproducibility of at least one the two-stage least squares (2SLS) coefficient in each study. We extend the analysis to 25 additional IV studies published after the original sample, applying identical inclusion criteria and the same diagnostic template. Across the combined corpus, the workflow targets up to three 2SLS specifications per paper. Each specification corresponds to a model defined by an outcome, a single treatment variable, one or more instruments, and a set of covariates, estimated on a particular sample. In the expanded set of 92 studies, the system achieves a 87% end-to-end reproducibility success. The unsuccessful cases are caused by incomplete replication materials rather than computational instability. Conditional on accessible materials, the pipeline reproduces the benchmark 2SLS estimates exactly and completes all the diagnostic tests.

It is important to note that this level of reliability was not achieved in a single engineering pass. The corpus spans multiple programming languages (mostly Stata and R), replication, but a necessary first step toward credible inference and cumulative research.

The contributions of this paper are threefold. First, we design and implement an adaptive yet version-controlled agentic AI workflow that executes fixed empirical templates across heterogeneous replication materials. Second, we provide systematic evaluation against a manually verified benchmark and a forward extension to newly published studies,

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut