AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 38% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .

💡 Research Summary

AblationBench introduces a comprehensive benchmark for evaluating language‑model (LM) agents that plan ablation experiments in empirical AI research. The authors identify ablation planning as a crucial yet under‑explored step in the AI co‑scientist pipeline, distinct from code generation or result replication. To operationalize this, they define two complementary tasks.

AuthorAblation asks a model to generate an ablation plan given only the title, abstract, and the method section of a paper. The dataset comprises 83 conference papers spanning 14 venues, annotated with 230 human‑derived ablations that serve as ground‑truth (GT). Papers are split into 21 development and 62 test instances.

ReviewerAblation asks a model to propose missing ablations for a full submission, mimicking a reviewer’s role. The dataset contains 350 ICLR submissions (2023‑2025) together with official reviews that explicitly mention missing ablations. These reviews provide the GT missing‑ablation list.

For both tasks the authors build LM‑based judges that automatically compare a model’s generated ablations with GT. Matching is defined as targeting the same component of the method and applying a comparable modification. To mitigate known LM‑judge biases, they employ three strategies: (1) ensemble of three models with majority voting to reduce intra‑model bias, (2) random swapping of GT and plan labels (Side A/Side B) to counter contextual bias, and (3) random shuffling of ablation order to address positional bias. Human‑annotated evaluation sets (AuthorEval and ReviewerEval) are provided to validate judge reliability, achieving ~84 % agreement with human labels.

Two planner baselines are evaluated. LM‑Planner uses a single chain‑of‑thought (CoT) prompt to a large LM, feeding the paper’s metadata and raw text and asking for up to k ablation entries with reasoning. Agent‑Planner builds on the SWE‑agent (a ReAct‑style LM agent) that can inspect files, run shell commands, and iteratively refine its output before emitting the final plan. Both planners are tested with state‑of‑the‑art models such as GPT‑4‑Turbo and Claude‑3‑Opus.

Results show that even the strongest LM systems recover only a modest fraction of the GT ablations: average recall is 38 % across tasks, with AuthorAblation recall at 31 % and ReviewerAblation recall at 45 %. Human performance on a subset of ten AuthorAblation instances yields an F1 of 0.65, compared to the best model’s 0.42, highlighting a substantial gap. Notably, the CoT‑based LM‑Planner consistently outperforms the more complex Agent‑Planner, suggesting that sophisticated tool use does not yet translate into better scientific reasoning for this problem.

The paper also provides a detailed comparison with the prior AbGen benchmark, emphasizing three advances: (1) raw‑paper input rather than manually curated research context, (2) dual perspective (author and reviewer) covering a broader set of conferences and topics, and (3) fully automated evaluation via LM judges.

In the discussion, the authors acknowledge key limitations. First, the benchmark only evaluates recovery of existing ablations; generating novel, useful ablations without GT remains an open challenge. Second, the current judges cannot assess feasibility or impact of new proposals, which would require execution or simulation. Third, systematic biases in LM judges, despite mitigation, still affect reliability.

Future work directions include (a) designing metrics that can evaluate novel ablations (e.g., code‑execution feasibility, performance change), (b) incorporating multimodal knowledge (figures, code snippets) to improve component identification, and (c) expanding the benchmark to other scientific domains beyond AI (e.g., biology, physics) to test generality.

Overall, AblationBench supplies the community with a publicly available dataset, automatic evaluation pipeline, and baseline results, establishing a concrete yardstick for the emerging task of automated ablation planning. The reported performance gap underscores the need for more sophisticated reasoning, better grounding in domain knowledge, and richer evaluation frameworks in future AI co‑scientist systems.

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

💡 Research Summary

Comments & Academic Discussion

Leave a Comment