Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment
Reproducibility is an important requirement in evolutionary computation, where results largely depend on computational experiments. In practice, reproducibility relies on how algorithms, experimental protocols, and artifacts are documented and shared. Despite growing awareness, there is still limited empirical evidence on the actual reproducibility levels of published work in the field. In this paper, we study the reproducibility practices in papers published in the Evolutionary Combinatorial Optimization and Metaheuristics track of the Genetic and Evolutionary Computation Conference over a ten-year period. We introduce a structured reproducibility checklist and apply it through a systematic manual assessment of the selected corpus. In addition, we propose RECAP (REproducibility Checklist Automation Pipeline), an LLM-based system that automatically evaluates reproducibility signals from paper text and associated code repositories. Our analysis shows that papers achieve an average completeness score of 0.62, and that 36.90% of them provide additional material beyond the manuscript itself. We demonstrate that automated assessment is feasible: RECAP achieves substantial agreement with human evaluators (Cohen’s k of 0.67). Together, these results highlight persistent gaps in reproducibility reporting and suggest that automated tools can effectively support large-scale, systematic monitoring of reproducibility practices.
💡 Research Summary
This paper investigates the state of reproducibility in the Evolutionary Computation (EC) community by conducting a systematic, ten‑year empirical study of papers published in the Evolutionary Combinatorial Optimization and Metaheuristics (ECOM) track of the Genetic and Evolutionary Computation Conference (GECCO). The authors assembled a corpus of 168 full‑length papers from 2016 to 2025, explicitly excluding any works co‑authored by the study’s own researchers to avoid conflict of interest.
To assess reproducibility, the authors first designed a structured checklist inspired by ACM’s Artifact Review and Badging guidelines, tailoring it to the specific methodological and experimental characteristics of EC. The checklist comprises five dimensions: (i) methodological clarity, (ii) experimental setup, (iii) results reporting, (iv) artifact evaluation, and (v) paper metadata. Each dimension contains multiple items that are marked as “Y” (present), “N” (absent), or “NA” (not applicable). The full checklist is provided in an appendix.
Two complementary assessment pipelines were applied to every paper. The manual assessment was carried out by trained human assessors who read the paper, inspected any supplementary material hosted on the conference website, and examined publicly referenced artifacts such as source‑code repositories, configuration files, and datasets. When code was available, assessors attempted to rerun the experiments within a two‑hour budget, using external documentation if needed but without any assistance from large language models (LLMs). The outcome for each checklist item was recorded conservatively: any explicit evidence, even if partial, qualified as “Y”.
The automated pipeline, named RECAP (REproducibility Checklist Automation Pipeline), leverages OpenAI’s GPT‑5 nano model. For each paper, the PDF is converted to plain text and fed to the LLM in a single request (thanks to the model’s large context window). A system prompt and a JSON response schema are supplied for every checklist field, guiding the model to output “Y”, “N”, or “NA” together with a brief disambiguation. For artifact‑related fields, RECAP extracts repository URLs, clones the repositories into an isolated Docker sandbox pre‑installed with common scientific dependencies, and presents truncated text versions of code and data files to the LLM for documentation quality assessment. Functional executability is tested by attempting to run the code for up to five minutes; success yields a “Y”. All responses are stored as per‑paper JSON files for later analysis.
Results of the manual assessment reveal an average reproducibility completeness score of 0.62 (on a 0–1 scale). Only 36.9 % of the papers provide additional material beyond the manuscript itself (e.g., source code, datasets, experiment logs). Certain checklist items—such as hardware/machine description, tuning budget, and detailed parameter configuration—are frequently missing, with “Y” rates often below 20 %. Conversely, items like the presence of a methodological description and basic result tables are more commonly reported. A modest upward trend is observable after 2020, but the overall level remains far from full compliance.
Comparison between manual and automated assessments shows substantial agreement: Cohen’s κ = 0.67, indicating “substantial” concordance. The agreement is strongest for text‑based items (methodology, experimental setup, result reporting) where the LLM’s information‑retrieval role suffices. Discrepancies arise mainly in artifact‑related fields, where the LLM’s limited reasoning and the five‑minute execution cap sometimes lead to false negatives. Nonetheless, the overall performance demonstrates that LLM‑driven automation can approximate human judgment with reasonable reliability.
The authors discuss several implications. First, the current state of reproducibility reporting in EC is inadequate; many essential details needed to rerun experiments are omitted, limiting the community’s ability to verify and build upon prior work. Second, the checklist and manual protocol provide a replicable framework for future meta‑studies across other sub‑domains of AI and computer science. Third, RECAP illustrates a viable path toward scalable, semi‑automated reproducibility monitoring that could be integrated into conference submission systems or journal editorial workflows, offering authors immediate feedback on missing reproducibility signals.
Limitations are acknowledged: the study focuses on the presence of information rather than its correctness; the manual assessment’s two‑hour code‑execution window may not capture more complex setups; and the LLM’s propensity for hallucination is mitigated by restricting it to extraction rather than reasoning tasks. Future work is proposed to enhance RECAP’s accuracy through multi‑model ensembles, richer tool‑use (e.g., automated environment provisioning with Conda or Nix), and tighter integration with reproducibility badges that could incentivize authors to meet higher standards.
In conclusion, this paper delivers the first large‑scale, ten‑year quantitative portrait of reproducibility practices in the EC community, reveals persistent gaps, and validates that LLM‑based automation can serve as an effective, scalable complement to human evaluation. The findings encourage the adoption of standardized checklists and automated tools to foster more transparent, reproducible research in evolutionary computation.
Comments & Academic Discussion
Loading comments...
Leave a Comment