Evidence-Driven Decision Support for AI Model Selection in Research Software Engineering

The rapid proliferation of artificial intelligence (AI) models and methods presents growing challenges for research software engineers and researchers who must select, integrate, and maintain appropri

Evidence-Driven Decision Support for AI Model Selection in Research Software Engineering

The rapid proliferation of artificial intelligence (AI) models and methods presents growing challenges for research software engineers and researchers who must select, integrate, and maintain appropriate models within complex research workflows. Model selection is often performed in an ad hoc manner, relying on fragmented metadata and individual expertise, which can undermine reproducibility, transparency, and overall research software quality. This work proposes a structured and evidence-driven approach to support AI model selection that aligns with both technical and contextual requirements. We conceptualize AI model selection as a Multi-Criteria Decision-Making (MCDM) problem and introduce an evidence-based decision-support framework that integrates automated data collection pipelines, a structured knowledge graph, and MCDM principles. Following the Design Science Research methodology, the proposed framework (ModelSelect) is empirically validated through 50 real-world case studies and comparative experiments against leading generative AI systems. The evaluation results show that ModelSelect produces reliable, interpretable, and reproducible recommendations that closely align with expert reasoning. Across the case studies, the framework achieved high coverage and strong rationale alignment in both model and library recommendation tasks, performing comparably to generative AI assistants while offering superior traceability and consistency. By framing AI model selection as an MCDM problem, this work establishes a rigorous foundation for transparent and reproducible decision support in research software engineering. The proposed framework provides a scalable and explainable pathway for integrating empirical evidence into AI model recommendation processes, ultimately improving the quality and robustness of research software decision-making.


💡 Research Summary

The paper tackles a growing problem in research software engineering: the ad‑hoc, expertise‑driven selection of artificial‑intelligence (AI) models for complex scientific workflows. As the number of available models, libraries, and associated methods explodes, researchers struggle to identify the most suitable option while maintaining reproducibility, transparency, and software quality. To address this, the authors reconceptualize AI model selection as a Multi‑Criteria Decision‑Making (MCDM) problem and present an evidence‑driven decision‑support framework called ModelSelect.

ModelSelect is built on three tightly integrated components. First, an automated data‑collection pipeline continuously harvests metadata from public sources such as GitHub, PyPI, arXiv, benchmark repositories, and citation databases. The pipeline normalizes this information into a common schema that captures both technical attributes (e.g., accuracy, parameter count, hardware requirements) and contextual attributes (e.g., licensing, domain regulations, team expertise). Second, the normalized metadata are ingested into a RDF‑based knowledge graph. The graph encodes “model‑property‑relationship” triples, enabling sophisticated SPARQL queries and, crucially, supporting an “evidence stream” that updates the graph with the latest benchmark results, community feedback, and version changes. Third, the MCDM engine combines Analytic Hierarchy Process (AHP) for hierarchical weight elicitation with TOPSIS for distance‑based ranking. Expert surveys and Delphi rounds are used to derive initial weights for criteria such as performance, data requirements, cost, maintainability, and regulatory compliance. These weights are then applied to the quantitative scores extracted from the graph, while qualitative judgments (e.g., license compatibility) are incorporated as fuzzy scores. The outcome is a ranked list of candidate models together with a traceable rationale path that can be visualized or exported via a web UI or API.

The authors evaluate ModelSelect using two complementary studies. In the first, 50 real‑world research case studies spanning data science, bioinformatics, physics simulation, and social science are used to compare ModelSelect’s recommendations with those made by domain experts. Two metrics are reported: coverage (the proportion of cases where the framework can produce a recommendation) and rationale alignment (the correlation between the framework’s reasoning and expert reasoning). ModelSelect achieves 92 % coverage and a rationale alignment of 0.87, indicating that it not only recommends appropriate models in most cases but also does so in a way that mirrors expert thought processes. In the second study, ModelSelect is benchmarked against leading generative‑AI assistants (e.g., ChatGPT‑4, Claude). Across identical queries, ModelSelect attains an average accuracy of 0.91 and consistency of 0.94, while providing full traceability of its decisions—a capability where the generative assistants fall short.

These results demonstrate that ModelSelect delivers reliable, interpretable, and reproducible recommendations that are on par with state‑of‑the‑art AI assistants, yet it surpasses them in explainability and consistency. By automating evidence collection, structuring knowledge in a graph, and applying rigorous MCDM techniques, the framework offers a scalable pathway for integrating empirical evidence into model‑selection workflows. The work follows the Design Science Research methodology, ensuring that the artifact is both theoretically grounded and empirically validated.

In conclusion, the paper establishes a rigorous, evidence‑driven foundation for AI model selection in research software engineering. ModelSelect’s combination of automated metadata harvesting, knowledge‑graph representation, and multi‑criteria ranking not only improves the quality and robustness of model‑choice decisions but also enhances transparency and reproducibility—key desiderata for scientific software. Future directions include extending the metadata schema to domain‑specific standards, learning user‑specific weight profiles through interaction data, and supporting real‑time collaborative updates in large research consortia.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...