Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent

Reading time: 5 minute
...

📝 Original Info

  • Title: Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent
  • ArXiv ID: 2512.14990
  • Date: 2025-12-17
  • Authors: - Mehil B. Shah (Dalhousie University, Halifax, Canada) – shahmehil@dal.ca - Mohammad Masudur Rahman (Dalhousie University, Halifax, Canada) – masud.rahman@dal.ca - Foutse Khomh (Polytechnique Montreal, Montreal, Canada) – foutse.khomh@polymtl.ca

📝 Abstract

Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their resolution, but it is extremely challenging due to the inherent nondeterminism of DL models and their tight coupling with hardware and software environments. According to recent studies, only about 3% of DL bugs can be reliably reproduced using manual approaches. To address these challenges, we present RepGen, a novel, automated, and intelligent approach for reproducing deep learning bugs. RepGen constructs a learningenhanced context from a project, develops a comprehensive plan for bug reproduction, employs an iterative generate-validate-refine mechanism, and thus generates such code using an LLM that reproduces the bug at hand. We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%, a 19.81% improvement over the state-of-the-art measure. A developer study involving 27 participants shows that RepGen improves the success rate of DL bug reproduction by 23.35%, reduces the time to reproduce by 56.8%, and lowers participants' cognitive load. CCS Concepts • Software and its engineering → Software testing and debugging.

💡 Deep Analysis

Figure 1

📄 Full Content

Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent Mehil B Shah Dalhousie University Halifax, Canada shahmehil@dal.ca Mohammad Masudur Rahman Dalhousie University Halifax, Canada masud.rahman@dal.ca Foutse Khomh Polytechnique Montreal Montreal, Canada foutse.khomh@polymtl.ca Abstract Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applica- tions suffer from many bugs, failures, and vulnerabilities. Reproduc- ing these bugs is essential for their resolution, but it is extremely challenging due to the inherent nondeterminism of DL models and their tight coupling with hardware and software environments. Ac- cording to recent studies, only about 3% of DL bugs can be reliably reproduced using manual approaches. To address these challenges, we present RepGen, a novel, automated, and intelligent approach for reproducing deep learning bugs. RepGen constructs a learning- enhanced context from a project, develops a comprehensive plan for bug reproduction, employs an iterative generate-validate-refine mechanism, and thus generates such code using an LLM that re- produces the bug at hand. We evaluate RepGen on 106 real-world deep learning bugs and achieve a reproduction rate of 80.19%, a 19.81% improvement over the state-of-the-art measure. A developer study involving 27 participants shows that RepGen improves the success rate of DL bug reproduction by 23.35%, reduces the time to reproduce by 56.8%, and lowers participants’ cognitive load. CCS Concepts • Software and its engineering →Software testing and debug- ging. Keywords Deep learning bugs, deep learning bug reproduction, automated de- bugging, LLM-powered agents, code generation, machine learning systems, software testing and debugging ACM Reference Format: Mehil B Shah, Mohammad Masudur Rahman, and Foutse Khomh. 2026. Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelli- gent Agent. In 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26), April 12–18, 2026, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3744916.3787795 1 Introduction Artificial Intelligence (AI) has been widely adopted in many applica- tion domains, including software engineering [45, 46], autonomous vehicles [27], healthcare [64], finance [10], and cybersecurity [16]. This work is licensed under a Creative Commons Attribution-NonCommercial- NoDerivatives 4.0 International License. ICSE ’26, Rio de Janeiro, Brazil © 2026 Copyright held by the owner/author(s). ACM ISBN 979-8-4007-2025-3/2026/04 https://doi.org/10.1145/3744916.3787795 The global market share of AI software reached $34.8 billion in 2023 and is projected to grow up to $360 billion by 2030 [26]. Over 67% of top-performing companies have incorporated AI in their business solutions, and 97% of Fortune 500 companies have invested in AI technologies [58], indicating their significance. However, software applications empowered by Deep Learning (DL), the underlying technology behind current AI systems, remain prone to bugs, faults, and vulnerabilities, which could lead to major consequences (e.g., system crashes) and catastrophic failures (e.g., autonomous vehicle accidents) [69]. Unlike the bugs in traditional, developer-written software, the bugs in DL software are inherently challenging due to several factors. First, they are often non-deterministic due to ran- domness in model training, i.e., random weight initialization of the model layers [52]. Second, DL models perform high-dimensional tensor operations and suffer from a lack of interpretability, making their encountered bugs opaque [46]. Finally, these bugs also have multi-faceted dependencies on hardware (e.g., GPU), underlying frameworks (e.g., PyTorch, TensorFlow) [81], and data pipelines, making them highly complex [62]. To resolve DL bugs, software developers must first systematically reproduce them on their local machines. Without a reproduction, they cannot confirm the presence of a bug or diagnose its root cause. However, reproduction of DL bugs can be effort-intensive, time- consuming, and frustrating due to various technical challenges. They include intricate data pipelines, hardware dependencies, and variations in software frameworks and library versions. Even when a bug is reproducible, developers frequently need to engage in trial-and-error, carefully tune environmental settings, and reason about the contextual factors that may influence the behaviour of DL programs, all of which can be tedious and error-prone [54]. Developers also face the challenge of missing or incomplete infor- mation when attempting to reproduce bugs from issue reports [71]. Reports may lack crucial details of a bug and omit relevant data or code snippets. In such cases, even experienced developers must spend substantial time reconstructing the missing detail of a bug, the target environment, and iteratively testing hypo

📸 Image Gallery

Motivating-Example-Updated.png NASA_TLX_Assessment.png RepGen-Deepseek-Venn.png RepGen-Output-Updated.png RepGen-Plan-White-Small.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut