Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework

Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) show promise for automating software development by translating requirements into code. However, even advanced prompting workflows like progressive prompting often leave some requirements unmet. Although methods such as self-critique, multi-model collaboration, and retrieval-augmented generation (RAG) have been proposed to address these gaps, developers lack clear guidance on when to use each. In an empirical study of 25 GitHub projects, we found that progressive prompting achieves 96.9% average task completion, significantly outperforming direct prompting (80.5%, Cohen’s d=1.63, p<0.001) but still leaving 8 projects incomplete. For 6 of the most representative projects, we evaluated each enhancement strategy across 4 failure types. Our results reveal that method effectiveness depends critically on failure characteristics: Self-Critique succeeds on code-reviewable logic errors but fails completely on external service integration (0% improvement), while RAG achieves highest completion across all failure types with superior efficiency. Based on these findings, we propose a decision framework that maps each failure pattern to the most suitable enhancement method, giving practitioners practical, data-driven guidance instead of trial-and-error.


💡 Research Summary

This paper investigates why even state‑of‑the‑art prompting techniques for large language model (LLM)‑based code generation still leave requirements unmet, and it proposes data‑driven guidance for selecting remediation strategies. The authors first benchmarked direct prompting against progressive prompting on 25 diverse open‑source GitHub projects. Progressive prompting—structured as requirements analysis, architectural design, test specification, and implementation—achieved an average task‑completion rate of 96.9 % (SD 5.8 %), significantly outperforming direct prompting’s 80.5 % (Cohen’s d = 1.63, p < 0.001). Nevertheless, eight projects (32 %) remained incomplete, exposing four high‑level failure categories: (1) Local Logic failures, (2) External Integration failures, (3) Domain Knowledge failures, and (4) Infrastructure Configuration failures.

To address these gaps, the study instantiated three concrete enhancement pipelines: (a) Self‑Critique, which iteratively asks the same LLM to review its generated code, identify missing requirements, and produce fixes; (b) Multi‑Model Collaboration, which leverages GPT‑5 for design and Claude Sonnet 4.5 for implementation, thereby exploiting complementary strengths; and (c) Retrieval‑Augmented Generation (RAG), which enriches the prompt with retrieved official documentation, similar open‑source examples, and implementation patterns relevant to the target stack. Six representative “challenge” projects (selected from the eight failures) were used to evaluate each method across the four failure types.

Results show that RAG achieved the highest overall performance, reaching 100 % completion on four of six projects and an average completion rate of 99.2 % (+6.3 pp over baseline). Self‑Critique succeeded only on Local Logic failures (e.g., P2 and P10), improving completion to 100 % for those cases but delivering 0 % improvement on External Integration and Domain Knowledge failures. Multi‑Model Collaboration also reached 99.2 % average completion and succeeded on most projects, yet it required the longest generation time. Efficiency was measured with a “minutes per percentage‑point” (min/pp) metric: RAG averaged 1.2 min/pp, Multi‑Model 2.0 min/pp, and Self‑Critique 3.5 min/pp, indicating RAG’s superior cost‑effectiveness. Statistical testing (Friedman test) did not find a significant difference across methods (χ² = 3.60, p = 0.165), but large effect sizes (Cohen’s d > 1.0) and practical differences in completion rates support the claim that method choice matters in practice.

Based on these findings, the authors propose a decision framework that maps each failure pattern to the most appropriate enhancement technique. The framework advises developers to employ Self‑Critique when failures are “code‑reviewable” (i.e., missing logic that can be detected from the generated code itself), to use RAG when external documentation or integration details are lacking, and to consider Multi‑Model Collaboration when architectural complexity warrants leveraging distinct model capabilities, accepting a higher time cost.

In summary, the study provides a systematic taxonomy of LLM code‑generation failures, empirically validates three remediation strategies, and delivers a practical, data‑driven decision aid. This work bridges the gap between academic proposals for LLM augmentation and real‑world software engineering practice, offering actionable guidance that can reduce trial‑and‑error cycles, improve development efficiency, and inform future design of LLM‑assisted programming tools.


Comments & Academic Discussion

Loading comments...

Leave a Comment