Unlocking LLM Repair Capabilities Through Cross-Language Translation and Multi-Agent Refinement
Recent advances in leveraging LLMs for APR have demonstrated impressive capabilities in fixing software defects. However, current LLM-based approaches predominantly focus on mainstream programming languages like Java and Python, neglecting less prevalent but emerging languages such as Rust due to expensive training resources, limited datasets, and insufficient community support. This narrow focus creates a significant gap in repair capabilities across the programming language spectrum, where the full potential of LLMs for comprehensive multilingual program repair remains largely unexplored. To address this limitation, we introduce a novel cross-language program repair approach LANTERN that leverages LLMs’ differential proficiency across languages through a multi-agent iterative repair paradigm. Our technique strategically translates defective code from languages where LLMs exhibit weaker repair capabilities to languages where they demonstrate stronger performance, without requiring additional training. A key innovation of our approach is an LLM-based decision-making system that dynamically selects optimal target languages based on bug characteristics and continuously incorporates feedback from previous repair attempts. We evaluate our method on xCodeEval, a comprehensive multilingual benchmark comprising 5,068 bugs across 11 programming languages. Results demonstrate significant enhancement in repair effectiveness, particularly for underrepresented languages, with Rust showing a 22.09% improvement in Pass@10 metrics. Our research provides the first empirical evidence that cross-language translation significantly expands the repair capabilities of LLMs and effectively bridges the performance gap between programming languages with different levels of popularity, opening new avenues for truly language-agnostic automated program repair.
💡 Research Summary
The paper addresses a critical gap in large‑language‑model (LLM) based automated program repair (APR): while LLMs excel at fixing bugs in popular languages such as Java and Python, their performance drops sharply for emerging or less‑represented languages like Rust, Kotlin, or Go. This disparity stems from imbalanced training data, higher annotation costs, and limited community resources. To bridge this gap without costly retraining, the authors propose LANTERN (Cross‑Language Translation and Multi‑Agent Refinement), a novel framework that leverages the differential repair proficiency of LLMs across languages.
LANTERN operates in four stages. First, a decision‑making LLM analyzes the buggy snippet, extracts semantic and syntactic features, and consults a pre‑computed performance profile (e.g., Pass@10 scores per language, bug‑type success rates) to select an optimal target language where the LLM is known to be stronger. This selection is dynamic: historical feedback from previous repair attempts is stored in a memory module and influences future choices. Second, the buggy code is translated into the chosen target language using a state‑of‑the‑art code‑generation LLM (e.g., CodeLlama or GPT‑4o). The translation pipeline includes AST consistency checks and type‑preservation validation to minimize semantic drift. Third, a multi‑agent repair loop runs on the translated code. One agent generates candidate patches, another validates them through compilation, execution, and automated testing, and a third aggregates results, updates the feedback memory, and decides whether to iterate. Failed patches are not discarded; they become part of the context for subsequent iterations, allowing the system to refine its approach gradually. Fourth, once a patch passes the test suite in the target language, a reverse translation step converts the fixed code back to the original language, followed by a final validation against the original test suite. If the patch still fails, LANTERN may select a different target language and repeat the cycle.
The authors evaluate LANTERN on xCodeEval, a multilingual benchmark comprising 5,068 bugs across 11 languages. Compared with a strong baseline LLM that directly attempts repairs in the source language, LANTERN raises the overall Pass@10 from 89.2 % to 94.5 %. The most striking improvement appears for Rust, where Pass@10 jumps from 65.58 % to 87.67 % (a 22.09 % absolute gain). Similar gains are observed for Kotlin and Go. Ablation studies confirm that the translation step, the dynamic language‑selection policy, and the feedback‑driven multi‑agent loop each contribute significantly; removing any component degrades performance to near‑baseline levels.
Implementation details reveal a modular architecture: the decision engine, translator, and repair agents are exposed as independent micro‑services, enabling easy substitution of LLM back‑ends. The system reuses existing LLMs without additional fine‑tuning, making it cost‑effective. Limitations include reduced robustness for large‑scale project translation (handling build scripts, dependencies), occasional semantic loss when translating code with heavy macro usage, and reliance on test‑suite quality for validation. The authors suggest future work on static‑analysis‑augmented verification, richer semantic mapping between languages, and extending the framework to full‑repository migration scenarios.
In summary, LANTERN demonstrates that cross‑language translation, guided by an LLM‑driven selection strategy and reinforced by multi‑agent iterative refinement, can substantially close the performance gap of LLM‑based APR across diverse programming languages. This work opens a promising path toward truly language‑agnostic automated debugging, especially for emerging languages that have previously suffered from limited repair support.
Comments & Academic Discussion
Loading comments...
Leave a Comment