Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks
The proficiency of Large Language Models (LLMs) in coding tasks is often a reflection of their extensive pre-training corpora, which typically collapses when confronted with previously unfamiliar programming languages. Departing from data-intensive finetuning, we investigate the paradigm of Inference-time Language Acquisition (ILA), where an LLM masters an unfamiliar language through dynamic interaction with limited external resources. In this paper, we propose ILA-agent, a general ILA framework that equips LLMs with a set of behavioral primitives. By modeling essential human-like behaviors as a suite of tools, ILA-agent enables LLMs to incrementally explore, apply, and verify language knowledge through structured interactions with the official documentation and execution environment. To provide a rigorous evaluation in a low-resource setting, we construct Cangjie-bench, a multi-task benchmark based on the novel statically-typed language Cangjie. We instantiate ILA-agent for Cangjie and evaluate its performance across code generation, translation, and program repair tasks. Results using diverse LLMs demonstrate that ILA-agent significantly outperforms retrieval-augmented baselines. Further analysis of agent trajectories characterizes the emergent behavior patterns while highlighting persisting performance gaps.
💡 Research Summary
The paper tackles a fundamental limitation of large language models (LLMs) in software engineering: their performance collapses when faced with programming languages that were not present in the pre‑training corpus. Traditional remedies—fine‑tuning on a large corpus of code or using retrieval‑augmented generation (RAG) to fetch documentation—are either data‑hungry or insufficient because they lack an interactive verification loop. To address this, the authors propose a new paradigm called Inference‑time Language Acquisition (ILA), in which an LLM learns an unfamiliar language on the fly while solving a coding problem.
At the core of ILA is the ILA‑agent, a general framework that equips an LLM with a suite of “behavioral primitives” implemented as tools. These primitives are divided into two categories: exploration and verification. Exploration primitives let the model query and navigate the official language documentation. The framework provides two structural view tools (ViewStruct and ViewDetail) that expose the table‑of‑contents hierarchy and allow drilling down into specific sections, as well as a semantic search tool (SemSearch) that can retrieve relevant passages even when the model’s query uses different terminology from the language’s official lexicon. Verification primitives give the model access to the language’s execution environment: the Execute tool runs arbitrary code snippets and returns compiler or runtime diagnostics, while the Submit tool evaluates a complete solution against a hidden test suite and signals success.
The ILA process is formalized as a partially observable Markov decision process (POMDP). A state consists of the original problem description together with the full history of action‑observation pairs. The policy, instantiated by the LLM, maps the current state to the next tool invocation. After each tool call the environment returns an observation (e.g., a documentation excerpt or execution error), which is appended to the state. The loop continues until the Submit tool reports that all public tests pass or a preset interaction limit is reached.
To evaluate ILA in a truly low‑resource setting, the authors construct Cangjie‑bench, a multi‑task benchmark built around Cangjie, a newly released statically‑typed language (first released June 2024). Cangjie‑bench comprises three tasks: (1) code generation from natural‑language specifications (155 problems adapted from HumanEval), (2) Java‑to‑Cangjie code translation (165 problems derived from the TransCoder dataset), and (3) program repair (32 buggy Cangjie programs ported from the QuixBugs suite). Because public code for Cangjie is virtually nonexistent, the benchmark simulates a cold‑start scenario where LLMs have no prior exposure to the language’s syntax or standard library.
Experiments involve three state‑of‑the‑art LLMs—DeepSeek‑V3.2, Qwen3‑Max, and Claude‑Sonnet‑4.5—compared against three baselines: (a) task‑specific fine‑tuning (Cangjie Generator, Cangjie Translator, OptCodeT rans), (b) single‑shot RAG (one retrieval round before generation), and (c) iterative RAG (up to five retrieval‑generation cycles). Results show that ILA‑agent consistently outperforms both fine‑tuned and retrieval‑augmented baselines across all three tasks. For example, on code generation DeepSeek‑V3.2 improves from 63.23 % (iterative RAG) to 71.94 % with ILA‑agent; on program repair the same model reaches 100 % success, whereas the best RAG variant caps at 87.5 %. Similar gains are observed for Qwen3‑Max and Claude‑Sonnet‑4.5, with the latter achieving 81.94 % (code generation), 86.45 % (translation), and 100 % (repair).
A deeper analysis of agent trajectories reveals a characteristic pattern: early interaction steps focus on documentation search (semantic queries, structural navigation), followed by a series of Execute calls that test small code fragments, and finally a Submit once the model believes the solution is correct. The authors also identify failure modes: handling of complex generic types, library‑specific idioms, and multi‑module projects remain challenging, as the current primitives lack higher‑level reasoning about module dependencies or build systems.
The paper’s contributions are threefold: (1) introduction of the ILA paradigm and the ILA‑agent framework that unifies exploration and verification tools, (2) release of Cangjie‑bench as a rigorous low‑resource benchmark for new languages, and (3) extensive empirical evidence that interactive, tool‑driven language acquisition can bridge the knowledge gap far more effectively than static fine‑tuning or retrieval alone. The authors suggest future work on automatically generating language‑specific plugins (e.g., linters, static analyzers), extending ILA to multi‑language learning scenarios, and integrating more sophisticated software engineering artifacts such as build tools and package managers. Overall, the study demonstrates that LLMs can acquire functional proficiency in entirely novel programming languages through structured interaction with documentation and execution feedback, opening a path toward truly adaptable AI coding assistants.
Comments & Academic Discussion
Loading comments...
Leave a Comment