JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)

JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11)
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We build a benchmark to evaluate large language models (LLMs) for source code migration tasks, specifically upgrading functions from Java 8 to Java 11. We first collected a dataset of function pairs from open-source repositories, but limitations in data quality led us to construct a refined dataset covering eight categories of deprecated APIs. Using this dataset, the Mistral Codestral model was evaluated with CodeBLEU and keyword-based metrics to measure lexical and semantic similarity as well as migration correctness. Results show that the evaluated model (Mistral Codestral) can handle trivial one-to-one API substitutions with moderate success, achieving identical migrations in 11.11% of the cases, but it struggles with more complex migrations such as CORBA or JAX-WS. These findings suggest Mistral Codestral can partially reduce developer effort by automating repetitive migration tasks but cannot yet replace humans within the scope of the JMigBench benchmark. The benchmark and analysis provide a foundation for future work on expanding datasets, refining prompting strategies, and improving migration performance across different LLMs.


💡 Research Summary

The paper introduces JMigBench, a new benchmark designed to evaluate large language models (LLMs) on the task of migrating Java source code from version 8 to version 11 at the function level. The authors begin by highlighting the practical importance of code migration: legacy Java 8 applications still dominate many production environments, and moving to Java 11— the first long‑term‑support release after Java 8— is a common, yet labor‑intensive, engineering effort. Existing benchmarks such as CODEMENV and MigrationBench either focus on narrow migration scenarios or operate at the repository level without providing paired “before‑and‑after” function examples, making fine‑grained assessment of LLM outputs difficult.

Dataset Construction
The authors first attempted an automated collection pipeline. They selected popular GitHub repositories (≥10 k stars) and searched release notes, issue trackers, and commit messages for explicit mentions of a Java 8→Java 11 migration. Using a keyword list of deprecated APIs (e.g., Service, Any, ORB), they extracted function‑level diffs where the function name remained the same but the implementation changed. This yielded two raw sub‑datasets: 131 pairs with identical signatures and 19 pairs with different signatures, totaling 150 functions. However, manual inspection revealed severe quality problems: (1) the distribution of deprecated terms was heavily skewed toward JAX‑WS and CORBA, (2) many “matches” were false positives where the keyword appeared in variable names rather than API calls, and (3) many repositories had already removed deprecated APIs before the Java 11 migration, resulting in a lack of genuine migration cases.

To address these issues, the authors curated a refined dataset. They revised the keyword list to include a broader, less ambiguous set of deprecated APIs, and they employed ChatGPT to assist in generating realistic Java 8 and Java 11 function pairs that embody clear migration patterns (e.g., Collections.singletonListList.of, Thread.stopThread.interrupt). Human annotators verified functional equivalence and ensured that each pair contained the same number of parameters and a comparable length. The final dataset consists of 45 function pairs covering eight API categories (Collections, Concurrency, XML, Networking, etc.). Statistics show an average length of 9.69 lines for Java 8 functions and 8.33 lines for Java 11 functions, with a balanced distribution of deprecated terms across categories.

Experimental Methodology
Two research questions guide the evaluation:

RQ1 – Lexical and Semantic Similarity: How close is the LLM‑generated Java 11 function to the human‑written reference? The authors compute CodeBLEU scores (weighted n‑gram overlap) and a data‑flow match score (AST‑based structural similarity).

RQ2 – Migration Correctness: Does the LLM actually remove deprecated syntax? A simple metric counts the number of deprecated keywords remaining in the generated code and reports an “effectiveness” percentage as 1 – (remaining/total).

The LLM under test is Mistral Codestral (2501), accessed via the Mistral API. The prompt is fixed for all experiments: a system message tells the model to act as a senior Java engineer tasked with converting Java 8 code to Java 11, and a user message supplies the Java 8 function wrapped in <Java> tags, explicitly requesting only the Java 11 method as output.

Results
For RQ1, the average CodeBLEU score across the 45 pairs is 0.42, and the average data‑flow match is 0.38. Only 11.11 % of the generated functions are identical to the reference. For RQ2, the overall keyword‑removal effectiveness is 31.82 %. The model performs reasonably on trivial one‑to‑one API replacements (e.g., Collections.singletonListList.of) but fails completely on more complex migrations involving CORBA (Any) or JAX‑WS (Service), where the effectiveness drops to 0 %.

The authors interpret these findings as evidence that current LLMs can automate repetitive, syntactically straightforward migrations but lack the deeper semantic reasoning required for multi‑step refactorings, interface redesigns, or changes that involve broader architectural considerations. The relatively low CodeBLEU scores also suggest that the model often produces code that is syntactically plausible yet semantically divergent from the reference.

Threats to Validity

  1. Dataset Size – With only 45 curated pairs, statistical conclusions are limited.
  2. Synthetic Nature – Although the pairs are human‑validated, they are partially generated with ChatGPT, which may not capture the full complexity of real‑world codebases.
  3. Prompt Uniformity – Using a single prompt design does not explore how prompt engineering could improve performance.
  4. Model Scope – Only one LLM (Codestral) is evaluated; results may not generalize to other models.

Conclusions and Future Work
JMigBench fills a gap in the evaluation ecosystem by providing a fine‑grained, function‑level benchmark with ground‑truth migrations from Java 8 to Java 11. The study demonstrates that while Mistral Codestral can reduce developer effort for simple API updates, it cannot yet replace human expertise for complex migration scenarios. Future research directions include expanding the benchmark with larger, more diverse real‑world samples, experimenting with multi‑prompt or chain‑of‑thought strategies, integrating LLMs with rule‑based refactoring tools (hybrid approaches), and benchmarking a broader set of LLMs (e.g., GPT‑4o, Claude 3). By establishing a reproducible evaluation framework, JMigBench is poised to accelerate progress toward reliable, automated code migration solutions.


Comments & Academic Discussion

Loading comments...

Leave a Comment