ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS

ArkEval: Benchmarking and Evaluating Automated CodeRepair for ArkTS
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models have transformed code generation, enabling unprecedented automation in software development. As mobile ecosystems evolve, HarmonyOS has emerged as a critical platform requiring robust development tools. Software development for the HarmonyOS ecosystem relies heavily on ArkTS, a statically typed extension of TypeScript. Despite its growing importance, the ecosystem lacks robust tools for automated code repair, primarily due to the absence of a high-quality benchmark for evaluation. To address this gap, we present ArkEval, a unified framework for ArkTS automated repair workflow evaluation and benchmark construction. It provides the first comprehensive benchmark specifically designed for ArkTS automated program repair. We constructed this benchmark by mining issues from a large-scale official Huawei repository containing over 400 independent ArkTS applications. Through a rigorous multi-stage filtering process, we curated 502 reproducible issues. To ensure testability, we employed a novel LLM-based test generation and voting mechanism involving Claude and other models. Furthermore, we standardized problem statements to facilitate fair evaluation. Finally, we evaluated four state-of-the-art Large Language Models (LLMs) on our benchmark using a retrieval-augmented repair workflow. Our results highlight the current capabilities and limitations of LLMs in repairing ArkTS code, paving the way for future research in this low-resource language domain.


💡 Research Summary

The paper introduces ArkEval, the first comprehensive benchmark and evaluation framework for automated program repair (APR) targeting ArkTS, the statically‑typed extension of TypeScript used in Huawei’s HarmonyOS ecosystem. Recognizing a “digital divide” where large language models (LLMs) excel on high‑resource languages (Python, Java, JavaScript) but falter on low‑resource domain‑specific languages (DSLs), the authors set out to create a high‑quality dataset that enables systematic measurement of LLM‑based repair capabilities for ArkTS.

Benchmark Construction
The authors mined the official Huawei HarmonyOS repository, which aggregates more than 400 independent ArkTS sample applications. From this pool they selected 149 projects that met strict quality criteria (size, commit history, issue tracking). A two‑stage filtering pipeline—automated metric‑based filtering (patch size < 300 LOC, compile‑time failures, dependency complexity) followed by manual verification—yielded 502 reproducible bugs. Each entry includes a concise problem statement, function signature, required dependencies, and the original source files.

Test Oracle Synthesis
Because ArkTS projects rarely contain regression tests, the authors devised a novel LLM‑based “vote oracle” pipeline. Multiple LLMs (Claude, GPT‑4, DeepSeek, among others) were prompted to generate input‑output test cases for each bug. The generated tests were then subjected to a voting mechanism: only test cases that received consensus across models were retained as execution‑based oracles. This approach, termed “LLM‑Vote Oracle Synthesis,” eliminates the need for extensive human labeling while still producing reliable test suites for evaluation.

Standardization and Retrieval‑Augmented Generation (RAG)
To facilitate fair comparison, the authors standardized problem descriptions, ensuring uniform formatting of requirements, signatures, and dependencies. They then built a retrieval‑augmented generation pipeline (ArkFix) that first retrieves relevant code snippets from the repository and injects them into the LLM prompt. This RAG strategy compensates for the scarcity of ArkTS‑specific pre‑training data and helps the model respect ArkTS’s strict static typing and declarative UI conventions.

Experimental Evaluation
Four state‑of‑the‑art LLMs—including models running on Huawei Ascend 910B hardware—were evaluated on ArkEval using two primary metrics: Compile@1 (whether the first generated patch compiles) and Pass@1 (whether it passes the generated test suite). Results reveal a sobering picture: while the models often produce syntactically plausible TypeScript code, their success rate on ArkTS‑specific compilation constraints is near zero. The most common failure mode is “semantic hallucination,” where the model generates code that looks correct but violates ArkTS rules such as the prohibition of the any type, strict layout declarations, or the special @State/@Link/@Prop decorator usage. Consequently, logical correctness is frequently achieved only after extensive manual editing.

Key Contributions

  1. ArkEval Benchmark – 502 real‑world ArkTS bugs with automatically synthesized, high‑confidence test oracles.
  2. LLM‑Vote Oracle Synthesis – A scalable method for generating execution‑based tests in low‑resource settings.
  3. ArkFix Framework – Demonstrates how retrieval‑augmented generation can improve repair for languages with strict static constraints.
  4. Empirical Insight – Highlights the current limitations of generic LLMs on ArkTS, underscoring the need for domain‑specific fine‑tuning and tighter integration of static analysis.

Limitations and Future Work
The oracle generation still relies on LLMs, so residual noise may affect evaluation fidelity; a small amount of human validation was performed but not exhaustive. The evaluated models were not fine‑tuned on ArkTS data, which likely contributed to the low compile success rates. Future research directions include building a large ArkTS code corpus for pre‑training, integrating static type checkers directly into the generation loop, and extending the vote‑oracle methodology to other DSLs (e.g., Solidity, embedded C).

In summary, ArkEval fills a critical gap by providing the first executable benchmark for ArkTS automated repair, introducing innovative LLM‑driven test synthesis, and delivering a thorough empirical assessment that charts a roadmap for advancing AI‑assisted development in low‑resource, statically‑typed ecosystems.


Comments & Academic Discussion

Loading comments...

Leave a Comment