ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model are available at https://huggingface.co/hreyulog/embedinggemma_arkts and https://huggingface.co/datasets/hreyulog/arkts-code-docstring .


💡 Research Summary

The paper addresses a critical gap in the OpenHarmony ecosystem: the lack of publicly available datasets and standardized benchmarks for ArkTS, the primary programming language used to develop applications on HarmonyOS. To fill this void, the authors construct the first large‑scale, open‑source ArkTS dataset and define a code‑search benchmark that mirrors the well‑known CodeSearchNet framework.

Data collection begins with a dual‑platform strategy. Using keyword‑based searches on GitHub and a curated list from Gitee’s OpenHarmony exploration page, the authors identify 1,577 unique repositories (560 from GitHub, 1,017 from Gitee). After deduplication, they recursively crawl each repository, selecting files with the .ets extension (the standard ArkTS source file). For every valid file, metadata such as repository name, commit hash, file path, and platform are recorded.

The core of the dataset construction relies on tree‑sitter‑arkts, a language‑specific parser that converts ArkTS source code into abstract syntax trees (ASTs). Leveraging the AST, the pipeline extracts function‑ and class‑level code units together with their associated natural‑language docstrings. Each record therefore contains three elements: (1) the docstring, (2) the corresponding ArkTS function or class implementation, and (3) the AST representation. This AST‑driven extraction ensures syntactic correctness and captures ArkTS‑specific constructs such as declarative UI annotations, distributed‑application primitives, and cross‑language bindings that would be missed by simple regex‑based approaches.

All records are aggregated into a HuggingFace‑compatible format (JSONL), facilitating immediate reuse by the research community. Licensing is carefully considered: the source repositories span Apache, MIT, BSD, MPL and other common open‑source licenses. The derived dataset is presented as a transformative derivative intended for research under fair‑use principles, with a clear disclaimer that downstream commercial use must respect the original licenses.

For evaluation, the authors define a single‑search task: given a natural‑language docstring, retrieve the exact ArkTS function that generated it. This mirrors the CodeSearchNet “docstring‑to‑code” paradigm. Retrieval performance is measured using three standard metrics—Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG@k), and Recall@k—providing a comprehensive view of early‑rank accuracy, ranking quality, and coverage.

Two families of retrieval models are benchmarked. The first is a classic sparse method, BM25, with Chinese‑heavy docstrings tokenized by Jieba to handle multilingual documentation. The second family consists of several open‑source neural code‑embedding models (CodeBERT, GraphCodeBERT, CodeT5, among others). As expected, these models, originally trained on mainstream languages (Python, Java, JavaScript, etc.), perform poorly on ArkTS due to the language’s under‑representation in their pre‑training corpora.

To improve performance, the authors fine‑tune pre‑trained models using supervised contrastive learning on the newly created ArkTS pairs. They explore three training configurations: (1) ARKTS‑TRAINING – fine‑tuning solely on ArkTS data; (2) TS‑TRAINING – fine‑tuning solely on a large TypeScript corpus (340,116 function‑docstring pairs) extracted with tree‑sitter‑typescript; and (3) TS→ARKTS‑TRAINING – a two‑stage approach where the model is first adapted on TypeScript data and then further fine‑tuned on ArkTS. The inclusion of TypeScript is justified by the close syntactic and semantic relationship between the two languages, allowing the model to acquire relevant programming abstractions before specializing on ArkTS‑specific patterns.

Experimental results show that the two‑stage TS→ARKTS fine‑tuning achieves the best scores, surpassing the baseline neural models by a substantial margin (e.g., MRR improvement from ~0.42 to 0.68, Recall@10 from ~0.55 to 0.82, and NDCG@10 gains of over 15%). This demonstrates that cross‑language transfer from a well‑represented parent language can effectively bridge the data scarcity of a newer language.

All resources—raw dataset, processed HuggingFace dataset, and the fine‑tuned embedding model (named “embedinggemma_arkts”)—are released publicly. By providing the full pipeline (repository discovery, AST parsing, data cleaning, benchmark definition, and model fine‑tuning) as open‑source code, the authors enable reproducibility and invite the community to extend the work to other tasks such as code summarization, automated code generation, defect detection, and cross‑ecosystem migration.

In summary, the paper makes four key contributions: (1) construction and public release of the first large‑scale ArkTS code‑docstring dataset; (2) establishment of a standardized code‑search benchmark for ArkTS; (3) comprehensive evaluation of existing code‑embedding models on this benchmark; and (4) demonstration that fine‑tuning with both ArkTS and related TypeScript data yields a high‑performing model for ArkTS code retrieval. This work lays the groundwork for systematic, reproducible research on ArkTS intelligence and paves the way for advanced AI‑assisted development tools within the rapidly growing OpenHarmony ecosystem.


Comments & Academic Discussion

Loading comments...

Leave a Comment