aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have shown promising results in repository-level code completion, which completes code based on the in-file and cross-file context of a repository. The cross-file context typically contains different types of information (e.g., relevant APIs and similar code) and is lengthy. In this paper, we found that LLMs struggle to fully utilize the information in the cross-file context. We hypothesize that one of the root causes of the limitation is the misalignment between pre-training (i.e., relying on nearby context) and repo-level code completion (i.e., frequently attending to long-range cross-file context). To address the above misalignment, we propose Code Long-context Alignment - COLA, a purely data-driven approach to explicitly teach LLMs to focus on the cross-file context. Specifically, COLA constructs a large-scale repo-level code completion dataset - COLA-132K, where each sample contains the long cross-file context (up to 128K tokens) and requires generating context-aware code (i.e., cross-file API invocations and code spans similar to cross-file context). Through a two-stage training pipeline upon COLA-132K, LLMs learn the capability of finding relevant information in the cross-file context, thus aligning LLMs with repo-level code completion. We apply COLA to multiple popular LLMs (e.g., aiXcoder-7B) and extensive experiments on COLA-132K and a public benchmark - CrossCodeEval. Our experiments yield the following results. 1) Effectiveness. COLA substantially improves the performance of multiple LLMs in repo-level code completion. For example, it improves aiXcoder-7B by up to 19.7% in exact match. 2) Generalizability. The capability learned by COLA can generalize to new languages. 3) Enhanced Context Utilization Capability. We design two probing experiments, which show COLA improves the capability of LLMs in utilizing the information in cross-file context.

💡 Research Summary

The paper “aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion” addresses a critical limitation observed in Large Language Models (LLMs) when applied to repository-level code completion. While LLMs have shown promise in this task, which requires completing code based on both in-file and lengthy cross-file context from a repository, they often fail to fully utilize the information present in the long cross-file context. The authors hypothesize that a root cause of this failure is a misalignment between standard pre-training objectives and the demands of repo-level completion. During pre-training, LLMs primarily learn to rely on nearby tokens for next-token prediction, whereas repo-level completion frequently requires attending to and retrieving relevant information from distant tokens spread across different files.

To bridge this gap, the paper proposes Code Long-context Alignment (COLA), a purely data-driven method designed to explicitly teach LLMs to focus on and utilize information from the cross-file context. The core of COLA is the construction of a novel, large-scale dataset named COLA-132K. This dataset is specifically crafted for repo-level code completion and contains 132,000 samples (120k for training, 12k for testing) drawn from 2,000 high-quality open-source repositories across four popular programming languages: Python, Java, C++, and Go. Each sample in COLA-132K has two key characteristics: 1) It includes a long cross-file context, averaging 12,000 tokens and extending up to 128,000 tokens, and 2) The target code to be generated is “context-aware,” meaning it critically depends on information found within that cross-file context. The authors focus on two primary types of context-aware code: cross-file API invocations and code spans that are similar to examples found in the cross-file context. By forcing the model to generate such code, it must learn to actively search for and identify relevant patterns within the long context.

The training process involves a two-stage pipeline applied on the COLA-132K dataset. The first stage is supervised fine-tuning, where the model learns the basic patterns of generating context-aware code. The second stage employs reinforcement learning to further refine the model’s ability by rewarding it for accurate completions. This pipeline aligns the LLM’s behavior with the requirements of repository-level coding, enhancing its ability to locate and use distant information.

The authors apply the COLA methodology to several popular base LLMs, including aiXcoder-7B, DeepSeek-Coder-6.7B, and CodeLlama-7B. The model resulting from applying COLA to aiXcoder-7B is named aiXcoder-7B-v2. Extensive evaluation on the COLA-132K test set and the public CrossCodeEval benchmark demonstrates the effectiveness of COLA across multiple dimensions. First, it substantially improves the performance of all base LLMs on repo-level code completion metrics (Exact Match and BLEU). For instance, aiXcoder-7B-v2 shows an improvement of up to 19.7% in Exact Match over its base version and even surpasses the performance of much larger models like DeepSeek-Coder-33B. Second, the capability learned through COLA exhibits strong generalizability. Models trained with COLA on the four source languages show improved performance on unseen programming languages (C# and TypeScript), indicating that they learned a general skill of “finding information in context” rather than memorizing language-specific patterns. Third, through carefully designed probing experiments, the paper provides evidence that COLA specifically enhances the model’s capability to utilize two key types of cross-file information: relevant APIs and similar code snippets.

In summary, this work identifies a significant misalignment in LLMs for a practical software engineering task and introduces an elegant, data-centric solution. By curating a targeted dataset (COLA-132K) and employing a two-stage training pipeline, COLA successfully teaches LLMs to overcome their bias towards local context and effectively leverage long-range, cross-file dependencies. The results confirm that COLA is a model-agnostic, effective approach for aligning LLMs with the complex demands of repository-aware code completion, leading to significant performance gains and improved generalization.

aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion

💡 Research Summary

Comments & Academic Discussion

Leave a Comment