C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Reading time: 5 minute
...

📝 Original Info

  • Title: C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
  • ArXiv ID: 2512.21332
  • Date: 2025-12-24
  • Authors: ** Jin Qin∗, Zihan Liao∗, Ziyin Zhang∗, Hang Yu†, Peng Di†, Rui Wang† 소속: 1 Ant Group, 2 상하이 교통대학 (Shanghai Jiao Tong University) **

📝 Abstract

We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.

💡 Deep Analysis

Figure 1

📄 Full Content

C2LLM Technical Report C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling Jin Qin∗,1 Zihan Liao∗,1 Ziyin Zhang∗,1,2 Hang Yu†,1 Peng Di†,1 Rui Wang†,2 1Ant Group 2Shanghai Jiao Tong University 1{qj431428,liaozihan.lzh,hyu.hugo,dipeng.dp}@antgroup.com 2{daenerystargaryen,wangrui12}@sjtu.edu.cn § https://github.com/codefuse-ai/CodeFuse-Embeddings https://huggingface.co/collections/codefuse-ai/codefuse-embeddings Granite-Embed-S-R2 Granite-Embed-R2 Text-Embed-005 INF-Retriever-1.5B EmbedGemma-0.3B INF-Retriever-7B Qwen3-Embed-0.6B C2LLM-0.5B Gemini-Embed-001 Qwen3-Embed-4B Qwen3-Embed-8B Seed1.6-Embed C2LLM-7B 50 55 60 65 70 75 80 85 MTEB-Code Performance 55.84 57.22 61.51 68.49 68.76 69.70 75.42 75.46 76.00 80.07 80.69 80.71 80.75 Closed-source Model Figure 1: MTEB-Code leaderboard. C2LLM-7B ranks 1st among all models, surpasssing the best closed-source models, while C2LLM-0.5B ranks 1st among models with less than 1B parameters, and 6th overall. Abstract We present C2LLM - Contrastive Code Large Language Models, a family of code em- bedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM’s causal representa- tions acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard. ∗Equal Contribution. †Correspondence to: Hang Yu , Peng Di , Rui Wang . 1 arXiv:2512.21332v1 [cs.CL] 24 Dec 2025 C2LLM Technical Report 1 Introduction Large language models (LLMs) pretrained on source code and natural language have rapidly ad- vanced a wide spectrum of software engineering applications, including code generation, automated issue resolution, and, notably, code retrieval (Zhang et al., 2024b). In the retrieval setting, a user supplies a natural-language query (e.g., “open a jsonl file in Python and read all lines”), and the system must return the most relevant snippet among millions or even billions of candidates stored in public or private codebases. Code retrieval is not only essential for interactive developer search engines but also forms a pivotal step in the workflow of emerging code agents - autonomous systems that iteratively plan, search, and edit code to accomplish complex programming tasks (Yang et al., 2024; Gao et al., 2025; Tao et al., 2025; Wang et al., 2025). At the core of code retrieval systems lie code embedding models. Despite the recent surge of general- purpose text embedding models (Zhang et al., 2025a; Lee et al., 2025a; Choi et al., 2025; Chen et al., 2024; Zhang et al., 2025b), directly transferring them to code embedding remains sub-optimal, as popular pooling strategies are ill-suited to code. State-of-the-art embedding models either adopt mean pooling over the outputs of an LLM (Lee et al., 2025a;b) or take the end-of-sequence (EOS) token representation as sequence embeddings (Choi et al., 2025; Zhang et al., 2025b). However, mean pooling is often paired with bidirectional attention, departing from the causal pretraining recipe of leading code LLMs (e.g. Qwen2.5-Coder, Hui et al., 2024) and therefore fails to unlock their full potential (Li et al., 2025b). Conversely, taking the EOS token embedding collapses all syntactic and semantic structure into one position, creating an information bottleneck that is especially harmful in the code domain, where input code files could easily contain thousands of tokens. To address this challenge, we introduce Contrastive Code Large Language Models (C2LLM), a new code embedding model family optimized for code retrieval. C2LLM preserves the causal attention of its backbone LLM but sidesteps the dilemma between mean pooling and EOS representation by inserting a lightweight Pooling by Multihead Attention (PMA) module (Lee et al., 2019), which has been shown by Liao et al. (2024) to outperform both mean pooling and EOS representation. A single learnable query attends to all token representations produced by the LLM, simultaneously 1) aggregating sequence information into a single vector, and 2) providing support for dimensionality adaptation, making it ideal for real-world large-scale vector databases. Trained on 3 million publicly available data, our 7B model achieves an average performance of 80.75 on MTEB-Code benchmark, ranking 1st among all models on the leaderboard. Our smaller model, with 0.5B parameters, scores 75.46 and pushes the frontier of models around 1B size, sur- passing simil

📸 Image Gallery

codefuse_logo.png huggingface_logo.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut