C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
Reading time: 5 minute
...
📝 Original Info
Title: C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
ArXiv ID: 2512.21332
Date: 2025-12-24
Authors: ** Jin Qin∗, Zihan Liao∗, Ziyin Zhang∗, Hang Yu†, Peng Di†, Rui Wang† 소속: 1 Ant Group, 2 상하이 교통대학 (Shanghai Jiao Tong University) **
📝 Abstract
We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.
💡 Deep Analysis
📄 Full Content
C2LLM Technical Report
C2LLM Technical Report: A New Frontier in Code Retrieval via
Adaptive Cross-Attention Pooling
Jin Qin∗,1
Zihan Liao∗,1
Ziyin Zhang∗,1,2
Hang Yu†,1
Peng Di†,1 Rui Wang†,2
1Ant Group
2Shanghai Jiao Tong University
1{qj431428,liaozihan.lzh,hyu.hugo,dipeng.dp}@antgroup.com
2{daenerystargaryen,wangrui12}@sjtu.edu.cn
§ https://github.com/codefuse-ai/CodeFuse-Embeddings
https://huggingface.co/collections/codefuse-ai/codefuse-embeddings
Granite-Embed-S-R2
Granite-Embed-R2
Text-Embed-005
INF-Retriever-1.5B
EmbedGemma-0.3B
INF-Retriever-7B
Qwen3-Embed-0.6B
C2LLM-0.5B
Gemini-Embed-001
Qwen3-Embed-4B
Qwen3-Embed-8B
Seed1.6-Embed
C2LLM-7B
50
55
60
65
70
75
80
85
MTEB-Code Performance
55.84
57.22
61.51
68.49
68.76
69.70
75.42
75.46
76.00
80.07
80.69
80.71
80.75
Closed-source Model
Figure 1: MTEB-Code leaderboard. C2LLM-7B ranks 1st among all models, surpasssing the best
closed-source models, while C2LLM-0.5B ranks 1st among models with less than 1B parameters,
and 6th overall.
Abstract
We present C2LLM - Contrastive Code Large Language Models, a family of code em-
bedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones,
C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence
embedding from token embeddings, effectively 1) utilizing the LLM’s causal representa-
tions acquired during pretraining, while also 2) being able to aggregate information from
all tokens in the sequence, breaking the information bottleneck in EOS-based sequence
embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as
an alternative to MRL. Trained on three million publicly available data, C2LLM models
set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking
1st on the overall leaderboard.
∗Equal Contribution.
†Correspondence to: Hang Yu , Peng Di , Rui Wang
.
1
arXiv:2512.21332v1 [cs.CL] 24 Dec 2025
C2LLM Technical Report
1
Introduction
Large language models (LLMs) pretrained on source code and natural language have rapidly ad-
vanced a wide spectrum of software engineering applications, including code generation, automated
issue resolution, and, notably, code retrieval (Zhang et al., 2024b). In the retrieval setting, a user
supplies a natural-language query (e.g., “open a jsonl file in Python and read all lines”), and the
system must return the most relevant snippet among millions or even billions of candidates stored
in public or private codebases. Code retrieval is not only essential for interactive developer search
engines but also forms a pivotal step in the workflow of emerging code agents - autonomous systems
that iteratively plan, search, and edit code to accomplish complex programming tasks (Yang et al.,
2024; Gao et al., 2025; Tao et al., 2025; Wang et al., 2025).
At the core of code retrieval systems lie code embedding models. Despite the recent surge of general-
purpose text embedding models (Zhang et al., 2025a; Lee et al., 2025a; Choi et al., 2025; Chen et al.,
2024; Zhang et al., 2025b), directly transferring them to code embedding remains sub-optimal, as
popular pooling strategies are ill-suited to code. State-of-the-art embedding models either adopt
mean pooling over the outputs of an LLM (Lee et al., 2025a;b) or take the end-of-sequence (EOS)
token representation as sequence embeddings (Choi et al., 2025; Zhang et al., 2025b). However, mean
pooling is often paired with bidirectional attention, departing from the causal pretraining recipe
of leading code LLMs (e.g. Qwen2.5-Coder, Hui et al., 2024) and therefore fails to unlock their full
potential (Li et al., 2025b). Conversely, taking the EOS token embedding collapses all syntactic and
semantic structure into one position, creating an information bottleneck that is especially harmful in
the code domain, where input code files could easily contain thousands of tokens.
To address this challenge, we introduce Contrastive Code Large Language Models (C2LLM), a new
code embedding model family optimized for code retrieval. C2LLM preserves the causal attention
of its backbone LLM but sidesteps the dilemma between mean pooling and EOS representation
by inserting a lightweight Pooling by Multihead Attention (PMA) module (Lee et al., 2019), which
has been shown by Liao et al. (2024) to outperform both mean pooling and EOS representation. A
single learnable query attends to all token representations produced by the LLM, simultaneously 1)
aggregating sequence information into a single vector, and 2) providing support for dimensionality
adaptation, making it ideal for real-world large-scale vector databases.
Trained on 3 million publicly available data, our 7B model achieves an average performance of
80.75 on MTEB-Code benchmark, ranking 1st among all models on the leaderboard. Our smaller
model, with 0.5B parameters, scores 75.46 and pushes the frontier of models around 1B size, sur-
passing simil