📝 Original Info
- Title: Cross-modal Retrieval Models for Stripped Binary Analysis
- ArXiv ID: 2512.10393
- Date: 2025-12-11
- Authors: Guoqiang Chen, Lingyun Ying, Ziyang Song, Daguang Liu, Qiang Wang, Zhiqi Wang, Li Hu, Shaoyin Cheng, Weiming Zhang, Nenghai Yu
📝 Abstract
Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware analysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distinguishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.
💡 Deep Analysis
📄 Full Content
Cross-modal Retrieval Models for Stripped Binary Analysis
Guoqiang Chen1,2, Lingyun Ying2, Ziyang Song2, Daguang Liu2,
Qiang Wang2, Zhiqi Wang2, Li Hu1, Shaoyin Cheng1,
Weiming Zhang1, Nenghai Yu1
1University of Science and Technology of China, 2QI-ANXIN Technology Research Institute
Abstract
Retrieving binary code via natural language
queries is a pivotal capability for downstream
tasks in the software security domain, such
as vulnerability detection and malware anal-
ysis. However, it is challenging to identify
binary functions semantically relevant to the
user query from thousands of candidates, as
the absence of symbolic information distin-
guishes this task from source code retrieval. In
this paper, we introduce, BinSeek, a two-stage
cross-modal retrieval framework for stripped
binary code analysis. It consists of two models:
BinSeek-Embedding is trained on large-scale
dataset to learn the semantic relevance of the bi-
nary code and the natural language description,
furthermore, BinSeek-Reranker learns to care-
fully judge the relevance of the candidate code
to the description with context augmentation.
To this end, we built an LLM-based data syn-
thesis pipeline to automate training construc-
tion, also deriving a domain benchmark for fu-
ture research. Our evaluation results show that
BinSeek achieved the state-of-the-art perfor-
mance, surpassing the the same scale models by
31.42% in Rec@3 and 27.17% in MRR@3, as
well as leading the advanced general-purpose
models that have 16 times larger parameters.
1
Introduction
Binary code analysis serves as a cornerstone of
software security, supporting critical tasks such
as vulnerability detection, malware analysis, pro-
gram audit. Traditional manual analysis is known
to be labor-intensive and requires deep expertise,
especially for binaries including thousands of func-
tions. To alleviate these scalability challenges, re-
cent studies (Shang et al., 2025b, 2024) have begun
exploring the application of Large Language Mod-
els (LLMs) to various binary code-related analy-
sis tasks. Nevertheless, existing work has largely
overlooked the problem of retrieving binary code
through natural language (NL) queries, leaving a
gap in effectively bridging users’ high-level seman-
tic intentions with large-scale binary codebases.
Retrieving stripped binary code with NL queries,
however, presents unique and substantial chal-
lenges compared to source code retrieval. Unlike
high-level programming language (PL), stripped
binaries lack explicit semantic indicators such as
variable names, function names, and type informa-
tion. These comprehensive symbols are generally
stripped before releasing the binary programs for
different reasons (e.g., reducing file size, hiding
functionality). This absence significantly compli-
cates the understanding of binary code for both
humans and models (Jin et al., 2023), downgrad-
ing the performance of general-purpose retrieval
models in this scenario.
Semantic-based code search has been exten-
sively explored in the context of source code. Ex-
isting approaches typically learn a joint embedding
space to align NL and code, with pretrained models
such as CodeBERT (Feng et al., 2020), GraphCode-
BERT (Guo et al., 2021), and UniXcoder (Guo
et al., 2022) achieving remarkable success. Despite
these advancements, directly adapting source code
retrieval models to the binary domain is ineffec-
tive due to the fundamental modality differences
discussed above. Moreover, unlike the abundance
of open-source repositories (e.g., GitHub) that fuel
source code representation learning, a high-quality
and well-labeled dataset for binaries remain scarce,
constraining the development of robust binary code
retrieval models.
Since retrieving binary code with NL queries is
rarely studied, this work aims to develop an effec-
tive solution for retrieving functions from stripped
binaries. In particular, we introduce BinSeek, a
two-stage cross-modal retrieval framework specifi-
cally tailored for this issue: for the first stage, we
build, BinSeek-Embedding, a embedding model
to retrieve the candidates from a codebase that is
decompiled from binaries, and for the second stage,
1
arXiv:2512.10393v2 [cs.SE] 5 Jan 2026
we further devise, BinSeek-Reranker, to reorder the
candidates augmented with calling context infor-
mation for more precise results. To train our expert
models, we employ LLMs to automatically synthe-
size high-quality semantic labels in NL for binary
functions. In this way, we also deliver the first do-
main benchmark for the binary code retrieval task,
which is expected to facilitate future research in
this domain. Our BinSeek achieves state-of-the-art
(SOTA) performance with Rec@3 of 84.5% and
MRR@3 of 80.25%, indicating its effectiveness
in finding semantic-related binary functions. Our
main contributions are summarized as follows:
• We introduce BinSeek, a two-stage cross-modal
retrieval framework for stripped binary code
analysis, where BinSeek-Em
Reference
This content is AI-processed based on open access ArXiv data.