Cross-modal Retrieval Models for Stripped Binary Analysis

Reading time: 5 minute
...

📝 Original Info

  • Title: Cross-modal Retrieval Models for Stripped Binary Analysis
  • ArXiv ID: 2512.10393
  • Date: 2025-12-11
  • Authors: Guoqiang Chen, Lingyun Ying, Ziyang Song, Daguang Liu, Qiang Wang, Zhiqi Wang, Li Hu, Shaoyin Cheng, Weiming Zhang, Nenghai Yu

📝 Abstract

Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware analysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distinguishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.

💡 Deep Analysis

📄 Full Content

Cross-modal Retrieval Models for Stripped Binary Analysis Guoqiang Chen1,2, Lingyun Ying2, Ziyang Song2, Daguang Liu2, Qiang Wang2, Zhiqi Wang2, Li Hu1, Shaoyin Cheng1, Weiming Zhang1, Nenghai Yu1 1University of Science and Technology of China, 2QI-ANXIN Technology Research Institute Abstract Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware anal- ysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distin- guishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the bi- nary code and the natural language description, furthermore, BinSeek-Reranker learns to care- fully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data syn- thesis pipeline to automate training construc- tion, also deriving a domain benchmark for fu- ture research. Our evaluation results show that BinSeek achieved the state-of-the-art perfor- mance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters. 1 Introduction Binary code analysis serves as a cornerstone of software security, supporting critical tasks such as vulnerability detection, malware analysis, pro- gram audit. Traditional manual analysis is known to be labor-intensive and requires deep expertise, especially for binaries including thousands of func- tions. To alleviate these scalability challenges, re- cent studies (Shang et al., 2025b, 2024) have begun exploring the application of Large Language Mod- els (LLMs) to various binary code-related analy- sis tasks. Nevertheless, existing work has largely overlooked the problem of retrieving binary code through natural language (NL) queries, leaving a gap in effectively bridging users’ high-level seman- tic intentions with large-scale binary codebases. Retrieving stripped binary code with NL queries, however, presents unique and substantial chal- lenges compared to source code retrieval. Unlike high-level programming language (PL), stripped binaries lack explicit semantic indicators such as variable names, function names, and type informa- tion. These comprehensive symbols are generally stripped before releasing the binary programs for different reasons (e.g., reducing file size, hiding functionality). This absence significantly compli- cates the understanding of binary code for both humans and models (Jin et al., 2023), downgrad- ing the performance of general-purpose retrieval models in this scenario. Semantic-based code search has been exten- sively explored in the context of source code. Ex- isting approaches typically learn a joint embedding space to align NL and code, with pretrained models such as CodeBERT (Feng et al., 2020), GraphCode- BERT (Guo et al., 2021), and UniXcoder (Guo et al., 2022) achieving remarkable success. Despite these advancements, directly adapting source code retrieval models to the binary domain is ineffec- tive due to the fundamental modality differences discussed above. Moreover, unlike the abundance of open-source repositories (e.g., GitHub) that fuel source code representation learning, a high-quality and well-labeled dataset for binaries remain scarce, constraining the development of robust binary code retrieval models. Since retrieving binary code with NL queries is rarely studied, this work aims to develop an effec- tive solution for retrieving functions from stripped binaries. In particular, we introduce BinSeek, a two-stage cross-modal retrieval framework specifi- cally tailored for this issue: for the first stage, we build, BinSeek-Embedding, a embedding model to retrieve the candidates from a codebase that is decompiled from binaries, and for the second stage, 1 arXiv:2512.10393v2 [cs.SE] 5 Jan 2026 we further devise, BinSeek-Reranker, to reorder the candidates augmented with calling context infor- mation for more precise results. To train our expert models, we employ LLMs to automatically synthe- size high-quality semantic labels in NL for binary functions. In this way, we also deliver the first do- main benchmark for the binary code retrieval task, which is expected to facilitate future research in this domain. Our BinSeek achieves state-of-the-art (SOTA) performance with Rec@3 of 84.5% and MRR@3 of 80.25%, indicating its effectiveness in finding semantic-related binary functions. Our main contributions are summarized as follows: • We introduce BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis, where BinSeek-Em

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut