MalCVE: Malware Detection and CVE Association Using Large Language Models

MalCVE: Malware Detection and CVE Association Using Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Malicious software attacks are having an increasingly significant economic impact. Commercial malware detection software can be costly, and tools that attribute malware to the specific software vulnerabilities it exploits are largely lacking. Understanding the connection between malware and the vulnerabilities it targets is crucial for analyzing past threats and proactively defending against current ones. In this study, we propose an approach that leverages large language models (LLMs) to detect binary malware, specifically within JAR files, and uses LLM capabilities combined with retrieval-augmented generation (RAG) to identify Common Vulnerabilities and Exposures (CVEs) that malware may exploit. We developed a proof-of-concept tool, MalCVE, that integrates binary code decompilation, deobfuscation, LLM-based code summarization, semantic similarity search, and LLM-based CVE classification. We evaluated MalCVE using a benchmark dataset of 3,839 JAR executables. MalCVE achieved a mean malware-detection accuracy of 97%, at a fraction of the cost of commercial solutions. In particular, the results demonstrate that LLM-based code summarization enables highly accurate and explainable malware identification. MalCVE is also the first tool to associate CVEs with binary malware, achieving a recall@10 of 65%, which is comparable to studies that perform similar analyses on source code.


💡 Research Summary

The paper introduces MalCVE, a novel framework that leverages large language models (LLMs) together with retrieval‑augmented generation (RAG) to detect malicious Java Archive (JAR) binaries and to associate them with the specific Common Vulnerabilities and Exposures (CVEs) they are likely to exploit. The authors motivate their work by pointing out the high cost and opacity of commercial malware detection products and the lack of tools that can directly link binary malware to the vulnerabilities it targets. Their research questions are: (RQ1) Can LLMs accurately classify decompiled binaries as malicious or benign? (RQ2) Can LLMs, aided by RAG, reliably retrieve the CVEs relevant to a given malicious binary?

MalCVE’s pipeline consists of eight stages. First, the JAR file is decompiled using two open‑source Java decompilers, CFR and Procyon, to obtain Java source code. Second, a custom deobfuscation tool built on JavaParser resolves constant string expressions and inlines them, mitigating basic obfuscation. Third, the decompiled and deobfuscated code is fed to an LLM with a “summarize‑and‑judge” prompt. The model returns a structured JSON containing a verdict (malicious/benign), confidence score, a concise behavioral summary, identified libraries, and an initial set of search queries. Fourth, a second prompt generates refined CVE‑search queries based solely on the summary, reducing token usage while focusing on salient keywords. Fifth, these queries are embedded and matched against a vector database of CVE descriptions using semantic similarity (e.g., cosine similarity). Sixth, the results are ranked by similarity score. Seventh, the top‑k (k = 10 in the experiments) CVEs are presented as the likely vulnerabilities exploited by the binary. Finally, all outputs are saved for further analysis.

The authors evaluate MalCVE on the publicly available MalDICT dataset, which contains 3,839 JAR executables evenly split between malicious and benign samples. For RQ1, the LLM‑based summarization step alone yields an average detection accuracy of 97 %, outperforming many prior works on comparable tasks and doing so at a fraction of the cost of commercial solutions (approximately 1/66 the cost of CrowdStrike Falcon 1 and 1/80 the cost of ANY.RUN). For RQ2, the system achieves a recall@10 of 65 % for CVE association, a figure comparable to studies that perform CVE mapping on source code rather than binaries. The authors also highlight the explainability of the approach: the LLM’s natural‑language summary provides human‑readable rationale, facilitating analyst verification.

In the related‑work discussion, the paper positions itself against prior LLM‑based malware detection efforts that focus on Android APKs, PE files, or small, imbalanced datasets, and against vulnerability‑mapping research that targets CWE identifiers in source code. MalCVE bridges these gaps by (1) operating directly on decompiled Java binaries, (2) using LLM‑generated summaries as a bridge between detection and CVE retrieval, and (3) relying largely on open‑source tooling, with the LLM API being the primary commercial component.

Limitations are acknowledged. The deobfuscation step only handles constant‑folding and string inlining, leaving more sophisticated obfuscation techniques (control‑flow flattening, dynamic class loading) unaddressed. LLM hallucinations can produce vague or incorrect search queries, potentially lowering CVE recall. The current implementation is Java‑centric; extending to other binary formats such as PE or ELF would require additional decompilation and language‑specific handling.

Future work outlined includes (a) building a unified decompilation framework that supports multiple languages and binary formats, (b) integrating dynamic analysis (sandbox execution) with the RAG pipeline to enrich context, (c) fine‑tuning or applying parameter‑efficient adapters (e.g., LoRA) to specialize LLMs for security tasks, and (d) linking CVE results with broader security knowledge graphs (CWE, CAPEC) for richer threat intelligence.

In conclusion, MalCVE demonstrates that generic, pre‑trained LLMs can be harnessed to achieve high‑accuracy, cost‑effective malware detection and to provide actionable CVE associations directly from obfuscated binaries. The system’s explainable outputs and low entry barrier make it a promising foundation for next‑generation automated threat analysis and vulnerability intelligence pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment