Identifying Adversary Tactics and Techniques in Malware Binaries with an LLM Agent
Understanding TTPs (Tactics, Techniques, and Procedures) in malware binaries is essential for security analysis and threat intelligence, yet remains challenging in practice. Real-world malware binaries are typically stripped of symbols, contain large numbers of functions, and distribute malicious behavior across multiple code regions, making TTP attribution difficult. Recent large language models (LLMs) offer strong code understanding capabilities, but applying them directly to this task faces challenges in identifying analysis entry points, reasoning under partial observability, and misalignment with TTP-specific decision logic. We present TTPDetect, the first LLM agent for recognizing TTPs in stripped malware binaries. TTPDetect combines dense retrieval with LLM-based neural retrieval to narrow the space of analysis entry points. TTPDetect further employs a function-level analyzing agent consisting of a Context Explorer that performs on-demand, incremental context retrieval and a TTP-Specific Reasoning Guideline that achieves inference-time alignment. We build a new dataset that labels decompiled functions with TTPs across diverse malware families and platforms. TTPDetect achieves 93.25% precision and 93.81% recall on function-level TTP recognition, outperforming baselines by 10.38% and 18.78%, respectively. When evaluated on real world malware samples, TTPDetect recognizes TTPs with a precision of 87.37%. For malware with expert-written reports, TTPDetect recovers 85.7% of the documented TTPs and further discovers, on average, 10.5 previously unreported TTPs per malware.
💡 Research Summary
The paper introduces TTPDetect, the first large‑language‑model (LLM) agent designed to identify MITRE ATT&CK tactics, techniques, and procedures (TTPs) directly from stripped malware binaries. The authors observe that real‑world malware binaries are often symbol‑stripped, contain thousands of functions, and distribute malicious behavior across many code regions, making manual TTP attribution labor‑intensive and error‑prone. Existing approaches either rely on richer artifacts such as threat‑intel reports, dynamic execution traces, or they treat the problem as a simple binary classification, which does not reveal which specific functions implement a given technique.
To overcome these challenges, TTPDetect follows a two‑stage workflow that mirrors how human analysts work. Stage 1 – Entry‑point narrowing uses a hybrid retrieval system that combines dense vector similarity search with an LLM‑driven neural re‑ranking. First, each function in a binary is embedded using a pretrained code encoder; a fast approximate nearest‑neighbor index then returns a shortlist of functions that are semantically close to the textual description of each ATT&CK technique. The LLM then re‑examines the shortlist, scoring the alignment between function code and technique description, thereby producing a compact set of candidate function‑technique pairs. This dramatically reduces the combinatorial explosion of evaluating every function against every technique while preserving high recall.
Stage 2 – Function‑level analysis employs a specialized analyzing agent composed of two components:
-
Context Explorer – Recognizing that a single function often lacks sufficient evidence, the agent iteratively expands the analysis context by retrieving callers, callees, and other related functions. The expansion is guided by an explicit “
” prompt that asks the LLM to reason about which neighboring functions are likely to contain complementary evidence. This on‑demand, incremental retrieval mitigates the partial observability problem inherent in static function‑level analysis. -
TTP‑Specific Reasoning Guideline – For each ATT&CK technique, the authors craft a structured reasoning template that lists positive indicators, negative counter‑examples, and differentiation criteria (e.g., “Impair Defenses” vs. “Indicator Removal”). The LLM is forced to follow this template and produce a binary “yes/no” decision rather than free‑form enumeration. By anchoring the model’s output to a domain‑specific decision logic, hallucinations and over‑generalizations are substantially reduced.
The authors also construct a new dataset, Function‑TTP, which contains over 45 k decompiled functions from diverse malware families (Windows PE, Linux ELF) manually labeled with ATT&CK techniques by expert reverse engineers. The dataset is split into training, validation, and test sets, and includes multi‑label annotations because a single function may implement several techniques.
Experimental results demonstrate that TTPDetect achieves 93.25 % precision and 93.81 % recall on the function‑level test set, outperforming baseline prompting strategies (simple zero‑shot or few‑shot prompts) by 10.38 % in precision and 18.78 % in recall. When applied to 300 real‑world malware samples, the system maintains 87.37 % precision at the binary level. In a case study with recent malware that have publicly released analyst reports, TTPDetect recovers 85.7 % of the documented TTPs and, on average, discovers 10.5 previously unreported techniques per sample; manual verification confirms these as true positives.
The paper discusses several limitations. First, static analysis alone cannot capture techniques that are only triggered under specific runtime conditions, so some TTPs may remain undetected. Second, the reasoning guidelines must be updated as ATT&CK evolves, incurring maintenance overhead. Third, safety filters in commercial LLM APIs sometimes reject malicious code inputs, potentially limiting deployment. The authors suggest future work integrating dynamic traces (e.g., sandbox logs) for multimodal reasoning, automating guideline generation via meta‑learning, and exploring lightweight on‑device models for real‑time threat hunting.
In conclusion, TTPDetect showcases how a carefully engineered combination of dense retrieval, LLM‑driven neural re‑ranking, incremental context exploration, and domain‑specific reasoning can turn a generic code‑understanding model into a high‑precision TTP attribution engine for stripped binaries. This bridges a critical gap between raw binary analysis and actionable threat intelligence, offering a scalable, automated alternative to labor‑intensive manual reverse‑engineering.
Comments & Academic Discussion
Loading comments...
Leave a Comment