From Sands to Mansions: Towards Automated Cyberattack Emulation with Classical Planning and Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Evolving attacker capabilities demand realistic and continuously updated cyberattack emulation for threat-informed defense and security benchmarking. Towards automated attack emulation, this paper defines modular attack actions and a linking model to organize and chain heterogeneous attack tools into causality-preserving cyberattacks. Building on this foundation, we introduce Aurora: an automated cyberattack emulation system powered by symbolic planning and large language models (LLMs). Aurora crafts actionable, causality-preserving attack chains tailored to Cyber Threat Intelligence (CTI) reports and target environments, and automatically executes these emulations. Using Aurora, we generated an extensive cyberattack emulation dataset from 250 attack reports, 15 times larger than the leading expert-crafted dataset. Our evaluation shows that Aurora significantly outperforms existing methods in creating actionable, diverse, and realistic attack chains. We release the dataset and use it to evaluate three state-of-the-art intrusion detection systems, whose performance differed notably from results on older datasets, highlighting the need for up-to-date, automated attack emulation.

💡 Research Summary

The paper addresses the growing need for realistic, up‑to‑date cyber‑attack emulation to support threat‑informed defense and security benchmarking. Existing emulation approaches either rely on manually crafted playbooks (e.g., MITRE ATT&CK Evaluations) that are costly and limited in scale, or on procedural orchestration tools (Caldera, PurpleSharp) that sequence isolated actions without modeling the causal dependencies between steps. Consequently, current datasets are small, outdated, and often non‑actionable, limiting their usefulness for evaluating modern intrusion detection systems (IDS).

Key Contributions

Attack Action Formalism – The authors define an “Attack Action” as the smallest executable unit of an attack, analogous to an atomic action in symbolic planning. Each action is described by a name, source tool, description, ATT&CK tactic/technique mapping, concrete execution instructions (commands or scripts), and the required executor (e.g., PowerShell, Metasploit module). This uniform representation enables heterogeneous tools to be treated interchangeably.
Attack Action Linking Model (AALM) – AALM explicitly records preconditions (state predicates that must hold before an action) and effects (state predicates that become true after execution). By modeling these relationships, the system guarantees that generated attack chains preserve the causal order observed in real‑world campaigns.
Symbolic Planning Engine – Using PDDL, Aurora formulates a planning problem where the initial state (I) reflects the known configuration of a testbed (OS versions, installed software, known vulnerabilities) and the goal state (G) encodes the objectives extracted from a CTI report (e.g., obtain admin rights, exfiltrate data). The planner searches the defined action space for a sequence that satisfies all preconditions and reaches G.
LLM‑Driven Knowledge Extraction & Parameter Completion – Large language models are employed in two stages: (a) parsing unstructured CTI reports and tool documentation to automatically generate the catalog of attack actions; (b) filling missing parameters (file paths, port numbers, credential placeholders) in the planner’s output, tailoring the plan to the specific target environment. This replaces the labor‑intensive manual TTP‑to‑action mapping performed by experts.
Aurora System – The end‑to‑end pipeline (CTI → action catalog → planning → Python script generation → automatic execution) is implemented as an open‑source framework. Aurora extracted over 5,500 actions from five popular third‑party tools and constructed more than 2,500 attack chains based on 250 recent CTI reports, yielding a dataset 15× larger than the leading expert‑crafted benchmark.

Evaluation
The authors compare Aurora against existing automated planners (e.g., ChainReactor) and against human‑crafted playbooks on four metrics: (i) actionability (presence of concrete commands), (ii) causal coherence (precondition/effect consistency), (iii) diversity (variety of ATT&CK TTP combinations), and (iv) scalability (ease of adding new tools/techniques). Aurora outperforms baselines on all metrics, achieving a 92 % causal‑coherence rate versus ~45 % for prior work, and providing executable scripts for 100 % of its actions.

To demonstrate the practical impact, the newly generated dataset is used to evaluate three state‑of‑the‑art IDSs (a DeepLog‑style sequence model, a Zeek‑based rule engine, and a Graph Neural Network detector). All three systems show a notable drop in detection performance (≈15 % lower accuracy) compared with results obtained on older benchmark datasets, underscoring the importance of up‑to‑date, realistic attack emulations for robust IDS assessment.

Limitations & Future Work
Aurora assumes complete knowledge of the target environment (OS, vulnerabilities, network topology). It does not yet handle planning under uncertainty or dynamic changes during execution. The LLM‑generated parameters may occasionally be incorrect, suggesting a need for post‑generation verification. Future research directions include (a) extending the planner to operate with partial information, (b) integrating formal verification or sandbox feedback to validate LLM outputs, and (c) enabling real‑time adaptive replanning as the environment evolves.

Conclusion
The paper presents Aurora, the first system that unifies symbolic planning with large language models to automatically generate and execute causally coherent, customizable, and fully actionable cyber‑attack chains from raw CTI reports. By releasing a large, up‑to‑date attack dataset, the authors provide a valuable benchmark for the security community and set a new baseline for automated cyber‑attack emulation research.

From Sands to Mansions: Towards Automated Cyberattack Emulation with Classical Planning and Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment