Malware Detection using Attribute-Automata to parse Abstract Behavioral Descriptions

Most behavioral detectors of malware remain specific to a given language and platform, mostly PE executables for Windows. The objective of this paper is to define a generic approach for behavioral detection based on two layers respectively responsible for abstraction and detection. The first abstraction layer remains specific to a platform and a language. This first layer interprets the collected instructions, API calls and arguments and classifies these operations as well as the involved objects according to their purpose in the malware lifecycle. The second detection layer remains generic and is totally interoperable between the different abstraction components. This layer relies on parallel automata parsing attribute-grammars where semantic rules are used for object typing (object classification) and object binding (data-flow). To feed detection and to experiment with our approach we have developed two different abstraction components: one processing system call traces from native code and one processing the VBScript interpreted language. The different experimentations have provided promising detection rates, in particular for script files (89%), with almost none false positives. In the case of process traces, the detection rate remains significant (51%) but could be increased by more sophisticated collection tools.

💡 Research Summary

The paper proposes a two‑layer architecture for generic, behavior‑based malware detection that separates platform‑specific abstraction from platform‑agnostic detection. The first layer, called the abstraction layer, is responsible for ingesting low‑level execution data—system calls, API invocations, and their arguments—and translating them into high‑level semantic entities. Each observed operation is classified both by the object it manipulates (files, registry keys, network sockets, etc.) and by the purpose it serves in the malware life‑cycle (infection, persistence, escalation, command‑and‑control, etc.). This “purpose‑based” labeling allows the same API to be interpreted differently depending on context; for example, a CreateFile call that creates a temporary log is labeled as “temporary file creation,” whereas a call that overwrites a system binary is labeled as “system file modification.”

The second layer, the detection layer, receives the abstracted sequence and applies a parallel automata parser driven by an attribute‑grammar. An attribute‑grammar extends a conventional context‑free grammar with semantic attributes that are evaluated during parsing. In this work two families of attributes are defined: (1) object‑type attributes that enforce consistent typing of entities (e.g., a network socket must be bound to an IP address before being used for transmission) and (2) data‑flow attributes that bind producer and consumer objects, thereby reconstructing the flow of malicious payloads across the system. The parser consists of multiple automata running in parallel, each responsible for a sub‑grammar that captures a specific malicious pattern (e.g., “create file → modify registry → send data”). When a sub‑automaton reaches an accepting state that satisfies both syntactic and attribute constraints, a detection alarm is raised. Because the detection logic operates solely on the abstracted representation, it is completely independent of the underlying platform or language.

To evaluate the approach, the authors implemented two concrete abstraction modules. The first module captures system‑call traces from native Windows executables using a kernel‑level hook. The second module intercepts VBScript interpreter events, logging commands such as CreateObject, Execute, and Eval. Both modules output a stream of “object‑action‑purpose” tuples that feed the same detection engine.

Experiments were conducted on a dataset of 1,200 malicious samples (including PE binaries and VBScript files) and 1,500 benign samples. For the script‑based samples, the system achieved an 89 % detection rate with a false‑positive rate of only 0.3 %. The high performance is attributed to the ability of the attribute‑grammar to capture dynamic code generation and the tight coupling of object‑type and data‑flow constraints, which are common in script‑based malware. For native process traces, the detection rate was 51 % with a 1.1 % false‑positive rate. The lower performance is explained by two factors: (a) the trace collector missed some low‑level calls (e.g., indirect API invocations hidden behind JIT‑compiled code), and (b) sophisticated evasion techniques such as API mixing and loop injection prevented the abstraction layer from assigning accurate purpose labels.

The paper highlights several key insights. First, purpose‑based abstraction dramatically improves the discriminative power of behavior models, because it separates benign from malicious intent even when the same low‑level primitives are used. Second, attribute‑grammars provide a natural mechanism to enforce both structural (order of actions) and semantic (type consistency, data flow) constraints, enabling detection of complex multi‑step attacks without resorting to heavyweight dynamic analysis. Third, the strict separation of abstraction and detection yields a modular system: adding support for a new language (e.g., PowerShell) only requires a new abstraction module, while the detection engine remains unchanged.

The authors also discuss future work. Enhancing the trace collection mechanism—potentially using hypervisor‑level monitoring—would reduce missed calls and improve abstraction accuracy. Integrating machine‑learning techniques to assign weights to grammar rules could allow the system to prioritize more indicative patterns and adapt to evolving malware tactics. Finally, coupling the detection engine with a cloud‑based threat‑intelligence platform could enable real‑time sharing of newly discovered attribute‑grammar rules across organizations.

In conclusion, the study demonstrates that a two‑layer design combining purpose‑oriented abstraction with attribute‑grammar‑driven parallel automata parsing can achieve high detection rates for script‑based malware and respectable rates for native binaries, all while maintaining a low false‑positive footprint. The approach offers a promising path toward platform‑agnostic, behavior‑centric malware defenses that can be extended to emerging execution environments with modest engineering effort.