Introduction and Motivation

Modern software ecosystems increasingly depend on closed-source components and proprietary libraries embedded within operating systems. This is especially evident in Apple’s macOS. macOS ships with numerous private frameworks, which are collections of code libraries that provide essential services but lack official documentation or support . Reverse engineering is often the only means for security analysts and developers to understand these Application Programming Interfaces (APIs), identify vulnerabilities, and enable third-party solutions.

*Command execution across privilege boundaries via hidden APIs.* A sandboxed application invokes a private framework API, which registers an XPC messaging interface. Through this channel, attacker-controlled arguments can be used to execute code outside the sandbox.

Private macOS frameworks have repeatedly been at the center of severe security flaws . Consider a recent case involving the StorageKit framework in macOS Sonoma beta. Simply loading this private framework caused the system to automatically register a background service for inter-process communication (an XPC service). Deeper inspection revealed an undocumented messaging interface that allowed sandboxed applications to execute arbitrary commands with attacker-controlled arguments. By inferring the structure of this hidden interface, researchers reconstructed a communication proxy and demonstrated remote command execution from a sandboxed context. Although removed in macOS 14.0, this case reflects a recurring pattern of vulnerabilities (illustrated in Figure 1): a hidden API exposes an unintended capability, which then becomes an entry point for privilege escalation or policy bypass. Earlier examples include CVE-2019-8561 in PackageKit, which enabled System Integrity Protection (SIP) bypass via a race condition, and CVE-2021-30873 , which permitted process injection through private interface misuse. Accurate type inference in such contexts helps surface overly broad interfaces and implicit trust boundaries, offering structural cues that can inform vulnerability analysis without requiring direct exploit development.

Automated extraction and completion of private framework APIs represent a key step in the broader reverse engineering process for macOS security analysis. Accurate inference of API argument and return types enables researchers to reconstruct interfaces and reason about their intended use, shifting the research focus from tedious manual recovery to higher-level vulnerability detection. This capability extends beyond macOS itself: many third-party applications (e.g., , ), including widely distributed commercial software (e.g., ), rely on undocumented APIs for extended functionality. Vulnerabilities in these APIs can therefore propagate into higher-level software that inherits or misuses them. Improved access to reconstructed API definitions (Type inference) produced through type inference thus benefits both OS-level security research and the auditing of dependent applications.

Recovering type information from binary executables has long been a core objective in reverse engineering. Classical approaches rely on static analysis with handcrafted rules or data-flow reasoning to infer variable and function types from usage patterns . For example, TIE introduced a systematic reconstruction framework that applies constraint propagation across binary code to improve both precision and conservativeness compared to traditional decompiler heuristics. Subsequent work explored machine learning for identifying type signatures in disassembled code . Notably, TypeMiner employed classification techniques to recover data types, achieving 76–93% accuracy on C binaries.

Despite these advances, significant challenges remain: existing methods often struggle with complex object-oriented binaries, depend on large training corpora, or fall short in reconstructing complete method signatures and interface definitions. As a consequence, in practice, reverse engineers still expend significant manual effort to determine how undocumented macOS frameworks should be invoked . Persistent limitations hinder prior work: rule-based systems are brittle against evolving compiler idioms, while purely static analyses remain incomplete without broader semantic understanding. Existing learning-based efforts focus on general binary analysis rather than macOS framework reconstruction, leaving this domain largely unexplored.

Meanwhile, although LLMs have shown strong performance in source-code understanding and reasoning , their application to binary analysis remains largely unexplored. This motivates our central question: Can LLMs, when combined with program-analysis tools, autonomously recover accurate method signatures and interface definitions in undocumented macOS frameworks? To address this, we introduce MOTIF (Mach-O Type Inference Framework), a system that treats type inference and interface reconstruction as an iterative, LLM-guided analytical process. MOTIF tackles key challenges in binary analysis: (I) sparse type encodings, (II) incomplete disassembly metadata, and (III) the absence of ground-truth evaluation through three components:

MOTIF-agent: An LLM-guided reverse engineering pipeline that orchestrates external tools such as disassemblers, decompilers, and static linters to progressively reconstruct missing type information from incomplete headers (see §5).
MOTIF-bench: A benchmark dataset constructed from public macOS frameworks with ground-truth headers, used for quantitative evaluation of type inference accuracy (see §6).
MOTIF-model: A small LLM distilled from large-model interaction traces, specialized for local deployment and efficient type inference (see §7).

Community Reconstructions of Private Frameworks. Technical blogs and curated indexes (e.g., PrivateFrameworks repositories and The Apple Wiki ) document private APIs by extracting Objective-C metadata with tools such as class-dump and RuntimeBrowser. These sources are valuable as living catalogs and as starting points for auditing undocumented interfaces. However, the reconstructed headers are typically incomplete: many parameter and return types are recovered only as id or void*, which encode minimal semantic information and hinder static analysis or safe API usage. Additionally, signatures involving blocks (i.e., function-type closures) and protocol-qualified types (interfaces that constrain object behavior) are often missing or incorrectly inferred.

Type Reconstruction in Compiled Binaries. Constraint-based and dataflow-driven systems for stripped binaries (e.g., ) infer types from calling conventions, register usage, and memory layouts. This line of work is strong in statically typed C/C++ settings with regular Application Binary Interfaces (ABIs) and compiler artifacts. Objective-C breaks these assumptions in several fundamental ways.

First, message sends store only compact type encodings for selector names (e.g., [obj doSomething:]) but omit type annotations, making it difficult to determine parameter and return types from the call site alone.

Second, dynamic dispatch defers method resolution until runtime, shifting essential interface information into Objective-C–specific binary sections (e.g., __objc_classlist, __objc_methname), which fall outside the scope of conventional static analysis pipelines.

Third, the widespread use of id (a dynamically typed object placeholder) further erases static constraints: any object can be passed or returned under the id type, which carries no information about supported methods or internal structure.

Finally, blocks (closures capturing local context) introduce additional complexity through nested function signatures and implicit state capture. Classical propagation techniques, which rely on explicit type flow and fixed call graphs, cope poorly with such semantics. As a result, they often fail to recover protocol-qualified types or infer generic collection types accurately.

Large Language Models in Binary Analysis. Recent studies apply LLMs to reverse engineering tasks, with a growing number of works exploring static and dynamic contexts . ReSym recovers variable and data-structure symbols from stripped binaries, while PentestGPT explores autonomous offensive workflows. These approaches primarily target C or C++ binaries and are not designed for the dynamic, metadata-rich environment of macOS private frameworks. They generally operate independently of external tools such as disassemblers, metadata extractors, or static/dynamic linters, producing results that may appear plausible but cannot be systematically verified against the binary.

Background: macOS Frameworks and Objective-C Typing

Mach-O Binaries and Framework Architecture. Apple’s macOS and iOS use the Mach-O (Mach Object) file format as the native binary format for executables, libraries, and dynamically-loaded components.

In macOS, shared libraries are often packaged as frameworks, which are directory bundles (with a .framework extension) containing a dynamic shared library along with resources like headers, assets, and metadata. A framework bundle typically includes a versioned directory structure (e.g., Versions/A) that houses the actual Mach-O library and symlinks at the top level pointing to the current version. This design allows multiple versions of a library to coexist for binary compatibility, with the loader automatically linking against the current version of the framework’s dylib.

Public macOS frameworks (located in /System/Library/Frameworks/) export stable APIs with accompanying header files and documentation. In contrast, private frameworks (in /System/Library/PrivateFrameworks/) are intended for internal OS use and are not documented or exposed in the official SDK. These private frameworks are structurally similar to public ones (they are Mach-O dynamic libraries with Objective-C or C/C++ code inside), but because they lack published headers or documentation, their APIs are opaque to third-party developers. System applications and daemons frequently rely on private frameworks for functionality; for example, Disk Utility links against numerous private frameworks like DiskManagement.framework and StorageKit.framework.

Although private frameworks are not part of the official SDK, they still expose exported classes and selectors that serve as their callable interface. These symbols must remain accessible for any system or third-party binary that links against the framework, even though no documentation or header files are provided.

Finally, because macOS frameworks (including private ones) often make heavy use of Objective-C, their binaries embed Objective-C metadata (class names, method selectors, etc.) in Mach-O sections. This runtime metadata, along with the Mach-O structure, is key for disassemblers and other tools to interpret the contents of frameworks. In summary, macOS private frameworks are Mach-O dynamic libraries packaged in bundle form, loaded via the dynamic linker from a shared cache in modern systems. They are functionally analogous to public frameworks but without the benefit of documentation or readily available symbol information, which poses a significant challenge for analysis and type recovery.

Headers and Their Role in Reverse Engineering. Header files (C/C++ header files or Objective-C interface files) describe the public interface of libraries by declaring functions, methods, classes, constants, and data types. In the context of macOS frameworks, headers define the API contract (e.g. for example, for a public framework like Foundation or AppKit), Apple provides .h files in the Xcode SDK that list all class definitions, method signatures, and data structures developers can use. These headers are invaluable for both compilation and for understanding what functions exist and how to call them.

Naturally, no such headers are provided for private frameworks, which means reverse engineers must reconstruct the APIs themselves. To mitigate this, researchers use header reconstruction tools. One well-known utility is class-dump, which analyzes a Mach-O binary and generates Objective-C interface declarations (pseudo-headers) by reading the Objective-C runtime metadata embedded in the file. For example, running class-dump on a private framework binary will produce a header file with the class declarations and method signatures that the binary contains. This is how early iOS/macOS enthusiasts uncovered hidden APIs: by dumping private frameworks to see what classes and methods Apple has implemented. In a similar vein, the open-source RuntimeBrowser tool on macOS can load all private frameworks and use Objective-C runtime APIs to enumerate classes and selectors, presenting a list of methods and allowing export of header files. This approach bypasses direct Mach-O parsing but still extracts only the encoded selectors embedded in the binaries. Disassemblers like Hopper and IDA Pro also have features or plugins to export headers (e.g. they leverage the same Objective-C metadata to reconstruct class interfaces).

However, these reconstructed headers have limitations. (I) Missing symbols. If a binary has been stripped of names or heavily optimized, the extracted output will be incomplete. (II) Lack of type context. The generated headers often show generic placeholders (id, void*) because the extracted metadata contains only selector names and encoded type stubs, without information about actual usage. Without analyzing how methods are called (e.g., whether an id is consistently a NSString*), these tools cannot reconstruct precise types. (III) Ambiguous signatures. As a result, many method signatures look cryptic or truncated, with little information about the real classes or data structures involved. (IV) Non-ObjC code is invisible. Pure C functions and C++ methods (which may exist in the same framework) are not exposed in Objective-C metadata, so they must be identified through disassembly or other forms of analysis.

Objective-C Selectors and Message Passing. A large portion of macOS private frameworks are implemented in Objective-C, making an understanding of its dispatch model essential. Objective-C is a dynamically dispatched, message-passing language: rather than invoking methods through fixed function pointers or vtable offsets, it sends messages to objects.

At compile time, a call such as [object doSomething:arg] is lowered into a call to the C function objc_msgSend, which takes the receiver, a selector (SEL) identifying the method name, and the call arguments.

At runtime, the Objective-C system resolves the selector by inspecting the receiver’s class and consulting its method dispatch table, climbing the inheritance chain if necessary. This process realizes late binding, with the actual target implementation chosen dynamically based on the object’s class.

At the binary level, this indirection leaves only partial traces. The compiled code issues calls to objc_msgSend, while selector strings and their method encodings are stored in dedicated sections such as __objc_methname and __objc_methtype. A method encoding compactly describes the return type and argument layout (e.g., c44@0:8@16c24Q28{̂}@36), but the actual method body is not directly referenced. For reverse engineering, this means one can readily observe which selector is invoked, yet resolving the implementation requires inferring the receiver’s class.

In practice, without static type information, most Objective-C call sites in disassembly collapse into the same generic objc_msgSend, which makes it difficult to reconstruct call relationships and reason about object behavior. This indirection highlights why accurate type inference in Objective-C binaries is inherently challenging.

Scope, Assumptions, and Threat Model

We consider two related scenarios: direct inspection of a private framework binary, and analysis of a client application that depends on a private framework and thus provides an entry point or usage context. Here, framework denotes the particular private framework under investigation. Dependencies from a client binary can be discovered by (I) inspecting the application’s Info.plist and entitlement entries (for example, Mach service names, entitlement flags, or declared access to hardware/services), which often indicate which system services or private frameworks the binary interacts with, or (II) using dynamic-analysis commands (e.g., lm in lldb) to list linked libraries at runtime.

The targeted private framework satisfies several common properties: (I) Objective-C implementation. The framework is written in Objective-C and therefore exposes selector names as metadata. (II) Partial API surface. It exposes callable interfaces to dependent system components or applications, but without distributed headers describing exact argument and return types. This absence of type definitions prevents analysts from recovering full method signatures directly from the binaries. These conditions reflect the broader challenge: private frameworks are discoverable and usable, but lack the type signatures necessary for rigorous analysis.

In this scenario, an attacker with standard reverse-engineering capabilities (e.g., able to statically inspect binaries and dynamically load them into controlled environments, but without access to source code) has two primary strategies for dealing with the framework: (I) Decompile a client binary that depends on the framework (e.g., a system application such as Mail, Calendar, or Disk Utility, or a third-party app that links the private framework). (II) Directly reverse engineer the framework itself, reconstructing pseudo-headers with class and method declarations. While this provides structural visibility, the exact method types for arguments and return values remain unresolved. It is precisely this gap that our type-inference framework aims to address.

MOTIF-agent: LLM-guided Reverse Engineering framework

*Example of MOTIF in operation:* the agent interacts with analysis tools and a semantic linter in a closed-loop refinement process to recover hidden APIs.

*Overview of the MOTIF architecture.* The pipeline consists of: (I) binary preparation; (II) a constraint satisfaction feedback loop under linter-derived semantics, composed of (II.1) prompt assembly; (II.2) an agent system; and (II.3) semantic linter diagnostics. The loop iteratively refines candidate signatures until constraint satisfaction is achieved.

MOTIF-agent combines LLM reasoning with static constraints to recover Objective-C method signatures from stripped macOS binaries (Figure 2). An end-to-end inference trace appears in Appendix 9.1.

From Embedding Priors to Constraint-Guided Inference. Modern LLMs exhibit emergent competence in specialized domains even without task-specific supervision , . Our premise is that Objective-C idioms, naming conventions, and method patterns, in particular from Apple SDKs, are sufficiently represented in pretraining corpora (e.g., developer blogs, StackOverflow answers, public headers) to form strong embedding priors: latent associations encoded in its vector representations that link semantically or syntatically related concepts (e.g. methods named initWithCoder: usually take an argument of type NSCoder *). However, embeddings alone are insufficient for recovering nontrivial type signatures in private frameworks. The absence of header files, presence of anonymous structs, and stripped symbols create gaps that require grounded reasoning beyond what the model can recall. These limitations motivate a hybrid architecture that combines LLM-driven inference with constraint-guided feedback. The next section outlines the high-level design of this system, detailing how preparation, prompting, tool usage, and refinement interact to recover hidden types from stripped binaries.

High-Level Architecture. Figure 3 illustrates the full architecture of MOTIF, which operates as a hybrid inference loop interleaving pretrained LLM priors with symbolic and structural constraints derived from static analysis. The architecture decomposes into two main stages: (I) binary preparation and (II) constraint-guided inference.

Stage I: Static Binary Preparation (Inputs and Parsing). The system accepts as input a stripped Mach-O binary and extracts partial metadata, specifically Objective-C headers and symbol maps, using ipsw. Extracted headers are then parsed with a customized ANTLR4-based parser to identify underspecified declarations and candidate methods requiring type recovery.

Stage II: Constraint-Guided Inference (Context, Tools, Linter, Loop). This context is packaged into structured prompts that combine:

method string with underspecified types,
mangled symbol name,
tool definitions (available static and dynamic analysis primitives),
metadata (e.g., file path, framework name).

During inference, the model agent can query these tools within a ReAct loop, retrieving disassembly, typedef lookups, or symbol addresses as needed. Candidate Objective-C method signatures are synthesized and then validated by a semantic source-level linter enforcing structural, idiomatic, and compilation-level constraints. Diagnostic messages from the linter are fed back into the prompt construction step, creating a closed-loop refinement system. The loop terminates when all hard constraints specified in the system configuration are satisfied or a fixed number of refinement iterations is reached.

Iterative Constrained Refinement. Unlike static prompting, our system performs iterative constrained refinement by integrating the LLM within a tightly controlled feedback loop. Each model-generated signature is statically verified through a domain-specific linter developed in this work for macOS and Objective-C binaries, which emits structured warnings and hard constraints (e.g., unresolved anonymous structs, syntactically invalid pointer types, unsafe generics such as NSDictionary<NSString, id>).

Interactive Diagnostic Cycle and Convergence. The model is then re-prompted with this feedback, forming an interactive diagnostic cycle. This constrained loop acts as a syntactic and semantic convergence mechanism, filtering out implausible completions and driving the model toward signatures that satisfy Objective-C compiler requirements, idiomatic conventions, and disassembler-aligned structures.

Constraint Taxonomy and Linter Semantics. Our semantic linter emits structured diagnostics over candidate signatures, capturing violations of Objective-C typing and private/public framework conventions. Table [tab:linter-taxonomy] summarizes the taxonomy.

Name	Severity	Message Type	Example (Violation)	Suggested Correction

SyntaxErrors	High	Syntax Violation	`void) doSomething;`	`(void) doSomething;`
NoStructs	High	Inline Struct Detected	`struct { double x0; double x1; } center`	`CGPoint center`
SelectorMismatch	High	Selector Divergence	`doSomething:argument2:`	`doSomething:withArg2:`

StructRefs	Medium	Raw Struct Pointer Used	`struct _NSZone *zone`	`NSZoneRef zone`
GenericCollections	Medium	Missing Generic Parameter	`NSArray args`	`NSArray<NSString *> args`
NoIdGenerics	Medium	Generic Type is `id`	`NSArray<id> values`	`NSArray values`

ConventionalTypes	Low	Non-conventional Scalar Type	`_Bool isEnabled;`	`BOOL isEnabled;`

Tool Interface and Execution Layer. As formalized in Appendix 9.2, inference targets are derived from partial headers extracted from Mach-O binaries, parsed into abstract syntax trees, and filtered according to underspecification criteria. In Figure 4, we illustrate a real interaction where the model queries the disassembler for address 0x180017F48, observes the use of objc_msgSend, and revises its candidate method type accordingly.

Example tool call: LLM invokes disassembler to inspect method body.

Tool-Augmented Inference. Once target is constructed, the pipeline enters the inference phase. Here, the LLM operates as a tool-augmented agent, iteratively invokes tools, integrates returned signals, and converges toward a type-complete candidate signature satisfying both syntactic and semantic constraints. This loop forms the core of our type recovery mechanism.

Tool Inventory. The following tools are currently exposed to the model:

Symbol Address Resolution. Maps a selector to its corresponding memory address via a static lookup over symbol tables extracted from the firmware image using ipsw.
Disassembly View. Given an address, returns a disassembled instruction trace (ARM64) from a fixed-length window. Type inferences are drawn from operand behavior (e.g., pointer dereferences, integer arithmetic, or usage of objc_msgSend targets).
Decompiler View. Invokes a decompiler backend IDA Pro CLI to emit higher-level pseudocode. While incomplete for Objective-C, this often reveals control flow, return-type hints (e.g., scalar vs. reference), and selector dispatches on argument slots.
Header Inspection. Retrieves the full header in which the current selector is declared, enabling access to nearby type definitions, property declarations, and superclasses.
Header Index Scan. Enumerates the available Objective-C headers for the current framework bundle. This enables resolution of unknown types (e.g., CustomViewModel) to their defining interface.
Terminalization Operator. Produces the final inferred method signature and exits the ReAct loop. This tool is required to safely conclude inference in instruction-tuned models trained for aggressive tool-calling.

Each tool is embedded into the agent’s action space and selected via token-level planning during decoding. Tool outputs may be validated downstream by the semantic linter or recycled via constrained retries in the Iterative Constrained Refinement.

MOTIF-bench: A Reproducible Benchmark for Mach-O Type Inference

**Overview of Mach-O Type-Inference Framework Benchmark construction process.** The pipeline consists of five main stages. **Data collection:** Ground truth data are extracted from Xcode resources, while input data candidates are dumped from the dyld shared cache and processed. **Filtering:** Frameworks are binned by method count and equally sampled to ensure balanced representation across varying sizes and prevent sampling bias. **Data preprocessing:** Extracted headers are transformed and parsed using ANTLR4-based tools to reduce size and extract incomplete method signatures. **Matching:** Candidate methods are matched against the ground truth and unmatched methods are discarded. **Evaluation process:** Solutions generated by LLMs are parsed and compared against the ground truth on a per-signature basis, evaluating argument types and return types to generate benchmark metrics and reports.

We introduce the first benchmark specifically designed to evaluate the ability of analysis tools and language models to recover type signatures from compiled macOS frameworks. To our knowledge, no prior benchmark targets this problem space: existing datasets for code completion or decompilation do not capture the challenges unique to Objective-C binaries, where dynamic message passing and method encodings complicate static recovery.

Constructing such a benchmark is non-trivial, as it requires aligning authoritative ground-truth headers with incomplete binary metadata while ensuring balanced coverage across framework categories and scales. Motivated by the need for a principled evaluation environment to measure framework-level type inference performance, we developed MOTIF-bench. The benchmark is constructed from publicly documented Apple frameworks and incorporates real binary metadata, yielding a version-specific, reproducible, and objective foundation for type-inference evaluation.

Beyond serving as an evaluation suite, the benchmark and its datasets can also be used to fine-tune inference models (see Section 7), enabling a consistent pipeline from supervised training to empirical assessment.

Benchmark Construction. To provide a fair and reproducible evaluation environment, we constructed a dynamic dataset pipeline that extracts macOS frameworks directly from the target system version. The pipeline collects two complementary sources: (I) ground-truth headers from the corresponding version of Xcode, and (II) binary metadata from the system’s dyld shared cache. This dual-source design ensures that benchmark instances reflect both authoritative type specifications and the incomplete artifacts available in practice. The specific set of frameworks included varies with the macOS version under analysis; for example, the distribution shown in Figure 6 corresponds to macOS 26.0 Beta (build 25A5295e). Frameworks are stratified by Apple’s official category labels (e.g., System, Graphics & Games, App Services). Within each category, frameworks are further partitioned into bins by method count: small ($`\leq 10`$), medium (10–100), and large (100–1000). From each bin, frameworks are sampled in equal proportion, mitigating sampling bias and preventing large frameworks from dominating the dataset. Approximately 70% of sampled frameworks are reserved for evaluation, with the remainder allocated to model training dialogues. The detailed allocation is summarized in Table 1.

Example of dataset generated from MacOS Tahoe 26.0 Beta (25A5295e)

Category	Train Ratio	Test Ratio
System	0.74	0.26
Graphics and Games	0.76	0.24
App Services	0.73	0.27
Developer Tools	0.77	0.23
Media	0.74	0.26
App Frameworks	1.00	0.00
Web	0.92	0.08
Total	0.81	0.19

Train/Test Ratios Across Categories

This construction procedure yields a benchmark that is both balanced and extensible. By grounding in official Apple headers and dyld caches, MOTIF-bench captures the practical difficulty of type recovery while maintaining a clear ground truth. Its dynamic, version-specific design ensures that the dataset can be regenerated for future macOS releases, supporting long-term reproducibility of evaluations.

MOTIF-model: Lightweight Tool-Aware Language Model for Type Inference

Reverse engineering private or undocumented macOS frameworks often involves legally sensitive binaries, making reliance on proprietary API-based systems undesirable due to confidentiality, compliance, or data protection concerns. Furthermore, cloud APIs introduce high token costs and unpredictable latency, complicating reproducibility in controlled research settings. A lightweight, locally deployable model provides a privacy-preserving alternative. From a technical perspective, general-purpose instruction-tuned LLMs are not designed to synthesize type signatures under partial information, nor are they optimized to integrate tool-derived constraints. MOTIF-model addresses this gap. Its core design goal is to recover Objective-C method signatures when headers are absent, while conforming to structural and symbolic constraints available from static analysis. To support this objective, the model is explicitly aligned with tool-aware inference: it must not hallucinate signatures, but rather integrate tool outputs when available and abstain or defer when constraints cannot be satisfied.

Training Dataset and Preprocessing. The training corpus was distilled from MOTIF-bench, ensuring alignment between evaluation and fine-tuning. Approximately 3,000 dialogues were collected by running multiple candidate LLMs on benchmark tasks. To ensure quality, we applied a distillation filter with the following criteria: (I) Malformed tool invocations were excluded to avoid training on invalid call sequences. (II) Infinite recursion in the ReAct loop was discarded, preventing degenerate reasoning traces. (III) Low-quality generations with benchmark scores below 0.8 were removed.

The remaining dialogues captured complementary strengths across models, for example, some systems were more effective at generic collection typing, while others were better at resolving struct references. By merging them, we obtained a single coherent dataset of high-quality dialogues. We applied the Qwen-3 chat template to the dataset, converting all dialogues into a consistent multi-turn format with explicit tool calls and definitions preserved. Preprocessing enforced a structured prompt schema, where tool-related context was marked using control tokens such as <tools> for tool definitions and <tool_call> for tool invocations. This ensured that the model could reliably distinguish natural-language dialogue turns from tool-driven actions.

Fine-Tuning and Tool-Constraint Alignment. Fine-tuning was performed using the QLoRA approach via the Axolotl framework . QLoRA (Quantized Low-Rank Adaptation) operates on quantized weight representations while inserting low-rank adapters into frozen transformer layers, enabling efficient adaptation with a reduced memory footprint. This setup allowed training to complete on a single NVIDIA H200 GPU. Constraint alignment was realized by embedding tool definitions into the prompt context and penalizing generations that violated linter-enforced rules. In effect, the model was trained to synthesize type signatures consistent with tool-derivable evidence, rather than relying solely on prior knowledge.

Inference-Time Tool Integration. At runtime, MOTIF-model operates within a ReAct loop, where it must conform to available tool traces and abstain or flag inconsistencies if conflicts arise. Early experiments revealed a tendency toward aggressive tool invocation, sometimes exhausting the loop without converging on a stable signature. To mitigate this, we introduced a control operator that allows the model to explicitly exit the loop and return to the feedback cycle, which stabilized long-horizon reasoning and prevented unproductive tool churn. The final model was quantized and converted into Apple’s MLX format, enabling efficient execution on a single M2 Ultra (32 GB VRAM) or comparable consumer-grade accelerators. This deployment footprint allows researchers to conduct local, privacy-preserving type inference experiments at scale, without exposing binaries or research traces to cloud providers.

Evaluation Methodology and Results


Methods	Rank	Avg.	Numerical / Binary Answer Metrics						Signature Inference Subtasks

Static Analysis	17	24.5	14.9	0.0	–	–	–	–	98.0	0.0	0.0	0.0

Claude Sonnet 4	6	78.7	80.1	62.8	–	–	–	–	99.0	65.9	73.8	76.0
Deepseek R1	13	48.1	58.9	43.6	–	–	–	–	71.7	47.9	36.9	39.4
Gemini 2.0 flash	10	57.3	71.5	48.6	–	–	–	–	100.0	39.6	43.1	46.6
ChatGPT 4o	8	63.0	74.5	51.8	–	–	–	–	100.0	50.7	52.3	49.0

Qwen3 (8B) finetuned	3	42.4	83.5	64.7	99.5	16.3	100.0	0.0	99.4	32.2	9.6	28.6
Llama 3.1 (8B-Instruct) finetuned	6	37.6	80.3	58.1	97.9	9.1	100.0	0.0	93.6	33.0	3.8	20.1

GPT-4o	2	50.0	86.3	68.9	99.9	14.4	100.0	0.0	99.4	44.3	17.0	39.2
Deepseek R1	7	54.8	79.5	61.1	99.9	53.7	97.7	2.3	96.0	42.0	35.8	45.5
Claude Sonnet 4	1	94.7	86.7	72.3	77.4	0.1	100.0	0.0	100.0	83.2	100.0	95.5
Gemini-2.0 Flash	4	88.7	82.9	60.8	66.8	6.5	100.0	0.0	100.0	79.2	100.0	75.5
XAI Grok 4	5	56.4	82.1	73.0	99.7	5.6	100.0	0.0	82.1	50.7	48.1	44.6

Qwen3 (14B)	9	39.1	73.8	49.5	97.7	46.6	100.0	0.0	99.4	31.4	1.9	23.8
gpt-oss-120b	11	41.5	68.3	61.6	85.2	40.9	98.3	1.7	81.1	31.1	23.8	29.9
gpt-oss-20b	12	28.5	57.1	40.2	82.4	30.4	82.6	17.4	68.8	16.7	0.0	28.6
Llama 3.1 8B Instruct	14	19.4	35.9	19.0	91.5	10.3	100.0	0.0	58.5	7.6	2.0	9.4
Phi-3-medium	13	26.8	46.1	22.7	0.0	22.6	–	–	82.2	9.1	3.8	11.9

Our evaluation assesses both the empirical accuracy and practical utility of MOTIF. We (I) quantitative benchmark on public macOS frameworks using MOTIF-Bench to measure type-inference accuracy against ground-truth headers; and (II) qualitative validate on MOTIF-Private, a manually-curated dataset of private frameworks designed to assess reconstruction fidelity and practical utility in real-world reverse-engineering scenarios.

Experimental Environment. All experiments were conducted on a dedicated workstation running macOS Sonoma 14.7.6 (23H626) with an Apple M2 Ultra processor and 192 GB of unified memory.

Toolchain. The binary extraction pipeline relied on ipsw for dyld shared cache parsing and framework extraction, integrated features of ipsw for disassembly, and IDA Pro 9.1 for decompilation. Our custom ANTLR4-based grammar was used to parse incomplete Objective-C headers into abstract syntax trees (ASTs) and identify inference targets. Static validation was performed with a bespoke type-consistency linter.

Models. The primary system under evaluation is MOTIF-model, our fine-tuned 8B-parameter LLM based on Qwen3-8B and trained for tool-guided type inference. We compare this model against both proprietary API-based systems and open-source baselines. For comparison studies, we additionally evaluate one-shot prompting of these models without tool access, as well as a traditional static approach such as ipsw.

Public Framework Benchmarking. We begin by evaluating MOTIF on public macOS frameworks, where official headers provide reliable ground truth. All evaluation data are drawn from MOTIF-bench, which contains stratified samples of public frameworks extracted from the dyld shared cache and paired with header files from the corresponding Xcode SDK release. The benchmark ensures balanced coverage across framework categories and method scales, reducing sampling bias that could otherwise distort accuracy measurements.

Evaluation Metrics. We evaluate type inference quality using three primary correctness metrics:

Partial Match (PM) Accuracy: the average fraction of correctly inferred types across all argument and return positions. Formally,
```
\text{PM} = \frac{1}{|\mathcal{M}|} \sum_{m \in \mathcal{M}} \frac{|\{\, i \mid \hat{t}_{m,i} = t_{m,i} \,\}|}{|\{\, i \mid t_{m,i} \neq \bot \,\}|},
```
where $`\mathcal{M}`$ is the set of evaluated methods, $`t_{m,i}`$ is the ground-truth type, and $`\hat{t}_{m,i}`$ is the inferred type. This metric accounts for partial correctness, e.g., when some but not all types in a signature are recovered.
Exact Match (EM) Accuracy: the fraction of methods for which the entire inferred signature (all argument and return types) exactly matches the ground truth. This is a strict all-or-nothing measure:
```
\text{EM} = \frac{|\{\, m \in \mathcal{M} \mid \forall i: \hat{t}_{m,i} = t_{m,i} \,\}|}{|\mathcal{M}|}.
```

Beyond correctness, we track behavioral metrics:

Tool Usage Rate: the proportion of inference tasks where the agent invokes at least one external analysis tool (disassembler, decompiler, or static linter). This quantifies reliance on external program-analysis evidence rather than purely language-model predictions.
Inference Stability: the proportion of methods for which the iterative inference process converges to a fixed signature within $`K=10`$ steps. A method $`m`$ is considered stable if $`\hat{t}_{m,i}^{(k)} = \hat{t}_{m,i}^{(k-1)}`$ for all $`i`$ once some $`k^{*} \leq K`$ is reached.
Tool-Call Correctness (TCC): the fraction of tool invocations issued by MOTIF that exactly match the expected API usage (correct tool name, arguments, and file paths). Each tool call is parsed and validated against a formal schema of permissible arguments. TCC measures whether the agent is not only deciding to invoke a tool, but doing so in a syntactically and semantically valid manner.
Tool-Call Hallucination Rate (HR): the proportion of tool invocations that are invalid, redundant, or unsupported (e.g., calls to nonexistent binaries, malformed flags, or references to artifacts not produced by prior steps). This metric captures the extent to which the LLM attempts to fabricate functionality outside the actual tool API.

Private Framework	VulnHistory	#	Short description

SafariFoundation	–	1	Provides foundational services for Safari, including autocomplete, credential, and account handling across macOS.
DiskManagement		5	Implements low-level disk volume management routines exposed to utilities and system daemons.

StoreFoundation	–	5	Core framework supporting the Mac App Store’s app listing, download, update, and purchase operations.
CommerceKit	–	3	Provides transactional and purchase APIs tied to Apple’s commerce and payment infrastructure.

DFRFoundation	–	1	Framework for the Touch Bar subsystem, offering APIs to simulate and communicate with the Touch Bar.
PIP	–	1	Provides Picture-in-Picture support for video playback across macOS applications.
SystemUIPlugin	–	1	Hosts extensions and menu bar plug-ins for system UI elements.
AOSKit	–	3	Implements Apple Online Services authentication, including legacy SSO mechanisms.
SidecarCore	–	3	Provides the core logic for Sidecar, enabling an iPad to function as a secondary display.
CalendarFoundation		2	Implements calendar event storage and scheduling logic underpinning Calendar.app.
DoNotDisturb	–	3	Exposes APIs to control the Do Not Disturb (DND) system setting.
IMCore	–	5	Provides messaging core for iMessage and SMS relay integration on macOS.

Signature Inference Subtasks-Level Metrics. To capture fine-grained aspects of Objective-C type recovery, we evaluate MOTIF on four subtask-specific metrics. Each metric is defined over a subset of argument and return positions in the benchmark corpus $`\mathcal{M}`$, and accuracy is measured as the fraction of correctly recovered types within that subset.

Basic Type Completion (BTC): Let $`\mathcal{S}`$ be the set of positions whose ground-truth type is a primitive scalar (e.g., int, BOOL, long). We compute
```
\text{BTC} = \frac{|\{(m,i) \in \mathcal{S} \mid \hat{t}_{m,i} = t_{m,i}\}|}{|\mathcal{S}|}.
```
Collection Inference (CI): Let $`\mathcal{C}`$ be the set of positions where the ground-truth type is a concrete Objective-C collection (e.g., NSArray*, NSDictionary*, NSSet*). Accuracy is defined as
```
\text{CI} = \frac{|\{(m,i) \in \mathcal{C} \mid \hat{t}_{m,i} = t_{m,i}\}|}{|\mathcal{C}|}.
```
Delegate Protocol Inference (DPI): Let $`\mathcal{P}`$ be the set of positions where the ground-truth type includes a protocol-qualified annotation (e.g., id<NSCopying>). We measure
```
\text{DPI} = \frac{|\{(m,i) \in \mathcal{P} \mid \hat{t}_{m,i} = t_{m,i}\}|}{|\mathcal{P}|}.
```
Block Type Inference (BTI): Let $`\mathcal{B}`$ be the set of positions whose ground-truth type is an Objective-C block, represented as a function pointer with explicit return and parameter types. Equality requires the predicted block signature to match the ground truth in both return type and ordered parameter list. Formally,
```
\text{BTI} = \frac{|\{(m,i) \in \mathcal{B} \mid \hat{t}_{m,i} = t_{m,i}\}|}{|\mathcal{B}|}.
```

Baselines. We evaluate MOTIF against two classes of baselines that capture the current state of practice in reverse engineering and automated type inference:

Metadata-based utilities. Tools such as ipsw reconstruct Objective-C headers directly from runtime metadata embedded in Mach-O binaries. This utility is widely used in both academic and practitioner settings but is limited to exposing selector names and generic placeholder types (e.g., id, void*), without recovering precise argument or return types.
One-shot No-Tooling LLMs. We query general-purpose LLMs on the same incomplete method signatures but without tool access or iterative refinement. This baseline measures the extent to which improvements in MOTIF stem from tool integration and fine-tuning rather than raw model capability.

MOTIF-Private: A Manually-Curated Private Framework Benchmark


Methods	Rank	Avg.	Numerical / Binary Answer Metrics						Signature Inference Subtasks

Static Analysis	9	–	22.3	0.0	–	–	–	–	100.0	0.0	0.0	0.0

Qwen3 (8B)	1	–	75.2	48.4	100.0	0.0	100.0	0.0	60.0	40.0	0.0	33.3

GPT-4o	3	–	71.9	51.6	100.0	0.0	100.0	0.0	60.0	0.0	0.0	33.3
Deepseek R1	2	–	72.5	38.7	100.0	38.7	97.3	2.7	60.0	20.0	0.0	16.7
Claude Sonnet 4	5	–	66.7	56.2	100.0	0.0	100.0	0.0	33.3	0.0	0.0	50.0
Gemini-2.0 Flash	4	–	67.7	32.3	100.0	9.7	100.0	0.0	60.0	20.0	0.0	50.0
XAI Grok 4	6	–	58.3	48.4	100.0	6.5	100.0	0.0	40.0	0.0	0.0	0.0

Llama 3.1 8B Instruct	8	–	24.6	6.5	83.9	0.0	100.0	0.0	40.0	0.0	0.0	20.0
Phi-3-medium	7	–	37.3	6.5	0.0	6.5	–	–	60.0	0.0	0.0	0.0

While quantitative benchmarking provides measurable accuracy against public frameworks with known headers, the true motivation of our method lies in reverse engineering private macOS frameworks. To assess practical utility in this setting, we also manually constructed a small dataset of case studies across a curated set of private frameworks. Framework selection followed targeted criteria rather than random sampling, focusing on methods and frameworks that met at least one of the following conditions: (I) availability of partial ground truth through Apple’s open-source repositories; (II) active reliance by the community for building tools that are otherwise impossible without private frameworks; (III) historical association with vulnerabilities and security advisories; and (IV) coverage of diverse domains. An overview of these selected frameworks, including their security exposure (if any), functional role, and usage contexts within the macOS ecosystem, is provided in Table [tab:private-frameworks].

Table [tab:benchresults] reports the alignment between model-inferred signatures and manually reconstructed reference headers, comparing our fine-tuned tool-calling MOTIF-model(Qwen3-8B) with both static analysis and proprietary LLM APIs across numerical and structural inference metrics.

Integrated Evaluation and Analysis

Subtask-level inference accuracy.

Accuracy deltas per subtask.

Unified Performance Comparison. We first do performance comparison across the full evaluation suite shown in Table [tab:motifbenchresults], covering all four system configurations: static analysis, zero-shot LLM, tool-augmented LLM, and our fine-tuned 8B variant. Metrics are grouped into (I) scalar behavioural indicators (e.g., tool-usage efficiency, inference stability) and (II) structured subtask-level accuracy measures for signature inference.

As shown in Figures [fig:metric-num] and 7, static analysis achieves low accuracy across all axes, recovering fewer than 2% of valid structural signatures. Zero-shot LLMs improve moderately (up to +18 pp on PM accuracy) but remain brittle on type-inference and tool-alignment subtasks, with variance exceeding 0.25 across repeated runs. Introducing deterministic tool augmentation stabilises behaviour and increases overall signature recovery by 1.6$`\times`$.

Ablation results in Table [tab:oneshot-vs-feedback] further isolate the effect of iterative feedback, showing that feedback-conditioned prompting yields an additional 12–15 pp improvement in ambiguous-type precision and halves cross-run variance. This gain arises from grounding the model’s reasoning chain in verifiable tool outputs rather than unconstrained textual inference. Overall, the fine-tuned 8B configuration achieves parity with substantially larger proprietary LLMs while operating at one-eighth their parameter scale, reducing inference cost by over 60% and improving controllability through consistent tool-bounded reasoning (see Figure [fig:motif-cost]).

Failure Analysis and Limitations. Figure [fig:linter-distribution] summarises the empirical distribution of triggered constraint categories derived from linter diagnostics. The majority of violations arise from conventional type and generic collection inconsistencies (1,740 and 1,163 cases, respectively), together accounting for more than half of all detected issues. These correspond to incomplete or missing generic annotations in Objective-C container types (e.g., NSDictionary<NSString*,id>), confirming that ambiguity in parameterised structures remains the dominant failure mode. By contrast, structural mismatches, unresolved selectors, and unparsed method bodies contribute less than 15% of the total, indicating that syntactic correctness is largely stabilised by the tool-bounded parsing stage. Residual errors primarily stem from information not recoverable through static metadata alone (e.g., runtime type casts, opaque structs, or heavily optimised compiler output) representing intrinsic limitations of static-plus-LLM inference.

Subtask-Level Improvements. Figure 8 quantifies median accuracy gains obtained through iterative feedback across the benchmark subtasks. The most substantial improvement appears in DP Inference (+26.4 pp), followed by Collection Inference (+15.7 pp) and BT Inference (+12.7 pp), while BT Completion remains nearly unchanged (+0.2 pp). These trends suggest that feedback-conditioned prompting most effectively enhances reasoning over high-entropy semantic spaces ( particularly data-path and collection-type inference) where raw textual priors are insufficient. The relatively flat gain in boilerplate completion implies diminishing returns once syntactic structure is fully captured. Together, they explain the aggregate improvements reported in Figure 7 and Table [tab:oneshot-vs-feedback], showing that most accuracy gains originate from enforcing and resolving type-generic constraints within the iterative loop.

Setting	Exact Match Accuracy	Tool-Call Correctness	Inference Stability

LLM (one-shot)	23.1%	–	42.0%

LLM + Tool Context	44.7%	53.0%	55.8%

LLM + Tool + Feedback (Ours)	67.5%	71.2%	86.9%

Conclusion

We introduced MOTIF, which encompasses a unified framework, benchmark, and model for automated reverse engineering of private Objective-C frameworks. Together, these components establish both a methodology and a resource for advancing automated reverse engineering. Moreover, the design generalizes: by adapting tooling and prompt schemas, the same ReAct–feedback loop can be extended to other binary formats and operating systems.

Ethical Consideration

Our work operates in a sensitive realm of reverse engineering private parts of a proprietary operating system. Our research is conducted strictly under the banner of academic exploration and security research, not for building software intended for consumer distribution or retail within Apple’s ecosystem. To ensure ethical research, we do not distribute private framework binaries or reconstructed APIs in application packages intended for external deployment. Further, our tools and models are shared only for research purposes, not embedded in consumer-facing software, especially not via the App Store. By framing MOTIF as a research platform rather than a production tool, we aim to improve transparency and robustness of closed-system platforms while remaining mindful of legal and ethical constraints.

Case Study: MOTIF-agent System Trace for Recovering Hidden API Signatures

To illustrate the internal operation of the MOTIF-agent, we present a concrete end-to-end trace on a representative reverse-engineering task recovering the missing Objective-C method signature for a private macOS framework header. Table 2 summarises the key experimental context, including the target operating system version, framework, and header file under analysis.

Attribute	Value
Target OS version	macOS 14.5 (Sonoma)
Target framework	`IOUSBHost.framework`
Target header	`IOUSBHostObject.h`
Analysis date	August 2025
MOTIF agent version	v1.3 (default configuration)
LLM backend	`GPT-4o-mini`
Model size	$`\sim`$38B parameters
Context window	128k tokens
Candidate pool size	5 (top-1 correct)

Metadata for case study. The listed configuration corresponds to the runtime setup used in Figures 9, 10, 11.

This case study serves as a compact yet informative example, demonstrating how the agent orchestrates reasoning, tool invocation, and constraint-guided validation to infer precise type semantics.

Problem Setup and Symbolic Domains

Header corpus. We begin with a Mach-O binary with partially stripped Objective-C interface metadata. Using compiler-level introspection, we statically extract headers $`\mathcal{H}=\{h_1,\dots,h_m\}`$ and parse each $`h_i`$ under a deterministic LL(*) grammar $`\mathbb{G}_{\text{ObjC}}`$ into an abstract syntax tree (AST), forming $`\mathcal{A}=\bigcup_i \texttt{AST}_{h_i}`$.

Ambiguity set. Within $`\mathcal{A}`$ we target underspecified declarations $`\mathcal{G}\subset\mathcal{A}`$ whose return or parameter types belong to the finite ambiguous set $`\mathcal{T}_{\text{ambig}}=\{\texttt{id},\texttt{void *},\texttt{struct\{...\}}, \dots\}`$. Resolving these ambiguities requires mapping syntactic declarations to their compiled representations.

Symbol binding. For every $`g_i\in\mathcal{G}`$ we establish a static correspondence to a unique binary selector $`s_i\in\mathcal{S}`$ extracted from the Mach-O symbol table. Selectors encode both class and method components (e.g., +[CBUUID UUIDWithData:]), providing a bijective mapping between AST-level declarations and compiled symbols. This mapping anchors source-level ambiguity to a concrete binary artefact and enables cross-referencing of type hypotheses against compiled metadata.

Inference context (LLM input). For each recovery target $`g_i`$ we assemble the prompt context $`\mathcal{T}_i=(\mathcal{N}_i,\mathcal{I}_i)`$ that excludes semantic diagnostics. $`\mathcal{N}_i`$ captures natural source evidence (raw declaration text, lexical neighbourhood, identifier and file locality). $`\mathcal{I}_i`$ aggregates symbolic descriptors available to tools (disassembly and decompilation at $`s_i`$, typedef resolution in $`\mathcal{A}`$, and symbol metadata/addresses). The template $`P_i`$ instantiated with $`\mathcal{T}_i`$ yields an initial fully-specified candidate signature $`\hat{g}`$.

Constraint formulation and refinement loop. Our goal is to identify the most semantically admissible signature consistent with all high-severity constraints. Given $`\hat{g}`$, a stratified linter $`\mathcal{L}\!:\!\texttt{Signature}\!\rightarrow\!\mathcal{M}`$ produces a multiset of messages $`\mathcal{M}=\{(m_j,s_j)\}_{j=1}^{k}`$ with severities $`s_j\in\{\texttt{low},\texttt{medium},\texttt{high}\}`$. Messages induce Boolean constraints $`c_j`$ partitioned as $`\mathcal{C}_{\texttt{hard}}=\{c_j\mid s_j=\texttt{high}\}`$ and $`\mathcal{C}_{\texttt{soft}}=\{c_j\mid s_j\in\{\texttt{medium},\texttt{low}\}\}`$. We pose selection as a Weighted Partial MaxSAT objective over $`\texttt{Signature}`$:

\begin{align}
\hat{g}^* &= 
\arg\min_{\hat{g}}
\sum_{c_j \in \mathcal{C}_{\texttt{soft}}}
\lambda(s_j)\,\mathbb{1}[c_j(\hat{g})=\texttt{false}],
\\[2pt]
&\text{s.t. }\forall c_j\in\mathcal{C}_{\texttt{hard}},\;
c_j(\hat{g})=\texttt{true}.
\end{align}

Here $`\lambda(\cdot)`$ weights soft violations. If $`\mathcal{M}\neq\emptyset`$, the structured feedback is fed back to re-instantiations of $`P_i`$ (same $`\mathcal{T}_i`$, updated guidance), iterating until all hard constraints are satisfied and the soft cost stabilises. This separation keeps LLM inputs strictly evidential ($`\mathcal{N}_i,\mathcal{I}_i`$) while delegating admissibility to the constraint layer.