DAInfer+: Neurosymbolic Inference of API Specifications from Documentation via Embedding Models

D AInfer+ : Neurosymbolic Inference of API Specifications from Documentation via Embedding Mo dels MARY AM MASOUDIAN , The Hong Kong Univ ersity of Science and T echnology, China ANSH UNKANG ZHOU , The Hong Kong University of Science and T echnology, China CHENGPENG W ANG , The Hong Kong University of Science and T echnology, China CHARLES ZHANG , The Hong Kong University of Science and T echnology, China Modern software systems heavily rely on various libraries, which require understanding the API semantics in static analysis. Howe ver , summarizing API semantics r emains challenging due to complex implementations or unavailable library code. This paper presents D AInfer+ , a novel appr oach for inferring API specications from library documentation. W e emplo y Natural Language Processing (NLP) to interpret informal semantic information provided by the do cumentation, which enables us to reduce the specication inference to an optimization problem. Specically , we investigate the eectiveness of sentence embedding mo dels and Large Language Models (LLMs) in deriving memory operation abstractions from API descriptions. These abstractions are used to retriev e data-ow and aliasing relations to generate compr ehensive API specications. T o solve the optimization problem eciently , we pr opose neurosymbolic optimization, yielding precise data-ow and aliasing spe cications. Our evaluation of popular Java libraries shows that zero-shot sentence embedding models outperform few-shot prompted LLMs in robustness, capturing ne-grained semantic nuances more eectively . While our initial attempts using tw o-stage LLM prompting yielded promising results, we found that the embedding-based approach proved superior . Specically , these models achieve o ver 82% recall and 85% precision for data-ow inference and 88% recall and 79% precision for alias relations, all within seconds. These results demonstrate the practical value of D AInfer+ in library-aware static analysis. CCS Concepts: • Software and its engineering → Software libraries and repositories ; Automated static analysis ; • Applie d computing → Do cument analysis . Additional K ey W ords and Phrases: specication inference, documentation mining, alias analysis, data-ow analysis 1 INTRODUCTION In mo dern programming languages, programmers often develop their applications based on various libraries, which provide fundamental building blocks for client-side implementation. Undoubtedly , the behaviors of librar y APIs directly aect the functionality of the application code. A s targeted by existing studies [ 9 , 29 ], several library APIs are essentially generalized store and load operations, forming aliasing relations thr ough store-load matches. For example, the APIs HashMap.put and HashMap.get conduct the store and load operations, respectively . When they ar e invoked upon the same HashMap object with the same rst parameters successiv ely , the return value of HashMap.get can be aliased with the se cond parameter of HashMap.put . T o identify value ows in the application code, a static analyzer should b e aware of such API aliasing sp ecications, which play critical roles for pointer analysis and other downstream clients. Accor ding to our investigation, many existing static analysis techniques rely on manually specied library API aliasing specications [ 4 , 6 , 32 ]. Howev er , the emergence of third-party libraries introduces a large number of APIs, making this laborious eort unacceptable in practice. This work initially targets the API aliasing specication inference pr oblem to support librar y- aware alias analysis. Existing approaches infer API aliasing spe cications from three perspectives. A uthors’ addresses: Maryam Masoudian , The Hong Kong University of Science and T echnology, Hong K ong, China, mamt@ cse.ust.hk; Anshunkang Zhou , The Hong Kong University of Science and T e chnology, Hong Kong, China, azhouah@ust.hk; Chengpeng W ang , The Hong Kong University of Science and T e chnology, Hong Kong, China, cwangch@connect.ust.hk; Charles Zhang , The Hong Kong University of Science and T echnology, Hong Kong, China, charlesz@cse.ust.hk. 2 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang Fig. 1. Examples of library documentation. W e use 𝑚 𝑖 to denote the API with the ID 𝑖 in the paper . The rst line analyzes the source code statically [ 5 , 61 ]. Although it can derive the function summaries as the API aliasing specications, the solution suers from the scalability problem due to deep call chains [ 67 ]. More importantly , the implementation of several library APIs can dep end on native code, such as System.arraycop y in the implementation of java.util. V ector , which makes static analysis intractable [ 9 ]. The second line of the te chniques constructs unit tests via active learning to trigger the execution of librar y APIs, so as to infer aliasing relations in the runtime [ 9 ]. Compared to static analysis-based inference techniques, they are more applicable when the source code of the librar y is unavailable. Howev er , it can be infeasible to generate unit tests to trigger the target librar y APIs due to the diculties of constructing the parameters with complex data structures and executing APIs on specic devices or environments. Third, se veral researchers learn the aliasing spe cications from applications using libraries [ 29 ], which does not require the source code of the libraries or the execution of the pr ograms. Unfortunately , their approach only discovers the API specications used in the applications, nally causing the low recall in the inference. This paper presents a new perspective on inferring API aliasing specications. Unlike existing studies, we utilize another important librar y artifact: the documentation, to analyze the semantics of library APIs. As shown in Figure 1 , librar y documentation contains formal semantic properties, e.g., class hierarchy relation and type signatures, and informal semantic information, e.g., semantic descriptions and naming information. Although the librar y documentation demonstrates the library API semantics in detail, it is far from trivial to derive API aliasing specications from it. First, eectively understanding the informal semantic information is quite dicult. Even if we apply the recent advances in the large language models (LLMs), e.g., feeding the documentation of android.content.Intent to Cha tGPT , we can only obtain nine API aliasing specications, all of which are incorrect. Second, library documentation can b e quite lengthy , which may introduce signicant overhead. For example, feeding the lengthy do cumentation to Cha tGPT not only demands much time but also introduces a high nancial cost due to enormous token consumption. Beyond the cost, we observe a fundamental r eliability gap . While LLMs show pr omise in software engineering tasks such as programming [ 1 , 19 , 33 , 40 , 45 , 50 , 73 ], program analysis [ 31 , 37 , 49 , 70 ], and program repair [ 12 , 83 ], they are prone to semantic over-engineering and hallucinations that compromise their reliability . For instance, an LLM may incorrectly attribute a memory write opera- tion to the simple API method Stack.contains despite the method’s signature and documentation clearly indicating it only reads fr om the stack. Advanced prompt engineering techniques do not consistently mitigate these errors; inde ed, the two-staged prompting approach in our previous work [ 69 ] mistakenly attributes a memor y deletion to this method. Furthermore, LLMs struggle to isolate distinct operations within compound sentences in an API metho d’s semantic description, such as the “ removes the object... and returns... ” phrasing of Stack.p op shown in Figure 1 . For 3 instance, an LLM may focus exclusively on the r emoval of an item from the stack while failing to infer the data-ow link to the return value. In the context of taint analysis, this oversight creates a broken propagation chain. These failures in distinguishing conjoined operations, coupled with high computational over- head, necessitate a shift toward mor e robust methodologies. W e argue that sentence embedding models [ 59 , 71 ] pro vide a superior foundation for this task. By focusing on semantic similarity within atomic, single-intent sentences, these models can robustly identify memory operations while maintaining computational eciency and structural precision. T o eectively achieve the infer ence with high eciency , we propose our inference algorithm named DAInfer+ which originates from three key insights: • The class hierarchy determines the available APIs of a given class, while type signatures enable us to over-appr oximate aliasing facts based on the types of API parameters and returns. If two values can not be aliased, we do not need to analyze the naming information and semantic descriptions, which decreases the overhead by avoiding applying NLP models. • The named entities in the names of APIs and parameters indicate the high-level semantics and narrow do wn aliasing relations between the parameters and r eturn values. In Figure 1 (a), the named entities in getIdentifier and the parameter name of Intent.setIdentifier are the same, indicating that the return value of Intent.getIdentifier can be aliase d with the parameter of Intent.setIdentifier . • Semantic descriptions reveal memory operations through specic verbs, supp orting the identi- cation of store-load matches that may intr oduce the derivation of aliasing facts. In Figure 1 (b), verbs such as push and look indicate that Stack.push and Stack.p eek perform insertion and read operations, respectively . Based on our insights, we propose D AInfer+ , an algorithm to infer API specications by nd- ing data-ow and aliasing relations. T e chnically , we introduce a graph representation to over- approximate the aliasing relations between parameters and return values base d on type information. T o interpret informal semantic information, we use NLP models to abstract memory operation kinds and high-level semantics of API parameters/return values, respectively . W e formulate the task as mapping from semantic information from API documentations to formal memory behaviors that go vern data-o w and aliasing relations. W e leverage a tagging model to infer the alias relations between r eturn values and parameters of APIs. Then, we reduce the spe cication inference pr oblem to an optimization problem that enforces the aliasing pairs b etween API parameters as many as pos- sible for precise semantic abstraction. Particularly , the optimization problem poses constraints on the results of the two NLP models. T o solve the problem eciently , we propose the neurosymbolic optimization algorithm, which interacts with the two NLP models in a demand-driven manner , achieving low resour ce cost in the inference. T o accurately infer memory operations for each API method, we previously proposed a staged prompting technique using generative LLMs [ 69 ]. In this paper , we introduce a more robust solu- tion utilizing zero-shot sentence embedding models for semantic mapping. While the prompting approach tasks the LLM with simulating a developer to categorize API behaviors base d on doc- umentation, the embe dding approach calculates the cosine similarity between API descriptions and standardized memory operation denitions ( e.g., read , write , insert , delete ). Our evaluation (Section 6 ) demonstrates that embedding mo dels achieve signicantly higher r ecall and precision than LLMs using either zero-shot or few-shot prompting. By reducing the inference task to a semantic similarity comparison, we eliminate the hallucination risks associated with generative models while achieving greater eciency . 4 Mar yam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang W e implement our approach D AInfer+ and evaluate it upon Java classes in several p opular When the two-stage prompting is selected to retrieve memory op eration abstractions, D AInfer+ achieves the alias specication inference with a precision of 79.78% and a recall of 82.29%, consuming 5.35 seconds per class on average. A dditionally , DAInfer+ promotes the alias analysis by discovering 80.05% more aliasing facts for the API return values and enables the taint analysis to discover 85 more taint ows in the experimental subjects. Furthermore, we assess D AInfer+ capability in retrieving data-ow specications using embedding models vs. LLMs. Our results demonstrate its recall and precision of embedding mo dels are more than 82% and 85%, respectively . In contrast, programming-trained LLMs could only reach up to 75% recall, albeit with a higher precision of 94%. This suggests that while specialized LLMs can be more precise, embedding models may provide a more comprehensive retrieval of spe cications, which is crucial for ee ctive static analysis. Furthermore, when employing embedding mo dels, DAInfer+ achieves 88% r ecall and 79% pr ecision in inferring alias specications while maintaining high eciency by retrieving results in only a few seconds. The main contributions of this work are: • W e propose a novel neurosymbolic optimization technique to solve the API sp ecication inference problem eciently . • W e introduce a comprehensiv e pipeline to perform memor y operation inference, which serves as a foundational layer for subsequent alias inference. • W e introduce a new embe dding-driven paradigm for inferring API specications, replacing gener- ative LLM prompting with a deterministic, embedding-base d retrieval mechanism that leverages latent vector comparisons between API descriptions and memory op eration abstractions. • W e conduct a compr ehensive comparative analysis between general-purpose LLMs and state- of-the-art programming-specic models, demonstrating that our embedding-based approach achieves superior eciency and precision in API specication inference. • W e extensiv ely evaluate our approach over real-world libraries to demonstrate its improv ed accuracy and eciency compared to existing techniques, and to quantify its impact on client analyses. 2 BA CKGROUND AND OVERVIEW In this section, we introduce the background of API data-ow and aliasing specication inference and outline our key ideas to infer data-ow and aliasing relations of API methods from their documentations. 2.1 Library-A war e Data-F low and Alias Analysis Modern software systems heavily depend on various libraries. A recent study found that a Java project can include an average of 48 libraries transitively [ 72 ]. This prevalence of library usage stimulates the demand for modeling API semantics in fundamental static analyses, such as data- ow and alias analysis. However , the de ep call chains and unavailable source code (e.g., native functions) complicate the scalability and applicability of static analysis. Many static analyzers use specications to abstract the librar y API semantics to achieve library-aware analysis. A data-ow specication for an API 𝑚 presents the ows of data from its parameters to its body (acting as a data sink) or from its body to the return value (acting as a data source). This specication primarily represents the memory operations that a method performs upon its execution, such as read , write , insert , and delete . Example 1. Figur e 1 (a) indicates that the rst parameter of Intent.putStringArrayListExtra is used to insert a new String list into an “ Intent ” obje ct, whereas Intent.getStringArrayListExtra retrieves the list from the same object if invoked successively . In a data-ow context, the former method 5 performs an insertion memor y op eration on the internal state of the “ Intent ” object, while the latter performs a read operation on the same obje ct; this establishes a potential taint path through the “ Intent ” container . By identifying the specic memory operations within these individual specications, one can derive higher-order relationships between multiple APIs. Specically , the API aliasing specication for an API pair ( 𝑚 1 , 𝑚 2 ) is established by matching their respective memor y behaviors: When 𝑚 1 and 𝑚 2 conduct the store and load op erations, respectively , the return value of 𝑚 2 may be aliased with the parameter of 𝑚 1 if 𝑚 2 is invoked after 𝑚 1 upon the same obje ct. Based on the specication, a static analyzer can model the librar y API semantics without explicitly analyzing the implementation of 𝑚 1 and 𝑚 2 , ultimately promoting the scalability and applicability of the overall analysis. Example 2. Figure 1 (a) indicates that when the rst parameters of Intent.putStringArrayListExtra and Intent.getStringArrayListExtra are aliased, the return value of the latter can b e aliased with the second parameter of the former if they are invoked successively upon the same “ Intent ” object. 2.2 Dierent Perspectives of Inferring API Specifications With the increasing number of third-party libraries, manually specifying the API spe cications demands incredibly laborious eort [ 4 , 6 , 32 ]. T o mitigate this problem, pre vious studies infer data- ow and aliasing API specications from dierent artifacts, including library implementation [ 5 ], application code using libraries [ 23 , 29 ], and tests synthesized via active learning [ 9 ] or coverage- guided fuzzing [ 47 ]. However , their solutions can be hindered by three main drawbacks. First, analyzing the library implementation suers from the scalability issue due to complex program structures, such as deep call chains, and can even become inapplicable due to the unavailability of the implementation or the presence of nativ e code. Second, inferring the specications from the application code using libraries may fail to achieve high recall when specic APIs are not utilized in the application code. Third, deriving the data-ow or aliasing facts from dynamic execution suers from the inapplicability issue when it is infeasible to construct e xecutable tests in sp ecic devices or environments. T o ll the research gap , our work proposes another perspective to infer the API specications. W e realize that ther e is another essential library artifact, i.e., library documentation, demonstrating the library API semantics in a semi-formal structure. As shown in Figure 1 , the formal semantic properties, including class hierarchy r elation and type signatures, are e xplicitly provided. Mean- while, the naming information, e.g., the parameter names and API names, shows the intent of API parameters and return values, while semantic descriptions demonstrate the functionalities of the APIs informally . These ingredients permit us to understand how the library APIs manipulate the memory . Specically , this enables the inference of data-ows to and from the heap upon method invocation, which in turn facilitates the identication of aliasing relations between parameters and return values. More importantly , the documentation is often available for analysis, as the developers tend to refer to it during the development. Hence, inferring the API data-ow and aliasing specications from documentation would exhibit better applicability than the existing techniques. 2.3 Overview of DAInfer+ Although the do cumentation guides the developers in understanding the API semantics, ther e exists a gap between the API knowledge and API data-ow and aliasing specications. Concretely , we need to understand how the API parameters are stored and how the API return values are loaded. 6 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang Howev er , achieving this is quite complicate d in front of informal semantic information. Even if we leverage the new advances in the LLMs, they cannot understand how the APIs manipulate memory and eventually fail to identify the aliasing relations between API parameters and return values. With lengthy documentation, fr equent interactions with LLMs can incur signicant time and token costs. T o address the challenges, we propose a novel inference algorithm named D AInfer+ , which eectively understands the API semantics and eciently infers the API specication from librar y documentation. Our key idea originates from three critical observations on the data-ow and aliasing relations between the parameters and return values of the library APIs as follows. • The parameters and return values should b e type-consistent if they are aliase d. Specically , their types should be the same, or one of them is the subtype/super-type of the other . Such facts can be easily obtained from the class hierarchy r elation and type signatures in the documentation. In Figure 1 , for example, w e can obtain the potential aliasing relation between the return value of Intent.getIdentifier and the parameter of Intent.setIdentifier , while the second parameter of Intent.putStringArrayListExtra can not be aliased with the return value of Intent.getIdentifier . • If the return values and parameters of two APIs are aliased, the named entities in their names tend to be the same, indicating the same high-level semantics. For example, the APIs Intent.setIdentifier and Intent.getIdentifier in Figure 1 ( a) share the same named entity “ identifier ” , indicating that they manipulate the same inner eld. For general-purpose data structures, such as java.util.Stack in Figure 1 ( b), the API names of Stack.peek and Stack.pop do not have any named entities, indicating that their return values can be aliased with other parameters with consistent types. • If a library API stores its parameters or loads the inner eld as the return value, the verbs in its semantic description can reect the memory operation kind intuitively . For example, the verbs “set ” and “insert ” are commonly used for the APIs storing their parameters, while the verbs “get ” and “return” are pre valent in the semantic descriptions of the APIs loading inner elds. Based on the observations, we realize that w e can leverage typ e information to over-approximate aliasing relations and utilize named entities, verbs, and descriptive simple sentences to understand the high-level semantic meanings of the APIs and their data-ow facts. For any store-load API pair , we can nalize an API aliasing specication as long as we discover the parameters and return values with the same semantic meanings and consistent types. According to these insights, we design our inference algorithm D AInfer+ , of which the workow is sho wn in Figure 2 . Our key technical design consists of three components. • W e introduce a new graph r epresentation, namely the API value graph , to approximate aliasing relations. After converting a library documentation to a normalized documentation model, we encode the potential aliasing relations in the API value graph. • W e reduce the inference problem to an optimization pr oblem upon the API value graph, where we aim to discover as many aliasing facts among parameters and return values as possible. Particularly , we leverage NLP models to e xtract the named entities and interpret the semantic descriptions to infer memory operation abstractions, respectively . These abstractions repre- sent the underlying data-ow facts for each API method, serving as the foundation for our optimization-based inference. • W e instantiate the optimization pr oblem and propose an ecient neurosymbolic optimization algorithm to solve the problem, of which the solution induces the API aliasing sp ecications. Our neurosymbolic optimization algorithm interacts with the tagging mo del and the memory operation abstraction mo dule in a demand-driven manner , signicantly improving the eciency of our algorithm. 7 Fig. 2. W orkf low of DAInfer+ Beneting from our insights, our inference algorithm D AInfer+ simultaneously achieves high precision, recall, and eciency . The high availability of library documentation also promotes the applicability of our approach in real-w orld scenarios. In the following sections, we will formulate our problem (§ 3 ) and provide our technical design (§ 4 and § 5 ) in detail. 3 PROBLEM FORMULA TION This section rst formulates the documentation model (§ 3.1 ) and then denes the API aliasing specication (§ 3.2 ). Lastly , we provide the formal statement of the API aliasing sp ecication inference problem and highlight the technical challenges (§ 3.3 ). 3.1 Documentation Model Denition 1. (Do cumentation Model) Giv en a library , its documentation model is L : = ( H , T , N , D ) : • Class hierar chy model H maps a class 𝑐 to a set of classes, which are the superclasses of 𝑐 . • T yp e signature model T maps ( 𝑐 , 𝑚, 𝑖 ) to a type, where 𝑚 is an API of the class 𝑐 and 𝑖 is the index of the parameter . Without ambiguity , we regard the index of the return value as -1. • Naming model N maps ( 𝑐 , 𝑚, 𝑖 ) to a string indicating the parameter name or API name, wher e 𝑚 is an API of the class 𝑐 and 𝑖 is the index of the parameter . Without ambiguity , N ( 𝑐 , 𝑚, − 1 ) indicates the name of the API 𝑚 of the class 𝑐 . • Description model D maps ( 𝑐 , 𝑚 ) to a string indicating the API semantic description. Example 3. According to the documentation of the class Intent in Figure 1 , w e have H ( Intent ) = { Object } , T ( Intent , 𝑚 1 , − 1 ) = void , T ( Intent , 𝑚 1 , 1 ) = ArrayList N ( Intent , 𝑚 1 , 0 ) = name , N ( Intent , 𝑚 1 , 1 ) = value , N ( Intent , 𝑚 1 , − 1 ) = putStringArrayListExtra D ( Intent , 𝑚 1 ) is “ Add extracted data to the intent ” . Here, 𝑚 1 is the API Intent.putStringArrayListExtra . Due to space limits, we do not discuss other APIs in detail. Based on documentation, we can collect all the APIs oered by a sp ecic class and its sup erclasses, forming the universe of available APIs when using the class. The naming information and API semantic descriptions are informal spe cications, guiding the developers to use proper APIs in their programming contexts. Based on the documentation model, not only do developers achieve their program logic conveniently , but also analyzers can understand the behavior of each API. 3.2 API Aliasing Specification T o support the library-aware alias analysis, we concentrate on the API aliasing specication inference and follow an important form of aliasing specications formulated in the prior study [ 29 ], which is dened as follows. Denition 2. ( API Aliasing Spe cication) An API aliasing specication is a tuple ( 𝑚 1 , 𝑚 2 , 𝑃 , 𝑡 ) , where 𝑚 1 and 𝑚 2 are two APIs, 𝑃 := { ( 𝑖 ( 1 ) 1 , 𝑖 ( 2 ) 1 ) , · · · , ( 𝑖 ( 1 ) 𝑗 , 𝑖 ( 2 ) 𝑗 ) } is a set of non-negative integer pairs, and 𝑡 is an non-negative integer . It indicates that the return value of 𝑚 2 can be aliased with the 𝑡 -th parameter of 𝑚 1 if 8 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang • 𝑚 1 is called before 𝑚 2 upon the same object • The 𝑖 ( 1 ) 𝑘 and 𝑖 ( 2 ) 𝑘 -th parameters of 𝑚 1 and 𝑚 2 are aliased accordingly . Here, 0 ≤ 𝑖 ( 1 ) 𝑘 ≤ 𝑛 1 , 0 ≤ 𝑖 ( 2 ) 𝑘 ≤ 𝑛 2 , and 0 ≤ 𝑘 ≤ 𝑗 . 𝑛 1 and 𝑛 2 are the parameter numbers of 𝑚 1 and 𝑚 2 , respectively . Without ambiguity , we call 𝑚 1 and 𝑚 2 a store-load API pair . Denition 2 shows that the APIs 𝑚 1 and 𝑚 2 conduct the store and load operations upon the memory , respe ctively . Unlike simple load and store operations of p ointers, storing and loading the values in memory may dep end on the values of other parameters, which are induced by the set 𝑃 , determining the memory location where the values are stored and loaded, respectively . Essentially , the set 𝑃 indicates the precondition of the aliasing relation between the return value of 𝑚 2 and the 𝑡 -th parameter of 𝑚 1 . If 𝑃 is empty , the parameters of 𝑚 1 and 𝑚 2 are not necessarily aliased to enforce the aliasing relation between the return value of 𝑚 2 and the 𝑡 -th parameter of 𝑚 1 . Example 4. In Figure 1 (a), we have two API aliasing specications ( 𝑚 1 , 𝑚 4 , { ( 0 , 0 ) } , 1 ) and ( 𝑚 2 , 𝑚 3 , ∅ , 0 ) . Specically , the API aliasing spe cication ( 𝑚 1 , 𝑚 4 , { ( 0 , 0 ) } , 1 ) indicates that the return value of Intent.getStringArrayListExtra and the second parameter of Intent.putStringArrayListExtra are aliased when they are invoked upon the same object and their rst parameters are aliased. The API aliasing spe cication in Denition 2 is more general than the one targete d by USpec [ 29 ]. Specically , USpec only infers that calling 𝑚 2 may return a value aliased with the 𝑡 -th parameter of a preceding call of 𝑚 1 on the same object if all other parameters are aliased . However , there exist many store-load API pairs in which not all the other parameters are aliased. For instance, the API createBitmap of android.graphics.Bitmap sets the values of DisplayMetrics , Config , width, and height simultaneously , while the API getConfig only fetches the value of Config . Our formulation in Denition 2 is expressive enough to depict such a stor e-load API pair . 3.3 Problem Statement W e aim to address the API aliasing sp ecication inference problem from another p erspective. As demonstrated in § 3.1 , the librar y documentation provides various forms of semantic information about the librar y APIs. Hence, we can hopefully derive the API aliasing specications from docu- mentation without conducting deep semantic analysis upon the source code or program runtime information. The API aliasing specication for a given store-load API pair may not be unique. In Example 4 , for instance, ( 𝑚 1 , 𝑚 4 , ∅ , 1 ) is also a valid specication, while it do es not pose any restrictions upon the parameters of the two APIs as the pre-condition. In our work, we want to ensure that the inferred specications exhibit as strong pre-conditions as possible, which implies the maximal size of the set 𝑃 . Finally , we state the problem of the API aliasing specication inference as follows. Given a documentation model L = ( H , T , N , D ) , infer a set of API aliasing specications 𝑆 AS such that | 𝑃 | is maximized for each ( 𝑚 1 , 𝑚 2 , 𝑃 , 𝑡 ) ∈ 𝑆 AS . T echnical Challenges. Although library documentation oers semantic information, solving the above problem is quite challenging. First, the naming information and semantic descriptions can b e ambiguous. Without an eective interpretation, w e cannot understand how the APIs operate on memory or identify aliasing relations b etween parameters and return values. Second, there are often many available APIs oered by a single class and even its superclasses. It is non-trivial to obtain high eciency in front of a large number of available APIs for each class. Roadmap. In this work, we propose an inference algorithm DAInfer+ to address the two technical challenges. Spe cically , we introduce the documentation mo del abstraction to formulate 9 (Intent, 𝑚 ! , - 1) (Intent, 𝑚 " , - 1) (Intent, 𝑚 " , 1) (Intent, 𝑚 # , 0) (Intent, 𝑚 # , - 1) (Intent, 𝑚 " , 0) (Intent, 𝑚 $ , 0) (Intent, 𝑚 $ , - 1) (Intent, 𝑚 % , - 1) (Intent, 𝑚 % , 0) (Intent, 𝑚 & , 0) (Intent, 𝑚 & , - 1) (Intent, 𝑚 & , 1) getIdentifier s3 normalizeMimeT ype s5 identifier s2 name s4 name s1 type s5 value s1 getStringArrayListExtra s4 setIdentifier s2 flags s6 fillIn s6 (Stack, 𝑚 ' , 0) (Stack, 𝑚 ' , - 1) (Stack, 𝑚 ( , - 1) (Stack, 𝑚 ) , - 1) (Stack, 𝑚 "* , - 1) other s6 putStringArrayListExtra s1 empty s10 push s7 item s7 peek s8 pop s9 Fig. 3. The API value graph of the documentation model induced by the do cumentation in Figure 1 semantic information, which enables us to reduce the original problem to an optimization problem (§ 4 ). Furthermore, we propose the neurosymbolic optimization to eciently solve the instantiated optimization problem (§ 5 ). W e present the details of our implementation (§ 6 ) and demonstrate the evaluation quantifying the eectiveness and eciency of D AInfer+ (§ 7 ). 4 DOCUMENT A TION MODEL ABSTRA CTION This se ction presents the abstraction of the documentation model. Specically , we propose the concept of the API value graph (§ 4.1 ) and introduce two label abstractions over the graph (§ 4.2 ), which enables us to reduce the API aliasing spe cication problem to an optimization problem (§ 4.3 ). 4.1 API V alue Graph As shown in § 3.1 , the formal semantic information, namely class hierarchy and the type signatur es, reveals potential aliasing r elations between API parameters and return values, while the informal semantic information, e.g., naming information and semantic descriptions, shows how parameters and return values ar e utilize d. T o depict aliasing relations that can b e introduced by API inv ocations, we propose a graph representation, namely API value graph , as follows. Denition 3 (API V alue Graph) . Given a do cumentation mo del L = ( H , T , N , D ) , its API value graph is the labeled graph 𝐺 : = ( 𝑉 , 𝐸 , ℓ 𝑛 , ℓ 𝑑 ) , where • The node set 𝑉 contains API parameters and return values, which are r eferred to as API values . ( 𝑐 , 𝑚, 𝑖 ) ∈ 𝑉 if and only if ( 𝑐 , 𝑚, 𝑖 ) ∈ dom ( N ) or there is 𝑐 ′ ∈ H ( 𝑐 ) such that ( 𝑐 ′ , 𝑚, 𝑖 ) ∈ dom ( N ) . • The edge set 𝐸 ⊆ 𝑉 × 𝑉 indicates possible aliasing relations between API values. Spe cically , ( 𝑣 1 , 𝑣 2 ) ∈ 𝐸 if and only if T ( 𝑣 1 ) = T ( 𝑣 2 ) , T ( 𝑣 1 ) ∈ H ( T ( 𝑣 2 ) ) , or T ( 𝑣 2 ) ∈ H ( T ( 𝑣 1 ) ) . • The name lab el ℓ 𝑛 is a function that maps an API value to its name, i.e., ℓ 𝑛 ( 𝑣 ) = N ( 𝑣 ) . • The description lab el ℓ 𝑑 is a function that maps an API value to the semantic description of the API, i.e., ℓ 𝑑 ( 𝑣 ) = D ( 𝑐 , 𝑚 ) , where 𝑣 = ( 𝑐 , 𝑚, 𝑖 ) . The API value graph regards API values, namely API parameters and return values, as rst-class citizens, and depicts their high-level semantics with lab els. Intuitively , an edge from ( 𝑐 , 𝑚 1 , 𝑖 1 ) to ( 𝑐 , 𝑚 2 , 𝑖 2 ) indicates the fact that the two values may be aliased when 𝑚 2 is invoked after 𝑚 1 upon the same obje ct. Meanwhile, the two labels attach the informal semantic information to API values, showing their usage intention. From a high-level perspective, the API value graph over-appro ximates aliasing relations according to class hierarchy relations and type signatures, and still preserves informal semantic information as labels to support further spe cication inference. Example 5. Figure 3 shows the API value graph for the documentation model induce d by the classes in Figure 1 , wher e the name labels and description labels are shown in the left and right 10 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang boxes, respectively . 𝑠 𝑖 indicates the semantic description of 𝑚 𝑖 in Figure 1 . Specically , the edge from ( Intent , 𝑚 2 , 0 ) to ( Intent , 𝑚 5 , 0 ) indicates that the rst parameters of Intent.setIdentifier and Intent.normalizeMime T ype may b e aliased when the two APIs are invoked successively . 4.2 Label Abstraction Although the edges of the API value graph over-approximate aliasing r elations over API values, not all the aliasing relations can hold when using APIs. In Figure 1 , for example, the r eturn value of getIdentifier and the rst parameter of normalizeMime T yp e are unlikely to be aliase d as the named entities in their names are dierent, revealing the dier ent usage intention of the two API values. T o formulate this key idea, we rst intr oduce the concept of the semantic unit abstraction as follows. Denition 4. (Semantic Unit Abstraction) A semantic unit abstraction 𝛼 𝜏 is a function mapping a string 𝑠 to a set of named entities contained in 𝑠 . W e call the elements in 𝛼 𝜏 ( 𝑠 ) as semantic units . Example 6. The named entities in the API name of getStringArrayListExtra include string , array , list , and extra . Hence, we hav e 𝛼 𝜏 ( getStringArrayListExtra ) = { string , array , list , extra } . Essentially , the semantic unit abstraction extracts the name d entities from the names as semantic units, which shows the high-level semantics of API values, enabling us to rene aliasing relations according to the following two intuitions: (1) If two API values 𝑣 1 and 𝑣 2 have names with the same semantic units, we can obtain the condence that they are very likely to indicate the same object in the memory; (2) If the name of an API value does not have any semantic units, we can conservatively regard that it can be aliased with any other API values with consistent typ es. Hence , we formally dene the semantic unit consistency to formulate the two intuitions. Denition 5. (Semantic Unit Consistency ) Given a semantic unit abstraction 𝛼 𝜏 upon an API value graph 𝐺 = ( 𝑉 , 𝐸 , ℓ 𝑛 , ℓ 𝑑 ) , two nodes 𝑣 1 and 𝑣 2 are semantic-unit consistent if and only if (1) 𝛼 𝜏 ( ℓ 𝑛 ( 𝑣 1 ) ) = 𝛼 𝜏 ( ℓ 𝑛 ( 𝑣 2 ) ) , or (2) 𝛼 𝜏 ( ℓ 𝑛 ( 𝑣 1 ) ) = ∅ ∨ 𝛼 𝜏 ( ℓ 𝑛 ( 𝑣 2 ) ) = ∅ . Example 7. Consider the API value graph in Figure 3 . W e have 𝛼 𝜏 ( getIdentifier ) = { identifier } , so the return value of the API getIdentifier and the rst parameter of setIdentifier are semantic-unit consistent for the class Intent . Also , we have 𝛼 𝜏 ( item ) = { item } and 𝛼 𝜏 ( peek ) = ∅ , so the return value of peek and the rst parameter of push are semantic-unit consistent for the class Stack . Finally , we obser ve that semantic descriptions characterize ho w API values interact with memor y . By leveraging these informal descriptions, we can systematically identify the underlying memory operations r equired to infer precise data-ow and aliasing specications. T o formalize this mapping, we dene the concept of memory operation abstraction as follows: Denition 6. (Memory Operation Abstraction) A memory operation abstraction 𝛼 𝑜 maps a semantic description 𝑠 to 𝛼 𝑜 ( 𝑠 ) ⊆ 𝑀 , where 𝑀 = { I , D , R , W } . The elements in 𝑀 indicate the insertion (I), deletion (D), read (R), and write (W) operations on the memory . Notably , we classify common memor y operations into four categories due to two major reasons. First, the write operation contains several sub-kinds, such as deletion and insertion. If we only categorize memory op erations into read and write, we can not distinguish the APIs conducting the deletion and insertion, such as pop and add for java.util.Stack , which may yield wrong API aliasing specications. For example, the APIs pop and peek of java.util.Stack would be wrongly identied to form a store-load pair and thus induce an incorrect API aliasing specication. Second, obje cts can be organized in various structural manners. When adding an obje ct to a container-typed eld, such as java.util.Stack and java.util.HashMap , the op eration is an insertion. When storing an object 11 in a non-container-typed eld, the API writes a specic value to the eld. The abov e operations are often described dierently in natural language, so we formulate the memory operation abstraction in a ne-grained manner . Example 8. According to Figure 1 , we have 𝛼 𝑜 ( 𝑠 1 ) = { I , W } , 𝛼 𝑜 ( 𝑠 2 ) = { W } and 𝛼 𝑜 ( 𝑠 3 ) = 𝛼 𝑜 ( 𝑠 4 ) = { R } for Intent . For Stack , we have 𝛼 𝑜 ( 𝑠 7 ) = { I , W } , 𝛼 𝑜 ( 𝑠 8 ) = { R } , and 𝛼 𝑜 ( 𝑠 9 ) = { R , D , W } . T o sum up, the above two label abstractions interpr et the informal semantic descriptions with the sets of semantic units and memory operations, base d on which we can rene potential aliasing relations indicated by the edges of the API value graph and identify store-load API pairs. In § 5.2 , we will demonstrate how to instantiate the two abstractions to support the spe cication inference. 4.3 Problem Reduction Based on the two label abstractions, we can interpret the high-level semantics of API values and the memory operations conducted by the APIs. Accor ding to our problem statement in § 3.3 , we need to identify the store-load API pairs and nd as many aliased parameters as possible, which determine a strong precondition of the aliasing relation between loaded and stored values. Hence, we reduce the specication inference to an optimization problem over the API value graph as follows. Denition 7. (Optimization Problem) Given a semantic unit abstraction 𝛼 𝜏 and a memor y operation abstraction 𝛼 𝑜 upon an API value graph 𝐺 = ( 𝑉 , 𝐸 , ℓ 𝑛 , ℓ 𝑑 ) , nd an edge set 𝐸 ∗ ⊆ 𝐸 with a maximal size | 𝐸 ∗ | satisfying the following constraints: • (Degree constraint) For each 𝑣 ∈ 𝑉 , the in-degree and out-degree of 𝑣 are not greater than 1. • (V alidity constraint) If ( 𝑣 1 , 𝑣 2 ) ∈ 𝐸 ∗ , where 𝑣 1 and 𝑣 2 indicate parameters, there exist 𝑢 1 , 𝑢 2 ∈ 𝑉 such that ( 𝑢 1 , 𝑢 2 ) ∈ 𝐸 ∗ . where 𝑢 1 and 𝑢 2 indicate a parameter and a return value, r espe ctively . • (Semantic unit constraint) For any ( 𝑣 1 , 𝑣 2 ) ∈ 𝐸 ∗ , where 𝑣 1 = ( 𝑐 , 𝑚 1 , 𝑖 1 ) and 𝑣 2 = ( 𝑐 , 𝑚 2 , 𝑖 2 ) , the semantic unit abstraction of the names of 𝑣 1 and 𝑣 2 should satisfy – (S1) If 𝑖 2 ≠ − 1 , 𝑣 1 and 𝑣 2 are semantic-unit consistent. – (S2) If 𝑖 2 = − 1 , 𝑣 1 or 𝑣 ′ 1 is semantic-unit consistent with 𝑣 2 , where 𝑣 ′ 1 = ( 𝑐, 𝑚 1 , − 1 ) . • (Memory operation constraint) For any ( 𝑣 1 , 𝑣 2 ) ∈ 𝐸 ∗ , the following two conditions are satised: – (M1) 𝑣 1 satises I ∈ 𝛼 𝑜 ( ℓ 𝑑 ( 𝑣 1 ) ) ∨ ( W ∈ 𝛼 𝑜 ( ℓ 𝑑 ( 𝑣 1 ) ) ∧ D ∉ 𝛼 𝑜 ( ℓ 𝑑 ( 𝑣 1 ) ) – (M2) 𝑣 2 satises that R ∈ 𝛼 𝑜 ( ℓ 𝑑 ( 𝑣 2 ) ) Denition 7 aims to maximize | 𝐸 ∗ | to discover all the aliased parameters of each store-load API pair , which corresponds to maximizing | 𝑃 | in original problem statement in § 3.3 . The four kinds of constraints are imposed upon the selected edges. Spe cically , the degree and validity constraints ensure that the edges induce the API aliasing sp ecication dene d in Denition 2 . Besides, the parameters of the APIs 𝑚 1 and 𝑚 2 should be semantic-unit consistent if they ar e connected by a selected edge (S1). If a sele cted edge connects the parameter of 𝑚 1 and the return value of 𝑚 2 , then the parameter of 𝑚 1 should be semantic-unit consistent with the return value of 𝑚 2 (S2). Lastly , the memory operation constraint ensures that the APIs 𝑚 1 and 𝑚 2 should form a store-load API pair (M1 and M2). Finally , we can obtain the specications based on the optimal solution as follows. Given the optimal solution 𝐸 ∗ of the optimization problem dened in Denition 7 , w e can obtain the API aliasing specication ( 𝑚 1 , 𝑚 2 , 𝑃 , 𝑡 ) ∈ 𝑆 AS , where • 𝑃 = { ( 𝑖 1 , 𝑖 2 ) | ( ( 𝑐 , 𝑚 1 , 𝑖 1 ) , ( 𝑐 , 𝑚 2 , 𝑖 2 ) ) ∈ 𝐸 ∗ , 𝑖 2 ≠ − 1 } • 𝑡 satises ( ( 𝑐 , 𝑚 1 , 𝑡 ) , ( 𝑐 , 𝑚 2 , − 1 ) ) ∈ 𝐸 ∗ Example 9. Figure 4 shows the optimal solution to the optimization problem o ver the API value graph in Figure 3 , where the sets shown in the two b oxes demonstrate the extracted semantic 12 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang (Intent, 𝑚 ! , - 1) (Intent, 𝑚 " , 0) (Intent, 𝑚 # , 0) (Intent, 𝑚 $ , 0) {identifier} {R} {identifier} {W} {name} {R} (Stack, 𝑚 % , 0) (Stack, 𝑚 % , - 1) (Stack, 𝑚 & , - 1) (Stack, 𝑚 ' , - 1) (Intent, 𝑚 # , 1) (Intent, 𝑚 $ , - 1) {value} {W} {string, array , list, extra} {R} {name} {W} {item} {I, W} {} {R} {} {I, W} {} {R, D, W} Fig. 4. An optimal solution to the problem instance over the API value graph sho wn in Figure 3 units and the identied memory op erations under the label abstractions in Examples 7 and 8 . W e discover six possible aliasing relations. Notably , although the semantic units of ( Intent , 𝑚 4 , − 1 ) are dierent from ( Intent , 𝑚 1 , 1 ) , they are exactly the same as the ones of ( Intent , 𝑚 1 , − 1 ) , indicating that the second parameter of 𝑚 1 can have the same semantics as the return value of 𝑚 4 . The optimal solution nally induces the API aliasing specications in Example 4 . By reducing the original problem to the optimization problem in Denition 7 , we only need to tackle two sub-pr oblems for the specication inference. First, we hav e to instantiate two label abstractions to precisely interpret the semantic meanings of names and the kind of memor y operations. Second, we need to design an ecient optimization algorithm to solve the optimization problem. In § 5 , we will pro vide the technical details of addressing the two sub-problems. 5 SPECIFICA TION INFERENCE VIA NEUROSYMBOLIC OPTIMIZA TION This section presents the technical details of our algorithm D AInfer+ . Specically , we demonstrate the overall algorithm in § 5.1 and detail the label abstraction instantiation in § 5.2 . Besides, we present the neurosymbolic optimization in § 5.3 to instantiate and solve the optimization problem given in Denition 7 . Lastly , we summarize our approach and highlight its advantages in § 5.4 . 5.1 Overall Algorithm Algorithm 1: Inference Algorithm Input: L : Documentation model; Output: 𝑆 AS : API aliasing specications; 1 𝐺 ← constructAVG ( L ) ; 2 𝛼 𝜏 ← getSemanticUnitAbs ( ) ; 3 𝛼 𝑜 ← getMemoryOperationAbs ( ) ; 4 P ← ( L , 𝐺 , 𝛼 𝜏 , 𝛼 𝑜 ) ; 5 𝐸 ∗ ← neuroSymOpt ( P ) ; 6 𝑆 AS ← convert ( 𝐸 ∗ ) ; 7 return 𝑆 AS ; As demonstrated in § 4.3 , we can reduce the API aliasing specication inference problem to an instance of the optimization problem given in Denition 7 . T echnically , we propose and formulate our specication algorithm in Algo- rithm 1 , which takes as input a documentation model L and generates a set of API aliasing spe c- ications 𝑆 AS as output. First, w e derive the API value graph 𝐺 from the documentation model L based on Denition 3 (Line 1). Second, we in- stantiate two label abstractions, i.e., 𝛼 𝜏 and 𝛼 𝑜 , and further construct an instance of the optimization pr oblem P dened in Denition 7 (Lines 2–3). Third, we propose the neur osymbolic optimization to solve the instance of the optimization problem P (Lines 4–5), and nally convert the optimal solution 𝐸 ∗ to a set of API aliasing specications 𝑆 AS (Line 6). Particularly , Denition 3 has demonstrated how to construct the API value graph, and converting the optimal solution to the specication is also explicitly formulate d at the end of § 4.3 . In the rest of this section, we will provide mor e details on the label abstraction instantiation (§ 5.2 ) and the neurosymbolic optimization algorithm (§ 5.3 ), which nalize the functions getSemanticUnitAbs , getMemoryOp erationAbs , and neuroSymOpt in Algorithm 1 , respectively . 13 5.2 Label Abstraction Instantiation According to Denitions 4 and 6 , the semantic unit abstraction requires attaching the grammatical tags, while the memor y operation abstraction demands identifying how an API manipulates memory . In what follo ws, we will detail how to instantiate them with two dierent NLP models, r espe ctively . 5.2.1 Instantiating Semantic Unit Abstraction. According to common programming practices, the developers of libraries tend to follow typical naming conventions [ 14 ], such as camel case, pas- cal case, and snake case. For example, user Account is a parameter name using camel case, and get_account_balance is an API name using snake case. Notably , the sub-words are often separated with an underscore or begin with an upp ercase letter . Hence, we can easily decompose each name 𝑠 into the concatenation of several sub-w ords and further determine the tag of each sub-word. Howev er , the names of APIs or their parameters can hardly b e valid phrases or sentences. Simply applying the part-of-speech (POS) tagging would tag almost all the words as nouns. Also, the POS tagging targets tagging sentences, while the names of parameters and APIs are only the concatenation of words in phrases. T o obtain more precise tagging results, we leverage an existing probability model trained in Brown Corpus [ 34 ], which can return all the possible grammatical tags of each word along with the occurrences. This enables us to determine whether a word is more likely to be a noun accor ding to the existing probability mo del, which does not depend on the usage context of the word. Formally , we instantiate the semantic unit abstraction as follo ws. Denition 8. (Instantiation of Semantic Unit Abstraction) Assume that 𝑔 𝜏 maps a word 𝑤 to a set of tag-occurrence pairs { ( 𝜏 𝑗 , 𝑘 𝑗 ) } . Given a sub-word 𝑤 in a parameter/API name 𝑠 , 𝑤 ∈ 𝛼 𝜏 ( 𝑠 ) if and only if ( NOUN , 𝑘 ∗ ) ∈ 𝑔 𝜏 ( 𝑤 ) and 𝑘 ∗ is the largest occurrence in 𝑔 𝜏 ( 𝑤 ) . Example 10. Consider the API setIdentifier in Figure 1 . After splitting the API name into two sub-words, namely “set ” and “identier ” , we discov er that “set ” is more likely to be a verb than a noun, while “identier ” is very likely to b e a noun. Hence, our instantiated semantic unit abstraction 𝛼 𝜏 maps setIdentifier to { identifier } , identifying identifier as the semantic unit of the API. 5.2.2 Instantiating Memory Op eration Abstraction. NLP models are particularly ee ctive at dis- tilling program semantics from natural language descriptors, enabling the autonomous inference of API specications - such as taint specications [ 22 , 49 , 66 ] and alias specications [ 69 ] - from unstructured documentation. Human-written API descriptions serve as high-level functional speci- cations, providing valuable semantic information regar ding a method’s b ehavioral intent. T o instantiate an eective memor y abstraction, we leverage two common pr ogramming practices: (1) developers typically summarize API functionality using full sentences or verb-object phrases as semantic descriptions, and (2) the verbs within these descriptions intuitively depict the underlying memory operations performe d by the API. How ever , extracting these operations is hindered by the vocabulary mismatch problem; for example, the verbs “ put ” , “ insert ” , and “ push ” may all denote a single memory insertion primitive. The diverse choices of the verbs describing a specic memory operation would make the inference suer low recall if we just adopted a grep-like approach based on string matching. Initially , we are inspired by r ecent progress in the NLP community . Hence, we r ealize that the latest advances in the LLMs may provide new opp ortunities for resolving this issue [ 13 , 54 , 56 ]. Specically , the LLMs have excellent abilities in text understanding, esp ecially under the guidance of a few-shot examples or descriptions of rules. W e designed a two-staged prompting solution that infers the memory operation kind, considering the best practices followed by dev elopers in selecting the name for an API method, given the API description. While using LLMs to reason over descriptions helps, they are prone to hallucinations, occasionally misclassifying intent or inventing non-existent operations in linguistic structures. For instance, the description “ Removes 14 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang As an experienced programmer, you are good at choosing API names and writing documentati on. Yo u a r e d e v e l o p i n g f o u r APIs m1, m2, m3, and m4 to implement th e following four memory operations : Here are the ver bs for each API. P lease choose proper verbs for each API. The verbs can be used in API name s and semantic description of the documentation. Separate the verbs by commas and sort them based on preference. - m1: reads the value of a class f ield and returns it - m2: stores its parameter to a class field - m3: inserts its parameter to a class field - m4: removes a value from a class f ield - m1: get , retrieve, fetch, obtain, acqui re, read, access - m2: set , assign, store, save, initializ e, update, record - m3: insert , add, put, store, set, update, record - m4: remo ve , delete, era se, clear, eliminate, exclude As an experienced programmer, you are good at understanding API semantics according to its semantic description . Here are four m emory operations, na mely read, write, insertion , and deletion . Given a sem antic description of an API, deter mine whether the API conducts the rea d, write, insertion, a nd deletion based on t he following rules: - If the description contains [xxx] or its synonyms, it conducts the read. Please provide t he API semanti c description. - If the description contains [xxx] or its synonyms, it conducts th e write. - If the description contains [xxx] or its synonyms, it conducts the insertion. - If the description contains [xxx] or its synonyms, it conducts the deletion. The semantic description of getParcelable ArrayListExtra in the class Intent is “ Retrieve extended d ata from this Inten t ”. Please determine its memory operations and answer Y es/No. Here is an example output: No , No, No, No Y es, No, No, No Select To p 1 (a) Retrieve typical verbs via prompting (b) Instantiate the memory operation abstraction via prompting Fig. 5. The prompt templates of two-staged prompting the object at the top of this stack and returns that object as the value of this function ” for API metho d “ p op() ” consists of two simple sentences connected with the connecting word “ and ” . The verb “ remove ” implies a modication on the memor y while the verb “ return ” shows a memory read operation after wards. Although the prompt shown in Fig. 5 requests LLM to choose all the relevant semantic descriptions, it is possible that it is biased by one part of the complex sentence, not producing the desired output. In this extended version, we propose a more robust solution using embedding models. By mapping API descriptions and formal verb-phrases describing memor y primitives into a shar ed high-dimensional vector space, w e can perform semantic similarity comparison. This determin- istic approach resolves vocabular y variance while providing a grounded safeguard against the hallucinations inherent in purely generative models. Memory Abstraction with LLMs: T o instantiate the memor y op eration abstraction, we propose two-stage prompting, of which the prompt template is sho wn in Figure 5 . • First, we design the prompt in Figur e 5 ( a) to r etrieve the verbs describing each memory operation and enforce the LLMs sort them base d on the prefer ence. Although the verb lists may overlap, the top-1 verbs are repr esentative enough to distinguish dierent memory operations. • Second, we select the top-1 verbs r ecommende d in the rst stage and then construct the prompt describing the rules for the memor y operation abstraction, which is shown in Figure 5 ( b). Finally , we obtain an LLM response containing four “Y es”/“No” separated by commas. It is w orth noting that we identify memor y operation kinds via a two-stage prompting instead of a one-stage prompting. If we manually specify the typical verbs describing memory operations, the second prompt may rely on our manual setting, which demands expert knowledge. If we do not oer typical w ords as hints, the result is not as interpretable as the current one. Our design actually utilizes the ability of LLMs to predict method names for coding tasks, self-promoting the memor y operation identication with generated typical verbs. Note that the rst stage is only conducted once. The typical verbs ar e shared when analyzing library APIs. Hence, the extra cost introduced by the rst stage is negligible. Based on the above prompting process, we can obtain an instantiation of the memory operation abstraction, which is formally formulate d as follows. Denition 9. (Instantiation of Memor y Operation Abstraction) 𝑔 𝑜 is the function induced by the LLM via two-staged prompting in Figur e 5 . Then the memory op eration abstraction 𝛼 𝑜 satises that op ∈ 𝛼 𝑜 ( 𝑠 ) if and only if the corresponding answer of op in 𝑔 𝑜 ( 𝑠 ) is “Y es”, where op ∈ 𝑀 . 15 Example 11. In Figure 5 ( b), the output of the LLM is “Y es, No, No, No ” , indicating that In- tent.getStringArrayListExtra only conducts the memor y read. Hence, w e have 𝛼 𝑜 ( 𝑠 4 ) = { R } , where 𝑠 4 is the semantic description of the API Intent.getStringArrayListExtra . Similarly , for the API Intent.normalizeMine T ype , the verb “normalize ” in its semantic description 𝑠 5 is not the synonym of four typical verbs, so 𝛼 𝑜 ( 𝑠 5 ) = ∅ , indicating that it do es not contribute to any load-store match. Memory Abstraction with Emb edding Models: By representing both API descriptions and memory operation descriptions as vectors, we can quantify their semantic similarity and identify the most likely memory b ehaviors (e .g., read, write, insert, or remov e operations) without relying on rigid keyword matching. A semantic description 𝑠 is an informal description of an API method’s functionality through a set of sentences. T o reliably infer the memory operation from a semantic de- scription, it is essential to consider the overall semantics of all the sentences. Ho wever , interpreting the connection between the sentences when they are not simple is not straightforward. A sentence can have three dierent structures according to linguistic resources [ 8 ] shown in Figure 6 . A simple sentence is only an independent clause consisting of a v erb phrase and a subject. A compound sentence consists of two or more independent clauses joine d by a coordinating conjunction (F ANBOYS: For , And, Nor , But, Or , Y et, So) or a semicolon. A complex sentence includes one independent clause with at least one dependent clause (starts with a sub ordinating conjunction like when, b ecause, or although). While a simple sentence typically implies a direct functional objective, the latter two structures often provide additional context or conditions regarding the API’s behavior . Simple Sentence (S) V erb P hrase (VP) V erb Sets Noun Phrase (NP) an identier Compound Sentence (S) Independent Clause ( 𝑠𝑒𝑛𝑡 1 ) VP Remove the object ... Coordinator and Independent Clause ( 𝑠𝑒𝑛𝑡 2 ) VP Return the object Complex Sentence (S) Main Clause VP T ests Dependent Clause ( 𝑠𝑒𝑛𝑡 dep ) Subordinator if Clause NP this stack VP is empty (a) Simple (b) Compound (c) Complex Fig. 6. Visual representation of sentence structures: ( a) Simple, (b) Compound, and (c) Complex. In a compound sentence, coordinating conjunctions link multiple independent clauses, typically representing a sequence or a set of distinct actions performed by the API. In contrast, in a complex sentence, the independent clause generally denotes the primar y action, while the dependent clause species the conditions, constraints, or consequences. Consequently , we design a solution that identies the sentence structure of an API description and employs an induction appr oach to isolate and infer abstract memor y operations by focusing exclusively on the described primar y action. This process follows three steps: • Step 1: Semantic Structural Decomposition. W e rst analyze the semantic description. If it is either a compound or complex sentence , we decompose it into a set of constituent simple sentences { 𝑠 𝑒 𝑛𝑡 1 , . . . , 𝑠 𝑒 𝑛𝑡 𝑛 } . • Step 2: Memory Operation Abstraction Inference via Semantic Similarity . For each extracted sentence 𝑠 𝑒 𝑛𝑡 𝑖 , we determine the most likely abstract memor y operation 𝑜 𝑝 ∗ 𝑖 ∈ 𝑀 by computing the semantic similarity b etween its vector representation and the embeddings of our predened memory op eration descriptions. W e then select the operation that yields the highest similarity score. 16 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang • Step 3: Memor y Operation Abstraction Aggregation. Finally , we aggr egate the inferred operations according to the original sentence structure to nalize the set of abstract memory operations 𝑂 𝑃 ∗ that characterize the data-ow behavior of the API method. Denition 10. Sentence Structure: For a giv en semantic description 𝑠 , we dene a typing function called as T ( 𝑠 ) ∈ { Simple, Compound, Complex } that maps the description to its primary linguistic structure based on the connectivity of its constituent clauses. Denition 11. Semantic Structural Decomposition: A semantic description 𝑠 is formally repre- sented as the union of 𝑛 simple sentences { 𝑠𝑒 𝑛𝑡 1 , . . . , 𝑠 𝑒 𝑛𝑡 𝑛 } , where 𝑛 > 0 . Each 𝑠 𝑒 𝑛𝑡 𝑖 represents a fundamental unit of the API method’s functionality . Example 12. The API documentation for the method Stack.pop() describ es it as: “ Remov es the object at the top of this stack and returns that object as the value of this function . ” This description uses a compound sentence structure joine d by the conjunction “and. ” W e decompose this into two simple sentences: • 𝑠 𝑒 𝑛𝑡 1 : “ Removes the object at the top of this stack. ” • 𝑠 𝑒 𝑛𝑡 2 : “ Returns that object as the value of this function. ” Denition 12. Abstract Memory Operation Descriptions: The abstraction 𝛽 maps each memory operation 𝑜 𝑝 ∈ 𝑀 = { 𝑅, 𝑊 , 𝐼 , 𝐷 } to a natural language semantic descriptor 𝑑 ∈ 𝐷 𝑒 𝑠 𝑐 . The mapping is dened as: • 𝛽 ( 𝑅 ) : Gets value of something. • 𝛽 ( 𝑊 ) : Sets value of something. • 𝛽 ( 𝐼 ) : Inserts something into a collection. • 𝛽 ( 𝐷 ) : Removes something from a collection. Denition 13. Sentence Similarity Function: The similarity function 𝑓 : 𝑆 𝑒 𝑛𝑡 × 𝐷 𝑒 𝑠 𝑐 → [ 0 , 1 ] evaluates the semantic alignment between the v ector representation of a sentence 𝑠 𝑒 𝑛𝑡 𝑖 and the vector representation of an operation description 𝛽 ( 𝑜 𝑝 ) : 𝑠𝑐 𝑜 𝑟 𝑒 = 𝑓 ( 𝑠 𝑒 𝑛𝑡 𝑖 , 𝛽 ( 𝑜 𝑝 ) ) , 𝑜 𝑝 ∈ 𝑀 For each simple sentence, the underlying memory operation is assume d to be encode d within its verb phrases. W e map these linguistic units to the set of operations 𝑀 = { 𝐼 , 𝐷 , 𝑅, 𝑊 } by identifying the operation 𝑜 𝑝 ∈ 𝑀 whose description 𝛽 ( 𝑜 𝑝 ) most closely resembles the action of the sentence. Formally , for 𝑠 𝑒 𝑛𝑡 𝑖 , we select 𝑜 𝑝 ∗ 𝑖 so that it maximizes the similarity score 𝑓 . Denition 14. Sentence-level Memory Operation Abstraction: The inferred operation 𝑜 𝑝 ∗ 𝑖 for 𝑠 𝑒 𝑛𝑡 𝑖 is dened as the operation that maximizes the similarity score: 𝑜 𝑝 ∗ 𝑖 = 𝑎𝑟 𝑔 𝑚𝑎𝑥 𝑜 𝑝 ∈ 𝑀 𝑓 ( 𝑠 𝑒 𝑛𝑡 𝑖 , 𝛽 ( 𝑜 𝑝 ) ) Once the memory op eration abstraction is retrieved for each simple sentence, we attempt to infer the memory operation for the semantic description 𝑠 provided for an API method. Example 13. Using a sentence embe dding model (e .g., SBERT -MPNet ), w e calculate the cosine similarity scores between the simple sentence 𝑠 𝑒 𝑛𝑡 1 “ Removes the object at the top of this stack ” and each of the four memory abstraction descriptions as bellow: • 𝑠 𝑐 𝑜 𝑟 𝑒 𝑟 = 𝑓 ( 𝑠 𝑒 𝑛𝑡 1 , 𝛽 ( 𝑅 ) ) = 0 . 31 • 𝑠 𝑐 𝑜 𝑟 𝑒 𝑤 = 𝑓 ( 𝑠 𝑒 𝑛𝑡 1 , 𝛽 ( 𝑊 ) ) = 0 . 24 • 𝑠 𝑐 𝑜 𝑟 𝑒 𝑖 = 𝑓 ( 𝑠 𝑒 𝑛𝑡 1 , 𝛽 ( 𝐼 ) ) = 0 . 29 • 𝑠 𝑐 𝑜 𝑟 𝑒 𝑑 = 𝑓 ( 𝑠 𝑒 𝑛𝑡 1 , 𝛽 ( 𝐷 ) ) = 0 . 59 17 The optimal operation 𝑜 𝑝 ∗ 1 is determined by selecting the abstraction with the highest similarity score: 𝑜 𝑝 ∗ 1 = arg max 𝑘 ∈ { 𝑅 , 𝑊 ,𝐼 ,𝐷 } ( 𝑠𝑐 𝑜 𝑟 𝑒 𝑘 ) = 𝐷 Denition 15. Instantiation of Memor y Abstraction Operation: Let a semantic description 𝑠 be a sentence composed of 𝑛 constituent simple sentences { 𝑠𝑒 𝑛𝑡 1 , . . . , 𝑠 𝑒 𝑛𝑡 𝑛 } . The Induction Function G aggregates the individual operation abstractions 𝑜 𝑝 ∗ 𝑖 into a global operation 𝑂 𝑃 ∗ based on the sentence structure T ( 𝑠 ) as follows: 𝑂 𝑃 ∗ = G ( 𝑠 , { 𝑜 𝑝 ∗ 𝑖 } 𝑛 𝑖 = 1 ) =          { 𝑜 𝑝 ∗ 1 } if T ( 𝑠 ) = Simple Ð 𝑛 𝑖 = 1 { 𝑜 𝑝 ∗ 𝑖 } if T ( 𝑠 ) = Compound { 𝑜 𝑝 ∗ independent } if T ( 𝑠 ) = Complex Based on the sentence structure of this description, we interpret the memor y op eration dierently with the induction function G as stated below: • Simple Sentence: For 𝑛 = 1 , the global op eration is directly mapp ed from the single con- stituent sentence. • Compound Sentence: The global operation is the union of operations from all clauses, representing a sequence or concurrent set of actions. • Complex Sentence: The global operation is inherited exclusively from the independent clause, which identies the primary functional intent, while dep endent clauses (providing conditions or constraints) are disregarded. By applying G , we transform unstructured natural language into a structured set of abstract memory operations that can b e directly mapped to the nodes of the API value graph. Notably , our intuition of the label abstraction upon the API value graph is applicable for general libraries in real-world production. T ypically , the developers of libraries are often in well-organized communities and cooperatives, following goo d naming conventions and using proper verbs in semantic descriptions. That is, they are unlikely to use dierent nouns to indicate the objects with the same usage intention or describe memor y operations conducted by the APIs with wrong verbs. Their go od development habits permit us to correctly interpret the informal semantic properties of library APIs with the NLP models, which can yield satisfactor y precision and recall in the wild. Our evaluation also demonstrates the ee ctiveness of the lab el abstraction upon benchmarks used in existing studies [ 6 , 9 , 29 ]. Furthermore, such well-structured natural language descriptions, including documentation and comments, have be en utilized in various software engineering tasks, such as API name recommendation [ 1 , 19 , 73 , 76 ], API misuse detection [ 60 , 84 ] and unit test generation [ 11 ]. These approaches, which share similar assumptions ab out natural language descriptions as ours, hav e demonstrated their practical impacts in understanding code semantics and beneting downstream clients. W e will provide a detaile d discussion of these approaches in § 8 . 5.3 Neurosymbolic Optimization As shown in § 5.2 , our two label abstractions are achiev ed with dierent overheads. Specically , the semantic unit abstraction only relies on the tagging model that can be applied eciently . For the memory operation abstraction, we utilize embedding models—a signicant optimization over the full LLM inference used in previous iterations—as they oer a b etter balance between semantic accuracy and computational cost. T o achieve high eciency , we propose a solving technique, named neurosymbolic optimization , for the optimization problem dened in Denition 7 . For each API pair , we rst check the satisability of degree constraint 𝜙 𝑑 and the validity constraint 𝜙 𝑣 (Lines 2–5). If b oth of them are satised, we apply the tagging model to derive 18 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang the semantic unit constraint 𝜙 𝑠 (Line 6) and examine the satisability of the conjunction of the three constraints (Line 7). If it is satisable, we invoke the embedding models to achieve the memory operation abstraction, and derive the memory operation constraint (Line 9). Based on OMT solving [ 10 ], we select the maximal number of e dges connecting the API values (Line 10) and append them to the set 𝐸 ∗ (Line 11), which is returned as the solution to the optimization problem. Algorithm 2: Neurosymb olic optimization Input: P : An optimization problem; Output: 𝐸 ∗ : The optimal solution; 1 foreach ( 𝑐 , 𝑚 1 ) , ( 𝑐, 𝑚 2 ) do 2 𝜙 𝑑 ← deriveDegreeConstraints ( P ) ; 3 𝜙 𝑣 ← deriveValidityConstraints ( P ) ; 4 if SMTSolve ( 𝜙 𝑑 ∧ 𝜙 𝑣 )= UNSA T then 5 continue ; 6 𝜙 𝑠 ← deriveSUConstraints ( P ) ; 7 if SMTSolve ( 𝜙 𝑑 ∧ 𝜙 𝑣 ∧ 𝜙 𝑠 )= UNSA T then 8 continue ; 9 𝜙 𝑜 ← deriveMOConstraints ( P ) ; 10 𝐸 ′ ← Solve ( obj ( P ) , 𝜙 𝑑 ∧ 𝜙 𝑣 ∧ 𝜙 𝑠 ∧ 𝜙 0 ) ; 11 𝐸 ∗ ← 𝐸 ∗ ∪ 𝐸 ′ ; 12 return 𝐸 ∗ ; Notably , the degree constraint and validity constraint do not depend on any NLP models and are instantiated symbolically , while the se- mantic unit constraint and memor y op eration constraint rely on the outputs of the tagging model and the embe dding models, respectively , being instantiated in a neural manner . By de- coupling the symb olic constraints from neural ones , D AInfer+ applies NLP mo dels with a lazy strategy . Note that in our pr evious attempt, LLM inference consumed much more time than SMT solving [ 69 ]. Our new design with embedding models can signicantly reduce the time over- head and the hardware cost. Example 14. Consider the APIs of Intent in Fig- ure 1 (a). When processing the APIs Intent.fillIn and Intent.getIdentifier , the validity constraint is not satised as there are no type-consistent parameters or return values. Hence, w e do not apply the tagging model or the emb edding model. For the APIs Intent.setIdentifier and Intent.normalizeMime T ype , we nd that their parameters and return values are not semantic-unit consistent, so we do not invoke the embedding model with their semantic descriptions. When we designed the label abstraction instantiation, we also considered directly prompting LLMs to validate semantic unit consistency . However , pairwise examining the names of an API and its parameters introduces a large number of LLM inferences, which increases time and token costs. W e add mor e discussions on the possibility of utilizing LLMs to improve D AInfer+ in this task in § 7.6 . 5.4 Summary D AInfer+ is the rst trial of inferring API specications from documentation. It demonstrates the promising potential of utilizing new advances in the community of natural language processing, especially embe dding models, to solve traditional static analysis pr oblems. Similar to traditional pointer analyses upon source code, such as Andersen-style pointer analysis [ 3 ], D AInfer+ estab- lishes a constraint system over library documentation to pose restrictions upon pointer facts. T o precisely understand the natural language, it utilizes the NLP models as documentation interpreters to abstract informal semantic information, which supp orts instantiating an optimization problem for the specication inference. Our insight into utilizing NLP models for do cumentation interpretation can be generalized in other tasks, such as program synthesis [ 80 ] and test case generation [ 52 ]. 6 IMPLEMENT A TION W e implement the approach DAInfer+ as a prototype and release the source code online [ 25 ]. Specically , we implement the documentation parser by using the BeautifulSoup Python package. For each do cumentation page describing the API semantics, we can extract four kinds of information, 19 including class hierarchy relations, API type information, naming information, and API semantic descriptions. Since librar y documentation pages almost have a uniform format, we do not have to make major changes to the implementation of the parser to adapt to libraries. T o instantiate the semantic unit abstraction, we utilize the conditional frequency distributions tool with the Brown Corpus pro vided by the Natural Language T oolkit [ 53 ]. This allows us to determine whether a word is most likely to b e a noun. Finally , we leverage an advance d NLP library called spaCy [ 38 ] to construct the dependency trees, enabling the extraction of clausal structures fr om simple, compound, and complex sentences. W e utilize the conditional frequency distributions tool with the Brown Corpus provided by the Natural Language T oolkit [ 53 ] to determine whether a word is most likely to be a noun. Finally , we le verage an advanced NLP library called spaCy [ 38 ] to construct the dependency trees, enabling the extraction of clausal structur es from simple, compound, and complex sentences. T o instantiate the memory operation abstraction as in our previous research using LLMs [ 69 ], we adopt the gpt-3.5-turbo model with the chat completions API to interpret the API semantic descriptions [ 55 ]. Specically , we invoke the ChatCompletion.create interface to feed the constructe d prompts to the LLM and fetch its response. In our implementation, we set the temperatures for the two stages of prompting to 0.7. T o extract memor y operation abstractions using emb edding models, w e employed the pre-trained models from the Sentence-BERT (SBERT)[ 59 ] and E5 [ 71 ] frameworks. SBERT [ 59 ] is a modication of the BERT [ 28 ] architecture that uses Siamese and triplet network structures to generate semanti- cally meaningful sentence embeddings. The xed-size vectors in SBERT allow for highly ecient comparison. As a retrieval-rst architecture , E5 [ 71 ] excels at resolving asymmetric semantic simi- larity . This capability stems from its extensiv e pre-training on large-scale web corpora. W e use d these models to compute the emb edding vectors for each simple sentence. Independently , we also computed the vectors for the four memory operation descriptions. The semantic correspondence between a sentence and each operation was quantied using cosine similarity . This approach allows us to align the linguistic functionality of the API methods with formal memory operations based on their shared semantic space. W e implement the neurosymbolic optimization based on Z3 solver [ 10 , 26 ]. For any pair of APIs, we introduce ( 𝑛 1 + 1 ) · ( 𝑛 2 + 1 ) boolean variables to indicate whether the two API values are aliased, where 𝑛 1 and 𝑛 2 are the numbers of the API parameters. W e directly enco de the degree constraint and validity constraint symbolically , while the semantic unit constraint and memory operation constraint are constructed and solved on demand, relying on the outputs of our desired NLP models. W e count the numb er of boolean variables assigne d to True and set them as the objective function. For better performance, we parallelize the invocations of the LLM in eight threads, and introduce the memorization technique to store the tagging result and the result of memory operation abstraction upon each semantic description. If a word or an API semantic description has been processed b efore, w e directly reuse the previous result. 7 EV ALU A TION W e evaluate D AInfer+ by investigating the following research questions: • RQ1: How accurately and eciently does DAInfer+ generate data-ow and aliasing specica- tions? • RQ2: How does DAInfer+ benet librar y-aware static analysis clients? • RQ3: How does DAInfer+ compare against other approaches? 20 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang Fig. 7. The zero and few-shot prompt templates for inferring data-flow specifications for an API metho d. 7.1 Experimental Setup All the experiments in D AInfer [ 69 ] are performed on a 64-bit machine with 40 Intel(R) X eon(R) CP U E5-2698 v4 @ 2.20 GHz and 512 GB of physical memory . For D AInfer+ , we used a machine with the same CP U setup, 251 GB of physical memory , and an N VIDIA RTX 3090 GP U for hosting LLMs and embedding models. W e invoke the Z3 SMT solver with its default options. Subjects. T o show the superiority of D AInfer+ in producing alias specications, we evaluate A tlas [ 9 ], USpec [ 29 ], and D AInfer+ upon the same set of Java classes. Specically , the Java classes are collected from: (1) The classes of which the specications are manually specied in FlowDroid [ 6 ]; (2) The classes appearing in the inference results of USpec [ 29 ]. Since the dataset of A tlas [ 9 ] is not publicly available, we cannot conduct experiments on it. In total, our benchmark contains 167 Java classes oering 8,342 APIs, which range from general-purpose libraries, including Android framew ork and Java Collections Framework, to sp ecic-usage libraries, such as Gson . Without ambiguity , we call the rst and the second kinds of classes from FlowDroid b enchmark and USpec b enchmark, respectively . T o evaluate the eectiveness of our memory operation abstraction module in inferring data-ow facts for constructing specications, we evaluated D AInfer+ across a diverse suite of state-of- the-art LLMs trained on both general-purpose and programming-specic datasets. Our sele ction includes deepseek- V2 [ 27 ], gpt-4o-mini [ 56 ], qwen2.5-coder [ 39 ], and deepseek-co der- V2 [ 36 ]. W e employed the zero-shot and few-shot prompting strategies (illustrated in Figure 7 ) to task these models with inferring data-ow specications. For each API method, the models w ere provided with the method signature, its containing class, and the corresponding documentation. T o facilitate a rigorous comparison, we leveraged the same 167 Java classes originally specie d in the FlowDroid framework, comparing the sp ecications generated by our tool against those generated by the LLMs. 7.2 Data-Flow Specification Inference Eectiveness. T o assess the eectiveness of our approach, we use the manually curated dataset of data-ow specications from FlowDroid [ 6 ] as our ground truth. Due to the extensive number 21 of specications in the original dataset, we randomly sele cted 141 classes comprising 1,094 API methods. Among these, 1,064 methods provide semantic descriptions within their documentation. After excluding 19 deprecated methods, our nal evaluation set consisted of 1,045 methods. T o ensure the accuracy of the ground truth and address potential aws in the original collection, two researchers independently re-labeled the dataset by analyzing the source code and do cumentation of each API method to verify the corresponding data-ow specications. As demonstrated in T able 1 , D AInfer+ achieves its maximum ee ctiveness when utilizing embedding models rather than LLMs. General-purp ose LLMs, such as gpt-4o-mini and deepseek- v2 , achieve relatively high precision but lower overall recall, even with few-shot prompting. This indicates a tendency toward under-approximation; these models are often too conser va- tive, failing to identify valid data-ows described in the documentation. However , when they do predict ows, they are still prone to spurious over-approximation in specic cases. While few-shot prompting te chniques improv e their recall, these models remain prone to generating spurious ows. For instance, “ gpt-4o-mini ” infers a data-o w from the rst parameter of the API method android.os.Bundle.getStringArrayList( string) to “ this ” (the “ Bundle ” object). In addition, code-specialize d LLMs achieve higher precision ( > 80% ), but they still struggle to reach high recall. For example , qwen2.5-coder incorrectly infers a data-ow fr om the rst parameter of an- droid.content.Intent.getStringArrayExtra(string) to “ this ” (the “ Intent ” obje ct). The lower false positive rate in code-specialized models is due to their training on structural code patterns; ho wever , they remain limited by under-approximation. By being overly conservative in predicting ows, these models frequently fail to identify valid data-ows, thereby increasing their false negative rates. Embedding models demonstrate superior performance in both recall and precision compared to LLMs. As shown in T able 1 , these models consistently achieve higher recall and precision scores exceeding 82% and 88%, respectively , indicating higher reliability for specialized data-ow specication inference. This success is primarily due to the semantic alignment b etween the verbs used in API method descriptions and the denitions of memor y operations. The heatmaps in Figure 8 prov e this claim by illustrating the similarity scores between various verbs found in API documentation with our memor y operation descriptions using two popular models of SBERT . Notably , these gures reveal that most descriptive verbs align closely with the correct memory operations. T able 1. Accuracy ( Acc.), Recall (Rec.), Precision (Pre.), and F1-score for data-flow spe cification retrieval. T ype Model Acc. Rec. Pre. F1-scor e zero-shot prompting deepseek-v2 23.83 24.69 87.29 0.38 gpt-4o-mini 61.83 72.25 81.09 0.76 qwen2.5-coder 60.49 65.56 87.66 0.75 deepseek-coder-v2 47.51 54.74 78.23 0.64 few-shot prompting deepseek-v2 49.26 64.95 67.09 0.66 gpt-4o-mini 69.30 76.46 87.09 0.81 qwen2.5-coder 72.05 75.08 94.68 0.83 deepseek-coder-v2 50.74 58.05 80.11 0.67 embedding models SBERT -Mini 74.93 82.36 87.25 0.86 SBERT -MPNet 75.6 83.9 89.42 0.86 E5-Base 73.32 83.95 85.37 0.85 E5-Large 76.34 84.69 87.5 0.87 Eciency . T o evaluate the eciency of embedding models relative to LLMs, we utilize two primary metrics: (1) inference time , and (2) cost. As indicated in T able 2 , the selected LLMs require 22 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang (a) SBERT -MPNet embedding model. (b) SBERT -Mini emb edding model. Fig. 8. Cosine similarity scores for various verbs in API method descriptions with verbs used in our designated memory operation descriptions with dierent models of SBERT . signicantly longer durations to infer data-ow specications compared to their emb edding coun- terparts. Even in the b est-case scenario, where deepse ek-coder-v2 achieves its p eak performance, its analysis time r emains orders of magnitude higher than that of the embedding models. Furthermore , the throughput of embedding models (e.g., SBERT -MPNet and E5-base ) signicantly excee ds that of LLMs like qwen2.5-coder in few-shot mode. Her e, throughput is dened as the volume of data processed per second (measured in tokens for LLMs and vectors for embe dding models). On average, the LLMs processed 8,389 prompts to achieve inference, whereas the embedding models extracted and encoded 1,147 simple sentences. This empirical evidence conrms that for high-volume data- ow inference, the “ Encoder-only ” architecture pro vides a superior eciency-to-p erformance ratio compared to the “ Decoder-only ” autoregressive approach, which is fundamentally bottlenecke d by sequential, token-by-token generation. T o quantify the total cost, we monitored r esource utilization across an array of NVIDIA RTX 3090 GP Us. Our ndings reveal a signicant disparity in VRAM overhead. Even when utilizing 4-bit quantization to minimize memor y requirements, the local LLM suite ( qwen2.5-coder-32B and de epSeek-coder-v2 ) requires approximately 20-30 GB of VRAM just to host these models. Comparatively , the embe dding models occupy at most 2GB of the memory . Furthermore, we analyzed the operational costs considering the rental of a cloud service with the same setup with our environment. In addition, gpt-4o-mini consume d 4M output and 73K input tokens for retrieving 23 the data-ow specications of the entire dataset, totaling $2.41 USD. While this API-based appr oach is economically viable for one-o inferences, the total time required for LLM-based inference is several or ders of magnitude higher than our embedding-based approach, which processes the entire dataset in less than a few se conds. Even when utilizing the same hardware environment, the embedding models ( E5 , SBERT ) demonstrate a massive advantage in computational eciency . T able 2. Comparative analysis of resource intensity between embedding models and LLMs with few-shot prompting. Model Parameters Thr oughput Cost Time Cost (Items/sec.) (per 1M items) (sec.) deepseek-v2 16B 1.62 $0.11 * 3,358.2 qwen2.5-coder 32B 2.26 $0.08 * 2,344 gpt-4o-mini proprietary 5.73 $0 . 15 − $0 . 6 + 927 deepseek-coder-v2 7B 2.26 $0.03 * 900 SBERT -Mini 22M 6966.66 $ 0.0 † 0.15 SBERT -MPNet 110M 2223.40 $ 0.0 † 0.47 E5-base 110M 460.35 $0.0 † 2.27 E5-large 335M 176.55 $0.0 † 5.19 * Estimated based on a market rental rate of USD $0.56/hr for equivalent hardware (RTX 3090, 215GB RAM). † Embedding models incur negligible costs on standard hardware and values are rounded to the nearest cent. + Based on OpenAI ocial API pricing for input/output token blends. 7.3 Alias Specification Inference Eectiveness. Although USpec oers the raw data and the source code of A tlas is available, the ground truth used in the two previous studies is not published. Also, the specications oered by FlowDroid are manually spe cied by the developers, and thus, may contain several aws and miss several correct ones. Hence, w e have to label the sp ecications of the benchmarks manually . Meanwhile, investigating all the classes demands tremendous manual eort. Following the recent study [ 29 ], we randomly select 60 classes that oer 2,771 APIs in total. For each API, we examine whether it forms store-load API pairs with other APIs oered by the same class, of which the number can reach 50 on average. T o make the manual examination more r eliable, we invite ve experienced engineers from the industr y as volunteers to specify the sp ecications independently . Specically , they refer to the sp ecications specie d by the developers of FlowDroid and inferred by existing works (i.e., USpec and A tlas ), and meanwhile, investigate the library documentation and implementation simultaneously . In the end, we merge the specications spe cied by the ve volunteers and resolve the inconsistent parts following the principle of max voting, eventually obtaining 988 API aliasing specications as the ground truth. According to our inv estigation, we nd that DAInfer+ achieves high precision and recall upon the experimental subje cts. In total, it successfully infers 2,680 API aliasing specications. For the randomly sele cted 60 classes, DAInfer [ 69 ] infers 1,019 API aliasing specications, 813 of which are correct, achieving a precision of 79.78%. we discov er that DAInfer misses sp ecications, achieving a recall of 82.29%. In our latest study , D AInfer+ achieves 79% precision and 88% r ecall when using SBERT -MPNet . By systematically decomposing complex descriptions and inferring abstract memory operations base d on sentence structure , DAInfer+ eectively lters out non- functional linguistic noise. This structural approach signicantly reduces false positives, as it 24 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang T able 3. Eiciency of DAInfer and its ablations. # Input represents vectors for emb edding models and prompts for LLMs. T ool # Tagging # Inputs # T okens Time Cost (sec) D AInfer+ ( SBERT -Mini ) 32,325 3,597 NA 87.86 D AInfer+ ( SBERT -MPNet ) 32,325 3,597 NA 88.88 D AInfer [ 69 ] 32,325 2,950 726,425 892.93 D AInfer-Type [ 69 ] 32,325 5,164 1,276,254 1,734.63 D AInfer-Exha ustive [ 69 ] 58,846 8,090 1,994,017 2,844.26 isolates the primary action fr om the surrounding constraints and conditions that frequently mislead general-purpose LLMs. After examining all the APIs of the sele cted classes, Interestingly , we collect the specications where the API names contain “ get ” or “ set ” , and discover that such specications only take up 33.49% of all the inferred ones. It shows that D AInfer can understand how APIs operate upon the memory even if diverse verbs are used. W e also compare our results with the specications in the FlowDroid and USpec benchmarks. It is shown that D AInfer+ infers 170 out of the total 210 specications in FlowDroid b enchmark and 65 out of the total 82 specications inferred by USpec , achieving 81.0% and 79.3% recall upon the two b enchmarks, respectively . The above results show that D AInfer , and DAInfer+ , the latest update of it with embe dding models, can eectively infer the API aliasing specications from documentation. Eciency . W e quantify the eciency of D AInfer and DAInfer+ with four metrics, including the number of times the tagging model is applie d, the input count (representing LLM prompts or vector calculations), the token cost, and the time cost. As shown in T able 3 . D AInfer [ 69 ] applies the tagging model 32,325 times and interacts with the LLM 2,950 times using 726,425 tokens, and the overall time cost is 892.93 seconds (around 15 minutes). Accor ding to the OpenAI billing strategy , we only ne ed to pay 1.09 USD in total. W e also conduct the ablation study to demonstrate the b enet of the neurosymbolic optimization algorithm. Specically , the ablation D AInfer-Exha ustive applies the two NLP models to all the APIs, while the ablation DAInfer-Type applies the NLP models to the APIs satisfying the degree constraint and the validity constraint. As shown in T able 3 , D AInfer-Type invokes the LLM 5,164 times with 1,276,254 tokens in total and nishes analyzing all subje cts in 1,734.63 seconds. Besides, DAInfer-Exha ustive has to apply the tagging models 58,846 times and invoke the LLM 8,090 times using 1,994,017 tokens, of which the whole process nishes in 2,844.26 seconds. The key reason for the dierences b etween the ablations is that the solving steps at Lines 4 and 7 in Algorithm 2 can eectively reduce the numb er of times the tagging mo del and the LLM are applied, respectively . Compared to D AInfer-Type and DAInfer-Exha ustive , D AInfer achieves the inference with 1 . 94 × and 3 . 19 × speed-ups when relying on two-staged prompting with LLMs. By utilizing D AInfer+ with emb edding models and an optimized tagging-based inference ap- proach, the specication retrieval pr ocess is signicantly accelerated. From the original data, we extract a total of 3,597 simple sentences averaging 87.4 seconds. The subsequent memory op era- tion abstraction using embedding mo dels requires at most 2 seconds, specically 0.48 se conds for SBERT -Mini and 1.48 se conds for SBERT -MPNet . Consequently , the overall spe ed-up is enhanced by 10 . 16 × and 10 . 04 × for SBERT -Mini and SBERT -MPNet respectively , compared to using DAIn- fer with LLMs. Hence, our neurosymbolic optimization can eciently support the specication inference. 25 7.4 Eects on Client Analysis Following existing studies [ 9 , 29 ], we choose alias analysis and taint analysis as two fundamental clients of DAInfer+ to quantify its eects. Eect on Alias A nalysis. W e conduct the eld and context-sensitive alias analysis by running a static analyzer Pinpoint [ 63 , 77 ] upon 15 Java projects in two settings. In the setting Alias- Empty , we provide empty sp ecications of library APIs, i.e., discarding all the possible alias facts introduced by library API calls. In the setting Alias-Infer , we apply the inferred correct API aliasing specications to the p ointer analysis. For each given pointer , Pinpoint computes its alias facts in a sound manner . W e quantify the alias set sizes of the return values of library APIs and compute 𝑠𝑖 𝑧 𝑒 infer 𝑠𝑖 𝑧 𝑒 empty for each library API invocation, where size infer and size empty are the alias set sizes of the return value under the settings Alias-Infer and Alias-Empty , respe ctively . Figure 9 is the histogram showing the distribution of the ratios of alias set sizes. According to the ratios of alias set sizes, we can discover that the average increase ratio reaches 80.05% with the benet of our inferred specications. Except for the intervals (1, 1.2] and (1.2, 1.4], the size incr ease ratio is larger than 40% as the ratio is larger than 1.4. The proportion of such library API invocations reaches 96.25%. Because our pointer analysis is sound and we investigate the same set of r eturn values of library API calls, the increases in the alias set sizes demonstrate that DAInfer+ promotes the alias analysis in discovering more alias facts in the applications using libraries. Distribution of Ratio of Alias Set Size Proportion Fig. 9. The results of pointer analysis Eect on T aint A nalysis. W e choose three dier- ent settings of spe cications for FlowDroid to con- duct the taint analysis, namely T aint-Empty , T aint- Manual , and T aint-Infer . Here, T aint-Empty and T aint-Infer are similar to the two settings in the pointer analysis, and the sources and sinks are spec- ied based on the default taint sp ecication oered by FlowDroid . Under the setting T aint-Manual , we apply the manual specications provided by Flow - Droid directly . W e select 23 popular Android appli- cations in F-Droid [ 30 ], which cover dierent pro- gram domains, including navigation, security , and messaging applications. Application ID # Ta i n t Flows Fig. 10. The results of taint analysis Figure 10 shows the number of taint ows dis- covered under the three settings. Sp ecically , Flow - Droid discovers 225 taint ows under T aint-Empty , while it nds 304 taint o ws under T aint-Manual . Notably , 79 out of 304 taint ows are induced by the aliasing relations among API parameters and returns. When we run FlowDroid under T aint- Infer , it discov ers 310 taint ows, 85 of which are discovered based on the correct API aliasing spe ci- cations inferred by DAInfer+ . There are six taint ows in three apps not discov ered by FlowDroid under the setting T aint-Infer due to false negatives of our inference algorithm. How ever , 12 taint ows discovered under T aint-Infer are not discover ed under T aint-Manual . The results demonstrate that DAInfer+ promotes the taint analysis in discovering more taint ows. W e do not seek the conrmations of taint ows, which may depend on the dev elopers’ subjective intentions and the choices of taint spe cications. However , the ability to discover more taint ows has shown the 26 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang practical impact of D AInfer+ in dete cting p otential taint-style vulnerabilities. This evaluation principle is also applied in many existing studies [ 9 , 29 , 64 ]. 7.5 Comparison with Existing T echniques W e initially compare DAInfer+ with StubDroid [ 5 ] on constructing the data-ow specications. Next, we compare DAInfer+ with the two most recent studies on API aliasing sp ecication inference, i.e., A tlas [ 9 ] and USpec [ 29 ]. Besides, we construct another baseline , LLM-Alias , which feeds the documentation to Cha tGPT and generates API aliasing spe cications via in-context learning. Comparison with StubDroid. T o evaluate the semantic richness of DAInfer+ , we compare the automated generation of data-ow specications of StubDroid [ 5 ] with our solution. While StubDroid produces generic, method-level summary rules, DAInfer+ infers high-level memory primitives—specically write , read , insert , and remove —for each API method. By translating these primitives into data-ow gen and kill operations, DAInfer+ provides the semantic context necessar y for precision in collection-heavy applications. Spe cically , while StubDroid often leads to over- approximation by failing to identify when a taint should be invalidate d, DAInfer+ utilizes the remove primitive to trigger str ong up dates (kills). This allows the engine to remove specic data-o w facts from the analysis state, thereby signicantly reducing false p ositives. In addition, StubDroid fails to extract the data-ow specications for abstract interfaces such as java.util.List . Hence, our assessments reveal that StubDroid achieves 46% recall and 51% precision in extracting the data-ow specications for the selected classes we used earlier in section 7.2 . Comparatively , D AInfer+ with SBERT -MPNet achieves the recall of 82% and the pr ecision of 88%. Furthermore, StubDroid require an average of 394.84 seconds to generate a sp ecication for a single class (ranging from minimum of 0.36 to maximum of 10,242.88 seconds). Conversely , DAInfer+ infers these specications from documentation in just a few se conds, as shown in T able 2 Unlike StubDroid , which relies on heavy weight bytecode analysis to discover ows, D AInfer+ infers memory op eration abstractions, and then translates them directly into F lowDroid -compatible XML summaries in an ecient manner . Comparison with A tlas . W e run A tlas [ 7 ] upon the total 167 classes and nish the inference in 74.48 minutes. Note that the output of A tlas is the library implementation derived from unit test executions. A utomatically converting it into the sp ecications dened in Denition 2 requires static analysis techniques. Hence, we analyze the library implementation generate d by A tlas with a eld-sensitive pointer analysis, which matches the store-load operations upon the same elds, and eventually convert the output of A tlas to the API aliasing specications dened in Denition 2 . For the classes lab eled with ground truth in § 7.3 , A tlas infers 546 spe cications and 454 correct ones, achieving 83.15% precision and 45.95% recall. After investigating the results, we nd that A tlas fails to generate the specications for 111 classes in the experimental subjects, such as android.os.Intent and android.os.Configuration . The root cause is that A tlas fails to infer the specications when the creation of library function parameters is non-trivial, or the unit test execution demands a specic environment, such as an Android emulator . In contrast, DAInfer+ can derive the API aliasing specications for such classes. Also, the aliasing specications generate d by D AInfer+ only depict the potential aliasing r elations between parameters and return values, while they all miss the preconditions under which such aliasing relations hold. For example, A tlas only obtains that the return value of HashMap.get can be aliased with the second parameter of HashMap.put , missing the precondition ov er their rst parameter . The restrictive templates used in the inference introduce the imprecision, which is also reported in the prior study [ 29 ]. Comparison with USpec . USpec is not open-sourced due to its commercial use [ 29 ]. T o make the comparison, we asked the authors for the raw data of their evaluation. According to their r esults, 27 USpecs successfully obtains 124 API aliasing specications upon 62 classes. Unfortunately , the precision of USpec only r eaches 66.1% (82/124). For instance , USpec generates the incorrect aliasing specication ( HashMap.put , HashMap.get , { ( 0 , 1 ) } , 0 ) for the class java.util.HashMap . The root cause is that USpec infers possible aliasing relations according to the usage events, while the keys and values of HashMap objects may have the same type, making the inference algorithm unable to distinguish them. Howev er , DAInfer+ successfully infers the specication via neurosymbolic optimization. W e also quantify USpec ’s recall based on our labeled specications in § 7.3 . It is shown that USpec misses 370 API aliasing specications. The recall of inferring API aliasing specications is only 18.14%. The root cause of its lo w recall is that USpec can only generate the aliasing specications for the APIs used in the applications’ co de. Comparison with LLM-Alias . W e compare D AInfer+ with LLM-Alias , which directly queries Cha tGPT with the do cumentation. The response Cha tGPT generates is a natural language sentence with an API aliasing specication. Due to laborious eort, we only examine the inference r esults for 60 classes that we randomly selected in § 7.3 . The results show that LLM-Alias generates 801 API aliasing sp ecications for examined classes, only 113 of which are correct, yielding a precision of 14.11% and a recall of 11.44%. Among 688 incorrect specications, 60 sp ecications indicate the correct aliasing relations between parameters and return values, while they do not pose any restrictions on API parameters as the preconditions. The results show that vanilla LLMs without special designs have po or p erformance in understanding the concept of the aliasing relation. In contrast, DAInfer+ achieves quite satisfactor y precision and recall, which benets from our insightful problem reduction and ecient neurosymbolic optimization. Application ID # Newly Discovered Ta i n t Flows Fig. 11. The results of taint analysis assisted with A t- las , USpec , and LLM-Alias Comparison upon Client Analyses. W e also compare the eects of baselines on client analyses with the same settings as the ones in § 7.4 . Specically , A tlas introduces a 43.26% increase in the alias set sizes on aver- age, which is low er than the one introduced by DAInfer+ . USpec and LLM-Alias intro- duce 14.52% and 12.17% increase in the alias set sizes on average , respectively . Although LLM-Alias infers slightly more API aliasing specications than USpec , the specications inferred by USpec contribute more to the aliasing facts, which might b e cause d by more frequent usage of the involved library APIs in the application code. DAInfer+ can intr oduce the highest average increase ratio in alias sets among dierent approaches. Similarly , we nd that A tlas , USpec , and LLM-Alias discover fewer taint ows than D AInfer , which is shown by Figure 11 . Specically , DAInfer newly discovers 85 taint ows, while A tlas , USpec , and LLM-Alias detect 60, 29, and 35 taint ows in total, respectively . Therefore, DAInfer+ has overwhelming superiority over existing techniques in assisting client analyses, including alias analysis and taint analysis. 7.6 Limitations and Future W ork Our approach has sev eral drawbacks that demand further improvements. First, DAInfer+ can not determine whether an API creates a new obje ct. When the developers create any new ob- jects, our inferred specications can only depict data-ow facts instead of aliasing relations. For example, DAInfer+ infers an API aliasing specication for java.util.Map that the return value of Map.computeIfPresent can be aliased with the second parameter of Map.put when their rst 28 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang parameters are aliased. This is a wrong specication as computeIfPresent returns null value or a newly computed value instead of any existing values stored in the elds. Second, the semantic unit consistency requires two strings to b e equal. In our evaluation, howe ver , we notice that sev- eral semantic units are not the same strings while they indicate the same concept in several rare cases. For example, the rst parameters of Sparse Array .set and Sparse Array .value At in the class android.util.Sparse Array are key and index , respectively . The two dierent strings are actually the indicators of the same semantic concept. Hence, D AInfer+ can not infer the correct specication for the two APIs. Although there are several traditional ways to extract synonyms for natural lan- guages, such as W ordNet [ 51 ] and word embedding [ 81 ], they may fail to identify similar semantic units in programming languages, for example, the similarity between key and index is measured to be even smaller than 0.1 by W ordNet. Even if we utilize several code models, such as code2vec [ 2 ] and CodeBERT [ 33 ], they can still lead a false negative/positive when the similarity of the names in a correct/wrong specication is below/above the preset threshold. Furthermore, while relying on embedding models to infer memor y operations and data-ow specications enhances eciency , these models are susceptible to false positives when API docu- mentation is incomplete. For instance, dev elopers often use cross-references—such as " Please use writeBundle(Bundle) instead ”"—rather than providing redundant descriptions for similar methods or omitting return descriptions for state-updating API methods like java.io.IntBuer .put(string) . T o mitigate this, we rened the description analysis by manually augmenting missing contextual in- formation. Additionally , certain verbs such as “ copy ” initially yielde d low similarity scor es for their corresponding memory op erations. In addition, the dependency parser o ccasionally struggled with complex structures. Specically , it erroneously classied “ Set the point’s x and also the sentence y coordinates ” as a compound sentence, leading to misleading results. T o further improv e D AInfer+ , we can explore several directions in the future. First, we can leverage domain-specic LLMs or embedding models for code, which allow local deployment, to validate the semantic unit consistency . If the inference of general-purpose LLMs, such as GPT -4, becomes much more ecient and cheap er in the future, we can also prompt them directly without introducing signicant overhead. The ab ove models can hopefully support us in identifying the semantic units indicating the same concept, even if they are not the same string. Second, D AInfer+ requires a manually sp ecied parser for documentation. Since LLM latency remains a factor , developing mor e robust dependency parsers tailored for technical documentation will be critical for accurate verb identication and relationship mapping. Thir d, domain-specic ne-tuning of the embedding models would enhance the identication of memor y-manipulation verbs, bridging the semantic gap between general-purp ose language and API sp ecications. These general models are trained on broad corpora like Wikipedia , causing a semantic gap when interpreting specialized API documentation. For instance, a general model may not distinguish the subtle operational dierence between “ Transmiing an array ” and ‘ Copying an array ” yet these imply fundamentally dierent data-ow facts. For instance, pre-trained models like CodeBERT [ 33 ] oer bimodal understanding of code and te xt; howev er , they often struggle to distinguish specic memory actions in a zero-shot setting. T argeted training on technical corp ora would allow embedding models to capture the precise memory semantics that general-purpose emb eddings currently fail to resolve . 8 RELA TED W ORK Library Specication Inference. The inference of library function specications has always been a central topic in program analysis. T ypically , IFDS/IDE-base d approaches summarize the data-ow facts of libraries as their semantic abstractions [ 5 , 61 ], which can be reused across various clients of data-ow analysis. Established upon a symbolic memor y model, shape analysis computes the memory state for each statement of a librar y function as invariants, and derives the 29 preconditions/postconditions of each library function as its specication [ 15 , 42 , 62 ]. While the inferred specication accurately depicts the semantics of the librar y function, the analysis suers from scalability problems, especially in the presence of complex program structur es [ 16 ]. T o mitigate these limitations, mining-based approaches leverage the program facts derived from applications to infer specic forms of specications, e.g., points-to [ 9 ], aliasing [ 29 ], taint [ 20 ], and commutativity sp ecications [ 35 ], which supp ort specic static analysis clients, e.g., taint analysis [ 6 ] and Andersen-style pointer analysis [ 32 ]. Another mining-based approach A utoISES automatically infers security sp ecications from high-quality application code and then guides the detection of security p olicy violations [ 65 ]. Recent advancements, such as the CSS framew ork [ 46 ], extend these capabilities by generating caller-sensitive sp ecications for native code via iterative static analysis. Other eorts, including ModelGen [ 23 ] and Spectre [ 47 ], utilize dynamic analysis to identify data-ow and alias specications at runtime; howe ver , these techniques remain inherently constrained by input dependency and incomplete code cov erage. Our work concentrates on the data-ow and aliasing sp ecication inference , which shares the same motivation as the existing studies [ 9 , 29 ] while introducing a no vel paradigm. Rather than r elying on elusive code artifacts or limited execution traces, we leverage natural language documentation to unlock broader applicability . By employing embedding models, our method captures the latent data-ow intent within API descriptions. By synthesizing these semantic insights with name d entity and type information, our frame work employs optimization te chniques to retrieve precise alias sp ecications independently of code analysis, thereby bypassing the visibility and scalability issues that hinder the state-of-the-art static and dynamic analyzers. Natural Language Sp ecication Understanding. Natural language sp ecications, such as comments and documentation, are widely utilize d in various software engineering tasks, including test case generation [ 11 , 52 , 82 ], bug detection [ 60 , 66 , 82 , 84 ], and code search [ 57 , 76 ]. T ypically , C2S [ 82 ] employs semantic parsing to derive formal specications from comments, which aids in test case generation and taint bug detection. Similarly , Jdoctor [ 11 ] and Sw ami [ 52 ] translate natural language specications to formal ones to facilitate the generation of test cases covering exceptional behavior and boundar y conditions, while they only focus on sp ecic patterns, such as exceptions and numeric relations. In the realm of spe cication mining, SuSi [ 58 ] introduced super vised machine learning to classify sources and sinks for the information ow analysis. How ever , it relies heavily on manually engineered features, including both syntactic and semantic patterns in the API metho ds and their descriptions. Doc2Spec utilizes keywords, such as nouns and verbs indicating resource names and actions, respectively , to infer the resource specications, which promote the r esource misuse detection [ 84 ]. PreMA [ 76 ] extends this context to verb phrases to enhance the precision of detection of similar APIs. More closely related to our work, DocFlow [ 66 ] uses contrastive learning to map resource names to sensitiv e categories (sources or sinks), and F luy [ 22 ] leverages a pr e-trained embedding mo del (V ar CLR [ 18 ]) to validate the taint ows based on API naming conventions. Despite this progress, these solutions typically require intensive supervise d training and suer from high false-positive rates in the absence of large, annotated datasets [ 41 ]. Although D AInfer+ shares similarities with existing works [ 57 , 84 ] in terms of technical choices, such as named-entity recognition [ 21 ], our eort explores a ne w paradigm of deriving data-ow facts and aliasing relations from documentation, which can be generalized for other static analysis problems. A key innovation of our framew ork is employing a zero-shot embe dding model to infer data-ow relations. Unlike traditional specications mining solutions [ 22 , 58 , 66 ], our approach eliminates the need for large-scale manual labeling, oering a lightweight, scalable solution that remains eective in dynamic environments. Furthermore, by utilizing specialize d embe dding models 30 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang rather than generative LLMs, we reduce the computational overhead while maintaining robust semantic inference. Large Language Mo dels. The Large Language Models (LLMs) [ 54 , 56 ], base d on the deco der-only transformer architecture [ 68 ], are typically pre-trained on massive text corpora containing trillions of tokens. They exhibit exceptional zero/few-shot performance in a wide range of highly specic downstream tasks, including complex text generation [ 17 ], interactive de cision making/planning [ 78 , 85 ], and tool utilization [ 79 ]. Among various downstream tasks, reasoning task has traditionally been regarded as a typical challenge for LLMs [ 24 ], which has attracted signicant research interests. Specically , there has been a line of literature exploring the use of LLMs in automated theorem proving within formal logic. Pione ering studies [ 43 , 44 , 75 ] have focused on employing LLMs to generate proofs for theorems expr essed in formal logic. Several recent eorts aimed to integrate advanced LLMs that have demonstrate d impressive zero/fe w-shot p erformance in code completion tasks into formal logic reasoning tasks [ 48 , 74 ]. Inspired by these advancements, our previous research [ 69 ] leveraged LLMs to interpret memory operation kinds, a sub-pr oblem addr essed within our current approach. Modern LLMs, particularly those ne-tune d on code such as DeepSeek-Coder [ 36 ] and Qwen2.5- Coder [ 39 ], have shown impressive zer o-shot reasoning in co de completion and formal logic tasks. Howev er , for rigorous static analysis, directly applying generative LLMs presents signicant risks of hallucination and computational latency at scale. While LLMs oer powerful reasoning, our framework strategically balances their use with embedding models to optimize for b oth analytical depth and practical eciency . 9 CONCLUSION W e proposed a new approach D AInfer+ to infer API aliasing specications from documentation. D AInfer+ adopts the tagging and NLP models to interpret informal semantic information in documentation. It reduces the inference problem to an optimization problem that can b e eciently solved by our neurosymbolic optimization algorithm. The inferred specications are further fed to static analysis clients for analyzing the applications using libraries. Our evaluation demonstrates that D AInfer+ achieves high precision and recall with signicant gains in eciency , particularly when leveraging embedding mo dels for semantic interpretation. Furthermore, the results highlight the practical impact of our approach in enhancing librar y-aware pointer analysis and taint analysis. By bridging the gap between informal documentation and formal analysis, D AInfer+ provides a robust and scalable solution for understanding library semantics. REFERENCES [1] W ase em Akram, Y anjie Jiang, Y uxia Zhang, Haris Ali Khan, and Hui Liu. 2025. LLM-Based Method Name Suggestion with Automatically Generated Context-Rich Prompts. 2, FSE, Article FSE036 (June 2025), 22 pages. https://doi.org/10. 1145/3715753 [2] Uri Alon, Meital Zilberstein, Omer Levy , and Eran Y ahav . 2019. code2vec: learning distribute d representations of code. Proc. A CM Program. Lang. 3, POPL, Article 40 (jan 2019), 29 pages. https://doi.org/10.1145/3290353 [3] Lars Ole Andersen. 1994. Program analysis and specialization for the C programming language. (1994). [4] Anastasios Antoniadis, Nikos Filippakis, Paddy Krishnan, Raghavendra Ramesh, Nicholas Allen, and Y annis Smarag- dakis. 2020. Static analysis of Java enterprise applications: frameworks and caches, the elephants in the room. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implemen- tation, PLDI 2020, London, UK, June 15-20, 2020 , Alastair F . Donaldson and Emina T orlak (Eds.). A CM, 794–807. https://doi.org/10.1145/3385412.3386026 [5] Steven Arzt and Eric Bodden. 2016. StubDroid: automatic inference of precise data-ow summaries for the android framework. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, A ustin, TX, USA, May 14-22, 2016 , Laura K. Dillon, Willem Visser , and Laurie A. Williams (Eds.). ACM, 725–735. https://doi.org/10. 1145/2884781.2884816 31 [6] Steven Arzt, Siegfried Rasthofer , Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le T raon, Damien Octeau, and Patrick D . McDaniel. 2014. FlowDroid: precise context, ow , eld, object-sensitive and lifecycle-aware taint analysis for Android apps. In A CM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014 , Michael F. P. O’Boyle and Keshav Pingali (Eds.). ACM, 259–269. https://doi.org/10.1145/2594291.2594299 [7] A TLAS. 2023. Soure code of A TLAS. https://github.com/obastani/atlas . [Online; accessed 13-Sept-2023]. [8] Brian Backman. 2004. Building Sentence Skills: T ools for W riting the A mazing English Sentence . T eacher Created Resources. [9] Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2018. Active learning of points-to specications. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, P A, USA, June 18-22, 2018 , Jerey S. Foster and Dan Grossman (Eds.). ACM, 678–692. https: //doi.org/10.1145/3192366.3192383 [10] Nikolaj S. Bjørner , Anh-Dung P han, and Lars F leckenstein. 2015. 𝜈 Z - An Optimizing SMT Solver . In T ools and Algorithms for the Construction and A nalysis of Systems - 21st International Conference, T ACAS 2015, Held as Part of the European Joint Conferences on Theor y and Practice of Software, ETAPS 2015, London, UK, A pril 11-18, 2015. Proceedings (Lecture Notes in Computer Science, V ol. 9035) , Christel Baier and Cesare Tinelli (Eds.). Springer , 194–199. https://doi.org/10.1007/978- 3- 662- 46681- 0_14 [11] Arianna Blasi, Alberto Go, Konstantin Kuznetsov , Alessandra Gorla, Michael D. Ernst, Mauro Pezzè, and Sergio Del- gado Castellanos. 2018. Translating code comments to procedure sp ecications. In Procee dings of the 27th A CM SIGSOFT International Symposium on Software T esting and Analysis, ISSTA 2018, A msterdam, The Netherlands, July 16-21, 2018 , Frank Tip and Eric Bodden (Eds.). ACM, 242–253. https://doi.org/10.1145/3213846.3213872 [12] Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. Repair Agent: An Autonomous, LLM-Based A gent for Program Repair . In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering (Ottawa, Ontario, Canada) (ICSE ’25) . IEEE Press, 2188–2200. https://doi.org/10.1109/ICSE55347.2025.00157 [13] T om B. Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry , Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger , T om Henighan, Rewon Child, A ditya Ramesh, Daniel M. Ziegler , Jerey W u, Clemens Winter , Christopher Hesse, Mark Chen, Eric Sigler , Mateusz Litwin, Scott Gray , Benjamin Chess, Jack Clark, Christopher Berner , Sam McCandlish, Alec Radford, Ilya Sutskever , and Dario Amodei. 2020. Language Mo dels are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: A nnual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Decemb er 6-12, 2020, virtual , Hugo Larochelle, Marc’ Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bf b8ac142f64a- Abstract.html [14] Simon Butler , Michel W ermelinger , and Yijun Yu. 2015. A sur vey of the forms of Java reference names. In Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, ICPC 2015, Florence/Firenze, Italy, May 16-24, 2015 , Andrea De Lucia, Christian Bird, and Rocco Oliveto (Eds.). IEEE Computer So ciety , 196–206. https: //doi.org/10.1109/ICPC.2015.30 [15] Cristiano Calcagno, Dino Distefano, Peter W . O’Hearn, and Hongseok Y ang. 2011. Compositional Shape Analysis by Means of Bi- Abduction. J. ACM 58, 6 (2011), 26:1–26:66. https://doi.org/10.1145/2049697.2049700 [16] Bor-Y uh Evan Chang, Cezara Dragoi, Roman Manevich, Noam Rinetzky, and Xavier Rival. 2020. Shape Analysis. Found. Trends Program. Lang. 6, 1-2 (2020), 1–158. https://doi.org/10.1561/2500000037 [17] Mark Chen, Jerr y Tw orek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray , Raul Puri, Gretchen Krueger , Michael Petrov , Heidy Khlaaf, Girish Sastry , Pamela Mishkin, Brooke Chan, Scott Gray , Nick Ryder , Mikhail Pavlov , Alethea Power , Lukasz Kaiser , Mohammad Bavarian, Clemens Winter , Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-V oss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas T ezak, Jie T ang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr , Jan Leike, Joshua Achiam, V edant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer , Peter W elinder , Bob McGrew , Dario Amodei, Sam McCandlish, Ilya Sutskever , and W ojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv: 2107.03374 https://arxiv .org/abs/2107.03374 [18] Qibin Chen, Jeremy Lacomis, Edward J. Schwartz, Graham Neubig, Bogdan V asilescu, and Claire Le Goues. 2022. V ar CLR: variable semantic representation pre-training via contrastive learning (ICSE ’22) . Association for Computing Machinery , New Y ork, N Y , USA, 2327–2339. https://doi.org/10.1145/3510003.3510162 [19] Y ujia Chen, Cuiyun Gao, Muyijie Zhu, Qing Liao, Y ong W ang, and Guoai Xu. 2024. APIGen: Generative API Method Recommendation. In 2024 IEEE International Conference on Software A nalysis, Evolution and Reengineering (SANER) . 171–182. https://doi.org/10.1109/SANER60148.2024.00025 32 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang [20] Victor Chibotaru, Benjamin Bichsel, V eselin Raychev , and Martin T . V echev . 2019. Scalable taint specication inference with big co de. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, Phoenix, AZ, USA, June 22-26, 2019 , K athr yn S. McKinley and Kathleen Fisher (Eds.). ACM, 760–774. https://doi.org/10.1145/3314221.3314648 [21] Nancy Chinchor . 1998. Appendix E: MUC-7 Named Entity T ask Denition (version 3.5). In Seventh Message Under- standing Conference: Proce edings of a Conference Held in Fairfax, Virginia, USA, MUC 1998, A pril 29 - May 1, 1998 . A CL. https://aclanthology .org/M98- 1028/ [22] Yiu W ai Chow , Max Schäfer , and Michael Pradel. 2023. Bewar e of the Unexpected: Bimodal Taint Analysis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software T esting and A nalysis (Seattle, W A, USA) (ISSTA 2023) . Association for Computing Machinery , New Y ork, N Y , USA, 211–222. https://doi.org/10.1145/3597926.3598050 [23] Lazaro Clapp, Saswat Anand, and Alex Aiken. 2015. Modelgen: mining explicit information ow specications from concrete executions (ISST A 2015) . Association for Computing Machinery, Ne w Y ork, NY, USA, 129–140. https: //doi.org/10.1145/2771783.2771810 [24] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser , Matthias Plappert, Jerr y T worek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse , and John Schulman. 2021. Training V eriers to Solve Math W ord Problems. CoRR abs/2110.14168 (2021). arXiv: 2110.14168 https://arxiv .org/abs/2110.14168 [25] DAInferPlus. 2026. Implementation of DAInfer . https://github.com/maryammsd/DAInfer . [Online; accessed 14-Feb- 2026]. [26] Leonardo Mendonça de Moura and Nikolaj Bjørner . 2008. Z3: An Ecient SMT Solver . In T o ols and Algorithms for the Construction and A nalysis of Systems, 14th International Conference, T ACAS 2008, Held as Part of the Joint European Conferences on The ory and Practice of Software, ET APS 2008, Budap est, Hungar y , March 29-A pril 6, 2008. Proceedings (Lecture Notes in Computer Science, V ol. 4963) , C. R. Ramakrishnan and Jakob Rehof (Eds.). Springer, 337–340. https://doi.org/10.1007/978- 3- 540- 78800- 3_24 [27] DeepSeek-AI, Aixin Liu, Bei Feng, Bin W ang, Bingxuan W ang, Bo Liu, Chenggang Zhao, Chengqi Dengr , Chong Ruan, Damai Dai, Daya Guo, Dejian Y ang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Y ang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jin Chen, Jingyang Y uan, Junjie Qiu, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Lean W ang, Lecong Zhang, Lei Xu, Le yi Xia, Liang Zhao , Liyue Zhang, Meng Li, Miaojun W ang, Mingchuan Zhang, Minghua Zhang, Minghui T ang, Mingming Li, Ning Tian, Panpan Huang, Peiyi W ang, Peng Zhang, Qihao Zhu, Qinyu Chen, Qiushi Du, R. J. Chen, R. L. Jin, Ruiqi Ge, Ruizhe Pan, Runxin Xu, Ruyi Chen, S. S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Y e, Shirong Ma, Shiyu W ang, Shuang Zhou, Shuiping Y u, Shunfeng Zhou, Size Zheng, T . W ang, Tian Pei, Tian Yuan, Tianyu Sun, W . L. Xiao, W angding Zeng, W ei An, W en Liu, W enfeng Liang, W enjun Gao, W entao Zhang, X. Q. Li, Xiangyue Jin, Xianzu W ang, Xiao Bi, Xiaodong Liu, Xiaohan W ang, Xiaojin Shen, Xiaokang Chen, Xiaosha Chen, Xiaotao Nie, Xiaowen Sun, Xiaoxiang W ang, Xin Liu, Xin Xie, Xingkai Yu, Xinnan Song, Xinyi Zhou, Xinyu Y ang, Xuan Lu, Xuecheng Su, Y. W u, Y . K. Li, Y . X. W ei, Y. X. Zhu, Y anhong Xu, Yanping Huang, Y ao Li, Y ao Zhao, Y aofeng Sun, Y aohui Li, Y aohui W ang, Yi Zheng, Yichao Zhang, Yiliang Xiong, Yilong Zhao , Ying He, Ying Tang, Yishi Piao, Yixin Dong, Yixuan T an, Yiyuan Liu, Y ongji W ang, Y ongqiang Guo, Y uchen Zhu, Yuduan W ang, Y uheng Zou, Yukun Zha, Y unxian Ma, Yuting Y an, Yuxiang Y ou, Y uxuan Liu, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhen Huang, Zhen Zhang, Zhenda Xie, Zhewen Hao, Zhihong Shao, Zhiniu W en, Zhipeng Xu, Zhongyu Zhang, Zhuoshu Li, Zihan W ang, Zihui Gu, Zilin Li, and Ziwei Xie. 2024. DeepSeek- V2: A Strong, Economical, and Ecient Mixture-of-Experts Language Model. arXiv: 2405.04434 [cs.CL] https://arxiv .org/abs/2405.04434 [28] Jacob Devlin, Ming- W ei Chang, Kenton Lee, and Kristina T outanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Procee dings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T e chnologies, V olume 1 (Long and Short Pap ers) , Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19- 1423 [29] Jan Eberhardt, Samuel Steen, V eselin Raychev , and Martin T . V echev . 2019. Unsuper vised learning of API alias specications. In Procee dings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementa- tion, PLDI 2019, Phoenix, AZ, USA, June 22-26, 2019 , Kathryn S. McKinley and Kathleen Fisher (Eds.). A CM, 745–759. https://doi.org/10.1145/3314221.3314640 [30] F-Droid. 2023. F-Droid. https://f- droid.org/ . [Online; accessed 1-Sept-2023]. [31] Chongzhou Fang, Ning Miao, Shaurya Srivastav , Jialin Liu, Ruoyu Zhang, Ruijie Fang, Asmita, Ryan T sang, Najmeh Nazari, Han W ang, and Houman Homayoun. 2024. Large Language Models for Code Analysis: Do LLMs Really Do Their Job? . In 33rd USENIX Security Symposium (USENIX Security 24) . USENIX Association, P hiladelphia, P A, 829–846. https://www.usenix.org/confer ence/usenixsecurity24/presentation/fang 33 [32] Pratik Fegade and Christian Wimmer . 2020. Scalable pointer analysis of data structures using semantic models. In CC ’20: 29th International Conference on Compiler Construction, San Diego, CA, USA, Februar y 22-23, 2020 , Louis-Noël Pouchet and Alexandra Jimborean (Eds.). A CM, 39–50. https://doi.org/10.1145/3377555.3377885 [33] Zhangyin Feng, Daya Guo, Duyu T ang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre- Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020 , Trevor Cohn, Y ulan He, and Y ang Liu (Eds.). Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.ndings- emnlp.139 [34] W Nelson Francis and Henry Kucera. 1967. Computational analysis of present-day American English. Providence, RI: Brown University Press. Kuperman, V ., Estes, Z., Brysbaert, M., & W arriner, AB (2014). Emotion and language: V alence and arousal ae ct word recognition. Journal of Experimental Psychology: General 143 (1967), 1065–1081. [35] Timon Gehr , Dimitar K. Dimitrov , and Martin T . V e chev . 2015. Learning Commutativity Specications. In Computer Aided V erication - 27th International Conference, CA V 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I (Lecture Notes in Computer Science, V ol. 9206) , Daniel Kroening and Corina S. Pasar eanu (Eds.). Springer , 307–323. https://doi.org/10.1007/978- 3- 319- 21690- 4_18 [36] Daya Guo, Qihao Zhu, Dejian Y ang, Zhenda Xie, Kai Dong, W entao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and W enfeng Liang. 2024. DeepSe ek-Coder: When the Large Language Mo del Me ets Programming – The Rise of Code Intelligence. https://arxiv .org/abs/2401.14196 [37] Jinyao* Guo, Chengp eng* W ang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025. RepoA udit: An A utonomous LLM- Agent for Repository-Level Code Auditing. In Procee dings of the 42nd International Conference on Machine Learning . *Equal contribution. [38] Matthew Honnibal, Ines Montani, Soe V an Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. (2020). https://doi.org/10.5281/zenodo.1212303 [39] Binyuan Hui, Jian Y ang, Zeyu Cui, Jiaxi Y ang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, K eming Lu, Kai Dang, Y ang Fan, Yichang Zhang, An Y ang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Y unlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder T echnical Report. arXiv: 2409.12186 [cs.CL] https://arxiv .org/abs/2409.12186 [40] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2020. Co deSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv: 1909.09436 [cs.LG] https://arxiv .org/abs/1909.09436 [41] Hiroki Inayoshi, Shoichi Saito , and Akito Monden. 2025. Evaluating T aint Specication Generators for Identifying T aint Sources in Relation to Data Safety Section. In 2025 IEEE/ACM 12th International Conference on Mobile Software Engineering and Systems (MOBILESoft) . 44–54. https://doi.org/10.1109/MOBILESoft66462.2025.00012 [42] Bertrand Jeannet, Alexey Loginov , Thomas W . Reps, and Mo oly Sagiv . 2010. A relational approach to interprocedural shape analysis. A CM Trans. Program. Lang. Syst. 32, 2 (2010), 5:1–5:52. https://doi.org/10.1145/1667048.1667050 [43] Albert Qiaochu Jiang, W enda Li, Szymon T workowski, Konrad Czechowski, T omasz Odrzygózdz, Piotr Milos, Yuhuai Wu, and Mateja Jamnik. 2022. Thor: Wielding Hammers to Integrate Language Models and Automated The orem Provers. In NeurIPS . http://papers.nips.cc/pap er_les/paper/2022/hash/377c25312668e48f2e531e2f2c422483- Abstract- Conference.html [44] Albert Qiaochu Jiang, Sean W elleck, Jin Peng Zhou, Timothée Lacroix, Jiacheng Liu, W enda Li, Mateja Jamnik, Guillaume Lample, and Y uhuai Wu. 2023. Draft, Sketch, and Prove: Guiding Formal Theorem Prov ers with Informal Proofs. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview .net. https://openreview .net/p df ?id=SMa9EA ovKMC [45] Juyong Jiang, Fan W ang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2025. A Sur vey on Large Language Models for Code Generation. A CM Trans. Softw. Eng. Methodol. (July 2025). https://doi.org/10.1145/3747588 Just Accepted. [46] Shuangxiang Kan, Y uhao Gao, Zexin Zhong, and Y ulei Sui. 2024. Cross-Language T aint Analysis: Generating Caller- Sensitive Native Code Sp ecication for Java. IEEE Transactions on Software Engineering 50, 6 (2024), 1518–1533. https://doi.org/10.1109/TSE.2024.3392254 [47] Shuangxiang Kan, Y uekang Li, W eigang He, Zhenchang Xing, Liming Zhu, and Y ulei Sui. 2025. Spectre: Automated Aliasing Specications Generation for Library APIs with Fuzzing. ACM Trans. Softw. Eng. Methodol. (June 2025). https://doi.org/10.1145/3725811 [48] Guillaume Lample, Timothée Lacroix, Marie-Anne Lachaux, Aurélien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner , and Xavier Martinet. 2022. Hyper Tree Proof Search for Neural Theorem Pro ving. In NeurIPS . http: //papers.nips.cc/paper_les/paper/2022/hash/a8901c5e85f b8e1823bbf0f755053672- Abstract- Conference.html [49] Puzhuo Liu, Chengnian Sun, Y aowen Zheng, Xuan Feng, Chuan Qin, Y uncheng W ang, Zhenyang Xu, Zhi Li, Peng Di, Y u Jiang, and Limin Sun. 2025. LLM-Pow ered Static Binary Taint Analysis. ACM Trans. Softw . Eng. Methodol. 34, 3, Article 83 (Feb. 2025), 36 pages. https://doi.org/10.1145/3711816 [50] Michael R. Lyu, Baishakhi Ray , Abhik Roychoudhury , Shin Hwei Tan, and Patanamon Thongtanunam. 2025. Automatic Programming: Large Language Models and Beyond. 34, 5, Article 140 (May 2025), 33 pages. https://doi.org/10.1145/ 34 Maryam Masoudian, Anshunkang Zhou, Chengpeng W ang, and Charles Zhang 3708519 [51] George A. Miller . 1995. W ordNet: a lexical database for English. Commun. ACM 38, 11 (nov 1995), 39–41. https: //doi.org/10.1145/219717.219748 [52] Manish Motwani and Y uriy Brun. 2019. A utomatically generating precise Oracles from structured natural language specications. In Procee dings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019 , Joanne M. Atle e, T evk Bultan, and Jon Whittle (Eds.). IEEE / ACM, 188–199. https: //doi.org/10.1109/ICSE.2019.00035 [53] NLTK. 2023. Natural Language T o olkit. https://www .nltk.org/index.html . [Online; accessed 7-Sep-2023]. [54] OpenAI. 2022. Introducing ChatGPT . (2022). https://openai.com/blog/chatgpt [55] OpenAI. 2023. GPT -3.5. https://platform.openai.com/docs/models/gpt- 3- 5 . [Online; accessed 7-Sep-2023]. [56] OpenAI. 2023. GPT -4 T e chnical Report. arXiv: 2303.08774 [cs.CL] [57] Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney , and Amit M. Paradkar . 2012. Inferring method specications from natural language API descriptions. In 34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland , Martin Glinz, Gail C. Murphy , and Mauro Pezzè (Eds.). IEEE Computer Society , 815–825. https://doi.org/10.1109/ICSE.2012.6227137 [58] Siegfried Rasthofer , Steven Arzt, and Eric Bodden. 2014. A Machine-learning Approach for Classifying and Categorizing Android Sources and Sinks. In Proce edings of the 21st Network and Distributed System Se curity Symposium (NDSS) , V ol. 14. 23–26. [59] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT -Networks. arXiv: 1908.10084 [cs.CL] https://ar xiv .org/abs/1908.10084 [60] Xiaoxue Ren, Xinyuan Y e, Zhenchang Xing, Xin Xia, Xiwei Xu, Liming Zhu, and Jianling Sun. 2020. API-Misuse Detection Driven by Fine-Graine d API-Constraint Knowledge Graph. In 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melb ourne, Australia, September 21-25, 2020 . IEEE, 461–472. https: //doi.org/10.1145/3324884.3416551 [61] Atanas Rountev , Mariana Sharp, and Guoqing Xu. 2008. IDE dataow analysis in the presence of large object-oriented libraries. In International Conference on Compiler Construction . Springer , 53–68. [62] Shmuel Sagiv , Thomas W . Reps, and Reinhard Wilhelm. 2002. Parametric shap e analysis via 3-value d logic. ACM Trans. Program. Lang. Syst. 24, 3 (2002), 217–298. https://doi.org/10.1145/514188.514190 [63] Qingkai Shi, Xiao Xiao, Rongxin W u, Jinguo Zhou, Gang Fan, and Charles Zhang. 2018. Pinpoint: fast and precise sparse value ow analysis for million lines of code. In Procee dings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, P A, USA, June 18-22, 2018 , Jerey S. Foster and Dan Grossman (Eds.). A CM, 693–706. https://doi.org/10.1145/3192366.3192418 [64] Johannes Späth, Lisa Nguyen Quang Do, Karim Ali, and Eric Bodden. 2016. Boomerang: Demand-Driven Flow- and Context-Sensitive Pointer Analysis for Java ( Artifact). Dagstuhl A rtifacts Ser . 2, 1 (2016), 12:1–12:2. https: //doi.org/10.4230/DARTS.2.1.12 [65] Lin T an, Xiaolan Zhang, Xiao Ma, W eiwei Xiong, and Y uanyuan Zhou. 2008. AutoISES: Automatically Inferring Security Specication and Detecting Violations. In Proceedings of the 17th USENIX Security Symp osium, July 28-A ugust 1, 2008, San Jose, CA, USA , Paul C. van Oorschot (Ed.). USENIX Association, 379–394. http://ww w .usenix.org/events/ sec08/tech/full_papers/tan_l/tan_l.pdf [66] Marcos Tileria, Jorge Blasco, and Santanu Kumar Dash. 2024. DocFlow: Extracting Taint Specications from Software Documentation (ICSE ’24) . Association for Computing Machinery , New Y ork, N Y , USA, Article 61, 12 pages. https: //doi.org/10.1145/3597503.3623312 [67] John T oman and Dan Grossman. 2017. T aming the Static Analysis Beast. In 2nd Summit on Advances in Programming Languages, SNAPL 2017, May 7-10, 2017, Asilomar , CA, USA (LIPIcs, V ol. 71) , Benjamin S. Lerner, Rastislav Bodík, and Shriram Krishnamurthi (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 18:1–18:14. https://doi.org/10. 4230/LIPIcs.SNAPL.2017.18 [68] Ashish V aswani, Noam Shaze er , Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser , and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: A nnual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA , Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. W allach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91f b d053c1c4a845aa- Abstract.html [69] Chengpeng W ang, Jipeng Zhang, Rongxin Wu, and Charles Zhang. 2024. DAInfer: Inferring API Aliasing Specications from Library Documentation via Neurosymb olic Optimization. In The Proceedings of the A CM on Software Engineering , V ol. 1. https://doi.org/10.1145/3660816 [70] Chengpeng W ang, Wuqi Zhang, Zian Su, Xiangzhe Xu, Xiaoheng Xie, and Xiangyu Zhang. 2024. LLMDF A: analyzing dataow in code with large language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems (V ancouver , BC, Canada) (NIPS ’24) . Curran Associates Inc., Red Ho ok, N Y , USA, Article 4181, 35 30 pages. [71] Liang W ang, Nan Y ang, Xiaolong Huang, Linjun Y ang, Rangan Majumder , and Furu W ei. 2024. Multilingual E5 T ext Embeddings: A T echnical Report. arXiv: 2402.05672 [cs.CL] https://ar xiv .org/abs/2402.05672 [72] Ying W ang, Ming W en, Zhenwei Liu, Rongxin Wu, Rui W ang, Bo Y ang, Hai Yu, Zhiliang Zhu, and Shing-Chi Cheung. 2018. Do the dependency conicts in my project matter? . In Proceedings of the 2018 ACM Joint Meeting on European Software Engine ering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, November 04-09, 2018 , Gar y T . Leavens, Alessandro Garcia, and Corina S. Pasareanu (Eds.). A CM, 319–330. https://doi.org/10.1145/3236024.3236056 [73] Moshi W ei, Nima Shiri Harzevili, Y uchao Huang, Junjie W ang, and Song W ang. 2022. CLEAR: contrastive learning for A PI recommendation. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22) . Association for Computing Machiner y , New Y ork, N Y , USA, 376–387. https://doi.org/10.1145/3510003.3510159 [74] Sean W elleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, and Y ejin Choi. 2022. NaturalProver: Grounded Mathematical Proof Generation with Language Models. In NeurIPS . http://papers.nips.cc/paper_les/paper/2022/ hash/1fc548a8243ad06616eee731e0572927- Abstract- Conference.html [75] Y uhuai Wu, Albert Qiaochu Jiang, W enda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy . 2022. A utoformalization with Large Language Models. In NeurIPS . http://papers.nips.cc/paper_les/paper/2022/hash/ d0c6bc641a56bebe e9d985b937307367- Abstract- Conference .html [76] W enkai Xie, Xin Peng, Mingwei Liu, Christoph Tr eude, Zhenchang Xing, Xiaoxin Zhang, and W enyun Zhao. 2020. API method recommendation via explicit matching of functionality verb phrases. In Proceedings of the 28th A CM Joint Meeting on European Software Engine ering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020) . Association for Computing Machiner y , New Y ork, NY, USA, 1015–1026. https: //doi.org/10.1145/3368089.3409731 [77] Peisen Y ao, Jinguo Zhou, Xiao Xiao, Qingkai Shi, Rong xin Wu, and Charles Zhang. 2021. Ecient Path-Sensitive Data-Dependence Analysis. CoRR abs/2109.07923 (2021). arXiv: 2109.07923 https://arxiv .org/abs/2109.07923 [78] Shunyu Y ao, Dian Yu, Jerey Zhao, Izhak Shafran, Thomas L. Griths, Y uan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. CoRR abs/2305.10601 (2023). https: //doi.org/10.48550/arXiv .2305.10601 arXiv: 2305.10601 [79] Shunyu Y ao, Jerey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Y uan Cao. 2023. Re Act: Syner- gizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview .net. https://openreview .net/p df ?id=WE_vluYUL- X [80] Jane Y en, T amás Lévai, Qinyuan Y e, Xiang Ren, Ramesh Govindan, and Barath Raghavan. 2021. Semi-automated protocol disambiguation and code generation. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (Virtual Event, USA) (SIGCOMM ’21) . Association for Computing Machiner y , New Y ork, N Y , USA, 272–286. https://doi.org/10. 1145/3452296.3472910 [81] Hamed Zamani and W Bruce Croft. 2017. Relevance-based word embedding. In Proceedings of the 40th International A CM SIGIR Conference on Research and Development in Information Retrieval . 505–514. [82] Juan Zhai, Yu Shi, Minxue Pan, Guian Zhou, Y ong xiang Liu, Chunrong Fang, Shiqing Ma, Lin T an, and Xiangyu Zhang. 2020. C2S: translating natural language comments to formal program specications. In ESEC/FSE ’20: 28th A CM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020 , Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds.). ACM, 25–37. https://doi.org/10.1145/3368089.3409716 [83] Y untong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury . 2024. A utoCodeRover: Autonomous Pr ogram Improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software T esting and Analysis (Vienna, A ustria) (ISST A 2024) . Association for Computing Machiner y , New Y ork, NY , USA, 1592–1604. https: //doi.org/10.1145/3650212.3680384 [84] Hao Zhong, Lu Zhang, T ao Xie, and Hong Mei. 2009. Inferring Resource Specications from Natural Language API Documentation. In ASE 2009, 24th IEEE/ACM International Conference on Automated Software Engineering, Auckland, New Zealand, November 16-20, 2009 . IEEE Computer Society, 307–318. https://doi.org/10.1109/ASE.2009.94 [85] Denny Zhou, Nathanael Schärli, Le Hou, Jason W ei, Nathan Scales, Xuezhi W ang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V . Le, and Ed H. Chi. 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview .net. https://openreview .net/p df ?id=WZH7099tgfM

DAInfer+: Neurosymbolic Inference of API Specifications from Documentation via Embedding Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment