Lightweight Call-Graph Construction for Multilingual Software Analysis
Analysis of multilingual codebases is a topic of increasing importance. In prior work, we have proposed the MLSA (MultiLingual Software Analysis) architecture, an approach to the lightweight analysis of multilingual codebases, and have shown how it can be used to address the challenge of constructing a single call graph from multilingual software with mutual calls. This paper addresses the challenge of constructing monolingual call graphs in a lightweight manner (consistent with the objective of MLSA) which nonetheless yields sufficient information for resolving language interoperability calls. A novel approach is proposed which leverages information from a compiler-generated AST to provide the quality of call graph necessary, while the program itself is written using an Island Grammar that parses the AST providing the lightweight aspect necessary. Performance results are presented for a C/C++ implementation of the approach, PAIGE (Parsing AST using Island Grammar Call Graph Emitter) showing that despite its lightweight nature, it outperforms Doxgen, is robust to changes in the (Clang) AST, and is not restricted to C/C++.
💡 Research Summary
The paper addresses the growing need for static analysis tools that can handle multilingual codebases, building on the authors’ earlier MLSA (MultiLingual Software Analysis) architecture. The central contribution is PAIGE (Parsing AST using Island Grammar Call Graph Emitter), a lightweight filter that generates monolingual call graphs suitable for later multilingual stitching. PAIGE operates on a textual Abstract Syntax Tree (AST) produced by the native compiler (Clang for C/C++, Python’s ast module, and SpiderMonkey for JavaScript) and uses an Island Grammar (IG) to extract only the nodes of interest—primarily function call expressions and their arguments—while ignoring the rest of the tree (“water”).
Implementation details: Flex defines lexical tokens for key AST keywords (e.g., CallExpr, DeclRefExpr) and regular‑expression patterns for strings and identifiers. Bison then combines these tokens into a small context‑free grammar that distinguishes “islands” (CALL, ARGUMENT) from “water”. When a CALL token is seen, the following WORD token (the function name) is passed to a C++ routine that records the call; ARGUMENT tokens trigger recording of argument names, types, or literal values. The output is a CSV file containing, for each call, the callee name, the caller’s scope, the class (if a member function), and a detailed list of argument kinds (variable, string literal, numeric literal, subscript, member variable, nested call result, binary/unary expression, etc.).
Key technical advantages:
- AST‑driven richness – The AST already encodes type information, class hierarchies, and implicit “this” pointers, allowing PAIGE to resolve dynamic dispatch and pointer dereferences without expensive type‑inference algorithms.
- Island Grammar efficiency – By scanning only for a handful of keywords, PAIGE processes large ASTs quickly and with low memory overhead; the IG is also robust to changes in the underlying AST format because it touches only a small, well‑defined subset.
- Modularity and language independence – The same IG pattern can be retargeted to other languages by swapping the AST source and adjusting the token list, demonstrated with Python and JavaScript.
- Tree‑structured output – Rather than a generic graph, PAIGE emits a call tree where each distinct argument list creates a separate branch, and recursive calls appear as leaf nodes. This simplifies visual inspection and downstream analysis.
Experimental evaluation compared PAIGE against Doxygen on 100 C/C++ programs. PAIGE was on average 30 % faster in total parsing time and produced more complete call information, especially for constructs that Doxygen typically omits (pointer dereferences, member functions, cross‑file definitions, and implicit “this” usage). The generated CSVs were later fed into the MLSA pipeline, where Graphviz rendered DOT files illustrating both simple and complex call scenarios. The authors also report successful adaptation of the IG to Python and JavaScript ASTs, confirming the approach’s language‑agnostic promise.
In summary, PAIGE fulfills the MLSA design goals of being lightweight, modular, and static‑analysis‑centric while delivering high‑quality call graphs that capture the nuances of modern C/C++ (and other) code. Its reliance on compiler‑generated ASTs and Island Grammars yields a tool that is both fast and resilient to language evolution, making it a practical foundation for further multilingual software analyses such as security checks, quality metrics, or inter‑language interoperability studies. Future work includes extending language coverage, integrating dynamic profiling data, and applying the call‑graph data to concrete software‑engineering tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment