ATLAS: Automated Tree-based Language Analysis System for C and C++ source programs

ATLAS: Automated Tree-based Language Analysis System for C and C++ source programs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Analyzing non-compilable C/C++ submodules without a resolved build environment remains a critical bottleneck for industrial software evolution. Traditional static analysis tools often fail in these scenarios due to their reliance on successful compilation, while Large Language Models (LLMs) lack the structural context necessary to reason about complex program logic. We introduce ATLAS, a Python-based CLI that generates unified multi-view representations for large-scale C/C++ projects with high accuracy, achieving success rates up to 96.80% for CFGs and 91.38% for DFGs. ATLAS is characterized by: (i) inter-procedural, type-aware analysis across function boundaries; (ii) support for both full and partial analysis of non-compilable projects; (iii) graph optimizations such as variable collapsing and node blacklisting; and (iv) synchronized multi-view graphs that align syntax, execution paths, and data-flow logic. Evaluating ATLAS with DeepSeek V3.2 for automated test generation demonstrates a 34.71% increase in line coverage and 32.66% in branch coverage, matching or exceeding the performance of the symbolic execution tool KLEE on complex projects. With polynomial scalability, ATLAS provides a robust infrastructure for generating the information-dense datasets required by next-generation, graph-aware ML4SE models. Video demonstration: https://youtu.be/QGuJZhj9CTA Tool github repository: https://github.com/jaid-monwar/ATLAS-multi-view-code-representation-tool.git


💡 Research Summary

The paper presents ATLAS, a Python‑based command‑line tool that automatically extracts and aligns three fundamental program representations—Abstract Syntax Tree (AST), Control‑Flow Graph (CFG), and Data‑Flow Graph (DFG)—from large C/C++ codebases without requiring a fully resolved build environment. The authors motivate the work by highlighting the industrial bottleneck of analyzing “non‑compilable” sub‑modules, where traditional static analysis tools (LLVM, GCC) fail because they depend on successful compilation, and large language models (LLMs) struggle to capture deep structural relationships when fed only raw source text.

ATLAS’s pipeline consists of three stages. First, a preprocessor normalizes multi‑file projects, expands macros, and removes non‑semantic tokens. It then uses Tree‑sitter to produce concrete syntax trees (CSTs) and builds a rich symbol table that records scopes, types, and declaration‑use links. Second, the Combined Code View Generator creates the three views:

  • The AST driver traverses the CST, prunes irrelevant tokens, and offers two user‑configurable optimizations—Variable Collapsing (merging all occurrences of a variable into a single node) and Node Blacklisting (excluding selected node categories).
  • The CFG driver builds a statement‑level, inter‑procedural graph where each node corresponds to a source statement and edges represent all possible execution transitions, including function calls, indirect calls via function pointers, and recursion.
  • The DFG driver applies Reaching Definition Analysis (RDA) to construct a graph of variable definitions, uses, and modifications. It is type‑aware, handling pass‑by‑reference, pointer indirection, and class‑member state across constructors and methods.

All three graphs share a common node identifier, enabling ATLAS to merge them into a unified multi‑view graph. Edge types are annotated so downstream machine‑learning models can simultaneously learn syntactic, control‑flow, and data‑flow semantics. The tool also supports exporting the graphs in JSON (for ML pipelines), DOT, and PNG (for human inspection), with color‑coded edges to distinguish view types.

The authors evaluate ATLAS on the TheAlgorithms dataset (406 C and 360 C++ files). CFG generation succeeds on 96.80 % of C files and 91.67 % of C++ files; DFG succeeds on 91.38 % and 90.56 % respectively. Failures are mainly due to constructs that are hard to resolve statically, such as goto statements, multithreading primitives, complex pointer arithmetic, operator overloading, and static class variables.

To demonstrate practical impact, the authors integrate ATLAS‑generated CFGs with DeepSeek V3.2, a large language model used for automated test generation. They compare three settings: DeepSeek with CFG context, DeepSeek without CFG, and the symbolic execution tool KLEE. Across ten large C programs, the CFG‑augmented setting improves average line coverage by 34.71 % and branch coverage by 32.66 % relative to the CFG‑free setting, and matches or exceeds KLEE’s coverage on most benchmarks. Moreover, the proportion of generated tests that exercise at least one unique execution path rises from 67 % (no CFG) to 75 % (with CFG), indicating reduced redundancy.

Scalability is assessed on 20 programs ranging from 61 to 309 lines of code. Median analysis time is 6.89 seconds and median peak memory usage is 135 MB; both metrics grow polynomially with code size, confirming that ATLAS remains practical for typical benchmark‑scale projects.

The paper concludes that ATLAS fills a critical gap by providing a fast, build‑free, multi‑view graph extraction pipeline for C/C++ code, thereby enabling the creation of dense, aligned datasets required by next‑generation graph‑aware ML4SE models. Future work includes extending support for advanced C++ idioms (templates, concepts, lambdas), incorporating automated memory‑safety analysis, and modeling concurrency constructs to further improve the completeness of the generated graphs.


Comments & Academic Discussion

Loading comments...

Leave a Comment