Discovering 100+ Compiler Defects in 72 Hours via LLM-Driven Semantic Logic Recomposition

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Compilers constitute the foundational root-of-trust in software supply chains; however, their immense complexity inevitably conceals critical defects. Recent research has attempted to leverage historical bugs to design new mutation operators or fine-tune models to increase program diversity for compiler fuzzing.We observe, however, that bugs manifest primarily based on the semantics of input programs rather than their syntax. Unfortunately, current approaches, whether relying on syntactic mutation or general Large Language Model (LLM) fine-tuning, struggle to preserve the specific semantics found in the logic of bug-triggering programs. Consequently, these critical semantic triggers are often lost, resulting in a limitation of the diversity of generated programs. To explicitly reuse such semantics, we propose FeatureFuzz, a compiler fuzzer that combines features to generate programs. We define a feature as a decoupled primitive that encapsulates a natural language description of a bug-prone invariant, such as an out-of-bounds array access, alongside a concrete code witness of its realization. FeatureFuzz operates via a three-stage workflow: it first extracts features from historical bug reports, synthesizes coherent groups of features, and finally instantiates these groups into valid programs for compiler fuzzing. We evaluated FeatureFuzz on GCC and LLVM. Over 24-hour campaigns, FeatureFuzz uncovered 167 unique crashes, which is 2.78x more than the second-best fuzzer. Furthermore, through a 72-hour fuzzing campaign, FeatureFuzz identified 113 bugs in GCC and LLVM, 97 of which have already been confirmed by compiler developers, validating the approach’s ability to stress-test modern compilers effectively.

💡 Research Summary

Compilers are the cornerstone of modern software supply chains, yet their massive codebases hide subtle bugs that often stem from the semantics of programs rather than their surface syntax. Existing fuzzing approaches—grammar‑based generators, mutation‑based mutators, and large language model (LLM) generators—focus primarily on syntactic diversity. Consequently, they struggle to preserve the delicate semantic invariants (e.g., flow‑sensitive index bounds, intertwined control‑data dependencies) required to trigger deep compiler defects. This paper introduces FeatureFuzz, a novel compiler fuzzer that explicitly captures and recombines these semantic triggers using LLMs.

A feature is defined as a dual primitive: (1) a natural‑language description of a bug‑prone invariant (e.g., “the array index exceeds the array length”) and (2) a minimal code witness that implements the invariant. By decoupling the semantic description from concrete syntax, features become reusable building blocks that can be mixed across unrelated bug reports. FeatureFuzz operates through a three‑stage LLM pipeline:

Extraction (ExtractLLM) – Leveraging a pre‑trained LLM with a carefully crafted prompt, the system parses historical bug reports, PoC programs, and commit diffs (sourced from GCC Bugzilla, LLVM repositories, etc.) to distill a pool of features. The output is a structured set of (description, code) pairs, each representing a high‑level logical condition known to cause a compiler failure.
Group Synthesis (GroupLLM) – From the global feature pool, a subset of features is sampled. A fine‑tuned LLM then synthesizes “glue” features that reconcile variable names, control flow, and data dependencies, producing a coherent feature group. This step ensures that the sampled features, which may originate from disparate bugs, are logically compatible when combined.
Instantiation (InstanLLM) – The synthesized group is translated into a syntactically valid C/C++ program. InstanLLM automatically inserts declarations, type casts, initializations, and scaffolding code so that all semantic constraints are simultaneously satisfied. The resulting program is ready for compilation and execution.

FeatureFuzz also incorporates a coverage‑guided feedback loop: programs that increase code coverage cause their constituent features to receive higher rewards, biasing future sampling toward more “fruitful” semantics.

Evaluation was performed on GCC and LLVM. In 24‑hour campaigns, FeatureFuzz generated 167 unique crashes, a 2.78× improvement over the second‑best baseline (MetaMut). Extending to a 72‑hour campaign, the system reported 113 distinct bugs; 97 have been confirmed by compiler developers, including 39 bugs in middle‑end or back‑end optimizations and 46 bugs that could appear in real‑world code. These results demonstrate that explicit semantic recombination dramatically expands the reachable bug‑triggering space compared to purely syntactic fuzzers.

The authors formalize the notion of Semantic Collapse—the loss of meaningful bug‑triggering logic when it is compressed into shallow syntactic transformations or hidden inside model parameters. By using natural language as an intermediate representation, FeatureFuzz preserves semantic intent throughout the generation pipeline, effectively preventing collapse.

Limitations are acknowledged: (1) the quality of extracted features depends on the accuracy of the LLM during extraction; (2) the feature pool is biased toward bugs present in the historical dataset; (3) the current implementation targets C/C++ only, requiring new feature definitions and fine‑tuning for other languages. Future work may explore automated validation of features, multi‑language extensions, and integration with formal verification techniques to further strengthen the approach.

In summary, this paper proposes a paradigm shift from syntax‑centric mutation to semantic feature recombination for compiler fuzzing. By systematically extracting, synthesizing, and instantiating high‑level invariants with the aid of LLMs, FeatureFuzz achieves a substantial increase in bug discovery efficiency, offering a promising direction for enhancing compiler reliability and security.

Discovering 100+ Compiler Defects in 72 Hours via LLM-Driven Semantic Logic Recomposition

💡 Research Summary

Comments & Academic Discussion

Leave a Comment