Understanding and Finding JIT Compiler Performance Bugs
Just-in-time (JIT) compilers are key components for many popular programming languages with managed runtimes (e.g., Java and JavaScript). JIT compilers perform optimizations and generate native code at runtime based on dynamic profiling data, to improve the execution performance of the running application. Like other software systems, JIT compilers might have software bugs, and prior work has developed a number of automated techniques for detecting functional bugs (i.e., generated native code does not semantically match that of the original code). However, no prior work has targeted JIT compiler performance bugs, which can cause significant performance degradation while an application is running. These performance bugs are challenging to detect due to the complexity and dynamic nature of JIT compilers. In this paper, we present the first work on demystifying JIT performance bugs. First, we perform an empirical study across four popular JIT compilers for Java and JavaScript. Our manual analysis of 191 bug reports uncovers common triggers of performance bugs, patterns in which these bugs manifest, and their root causes. Second, informed by these insights, we propose layered differential performance testing, a lightweight technique to automatically detect JIT compiler performance bugs, and implement it in a tool called Jittery. We incorporate practical optimizations into Jittery such as test prioritization, which reduces testing time by 92.40% without compromising bug-detection capability, and automatic filtering of false-positives and duplicates, which substantially reduces manual inspection effort. Using Jittery, we discovered 12 previously unknown performance bugs in the Oracle HotSpot and Graal JIT compilers, with 11 confirmed and 6 fixed by developers.
💡 Research Summary
The paper tackles an under‑explored problem: performance bugs in just‑in‑time (JIT) compilers. While functional bugs (semantic mismatches) have received considerable attention, performance regressions—either excessive compilation time (“long compilation”) or unexpectedly slow generated code (“high‑order performance”)—have not been systematically studied. To fill this gap, the authors conduct a large‑scale empirical analysis of 191 real‑world performance bug reports from four widely used JIT compilers (HotSpot C1/C2, Graal, and V8). Their manual classification reveals three dominant characteristics: (1) almost half of the bugs can be reproduced with tiny micro‑benchmarks rather than full‑scale suites, (2) most were discovered via comparative signals such as performance regressions or execution‑time divergences between two compiler configurations, and (3) beyond classic optimization or code‑generation errors, JIT‑specific mechanisms—speculative optimizations, tiered compilation decisions, and runtime interactions (e.g., de‑optimizations, code‑cache management)—are frequent root causes.
Guided by these insights, the authors propose a novel detection technique called “layered differential performance testing” and implement it in a tool named Jittery. Jittery automatically generates thousands of small random programs, runs each under two differential JIT configurations (different tiers, versions, or flags), and flags programs whose execution times diverge significantly. To keep the approach lightweight, detection is organized into multiple layers with increasing measurement rigor: early layers quickly discard clearly benign programs using few iterations, while later layers apply many more iterations and collect detailed profiling data for the remaining candidates. Runtime information from earlier layers is used to prioritize promising candidates, dramatically reducing overall testing time (by 92.40 % compared with a naïve exhaustive approach). Additional heuristics automatically filter out false positives caused by system noise and collapse duplicate bug reports, further lowering manual inspection effort.
When applied to the HotSpot and Graal compilers, Jittery uncovered 12 previously unknown performance bugs; 11 were confirmed by the developers and 6 have already been fixed. The discovered bugs span speculative‑optimization failures (e.g., null‑check elimination or type specialization that trigger frequent de‑optimizations), code‑cache overflow or mis‑management, and incorrect tier‑transition heuristics that cause unnecessary high‑cost compilations. The authors release the full dataset of 191 bugs, the Jittery source code, and the scripts used for evaluation, enabling reproducibility and future research.
The contributions are fourfold: (1) the first in‑depth empirical study of JIT performance bugs, (2) a publicly available dataset of real‑world bugs, (3) the design and implementation of Jittery, a lightweight yet accurate automated detection framework, and (4) empirical evidence that Jittery can efficiently discover real bugs in production‑grade JIT compilers. The paper also discusses limitations—current focus on Java and JavaScript, reliance on randomly generated programs, and the need for deeper de‑optimization tracing—and outlines future directions such as extending the approach to other dynamic languages (Python, Ruby) and integrating hardware performance counters.
In summary, this work illuminates the nature of JIT performance regressions, provides concrete methodological advances for their detection, and demonstrates that automated, layered differential testing can be both fast and effective, offering a valuable tool for JIT compiler developers and maintainers.
Comments & Academic Discussion
Loading comments...
Leave a Comment