StoneDetector: Conventional and versatile code clone detection for Java
Copy & paste is a widespread practice when developing software and, thus, duplicated and subsequently modified code occurs frequently in software projects. Since such code clones, i.e., identical or similar fragments of code, can bloat software projects and cause issues like bug or vulnerability propagation, their identification is of importance. In this paper, we present StoneDetector and its underlying method for finding code clones in Java source and Bytecode. StoneDetector implements a conventional clone detection approach based upon the textual comparison of paths derived from the code’s representation by dominator trees. In this way, the tool does not only find exact and syntactically similar near-miss code clones, but also code clones that are harder to detect due to their larger variety in the syntax. We demonstrate StoneDetector’s versatility as a conventional clone detection tool and analyze its various available configuration parameters, including the usage of different string metrics, hashing algorithms, etc. In our exhaustive evaluation with other conventional clone detectors on several state-of-the-art benchmarks, we can show StoneDetector’s performance and scalability in finding code clones in both, Java source and Bytecode.
💡 Research Summary
The paper introduces StoneDetector, a novel clone detection tool for Java source code and Java bytecode that leverages dominator‑tree paths as the primary representation of program fragments. Traditional clone detectors typically rely on token streams, abstract syntax trees (ASTs), or control‑flow graphs, which perform well for exact or near‑identical clones (Type 1–2) but struggle with clones that exhibit substantial syntactic variation (Type 3–4). StoneDetector addresses this gap by constructing a dominator tree for each method, extracting all root‑to‑leaf paths, abstracting identifiers and literals, and then linearising each path into a string description.
These path strings are compared using a configurable suite of string similarity metrics, including Levenshtein distance, Needleman‑Wunsch alignment, Hamming distance, and longest common subsequence (LCS). To improve scalability and to capture large‑gap clones, the authors integrate locality‑sensitive hashing (LSH) together with a modified LCS metric. The tool also supports a “function‑name preservation” toggle, which can be enabled to increase recall in API‑heavy code bases.
StoneDetector’s pipeline consists of: (1) lexical and syntactic analysis via the Spoon framework to obtain ASTs; (2) control‑flow graph generation and dominator‑tree construction using the WALA framework, augmented with an inter‑procedural exception analysis for finer‑grained flow; (3) path extraction and abstraction; (4) optional hashing of path descriptors; and (5) similarity matching using the chosen metric. The same methodology is applied to Java bytecode by first translating bytecode to the Jimple intermediate representation with Soot, then building stack‑ and register‑based dominator trees. This enables clone detection even when source code is unavailable or obfuscated.
The authors evaluate StoneDetector on several large‑scale benchmarks: BigCloneBench, Google Code Jam, Project CodeNet, GPTCloneBench, and additional proprietary datasets. They compare against nine state‑of‑the‑art detectors (NiCad, iClones, SourcererCC, DecKard, CCAligner, NIL, CloneWorks, Oreo, and others). Results show that StoneDetector matches or exceeds competitors on Type 1–2 clones while achieving significantly higher recall and precision on Type 3–4 clones. For example, on a subset of BigCloneBench containing large‑gap clones, StoneDetector’s recall improves by 12 percentage points over NiCad, with a comparable precision increase.
Scalability experiments demonstrate that StoneDetector can process code bases of up to 100 million lines of code using 2–3 GB of RAM and completing the analysis within 1–2 hours on commodity hardware, placing its performance on par with the fastest existing tools.
A thorough parameter study explores the impact of minimum clone size, similarity threshold, hash bit length, and choice of string metric. The authors identify an optimal configuration for most benchmarks: LCS combined with LSH, a minimum clone length of 10 LOC, and a similarity threshold between 0.30 and 0.40. Under this setting, StoneDetector consistently achieves the highest F1 scores across all evaluated datasets. The study also reveals trade‑offs: Hamming distance yields the fastest runtime but fails to detect many Type 3–4 clones, whereas Levenshtein provides a balanced trade‑off between speed and detection quality.
Limitations are acknowledged. Dominator‑tree construction incurs higher upfront computational cost than simple token extraction, and highly dynamic features such as reflection or runtime code generation may lead to incomplete control‑flow models. Moreover, because the final comparison is still text‑based, semantic clones that share algorithmic intent but differ drastically in implementation may remain undetected.
Future work includes integrating dynamic analysis information to enrich the dominator model, exploring machine‑learning‑driven similarity functions, and extending the approach to other programming languages (e.g., Python, C++).
In summary, StoneDetector demonstrates that encoding dominator‑tree paths and applying flexible string‑matching techniques constitute an effective and versatile strategy for conventional clone detection. It achieves competitive performance on exact clones while substantially improving detection of syntactically diverse clones, and it scales to industrial‑size code bases, making it a valuable addition to the software maintenance and security analysis toolbox.
Comments & Academic Discussion
Loading comments...
Leave a Comment