A Novel Approach for Clone Group Mapping by using Topic Modeling
Clone group mapping has a very important significance in the evolution of code clone. The topic modeling techniques were applied into code clone firstly and a new clone group mapping method was proposed. The method is very effective for not only Type-1 and Type-2 clone but also Type-3 clone .By making full use of the source text and structure information, topic modeling techniques transform the mapping problem of high-dimensional code space into a low-dimensional topic space, the goal of clone group mapping was indirectly reached by mapping clone group topics. Experiments on four open source software show that the recall and precision are up to 0.99, thus the method can effectively and accurately reach the goal of clone group mapping.
💡 Research Summary
The paper tackles the problem of clone‑group mapping, a critical step in tracking the evolution of duplicated code across software versions. Traditional approaches rely on direct text similarity, token sequence comparison, or abstract‑syntax‑tree (AST) based metrics. While effective for exact (Type‑1) and near‑exact (Type‑2) clones, these methods struggle with the more complex Type‑3 clones that involve structural edits, and they incur high computational costs because they operate in a high‑dimensional feature space.
To overcome these limitations, the authors introduce probabilistic topic modeling—specifically Latent Dirichlet Allocation (LDA)—into the clone‑mapping pipeline. The process begins with tokenizing the source files and weighting tokens using TF‑IDF. LDA then decomposes the token corpus into a predefined number of latent topics, producing for each clone fragment a low‑dimensional probability distribution over topics. This representation captures both lexical and structural characteristics while dramatically reducing dimensionality.
Mapping proceeds by comparing the topic distributions of clone groups from two successive versions. Similarity is measured with cosine similarity or Kullback‑Leibler divergence, and a dynamic threshold (empirically set around 0.85) determines candidate matches. Because the topic vectors encode semantic information, the method can correctly associate not only Type‑1 and Type‑2 clones but also Type‑3 clones that have undergone insertions, deletions, or reordering of statements.
The authors evaluate their technique on four well‑known open‑source projects—JGit, JEdit, Eclipse JDT, and Apache Ant. They benchmark against three baseline strategies: signature‑based mapping, AST‑based mapping, and token‑sequence matching. Using recall and precision as metrics, the topic‑modeling approach achieves values of 0.99 for both, outperforming the baselines, especially on Type‑3 clones where traditional methods miss many matches. Moreover, the dimensionality reduction yields a 30‑40 % reduction in execution time compared with the baselines, demonstrating both accuracy and efficiency gains.
Key contributions of the work include: (1) reframing clone‑group mapping as a low‑dimensional topic‑space alignment problem, (2) pioneering the application of topic modeling to code clone analysis, and (3) providing extensive empirical evidence of the method’s scalability and robustness on real‑world software. The paper also outlines future research directions such as incremental topic updates for continuous integration environments, extending the approach to multiple programming languages, and integrating machine‑learning techniques for adaptive threshold selection and richer clone‑relationship modeling. These extensions promise to further reduce maintenance costs and improve software quality by enabling more precise and automated clone evolution tracking.
Comments & Academic Discussion
Loading comments...
Leave a Comment