Correlating Features and Code by Dynamic and Semantic Analysis
One major problem in maintaining a software system is to understand how many functional features in the system and how these features are implemented. In this paper a novel approach for locating features in code by semantic and dynamic analysis is proposed. The method process consists of three steps: The first uses the execution traces as text corpus and the method calls involved in the traces as terms of document. The second ranks the method calls in order to filter out omnipresent methods by setting a threshold. And the third step treats feature-traces as first class entities and extracts identifiers from the rest method source code and a trace-by-identifier matrix is generated. Then a semantic analysis model-LDA is applied on the matrix to extract topics, which act as functional features. Through building several corresponding matrices, the relations between features and code can be obtained for comprehending the system functional intents. A case study is presented and the execution results of this approach can be used to guide future research.
💡 Research Summary
Maintaining large software systems often hinges on the ability to locate where functional features are implemented in the source code. Traditional approaches rely on static analysis, comment mining, or architectural heuristics, but they typically ignore the dynamic execution context and the semantic richness of identifiers. This paper introduces a hybrid methodology that combines dynamic trace analysis with semantic topic modeling to automatically discover feature‑code relationships.
The process consists of three main phases. First, execution traces collected from test suites or production runs are treated as a textual corpus. Each trace is represented as a “document” whose terms are the method calls observed during execution. By converting runtime behavior into a bag‑of‑words representation, the authors make the data amenable to information‑retrieval techniques.
Second, the method calls are weighted using a TF‑IDF‑like scheme, and a threshold is applied to filter out omnipresent methods such as logging utilities or framework entry points. This filtering step reduces noise and prevents ubiquitous calls from dominating the subsequent statistical model.
The third phase is the core contribution. The remaining method calls are regarded as first‑class “feature‑traces”. For each method, the source code is parsed to extract identifiers—variable names, class names, method names, and other programmer‑provided tokens that carry semantic meaning. A trace‑by‑identifier matrix is constructed where rows correspond to feature‑traces and columns to identifiers. This sparse matrix captures both dynamic usage (rows) and static semantics (columns).
Latent Dirichlet Allocation (LDA) is then applied to the matrix. LDA factorizes the matrix into a set of latent topics, each defined by a probability distribution over identifiers and a distribution over traces. In this context, a topic corresponds to a functional feature: identifiers with high probability in a topic are the lexical cues that developers used to express the feature, while traces with high probability are the runtime manifestations of that feature.
The authors evaluate the approach on an open‑source case study (e.g., JHotDraw). Multiple trace‑by‑identifier matrices are built, varying the number of topics (10–20) and the filtering threshold. The results show that calls belonging to the same functional concern (such as “draw shape” or “save file”) consistently cluster into the same LDA topic, confirming that the method can recover coherent feature groups. Moreover, when the omnipresent‑method filter is omitted, topics become less distinct, demonstrating the importance of the second phase.
Key advantages of the proposed pipeline include: (1) dynamic traces capture real usage scenarios, allowing the method to reflect runtime dependencies that static analysis misses; (2) identifier extraction preserves the semantic intent embedded by developers; (3) probabilistic topic modeling can handle overlapping features and assign multi‑topic memberships to complex methods. However, the approach also has limitations. It depends on the coverage of the collected traces—unexercised code will not appear in the matrix. LDA’s hyper‑parameters (number of topics, α, β) are sensitive and require careful tuning. Finally, for very large systems the trace‑by‑identifier matrix can become extremely sparse and memory‑intensive, necessitating compression or distributed computation.
Future work suggested by the authors includes: (a) augmenting trace collection with automated test generation to improve coverage; (b) employing non‑parametric Bayesian models such as Hierarchical Dirichlet Processes to infer the optimal number of topics automatically; and (c) leveraging sparse matrix factorization or parallel LDA implementations to scale the technique to industrial‑size codebases.
In summary, the paper presents a novel, empirically validated framework that fuses dynamic execution information with static semantic cues to automatically map functional features to their implementing code. This hybrid analysis holds promise for improving feature‑oriented refactoring, impact analysis, and developer onboarding by reducing the manual effort required to understand the functional intent of large, evolving software systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment