Java Source-code Clustering: Unifying Syntactic and Semantic Features
This is a companion draft to paper ‘Software Clustering: Unifying Syntactic and Semantic Features’, in proceedings of the 19th Working Conference on Reverse Engineering (WCRE 2012). It discusses the clustering process in detail, which appeared in the paper in an abridged form. It also contains certain additional process steps which were not covered in the WCRE paper. The clustering process is described for applications with Java source-code. However, as argued in the WCRE paper, it can be seamlessly adapted to many other programming paradigms.
💡 Research Summary
The paper provides a comprehensive, step‑by‑step description of a Java source‑code clustering technique that unifies syntactic and semantic information. The authors begin by motivating the need for automated modularization in large software systems, noting that prior work either focuses solely on structural cues (inheritance, call graphs, package hierarchies) or on textual cues (identifier names, comments) but rarely combines both in a principled way. Their solution consists of a multi‑stage pipeline.
First, the source files are parsed using the Eclipse JDT front‑end to generate abstract syntax trees (ASTs). From the ASTs they extract structural artifacts: class‑to‑class inheritance links, interface implementations, method call edges, and package nesting. Simultaneously, all identifiers, Javadoc blocks, and inline comments are collected; identifiers are split into constituent tokens (CamelCase, snake_case) and normalized.
Second, syntactic features are transformed into a structural similarity matrix (S_syn). The authors compute pairwise coupling scores based on shared super‑classes, call‑graph proximity, and package co‑location, normalizing each component to the range
Comments & Academic Discussion
Loading comments...
Leave a Comment