Empirical Confirmation (and Refutation) of Presumptions on Software

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Code metrics are easy to define, but not so easy to justify. It is hard to prove that a metric is valid, i.e., that measured numerical values imply anything on the vaguely defined, yet crucial software properties such as complexity and maintainability. This paper employs statistical analysis and tests to check some “believable” presumptions on the behavior of software and metrics measured for this software. Among those are the reliability presumption implicit in the application of any code metric, and the presumption that the magnitude of change in a software artifact is correlated with changes to its version number. Putting a suite of 36 metrics to the trial, we confirm most of the presumptions. Unexpectedly, we show that a substantial portion of the reliability of some metrics can be observed even in random changes to architecture. Another surprising result is that Boolean-valued metrics tend to flip their values more often in minor software version increments than in major increments.

💡 Research Summary

**
The paper “Empirical Confirmation (and Refutation) of Presumptions on Software” investigates two central, yet often implicit, assumptions in software metric research: (1) that code metrics are reliable—i.e., that their values remain stable across successive software versions—and (2) that the magnitude of changes in a software artifact correlates with the type of version-number change (major, minor, patch). To test these presumptions, the authors assembled a corpus of 19 Java open‑source projects drawn from the Qualitas Corpus, covering a total of 95 released versions and 76 consecutive version pairs. Each version was modeled as a directed graph where nodes represent types (classes) and edges capture direct dependencies (inheritance, method calls, field usage). The graph representation allowed the authors to compute a suite of 36 metrics, which they organized into three taxonomic groups: (i) marker or Boolean metrics (binary properties such as “implements interface X”), (ii) local numeric metrics (counts or measures computed per type or package, e.g., lines of code, number of methods), and (iii) global topological metrics (graph‑wide properties such as average path length, clustering coefficient, depth of inheritance tree).

Statistical analysis relied on Kendall’s τ for ranking similarity, Pearson and Spearman correlations for assessing relationships between metric changes and version‑number cardinality, and simple consistency ratios to quantify reliability (the proportion of metric values that remain unchanged between two successive versions). Version‑number cardinality was defined as 1 for a change in the major component, ½ for a change in the minor component, ¼ for a patch change, and so on.

The empirical results confirm that most metrics exhibit high reliability: over 90 % of metric values are identical across consecutive versions, supporting the general belief that metrics are stable. However, the study uncovers nuanced behavior that challenges a naïve interpretation of this reliability. First, global topological metrics retain their high consistency even when the underlying architecture is randomly perturbed (nodes and edges are shuffled). This suggests that such metrics are largely driven by coarse‑grained properties (e.g., total node/edge counts) rather than genuine architectural characteristics, and therefore may overstate their usefulness for detecting design quality changes.

Second, Boolean (marker) metrics display a counter‑intuitive pattern: they flip their values more frequently in minor version increments than in major ones. The authors interpret this as evidence that minor releases, while typically involving smaller functional changes, often introduce or remove binary features (e.g., adding an annotation, toggling an interface implementation) without a corresponding large structural overhaul. Conversely, major releases tend to preserve existing binary properties while focusing on broader architectural evolution.

Third, local numeric metrics (e.g., lines of code per class) are more volatile in minor version changes than in major ones. This aligns with the observation that minor releases frequently involve bug fixes, small refactorings, or incremental feature additions that affect implementation details but not the overall architecture. Major releases, by contrast, are more likely to involve systematic restructuring, which can leave many local counts unchanged.

The paper also evaluates several “platitudes” commonly held by practitioners. The size‑variety assumption (software comes in many sizes) is empirically supported by the two‑order‑of‑magnitude variation in types, packages, and edges across the corpus. The belief that major version changes correspond to large code growth is partially confirmed: on average, types increase by 37 % and edges by 46 % between versions, but there are outliers where edges shrink dramatically due to aggressive refactoring. The “mostly‑evolutionary” assumption (most changes are incremental rather than revolutionary) is upheld by median‑vs‑mean comparisons of preserved types/edges, indicating that most version pairs retain a high proportion of existing elements. However, the “locality‑of‑change” assumption (changes are confined to a small subset of the code) is refuted: only about 17 % of types experience unchanged incoming or outgoing relationships, and roughly 3–6 % of types and edges disappear between versions, showing that changes often affect a broader portion of the system than expected.

In summary, the study demonstrates that while code metrics are generally reliable, the source of that reliability varies by metric family. Global topological metrics may be reliable for the wrong reasons (insensitivity to structural perturbations), Boolean metrics are surprisingly sensitive to minor version increments, and local numeric metrics reflect implementation‑level churn more than architectural stability. These findings suggest that metric‑based quality assessments should be contextualized: practitioners must consider the type of metric, the nature of version changes, and the underlying assumptions about software evolution. The authors acknowledge limitations such as the exclusive focus on Java open‑source projects, the reliance on version‑number schemes that differ across projects, and the fact that random perturbations ignore semantic information. Future work is proposed to extend the analysis to other languages, to link metric stability with concrete quality outcomes (e.g., defect density), and to develop predictive models that incorporate metric dynamics into release‑management decisions.

Empirical Confirmation (and Refutation) of Presumptions on Software

💡 Research Summary

Comments & Academic Discussion

Leave a Comment