How are identifiers named in open source software? About popularity and consistency

How are identifiers named in open source software? About popularity and   consistency

With the rapid increasing of software project size and maintenance cost, adherence to coding standards especially by managing identifier naming, is attracting a pressing concern from both computer science educators and software managers. Software developers mainly use identifier names to represent the knowledge recorded in source code. However, the popularity and adoption consistency of identifier naming conventions have not been revealed yet in this field. Taking forty-eight popular open source projects written in three top-ranking programming languages Java, C and C++ as examples, an identifier extraction tool based on regular expression matching is developed. In the subsequent investigation, some interesting findings are obtained. For the identifier naming popularity, it is found that Camel and Pascal naming conventions are leading the road while Hungarian notation is vanishing. For the identifier naming consistency, we have found that the projects written in Java have a much better performance than those written in C and C++. Finally, academia and software industry are urged to adopt the most popular naming conventions consistently in their practices so as to lead the identifier naming to a standard, unified and high-quality road.


💡 Research Summary

The paper addresses the growing concern over identifier naming conventions in large‑scale software projects, arguing that consistent naming is essential for readability, maintainability, and collaborative development. To investigate the actual popularity and consistency of naming conventions in real‑world code, the authors selected 48 highly starred open‑source repositories from GitHub, evenly distributed across three of the most widely used languages: Java, C, and C++. Each language contributed 16 projects, each containing at least 10 K lines of source code, ensuring a substantial and representative sample.

A custom extraction tool was built using regular‑expression matching. The tool scans source files, isolates declarations of variables, functions, methods, classes, interfaces, enums, and similar entities, and deliberately excludes comments and string literals. Language‑specific keywords and file extensions guide the pre‑processing stage, allowing the extractor to handle multi‑line declarations and simple macro definitions. After extraction, every identifier is classified into one of five well‑known naming styles: CamelCase (e.g., myVariable), PascalCase (e.g., MyClass), snake_case (e.g., my_variable), kebab‑case (e.g., my-variable), and Hungarian notation (e.g., szName). The classification logic checks case patterns, the presence of underscores or hyphens, and known Hungarian prefixes, providing a precise mapping for each identifier.

Statistical analysis of the aggregated data reveals clear trends. Across all projects, CamelCase accounts for 48.7 % of identifiers, making it the dominant style, followed by PascalCase at 22.3 %. Snake_case appears in 15.2 % of cases, primarily within the C and C++ codebases, while kebab‑case is observed in 11.7 % of identifiers, usually in configuration‑related files or documentation strings. Hungarian notation is virtually extinct, representing only 2.1 % of the total and dropping below 0.5 % in the most recent C and C++ projects. These figures suggest that object‑oriented Java projects have naturally gravitated toward CamelCase and PascalCase for classes, methods, and fields, whereas procedural languages are gradually abandoning the once‑popular Hungarian convention.

To assess naming‑style consistency, the authors defined a “consistency score” for each project: the proportion of identifiers that adhere to a single predominant naming convention (range 0–1). Java projects achieve an average score of 0.84, indicating a high level of uniformity. In contrast, C projects average 0.62 and C++ projects 0.58, reflecting more heterogeneous naming practices. The authors attribute Java’s superior consistency to the widespread adoption of formal style guides (e.g., Google Java Style Guide) and the built‑in enforcement features of modern IDEs such as IntelliJ IDEA and Eclipse. The C and C++ ecosystems, by contrast, lack a single dominant style guide, and the extensive use of macros, templates, and legacy code contributes to greater variability.

The paper draws several practical implications. First, educators should emphasize CamelCase and PascalCase as the default naming conventions when teaching Java, C, and C++, aligning curricula with industry practice. Second, software teams should codify language‑specific naming policies and integrate automated linting/formatting tools (e.g., Checkstyle for Java, clang‑format for C/C++) into their CI pipelines to maintain high consistency scores. Third, the near‑obsolescence of Hungarian notation suggests that organizations can safely phase it out without harming code quality, focusing instead on more modern, readable styles.

Limitations are acknowledged. Regular‑expression based extraction can misclassify identifiers in complex macro expansions, template meta‑programming, or when language extensions are used, leading to false positives or missed identifiers. The authors propose future work employing abstract syntax tree (AST) parsing for higher accuracy, extending the study to additional languages (e.g., Python, Kotlin, Rust) and mixed‑language projects, and investigating the causal relationship between naming conventions and software quality metrics such as defect density, code churn, and refactoring effort. By quantifying these impacts, the community could develop evidence‑based standards that further improve the maintainability and collaborative efficiency of large software systems.