Algorithmic Programming Language Identification

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Motivated by the amount of code that goes unidentified on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on supervised learning and intelligent statistical features. We also explored, but abandoned, a grammatical approach. In testing, our implementation greatly outperforms that of an existing tool that relies on a Bayesian classifier. Code is written in Python and available under an MIT license.

💡 Research Summary

The paper addresses the practical problem of automatically identifying the programming language of source code fragments that appear on the web without any accompanying metadata. Existing tools either rely on heuristic syntax highlighting (e.g., Google Code Prettify) or on simple Bayesian classifiers (e.g., SourceClassifier), both of which struggle with noise introduced by comments, strings, and ambiguous syntax.

To overcome these limitations, the authors build a supervised learning system that leverages a large, curated dataset of 41 000 source files (≈324 MB) collected from GitHub. Files are initially labeled using GitHub tags and then filtered by file extension to reduce labeling noise. The dataset covers 24 languages, ranging from mainstream (C, Java, Python) to esoteric (Brainfuck, Chef).

A key preprocessing step is the detection and removal of comments and string literals. The authors define a “words property” – lines composed solely of alphabetic words separated by spaces – to locate candidate comment or string lines. The longest such capture on each line is retained, and a heuristic search expands outward to locate matching opening and closing delimiters. This step isolates language‑agnostic text, preventing it from contaminating statistical feature extraction.

The core of the system consists of seven statistical features, each quantified and compared against language‑specific reference distributions:

Brackets – relative frequencies of parentheses, curly braces, square brackets, and angle brackets.
FirstWord – the first token of each line, capturing language‑specific declarations such as “public” or “int”.
Keywords – frequency of pure‑alphabetic words, serving as a bag‑of‑keywords model.
LastCharacter – the terminal character of each line, useful for languages that terminate statements with a period (Prolog) or semicolon (Java, C).
Operators – sequences of punctuation symbols stripped of letters and numbers, identifying language‑specific operator sets.
Punctuation – the ratio of punctuation marks to letters, which distinguishes highly punctuated esoteric languages.
Comments and Strings – direct matching of detected comment/string delimiters with those known for each language.

For each feature the authors compute a score (s_l) for language (l). The first, fourth, fifth, and sixth features use a squared‑error formulation (s_l = \frac{1}{\sum_i (p_{i,l} - x_i)^2}), where (p_{i,l}) is the reference proportion in language (l) and (x_i) is the observed proportion in the unknown snippet. The “FirstWord”, “Keywords”, “LastCharacter”, and “Operators” features use a similar formulation but sum over the most frequent tokens. The “Comments and Strings” score is simply the count of exact token matches. All feature scores are normalized so that their sum per feature equals one, then summed across features to obtain a final language ranking.

The authors evaluated the system on 25 randomly selected files from their corpus, deliberately ignoring file extensions to simulate real‑world conditions. The results were: 48 % (12/25) correct at rank 1, 12 % at rank 2, another 12 % at rank 3, 4 % at rank 4, and 12 % not appearing in the top 5. For comparison, they retrained SourceClassifier on the same dataset and obtained only 12 % correct at rank 1, with the rest spread across lower ranks. This demonstrates a substantial performance gain over the baseline Bayesian approach.

The paper also describes an exploratory grammar‑based approach using OMeta (a PEG‑oriented language). The idea was to learn small grammar fragments from code and combine them into a language‑wide grammar, scoring fragments by the proportion of tokens they successfully parse. However, the authors encountered two major obstacles: (1) defining a reliable scoring metric for partial grammars, and (2) determining the correct ordering and combination of fragments without prior knowledge of grammar precedence. Consequently, the grammar route was abandoned for future work.

In conclusion, the authors present a practical, statistically‑driven language identification framework that outperforms existing bag‑of‑words Bayesian classifiers. The system is implemented in Python, released under an MIT license, and includes the training corpus for reproducibility. Future directions include refining the comment/string detection heuristics, expanding the language set, integrating deep‑learning based token embeddings, and revisiting the grammar‑based component to create a hybrid model that could further boost accuracy.

Algorithmic Programming Language Identification

💡 Research Summary

Comments & Academic Discussion

Leave a Comment