Algorithmic Programming Language Identification

February 23, 2026

Reading time: 5 minute

...

📝 Abstract

Motivated by the amount of code that goes unidentified on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on supervised learning and intelligent statistical features. We also explored, but abandoned, a grammatical approach. In testing, our implementation greatly outperforms that of an existing tool that relies on a Bayesian classifier. Code is written in Python and available under an MIT license.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

arXiv:1106.4064v2 [cs.LG] 9 Nov 2011 Algorithmic Programming Language Identiﬁcation David Klein, Kyle Murray and Simon Weber University of Rochester November 1, 2011 Abstract Motivated by the amount of code that goes unidentiﬁed on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on su- pervised learning and intelligent statistical features. We also explored, but abandoned, a grammatical approach. In testing, our implemen- tation greatly outperforms that of an existing tool that relies on a Bayesian classiﬁer. Code is written in Python and available under an MIT license. 1 1 Introduction and Motivation The purpose of algorithmic programming language detection is to determine the programming language with which a particular program or program fragment was written. The goal of this research is to devise techniques to recognize and accurately identify code from as many programming languages as possible, without starting with any information other than the code itself. One large domain in which many intensely useful programs and code fragments go unidentiﬁed is the web. Such code appears in blogs, forums, mailing lists, documentation, and many other contexts where the supporting web software does not provide a facility for identifying the language of a par- ticular program. With identifying information, various possibilities are un- locked. For instance, code found on the web now becomes more searchable, more so than the limited form of searchability that tools such as Google’s code search provide for explicitly marked full programs. In addition, code fragments could become more readable when they are automatically syntax highlighted based on the proper language grammar. Other potential uses range from data recovery on damaged ﬁle systems to enabling automatic code grading systems to be language agnostic. 2 Existing Tools Simple tools for identifying source code already exist. Various syntax high- lighting tools such as Google Code Prettify will automatically highlight syn- tax given some code [3]. However, these tools do not actually identify lan- guages; instead, they use heuristics that will make the highlighting work well. In the case of Google Code Prettify, broad grammars (such as C-like, Bash-like, and Xml-like) are preprogrammed. These grammars are then used to scan code, and the best matching grammar is used in highlighting. Clearly, languages that share a grammar cannot be distinguished between (and they do not need to be for the highlighting to work). More relevant is SourceClassiﬁer, which will attempt to identify a pro- gramming language given some code [2]. However, it relies on a simple Bayesian classiﬁer. Its strength is therefore limited to the quality of train- ing data, and it can easily be thrown oﬀby strings and comments. These existing tools show a practical need for our work, and we are conﬁdent that they can be improved upon. 2 3 Approach Our approach involves obtaining a large database of source code, written in a multitude of languages. We then train our program on the database by giving it pieces of source code in identiﬁed languages. We can use the resulting knowledge to evaluate source code from unknown languages and determine which language it is likely to be written in. 3.1 Training Data All training data was collected from the code hosting website Github by a custom web crawler. Github’s code tagging was used to initially identify samples. However, this was unlikely to be accurate in all cases. To address this, we veriﬁed the samples via ﬁle extension; only ﬁles that matched the common ﬁle extension for a language were used to train that language. This minimized noise in our training data. Our training data is 324 megabytes and contains over 41 thousand source code ﬁles in the following languages: • Actionscript • Ada • Brainfuck • C • C# • C++ • Common Lisp • CSS • Erlang • Haskell • HTML • Java • Javascript 3 • Lua • Matlab • Objective C • Perl • PHP • Python • Ruby • Scala • Scheme • Smalltalk • Latex • XML 3.1.1 Comment and String Detection Since comments and strings in any language can have arbitrary content, it is important to exclude their content from consideration during training and identiﬁcation. Our algorithm for identifying comments and strings depends on a simple heuristic: comments and strings are some of the only parts of code that will contain natural language, and so will often have alphabetic- only words separated by spaces. We will deﬁne a line of code to have the words property if it contains a sequence of alphabetic-only words separated by spaces and ending in a space or a newline. This is used in combination with other heuristics throughout the algorithm. The ﬁrst step in our algorithm identiﬁes lines that are likely to contain a comment or string. This is done by simply ﬁnding lines that match the words property; these will be referred to as candidate lines. For all candid

View Original ArXiv

This content is AI-processed based on ArXiv data.

Algorithmic Programming Language Identification

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found