Algorithmic Programming Language Identification
📝 Abstract
Motivated by the amount of code that goes unidentified on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on supervised learning and intelligent statistical features. We also explored, but abandoned, a grammatical approach. In testing, our implementation greatly outperforms that of an existing tool that relies on a Bayesian classifier. Code is written in Python and available under an MIT license.
💡 Analysis
Motivated by the amount of code that goes unidentified on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on supervised learning and intelligent statistical features. We also explored, but abandoned, a grammatical approach. In testing, our implementation greatly outperforms that of an existing tool that relies on a Bayesian classifier. Code is written in Python and available under an MIT license.
📄 Content
arXiv:1106.4064v2 [cs.LG] 9 Nov 2011 Algorithmic Programming Language Identification David Klein, Kyle Murray and Simon Weber University of Rochester November 1, 2011 Abstract Motivated by the amount of code that goes unidentified on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on su- pervised learning and intelligent statistical features. We also explored, but abandoned, a grammatical approach. In testing, our implemen- tation greatly outperforms that of an existing tool that relies on a Bayesian classifier. Code is written in Python and available under an MIT license. 1 1 Introduction and Motivation The purpose of algorithmic programming language detection is to determine the programming language with which a particular program or program fragment was written. The goal of this research is to devise techniques to recognize and accurately identify code from as many programming languages as possible, without starting with any information other than the code itself. One large domain in which many intensely useful programs and code fragments go unidentified is the web. Such code appears in blogs, forums, mailing lists, documentation, and many other contexts where the supporting web software does not provide a facility for identifying the language of a par- ticular program. With identifying information, various possibilities are un- locked. For instance, code found on the web now becomes more searchable, more so than the limited form of searchability that tools such as Google’s code search provide for explicitly marked full programs. In addition, code fragments could become more readable when they are automatically syntax highlighted based on the proper language grammar. Other potential uses range from data recovery on damaged file systems to enabling automatic code grading systems to be language agnostic. 2 Existing Tools Simple tools for identifying source code already exist. Various syntax high- lighting tools such as Google Code Prettify will automatically highlight syn- tax given some code [3]. However, these tools do not actually identify lan- guages; instead, they use heuristics that will make the highlighting work well. In the case of Google Code Prettify, broad grammars (such as C-like, Bash-like, and Xml-like) are preprogrammed. These grammars are then used to scan code, and the best matching grammar is used in highlighting. Clearly, languages that share a grammar cannot be distinguished between (and they do not need to be for the highlighting to work). More relevant is SourceClassifier, which will attempt to identify a pro- gramming language given some code [2]. However, it relies on a simple Bayesian classifier. Its strength is therefore limited to the quality of train- ing data, and it can easily be thrown offby strings and comments. These existing tools show a practical need for our work, and we are confident that they can be improved upon. 2 3 Approach Our approach involves obtaining a large database of source code, written in a multitude of languages. We then train our program on the database by giving it pieces of source code in identified languages. We can use the resulting knowledge to evaluate source code from unknown languages and determine which language it is likely to be written in. 3.1 Training Data All training data was collected from the code hosting website Github by a custom web crawler. Github’s code tagging was used to initially identify samples. However, this was unlikely to be accurate in all cases. To address this, we verified the samples via file extension; only files that matched the common file extension for a language were used to train that language. This minimized noise in our training data. Our training data is 324 megabytes and contains over 41 thousand source code files in the following languages: • Actionscript • Ada • Brainfuck • C • C# • C++ • Common Lisp • CSS • Erlang • Haskell • HTML • Java • Javascript 3 • Lua • Matlab • Objective C • Perl • PHP • Python • Ruby • Scala • Scheme • Smalltalk • Latex • XML 3.1.1 Comment and String Detection Since comments and strings in any language can have arbitrary content, it is important to exclude their content from consideration during training and identification. Our algorithm for identifying comments and strings depends on a simple heuristic: comments and strings are some of the only parts of code that will contain natural language, and so will often have alphabetic- only words separated by spaces. We will define a line of code to have the words property if it contains a sequence of alphabetic-only words separated by spaces and ending in a space or a newline. This is used in combination with other heuristics throughout the algorithm. The first step in our algorithm identifies lines that are likely to contain a comment or string. This is done by simply finding lines that match the words property; these will be referred to as candidate lines. For all candid
This content is AI-processed based on ArXiv data.