A Language-Agnostic Model for Semantic Source Code Labeling

A Language- Agnostic Model for Semantic Source Code Labeling Ben Gelman T wo Six Labs, LLC. Arlington, Virginia, USA ben.gelman@twosixlabs.com Bryan Hoyle T wo Six Labs, LLC. Arlington, Virginia, USA bryan.hoyle@twosixlabs.com Jessica Moore T wo Six Labs, LLC. Arlington, Virginia, USA jessica.moore@twosixlabs.com Joshua Saxe Sophos Fairfax, Virginia, USA joshua.saxe@sophos.com David Slater T wo Six Labs, LLC. T acoma, W ashington, USA david.slater@twosixlabs.com ABSTRA CT Code search and compr ehension have become more dicult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while main- taining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. W e use Stack Overow code snippets and their tags to train a language-agnostic, deep con- volutional neural network to automatically predict semantic labels for source code documents. On Stack Overow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-taile d list of 4,508 tags. W e also manually validate the model outputs on a diverse set of unlab eled source code do cuments retrieved from Github, and obtain a top-1 accuracy of 86.6%. This strongly indi- cates that the mo del successfully transfers its knowledge from Stack Overow snippets to arbitrary source code documents. CCS CONCEPTS • Computing methodologies → Articial intelligence ; Ma- chine learning ; Natural language processing ; Machine learning approaches ; Neural networks; KEY W ORDS deep learning, source code, natural language processing, multilabel classication, semantic labeling, crowdsourcing A CM Reference Format: Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe, and David Slater . 2018. A Language- Agnostic Model for Semantic Source Co de Labeling. In Proceedings of the 1st International W orkshop on Machine Learning and Software Engineering in Symbiosis (MASES ’18), September 3, 2018, Montp ellier , France. ACM, New Y ork, N Y , USA, 9 pages. https://doi.org/10.1145/3243127. 3243132 Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employ ee, contractor or aliate of the United States government. As such, the Government retains a nonexclusive , royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only . MASES ’18, September 3, 2018, Montpellier , France © 2018 Copyright held by the owner/author(s). Publication rights licensed to A CM. ACM ISBN 978-1-4503-5972-6/18/09. . . $15.00 https://doi.org/10.1145/3243127.3243132 1 IN TRODUCTION In recent years, the quantity of available source code has been growing exponentially [ 10 ]. Code reuse at this scale is predicate d on understanding and searching through a massive numb er of projects and source co de documents. The ability to generate meaningful, semantic labels is key to comprehending and searching for r elevant source code, especially as the diversity of programming languages, libraries, and code content continues to expand. The search functionality for current large code repositories, such as GitHub [ 12 ] and SourceForge [ 31 ], will match queried terms in the source code, comments, or documentation of a project. More sophisticated search appr oaches have shown better performance in retrieving relevant results, but they often insuciently handle scale, breadth, or ease of use. Santanu and Prakash [ 26 ] develop pattern languages for C and PL/AS that allow users to write generic code- like schematics. Although the schematics locate specic source code constructs, they do not capture the general functionality of a program and scale poorly to large code corpora. Bajrachar ya et al. [ 3 ] dev elop a search engine called Sour cerer that enhances keyword search by extracting features from its code corpus. Sourcer er scales well to large corpora, but it is still hindered by custom language- specic parsers. Suering from a similar problem, Exemplar [ 24 ] is a system that tracks the ow of data thr ough various API calls in a project. Exemplar also uses the documentation for projects/API calls in order to match a user’s keywords. Recent applied works have similar shortcomings [ 5 ] [ 17 ] [ 29 ] [ 32 ]. Creating a solution that operates across programming languages, libraries, and projects is dicult due to the complexity of modeling such a huge variety of code. As a step in that direction, we present a novel framework for generating labels for source code of arbitrary language, length, and domain. Using a machine learning approach, we capitalize on a wealth of crow dsourced data from Stack Overow (SO) [ 25 ], a fo- rum that provides a constantly growing source of code snippets that are user-labeled with programming languages, tool sets, and functionalities. Prior works have attempted to predict a single la- bel for an SO post [ 19 ] [ 33 ] using both the post’s text and source code as input. T o our knowledge, our work is the rst to use Stack Overow to predict exclusively on source code. A dditionally , prior methods do not attempt multilabel classication, which becomes a signicant issue when labeling realistic sour ce code documents MASES ’18, September 3, 2018, Montpellier , France Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe, and David Slater Figure 1: An example prediction of our model. The input code snippet is on the left, while the predicted labels and their raw certainties are on the right. Keyword matching on the predicted labels would not have been able to locate this code. instead of brief SO snippets. Our approach utilizes SO’s code snip- pets to simultaneously mo del thousands of concepts and predict on previously unseen source code, as demonstrated in Fig. 1 . W e construct a deep convolutional neural network that directly processes source code documents of arbitrary length and predicts their functionality using pre-existing Stack Overow tags. As users ask questions about new programming languages and tools, the model can be retrained to maintain up-to-date representations. Our contributions are as follows: • First work, to our knowledge, to introduce a baseline for multilabel tag prediction on Stack Overow posts. • Convolutional neural network architectur e that can handle arbitrary length source code documents and is agnostic to programming language. • State-of-the-art top-1 accuracy (79% vs 65% [ 33 ]) for predict- ing tags on Stack Overow posts, using only code snippets as input. • Approach that enables tagging of source code corpora ex- ternal to Stack O verow , which is validated by a human study . W e organize the rest of the paper as follows: section 2 discusses related works, section 3 details data preprocessing and correction, section 4 explains our neural network architecture and validation, section 5 displays our results, section 6 presents challenges and limitations, and section 7 considers future work. 2 RELA TED WORK Due to the parallels between source code and natural language [ 14 ] [ 1 ], we nd that recent work in the natural language processing (NLP) domain is relevant to our problem. Modern NLP approaches have generated state-of-the-art results with long short-term mem- ory neural networks (LSTMs) and convolutional neural networks (CNNs). Sundermeyer , Schlüter , and Ney [ 34 ] have shown that LSTMs perform better than n-grams for modeling word sequences, but the vocabulary size for word-level models is often large, r equir- ing a massive parameter space. Kim, Jernite, Sontag, and Rush [ 16 ] show that by combining a character-lev el CNN with an LSTM, the y can achie ve comparable results while having 60% few er parameters. Further work shows that CNNs are able to achieve state-of-the-art performance without the training time and data r equired for LSTMs [ 8 ]. In the source code domain, however , prior work has utilized a wide variety of methods. In 1991, Maarek, Berry , and Kaiser [ 22 ] recognized that there was a lack of usable code libraries. Libraries were dicult to nd, adapt, and integrate without proper labeling, and locating comp o- nents functionally close to a given topic pose d a challenge. The authors developed an information retrieval approach leveraging the co-occurrence of neighboring terms in code, comments, and documentation. More recently , Kuhn, Ducasse, and Gírba [ 18 ] apply Latent Se- mantic Indexing (LSI) and hierarchical clustering in order to analyze source code vocabulary without the use of external documentation. LSI-based methods have had success in the code comprehension domain, including document search engines [ 4 ] and IDE-integrated topic modeling [ 11 ]. Although the method seems to perform well, labeling an unse en source code do cument requires reclustering the entire dataset. This is a signicant setback for maintaining a constantly growing corpus of labeled documents. In the context of source code labeling, super vised methods are mostly unexplored. A critical issue in this task is the massive amount of labeled data required to create the model. A few eorts have recognized Stack O verow for its wealth of crowdsourced data. Saxe, T urner , and Blokhin [ 28 ] search for Stack Overow posts containing strings found in malware binaries, and use the retrie ved tags to label the binaries. Kuo [ 19 ] attempts to predict tags on SO posts by computing the co-occurrence of tags and words in each post. He achieves a 47% top-1 accuracy , which in this context is the task of predicting only one tag per post. Clayton and Byrne [ 33 ] also attempt to predict a tag for SO p osts. They invest a gr eat deal of eort in feature extraction inspir ed by A CT -R’s declarative memor y retrieval mechanisms [ 2 ]. Utilizing logistic regression, they achie ve a 65% top-1 accuracy . In this work, we generate a more complex machine learning model than those present in previous attempts. Because we intend to generalize our model to source code les, we make our tests stricter by only using the actual code inside Stack Overow posts as inputs to the mo del. Despite the information loss fr om not taking advantage of the entire post text, we still further improve on the performance of prior work and obtain a 78.7% top-1 accuracy . 3 D A T A The primar y goal of our work is to create a machine learning system that will classify the functionality of source code. W e achieve this by leveraging Stack Overow , a large, public question and answ er forum fo cused on computer programming. Users can ask a question, provide a response, post code snippets, and vote on posts. The SO dataset provides sev eral advantages in particular: a huge quantity of code snippets; a wide set of tags that cover concepts, tools, and functionalities; and a platform that is constantly updated by users asking and answering questions about new technologies. Due to the complexity of the data, we use this section to discuss the data’s characteristics and our preprocessing procedures in detail. A Language- Agnostic Model for Semantic Source Code Lab eling MASES ’18, September 3, 2018, Montpellier , France Figure 2: A Stack Overow thread with a question and an- swer . The thread’s tags are boxed in red and the code snip- pets are b oxed in blue. For the purpose of training our model, the tags are the output lab els and the code snippets are the input features. W e can see from this example that the longer snippets lo ok like valid code, while the shorter snippets are not as useful. Fig. 2 is an example of a Stack Overow thread. Users who ask a question are allow ed to attach a maximum of ve tags. Although there is a long list of pr eviously used tags, users are free to enter any text. The tags ar e often chosen with programming language , concepts, and functionality in mind. The tags for this example, boxed in red, are “python, " “list, " and “slice. " Additionally , any user is allowed to block o text as a code snippet. In this example, the user providing an answer uses many code snipp ets, which have been boxed in blue. Although the code snippets may describe a par- ticular functionality , they do not necessarily represent a complete or syntactically correct program. Our initial intuition is that the code snippets can simply be input into a machine learning model with the user-created tags as lab els. This trained model would then be able to accept any source code and provide tags as output. As we further analyze the data, several questions need to rst be resolved, including how to asso ciate tags with snippets, what constitutes a single code sample, and which data should be ltered from the dataset. Stack Overow’s threads are the fundamental pieces of our train- ing data. The publicly available SO data dump provides over 70 million source code snippets with labels that w ould be useful for Figure 3: The distribution of snippet lengths in the full dataset, with frequencies logarithmically scaled. Although short co de snipp ets are extremely common, they have lim- ited value. real w orld projects. Because the tags are selected at the thread lev el while snippets occur in individual p osts, we assign the thread’s tags to each post in that thread. Since a single post can have many code snippets, we choose to concatenate the snipp ets using newline characters as separators in order to preser ve a user’s p ost as a single idea. Although these transformations ensur e that a post will suce as input to a language-level model, they do not guarantee the useful- ness of the snippets themselves. The following section will address several problems with short, uninformative code snippets, user error in tagging posts and generating code with the correct func- tionality , and the long-tailed distribution of unusable tags. 3.1 Statistics and Data Correction As of December 15, 2016, the Stack Overow data dump contains 24,158,127 posts that have at least one code snippet, 73,934,777 individual code snippets, and 46,663 total tags. Despite the large amount of data, there is a se vere long-tailed phenomenon that is common in many online communities [ 13 ]. The distributions of code-snippet length and number of tags per p ost are of particular importance to our problem. Fig. 3 shows the distribution of individual snippet lengths, mea- sured in number of characters, throughout Stack Overow . As one would expect, the longer snippets are many or ders of magnitude less frequent than the shorter snippets. Fig. 4 further demonstrates that, of the many short snippets, there is a huge quantity that are empty strings or are only a few characters long. There ar e several reasons why these snippets ar e poor choices for training data. First, a single character is usually not descriptive enough to characterize multiple tags. Saying that ‘x’ is a good indicator of python, machine learning, and databases does not make sense. Going back to Fig. 2 , we can also see that the short snippets are often references to code, but not valid code themselves. MASES ’18, September 3, 2018, Montpellier , France Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe, and David Slater Figure 4: A zo omed view of the snippet length distribution, with 1 bin equal to 1 character . There are many strings that are empty or only a few characters long. Figure 5: The mean and me dian number of punctuation marks at dierent snippet lengths. At a snippet length of 10 characters, the mean and median number of punctua- tion marks is 1, indicating a reasonable choice for minimum snippet length. In order to avoid cutting out snippets at an uninformed threshold, we investigate snipp ets of dierent lengths in more detail. W e found punctuation to be a good indicator of co de usefulness in short snippets. The occurrence of punctuation means that we are more likely to see programming language constructs such as “variable = value" or “ class.metho d. " Howev er , simply removing all snippets without punctuation is not viable because of valuable alphanumeric keywords and punctuation-free code (“ call W riteConsole”), so w e instead decide to lter base d on a threshold length. Fig. 5 shows the median and mean number of punctuation marks for dierent snippet lengths. At a snippet length of 10 characters, the mean and Figure 6: Distribution of tags per p ost. All posts on Stack Overow must have at least one tag, but there is a maximum of ve tags, resulting in missing labels. median are both greater than one, so we lter out all snippets that are length 9 or below from the data. Additionally , Fig. 6 shows the distribution of tags per post. As stated previously , Stack Overow allo ws a maximum of ve tags for any given post. Although most posts contain three tags, there is still a signicant number of posts with fewer tags. The combined eect from a high quantity of posts that have few tags and an enforced maximum creates a “missing label phenomenon. " This is the situation where a given post is not tagged with all of the functionalities or concepts actually described in the post. This is a non-trivial challenge for machine learning models because a code snippet is considered a negative example for a given label if that label is missing. Users can also add errors to the training data by simply b eing wrong about their tags or poste d code on Stack O verow . Because users can vote based on the quality of a post, we can use scores as an indicator for incorrectly tagged or poorly code d posts. Fig. 7 shows the distribution of scores for posts that have at least one code snippet. W e cut all posts with negative scores from the training data. Although we considered cutting posts with zero scor e because they had not be en validated by other users via voting, we ultimately choose to keep them b ecause the score distribution shows that there is a large amount of data with zero score. After ltering the data for the snippet length and score thresh- olds, one problem remained with the set of valid labels. Because users ar e allowed to enter any text as a tag for their posts, there is a long-tailed distribution of tags that ar e rarely use d. T able 1 displays the magnitude of the problem. In the rst 4,508 tags, the amount of posts per tag drops fr om 2.5 million to just 1,000. In order to enable a 99% / 1% training/test split and still have 10 positive labels per tag to estimate performance, w e cut o tags with fewer than one thousand positive samples. In the following section, we explain how we construct our models and perform validation using the snippet, score, and tag-ltered data. A Language- Agnostic Model for Semantic Source Code Lab eling MASES ’18, September 3, 2018, Montpellier , France Figure 7: The distribution of scores on Stack O verow posts. Negative scores are often the result of po orly worded ques- tions, incorrectly tagged p osts, or awed code snippets, so we lter them out of the training set. W e ke ep zero-scored snippets because they may not have been viewed enough to be voted on. T able 1: Rankings are base d on the numb er of posts that are labeled with a tag, after ltering data for snippet and score thresholds. This shows that the majority of tags have too few samples to train and validate a machine learning model. Rank T ag # of Posts 1 javascript 2,585,182 8 html 1,279,137 73 apache 99,377 751 web-cong 10,056 4,508 simplify 1,000 16,986 against 100 46,663 db-charmer 1 4 METHODOLOGY Our motivations for using neural networks in this work are sev- eralfold. As discussed in the introduction, convolutional neural networks have shown state-of-the-art performance in natural lan- guage tasks with less computation than LSTMs [ 16 ] [ 8 ]. Both natu- ral language and source code tasks must model structure, semantic meaning, and context. Neural networks also hav e the ability to eciently handle multi- label classication problems: rather than training M classiers for M dierent output labels, the output layer of a neural network can have M nodes, simultaneously providing predictions for multiple labels. This enables the neural network to learn features that ar e common across lab els, whereas individual classiers must learn those relationships separately . . . . E ‘d’ 1 d e f _ f ( ) : \n 128 2-Length Convolutional Filters 192 3-Length Convolutional Filters 256 4-Length Convolutional Filters 512 5-Length Convolutional Filters 128 2-Length Convolutional Filters 128 3-Length Convolutional Filters 256 4-Length Convolutional Filters 512 5-Length Convolutional Filters T 1 T m T 2 T m-1 . . . . . Sum-over- time Sum-over- time Sum-over- time Sum-over- time (a) (b) (d) (e) (c) E ‘d’ 16 Character embedding . . . E ‘e’ 1 E ‘e’ 16 . . . E ‘f’ 1 E ‘f’ 16 . . . E ‘_’ 1 E ‘_’ 16 . . . E ‘f’ 1 E ‘f’ 16 . . . E ‘(’ 1 E ‘(’ 16 . . . E ‘)’ 1 E ‘)’ 16 . . . E ‘:’ 1 E ‘:’ 16 . . . E ‘\n’ 1 E ‘\n’ 16 Batch Normalization Batch Normalization Batch Normalization Batch Normalization Figure 8: An overview of the neural network architecture. (a) The characters from a given code snippet are converted to real-valued vectors using a character embe dding . (b) W e use stacked convolutional lters of dierent lengths with ReLU activations over the matrix of embeddings. (c) W e perform sum-over-time po oling on the output of each stacked convo- lution. (d) A attene d vector is fed into two fully-connected, batch-normalized, dense layers with ReLU activations. (e) Each output node uses a logistic activation function to pro- duce a value from 0 to 1, representing the probability of a given lab el. 4.1 Neural Network Architecture Fig. 8 gives an overview of the neural netw ork architecture. In part (a) of Fig. 8 , we use a character embe dding to transform each print- able character into a 16-dimensional real-valued vector . W e chose character embeddings over more commonly use d wor d embeddings for multiple reasons. Creating an embedding for every word in the source code domain is problematic be cause of the massive set of unique identiers. Forming a dictionary from words only seen in the training set will not generalize, and using all possible identi- ers will be infeasible to optimize. The neural network only needs MASES ’18, September 3, 2018, Montpellier , France Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe, and David Slater 0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ \s \t \n \r \v \f PCA Projection of Embedding to 2 Dimensions Figure 9: A two-dimensional projection of the character em- bedding vectors that are optimized during model training. The model generates clear spatial relationships between var- ious sets of characters. The separation between upp ercase letters, lowercase letters, symbols, and numbers is of par- ticular interest. In general, meaningful spatial relationships signicantly improve the features extracted by the convolu- tional layers. to optimize 100 embeddings when using the printable characters. Additionally , the character embeddings are able to function on any text, allowing the model to predict on source code without the use of language-specic parsers or features. In order to provide an intuition of the character embedding, we use PCA to project the 16-dimensional emb edding vectors down to two dimensions, as displayed in Fig. 9 . This gure indicates that the model generates salient spatial relationships between the embedded characters during optimization, which is critical to the performance of the convolutional lay ers. The convolutions are able to preserve information about words and sequences by sliding over the embedding vectors of consecutive characters. W e stack two convolutional layers of the same size for various lter lengths, which generates a stacked convolution matrix. Using sum-over- time pooling on the stacked convolution matrix allows us to obtain a xed-length vector regardless of the initial input size. After two batch normalized dense layers, the last layer has a logistic activation for each neuron in order to output the probabil- ity of a tag occurring. The network is trained on binar y vectors containing a 1 for every tag that occurs for a given code snipp et and 0 otherwise. W e use binary cross-entropy as our loss function. 4.2 V alidation Setup Since we train the mo del on Stack Overow and predict on arbitrar y source code, we must validate the model in both domains. On the SO data, we use a hold-out test-set strategy so that the model can be evaluated on previously unse en data. In the source code domain, we perform human validation to verify the accuracy of the modelâĂŹs outputs. 4.3 Stack O verow V alidation T o validate the neural network on Stack Overow , we tested a number of multilabel test set stratication algorithms. Stratication based on k-fold cross-validation, which is a standard technique for binar y and multiclass classication tasks, cannot be directly applied to the SO multilabel classication problem due to classes not being disjoint. Furthermore, due to the class imbalance caused by using a long-tailed tag distribution for labels, random stratication produces partitions of the data that do not generate good estimates for multilabel problems [ 30 ] [ 36 ]. In particular , the label counts for the top tag and the 4,508th tag dier by 3 orders of magnitude, which can r esult in classes with v ery few positive labels for the test set. Since deep CNN models take a long time to train and b enet from large datasets, we want to avoid cross validation and use as much of the dataset as possible to train our model. Our goal is to generate a 98% / 1% / 1% train/validation/test split that still provides a good estimate of performance. With an ideal stratication, this would ensure that even the rarest tags (with 1000 samples each) would have 10 samples in the validation and test sets, which is sucient for estimating performance. On our dataset, this would result in about 240,000 samples in validation and test sets. Multilabel stratication begins with the m -by- q label matrix Y , where m is the number of samples in the dataset D , q is the number of labels in the set of labels Λ , and Y [ i , j ] = 1 where sample i has label j , and Y [ i , j ] = 0 other wise. The goal is to generate a partition { D 1 , . . . , D k } of D that fullls certain desirable properties. First, the size of each partition should match a designated ratio of samples, in our case, | D train | , | D test | , | D val | | D | = ( 0 . 98 , 0 . 1 , 0 . 1 ) . Additionally , the proportion of positive examples of each label in each partition should be the same; i.e., ∀ s ∈ { t r ain , t e s t , v al } , ∀ j ∈ Λ : Í i ∈ D s Y [ i , j ] | D s | = c j where c j is the proportion of positive examples of label j in D . Labelset stratication [ 36 ] considers each combination of lab els, denoted labelsets, as a unique lab el and then performs standard k- fold stratication on those lab elsets. This works well for multilabel problems where each labelset appears suciently often. Howev er , this does not optimize for individual lab el counts, which is a prob- lem for datasets like SO that include rare labels and rare label combinations. W e found that iterative stratication [ 30 ], a greedy method that specically focuses on balancing the lab el counts for rare labels, produced the best validation and test sets. T o produce our partition, we ran iterative stratication twice with a 99%/1% split, which resulted in a 98.01%/0.99%/ 0.1% train/validation/test split. 4.4 Source Code V alidation V alidating the model’s p erformance on source co de poses a dierent challenge because of the lack of labele d documents. In order to obtain results, w e performed human validation on source code that is randomly sampled from GitHub [ 12 ]. Specically , we ran a script A Language- Agnostic Model for Semantic Source Code Lab eling MASES ’18, September 3, 2018, Montpellier , France Figure 10: The GUI for human validation of model outputs on source code do cuments. T able 2: Mean, median, and standard deviation of tag AUCs for each model. Model Mean Median Stdev Embedding CNN 0.957 0.974 0.048 Embedding Logistic Regression 0.751 0.759 0.099 N-gram Logistic Regression 0.514 0.502 0.093 to download the master branches of random GitHub pr ojects via the site’s API until we had 146,408 individual les. W e sampled 20 les for each of the following extensions, r esulting in a total of 200 source code documents: [py , xml, java, html, c, js, sql, asm, sh, cpp]. Note that the extensions were not presented to the users and that they do not inform the predictions of the model. W e created a GUI, displayed in Fig. 10 , that presents the top lab els and asks users if they agree with, disagree with, or are unsure about each label. Ther e were a total of 3 reviewers, each of whom answered the questions on the GUI for all 200 source co de documents. W e r emove the unsure answers and use simple voting among the remaining ratings to produce ground truth and compute an ROC curve. 5 RESULTS On the Stack Overow data, we rst calculated the top-1 accuracy previously use d by Kuo [ 19 ] and Clayton and Byrne [ 33 ]. W e obtain a 78.7% top-1 accuracy , which is a signicant improvement over the previous best of 65%. Howev er , we found that metric to be lacking: it only che cks if the model’s top prediction is in the SO p ost’s tag set. Our goal is to predict many tags pertinent to a source co de document, not just its primary tag. Because our work is introducing the multilabel tag prediction problem on Stack Overow code snippets, we train multiple baseline models to demonstrate the signicance of our convolutional neural network architecture . In order to evaluate the results, we computed the ar ea under ROC ( AUC) for each individual tag. This is a reasonable evaluation b ecause it demonstrates the performance of each model across the entire set of tags. Figure 11: The distribution of tag AUCs for each model. Be- cause our dataset uses 4,508 labels, there are 4,508 AUCs binned and plotted for each mo del. This graph demonstrates how well each model performs across all the lab els. W e used tw o additional models as baselines for this problem. The rst model performs logistic regression on a bag of n-grams. This model obtains the 75,000 most common n-grams (using n=1,2,3) from the training set to use as features. The second model performs logistic regression on a character emb edding of the input code using an embe dding dimension of 8. W e choose these two models as baselines because they test two dierent types of featurizations and they are able to eciently train and predict on multilabel problems. Fig. 11 shows the distributions of tag AUCs for the CNN model and the logistic regression baseline models. Because our dataset uses 4,508 tags, there are 4,508 AUC values that are binned and plotted for each model. The shape of the logistic regr ession distributions are similar - most of the tags fall within the central range of the models’ distributions and there ar e few tags that perform r elatively well or relatively poorly . Our convolutional architecture performs well on most of the tags, and instead has a long-tailed distribution of decreasing performance. T able 2 displays a summarized, quantitative view of the tag AUC distributions. The logistic regression models have similar standard deviations, but the n-gram mo del has a considerably lower mean and median, indicating that the n-gram features are not as eective as the character embeddings. The convolutional network has a signicantly higher mean and median, and a lower standard deviation. Although all of the models perform worse as the rarity of the tags increases, the lower standard deviation of the convolutional network implies that the model is more robust to the rarity of a given tag. For source code validation, we use human feedback on the con- volutional network to generate Fig. 12 . The mo del obtained a 0.769 AUC. For the sake of comparison, we compute top-1 accuracy with the human validation on source code and obtain an 86.6% accuracy . W e note that this is better than the analogous p erformance on Stack Overow , which indicates that, on source code, the model performs better for the rst tag, but worse for the rest. MASES ’18, September 3, 2018, Montpellier , France Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe, and David Slater Figure 12: Human validation ROC curve with a 0.769 AUC. This diers from the Stack O verow AUC values be cause it operates on the results of human validation, which is lim- ited to only a few tags per document. As a nal note on performance, w e trained and tested our model using an N VIDIA 1080 GP U. Our model obtains spee ds of about 317,000 characters per second. Assuming an average of 38 char- acters per line of code (calculated base d on a random sample of source les from GitHub[ 12 ]), the model is able to achieve predic- tion speeds of 8,342 source lines of code per second. T o put this in context, it would take the model less than an hour to predict on the 20+ million lines of code in the Linux kernel. It is also read- ily parallelized to quickly predict across much larger source code corpora. 6 CHALLENGES/LIMI T A TIONS In the course of our research, we encountered a few limitations that require further study . First is the transfer learning problem between Stack Overow code snippets and source code. The lack of labeled source code prevents us from training directly on the desired domain. The size of SO code snippets and the maximum numb er of tags per post are detrimental to the model’s ability to predict on arbitrar- ily long source code. Due to the ve tags per post limit, predicting more tags will increase the model loss, resulting in predictions with few tags. The original hypothesis was that the model would associate few predictions with short snippets and many tags for longer snippets, but the source code evaluation did not strongly support this. Exploring approaches that utilize loss functions other than binary cross-entropy may address these tag limit problems. Another issue is that Stack O verow users do not tag their co de snippets directly , but rather their questions. For example, a user could post a code snippet of an XML document, ask ho w to parse it in Java, and tag the thread with “XML, " “Java, " and “parse . " These tags are all extr emely relevant to the user’s question, but they do not describe the code snipp et independently . During training, our model is only able to see that the XML document is an example of XML, Java, and parsing. This creates noise in the Java and parse labels. Finally , the human verication process is a noisy e valuation of the model’s performance on source co de. V erifying the predictions is an arduous process because the model is familiar with thousands of functionalities. It is infeasible for individuals to b e masters of such a wide range of ideas and to ols, which results in a signicant amount of labeler disagreement. 7 CONCLUSIONS/F U T URE W ORK W e leverage the crow dsourced data from Stack Overow to train a deep convolutional neural network that can attach meaningful, semantic labels to source code documents of arbitrar y language. While most current code search approaches locate documents by matching strings from user queries, our approach enables us to identify documents base d on functional content instead of the literal characters used in source code or documentation. A logical next step is to apply this model to large source code corp ora and build a search interface to nd source code of interest. Unlike previous supervised SO tag-prediction models, we train and test strictly on code snippets, yet we still advance the top-1 prediction accuracy from 65% to 79% on Stack Overow . W e also achieve 87% on human-validated source code. Using the area under ROC to measure performance, we obtain a mean AUC of 0.957 on the Stack O verow dataset and an AUC of 0.769 on the human source code validation. Rening the methodology and data prepro- cessing by training the model with entire threads instead of posts could alleviate the performance drop caused by transfer learning. An alternative direction for future research is to inv estigate better metrics and loss functions for training and evaluating model per- formance on long-tailed multilabel datasets. This could prevent the model from being punished for predicting more than ve tags. Finally , extensions of the architecture that br oaden the contex- tual aperture of the convolutional layers may grant the model a deeper understanding of abstract co de concepts and semantics. This would enable more sophisticated code search and comprehension. A CKNO WLEDGMEN TS This project was sponsored by the Air Force Research Laboratory (AFRL) as part of the D ARP A MUSE program. REFERENCES [1] Miltiadis Allamanis, Earl T Barr , Premkumar Devanbu, and Charles Sutton. 2017. A sur vey of machine learning for big code and naturalness. arXiv preprint arXiv:1709.06182 (2017). [2] John R Anderson, Daniel Bothell, Michael D Byrne, Scott Douglass, Christian Lebiere, and Y ulin Qin. 2004. An integrated theor y of the mind. Psychological review 111, 4 (2004), 1036. [3] Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor , Pierre Baldi, and Cristina Lopes. 2006. Sourcerer: a search engine for open source code supporting structure-based search. In Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications . ACM, 681–682. [4] Michael W Berry, Susan T Dumais, and Gavin W O’Brien. 1995. Using linear algebra for intelligent information retrieval. SIAM review 37, 4 (1995), 573–595. [5] Black Duck. 2017. Op en Hub. https://w ww .openhub.net [6] Francisco Charte, Antonio J Rivera, María J del Jesus, and Francisco Herrera. 2015. Addressing imbalance in multilabel classication: Measures and random resampling algorithms. Neurocomputing 163 (2015), 3–16. A Language- Agnostic Model for Semantic Source Code Lab eling MASES ’18, September 3, 2018, Montpellier , France [7] Hoa Khanh Dam, Truyen T ran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016). [8] Y ann N Dauphin, Angela Fan, Michael A uli, and David Grangier . 2016. Language Modeling with Gated Convolutional Networks. arXiv preprint (2016). [9] Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, and Sebastiano Panichella. 2014. Lab eling source code with information retrieval methods: an empirical study . Empirical Software Engine ering 19, 5 (2014), 1383– 1420. [10] Amit Deshpande and Dirk Riehle. 2008. The total growth of open source. In IFIP International Conference on Open Source Systems . Springer, 197–209. [11] Malcom Gethers, Tre vor Savage, Massimiliano Di Penta, Rocco Oliveto, Denys Poshyvanyk, and Andrea De Lucia. 2011. Code T opics: which topic am I coding now? . In Proceedings of the 33rd International Conference on Software Engineering . ACM, 1034–1036. [12] GitHub. 2017. GitHub. https://github.com [13] A Grabowski, N Kruszewska, and RA Kosiński. 2008. Properties of on-line social systems. The European P hysical Journal B-Condensed Matter and Complex Systems 66, 1 (2008), 107–113. [14] Abram Hindle, Earl T Barr , Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engine ering (ICSE), 2012 34th International Conference on . IEEE, 837–847. [15] Sepp Hochreiter and Jürgen Schmidhub er . 1997. Long short-term memory . Neural computation 9, 8 (1997), 1735–1780. [16] Y oon Kim, Y acine Jernite, David Sontag, and Alexander M Rush. 2015. Character- aware neural language models. arXiv preprint arXiv:1508.06615 (2015). [17] Krugle. 2017. krugle. http://opensearch.krugle.org [18] Adrian Kuhn, Stéphane Ducasse, and Tudor Gírba. 2007. Semantic clustering: Identifying topics in source code. Information and Software T echnology 49, 3 (2007), 230–243. [19] Darren Kuo. 2011. On word prediction methods . Technical Report. Technical report, EECS Department, University of California, Berkeley . [20] Otávio A ugusto Lazzarini Lemos, Sushil Krishna Bajracharya, and Joel Ossher . 2007. CodeGenie:: a tool for test-driven source code search. In Companion to the 22nd A CM SIGPLAN conference on Object-oriented programming systems and applications companion . ACM, 917–918. [21] Bennet P Lientz, E. Burton Swanson, and Gail E T ompkins. 1978. Characteristics of application software maintenance. Commun. A CM 21, 6 (1978), 466–471. [22] Y oelle S Maarek, Daniel M Berry , and Gail E Kaiser . 1991. An information retrieval approach for automatically constructing software libraries. IEEE Transactions on software Engineering 17, 8 (1991), 800–813. [23] Jon D Mcaulie and David M Blei. 2008. Supervised topic models. In Advances in neural information processing systems . 121–128. [24] Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Chen Fu, and Qing Xie. 2012. Exemplar: A source code search engine for nding highly relevant applica- tions. IEEE Transactions on Software Engineering 38, 5 (2012), 1069–1087. [25] Stack Overow . 2017. Stack O verow . http://stackoverow .com [26] Santanu Paul and Atul Prakash. 1994. A framework for source code search using program patterns. IEEE Transactions on Software Engineering 20, 6 (1994), 463–475. [27] Steven P Reiss. 2009. Semantics-based co de search. In Proceedings of the 31st International Conference on Software Engineering . IEEE Computer Society, 243– 253. [28] Joshua Saxe, Rafael Turner , and Kristina Blokhin. 2014. CrowdSource: Automated inference of high level malware functionality from low-level symbols using a crowd trained machine learning model. In Malicious and Unwanted Software: The A mericas (MALW ARE), 2014 9th International Conference on . IEEE, 68–75. [29] searchcode. 2017. searchcode. https://searchcode.com [30] Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. 2011. On the stratication of multi-label data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases . Springer , 145–158. [31] SourceForge. 2017. SourceForge. https://sourceforge.net [32] Sourcegraph. 2017. Sourcegraph. https://sourcegraph.com [33] Clayton Stanley and Michael D Byrne . 2013. Predicting tags for stackoverow posts. In Proceedings of ICCM , V ol. 2013. [34] Martin Sundermeyer , Ralf Schlüter, and Hermann Ney . 2012. LSTM Neural Networks for Language Modeling.. In Interspeech . 194–197. [35] Stephen W Thomas, Bram Adams, Ahmed E Hassan, and Dorothea Blostein. 2010. V alidating the use of topic models for software evolution. In Sour ce Code A nalysis and Manipulation (SCAM), 2010 10th IEEE W orking Conference on . IEEE, 55–64. [36] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. 2009. Mining multi-label data. In Data mining and knowledge discovery handbo ok . Springer , 667–685. [37] Xi-Zhu W u and Zhi-Hua Zhou. 2016. A Unied View of Multi-Label Performance Measures. arXiv preprint arXiv:1609.00288 (2016).

A Language-Agnostic Model for Semantic Source Code Labeling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment