SampleFix: Learning to Generate Functionally Diverse Fixes

Automatic program repair holds the potential of dramatically improving the productivity of programmers during the software development process and correctness of software in general. Recent advances in machine learning, deep learning, and NLP have re…

Authors: Hossein Hajipour, Apratim Bhattacharyya, Cristian-Alex

SampleFix: Learning to Generate Functionally Diverse Fixes
SampleFix: Learning to Generate F unctionally Div erse Fixes Hossein Ha jip our 1 , Apratim Bhattac haryya 2 , Cristian-Alexandru Staicu 1 , and Mario F ritz 1 1 CISP A Helmholtz Cen ter for Information Securit y , German y 2 Max Planck Institute for Informatics, German y Abstract. Automatic program repair holds the p oten tial of dramati- cally impro ving the pro ductivity of programmers during the softw are dev elopment pro cess and correctness of softw are in general. Recent ad- v ances in machine learning, deep learning, and NLP ha ve rekindled the hop e to even tually fully automate the pro cess of repairing programs. Ho wev er, previous approaches that aim to predict a single fix are prone to fail due to uncertain ty ab out the true intend of the programmer. Therefore, we prop ose a generativ e mo del that learns a distribution ov er p oten tial fixes. Our model is form ulated as a deep conditional v ariational auto encoder that can efficiently sample fixes for a given erroneous pro- gram. In order to ensure diverse solutions, we propose a no vel regularizer that encourages div ersity ov er a semantic embedding space. Our ev alua- tions on common programming errors sho w for the first time the gener- ation of div erse fixes and strong improv ements ov er the state-of-the-art approac hes b y fixing up to 45% of the erroneous programs. W e addition- ally sho w that for the 65% of the repaired programs, our approac h was able to generate multiple programs with diverse functionalities. Keyw ords: Program repair · Generative models · Conditional v aria- tional auto enco der. 1 In tro duction Soft ware dev elopment is a time-consuming and exp ensive process. Unfortunately , programs written by humans typically come with bugs, so significan t effort needs to be in vested to obtain code that is only lik ely to be correct. Debugging is also t ypically p erformed b y humans and can contain mistakes. This is neither de- sirable nor acceptable in many critical applications. Therefore, automatically lo cating and correcting program errors [11] offers the p otential to increase pro- ductivit y as well as impro ve the correctness of soft ware. Adv ances in deep learning [17,18], computer vision [9,26], and NLP [30,3] ha ve dramatically bo osted the mac hine’s abilit y to automatically learn repre- sen tations of natural data such as images and natural language conten ts for v arious tasks. Deep learning models also hav e b een successful in learning the distribution o ver contin uous [29,16] and discrete data [21,14], to generate new 2 H. Ha jip our et al. and diverse data points [10]. These adv ances in machine learning and the adven t of large corpora of source code [1] provide new opp ortunities to ward harnessing deep learning metho ds to understand, generate, or debug programs. Fig. 2: SampleFix captures the inheren t am biguity of the p ossible fixes b y sam- pling multiple p oten tial fixes for the giv en erroneous real-world program. Poten- tial fixes with the same functionalit y are highlighted with the same color and the newly added tok ens are underlined. Fig. 1: Our SampleFix approach with div ersity regularizer pro- motes sampling of div erse fixes, that accoun t for the inherent uncertain ty in the automated debugging task. Prior works in automatic program repair predominan tly rely on exp ert-designed rules and error mo dels that describ e the space of the p otential fixes [27,8]. Such hand-designed rules and error mo dels are not easily adapt- able to the new domains and require a time- consuming pro cess. In con trast, learning-based approaches pro vide an opp ortunity to adapt suc h mo dels to the new domain of errors. Therefore, there has been an increasing in terest to carry ov er the success stories of deep learning in NLP and related tec hniques to emplo y learning- based approaches to tackle the “common pro- gramming errors” problem [13,12]. Suc h inv estigations hav e included compile- time errors such as missing scope delimiters, adding extraneous symbols, using incompatible op erators. No vice programmers and even expe rienced dev elop ers often struggled with these types of errors [25], which is usually due to lac k of atten tion to the details of programs and/or programmer’s inexp erience. Recen tly , Gupta et al. [13] proposed a deep sequence to sequence mo del called DeepFix where, giv en an erroneous program, the mo del predicts the locations of the errors and a potential fix for each predicted lo cation. The problem is for- m ulated as a deterministic task, where the mo del is trained to predict a single fix for eac h error. How ever, different programs – and therefore also their fixes – can express the same functionality . Besides, there is also uncertain ty ab out the in tention of the programmer. Figure 1 illustrates the issue. Given an erroneous program (buggy program), there is a large num b er of programs within a cer- SampleFix: Learning to Generate F unctionally Diverse Fixes 3 tain edit distance. A subset of these, will result in successful compilation. The remaining programs will still implemen t different functionalities and – without additional information or assumptions – it is impossible to tell which program/- functionalit y was intended. In addition, previous w ork [28] also identified o v erfit- ting as one of the ma jor challenges for learning-based automatic program repair. W e b eliev e that one of the culprits for this is the p o or ob jectives used in the training pro cess, e.g., training a mo del to generate a particular target fix. Let us consider the example in Figure 2 from the dataset of DeepFix [13]. This example program is incorrect due to the imbalanced num b er of curly brac kets. In a traditional scenario, a compiler w ould warn the dev elop er ab out this error. F or example, when trying to compile this co de with GCC, the compiler terminates with the error “exp ected declaration or statemen t at end of input”, indicating line 10 as the error location. Experienced dev elopers w ould be able to understand this cryptic message and pro ceed to fixing the program. Based on their inten tion, they can decide to add a curly brac ket either at line 6 (patch P 1 ) or at line 9 (patc h P 2 ). Both these solutions would fix the compilation error in the erroneous program, but the resulting solutions ha ve differen t semantics. Hence, w e propose a deep generative framework to automatically correct pro- gramming errors b y learning the distribution of potential fixes. W e in vestigate differen t solutions to mo del the distribution of the fixes and sample m ultiple fixes, including different v ariants of Conditional V ariation Auto enco ders (CV AE) and b eam searc h deco ding. It turns out (as we will also sho w in our exp erimen ts) CV AE and b eam searc h decoding are complementary , while CV AE is compu- tationally more efficient in comparison to b eam search deco ding. F urthermore, w e encourage diversit y in the candidate fixes through a nov el regularizer whic h p enalizes similar fixes for an identical erroneous program and significan tly in- creases the effectiveness of our approach. The candidate fixes in Figure 2 are generate by our approac h, illustrating its potential for generating b oth div erse and correct fixes. F or a given erroneous program, our approach is capable of generating div erse fixes to resolve the syn tax errors. T o summarize, the contributions of this pap er are as follo ws, 1. W e prop ose an efficien t generativ e method to automatically correct common programming errors by learning the distribution o ver potential fixes. 2. W e prop ose a nov el regularizer to encourage the model to generate div erse fixes. 3. Our generative mo del together with the diversit y regularizer shows an increase in the diver- sit y and accuracy of fixes, and a strong improv emen t ov er the state-of-the-art approac hes. 2 Related W ork Our w ork builds on the general idea of sequence-to-sequence mo dels as w ell as ideas from neural mac hine translation. W e phrase our approac h as a v ariational auto-enco der and compare it to prior learning-based program repair approaches. W e review the related w ork in order b elow: 4 H. Ha jip our et al. 2.1 Neural Machine T ranslation Sutsk ever et al. [30] introduces neural machine translation and casts it as a sequence-to-sequence learning problem. The p opular encoder-deco der arc hitec- ture is in tro duced to map the source sen tences in to target sentences. One of the ma jor drawbac ks of this model is that the sequence enco der needs to compress all of the extracted information into a fixed-length v ector. Bahdanau et al. [3] addresses this issue by using atten tion mec hanism in the enco der-deco der arc hi- tecture, where it fo cuses on the most relev an t part of enco ded information b y learning to searc h ov er the encoded vector. In our w ork, w e employ a sequence- to-sequence mo del with atten tion to parameterize our generative mo del. This mo del gets an incorrect program as input and maps it to man y p otential fixes b y drawing samples on the estimated distribution of the fixes. 2.2 V ariational Auto enco ders The v ariational autoenco ders [16,24] is a generativ e model designed to learn deep directed laten t v ariable based graphical mo dels of large datasets. The mo del is trained on the data distribution b y maximizing the v ariational low er b ound of the log-likelihoo d as the ob jective function. Bo wman et al. [5] extend this framew ork by introducing an RNN-based v ariational autoenco der to enable the learning of laten t v ariable based generative mo dels on text data. The proposed mo del is successful in generating div erse and coheren t sen tences. T o model con- ditional distributions for the structured output represen tation Sohn et al. [29] extended v ariational auto enco ders by introducing an ob jective that maximizes the conditional data log-likelihoo d. In our approac h, w e emplo y an RNN-based conditional v ariational auto enco der to mo del the distribution of the potential fixes given erroneous programs. V ariational auto enco der approac hes enable the efficien t sampling of accurate and diverse fixes. 2.3 Learning-based Program Repair Recen tly there has b een a growing in terest in using learning-based approac hes to automatically repair the programs [22]. Long and Rinard [20] prop osed a probabilistic mo del by designing co de features to rank p otential fixes for a given program. Pu et al. [23] employ an enco der-deco der neural architecture to auto- matically correct programs. In these works and man y learning-based program- ming repair approaches, en umerative searc h ov er programs is required to resolv e all errors. Ho w ever, our prop osed framew ork is capable of predicting the location and p otential fixes b y passing the whole program to the mo del. Besides this, un- lik e our approac h, which only generates fixes for the given erroneous program, Pu et al. [23] need to predict whole program statements to resolve the errors. There are tw o imp ortan t program repair tasks explored in the literature: fixing syntactic errors and fixing seman tic ones. While in the current w ork w e prop ose a tec hnique for fixing syn tactic errors, we b elieve that our observ ation SampleFix: Learning to Generate F unctionally Diverse Fixes 5 ab out the div ersity of the fix has implications for the approac hes aimed at re- pairing seman tic bugs as w ell. Most of the recen t work in this domain aim to predict a unique fix, often extracted from a real-world rep ository . F or example, Getafix [2], a recent approach for automatically repairing six types of semantic bugs, is ev aluated on a set of 1,268 unique fixes written by developers. Similarly , DLfix [19] considers a bug to b e fixed only if it exactly matc hes a patch pro- vided by the dev elop er. While this is an improv ed metho dology in the spirit of our prop osal it is highly dependent on the performance of the test suite oracle whic h may not alw ays capture the developer’s inten t. DeepFix [13], RLAssist [12], and DrRepair [32] uses neural representations to repair syntax errors in programs. In detail, DeepFix [13] uses a sequence- to-sequence mo del to directly predict a fix for incorrect programs. In con trast, our generative framework is able to generate m ultiple fixes by learning the dis- tribution of potential correctness. Therefore, our mo del do es not p enalize, but rather encourages div erse fixes. RLAssist [12] repairs the programs by emplo y- ing a reinforcement learning approach. They train an agent that na vigate ov er the program to locate and resolve the syn tax errors. In this work, they only address the t yp ographic errors, rely on a hand-designed action space, and meet problems due to the increasing size of the action space. In contrast, our metho d sho ws improv ed performance on t yp ographic errors and also generalizes to issues with missing v ariable declaration errors b y generating diverse fixes. In a recent work, Y asunaga and Liang [32] prop osed DrRepair to resolv e the syntax error b y in tro ducing a program feedback graph. They connect the relev an t sym b ols in the source code and the compile error messages and emplo y the graph neural netw ork on top to mo del the pro cess of the program repair. In this w ork, they rely on the compiler error messages which can b e helpful, but it also limits the generality of the metho d. How ever, our proposed approac h do es not rely on additional information such as compiler error messages, and it resolves the errors b y directly mo deling the underlying distribution of the p oten tial correct fixes. 3 SampleFix: Generativ e Mo del for Diversified Co de Fixes Repairing the common program errors is a c hallenging task due to ambiguit y in p oten tial corrections and lack of represen tative data. Given a single erroneous program and a certain num b er of allo wed changes, there are multiple wa ys to fix the program resulting in different styles and functionalit y . Without further information, the true, intended style and/or functionality remains unknown. In order to accoun t for this inheren t ambiguit y , w e prop ose a deep generativ e mo del to learn a distribution o ver p otential fixes giv en the erroneous program – in con trast to predicting a single fix. W e frame this challenging learning problem as a conditional v ariational auto enco ders (CV AE). How ever, standard sampling pro cedures and limitations of datasets and their construction make learning and generation of diverse samples a challenge. W e address this issue by a b eam 6 H. Ha jip our et al. searc h decoding scheme in combination with a nov el regularizer that encourages div ersity of the samples in the em b edding space of the CV AE. Generative Seq2Seq Model Input pr ogram x ŷ 1 : ŷ 2 : ŷ 3 : ŷ 4 : 3 int a = 2, b = 3, c 3 int a = 2, c; 4 int a = 2, c; 3 int a = 2, b = 3, c; Selecting fixes ... ... ... Diverse candidate fixes Fig. 3: Ov erview of SampleFix at inference time, highlighting the generation of div erse fixes. Figure 3 provides an o verview of our prop osed approach at inference time. F or a given erroneous program, the generative mo del draws T intermediate, candidate fixes ˆ y from the learned conditional distribution. W e use a compiler to select a subset of promising intermediate candidate fixes based on t he num b er of remaining errors. This procedure is applied iteratively un til arriv e at a set of candidate fixes within the maxim um num b er of prescribed c hanges. W e then select a final set of candidate fixes that compile, hav e unique syntax according to our measure describ ed b elo w (Subsection 3.5). In the following, w e formulate our prop osed generative model with the diver- sit y regularizer and provide details of our training and inference process. 3.1 Conditional V ariational Auto enco ders for Generating Fixes Conditional V ariational Auto enco ders (CV AE) [29], model conditional distribu- tions p θ ( y | x ) using laten t v ariables z . The conditioning in tro duced through z enables the modelling of complex multi-modal distributions. As pow erful trans- formations can b e learned using neural netw orks, z itself can ha ve a simple distribution which allows for efficien t sampling. This model allo ws for sampling from p θ ( y | x ) given an input sequence x , by first sampling latent v ariables ˆ z from the prior distribution p ( z ). During training, amortized v ariational infer- ence is used and the laten t v ariables z are learned using a recognition netw ork q φ ( z | x , y ), parametrized b y φ . In detail, the v ariational lo wer b ound of the model (Equation 1) is maximized, log( p ( y | x )) ≥ E q φ ( z | x , y ) log( p θ ( y | z , x )) − D KL ( q φ ( z | x , y ) , p ( z | x )) . (1) P enalizing the divergence of q φ ( z | x , y ) to the prior in Equation 1 allows for sampling from the prior p ( z ) during inference. In practice, the v ariational low er b ound is estimated using Mon te-Carlo integration, ˆ L CV AE = 1 T T X i=1 log( p θ ( y | ˆ z i , x )) − D KL ( q φ ( z | x , y ) , p ( z | x )) . (2) SampleFix: Learning to Generate F unctionally Diverse Fixes 7 where, ˆ z i ∼ q φ ( z | x , y ), and T is the n umber of samples. W e cas t our model for resolving program errors in the Conditional V ariational Autoenco der framew ork. Here, the input x is the erroneous program and y is the fix. Ho wev er, the plain CV AE as described in [29] suffers from div ersity issues. Usually , the drawn samples do not reflect the true v ariance of the p osterior p ( y | x ). This would amoun t to the correct fix p oten tially missing from our can- didate fixes. T o mitigate this problem, next w e introduce an ob jective that aims to enhance the div ersity of our candidate fixes. 3.2 Enabling Diverse Samples using a Best of Man y Ob jective Here, we introduce the div ersit y enhancing ob jective that w e use. Casting our mo del in the Conditional V ariational Auto enco der framew ork would enable us to sample a set of candidate fixes for a giv en erroneous program. Ho wev er, the standard v ariational low er b ound ob jectiv e do es not encourage div ersity in the candidate fixes. This is b ecause the av erage likelihoo d of the candidate fixes is considered. In detail, as the av erage likelihoo d is considered, all candidate fixes m ust explain the “true” fix in training set well. This discourages diversit y and constrains the recognition net work, whic h is already constrained to maintain a Gaussian latent v ariable distribution. In practice, the learned distribution fails to fully capture the v ariance of the true distribution. T o encourage div ersity , we emplo y ”Many Samples” (MS) ob jective proposed by Bhattac haryya et al. [4], ˆ L MS = log  1 T T X i=1 p θ ( y | ˆ z i , x )  − D KL ( q φ ( z | x , y ) , p ( z | x )) . (3) In comparison to Equation 2, this ob jective (Equation 3) encourages div ersity in the model by allo wing for m ultiple chances to draw highly lik ely candidate fixes. This enables the model to generate div erse candidate fixes, while main tain- ing high lik eliho o d. In practice, due to n umerical stability issues, w e use ”Best of Man y Samples” (BMS) ob jectiv e, which is an approximation of 3. This ob jective retains the div ersity enhancing nature of Equation 3 while being easy to train, ˆ L BMS = max i  log( p θ ( y | ˆ z i , x ))  − D KL ( q φ ( z | x , y ) , p ( z | x )) . (4) 3.3 DS-SampleFix: Encouraging Div ersit y with a Div ersit y-sensitive Regularizer T o increase the diversit y using Equation 4 w e need to use a substantial num b er of samples during training. This is computationally prohibitive esp ecially for large mo dels, as memory requirements and computation time increases linearly in the n umber of such samples. On the other hand, for a small num b er of samples, the ob jectiv e b ehav es similarly to the standard CV AE ob jective as the recognition 8 H. Ha jip our et al. net work has few er and few er chances to draw highly lik ely samples/candidate fixes, thus limiting div ersity . Therefore, in order to encourage the model to gen- erate div erse fixes even with a limited num b er of samples, w e prop ose a nov el regularizer that aims to increase the distance b etw een the tw o closest candidate fixes (Equation 5). This p enalizes generating similar candidate fixes for a given erroneous program and thus encourages div ersit y in the set of candidate fixes. In comparison to Equation 4, we observe considerable gains even with the use of only T = 2 candidate fixes. In detail, we maximize the follo wing ob jectiv e ˆ L DS-BMS = max i  log( p θ ( y | ˆ z i , x ))  + min i , j d ( ˆ y i , ˆ y j ) − D KL ( q φ ( z | x , y ) , p ( z | x )) . (5) Distance Metric. Here, we discuss the distance metric d in Equation 5. Note, that the samples n ˆ y i , ˆ y j o can be of different lengths. Therefore, w e first pad the shorter sample to equalize lengths. Empirically , we find that the Euclidean distance p erforms b est. This is mainly b ecause, in practice, Euclidean distance is easier to optimize. 3.4 Beam Search Deco ding for Generating Fixes Beam searc h deco ding is a classical mo del to generate m ultiple outputs from a sequence-to-sequence mo del [31,7]. Giv en the distributions p θ ( y | x ) of a sequence- to-sequence mo del w e can generate m ultiple outputs by unrolling the mo del in time and k eeping the top-K tok ens at eac h time step, where K is the beam width. In our generative mo del, we employ b eam search algorithm to sample multiple fixes. In detail, we deco de with b eam width of size K for eac h sample z and in total for T samples from p ( z ). W e set T = 100 during inference. 3.5 Selecting Diverse Candidate Fixes W e extend the iterative repair procedure introduced by Gupta et al. [13] in the con text of our prop osed generative mo del, where the iterative pro cedure no w lev erages multiple candidate fixes. Given an erroneous program, the generative mo del outputs T candidate fixes. Each fix contains a p otential erroneous line with the corresp onding fix. So in each iteration w e only edit one line of the given program. T o select the b est fixes, we take the candidate fixes and the input erroneous program, reconcile them to create T up dated programs. W e ev aluate these fixes using a compiler, and select up to the best N fixes, where N ≤ T . W e only select the unique fixes which do not in tro duce any additional error messages. In the next iterations, w e feed up to N programs bac k to the mo del. These programs are updated based on the selected fixes of the previous iteration. W e keep up to N programs with the lo wer num b er of error messages ov er the iterations. At the end of the repairing procedure, w e obtain m ultiple p otential candidate fixes. In the exp erimen ts where w e are in terested in a single repaired program, we pic k the b est fix with the highest probability score according to our deep generativ e mo del. SampleFix: Learning to Generate F unctionally Diverse Fixes 9 3.6 Mo del Architecture and Implementation Details Fig. 4: Ov erview of net work archi- tecture. T o ensure a fair comparison, our gener- ativ e mo del is based on the sequence-to- sequence arc hitecture, similar to Gupta et al. [13]. Figure 4 shows the architecture of our approach in detail. Note that the recognition netw ork is a v ailable to enco de the fixes to latent v ariables z only dur- ing training. All of the netw orks in our framew ork consists of 4-lay ers of LSTM cells with 300 units. The netw ork is opti- mized using Adam optimizer [15] with the default setting. W e use T = 2 samples to train our mo dels, and T = 100 samples during inference. T o pro cess the program through the netw orks, we tokenize the programs similar to the setting used b y Gupta et al. [13]. During inference, the conditioning erroneous program x is input to the en- co der, which enco des the program to the vector v . T o generate multiple fixes using our decoder, the code v ector v along with a sample of z from the prior p ( z ) is input to the decoder. F or simplicit y , we use a standard Gaussian N (0 , I ) prior, although more complex priors can b e easily leveraged. The deco der is unrolled in time and output logits ( p θ ( y | ˆ z i , x )). 4 Exp erimen ts W e ev aluate our approac h on the task of repairing common programming errors. W e ev aluate the diversit y and accuracy of our sampled error corrections as well as compare our prop osed metho d with the state of the art. 4.1 Dataset W e use the dataset published by Gupta et al. [13] as it’s sizable and includes real-w orld data. It contains C programs written by studen ts in an in tro ductory programming course. The dataset consists of 93 different tasks that were writ- ten by students in an introductory programming course. The programs were collected using a web-based system [6]. These programs ha ve token lengths in the range [75 , 450], and con tain t yp ographic and missing v ariable declaration errors. T o tok enize the programs and generate training and test data differen t t yp e of tokens, such as t yp es, k eyw ords, special c haracters, functions, literals and v ariables are used. The dataset contains tw o sets of data which are called syn- thetic and real-world data. The syn thetic data contains the erroneous programs whic h are syn thesized b y m utating correct programs written by students. The real-w orld data contains 6975 erroneous programs with 16766 error messages. 4.2 Ev aluation 10 H. Ha jip our et al. T able 2: Results of p erformance comparison of DeepFix, RLAssist, DrRepair, Beam searc h (BS), SampleFix , DS-SampleFix, and DS-SampleFix + BS. Typo, Miss Dec, and All refer to t yp ographic, missing v ariable declarations, and all of the error messages resp ectively . Sp eed denotes computational time for sampling 100 fixes. Ë denotes successfully compiled programs, while  refers to resolved error messages. Models Typo Miss Dec All Speed (s) Ë  Ë  Ë  DeepFix [13] 23.3% 30.8% 10.1% 12.9% 33.4% 40.8% - RLAssist [12] 26.6 % 39.7 % - - - - - DrRepair [32] - - - - 34.0% - - Beam searc h (BS) 25.9% 42.2% 20.3% 47.0% 44.7% 63.9% 4.82 SampleFix 24.8% 38.8% 16.1% 22.8% 40.9% 56.3% 0.88 DS-SampleFix 27.7% 40.9% 16.7% 24.7% 44.4% 61.0% 0.88 DS-SampleFix + BS 27.8% 45.6% 19.2% 47.9% 45.2% 65.2% 1.17 T able 1: Results of p erformance com- parison of DeepFix, Beam searc h (BS), SampleFix ,and DS-SampleFix on synthetic data. T yp o, Miss Dec, and All refer to typographic, missing v ariable declarations, and all of the errors resp ectiv ely . Models Typo Miss Dec All DeepFix 84.7% 78.8% 82.0% Beam searc h (BS) 91.8% 89.5% 90.7% SampleFix 86.8% 86.5% 86.6% DS-SampleFix 95.6% 88.1% 92.2% W e ev aluate our approach on synthetic and real-w orld data. T o ev aluate our ap- proac h on the synthetic test set w e ran- domly select 20k pairs. This data con- tains pairs of erroneous programs with the in tended fixes. T o ev aluate our ap- proac h on real-world data we use a real- w orld set of erroneous programs. Unlike syn thetic test set, we don’t ha ve access to the in tended fix(es) in the real-w orld data. How ever, w e can chec k the correct- ness of the program using the ev aluator (compiler). F ollowing the prior w ork, w e train t wo net w orks, one for typographic errors and another to fix missing v ariables declaration errors. Note that there migh t be an ov erlap b et ween the error resolved b y the net work for typographic errors and the net work for missing v ariables declaration errors, so w e also provide the o verall results of the resolv ed error messages. Syn thetic Data. T able 1 shows the comparison of our prop osed approaches, Beam search (BS), SampleFix and DS-SampleFix, with DeepFix [13] on the syn thetic data in the first iteration. In this table (T able 1), w e can see that our approac hes outp erform DeepFix in generating intended fixes for the t ypographic and missing v ariable declaration errors. Beam search (BS), SampleFix and DS- SampleFix generate 90.7%, 86.6%, and 92.2% of the in tended fixes resp ectively . SampleFix: Learning to Generate F unctionally Diverse Fixes 11 Real-W orld Data. In T able 2 we compare our approaches, with state-of-the-art approac hes (DeepFix [13], RLAssist [12], and DrRepair [32]) on the real-world data. In our experiments (T able 2) w e show the p erformance of b eam search deco ding, CV AEs (SampleFix), and our prop osed diversit y-sensitiv e regularizer (DS-SampleFix). F urthermore, we sho w that DS-SampleFix can still take ad- v an tage of b eam searc h algorithm (DS-SampleFix + BS). T o do that, for eac h sample z we deco de with b eam width of size 5, and to sample 100 fixes we draw 20 samples from p ( z ). W e also pro vide the sampling speed in terms of sampling 100 fixes for a giv en program using an av erage o ver 100 runs. The running time results show that CV AE-based mo dels are at least 4x faster than b eam search in sampling the fixes. In this experiment, w e feed the programs up to 5 iterations. T able 2 sho ws that our approac hes outp erform DeepFix [13], RLAssist [12], and DrRepair [32] in resolving the error messages. This sho ws that generat- ing multiple div erse fixes can lead to substantial impro vemen t in p erformance. Beam search, SampleFix, DS-SampleFix, and DS-SampleFix + BS resolve 63.9%, 56.3%, 61.0%, and 65.2% of the error messages resp ectiv ely . Ov erall, our DS- SampleFix + BS is able to resolve all compile-time errors of the 45 . 2% of the pro- grams - around 12% p oints impro vemen t ov er DeepFix and 11% p oints impro ve- men t ov er DrRepair. F urthermore, the p erformance adv antage of DS-SampleFix o ver SampleFix shows the effectiv eness of our nov el regularizer. Note that DrRepair [32] has achiev ed further impro v ements b y relying on the compiler. While utilizing the compiler output seems to be beneficial, it also limits the generalit y of the approach. F or a fair comparison, w e rep ort the p erformance of DrRepair without the compiler output, but consider informing our mo del by the compiler output an in teresting av enue of future work. Fig. 5: An example illustrating that our DS-SampleFix can generate diverse fixes. Left: Example of a program with a typographic error. The error, i.e., missing brac ket, is highligh ted at line 13. Righ t: Our DS-SampleFix prop oses m ultiple fixes for the giv en error (line num b er with the corresp onding fix), highligh ting the abilit y of DS-SampleFix to generate diverse and accurate fixes. 12 H. Ha jip our et al. Qualitativ e Example. W e illustrate div erse fixes generated b y our DS-SampleFix in Figure 5 using a co de example with typographic errors, with the corresponding t wo output samples of DS-SampleFix. In the examples given in Figure 5, there is a missing closing curly brack et after line 13. W e can see that DS-SampleFix generates m ultiple correct fixes to resolve the error in the given program. This indicates that our approac h is capable of handling the inherent am biguity and uncertain ty in predicting fixes for the erroneous programs. The tw o fixes in Fig- ure 5 are unique and compileable fixes that implement different functionalities for the giv en erroneous program. Note that generating multiple diverse fixes giv es the programmers the opp ortunity of choosing the desired fix(es) among the compileable ones, based on their in tention. Generating F unctionally Div erse Programs. Given an erroneous program, our approach can generate multiple potential fixes that result in a successful compilation. Since w e do not ha ve access to the user’s inten tion, it is desirable to suggest multiple potential fixes with diverse functionalities. Here, we ev aluate our approac h in generating multiple programs with differen t functionalities. In order to assess different functionalities, w e use the following approach based on tests. The dataset of Gupta et al. [13] consists of 93 different tasks. The description of each task, including the input/output format, is provided in the dataset. Based on the input/output format, we can pro vide input examples for eac h task. T o measure the diversit y in functionalit y of the programs in each task, w e generate 10 input examples. F or instance, given a group of programs for a sp ecific task, w e can run each program using the input examples and get the outputs. W e consider t wo programs to hav e differen t functionalities if they return differen t outputs given the same input example(s). In order to generate multiple programs w e use our iterative selecting strat- egy (Subsection 3.5). In each iteration, we keep up to N programs with the less n umber of error messages ov er the iterations. At the end of the repairing pro cedure, w e obtain m ultiple repaired programs. As discussed (Figure 1), a sub- set of these programs will successfully compile. In this experiment, w e use the real-w orld test set, and w e set N = 50 as this n umber is large enough to allow us to study the diversit y of the fixes, without incurring an unnecessarily large load on our infrastructure. Our goals in the remaining of this section are: 1. F or eac h erroneous program, to measure the n umber of generated unique fixes that successfully compile. 2. F or eac h erroneous program, to measure the num b er of generated programs with differen t functionalities. Figure 6a and Figure 6b sho w the syntactic div ersity of the generated pro- grams, and the diversit y in functionalit y of these programs, respectively . In Fig- ure 6a w e show the percentage of the successfully compiled programs with unique fixes for a given erroneous program. The x-axis refers to the num b er of gener- ated and successfully compiled unique programs, and y-axis to the p ercentage of repaired programs for which these man y unique fixes were generated. F or exam- ple, for almost 20% of the repaired programs, DS-SampleFix + BS generates t wo SampleFix: Learning to Generate F unctionally Diverse Fixes 13 (a) Diversit y of the generated programs. (b) Diversit y of the functionality of the generated programs. Fig. 6: The results show the p erformance of Beam search (BS), SampleFix , DS- SampleFix , and DS-SampleFix + BS. (a) Percen tage of the n umber of the gen- erated successfully compiled, unique programs for the giv en erroneous programs. (b) P ercentage of the successfully compiled programs with differen t functionali- ties for the giv en erroneous programs. unique fixes. Ov erall, w e can see that DS-SampleFix and DS-SampleFix + BS generate more div erse programs in comparison to the other approaches. T able 3: Results of p erformance compar- ison of Beam Searc h (BS), SampleFix , DS-SampleFix , and DS-SampleFix +BS on generating div erse programs. Diverse Prog refers to the percentage of cases where the mo dels generate at least t wo or more suc- cessfully compiled unique programs. Div erse F unc denotes the p ercentage of cases where the mo dels generate at least tw o or more programs with differen t functionalities. Models Diverse Prog Div erse F unc Beam searc h 55.6% 45.1% SampleFix 44.6% 34.9% DS-SampleFix 68.8% 53.4% DS-SampleFix + BS 69.5 % 60.4 % Figure 6b sho ws the percent- age of the successfully compiled programs with different function- alities, for a given erroneous pro- gram. Here, the x-axis refers to the n um b er of the generated func- tionally different programs, and y-axis refers to the percentage of erroneous programs with at least one fix, for whic h w e could gener- ate that many div erse fixes. One can observ e that in man y cases, e.g., up to 60% of the times for SampleFix, the metho ds gener- ate programs corresp onding to a single functionality . How ever, in man y other cases they generate functionally div erse fixes. F or example, in almost 10% of the cases, DS-SampleFix generate 10 or more fixes with differen t functionalities. In Figure 6b we can see that all of the approaches hav e higher p ercentage for generating program with the same functionalit y in comparison to the results in Figure 6a. This indicates that for some of the giv en erroneous programs, we generate multiple unique programs with approximately the same functionalit y . These results show that 14 H. Ha jip our et al. DS-SampleFix and DS-SampleFix + BS generate programs with more diverse functionalities in comparison to the other approac hes. In T able 3 we compare the p erformance of our approac hes in generating di- v erse programs and functionalities. W e pro vide results for all of our four ap- proac hes, i.e., Beam searc h (BS), SampleFix , DS-SampleFix , and DS-SampleFix + BS. W e consider that an approac h can generate div erse programs if it can pro- duce tw o or more successfully compiled, unique programs for a given erroneous program. Similarly , w e sa y that the approac h produces functionally diverse pro- grams if it can generate t wo or more programs with observ able differences in functionalit y for a given erroneous program. Here w e consider the p ercentage out of the total n um b er of erroneous programs for whic h the model generates at least one successfully compiled program. The results of this table show that our DS-SampleFix + BS approac h generates programs with more diverse function- alities in comparison to the other approac hes. 5 Conclusion W e prop ose a no vel approac h to correct common programming errors. W e recog- nize and mo del the inherent ambiguit y and uncertaint y when predicting m ultiple fixes. In con trast to previous approaches, our approac h is able to learn the distri- bution ov er candidate fixes rather than the most likely fix. W e achiev e increased div ersity of the sampled fixes b y a no vel diversit y-sensitiv e regularizer. W e show that our approach is capable of generating multiple div erse fixes with different functionalities. F urthermore, our ev aluations on synthetic and real-w orld data sho w improv ements ov er state-of-the-art metho ds. References 1. Allamanis, M., Barr, E.T., Dev an bu, P ., Sutton, C.: A surv ey of machine learning for big code and naturalness. A CM Computing Surveys (CSUR) (2018) 2. Bader, J., Scott, A., Pradel, M., Chandra, S.: Getafix: learning to fix bugs auto- matically . Proc. A CM Program. Lang. 3 (OOPSLA) (2019) 3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation b y jointly learning to align and translate (2015) 4. Bhattac haryya, A., Schiele, B., F ritz, M.: Accurate and div erse sampling of se- quences based on a ”best of many” sample ob jective. In: CVPR (2018) 5. Bo wman, S.R., Vilnis, L., Vin yals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Gener- ating sentences from a contin uous space. In: SIGNLL Conference on Computational Natural Language Learning (CoNLL) (2016) 6. Das, R., Ahmed, U.Z., Kark are, A., Gulw ani, S.: Prutor: A system for tutoring CS1 and collecting student programs for analysis (2016) 7. Deshpande, A., Aneja, J., W ang, L., Sch wing, A.G., F orsyth, D.: F ast, diverse and accurate image captioning guided b y part-of-speech. In: CVPR (2019) 8. D’An toni, L., Samanta, R., Singh, R.: Qlose: Program repair with quan titative ob jectives. In: CA V (2016) 9. Girshic k, R.: F ast r-cnn. In: ICCV (2015) SampleFix: Learning to Generate F unctionally Diverse Fixes 15 10. Gottsc hlich, J., Solar-Lezama, A., T atbul, N., Carbin, M., Rinard, M., Barzilay , R., Amarasinghe, S., T enen baum, J.B., Mattson, T.: The three pillars of mac hine programming. In: MAPL (2018) 11. Goues, C.L., Pradel, M., Ro ychoudh ury , A.: Automated program repair. Comm un. A CM 62 (12), 56–65 (2019) 12. Gupta, R., Kanade, A., Shev ade, S.: Deep reinforcement learning for programming language correction. In: AAAI (2019) 13. Gupta, R.R., P al, S., Kanade, A., Shev ade, S.K.: Deepfix: Fixing common c lan- guage errors b y deep learning. In: AAAI (2017) 14. Jang, E., Gu, S., P o ole, B.: Categorical reparameterization with gumbel-softmax. In: ICLR (2017) 15. Kingma, D.P ., Ba, J.: Adam: A method for sto chastic optimization. In: ICLR (2015) 16. Kingma, D.P ., W elling, M.: Auto-enco ding v ariational ba yes. In: ICLR (2014) 17. Krizhevsky , A., Sutsk ever, I., Hinton, G.E.: Imagenet classification with deep con- v olutional neural netw orks. In: NIPS (2012) 18. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Unsup ervised learning of hierarc hical represen tations with conv olutional deep b elief net works. Communications of the A CM (2011) 19. Li, Y., W ang, S., Nguyen, T.N.: Dlfix: Con text-based co de transformation learning for automated program repair. In: International Conference on Softw are Engineer- ing (ICSE) (2020) 20. Long, F., Rinard, M.: Automatic patc h generation by learning correct code. In: A CM SIGPLAN Notices (2016) 21. Maddison, C.J., Mnih, A., T eh, Y.W.: The concrete distribution: A contin uous relaxation of discrete random v ariables (2016) 22. Monp errus, M.: Automatic soft ware repair: a bibliography . ACM Computing Sur- v eys (CSUR) (2018) 23. Pu, Y., Narasimhan, K., Solar-Lezama, A., Barzila y , R.: sk p: a neural program corrector for moo cs. In: ACM SIGPLAN (2016) 24. Rezende, D.J., Mohamed, S.: V ariational inference with normalizing flows. In: ICML (2015) 25. Seo, H., Sadowski, C., Elbaum, S., Aftandilian, E., Bowdidge, R.: Programmers’ build errors: a case study (at google). In: ICSE (2014) 26. Simon yan, K., Zisserman, A.: V ery deep conv olutional net works for large-scale image recognition. In: ICLR (2015) 27. Singh, R., Gulw ani, S., Solar-Lezama, A.: Automated feedback generation for in- tro ductory programming assignmen ts. In: PLDI (2013) 28. Smith, E.K., Barr, E.T., Goues, C.L., Brun, Y.: Is the cure worse than the disease? o verfitting in automated program repair. In: F oundations of Softw are Engineering (ESEC/FSE) (2015) 29. Sohn, K., Lee, H., Y an, X.: Learning structured output representation using deep conditional generative mo dels. In: NIPS (2015) 30. Sutsk ever, I., Viny als, O., Le, Q.V.: Sequence to sequence learning with neural net works. In: NIPS (2014) 31. W ang, L., Sch wing, A., Lazebnik, S.: Diverse and accurate image description using a v ariational auto-enco der with an additive gaussian encoding space. In: NIPS (2017) 32. Y asunaga, M., Liang, P .: Graph-based, self-sup ervised program repair from diag- nostic feedback. In: ICML (2020)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment