SampleFix: Learning to Generate Functionally Diverse Fixes

SampleFix: Learning to Generate F unctionally Div erse Fixes Hossein Ha jip our 1 , Apratim Bhattac haryya 2 , Cristian-Alexandru Staicu 1 , and Mario F ritz 1 1 CISP A Helmholtz Cen ter for Information Securit y , German y 2 Max Planck Institute for Informatics, German y Abstract. Automatic program repair holds the p oten tial of dramati- cally impro ving the pro ductivity of programmers during the softw are dev elopment pro cess and correctness of softw are in general. Recent ad- v ances in machine learning, deep learning, and NLP ha ve rekindled the hop e to even tually fully automate the pro cess of repairing programs. Ho wev er, previous approaches that aim to predict a single ﬁx are prone to fail due to uncertain ty ab out the true intend of the programmer. Therefore, we prop ose a generativ e mo del that learns a distribution ov er p oten tial ﬁxes. Our model is form ulated as a deep conditional v ariational auto encoder that can eﬃciently sample ﬁxes for a given erroneous pro- gram. In order to ensure diverse solutions, we propose a no vel regularizer that encourages div ersity ov er a semantic embedding space. Our ev alua- tions on common programming errors sho w for the ﬁrst time the gener- ation of div erse ﬁxes and strong improv ements ov er the state-of-the-art approac hes b y ﬁxing up to 45% of the erroneous programs. W e addition- ally sho w that for the 65% of the repaired programs, our approac h was able to generate multiple programs with diverse functionalities. Keyw ords: Program repair · Generative models · Conditional v aria- tional auto enco der. 1 In tro duction Soft ware dev elopment is a time-consuming and exp ensive process. Unfortunately , programs written by humans typically come with bugs, so signiﬁcan t eﬀort needs to be in vested to obtain code that is only lik ely to be correct. Debugging is also t ypically p erformed b y humans and can contain mistakes. This is neither de- sirable nor acceptable in many critical applications. Therefore, automatically lo cating and correcting program errors [11] oﬀers the p otential to increase pro- ductivit y as well as impro ve the correctness of soft ware. Adv ances in deep learning [17,18], computer vision [9,26], and NLP [30,3] ha ve dramatically bo osted the mac hine’s abilit y to automatically learn repre- sen tations of natural data such as images and natural language conten ts for v arious tasks. Deep learning models also hav e b een successful in learning the distribution o ver contin uous [29,16] and discrete data [21,14], to generate new 2 H. Ha jip our et al. and diverse data points [10]. These adv ances in machine learning and the adven t of large corpora of source code [1] provide new opp ortunities to ward harnessing deep learning metho ds to understand, generate, or debug programs. Fig. 2: SampleFix captures the inheren t am biguity of the p ossible ﬁxes b y sam- pling multiple p oten tial ﬁxes for the giv en erroneous real-world program. Poten- tial ﬁxes with the same functionalit y are highlighted with the same color and the newly added tok ens are underlined. Fig. 1: Our SampleFix approach with div ersity regularizer pro- motes sampling of div erse ﬁxes, that accoun t for the inherent uncertain ty in the automated debugging task. Prior works in automatic program repair predominan tly rely on exp ert-designed rules and error mo dels that describ e the space of the p otential ﬁxes [27,8]. Such hand-designed rules and error mo dels are not easily adapt- able to the new domains and require a time- consuming pro cess. In con trast, learning-based approaches pro vide an opp ortunity to adapt suc h mo dels to the new domain of errors. Therefore, there has been an increasing in terest to carry ov er the success stories of deep learning in NLP and related tec hniques to emplo y learning- based approaches to tackle the “common pro- gramming errors” problem [13,12]. Suc h inv estigations hav e included compile- time errors such as missing scope delimiters, adding extraneous symbols, using incompatible op erators. No vice programmers and even expe rienced dev elop ers often struggled with these types of errors [25], which is usually due to lac k of atten tion to the details of programs and/or programmer’s inexp erience. Recen tly , Gupta et al. [13] proposed a deep sequence to sequence mo del called DeepFix where, giv en an erroneous program, the mo del predicts the locations of the errors and a potential ﬁx for each predicted lo cation. The problem is for- m ulated as a deterministic task, where the mo del is trained to predict a single ﬁx for eac h error. How ever, diﬀerent programs – and therefore also their ﬁxes – can express the same functionality . Besides, there is also uncertain ty ab out the in tention of the programmer. Figure 1 illustrates the issue. Given an erroneous program (buggy program), there is a large num b er of programs within a cer- SampleFix: Learning to Generate F unctionally Diverse Fixes 3 tain edit distance. A subset of these, will result in successful compilation. The remaining programs will still implemen t diﬀerent functionalities and – without additional information or assumptions – it is impossible to tell which program/- functionalit y was intended. In addition, previous w ork [28] also identiﬁed o v erﬁt- ting as one of the ma jor challenges for learning-based automatic program repair. W e b eliev e that one of the culprits for this is the p o or ob jectives used in the training pro cess, e.g., training a mo del to generate a particular target ﬁx. Let us consider the example in Figure 2 from the dataset of DeepFix [13]. This example program is incorrect due to the imbalanced num b er of curly brac kets. In a traditional scenario, a compiler w ould warn the dev elop er ab out this error. F or example, when trying to compile this co de with GCC, the compiler terminates with the error “exp ected declaration or statemen t at end of input”, indicating line 10 as the error location. Experienced dev elopers w ould be able to understand this cryptic message and pro ceed to ﬁxing the program. Based on their inten tion, they can decide to add a curly brac ket either at line 6 (patch P 1 ) or at line 9 (patc h P 2 ). Both these solutions would ﬁx the compilation error in the erroneous program, but the resulting solutions ha ve diﬀeren t semantics. Hence, w e propose a deep generative framework to automatically correct pro- gramming errors b y learning the distribution of potential ﬁxes. W e in vestigate diﬀeren t solutions to mo del the distribution of the ﬁxes and sample m ultiple ﬁxes, including diﬀerent v ariants of Conditional V ariation Auto enco ders (CV AE) and b eam searc h deco ding. It turns out (as we will also sho w in our exp erimen ts) CV AE and b eam searc h decoding are complementary , while CV AE is compu- tationally more eﬃcient in comparison to b eam search deco ding. F urthermore, w e encourage diversit y in the candidate ﬁxes through a nov el regularizer whic h p enalizes similar ﬁxes for an identical erroneous program and signiﬁcan tly in- creases the eﬀectiveness of our approach. The candidate ﬁxes in Figure 2 are generate by our approac h, illustrating its potential for generating b oth div erse and correct ﬁxes. F or a given erroneous program, our approach is capable of generating div erse ﬁxes to resolve the syn tax errors. T o summarize, the contributions of this pap er are as follo ws, 1. W e prop ose an eﬃcien t generativ e method to automatically correct common programming errors by learning the distribution o ver potential ﬁxes. 2. W e prop ose a nov el regularizer to encourage the model to generate div erse ﬁxes. 3. Our generative mo del together with the diversit y regularizer shows an increase in the diver- sit y and accuracy of ﬁxes, and a strong improv emen t ov er the state-of-the-art approac hes. 2 Related W ork Our w ork builds on the general idea of sequence-to-sequence mo dels as w ell as ideas from neural mac hine translation. W e phrase our approac h as a v ariational auto-enco der and compare it to prior learning-based program repair approaches. W e review the related w ork in order b elow: 4 H. Ha jip our et al. 2.1 Neural Machine T ranslation Sutsk ever et al. [30] introduces neural machine translation and casts it as a sequence-to-sequence learning problem. The p opular encoder-deco der arc hitec- ture is in tro duced to map the source sen tences in to target sentences. One of the ma jor drawbac ks of this model is that the sequence enco der needs to compress all of the extracted information into a ﬁxed-length v ector. Bahdanau et al. [3] addresses this issue by using atten tion mec hanism in the enco der-deco der arc hi- tecture, where it fo cuses on the most relev an t part of enco ded information b y learning to searc h ov er the encoded vector. In our w ork, w e employ a sequence- to-sequence mo del with atten tion to parameterize our generative mo del. This mo del gets an incorrect program as input and maps it to man y p otential ﬁxes b y drawing samples on the estimated distribution of the ﬁxes. 2.2 V ariational Auto enco ders The v ariational autoenco ders [16,24] is a generativ e model designed to learn deep directed laten t v ariable based graphical mo dels of large datasets. The mo del is trained on the data distribution b y maximizing the v ariational low er b ound of the log-likelihoo d as the ob jective function. Bo wman et al. [5] extend this framew ork by introducing an RNN-based v ariational autoenco der to enable the learning of laten t v ariable based generative mo dels on text data. The proposed mo del is successful in generating div erse and coheren t sen tences. T o model con- ditional distributions for the structured output represen tation Sohn et al. [29] extended v ariational auto enco ders by introducing an ob jective that maximizes the conditional data log-likelihoo d. In our approac h, w e emplo y an RNN-based conditional v ariational auto enco der to mo del the distribution of the potential ﬁxes given erroneous programs. V ariational auto enco der approac hes enable the eﬃcien t sampling of accurate and diverse ﬁxes. 2.3 Learning-based Program Repair Recen tly there has b een a growing in terest in using learning-based approac hes to automatically repair the programs [22]. Long and Rinard [20] prop osed a probabilistic mo del by designing co de features to rank p otential ﬁxes for a given program. Pu et al. [23] employ an enco der-deco der neural architecture to auto- matically correct programs. In these works and man y learning-based program- ming repair approaches, en umerative searc h ov er programs is required to resolv e all errors. Ho w ever, our prop osed framew ork is capable of predicting the location and p otential ﬁxes b y passing the whole program to the mo del. Besides this, un- lik e our approac h, which only generates ﬁxes for the given erroneous program, Pu et al. [23] need to predict whole program statements to resolve the errors. There are tw o imp ortan t program repair tasks explored in the literature: ﬁxing syntactic errors and ﬁxing seman tic ones. While in the current w ork w e prop ose a tec hnique for ﬁxing syn tactic errors, we b elieve that our observ ation SampleFix: Learning to Generate F unctionally Diverse Fixes 5 ab out the div ersity of the ﬁx has implications for the approac hes aimed at re- pairing seman tic bugs as w ell. Most of the recen t work in this domain aim to predict a unique ﬁx, often extracted from a real-world rep ository . F or example, Getaﬁx [2], a recent approach for automatically repairing six types of semantic bugs, is ev aluated on a set of 1,268 unique ﬁxes written by developers. Similarly , DLﬁx [19] considers a bug to b e ﬁxed only if it exactly matc hes a patch pro- vided by the dev elop er. While this is an improv ed metho dology in the spirit of our prop osal it is highly dependent on the performance of the test suite oracle whic h may not alw ays capture the developer’s inten t. DeepFix [13], RLAssist [12], and DrRepair [32] uses neural representations to repair syntax errors in programs. In detail, DeepFix [13] uses a sequence- to-sequence mo del to directly predict a ﬁx for incorrect programs. In con trast, our generative framework is able to generate m ultiple ﬁxes by learning the dis- tribution of potential correctness. Therefore, our mo del do es not p enalize, but rather encourages div erse ﬁxes. RLAssist [12] repairs the programs by emplo y- ing a reinforcement learning approach. They train an agent that na vigate ov er the program to locate and resolve the syn tax errors. In this work, they only address the t yp ographic errors, rely on a hand-designed action space, and meet problems due to the increasing size of the action space. In contrast, our metho d sho ws improv ed performance on t yp ographic errors and also generalizes to issues with missing v ariable declaration errors b y generating diverse ﬁxes. In a recent work, Y asunaga and Liang [32] prop osed DrRepair to resolv e the syntax error b y in tro ducing a program feedback graph. They connect the relev an t sym b ols in the source code and the compile error messages and emplo y the graph neural netw ork on top to mo del the pro cess of the program repair. In this w ork, they rely on the compiler error messages which can b e helpful, but it also limits the generality of the metho d. How ever, our proposed approac h do es not rely on additional information such as compiler error messages, and it resolves the errors b y directly mo deling the underlying distribution of the p oten tial correct ﬁxes. 3 SampleFix: Generativ e Mo del for Diversiﬁed Co de Fixes Repairing the common program errors is a c hallenging task due to ambiguit y in p oten tial corrections and lack of represen tative data. Given a single erroneous program and a certain num b er of allo wed changes, there are multiple wa ys to ﬁx the program resulting in diﬀerent styles and functionalit y . Without further information, the true, intended style and/or functionality remains unknown. In order to accoun t for this inheren t ambiguit y , w e prop ose a deep generativ e mo del to learn a distribution o ver p otential ﬁxes giv en the erroneous program – in con trast to predicting a single ﬁx. W e frame this challenging learning problem as a conditional v ariational auto enco ders (CV AE). How ever, standard sampling pro cedures and limitations of datasets and their construction make learning and generation of diverse samples a challenge. W e address this issue by a b eam 6 H. Ha jip our et al. searc h decoding scheme in combination with a nov el regularizer that encourages div ersity of the samples in the em b edding space of the CV AE. Generative Seq2Seq Model Input pr ogram x ŷ 1 : ŷ 2 : ŷ 3 : ŷ 4 : 3 int a = 2, b = 3, c 3 int a = 2, c; 4 int a = 2, c; 3 int a = 2, b = 3, c; Selecting fixes ... ... ... Diverse candidate fixes Fig. 3: Ov erview of SampleFix at inference time, highlighting the generation of div erse ﬁxes. Figure 3 provides an o verview of our prop osed approach at inference time. F or a given erroneous program, the generative mo del draws T intermediate, candidate ﬁxes ˆ y from the learned conditional distribution. W e use a compiler to select a subset of promising intermediate candidate ﬁxes based on t he num b er of remaining errors. This procedure is applied iteratively un til arriv e at a set of candidate ﬁxes within the maxim um num b er of prescribed c hanges. W e then select a ﬁnal set of candidate ﬁxes that compile, hav e unique syntax according to our measure describ ed b elo w (Subsection 3.5). In the following, w e formulate our prop osed generative model with the diver- sit y regularizer and provide details of our training and inference process. 3.1 Conditional V ariational Auto enco ders for Generating Fixes Conditional V ariational Auto enco ders (CV AE) [29], model conditional distribu- tions p θ ( y | x ) using laten t v ariables z . The conditioning in tro duced through z enables the modelling of complex multi-modal distributions. As pow erful trans- formations can b e learned using neural netw orks, z itself can ha ve a simple distribution which allows for eﬃcien t sampling. This model allo ws for sampling from p θ ( y | x ) given an input sequence x , by ﬁrst sampling latent v ariables ˆ z from the prior distribution p ( z ). During training, amortized v ariational infer- ence is used and the laten t v ariables z are learned using a recognition netw ork q φ ( z | x , y ), parametrized b y φ . In detail, the v ariational lo wer b ound of the model (Equation 1) is maximized, log( p ( y | x )) ≥ E q φ ( z | x , y ) log( p θ ( y | z , x )) − D KL ( q φ ( z | x , y ) , p ( z | x )) . (1) P enalizing the divergence of q φ ( z | x , y ) to the prior in Equation 1 allows for sampling from the prior p ( z ) during inference. In practice, the v ariational low er b ound is estimated using Mon te-Carlo integration, ˆ L CV AE = 1 T T X i=1 log( p θ ( y | ˆ z i , x )) − D KL ( q φ ( z | x , y ) , p ( z | x )) . (2) SampleFix: Learning to Generate F unctionally Diverse Fixes 7 where, ˆ z i ∼ q φ ( z | x , y ), and T is the n umber of samples. W e cas t our model for resolving program errors in the Conditional V ariational Autoenco der framew ork. Here, the input x is the erroneous program and y is the ﬁx. Ho wev er, the plain CV AE as described in [29] suﬀers from div ersity issues. Usually , the drawn samples do not reﬂect the true v ariance of the p osterior p ( y | x ). This would amoun t to the correct ﬁx p oten tially missing from our can- didate ﬁxes. T o mitigate this problem, next w e introduce an ob jective that aims to enhance the div ersity of our candidate ﬁxes. 3.2 Enabling Diverse Samples using a Best of Man y Ob jective Here, we introduce the div ersit y enhancing ob jective that w e use. Casting our mo del in the Conditional V ariational Auto enco der framew ork would enable us to sample a set of candidate ﬁxes for a giv en erroneous program. Ho wev er, the standard v ariational low er b ound ob jectiv e do es not encourage div ersity in the candidate ﬁxes. This is b ecause the av erage likelihoo d of the candidate ﬁxes is considered. In detail, as the av erage likelihoo d is considered, all candidate ﬁxes m ust explain the “true” ﬁx in training set well. This discourages diversit y and constrains the recognition net work, whic h is already constrained to maintain a Gaussian latent v ariable distribution. In practice, the learned distribution fails to fully capture the v ariance of the true distribution. T o encourage div ersity , we emplo y ”Many Samples” (MS) ob jective proposed by Bhattac haryya et al. [4], ˆ L MS = log  1 T T X i=1 p θ ( y | ˆ z i , x )  − D KL ( q φ ( z | x , y ) , p ( z | x )) . (3) In comparison to Equation 2, this ob jective (Equation 3) encourages div ersity in the model by allo wing for m ultiple chances to draw highly lik ely candidate ﬁxes. This enables the model to generate div erse candidate ﬁxes, while main tain- ing high lik eliho o d. In practice, due to n umerical stability issues, w e use ”Best of Man y Samples” (BMS) ob jectiv e, which is an approximation of 3. This ob jective retains the div ersity enhancing nature of Equation 3 while being easy to train, ˆ L BMS = max i  log( p θ ( y | ˆ z i , x ))  − D KL ( q φ ( z | x , y ) , p ( z | x )) . (4) 3.3 DS-SampleFix: Encouraging Div ersit y with a Div ersit y-sensitive Regularizer T o increase the diversit y using Equation 4 w e need to use a substantial num b er of samples during training. This is computationally prohibitive esp ecially for large mo dels, as memory requirements and computation time increases linearly in the n umber of such samples. On the other hand, for a small num b er of samples, the ob jectiv e b ehav es similarly to the standard CV AE ob jective as the recognition 8 H. Ha jip our et al. net work has few er and few er chances to draw highly lik ely samples/candidate ﬁxes, thus limiting div ersity . Therefore, in order to encourage the model to gen- erate div erse ﬁxes even with a limited num b er of samples, w e prop ose a nov el regularizer that aims to increase the distance b etw een the tw o closest candidate ﬁxes (Equation 5). This p enalizes generating similar candidate ﬁxes for a given erroneous program and thus encourages div ersit y in the set of candidate ﬁxes. In comparison to Equation 4, we observe considerable gains even with the use of only T = 2 candidate ﬁxes. In detail, we maximize the follo wing ob jectiv e ˆ L DS-BMS = max i  log( p θ ( y | ˆ z i , x ))  + min i , j d ( ˆ y i , ˆ y j ) − D KL ( q φ ( z | x , y ) , p ( z | x )) . (5) Distance Metric. Here, we discuss the distance metric d in Equation 5. Note, that the samples n ˆ y i , ˆ y j o can be of diﬀerent lengths. Therefore, w e ﬁrst pad the shorter sample to equalize lengths. Empirically , we ﬁnd that the Euclidean distance p erforms b est. This is mainly b ecause, in practice, Euclidean distance is easier to optimize. 3.4 Beam Search Deco ding for Generating Fixes Beam searc h deco ding is a classical mo del to generate m ultiple outputs from a sequence-to-sequence mo del [31,7]. Giv en the distributions p θ ( y | x ) of a sequence- to-sequence mo del w e can generate m ultiple outputs by unrolling the mo del in time and k eeping the top-K tok ens at eac h time step, where K is the beam width. In our generative mo del, we employ b eam search algorithm to sample multiple ﬁxes. In detail, we deco de with b eam width of size K for eac h sample z and in total for T samples from p ( z ). W e set T = 100 during inference. 3.5 Selecting Diverse Candidate Fixes W e extend the iterative repair procedure introduced by Gupta et al. [13] in the con text of our prop osed generative mo del, where the iterative pro cedure no w lev erages multiple candidate ﬁxes. Given an erroneous program, the generative mo del outputs T candidate ﬁxes. Each ﬁx contains a p otential erroneous line with the corresp onding ﬁx. So in each iteration w e only edit one line of the given program. T o select the b est ﬁxes, we take the candidate ﬁxes and the input erroneous program, reconcile them to create T up dated programs. W e ev aluate these ﬁxes using a compiler, and select up to the best N ﬁxes, where N ≤ T . W e only select the unique ﬁxes which do not in tro duce any additional error messages. In the next iterations, w e feed up to N programs bac k to the mo del. These programs are updated based on the selected ﬁxes of the previous iteration. W e keep up to N programs with the lo wer num b er of error messages ov er the iterations. At the end of the repairing procedure, w e obtain m ultiple p otential candidate ﬁxes. In the exp erimen ts where w e are in terested in a single repaired program, we pic k the b est ﬁx with the highest probability score according to our deep generativ e mo del. SampleFix: Learning to Generate F unctionally Diverse Fixes 9 3.6 Mo del Architecture and Implementation Details Fig. 4: Ov erview of net work archi- tecture. T o ensure a fair comparison, our gener- ativ e mo del is based on the sequence-to- sequence arc hitecture, similar to Gupta et al. [13]. Figure 4 shows the architecture of our approach in detail. Note that the recognition netw ork is a v ailable to enco de the ﬁxes to latent v ariables z only dur- ing training. All of the netw orks in our framew ork consists of 4-lay ers of LSTM cells with 300 units. The netw ork is opti- mized using Adam optimizer [15] with the default setting. W e use T = 2 samples to train our mo dels, and T = 100 samples during inference. T o pro cess the program through the netw orks, we tokenize the programs similar to the setting used b y Gupta et al. [13]. During inference, the conditioning erroneous program x is input to the en- co der, which enco des the program to the vector v . T o generate multiple ﬁxes using our decoder, the code v ector v along with a sample of z from the prior p ( z ) is input to the decoder. F or simplicit y , we use a standard Gaussian N (0 , I ) prior, although more complex priors can b e easily leveraged. The deco der is unrolled in time and output logits ( p θ ( y | ˆ z i , x )). 4 Exp erimen ts W e ev aluate our approac h on the task of repairing common programming errors. W e ev aluate the diversit y and accuracy of our sampled error corrections as well as compare our prop osed metho d with the state of the art. 4.1 Dataset W e use the dataset published by Gupta et al. [13] as it’s sizable and includes real-w orld data. It contains C programs written by studen ts in an in tro ductory programming course. The dataset consists of 93 diﬀerent tasks that were writ- ten by students in an introductory programming course. The programs were collected using a web-based system [6]. These programs ha ve token lengths in the range [75 , 450], and con tain t yp ographic and missing v ariable declaration errors. T o tok enize the programs and generate training and test data diﬀeren t t yp e of tokens, such as t yp es, k eyw ords, special c haracters, functions, literals and v ariables are used. The dataset contains tw o sets of data which are called syn- thetic and real-world data. The syn thetic data contains the erroneous programs whic h are syn thesized b y m utating correct programs written by students. The real-w orld data contains 6975 erroneous programs with 16766 error messages. 4.2 Ev aluation 10 H. Ha jip our et al. T able 2: Results of p erformance comparison of DeepFix, RLAssist, DrRepair, Beam searc h (BS), SampleFix , DS-SampleFix, and DS-SampleFix + BS. Typo, Miss Dec, and All refer to t yp ographic, missing v ariable declarations, and all of the error messages resp ectively . Sp eed denotes computational time for sampling 100 ﬁxes. Ë denotes successfully compiled programs, while  refers to resolved error messages. Models Typo Miss Dec All Speed (s) Ë  Ë  Ë  DeepFix [13] 23.3% 30.8% 10.1% 12.9% 33.4% 40.8% - RLAssist [12] 26.6 % 39.7 % - - - - - DrRepair [32] - - - - 34.0% - - Beam searc h (BS) 25.9% 42.2% 20.3% 47.0% 44.7% 63.9% 4.82 SampleFix 24.8% 38.8% 16.1% 22.8% 40.9% 56.3% 0.88 DS-SampleFix 27.7% 40.9% 16.7% 24.7% 44.4% 61.0% 0.88 DS-SampleFix + BS 27.8% 45.6% 19.2% 47.9% 45.2% 65.2% 1.17 T able 1: Results of p erformance com- parison of DeepFix, Beam searc h (BS), SampleFix ,and DS-SampleFix on synthetic data. T yp o, Miss Dec, and All refer to typographic, missing v ariable declarations, and all of the errors resp ectiv ely . Models Typo Miss Dec All DeepFix 84.7% 78.8% 82.0% Beam searc h (BS) 91.8% 89.5% 90.7% SampleFix 86.8% 86.5% 86.6% DS-SampleFix 95.6% 88.1% 92.2% W e ev aluate our approach on synthetic and real-w orld data. T o ev aluate our ap- proac h on the synthetic test set w e ran- domly select 20k pairs. This data con- tains pairs of erroneous programs with the in tended ﬁxes. T o ev aluate our ap- proac h on real-world data we use a real- w orld set of erroneous programs. Unlike syn thetic test set, we don’t ha ve access to the in tended ﬁx(es) in the real-w orld data. How ever, w e can chec k the correct- ness of the program using the ev aluator (compiler). F ollowing the prior w ork, w e train t wo net w orks, one for typographic errors and another to ﬁx missing v ariables declaration errors. Note that there migh t be an ov erlap b et ween the error resolved b y the net work for typographic errors and the net work for missing v ariables declaration errors, so w e also provide the o verall results of the resolv ed error messages. Syn thetic Data. T able 1 shows the comparison of our prop osed approaches, Beam search (BS), SampleFix and DS-SampleFix, with DeepFix [13] on the syn thetic data in the ﬁrst iteration. In this table (T able 1), w e can see that our approac hes outp erform DeepFix in generating intended ﬁxes for the t ypographic and missing v ariable declaration errors. Beam search (BS), SampleFix and DS- SampleFix generate 90.7%, 86.6%, and 92.2% of the in tended ﬁxes resp ectively . SampleFix: Learning to Generate F unctionally Diverse Fixes 11 Real-W orld Data. In T able 2 we compare our approaches, with state-of-the-art approac hes (DeepFix [13], RLAssist [12], and DrRepair [32]) on the real-world data. In our experiments (T able 2) w e show the p erformance of b eam search deco ding, CV AEs (SampleFix), and our prop osed diversit y-sensitiv e regularizer (DS-SampleFix). F urthermore, we sho w that DS-SampleFix can still take ad- v an tage of b eam searc h algorithm (DS-SampleFix + BS). T o do that, for eac h sample z we deco de with b eam width of size 5, and to sample 100 ﬁxes we draw 20 samples from p ( z ). W e also pro vide the sampling speed in terms of sampling 100 ﬁxes for a giv en program using an av erage o ver 100 runs. The running time results show that CV AE-based mo dels are at least 4x faster than b eam search in sampling the ﬁxes. In this experiment, w e feed the programs up to 5 iterations. T able 2 sho ws that our approac hes outp erform DeepFix [13], RLAssist [12], and DrRepair [32] in resolving the error messages. This sho ws that generat- ing multiple div erse ﬁxes can lead to substantial impro vemen t in p erformance. Beam search, SampleFix, DS-SampleFix, and DS-SampleFix + BS resolve 63.9%, 56.3%, 61.0%, and 65.2% of the error messages resp ectiv ely . Ov erall, our DS- SampleFix + BS is able to resolve all compile-time errors of the 45 . 2% of the pro- grams - around 12% p oints impro vemen t ov er DeepFix and 11% p oints impro ve- men t ov er DrRepair. F urthermore, the p erformance adv antage of DS-SampleFix o ver SampleFix shows the eﬀectiv eness of our nov el regularizer. Note that DrRepair [32] has achiev ed further impro v ements b y relying on the compiler. While utilizing the compiler output seems to be beneﬁcial, it also limits the generalit y of the approach. F or a fair comparison, w e rep ort the p erformance of DrRepair without the compiler output, but consider informing our mo del by the compiler output an in teresting av enue of future work. Fig. 5: An example illustrating that our DS-SampleFix can generate diverse ﬁxes. Left: Example of a program with a typographic error. The error, i.e., missing brac ket, is highligh ted at line 13. Righ t: Our DS-SampleFix prop oses m ultiple ﬁxes for the giv en error (line num b er with the corresp onding ﬁx), highligh ting the abilit y of DS-SampleFix to generate diverse and accurate ﬁxes. 12 H. Ha jip our et al. Qualitativ e Example. W e illustrate div erse ﬁxes generated b y our DS-SampleFix in Figure 5 using a co de example with typographic errors, with the corresponding t wo output samples of DS-SampleFix. In the examples given in Figure 5, there is a missing closing curly brack et after line 13. W e can see that DS-SampleFix generates m ultiple correct ﬁxes to resolve the error in the given program. This indicates that our approac h is capable of handling the inherent am biguity and uncertain ty in predicting ﬁxes for the erroneous programs. The tw o ﬁxes in Fig- ure 5 are unique and compileable ﬁxes that implement diﬀerent functionalities for the giv en erroneous program. Note that generating multiple diverse ﬁxes giv es the programmers the opp ortunity of choosing the desired ﬁx(es) among the compileable ones, based on their in tention. Generating F unctionally Div erse Programs. Given an erroneous program, our approach can generate multiple potential ﬁxes that result in a successful compilation. Since w e do not ha ve access to the user’s inten tion, it is desirable to suggest multiple potential ﬁxes with diverse functionalities. Here, we ev aluate our approac h in generating multiple programs with diﬀeren t functionalities. In order to assess diﬀerent functionalities, w e use the following approach based on tests. The dataset of Gupta et al. [13] consists of 93 diﬀerent tasks. The description of each task, including the input/output format, is provided in the dataset. Based on the input/output format, we can pro vide input examples for eac h task. T o measure the diversit y in functionalit y of the programs in each task, w e generate 10 input examples. F or instance, given a group of programs for a sp eciﬁc task, w e can run each program using the input examples and get the outputs. W e consider t wo programs to hav e diﬀeren t functionalities if they return diﬀeren t outputs given the same input example(s). In order to generate multiple programs w e use our iterative selecting strat- egy (Subsection 3.5). In each iteration, we keep up to N programs with the less n umber of error messages ov er the iterations. At the end of the repairing pro cedure, w e obtain m ultiple repaired programs. As discussed (Figure 1), a sub- set of these programs will successfully compile. In this experiment, w e use the real-w orld test set, and w e set N = 50 as this n umber is large enough to allow us to study the diversit y of the ﬁxes, without incurring an unnecessarily large load on our infrastructure. Our goals in the remaining of this section are: 1. F or eac h erroneous program, to measure the n umber of generated unique ﬁxes that successfully compile. 2. F or eac h erroneous program, to measure the num b er of generated programs with diﬀeren t functionalities. Figure 6a and Figure 6b sho w the syntactic div ersity of the generated pro- grams, and the diversit y in functionalit y of these programs, respectively . In Fig- ure 6a w e show the percentage of the successfully compiled programs with unique ﬁxes for a given erroneous program. The x-axis refers to the num b er of gener- ated and successfully compiled unique programs, and y-axis to the p ercentage of repaired programs for which these man y unique ﬁxes were generated. F or exam- ple, for almost 20% of the repaired programs, DS-SampleFix + BS generates t wo SampleFix: Learning to Generate F unctionally Diverse Fixes 13 (a) Diversit y of the generated programs. (b) Diversit y of the functionality of the generated programs. Fig. 6: The results show the p erformance of Beam search (BS), SampleFix , DS- SampleFix , and DS-SampleFix + BS. (a) Percen tage of the n umber of the gen- erated successfully compiled, unique programs for the giv en erroneous programs. (b) P ercentage of the successfully compiled programs with diﬀeren t functionali- ties for the giv en erroneous programs. unique ﬁxes. Ov erall, w e can see that DS-SampleFix and DS-SampleFix + BS generate more div erse programs in comparison to the other approaches. T able 3: Results of p erformance compar- ison of Beam Searc h (BS), SampleFix , DS-SampleFix , and DS-SampleFix +BS on generating div erse programs. Diverse Prog refers to the percentage of cases where the mo dels generate at least t wo or more suc- cessfully compiled unique programs. Div erse F unc denotes the p ercentage of cases where the mo dels generate at least tw o or more programs with diﬀeren t functionalities. Models Diverse Prog Div erse F unc Beam searc h 55.6% 45.1% SampleFix 44.6% 34.9% DS-SampleFix 68.8% 53.4% DS-SampleFix + BS 69.5 % 60.4 % Figure 6b sho ws the percent- age of the successfully compiled programs with diﬀerent function- alities, for a given erroneous pro- gram. Here, the x-axis refers to the n um b er of the generated func- tionally diﬀerent programs, and y-axis refers to the percentage of erroneous programs with at least one ﬁx, for whic h w e could gener- ate that many div erse ﬁxes. One can observ e that in man y cases, e.g., up to 60% of the times for SampleFix, the metho ds gener- ate programs corresp onding to a single functionality . How ever, in man y other cases they generate functionally div erse ﬁxes. F or example, in almost 10% of the cases, DS-SampleFix generate 10 or more ﬁxes with diﬀeren t functionalities. In Figure 6b we can see that all of the approaches hav e higher p ercentage for generating program with the same functionalit y in comparison to the results in Figure 6a. This indicates that for some of the giv en erroneous programs, we generate multiple unique programs with approximately the same functionalit y . These results show that 14 H. Ha jip our et al. DS-SampleFix and DS-SampleFix + BS generate programs with more diverse functionalities in comparison to the other approac hes. In T able 3 we compare the p erformance of our approac hes in generating di- v erse programs and functionalities. W e pro vide results for all of our four ap- proac hes, i.e., Beam searc h (BS), SampleFix , DS-SampleFix , and DS-SampleFix + BS. W e consider that an approac h can generate div erse programs if it can pro- duce tw o or more successfully compiled, unique programs for a given erroneous program. Similarly , w e sa y that the approac h produces functionally diverse pro- grams if it can generate t wo or more programs with observ able diﬀerences in functionalit y for a given erroneous program. Here w e consider the p ercentage out of the total n um b er of erroneous programs for whic h the model generates at least one successfully compiled program. The results of this table show that our DS-SampleFix + BS approac h generates programs with more diverse function- alities in comparison to the other approac hes. 5 Conclusion W e prop ose a no vel approac h to correct common programming errors. W e recog- nize and mo del the inherent ambiguit y and uncertaint y when predicting m ultiple ﬁxes. In con trast to previous approaches, our approac h is able to learn the distri- bution ov er candidate ﬁxes rather than the most likely ﬁx. W e achiev e increased div ersity of the sampled ﬁxes b y a no vel diversit y-sensitiv e regularizer. W e show that our approach is capable of generating multiple div erse ﬁxes with diﬀerent functionalities. F urthermore, our ev aluations on synthetic and real-w orld data sho w improv ements ov er state-of-the-art metho ds. References 1. Allamanis, M., Barr, E.T., Dev an bu, P ., Sutton, C.: A surv ey of machine learning for big code and naturalness. A CM Computing Surveys (CSUR) (2018) 2. Bader, J., Scott, A., Pradel, M., Chandra, S.: Getaﬁx: learning to ﬁx bugs auto- matically . Proc. A CM Program. Lang. 3 (OOPSLA) (2019) 3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation b y jointly learning to align and translate (2015) 4. Bhattac haryya, A., Schiele, B., F ritz, M.: Accurate and div erse sampling of se- quences based on a ”best of many” sample ob jective. In: CVPR (2018) 5. Bo wman, S.R., Vilnis, L., Vin yals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Gener- ating sentences from a contin uous space. In: SIGNLL Conference on Computational Natural Language Learning (CoNLL) (2016) 6. Das, R., Ahmed, U.Z., Kark are, A., Gulw ani, S.: Prutor: A system for tutoring CS1 and collecting student programs for analysis (2016) 7. Deshpande, A., Aneja, J., W ang, L., Sch wing, A.G., F orsyth, D.: F ast, diverse and accurate image captioning guided b y part-of-speech. In: CVPR (2019) 8. D’An toni, L., Samanta, R., Singh, R.: Qlose: Program repair with quan titative ob jectives. In: CA V (2016) 9. Girshic k, R.: F ast r-cnn. In: ICCV (2015) SampleFix: Learning to Generate F unctionally Diverse Fixes 15 10. Gottsc hlich, J., Solar-Lezama, A., T atbul, N., Carbin, M., Rinard, M., Barzilay , R., Amarasinghe, S., T enen baum, J.B., Mattson, T.: The three pillars of mac hine programming. In: MAPL (2018) 11. Goues, C.L., Pradel, M., Ro ychoudh ury , A.: Automated program repair. Comm un. A CM 62 (12), 56–65 (2019) 12. Gupta, R., Kanade, A., Shev ade, S.: Deep reinforcement learning for programming language correction. In: AAAI (2019) 13. Gupta, R.R., P al, S., Kanade, A., Shev ade, S.K.: Deepﬁx: Fixing common c lan- guage errors b y deep learning. In: AAAI (2017) 14. Jang, E., Gu, S., P o ole, B.: Categorical reparameterization with gumbel-softmax. In: ICLR (2017) 15. Kingma, D.P ., Ba, J.: Adam: A method for sto chastic optimization. In: ICLR (2015) 16. Kingma, D.P ., W elling, M.: Auto-enco ding v ariational ba yes. In: ICLR (2014) 17. Krizhevsky , A., Sutsk ever, I., Hinton, G.E.: Imagenet classiﬁcation with deep con- v olutional neural netw orks. In: NIPS (2012) 18. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Unsup ervised learning of hierarc hical represen tations with conv olutional deep b elief net works. Communications of the A CM (2011) 19. Li, Y., W ang, S., Nguyen, T.N.: Dlﬁx: Con text-based co de transformation learning for automated program repair. In: International Conference on Softw are Engineer- ing (ICSE) (2020) 20. Long, F., Rinard, M.: Automatic patc h generation by learning correct code. In: A CM SIGPLAN Notices (2016) 21. Maddison, C.J., Mnih, A., T eh, Y.W.: The concrete distribution: A contin uous relaxation of discrete random v ariables (2016) 22. Monp errus, M.: Automatic soft ware repair: a bibliography . ACM Computing Sur- v eys (CSUR) (2018) 23. Pu, Y., Narasimhan, K., Solar-Lezama, A., Barzila y , R.: sk p: a neural program corrector for moo cs. In: ACM SIGPLAN (2016) 24. Rezende, D.J., Mohamed, S.: V ariational inference with normalizing ﬂows. In: ICML (2015) 25. Seo, H., Sadowski, C., Elbaum, S., Aftandilian, E., Bowdidge, R.: Programmers’ build errors: a case study (at google). In: ICSE (2014) 26. Simon yan, K., Zisserman, A.: V ery deep conv olutional net works for large-scale image recognition. In: ICLR (2015) 27. Singh, R., Gulw ani, S., Solar-Lezama, A.: Automated feedback generation for in- tro ductory programming assignmen ts. In: PLDI (2013) 28. Smith, E.K., Barr, E.T., Goues, C.L., Brun, Y.: Is the cure worse than the disease? o verﬁtting in automated program repair. In: F oundations of Softw are Engineering (ESEC/FSE) (2015) 29. Sohn, K., Lee, H., Y an, X.: Learning structured output representation using deep conditional generative mo dels. In: NIPS (2015) 30. Sutsk ever, I., Viny als, O., Le, Q.V.: Sequence to sequence learning with neural net works. In: NIPS (2014) 31. W ang, L., Sch wing, A., Lazebnik, S.: Diverse and accurate image description using a v ariational auto-enco der with an additive gaussian encoding space. In: NIPS (2017) 32. Y asunaga, M., Liang, P .: Graph-based, self-sup ervised program repair from diag- nostic feedback. In: ICML (2020)

SampleFix: Learning to Generate Functionally Diverse Fixes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment