A deep learning model for estimating story points

A deep learning model for estimating story points Morakot Choetkiertikul ∗ , Hoa Khanh Dam ∗ , Truyen T ran † , Trang Pham † , Aditya Ghose ∗ and Tim Menzies ‡ ∗ Univ ersity of W ollongong, Australia Email: mc650@uowmail.edu.au, hoa@uo w .edu.au, aditya@uow .edu.au † Deakin Univ ersity , Australia Email: truyen.tran@deakin.edu.au, phtra@deakin.edu.au ‡ North Carolina State Univ ersity , USA Email: tim.menzies@gmail.com Abstract —Although there has been substantial resear ch in software analytics for effort estimation in traditional software projects, little w ork has been done for estimation in agile projects, especially estimating user stories or issues. Story points are the most common unit of measur e used for estimating the eff ort in volved in implementing a user story or resolving an issue. In this paper , we offer f or the ﬁrst time a comprehensi ve dataset for story points-based estimation that contains 23,313 issues from 16 open source projects. W e also propose a prediction model for estimating story points based on a novel combination of two powerful deep learning architectur es: long short-term memory and recurrent highway network. Our prediction system is end- to-end trainable from raw input data to prediction outcomes without any manual feature engineering. An empirical e valuation demonstrates that our approach consistently outperforms three common effort estimation baselines and two alternatives in both Mean Absolute Error and the Standardized Accuracy . I . I N T R O D U C T I O N Effort estimation is an important part of software project management, particularly for planning and monitoring a soft- ware project. The accuracy of ef fort estimation has implica- tions on the outcome of a software project; underestimation may lead to schedule and budget overruns, while ov erestima- tion may have a negati ve impact on organizational competiti ve- ness. Research in software effort estimation dates back sev eral decades and they can generally be divided into model-based and expert-based methods [1, 2]. Model-based approaches lev erages data from old projects to make predictions about new projects. Expert-based methods rely on human expertise to make such judgements. Most of the existing work (e.g. [3–17]) in effort estimation focus on waterfall-like software dev elopment. These approaches estimate the effort required for dev eloping a complete software system, relying on a set of features manually designed for characterizing a software project. In modern agile dev elopment settings, software is developed through repeated cycles (iterativ e) and in smaller parts at a time (incremental), allowing for adaptation to changing requirements at any point during a project’ s life. A project has a number of iterations (e.g. sprints in Scrum [18]). An iteration is usually a short (usually 2–4 weeks) period in which the dev elopment team designs, implements, tests and delivers a distinct product increment, e.g. a w orking milestone version or a working release. Each iteration requires the completion of a number of user stories, which are a common way for agile teams to express user requirements. This is a shift from a model where all functionalities are delivered together (in a single delivery) to a model in volving a series of incremental deliv eries. There is thus a need to focus on estimating the effort of completing a single user story at a time rather than the entire project. In fact, it has no w become a common practice for agile teams to go through each user story and estimate its “size”. Story points are commonly used as a unit of measure for specifying the overall size of a user story [19]. Currently , most agile teams heavily rely on e xperts’ subjective assessment (e.g. planning poker , analogy , and expert judgment) to arri ve at an estimate. This may lead to inaccuracy and more importantly inconsistencies between estimates. In fact, a recent study [20] has found that the ef fort estimates of around half of the agile teams are inaccurate by 25% or more. T o facilitate research in effort estimation for agile dev el- opment, we have dev eloped a new dataset for story point effort estimation. This dataset contains 23,313 user stories or issues with ground truth story points. W e collected these issues from 16 large open source projects in 9 repositories namely Apache, Appcelerator, DuraSpace, Atlassian, Moodle, Lsstcorp, Mulesoft, Spring, and T alendfor ge. T o the best of our knowledge, this is the ﬁrst dataset for story point estimation where the focus is at the issue/user story level rather than at the project level as in traditional effort estimation datasets. W e also propose a prediction model which supports a team by recommending a story-point estimate for a given user story . Our model learns from the team’ s previous story point esti- mates to predict the size of new issues. This prediction system will be used in conjunction with (instead of a replacement for) existing estimation techniques practiced by the team. The key novelty of our approach resides in the combination of two powerful deep learning architectures: long short-term memory (LSTM) and recurrent highway network (RHN). LSTM allows us to model the long-term context in the textual description of an issue, while RHN provides us with a deep representation of that model. W e named this approach as Long-Deep Recurrent Neural Network (LD-RNN). Our LD-RNN model is a fully end-to-end system where raw data signals (i.e. words) are passed from input nodes up to the ﬁnal output node for estimating story points, and the prediction errors are propagated from the output node all the way back to the word layer . LD-RNN automatically learns semantic representations of user story or issue reports, thus liberating the users from manually designing and extracting features. Our approach consistently outperforms three common baseline estimators and two alternatives in both Mean Absolute Error and the Standardized Accuracy . These claims hav e also been tested using a non-parametric W ilcoxon test and V argha and Delaney’ s statistic to demonstrate the statistical signiﬁcance and the effect size. The remainder of this paper is organized as follows. Section II provides an example to motiv ate our work, and the story point dataset is described in Section III. W e then present the LD-RNN model and explain how it can be trained in Section IV and Section V respectiv ely . Section VI reports on the experimental ev aluation of our approach. Related work is discussed in Section VII before we conclude and outline future work in Section VIII. I I . M O T I V A T I N G E X A M P L E When a team estimates with story points, it assigns a point value (i.e. story points) to each user story . A story point estimate reﬂects the r elative amount of effort in volved in implementing the user story: a user story that is assigned two story points should take twice as much effort as a user story assigned one story point. Many projects have no w adopted this story point estimation approach [20]. Projects that use issue tracking systems (e.g. JIRA [21]) record their user stories as issues . Figure 1 shows an example of issue XD-2970 in the Spring XD project [22] which is recorded in JIRA. An issue typically has a title (e.g. “Standardize XD logging to align with Spring Boot”) and description. Projects that use JIRA Agile also record story points. For example, the issue in Figure 1 has 8 story points. Fig. 1. An example of an issue with estimated story points Story points are usually estimated by the whole project team. F or example, the widely-used Planning Poker [23] method suggests that each team member provides an estimate and a consensus estimate is reached after a few rounds of discussion and (re-)estimation. This practice is different from traditional approaches (e.g. function points) in se veral aspects. Both story points and function points are a measure of size. Howe ver , function points can be determined by an external estimator based on a standard set of rules (e.g. counting inputs, outputs, and inquiries) that can be applied consistently by any trained practitioner . On the other hand, story points are dev eloped by a speciﬁc team based on the team’ s cumulati ve knowledge and biases, and thus may not be useful outside the team (e.g. in comparing performance across teams). Story points are used to compute velocity , a measure of a team’ s rate of progress per iteration. V elocity is the sum of the story-point estimates of the issues that the team resolved during an iteration. For example, if the team resolves four stories each estimated at three story points, their velocity is twelve. V elocity is used for planning and predicting when a software (or a release) should be completed. For example, if the team estimates the next release to include 100 story points and the team’ s current velocity is 20 points per 2- week iteration, then it would take 5 iterations (or 10 weeks) to complete the project. Hence, it is important that the team is consistent in their story point estimates to av oid reducing the predictability in planning and managing their project. Our proposal enables teams to be consistent in their esti- mation of story points. Achieving this consistency is central to effecti vely leveraging story points for project planning. The machine learner learns from past estimates made by the speciﬁc team which it is deployed to assist. The insights that the learner acquires are therefore team-speciﬁc . The intent is not to hav e the machine learner supplant existing agile estimation practices. The intent, instead, is to deploy the machine learner to complement these practices by playing the role of a decision support system. T eams would still meet, discuss user stories and generate estimates as per current practice, but would have the added beneﬁt of access to the insights acquired by the machine learner . T eams would be free to reject the suggestions of the machine learner, as is the case with any decision support system. In ev ery such estimation ex ercise, the actual estimates generated are recorded as data to be fed to the machine learner , independent of whether these estimates are based on the recommendations of the machine learner or not. This estimation process helps the team not only understand suf ﬁcient details about what it will take to to resolve those issues, but also align with their previous estimations. I I I . S T O RY P O I N T DAT A S E T S A number of publicly av ailable datasets (e.g. China, De- sharnais, Finnish, Maxwell, and Miyazaki datasets in the PR OMISE repository [24]) have become valuable assets for many research projects in software effort estimation in the recent years. Those datasets ho wever are only suitable for estimating effort at the project lev el (i.e. estimating effort for dev eloping a complete software system). T o the best of our knowledge, there is currently no publicly available dataset for effort estimation at the issue level (i.e. estimating effort for T ABLE I D E SC R I P TI V E S TA T I ST I C S O F O U R S TO RY P O I N T D A TAS E T Repo. Project Abb . # issues min SP max SP mean SP median SP mode SP var SP std SP mean TD length LOC Apache Mesos ME 1,680 1 40 3.09 3 3 5.87 2.42 181.12 247,542 + Usergrid UG 482 1 8 2.85 3 3 1.97 1.40 108.60 639,110 + Appcelerator Appcelerator Studio AS 2,919 1 40 5.64 5 5 11.07 3.33 124.61 2,941,856 # Aptana Studio AP 829 1 40 8.02 8 8 35.46 5.95 124.61 6,536,521 + Titanium SDK/CLI TI 2,251 1 34 6.32 5 5 25.97 5.10 205.90 882,986 + DuraSpace DuraCloud DC 666 1 16 2.13 1 1 4.12 2.03 70.91 88,978 + Atlassian Bamboo BB 521 1 20 2.42 2 1 4.60 2.14 133.28 6,230,465 # Clover CV 384 1 40 4.59 2 1 42.95 6.55 124.48 890,020 # JIRA Software JI 352 1 20 4.43 3 5 12.35 3.51 114.57 7,070,022 # Moodle Moodle MD 1,166 1 100 15.54 8 5 468.53 21.65 88.86 2,976,645 + Lsstcorp Data Management DM 4,667 1 100 9.57 4 1 275.71 16.61 69.41 125,651 * Mulesoft Mule MU 889 1 21 5.08 5 5 12.24 3.50 81.16 589,212 + Mule Studio MS 732 1 34 6.40 5 5 29.01 5.39 70.99 16,140,452 # Spring Spring XD XD 3,526 1 40 3.70 3 1 10.42 3.23 78.47 107,916 + T alendforge T alend Data Quality TD 1,381 1 40 5.92 5 8 26.96 5.19 104.86 1,753,463 # T alend ESB TE 868 1 13 2.16 2 1 2.24 1.50 128.97 18,571,052 # T otal 23,313 SP: story points, TD length: the number of words in the title and description of an issue, LOC: line of code (+: LOC obtained from www .openhub.net, *: LOC from GitHub, and #: LOC from the reverse engineering) dev eloping a single issue). Thus, we needed to build such a dataset for our study . W e have made this dataset publicly available , both to enable veriﬁability of our results and also as a service to the research community . T o collect data for our dataset, we looked for issues that were estimated with story points. JIRA is one of the few widely-used issue tracking systems that support agile dev el- opment (and thus story point estimation) with its JIRA Agile plugin. Hence, we selected a div erse collection of nine major open source repositories that use the JIRA issue tracking system: Apache, Appcelerator , DuraSpace, Atlassian, Moodle, Lsstcorp, MuleSoft, Spring, and T alendforge. Apache hosts a family of related projects sponsored by the Apache Software Foundation [25]. Appcelerator hosts a number of open source projects that focus on mobile application dev elopment [26]. DuraSpace contains digital asset management projects [27]. The Atlassian repository has a number of projects which provide project management systems and collaboration tools [28]. Moodle is an e-learning platform that allows ev eryone to join the community in several roles such as user , dev eloper, tester , and QA [29]. Lsstcorp has a number of projects supporting research inv olving the Large Synoptic Survey T ele- scope [30]. MuleSoft provides software dev elopment tools and platform collaboration tools such as Mule Studio [31]. Spring has a number of projects supporting application development framew orks [32]. T alendforge is the open source integration software provider for data management solutions such as data integration and master data management [33]. W e then used the Representational State Transfer (REST) API provided by JIRA to query and collected those issue reports. W e collected all the issues which were assigned a story point measure from the nine open source repositories up until August 8, 2016. W e then extracted the story point, title and description from the collected issue reports. Each repository contains a number of projects, and we chose to include in our dataset only projects that had more than 300 issues with story points. Issues that were assigned a story point of zero (e.g., a non-reproducible bug), as well as issues with a negati ve, or unrealistically large story point (e.g. greater than 100) were ﬁltered out. Ultimately , about 2.66% of the collected issues were ﬁltered out in this fashion. In total, our dataset has 23,313 issues with story points from 16 different projects: Apache Mesos (ME), Apache Usergrid (UG), Appcelerator Studio (AS), Aptana Studio (AP), T itanum SDK/CLI (TI), DuraCloud (DC), Bamboo (BB), Clo ver (CV), JIRA Software (JI), Moo- dle (MD), Data Management (DM), Mule (MU), Mule Studio (MS), Spring XD (XD), T alend Data Quality (TD), and T alend ESB (TE). T able I summarizes the descripti ve statistics of all the projects in terms of the minimum, maximum, mean, median, mode, v ariance, and standard de viations of story points assigned used and the av erage length of the title and de- scription of issues in each project. These sixteen projects bring div ersity to our dataset in terms of both application domains and project’ s characteristics. Speciﬁcally , they are different in the following aspects: number of observation (from 352 to 4,667 issues), technical characteristics (different programming languages and different application domains), sizes (from 88 KLOC to 18 millions LOC), and team characteristics (dif ferent team structures and participants from different regions). I V . A P P RO AC H Our overall research goal is to build a prediction system that takes as input the title and description of an issue and produces a story-point estimate for the issue. T itle and description are required information for any issue tracking system. Although some issue tracking systems (e.g. JIRA) may elicit addition metadata for an issue (e.g. priority , type, affect versions, and ﬁx versions), this information is not always provided at the time that an issues is created. W e therefore make a pessimistic assumption here and rely only on the issue’ s title and description. Thus, our prediction system can be used at any time, ev en when an issue has just been created. W e combine the title and description of an issue report into a single text document where the title is followed by the description. Our approach computes vector representations pooling Embed LSTM stor y4po int4 estimat e W 1 W 2 W 3 W 4 W 5 W 6 Rec urr ent4 Hig hway 4Ne t Reg res sion Standardize XD logging to align with document4repre sen t ation h 1 h 2 h 3 h 4 h 5 h 6 …. …. …. …. Fig. 2. Long-Deep Recurrent Neural Net (LD-RNN). The input layer (bottom) is a sequence of words (represented as ﬁlled circles). W ords are ﬁrst embedded into a continuous space, then fed into the LSTM layer . The LSTM outputs a sequence of state vectors, which are then pooled to form a document-lev el vector . This global vector is then fed into a Recurrent Highway Net for multiple transformations (See Eq. (2) for detail). Finally , a regressor predicts an outcome (story-point). for these documents. These representations are then used as features to predict the story points of each issue. It is important to note that these features are automatically learned from raw text, hence removing us from manually engineering the features. Figure 2 sho ws the Long-Deep Recurrent Neural Network (LD-RNN) that we hav e designed for the story point pre- diction system. It is composed of four components arranged sequentially: (i) word embedding, (ii) document representation using Long Short-T erm Memory (LSTM) [34], (iii) deep representation using Recurrent Highway Net (RHWN) [35]; and (iv) differentiable regression. Giv en a document which consists of a sequence of words s = ( w 1 , w 2 , ..., w n ) , e.g. the word sequence (Standardize , XD, logging, to, align, with, ....) in the title and description of issue XD-2970 in Figure 1, our LD-RNN can be summarized as follows: y = Regress ( RHWN ( LSTM ( Embed ( s )))) (1) W e model a document’ s semantics based on the principle of compositionality: the meaning of a document is determined by the meanings of its constituents (e.g. words) and the rules used to combine them (e.g. one word followed by another). Hence, our approach models document representation in two stages. It ﬁrst conv erts each word in a document into a ﬁxed- length vector (i.e. word embedding). These word vectors then serve as an input sequence to the Long Short-T erm Memory (LSTM) layer which computes a vector representation for the whole document (see Section IV -A for details). After that, the document vector is fed into the Recurrent Highway Network (RHWN), which transforms the document vector multiple times (see Section IV -B for details), before outputting a ﬁnal v ector which represents the te xt. The vector serves as input for the re gressor which predicts the output story-point. While many existing regressors can be employed, we are mainly interested in regressors that are differ entiable with respect to the training signals and the input vector . In our implementation, we use the simple linear re gr ession that outputs the story-point estimate. Our entire system is trainable from end-to-end : (a) data signals are passed from the words in issue reports to the ﬁnal output node; and (b) the prediction error is propagated from the output node all the way back to the word layer . A. Document repr esentation W e represent each word as a low dimensional, continuous and real-valued vector , also known as wor d embedding . Here we maintain a look-up table, which is a word embedding matrix M ∈ R d ×| V | where d is the dimension of w ord vector and | V | is vocab ulary size. These word vectors are pre-trained from corpora of issue reports, which will be described in details in Section V -B. Since an issue document consists of a sequence of words, we model the document by accumulating information from the start to the end of the sequence. A powerful accumulator is a Recurrent Neural Network (RNN) [36], which can be seen as multiple copies of the same single-hidden-layer network, each passing information to a successor and thus allo wing information to be accumulated. While a feedforward neural network maps an input vector into an output vector , an RNN maps a sequence into a sequence. Let w 1 , ..., w n be the input sequence (e.g. words in a sentence). At time step t , a standard RNN model reads the input w t and the previous state h t − 1 to compute the state h t . Due to space limits, we refer the readers to [36] for more technical details of RNN. While RNNs are theoretically po werful, they are difﬁcult to train for long sequences [36], which are often seen in issue reports (e.g. see the description of issue XD-2970 in Figure 1). Hence, our approach employs Long Short-T erm Memory (LSTM) [34, 37], a special variant of RNN. The most important element of LSTM is short-term memory – a vector that stores accumulated information over time. The information stored in the memory is refreshed at each time step through partially forgetting old, irrele vant information and accepting fresh ne w input. Howe ver , only some parts of the input will be added to the memory through a selective input gate. Once the memory has been refreshed, an output state will be read from the memory through an output gate. The reading of the new input, writing of the output, and the forgetting are all learnable. LSTM has demonstrated ground-breaking results in many applications such as language models [38], speech recognition [39] and video analysis [40]. Space limits preclude us from detailing how LSTM works which can be found in its seminal paper [34]. After the output state has been computed for ev ery word in the input sequence, a vector representation for the whole document is deri ved by pooling all the vector states. There are multiple ways to perform pooling, but the main requirement is that pooling must be length inv ariant, that is, pooling is not sensitive to variable length of the document. A simple but often effecti ve pooling method is averaging, which we employed here. B. Deep repr esentation using Recurrent Highway Network Giv en that v ector representation of an issue report has been extracted by the LSTM layer , we can use a differentiable regressor for immediate prediction. Ho wever , this may be sub- optimal since the network is rather shallow . W e hav e therefore designed a deep representation that performs multiple non- linear transformations, using the idea from Highway Nets [41]. A Highway Net is a feedforward neural network that consists of a number of hidden layers, each of which performs a non-linear transformation of the input. T raining very deep feedforward networks with many layers is difﬁcult due to two main problems: (i) the number of parameters grows with the number of layers, leading to overﬁtting; and (ii) stacking many non-linear functions makes it dif ﬁcult for the information and the gradients to pass through. Our conception of a Recurrent Highway Network (RHN) addresses the ﬁrst problem by sharing parameters between layers, i.e. all the hidden layers having the same hidden units (similarly to the notion of a recurrent net). It deals with the second problem by modifying the transformation taking place at a hidden unit to let information from lower layers pass linearly through. Speciﬁcally , the hidden state at layer l is deﬁned as: h l +1 = α l ∗ h l + ( 1 − α l ) ∗ σ l ( h l ) (2) where σ l is a non-linear transform (e.g., a logistic or a tanh) and α l = logit ( h l ) is a logistic transform of h l . Here α l plays the role of a highway gate that lets information passing from layer l to layer l + 1 without loss of information. For example, α l → 1 enables simple copying. This gating scheme is highly effecti ve: while traditional deep neural nets cannot go beyond sev eral layers, the Highway Net can have up to a thousand layers [41]. In previous work [35] we found that the operation in Eq. (2) can be repeated multiple times with exactly the same set of parameters. In other words, we can create a very compact version of Recurrent Highway Network with only one set of parameters in α l and σ l . This clearly has a great advantage of av oiding ov erﬁtting. V . M O D E L T R A I N I N G A. T raining LD-RNN W e have implemented the LD-RNN model in Python using Theano [42]. T o simplify our model, we set the size of the memory cell in an LSTM unit and the size of a recurrent layer in RHWN to be the same as the embedding size. W e tuned some important hyper-parameters (e.g. embedding size and the number of hidden layers) by conducting experiments with different v alues, while for some other hyper-parameters, we used the default values. This will be discussed in more details in the ev aluation section. Recall that the entire network can be reduced to a pa- rameterized function deﬁned in Equation (1), which maps sequences of raw words (in issue reports) to story points. Let θ be the set of all parameters in the model. W e deﬁne a loss function L ( θ ) that measures the quality of a particular set of parameters based on the difference between the predicted story points and the ground truth story points in the training data. A setting of the parameters θ that produces a prediction for an issue in the training data consistent with its ground truth story points would hav e a very low loss L . Hence, learning is achiev ed through the optimization process of ﬁnding the set of parameters θ that minimizes the loss function. Since ev ery component in Equation (1) is differentiable, we use the popular stochastic gradient descent to perform optimization: through backpropagation, the model parameters θ are updated in the opposite direction of the gradient of the loss function L ( θ ) . In this search, a learning rate η is used to control ho w large of a step we take to reach a (local) minimum. W e use RMSprop, an adaptiv e stochas- tic gradient method (unpublished note by Geof frey Hinton), which is known to work best for recurrent models. W e tuned RMSprop by partitioning the data into mutually exclusi ve training, validation, and test sets and running multiple training epoches. Speciﬁcally , the training set is used to learn a useful model. After each training epoch, the learned model was ev aluated on the validation set and its performance was used to assess against hyperparameters (e.g. learning rate in gradient searches). Note that the v alidation set was not used to learn any of the model’ s parameters. The best performing model in the validation set was chosen to be ev aluated on the test set. W e also employed the early stopping strategy , i.e. monitoring the model’ s performance during the validation phase and stopping when the performance got worse. T o prev ent overﬁtting in our neural network, we hav e imple- mented an effecti ve solution called dr opout in our model [43], where the elements of input and output states are randomly set to zeros during training. During testing, parameter av eraging is used. In effect, dropout implicitly trains many models in parallel, and all of them share the same parameter set. The ﬁnal model parameters represent the average of the parameters across these models. T ypically , the dropout rate is set at 0 . 5 . An important step prior to optimization is parameter ini- tialization. T ypically the parameters are initialized randomly , but our experience sho ws that a good initialization (through pre-training) helps learning con verge faster to good solutions. B. Pr e-training Pre-training is a way to come up with a good parameter initialization without using the labels (i.e. ground-truth story points). W e pre-train the lower layers of LD-RNN (i.e. em- bedding and LSTM), which operate at the word le vel. Pre- training is ef fective when the labels are not abundant. During pre-training, we do not use the ground-truth story points, but instead lev erage two sources of information: the strong predictiv eness of natural language, and a vailability of free texts without labels (e.g. issue reports without story points). The ﬁrst source comes from the property of languages that the next word can be predicted using previous words, thanks to grammars and common expressions. Thus, at each time step t , we can predict the next word w t +1 using the state h t , using the softmax function: P ( w t +1 = k | w 1: t ) = exp ( U k h t ) P k 0 exp ( U k 0 h t ) (3) where U k is a free parameter . Essentially we are building a language model, i.e., P ( s ) = P ( w 1: n ) , which can be factor - ized using the chain-rule as: P ( w 1 ) Q n t =2 P ( w t +1 | w 1: t ) . The language model can be learned by optimizing the log- loss − log P ( s ) . Ho wever , the main bottleneck is computa- tional: Equation (3) costs | V | time to e valuate where | V | is the vocabulary size, which can be hundreds of thousands for a big corpus. For that reason, we implemented an approximate but v ery fast alternati ve based on Noise-Contrasti ve Estimation [44], which reduces the time to M  | V | , where M can be as small as 100. W e also run multiple epoches against a validation set to choose the best model. W e use perplexity , a common intrinsic ev aluation metric based on the log-loss, as a criterion for choosing the best model and early stopping. A smaller perplexity implies a better language model. The word embedding matrix M ∈ R d ×| V | (which is ﬁrst randomly initialized) and the initialization for LSTM parameters are learned through this pre-training process. V I . E V A L U A T I O N The empirical ev aluation we carried out aimed to answer the following research questions: • RQ1. Sanity Check : Is the proposed appr oach suitable for estimating story points? This sanity check requires us to compare our LD-RNN prediction model with the three common baseline bench- marks used in the context of effort estimation: Random Guessing, Mean Effort, and Median Ef fort. Random guessing is a naive benchmark used to assess if an esti- mation model is useful [45]. Random guessing performs random sampling (with equal probability) over the set of issues with known story points, chooses randomly one issue from the sample, and uses the story point value of that issue as the estimate of the target issue. Random guessing does not use any information associated with the target issue. Thus any useful estimation model should outperform random guessing. Mean and Median Effort estimations are commonly used as baseline benchmarks for effort estimation [12]. They use the mean or median story points of the past issues to estimate the story points of the target issue. • RQ2. Beneﬁts of deep repr esentation : Does the use of Recurr ent Highway Nets pr ovide mor e accurate and r obust estimates than using a traditional re gression tech- nique? T o answer this question, we replaced the Recurrent Highway Net component with a regressor for immediate prediction. Here, we choose a Random Forests (RF) regressor over other baselines (e.g. the one proposed in [46]) since ensemble methods like RF , which combine the estimates from multiple estimators, are the most effecti ve method for effort estimation [15]. RF achiev es a signiﬁcant improvement over the decision tree approach by generating many classiﬁcation and regression trees, each of which is b uilt on a random resampling of the data, with a random subset of variables at each node split. Tree predictions are then aggregated through av eraging. W e then compare the performance of this alternativ e, namely LSTM+RF , against our LD-RNN model. • RQ3. Beneﬁts of LSTM document representation : Does the use of LSTM for modeling issue reports pro vide mor e accurate results than the traditional Bag-of-W or ds (BoW) appr oach? The most popular text representation is Bag-of-W ords (BoW) [47], where a text is represented as a vector of word counts. For example, the title and description of issue XD-2970 in Figure 1 would be con verted into a sparse binary vector of vocabulary size, whose elements are mostly zeros, except for those at the positions des- ignated to “standardize”, “XD”, “logging” and so on. The BoW representation howe ver effecti vely destroys the sequential nature of text. This question aims to explore whether LSTM with its capability of modeling this sequential structure would improve the story point estimation. T o answer this question, we feed two dif ferent feature vectors: one learned by LSTM and the other de- riv ed from BoW technique to the same Random Forrests regressor , and compare the predictive performance of the former (i.e. LSTM+RF) against that of the latter (i.e. BoW+RF). • RQ4. Cross-project estimation : Is the pr oposed ap- pr oach suitable for cross-pr oject estimation? Story point estimation in new projects is often dif ﬁcult due to lack of training data. One common technique to address this issue is training a model using data from a (source) project and applying it to the new (target) project. Since our approach requires only the title and description of issues in the source and target projects, it is readily applicable to both within-project estimation and cross-project estimation. In practice, story point es- timation is howe ver known to be speciﬁc to teams and projects. Hence, this question aims to in vestigate whether our approach is suitable for cross-project estimation. A. Experimental setting W e performed experiments on the sixteen projects in our dataset – see T able I for their details. T o mimic a real deployment scenario that prediction on a current issue is made by using knowledge from estimations of the past issues, the issues in each project were split into training set (60% of the issues), dev elopment/validation set (i.e. 20%), and test set (i.e. 20%) based on their creation time. The issues in the training set and the validation set were created before the issues in the test set, and the issues in the training set were also created before the issues in the validation set. B. P erformance measures There are a range of measures used in ev aluating the accuracy of an ef fort estimation model. Most of them are based on the Absolute Error , (i.e. | Actual S P − E stimatedS P | ). where AcutalS P is the real story points assigned to an issue and E stimatedS P is the outcome given by an estimation model. Mean of Magnitude of Relative Error (MRE) and Prediction at level l [48], i.e. Pred( l ), have also been used in ef fort estimation. Howe ver , a number of studies [49–52] hav e found that those measures bias towards underestimation and are not stable when comparing effort estimation models. Thus, the Mean Absolute Err or (MAE) and the Standardized Accuracy (SA) have recently been recommended to compare the performance of effort estimation models [12]. MAE is deﬁned as: M AE = 1 N N X i =1 | ActualS P i − E stimatedS P i | where N is the number of issues used for ev aluating the performance (i.e. test set), Actual S P i is the actual story point, and E stimatedS P i is the estimated story point, for the issue i . SA is based on MAE and it is deﬁned as: S A =  1 − M AE M AE rg uess  × 100 where M AE rg uess is the MAE of a large number (e.g. 1000 runs) of random guesses. SA measures the comparison against random guessing. Predicti ve performance can be improved by decreasing MAE or increasing SA. W e assess the story point estimates produced by the estima- tion models using MAE and SA. T o compare the performance of two estimation models, we tested the statistical signiﬁcance of the absolute errors achieved with the two models using the Wilcoxon Signed Rank T est [53]. The W ilcoxon test is a safe test since it makes no assumptions about underlying data distributions. The null hypothesis here is: “the absolute errors provided by an estimation model are signiﬁcantly less that those provided by another estimation model”. W e set the conﬁdence limit at 0.05 (i.e. p < 0.05). In addition, we also employed a non-parametric effect size measure, the V argha and Delaney’ s ˆ A 12 statistic [54] to assess whether the effect size is interesting. The ˆ A 12 measure is chosen since it is agnostic to the underlying distribution of the data, and is suitable for assessing randomized algorithms in software engineering generally [55] and effort estimation in particular [12] . Speciﬁcally , giv en a performance measure (e.g. the Absolute Error from each estimation in our case), the ˆ A 12 measures the probability that estimation model M achiev es better results (with respect to the performance mea- sure) than estimation model N using the following formula: ˆ A 12 = ( r 1 /m − ( m + 1) / 2) /n where r 1 is the rank sum of observations where M achieving better than N , and m and n are respectiv ely the number of observations in the samples deriv ed from M and N . If the performance of the two models are equiv alent, then ˆ A 12 = 0 . 5 . If M perform better than N , then ˆ A 12 > 0 . 5 and vice versa. All the measures we ha ve used here are commonly used in ev aluating effort estimation models [12, 55]. C. Hyper-par ameter settings for training a LD-RNN model W e focused on tuning two important hyper -parameters: the number of word embedding dimensions and the number of hidden layers in the recurrent highway net component of our model. T o do so, we ﬁxed one parameter and v aried the other to observe the MAE performance. W e chose to test with four different embedding sizes: 10, 50, 100, and 200, and twelve variations of the number of hidden layers from 2 to 200. This tuning was done using the validation set. Figure 3 shows the results from experimenting with Apache Mesos.. As can be seen, the setting where the number of embeddings is 50 and the number of hidden layers is 10 gives the lo west MAE, and thus was chosen. ! !"# !"$ !"% !"& # #"# #"$ # ' ( !) #) ') $) () %) &) !)) #)) *+, -./01 23 453 6788193 :; <12 = >?*!) >?*() >?*!)) >?*#)) Fig. 3. Story point estimation performance with different parameter . For both pre-training we ran with 100 epoches and the batch size is 50. The initial learning rate in pre-training was set to 0 . 02 , adaptation rate was 0 . 99 , and smoothing factor w as 10 − 7 . For the main LD-RNN model we used 1,000 epoches and the batch size wass set to 100. The initial learning rate in the main model was set to 0 . 01 , adaptation rate was 0 . 9 , and smoothing factor was 10 − 6 . Dropout rates for the RHW and LSTM layers were set to 0.5 and 0.2 respectiv ely . D. Pr e-training W e used 50,000 issues without story points (i.e. without la- bels) in each repository for pre-training. Figure 4 sho w the top- 500 frequent words used in Apache. They are di vided into 9 clusters (using K-means clustering) based on their embedding which was learned through the pre-training process. W e show here some representative words from some clusters for a brief illustration. W ords that are semantically related are grouped in the same cluster . For example, words related to networking like soap, conﬁguration, tcp, and load are in one cluster . This indicates that to some extent, the learned vectors effecti vely capture the semantic relations between words, which is useful for the story-point estimation task we do later . ! "# ! "$ !# $ # "$ "# ! "# ! "$ !# $ # "$ "# %" %& %' %( %# %) %* %+ %, -. /0-1. 23 45678 459:-4 13 3;45 2/1<:0 2. 0 ; 45 /.= >8 ?67038.=45 3/;45-.0@ A0B045/458:;-<:<=3038.= 4 @0 304 AB: /-8<=3 452 <7 B8/<45 C6<6<4 1. 2345 2<22 8 . = 2378 =?4 : <3 1.@ 45 2<345 7<367=4531 7 <0@ Fig. 4. T op-500 word clusters used in the Apache’ s issue reports E. Results W e report here the results in answering research questions RQs 1–4. RQ1: Sanity check The analysis of MAE and SA (see T able II) suggests that the estimations obtained with our approach, LD-RNN, are better than those achieved by using Mean, Median, and Random estimates. LD-RNN consistently outperforms all these three baselines in all sixteen projects. A veraging across all the projects, LD-RNN achiev es an accuracy of 2.09 MAE and 52.66 SA, while the best of the baselines achieve only 2.84 MAE and 36.36 SA. T able III sho ws the results of the W ilcoxon test (together with the corresponding ˆ A 12 effect size) to measure the statis- tical signiﬁcance and effect size (in brackets) of the improved accuracy achiev ed by LD-RNN over the baselines: Mean Effort, Median Ef fort, and Random Guessing. In 45/48 cases, our LD-RNN signiﬁcantly outperforms the baselines with effect sizes greater than 0.5. Our approach outperforms the baselines, thus passing the sanity check required by RQ1. RQ2: Beneﬁts of deep representation The results for the W ilcoxon test to compare our approach of using Recurrent Highway Networks for deep representation of issue reports against using Random Forests coupled with LSTM (i.e. LSTM+RF) is shown in T able IV. The improve- ment of our approach ov er LSTM+RF is signiﬁcant (p < 0.05) with the ef fect size greater than 0.5 in 13/16 cases. The three projects (DC, MU and MS) where the improvement is not signiﬁcant have a very small number of issues. It is commonly known that deep highway recurrent networks tend to be signiﬁcantly effecti ve when we have lar ge datasets. T ABLE II E V A L UATI O N R E S U L T S ( T H E B E S T R E SU LT S A R E H I G H LI G H T ED IN B OL D ) . M A E - T HE LO W ER T HE BE T T E R , S A - T HE HI G H E R T H E B E T T ER . Proj T echnique MAE SA Proj T echnique MAE SA ME LD-RNN 1.02 59.03 JI LD-RNN 1.38 59.52 LSTM+RF 1.08 57.57 LSTM+RF 1.71 49.71 BoW+RF 1.31 48.66 BoW+RF 2.10 38.34 Mean 1.64 35.61 Mean 2.48 27.06 Median 1.73 32.01 Median 2.93 13.88 UG LD-RNN 1.03 52.66 MD LD-RNN 5.97 50.29 LSTM+RF 1.07 50.70 LSTM+RF 9.86 17.86 BoW+RF 1.19 45.24 BoW+RF 10.20 15.07 Mean 1.48 32.13 Mean 10.90 9.16 Median 1.60 26.29 Median 7.18 40.16 AS LD-RNN 1.36 60.26 DM LD-RNN 3.77 47.87 LSTM+RF 1.62 52.38 LSTM+RF 4.51 37.71 BoW+RF 1.83 46.34 BoW+RF 4.78 33.84 Mean 2.08 39.02 Mean 5.29 26.85 Median 1.84 46.17 Median 4.82 33.38 AP LD-RNN 2.71 42.58 MU LD-RNN 2.18 40.09 LSTM+RF 2.97 37.09 LSTM+RF 2.23 38.73 BoW+RF 2.96 37.34 BoW+RF 2.31 36.64 Mean 3.15 33.30 Mean 2.59 28.82 Median 3.71 21.54 Median 2.69 26.07 TI LD-RNN 1.97 55.92 MS LD-RNN 3.23 17.17 LSTM+RF 2.32 48.02 LSTM+RF 3.30 15.30 BoW+RF 2.58 42.15 BoW+RF 3.29 15.58 Mean 3.05 31.59 Mean 3.34 14.21 Median 2.47 44.65 Median 3.30 15.42 DC LD-RNN 0.68 69.92 XD LD-RNN 1.63 46.82 LSTM+RF 0.69 69.52 LSTM+RF 1.81 40.99 BoW+RF 0.96 57.78 BoW+RF 1.98 35.56 Mean 1.30 42.88 Mean 2.27 26.00 Median 0.73 68.08 Median 2.07 32.55 BB LD-RNN 0.74 71.24 TD LD-RNN 2.97 48.28 LSTM+RF 1.01 60.95 LSTM+RF 3.89 32.14 BoW+RF 1.34 48.06 BoW+RF 4.49 21.75 Mean 1.75 32.11 Mean 4.81 16.18 Median 1.32 48.72 Median 3.87 32.43 CV LD-RNN 2.11 50.45 TE LD-RNN 0.64 69.67 LSTM+RF 3.08 27.58 LSTM+RF 0.66 68.51 BoW+RF 2.98 29.91 BoW+RF 0.86 58.89 Mean 3.49 17.84 Mean 1.14 45.86 Median 2.84 33.33 Median 1.16 44.44 When we use MAE and SA as ev aluation criteria (see T able II), LD-RNN is still the best approach, consistently out- performing LSTM+RF across all sixteen projects. A veraging across all the projects, LSTM+RF achie ves the accuracy of only 2.61 (versus 2.09 MAE by LD-RNN) and 44.05 (vs. 52.66 SA). The proposed approach of using Recurrent Highway Net- works is effecti ve in building a deep representation of issue reports and consequently improving story point estimation. RQ3: Beneﬁts of LSTM document representation T o study the beneﬁts of using LSTM instead of BoW in rep- resenting issue reports, we compared the improv ed accuracy achiev ed by Random Forest using the features deriv ed from LSTM against that using the features derived from BoW . For a fair comparison we used Random Forests as the regressor in both settings and the result is reported in T able V. The improv ement of LSTM o ver BoW is signiﬁcant (p < 0.05) with effect size greater than 0.5 in 13/16 cases. T ABLE III C O MPA R IS O N O N T H E E FF O RT E S T I MAT IO N B E N C HM A R K S U S I N G W I LC OX O N T E S T A N D ˆ A 12 E FF EC T S I Z E ( I N B R AC KE T S ) LD-RNN vs Mean Median Random ME < 0.001 [0.70] < 0.001 [0.74] < 0.001 [0.76] UG < 0.001 [0.63] < 0.001 [0.66] < 0.001 [0.68] AS < 0.001 [0.65] < 0.001 [0.64] < 0.001 [0.75] AP 0.04 [0.58] < 0.001 [0.60] < 0.001 [0.60] TI < 0.001 [0.72] < 0.001 [0.64] < 0.001 [0.73] DC < 0.001 [0.71] 0.415 [0.54] < 0.001 [0.81] BB < 0.001 [0.85] < 0.001 [0.73] < 0.001 [0.85] CV < 0.001 [0.75] < 0.001 [0.70] < 0.001 [0.76] JI < 0.001 [0.76] < 0.001 [0.79] < 0.001 [0.69] MD < 0.001 [0.73] < 0.001 [0.75] < 0.001 [0.57] DM < 0.001 [0.69] < 0.001 [0.57] < 0.001 [0.66] MU 0.003 [0.56] < 0.001 [0.59] < 0.001 [0.65] MS 0.799 [0.51] 0.842 [0.51] < 0.001 [0.64] XD < 0.001 [0.66] < 0.001 [0.63] < 0.001 [0.70] TD < 0.001 [0.77] < 0.001 [0.63] < 0.001 [0.67] TE < 0.001 [0.66] < 0.001 [0.67] < 0.001 [0.87] T ABLE IV C O MPA R IS O N O F T H E R E C UR R E N T H I G H W AY N E T A N D R A ND O M F O RE S T S U S IN G W I L C OX O N T E S T A N D ˆ A 12 E FF EC T S I Z E ( I N B R AC KE T S ) Proj LD-RNN vs LSTM+RF Proj LD-RNN vs LSTM+RF ME < 0.001 [0.53] JI 0.006 [0.64] UG 0.004 [0.53] MD < 0.001 [0.67] AS < 0.001 [0.59] DM < 0.001 [0.61] AP < 0.001 [0.60] MU 0.846 [0.50] TI < 0.001 [0.59] MS 0.502 [0.51] DC 0.406 [0.54] XD < 0.001 [0.57] BB < 0.001 [0.64] TD < 0.001 [0.62] CV < 0.001 [0.69] TE 0.020 [0.53] LSTM also performs better than BoW with respect to the MAE and SA measures in the same thirteen cases where we used the Wilcoxon test. The proposed LSTM-based approach is ef fective in au- tomatically learning semantic features representing issue description, which improves story-point estimation. RQ4: Cross-pr oject estimation W e performed sixteen sets of cross-project estimation ex- periments to test two settings: (i) within-repository: both the source and target projects (e.g. Apache Mesos and Apache Usergrid) were from the same repository , and pre-training was done using both projects and all other projects in the same repository; and (ii) cross-repository: the source project (e.g. Appcelerator Studio) was in a different repository from the target project Apache Usergrid, and pre-training was done using only the source project. T able VI shows the performance of our LD-RNN model for cross-project estimation using the Mean Absolute Error measure. W e used a benchmark of within-project estimation where older issues of the target project were used for training. In all cases, the proposed approach when used for cross-project estimation performed worse than when used for within-project estimation (e.g. on average 24.8% reduction in performance for within-repository and 81% for cross-repository). These results conﬁrm a univ ersal understanding [20] in agile dev el- opment that story point estimation is speciﬁc to teams and projects. T ABLE V C O MPA R IS O N O F R A N D OM F O R E S T W I T H L S T M A N D R A N DO M F O R E S TS W I TH B O W U S I NG W I L C OX O N T E ST A ND ˆ A 12 E FF EC T S I Z E ( I N B R AC KE T S ) Proj LSTM vs BoW Proj LSTM vs BoW ME < 0.001 [0.58] JI 0.009 [0.60] UG < 0.001 [0.56] MD 0.022 [0.54] AS < 0.001 [0.56] DM < 0.001 [0.54] AP 0.788 [0.51] MU 0.006 [0.53] TI < 0.001 [0.55] MS 0.780 [0.49] DC < 0.001 [0.66] XD < 0.001 [0.54] BB < 0.001 [0.64] TD < 0.001 [0.60] CV 0.387 [0.49] TE < 0.001 [0.61] T ABLE VI E V A L UATI O N R E S U L T O N CR OS S - P RO JE C T E S T I MATI O N (i) within-repository (ii) cross-repository Source T arget MAE Source T arget MAE MAE (with-in project) ME UG 1.16 AS UG 1.57 1.03 UG ME 1.13 AS ME 2.08 1.02 AS AP 2.78 MD AP 5.37 2.71 AS TI 2.06 MD TI 6.36 1.97 AP AS 3.22 XD AS 3.11 1.36 AP TI 3.45 DM TI 2.67 1.97 MU MS 3.14 UG MS 4.24 3.23 MS MU 2.31 ME MU 2.70 2.18 A verage 2.41 3.51 1.93 Giv en the speciﬁcity of story points to teams and projects, our proposed approach is more effecti ve for within-project estimation. F . V eriﬁability W e hav e created a replication package and will make it publically a vailable soon. The package contains the full dataset and the source code of our LD-RNN model and the benchmark models (i.e. the baselines, LSTM+RF , and BoW+RF). On this website, we also provide detailed instructions on how to run the code and replicate all the experiments we reported in this paper so that our results can be independently veriﬁed. G. Thr eats to validity W e tried to mitigate threats to construct validity by using real world data from issues recorded in large open source projects. W e collected the title and description provided with these issue reports and the actual story points that were assigned to them. T o minimize threats to conclusion v alidity , we carefully selected unbiased error measures and applied a number of statistical tests to verify our assumptions [54]. Our study was performed on datasets of different sizes. In addition, we carefully followed recent best practices in e valuating effort estimation models [45, 46, 55] to decrease conclusion instability [56]. T raining deep neural networks like our LD-RNN system takes a substantial amount of time, and thus we did not have enough time to do cross-fold validation and leave it for future work. One potential research direction is therefore building up a community for sharing pre-trained networks, which can be used for initialization, thus reducing training times (similar to Model Zoo [57]). As the ﬁrst step towards this direction, we make our pre-trained models publicly a vailable for the research community . Furthermore, our approach assumes that the team stays static ov er time, which might not be the case in practice. T eam changes might af fect the set of skills a vailable and consequently story point estimation. Hence, our future work will consider the modeling of team dynamics. T o mitigate threats to external v alidity , we hav e considered 23,313 issues from sixteen open source projects, which dif fer signiﬁcantly in size, complexity , team of developers, and com- munity . W e howe ver acknowledge that our dataset would not be representativ e of all kinds of software projects, especially in commercial settings (although open source projects and commercial projects are similar in many aspects). One of the key differences between open source and commercial projects that may af fect the estimation of story points is the nature of contributors, developers, and project’ s stakeholders. Further in vestigation for commercial agile projects is needed. V I I . R E L AT E D W O R K Existing estimation methods can generally be classiﬁed into three major groups: expert-based, model-based, and hybrid approaches. Expert-based methods rely on human expertise to make estimations, and are the most popular technique in practice [2, 58]. Expert-based estimation howe ver tends to require large overheads and the av ailability of experts each time the estimation needs to be made. Single-expert opinion is also considered less useful on agile projects (than on traditional projects) since estimates are made for user stories or issues which require different skills from more than one person (rather than an expert in one speciﬁc task) [19]. Hence, group estimation are more encouraged for agile projects such at the planning poker technique [23] which is widely used in practice. Model-based approaches use data from past projects but they are also varied in terms of building customized models or using ﬁxed models. The well-known construction cost (COCOMO) model [3] is an example of a ﬁxed model where factors and their relationships are already deﬁned. Such estimation models were built based on data from a range of past projects. Hence, they tend to be suitable only for a certain kinds of project that were used to built the model. The customized model building approach requires context- speciﬁc data and uses various methods such as regression (e.g. [4, 5]), Neural Network (e.g. [6, 7]), Fuzzy Logic (e.g. [8]), Bayesian Belief Networks (e.g.[9]), analogy-based (e.g. [10, 11]), and multi-objectiv e ev olutionary approaches (e.g. [12]). It is howe ver likely that no single method will be the best performer for all project types [13–15]. Hence, some recent work (e.g. [15]) proposes to combine the estimates from multiple estimators. Hybrid approaches (e.g. [16, 17]) combine expert judgements with the av ailable data – similarly to the notions of our proposal. While most existing work focuses on estimating a whole project, little work has been done in building models specif- ically for agile projects. T oday’ s agile, dynamic and change- driv en projects require different approaches to planning and es- timating [19]. Some recent approaches le verage machine learn- ing techniques to support effort estimation for agile projects. The work in [59] de veloped an effort prediction model for iterativ e software dev elopment setting using regression models and neural networks. Differing from traditional effort estima- tion models, this model is built after each iteration (rather than at the end of a project) to estimate effort for the next iteration. The work in [60] built a Bayesian network model for effort prediction in software projects which adhere to the agile Extreme Programming method. Their model howe ver relies on sev eral parameters (e.g. process effecti veness and process improv ement) that require learning and extensi ve ﬁne tuning. Bayesian networks are also used in [61] to model dependen- cies between different factors (e.g. sprint progress and sprint planning quality inﬂuence product quality) in Scrum-based software dev elopment project in order to detect problems in the project. Our work speciﬁcally focuses on estimating issues with story points, which is the key difference from previous work. Previous research (e.g. [62–65]) has also been done in predicting the elapsed time for ﬁxing a bug or the delay risk of resolving an issue. Howe ver , ef fort estimation using story points is a more preferable practice in agile development. V I I I . C O N C L U S I O N In this paper, we hav e contributed to the research com- munity the ﬁrst dataset for story point estimations, sourcing from 16 large and diverse software projects. W e have also proposed a deep learning-based, fully end-to-end prediction system for estimating story points, completely liberating the users from manually hand-crafting features. A key no velty of our approach is the combination of two po werful deep learning architectures: Long Short-T erm Memory (to learn a vector rep- resentation for issue reports) and Recurrent Highway Network (for building a deep representation). The proposed approach has consistently outperformed three common baselines and two alternativ es according to our ev aluation results. Our future work would in volve expanding our study to com- mercial software projects and other large open source projects to further ev aluate our proposed method. W e also consider performing team analytics (e.g. features characterizing a team) to model team changes over time. Furthermore, we will look into experimenting with a sliding window setting to explore incremental learning. In addition, we will also inv estigate how to best use the issue’ s metadata (e.g. priority and type) and still maintain the end-to-end nature of our entire model. Our future work also inv olve comparing our use of the LSTM model against other state-of-the-art models of natural language such as paragraph2vec [66] or Con volutional Neural Network [67]. Finally , we would like to ev aluate empirically the impact of our prediction system for story point estimation in practice. This would in volve developing the model into a tool (e.g. a JIRA plugin) and then organising trial use in practice. This is an important part of our future work to conﬁrm the ultimate beneﬁts of our approach in general. R E F E R E N C E S [1] T . Menzies, Z. Chen, J. Hihn, and K. Lum, “Selecting best practices for effort estimation, ” IEEE T ransactions on Softwar e Engineering , vol. 32, no. 11, pp. 883–895, 2006. [2] M. Jorgensen, “A review of studies on expert estimation of software dev elopment effort, ” Journal of Systems and Softwar e , vol. 70, no. 1-2, pp. 37–60, 2004. [3] B. W . Boehm, R. Madachy , and B. Steece, Software cost estimation with Cocomo II . Prentice Hall PTR, 2000. [4] P . Sentas, L. Angelis, and I. Stamelos, “Multinomial Logistic Regression Applied on Software Productivity Prediction, ” in Pr oceedings of the 9th P anhellenic conference in informatics , 2003, pp. 1–12. [5] P . Sentas, L. Angelis, I. Stamelos, and G. Bleris, “Software producti vity and effort prediction with ordinal regression, ” Information and Softwar e T echnology , vol. 47, no. 1, pp. 17–29, 2005. [6] S. Kanmani, J. Kathirav an, S. S. Kumar, M. Shanmugam, and P . E. College, “Neural Network Based Effort Estimation using Class Points for OO Systems, ” Evaluation , 2007. [7] A. Panda, S. M. Satapathy , and S. K. Rath, “Empirical V alidation of Neural Network Models for Agile Softw are Effort Estimation based on Story Points, ” Pr ocedia Computer Science , vol. 57, pp. 772–781, 2015. [8] S. Kanmani, J. Kathira van, S. S. K umar, and M. Shanmugam, “Class point based effort estimation of OO systems using Fuzzy Subtractive Clustering and Artiﬁcial Neural Networks, ” Proceedings of the 1st India Softwar e Engineering Conference (ISEC) , pp. 141–142, 2008. [9] S. Bibi, I. Stamelos, and L. Angelis, “Software cost prediction with predeﬁned interval estimates, ” in Pr oceedings of the F irst Software Measur ement Eur opean F orum , Rome, Italy , 2004, pp. 237–246. [10] M. Shepperd and C. Schoﬁeld, “Estimating Software Project Effort Using Analogies, ” IEEE T ransactions on Softwar e Engineering , vol. 23, no. 12, pp. 736–743, 1997. [11] L. Angelis and I. Stamelos, “A Simulation T ool for Ef ﬁcient Analogy Based Cost Estimation, ” Empirical Softwar e Engineering , vol. 5, no. 1, pp. 35–68, 2000. [12] F . Sarro, A. Petrozziello, and M. Harman, “Multi-objective Software Effort Estimation, ” in Proceedings of the 38th International Conference on Software Engineering (ICSE) , 2016, pp. 619–630. [13] M. Jorgensen and M. Shepperd, “A systematic revie w of software dev elopment cost estimation studies, ” IEEE T ransactions on Software Engineering , vol. 33, no. 1, pp. 33–53, 2007. [14] F . Collopy , “Difﬁculty and complexity as factors in software effort estimation, ” International Journal of F or ecasting , vol. 23, no. 3, pp. 469–471, 2007. [15] E. Kocaguneli, T . Menzies, and J. W . Keung, “On the value of ensemble effort estimation, ” IEEE T ransactions on Softwar e Engineering , vol. 38, no. 6, pp. 1403–1416, 2012. [16] R. V alerdi, “Conver gence of expert opinion via the wideband delphi method: An application in cost estimation models, ” 2011. [17] S. Chulani, B. Boehm, and B. Steece, “Bayesian analysis of empirical software engineering cost models, ” IEEE T ransactions on Software Engineering , vol. 25, no. 4, pp. 573–583, 1999. [18] H. F . Cervone, “Understanding agile project management methods using Scrum, ” OCLC Systems & Services: International digital library perspectives , vol. 27, no. 1, pp. 18–22, 2011. [19] M. Cohn, Agile estimating and planning . Pearson Education, 2005. [20] M. Usman, E. Mendes, F . W eidt, and R. Britto, “Ef fort Estimation in Agile Software Development: A Systematic Literature Revie w , ” in Pr oceedings of the 10th International Conference on Pr edictive Models in Software Engineering (PR OMISE) , 2014, pp. 82–91. [21] Atlassian, “Atlassian JIRA Agile software, ” 2016. [Online]. A vailable: https://www .atlassian.com/software/jira [22] Spring, “Spring XD issue XD-2970, ” 2016. [Online]. A vailable: https://jira.spring.io/browse/XD- 2970 [23] J. Grenning, Planning poker or how to avoid analysis paralysis while r elease planning , 2002, vol. 3. [24] T . Menzies, B. Caglayan, Z. He, E. K ocaguneli, J. Krall, F . Peters, and B. Turhan, “The PR OMISE Repository of empirical software engineering data, ” 2012. [25] Apache, “The Apache repository, ” 2016. [Online]. A vailable: https: //issues.apache.org/jira [26] Appcelerator , “The Appcelerator repository, ” 2016. [Online]. A vailable: https://jira.appcelerator .org [27] DuraSpace, “The DuraSpace repository, ” 2016. [Online]. A vailable: https://jira.duraspace.org [28] Atlassian, “The Atlassian repository, ” 2016. [Online]. A vailable: https://jira.atlassian.com [29] Moodle, “The Moodle repository, ” 2016. [Online]. A vailable: https: //tracker .moodle.org [30] Lsstcorp, “The Lsstcorp repository, ” 2016. [Online]. A vailable: https://jira.lsstcorp.org [31] Mulesoft, “The Mulesoft repository, ” 2016. [Online]. A vailable: https://www .mulesoft.org/jira [32] Spring, “The Spring repository, ” 2016. [Online]. A vailable: https: //spring.io/projects [33] T alendforge, “The T alendforge repository, ” 2016. [Online]. A vailable: https://jira.talendforge.or g [34] S. Hochreiter and J. Schmidhuber, “Long short-term memory , ” Neur al computation , vol. 9, no. 8, pp. 1735–1780, 1997. [35] T . Pham, T . Tran, D. Phung, and S. V enkatesh, “Faster training of v ery deep networks via p-norm gates, ” ICPR’16 , 2016. [36] S. Hochreiter, Y . Bengio, P . Frasconi, and J. Schmidhuber , “Gradient ﬂow in recurrent nets: the dif ﬁculty of learning long-term dependencies, ” 2001. [37] F . A. Gers, J. Schmidhuber, and F . Cummins, “Learning to forget: Continual prediction with lstm, ” Neural computation , vol. 12, no. 10, pp. 2451–2471, 2000. [38] M. Sundermeyer, R. Schl ¨ uter , and H. Ne y , “LSTM Neural Networks for Language Modeling, ” in INTERSPEECH , 2012, pp. 194–197. [39] A. Graves, A.-r . Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks, ” in Acoustics, Speech and Signal Pro- cessing (ICASSP), 2013 IEEE International Confer ence on . IEEE, 2013, pp. 6645–6649. [40] J. Y ue-Hei Ng, M. Hausknecht, S. V ijayanarasimhan, O. V inyals, R. Monga, and G. T oderici, “Beyond short snippets: Deep networks for video classiﬁcation, ” in Proceedings of the IEEE Conference on Computer V ision and P attern Recognition , 2015, pp. 4694–4702. [41] R. K. Srivasta va, K. Greff, and J. Schmidhuber, “Training very deep networks, ” arXiv preprint , 2015. [42] T . D. T eam, “Theano: A { Python } framework for fast computation of mathematical expressions, ” arXiv e-prints , vol. abs/1605.0, 2016. [Online]. A vailable: http://deeplearning.net/software/theano [43] N. Sriv astav a, G. Hinton, A. Krizhevsk y , I. Sutske ver, and R. Salakhut- dinov , “Dropout: A simple w ay to prevent neural netw orks from overﬁt- ting, ” Journal of Machine Learning Researc h , vol. 15, pp. 1929–1958, 2014. [44] M. U. Gutmann and A. Hyv ¨ arinen, “Noise-contrastiv e estimation of unnormalized statistical models, with applications to natural image statistics, ” Journal of Machine Learning Resear ch , vol. 13, no. Feb, pp. 307–361, 2012. [45] M. Shepperd and S. MacDonell, “Evaluating prediction systems in software project estimation, ” Information and Softwar e T echnology , vol. 54, no. 8, pp. 820–827, 2012. [Online]. A vailable: http: //dx.doi.org/10.1016/j.infsof.2011.12.008 [46] P . A. Whigham, C. A. Owen, and S. G. Macdonell, “A Baseline Model for Software Effort Estimation, ” ACM T ransactions on Softwar e Engineering and Methodology (TOSEM) , vol. 24, no. 3, p. 20, 2015. [47] P . T irilly , V . Cla veau, and P . Gros, “Language modeling for bag-of-visual words image categorization, ” in Proceedings of the 2008 international confer ence on Content-based image and video retrie val , 2008, pp. 249– 258. [48] S. D. Conte, H. E. Dunsmore, and V . Y . Shen, Software Engineering Metrics and Models . Redwood City , CA, USA: Benjamin-Cummings Publishing Co., Inc., 1986. [49] T . Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit, “A simulation study of the model evaluation criterion MMRE, ” IEEE T ransactions on Softwar e Engineering , vol. 29, no. 11, pp. 985–995, 2003. [50] B. Kitchenham, L. Pickard, S. MacDonell, and M. Shepperd, “What accuracy statistics really measure, ” IEE Pr oceedings - Softwar e , vol. 148, no. 3, p. 81, 2001. [51] M. Korte and D. Port, “Conﬁdence in software cost estimation results based on MMRE and PRED, ” Pr oceedings of the 4th international workshop on Pr edictor models in software engineering (PROMISE) , pp. 63–70, 2008. [52] D. Port and M. Korte, “Comparative studies of the model ev aluation criterions mmre and pred in software cost estimation research, ” in Pr o- ceedings of the 2nd ACM-IEEE international symposium on Empirical softwar e engineering and measurement . ACM, 2008, pp. 51–60. [53] K. Muller , “Statistical power analysis for the behavioral sciences, ” T echnometrics , vol. 31, no. 4, pp. 499–500, 1989. [54] A. Arcuri and L. Briand, “A Hitchhiker’ s guide to statistical tests for assessing randomized algorithms in software engineering, ” Software T esting, V eriﬁcation and Reliability , vol. 24, no. 3, pp. 219–250, 2014. [55] A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering, ” in Proceedings of the 33rd International Conference on Software Engineering (ICSE) , 2011, pp. 1–10. [56] T . Menzies and M. Shepperd, “Special issue on repeatable results in software engineering prediction, ” Empirical Softwar e Engineering , vol. 17, no. 1-2, pp. 1–17, 2012. [57] T . Jia, Y angqing and Shelhamer , Evan and Donahue, Jeff and Karaye v , Serge y and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, “Caffe: Con volutional Architecture for Fast Feature Em- bedding, ” arXiv preprint , 2014. [58] M. Jorgensen and T . M. Gruschke, “The impact of lessons-learned sessions on ef fort estimation and uncertainty assessments, ” IEEE T rans- actions on Software Engineering , vol. 35, no. 3, pp. 368–383, 2009. [59] P . Abrahamsson, R. Moser, W . Pedrycz, A. Sillitti, and G. Succi, “Effort prediction in iterati ve software de velopment processes – incremental ver - sus global prediction models, ” 1st International Symposium on Empirical Softwar e Engineering and Measurement (ESEM) , pp. 344–353, 2007. [60] P . Hearty , N. Fenton, D. Marquez, and M. Neil, “Predicting Project V elocity in XP Using a Learning Dynamic Bayesian Network Model, ” IEEE T ransactions on Software Engineering , v ol. 35, no. 1, pp. 124–137, 2009. [61] M. Perkusich, H. De Almeida, and A. Perkusich, “A model to detect problems on scrum-based software development projects, ” The ACM Symposium on Applied Computing , pp. 1037–1042, 2013. [62] E. Giger, M. Pinzger, and H. Gall, “Predicting the ﬁx time of bugs, ” in Pr oceedings of the 2nd International W orkshop on Recommendation Systems for Software Engineering (RSSE) . A CM, 2010, pp. 52–56. [63] L. D. P anjer, “Predicting Eclipse Bug Lifetimes, ” in Pr oceedings of the 4th International W orkshop on Mining Software Repositories (MSR) , 2007, pp. 29–32. [64] P . Bhattacharya and I. Neamtiu, “Bug-ﬁx time prediction models: can we do better?” in Pr oceedings of the 8th working conference on Mining softwar e r epositories (MSR) . A CM, 2011, pp. 207–210. [65] P . Hooimeijer and W . W eimer , “Modeling bug report quality , ” in Pr oceedings of the 22 IEEE/A CM international confer ence on A utomated softwar e engineering (ASE) . A CM Press, nov 2007, pp. 34 – 44. [66] Q. Le and T . Mikolov , “Distributed Representations of Sentences and Documents, ” in Pr oceedings of the 31st International Confer ence on Machine Learning (ICML) , vol. 32, 2014, pp. 1188–1196. [67] N. Kalchbrenner, E. Grefenstette, and P . Blunsom, “A Conv olutional Neural Network for Modelling Sentences, ” Pr oceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL) , pp. 655–665, 2014.

A deep learning model for estimating story points

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment