Predictive Business Process Monitoring with LSTM Neural Networks

Predictiv e Business Pro cess Monitoring with LSTM Neural Net w orks Niek T ax 1 , Ily a V erenich 2 , 3 , Marcello La Rosa 2 , and Marlon Dumas 3 1 Eindho ven Univ ersity of T echnology , The Netherlands n.tax@tue.nl 2 Queensland Universit y of T echnology , Australia {ilya.verenich, m.larosa}@qut.edu.au 3 Univ ersity of T artu, Estonia marlon.dumas@ut.ee Abstract. Predictiv e business process monitoring methods exploit logs of completed cases of a process in order to make predictions about run- ning cases thereof. Existing metho ds in this space are tailor-made for sp eciﬁc prediction tasks. Moreo ver, their relative accuracy is highly sen- sitiv e to the dataset at hand, th us requiring users to engage in trial-and- error and tuning when applying them in a sp eciﬁc setting. This pap er in vestigates Long Short-T erm Memory (LSTM) neural netw orks as an approac h to build consistently accurate mo dels for a wide range of pre- dictiv e pro cess monitoring tasks. First, w e sho w that LSTMs outp erform existing tec hniques to predict the next even t of a running case and its timestamp. Next, we show ho w to use mo dels for predicting the next task in order to predict the full contin uation of a running case. Finally , w e apply the same approach to predict the remaining time, and show that this approach outperforms existing tailor-made metho ds. 1 In tro duction Predictiv e business pro cess monitoring techniques are concerned with predicting the evolution of running cases of a business pro cess based on models extracted from historical ev ent logs. A range of such techniques hav e b een prop osed for a v ariety of prediction tasks: predicting the next activity [2], predicting the future path (contin uation) of a running case [25], predicting the remaining cycle time [27], predicting deadline violations [22] and predicting the fulﬁllmen t of a prop ert y up on completion [20]. The predictions generated b y these tec hniques ha ve a range of applications. F or example, predicting the next activity (and its timestamp) or predicting the sequence of future activities in a case provide v aluable input for planning and resource allo cation. Meanwhile, predictions of the remaining execution time can b e used to prioritize pro cess instances in order to fulﬁll service-lev el ob jectiv es (e.g. to minimize deadline violations). Existing predictiv e process monitoring approaches are tailor-made for sp eciﬁc prediction tasks and not readily generalizable. Moreo ver, their relativ e accuracy v aries signiﬁcantly dep ending on the input dataset and the p oin t in time when 2 Niek T ax, Ilya V erenich, Marcello La Rosa, and Marlon Dumas the prediction is made. A technique may outp erform another one for one log and a given prediction p oint (e.g. making prediction at the mid-p oin t of each trace), but under-perform it for another log at the same prediction point, or for the same log at an earlier prediction point [12,22]. In some cases, m ultiple tec hniques need to be com bined [22] or considerable tuning is required (e.g. using h yp erparameter optimization) [11] in order to achiev e more consistent accuracy . Recurren t neural netw orks with Long Short-T erm Memory (LSTM) archi- tectures [14] hav e b een shown to deliver consisten tly high accuracy in several sequence modeling application domains, e.g. natural language pro cessing [23] and sp eec h recognition [13]. Recen tly , Evermann et al. [9] applied LSTMs to predictiv e pro cess monitoring, sp eciﬁcally to predict the next activity in a case. Inspired by these results, this pap er inv estigates the follo wing questions: (i) can LSTMs be applied to a broad range of predictiv e process monitoring prob- lems, and how? and (ii) do LSTMs ac hieve consistently high accuracy across a range of prediction tasks, even t logs and prediction p oints? T o address these questions, the pap er puts forward LSTM arc hitectures for predicting: (i) the next activity in a running case and its timestamp; (ii) the contin uation of a case up to completion; and (iii) the remaining cycle time. The outlined LSTM archi- tectures are empirically compared against tailor-made approac hes with respect to their accuracy at diﬀeren t prediction p oin ts, using four real-life even t logs. The paper is structured as follows. Section 2 discusses related work. Section 3 in tro duces foundational concepts and notation. Section 4 describes a technique to predict the next activity in a case and its timestamp, and compares it against tailor-made baselines. Section 5 extends the previous technique to predict the con tinuation of a running case. Section 6 sho ws how this latter metho d can be used to predict the remaining time of a case, and compares it against tailor-made approac hes. Section 7 concludes the pap er and outlines future work directions. 2 Related W ork This section discusses existing approac hes to predictiv e pro cess monitoring for three prediction tasks: time-related predictions, predictions of the outcome of a case and predictions of the contin uation of a case and/or characteristics thereof. 2.1 Prediction of time-related prop erties A range of researc h prop osals hav e addressed the problem of predicting dela ys and deadline violations in business processes. Pik a et al. [24] propose a tec hnique for predicting deadline violations. Metzger et al. [21,22] presen t tec hniques for predicting “late sho w” even ts (i.e. delays b etw een the exp ected and the actual time of arriv al) in a freight transp ortation pro cess. Senderovic h et al. [28] apply queue mining tec hniques to predict delays in case executions. Another b o dy of work fo cuses on predicting the remaining cycle time of running cases. V an Dongen et al. predict the remaining time by using non- parametric regression mo dels based on case v ariables [8]. V an der Aalst et al. Predictiv e Business Pro cess Monitoring with LSTM Neural Netw orks 3 [1] propose a remaining time prediction metho d b y constructing a transition system from the even t log using set, bag, or sequence abstractions. Rogge- Solti & W esk e [27] use stochastic P etri nets to predict the remaining time of a process, taking into account elapsed time since the last observed even t. F olino et al. [10] dev elop an ad-ho c clustering approach to predict remaining time and o vertime faults. In this pap er, we sho w that prediction of the remaining cycle time can b e approac hed as a sp ecial case of prediction of a pro cess con tinuation. Sp eciﬁcally , our approac h is prov en to generally provide b etter accuracy than [1] and [8]. 2.2 Prediction of case outcome The goal of approac hes in this category is to predict cases that will end up in an undesirable state. Maggi et al. [20], prop ose a framework to predict the outcome of a case (normal vs. deviant) based on the sequence of activities executed in a given case and the v alues of data attributes of the last executed activit y in a case. This latter framework constructs a classiﬁer on-the-ﬂy (e.g. a decision tree or random forest) based on historical cases that are similar to the (incomplete) trace of a running case. Other approac hes construct a collection of classiﬁers oﬄine. F or example, [19] construct one classiﬁer for ev ery p ossible prediction p oin t (e.g. predicting the outcome after the ﬁrst even t, the second one and so on). Mean while, [12] apply clustering tec hniques to group together similar preﬁxes of historical traces and then construct one classiﬁer p er cluster. The ab o ve approac hes require one to extract a feature vector from a preﬁx of an ongoing trace. De Leoni et al. [18] prop ose a framew ork that classiﬁes possible approac hes to extract such feature vectors. In this pap er, we do not address the problem of case outcome prediction, although the prop osed architectures could b e extended in this direction. 2.3 Prediction of future ev ent(s) Breuk er et al. [3] use probabilistic ﬁnite automaton to tackle the next-activit y prediction problem, while Ev ermann et al. [9] use LSTMs. Using the latter ap- proac h as a baseline, we prop ose an LSTM architecture that solv es the next- activit y prediction problem with higher accuracy than [9] and [3], and that can b e generalized to other prediction problems. Pra vilovic et al. [26] propose an approac h that predicts b oth the next activit y and its attributes (e.g. the inv olv ed resource). In this pap er w e use LSTMs to tac kle a similar problem: predicting the next activity and its timestamp. Lakshmanan et al. [16] use Marko v chains to estimate the probability of future execution of a given task in a running case. Meanwhile, V an der Sp oel et al [29] address the more ambitious problem of predicting the entire contin uation of a case using a shortest path algorithm ov er a causality graph. P olato et al. [25] reﬁne this approac h by mining an annotated transition system from an ev ent log and annotating its edges with transition probabilities. In this paper, w e take this latter approach as a baseline and show ho w LSTMs can impro ve o ver it while pro viding higher generalizability . 4 Niek T ax, Ilya V erenich, Marcello La Rosa, and Marlon Dumas 3 Bac kground In this section w e introduce concepts used in later sections of this pap er. 3.1 Ev ent logs, traces and sequences F or a given set A , A ∗ denotes the set of all sequences ov er A and σ = h a 1 , a 2 , . . . , a n i a sequence of length n ; hi is the empty sequence and σ 1 · σ 2 is the concatena- tion of sequences σ 1 and σ 2 . hd k ( σ ) = h a 1 , a 2 , . . . , a k i is the preﬁx of length k (0 < k < n ) of sequence σ and tl k ( σ ) = h a k +1 , . . . , a n i is its suﬃx. F or example, for a sequence σ 1 = h a, b, c, d, e i , hd 2 ( σ 1 ) = h a, b i and tl 2 ( σ 1 ) = h c, d, e i . Let E be the ev en t univ erse, i.e., the set of all p ossible even t iden tiﬁers, and T the time domain. W e assume that ev en ts are c haracterized b y v arious properties, e.g., an even t has a timestamp, corresp onds to an activity , is p erformed by a particular resource, etc. W e do not imp ose a sp eciﬁc set of prop erties, how ever, giv en the fo cus of this paper we assume that tw o of these prop erties are the timestamp and the activity of an even t, i.e., there is a function π T ∈ E → T that assigns timestamps to even ts, and a function π A ∈ E → A that assigns to eac h even t an activity from a ﬁnite set of pro cess activities A . An event lo g is a set of ev ents, each linked to one trace and globally unique, i.e., the same ev ent cannot o ccur twice in a log. A trace in a log represen ts the execution of one case. Deﬁnition 1 (T race, Even t Log). A trace is a ﬁnite non-empty se quenc e of events σ ∈ E ∗ such that e ach event app e ars only onc e and time is non-de cr e asing, i.e., for 1 ≤ i < j ≤ | σ | : σ ( i ) 6 = σ ( j ) and π T ( σ ( i )) ≤ π T ( σ ( j )) . C is the set of al l p ossible tr ac es. An even t log is a set of tr ac es L ⊆ C such that e ach event app e ars at most onc e in the entir e lo g. Giv en a trace and a prop ert y , w e often need to compute a sequence consisting of the v alue of this prop ert y for each even t in the trace. T o this end, w e lift the function f p that maps an even t to the v alue of its prop erty p , in such a wa y that w e can apply it to sequences of even ts (traces). Deﬁnition 2 (Applying F unctions to Sequences). A function f ∈ X → Y c an b e lifte d to se quenc es over X using the fol lowing r e cursive deﬁnition: (1) f ( hi ) = hi ; (2) for any σ ∈ X ∗ and x ∈ X : f ( σ · h x i ) = f ( σ ) · h f ( x ) i . Finally , π A ( σ ) transforms a trace σ to a sequence of its activities. F or exam- ple, for trace σ = h e 1 , e 2 i , with π A ( e 1 ) = a and π A ( e 2 ) = b , π A ( σ ) = h a, b i . 3.2 Neural Netw orks & Recurrent Neural Net works A neural netw ork consists of one lay er of inputs units , one lay er of outputs units, and multiple lay ers in-b et ween which are referred to as hidden units . The outputs of the input units form the inputs of the units of the ﬁrst hidden layer (i.e., the ﬁrst lay er of hidden units), and the outputs of the units of each hidden lay er Predictiv e Business Pro cess Monitoring with LSTM Neural Netw orks 5 Fig. 1. A simple recurren t neural netw ork (taken from [17]). form the input for each subsequent hidden la yer. The outputs of the last hidden la yer form the input for the output lay er. The output of each unit is a function o ver the w eigh ted sum of its inputs. The weigh ts of this w eighted sum performed in each unit are learned through gradient-based optimization from training data that consists of example inputs and desired outputs for those example inputs. Recurren t Neural Net works (RNNs) are a sp ecial type of neural net works where the connections b et w een neurons form a directed cycle. RNNs can be unfolded, as shown in Figure 1. Each step in the unfolding is referred to as a time step, where x t is the input at time step t . RNNs can tak e an arbitrary length sequence as input, by pro viding the RNN a feature represen tation of one element of the sequence at eac h time step. s t is the hidden state at time step t and contains information extracted from all time steps up to t . The hidden state s is up dated with information of the new input x t after each time step: s t = f ( U x t + W s t − 1 ), where U and W are vectors of weigh ts ov er the new inputs and the hidden state resp ectiv ely . F unction f , kn own as the activ ation function, is usually either the hyperb olic tangent or the logistic function, often referred to as the sigmoid function: sigmoid ( x ) = 1 1+ exp ( − x ) . In neural net work literature the sigmoid function is often represented with the letter σ , but we will fully write sigmoid to av oid confusion with traces. o t is the output at step t . 3.3 Long Short-T erm Memory for Sequence Mo deling A Long Short-T erm Memory mo del (LSTM) [14] is a sp ecial Recurren t Neural Net work architecture that has p o werful mo deling capabilities for long-term de- p endencies. The main distinction b etw een a regular RNN and a LSTM is that the latter has a more complex memory cell C t replacing s t . Where the v alue of state s t in a RNN is the result of a function o ver the weigh ted av erage ov er s t − 1 and x t , the LSTM state C t is accessed, written, and cleared through con trolling gates, respectively o t , i t , and f t . Information on a new input will be accum ulated to the memory cell if i t is activ ated. Additionally , the past memory cell status C t − 1 can b e “forgotten” if f t is activ ated. The information of C t will b e propa- gated to the output h t based on the activ ation of output gate o t . Combined, the LSTM mo del can b e describ ed with the following formulas: 6 Niek T ax, Ilya V erenich, Marcello La Rosa, and Marlon Dumas f t = sigmoid ( W f · [ h t − 1 , x t ] + b f ) i t = sigmoid ( W i · [ h t − 1 , x t ] + b i ) ˜ C t = tanh ( W c · [ h t − 1 , x t ] + b C ) C t = f t ∗ C t − 1 + i i ∗ ˜ C t o t = sigmoid ( W o [ h t − 1 , x t ] + b o ) h t = o t ∗ tanh ( C t ) In these formulas all W v ariables are w eights and b v ariables are biases and b oth are learned during the training phase. 4 Next Activity and Timestamp Prediction In this section we presen t and ev aluate multiple architectures for next even t and timestamp prediction using LSTMs. 4.1 Approac h W e start by predicting the next activit y in a case and its timestamp, by learning an activity prediction function f 1 a and a time prediction function f 1 t . W e aim at functions f 1 a and f 1 t suc h that f 1 a ( hd k ( σ )) = hd 1 ( tl k ( π A ( σ ))) and f 1 t ( hd k ( σ )) = hd 1 ( tl k ( π T ( σ ))) for any preﬁx length k . W e transform each even t e ∈ hd k ( σ ) in to a feature vector and use these vectors as LSTM inputs x 1 , . . . , x k . W e build the feature vector as follows. W e start with | A | features that represent the t yp e of activity of ev ent e in a so called one-hot enc o ding . W e tak e an arbitrary but consisten t ordering ov er the set of activities A , and use index ∈ A → { 1 , . . . , | A |} to indicate the p osition of an activity in it. The one-hot enco ding assigns the v alue 1 to feature num b er index ( π A ( e )) and a v alue of 0 to the other features. W e add three time-based features to the one-hot enco ding feature v ector. The ﬁrst time-based feature of even t e = σ ( i ) is the time betw een the previous ev en t in the trace and the curren t ev ent, i.e., fv t 1 ( e ) =  0 if i = 1 , π T ( e ) − π T ( σ ( i − 1)) otherwise . . This feature allo ws the LSTM to learn dep endencies b et ween the time diﬀerences at diﬀerent points (indexes) in the pro cess. Man y activities can only b e p erformed during oﬃce hours, therefore we add a time feature fv t 2 that contains the time within the day (since midnight) and fv t 3 that contains the time within the week (since midnight on Sunda y). fv t 2 and fv t 3 are added to learn the LSTM such that if the last even t observed o ccurred at the end of the working day or at the end of the working week, the time un til the next even t is expected to be longer. A t learning time, w e set the target output o k a of time step k to the one-hot enco ding of the activit y of the even t one time step later. How ever, it can b e the case that the case ends at time k , in which case there is no new even t to predict. Therefore w e add an extra elemen t to the output one-hot-enco ding v ector, whic h has v alue 1 when the case ends after k . W e set a second target output o k t equal to the fv t1 feature of the next time step, i.e. the target is the time diﬀerence b et w een the next and the current even t. How ev er, knowing the timestamp of the current even t, we can calculate the timestamp of the following ev ent. W e optimize the w eights of the neural netw ork with the Adam learning algorithm Predictiv e Business Pro cess Monitoring with LSTM Neural Netw orks 7 ev ent feature v ector LSTM LSTM · · · · · · LSTM LSTM activit y prediction time prediction (a) ev ent feature v ector LSTM · · · LSTM activit y prediction time prediction (b) ev ent feature v ector LSTM · · · n × LSTM LSTM LSTM · · · m × · · · m × LSTM LSTM activit y prediction time prediction (c) Fig. 2. Neural Netw ork architectures with single-task lay ers (a) , with shared multi- tasks lay er (b) , and with n + m lay ers of which n are shared (c) . [15] suc h that the cross entrop y b et w een the ground truth one-hot enco ding of the next even t and the predicted one-hot enco ding of the next even t as w ell as the mean absolute error (MAE) b etw een the ground truth time until the next ev ent and the predicted time until the next ev ent are minimized. Mo deling the next activity prediction function f 1 a and time prediction func- tion f 1 t with LSTMs can b e done using sev eral arc hitectures. Firstly , w e can train tw o separate mo dels, one for f 1 a and one for f 1 t , b oth using the same in- put features at each time step, as represented in Figure 2 (a). Secondly , f 1 a and f 1 t can b e learned join tly in a single LSTM mo del that generates t wo outputs, in a m ulti-task learning setting [4] (Figure 2 (b)). The usage of LSTMs in a m ulti-task learning setting has shown to improv e p erformance on all individual tasks when jointly learning m ultiple natural language pro cessing tasks, including part-of-sp eec h tagging, named entit y recognition, and sentence classiﬁcation [6]. A hybrid option betw een the architecture of Figures 2 (a) and (b) is an architec- ture of a num b er of shared LSTM la yers for both tasks, follo wed b y a num b er of la yers that sp ecialize in either prediction of the next activity or prediction of the time un til the next even t, as shown in Figure 2 (c). It should b e noted that activity prediction function f 1 a outputs the probabilit y distribution of v arious p ossible contin uations of the partial trace. F or ev aluation purp oses, we will only use the most likely con tinuation. W e implemented the tec hnique as a set of Python scripts using the recurrent neural netw ork library Keras [5]. The exp erimen ts were performed on a single NVidia T esla k80 GPU, on which the exp erimen ts to ok b et w een 15 and 90 sec- onds p er training iteration dep ending on the neural netw ork architecture. The execution time to mak e a prediction is in the order of milliseconds. 8 Niek T ax, Ilya V erenich, Marcello La Rosa, and Marlon Dumas 4.2 Exp erimen tal setup In this section we describe and motiv ate the metrics, datasets, and baseline metho ds used for ev aluation of the predictions of the next activities and of the timestamps of the next even ts. T o the b est of our knowledge, there is no existing technique to predict b oth the next activit y and its timestamp. Therefore, w e utilize one baseline method for activity prediction and a diﬀerent one for timestamp prediction. W ell-kno wn error metrics for regression tasks are Mean Absolute Error (MAE) and Ro ot Mean Square Error (RMSE). Time diﬀerences b et ween even ts tend to b e highly v arying, with v alues at diﬀerent orders of magnitude. W e ev aluate the predictions using MAE, as RMSE would b e very sensitive to errors on outlier data p oin ts, where the time b et w een tw o even ts in the log is v ery large. The remaining cycle time prediction method prop osed b y v an der Aalst et al. [1] can b e naturally adjusted to predict the time until the next even t. T o do so w e build a transition system from the ev en t log using either set, bag, or sequence abstraction, as in [1], but instead we annotate the transition system states with the a verage time until the next even t. W e will use this approach as a baseline to predict the timestamp of next ev ent. W e ev aluate the p erformance of predicting the next activit y and its times- tamp on tw o datasets. W e use the chronologically ordered ﬁrst 2/3 of the traces as training data, and ev aluate the activity and time predictions on the remaining 1/3 of the traces. W e ev aluate the next activit y and the timestamp prediction on all preﬁxes hd k ( σ ) of all trace σ in the set of test traces for 2 ≤ k < | σ | . W e do not make an y predictions for the trace preﬁx of size one, since for those preﬁxes there is insuﬃcien t data av ailable to base the prediction up on. Help desk dataset This log contains even ts from a tic keting management pro- cess of the help desk of an Italian softw are compan y 1 . The process consists of 9 activities, and all cases start with the insertion of a new tick et into the tick eting managemen t system. Each case ends when the issue is resolv ed and the tick et is closed. This log con tains around 3,804 cases and 13,710 even ts. BPI’12 subpro cess W dataset This even t log originates from the Business Pro cess In telligence Challenge (BPI’12) 2 and contains data from the application pro cedure for ﬁnancial products at a large ﬁnancial institution. This process consists of three subprocesses: one that tracks the state of the application, one that trac ks the states of work items asso ciated with the application, and a third one that trac ks the state of the oﬀer. In the con text of predicting the coming ev ents and their timestamps we are not in terested in even ts that are p erformed automatically . Th us, we narrow down our ev aluation to the work items subpro- cess, whic h contains even ts that are man ually executed. F urther, we ﬁlter the log to retain only even ts of t yp e c omplete . Two existing techniques [3,9] for the next activit y prediction, describ ed in Section 2, hav e b een ev aluated on this even t log with iden tical prepro cessing, enabling comparison. 1 doi:10.17632/39bp3vv62t.1 2 doi:10.4121/uuid:3926db30-f712-4394-aeb c-75976070e91f Predictiv e Business Pro cess Monitoring with LSTM Neural Netw orks 9 Helpdesk BPI’12 W Lay ers Shared N/l MAE in days Accuracy MAE in days Accuracy Preﬁx 2 4 6 All Preﬁx 2 10 20 All LSTM 4 4 100 3.64 2.79 2.22 3.82 0.7076 1.75 1.49 1.02 1.61 0.7466 4 3 100 3.63 2.78 2.21 3.83 0.7075 1.74 1.47 1.01 1.59 0.7479 4 2 100 3.59 2.82 2.27 3.81 0.7114 1.72 1.45 1.00 1.57 0.7497 4 1 100 3.58 2.77 2.24 3.77 0.7074 1.70 1.46 1.01 1.59 0.7522 4 0 100 3.78 2.98 2.41 3.95 0.7072 1.74 1.47 1.05 1.61 0.7515 3 3 100 3.58 2.69 2.22 3.77 0.7116 1.69 1.47 1.02 1.58 0.7507 3 2 100 3.59 2.69 2.21 3.80 0.7118 1.69 1.47 1.01 1.57 0.7512 3 1 100 3.55 2.78 2.38 3.76 0.7123 1.72 1.47 1.04 1.59 0.7525 3 0 100 3.62 2.71 2.23 3.82 0.6924 1.81 1.51 1.07 1.66 0.7506 2 2 100 3.61 2.64 2.11 3.81 0.7117 1.72 1.46 1.02 1.58 0.7556 2 1 100 3.57 2.61 2.11 3.77 0.7119 1.69 1.45 1.01 1.56 0.7600 2 0 100 3.66 2.89 2.13 3.86 0.6985 1.74 1.46 0.99 1.60 0.7537 1 1 100 3.54 2.71 3.16 3.75 0.7072 1.71 1.47 0.98 1.57 0.7486 1 0 100 3.55 2.91 2.45 3.87 0.7110 1.72 1.46 1.05 1.59 0.7431 3 1 75 3.73 2.81 2.23 3.89 0.7118 1.73 1.49 1.07 1.62 0.7503 3 1 150 3.78 2.92 2.43 3.97 0.6918 1.81 1.52 1.14 1.71 0.7491 2 1 75 3.73 2.79 2.32 3.90 0.7045 1.72 1.47 1.03 1.59 0.7544 2 1 150 3.62 2.73 2.23 3.83 0.6982 1.74 1.49 1.08 1.65 0.7511 1 1 75 3.74 2.87 2.35 3.87 0.6925 1.75 1.50 1.07 1.64 0.7452 1 1 150 3.73 2.79 2.32 3.92 0.7103 1.72 1.48 1.02 1.60 0.7489 RNN 3 1 100 4.21 3.25 3.13 4.04 0.6581 2 1 100 4.12 3.23 3.05 3.98 0.6624 1 1 100 4.14 3.28 3.12 4.02 0.6597 Time pre diction baselines Set abstraction [1] 6.15 4.25 4.07 5.83 - 2.71 1.64 1.02 1.97 - Bag abstraction [1] 6.17 4.11 3.26 5.74 - 2.89 1.71 1.07 1.92 - Sequence abstraction [1] 6.17 3.53 2.98 5.67 - 2.89 1.69 1.07 1.91 - A ctivity pr e diction b aselines Evermann et al. [9] - - - - - - - - - 0.623 Breuker et al. [3] - - - - - - - - - 0.719 T able 1. Exp erimental results for the Help desk and BPI’12 W logs. 4.3 Results T able 1 shows the performance of v arious LSTM arc hitectures on the help desk and the BPI’12 W subpro cess logs in terms of MAE on predicted time, and accuracy of predicting the next even t. The sp eciﬁc preﬁx sizes are chosen suc h that they represent short , me dium , and long traces for each log. Thus, as the BPI’12 W log con tains longer traces, the preﬁx sizes ev aluated are higher for this log. In the table, al l reports the av erage p erformance on all preﬁxes, not just the three preﬁx sizes rep orted in the three preceding columns. The num b er of shared lay ers represen ts the num b er of lay ers that contribute to both time and activit y prediction. Rows where the n umbers of shared la yers are 0 corresp ond to the architecture of Figure 2 (a), where the prediction of time and activities is performed with separate mo dels. When the num ber of shared la yers is equal to the num ber of lay ers, the neural net work contains no specialized la yers, cor- resp onding to the arc hitecture of Figure 2 (b). T able 1 also shows the results of predicting the time until the end of the next e v en t using the adjusted metho d from v an der Aalst et al. [1] for comparison. All LSTM architectures outp erform the baseline approach on all preﬁxes as w ell as a v eraged ov er all preﬁxes on 10 Niek T ax, Ilya V erenich, Marcello La Rosa, and Marlon Dumas b oth datasets. F urther, it can b e observed that the p erformance gain b et ween the b est LSTM mo del and the b est baseline model is muc h larger for the short preﬁx than for the long preﬁx. The b est p erformance obtained on next activity prediction ov er all preﬁxes was a classiﬁcation accuracy of 71% on the help desk log. On the BPI’12 W log the b est accuracy is 76%, whic h is higher than the 71.9% accuracy on this log rep orted b y Breuk er et al. [3] and the 62.3% accuracy rep orted by Ev ermann et al. [9]. In fact, the results obtained with LSTM are consisten tly higher than b oth approaches. Even though Evermann et al. [9] also rely on LSTM in their approach, there are several diﬀerences which are likely to cause the p erformance gap. First of all, [9] uses a tec hnique called emb e dding [23] to create feature descriptions of even ts instead of the features described ab o ve. Em b eddings automatically transform each activit y in to a “useful” large dimen- sional contin uous feature v ector. This approac h has shown to work really well in the ﬁeld of natural language pro cessing, where the num b er of distinct words that can b e predicted is very large, but for pro cess mining even t logs, where the n umber of distinct activities in an even t log is often in the order of hundreds or muc h less, no useful feature v ector can be learned automatically . Second, [9] uses a tw o-la yer arc hitecture with 500 neurons p er la yer, and does not explore other v ariants. W e found performance to decrease when increasing the n umber of neurons from 100 to 150, which mak es it likely that the p erformance of a 500 neuron model will decrease due to o verﬁtting. A third and last explanation for the p erformance diﬀerence is the use of m ulti-task learning, which as w e sho wed, sligh tly improv es prediction p erformance on the next activity . Ev en though the p erformance diﬀerences b et w een our three LSTM arc hitec- tures are small for b oth logs, we observ e that most b est p erformances (indicated in b old) of the LSTM mo del in terms of time prediction and next activity pre- diction are either obtained with the completely shared architecture of Figure 2 (b) or with the h ybrid arc hitecture of Figure 2 (c). W e exp erimen ted with decreasing the num ber of neurons p er la yer to 75 and increasing it to 150 for arc hitectures with one shared lay er, but found that this results in decreasing p erformance in b oth tasks. It is likely that 75 neurons resulted in underﬁtting mo dels, while 150 neurons resulted in ov erﬁtting models. W e also experimented with traditional RNNs on one la yer architectures, and found that they p erform signiﬁcan tly worse than LSTMs on b oth time and activity prediction. 5 Suﬃx Prediction Using functions f 1 a and f 1 t rep eatedly allows us to mak e longer-term predictions that predict further ahead than a single time step. W e use f ⊥ a and f ⊥ t to refer to activity and time un til next even t prediction functions that predict the whole con tinuation of a running case, and aim at those functions to b e such that f ⊥ a ( hd k ( σ )) = tl k ( π A ( σ )) and f ⊥ t ( hd k ( σ )) = tl k ( π T ( σ )) 5.1 Approac h The suﬃx can b e predicted by iterativ ely predicting the next activit y and the time until the next even t, until the next activit y prediction function f 1 a predicts Predictiv e Business Pro cess Monitoring with LSTM Neural Netw orks 11 the end of case, whic h we represent with ⊥ . More formally , w e calculate the complete suﬃx of activities as follo ws: f ⊥ a ( σ ) =      σ if f 1 a ( σ ) = ⊥ f ⊥ a ( σ · e ) , with e ∈ E , π A ( e ) = f 1 a ( σ ) ∧ π T ( e ) = ( f 1 t ( σ ) + π T ( σ ( | σ | ))) otherwise and w e calculate the suﬃx of times until the next even ts as follows: f ⊥ t ( σ ) =      σ, if f 1 t ( σ ) = ⊥ f ⊥ t ( σ · e ) , with e ∈ E , π A ( e ) = f 1 a ( σ ) ∧ π T ( e ) = ( f 1 t ( σ ) + π T ( σ ( | σ | ))) otherwise 5.2 Exp erimen tal Setup F or a giv en trace preﬁx hd k ( σ ) w e ev aluate the performance of f ⊥ a b y calculating the distance b et w een the predicted contin uation f ⊥ a ( hd k ( σ )) and the actual con- tin uation π A ( tl k ( σ )). Many sequence distance metrics exist, with Levensh tein distance b eing one of the most w ell-known ones. Lev enshtein distance is de- ﬁned as the minimum n umber of insertion, deletion, and substitution op erations needed to transform one sequence in to the other. Lev enshtein distance is not suitable when the business process includes paral- lel branches. Indeed, when h a, b i are the next predicted even ts, and h b, a i are the actual next even ts, w e consider this to b e only a minor error, since it is often not relev an t in which order tw o parallel activities are executed. How ev er, Levensh tein distance w ould assign a cost of 2 to this prediction, as transforming the predicted sequence into the ground truth sequence w ould require one deletion and one insertion op eration. An ev aluation measure that b etter reﬂects the prediction qualit y of is the Damerau-Levenstein distance [7], which adds a sw apping op era- tion to the set of operations used b y Lev enshtein distance. Damerau-Lev enshtein distance w ould assign a cost of 1 to transform h a, b i into h b, a i . T o obtain compa- rable results for traces of v ariable length, we normalize the Damerau-Levensh tein distance b y the maximum of the length of the ground truth suﬃx and the length of the predicted suﬃx and subtract the normalized Damerau-Levensh tein dis- tance from 1 to obtain Damerau-Lev enshtein Similarity (DLS). T o the best of our knowledge, the most recen t method to predict an arbitrary n umber of even ts ahead is the one by Polato et al. [25]. The authors ﬁrst extract a transition system from the log and then learn a machine learning mo del for each transition system state to predict the next activit y . They ev aluate on predictions of a ﬁxed num ber of even ts ahead, while we are in terested in the contin uation of the case until its end. W e redid the experiments with their ProM plugin to obtain the p erformance on the predicted full case contin uation. F or the LSTM exp eriments, we use a tw o-lay er architecture with one shared la yer and 100 neurons p er lay er, which sho wed go od p erformance in terms of next activity prediction and predicting the time until the next ev ent in the previous exp erimen t (T able 1). In addition to the tw o previously introduced 12 Niek T ax, Ilya V erenich, Marcello La Rosa, and Marlon Dumas logs, w e ev aluate prediction of the suﬃx on an additional dataset, described b elo w, whic h b ecomes feasible no w that we hav e ﬁxed the LSTM architecture. En vironmental p ermit dataset This is a log of an environmen tal p ermitting pro cess at a Dutc h municipalit y . 1 Eac h case refers to one p ermit application. The log contains 937 cases and 38,944 even ts of 381 even t t yp es. Almost every case follo ws a unique path, making the suﬃx prediction more challenging. 5.3 Results T able 2 summarizes the results of suﬃx prediction for each log. As can be seen, the LSTM outp erforms the baseline [25] on all logs. Even though it improv es o ver the baseline, the p erformance on the BPI’12 W log is low giv en that the log only contains 6 activities. After insp ection we found that this log con tains many sequences of tw o or more even ts in a row of the same activity , where o ccurrences of 8 or more identical ev ents in a ro w are not uncommon. W e found that LSTMs ha ve problems dealing with this log c haracteristic, causing it to predict ov erly long sequences of the same activity , resulting in predicted suﬃxes that are muc h longer than the ground truth suﬃxes. Hence, we also ev aluated suﬃx prediction on a modiﬁed version of the BPI’12 W log where we remo ved rep eated o ccur- rences of the same ev en t, keeping only the ﬁrst o ccurrence. How ev er, w e can only notice a mild impro vemen t ov er the unmodiﬁed log. Metho d Help desk BPI’12 W BPI’12 W (no duplicates) En vironmental p ermit P olato [25] 0.2516 0.0458 0.0336 0.0260 LSTM 0.7669 0.3533 0.3937 0.1522 T able 2. Suﬃx prediction results in terms of Damerau-Levensh tein Similarity . 6 Remaining Cycle Time Prediction Time prediction function f ⊥ t predicts the timestamps of all even ts in a running case that are still to come. Since the last predicted timestamp in a prediction generated b y f ⊥ t is the timestamp of the end of the case, it is easy to see that f ⊥ t can be used for predicting the remaining cycle time of the running case. F or a given unﬁnished case σ , ˆ σ t = f ⊥ t ( σ ) contains the predicted timestamps of the next even ts, and ˆ σ t ( | ˆ σ t | ) contains the predicted end time of σ , therefore the estimated remaining cycle time can b e obtained through ˆ σ t ( | ˆ σ t | ) − π ( σ ( | σ | )). 6.1 Exp erimen tal Setup W e use the same arc hitecture as for the suﬃx prediction exp erimen ts. W e predict and ev aluate the remaining time after each passed even t, starting from preﬁx size 2. W e use the remaining cycle time prediction metho ds of v an der Aalst et al. [1] and v an Dongen et al. [8] as baseline metho ds. 1 doi:10.4121/uuid:26aba40d-8b2d-435b-b5af-6d4bfb d7a270 Predictiv e Business Pro cess Monitoring with LSTM Neural Netw orks 13 6.2 Results Figure 3 shows the mean absolute error for each preﬁx size, for the four logs (Help desk, BPI’12 W, BPI’12 W with no duplicates and Environmen tal P ermit). It can b e seen that LSTM consistently outp erforms the baselines for the Help desk log. An exception is the BPI’12 W log, where LSTM performs worse than the baselines on short preﬁxes. This is caused b y the problem that LSTMs hav e in predicting the next even t when the log has many rep eated even ts, as described in Section 5. This problem causes the LSTM to predict suﬃxes that are to o long compared to the ground truth, and, thereb y , also ov erestimating the remaining cycle time. W e see that the LSTM do es outp erform the baseline on the mo diﬁed v ersion of the BPI’12 W log where w e only kept the ﬁrst occurrence of eac h rep eated even t in a sequence. Note that we do not remov e the last ev en t of the case, ev en if it is a rep eated even t, as that w ould change the ground truth remaining cycle time for the preﬁx. 0 2 4 6 8 2 3 4 5 6 7 Prefix size MAE, days (a) 0 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 18 20 Prefix size (b) 0 2 4 6 8 10 2 4 6 8 10 Prefix size MAE, days (c) 0 10 20 30 40 50 60 70 2 4 6 8 10 12 14 16 18 20 Prefix size (d) van Dongen set sequence bag LSTM Fig. 3. MAE v alues using preﬁxes of diﬀerent lengths for help desk (a), BPI’12 W (b), BPI’12 W (no duplicates) (c) and envir onmental p ermit (d) datasets. 7 Conclusion & F uture W ork The foremost contribution of this pap er is a technique to predict the next ac- tivit y of a running case and its timestamp using LSTM neural netw orks. W e sho wed that this technique outp erforms existing baselines on real-life data sets. Additionally , w e found that predicting the next activity and its timestamp via a single model (m ulti-task learning) yields a higher accuracy than predicting them using separate mo dels. W e then sho wed that this basic technique can b e 14 Niek T ax, Ilya V erenich, Marcello La Rosa, and Marlon Dumas generalized to address tw o other predictive pro cess monitoring problems: pre- dicting the entire con tinuation of a running case and predicting the remaining cycle time. W e empirically show ed that the generalized LSTM-based technique outp erforms tailor-made approaches to these problems. W e also identiﬁed a lim- itation of LSTM mo dels when dealing with traces with multiple o ccurrences of the same activity , in which case the mo del predicts ov erly long sequences of the same ev ent. Addressing this latter limitation is a direction for future work. The prop osed technique can be extended to other prediction tasks, such as prediction of aggregate p erformance indicators and case outcomes. The latter task can be approac hed as a classiﬁcation problem, wherein each neuron of the output la yer predicts the probabilit y of the corresp onding outcome. Another a ven ue f or future work is to extend feature v ectors with additional case and even t attributes (e.g. resources). Finally , w e plan to extend the m ulti-task learning approac h to predict other attributes of the next activity b esides its timestamp. Repro ducibilit y . The source code and supplemen tary material required to repro duce the exp eriments rep orted in this pap er can b e found at http:// verenich.github.io/ProcessSequencePrediction . Ac knowledgmen ts. This research is funded by the Australian Research Coun- cil (grant DP150103356), the Estonian Research Council (gran t IUT20-55) and the RISE BPM pro ject (H2020 Marie Curie Program, grant 645751). References 1. v an der Aalst, W.M.P ., Schonen b erg, M.H., Song, M.: Time prediction based on pro cess mining. Information Systems 36(2), 450–475 (2011) 2. Bec ker, J., Breuk er, D., Delfmann, P ., Matzner, M.: Designing and implementing a framew ork for even t-based predictive mo delling of business pro cesses. In: Pro ceed- igns of the 6th International W orkshop on En terprise Mo delling and Information Systems Architectures. pp. 71–84. Springer (2014) 3. Breuk er, D., Matzner, M., Delfmann, P ., Beck er, J.: Comprehensible predictive mo dels for business processes. MIS Quarterly 40(4), 1009–1034 (2016) 4. Caruana, R.: Multitask learning. Mac hine Learning 28(1), 41–75 (1997) 5. Chollet, F.: Keras. https://github.com/fchollet/keras (2015) 6. Collob ert, R., W eston, J.: A uniﬁed architecture for natural language pro cessing: Deep neural net w orks with m ultitask learning. In: ICML. pp. 160–167. A CM (2008) 7. Damerau, F.J.: A technique for computer detection and correction of sp elling er- rors. Communications of the A CM 7(3), 171–176 (1964) 8. v an Dongen, B.F., Cro o y , R.A., v an der Aalst, W.M.P .: Cycle time prediction: when will this case ﬁnally b e ﬁnished? In: CoopIS. pp. 319–336. Springer (2008) 9. Ev ermann, J., Rehse, J.R., F ettke, P .: A deep learning approach for predicting pro cess b ehaviour at run time. In: Pro ceedings of the 1st In ternational W orkshop on Runtime Analysis of Process-Aware Information Systems. Springer (2016) 10. F olino, F., Guarascio, M., P ontieri, L.: Discov ering context-a ware mo dels for pre- dicting business pro cess p erformances. In: Co opIS. pp. 287–304 (2012) 11. F rancescomarino, C.D., Dumas, M., F ederici, M., Ghidini, C., Maggi, F.M., Rizzi, W.: Predictive business pro cess monitoring framew ork with hyperparameter opti- mization. In: CAiSE. pp. 361–376. Springer (2016) Predictiv e Business Pro cess Monitoring with LSTM Neural Netw orks 15 12. F rancescomarino, C.D., Dumas, M., Maggi, F.M., T einemaa, I.: Clustering-based predictiv e pro cess monitoring. CoRR abs/1506.01428 (2015), abs/1506.01428 , to app ear in T ransactions on Services Computing 13. Gra ves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural netw orks. In: IEEE International Conference on Acoustics, Sp eec h and Sig- nal Pro cessing. pp. 6645–6649. IEEE (2013) 14. Ho c hreiter, S., Schmidh ub er, J.: Long short-term memory . Neural Computation 9(8), 1735–1780 (1997) 15. Kingma, D., Ba, J.: Adam: A metho d for sto c hastic optimization. In: Pro ceedings of the 3rd International Conference for Learning Representations (2015) 16. Lakshmanan, G.T., Shamsi, D., Doganata, Y.N., Un uv ar, M., Khalaf, R.: A mark ov prediction mo del for data-driv en semi-structured business pro cesses. Knowledge and Information Systems 42(1), 97–126 (2015) 17. LeCun, Y., Bengio, Y., Hin ton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 18. de Leoni, M., v an der Aalst, W.M.P ., Dees, M.: A general pro cess mining framew ork for correlating, predicting and clustering dynamic b eha vior based on even t logs. Information Systems 56, 235–257 (2016) 19. Leon tjev a, A., Conforti, R., Di F rancescomarino, C., Dumas, M., Maggi, F.M.: Complex sym b olic sequence enco dings for predictive monitoring of business pro- cesses. In: BPM, pp. 297–313. Springer (2015) 20. Maggi, F.M., Di F rancescomarino, C., Dumas, M., Ghidini, C.: Predictive moni- toring of business pro cesses. In: CAiSE. pp. 457–472. Springer (2014) 21. Metzger, A., F ranklin, R., Engel, Y.: Predictive monitoring of heterogeneous service-orien ted business netw orks: The transp ort and logistics case. In: 2012 An- n ual SRII Global Conference. pp. 313–322. IEEE (2012) 22. Metzger, A., Leitner, P ., Iv anovic, D., Schmieders, E., F ranklin, R., Carro, M., Dustdar, S., Pohl, K.: Comparing and combining predictive business pro cess mon- itoring tec hniques. IEEE T rans. Systems, Man, and Cyb ernetics: Systems 45(2), 276–290 (2015) 23. Mik olov, T., Sutskev er, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sen tations of words and phrases and their comp ositionalit y . In: Adv ances in Neural Information Pro cessing Systems. pp. 3111–3119 (2013) 24. Pik a, A., v an der Aalst, W.M.P ., Fidge, C.J., ter Hofstede, A.H.M., Wynn, M.T.: Predicting deadline transgressions using even t logs. In: BPM. pp. 211–216. Springer (2012) 25. P olato, M., Sp erduti, A., Burattin, A., de Leoni, M.: Time and activity sequence prediction of business pro cess instances. arXiv preprint arXiv:1602.07566 (2016) 26. Pra vilovic, S., Appice, A., Malerba, D.: Pro cess mining to forecast the future of running cases. In: In ternational W orkshop on New F rontiers in Mining Complex P atterns. pp. 67–81. Springer (2013) 27. Rogge-Solti, A., W eske, M.: Prediction of remaining service execution time using sto c hastic Petri nets with arbitrary ﬁring delays. In: ICSOC. pp. 389–403. Springer (2013) 28. Sendero vich, A., W eidlic h, M., Gal, A., Mandelbaum, A.: Queue mining - predicting dela ys in service pro cesses. In: CAiSE. pp. 42–57 (2014) 29. v an der Sp oel, S., v an Keulen, M., Amrit, C.: Pro cess prediction in noisy data sets: a case study in a dutch hospital. In: In ternational Symposium on Data-Driv en Pro cess Discov ery and Analysis. pp. 60–83. Springer (2012)

Predictive Business Process Monitoring with LSTM Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment