ArduCode: Predictive Framework for Automation Engineering

1 ArduCode: Predicti v e Frame work for Automation Engineering Arquimedes Canedo 1 ∗ Palash Go yal 2 ∗ Di Huang 2 ∗ Amit Pande y 1 Gustav o Quiros 1 1 Siemens Corporate T echnology 2 USC Information Sciences Institute Abstract —A utomation engineering is the task of integrating, via software, various sensors, actuators, and controls to automate a real-world process. T oday , automation engineering is supported by a suite of software tools including integrated development en vironments (IDE), hardwar e conﬁgurators, compilers, and runtimes. These tools focus on the automation code itself, but lea ve the automation engineer unassisted in their decision making. This can lead to longer software development cycles due to imperfections in the decision making that arise when integrating software and hardware. T o address this problem, this paper addresses multiple challenges often faced in automation engineering and proposes machine learning-based solutions to assist engineers tackle these challenges. W e show that machine learning can be leveraged to assist the automation engineer in classifying automation code, ﬁnding similar code snippets, and reasoning about the hardwar e selection of sensors and actuators. W e validate our ar chitecture on two r eal datasets consisting of 2,927 Arduino projects, and 683 Programmable Logic Controller (PLC) projects. Our results show that paragraph embedding techniques can be utilized to classify automation using code snippets with precision close to human annotation, giving an F 1 - score of 72%. Further , we show that such embedding techniques can help us ﬁnd similar code snippets with high accuracy . Finally , we use autoencoder models f or hardware r ecommendation and achieve a p @3 of 0.79 and p @5 of 0.95. W e also present the implementation of ArduCode in a proof of concept user interface integrated into an existing automation engineering system platform. Note to practitioners— This paper is motiv ated by the use of artiﬁcial intelligence methods to improve the efﬁcienc y and quality of the automation engineering software dev elopment process. Our goal is to dev elop and integrate intelligent as- sistants in existing automation engineering development tools to minimally disrupt existing workﬂows. Practitioners should be able to adapt our framework to other tools and data. Our contributions address important practical problems: (a) we address the lack of realistic datasets in automation engineering with two publicly av ailable data sources; (b) we make the reference implementation of our algorithms publicly av ailable on GitHub for other practitioners to hav e a starting point for future research; (c) we demonstrate the inte gration of our framew ork as an add-on to an existing automation engineering toolchain. I . I N T RO D U C T I O N Industrial automation is undergoing a technological revolu- tion referred to as the fourth industrial r evolution [1], [2]. The ﬁrst rev olution was mechanization of production enabled by ∗ These authors contributed equally to this work. steam and w ater po wer . The second rev olution was the mass production enabled by electricity . The third rev olution was automated production enabled by electronics and information technologies. The fourth re volution is smart production en- abled by recent breakthroughs in intelligent robotics, sensors, big data, adv anced materials, edge supercomputing, internet of things, cyber -physical systems, and artiﬁcial intelligence. These systems are currently being integrated by software into factories, po wer grids, transportation systems, b uildings, homes, and consumer de vices. Automation engineering software (AES) integrates various sensors, actuators, and control with the purpose of automating real-world processes [3], [4]. AES de velopment has sev eral challenges that distinguish it from general purpose software dev elopment. In this paper , we focus on two of the most prominent challenges: (a) AES software dev elopment is done by automation engineers, not by software experts; and (b) AES software interacts with the physical world by sampling sensors and writing outputs to actuators in well deﬁned time intervals. These challenges have important implications in AES engineering. On the one hand, it is time consuming to dev elop automation code. This is partly because automation code is often not engineered for reusability . Thus, similar functionality is often dev eloped from scratch. On the other hand, the interaction with the physical world requires au- tomation engineers to understand the har dwar e conﬁgur ation that deﬁnes how sensors, actuators, and other hardware are connected to the digital and analog inputs and outputs of the system. This often requires an iterati ve engineering approach between the hardware and the software since any change in the hardware (e.g., a change of a component) has an impact in the software (e.g., input/output mappings), and vice-v ersa. The tight coupling between the hardware and the software produces longer development cycles in AES. The lifecycle of industrial automation systems is divided into two phases: engineering and runtime. Engineering refers to all activities that occur before the system is in opera- tion. These engineering activities include hardware selection, hardware conﬁguration, automation code dev elopment, testing, and simulation. Runtime, on the other hand, refers to all activities that occur during the system’ s operation. These run- time acti vities include control, signal processing, monitoring, prognostics, etc. Applications of artiﬁcial intelligence (AI) in industrial automation have been primarily focused on the runtime phase due to the availability of large volumes of data from sensors. For example, time series forecasting algo- 2 rithms have been very successful in signal processing [5], [6]. Planning and constraint satisfaction are used in controls and control code generation [7], [8]. Anomaly detection algorithms are becoming very popular in cyber -attack monitoring [9], [10]. Probabilistic graphical models and neural networks for prognostics and health management of comple x cyber -physical systems [11] ha ve been deployed for systems such as wind [12] and gas turbines [13]. The use of machine learning in the engineering phase, on the other hand, has remained relatively unexplored. There may be sev eral reasons for this. First, engineering data is v ery scarce because of its proprietary nature [14]. Second, the duration of the engineering phase is short compared to the runtime phase; some industrial automation systems are in operation for more than 30 years. Therefore, the engineering phase is often considered less critical than the runtime phase. Third, acquiring human intent and knowledge is dif ﬁcult. Capturing engineering know-ho w in expert systems is time consuming and expensi ve. This paper introduces the use of machine learning meth- ods in AES to address three key tasks: code classiﬁcation, semantic code search, and hardware recommendation. First, we demonstrate code classiﬁcation on tw o real AES datasets. W e learn representation of AES code via document embed- ding methods, using dif ferent artifacts such as function calls, includes, comments, tags, and the code itself. Then we train classiﬁers on the code embeddings to categorize code projects. Our results show that our approach captures code structure and it is comparable to human annotation prediction performance. Second, using the resulting code embeddings, we demonstrate a semantic code search capability for AES code capable of ﬁnding syntactic and structurally equiv alent fragments of code. Third, we de velop a hardw are recommendation system to auto-complete partial hardware conﬁgurations. Our results show a 3 × higher precision than the baselines. The original contributions of this paper are as follo ws: • The introduction of three AES tasks where AI has a big impact potential: code classiﬁcation, semantic code search, and hardware recommendation. • An unsupervised learning AES code embedding approach based on natural language processing suitable for code classiﬁcation and semantic code search. • The comparison of two hardware recommendation ap- proaches using Bayesian Ne wtorks and Autoencoders. • The ev aluation of our AI models in two real AES datasets consisting of 2,927 Arduino projects [15], and 683 Pro- grammable Logic Controller (PLC) programs [16]. • The ArduCode reference implementation in Python 1 , and datasets for advancing the AI research in automation consisting of: (i) AES source code and meta-data; (ii) an expert e v aluation of code structural and syntactic similarity for 50 code snippets; (iii) a manually curated silver standard for hardware recommendation systems with two levels of granularity . This paper is organized as follows. Section II frames Ar- duCode in terms of the state-of-the-art in AES, and recent 1 https://github .com/arducode- aes/arducode dev elopments in code learning . Section III gives an overvie w of the two types of AES systems: industrial automation sys- tems, and maker automation systems. Section IV presents the proposed ArduCode architecture and methodology . Section V ev aluates ArduCode’ s three AES learning tasks. Section VI describes a proof-of-concept implementation of these tasks in an AES tool. Section VII describes future directions of AI in automation engineering. Section VIII provides the concluding remarks. I I . P R E L I M I NA R I E S A N D R E L AT E D W O R K T o the best of our knowledge, we are the ﬁrst to in vestigate the use of machine learning in AES. Howe ver , there is a large body of work on AES. In this section we motiv ate the use of machines to assist AES tasks, and frame our contrib utions relativ e to the state-of-the-art in AES and related ﬁelds. Over the last few years, manufacturing is transforming itself from centralized mass production into a distributed lot size one production. One of a kind products, uniquely customized by customers, are being produced on demand. This shift is creating an unprecedented need for innov ation in the engineering phase. T oday , despite being short in duration (relativ e to the runtime phase), we hav e estimated that the engineering phase contributes with about 50% of the total cost of automation. Thus, using AI in engineering phase is an important technological dev elopment for lowering the cost of production. In mass production the engineering phase is done upfront (at the be ginning of the lifecycle) and only once for a particular product. In distributed lot size one production, engineering is done in parallel with the runtime as the production system must be adapted to satisfy all the variability associated to one of a kind products. While ﬂexible machines and autonomous production systems can help realize lot size one production, the engineering phase will become intertwined with the runtime phase in the future. A. Code Classiﬁcation As production demands change rapidly , there is a need to efﬁciently integrate new functionality into production. T oday , AES engineers inv est a signiﬁcant amount of time creating functional libraries to organize code according to its func- tionality . Publicly av ailable examples of such libraries are the Arduino Library [17], PLCOpen [18], and OSCA T [16]. On the other hand, the majority of automation code functions are not neatly organized in libraries. In these cases, automation engineers rely on cloning code by copying and pasting [19]. Cloning code solves an immediate problem because it allows the software dev elopment to make progress, but it creates a long-term maintenance problem because engineers quickly forget what a code function does. T o address this problem, the authors in [19] present a text-based system to detect code clones for IEC 61131-3 programs commonly used in PLC programming. Although they broadly identify clone classiﬁcation as one of the tasks in clone analysis, their system seems to be primarily focused on identifying code clones. In general, the lack code classiﬁcation tools in the automation domain moti vate our work. This paper introduces 3 the use of automated code classiﬁcation to reduce the effort of creating and maintaining functional code libraries. AI- driv en code classiﬁcation can be used organize code snippets according to their functionality . That is, code snippets can be automatically labeled according to what they do – e.g., signal processing, signal generation, robot motion control, or any organization-deﬁned functionality . Code classiﬁcation can be integrated in an engineering tool in such a way that as soon as a new function is released by an engineer, it is automatically classiﬁed into a cate gory of a library . The engineer can be in the loop to conﬁrm or correct the classiﬁcation. B. Semantic Code Sear ch Frequent reconﬁgurations of the production system demand a much higher degree of automation code assurance. This puts automation engineers under higher pressure to produce code that works as intended in much shorter engineering cycles. This problem can be broken do wn into two steps: (i) identi- fying potential errors, and (ii) coming up with an alternative solution to solve these errors. In the automation context, static analysis tools have been used to identify AES programming errors and defects. These defects are often referred to as code smells or technical debt indicators [20]. For example, the authors in [21] present a tool to detect issues in IEC 61131-3 programs. Their approach uses pattern-matching on program structures; control-ﬂow and data-ﬂow analysis; and call graph and pointer analysis. Similarly , the authors in [22] present Arcade.PLC, a framework for the v eriﬁcation and analysis of PLC code that combines model-checking and static analysis. Unfortunately , these tools are primarily focused on solving the ﬁrst half of the problem. Therefore, this motiv ates the dev elopment of new approaches to assist automation engineers in identifying alternativ e implementations that can ﬁx the problems found by the static analysis tools. Broadly , recent adv ances in code learning have shown that semantic code search is viable using machine learning. Code learning can be divided into two categories [23]: (i) language speciﬁc models, and (ii) language independent models. Lan- guage speciﬁc models use knowledge of the languages used in the code to generate low-dimensional representations. For example, code2vec [24] constructs abstract syntax tree from the code for Jav a language for the purpose of predicting a method’ s name from its content. It deconstructs the tree into sev eral paths and learns code embedding by aggregating the representations of these paths. func2vec [25] uses control ﬂow graphs to generate embeddings of functions in C language. They utilize such representations to detect function clones. Similarly , Deeprepair [26] use a combination of word2vec on tokens and recursi ve encoder on abstract syntax tree for Ja va token embedding. They use the representation to automatically repair programs with bugs. Sev eral other works, such as DeepFix [27], use language speciﬁc code learning to identify bugs and programming errors in codes. DeepT yper [28] uses recurrent neural networks to perform type inference in dynam- ically typed languages such as Jav ascript and Python. On the other hand, language independent models focus on syntactic representation learning. For example, [29] utilize wor d2vec directly on tokens from code to learn their representations. They sho w that their model can help predict software vul- nerabilities. [30] utilize a similar approach for the task of automated program repair . The authors in [31] introduce a syntactic model based on logbilinear contexts to generate new method names using these embeddings. Such models which do not use language syntax to learn code representations are less widely used compared to language speciﬁc models and often do not perform as well. Howe ver , in this paper , we show that our proposed language independent model achie ves high accuracy in automation engineering tasks. C. Har dware Recommendation Automation engineering is the task of integrating various hardware components in software to achiev e a production goal. The selection of hardware components occurs early on in the engineering process [14]. There are two tasks associated to hardware conﬁguration: (i) the selection of a speciﬁc hardware component (e.g., temperature sensor model A); (ii) the conﬁguration of that hardware in terms of inputs and outputs for the automation software to interact with it. Depending on the complexity of the project, the selection of components may be done by the mechanical engineering department. Howe ver , the conﬁguration of inputs and outputs is alw ays done by the automation engineers. Also, note that the input and output conﬁguration is tightly coupled the hardware selection. An automation program written for a gi ven hardware is not guaranteed to work for another hardware selection. Therefore, any hardware conﬁguration change triggers a re- engineering process [32]. T oday , AES engineers use an iterative process that is re- peated several times because either the hardware was incom- plete or the hardware selection was wrong. Thus, reducing these hardware conﬁguration iterations moti vates the need for hardware recommendation systems. The goal of hardware recommendation systems is to predict a full hardware conﬁgu- ration from a partial hardware conﬁguration. Recommendation or recommender systems are widely deployed in a v ariety of areas such as social media, video streaming, music streaming, news, dating, and consumer products [33]. Unfortunately , recommender systems in the context of automation has re- mained a relativ ely unexplored area. The authors in [14] present RESCOM, a multi-relational recommender system for an industrial purchasing system. This system assists users in selecting the hardware for complex engineering solutions based on shopping basket statistical patterns and semantic information. The focus on purchasing mak es RESCOM more suitable for solving the hardware selection sub-problem. T o the best of our knowledge, we are the ﬁrst to propose hardware recommender system for solving the hardware conﬁguration for inputs and outputs sub-problem. I I I . A U T O M A T I O N E N G I N E E R I N G S O F T W A R E O V E RV I E W There are two types of automation systems: industrial au- tomation systems and maker automation systems. This section introduces both. Despite some key dif ferences, the most impor - tant aspect in common is the underlying computational model 4 to interact with the physical en vironment. Most automation systems work on a periodic task model. Every period, or cycle , is composed of three steps. The ﬁrst step reads inputs from the hardware (e.g., rpm of a motor). The second step executes a task ev ery T seconds. A task is similar to a thread and that executes a user-deﬁned automation pr ogram composed of one or more functions . The third step writes outputs to the hardware (e.g., a control signal). These three steps realize the classic closed-loop control system conﬁguration. In addition to the computation and memory capabilities, an important measure to compare automation systems is the number of input/output (I/O) pins. The I/O capacity determines how man y hardware de vices can be wired to the automation system. I/O pins can be of type analog and digital. Automation systems can be interconnected to form industrial networks. Although communication can be done through the I/O pins, modern automation systems provide dedicated communication ports supporting Ethernet and industrial protocols such as Proﬁnet [34] and OPC U A [35]. AES programming is done through an integrated devel- opment en vironment (IDE) referred to as the engineering system . This engineering system provides tools to support the AES dev elopment including programming language edi- tors, hardware conﬁgurators, library managers, static analysis checkers, debuggers, compilers, and build systems. Improving the engineering systems with AI is the main focus of this paper . A. Industrial Automation Systems (IAS) These systems control industrial processes; many of these are safety-critical. IAS are real-time and must guarantee a response within a speciﬁed timing constraint (i.e., cycle). IAS are designed to operate for decades, without downtime, in very harsh environments of extreme temperatures, pres- sure, humidity , and vibration. V ariations of these systems are dev eloped by different manufacturers and are targeted to different industries. For example, discrete manufacturing use Programmable Logic Controllers (PLC), process plants use Distributed Control Systems (DCS), and power systems use Remote T erminal Units (R TU). Over the years, IAS have been programmed using a variety of programming languages. Most of these programming lan- guages were conceiv ed for automation engineers, not software engineers. Some of these languages, e.g. ladder logic, adopted domain-speciﬁc and graphical notations that were used to design relay racks in manufacturing and process control. Thus, by giving a syntax that automation engineers were familiar with, they were able to adopt these languages and write the AES code themselves. T oday , these programming languages are standardized by the IEC61131-3 standard [36]. Most vendors provide programming language interoperability and an automation program can contain functions written in different languages. This provides high ﬂexibility to the automation engineers. B. Maker Automation Systems (MAS) The MAS market has been enabled by low cost electronics, microcontrollers, sensors, and actuators. Maker s , people who form a “do it yourself ” (DIY) community and culture, have been the primary adopters of MAS. Arduino [15] is one of the most popular MAS. Its open source hardware and software is used by thousands of students and hobbyists around the world to de velop DIY automation projects in robotics, home automation, entertainment, and wearables [37]. MAS today are considered non-real time. Ho we ver , there are a fe w recent advances to bring real-time operating systems (R TOS) to Arduino [38], [39]. Despite this important differ - ence w .r .t. IAS, MAS boards come with integrated analog and digital I/O pins, and follow the same computation paradigm used for automation systems. In the case of the Arduino IDE [40], automation programs are referred to as sketches . Sketches can be written in Arduino code ﬁles (INO), C, or C++. There is an extensiv e library of functions to work with hardware, manipulating data, and using control algorithms. MAS code is written by hobbyists with a v ery div erse backgrounds and skill sets. Comparing the quality of MAS and IAS code is not straightforward because they are sub- ject to different programming en vironments. For example, Arduino’ s C/C++ en vironment provides the user full control ov er memory addressing whereas PLC en vironments constrain it. Ho we ver , we noticed that code cloning [19] is a common pattern in the MAS community . The high availability of code for full projects makes it easy for hobbyists to copy and paste useful snippets. Many of these programmers giv e attribution to the original source through the use of comments in the code. A more in-depth analysis of MAS vs IAS code is an interesting future research direction. I V . A R D U C O D E : A U T O M A T I O N E N G I N E E R I N G S O F T W A R E L E A R N I N G In this section, we introduce our predictiv e framework for automation engineering (see Figure 1). First, we provide our data collection methodology . Then, we describe our technical approach for each of the three automation engineering tasks: (i) code classiﬁcation, (ii) semantic code search, and (iii) hardware recommendation. A. Data Collection T o validate our approach, we collected two real datasets representativ e of MAS and IAS. The following subsections summarize the two datasets. 1) Ar duino Code: W e collected the source code and textual metadata from 2,927 Arduino projects from the Arduino Project Hub [41]. The textual metadata consists of the project’ s category , title, abstract, tags, description, and hardware conﬁg- uration. Each project is labeled by one category . In total, there are 12 cate gories as sho wn in Figure 2. W e use these cate gories as the labels to predict in the code classiﬁcation task. Makers are well known for helping and fostering collaboration in the DIY community . The documentation associated to the Arduino projects is extensi ve. Therefore, the project’ s title, abstract, tags, and description metadata provide a upper bound baseline for label classiﬁcation using human annotations. The hardware conﬁguration is a list of components re- quired to build the project. In the 2,927 projects, there 5 Hw Conﬁg Unsupervised Learning Source Code Source Code Industrial Automation Code Samples Maker Automation Code Samples doc2vec TF-IDF PLC Code Prepr ocessing Arduino Code Prepr ocessing Feature Selection Code Embeddings Code Classiﬁcation Logistic Regression Random Forests Semantic Code Search [0.21, 0.33, …, 0.54] Maker Automation Hardwar e Conﬁguration Samples Manual Data Curation Automation Expert Silver Standard Hardwar e Recommendation Bayesian Networks Autoencoders Nearest Neighbors Fig. 1: AES code learning architecture. are 6,500 unique components. After manual inspection, we observed that different authors name the same component differently; e.g., “resistor 10k” and “resistor 10k ohm”. T o clean the data, we manually curated the hardware conﬁgu- ration lists and renamed the 6,500 components according to their functionality . An important contribution of this paper is the deﬁnition of two functional lev els of abstraction for the hardware: lev el-1, and lev el-2 functionality. Our level- 1 functionality consists of 9 categories: Actuators, Arduino, Communications, Electronics, Human Machine Interface, Ma- terials, Memory , Power , Sensors. Our le vel-2 functionality further reﬁnes the level-1 into a total of 45 categories: Ac- tuators. { acoustic, air, ﬂo w , motor } , Arduino. { large, medium, other , small } , Communications. { ethernet, optical, radio, serial, wiﬁ } , Electronics. { capacitor, diode, relay , resistor, transistor } , Human Machine Interface. { button, display , input, led } , Mate- rials. { adapter , board, screw , solder , wiring } , Memory . { solid } , Power . { battery , regulator , shifter , supply , transformer } , Sen- sors. { accel, acoustic, camera, encoder , ﬂuid, gps, misc, optical, photo, pv , rﬁd, temp } . W e use these two lev els of hardware conﬁguration to v alidate our hardware recommendation algo- rithm. 196 410 66 383 261 34 276 229 408 214 353 97 1 2 4 8 16 32 64 12 8 25 6 51 2 audio-sound ho me-au t o mat io n installations iot-wireless lab-stuff flying-things gadgets-games -t o ys lights-leds motors-roboti cs screens-displays sensors-enviro nme n t wearables # of projects Fig. 2: Hand-curated Le vel-1 Arduino Project Categories. 2) PLC Code: The OSCA T library [16] is the largest publicly av ailable library of PLC programs. The OSCA T -LIB is vendor independent and it provides reusable code functions in different categories such as signal processing (SIGPR O), geometry calculations (GEOMETR Y), and string manipulation (STRINGS). These categories are extracted from the com- ment’ s section of each ﬁle marked by the line “F AMIL Y : X”, where “X” is the category associated to that function. This line is eliminated from the dataset during training. In total, the OSCA T -LIB Basic version 3.21 contains 683 functions and 28 category labels. Figure 3 shows the label distrib ution. The OSCA T -LIB does not contain hardware conﬁguration, and therefore it is only suitable for the tasks of code classiﬁcation and semantic code search. All the code is written in SCL language. 43 78 14 18 18 106 29 74 14 42 8 21 35 19 19 12 44 14 2 10 2 26 8 8 1 8 4 6 1 2 4 8 16 32 64 12 8 GATE STRINGS FF ETR IG ARFUNCT SIGGEN MATH GENERAT TI DT MATH_V3 MEASURE R2 CONVERT CT RL AUTOMAT xS7 FU NCT SIGPRO CONSTANT LO GIC SENSOR FF PTR IG CMATH BUFFER LI ST PRO LO OTH ER GEOMETRY MEMORY OTHER # of programs Fig. 3: PLC function block categories. B. Code Classiﬁcation Giv en a code snippet, the task of code classiﬁcation is to predict its label. A label is a category associated to the code snippet such as the le vel-1 and le vel-2 Arduino project cate- gories, or the OSCA T library function categories. Our machine 6 learning pipeline consists of four steps: preprocessing, feature selection, code embeddings, and classiﬁcation. 1) Pr eprocessing: W e preprocess the automation projects and code snippets to expose the various features shown in T able I. The Arduino dataset contains more features than the PLC dataset. Therefore, not all features are av ailable in the PLC dataset. For example, PLC code does not have includes, and project data such as tags, title, descriptions, and components is not a v ailable. T ABLE I: Features e xposed by preprocessing. Featur e Arduino PLC Description Includes – C/C++ includes Functions Function names Comments Comments in code T okens All code tokens Code Code keyw ords LOC Lines of code T ags – Project tags T itle – Project title Descriptions – Project descriptions Labels Labels to predict Components – Hardware conﬁguration 2) F eature Selection: The purpose of feature selection is to provide the ArduCode framework with a feature space e x- ploration mechanism to compare the performance of different code representations in the task of code classiﬁcation and semantic code search. The quality of the code embeddings is expected to vary according to the provided features. There- fore, the feature selection generates different experiments by combining dif ferent sets of features. For example, code can be represented by different combinations of includes, functions, comments, tokens, keywords. Code documentation can be represented by combinations of tags, titles, and descriptions. Alternativ ely , code representations and code documentation features can be combined to generate richer feature vectors for the code embeddings. 3) Code Embeddings: In machine learning, the projection of a high-dimensional input into a lo wer-dimensional represen- tation space is called an embedding. The next step is to embed the textual representations generated by the feature selection into a learned vector representation suitable for classiﬁcation. W e compare the performance of the embeddings generated by gensim doc2vec [42] with the embeddings generated by the term frequencyin verse document frequency (tf-idf). Doc2vec is a state-of-the-art method for distributed representations of sentences and documents – i.e., tokens and code. Tf-idf, on the other hand, is a statistical measure that ev aluates the rele v ance of a word is to a document – i.e., tokens and code. Tf-idf is typically used as a baseline for validating other machine learning models. Both doc2vec and tf-idf generate an n- dimensional vector representation of the input. The doc2vec’ s hyperparameters of interest are the embedding dimension, and the training algorithm (distributed memory and distributed bag of words). W e run all our experiments with negati ve sampling of 5; this draws 5 noise words to help the model differentiate the data from noise. 4) Classiﬁcation: The ﬁnal step is to train a supervised model for code classiﬁcation using the code embeddings as the input samples, and the code labels as the target values. W e compare the performance of logistic regression and random forest classiﬁers using the F 1 -score metric. Logistic regression and random forest are two of the main workhorses for su- pervised learning classiﬁcation. The F 1 -score is the harmonic mean of the precision and recall. An F 1 -score = 1 represents perfect precision and recall. C. Semantic Code Sear ch Giv en a code snippet, the goal of semantic code search is to ﬁnd similar programs. For automation engineering, similarity is deﬁned in terms of syntax and structure. Syntax similarity helps engineers ﬁnd useful functions in a giv en context, and structural similarity informs engineers on ho w other automa- tion solutions have been engineered. As shown in Figure 1, semantic code search uses the code embeddings generated by doc2vec to identify similar documents. Doc2vec attempts to bring similar documents close to each other in the embedding space. For a given code embedding of a code snippet, the nearest neighbors are expected to be similar and this distance metric is used as the basis for our semantic code search. While this approach is intuitive for syntactically similar documents, it is unclear whether functional structure is captured in the embeddings. D. Har dware Recommendation Giv en a partial list of hardware components, the task of the hardware recommendation is to giv e a prediction of other hardware components typically used in combination with the partial list. The hand curated silver standard described abov e is used to learn the joint probability distribution of the hard- ware components. The hardware recommendation task is then to compute the conditional probability of missing hardware components giv en a partial list of components. W e compare two approaches for the hardware recommen- dation task. Our baseline consists of the predictions given on random hardware conﬁgurations. First, we learn a Bayesian network where the random variables are the hardware compo- nent cate gories. Bayesian networks are a probabilistic graph- ical model that represent a set of variables (e.g., hardware components) and their conditional dependencies (e.g., 80% chance that a sensor of type X is connected to an con- troller of type Y) via a directed acyclic graph. W e use the Pomegranate Python package [43] to learn the structure of the Bayesian network and ﬁt the model with 70% of the hardware conﬁguration data. The Bayesian network for level- 1 components consists of 9 nodes and the one for lev el-2 components consist of 45 nodes. Ho wev er, we were only able to ﬁt the level-1 Bayesian network as the initialization of the Bayesian netw ork takes e xponential time with the number of variables. T ypically , this cannot be done with more than two dozen variables due to the super-exponential time complexity with respect to the number of variables, and the lev el-2 hard- ware conﬁguration consists of 45 variables. T o ov ercome this limitation, we use an autoencoder implemented in Keras [44]. 7 An autoencoder neural network is an unsupervised learning algorithm that tries to learn a function to reconstruct its input. In our autoencoder, the encoder learns a lo wer dimensional representation of the hardware conﬁguration data, and the decoder learns to reconstruct the original input from the lower dimensional representation. Overﬁtting is when a statistical model is tailored to a dataset and is unable to generalize to other datasets. T o av oid overﬁtting in the autoencoder, we use L1 and L2 regularizers. V . R E S U LT S This section ev aluates the performance of the three AES tasks in ArduCode: code classiﬁcation, semantic code search, and hardware recommendation. Our results establish the base- lines for these tasks in the AES domain. A. Code Classiﬁcation First, we established the lower and upper bounds for code label classiﬁcation. The lo wer bound is gi ven by training the code label classiﬁer using random embeddings. The upper bound is giv en by training the code label classiﬁer using human annotations. The Arduino dataset provides human annotations in the form of tags and descriptions that can be combined in three conﬁgurations: tags, descriptions, and descriptions+tags. W e ﬁrst embed these three conﬁgurations using tf-idf and doc2vec, and compare the label classiﬁcation performance using the F 1 -score. As sho wn in Figure 4 doc2vec yields a better performance than tf-idf. The embedding di- mension for doc2vec was set to 50, and the tf-idf models generated embedding dimensions of 1,469 for tags, 66,310 for descriptions, and 66,634 for descriptions+tags. Intuitively , the descriptions+tags conﬁguration provides the upper bound F 1 -score of 0.8213. 0.7 0.75 0.8 0.85 Tag s De s cri pt i on s De s cri pt i on s+T a g s Tf- Id f Do c2 vec Fig. 4: Human annotation prediction performance. Figure 5 compares the performance of a logistic regres- sion classiﬁer (LR) from scikit-learn [45] and a random forest classiﬁer (XGB) from XGBoost [46] using the 50- dimensional doc2vec embeddings of tags, descriptions, and descriptions+tags. Our implementation in these libraries was driv en purely by the con venience of their software interface and their popularity in the machine learning community . An implementation using alternativ e libraries should yield similar results to the ones obtained in our implementation. Our results sho w that the LR classiﬁer pro vides signiﬁcantly better performance than the XGB. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Tag s De s cri pt i on s De s cri pt i on + T a g s LR XG B Fig. 5: Performance v ariation wrt classiﬁer . W e establish the lower bound by generating 50-dimensional random embeddings and predicting the labels using the LR classiﬁer . Figure 6 shows that with both tf-idf and doc2vec the lower bound is 0.3538. After establishing the upper and lower bounds, we use different code features to predict labels. Figure 6 shows that embedding includes and functions provide a slightly better performance than the random baseline due to the very limited amount of information contained in these: 1.82 includes and 4.70 functions on av erage. Other code features impro ve the classiﬁcation accuracy signiﬁcantly . For example, tokens and code are similar representations and giv e a similar F 1 -score of 0.63 and 0.67. These results also sho w that comments contain v aluable information that can be used to predict the code label with a score of 0.67. Embedding code+comments and code+titles yield the highest F 1 -scores of 0.71. These results show that the prediction performance with code feature embeddings is comparable to human annotation embeddings with improvements of 2.03 × and 2.32 × ov er the random baseline, respectively . 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Ra nd o m Includ es Fu n ct io ns Titl e To kens Co mm e nts Co de Co de + Co mm ent s Co de + Ti t l es Tf- Id f Do c 2vec Fig. 6: Code label prediction using code. The classiﬁer’ s confusion matrix for code feature embed- dings using doc2vec is shown in Figure 7. Although the matrix is diagonally dominant, the dataset is imbalanced in terms of number of samples per class as shown in Figure 2. In particular , the classiﬁer’ s poor performance in the categories ﬂying-things, installations, and wearables correspond to their small number of samples in the dataset. Since the PLC dataset does not hav e any human annotation features, we can only compare the performance of code feature embeddings (against the random baseline. The F 1 -score for 8 Fig. 7: ArduCode classiﬁer’ s confusion matrix. the code embeddings is 0.9024 and for the random baseline is 0.2878; a 3.13 × improvement. Compared to the Arduino dataset, the PLC dataset has less samples (683 vs 2,927), more category labels (28 vs 12), and less lines of code per ﬁle on av erage (55 vs 177). These are three factors that inﬂuence the higher prediction accuracy of ArduCode on the PLC dataset. B. Semantic Code Sear ch T o v alidate the quality of our code embeddings, we ran- domly sampled 50 Arduino code snippets, and tasked a group of 6 software engineers to score the similarity of each code snippet to its top-3 nearest neighbors. F or ev ery code snippet pair , two similarity ratings for code syntax and code structure are giv en. A rating of 1 represents similarity , and a rating of 0 represents the lack of similarity . Code syntax refers to the use of similar variables and function names. Code structure refers to the use of similar code arrangements such as if-then-else and for loops. In addition, every rating has an associated expert’ s conﬁdence score from 1 (lowest conﬁdence) to 5 (highest conﬁdence) that represent the expert’ s self-assurance during the e valuation. During the expert ev aluation we eliminated 5 out of 50 samples where one of the top-3 nearest neighbors was either an empty ﬁle, or it contained code in a different programming language. W e found three code snippets written in Python and Ja v ascript. T able II sho ws the a verage code syntax and code struc- ture similarity scores giv en by experts. W e only report the high conﬁdence ratings (avg. conﬁdence > = 4 . 5 ) in order to eliminate the inﬂuence of uncertain answers. W e also measure the experts’ agreement via the Fleiss Kappa. These results sho w that the similarity scores for both syntax and structure are high for the top-1 neighbors (0.68 and 0.61 respectiv ely) but reduce signiﬁcantly (under 0.50) for the top-2 and top-3 neighbors. The experts are in substantial agreement ( 0 . 61 < = κ < = 0 . 80 ) in their syntax similarity scores, and in moderate agreement ( 0 . 41 < = κ < = 0 . 60 in their structure similarity scores. While these results conﬁrm that doc2vec code embeddings capture syntactic similarity , they also sho w that some structure similarity is captured in the top-1 neighbor . After their individual assessment, the e xperts gathered as a group to discuss their ﬁndings. Something that quickly became clear is that the experts’ background contributed to the deviation in the ratings. Half of the experts with an automation background had additional insights that made them more conﬁdent and congruent in their structural similarity assessments. On the other hand, the other half of the experts without an automation background were less conﬁdent and congruent in their ratings. W ith additional context from the automation domain, two of the three of the non-automation experts expressed that this information would hav e made their assessment more conﬁdent and congruent. T ABLE II: A verage code structure and code syntax similarity and Fleiss Kappa v alues for high conﬁdence raters. Nearest neighbors Similarity T op 1 ( κ ) T op 2 ( κ ) T op 3 ( κ ) Syntax 0.61 (0.75) 0.48 (0.70) 0.32 (0.61) Structure 0.68 (0.53) 0.40 (0.44) 0.33 (0.66) T o further gain insight into our experiment, we selected four similar and three not similar code snippets, and measured the cosine similarity of their embeddings as shown in T able III. The selected code snippets hav e a strong agreement among the experts, and a high conﬁdence in the similarity and lack of similarity across the top-3 nearest neighbors. These results conﬁrm that the code snippets considered very similar by the experts are close to each other in the embedding space. On the other hand, code snippets considered not similar are far apart in the embedding space. T ABLE III: Code embedding cosine similarity for similar and not similar code snippets. Nearest neighbors T op 1 T op 2 T op 3 Similar code snippets #2696 0.8768 0.7527 0.7642 #547 0.8719 0.8705 0.8506 #2815 0.9465 0.9445 0.9126 #54 0.8513 0.8056 0.7815 Not Similar code snippets #4512 0.5967 0.5497 0.5643 #4345 0.5415 0.4175 0.5192 #1730 0.5970 0.5035 0.5511 Figure 8 shows two similar Arduino code snippets (#54 and #2689) produced by ArduCode. There are similarities between these two programs at different lev els. First, all Arduino programs are required to have the setup() and loop() functions to initialize the program, and to specify the control logic ex ecuted on e very cycle. Syntactically , the two programs use the same standard functions: pinMode() to conﬁgure the Arduino board pins where the hardware connects as inputs or outputs; analogRead() to read an analog v alue from a pin; Serial.print() to print an ASCII character via the serial port; delay() to pause the program for the amount of time (in ms) speciﬁed by the parameter; and analogWrite() to write an analog value to a pin. Semantically , the two programs read sensor values (only 1 v alue in #54 and 3 values in #2689), 9 scale the sensor value to a range (from 300-1024 to 0-255 using map() in #54 and to ( x + 100) / 4 in #2689), print the scaled sensor value via the serial port, write the analog v alue to a LED (a single LED in #54 and three LEDs in #2689), and pause the program (10ms in #54 and 100ms in #2689). Note that the order in which these operations are scheduled is different in the two programs. Functionally , the two programs perform the same task of creating a heatmap for a sensor value using LEDs. While there are some syntactic similarities, or semantic code search is also able to capture semantic and structural similarities. void setup() { Serial.begin(9600); pinMode(A2,OUTPUT); pinMode(A3,OUTPUT); pinMode(A4,OUTPUT); pinMode(A0,INPUT); } void loop() { sensorval = analogRead(A0); led = map(sensorval, 300,1024,0,255); Serial.println(led); delay(10); analogWrite(A2,led); analogWrite(A3,led); analogWrite(A4,led); } void setup() { Serial.begin(9600); pinMode(led1, OUTPUT); pinMode(led2, OUTPUT); pinMode(led3, OUTPUT); } void loop() { int senzorNivo1 = analogRead(A0); int senzorNivo2 = analogRead(A1); int senzorNivo3 = analogRead(A2); osvetljaj1 = (senzorNivo1 + 100) /4; osvetljaj2 = (senzorNivo2 + 100) /4; osvetljaj3 = (senzorNivo3 + 100) /4; analogWrite(led1, 255 - osvetljaj1); analogWrite(led2, 255 - osvetljaj2); analogWrite(led3, 255 - osvetljaj3); Serial.print(senzorNivo1); Serial.print(" "); Serial.print(senzorNivo2); Serial.print(" "); Serial.println(senzorNivo3); delay(100); } Code Snippet #54 Code Snippet #2689 Syntax similarity Sensor read Sensor scaling Time control Key Fig. 8: Arduino semantic code search result. C. Har dware Recommendation In hardware recommendation, we are interested in rec- ommending the top-k hardware components. Therefore, we ev aluate our models in terms of pr ecision @ k . P r ecision @ k is the portion of recommended hardware components in the top-k set that are relev ant. For each hardware conﬁguration in the test data, we leave one hardware component out, and measure its pr ecision @ k . T able IV shows the results for the random baseline, the Bayesian network, and the autoencoder . As expected, the performance of the random baseline impro ves linearly from p @1 = 0 . 1 , p @3 = 0 . 32 , and p @5 = 0 . 54 to p @9 = 1 for the level-1 hardw are predictions. The Bayesian Network also improves linearly from p @1 = 0 . 32 , p @3 = 0 . 59 , and p @5 = 0 . 79 . The autoencoder provides both the best performance and the best improv ements from p @1 = 0 . 36 , p @3 = 0 . 79 , and p @5 = 0 . 95 . Note that the autoencoder’ s p @3 is the same performance as the Bayesian Network’ s p @5 , 0 . 79 . Furthermore, the autoencoder achiev es > 0 . 95 precision at p @5 , and the Bayesian network at p @8 . Learning a Bayesian Network for le vel-2 hardware com- ponents is computationally unfeasible. Therefore, we rely on an autoencoder to accomplish this task and the p @ k results are reported in T able V. The o verall p @ k performance of the T ABLE IV: p @ k results for level-1 hardware predictions. p @ k Random Bayesian A utoencoder Baseline Network p @1 0.10 0.32 0.36 p @3 0.32 0.59 0.79 p @5 0.54 0.79 0.95 p @9 1.00 1.00 1.00 autoencoder for le vel-2 is comparati vely lo wer than for le vel-1. The reason is that the le vel-2 hardware conﬁguration is sparser than lev el-1. On average, lev el-2 conﬁgurations hav e 4/40 components and level-1 conﬁgurations hav e 4/12 components. Howe ver , the improvement ov er the random baseline is of 10 × for p @1 , 5 × for p @3 , 4 × for p @5 , and 3 × for p @10 . T ABLE V: p @ k results for level-2 hardware predictions. p @ k Random Baseline A utoencoder p @1 0.02 0.21 p @3 0.06 0.34 p @5 0.11 0.45 p @10 0.21 0.69 V I . P RO O F O F C O N C E P T I M P L E M E N T A T I O N - C O G N I T I V E A U T O M A T I O N E N G I N E E R I N G S Y S T E M This section presents our experience in de veloping a proof of concept system that attempts to transition this research into an application. Our goal is to explore concrete implementation ideas to assist the AES engineer during the dev elopment process in such a way that the AI-driv en methods do not disrupt the original AES workﬂo ws. Figure 9 sho ws our implementation. The system is centered around an engineering knowledge base, which is b uilt in tw o complementary ways: by analyzing past engineering projects, and by direct input and curation of engineering know-ho w by experts. W e lev erage semantic web technologies and represent the engineering knowledge in an RDF triple store. Engineers with sufﬁcient expertise can edit the graph directly , and we aim to dev elop interfaces that simplify the formalization of engineering knowledge by domain experts. Note that the past engineering projects and expert know-ho w are speciﬁc to the organization where the system would be deployed and therefore the ArduCode’ s models would reﬂect that prior ex- pertise. Lev eraging data from past projects is a very powerful approach because it allows the integration of existing libraries and unstructured repositories of functions into ArduCode’ s learning process. Thus, code classiﬁcation, semantic code search, and hardware recommendation can be continuously improv ed over time as more data becomes av ailable from the regular engineering process. On the other hand, the direct expert input allows the resolution of issues when ArduCode’ s conﬁdence is lo w in a prediction, and the direct expert curation can enable the improv ement of ArduCode’ s models over time. W e use an existing AES engineering system as a platform (see Figure 9(a)). This approach has multiple adv antages. First, an existing AES Engineering System provides well known graphical user interfaces and workﬂows that engineers are 10 (a) AES Engineering System Platform (b) Cognitive Automation Engineering System (c) Simulation and V alidation Hardware conﬁguration r ecommendation Semantic code search recommendations Code classiﬁcation output AI knowledge base Fig. 9: Cognitive Automation Engineering System. familiar with. Second, we can leverage the APIs and open interfaces from the AES Engineering System to push and pull data from the current and other automation projects. Pulling data is critical for building a robust knowledge base for the AI methods to learn from. Pushing data is equally critical because it allows the AI methods to assist the engineer within the well established workﬂo ws and interfaces. Third, it provides a dev elopment environment for plug-ins where novel concepts for AI-assistant user interfaces can be developed and tested with real users. The proof of concept user interface plug-in that we de vel- oped in shown in Figure 9(b). The goal of this user interface is to provide the AES engineer an assistant that is continuously monitoring the state of the AES project and providing rec- ommendations from code classiﬁcation, semantic code search, and hardware recommendation. The assistant system reads the user’ s automation project containing hardware conﬁguration and program code. Multiple analysis components are then ex ecuted on the data and generate recommendations based on this av ailable information. As the user edits conﬁgurations or automation code, the code classiﬁer, semantic code search and hardware recommendation are inv oked, and the user interface collects the results of the analyses and presents them to the user in an organized manner based on the context of the corresponding engineering task. In this user interface, the main interaction occurs through questions and recommendations. The AI-assistant presents questions to the user , which the user can answer providing new information to the system. Questions to the user are triggered by information known to be missing in the kno wledge model. For instance, if an analysis depends on the value of an attribute of the project (e.g. safety integrity level), and this attribute is missing in the current project, then the assistant will issue a question to the user asking for the v alue of this attribute (e.g. What is the safety integrity lev el of this project?). Based on the information gathered from the engineering tool and from user answers, the assistant generates recommendations that the user either accepts or rejects. When the user pro vides the projects SIL lev el as an answer , the system may trigger additional analyses using this information (e.g. determine hardware to recommend for the required safety lev el, classify the code in the automation project to separate the safety and non- safety code and reorganize the project structure accordingly , or ﬁnd library code suitable for the safety application and recommend this code to the user). When the user accepts a recommendation, an action is taken and the project is modiﬁed accordingly . When the user rejects or ignores a recommendation the system simply continues its operation. Note that these interactions are also useful for reﬁning the AI models. These interactions can be used in the future as data for a lifelong learning system that self-improves over time. T o keep the user informed on what the AI-methods do, we provide an interf ace to the AI kno wledge base in a graphical form. W e use this vie w to highlight the nodes and edges related to a given recommendation. Just like in the traditional AES dev elopment process, v ali- dation through simulation plays an important role. W e show that the simulation and validation loop can be tightly coupled to the proposed Cognitiv e Automation Engineering System. After a recommendation is accepted, and the change is pushed to the AES Engineering System, this can be validated in simulation. Figure 9(c) shows a production line simulation that the AES project controls. This implementation shows that AI-assistants can be non-disruptiv e to the existing AES dev elopment process. The assistant system has been used for empirical ev aluation of the approach in a lab setting. Our plan is to ev aluate this tool in a real production environment. V I I . D I S C U S S I O N ArduCode is the starting point for predictiv e automation engineering tasks. This section discusses its kno wn limitations and motiv ates se veral directions for future work. 1) Capturing code structur e: While doc2vec captures some structural code similarity as conﬁrmed by our set of experts, there are other recent approaches such as code2vec that 11 are likely to better capture code structure and improv e the code classiﬁcation and semantic code search tasks. Ho wever , parsing C++ code requires full access to all the library dependencies. In the case of most Arduino programs, resolving the include paths requires a major manual effort. In the future, we expect to develop some automation to generate the abstract syntax trees and embed the code using models such as code2vec. 2) Har dware-softwar e gap: ArduCode’ s hardware recom- mendation is limited to hardware components. This task would be e ven more useful if it incorporated software elements such as library or API recommendations. W e obtained poor results with a supervised learning approach using hardware conﬁguration as the input samples, and includes and function names as the target v alues. W ithout context such as descrip- tions or titles, this is a hard task ev en for experts because the software references hardware only through non-descriptiv e variables (e.g., A0, A1). T o bridge this gap, one promising idea is to model software elements as random variables in the Bayesian Network and use expert know-ho w to deﬁne their conditional probabilities. Another potential direction is to take into consideration wiring diagrams that describe how the hardware components are connected to the controller, and based on this information determine the connection to the software. Howe ver , these diagrams are often not synchronized to the software vie w and could introduce many inconsistencies. A second challenge is that these diagrams are often created using ad-hoc methods by the software developers and their syntax and semantics are not consistent. 3) Continuous inte gration of new knowledge: In the current architecture, ArduCode’ s model is updated via re-training. This means that ne w kno wledge is not automatically integrated into ArduCode’ s knowledge base. The ne w knowledge is manually integrated through a re-training step. Re-training takes time and requires a machine learning expert. Howe ver , with the high-av ailability of cloud computing and graphic processing units (GPU), we do not anticipate re-training being a bottleneck for the automation domain. An interesting direc- tion for future work is to extend ArduCode into a lifelong learning architecture [47]. Lifelong learning is the ability of a machine learning system to sequentially retain learned knowledge and to transfer that knowledge over time when learning new tasks and improv e its capabilities. Such approach would continuously integrate new knowledge as it becomes av ailable, instead of being limited to discrete re-training steps. V I I I . C O N C L U S I O N In this paper , we introduced and studied three automation engineering predictive tasks. First, we showed that our code classiﬁcation approach based on doc2vec code embeddings and logistic regression achiev es an F 1 -scores of 72% and 90% on two real datasets. Second, a group of 6 experts v alidated the semantic code search task by assessing the syntax and structure similarity of 50 code snippets. Third, we demonstrated a p @3 of 79% and p @5 of 95% for the hardware recommendation task using an autoencoder . Additionally , we implemented these tasks in a proof-of-concept implementation of a cogniti ve automation system. This system has been used for empirical ev aluation of ArduCode in a laboratory setting. Future research directions are as follows. Evaluate Ar - duCode’ s doc2vec approach against recent approaches such as code2vec that are likely to better capture code structure and improv e the code classiﬁcation and semantic code search tasks. In addition, ArduCode’ s hardw are recommendation is limited to hardware components. This task would be ev en more useful if it incorporated software elements such as library or API recommendations. One promising idea is to model software elements as random v ariables in the Bayesian Network and use data mining techniques on existing projects to deﬁne their conditional probabilities. A C K N O W L E D G E M E N T S W e thank Evan Patterson, Jade Master, and Georg Muenzel for their valuable input and discussions. R E F E R E N C E S [1] M. Hermann, T . Pentek, and B. Otto, “Design principles for industrie 4.0 scenarios, ” in 2016 49th Hawaii International Conference on System Sciences (HICSS) , pp. 3928–3937, 2016. [2] R. Drath and A. Horch, “Industrie 4.0: Hit or hype? [industry forum], ” IEEE Industrial Electr onics Magazine , vol. 8, no. 2, pp. 56–58, 2014. [3] V . Vyatkin, “Software engineering in industrial automation: State-of-the- art revie w , ” IEEE T ransactions on Industrial Informatics , vol. 9, no. 3, pp. 1234–1249, 2013. [4] B. V ogel-Heuser, C. Diedrich, A. Fay , S. Jeschke, S. Kow alewski, M. W ollschlaeger , and P . Ghner, “Challenges for software engineering in automation, ” Journal of Software Engineering and Applications , vol. 7, no. 5, 2014. [5] Shyh-Jier Huang and Kuang-Rong Shih, “Short-term load forecasting via arma model identiﬁcation including non-gaussian process consider- ations, ” IEEE T ransactions on P ower Systems , vol. 18, no. 2, pp. 673– 679, 2003. [6] D. T ulone and S. Madden, “Paq: T ime series forecasting for approximate query answering in sensor networks, ” European W orkshop on Wir eless Sensor Networks , pp. 21–37, 2006. [7] Stev en M. LaV alle, Planning Algorithms . Cambridge University Press, 2006. [8] M. Dogar , A. Spielberg, S. Baker , and D. Rus, “Multi-robot grasp plan- ning for sequential assembly operations, ” Autonomous Robots , vol. 43, p. 649664, 2019. [9] C. Feng, T . Li, and D. Chana, “Multi-level anomaly detection in industrial control systems via package signatures and lstm networks, ” in 2017 47th Annual IEEE/IFIP International Confer ence on Dependable Systems and Networks (DSN) , pp. 261–272, 2017. [10] C. T en, J. Hong, and C. Liu, “ Anomaly detection for cybersecurity of the substations, ” IEEE Tr ansactions on Smart Grid , vol. 2, no. 4, pp. 865– 873, 2011. [11] K. R. McNaught and A. Zagorecki, “Using dynamic bayesian networks for prognostic modelling to inform maintenance decision making, ” in 2009 IEEE International Conference on Industrial Engineering and Engineering Management , pp. 1155–1159, 2009. [12] G. de Novaes Pires Leite, A. M. Araujo, and P . A. C. Rosas, “Prognostic techniques applied to maintenance of wind turbines: a concise and speciﬁc review , ” Renewable and Sustainable Energy Reviews , vol. 81, no. 2, pp. 1917–1925, 2018. [13] Y . G. Lia and P . Nilkitsaranontb, “Gas turbine performance prognostic for condition-based maintenance, ” Applied Ener gy , vol. 86, no. 10, pp. 2152–2161, 2009. [14] M. Hildebrandt, S. S. Sunder, S. Mogoreanu, I. Thon, V . Tresp, and T . Runkler , “Conﬁguration of industrial automation solutions us- ing multi-relational recommender systems, ” in Mac hine Learning and Knowledge Discovery in Databases , pp. 271–287, Springer International Publishing, 2019. [15] Y . A. Badamasi, “The working principle of an arduino, ” in 2014 11th International Confer ence on Electr onics, Computer and Computation (ICECCO) , pp. 1–4, Sep. 2014. 12 [16] OSCA T , “Open source community for automation technology . ” http:// www .oscat.de/, 2020. [17] Arduino, “ Arduino Libraries. ” https://www .arduino.cc/en/reference/ libraries, 2020. [18] PLCOpen. https://plcopen.org/, 2020. [19] H. Thaller, R. Ramler, J. Pichler , and A. Egyed, “Exploring code clones in programmable logic controller software, ” in 2017 22nd IEEE Inter- national Conference on Emerging T echnologies and F actory Automation (ETF A) , pp. 1–8, 2017. [20] S. Bougouffa, Q. H. Dong, S. Diehm, F . Gemein, and B. V ogel-Heuser, “T echnical Debt indication in PLC Code for automated Production Systems: Introducing a Domain Speciﬁc Static Code Analysis T ool, ” IF A C (International F ederation of Automatic Contr ol) , vol. 51, no. 10, pp. 70–75, 2018. [21] H. Prhofer, F . Angerer, R. Ramler , and F . Grillenberger , “Static Code Analysis of IEC 61131-3 Programs: Comprehensive T ool Support and Experiences from Large-Scale Industrial Application, ” IEEE T ransac- tions on Industrial Informatics , vol. 13, pp. 37–47, Feb 2017. [22] S. Biallas, S. K ow alewski, S. Stattelmann, and B. Schlich, “Static Analysis of Industrial Controller Code using Arcade.PLC. ” https://cs. au.dk/ ∼ amoeller/tapas2014/tapas2014 1.pdf, 2014. [23] Z. Chen and M. Monperrus, “ A literature study of embeddings on source code, ” arXiv pr eprint arXiv:1904.03061 , 2019. [24] U. Alon, M. Zilberstein, O. Levy , and E. Y ahav , “code2vec: Learning distributed representations of code, ” Proceedings of the ACM on Pr o- gramming Languages , vol. 3, no. POPL, p. 40, 2019. [25] D. DeFreez, A. V . Thakur , and C. Rubio-Gonz ´ alez, “Path-based function embedding and its application to error-handling speciﬁcation mining, ” in Pr oceedings of the 2018 26th ACM Joint Meeting on European Softwar e Engineering Conference and Symposium on the F oundations of Software Engineering , ESEC/FSE 2018, p. 423433, 2018. [26] M. White, M. T ufano, M. Martinez, M. Monperrus, and D. Poshyvan yk, “Sorting and transforming program repair ingredients via deep learning code similarities, ” in 2019 IEEE 26th International Conference on Softwar e Analysis, Evolution and Reengineering (SANER) , pp. 479–490, IEEE, 2019. [27] R. Gupta, S. Pal, A. Kanade, and S. Shevade, “Deepﬁx: Fixing common c language errors by deep learning, ” in AAAI Conference on Artiﬁcial Intelligence , pp. 1345–1351, AAAI, 2017. [28] V . J. Hellendoorn, C. Bird, E. T . Barr , and M. Allamanis, “Deep learning type inference, ” in ACM Joint European Softwar e Engineering Confer ence and Symposium on the F oundations of Softwar e Engineering , pp. 1345–1351, 2018. [29] J. A. Harer , L. Y . Kim, R. L. Russell, O. Ozdemir, L. R. K osta, A. Rangamani, L. H. Hamilton, G. I. Centeno, J. R. Key , P . M. Ellingwood, et al. , “ Automated software vulnerability detection with machine learning, ” arXiv pr eprint arXiv:1803.04497 , 2018. [30] Z. Chen and M. Monperrus, “The remarkable role of similarity in redundancy-based program repair , ” arXiv preprint , 2018. [31] M. Allamanis, E. T . Barr , C. Bird, and C. Sutton, “Suggesting accurate method and class names, ” in Pr oceedings of the 2015 10th Joint Meeting on F oundations of Softwar e Engineering , pp. 38–49, ACM, 2015. [32] S. Feldmann, S. J. I. Herzig, K. Kernschmidt, T . W olfenstetter, D. Kam- merl, A. Qamar , U. Lindemann, H. Krcmar, C. J. J. Paredis, and B. V ogel-Heuser, “ A comparison of inconsistency management ap- proaches using a mechatronic manufacturing system design case study , ” in 2015 IEEE International Conference on Automation Science and Engineering (CASE) , pp. 158–165, 2015. [33] J. Bobadilla, F . Ortega, A. Hernando, and A.Gutierrez, “Recommender systems survey, ” Knowledge-Based Systems , vol. 46, pp. 109–132, 2013. [34] J. Feld, “Proﬁnet - scalable factory communication for all applications, ” in IEEE International W orkshop on F actory Communication Systems, 2004. Pr oceedings. , pp. 33–38, 2004. [35] T . Hannelius, M. Salmenpera, and S. K uikka, “Roadmap to adopting opc ua, ” in 2008 6th IEEE International Conference on Industrial Informatics , pp. 756–761, July 2008. [36] International Electrotechnical Commission (IEC), “IEC 61131-3 2013 Standard, ” 2020. [37] M. Banzi and M. Shiloh, Getting started with Arduino: the open source electr onics prototyping platform . Maker Media, 3 ed., 2014. [38] B. Greiman, “FreeR TOS-Arduino. ” https://github .com/greiman/ FreeR TOS- Arduino, 2019. [39] P . Buonocunto, A. Biondi, M. Pagani, M. Marinoni, and G. Buttazzo, “ Arte: Arduino real-time extension for programming multitasking appli- cations, ” in Proceedings of the 31st Annual ACM Symposium on Applied Computing , SA C ’16, pp. 1724–1731, ACM, 2016. [40] Arduino, “ Arduino IDE. ” https://www .arduino.cc/en/main/software, 2020. [41] Arduino, “ Arduino Project Hub. ” https://create.arduino.cc/projecthub, 2020. [42] Q. Le and T . Mikolov , “Distributed representations of sentences and documents, ” in International conference on machine learning , pp. 1188– 1196, 2014. [43] “Pomegranate. ” https://pomegranate.readthedocs.io/, 2020. [44] “Keras: the Python Deep Learning Library. ” https://keras.io/, 2020. [45] “Scikit-learn: Machine Learning in Python. ” https://scikit- learn.org/, 2020. [46] “XGBoost: Scalable and Flexible Gradient Boosting. ” https://xgboost.ai/, 2020. [47] D. L. Silver , Q. Y ang, and L. Li, “Lifelong machine learning systems: Beyond learning algorithms., ” in AAAI Spring Symposium: Lifelong Machine Learning , vol. SS-13-05 of AAAI T echnical Report , AAAI, 2013.

ArduCode: Predictive Framework for Automation Engineering

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment