Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

Devign : Effective V ulnerabilit y Identiﬁcation by Learning Comp rehensive Pr ogram Semantic s via Graph Neural Networks Y aqin Zhou Nanyang T echno lo gical Uni versity yaqinchou@gm ail.com Shangqing Liu Nanyang T echno lo gical Uni versity shangqingliu 666@gmail.co m Jingkai Siow Nanyang T echno lo gical Uni versity JINGKAI001@e .ntu.edu.sg Xiaoning Du Nanyang T echno lo gical Uni versity dxn0733@gmai l.com Y ang Liu Nanyang T echno lo gical Uni versity yangliu@ntu. edu.sg Abstract V ulner ability identiﬁcation is cruc ial to protect the so f tware systems from attacks for cyber security . It is especially important to localize th e vulnerable functions among the source code to facilitate the ﬁx. However , it is a challenging and tedious process, and also requ ires specialized security expertise. Inspired by the work on manually-deﬁn e d patter ns o f vulnerabilities fro m various code rep resentation graphs and the recent adv ance on graph neu ral networks, w e propose Devign , a general graph neural ne twork based model for graph-level classiﬁcation throug h learning o n a rich set of code semantic rep resentations. It includes a novel Conv module to efﬁciently extract useful f e atures in the learned rich node rep resenta- tions fo r grap h-level classiﬁcation. The m o del is trained over manually labeled datasets built on 4 di versiﬁed large-scale open- source C pr ojects that inco rporate high com plexity and variety o f real source code in stead of synthesis cod e used in previous works. T he results of the extensi ve evaluation on the d atasets demon- strate that Devign outperfo rms the state of the arts signiﬁcantly with an average of 10.51 % h igher accuracy and 8.6 8% F1 score, increases a veragely 4.66% accuracy and 6.3 7% F1 by the Conv module. 1 Intr o duction The number of so f tware v ulnerabilities has been increasing rapidly recently , either r eported publicly throug h CVE (Com mon V ulnerabilities and Exposures) o r discovered intern ally in p roprietary code. In particular, the prev alen ce of open-sou rce libraries not only accounts fo r the incr ement, b ut also propag ates impact. These vuln erabilities, m o stly caused by insecure co de, can be exploited to attack software systems and cause sub stantial damages ﬁnancially an d socially . V ulner ability identiﬁcation is a crucial yet challenging prob lem in security . Beside the classic ap- proach e s such as static analysis [1, 2] , dynamic a n alysis [ 3 – 5] and symbolic ex ec u tion, a num ber of ad vances have been made in app lying mach ine learning as a comp lem entary ap proach. In the se early metho ds [6 – 8], f eatures or pa tter ns hand- crafted by human experts are taken as inp uts by m a- chine learning algorithm s to detect vulnerabilities. Howev er , the r oot ca u ses o f vulnerabilities varies Preprint. Under revie w . by types of wea k nesses [ 9] an d libraries, mak ing it impractical to char acterize all vulnerabilities in numero us libraries with the hand -crafted features. T o improve usability of the existing app roaches and avoid th e intense lab or of human exper ts on fea - ture extraction, recen t works in vestigate the poten tial of deep n eural networks on a mo re automa te d way of vuln erability iden tiﬁcation [10 – 12]. Ho wever , all o f these works hav e major limitations in learning co mpreh e nsiv e progr am semantics to character ize vulnera b ilities of hig h d iversity and c om- plexity in real sourc e co de. First, in terms of learn in g ap p roaches, th ey eithe r treat the sou rce co d e as a ﬂat sequence, which is similar to natura l lan guages or represent it with only partial informa tio n. Howe ver, sou r ce cod e is actu ally mo re structural and lo g ical than natu r al languag es and has hetero - geneou s aspec ts of r epresentation such as Abstract Syn tax Tree (AST), data ﬂo w , control ﬂow an d etc. Moreover , vuln erabilities are sometimes subtle ﬂa ws tha t requir e comp rehensive in vestigation from mu ltiple dimen sions of semantics. Th erefore, th e drawback s in the design of previous works limit their po tentiality to cover various vulne r abilities. Second, in terms of training data, part of the data in [11] is lab eled by static an alyzers, which introdu ced hig h percentage of false positives that are no t real vulner a bilities. Another part like [1 0], ar e simple artiﬁcial code (even with “go od” or “b ad” insid e the cod e to distinguish the v ulnerable code and non -vulner able code) that a re far beyond the complexity of real code [1 3]. T o this end , we pro pose a novel gra p h neural network based model with composite prog r amming representatio n for factual vulnerability data. This allows us to encode a full set of classical pro- grammin g code semantics to capture v ariou s vuln e rability char acteristics. A key innovation is a new Con v mod ule which takes as inpu t a g raph’ s heterogeneo us node features from gated recurrent u nits. The Conv module h ierarchically choo ses mo re co a rse featur es v ia leveraging the traditional conv o- lutional and den se layers for graph level classiﬁcation. Moreover , to both testify the potential of th e composite prog ramming embedd ing for so u rce co de an d the prop osed grap h neura l network mo del for the challenging task of vulnerability identiﬁcation, we comp iled ma n ually labeled data sets from 4 popu lar and diversiﬁed libraries in C pro grammin g langua g e. W e n ame th is model Devign (De ep V ulnerability Identiﬁcation via Gr a ph Neural Networks). • In the composite co de representation, with ASTs as the b ackbon e, we explicitly en code the progr am contro l and data d e penden cy at dif f erent levels into a joint graph of heterog eneous edges with each type den oting the c o nnection regarding to th e corresponding rep resentation. The co mprehe nsiv e representatio n, not consider e d in p r evious w o rks, facilitates to c a p ture as extensive types and patterns of vuln e rabilities as possible, and ena bles to learn better node represen tation through graph neural networks. • W e propo se the gated graph neural netw ork model with th e Con v module for gr aph-level classiﬁcation. The Con v modu le le a rns hierarchically from the node features to ca p ture the higher level of representations for graph-level classiﬁcation tasks. • W e implement Devign , and ev aluate its effecti veness throu gh manually lab eled data sets ( cost ar o und 600 ma n -hours ) collected from th e 4 pop ular C librar ies. W e make two datasets public together with more details (h ttp s://sites.google.co m /view/de v ig n ). The r e- sults show that Devign achiev es an average 10.51% higher accuracy and 8.6 8 % F1 sco re than baseline methods. Meanwhile, th e Con v mod ule brings an a verage 4.66% accu racy and 6.3 7 % F1 g ain. W e apply Devign to 40 latest CVEs collected fr om the 4 pr ojects and get 74.1 1% accuracy , m anifesting its usability of discovering new vulnerabilities. 2 The Devign Model V ulner ability p atterns manually c rafted with the co d e property g r aphs, integrating all syntax an d depend ency semantics, hav e been proved to be one o f the most effective appro aches [14] to detect software vulnerabilities. Insp ir ed by this, we design ed Devign to automate th e above pro cess o n code proper ty graph s to lear n vulner able patterns using graph neu ral network s [15]. Devign arch itecture shown in Figur e 1, which includes the th r ee seque n tial compon ents: 1) Graph Emb edding Layer of C omposite C ode Semantics , which encode s the raw source code of a fun ction into a joint graph structure with comprehe n si ve p rogram semantics. 2) Gated Graph Rec u rr ent Layers , which learn the features of nod es throu gh agg regating and p assing inform ation on neigh b oring nod es in grap hs. 3) The Conv module that extracts meaningfu l n ode representatio n for graph-level predictio n. 2 . . . . . . Conv Layer Gated Graph Recurrent Layer Graph Embedding Layer A 1 A 2 A 3 A 4 A 5 A 6 B 1 B 2 B 3 B 4 B 5 B 6 N 1 N 2 N 3 N 4 N 5 N 6 . . . . . . . . . . . . . . . . . . A A GRU B B GRU N N GRU Inp ut Ag grega tio n 0 1 Figure 1: The Arch itecture of Devign 2.1 Problem F ormulat ion Most mach ine learning or pattern b ased appro aches predict vulnerab ility a t the coarse gran ularity lev e l of a s ource ﬁle or an app lica tion, i.e., whether a source ﬁle or an application is pote n tially vulnerab le [7, 1 4, 10, 12]. Here we analyze vulnerable code at th e function le vel wh ich is the ﬁne lev e l o f granularity in the overall ﬂow of vulnerab ility analysis. W e formalize the identiﬁcation of vulnerab le functions as a bin a ry classiﬁcation p roblem, i.e., learning to decide whether a g iv en function in raw source co de is v ulnerable o r not. Le t a sample of data be deﬁned as (( c i , y i ) | c i ∈ C , y i ∈ Y ) , i ∈ { 1 , 2 , . . . , n } , whe r e C den otes the set of fun ctions in co de, Y = { 0 , 1 } n represents the label set with 1 f o r vu lnerable and 0 o therwise, and n is the number of instances. Since c i is a function , we assume it is encod e d as a multi-ed ged graph g i ( V , X , A ) ∈ G (See Sectio n 2.2 fo r the embedd in g details). Let m be the total num ber of node s in V , X ∈ R m × d is the initial no d e feature matrix w h ere each v ertex v j in V is repr esented by a d -dimensional re al-valued vector x j ∈ R d . A ∈ { 0 , 1 } k × m × m is the adjacen cy ma tr ix, where k is the to tal n umber o f edge types. An elemen t e p s,t ∈ A equal to 1 ind icates that nod e v s , v t is connected v ia an edge of type p , a nd 0 o therwise. The goal of Devign is to learn a mapping from G to Y , f : G 7→ Y to pred ict wheth er a function is vulnerab le or not. The prediction function f can be lear ned by minimizing the loss function belo w: min n X i =1 L ( f ( g i ( V , X, A ) , y i | c i )) + λω ( f ) (1) where L ( · ) is the cross entropy loss function, ω ( · ) is a regula r ization, and λ is an adjustab le weight. 2.2 Graph Embedding Layer of Compo site Code Semantics As illustrated in Figure 1, the grap h embedding layer E M B is a mapping from the functio n code c i to graph data stru ctures as the in put of the mod el, i.e., g i ( V , X , A ) = E M B ( c i ) , ∀ i = { 1 , . . . , n } (2) In this section , we de scr ibe the motivation a nd metho d on why and how to utilize th e classical co de representatio ns to embed the code into a compo site graph for feature learnin g. 2.2.1 Classical Code Gra ph Representation and V ulnerability Identiﬁcation In pr ogram an alysis, various representations of the prog ram are u tilized to manifest deeper seman- tics b e hind the textual code, where classi c concepts include ASTs, control ﬂow , an d da ta ﬂow graph s that capture the syntactic an d semantic relation ships amo ng the dif fe r ent tokens of the sou r ce cod e. Majority of vulnerabilities such as memory leak are too subtle to be spotted without a joint consider- ation of the compo site code semantics [14]. F or example, it is reported that ASTs alone can be used to ﬁnd only insecure arguments [ 14]. By combinin g ASTs with con trol ﬂo w grap hs, it enables to cover two m o re typ es of vulnerabilities, i.e., resource leaks and som e use-afte r-free vu lnerabilities. By fur th er integrating the three co de graph s, it is possible to describe m ost types except two that need extra e xternal information (i.e., race conditio n depending o n runtime properties and design errors that ar e hard to mode l without details on the intended design of a progr am) Thoug h [14] man u ally crafted the v ulnerability temp la tes in the form of g raph traversals, it co n veyed the key insight an d proved th e feasibility to lear n a broader ra n ge of vu lnerability patterns through integrating proper ties of ASTs, con trol ﬂow g raphs and data ﬂo w g raphs into a joint data structure. Beside the three classical code structu res, we also take the natural seq u ence of source co de into consideratio n, since the recen t ad vance on deep learning based vu lnerability detection h as demon- strated its ef f ectiv eness [1 0, 11]. It can com p lement the classical rep resentations because its u nique ﬂat structur e captures the relationsh ips of code tokens in a ‘h uman-r eadable’ fashion . 3 short a=32767: IdentifierDeclStatement a=32767 Assignment AST Edge CFG Edge DFG Edge NCS Edge b Identifier CFG Entry b short b Parameter short ReturnType a=32767 PrimaryExp a+b AddExp short ParameterType a Identifier a Identifier a Identifier a Identifier b Identifier short IdentifierType b>0 Condition 0 PrimaryExp b Identifier a Identifier short add(short b) ͵ FunctionDef If(b>0): IfStatement a=a+b: Assignment return a: ReturnStatement CFG Exit a a 1 short add ( short b ) { 2 short a = 32767; 3 if ( b > 0) { 4 a = a + b ; 5 } 6 ret u rn a ; 7 } ;ĂͿŽĚĞǆĂŵƉůĞŽĨ/ŶƚĞŐĞƌKǀĞƌĨůŽǁ ; ďͿ 'ƌĂ ƉŚ ZĞƉƌ ĞƐĞ ŶƚĂƚ ŝŽŶ Figure 2: Graph Representatio n of Code Snippe t with Integer Overﬂow 2.2.2 Graph Embedding of Code Next we br ie ﬂy introduce each type of the co de represen tations and how we r epresent various sub- graphs into one jo int gr a ph, following a code example of integer overﬂow as in Fig u re 2(a) and its graph repr esentation as shown in Figure 2(b). Abstract Syntax T ree (AST) AST is an or dered tree repr e sentation structure of so urce cod e . Usu - ally , it is the ﬁrst-step re p resentation used by code parser s to un derstand th e fundamental structure of the program an d to exam ine syntactic err o rs. Hence, it fo r ms the b asis for the gene ration of m any other code rep resentations an d the nod e set of AST V ast includes all the nodes of th e rest three cod e representatio ns used in this paper . Starting from the root node, the codes are broken down in to cod e blocks, statemen ts, declaration, expressions etc., and ﬁnally into th e primary tokens that form the leaf nodes. Th e major AST nod e s are shown in Figure 2. All the boxes are AST nodes, w ith speciﬁc codes in the ﬁrst line and node typ e annotated. The blue boxes are leaf no d es o f AST and purple arrows represent the child-par ent AST relations. Control Flow Graph (CFG) CFG de scr ibes all paths that mig ht be tra versed thr ough a program during its execution. T h e path alternati ves are determined by conditional statements, e.g . , if , for , and switch stateme nts. In CFGs, nodes denote statem ents and conditions, and they are con n ected b y directed edges to indicate the tran sf e r of con trol. The CFG edges are highlighted with green dashed arrows in Figure 2. Particular ly , th e ﬂo w starts from the entry and ends at the exit, and two different paths derive at the if statem ents. Data Flow Graph (DF G) DFG trac ks th e u sage of variables th r ough o ut the CFG. Data ﬂow is variable oriented and any data ﬂow in volves the access or modiﬁcation of certain variables. A DFG edge represents the subsequent access or modiﬁcation o nto the same v ariables. It is illustrated by orange dou ble arrows in Fi gure 2 and with the inv olved v ariab les annotated over the edge. For example, the parameter b is used in both the if condition and the a ssign ment statement. Natural Code Sequence (NCS) In o rder to en code the natural seq u ential orde r of th e sour c e code , we use NCS edges to connect neig h boring code tokens in the ASTs. T he main beneﬁt with such encodin g is to r eserve th e pro grammin g logic reﬂected by the sequen ce of sou rce code. The NCS edges are den oted by red arrows in Figure 2, connect all the leaf nodes of the AST . Consequently , a function c i can be denoted by a joint grap h g with the four type s o f subg raphs (or 4 types of edges) sharing the same set of nodes V = V ast . As sho wn in Figure (2), every n ode v ∈ V has tw o attributes, Code and T yp e . Code contains the source c o de r epresented by v , and the typ e of v denotes the typ e attribute. The initial no de represen tation x v shall reﬂect the two attributes. Hence, we enco de Code by u sing a pr e -trained word2vec m odel with th e code cor pus built on the who le source cod e ﬁles in the projects, and T ype by label encodin g. W e concatenate the two enco dings together as the in itial node rep resentation x v . 2.3 Gated Graph Recurrent Layers The key ide a of graph neural networks is to embed node representation f rom local neighbo r hoods throug h the ne ig hborh ood ag g regation. Based on the different tec h niques for aggregating n eigh- borho od in formation , there a r e graph conv olutio n al networks [16], Grap hSA GE [17], gated grap h recurren t networks [15] and th eir variants. W e chose the ga te d graph recurren t network to learn th e node embed ding, because it allo ws to go deeper than the othe r two an d is more suitable for our data with both seman tics and grap h structures [18]. 4 Giv en an em bedded graph g i ( V , X , A ) , for each node v j ∈ V , we initialize the no de state vector h (1) j ∈ R z , z ≥ d using th e in itial an notation by copying x j into the ﬁrst dimensions and padding extra 0 ’ s to allo w hidd en states th at are larger than the annota tio n size, i.e., h 1 j = [ x ⊤ j , 0 ] ⊤ . Let T be the total numb er of time-step for neighborho od agg regation. T o propagate infor mation thro ugho ut graphs, at each time step t ≤ T , all no des commu nicate with each o ther by passing info rmation via edges dependent o n the edge type and direction (described by the p th adjacent matrix A p of A , from the deﬁnition we can ﬁnd that th e number of a d jacent matrix equals to edge types), i.e., a ( t − 1) j,p = A ⊤ p  W p  h ( t − 1) ⊤ 1 , . . . , h ( t − 1) ⊤ m  + b  (3) where W p ∈ R z × z is the weight to lear n and b is th e bias. In particular , a n ew state a j,p of node v j is calculated b y ag g regating information of all neighb oring nodes deﬁne d on the adjacent m atrix A p on edge type p . The remaining steps are gated recu rrent unit (GRU) that incorpo r ate inform ation from all types with node v and the previous tim e step to get the curren t n ode’ s hidden state h ( t ) i,v , i.e., h ( t ) j = GRU ( h ( t − 1) j , A GG ( { a ( t − 1) j,p } k p =1 )) (4) where AGG ( · ) denotes an aggregation function that cou ld be on e of the f unctions { M E AN , M AX , S U M , C O N C AT } to aggregate the inf ormation from different edge types to compute the next time- step node em bedding h ( t ) . W e use the S U M function in th e implementation. The ab ove pro pagation procedu re iterates o ver T tim e steps, and th e state vectors at the last time step H ( T ) i = { h ( T ) j } m j =1 is the ﬁnal node repre sen tation matrix for the node set V . 2.4 The Con v Layer The gener ated no de featu res f rom the gated graph r ecurrent layers can be used as input to any prediction la y er , e.g., for n ode o r link or g raph- level prediction, and then the whole mo del can be trained in an end -to-end fashion. In our pro blem, we require to perf orm the task of graph-level classiﬁcation to determine whether a function c i is vulnerable or not. T he standard approach to graph classiﬁcation is gathering all these gener ated node em bedding s glob a lly , e.g., using a linear weighted summation to ﬂatly addin g up all the embed dings [15, 19] as shown in Eq (5), ˜ y i = S ig moid  X M LP ([ H ( T ) i , x i ])  (5) where the sig moid function is used for classiﬁcation and M LP deno tes a Mu ltilayer Percep tron (MLP) th at maps the co ncatenation of H ( T ) i and x i to a R m vector . This kind of a p proach hind ers effecti ve classiﬁcation over entire graphs [20, 21]. Thus, we design the Conv module to select sets of nodes and features that are relevant to the cur rent graph- lev e l task. Pre v ious works in [21] pro posed to use a SortPo oling layer af ter the g raph convo- lution layers to s ort the no de featur e s in a con sistent n ode ord er fo r gr aphs withou t ﬁxed o rdering , so that traditiona l neur al networks can be adde d after it and trained to extract useful features ch a racter- izing the rich in formatio n encoded in grap h. In our p r oblem, each code rep resentation graph h as its own pred eﬁned or d er an d connection of nodes encoded in the adjacent matrix, a nd the no de fe a tures are learned through gated recurren t grap h lay ers instead o f graph conv olution n etworks that re q uires to sort the n ode features fr o m different channels. Ther efore, we d irectly apply 1-D conv o lution and dense neur al networks to learn features relev ant to the graph-level task fo r more effective predictio n 1 . W e deﬁne σ ( · ) as a 1 -D conv o lu tional layer with maxpo oling, then σ ( · ) = M AX P O O L  Rel u  C ON V ( · )  (6) Let l be the numbe r of conv olutional layers applied, then the Conv module, can be expressed as Z (1) i = σ  [ H ( T ) i , x i ]  , . . . , Z ( l ) i = σ  Z ( l − 1) i  (7) Y (1) i = σ  H ( T ) i  , . . . , Y ( l ) i = σ  Y ( l − 1) i  (8) ˜ y i = S ig moid  AV G ( M LP ( Z ( l ) i ) ⊙ M LP ( Y ( l ) i ))  (9) where we ﬁrstly ap ply tr aditional 1 -D con volutional and d ense layers respectively o n the concate- nation [ H ( T ) i , x i ] and the ﬁnal n ode features H ( T ) i , followed by a pairwise multiplication on the two outputs, then an av e rage aggregation on the resulted vector , and at last make a prediction . 1 W e also tried LSTMs and BiLSTMs (with and without attention mechanisms) on the sorted nodes in AST order, howev er, the convol ution networks work bes t overall. 5 T able 1: Data Sets Overv iew Project Sec. Rel. Commits VFCs Non-VFCs Graphs V ul Graphs Non-V ul Graph s Linux Kernel 12811 8647 4164 16583 11198 5385 QEMU 11910 4932 6978 15645 6648 8997 Wireshark 10004 3814 6190 20021 6386 13635 FFmpeg 13962 5962 8000 6716 3420 3296 T otal 48687 23355 25332 58965 27652 31313 3 Evaluation W e ev alu ate the beneﬁts of De vig n ag a inst a number of state-of - the-art vu lnerability d iscovery meth- ods, with the g o al of und erstanding the following questions: Q1 How d oes our De vign c o mpare to the oth er learning ba sed vulner ability id e n tiﬁcation m ethods? Q2 How do es o ur Conv mod ule powered Devign compare to the Gg rn w ith the ﬂat s ummation in Eq (5) f o r the graph- lev e l classiﬁcation task? Q3 Can Devign learn f rom each type of the co de representations (e.g., a single-edged graph with one type of informatio n)? And how do the Devign mod els with the composite graphs (e.g., all types of code repr esentations) compare to each of the single-ed ged graphs? Q4 Can Devign have a better perform ance compare to some static analyz ers in the real scenario where the da taset is imbalanced with an extremely low percentage of vulnerable function s? Q5 How does Devign perform on the latest v ulnerab ilities reported publicly through CVEs? 3.1 Data Preparation It is nev er trivial to obtain high - quality data sets of vulner able fu nctions d ue to the d emand of qualiﬁed e x pertise. W e noticed th at despite [12] rele a sed data sets of vu lnerable functio ns, the labels are gener a ted by statistic analyzer s which are no t ac curate. Other poten tial datasets used in [22] are not a vailable. In this w o rk, supported by our industrial pa r tners, we inv e sted a team of security to collect and label the data fro m scratch. Besides r aw fun ction collection , we nee d to ge n erate g raph representatio ns f or eac h function and initial representa tio ns for e a c h no d e in a graph . W e describe the detailed pr ocedur es b elow . Raw Dat a Ga t hering T o test the cap ability of D evign in learning v ulnerab ility pattern s, we e valu- ate o n man ually-labe le d fu n ctions collected from 4 lar ge C-lang uage op en-sourc e pro jects that are popular amo ng developers and diversiﬁed in function ality , i.e., Linux Kernel, QEMU, W ireshark, and FFmpeg. T o f acilitate and ensure the q u ality of data labe llin g, we started by collecting security - related com- mits which we would label as vulner ability-ﬁx com mits o r n on-vu lnerability ﬁx commits, and then extracted v ulnerab le o r n on-v u lnerable functio ns directly from th e lab e led commits. The vulnerab ility-ﬁx commits ( VFCs) are comm its that ﬁx poten tial vulnerab ilities, from which we c a n extract v ulnerable f unctions from the source code of versions previous to the re vision m ade in th e commits. The non-vu lnerability- ﬁx commits (non-VFCs) are comm its that do not ﬁx any v ulnera- bility , similarly from which we can extract no n-vuln erable fun ctions from the sour ce code bef o re the modiﬁcation. W e ad opted the approach proposed in [23] to collect the com mits. It consists of the follo win g two steps. 1) Commits F iltering . Since only a tin y pa r t of commits are vulnerability related, w e exclude the security-unrelated c o mmits whose m essages are not m atched by a set of security-relate d keywords such as D o S a n d injection. The rest, mor e likely security- related, are left for m anual labelling . 2) Manual Labelling . A team o f fo u r pr o fessional security r e searchers spe nt totally 600 man - hours to perform a two round data lab elling and cross-veriﬁcatio n . Giv en a VFC or non -CFC, b a sed o n the m odiﬁed functions, we extrac t the so urce code o f these function s before the commit is applied , and assign the labels accordin gly . Graph Gener ation W e make use of th e op e n-source code analy sis platfo rm for C/C++ based on code pro perty gr aphs, Joern [1 4], to extract ASTs an d CFGs fo r all fun ctions in our data sets. Due to some inn er com pile err o rs and exception s in Joern, we can only o btain ASTs and CFGs for part of function s. W e ﬁlter ou t th ese f unctions witho ut ASTs and CFGs or with oblivious errors in ASTs and CFGs. Since the original DFGs edges ar e la b eled with the v ariables inv o lved, which tremendo usly inc r eases the number of the types of edges and meanwhile complicates embedded graphs, we su bstitute the DFGs with th ree other relation s, La stRead ( DFG_R) , LastWrite (DFG_ W) , 6 T able 2: Classiﬁcation accuracies an d F1 scores in p ercentages: The tw o far-right columns give the maximu m and av erage relative d ifference in accuracy/F1 c o mpared to Devign model with the composite code r e presentation s, i.e., Devign (Composite). Method Linux Kernel QEMU Wir eshark FFmpeg Combined Max Diff A vg Diff ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 A CC F1 Metrics + Xgboost 67.17 79.14 59.49 61.27 70. 39 61.31 67.17 63.76 61.36 63.76 14.84 11.80 10.30 8.71 3-layer BiL ST M 67.25 80.41 57.85 57.75 69. 08 55.61 53.27 69.51 59.40 65.62 16.48 15.32 14.04 8.78 3-layer BiL ST M + Att 75.63 82.66 65.79 59.92 74. 50 58.52 61.71 66.01 69.57 68.65 8.54 13.15 5.97 7.41 CNN 70.72 79.55 60.47 59.29 70. 48 58.15 53.42 66.58 63.36 60.13 16.16 13.78 11.72 9.82 Ggrn (AST) 72.65 81. 28 70.08 66.84 79.62 64.56 63.54 70.43 67.74 64. 67 6.93 8.59 4.69 5.01 Ggrn (CFG) 78.79 82.35 71.42 67.74 79. 36 65.40 65.00 71.79 70.62 70.86 4.58 5.33 2.38 2.93 Ggrn (NCS) 78.68 81.84 72.99 69.98 78. 13 59.80 65.63 69.09 70.43 69.86 3.95 8.16 2.24 4.45 Ggrn (DFG_C) 70.53 81.03 69.30 56.06 73. 17 50.83 63.75 69.44 65.52 64.57 9.05 17.13 6.96 10. 18 Ggrn (DFG_R) 72.43 80.39 68.63 56.35 74. 15 52.25 63.75 71.49 66.74 62.91 7.17 16.72 6.27 9.88 Ggrn (DFG_W) 71.09 81.27 71.65 65.88 72. 72 51.04 64.37 70.52 63.05 63.26 9.21 16.92 6.84 8.17 Ggrn (Composite) 74.55 79.93 72.77 66.25 78. 79 67.32 64.46 70.33 70.35 69.37 5.12 6.82 3.23 3.92 Devign (AST) 80.24 84.57 71.31 65.19 79.04 64.37 65.63 71.83 69. 21 69.99 3.95 7.88 2.33 3.37 Devign (CFG) 80.03 82.91 74.22 70.73 79. 62 66.05 66.89 70.22 71.32 71.27 2.69 3.33 1.00 2.33 Devign (NCS) 79.58 81.41 72.32 68.98 79. 75 65.88 67.29 68.89 70.82 68.45 2.29 4.81 1.46 3.84 Devign (DFG_C) 78.81 83.87 72.30 70.62 79. 95 66.47 65.83 70.12 69.88 70.21 3.75 3.43 2.06 2.30 Devign (DFG_R) 78.25 80.33 73.77 70.60 80. 66 66.17 66.46 72.12 71.49 70.92 3.12 4.64 1.29 2.53 Devign (DFG_W) 78.70 84.21 72.54 71.08 80. 59 66.68 67.50 70.86 71.41 71.14 2.08 2.69 1.27 1.77 Devign (Composite) 79.58 84.97 74.33 73.07 81. 32 67.96 69.58 73.55 72.26 73.26 - - - - and Compute d F r om (DFG_C) [24], to make it mo r e adaptive for the grap h embedding . DFG_ R represents the immediately last rea d of each occurrenc e of the variable. Eac h occurr e nce can be directly recognized from the leaf node s of ASTs. DF G_W repr esents the immediately last write of each occurr ence o f variables. Similarly , we makes these ann otations to the leaf node v ariables. DFG_C determines the so urces of a v a riable. In an assignment statement, the lef t-hand- side (lhs) variable is assigned with a new v alue by the right-hand- side (rhs) expression. DFG_C cap tures such relations between the lh s variable and each of the r hs variable. Further, we remove fu nctions with n ode size greater than 5 00 for co mputation al efﬁciency , which accounts for 15 %. W e sum m arize the statistics of th e data sets in T able 1. 3.2 Baseline Methods In the p erform a nce compar ison, we compar e Devign with th e state-of- the-art m achine-lear ning- based v ulnerability p rediction m ethods, as well as the gated graph recurren t network ( G g rn ) that used the linear ly weighted summatio n for classiﬁcation. Metrics + Xgbo ost [22]: W e collect totally 4 complexity metrics and 11 vulnerability m e trics fo r each f unction using Joern, and utilize Xgbo o st for classiﬁcation. Here we did no t u se th e p roposed binning and rank ing metho d be c a use it was no t lea r ning based , but a h euristic de sig ned to rank the likelihood of being vu lnerable. W e search the best p arameters via Bayes Optimization [2 5]. 3-layer BiLSTM [10]: It tre ats the source code as na tural lan guages and inpu t th e tokenized code into bidirectio nal LSTMs with initial embeddin gs tr ained via W o rd2vec. Her e we implemented a 3-layer bidire c tio nal for the best per forman ce. 3-layer BiLSTM + Att: I t is an im proved version of [ 1 0] with the atten tion mechanism [26]. CNN [11]: Similar to [10], it takes sourc e code a s natural lan guages and utilizes the bag of w o r ds to get th e initial embedd ings of code tokens, and then feeds them to CNNs to learn. 3.3 Perf orma nce Evaluation Devign Conﬁguration In the em b edding layer, the dimension of word2vec for the initial node r e pre- sentation is 100. In the gated graph recurrent lay er , we set the the d imension of hidden states as 200, and number of time steps as 6. F o r the Con v parameters of Devign , we ap ply (1, 3) ﬁlter with ReLU activ ation f unction f or th e ﬁr st convolution layer which is f ollowed by a max p ooling lay er with (1, 3) ﬁlter an d (1 , 2) strid e , a nd ( 1, 1) ﬁlt er fo r the seco nd co n volution layer with a max poo ling layer with (2, 2) ﬁlter and (1, 2) stride. W e use th e Adam optimizer with learn ing rate 0 .0001 and batch size 128, and L 2 regular ization to av oid overﬁtting. W e rand omly shuf ﬂe each dataset a n d split 75% 7 T able 3 : Class iﬁcation a c c uracies and F1 scores in percen tiles under the imbalanced setting Method Cppcheck Flawﬁnder CXXX 3-layer BiLSTM 3-layer BiLSTM + Att C NN Devign (Composite) AC C F1 A CC F1 AC C F1 A CC F1 A CC F1 ACC F 1 ACC F1 Linux 75.11 0 78.46 12.57 19.44 5.07 18.25 13.12 8.79 16.16 29.03 15.38 6 9 .41 24.64 QEMU 89.21 0 86.24 7.61 33.64 9.29 29.07 15.54 78.43 10.50 75.88 18.80 89.27 41.12 Wireshark 89.19 10.17 89.92 9.46 33.26 3.95 91.39 10.75 84.90 28.35 86.09 8.69 89.37 42.05 FFmpeg 87.72 0 80.34 12.86 36.04 2.45 11.17 18 .71 8.98 16.48 70.07 31.25 69.06 34.92 Combined 85.41 2.27 85.65 1 0.41 29.57 4.01 9.65 16.59 15.58 16.24 72.47 17.94 75.56 27.25 for the training an d the r est 2 5 % for validation. W e train our model on Nvidia Graphics T esla M40 and P40, with 100-e p och patience for early stoppin g . Results Analysis W e u se accuracy and F1 sco re to measure perfor mance. T a ble 3.3 summarizes all the experime nt results. First, we analyze the results regarding Q1 , the per f ormanc e of De vign with other learning based methods. Fr om the results on baseline methods, Ggrn and Devign with composite code represen ta tio ns, we can see that both Ggrn and Devign signiﬁcantly outperf o rm the baseline m ethods in all the data sets. Especially , compared to all the baseline me th ods, the relativ e accuracy gain by Devign is averagely 10.51%, at least 8.54% on the QEMU dataset. Devign (Composite) outperfo rms the 4 baseline method s in terms of F1 score as well, i.e., the relative gain of F1 score is 8.68 % on the average and the minimu m relati ve gains on each da ta set (Linux Kernel, QEMU, W irsh ark, FFmpeg and Combined) a re 2.31%, 1 1.80%, 6.65%, 4.04% and 4.61 % respectively . As Lin ux follows b est practice s o f coding style, the F1 score 84.97 by De v ign is hig hest among all datasets. Hence, Devign with compr ehensive sema ntics en coded in graphs performs signiﬁcan tly better than the state-of-th e-art vulnerability identiﬁcation methods. Next, we in vestigate th e answer to Q2 about the p erform ance gain of Devign again st Ggrn . W e ﬁrst look at the sco re with the co mposite code representation. It demo nstrates that, in all the data sets, Devign re a c hes high er acc u racy (an average of 3.23 %) than Ggrn , whe r e the hig hest accu racy gain is 5 .12% on the FFmpeg data set. Also Devign gets h igher F1, an av e r age of 3.92% than Ggrn , where th e high est F1 gain is 6.82 % o n the QE MU data set. Meanwhile, we look at the sco r e with each single cod e representatio n, from wh ich, we get similar conclusion that generally Devign signiﬁcantly ou tperform s Ggrn , where the m a ximum accuracy gain is 9.21% f or the DFG_W ed ge and the maximum F1 g ain is 17.13% for the DFG_ C. Overall the aver age accu racy an d F1 gain by Devign co m p ar ed with Ggrn ar e 4.66%, 6.37% am o ng all cases, which indica tes th e Con v mo dule extr acts mor e r elated nodes and fea tu r es for graph- level pr ediction . Then we check the results for Q3 to answer whether De vign can lear n different typ e s of code rep- resentation and the p e r forman ce on composite graph s. Surprisingly we ﬁnd that th e r esults learned from single-edged graphs are quite enc ouragin g in both of Ggrn and Devign . F or Gg rn , we ﬁnd that the acc u racy in some speciﬁc types of edg es is even slightly higher than th at in the com posite graph, e . g., b oth CFG and NCS graphs hav e better results on the FFmpeg an d combin ed data set. For Devign , in ter ms of accu racy , except th e Lin ux d ata set, th e composite gr aph repr esentation is overall superior to any single- e dged graph with the gain rangin g from 0.11% to 3.75%. In terms of F1 score, the improvement bro ught by c o mposite grap h compa r ed with the single-edged gr aphs is averagely 2.69 %, rang ing f rom 0.4% to 7.8 8% in the Devign in all tests. In summary , com p osite graphs help Devign to learn better prediction models than single-edged graphs. T o answer Q4 o n comparison with static analyzers on the real imbalanc ed dataset, we randomly sampled the test data to cr e ate imbalanced datasets with 10 % v u lnerable functions according to a large in dustrial scale analysis [23]. W e compare with the well-kn own open- source static analyzers Cppcheck, Flawﬁnder, and a c o mmercial tool CXXX which we hid e th e n ame for legal c o ncern. The results are shown in T able , wher e our approach outperform s signiﬁcantly all static analyzers with an F1 score 27.99 high er . Mean w h ile, static an alyzers ten d to miss most vu lnerable functio ns and ha ve high false positiv e s, e.g., Cppcheck found 0 vulnera b ility in 3 o u t o f the 4 single pr o ject datasets. Finally to answer Q5 o n the latest exposed vulnerab ilities, we scrape the latest 10 CVEs of each project respe cti vely to check wh ether Devign can b e potentially ap plied to iden tify zero- day vulner- abilities. Based on commit ﬁx of the 40 CVEs, we totally get 112 vu lnerable functions. W e input these functions into the trained Devign model and achieve an avera ge accuracy o f 7 4.11%, whic h manifests Devign ’ s potentia lity of discovering n ew vulnerabilities in practical applica tio ns. 8 4 Related W ork The success o f deep lear n ing has inspired the resear c hers to apply it for mo re auto mated s olutions to vulnerab ility discov e r y on source code [12, 10, 11]. The recent works [10, 12, 11] trea t so u rce code as ﬂat na tural langu age sequences, and explore the potential of na tu ral langua ge process techniqu es in vu lnerability detection. F o r instance, [12, 10] built models upon LSTM/BiLSTM neural network s, while [11] pro posed to use the CNNs instead. T o overcome th e limitations of the aforeme n tioned mo dels on expressing logic and stru ctures in code, a number o f works have attem p ted to probe mor e s tructural neural networks suc h as tree struc- tures [2 7] or graph structu res [15, 28, 2 4] for various tasks. For instan ce, [1 5] pro p osed to gen erate logical formulas for program veriﬁcation thr ough gated g raph recurrent networks, and [24] aimed at prediction of variable names and v ariable miss-usage. [28] propo sed Gemini for bin ary code sim- ilarity detection , w h ere functions in binar y code are repr e sented by attributed co ntrol ﬂow graphs and input Structure2vec [1 9] for learning graph em beddin g. Dif ferent from all these w o rks, ou r work tar g eted at vulne r ability identiﬁcation , and inco r porated comp rehensive code representations to express as many types of vulnerabilities as possible. Beside, our work ad o pt gated graph recur- rent layers in [15] to con sider seman tics of nodes (e.g., node annotations) as well as the structural features, both of which ar e im portant in vulnerab ility identiﬁcation. Struc tu re2vec f ocuses pr imar- ily o n learning structural features. Compared with [2 4] that applies gate d graph r ecurren t n etwork for variable prediction, we explicitly in corpor ate c o ntrol ﬂo w g raph into the composite graph and propo se the Con v mo dule for efﬁcient graph-level classiﬁcation. 5 Conclusion and Future W ork W e intr oduce a n ovel vulnerab ility identiﬁcation mod el Devign that is able to e n code a source-co de function into a joint graph structure from m ultiple syntax and semantic representations and th en lev e r age the composite graph repr esentation to effecti vely learn to discover vulnerab le co de. I t achieved a new state of the art on machine- learning- based vulnerable function disco very on r eal open-so u rce projec ts. Interesting fu ture work s inclu de ef ﬁcient learn ing fr om big functions v ia integrating program slicing, applyin g the learnt model to detect vulner abilities cr oss projects, and generating human -readab le or explain able vulnerability assessment. Refer ences [1] Z. Xu, B. Chen, M. Chandram ohan, Y . Liu, and F . Song , “Spain: secu rity patch analysis f or binaries towards und erstanding the pain and p ills, ” in Pr o ceedings of the 3 9th In ternationa l Confer en ce on Softwar e Engine e rin g . IE E E Press, 20 1 7, pp. 462–4 72. [2] M. Chandramo han, Y . X u e, Z. Xu, Y . Liu, C. Y . Cho , a nd H. B. K. T an , “Bing o: Cross- architecture c ross-os b in ary search, ” in P r oceeding s of the 2016 2 4th ACM S IGSOFT I nterna- tional Symposium on F ounda tions of Softwar e Eng ineering . A CM, 201 6, pp. 678–6 89. [3] Y . Li, Y . Xue, H. Chen, X. W u, C. Z hang, X. Xie, H. W ang, and Y . Liu, “Cerebro: context- aware adaptive fuzzing for effectiv e vu ln erability detec tion, ” in Pr o ceedings of the 2019 2 7th AC M Joint Meeting on Eur opea n Softwar e Eng ineering Confer ence an d Symposium on the F ound ations of Softwar e En g ineering . A CM, 2019, pp . 53 3–544 . [4] H. Chen, Y . Xue, Y . L i, B. Chen, X. Xie, X. W u , and Y . Liu , “Hawkeye: T owards a desired directed grey-bo x fu zzer , ” in Pr oceed ings of the 20 18 A CM SIGSAC Confer ence on Computer and Communica tio ns Security . A CM, 201 8 , pp. 2095–2 108. [5] Y . L i, B. Chen, M. Chan d ramoh a n, S.-W . Lin, Y . Liu , and A. Tiu, “Steelix: pro gram-state based binary fu zzing, ” in Pr oceedings of the 201 7 11th J oint Meetin g on F ou ndation s of Software Engineering . ACM, 2017, pp. 627 – 637. [6] S. Neuha u s, T . Zimmerman n, C . Holler, a n d A. Z eller , “Predicting vu lnerable software compon ents, ” in Pr o ceedings of the 14th ACM Confer ence on Computer and Communicatio n s Security , ser . CCS ’07. New Y ork, NY , USA: ACM, 2007, pp. 5 2 9–54 0. [On line]. A vailable: http://doi.acm .org/10.1 145/1315245.1315311 9 [7] V . H. Ngu y en and L. M. S. Tran, “Predicting vu ln erable software compone n ts with dependen cy graphs, ” in Pr oceedin gs of the 6th I nternationa l W orkshop o n Security Measur eme n ts a n d Metrics , ser . Metr iSec ’10 . New Y ork , NY , USA: A CM, 2 010, pp. 3:1– 3:8. [Online] . A vailable: h ttp://doi.acm. org/10.11 45/1853919.1853923 [8] Y . Shin, A. Meneely , L. W illiams, and J. A. Osb orne, “Ev a lu ating com plexity , code churn, and developer activity metrics as ind icators of software vulnerabilities, ” IEEE T rans. Softw . Eng. , vol. 37, no. 6 , p p. 77 2–78 7, Nov . 2 011. [ Online]. A vailable: http://dx.d oi.org/10. 1109/T SE.2010.81 [9] “CWE List V er sion 3.1, ” " h ttps://cwe.mitre.o rg/d ata/index.html" , 2 018. [10] Z. Li, D. Zo u, S. Xu, X. Ou, H. Jin, S. W an g, Z. Deng, and Y . Zhong , “V uldeepe cker: A deep learning- based system for vuln erability detectio n, ” in 25th An nual Network and D istributed System Security Symp osium (NDSS 2018 ) , 2018. [11] R. Russell, L. Kim , L. Hamilton, T . Lazovich, J. Harer, O. Ozd emir , P . Ellingwood, and M. Mc- Conley , “ Autom ated vulne rability detectio n in source c ode u sing d eep r e presentation le a rning, ” in 2018 17th IE EE International C onfer ence on Ma chine Learning and Application s (ICMLA) . IEEE, 2018 , pp. 757–76 2. [12] H. K. Dam, T . T ran, T . Pham, S. W . Ng, J. Gru n dy , an d A. Ghose, “ A u tomatic feature learning for vulnera b ility prediction, ” arXiv preprint arXiv:1708. 02368 , 2017. [13] Juliet test suite. [Online] . A vailable: https://samate. n ist.gov/SRD/around .php [14] F . Y amagu chi, N. Golde, D. Arp, and K. Rieck, “Modeling and discovering v ulnerabilities with cod e pro p erty graphs, ” in Pr oceedings of the 2 014 IEEE Symposium on Security a nd Privacy , ser . SP ’1 4 . W ashing ton, DC, USA: IEEE Computer So ciety , 2014, p p. 590 –604. [Online]. A vailable: http://dx.doi.o rg/10.1109/SP .2014.44 [15] Y . Li, D. T ar low , M. Brockschmidt, and R. Zemel, “Gated grap h sequence n eural networks, ” arXiv pr eprint arXiv:15 11.05 493 , 201 5. [16] M. Schlich tkrull, T . N. Kipf, P . Bloem, R. V an Den Berg, I . Tito v , and M. W elling, “Modeling relational data with graph conv olu tional n e tworks, ” in E ur opea n Semantic W eb Confer en ce . Springer, 20 18, pp. 593– 607. [17] P . V eli ˇ ckovi ´ c, G. Cucurull, A. Casanov a, A. Romero, P . Lio, an d Y . Bengio, “Gr aph a tten tion networks, ” arXiv preprint arXiv:1710. 1 0903 , 2017. [18] “Represen ta tio n Lear ning on N e tworks, ” "h ttp://snap.stanfo rd.edu /proj/embeddings- www/" , 2018. [19] H. Dai, B. Dai, an d L. Song, “Discriminati ve emb e d dings of latent variable models for struc- tured data, ” in Internation al confer ence on ma chine learnin g , 2016, pp. 270 2–271 1. [20] Z. Y ing, J. Y o u , C. Morr is, X. Ren, W . Hamilton , a n d J. Leskovec, “Hierar chical g raph repre- sentation learn in g with differentiable pooling, ” in Advance s in Neu ral Information Pr o c e ssing Systems , 2018 , pp. 4805–4 815. [21] M. Zhang, Z. Cui, M. Neumann, an d Y . Chen, “ A n end -to-end deep learning ar chitecture for graph classiﬁcation, ” in Thirty-Secon d AAAI Confer en ce on Artiﬁcial Intelligence , 201 8. [22] X. Du, B. Chen , Y . Li, J. Guo , Y . Z hou, Y . Liu, and Y . Jian g , “Leopard: Identifying vulnerab le code for vulner ability assessment through program metrics, ” arXiv pr eprint arXiv:1901.1 1 479 , 2019. [23] Y . Z hou and A. Sharma, “ Autom a ted identiﬁcation of security iss ues from commit me ssages and b u g reports, ” in Pr oceed ings of the 2017 11th Joint Meeting on F ound ations of Softwar e Engineering , ser . ESEC/FSE 20 1 7. New Y or k , NY , USA: A CM, 2017, pp . 91 4 –919 . [On lin e]. A vailable: h ttp://doi.acm. org/10.11 45/3106237.3117771 [24] M. Allaman is, M. Broc k schmidt, and M. Khademi, “Lear ning to represent programs with graphs, ” 11 2017. 10 [25] J. Snoek , H. Larochelle, and R. P . Adams, “Pr actical bayesian optimization of machin e learn ing algorithm s, ” in Advances in neural information pr oc e ssing systems , 201 2 , pp. 2951–2 959. [26] Z. Y ang, D. Y ang, C. Dyer , X. He, A. Smola, and E. Hovy , “Hierarchical attention networks for d ocumen t classiﬁcation, ” in Pr oce e d ings of the 20 1 6 Confer ence o f th e North American Chapter of the A ssociation for Computationa l Lin guistics: Human Lang uage T echnologies , 2016, pp. 14 80–14 89. [27] L. Mou, G. Li, L . Zhan g, T . W an g, and Z. Jin, “Conv o lutional neural networks over tree struc- tures for p r ogramm ing langu age processing. ” in AAAI , vol. 2, no. 3, 2 016, p. 4. [28] X. Xu , C. Liu, Q. Fen g, H. Y in, L. So ng, a nd D. Song, “ Neural network-b ased g r aph embe d- ding for cross-platfor m binary co de similarity detection, ” in Pr o ceedings of the 2017 ACM SIGSAC C onference o n Computer an d Communications Security . A CM, 2 017, pp. 363 –376 . 11

Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment