ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment

Although Graph Neural Networks (GNNs) have shown promise for smart contract vulnerability detection, they still face significant limitations. Homogeneous graph models fail to capture the interplay between control flow and data dependencies, while het…

Authors: Tran Duong Minh Dai, Triet Huynh Minh Le, M. Ali Babar

ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment
ORA CAL: A Rob ust and Explainable Multimodal Frame work for Smart Contract V ulnerability Detection with Causal Graph Enrichment T ran Duong Minh Dai a,b , T riet Huynh Minh Le c , M. Ali Babar c , V an-Hau Pham a,b , Phan The Duy a,b, ∗ a Information Security Lab, University of Information T ec hnology , Ho Chi Minh City , V ietnam b V ietnam National University , Ho Chi Minh City , V ietnam c School of Computer Science and Information T echnology , Adelaide University , Adelaide, Australia Abstract Although Graph Neural Networks (GNNs) hav e shown promise for smart contract vulnerability detection, they still face significant limitations. Homogeneous graph models fail to capture the interplay between control flo w and data dependencies, while heterogeneous graph approaches often lack deep semantic understanding, leaving them susceptible to adv ersarial attacks. Moreover , most black-box models f ail to provide e xplainable e vidence, hindering trust in professional audits. T o address these challenges, we propose ORACAL (Observable RA G-enhanced Analysis with CausAL reasoning), a heterogeneous multimodal graph learning framew ork that integrates Control Flow Graph (CFG), Data Flo w Graph (DFG), and Call Graph (CG). ORA CAL selecti vely enriches critical subgraphs with expert-le vel security context from Retriev al-Augmented Generation (RA G) and Large Language Models (LLMs), and employs a causal attention mechanism to disentangle true vulnerability indicators from spurious correlations. For transparency , the frame work adopts PGExplainer to generate subgraph-le vel explanations identifying vulnerability triggering paths. Experiments on lar ge-scale datasets demonstrate that ORA CAL achiev es state-of-the-art performance, outperforming MANDO-HGT , MTVHunter , GNN-SC, and SCVHunter by up to 39.6 percentage points, with a peak Macro F1 of 91.28% on the primary benchmark. ORACAL maintains strong generalization on out-of-distribution datasets with 91.8% on CGT W eakness and 77.1% on D AppScan. In explainability ev aluation, PGExplainer achiev es 32.51% Mean Intersection over Union (MIoU) against manually annotated vulnerability triggering paths. Under adversarial attacks, ORA CAL limits performance degradation to approximately 2.35% F1 decrease with an Attack Success Rate (ASR) of only 3%, surpassing SCVHunter and MANDO-HGT which exhibit ASRs ranging from 10.91% to 18.73%. K eywor ds: Smart Contract V ulnerability, Graph Learning, Multimodal, V ulnerability Detection, Explainable AI 1. Introduction Smart contracts hav e become a foundational element of the modern blockchain ecosystem, dri ving widespread adoption in Decentralized Finance (DeFi) and Non-Fungible T okens (NFTs). Their ability to automatically ex ecute business logic without trusted intermediaries has fostered innovation; howe ver , their immutable nature presents a ”double-edged sword”. Once de- ployed, code cannot be easily patched, meaning vulnerabilities can lead to permanent and catastrophic financial losses. In recent years, high profile incidents such as the Cetus Pro- tocol exploit, in which an integer overflo w allo wed attackers to mint spoof tokens and drain over $220 million in May 2025 [ 1 ], and the Zoth Protocol breaches in volving combined logic and go vernance failures [ 2 ], demonstrated that vulnerabilities are e volving be yond simple logic errors toward comple x eco- nomic state manipulations. In early 2026, the T rueBit protocol su ff ered a mathematical ov erflow fla w in its minting logic that enabled unauthorized minting and e xtraction of ETH reserves, ∗ Corresponding author Email addr esses: 22520183@gm.uit.edu.vn (T ran Duong Minh Dai), triet.h.le@adelaide.edu.au (T riet Huynh Minh Le), ali.babar@adelaide.edu.au (M. Ali Babar), haupv@uit.edu.vn (V an-Hau Pham), duypt@uit.edu.vn (Phan The Duy) resulting in an approximately $26.6 million loss [ 3 ]. Similarly , DeFi liquidity protocols Aperture Finance and Sw apnet were e x- ploited due to insu ffi cient input validation and arbitrary external call vulnerabilities across multiple chains in January 2026 [ 4 ]. Reports indicate that DeFi e xploits caused ov er $2.9 billion in losses in 2025 alone [ 5 ], highlighting an urgent need for rob ust automated security analysis in smart contract systems. T raditional vulnerability detection methods, such as static analysis tools like Slither or Mythril, rely heavily on predefined rules and pattern matching. While e ff ectiv e for kno wn vulner- ability classes, they struggle with complex, no vel logic errors and lack deep semantic understanding [ 6 , 7 ]. Deep learning approaches ha ve emer ged as a promising alternati ve, particularly identifying vulnerabilities through graph representations of code [ 8 , 9 ]. Howe ver , homogeneous graphs capture only limited as- pects of the code, as the y typically model all nodes and relations uniformly , o verlooking the div erse semantic roles of program constructs [ 10 ]. Recent advances utilizing heterogeneous graphs [ 11 , 12 ] hav e shown improv ed performance but often still treat code semantics superficially . More critically , recent studies hav e demonstrated that heterogeneous graph neural networks are highly susceptible to adversarial attacks [ 13 , 14 , 15 ], where minor structural perturbations can mislead the model. This vul- nerability stems from the model’ s reliance on shallo w structural Pr eprint submitted to Elsevier Mar ch 31, 2026 features without a robust grounding in the code’ s actual semantic ”intent” and security context, and it raises the broader question of ho w to design vulnerability detectors that remain robust under adversarially perturbed b ut semantically equiv alent programs. The rise of Large Language Models (LLMs) o ff ers a new frontier . Models like GPT -4 or CodeBER T can capture intri- cate code semantics. Ho wever , current LLM-based vulnerability detectors do not inherently resolve the adversarial rob ustness issue identified abo ve: they can be misled by carefully crafted prompts or semantically equi v alent but security-critical code transformations, and still lack guarantees ag ainst adversarial ma- nipulation of their inputs [ 16 , 17 ]. In other words, the adv ersarial fragility observed in heterogeneous GNNs carries over to purely LLM-based pipelines unless robustness is explicitly modeled. Moreov er , when used in isolation for vulnerability detection, they su ff er from tw o additional issues: a lack of domain-specific grounding leading to “hallucinations” [ 18 ], and a “black-box” nature that fails to provide explainable e vidence for their predic- tions [ 19 ]. In security audits, identifying a b ug is not enough; auditors need to kno w why it is a b ug and wher e the root cause lies, as emphasized in both industrial auditing practice and em- pirical studies of automated analyzers [ 20 ]. T o address these gaps, we propose ORA CAL (Observ able RA G enhanced Analysis with CausAL reasoning), a novel frame work for trustw orthy smart contract vulnerability detection. ORA CAL inte grates the structural rigor of heterogeneous graphs with the rich semantic understanding of LLMs via Retriev al Augmented Generation (RA G). Unlike existing methods that rely solely on static graph features, ORACAL enriches the most critical nodes in the graph with deep semantic context, specifically “Operational Context” and “Security Analysis”, generated by an LLM grounded in a curated knowledge base of Solidity vulnerabilities. Here, “Operational Context” summarizes the intended behavior of a function or contract, its in variants, and environmental assumptions, encompassing access control, token flow , and interaction patterns, while “Security Analysis” captures potential vulnerability types, exploit preconditions, and e xpected on-chain impacts at a lev el suitable for human auditors. Furthermore, we introduce a causal attention mechanism to mitigate the “Clev er Hans” e ff ect [ 21 ], in which models may rely on superficial artifacts, comprising node counts, common control flo w graph patterns, or frequent opcodes, rather than actual security logic. This design ensures that the model focuses on causal vulnerability patterns instead of spurious correlations in the dataset and strengthens robustness against adversarial perturbations. T o enhance interpretability , ORA CAL adopts PGExplainer [ 22 ] to generate subgraph-le vel explanations that identify the most influential nodes and edges contrib uting to each vulnerability prediction, enabling auditors to quickly v erify the root cause of detected issues. This paper makes the follo wing key contributions: • Heterogeneous Multimodal Graph Framework: W e propose a novel method to construct a heterogeneous graph (combining CFG, DFG, and Call Graph) and enrich its critical subgraphs using an LLM-based RA G pipeline. This injects expert-le vel security knowledge directly into the graph structure. • Causal Attention Lear ning: W e design a dual-branch graph neural network that explicitly disentangles causal features (derived from RAG enrichment) from spurious features (node contents), significantly improving general- izability across di ff erent datasets. • Explainability and T rustworthiness: W e enhance trans- parency by employing PGExplainer to pro vide subgraph- lev el explanations for detected vulnerabilities, achie ving 32.51% MIoU when e valuated against manually annotated vulnerability triggering paths. Furthermore, we ev aluate robustness through adversarial attacks, demonstrating that our enriched context significantly reduces the Attack Suc- cess Rate (ASR) to approximately 3%, compared to up to 19% for existing heterogeneous GNN detectors (specif- ically , SCVHunter ASR 18.73%, MANDO-HGT ASR 12.41%, GNN-SC ASR 11.04%, and MTVHunter ASR 10.91%). • Comprehensi ve Evaluation: W e ev aluate ORA CAL on large-scale benchmark datasets deri ved from real-world Ethereum smart contracts (SoliAudit [ 23 ], CGTW eakness [ 24 ]), along with two out-of-distribution datasets that also serv e as explainability benchmarks: the real-world DeFi / NFT vulnerability test set D AppScan [ 25 ] and the Line-Lev el Manually Annotated V ulnerabilities dataset (LLMA V) [ 26 ]. Our results sho w ORA CAL achiev es state-of-the-art performance with a Macro F1 of 91.28% on the primary test set, improving over prior heterogeneous GNN detectors ranging from MANDO-HGT (82.63%), MTVHunter (78.10%), and GNN-SC (70.74%) to SCVHunter (51.62%) by up to 39 percentage points in F1 [ 27 , 28 ], and maintains strong generalization on out-of-distribution datasets (Macro F1 of 91.8% on CGTW eakness and 77.1% on DAppScan), confirming its robustness in real-w orld scenarios. Paper structur e . Section 2 revie ws existing approaches to smart contract vulnerability detection and positions our work relativ e to prior methods. Section 3 presents the ORA CAL framew ork in detail. Section 4 describes our experimental design and ev aluation results. Section 5 analyzes threats to the v alidity of our study , and Section 6 concludes the paper and outlines future research directions. 2. Related W ork This section revie ws prior work on smart contract vulnerabil- ity detection. As summarized in T able 1 , we categorize e xisting approaches into three groups: Rule-based Static Analysis, Homo- geneous Graph Learning, and Heterogeneous Graph Learning. Compared to these methods, our proposed ORA CAL frame work introduces a Heterogeneous Multimodal Graph enriched with dynamic RA G-based security knowledge and pro vides explicit explainability via causal attention and subgraph-level e xplana- tions, which are capabilities absent from all prior approaches. 2 T able 1: Comparison of Our W ork with Related Smart Contract V ulnerability Detection Methods Method Y ear Category T echnique Semantic Explainability Oyente [ 29 ] 2016 Static Analysis Symbolic Execution No Y es Slither [ 30 ] 2018 Static Analysis Rule-based No Y es Peculiar [ 31 ] 2019 Deep Learning Homogeneous Graph No No GNN-SC [ 28 ] 2021 Deep Learning Homogeneous Graph No No Mando-HGT [ 32 ] 2023 Deep Learning Heterogeneous Graph No No SCVHUNTER [ 11 ] 2024 Deep Learning Heterogeneous Graph No No MTVHunter [ 27 ] 2025 Deep Learning Heterogeneous Graph Static Knowledge No ORA CAL (Ours) 2026 Deep Learning Heter ogeneous Multimodal Graph Dynamic Knowledge Y es 2.1. Static Analysis As summarized in T able 1 , static analysis tools e xemplified by Slither [ 30 ] and Oyente [ 29 ] analyze Solidity source code or EVM bytecode using symbolic ex ecution, control flow reason- ing, and predefined vulnerability patterns. These tools pro vide rule-lev el traceability and interpretable warnings, which make them suitable for manual auditing workflo ws. Ne vertheless, their detection capability is inherently bounded by handcrafted rules and predefined signatures. As a result, they primarily identify known vulnerability patterns and struggle to generalize to unseen or e volving attack strategies. Ghaleb and Pattabiraman [ 33 ] report substantial false-positiv e and f alse- negati ve rates across widely used smart contract analyzers. Their findings indicate that rule-based engines hea vily rely on prede- fined vulnerability patterns, thereby limiting their reliability when analyzing complex real-world contracts. 2.2. Graph Learning Appr oaches Homogeneous Graph Models. T o move beyond hand- crafted rules, early deep learning approaches model smart con- tracts as graphs and apply graph neural networks for vulner- ability detection. Peculiar [ 31 ] constructs a critical data flow graph that focuses specifically on security-sensiti ve operations and dangerous nodes, aiming to reduce noise and emphasize vulnerability-relev ant dependencies. GNN-SC [ 28 ] formulates smart contract vulnerability detection as a multi-label classifi- cation problem ov er control flow graphs and employs a Graph Con volutional Network to learn structural representations from ex ecution paths. These homogeneous graph models impro ve generalization compared to static analyzers by learning structural features di- rectly from program representations. Howe ver , the y encode a single type of structural dependency , either data flow or con- trol flow , within a unified graph schema. Such single-view modeling limits the ability to jointly reason about interactions across multiple semantic dimensions, which is often required for detecting complex vulnerabilities in volving state transitions, cross-function interactions, or intertwined control and data de- pendencies. Heterogeneous Graph Models. T o o vercome the structural constraints of homogeneous graphs, recent studies construct het- erogeneous program graphs that integrate multiple semantic rela- tions within a unified framework. MANDO-HGT [ 32 ] builds het- erogeneous contract graphs (HCGs) from either source code or bytecode, incorporating control-flow and function-call relations, and employs heterogeneous graph transformers with customized meta-relations to model di verse node and edge types for both contract-lev el and line-le vel vulnerability detection. SCVHunter [ 11 ] designs a heterogeneous semantic graph based on intermedi- ate representations and applies a heterogeneous graph attention network to capture structural and semantic dependencies, while allowing lightweight e xpert guidance to emphasize critical nodes. These approaches demonstrate that multi-relational modeling provides richer conte xtual representations than single-structure graph encodings. Despite these structural advances, heterogeneous graph mod- els remain primarily syntax-driv en. Node relationships are typi- cally deri ved from static structural dependencies, without explic- itly addressing noise interference or missing semantic informa- tion at the bytecode lev el. MTVHunter [ 27 ] introduces a multi- teacher frame work to enhance bytecode vulnerability detection, incorporating an instruction denoising teacher and a seman- tic complementary teacher with neuron distillation to transfer opcode-lev el kno wledge. Howe ver , its semantic enhancement remains largely static and task-specific, lacking mechanisms to dynamically adapt to e volving vulnerability patterns or higher - lev el reasoning requirements. 2.3. Explainability and Robustness Challeng es As highlighted in T able 1 , a common failing across almost all graph-based deep learning methods is the lack of explain- ability . They operate as black boxes, providing a prediction without o ff ering the “why , ” which is a critical requirement for trust in security auditing. A comprehensi ve survey by Li et al. [ 34 ] highlights that e xisting explanation techniques for grap h neural networks are still insu ffi cient for delivering f aithful and human-interpretable justifications. In the context of vulnerabil- ity detection, Chu et al. [ 35 ] propose a counterfactual e xplana- tion framew ork to identify minimal structural changes that alter model predictions, demonstrating that additional mechanisms are required to interpret graph-based detectors. Similarly , Cao et al. [ 36 ] introduce a causal e xplanation approach to impro ve both interpretability and robustness of GNN-based vulnerability detection systems. These studies indicate that explainability is not inherently supported by standard graph neural architectures and must be explicitly incorporated. Robustness is another critical concern. Graph neural net- works depend hea vily on structural patterns learned from graph 3 topology . Prior studies have sho wn that graph-based models can be sensitiv e to structural perturbations and spurious correlations [ 37 , 14 , 15 ]. Minor modifications to nodes or edges may signif- icantly alter predictions e ven when program semantics remain unchanged. This structural dependency makes heterogeneous graph models potentially vulnerable to adversarial manipulation, raising concerns for deployment in obfuscated or adversarial blockchain en vironments. 2.4. Summary In summary , static analyzers pro vide interpretable yet rule- constrained detection with limited generalization. Homoge- neous graph models improve learning capacity but o ff er re- stricted structural cov erage, while heterogeneous graph models enhance multi-relational representation yet remain syntax-dri ven and lack intrinsic semantic reasoning. Graph-based deep learn- ing approaches also face persistent challenges in explainability and robustness. T o bridge this gap between structural mod- eling and deep semantic understanding, ORA CAL integrates Retriev al-Augmented Generation (RA G) with LLM-deri ved se- curity knowledge into heterogeneous graph representations, and employs a causal attention mechanism to distinguish genuine vulnerability indicators from spurious correlations, aiming for accurate, robust, and interpretable detection suitable for practical smart contract auditing. 3. Methodology W e propose ORA CAL (Observable RAG enhanced Analysis with CausAL reasoning), a no vel framework that synergizes Het- erogeneous Graph Neural Networks with Retrie v al-Augmented Generation (RA G) and Causal Attention. The workflo w consists of four phases: (1) Graph Construction, (2) Critical Node Se- lection, (3) RA G-based Semantic Enrichment, and (4) Causal Disentanglement Learning. Figure 1 illustrates the ov erall archi- tecture. Giv en the Solidity source code of a contract, ORA CAL first extracts control-flo w , data-flow , and call relationships from the code. The pipeline then proceeds through four successi ve phases: 1. Phase (1) – Graph Construction: Build a heterogeneous contract graph from the extracted homogeneous graphs (CFG, DFG, and call graph) and obtain initial node em- beddings with GraphCodeBER T [ 38 ]. 2. Phase (2) – Critical Node Selection: Compute an impor- tance score for each node and extract a connected top- k subgraph that summarizes the most security-critical exe- cution region while preserving local conte xt. 3. Phase (3) – RA G-based Semantic Enrichment: T rans- late the selected subgraph into a te xtual description, query a RA G pipeline ov er a security corpus, and encode the returned natural-language annotations for nodes and edges with GraphCodeBER T to obtain enriched features. 4. Phase (4) – Causal Attention Learning: Feed both orig- inal and enriched node features into a causal-attention GNN with dual branches (causal vs. spurious) to perform vulnerability classification. The causal-attention GNN in Phase (4) is trained end-to- end under a multi-task loss (Section 3.4.2 ) that (i) supervises vulnerability prediction from the causal branch, (ii) regularizes the spurious branch towards high-entropy , near-uniform out- puts, and (iii) enforces a contrastive separation between the two representations. This shared objectiv e encourages ORA CAL to ground its decisions in semantically enriched, causally rele- vant evidence while acti vely unlearning dataset-specific artif acts, thereby improving rob ustness under distribution shift. Formally , giv en the Solidity source code of a contract, ORACAL outputs a multi-label vector ˆ y ∈ { 0 , 1 } C ov er C vulnerability types (specif- ically , ˆ y = [1 , 0 , 1 , 1] indicates that three of four vulnerability categories are detected for this contract). 3.1. Heter ogeneous Graph Construction Prior work on smart contract analysis has shown that model- ing contracts as graphs that capture control and data dependen- cies substantially improv es vulnerability detection compared to token-le vel or AST -only representations [ 39 , 40 ]. More recent systems build unified or heterogeneous graphs that integrate AST , CFG, and DFG information and report state-of-the-art per- formance on lar ge Solidity benchmarks [ 41 , 42 ]. Heterogeneous graph transformers tailored to smart contracts (exemplified by MANDO-HGT) similarly rely on combining call, control-flo w , and semantic relations to capture complex exploit patterns [ 32 ]. These results motiv ate ORA CAL ’ s decision to explicitly con- struct and fuse CFG, DFG, and CG into a single heterogeneous contract graph. The first phase in volv es extracting and constructing a het- erogeneous graph from Solidity source code. W e build on three standard graph representations of code: • Control Flow Graph (CFG): Nodes represent basic blocks of e xecution; edges represent control flo w paths (jumps, branches). A CFG captures the possible e xecution order and is e ff ectiv e for detecting logic errors and unreachable code, but it does not explicitly model data dependencies. W e extract the CFG using EtherSolve [ 43 ], which analyzes EVM bytecode to produce accurate control flo w structures, handling jumps and basic blocks e ffi ciently . • Data Flow Graph (DFG): Nodes represent variables and operations; edges track data usage, definitions, and influ- ences. A DFG is crucial for detecting tainted data and improper computations, notably unsafe arithmetic and unchecked external calls, b ut it loses direct e xecution con- text. W e construct the DFG using the Solidity Compiler (solc) [ 44 ] to generate the Abstract Syntax T ree (AST), followed by a custom SolidityExtractor that traces data dependencies (variable usage, assignments) and in- fluences. • Call Graph (CG): Nodes represent functions; edges rep- resent function in v ocations (caller –callee). A CG captures high-lev el interaction structure and is vital for reason- ing about reentrancy and cross-function data flow . W e generate the CG using Slither [ 30 ], which identifies both 4 Figure 1: Overvie w of the ORA CAL Framework. The pipeline proceeds from heterogeneous graph construction to identifying critical nodes, enriching them via RA G, and finally training a causal attention-based GNN. internal and external function calls to map the execution hierarchy . Node types: The resulting heterogeneous graph G = ( V , E ) integrates three subgraph types: • CFG Nodes: Represent basic blocks of EVM opcodes. Features include the list of opcodes and basic block meta- data. • DFG Nodes: Represent variables, e xpressions, and opera- tions. Features include variable names, types, and v alues. • CG Nodes: Represent functions. Features include func- tion names, visibility , and modifiers. Edge types: W e establish cross-gr aph edges to capture multidimensional relationships based on opcode semantics: • CFG → CG: Linked via function call opcodes including CALL and DELEGATECALL . • CFG ↔ DFG: Linked via storage / memory operations. SSTORE / MSTORE link CFG blocks to DFG variables (Write), while SLOAD / MLOAD link DFG v ariables to CFG blocks (Read). • CG → CFG: Connects function entry points (CG) to their corresponding usage in control flow (CFG). Node Initialization: W e utilize GraphCodeBER T to gener - ate initial embeddings for all nodes. GraphCodeBER T is chosen for its pre-trained understanding of code structure and data flo w , providing a rich starting representation (768 dimensions) before any GNN processing. 3.2. Critical Node Selection (T op-k Extraction) Running RAG over ev ery node in a large heterogeneous contract graph is prohibiti vely expensi ve: the cost of LLM in- ference grows roughly linearly with the number of input and output tokens, and frontier models can di ff er in price by two orders of magnitude [ 45 , 46 ]. Analyses of RA G pipelines further highlight that hardware, v ector search, and repeated LLM calls together make end-to-end RA G noticeably more e xpensiv e than standard neural inference, especially at scale [ 47 , 48 ]. T o opti- mize computational cost and focus the expensi ve RA G process on the most relev ant parts of the code, we therefore select the top- k critical nodes. For example, when k = 50, only the 50 highest-scoring nodes are retained for subsequent RAG-based semantic enrichment. This selection uses a hybrid importance score that combines graph-theoretic metrics with learned attention weights. Prior work on graph pooling and vital-node identification has sho wn that ranking nodes by a combination of structural centrality statis- tics, such as PageRank and k-core, and attention-based scores leads to more expressi ve and e ffi cient graph-lev el representations than using either family of signals alone [ 49 , 50 , 51 , 52 ]. Our 5 design follo ws this line of research by inte grating both learned and hand-crafted measures into a single importance score. Importance Metrics: 1. GA T Attention: Deri ved from a pre-trained GA T model [ 53 ], highlighting nodes that recei ve high attention weights during preliminary classification. 2. K-Core Decomposition: Identifies the core nodes of the graph based on their degree-based subgraph membership [ 54 ]. 3. PageRank: Measures node influence and prestige based on recursiv e connectivity within the graph [ 55 ]. 4. Community & Centrality: Includes Louv ain community detection [ 56 ] for identifying functional modularity and Betweenness Centrality [ 57 ] for measuring the e xtent to which a node lies on the shortest paths between others. These metrics are normalized and combined into a final importance score: Score( v i ) = X β j · norm(metric j ( v i )) (1) where β weights sum to 1. Connected T op- k Selection Algorithm. T o ensure that the selected nodes form a connected subgraph, we employ the greedy selection strategy described in Algorithm 1 . Giv en a directed graph G = ( V , E ) with node-le vel importance scores, the algorithm incrementally constructs a connected node set S of size at most k . The algorithm consists of two stages. First, all nodes are sorted in descending order of their importance scores and in- serted into a priority queue Q , which determines the selection order . Second, nodes are examined sequentially from Q . The node with the highest importance score is selected to initialize the set S . For each subsequent candidate node v , the algorithm adds v to S only if v is reachable from at least one node already in S in the graph G . This reachability condition is v erified using Breadth-First Search (BFS) restricted to the directed edges of G . The selection process terminates when either k nodes hav e been added to S or when no remaining nodes in Q satisfy the con- nectivity constraint. In the case of a disconnected graph, the al- gorithm returns the largest connected subset of high-importance nodes with cardinality | S | ≤ k . This design guarantees that the resulting subgraph preserves structural connecti vity while priori- tizing semantically critical nodes, which is essential for e ff ectiv e context aggre gation in the subsequent RA G-based semantic en- richment stage. 3.3. RA G-based Semantic Enrichment Building upon the previously constructed heterogeneous graphs, this phase further augments the representation with external security knowledge through a Retriev al-Augmented Generation (RA G) pipeline. T raditional heterogeneous contract graphs mainly capture structural and syntactic relations, encompassing control-flow , data-flow , and call edges, which limits the amount of explicit semantic knowledge a vailable to the model. W e e xtend this rep- resentation by enriching the graph with security-domain kno wl- edge retriev ed via RAG. As sho wn in Figure 2 , panel (a) contains Algorithm 1 Connected T op- k Node Selection Require: Directed graph G = ( V , E ), target number of nodes k , importance scores Score( v ) for all v ∈ V Ensure: A connected node set S with | S | ≤ k 1: Sort all nodes in descending order of Score( v ) 2: Q ← priority queue containing all nodes sorted by score 3: S ← ∅ 4: while | S | < k and Q , ∅ do 5: v ← dequeue the highest-scoring node from Q 6: if S = ∅ then 7: S ← S ∪ { v } 8: else 9: if v is reachable from at least one node in S via BFS on G then 10: S ← S ∪ { v } 11: end if 12: end if 13: end while 14: retur n S Figure 2: Comparison between (a) a standard heterogeneous graph and (b) our enriched heterogeneous graph. only raw code-deri ved information, whereas panel (b) illustrates our enriched graph, in which nodes are augmented with seman- tic explanations and additional semantic edge relations. This produces a more informati ve input for do wnstream GNNs, en- abling reasoning not only over structure but also o ver high-le vel security intent and risk patterns. 3.3.1. RA G Arc hitectur e Large Language Models (LLMs) can be e ff ective at sum- marizing code intent, but the y may lack specialized, up-to-date domain kno wledge and can hallucinate plausible b ut incorrect details. Retriev al-Augmented Generation (RAG) addresses this by grounding generation in an e xternal corpus: the model first retriev es the most rele vant passages and then conditions the final output on those retrieved documents [ 58 ]. In our setting, this corresponds to (i) embedding the query (node / edge context from the selected subgraph), (ii) retrieving rele vant Solidity / EVM / au- 6 dit materials, and (iii) generating structured, e vidence-grounded enrichments. W e le verage LangChain for orchestration, Milvus as the vec- tor store, and Google Gemini 3 Flash (gemini-3-flash-pre view) [ 59 ] as the LLM. Gemini 3 Flash is a cost-e ffi cient, lo w-latency model designed for high-throughput generation while maintain- ing quality comparable to Google Gemini 2.5 Pro in structured reasoning and code-related tasks. Compared to many open- source LLMs, which often require substantially more computa- tional resources and inference time to achie ve similar enrichment quality , Gemini 3 Flash provides a fa vorable balance between speed and output fidelity . In addition, Gemini 3 Flash supports up to 65,536 output tokens per request, which is su ffi cient to enrich 50–100 nodes and their edges in a single call. This lar ge context windo w enables batch-level enrichment without frag- mentation, thereby reducing API overhead, minimizing latenc y , and keeping operational costs manageable at scale. • V ector Store: A domain-specific corpus including Solid- ity documentation, o ffi cial EVM specifications, vulner- ability taxonomies, and real-world audit reports is con- structed and segmented (size = 1000, overlap = 200). The corpus is embedded using BAAI/bge-large-en-v1.5 [ 60 ], a high-performance dense retrie val model v alidated on benchmarks such as MTEB [ 61 ]. W e adopt BGE due to its instruction-aware contrastive training, which im- prov es alignment between technical queries and structured documentation. This is particularly important for smart contract vulnerability analysis, where subtle semantic dif- ferences in opcode beha vior and ex ecution flow can lead to distinct security outcomes. The Solidity and EVM specifications provide authoritative protocol semantics [ 44 , 62 ], while the SWC Registry [ 63 ] and public audit datasets such as SC-Bench [ 64 ] contribute standardized vulnerability patterns and real exploit e vidence . T ogether , these sources ensure that retriev ed conte xt is both formally grounded and empirically informed, thereby enhancing reasoning reliability and reducing hallucination. • Retrieval and Re-ranking: For each vulnerability- rele vant subgraph, we first retrie ve the top-10 semantically similar document chunks using dense vector similarity search. Although dense retrie val e ff ectively captures global semantic relev ance, it may not su ffi ciently prioritize fine-grained security-critical cues. T o improve precision, we further apply bge-reranker-v2-m3 [ 65 , 66 ], which adopts a cross-encoder architecture to jointly encode the query and candidate passage, enabling token-le vel interaction and more accurate relev ance estimation. The top-3 re-ranked passages are then selected to construct the final prompt conte xt. This two-stage retriev al strategy enhances evidence fidelity and ensures that the language model is conditioned on highly relev ant and security-grounded information. 3.3.2. Pr ompt Engineering T o guide the LLM e ff ectiv ely , we construct a structured prompt with three distinct sections, as illustrated in Figures 3 , 4 , and 5 . Figure 3 illustrates the CONTEXT part of the prompt, where relev ant technical documents retriev ed from the knowledge cor- pus are provided. This section supplies the LLM with the neces- sary definitions, specifications, and kno wn vulnerability patterns related to the specific opcodes or functions present in the sub- graph, acting as an external kno wledge base. RA G Prompt P art 1: Context Integration System Instruction: Based on the provided context, answer the question clearly and concisely . CONTEXT : [...Dynamically Retrieved T ec hnical Docs, EVM Specifications, Known V ulner ability P atterns...] QUESTION: { question } Figure 3: RA G Prompt Structure: Injecting Knowledge Conte xt. Figure 4 depicts the first part of the QUESTION section. It begins with a detailed te xtual description of the subgraph struc- ture, listing the critical nodes, their connections, and attrib utes. Follo wing this, it instructs the LLM to perform a ”Step 1: Com- prehensiv e Subgraph Analysis” using a Chain-of-Thought ap- proach. The model is asked to identify e xecution paths, vulnera- bility hotspots (like storage writes or e xternal calls), and security patterns before generating individual node enrichments. RA G Prompt P art 2: Subgraph Description Role Definition: Y ou are an expert in smart contract secu- rity analysis and graph-based vulnerability detection. T ask: Analyze ONE important subgraph (top-K nodes) rep- resenting critical ex ecution paths relev ant for vulnerability detection. Input Data (Contextualized Subgraph): • Structure: { num nodes } critical nodes, { num edges } edges. • Important Nodes: List of nodes with ID, type, im- portance score, and rank. • Important Edges: Source → T ar get relationships. • Connectivity: Explicit BFS reachability map sho w- ing data flow . Figure 4: RA G Prompt Structure: Subgraph Representation. Figure 5 presents the second part of the QUESTION section, specifically the instructions for ”Step 2: Enrich Nodes” and ”Step 3: Enrich Edges”. It strictly defines the JSON output format and details the specific fields required for each node and edge, ensuring the output is structured and machine-parsable for integration back into the graph pipeline. 7 RA G Prompt P art 3: Analysis & Output Schema Chain-of-Thought Instructions: 1. Holistic Analysis: Identify critical e xecution paths, vulnerability hotspots (storage writes, external calls), and flow patterns. 2. Enrich Nodes: Generate semantic meaning , operational context , and security analysis (Max 20 words). 3. Enrich Edges: Generate edge relationship . Required Output F ormat (JSON): { "enriched_nodes": { "node_id": { "semantic_meaning": "...", "operational_context": "...", "security_analysis": "..." } }, "enriched_edges": [ { "edge_id": "...", "source": "...", "target": "...", "edge_relationship": "..." } ] } Figure 5: RA G Prompt Structure: Reasoning Steps and JSON Schema. 3.3.3. Enriched Output Structur e Figures 6a and 6b present examples of the structured JSON output generated by the RA G system for nodes and edges, re- spectiv ely . The output JSON contains specific fields to ensure traceabil- ity and correct mapping back to the heterogeneous graph: • Identity Fields: – node id (represented by “cfg 1227”) and edge id (represented by “cfg 1227 cfg 1236 control flow”) uniquely identify the elements in the original graph. – source and target explicitly define the edge direc- tionality , essential for maintaining the causal flow during the enrichment integration. • Semantic Fields: W e specifically designed the output to contain four distinct fields to balance semantic depth with token e ffi ciency: – Semantic Meaning: Captures the ”What”, rep- resenting the high-lev el intent of the code block (namely , ”T oken T ransfer Logic”). This allows the GNN to understand the functional purpose beyond raw opcodes. – Operational Context: Captures the ”Ho w”, refer- ring to the node’ s role in the e xecution state (specifi- cally , ”Updates balance storage slot”). This provides state-transition awareness. – Security Analysis: Captures the ”Risk”, detailing specific vulnerability indicators (particularly , ”Unchecked external call”). This directly injects domain expert kno wledge into the feature space. – Edge Relationship: Captures the ”Structure”, es- tablishing the logical connection between nodes (de- fined as ”Control flo w dependency”). This enriches the graph topology with semantic reasoning. These te xtual descriptions are encoded via GraphCodeBER T and concatenated with the original embeddings to form the en- riched feature set X enriched ∈ R 1536 (768 original + 768 enriched). 3.4. Causal Attention Learning Neural models often ov erfit to dataset artifacts (spurious correlations) that are predicti ve during training b ut not causally related to vulnerabilities, including v ariable naming con ventions or functional style patterns. T o address this issue, causal atten- tion seeks to disentangle causally rele vant features from spuri- ous ones by introducing an intervention-style learning objecti ve and an attention mechanism that emphasizes true causal signals [ 67 , 68 , 69 ], thereby improving robustness under distribution shift and adversarially induced superficial changes. Accordingly , to ensure trustworthy predictions, the model must distinguish between true causal features (v alid security logic) and spurious correlations (biases). W e implement the CausalAttentionHeteroClassifier , as detailed in Figure 7 . 3.4.1. Detailed Layer Ar chitectur e The frame work operates through a sequence of specialized layers, each serving a distinct purpose in the disentanglement process: 1. Dual F eature Encoders (Input Processing). T wo par- allel feed-forward networks (Linear → LayerNorm → ReLU → Dropout) map the input feature space (768 dimensions) to a hidden dimension (128). The Causal Encoder exclusiv ely processes X enriched (RA G-derived features containing ”Security Analysis”), while the Spurious Encoder processes X original (raw GraphCodeBER T embeddings containing syntax / tok ens). This physical separation allo ws the model to learn distinct representa- tions for semantic logic versus structural patterns. 2. Node Attention Layer (Featur e Selection). A Softmax- based attention mechanism computes two scalar weights α c and α s for each node. This layer dynamically weighs the importance of each branch. For example, if a node’ s original embedding is noisy (say , standard boilerplate code), the model learns to assign a lower α s and higher α c , e ff ectively filtering out the noise before it propagates. 3. Featur e Fusion Layer (Combination). The weighted features are fused via h combined = Linear ( Concat ( α c · h enrich , α s · h orig )). This creates a unified representation that prioritizes 8 Example of RA G-enriched Node Output. "cfg_1227": { "node_id": "cfg_1227", "semantic_meaning": "Control flow node indicating decision branch.", "operational_context": "Reached from cfg_1207, leads to cfg_1236 seq-flow.", "security_analysis": "Potential reentrancy point (state update post-loop).", "enrichment_source": "rag" } (a) RA G-enriched Node Output Example of RA G-enriched Edge Output. { "edge_id": "cfg_1227__cfg_1236__control_flow", "source": "cfg_1227", "target": "cfg_1236", "edge_relationship": "Control flow representing sequential execution.", "relation": "control_flow", "enrichment_source": "rag" } (b) RA G-enriched Edge Output Figure 6: Examples of RA G-enriched outputs: (a) node-lev el enrichment and (b) edge-level enrichment. the ”cleaned” causal signal while retaining necessary structural context for graph propagation. 4. Heterogeneous GNN Layers (Propagation). Stacked HeteroCon v + GA TConv layers (256 dimensions) propagate the refined causal information across the graph while preserving edge types. Since the input to the GNN is already “cleaned” by the attention layer , the message passing spreads valid vulnerabil- ity context, ex emplified by “tainted flo w”, rather than irrele vant syntax patterns. 5. Graph Pooling (Aggregation). Mean pooling aggregates node representations into a graph-le vel v ector to verify the vul- nerability status of the entire contract based on its most critical components. 6. Dual Classifiers (Prediction). T w o separate classifica- tion heads yield logits z c (Causal Prediction) and z s (Spurious Prediction). The Main Prediction for final auditing is taken ex- clusiv ely from z c , ensuring decisions are based on the enriched, causal path. 7. Backdoor Adjustment Head (Do-Calculus). T o ensure predictions remain stable under di ff erent spurious contexts, we simulate do-calculus by interv ening on spurious features. W e pool node-lev el causal and spurious representations to graph- le vel vectors H c and H s , then form H ′ s by shu ffl ing H s within the batch (breaking the spurious–label link). An auxiliary classifier MLP u predicts from the concatenation of causal and intervened spurious features: z u = MLP u ( H c ∥ H ′ s ) . (2) T raining z u to match the true label y (via L cau sal ) encourages the model to rely on stable causal features rather than spurious ones. 3.4.2. Multi-T ask Loss Function The following loss terms support causal attention learning by disentangling causal (enriched) from spurious (original) features and stabilizing predictions under intervention. Sui et al. [ 68 ] decompose the graph into causal and trivial attended-graphs via learned masks and train the causal branch to predict the label while pushing the tri vial branch to uniformity and apply- ing backdoor adjustment at the representation level. W e adopt the same spirit for vulnerability detection with dual encoders and node attention, and add a contrastive loss to keep the two representations distinct. W e minimize a joint loss function: L tot al = L su p + λ 1 L uni f + λ 2 L cau sal + λ 3 L contr a (3) where λ 1 , λ 2 , and λ 3 are hyperparameters that determine the relativ e importance of each objective. • L su p (Supervised Loss): The causal branch prediction z c is trained to match the true label y via binary cross-entrop y (implemented as BCEWithLogitsLoss ). For multi-label targets, o ver the batch and labels: L su p = − E D h y ⊤ log σ ( z c ) + (1 − y ) ⊤ log  1 − σ ( z c )  i (4) where σ denotes the sigmoid function. This ensures the model learns accurate vulnerability classification from the enriched, causal path. • L uni f (Uniform Distrib ution Loss): T o discourage re- liance on superficial patterns (spurious features), we push the spurious-branch prediction z s tow ards maximum un- certainty [ 70 ]. For multi-label detection the tar get is 0.5 per label. W e minimize the mean squared error between σ ( z s ) and the uniform target: L uni f = E D h    σ ( z s ) − 0 . 5    2 2 i (5) By penalizing confident predictions in the spurious branch, the model shifts its decision-making to the causal branch. • L cau sal (Backdoor Adjustment / Do-Calculus): W e sim- ulate intervention on spurious features by forming graph- lev el H c , H s , and H ′ s (shu ffl ed H s within the batch). The auxiliary prediction z u = MLP u ( H c ∥ H ′ s ) (Step 7) is trained to match y . Because H ′ s is randomized, the model must rely on H c to predict correctly [ 68 ]: L cau sal = − E D h y ⊤ log σ ( z u ) + (1 − y ) ⊤ log  1 − σ ( z u )  i (6) 9 Figure 7: Detailed Architecture of the Causal Attention HeteroClassifier . The input is split into Causal ( h enrich ) and Spurious ( h orig ) branches, processed by dual encoders, fused via attention, and refined by GNN layers before final classification. • L contr a (Contrastive Loss): This loss enforces orthog- onality between the learned representations of the two branches [ 71 ]. W e maximize the Euclidean distance be- tween the enriched representation h enrich and the original representation h orig , up to a margin m : L contr a = max(0 , m − ∥ h enrich − h orig ∥ 2 ) (7) where m = 0 . 5. This prev ents ”feature leakage” where the enriched encoder might lazily copy the original features. It guarantees that h enrich captures new , distinct semantic information provided by the RA G process, rather than redundant structural data. This multi-task objecti ve ensures rob ust generalization by an- choring predictions in causal semantics while actively unlearning reliance on spurious artifacts. 4. Experiments and Evaluation 4.1. Resear ch Questions T o ev aluate ORA CAL comprehensiv ely , we address three Research Questions (RQs): • RQ1: How do enrichment strategies and training paradigms (Standard vs. Causal) a ff ect ORACAL ’ s detection performance and its generalization across in-domain and out-of-distribution datasets? • RQ2: How e ff ecti vely does ORA CAL generate explana- tions for its vulnerability predictions ? • RQ3: How does ORA CAL compare with state-of-the-art GNN methods (Mando-HGT , SCVHunter , MTVHunter) in terms of detection accuracy and robustness under ad- versarial attacks? 4.2. Experimental Setup This experiment is conducted in a Google Colab en vironment configured with 1 × NVIDIA A100 80 GB T ensor Core GPU and an Intel Xeon CPU @ 2.20 GHz (6 physical cores, 12 threads), with approximately 230 GB of system RAM. 4.2.1. Datasets W e utilize four datasets to e valuate our model. SoliAudit and CGT W eakness serv e as training and in-domain ev aluation sources, while D AppScan and LLMA V are reserv ed entirely for out-of-distribution (OOD) e valuation and e xplainability assess- ment. • SoliA udit [ 23 ]. SoliAudit is a multi-label smart contract vulnerability dataset introduced alongside a hybrid se- curity analysis framework combining machine learning and fuzz testing. After rigorous preprocessing (filtering compilation errors, graph extraction failures, and unsup- ported v ersions), the dataset used in this study contains 10,655 valid samples and serves as the primary source for learning fundamental vulnerability patterns. • CGT W eakness [ 24 ]. CGT (Consolidated Ground Truth) is a dataset of 3,103 smart contracts, built by aggregat- ing sev eral previously published benchmarks including CodeSmells, ContractFuzzer , SolidiFI, Zeus, etc. In this dataset, smart contracts were manually labeled by the au- thors of each original dataset; labels were then unified through voting algorithms to select a consistent label for each contract. After preprocessing and con version to CFG, DFG, and CG graph representations, 1,345 contracts were successfully retained. • D AppScan [ 25 ]. This is a real-world dataset constructed from 1,199 professional security audit reports, cov- ering 682 decentralized applications (D Apps). The D APPSCAN-SOURCE subset includes 947 labeled Solidity source files. Due to comple x dependency structures common in large D App projects, only 124 contracts were successfully compiled, processed, and con verted into graphs. • LLMA V [ 26 ]. The Line-Lev el Manually Annotated V ul- nerabilities (LLMA V) dataset originates from the empir - ical study by Salzano et al., which provides the largest collection of Solidity smart contracts with manually ver- ified, line-lev el vulnerability annotations. The original dataset contains 2,081 annotated contracts covering se ven D ASP vulnerability categories. Unlike SoliAudit and CGT W eakness, which pro vide only contract-le vel labels, 10 LLMA V o ff ers fine-grained statement-le vel ground truth, making it uniquely suited for e valuating both detection generalization and the quality of model-generated explana- tions against manually annotated vulnerability triggering paths. Both DAppScan and LLMA V are excluded from training entirely and serv e a dual purpose in our e v aluation: they function as out-of-distribution benchmarks for assessing generalization to unseen contract populations, and the y provide line-level or audit-lev el annotations that enable rigorous ev aluation of the explainability component. T able 2 presents the detailed distribution of vulnerabilities across all four datasets. In our multi-label classification context, “Positi ve” refers to the count of contracts explicitly identified as containing a specific vulnerability type, while “Negative” refers to those free from that specific issue. It is important to note that since a single smart contract can exhibit multiple vulnerabilities simultaneously (considering a contract with both Reentranc y and Arithmetic bugs), the sum of “Positiv e” samples across categories may exceed the total number of unique contracts. This distinction is crucial for understanding the class imbalance challenges our model must address. Notably , LLMA V exhibits a distinct distribution from the training sources: Arithmetic (472 positiv e) and Time Manipulation (415 positiv e) are the most prev alent categories, while Front Running is extremely rare (7 positiv e), reflecting a distribution that di ff ers substantially from SoliAudit and thus provides a rigorous test of generalization. 4.2.2. T raining and T esting Scenarios T o ensure consistency within our multimodal frame work, we apply the proposed RA G-based semantic enrichment to all data prior to model training and e valuation. This guarantees that ev ery split is represented in a uniformly enriched semantic space, enabling the model to capture contextual dependencies across multiple vulnerability categories. For e xperimental setup, we employ an 80:20 multilabel strat- ified split for SoliAudit and CGTW eakness to preserve the joint distribution of vulnerability labels. Specifically , 80% of each dataset is mer ged to construct a unified training set, while the remaining 20% portions are retained as in-domain hold-out test sets. In contrast, both D AppScan and LLMA V are e xcluded from training entirely and reserved for ev aluation only . D AppScan serves as a real-world out-of-distribution (OOD) benchmark representing production-grade DeFi contracts, while LLMA V provides an OOD benchmark with line-le vel annotations that additionally enables explainability ev aluation. T o ensure strict data separation and pre vent data leakage, we perform contract address deduplication between LLMA V and the training sources. Of the original 2,081 contracts in the LLMA V dataset, 1,052 share ov erlapping addresses with contracts already present in SoliAudit or CGTW eakness; these are remo ved, yielding 1,029 unique contracts for OOD ev aluation. This configuration allows us to assess both in-domain detection performance and cross- dataset generalization under realistic deployment conditions, while simultaneously validating the quality of model-generated explanations. W e train the model on four vulnerability classes, namely Arithmetic, Denial of Service, Lo w Le vel Calls, and T ime Ma- nipulation, due to data imbalance considerations. This selection is guided by the label distribution analysis in T able 2 , where these categories e xhibit the most balanced positiv e to negati ve ratios across all datasets. As a result, they provide su ffi cient positiv e samples in both the training sources and the out of dis- tribution test sets to support statistically meaningful e valuation. In contrast, other vulnerability types including Reentrancy with 503 positive samples out of 10,655 in SoliAudit but only 8 in D AppScan and 49 in LLMA V , Access Control with 864 positiv e samples in SoliAudit and 5 in D AppScan, and Front Running with only 7 positive samples in LLMA V , su ff er from e xtreme class imbalance. This imbalance would make per class F1 scores unreliable due to high v ariance caused by limited sample sizes. Therefore, focusing on these four classes during training ensures that the reported results reflect actual model capability rather than artifacts of sampling noise. The detailed label-wise distribution across the resulting train- ing and testing splits is summarized in T able 3 . 4.2.3. Statistical Analysis T o compare the highest-performing configurations under the standard and causal attention training paradigms, we ev aluate the macro F1 scores obtained from fi ve independent runs with di ff erent random seeds. This repeated-run setting accounts for stochastic variability during model training. Since performance metrics in deep learning are often non- normally distributed and we conduct fi ve di ff erent runs for each training paradigm, we adopt non-parametric statistical methods [ 72 ]. Specifically , we apply the W ilcoxon signed-rank test [ 73 ] to assess whether significant di ff erences e xist between the paired macro F1 scores of the two models. The test is conducted using an e xact two-sided p -value under the null h ypothesis of no di ff erence. W e consider p < 0 . 05 statistically significant and p < 0 . 10 marginally significant. In addition to significance testing, we report the V argha– Delaney ˆ A 12 e ff ect size [ 74 ] to quantify the magnitude of per - formance di ff erences. This statistic represents the probability that a randomly selected macro F1 score from the causal model exceeds one from the standard model. A value of ˆ A 12 = 0 . 5 indicates no di ff erence, while ˆ A 12 ≥ 0 . 71 corresponds to a large e ff ect. Reporting both p -values and e ff ect sizes provides a more reliable interpretation, particularly with small sample sizes. 4.2.4. P erformance Metrics W e formulate vulnerability detection as multi-label classifi- cation , where a single smart contract can simultaneously e xhibit multiple vulnerability types, encompassing both Reentranc y and T imestamp Dependence, represented as a binary vector o ver C classes [ 75 ]. This reflects the reality of smart contract security , where vulnerabilities are not mutually exclusi ve. W e e valuate ORA CAL across three dimensions: detection performance (Accuracy , Precision, Recall, Macro F1), robust- ness under adversarial attack, and explainability quality . For robustness, we emplo y the ev asion Attack Success Rate (ASR) , 11 T able 2: Distribution of V ulnerabilities Across Datasets V ulnerability T ype SoliA udit CGT W eakness D AppScan LLMA V Pos. Neg. Pos. Neg. Pos. Neg. P os. Neg. Reentrancy 503 10,152 775 570 8 116 49 980 Access Control 864 9,791 778 567 5 119 63 966 Arithmetic 9,849 806 530 815 20 104 472 557 Lo w-Le vel Calls 3,109 7,546 801 544 3 121 68 961 Denial of Service 4,657 5,998 776 569 10 114 119 910 Front Running 1,609 9,046 595 750 5 119 7 1,022 T ime Manipulation 3,420 7,235 845 500 7 117 415 614 T otal Samples 10,655 1,345 124 1,029 T able 3: Label-wise Distrib ution of Smart Contract V ulnerabilities in T raining (SoliAudit + CGTW eakness) and T esting Sets. D AppScan and LLMA V are used exclusi vely as OOD test datasets and e xplainability benchmarks. Dataset V ulnerability T rain T est T otal SoliA udit Reentrancy 402 101 503 Access Control 691 173 864 Arithmetic 7,879 1,970 9,849 Low Le vel Calls 2,487 622 3,109 Denial of Services 3,726 931 4,657 Front Running 1,287 322 1,609 T ime Manipulation 2,736 684 3,420 CGTW eakness Reentrancy 620 155 775 Access Control 622 156 778 Arithmetic 424 106 530 Low Le vel Calls 641 160 801 Denial of Services 621 155 776 Front Running 476 119 595 T ime Manipulation 676 169 845 D AppScan Reentrancy 0 8 8 Access Control 0 5 5 Arithmetic 0 20 20 Low Le vel Calls 0 3 3 Denial of Services 0 10 10 Front Running 0 5 5 T ime Manipulation 0 7 7 LLMA V Reentrancy 0 49 49 Access Control 0 63 63 Arithmetic 0 472 472 Low Le vel Calls 0 68 68 Denial of Services 0 119 119 Front Running 0 7 7 T ime Manipulation 0 415 415 defined as the fraction of correctly predicted vulnerable sam- ples misclassified as safe (1 → 0 label flip) after perturbation. For explainability , we adopt the V ulnerability T riggering Path (VTP) frame work from Cao et al. [ 36 ], using Mean Statement Precision (MSP), Mean Statement Recall (MSR), and Mean In- tersection ov er Union (MIoU) to quantify alignment between model-generated explanations and human-annotated vulnerabil- ity paths. All metric definitions are summarized in T able 4 . T able 4: Summary of performance metrics used for e valuation. Category Metric Description Detection Accuracy Jaccard score (intersection o ver union of predicted and ground-truth label sets per sample). Precision The ratio of correctly predicted positive ob- servations to the total predicted positi ve ob- servations. Recall The ratio of correctly predicted positive ob- servations to all actual observations in the class. F1 Score The harmonic mean of Precision and Re- call, providing a balanced measure of per- formance. Robustness ASR Attack Success Rate : The percentage of adversarial e xamples that successfully fool the model by flipping a vulnerable label to safe (1 → 0). Explainability MSP Mean Statement Precision : The propor- tion of explanatory statements correctly re- lated to the identified vulnerability . MSR Mean Statement Recall : The ability of the explainer to cov er all contextual statements within the vulnerability-triggering path. MIoU Mean Intersection over Union : The over- all overlap between explanatory and ground- truth statements. 4.2.5. Hyperparameters T able 5 summarizes the hyperparameters for both model training and the Retriev al-Augmented Generation (RA G) seman- tic enrichment process. For training, we utilized the AdamW optimizer with a learning rate of 2 × 10 − 4 and a batch size of 64. The model architecture consists of 3 GNN layers with a hidden dimension of 256, chosen to maximize representation capacity while minimizing the risk of over -smoothing. The causal loss methodology employs specific weighting factors ( λ uni f = 0 . 1 , λ cau sal = 0 . 5 , λ contr a = 0 . 2) to balance standard classification with causal feature disentanglement. Regarding the semantic enrichment via RA G, we partitioned contract source code into chunks of 1000 characters with an over - lap of 200 to preserve contextual continuity . During retriev al, we set k = 10 to extract the most relev ant snippets from the ex- ternal security kno wledge base, which were subsequently used to enrich the top 50 high-importance nodes identified through code analysis. 12 T able 5: Hyperparameters for Model T raining and RAG Enrichment Parameter T ype Parameter V alue T raining Parameters Epochs 50 Learning Rate 2e-4 Batch Size 64 Optimizer AdamW W eight Decay 0.0001 Dropout Rate 0.2 Hidden Dimension 256 GNN Layers 3 Enriched Attn W eight 2.0 Loss Parameters λ uni f 0.1 λ cau sal 0.5 λ contr a 0.2 Margin 0.5 RA G Parameters Chunk Size 1000 Chunk Overlap 200 Retriev al k 10 T op- k Enrichment Nodes 50 Figure 8: A verage code-le vel metrics per contract across four datasets. Bars rep- resent: LOC (teal), Inv ocation (blue), StateV ars (coral), CFComplexity (orange), and ExtCalls (dark gray). 4.3. Exploratory Data Analysis (ED A) T o further characterize the graph corpus, we report av erage code and graph metrics across all four data sources (Figure 8 ) described in Section 4.2.1 . Figure 8 compares fi ve metrics across the four datasets and il- lustrates di ff erences in both size and structural comple xity . LOC (Lines of Code) approximates the ov erall implementation size of a contract. Inv ocation is the size of the in vocation space, i.e., the number of nodes in the call graph, corresponding to the number of functions (or callable entry points) in the contract. State- V ars (State V ariables) is a proxy for state-space size: it counts DFG nodes of type variable declaration , including both contract-le vel state variables and local v ariables. CFComplexity (Control Flow Complexity) is McCabe cyclomatic complex- ity [ 76 ] of the control-flow graph, M = E − N + 2 where E and N are the number of edges and nodes in the CFG, capturing the number of linearly independent execution paths and thus the Figure 9: A verage graph-scale statistics per contract across four datasets. Bars represent: Node count (teal), Edge count (blue), Node enrichment length in characters (coral), and Edge enrichment length in characters (tan). branching complexity of the contract. ExtCalls is the number of inter-function calls, i.e., edges in the call graph (caller–callee relations), which reflects how intensiv ely a contract interacts with other functions and components. As shown in Figure 8 , DAppScan contains the largest and most complex contracts (LOC 327.2, StateV ars 99.7, ExtCalls 14.6), representativ e of production-grade DeFi code with dense state spaces and frequent external interactions. SoliAudit fol- lows with moderately high values (LOC 257.2, ExtCalls 11.1), reflecting medium-complexity contracts typical of curated secu- rity benchmarks. LLMA V presents a notable anomaly: despite lower LOC (225.7) than SoliAudit, it exhibits higher CFCom- plexity (52.2 vs. 47.7), suggesting denser branching logic per line of code. CGT is the most compact (LOC 185.7) yet carries the highest CFComplexity-to-LOC ratio, indicating algorithmi- cally dense contracts with many independent execution paths. T ogether , the four corpora span a spectrum from complex pro- duction contracts (D AppScan) to compact but structurally dense benchmarks (CGT), providing a rigorous and div erse ev aluation setting for graph-based vulnerability detection. Figure 9 summarizes graph-scale statistics across the four datasets. D AppScan produces the largest graphs (682.4 nodes, 565.3 edges), followed by SoliAudit (590.8 nodes, 481.5 edges), CGT (544.6 nodes, 493.6 edges), and LLMA V (493.9 nodes, 456.7 edges). The node-enrichment text length is relativ ely consistent across datasets, ranging from 201.4 (D AppScan) to 215.1 (CGT) characters, indicating that the RA G pipeline gener- ates similarly concise semantic descriptions regardless of graph complexity . Edge enrichment text is shorter and more uniform (64.6–73.1 characters). These sizes are large enough to encode meaningful semantic cues, comprising intent, data-flow roles, or security-relev ant hints, while remaining well within the token limits of GraphCodeBER T . 4.4. Results and Analysis 4.4.1. RQ1: Ablation study on SoliAudit Dataset Motivation. T o understand the contribution of each archi- tectural and enrichment component in ORA CAL, we perform a systematic ablation study . Each experiment isolates a specific 13 module or enrichment strategy while keeping the remaining components unchanged. Examining indi vidual components in this controlled manner allows us to determine ho w each element influences detection accuracy , robustness, and generalization behavior . This analysis clarifies whether performance gains orig- inate from causal training, retrie val-based enrichment, or their interaction. Method. W e compare a baseline model (no enrichment) against se ven enrichment v ariants—Only edge enrichment, All node + edge enrichment, All node enrichment, Only enrichment text, Only Operational Context, Only Security Analysis, and Only Semantic Meaning—under both Standard and Causal train- ing, reporting detection performance and per -class F1 on the SoliAudit test set. W e further conduct a structural bias analysis on misclassified samples and measure inference cost (ms / sam- ple) to assess computational trade-o ff s. For generalization, we select the top-performing checkpoint from each paradigm and e valuate it on CGT , D AppScan, and LLMA V over 5 independent runs. The detailed ev aluation protocol is described alongside the results below . Scenario Definitions. Each scenario controls which feature modalities are fed into the model. In all enriched scenarios, the original GraphCodeBER T embeddings ( X orig ) are always present as the graph backbone; the variants di ff er in which additional enriched features are concatenated. Specifically: No enrichment (baseline) uses only X orig ; Only enrichment text is the excep- tion—it replaces X orig entirely with only the concatenation of all e nriched text fields (no graph features); Only Operational Context / Security Analysis / Semantic Meaning augments X orig with one specific node-level field; All node enric hment augments X orig with all three node-lev el fields; Only edge enrichment aug- ments X orig with the edge relationship field only; and All node + edge enrichment augments X orig with all four fields (three node + one edge). Ablation Study Results. T able 6 presents the quantitativ e comparison of Standard and Causal training across the di ff erent enrichment scenarios. T able 6 reports the Macro F1 results under di ff erent enrich- ment configurations and training paradigms. Across all com- parable settings, Causal training consistently achieves higher performance than the Standard paradigm. The performance gap between Causal and Standard training ranges from 2.61% to 4.09% in Macro F1, depending on the enrichment configuration. The largest improvement is observ ed in the “Only edge enrichment” setting, where Causal testing achiev es 91.28% compared to 87.19% under Standard training, yielding a di ff erence of 4.09 percentage points. Even in the base- line configuration without enrichment, Causal training improv es performance from 86.94% to 89.55%, corresponding to a g ain of 2.61 percentage points. Comparing enrichment strategies against their respectiv e baselines further rev eals that edge enrichment provides the most substantial benefit. For Causal training, performance increases from 89.55% in the baseline to 91.28% with only edge enrich- ment. Under Standard training, performance increases from 86.94% to 87.19%. In contrast, other enrichment types yield smaller gains, indicating that structural semantic information em- bedded in edges plays a more critical role than additional node- lev el descriptions. These results suggest that message passing ov er semantically enriched edges contributes more e ff ecti vely to vulnerability detection than isolated contextual augmentation. Per -class perf ormance analysis. W e further visualize the performance details through class-wise F1-Score heatmaps for both paradigms on the SoliAudit test set in Figure 10 . The heatmaps rev eal the performance distribution across four key vulnerability classes: Denial of Service, Time Manipulation, Low-Le vel Calls, and Arithmetic. For instance, Causal training significantly boosts the detection of Time Manipulation (im- proving from ∼ 0.85 in Standard to ∼ 0.94 in Causal) and Denial of Service (improving from ∼ 0.85 to ∼ 0.89). Both paradigms perform exceptionally well on the Arithmetic class (0.96-0.97), but the o verall stability and rob ustness across complex vulnera- bilities are visibly superior in the Causal setting. Expectedly , the “Enrichment Only” scenario performs poorly in both, rea ffi rming that semantic text alone is insu ffi cient without the underlying structural graph context. Structural bias analysis. T o better understand the strengths and limitations of the models across di ff erent complexity pro- files, we analyze the structural properties of misclassified sam- ples (False Positi ves and False Ne gatives) of the causal attention model. Using four graph-le vel metrics (CFComplexity , Stat- eV ars, ExtCalls, and In vocation) defined in Section 4.3 , we compute the Mean Relative Bias : F N avg − F P avg F P avg (8) where F N avg and F P avg denote the average structural metric values of False Negati ve and F alse Positive samples, respectively . This metric captures the structural di vergence between missed vulnerabilities and false alarms, where a positi ve bias indicates a tendency to miss vulnerabilities in complex contracts, while a negati ve bias reflects o ver -prediction. The detailed results across vulnerability types are summarized in Figure 11 . As shown in Figure 11 , there is a clear relationship between the model’ s failure cases and the structural characteristics of specific vulnerabilities. For instance, vulnerabilities heavily re- liant on external interactions, notably T ime Manipulation and Unchec ked External Calls , generally exhibit a high positive bias for EXTCALL and INV OCA TION. This indicates that de- tecting these vulnerabilities becomes disproportionately harder (yielding more False Ne gatives) when the contract contains ex- tensiv e external calls and lar ge in vocation spaces. Notably , for Unchec ked External Calls , the enrichment only scenario dis- plays extreme positi ve bias (EXTCALL + 2.03, INV OCA TION + 1.44), proving that text-based semantic enrichment completely fails to capture inter-contract dependencies without a structural graph backbone. Con versely , for Arithmetic vulnerabilities, we observe a consistent negati ve bias across most scenarios for CFCOMPLEXITY (up to -0.71) and ST A TEV AR (up to -0.73), meaning the models tend to over -predict arithmetic bugs in con- tracts with highly branched control flows and numerous state variables. Additionally , di ff erent enrichment strategies introduce distinct structural trade-o ff s. The baseline model exhibits 14 T able 6: Ablation Study Comparing Standard and Causal T raining on the SoliAudit T est Set. Highlighted ro ws indicate the models with the highest F1score (lighter red for Standard and darker red for Causal). T raining T ype Scenario Accuracy (%) F1-Macro (%) Precision (%) Recall (%) Standard No enrichment (baseline) 82.84 86.94 90.36 84.05 Only edge enrichment 82.85 87.19 89.94 84.79 All node + edge enrichment 82.62 86.93 89.03 85.11 All node enrichment 82.02 86.74 88.37 85.29 Only enrichment text 72.63 72.21 79.38 67.92 Only Operational Context 82.90 87.14 90.32 84.35 Only Security Analysis 82.55 86.91 87.59 86.43 Only Semantic Meaning 83.09 87.13 89.14 85.42 Causal No enrichment (baseline) 85.84 89.55 91.24 88.10 Only edge enrichment 87.40 91.28 92.99 89.79 All node + edge enrichment 86.59 90.70 92.63 89.00 All node enrichment 86.24 90.48 91.56 89.50 Only enrichment text 74.94 74.75 79.54 72.65 Only Operational Context 86.50 90.71 92.75 88.94 Only Security Analysis 87.15 90.96 91.60 90.40 Only Semantic Meaning 86.44 90.64 92.40 89.08 significant positive bias in EXTCALL for T ime Manipulation ( + 2.08) and Arithmetic ( + 0.39), highlighting its struggle without semantic context. Node-only enrichments ( sec analysis only , all node fields ) show strong ne gati ve bias in CFCOMPLEXITY and ST A TEV AR, confirming that dense textual features without explicit structural edges can lead to over -prediction in intricate contracts. Notably , the semantic meaning enrichment scenario presents the most balanced structural profile, with relativ e biases closest to zero across most metrics. Although the all edge fields sce- nario achiev es the highest overall macro F1 score (T able 6 ), it still retains moderate positive bias tow ards highly interactive contracts (specifically , T ime Manipulation EXTCALL + 1.11). In conclusion, this analysis demonstrates that misclassification patterns are fundamentally tied to the structural properties of the vulnerabilities themselves. While causal attention with semantic edge enrichment significantly mitigates these weaknesses, stabi- lizing detection for vulnerabilities that are heavily dependent on external interactions within complex e xecution spaces remains a persistent challenge for current graph-based models. Loss analysis. In Figure 12 , we illustrate the training and testing loss trends across epochs to observe con ver gence di ff er- ences between the Causal and Standard models. For the Causal models, although they optimize a more complex multi-objectiv e loss function, the test loss curves show that the baseline scenario plateaus at a higher loss value (around 6 × 10 − 1 ), whereas struc- turally enriched scenarios, notably All Edge Fields and All Node Fields, conv erge to lo wer test losses (around 5 × 10 − 1 ). This indicates that semantic enrichment actively assists the causal objecti ve in discovering better, generalizable semantic-structural representations. Con versely , in the Standard training, the base- line model without enrichment achiev es the lowest absolute test loss, yet its actual classification F1-score on the test set is lower than the enriched variants. This discrepancy highlights that stan- dard cross-entropy training on purely structural features tends to quickly fit tri vial subgraph patterns but struggles to general- ize. In contrast, the Causal paradigm coupled with RA G-based enrichment forces the network to learn robust causal features, thereby pre venting o verfitting and enhancing o verall detection accuracy . Inference T ime Comparison. T o assess t he computational trade-o ff of our enhancement strategies, we ev aluate the infer- ence time per sample (ms / sample) for each scenario, comparing Causal and Standard models as illustrated in Figure 13 . The results highlight that while enrichment adds some overhead, the models maintain highly e ffi cient inference speeds o verall. Specifically , baseline scenarios without enrichment exhibit the lowest processing times (1.40 ms for Causal and 1.33 ms for Standard). Most node-only text enrichments, such as Opera- tional Context (Causal: 1.40 ms, Standard: 1.33 ms) and Secu- rity Analysis (Causal: 1.39 ms, Standard: 1.32 ms), incur almost no additional delay compared to the baseline. Howe ver , incorporating edge-based enrichment noticeably increases inference latency . The highest times are recorded for the “Only Edge Enrichment” (Causal: 1.90 ms, Standard: 1.80 ms) and “ All Node + Edge Enrichment” (Causal: 1.90 ms, Standard: 1.85 ms) variants. This indicates that processing the structural edge attrib utes and their corresponding semantic de- scriptions within the graph message-passing layers is the most computationally demanding step. Despite this increase, the max- imum inference time remains under 2 milliseconds per sample, confirming that the trade-o ff is minor and the model remains viable for rapid large-scale deplo yments. Generalization Results. T o e valuate how well our proposed approaches adapt to di ff erent data en vironments, we select four representativ e checkpoints and test them on the in-domain CGT benchmark and the two out-of-distrib ution datasets, DAppScan and LLMA V . Specifically , we ev aluate two enrichment scenarios 15 Figure 10: F1-Score heatmaps comparing Causal and Standard models on the SoliAudit test set. Figure 11: Summary of structural bias across scenarios. under both training paradigms: (i) “ORACAL-edge”, corre- sponding to the all edge field scenario that achieves the highest F1 score on the SoliAudit test set (T able 6 ), and (ii) “ORA CAL- node-edge”, corresponding to the all node + edge field scenario that represents our full proposed feature set combining both node-lev el and edge-level enrichments. For each scenario, both Causal and Standard checkpoints are ev aluated, yielding four configurations in total. W e restrict this ev aluation to these repre- sentativ e checkpoints rather than all enrichment scenarios due to the substantial computational cost of multi-run e valuation: each 5-run e v aluation on three datasets requires approximately 15 full inference passes with associated GPU time for graph con- struction, embedding computation, and model forward passes. Running all 8 enrichment scenarios under this protocol would increase the total computation eightfold, exceeding the practi- cal budget of our A100 GPU environment. T o ensure robust performance estimates and account for variance, we conduct this ev aluation over 5 independent runs with di ff erent random seeds. T able 7 reports the resulting mean and standard devia- tion for the di ff erent vulnerability classes. T able 7 presents the Macro F1 scores on the CGT , DAppScan, and LLMA V test sets across fi ve independent runs for all four configurations. T wo key observations emer ge from this comparison. First, Causal training consistently outperforms Standard training under both enrichment scenarios. On the CGT test set, the ORA CAL-edge Causal model attains a mean Macro F1 of 0.918 compared to 0.883 for ORA CAL-edge Standard. The W ilcoxon signed rank test (exact, two tailed) yields p = 0 . 0625. While this p value is slightly abov e the con ventional 0.05 thresh- old due to the small sample size ( n = 5), ev ery paired di ff erence is positiv e, indicating directional consistency . The V argha De- laney e ff ect size reaches ˆ A 12 = 1 . 00, meaning that a randomly selected run from the Causal model outperforms one from the Standard model with probability 1.00. This pattern holds identi- cally for ORA CAL-node-edge (0.908 vs 0.874), confirming that the adv antage of causal disentanglement is not contingent on a specific enrichment configuration. On the D AppScan test set, the Causal model again shows superiority across all fi ve runs under both scenarios, achie ving mean Macro F1 scores of 0.771 (ORA CAL-edge) and 0.759 (ORA CAL-node-edge) compared to 0.709 and 0.700 for their Standard counterparts. The W ilcoxon signed rank test yields p = 0 . 0625 in both cases, and the e ff ect size remains ˆ A 12 = 1 . 00, confirming that the Causal paradigm produces uniformly higher performance on this real-world OOD benchmark. On the LLMA V dataset, the Causal model achieves mean Macro F1 scores of 0.813 (ORA CAL-edge) and 0.802 (ORA CAL-node-edge) compared to 0.757 and 0.748 for Standard. This dataset represents a particularly informativ e generalization test because LLMA V contracts were annotated by a completely independent group of researchers using line-le vel manual analysis, resulting in a label distribution that di ff ers 16 Figure 12: Comparison of training and test losses across 50 epochs for both Causal and Standard training models on the SoliAudit test set. T able 7: Generalization Performance (Macro F1, Mean ± Std Dev , 5 runs). ORA CAL-edge: best SoliAudit F1 (all edge fields); ORA CAL-node-edge: full feature set (all node + edge fields). Causal / Standard refer to training paradigms. Red-highlighted ro ws indicate the best-detected vulnerability class per dataset. Dataset V ulnerability Class ORA CAL-edge Causal ORA CAL-edge Standard ORA CAL-node-edge Causal ORA CAL-node-edge Standard (Mean ± Std Dev) (Mean ± Std Dev) (Mean ± Std Dev) (Mean ± Std De v) CGT Arithmetic 0.982 ± 0.004 0.967 ± 0.004 0.976 ± 0.005 0.959 ± 0.006 LowLe velCalls 0.877 ± 0.015 0.836 ± 0.017 0.869 ± 0.017 0.827 ± 0.019 DenialOfService 0.882 ± 0.013 0.844 ± 0.018 0.874 ± 0.015 0.835 ± 0.020 T imeManipulation 0.921 ± 0.010 0.883 ± 0.019 0.912 ± 0.012 0.874 ± 0.021 Overall (Macro F1) 0.918 ± 0.004 0.883 ± 0.006 0.908 ± 0.006 0.874 ± 0.008 D AppScan Arithmetic 0.696 ± 0.013 0.638 ± 0.012 0.688 ± 0.015 0.629 ± 0.014 LowLe velCalls 0.833 ± 0.017 0.759 ± 0.018 0.824 ± 0.019 0.750 ± 0.020 DenialOfService 0.761 ± 0.019 0.700 ± 0.017 0.752 ± 0.021 0.691 ± 0.019 T imeManipulation 0.780 ± 0.016 0.738 ± 0.018 0.771 ± 0.018 0.729 ± 0.020 Overall (Macro F1) 0.771 ± 0.011 0.709 ± 0.012 0.759 ± 0.013 0.700 ± 0.014 LLMA V Arithmetic 0.862 ± 0.012 0.819 ± 0.015 0.854 ± 0.014 0.810 ± 0.017 LowLe velCalls 0.789 ± 0.019 0.729 ± 0.024 0.780 ± 0.021 0.720 ± 0.026 DenialOfService 0.773 ± 0.022 0.718 ± 0.021 0.764 ± 0.024 0.709 ± 0.023 T imeManipulation 0.817 ± 0.013 0.762 ± 0.016 0.808 ± 0.015 0.753 ± 0.018 Overall (Macro F1) 0.813 ± 0.009 0.757 ± 0.014 0.802 ± 0.011 0.748 ± 0.016 markedly from the training sources (specifically , Arithmetic constitutes 45.9% of positive labels in LLMA V versus 40.6% in SoliAudit). Across all four ev aluated classes, the Causal model consistently e xceeds the Standard baseline by 4.3 to 6.0 percentage points under both enrichment scenarios, demonstrating that the causal disentanglement mechanism provides robust gains even when the test distribution deviates substantially from the training data. Second, ORA CAL-edge slightly outperforms ORA CAL- node-edge across all datasets. The all edge field configuration achiev es 0.5–1.2 percentage points higher Macro F1 than the all node + edge field configuration under both Causal and Standard paradigms (witnessed in the 0.918 vs 0.908 on CGT Causal, 0.771 vs 0.759 on D AppScan Causal). This marginal gap is consistent with the ablation results in T able 6 , where edge-only enrichment (91.28% Causal) yields a slightly higher SoliAu- dit F1 than the full node + edge enrichment (90.70% Causal). The result suggests that while node-le vel enrichments provide additional semantic context, the edge-lev el structural seman- tics contribute more decisively to vulnerability detection, and combining both modalities introduces a small amount of feature redundancy that marginally dilutes the edge signal. Nev ertheless, the ORA CAL-node-edge configuration remains competitiv e and may o ff er adv antages in explainability tasks where richer node- lev el descriptions facilitate interpretation. T aken together, the statistical analysis reveals that Causal 17 Figure 13: Inference time comparison (ms / sample) across Causal and Standard models for various enrichment scenarios on the SoliAudit test set. training provides stable, lar ge, and fully consistent generaliza- tion gains across all three benchmarks under both enrichment configurations. The maximal e ff ect sizes ˆ A 12 = 1 . 00 across datasets, coupled with directional uniformity in all paired com- parisons, indicate that the observed improvements are systematic and reproducible rather than attributable to random variation, despite the conservati ve nature of the test with limited runs. Answer to RQ1 Causal training with edge enrichment yields the best per - formance: 91.28% Macro F1 on SoliAudit ( + 4.09% over Standard), with inference under 2 ms / sample. In cross- dataset generalization, ORA CAL-edge Causal achieves 0.918 (CGT), 0.771 (D AppScan), and 0.813 (LLMA V), while ORA CAL-node-edge Causal follo ws closely at 0.908, 0.759, and 0.802. Both consistently outperform Standard counterparts ( ˆ A 12 = 1 . 00), confirming that causal disentanglement provides systematic generaliza- tion gains. 4.4.2. RQ2: Interpretability and Explainability Analysis Motivation. While achieving state-of-the-art detection accu- racy is essential, the ”black-box” nature of deep learning models often obscures the underlying reasoning, which can hinder trust in professional security audits. In the context of smart contract security , a detector must not only identify a vulnerability but also provide actionable e vidence that allows human auditors to quickly verify the root cause. This interpretability is crucial for streamlining the auditing process and ensuring that the model is learning meaningful security logic rather than exploiting spuri- ous syntactic correlations. Method. T o enhance the transparency of our framework and identify the most e ff ective explanation approach, we conduct a comparative study using three complementary explanation methods: GNNExplainer [ 77 ], PGExplainer [ 22 ], and Atten- tionExplainer . The rationale for ev aluating multiple explainers is twofold. First, di ff erent explainers operate under fundamen- tally di ff erent paradigms, and their suitability for heterogeneous graph structures is not kno wn a priori. GNNExplainer optimizes a per-instance soft mask over nodes and edges to identify the most influential subgraph for each individual prediction. PG- Explainer , in contrast, trains a parameterized neural network T able 8: Explanation quality comparison using VTP metrics. Bold values indicate the best-performing explainer for each dataset and metric. Dataset Explainer MSP(%) MSR(%) MIoU(%) LLMA V [ 26 ] GNNExplainer 35.86 39.72 27.96 PGExplainer 40.91 44.85 32.51 AttentionExplainer 27.94 31.62 19.17 D AppScan [ 25 ] GNNExplainer 33.74 37.91 26.05 PGExplainer 39.68 42.77 30.85 AttentionExplainer 25.81 29.47 17.08 to produce explanation masks that generalize across instances, av oiding the need for per-sample optimization and thus o ff ering better scalability and consistency . AttentionExplainer directly extracts attention weights from the Causal Attention mechanism learned during training to rank node importance, o ff ering ex- planations aligned with the model’ s internal decision process without requiring additional backward passes. By comparing these three approaches, we can determine which explanation paradigm best captures the vulnerability semantics encoded by ORA CAL ’ s heterogeneous graph structure. Second, raw importance scores from any single explainer are not directly interpretable without a structured ev aluation protocol. T o address this limitation, we adopt the V ulnerability T riggering Path (VTP) ev aluation framework using three met- rics: Mean Statement Precision (MSP), Mean Statement Recall (MSR), and Mean Intersection over Union (MIoU), as defined by Cao et al. [ 36 ]. These metrics quantify the ov erlap between model-generated explanations and manually annotated vulner- ability triggering paths, enabling a fine-grained assessment of explanation quality . Since the SoliAudit and CGTW eakness datasets do not pro- vide line-lev el vulnerability annotations, we ev aluate explanation quality on the LLMA V dataset [ 26 ], which contains manually verified vulnerability locations at the statement le vel. In addition, we include the D AppScan dataset, which has been described in Section 4.2.1 , to assess the rob ustness of explanations in more complex and di verse smart contract scenarios. For each predic- tion, each explainer returns the top-10 nodes with the highest importance scores. These nodes are then mapped back to the corresponding source code line numbers via the node-to-line mapping produced during graph construction, yielding a set of predicted vulnerability-relev ant statements. This predicted set is compared against the ground-truth vulnerability triggering paths annotated in the dataset to compute the VTP metrics (MSP , MSR, MIoU). All experiments are conducted under the ORACAL-all configuration, which provides the most comprehensive struc- tural representation and achiev es the strongest predictive perfor - mance. Quantitative Explainer Evaluation. T able 8 presents the VTP e valuation results on both the LLMA V and DAppScan datasets. The comparativ e study re veals a clear ranking among the three explanation methods, with PGExplainer emerging as the most e ff ecti ve approach for interpreting ORACAL ’ s predic- tions. PGExplainer consistently achiev es the highest scores across all metrics on both datasets. On the LLMA V dataset, PGEx- 18 plainer reaches 40.91% MSP , 44.85% MSR, and 32.51% MIoU, outperforming GNNExplainer by 5.05, 5.13, and 4.55 percent- age points, respectiv ely . This advantage stems from PGEx- plainer’ s parameterized architecture, which learns a global ex- planation model from the training data and produces explanation masks that generalize across instances. In contrast, GNNEx- plainer optimizes a separate mask for each prediction, which limits its ability to capture cross-instance vulnerability patterns. AttentionExplainer yields the weakest performance (19.17% MIoU on LLMA V), indicating that while attention scores re- flect the model’ s learned focus, they do not directly optimize for subgraph-le vel explanation quality . The attention mecha- nism prioritizes features useful for classification rather than for isolating complete vulnerability triggering paths, resulting in lower alignment with ground-truth annotations compared to the mask-based explainers. Compared to LLMA V , all explainers obtain slightly lower scores on D AppScan (demonstrated by PGExplainer achieving 30.85% MIoU versus 32.51% on LLMA V). This de gradation is expected, as D AppScan contains more complex and di verse e xe- cution patterns from production-grade DeFi contracts, making it more challenging to precisely localize vulnerability trigger- ing paths. Nev ertheless, the relative ranking among explainers remains consistent across both datasets, indicating the robust- ness and generalizability of PGExplainer’ s advantage. Based on these findings, we recommend PGExplainer as the primary explanation method for ORA CAL, as it combines the strongest quantitativ e alignment with manually annotated vulnerability triggering paths and the practical benefit of instance-independent explanation generation. Qualitative Case Study . T o complement the quantitative ev aluation, Figure 14 visualizes a line-level comparison of three explainers against the ground truth for contract 0xc5B2508E878af367Ba4957BDBEb2bBc6DA5BB349.sol , a true positiv e Unchecked Low Le vel Calls sample. The contract consists of an oracle contract ( AmIOnTheFork ) and a splitter contract ( EthSplit ). The split function (lines 7–19) routes ETH or ETC to di ff erent addresses depending on a fork check, using three lo w-lev el calls: ethAddress.call.value(msg.v alue)() at line 11, fees.send(fee) at line 16, and etcAddress.call.value(msg.value-fee)() at line 17. None of these return values are checked, constituting the core vulnerability . The ground truth additionally marks lines 22–23 (a fallback function using throw ), line 27 (external contract instantiation), and line 28 (external address assignment) as vulnerability-relev ant. Among the three e xplainers, PGExplainer achiev es the clos- est alignment with the ground truth. It correctly identifies all three unchecked call sites (lines 11, 16, 17), the fee computation (line 15), the fork-dependent branching logic (lines 9, 13, 14), and the external contract references (lines 27, 28). Crucially , PGExplainer produces very fe w false positiv es: it does not flag boilerplate lines such as the pragma directive (line 1) or function signatures that are not directly in volved in the vulnerability . GNNExplainer identifies se veral critical lines (11, 15, 16, 17) but introduces additional false positiv es on structurally prominent but semantically irrelev ant nodes. F or instance, it marks line 3 ( function forked() ) despite this being a simple view function in the oracle contract that does not participate in the unchecked call pattern. This behavior reflects GNNExplainer’ s per-instance optimization, which can ov erfit to local structural salience. AttentionExplainer captures some vulnerability-relev ant lines (lines 11, 16, 17) but also assigns high importance to the pragma directiv e (line 1) and the oracle function signature (line 3), which are syntactically prominent b ut not causally re- lated to the vulnerability . This pattern confirms that attention weights, while useful for classification, do not reliably distin- guish between structural importance and vulnerability-specific relev ance. This case study demonstrates that PGExplainer’ s parameter- ized, globally trained e xplanation masks pro vide the most pre- cise and actionable explanations for security auditors, correctly isolating the unchecked low-le vel call pattern while minimizing noise from structural artifacts. Answer to RQ2 PGExplainer achie ves the best explanation quality (32.51% MIoU on LLMA V , 30.85% on D AppScan), pre- cisely isolating vulnerability-triggering statements rang- ing from unchecked call.value to send with minimal false positi ves. GNNExplainer o ff ers competitive recall but more false positives, while AttentionExplainer as- signs importance to syntactically prominent but causally irrelev ant nodes. Overall, ORA CAL provides reliable, auditor-friendly e xplanations grounded in true vulnera- bility semantics. 4.4.3. RQ3: Comparison with SOT A and Robustness under Ad- versarial Attac ks Motivation. While high detection accuracy is fundamen- tal, the practical utility of a vulnerability detector is limited if it remains susceptible to adversarial ev asion or fails to remain competitiv e against established state-of-the-art architectures. In real-world security auditing, malicious actors may intentionally obfuscate contract logic or introduce semantic perturbations to bypass automated tools. This necessitates a comprehensiv e ev al- uation of ORA CAL ’ s performance relativ e to existing SOT A models, alongside an assessment of its structural and textual ro- bustness, to ensure that the frame work maintains reliable defense in adversarial en vironments. Method. W e first conduct a comparati ve analysis of detec- tion performance against sev eral SO T A graph-based models, including GNN-SC [ 28 ], SCVHunter [ 11 ], MTVHunter [ 27 ], and MANDO-HGT [ 32 ], as listed in T able 1 of Section 2 . For this comparison, we ev aluate three variants of our proposed framew ork: (i) ORACAL-base , which utilizes only the structural graph modality; (ii) ORA CAL-enrich , which relies e xclusiv ely on the enriched semantic te xt modality; and (iii) ORA CAL-all , our complete multimodal architecture that fuses node / edge at- tributes, enriched te xt, and graph topology . Subsequently , we perform robustness tests by simulating re- 19 Figure 14: Line-level explainer comparison for an Unchecked Low Level Calls vulnerability (contract 0xc5B2508...5BB349.sol ). Columns show binary importance scores (1 = identified as important, 0 = not) assigned by Ground Truth, GNNExplainer, PGExplainer , and AttentionExplainer. Red values indicate disagreement with the ground truth. alistic adv ersarial scenarios. In practice, an attacker may attempt to fool the detector by subtly modifying the contract’ s control- or data-flow (graph structure) or by introducing adversarial noise into the natural language descriptions (enriched text). Since ORA CAL leverages both graph and text modalities, we must ev aluate its robustness against both types of perturbations. T yp- ically , adversarial attacks on graphs and text aim to minimize the perturbation budget while maximizing the misclassification rate. T o this end, we choose the following representati ve attack methods: As illustrated in Figure 15 , textual adversarial attacks of- ten in volve w ord-level perturbations, by replacing semantic in- dicator keywords with their antonyms (substituting “Critical” with “Safe”, “external” with “internal”, or “before” with “af- ter”). Such modifications are designed to preserve the overall grammatical structure while rev ersing the semantic meaning to mislead the model’ s textual modality . Similarly , Figure 16 demonstrates structural attacks, which in volv e the addition of spurious edges or the removal of critical e xecution paths, no- tably adding an edge from CG to DFG or removing a CFG-DFG link. These structural alterations probe the model’ s reliance on specific graph topologies and its susceptibility to tampering in contract control- and data-flow . T o comprehensively ev aluate robustness against these per- turbations, we select Representativ e attack methods: • HSAttack [ 15 ]: For heterogeneous architectures (ORA CAL, SCVHunter , MANDO-HGT), we employ the Metapath-free Structure Attack (HSAttack). HSA partitions the graph into single-edge-type subgraphs to identify the importance of specific relations without requiring structural metapaths. It then iterativ ely adds or remov es high-impact edges to assess structural stability . Figure 15: Illustration of textual adversarial perturbation via anton ym substitu- tion. • CAMA [ 78 ]: For models using homogeneous graphs or simpler representations (MTVHunter , GNN-SC), we uti- lize the Class Activ ation Mapping-based Attack (CAMA). CAMA localizes structural vulnerabilities by producing node-lev el importance maps, enabling unnoticeable per- turbations. • SubAttack [ 79 ]: T o ev aluate the robustness of text-based modalities (ORA CAL-enrich, ORA CAL-all), we use Sub- Attack. This word-lev el attack identifies semantic indi- cator keyw ords and performs antonym substitution while ensuring the adversarial te xt remains readable. All three methods are implemented as hard-label black-box ev asion attacks, simulating a realistic threat model where the attacker has no access to the model’ s internal parameters and 20 Figure 16: Structural adversarial perturbation in volving edge addition and re- mov al. strikes during the inference phase. In this context, we e valuate robustness using the Attack Success Rate (ASR) , as defined in T able 4 . Robustness evaluation setting. Due to limited computa- tional resources, we e valuate robustness using a single repre- sentativ e setting for each attack type across all experiments. For structural perturbations (HSAttack and CAMA), we set the edge budget to 0.1, meaning the attacker can add or re- mov e edges up to 10% of the total edge count in the original graph. For textual perturbations (SubAttack), we restrict the keyword budget to 10 words per node enrichment, ensuring that the adversarial modifications remain unnoticeable to human auditors while testing the model’ s reliance on semantic cues. The application of these attacks is model-dependent: CAMA is applied to models targeting homogeneous representations (GNN-SC and MTVHunter), as it was specifically designed for such architectures. In contrast, HSAttack is utilized for the remaining heterogeneous models, including SCVHunter , MANDO-HGT , ORA CAL-base, and the full ORA CAL-all vari- ant. Finally , adversarial text attacks (SubAttack) are exclusi vely performed on models that incorporate enriched natural language input—specifically ORA CAL-enrich and ORACAL-all. Detection perf ormance comparison. T able 9 (Original setting) summarizes the detection accurac y of ORA CAL com- pared to SO T A methods across the three datasets. ORA CAL-all consistently achie ves the highest Macro F1 scores (90.48% on SoliAudit, 90.83% on CGTW eakness, and 72.82% on D App- Scan), outperforming baseline models such as MANDO-HGT (82.63%), MTVHunter (78.10%), and SCVHunter, which ex- hibits lo wer performance on the real-w orld D AppScan dataset (28.44%). Modality-specific analysis rev eals that while structural in- formation ( ORA CAL-base ) is e ff ecti ve for curated benchmarks, semantic enrichment ( ORA CAL-enrich ) provides superior in vari- ance on complex, out-of-distribution datasets like D AppScan. By fusing both, ORA CAL-all achie ves 72.82% F1 on D AppScan, a + 10.82% lead ov er MANDO-HGT , demonstrating that multi- modal integration is essential for capturing high-le vel logic that generalizes to industrial-grade contracts. This trend is further supported by the Receiv er Operating Characteristic (R OC) curves shown in Figure 20 . On the So- liAudit dataset, ORACAL-all achie ves a high Area Under the Curve (A UC) of 0.96, followed by ORA CAL-base (0.95), while SCVHunter exhibits a lo wer curve with an A UC of 0.73. A similar trend is observed in CGTW eakness, where ORA CAL-all maintains an A UC of 0.96. A notable performance gap appears in the real-world D AppScan dataset, where current SO T A meth- ods exhibit limitations: SCVHunter’ s A UC decreases to 0.39, while ORA CAL-all maintains resilience with an A UC of 0.76. The slope of ORA CAL ’ s curve near the y-axis across all datasets demonstrates its ability to maintain a high T rue Positiv e Rate (TPR) at lo w False Positi ve Rates (FPR), which is practical for security tools seeking to minimize manual audit ov erhead. Robustness result. The results of the ev asion attacks are summarized in T able 9 . Quantitative analysis rev eals that ORA CAL-all maintains stability compared to SO T A GNN models. While SCVHunter exhibits a significant performance degradation under HSAttack, with a Macro F1 decrease of up to 18.58% and an Attack Success Rate (ASR) of 18.73% on SoliAudit, ORA CAL-all remains relatively stable. Specifically , under the same HSAttack structural perturbation, ORA CAL- all’ s F1 score decreases by only 2.35% on SoliAudit and 2.18% on D AppScan, maintaining an ASR between 2.88%–3.21%. The robustness of ORACAL-all is also profoundly evident in the textual and combined attack scenarios. As highlighted in T able 9 , the model e xhibits exceptional resilience against textual adversarial perturbations (SubAttack), achie ving the lo west ASR ( ∼ 2.0%) and minimal F1 degradation (1.11%–1.84%) across all datasets. Even in the ”Both” scenario, where both structural and textual attacks are applied simultaneously to simulate a worst- case e vasion attempt, ORACAL-all preserves a high Macro F1 score of 85.91% on SoliAudit, with its performance degradation (-4.57%) still being significantly lower than that of any single structural attack on baseline SO T A models (ex emplified by the -9.69% decrease for MANDO-HGT). This notable robustness can be attributed to tw o core mecha- nisms in ORA CAL ’ s architecture: (i) Multimodal Redundancy : The fusion of di verse modalities (graph topology and semantic enrichment) provides a ”fail-safe” mechanism. If an attacker cor - rupts the graph structure, the model can still rely on the in variant semantic cues in the enriched text, and vice versa. (ii) Causal Attention Learning : By disentangling causal in variant features from spurious correlations, ORA CAL ’ s training paradigm filters out adversarial noise that does not align with the true underlying causes of vulnerabilities, thereby prev enting the model from being easily misled by superficial perturbations. Answer to RQ3 ORA CAL-all achie ves the highest detection accuracy (90.83% F1 on CGTW eakness, ROC-A UC up to 0.96) and superior adversarial rob ustness compared to SO T A: while SCVHunter degrades by up to 18.58% F1 under HSAttack (ASR 18.73%), ORA CAL-all limits degrada- tion to 2.35% (ASR ∼ 3%). This resilience stems from ORA CAL ’ s multimodal redundanc y and causal attention, which filter out structural and semantic adversarial noise without relying on spurious correlations. 21 Figure 17: R OC Curve - SoliAudit Figure 18: R OC Curve - CGTW eakness Figure 19: R OC Curve - D AppScan Figure 20: R OC-A UC comparison across datasets. 5. Threats to V alidity This section discusses threats to internal, external, construct, and conclusion validity , along with our mitigation strategies. 5.1. Internal V alidity • Data Preprocessing and Extraction: The reliance on compilation and AST extraction may bias our datasets tow ard well-structured contracts. W e mitigate this by following standard preprocessing pipelines and ensuring our extraction scripts are deterministic for reproducibility . • RA G Enrichment and LLM Stochasticity: Semantic enrichment quality depends on corpus composition and LLM stochasticity . T o control for variability , we fix all random seeds, use zero temperature for LLM inference, construct the corpus from authoritativ e sources, and apply a two-stage retrie val with re-ranking. • Checkpoint Selection for Generalization: Evaluating all configurations is computationally expensi ve, so gener- alization e v aluation uses only the top-performing check- point from each paradigm. W e mitigate this limitation by selecting checkpoints based on primary benchmark performance to represent practically relev ant models. 5.2. External V alidity • Generalizability: Our evaluation focuses on Solidity datasets cov ering D ASP categories, which may not trans- fer to other languages or novel vulnerabilities ranging from flash loans to MEV . W e address this by ev aluat- ing across structurally di verse datasets, including two in- dependent out-of-distribution datasets (D AppScan and LLMA V), demonstrating generalization across distinct collection and labeling methodologies. • T emporal Concept Drift: The model is trained on static snapshots, risking performance degradation as the Solid- ity ecosystem ev olves and new vulnerabilities emerge. W e partially mitigate this through ORA CAL ’ s modular RA G corpus, which can be updated independently , and its causal attention mechanism designed to learn inv ariant semantics, improving resilience to distrib utional shifts. 5.3. Construct V alidity • Label Reliability: Ground truth labels are deri ved from heterogeneous sources, introducing potential noise. W e mitigate this by using consolidated datasets with multi- tool voting and cross-referencing automated labels with manual case studies to validate semantic consistenc y . • Explainability Ground T ruth: Explainability ev alua- tion relies on LLMA V’ s manual line-level annotations, in volving subjecti ve boundary decisions. W e address this by ev aluating on both LLMA V and D AppScan, demon- strating consistent explainer rankings across independent datasets, reducing the likelihood of methodology artifacts. • Adversarial Budget Scope: W e ev aluate robustness un- der fixed perturbation budgets, while real-world adver- saries may operate under di ff erent constraints. W e ac- knowledge this boundary condition and note that ev alu- ating a range of b udgets would pro vide a more complete robustness profile. • Evaluation Metrics: Standard metrics may not capture all aspects of practical auditing (specifically , Macro F1 treats all classes equally regardless of se verity). W e supplement aggregate metrics with per -class breakdowns, qualitati ve case studies, and structural bias analyses. 5.4. Conclusion V alidity • Statistical Soundness: The out-of-distribution test sets are relativ ely small, and Wilcoxon signed-rank tests ( n = 5) yield p -values constrained by a discrete distribution. W e complement p -values with V argha-Delane y ˆ A 12 e ff ect sizes, where consistently maximal values ( ˆ A 12 = 1 . 00) provide strong e vidence of systematic gains. • Implicit Overfitting thr ough Iterative Design: Re- peated ev aluation may inadvertently guide architectural choices toward test-specific optimizations. W e mitigate this by maintaining strict train-test separation, reserv- ing two independent OOD datasets (D AppScan and LLMA V) unused during model selection, and reporting performance across multiple independent benchmarks. 22 T able 9: Evasion attack performance comparison. V alues in parentheses indicate the relati ve F1 decrease vs. the Original setting. The light-red row indicates the method achie ving the highest Original F1 score (ORA CAL-all). The dark-red row indicates the attack scenario with the lowest F1 degradation and ASR (ORA CAL-all under SubAttack). “–” denotes not applicable (original, pre-attack setting; ASR is undefined). Model Method SoliA udit CGTW eakness D AppScan F1(%) ASR(%) F1(%) ASR(%) F1(%) ASR(%) GNN-SC Original 70.74 - 70.20 - 66.47 - HSAttack 61.92(-8.82) 11.04 61.07(-9.13) 11.62 57.90(-8.57) 10.31 MTVHunter Original 78.10 - 78.73 - 58.94 - HSAttack 69.65(-8.45) 10.91 69.82(-8.91) 11.47 51.92(-7.02) 9.73 SCVHunter Original 51.62 - 50.20 - 28.44 - HSAttack 33.04(-18.58) 18.73 32.71(-17.49) 19.16 17.29(-11.15) 12.02 MANDO-HGT Original 82.63 - 83.77 - 62.00 - HSAttack 72.94(-9.69) 12.41 73.15(-10.62) 13.07 54.47(-7.53) 10.11 ORA CAL-base Original 89.35 - 90.32 - 63.65 - HSAttack 85.54(-3.81) 4.22 86.33(-3.99) 4.51 60.47(-3.18) 4.63 ORA CAL-enrich Original 74.60 - 76.83 - 68.10 - SubAttack 71.18(-3.42) 4.68 73.02(-3.81) 4.94 64.61(-3.49) 5.03 ORA CAL-all Original 90.48 - 90.83 - 72.82 - HSAttack 88.13(-2.35) 2.94 88.07(-2.76) 3.21 70.64(-2.18) 2.88 SubAttack 88.81(-1.67) 2.13 88.99(-1.84) 2.27 71.71(-1.11) 1.96 Both 85.91(-4.57) 5.71 85.36(-5.47) 6.04 68.61(-4.21) 5.31 6. Conclusion and Future W ork In this paper , we presented ORA CAL, a novel framework for smart contract vulnerability detection that bridges the gap between graph-based structural analysis and LLM-based se- mantic reasoning. By constructing a comprehensive hetero- geneous graph and enriching critical nodes with “Operational Context” and “Security Analysis” via a trusted RA G pipeline, ORA CAL achiev es state-of-the-art detection performance, reach- ing a Macro F1 of 91.28% on the SoliAudit test set (with 97.34% per-class F1 on Arithmetic bugs). Our Causal Attention mech- anism e ff ectiv ely disentangles true vulnerability signals from spurious correlations, yielding consistent generalization gains across div erse datasets: 91.8% Macro F1 on CGT , 77.1% on D AppScan, and 81.3% on LLMA V , outperforming the Standard paradigm by 3.5–6.2 percentage points across all benchmarks. Beyond detection accurac y , ORACAL demonstrates strong explainability and rob ustness. In explainability ev aluation, PG- Explainer achiev es 32.51% MIoU on the line-level annotated LL- MA V dataset and 30.85% on D AppScan, providing actionable, auditor-friendly e xplanations that precisely isolate vulnerability- triggering statements such as unchecked low-lev el calls. In adversarial robustness testing, ORA CAL-all maintains an At- tack Success Rate (ASR) of approximately 2–3% under both structural (HSAttack) and textual (SubAttack) perturbations, compared to up to 18.73% ASR for existing methods such as SCVHunter , confirming that the multimodal architecture and causal training paradigm jointly provide e ff ecti ve defense against adversarial e vasion. Future work will focus on three directions: (1) extending the framework to Node Classification to pinpoint the exact lo- cations of vulnerabilities within the code (line-le vel detection), (2) employing established e valuation metrics [ 80 ] to systemati- cally assess the faithfulness and relev ance of RAG-generated se- mantic descriptions, ensuring that the enriched content remains grounded in technical truth, and (3) exploring Incremental Learning techniques [ 81 ] to e ffi ciently integrate ne w samples without full retraining. ORA CAL represents a significant step to- wards transparent, automated, and explainable security auditing for the blockchain ecosystem. Acknowledgement This research is funded by V ietnam National Univ ersity Ho Chi Minh City (VNU-HCM) under grant number NCM2025-26- 01. Phan The Duy was funded by the Postdoctoral Scholarship Programme of V ingroup Innov ation Foundation (VINIF), V in- Univ ersity , code VINIF .2025.STS.20. References [1] SecurityW eek. 223 million stolen in cetus protocol hack. http s:// ww w.secur i tyweek. com/223 - millio n- stolen- i n- cetus- p rotoco l- hack/ , May 2025. Online. [2] Halborn. Explained: The zoth hack. https: //www .halb orn.c o m/bl og / po s t /e x p la i n ed- the - zot h- hac k- m a r ch - 202 5 , March 2025. Online. 23 [3] CoinDesk. Truebit token crashes 99.9% after usd 26.6m exploit drains 8,535 eth. https ://ww w.co indes k.com /mar kets/ 2026/ 01/0 9/tr uebi t- tok en- t ru- c rash es- 9 9- 9- a f ter - usd 26- 6 m- exp loit - d rains- 8- 535- eth , Jan 2026. Online. [4] BlockSec. $17m closed-source smart contract exploit: Arbitrary-call vulnerability in swapnet and aperture finance. https://blocksec.com /blog/17 m- closed- sou rce- smart- c ontract- e x ploit- ar b itra ry- call- swapnet- aperture , Jan 2026. Online. [5] SlowMist. 2025 blockchain security and aml annual report. h t t p s : //la zar us. d ay/ med ia/p ost /fi les/ 202 6/0 1/0 2 /20 25- B loc kch ain- Se cur it y- an d- AM L - A nn u al- Re p or tE N. p df , December 2025. Online. [6] Niosha Hejazi and Arash Habibi Lashkari. A comprehensi ve survey of smart contracts vulnerability detection tools: T echniques and method- ologies. Journal of Network and Computer Applications , 237:104142, 2025. [7] Gerardo Iuliano and Dario Di Nucci. Smart contract vulnerabilities, tools, and benchmarks: An updated systematic literature review . J ournal of Systems and Softwar e , page 112788, 2026. [8] Michael Bresil, Pwc Prasad, Md Shohel Sayeed, and Umar Ali Bukar. Deep learning-based vulnerability detection solutions in smart contracts: A comparative and meta-analysis of existing approaches. IEEE Access , 2025. [9] Joao Crisostomo, Fernando Bacao, and V ictor Lobo. Machine learning methods for detecting smart contracts vulnerabilities within ethereum blockchain- a revie w . Expert Systems W ith Applications , 268:126353, 2025. [10] Ben W ang, Y anxiang T ong, Shunhui Ji, Hai Dong, Xiapu Luo, and Pengcheng Zhang. A revie w of learning-based smart contract vulner- ability detection: A perspective on code representation. A CM T ransactions on Softwar e Engineering and Methodology , 2025. [11] Feng Luo, Ruijie Luo, T ing Chen, Ao Qiao, Zheyuan He, Shuwei Song, Y u Jiang, and Sixing Li. Scvhunter: Smart contract vulnerability detection based on heterogeneous graph attention network. In Pr oceedings of the IEEE / ACM 46th international confer ence on software engineering , pages 1–13, 2024. [12] Zhaoyi Meng, Zexin Zhang, W ansen W ang, Jie Cui, and Hong Zhong. Smartscope: Smart contract vulnerability detection via heterogeneous graph embedding with local semantic enhancement. Expert Systems with Applications , page 129857, 2025. [13] He Zhao, Zhiwei Zeng, Y ongwei W ang, Deheng Y e, and Chunyan Miao. Hgattack: T ransferable heterogeneous graph adversarial attack. In 2024 IEEE International Conference on Agents (ICA) , pages 100–105. IEEE, 2024. [14] Y uling W ang, Zihui Chen, Pengfei Jiao, and Xiao W ang. Heta: Relation- wise heterogeneous graph foundation attack model. arXiv pr eprint arXiv:2506.07428 , 2025. [15] Haoran Li, Jian Xu, Long Y in, Qiang W ang, Y ongzhen Jiang, and Jingyi Liu. Metapath-free adversarial attacks against heterogeneous graph neural networks. Information Sciences , 713:122143, 2025. [16] Zeyu Y ang, Zhao Meng, Xiaochen Zheng, and Roger W attenhofer . Assess- ing adversarial rob ustness of large language models: An empirical study . arXiv pr eprint arXiv:2405.02764 , 2024. [17] Muhammad Shahid Jabbar , Sadam Al-Azani, Abrar Alotaibi, and Moataz Ahmed. Red teaming large language models: A comprehensi ve re view and critical analysis. Information Pr ocessing & Management , 62(6):104239, 2025. [18] Zongwei Li, Xiaoqi Li, W enkai Li, and Xin W ang. Scalm: Detecting bad practices in smart contracts through llms. In Proceedings of the AAAI Confer ence on Artificial Intelligence , volume 39, pages 470–477, 2025. [19] Fahad Al Debe yan, Trac y Hall, and Lech Madeyski. Emerging results in using explainable ai to impro ve software vulnerability prediction. In Pr oceedings of the 33r d A CM International Confer ence on the F oundations of Softwar e Engineering , pages 561–565, 2025. [20] Hanouf Al Ghanmi, Sabreen Ahmadjee, and Rami Bahsoon. Ev aluating the need for explanations in blockchain smart contracts to reconcile surprises. ACM T ransactions on Softwar e Engineering and Methodology , 34(8):1–35, 2025. [21] Sebastian Lapuschkin, Stephan W ¨ aldchen, Ale xander Binder , Gr ´ egoire Montav on, W ojciech Samek, and Klaus-Robert M ¨ uller . Unmasking clever hans predictors and assessing what machines really learn. Natur e commu- nications , 10(1):1096, 2019. [22] Dongsheng Luo, W ei Cheng, Dongkuan Xu, W enchao Y u, Bo Zong, Haifeng Chen, and Xiang Zhang. Parameterized explainer for graph neural network. Advances in neural information pr ocessing systems , 33:19620– 19631, 2020. [23] Jian-W ei Liao, Tsung-T a Tsai, Chia-Kang He, and Chin-W ei Tien. Soliau- dit: Smart contract vulnerability assessment based on machine learning and fuzz testing. In 2019 Sixth International Confer ence on Internet of Things: Systems, Manag ement and Security (IOTSMS) , pages 458–465. IEEE, 2019. [24] Monika Di Angelo and Gernot Salzer . Consolidation of ground truth sets for weakness detection in smart contracts. In International Conference on F inancial Cryptography and Data Security , pages 439–455. Springer , 2023. [25] Zibin Zheng, Jianzhong Su, Jiachi Chen, David Lo, Zhijie Zhong, and Mingxi Y e. Dappscan: building large-scale datasets for smart contract weaknesses in dapp projects. IEEE T ransactions on Software Engineering , 50(6):1360–1373, 2024. [26] Francesco Salzano, Cosmo K evin Antenucci, Simone Scalabrino, Gio vanni Rosa, Rocco Oliv eto, and Remo Pareschi. An empirical analysis of vulner- ability detection tools for solidity smart contracts using line lev el manually annotated vulnerabilities. arXiv preprint , 2025. [27] Guokai Sun, Y uan Zhuang, Shuo Zhang, Xiaoyu Feng, Zhenguang Liu, and Liguo Zhang. Mtvhunter: Smart contracts vulnerability detection based on multi-teacher knowledge translation. In Proceedings of the AAAI Confer ence on Artificial Intelligence , volume 39, pages 15169–15176, 2025. [28] Y oo-Y oung Cheong, Jihwan Shin, T aekyung Kim, Jinhyun Ahn, Dong- Hyuk Im, et al. Gnn-based ethereum smart contract multi-label vulnerabil- ity detection. In 2024 International Confer ence on Information Networking (ICOIN) , pages 57–61. IEEE, 2024. [29] Loi Luu, Duc-Hiep Chu, Hrishi Olickel, Prateek Saxena, and Aquinas Hobor . Making smart contracts smarter . In Proceedings of the 2016 A CM SIGSAC conference on computer and communications security , pages 254–269, 2016. [30] Josselin Feist, Gustav o Grieco, and Alex Groce. Slither: a static analysis framew ork for smart contracts. In 2019 IEEE / ACM 2nd International W orkshop on Emerging Tr ends in Software Engineering for Blockchain (WETSEB) , pages 8–15. IEEE, 2019. [31] Hongjun W u, Zhuo Zhang, Shangwen W ang, Y an Lei, Bo Lin, Y ihao Qin, Haoyu Zhang, and Xiaoguang Mao. Peculiar: Smart contract vulnerability detection based on crucial data flo w graph and pre-training techniques. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE) , pages 378–389. IEEE, 2021. [32] Hoang H Nguyen, Nhat-Minh Nguyen, Chunyao Xie, Zahra Ahmadi, Daniel Kudendo, Thanh-Nam Doan, and Lingxiao Jiang. Mando-hgt: Het- erogeneous graph transformers for smart contract vulnerability detection. In 2023 IEEE / ACM 20th International Confer ence on Mining Software Repositories (MSR) , pages 334–346. IEEE, 2023. [33] Asem Ghaleb and Karthik Pattabiraman. How e ff ectiv e are smart con- tract analysis tools? e valuating smart contract static analysis tools using bug injection. In Pr oceedings of the 29th ACM SIGSOFT international symposium on softwar e testing and analysis , pages 415–427, 2020. [34] Xuyan Li, Jie W ang, and Zheng Y an. Can graph neural networks be adequately explained? a surv ey . ACM Computing Surve ys , 57(5):1–36, 2025. [35] Zhaoyang Chu, Y ao W an, Qian Li, Y ang W u, Hongyu Zhang, Y ulei Sui, Guandong Xu, and Hai Jin. Graph neural networks for vulnerability detection: A counterfactual explanation. In Pr oceedings of the 33rd A CM SIGSOFT International Symposium on Software T esting and Analysis , pages 389–401, 2024. [36] Sicong Cao, Xiaobing Sun, Xiaoxue W u, David Lo, Lili Bo, Bin Li, and W ei Liu. Coca: Improving and explaining graph neural network-based vulnerability detection systems. In Proceedings of the IEEE / A CM 46th International Confer ence on Softwar e Engineering , pages 1–13, 2024. [37] Jiarong Xu, Junru Chen, Siqi Y ou, Zhiqing Xiao, Y ang Y ang, and Jiangang Lu. Robustness of deep learning models on graphs: A survey . AI Open , 2:69–78, 2021. [38] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu T ang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy , Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow . arXiv 24 pr eprint arXiv:2009.08366 , 2020. [39] Y uan Zhuang, Zhenguang Liu, Peng Qian, Qi Liu, Xiang W ang, and Qinming He. Smart contract vulnerability detection using graph neural networks. In Pr oceedings of the twenty-ninth international conference on international joint confer ences on artificial intelligence , pages 3283–3290, 2021. [40] Zhenguang Liu, Peng Qian, Xiaoyang W ang, Y uan Zhuang, Lin Qiu, and Xun W ang. Combining graph neural networks with expert knowledge for smart contract vulnerability detection. IEEE T ransactions on Knowledge and Data Engineering , 35(2):1296–1310, 2021. [41] Aria Seo, Y oung-T ak Kim, Ji Seok Y ang, Y angSun Lee, and Y unsik Son. Software weakness detection in solidity smart contracts using control and data flow analysis: A novel approach with graph neural networks. Electr onics (2079-9292) , 13(16), 2024. [42] Jingjie Xu, Ting W ang, Mingqi Lv , Tieming Chen, Tiantian Zhu, and Baiyang Ji. Mvd-hg: multigranularity smart contract vulnerability de- tection method based on heterogeneous graphs. Cybersecurity , 7(1):55, 2024. [43] Filippo Contro, Marco Crosara, Mariano Ceccato, and Mila Dalla Preda. Ethersolve: Computing an accurate control-flow graph from ethereum bytecode. In 2021 IEEE / ACM 29th International Confer ence on Pr ogram Compr ehension (ICPC) , pages 127–137. IEEE, 2021. [44] Solidity Documentation Contributors. Solidity documentation. htt ps: //docs.soliditylang.org / _/downloads/en/latest/ p d f/ , 2026. Accessed: February 15, 2026. [45] JZ Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance, 2024. arXiv preprint . [46] Adrien Laurent. Llm api pricing comparison (2025): Openai, gemini, claude. htt ps :/ /i nt ui ti on la bs .a i/ ar ti cl es /l lm- a pi - p ri ci ng- comparison- 2025 , 2026. Accessed: 2026-02-15. [47] Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. Ragcache: E ffi cient kno wledge caching for retriev al- augmented generation. ACM T ransactions on Computer Systems , 44(1):1– 27, 2025. [48] Chuangtao Ma, Zeyu Zhang, Arijit Khan, Sebastian Schelter, and Paul Groth. Cost-e ffi cient rag for entity matching with llms: A blocking-based exploration. arXiv pr eprint arXiv:2602.05708 , 2026. [49] Zimian Liu, Han Qiu, W ei Guo, Junhu Zhu, and Qingxian W ang. Nie- gat: node importance ev aluation method for inter-domain routing network based on graph attention network. J ournal of Computational Science , 65:101885, 2022. [50] Y angding Li, Shaobin Fu, Y angyang Zeng, Hao Feng, Ruoyao Peng, Jinghao W ang, and Shichao Zhang. Centrality-based relation a ware hetero- geneous graph neural network. Knowledge-Based Systems , 283:111174, 2024. [51] Y ue Peng, Jiwen Xia, Dafeng Liu, Miao Liu, Long Xiao, and Benyun Shi. Unifying topological structure and self-attention mechanism for node classification in directed networks. Scientific Reports , 15(1):805, 2025. [52] Elizav eta Kovtun, Maksim Makarenko, Natalia Semenov a, Ale xey Zaytsev , and Semen Budennyy . Pine: Pipeline for important node exploration in attributed networks. arXiv preprint , 2025. [53] Petar V eli ˇ ckovi ´ c, Guillem Cucurull, Arantxa Casanov a, Adriana Romero, Pietro Lio, and Y oshua Bengio. Graph attention netw orks. arXiv preprint arXiv:1710.10903 , 2017. [54] Stephen B Seidman. Network structure and minimum degree. Social networks , 5(3):269–287, 1983. [55] Lawrence Page, Ser gey Brin, Rajeev Motwani, and T erry Winograd. The pagerank citation ranking: Bringing order to the web . T echnical report, Stanford infolab, 1999. [56] V incent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and e xperiment , 2008(10):P10008, 2008. [57] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry , pages 35–41, 1977. [58] Patrick Le wis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨ uttler , Mik e Le wis, W en-tau Y ih, T im Rockt ¨ aschel, et al. Retriev al-augmented generation for kno wledge- intensiv e nlp tasks. Advances in neural information pr ocessing systems , 33:9459–9474, 2020. [59] Google AI Dev elopers. Gemini 3 developer guide. https://ai.googl e.dev/gemini- api/docs/gemini- 3 , 2025. Accessed on February 18, 2026. [60] Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennigho ff . C-pack: Packaged resources to adv ance general chinese embedding, 2023. [61] Niklas Muennigho ff , Nouamane T azi, Lo ¨ ıc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Confer ence of the Eur opean Chapter of the Association for Computational Linguistics , pages 2014–2037, 2023. [62] Gavin W ood et al. Ethereum: A secure decentralised generalised transac- tion ledger . Ether eum pr oject yellow paper , 151(2014):1–32, 2014. [63] SWC Registry Contributors. Smart contract weakness classification (swc) registry . htt ps : // swc re gis tr y.i o/ , 2026. Accessed: February 15, 2026. [64] Shihao Xia, Mengting He, Linhai Song, and Y iying Zhang. Sc-bench: A large-scale dataset for smart contract auditing. In 2025 IEEE / ACM International W orkshop on Lar ge Language Models for Code (LLM4Code) , pages 57–64. IEEE, 2025. [65] Chaofan Li, Zheng Liu, Shitao Xiao, and Y ingxia Shao. Making large language models a better foundation for dense retriev al, 2023. [66] Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality , multi- granularity text embeddings through self-kno wledge distillation, 2024. [67] Xu Y ang, Hanwang Zhang, Guojun Qi, and Jianfei Cai. Causal attention for vision-language tasks. In Pr oceedings of the IEEE / CVF confer ence on computer vision and pattern r ecognition , pages 9847–9857, 2021. [68] Y ongduo Sui, Xiang W ang, Jiancan W u, Min Lin, Xiangnan He, and T at- Seng Chua. Causal attention for interpretable and generalizable graph classification. In Pr oceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages 1696–1705, 2022. [69] Shuailin Y ang, Jiadong Ren, Jiazheng Li, and Dekai Zhang. V uldiac: V ulnerability detection and interpretation based on augmented cfg and causal attention learning. Journal of Systems and Softwar e , page 112595, 2025. [70] Amrith Setlur , Benjamin Eysenbach, V irginia Smith, and Serge y Le vine. Maximizing entropy on adversarial e xamples can improve generalization. In ICLR 2022 W orkshop on P AIR {\ textasciicir cum } 2Struct: Privacy , Accountability , Interpr etability , Rob ustness, Reasoning on Structur ed Data , 2022. [71] Michael Zhang, Nimit S Sohoni, Hongyang R Zhang, Chelsea Finn, and Christopher R ´ e. Correct-n-contrast: A contrastive approach for improving robustness to spurious correlations. arXiv pr eprint arXiv:2203.01517 , 2022. [72] T ae Kyun Kim. T test as a parametric statistic. K or ean journal of anesthe- siology , 68(6):540–546, 2015. [73] Robert F W oolson. Wilcoxon signed-rank test. W iley encyclopedia of clinical trials , pages 1–3, 2007. [74] Andr ´ as V argha and Harold D Delaney . A critique and improvement of the cl common language e ff ect size statistics of mcgra w and wong. Journal of Educational and Behavioral Statistics , 25(2):101–132, 2000. [75] Mohammad S Sorower . A literature survey on algorithms for multi-label learning. Ore gon State University , Corvallis , 18(1):25, 2010. [76] Thomas Mccabe. Cyclomatic complexity and the year 2000. IEEE Soft- war e , 13(3):115–117, 1996. [77] Zhitao Y ing, Dylan Bourgeois, Jiaxuan Y ou, Marinka Zitnik, and Jure Leskov ec. Gnnexplainer: Generating explanations for graph neural net- works. Advances in neural information pr ocessing systems , 32, 2019. [78] Xin W ang, Heng Chang, Beini Xie, Tian Bian, Shiji Zhou, Daixin W ang, Zhiqiang Zhang, and W enwu Zhu. Revisiting adversarial attacks on graph neural networks for graph classification. IEEE T ransactions on Knowledge and Data Engineering , 36(5):2166–2178, 2023. [79] Chenqi Hua, Xiaojian Liu, Y i Zhu, Chaowei Zhang, Y un Li, Y unhao Y uan, and Jipeng Qiang. Subattack: A word-lev el adversarial textual attack method via antonym substitution. Engineering Applications of Artificial Intelligence , 163:113159, 2026. [80] Hao Y u, Aoran Gan, Kai Zhang, Shiwei T ong, Qi Liu, and Zhaofeng Liu. Evaluation of retrie val-augmented generation: A surve y . In CCF Confer ence on Big Data , pages 102–120. Springer , 2024. [81] Justin Leo and Jugal Kalita. Survey of continuous deep learning meth- ods and techniques used for incremental learning. Neurocomputing , 582:127545, 2024. 25

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment