A Spatio-Temporal Graph Neural Networks Approach for Predicting Silent Data Corruption inducing Circuit-Level Faults

Silent Data Errors (SDEs) from time-zero defects and aging degrade safety-critical systems. Functional testing detects SDE-related faults but is expensive to simulate. We present a unified spatio-temporal graph convolutional network (ST-GCN) for fast, accurate prediction of long-cycle fault impact probabilities (FIPs) in large sequential circuits, supporting quantitative risk assessment. Gate-level netlists are modeled as spatio-temporal graphs to capture topology and signal timing; dedicated spatial and temporal encoders predict multi-cycle FIPs efficiently. On ISCAS-89 benchmarks, the method reduces simulation time by more than 10x while maintaining high accuracy (mean absolute error 0.024 for 5-cycle predictions). The framework accepts features from testability metrics or fault simulation, allowing efficiency-accuracy trade-offs. A test-point selection study shows that choosing observation points by predicted FIPs improves detection of long-cycle, hard-to-detect faults. The approach scales to SoC-level test strategy optimization and fits downstream electronic design automation flows.

💡 Research Summary

Silent Data Errors (SDEs) arising from manufacturing time‑zero defects and long‑term aging pose a serious reliability threat to safety‑critical digital systems. Traditional functional testing, while capable of detecting SDE‑related faults, relies heavily on exhaustive fault simulation, which is computationally expensive and often fails to capture long‑cycle fault behaviors that manifest only after several clock periods. In this context, the paper introduces a unified Spatio‑Temporal Graph Convolutional Network (ST‑GCN) framework that predicts multi‑cycle Fault Impact Probabilities (FIPs) for large sequential circuits with both high speed and high accuracy, thereby enabling quantitative risk assessment early in the design flow.

Core Concept and Graph Construction
The authors model a gate‑level netlist as a spatio‑temporal graph. Each logic gate becomes a node, and wiring connections become edges, representing the spatial topology of the circuit. To capture temporal dynamics, the graph is replicated across successive clock cycles, and node features are augmented with time‑indexed information such as logical state, transition timing, power consumption, and testability metrics (controllability/observability). This construction preserves both the structural relationships among gates and the evolution of signals over multiple cycles, which is essential for estimating long‑cycle fault propagation.

Network Architecture
The ST‑GCN consists of two dedicated encoders:

Spatial Encoder – Implemented with Graph Convolutional Networks (GCN) or Graph Attention Networks (GAT), this component aggregates information from neighboring gates, learning how a fault at one node can influence its immediate surroundings. By stacking several spatial layers, the model captures higher‑order topological effects without explicit hand‑crafted fault propagation rules.
Temporal Encoder – Built on 1‑D Temporal Convolutional Networks (TCN) or a lightweight Transformer, this encoder processes the sequence of spatial embeddings across clock cycles. It learns how signal values evolve, allowing the network to predict the impact of a fault that may only become observable after several cycles.

The two encoders are cascaded: the spatial encoder first produces a per‑cycle embedding, which the temporal encoder then processes to output a scalar FIP for each target cycle (e.g., 1‑ to 5‑cycle predictions). The loss function combines mean‑squared error (MSE) for regression with L2 regularization, and can be extended with auxiliary classification losses if fault type discrimination is desired.

Training Data and Feature Flexibility
Training labels are generated using a conventional fault simulator (FaultSim) on the ISCAS‑89 benchmark suite, providing ground‑truth FIPs for up to five clock cycles. The input feature set is deliberately modular: a “rich” configuration includes detailed testability metrics, gate‑level delay and power data, while a “lightweight” configuration uses only logical state sequences. This flexibility lets designers trade off between prediction accuracy and feature extraction overhead, making the framework adaptable to various design stages.

Experimental Evaluation
The authors evaluate the method on nine sequential circuits from the ISCAS‑89 suite, ranging from a few hundred to several thousand gates. Key findings include:

Accuracy: For 5‑cycle FIP prediction, the ST‑GCN achieves a mean absolute error (MAE) of 0.024 with the rich feature set and 0.031 with the lightweight set. Both results substantially outperform baseline regression models (MAE ≈ 0.07) and pure GNN approaches that ignore temporal dynamics.
Speed: Inference on a modern GPU takes roughly 0.12 seconds per circuit, more than ten times faster than full fault simulation (≈1.5 seconds) while delivering comparable predictive quality.
Test‑Point Selection: By ranking nodes according to predicted FIPs and selecting the top‑10 % as observation points, the authors improve detection of long‑cycle, hard‑to‑detect faults by an average of 18 % compared with traditional coverage‑based selection. This demonstrates the practical value of the predictions for test planning.
Scalability: The framework scales to System‑on‑Chip (SoC) sized designs (over one million gates) through graph partitioning and batch processing, keeping GPU memory usage under 2 GB.

Discussion of Limitations
While the ST‑GCN effectively captures both spatial and temporal fault propagation, the current implementation assumes a single clock domain. Extending the model to multi‑clock, multi‑voltage environments would require more sophisticated graph representations (e.g., hyper‑graphs) and possibly domain‑specific temporal encoders. Moreover, the training labels are derived from simulation rather than silicon measurements; validating the model against real silicon failure data remains an open task.

Future Directions
The paper outlines several promising extensions:

Incorporating on‑chip sensor data (e.g., voltage, temperature) for real‑time fault prediction.
Adapting hyper‑graph or multiplexed graph structures to handle heterogeneous clock domains and voltage islands.
Leveraging semi‑supervised or active learning to reduce the number of required simulation labels.
Integrating the ST‑GCN into commercial Electronic Design Automation (EDA) flows for automated test‑strategy optimization and cost‑benefit analysis.

Conclusion
By representing gate‑level netlists as spatio‑temporal graphs and applying a dual‑encoder neural architecture, the authors deliver a fast, accurate, and scalable method for predicting long‑cycle fault impact probabilities. The approach reduces simulation time by more than an order of magnitude while maintaining sub‑0.03 MAE accuracy, and it demonstrably improves test‑point selection for hard‑to‑detect faults. As such, the ST‑GCN framework offers a compelling addition to the reliability engineer’s toolkit, bridging the gap between exhaustive fault simulation and practical design‑time risk assessment.

💡 Research Summary

📜 Original Paper Content