Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study
Large language models (LLMs) are highly compute- and memory-intensive, posing significant demands on high-performance GPUs. At the same time, advances in GPU technology driven by shrinking transistor sizes and lower operating voltages have made these devices increasingly susceptible to soft errors. While prior work has examined GPU reliability, most studies have focused on general-purpose applications or conventional neural networks mostly used for vision tasks such as classification and detection. In contrast, systematic analysis of modern large-scale LLMs remains limited, despite their rapid adoption in diverse application scenarios. Given the unique characteristics of LLMs, their resilience to soft errors may differ substantially from earlier models. To bridge this gap, we conduct the first instruction-level fault injection study of LLM inference. Our approach reveals reliability characteristics from multiple perspectives, highlighting the effects of model architecture, parameter scale, and task complexity. These findings provide new insights into LLM reliability and inform the design of more effective fault tolerance mechanisms.
💡 Research Summary
This paper presents the first comprehensive instruction‑level fault injection study of large language model (LLM) inference on modern GPUs. While prior reliability research has largely focused on general‑purpose applications or convolutional neural networks (CNNs) for vision tasks, the authors argue that the computational patterns of transformer‑based LLMs—such as key‑value cache handling, operator fusion, and extensive matrix multiplications—require a finer‑grained analysis. To this end, they adopt NVBitFI, an open‑source instruction‑level fault injection framework, and extend it to identify which GPU instructions belong to specific LLM layers or modules during execution. By dynamically flipping selected register bits, the framework emulates soft errors (single‑bit flips) that arise from voltage transients or high‑energy particle strikes.
Experiments are conducted on an NVIDIA A100 (80 GB) GPU with CUDA 12.2, Python 3.12, and NVBit 1.7.5. Three representative transformer architectures—GPT‑2, Llama 3.2, and Qwen 3—are evaluated at two scales each (small and large), yielding six model variants ranging from 124 M to 3.21 B parameters. Six benchmark datasets cover a spectrum of tasks: Lambada (text generation), PIQA and HellaSwag (commonsense reasoning), WikiText‑2 (language modeling), XSum (summarization), and GSM8K (mathematical problem solving). For each run, the authors record abnormal outcomes classified as Detected Unrecoverable Errors (DUE), Silent Data Corruption (SDC), or masked errors (no observable effect).
Key findings include: (1) The probability of DUE and SDC rises sharply with the number of injected bit flips. With a single‑bit fault, about 70 % of errors are masked; with eight simultaneous flips, the masked rate drops below 30 % and abnormal outcomes exceed 75 %. (2) Larger models exhibit higher masking rates and lower DUE/SDC percentages, suggesting that the sheer volume of parameters and activations dilutes the impact of any individual fault. (3) Architectural differences matter: GPT‑2 shows 5‑10 % lower DUE/SDC rates than the comparably sized Qwen 3‑1.7 B, likely due to its use of residual connections and layer normalization that dampen error propagation. (4) Instruction type and bit position are decisive factors. Memory load/store instructions, especially those handling the KV cache, have the highest fault propagation probability, and flips in the most significant bits of 32‑bit registers cause the greatest output distortion. Simple arithmetic instructions (ADD, MUL) are more often masked. (5) Task difficulty influences vulnerability: high‑complexity tasks such as GSM8K experience a larger increase in SDC (over 20 % with modest fault rates) compared to relatively tolerant generation tasks like Lambada.
From these observations the authors propose practical mitigation strategies. Strengthening ECC or applying register remapping for memory‑intensive operations (e.g., attention) can substantially reduce error propagation. Deploying larger models in safety‑critical services can exploit the natural masking effect of scale. Quantifying task‑specific error tolerances enables service‑level agreements (SLAs) to define acceptable fault rates and guides the design of runtime error detection and recovery mechanisms. Finally, by releasing their NVBitFI‑based LLM reliability framework as open source, the paper paves the way for future studies across newer GPU architectures and emerging LLM designs.
In summary, the work demonstrates that LLM inference reliability on GPUs is shaped by a complex interplay of model size, architecture, instruction mix, bit‑level fault location, and task complexity. The instruction‑level perspective bridges the gap between micro‑architectural fault models and high‑level application outcomes, offering valuable insights for hardware‑software co‑design of fault‑tolerant LLM systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment