Peformance Isolation for Inference Processes in Edge GPU Systems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This work analyzes the main isolation mechanisms available in modern NVIDIA GPUs: MPS, MIG, and the recent Green Contexts, to ensure predictable inference time in safety-critical applications using deep learning models. The experimental methodology includes performance tests, evaluation of partitioning impact, and analysis of temporal isolation between processes, considering both the NVIDIA A100 and Jetson Orin platforms. It is observed that MIG provides a high level of isolation. At the same time, Green Contexts represent a promising alternative for edge devices by enabling fine-grained SM allocation with low overhead, albeit without memory isolation. The study also identifies current limitations and outlines potential research directions to improve temporal predictability in shared GPUs.

💡 Research Summary

This paper investigates three isolation mechanisms available on modern NVIDIA GPUs—Multi‑Process Service (MPS), Multi‑Instance GPU (MIG), and the recently introduced Green Contexts (GC)—with the goal of guaranteeing predictable inference latency for safety‑critical deep‑learning applications. The authors evaluate these mechanisms on two representative platforms: the data‑center‑class NVIDIA A100 and the edge‑oriented Jetson Orin Nano/AGX.

The motivation stems from the observation that inference workloads with batch size 1, typical in real‑time autonomous systems, under‑utilize GPU resources, opening the possibility of parallelizing multiple model inferences (e.g., ensemble voting) on a single GPU. However, concurrent execution introduces contention that can jeopardize the strict timing guarantees required by standards such as ISO 26262.

The paper first describes each isolation technology. MPS is a software layer that merges CUDA contexts to reduce context‑switch overhead but does not provide physical separation of compute or memory resources. MIG, available on select GPUs like the A100, partitions the device at the level of GPU Processing Clusters (GPCs), allocating dedicated SMs, caches, and memory to each instance, thus delivering strong spatial and temporal isolation. Green Contexts, introduced in CUDA 12.4, allow fine‑grained allocation of individual Streaming Multiprocessors (SMs) to a CUDA context, offering SM‑level isolation without memory partitioning.

The experimental methodology consists of two phases. In the first phase, the authors determine the maximum stable inference frequency (IMS) for six popular networks (ConvNeXt Base/Large, MobileNetV2, ResNet‑18, ViT‑B16, ViT‑L32) by iteratively increasing the target frequency until a timing violation (timeout) occurs, then validating the candidate frequency over multiple batches. In the second phase, a “process of interest” runs at its previously measured IMS while a competing process gradually increases its own IMS from 1 to its maximum. The number of timeouts experienced by the process of interest quantifies the isolation capability of each mechanism.

On the A100, MIG partitions of 3 GPC each (two equal partitions) are compared against a standalone GPU and MPS. MIG delivers near‑zero timeouts, confirming its strong temporal isolation, but incurs a modest (~5 %) throughput loss relative to the unpartitioned GPU due to internal overhead. MPS, lacking physical separation, shows a steep degradation of the process‑of‑interest’s IMS and a high timeout rate as the competing workload grows, indicating insufficient isolation for hard real‑time guarantees.

On the Jetson Orin Nano, the authors evaluate MPS, the standalone GPU, and Green Contexts with two 4‑SM partitions. Green Contexts achieve significantly fewer timeouts than MPS and maintain a linear relationship between allocated SMs and throughput, demonstrating effective SM‑level isolation. However, because GC does not partition device memory, memory‑pressure scenarios could still cause unpredictable delays, a limitation not present in MIG.

The authors conclude that MIG provides the most robust isolation for safety‑critical systems that can afford the associated hardware (e.g., A100). For power‑constrained edge devices lacking MIG support, Green Contexts represent a promising alternative, offering fine‑grained compute isolation with low overhead, albeit without memory isolation. The paper also outlines future research directions: (1) hybrid mechanisms that combine SM‑level partitioning with memory isolation, (2) dynamic partition reconfiguration driven by real‑time schedulers, and (3) long‑term stability studies across diverse workloads and power budgets.

Peformance Isolation for Inference Processes in Edge GPU Systems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment