Guaranteeing Semantic and Performance Determinism in Flexible GPU Sharing

GPU sharing is critical for maximizing hardware utilization in modern data centers. However, existing approaches present a stark trade-off: coarse-grained temporal multiplexing incurs severe tail-latency spikes for interactive services, while fine-gr…

Authors: Zhenyuan Yang, Wenxin Zheng, Mingyu Li

Guaranteeing Semantic and Performance Determinism in Flexible GPU Sharing
Guaranteeing Semantic and P erf ormance Determinism in Flexible GPU Sharing Zhenyuan Y ang 1,2,4,* , W enxin Zheng 3,* , and Mingyu Li 1,2 1 K ey Laboratory of System Softw are (Chinese Academy of Sciences) 2 Institute of Software, Chinese Academy of Sciences 3 Shanghai Jiao T ong Uni versity 4 Uni versity of Chinese Academy of Sciences Abstract GPU sharing is critical for maximizing hardw are utilization in modern data centers. Ho we ver , existing approaches present a stark trade-off: coarse-grained temporal multiple xing incurs sev ere tail-latenc y spikes for interacti ve services, while fine- grained spatial partitioning often necessitates inv asi v e kernel modifications that compromise behavioral equi v alence. W e present D E T S H A R E , a novel GPU sharing system that prioritizes determinism and transparenc y . D E T S H A R E en- sures semantic determinism (unmodified k ernels yield iden- tical results) and performance determinism (predictable tail latency), all while maintaining complete transpar ency (zero code modification). D E T S H A R E introduces GPU cor outines , a new abstraction that decouples logical e xecution conte xts from physical GPU resources. This decoupling enables flexi- ble, fine-grained resource allocation via lightweight context migration. Our ev aluation demonstrates that D E T S H A R E impro v es training throughput by up to 79.2% compared to temporal sharing. In co-location scenarios, it outperforms state-of-the- art baselines, reducing P99 tail latency by 15.1% without compromising throughput. Furthermore, through workload- aware placement and our T P OT - F I R S T scheduling policy , D E T S H A R E decreases average inference latency by 69.1% and reduces T ime-Per-Output-T oken (TPO T) SLO violations by 21.2% relativ e to default policies. 1 Introduction Artificial intelligence (AI) applications, such as chatbots [ 1 , 2 ], code generation [ 3 , 4 ], and autonomous agents [ 5 ], hav e be- come the dominant consumers of cloud compute resources and rely heavily on po werful GPU accelerators [ 6 – 8 ]. These workloads exhibit highly dynamic resource demands, fre- quently alternating between compute-bound phases (e.g., prompt prefill) and memory-bound phases (e.g., autoregres- siv e decoding). Current cloud abstractions fail to match this dynamism, primarily offering rigid, fixed-size instances to * Equal contribution enforce isolation. This mismatch compels operators to o ver - provision for peak loads, resulting in significant resource waste during of f-peak cycles. For instance, Microsoft re- ports that compute utilization drops to under 10% during the memory-bound decoding phase when serving Llama3-8B [ 9 ] on A100 GPUs [ 10 ]. Industry surv eys indicate that 60%–70% of GPU capacity routinely remains idle across both training and inference workloads in production clusters, translating to billions of dollars in w asted capital e xpenditure annually . Attempts to mitigate this inefficiency via manual resource tuning prov e impractical, as the process is labor -intensi ve and fails to scale [ 11 , 12 ]. GPU resource sharing off ers a natural solution: by co- locating workloads, the capacity left idle during one task’ s low-utilization phases can be reclaimed by co-resident jobs. Existing approaches can be broadly classified into tw o cate- gories. T emporal sharing [ 13 – 16 ] time-slices the GPU, grant- ing each job exclusi v e access for fixed quanta. While straight- forward, this approach f ails to reduce under-utilization during memory-bound phases and can cause sev ere latency spikes for latency-critical services [ 17 , 18 ]. Spatial sharing [ 19 – 23 ] permits concurrent ex ecution by partitioning GPU resources (e.g., Streaming Multiprocessors (SMs)) across workloads. This approach is considered more promising for achieving high utilization, yet existing spatial-sharing systems remain impractical due to three main challenges: • C1: Semantic Determinism—Inequivalence after sharing. Fine-grained spatial sharing approaches [ 19 , 24 ] often modify kernel code, fundamentally altering e xecution se- mantics. They change either the ef fective batch size (dis- torting normalization statistics) or the computation order (violating floating-point associativity), causing numerical div er gence from standalone ex ecution [ 25 ]. • C2: P erformance Determinsim—Unpredictable SLO vio- lations. Co-locating services on shared GPUs inherently introduces resource contention. Compounding this, exist- ing infrastructure [ 26 , 27 ] suffers from loss of semantic information: it remains oblivious to request priority . This uniform handling exacerbates interference, jeopardizing 1 strict tail-latency Service Le v el Objectiv es (SLOs). • C3: T ranspar ency and Practicality—The engineering tax. T echniques like searching for fusible kernels [ 28 ] or man- ual kernel re writing [ 24 ] require deep domain expertise and in v asi ve code modifications, imposing an engineering tax. This not only creates maintenance burdens across rapidly ev olving model architectures and underlying soft- ware stacks, b ut also amplifies the risk of violating seman- tic determinism (C1). This paper presents D E T S H A R E , a transparent GPU sharing system that simultaneously addresses all three challenges. Our key observation is that the modern GPU software-hardw are stack already offers the necessary primitives for efficient spatial sharing without sacrificing semantic determinism: NVIDIA ’ s Multi-Process Service (MPS) [ 29 ] enables spa- tial sharing with address space isolation, while Green Con- texts [ 30 ] serve as lightweight execution units amenable to fine-grained preemption. Lev eraging these software-hardware mechanisms, D E T - S H A R E introduces GPU cor outines , a nov el abstraction that fully decouples logical execution conte xts (visible to applica- tions) from physical GPU resources (managed by the CUDA runtime). Applications operate on virtual contexts ( vCtx ) that are dynamically bound to physical conte xts ( pCtx ) pro- visioned with varying SM quotas. A centralized, workload- aware scheduler orchestrates this binding, migrating contexts between physical resources to match instantaneous workload priorities. Critically , by executing k ernels without any modifications, D E T S H A R E ensures the y yield identical results whether run- ning in isolation or under sharing, thereby preserving both functional correctness and semantic determinism ( C1 ). T o meet tail-latency SLOs, D E T S H A R E incorporates workload- aware scheduling policies that prioritize latency-critical ker - nels while simultaneously maximizing aggregate throughput ( C2 ). Finally , D E T S H A R E remains fully transparent: no man- ual tuning or kernel annotation is required ( C3 ). W e e valuate D E T S H A R E on di verse workloads spanning DNN training, large language model (LLM) inference, and co-located training-inference scenarios. Our results sho w that D E T S H A R E impro ves throughput by up to 79.2% for DNN training workloads compared to temporal sharing. When co- locating DNN training and large-model inference tasks, D E T - S H A R E can reduce the P99 tail latency of inference by up to 15.1% compared to the baseline while preserving training throughput. By introducing workload-a ware job placement policies, D E T S H A R E reduces the average inference latency by up to 69.1% under real-w orld production traces. Further - more, with our T P O T - F I R S T scheduling policy , the TPOT SLO violation rate is reduced by an additional 21.2% relati ve to the default policy . Contributions. W e make the following contrib utions. • W e systematically classify existing GPU sharing mecha- nisms and identify three fundamental challenges that limit their applicability and efficienc y . • W e propose GPU cor outines , a nov el abstraction that fully decouples logical ex ecution contexts from physical GPU resources, enabling flexible and fine-grained multiplexing with determinism. • W e design and implement D E T S H A R E , a high- performance, general-purpose GPU sharing system that outperforms state-of-the-art solutions in both DNN training and LLM inference, while maintaining strict SLO compliance. 2 Background and Moti vation This section revie ws AI model serving applications and exist- ing mechanisms to run them on GPU. 2.1 AI Model Serving AI models are now ubiquitous across modern applications, powering search, personalization, automation, diagnostics, and con v ersational agents. And model serving ha ve become primary consumers of computing resources [ 6 – 8 ]. Serving AI models inv olves two broad phases, training produces a model and infer ence makes the model’ s predictions available to various applications. T raining includes pre-training, referring to large-scale, of- ten self-supervised training on massi ve data to b uild founda- tion models [ 31 – 33 ]. These pre-trained foundation models then serve as general-purpose backbones for do wnstream ap- plications in natural language understanding, con versational agents, summarization, image representation learning, and multimodal reasoning. Post-training further adapts foundation models to domain-specific tasks through fine-tuning [ 34 , 35 ], instruction tuning [ 36 – 38 ] or personalization [ 39 , 40 ]. In addi- tion, model compression [ 41 , 42 ] and quantization [ 43 ] tech- niques are applied to reduce memory footprint and improve inference latency . Another kind of training is online or contin- ual training [ 44 ], incrementally updating the deployed models using fresh or user-generated data. Infer ence entails the e xecution of trained AI models to gen- erate predictions for online or batch tasks. W ithin this domain, LLM serving has emerged as a dominant workload, under- pinning a di verse array of critical applications ranging from con v ersational agents [ 1 , 2 ] and neural code synthesis [ 3 , 4 ] to Retriev al-Augmented Generation (RA G) pipelines [ 45 , 46 ] and autonomous agentic systems [ 5 ]. T ypical LLM inference can be conceptually divided into a prefill phase, where the input context is encoded and ke y-v alue caches are initialized, and a decode phase, where tokens are generated sequentially , often with beam search or sampling strategies. These phases exhibit distinct resource utilization patterns: prefill is primar- ily compute-bound, characterized by high arithmetic intensity due to parallel processing of input tokens. In contrast, de- code is memory-bound, constrained by memory bandwidth due to the low arithmetic intensity of auto-regressi ve token 2 generation and the overhead of KV cache loading [ 10 , 47 , 48 ]. 2.2 GPU Sharing An enduring challenge in systems and architecture, GPU shar- ing has attracted a broad spectrum of research. Existing so- lutions broadly fall into two paradigms: temporal and spatial sharing. T emporal sharing [ 13 – 16 ] allo ws one to switch from one application to another . This offers strong isolation, while context switching introduces hea vy o verhead and since only one context is activ ated, o verall GPU utilization can be lo w when the workload is small [ 49 , 50 ]. Spatial sharing [ 13 – 16 ] ex ecutes multiple workloads con- currently across partitioned GPU resources, spanning from fine-grained software multiplexing like CUDA Streams to coarse-grained hardw are isolation. While CUD A Streams [ 20 , 51 – 53 ] enable intra-application parallelism by issuing opera- tions into distinct hardware work queues, they operate within a single CUD A context. Although this eliminates context switching overhead, the resulting unified address space en- forces fate sharing: a fault in one stream crashes the entire context, of fering zero fault isolation. T o provide stronger multi-tenant isolation, vendors of fer dedicated spatial partitioning mechanisms like NVIDIA ’ s Multi-Instance GPU (MIG) and Multi-Process Service (MPS). MIG [ 54 ] achiev es strict hardware-lev el isolation by divid- ing a single physical GPU into fully isolated instances, each with dedicated compute and memory bandwidth. Howe ver , this isolation comes at a sev ere cost: MIG relies on heavy- weight reconfiguration mechanisms that necessitate device resets, incurring latencies on the order of hundreds of millisec- onds [ 28 , 55 ]. Furthermore, its coarse-grained partitioning en- forces fixed resource slice sizes, ine vitably leading to signifi- cant resource fragmentation. In contrast to hardw are-centric partitioning, MPS [ 29 ] operates at the software lev el, facil- itating process-lev el spatial multiple xing via a client-server architecture. By multiplexing command streams from multi- ple independent clients into a single shared hardware conte xt, MPS eliminates context-switching o v erheads while success- fully maintaining separate virtual address spaces for each process. This design makes MPS particularly beneficial for multi-tenant inference workloads where clients can safely co- exist within a conte xt, effecti vely boosting overall utilization by absorbing leftover parallelism without the rigid reconfigu- ration penalties of MIG. 2.3 The Sources of Semantic Non-Determinism Hardwar e-Lev el Floating-Point Non-Associativity . At the lowest le vel, the root cause of numerical diver gence is the IEEE 754 standard for floating-point arithmetic [ 56 ]. Due to finite precision (especially in truncated formats like FP16 and BF16), floating-point addition is non-associative: ( A + B ) + C  = A + ( B + C ) . Consequently , any variation in the sequence of arithmetic reductions will yield div ergent rounding errors. System-Level Serving Optimizations. T o maximize throughput, modern LLM serving engines dynamically adapt ex ecution paths, which inherently introduces non- determinism. T echniques such as continuous batching [ 26 ] cause effecti ve batch sizes to fluctuate heavily . Concurrently , high-performance computational kernels (e.g., cuBLAS) lev erage dynamic tile sizing and Split-K reductions for ma- trix multiplications [ 25 , 57 ]. By continually reshaping the underlying reduction trees, these performance-oriented op- timizations alter the sequence of floating-point operations, directly exacerbating numerical di v ergence. Application-Le vel Impact: T raining-Inference Mismatch. This numerical non-determinism extends beyond mere the- oretical artif acts; it severely impacts advanced training paradigms such as Reinforcement Learning from Human Feedback (RLHF). During RL fine-tuning, identical prompts must consistently yield identical token trajectories. Micro- div er gences introduced during inference trigger a policy shift [ 58 ]—the model’ s behavior during online rollouts devi- ates from the policy ev aluated during of fline training. This mismatch effecti v ely degrades online RL algorithms into sub- optimal offline RL, substantially stalling con ver gence. Recognizing this, recent serving framew orks (e.g., SGLang [ 59 , 60 ]) and libraries [ 61 ] have introduced strict “deterministic modes”. Howe v er , these application-le vel guar - antees implicitly assume dedicated, unfragmented hardware. As we discuss in Section 3.1 , when these workloads are sub- jected to spatial multiplexing and sharing mechanisms at the infrastructure layer , the semantic determinism meticulously preserved by the application is once again destro yed. 3 Challenges of GPU Sharing Mechanisms In this section, we classify and summarize these existing GPU sharing mechanisms, as shown in T able 1 . Current GPU sharing mechanisms can be broadly cate- gorized into temporal sharing, spatial sharing, and hybrid sharing approaches. While spatial sharing methods generally outperform temporal sharing in terms of utilization and SLO compliance, they often require significant human effort to tune and may not guarantee equiv alence after sharing. 3.1 Challenge-#1: Semantic Determinism The bedrock of production AI systems is semantic integrity: a model validated in an offline environment must behav e identically when deployed online. The core requirement for GPU resource sharing, therefore, is behavioral equiv- alence —preserving each workload’ s functional behavior dur - ing shared ex ecution compared to its standalone counter- part. This property is paramount for training-inference con- sistency [ 62 ]: models v alidated on dedicated hardw are must yield identical predictions in inference scenarios. Any nu- merical div er gence—howe ver minute—can in v alidate offline 3 T able 1: Characterization of Existing GPU Sharing Mechanisms. Sharing Mechanism Key T echniques Semantic Determinism Perf ormance Determinism T ransparency & Practicality Utilization T emporal Sharing T ime-slicing ❍ ● ❍ Low Spatial Sharing Kernel atomization or scheduling ● ◗ ◗ High Hybrid Sharing Combination of both ● ◗ ● Medium Note: ◗ Partially addressed, ❍ Fully addressed, ● Not addressed. 0 25 50 75 100 125 150 175 200 Iteration Steps 0.4 0.5 0.6 0.7 L oss Impact of Batch Size on L oss Curve bs=16 bs=512 bs=1024 Figure 1: Statistical Divergence. The impact of dynamic batch fragmentation on training loss. Adapting batch sizes to fit resource fragments alters the statistical properties of the workload, leading to increased loss variance compared to the baseline. accuracy metrics. While temporal sharing naturally preserves arithmetic determinism by granting e xclusiv e access to full resources, spatial sharing introduces complex challenges. T o maximize utilization, recent intrusive methods (e.g., LithOS [ 24 ]) employ k ernel atomization—dynamically par - titioning parallelizable dimensions to fit av ailable resource fragments. W e argue that such techniques fundamentally vi- olate the abstraction boundary between logical computation and physical resource av ailability . Specifically , standard GPU kernels operate on a contract where the result is a function of input data ( D ) and model state ( S ): Resul t = F ( D , S ) . Ho w- ev er , by dynamically reshaping kernels to fit hardware frag- ments, atomization implicitly leaks instantaneous resource constraints ( R ( t ) ) into execution semantics, effecti vely al- tering the function to Resul t ′ = F ( D , S , R ( t )) . Since R ( t ) is stochastic in a multi-tenant cluster , the computation becomes non-deterministic. W e categorize this div er gence into two distinct lev els: • Statistical Divergence (Batch Sensitivity): T echniques that partition resources by splitting batches effecti v ely alter the statistical population used for operations like Batch Normalization [ 63 ]. As demonstrated in Figure 1 , reducing the ef fectiv e batch size to fit fragmented re- 64 256 1024 4096 16384 Resource Fragmentation (Grid Splits) 0.01 0.1 1 10 100 Deviation from Standalone Execution (Abs Error) FP16 BF16 Figure 2: Numerical Diver gence. Absolute numerical de- viation of FP16 and BF16 summations compared to their unfragmented, standalone ex ecution. Dynamically splitting the computation grid (X-axis) ine vitably alters the underlying reduction tree, exposing the non-associativity of floating-point arithmetic. This semantic di ver gence is especially se v ere for BF16. sources distorts mean and v ariance statistics, leading to significant loss fluctuations and degraded con vergence. • Numerical Divergence (FP Non-associativity): A more insidious issue arises in reduction-heavy kernels (e.g., Softmax, LayerNorm). Due to the non-associati vity of floating-point arithmetic [ 56 ], the order of summation determines rounding error accumulation. As shown in Figure 2 , we simulate resource fragmentation by splitting a large tensor summation into varying grid sizes. The re- sults demonstrate that merely altering the reduction tree structure causes the absolute error (vs. standalone e xecu- tion) to fluctuate significantly . This di ver gence is present in both FP16 and BF16, with BF16 showing an order of magnitude higher error , confirming that fragmentation breaks bitwise determinism. Formalization. W e formalize this observ ation as the Resour ce-Coupling Risk . Let Φ be a deterministic kernel func- tion. For any two execution instances i and j with identical inputs x , the condition Φ ( x ) i ≡ Φ ( x ) j must hold. Howe ver , intrusiv e partitioning introduces a resource mapping function M : K → R avail , where the kernel implementation K adapts its reduction strate gy based on M . Consequently , the output 4 div er gence δ becomes non-zero: δ = | K ( x , M ( R i )) − K ( x , M ( R j )) | > 0 (1) In deep pipelines like LLM inference, these accumulated "micro-div er gences" create a v alidation-deployment g ap un- acceptable for production critical paths. Principle 1: Kernel ex ecution semantics must remain unchanged to guarantee sharing equiv alence. 3.2 Challenge-#2: Perf ormance Determinism Meeting strict tail-latency SLOs is a prerequisite for ensur- ing satisfactory user experience in interactiv e deep learn- ing workloads. Howe ver , achieving performance determin- ism—specifically , keeping P99 latency stable—is notoriously difficult in multi-tenant en vironments. While temporal shar- ing is the incumbent solution, it is fundamentally ill-suited for strict latency constraints. The primary limitation is that time-slicing enforces serialization: it introduces unav oidable queuing delays for applications not currently scheduled on the GPU [ 15 , 64 ]. As contention increases, these accumu- lated w ait times frequently cause tail latencies to e xceed strict deadlines. Theoretical Containment ( S t em poral ⊂ S s pat ial ). Spatial sharing offers a fundamentally more promising approach. W e observe that an y performance tar get achiev able via temporal sharing is theoretically achiev able via spatial sharing. For - mally , let S t em poral be the set of v alid schedules achie vable via time-slicing (exclusi v e o wnership for ∆ t ) and S s pat ial be the set of spatially partitioned schedules. Since temporal sharing is mathematically a de generate case of spatial sharing—where partition sizes toggle strictly between 0% and 100% —it fol- lows that S t em poral ⊂ S s pat ial . Corollary: If a workload set can satisfy SLOs under a tem- poral policy , there theoretically exists a spatial configuration that also satisfies these SLOs. The Reality Gap: Unmanaged Interference. Despite this theoretical potential, translating spatial partitioning into prac- tical service assurance is hindered by performance non- determinism, which we categorize into two distinct domains: • Micro-ar chitectural Contention: Spatial co-location im- plicitly multiplexes shared resources that are not explic- itly partitioned, most notably Device Memory Bandwidth and Last-Level Cache capacity . As characterized in prior studies [ 17 ], a latency-sensiti v e inference job can suf fer non-deterministic latency spik es (up to 3 × [ 10 ]) when co- located with a bandwidth-heavy training job . • Scheduler Obliviousness: Commodity GPU hardware schedulers are throughput-oriented and obli vious to high- le vel SLO constraints. W ithout explicit interv ention, a high- priority kernel may suffer from head-of-line blocking be- hind a massiv e, non-preemptible background kernel in the hardware command queue [ 65 ], rendering logical priority ineffecti ve. While spatial sharing theoretically encompasses the so- lution space for meeting SLOs, naiv e partitioning fails to guarantee performance determinism due to unmanaged re- source contention. T o bridge this gap, the sharing mechanism requires an SLO-aw are scheduling policy coupled with low- latency k ernel preemption, ensuring that high-priority queries can immediately reclaim resources from best-effort work- loads. Principle 2: The scheduling framework should be SLO-aware and support lo w-latency preemption. 3.3 Challenge-#3: T ransparency and Practical- ity W e argue that widespread adoption of GPU sharing hinges on deployment transparency: the infrastructure must optimize resource usage without imposing any semantic or syntactic constraints on the application layer . Ideally , the resource management runtime should strictly adhere to the end-to-end principle, treating Deep Learning models as opaque ex ecutables (i.e., "Black Boxes"). This decoupling allows for "Day-Zero" support: ne w model archi- tectures can be deployed immediately without engineering intervention. Howe ver , man y existing spatial sharing designs [ 19 , 28 ] sacrifice this practicality for theoretical utilization gains. Ap- proaches relying on static compiler analysis, operator fusion, or manual kernel rewriting introduce a high "Engineering T ax". By requiring deep introspection into model graphs or in v asi ve source code modifications, these methods tightly cou- ple the sharing substrate with specific model implementations. This coupling creates a brittle ecosystem where ev en minor updates to framew ork or model architecture can in validate the optimization logic, necessitating expensi v e manual retuning. The Hidden Cost: V erification Explosion. Crucially , this coupling imposes a burden beyond dev elopment—it funda- mentally expands the verification surface . In h yperscale pro- duction en vironments, maintaining offline-to-online equi v a- lence is non-negotiable. Mechanisms that modify user GPU kernels or the runtime en vironment effecti vely in validate rig- orous offline v alidation processes. This operational reality is corroborated by infrastructure practitioners at ByteDance [ 66 ], who note that "mechanisms r equiring code r ewriting ar e typi- cally r ejected not just for complexity , but because they man- date a pr ohibitive r e-verification pr ocess for every frame work or model update." 5 O n e - to - O n e M a p p C t x 1 G P U M e m o r y SM s K er n el 1 p C t x 2 G P U M e m o r y SM s C li e n t A p p l i c a t i o n 1 v C t x 1 M a p p in g C li e n t A p p l i c a t i o n 2 v C t x 2 p C tx3 R CK ( p C txs u n a l lo c a te d ) K er n el 1 M a p p i n g T a b l e U p d a te p C tx4 p C tx5 … 3 4 5 3 4 5 I n t e r c e p tio n Sh im L a y e r C o G P U S c h e d u le r K e r n e ls L a u n c h P lu g g a b l e P o lic ie s K2 A l lo c a te R CK L og i c a l Vi e w P h y s i c a l Vi e w K3 K2 Figure 3: The o verall architecture of D E T S H A R E . Principle 3: The sharing mechanism must be trans- parent, precluding any modification to user code or model semantics. 4 D E T S H A R E Design This section details the design of D E T S H A R E , a system that reconciles ex ecution determinism with high hardware utiliza- tion. Existing spatial sharing approaches often compromise semantic determinism by transparently altering block size to fit fragmented resources [ 24 ]. In contrast, D E T S H A R E adopts a transparent spatial multiplexing approach. The core design philosophy of D E T S H A R E is the strict decoupling of logi- cal ex ecution flo ws from physical resource pro visioning. By maintaining k ernel launch configurations as immutable in vari- ants, D E T S H A R E ensures that resource elasticity ne ver alters the arithmetic behavior of the application. W e first present the ov erall system architecture and the core GPU Cor outine abstraction (§ 4.1 ). W e then detail the runtime mechanisms for dynamic context binding and cooperativ e preemption (§ 4.2 ), followed by the state management opti- mizations that enable low-ov erhead migration (§ 4.3 ). Finally , we describe the extensible scheduling frame work (§ 4.4 ) and our tiered fault isolation strategy (§ 4.5 ). 4.1 Overview As illustrated in Figure 3 , D E T S H A R E introduces a dynamic decoupling layer between the application and the GPU hard- ware. The core abstraction enabling this fle xibility is the GPU Cor outine , which decouples the application’ s logical view of the device from the underlying physical hardware constraints. D E T S H A R E formalizes this decoupling through tw o primary entities: • V irtual Context ( vCtx ): A persistent, application- facing handle that encapsulates the logical execution state. D E T S H A R E exposes vCtx as a fully transparent abstraction, functionally equiv alent to a standard CUD A context from the client’ s perspectiv e. • Physical Context ( pCtx ): An ephemeral, hardware- backed resource container . Each pCtx is a strictly iso- lated execution unit provisioned with a hard limit on SMs. Determinism via Abstraction. By separating the logical definition of a task from its physical ex ecution capacity , D E T - S H A R E guarantees semantic determinism. When resource av ailability fluctuates, D E T S H A R E does not perform kernel atomization (which risks numerical diver gence). Instead, it employs a migrate-to-yield mechanism: the vCtx is transpar- ently rebound to a different pCtx with an adjusted resource quota. Consequently , the computational function remains Resul t = F ( D , S ) , independent of the instantaneous resource state R ( t ) . 4.2 Mechanism: GPU Coroutines The D E T S H A R E runtime manages the lifecycle of GPU Coroutines, handling dynamic binding, e xecution signaling, and cooperativ e preemption. The pCtx Pool. Dynamically instantiating hardw are con- texts incurs significant driver latency , which is prohibiti ve for microsecond-scale scheduling. D E T S H A R E mitig ates this by maintaining a pre-provisioned pool of pCtx instances at initialization. These contexts are configured with discrete SM quota tiers (e.g., 25%, 50%, 100%). This pooling strategy re- mov es control-plane o verhead from the critical path, enabling zero-allocation-latency dispatch and migration. Dynamic Binding and Dispatch. D E T S H A R E maintains a Binding T able to track the mapping between acti v e vCtxs and physical resources. Upon intercepting a launch API (e.g., cuLaunchKernel ), the runtime consults the global scheduler to determine the optimal placement. Depending on the current mapping and resource state, the dispatch process proceeds under two distinct scenarios: • Direct Dispatch: If the vCtx is already bound to a pCtx that meets current SLO requirements, the kernel is di- rectly enqueued to the hardware command queue. This ex ecution flo w incurs negligible o verhead. • Context Remapping: If the scheduler detects a priority in v ersion or resource fragmentation, it triggers a Remap- ping Event prior to execution. The vCtx is unbound from its current container and undergoes a lightweight migration to a target pCtx with the appropriate quota. Cooperative Preemption via Signaling . T o support latency-sensiti v e workloads, D E T S H A R E implements a lo w- latency preemption mechanism that strictly preserves ker - nel semantics. As shown in Figure 4 , D E T S H A R E injects a lightweight Resident Contr ol Kernel (RCK) into each acti ve pCtx . The RCK is a persistent, single-thread kernel consum- ing negligible resources ( < 0 . 1% SM occupancy), acting as a device-side signal handler . 6 L o w - P r io r i ty P r o g r a m H ig h - P r io r i ty P r o g r a m v C t x 2 pC t x 1 ( L a r g e SM Q u o ta ) G P U M e m o r y S M s K e r n e l 1 P r e e m p t i o n T r i g g e r R C K K e r n e l 2 K e r n e l 3 pC t x2 ( Sma ll SM Q u o ta ) G PU M e m o r y SM s K e r n e l 1 C o G P U Sc h e d u le r p C tx M ig r a tio n R C K pC t x P ool H ig h - P r io r i ty Fir s t Po lic y p C t x 1 v C t x 1 p C t x 2 … Up d a te p C t x 1 v C t x 2 p C t x 2 v C t x 1 … p C tx 3 ... v C t x 1 H i g h - P r i o r i t y K e r n e l I n t e r r u p t! K e r n e l 2 K e r n e l 3 K e r n e l R e a s s i g n L a z y C o p y in g p C tx 4 p C tx 5 M a p p in g ① ② ③ ④ ⑤ Figure 4: An e xample of Remapping Event and pCtx migra- tion. When the scheduler necessitates preemption (e.g., to re- claim SMs for a high-priority inference task), it asserts a flag in a shared memory region mapped to the target pCtx . The RCK detects this signal and triggers a cooperati ve yield. Unlike traditional mechanisms that terminate the ex ecuting kernel, D E T S H A R E lev erages CUDA Stream dependencies to pause ex ecution at the nearest kernel or instruction block boundary . This safe suspension allows D E T S H A R E to syn- chronize state and initiate the context migration process, effec- ti vely yielding ph ysical resources while preserving the logical progress of the preempted task. 4.3 State Management and Migration Minimizing the volume of data transferred between pCtx instances is critical for efficient context migration. Since inter - context memory movement typically dominates migration latency , D E T S H A R E employs three complementary optimiza- tions to mitigate this o v erhead: fine-grained working set track- ing, dependency-aware input analysis, and demand-driv en lazy copying. Fine-Grained Usage T racing. A naive migration strate gy that snapshots the entire GPU memory address space incurs prohibitiv e ov erheads, particularly for GPU kernels that ac- cess only a fraction of their allocated b uf fers. T o address this, D E T S H A R E implements fine-grained tracing to monitor the "dirty" state and access patterns of memory regions. By main- taining a precise view of the acti v e memory footprint for each vCtx , the runtime identifies the exact subset of data requiring coherence in the destination pCtx , thereby filtering out stale or idle pages from the migration set. Introspection-Based Input Analysis. T o further prune the migration footprint, D E T S H A R E performs lightweight data- flow analysis on GPU kernel launch parameters. Prior to dis- patch, the runtime intercepts execution to e xamine parameter structures and their associated device pointers, deri ving the minimal closure of memory objects required for correctness. By isolating the kernel-reachable working set, D E T S H A R E av oids the indiscriminate transfer of the global heap, restrict- ing data movement strictly to the re gions touched during exe- cution. This approach significantly reduces migration latency , especially for workloads exhibiting sparse access patterns ov er large memory allocations. Demand-Driven Lazy Copying. Finally , D E T S H A R E lever - ages a lazy migration policy to amortize data transfer costs. During a migration e vent, only the critical execution state– comprising GPU kernel launch parameters and their immedi- ate dependencies—is eagerly transferred to the target pCtx . The remainder of the working set is migrated on demand: if the resumed kernel accesses a non-resident region, D E T - S H A R E triggers an asynchronous, incremental transfer in the background. This strategy eliminates synchronous blocking, effecti vely ov erlaps communication with computation, and ensures that migration remains transparent to the application’ s control flow . 4.4 Extensible Scheduling Framework D E T S H A R E does not enforce a monolithic scheduling algo- rithm. Instead, it adopts a decoupled architecture that strictly separates the scheduling mechanism (context switching, sig- nal injection) from the scheduling policy . This des ign e xposes an extensible framew ork, allo wing operators to define custom logic tailored to specific application SLOs without modifying the underlying runtime. The Pluggable Policy Interface. D E T S H A R E abstracts the scheduling decision process into a P olicy Interface . Users or cluster administrators can implement this interface to de- fine ho w D E T S H A R E reacts to lifecycle ev ents. The frame- work e xposes ke y hooks such as OnLaunch , OnCompletion , and OnCongestion . By implementing these methods, custom policies can inspect the global state—including the activ e pCtx pool status, queuing delays, and historical kernel e xecu- tion profiles—to make granular decisions. For e xample, a user can implement a "Cost-A ware" policy that prioritizes throughput during off-peak hours b ut switches to strict latency prioritization when a specific flag is set. D E T - S H A R E treats these policies as plugg able modules, enabling the runtime to switch strategies dynamically without recom- pilation. Reference Policies. While the framework is extensible, D E T S H A R E provides two production-grade reference poli- cies: • S L O - A W A R E P olicy (Default): This policy targets mixed inference-training workloads. It utilizes a lightweight predictor (based on historical grid sizes and duration) to estimate the "Head-of-Line Blocking" time. If an incoming high-priority kernel is predicted t o miss its deadline due to a running background task, the polic y 7 returns a preempt decision, triggering the mechanism described in § 4.2 . • T P OT - F I R S T Policy: Designed specifically for gen- erativ e AI, this policy (ev aluated in § 6.2 ) introspects the semantic phase of LLM inference. It prioritizes the decoding phase ov er the prefill phase, ensuring determin- istic inter-tok en latency e v en under heavy contention. QoS Enfor cement Mechanisms. Regardless of the cho- sen policy , D E T S H A R E operationalizes scheduling decisions through a robust QoS layer . This layer acts as a translation barrier , mapping high-level logical directiv es into concrete physical resource partitions. By lev eraging hardware-assisted isolation primitiv es for both spatial partitioning and temporal throttling, D E T S H A R E ensures that the user-defined policies are strictly bounded by hardware-enforced guarantees, pre- venting low-priority workloads from eroding the performance isolation of critical tasks. 4.5 F ault Isolation A robust multi-tenant system must ensure that failures in one context do not propagate to others. D E T S H A R E implements a two-tiered fault isolation strate gy to handle both f atal excep- tions (which compromise correctness) and soft hangs (which degrade performance). Fatal Exception Containment. Fatal exceptions are cat- egorized based on their blast radius. Local exceptions (e.g., illegal memory access within a kernel) corrupt only a single pCtx . D E T S H A R E le verages MPS-based static partitioning to strictly contain these failures; a crash in one pCtx terminates only the bound vCtx , while kernels in concurrent pCtx in- stances continue ex ecution uninterrupted. Global exceptions (e.g., uncorrectable ECC errors or MMU faults) may contam- inate the entire GPU de vice state. T o mitigate cross-conte xt impact, D E T S H A R E employs an emergency migration proto- col. Upon detecting a global fault signal, the runtime pauses all unaf fected vCtx instances and seamlessly migrates them to standby pCtx slots on alternativ e GPUs. This cross-de vice migration ensures high av ailability even in the presence of catastrophic hardware failures. Soft Hang Mitigation. Soft hangs, such as deadlocks or infinite loops, represent non-fatal b ut resource-wasting stalls. The D E T S H A R E scheduler detects these conditions using timeout-based heuristics: if a kernel exceeds its predicted ex ecution windo w by a configurable threshold (e.g., 3 × ), it is flagged as potentially hung. T o prevent these pathologi- cal workloads from monopolizing resources, the scheduling policy progressi v ely demotes the of fending vCtx . Upon sub- sequent scheduling rounds, the hung kernel is preempted and migrated to pCtx instances with strictly capped SM quotas (e.g., minimal tier). This "quarantine" approach ensures that while the hung task is not immediately killed (preserving de- bugging state), its resource consumption is throttled to prev ent performance degradation for well-beha ved concurrent tasks. 5 Implementation The core of D E T S H A R E comprises approximately 6,000 lines of C++ code and 1,000 lines of CUDA code. T o ensure both application transparency and f ault tolerance, the imple- mentation is architected around two primary components: a lightweight interception shim layer and a standalone global scheduler daemon. T o interpose on workload submissions without modify- ing user codes, the shim layer (see Figure 3 ) is transparently injected via LD_PRELOAD to intercept CUD A Dri ver and Run- time APIs (e.g., cuLaunchKernel , cudaMalloc ). These API wrappers capture launch metadata and seamlessly enqueue requests into lock-free, shared-memory ring buf fers. By le ver- aging shared memory for Inter-Process Communication (IPC) with the global scheduler , D E T S H A R E minimizes submission latency and keeps the direct dispatch o v erhead negligible. At the ex ecution level, D E T S H A R E employs a hardware- software co-design to enforce strict physical resource isola- tion. The global scheduler instantiates physical GPU contexts (the pCtx pool) utilizing NVIDIA Green Contexts, which achiev e fine-grained spatial partitioning by enforcing hard quotas on SMs. T o safely multiplex these hardware units while strictly preserving the execution semantics of GPU ker- nels, D E T S H A R E orchestrates the contexts over NVIDIA ’ s MPS to guarantee robust address space isolation. 6 Evaluation Our key tak eaw ays from the e valuation are: • Higher T raining Throughput: By enabling efficient spa- tial sharing, D E T S H A R E improves training throughput by up to 79.2% compared to temporal sharing. • Lower Inference Latency: With workload-aware job placement policies and lightweight context migration, D E T S H A R E reduces the av erage inference latency by up to 69.1%. • Flexible Scheduling Policy: Utilizing our pluggable T P OT - F I R S T policy , D E T S H A R E reduces the TPO T SLO violation rate by an additional 21.2% relativ e to the de- fault policy , demonstrating robust adaptability to dynamic traffic. Experimental Setup. W e ev aluate D E T S H A R E across two distinct hardware platforms to capture di v erse resource bot- tlenecks: an NVIDIA A800 cluster and NVIDIA Hopper architecture GPUs (e.g., H20). T o accurately reflect multi- tenant production en vironments, our workloads span a broad spectrum of AI tasks. Rather than ex ecuting tasks in iso- lation, we construct rigorous colocation scenarios by pair- ing diverse throughput-oriented training jobs together , and by co-scheduling heavy background training tasks alongside latency-critical LLM inference streams. 8 T able 2: Experimental setup for Colocated AI Model T rain- ing. W orkloads are categorized by their dominant resource bottleneck. Category Config Colocated W orkloads Job 1 Job 2 High Compute Contention A ResNet-50 [ 68 ] B ResNet-101 C BER T -Large [ 31 ] D BER T -Large ResNet-50 Low Compute Contention 1 RNN [ 69 ] 2 DeepFM [ 70 ] ResNet-101 LLM T raining E GLM-4-9B [ 71 ] (LoRA [ 72 ]) 6.1 End-to-End Perf ormance Evaluation W e ev aluate the end-to-end performance of D E T S H A R E to demonstrate its efficac y in maximizing aggreg ate throughput while maintaining strict performance isolation. Our e v aluation cov ers two distinct colocation scenarios: (1) Homog eneous and Heter og eneous Model T raining ( T able 2 ); and (2) Mixed W orkload Colocation , pairing throughput-oriented training with latency-critical LLM inference ( T able 3 ). Baselines. W e compare D E T S H A R E against industry- standard hardware partitioning mechanisms (NVIDIA MPS and MIG), as well as state-of-the-art software-defined sched- ulers including Orion [ 28 ], T ick-T ock [ 53 ], and Salus [ 67 ]. Since the source code for LithOS [ 24 ] is not publicly av ail- able, we implemented a f aithful reproduction of its scheduling logic based on its design principles to serve as a comparativ e baseline. 6.1.1 Colocated AI Model T raining Figure 5 reports the normalized throughput across all configu- rations. D E T S H A R E consistently achie ves superior aggre gate throughput by effecti vely balancing SM utilization and mini- mizing inter-job interference. High Compute Contention (Configs A–D). These con- figurations represent arithmetic-bound scenarios in volving compute-intensiv e models where the GPU’ s SMs are satu- rated. T raditional temporal sharing and coarse-grained spa- tial partitioning often de grade performance due to resource fragmentation. T aking Config C (dual BER T -Lar ge) as an ex- ample, D E T S H A R E attains a normalized throughput of 0 . 58 × per job . In contrast, rigid partitioning schemes like MIG are limited to 0 . 53 × , while temporal sharing (Orion) drops to 0 . 50 × due to high context-switching o v erheads. By le verag- ing pCtx pooling and dynamic vCtx-pCtx dispatch without stalling the pipeline, D E T S H A R E significantly reduces inter - job interference, maintaining high ef ficiency ev en under hea vy arithmetic contention and outperforming static heuristics such as GSlice. Low Compute Contention (Configs 1–2). W orkloads such as RNNs and DeepFM e xhibit intermittent GPU utilization, often bounded by memory bandwidth or host-de vice I/O. D E T S H A R E ef fectively harv ests these ephemeral idle cycles (“ex ecution bubbles”) by rapidly migrating vCtxs to under- utilized pCtxs . In Config 1 (dual RNN jobs), D E T S H A R E achiev es near-optimal isolation with normalized speeds of 0 . 99 × per job, effecti vely matching dedicated execution per - formance. While baselines like T ick-T ock and Salus plateau around 0 . 85 × – 0 . 90 × , D E T S H A R E exhibits superior schedul- ing granularity in identifying and filling millisecond-level idle slots. LLM T raining (Config E). W e further e valuate D E T - S H A R E on lar ge-scale foundation model fine-tuning (GLM-4- 9B with LoRA). D E T S H A R E ( 0 . 95 × ) outperforms hardware- le vel partitioning (MIG, 0 . 93 × ) by enabling flexible resource redistribution rather than static allocation. Notably , static spa- tial schedulers like GSlice struggle in this scenario ( 0 . 75 × – 0 . 79 × ), failing to manage the massive kernel launches and dy- namic memory footprints associated with LLMs. D E T S H A R E sustains high execution efficienc y ev en under these intense resource demands, highlighting its rob ust applicability to mod- ern GenAI workloads. T able 3: Experimental setup for Mixed W orkload Colocation. W e pair a throughput-oriented batch job (Job 1) with a latenc y- critical inference job (Job 2). Category Config Colocated W orkloads Job 1 (T rain) Job 2 (Infer ence) CV + LLM Inference A ResNet Llama 3 [ 9 ] B ResNet GPT -J [ 73 ] NLP + LLM Inference 1 BER T Llama 3 2 BER T GPT -J 6.1.2 DNN T raining with LLM Inference T o ev aluate D E T S H A R E in mixed-criticality scenarios, we pair throughput-oriented training jobs with latency-sensiti ve LLM inference requests. The specific w orkload combinations ( T able 3 ) cover Computer V ision (CV) and NLP models as background tasks, colocated with Llama 3 and GPT -J infer- ence streams. Figure 6 presents the performance trade-of fs between batch training throughput and inference tail latency . D E T S H A R E demonstrates a superior Pareto frontier compared to all base- lines, successfully maintaining strict SLOs for interactive workloads while maximizing the utilization of residual GPU cycles for background training. Guaranteeing Inference SLOs. For latenc y-critical LLM workloads, D E T S H A R E achie ves the lo west 99th percentile (p99) latency across all configurations. In Config A, D E T - 9 Figure 5: Normalized throughput of colocated model training tasks (higher is better). D E T S H A R E consistently outperforms baselines, particularly in scenarios with high resource contention or varying interference patterns. Figure 6: Performance comparison on mix ed workloads. (a) DNN training job throughput (higher is better) and (b) LLM inference job p99 latency (lo wer is better) under four config- urations. D E T S H A R E ef fectiv ely balances system ef ficienc y and SLO, outperforming methods lik e Orion and LithOS in both metrics. S H A R E reduces the p99 latency to 4.5 (normalized units), representing a 2 . 6 × and 2 . 2 × reduction compared to Orion (12.0) and MIG (10.0), respecti vely . Similarly , in the more memory-intensiv e Config B (ResNet + GPT -J), D E T S H A R E maintains a p99 latency of 5.9, significantly outperforming Orion (15.0) and MPS (10.2). Hardware partitioning meth- ods like MIG enforce inflexible spatial boundaries; the y lack the dynamic prioritization required to accommodate bursty inference traffic. Conv ersely , existing software schedulers like Orion fail to pro vide strict performance isolation. Under coarse-grained task co-location, critical inference GPU ker- nels are easily starved by heavy training w orkloads, resulting in sev ere interference. D E T S H A R E , by contrast, couples com- pute resource isolation via Green Conte xts with cooperati ve preemption dri ven by injected RCKs. This joint mechanism ensures strict performance isolation and immediate physi- cal resource reclamation for latency-sensiti ve inference tasks, without violating semantic determinism. Maximizing T raining Thr oughput. While prioritizing in- ference, D E T S H A R E does not compromise the progress of background training jobs (Job 1). For e xample, in Config 1, D E T S H A R E achie ves a normalized throughput of 0.60, out- performing LithOS (0.58), MPS (0.55), and Orion (0.53). Because inference workloads e xhibit bursty arriv al patterns, static partitioning (MIG/MPS) leaves compute units under - utilized during idle gaps. D E T S H A R E ’ s dynamic vCtx-pCtx remapping allows the background training job to instantly reclaim these idle cycles, maximizing aggregate GPU utiliza- tion. 6.2 Scheduling Policy Ev aluation W e ev aluate the impact of D E T S H A R E ’ s extensible scheduling policies through a comprehensiv e case study . 6.2.1 T races and Baselines T o demonstrate D E T S H A R E ’ s transparency and effecti veness at the infrastructure layer , we e v aluate an unmodified instance of the state-of-the-art serving system, SGLang [ 59 ], running entirely on top of D E T S H A R E to serve the Llama-3.1-8B- Instruct model. W e compare this against a baseline of SGLang ex ecuting on a default, e xclusiv ely provisioned GPU. W e run across three traces representing distinct production scenarios: 10 Figure 7: End-to-end e valuation on Azure, LongBench, and BurstGPT traces: Throughput (higher is better), Latency Breakdo wn and SLO V iolations (lower is better). D E T S H A R E reduces a verage and tail latenc y while maintaining throughput. Adopting the TPO T -First strategy further decreases decoding tail latenc y . • Azure LLM T race 2024 [ 74 ]: Highly stochastic request arriv al times, testing general adaptability . • LongBench [ 75 ]: Long-context tasks imposing significant pressure on GPU memory capacity and bandwidth. • BurstGPT [ 76 ]: Extreme traffic spikes challenging the scheduler’ s queue management. W e analyze D E T S H A R E under two pluggable configura- tions: the D E FAU LT Polic y (balancing resource fairness) and the T P OT - F I R S T Policy (explicitly prioritizing the decoding phase). 6.2.2 Performance Analysis Figure 7 depicts the throughput, latenc y breakdo wn, and SLO violation rates. Overall, D E T S H A R E -backed SGLang signifi- cantly outperforms the baseline. For instance, on the Azure trace, D E T S H A R E reduces the av erage TTFT by 69 . 1% ( 0 . 55 s to 0 . 17 s) while maintaining comparable throughput. Further- more, D E T S H A R E demonstrates superior stability by reduc- ing the P99 TPOT on LongBench by 4 . 7 × ( 1723 . 9 ms vs. 366 . 2 ms). Giv en that strict tail-latenc y guarantees during to- ken generation are paramount for a seamless interacti ve user experience, this dramatic reduction highlights D E T S H A R E ’ s efficac y e ven in se verely memory-constrained scenarios. Impact of T P O T - F I R S T Policy . The T P OT - F I R S T Pol- icy exhibits enhanced robustness in minimizing generation latency violations, particularly under fluctuating loads. • General Improv ement: On the Azure trace, the TPO T - First policy reduces TPO T SLO violations by 21 . 2% com- pared to the Default policy . • T rade-off in Bursty Scenarios: The advantages of the TPO T -First policy are most pronounced in the Burst- GPT benchmark. As shown in Figure 7 , this policy re- duces TPO T SLO violations by 46 . 1% ( 17 . 77% to 9 . 58% ) and dramatically lo wers the P90 TPO T from 847 . 7 ms to 262 . 9 ms. Howe ver , we observe a concomitant increase in TTFT SLO violations, which rise from 4 . 33% to 6 . 37% . Analysis: W e attribute this phenomenon to a necessary scheduling trade-of f. Under extreme b urstiness, the TPO T - First policy proacti vely delays the scheduling of ne w pre- fill requests (increasing TTFT) to reserve compute and memory bandwidth for ongoing decoding phases. This strategy ef fecti vely insulates acti ve requests from interfer- ence caused by sudden spikes in arri v al rates, ensuring a smooth generation e xperience at the cost of slightly higher initial queuing delays. 11 Figure 8: Overhead analysis. D E T S H A R E incurs only 4% and 12% ov erhead for context switching and preemption, respec- tiv ely , normalized to the exclusiv e ex ecution. 6.3 Overhead Analysis Figure 8 quantifies the performance o verhead introduced by D E T S H A R E , normalized to the e xclusi ve GPU baseline. Ov er- all, D E T S H A R E incurs a mar ginal o verhead of 4% for context switching and 12% for preemption. Analysis. The context switching ov erhead is kept minimal because the pCtx le verages Conte xt Memory T racing to av oid redundant data migration between de vice and host memory . The 12% preemption overhead is primarily dominated by the internal synchronization costs when the RCK interrupts the GPU ex ecution flo w . Giv en the substantial reductions in tail latency and SLO violations (e.g., 4 . 7 × lower P99 TPOT), we argue that this moderate overhead is a highly fa v orable trade-off for multi-tenant production en vironments. 7 Related W ork T emporal Sharing. T emporal sharing remains the default mechanism for multi-tenant GPU ex ecution. Prior systems hav e significantly enhanced this approach for DNN workloads by mitigating context-switching ov erheads through state reuse (Salus [ 67 ]), dynamically managing containerized or runtime memory footprints (Antman [ 15 ], TGS [ 13 ], Zico [ 51 ]), and employing predictiv e scheduling to bound inference tail la- tency (Clockw ork [ 64 ], Gandi va [ 14 ]). Ho wever , these solu- tions inherently operate at the granularity of the entire GPU. Consequently , they enforce strict e xecution serialization and fail to overlap computation. This limitation prev ents them from reclaiming idle capacity during memory-bound phases or bursty workloads, lea ving significant hardware resources underutilized. Spatial Sharing. Spatial sharing partitions hardware to en- able concurrent execution. F oundational mechanisms include intra-context CUD A Streams [ 77 ] and NVIDIA ’ s static par - titioning tools, MPS and MIG. Systems such as GSlice [ 19 ], GPUlets [ 20 ], and COL TI [ 78 ] build upon these primitiv es to construct higher-lev el spatio-temporal abstractions. Y et, because their underlying substrates rely on rigid, static parti- tioning, they cannot elastically adapt to the highly dynamic resource demands characteristic of modern workloads. Recent literature pushes for ev en finer-grained multiplexing. POD-Attention [ 10 ] rewrites attention kernels to collocate prefill and decode phases, while LithOS [ 24 ] atomizes ker- nels to pack ex ecution fragments at the T exture Processing Cluster (TPC) le vel. While ef fecti ve for utilization, these in va- siv e techniques necessitate user-k ernel modifications, risking behavioral inequi valence and complicating of fline-to-online correctness guarantees. In contrast, D E T S H A R E avoids this semantic trap: by decoupling logical contexts from ph ysical hardware, it dynamically remaps quotas at runtime to achiev e elastic, fine-grained spatial multiplexing while strictly pre- serving ex ecution determinism. LLM Infer ence Scheduling. The unique resource asymme- tries of LLM inference ha ve spurred specialized scheduling techniques. Recent work manages the prefill-decode imbal- ance via stall-free chunking (Sarathi-Serv e [ 47 ]) and fairness- aware phase balancing (F airBatching [ 79 ]). Other ef forts ex- plore temporal scheduling through preemptible context mi- grations (Llumnix [ 80 ]) and output-length-guided token-le vel preemption (TRAIL [ 81 ]), or focus on QoS mitigation via adapter-a ware caching (Chameleon [ 82 ]) and real-time la- tency attrib ution (LLMV isor [ 83 ], Niyama [ 84 ]). While these systems optimize application-le vel scheduling, D E T S H A R E provides a transparent, infrastructure-level foundation. Its extensible policy framework allo ws operators to plug in di- verse algorithms while enforcing execution through strict, hardware-le vel isolation primiti ves. GPU Memory Management and Disaggregation. Frame- works like A vA [ 85 ], rCUD A [ 16 ], and DCUD A [ 86 ] lev erage API remoting for disaggregated execution and fle xible GPU kernel placement. Simultaneously , memory-centric systems like DeepUM [ 87 ], Zico [ 51 ], and TGS [ 20 ] optimize device utilization by reducing allocation ov erheads and shrinking activ e working-set footprints. These memory optimization and disaggregation techniques are entirely orthogonal and complementary to D E T S H A R E , presenting natural a venues for future integration to further maximize ef ficiency in shared en vironments. 8 Conclusion In this paper , we identify and resolve the fundamental chal- lenges of ex ecution determinism, performance predictabil- ity , and practical extensibility in modern GPU sharing. T o this end, we propose D E T S H A R E , a transparent GPU sharing system designed to maximize hardware utilization without compromising determinism. By fundamentally decoupling logical ex ecution contexts from ph ysical hardware resources via a nov el GPU coroutine abstraction, D E T S H A R E enables flexible spatial multiple xing with zero k ernel modifications. Ultimately , D E T S H A R E successfully achiev es strict seman- 12 tic and performance determinism, providing a robust, zero- engineering-tax foundation for multi-tenant AI deployments. References [1] OpenAI. Chatgpt. https://chat.openai.com , 2023. Ac- cessed: 2025-11-28. [2] Gemini T eam, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Y u, Radu Soricut, Johan Schalkwyk, Andre w M Dai, Anja Hauth, Katie Millican, et al. Gemini: a fam- ily of highly capable multimodal models. arXiv pr eprint arXiv:2312.11805 , 2023. [3] Xue Jiang, Y ihong Dong, Lecheng W ang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and W enpin Jiao. Self-planning code generation with large language models. ACM T ransactions on Softwar e Engineering and Methodology , 33(7):1–30, 2024. [4] Qiuhan Gu. Llm-based code generation method for golang compiler testing. In Pr oceedings of the 31st A CM J oint Eur o- pean Softwar e Engineering Confer ence and Symposium on the F oundations of Softwar e Engineering , pages 2201–2203, 2023. [5] Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Y ong- Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. Pr oceedings of the AAAI Confer ence on Artificial Intelligence , 38(17):19632–19642, Mar . 2024. [6] Gartner . Gartner says ai-optimized iaas is poised to become the next growth engine for ai infrastructure, October 2025. Accessed: 2025-11-28. [7] T om Sorensen and Bob Sorensen. Cloud-based ai activity for hpc: W idespread b ut primarily exploratory . T echnical Report HR4.0492.09.20.2024, Hyperion Research, September 2024. Accessed: 2025-11-28. [8] Y ueying Li, Zhanqiu Hu, Esha Choukse, Rodrigo Fonseca, G Edward Suh, and Udit Gupta. Ecoserve: Designing carbon- aware ai inference systems. arXiv preprint , 2025. [9] Abhimanyu Dubey , Abhinav Jauhri, Abhinav Pande y , Abhishek Kulkarni, Gaura v Goel, Kanshul Nguyen, Punit K ulkarni, et al. The Llama 3 herd of models. arXiv preprint , 2024. [10] Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Pe- ter , Ramachandran Ramjee, and Ashish Panwar . Pod-attention: Unlocking full prefill-decode ov erlap for faster llm inference. In Pr oceedings of the 30th A CM International Confer ence on Ar chitectur al Support for Pr ogr amming Languages and Oper - ating Systems, V olume 2 , pages 897–912, 2025. [11] Edward Ionel. Improving gpu utilization: Strategies and best practices. Mirantis Blog . [12] Debo Ray . Why your million-dollar gpu clus- ter is 80 https://www.devzero.io/blog/ why- your- gpu- cluster- is- idle . [13] Bingyang W u, Zili Zhang, Zhihao Bai, Xuanzhe Liu, and Xin Jin. Transparent { GPU } sharing in container clouds for deep learning workloads. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) , pages 69–85, 2023. [14] W encong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Siv athanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Han yu Zhao, Quanlu Zhang, et al. Gan- div a: Introspectiv e cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) , pages 595–610, 2018. [15] W encong Xiao, Shiru Ren, Y ong Li, Y ang Zhang, Pengyang Hou, Zhi Li, Y ihui Feng, W ei Lin, and Y angqing Jia. { AntMan } : Dynamic scaling on { GPU } clusters for deep learn- ing. In 14th USENIX Symposium on Oper ating Systems Design and Implementation (OSDI 20) , pages 533–548, 2020. [16] Jose Duato, Antonio J Pena, Federico Silla, Juan C Fernandez, Rafael Mayo, and Enrique S Quintana-Orti. Enabling cuda acceleration within virtual machines using rcuda. In 2011 18th International Confer ence on High P erformance Computing , pages 1–10. IEEE, 2011. [17] Guin Gilman and Robert J. W alls. Characterizing concur- rency mechanisms for n vidia gpus under deep learning work- loads (extended abstract). SIGMETRICS P erform. Eval. Rev . , 49(3):32–34, March 2022. [18] Guin Gilman, Samuel S Ogden, Tian Guo, and Robert J W alls. Demystifying the placement policies of the n vidia gpu thread block scheduler for concurrent kernels. A CM SIGMETRICS P erformance Evaluation Review , 48(3):81–88, 2021. [19] Aditya Dhakal, Sameer G Kulkarni, and K. K. Ramakrishnan. Gslice: controlled spatial sharing of gpus for a scalable infer- ence platform. In Pr oceedings of the 11th A CM Symposium on Cloud Computing , SoCC ’20, page 492–506, Ne w Y ork, NY , USA, 2020. Association for Computing Machinery . [20] Seungbeom Choi, Sunho Lee, Y eonjae Kim, Jongse Park, Y oungjin Kwon, and Jaehyuk Huh. Serving heterogeneous ma- chine learning models on { Multi-GPU } servers with { Spatio- T emporal } sharing. In 2022 USENIX Annual T echnical Con- fer ence (USENIX A TC 22) , pages 199–216, 2022. [21] Bing-Shiun Han, T athagata Paul, Zhenhua Liu, and Anshul Gandhi. Kace: Kernel-aw are colocation for ef ficient gpu spatial sharing. In Pr oceedings of the 2024 ACM Symposium on Cloud Computing , pages 460–469, 2024. [22] Aditya Dhakal, Junguk Cho, Sameer G Kulkarni, KK Ramakr- ishnan, and Puneet Sharma. Spatial sharing of gpu for autotun- ing dnn models. arXiv preprint , 2020. [23] Shulai Zhang, Quan Chen, W eihao Cui, Han Zhao, Chunyu Xue, Zhen Zheng, W ei Lin, and Minyi Guo. Improving gpu sharing performance through adaptiv e b ubbleless spatial- temporal sharing. In Proceedings of the T wentieth Eur opean Confer ence on Computer Systems , pages 573–588, 2025. [24] Patrick H Coppock, Brian Zhang, Eliot H Solomon, V asilis Kypriotis, Leon Y ang, Bikash Sharma, Dan Schatzber g, T odd C Mowry , and Dimitrios Skarlatos. Lithos: An operating sys- tem for efficient machine learning on gpus. In Pr oceedings of the ACM SIGOPS 31st Symposium on Oper ating Systems Principles , pages 1–17, 2025. [25] Horace He and Thinking Machines Lab. Defeating nonde- terminism in llm inference. Thinking Machines Lab: Con- nectionism , 2025. https://thinkingmachines.ai/blog/defeating- nondeterminism-in-llm-inference/. 13 [26] Gyeong-In Y u, Joo Seong Jeong, Geon-W oo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving sys- tem for { T ransformer-Based } generativ e models. In 16th USENIX Symposium on Operating Systems Design and Im- plementation (OSDI 22) , pages 521–538, 2022. [27] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Ar chitectur e (ISCA) , pages 118–132. IEEE, 2024. [28] Foteini Strati, Xianzhe Ma, and Ana Klimovic. Orion: Interference-aware, fine-grained gpu sharing for ml applica- tions. In Pr oceedings of the Nineteenth Eur opean Confer ence on Computer Systems , pages 1075–1092, 2024. [29] NVIDIA Corporation. CUD A Multi-Process Service (MPS) Overview , 2025. Describes the MPS client-server model that multiplex es multiple processes into a single CUDA conte xt to reduce context-switch o verhead and enable concurrent k ernel ex ecution. [30] NVIDIA. Cuda driv er api – green conte xts, 2025. Accessed: 2025-12-10. [31] Jacob Devlin, Ming-W ei Chang, Kenton Lee, and Kristina T outanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding. In Proceedings of the 2019 confer ence of the North American chapter of the association for computational linguistics: human langua ge technolo gies, volume 1 (long and short papers) , pages 4171–4186, 2019. [32] Hangbo Bao, Li Dong, Songhao Piao, and Furu W ei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 , 2021. [33] Junnan Li, Dongxu Li, Silvio Sa v arese, and Stev en Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International confer- ence on machine learning , pages 19730–19742. PMLR, 2023. [34] Jeremy Howard and Sebastian Ruder . Uni versal language model fine-tuning for text classification. arXiv pr eprint arXiv:1801.06146 , 2018. [35] Ning Ding, Y ujia Qin, Guang Y ang, Fuchao W ei, Zonghan Y ang, Y usheng Su, Shengding Hu, Y ulin Chen, Chi-Min Chan, W eize Chen, et al. Parameter -ef ficient fine-tuning of large-scale pre-trained language models. Natur e machine intelligence , 5(3):220–235, 2023. [36] Jason W ei, Maarten Bosma, V incent Y Zhao, Kelvin Guu, Adams W ei Y u, Brian Lester , Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv pr eprint arXiv:2109.01652 , 2021. [37] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galle y , and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277 , 2023. [38] Haotian Liu, Chunyuan Li, Qingyang W u, and Y ong Jae Lee. V isual instruction tuning. Advances in neural information pr ocessing systems , 36:34892–34916, 2023. [39] Y uchen Zhuang, Haotian Sun, Y ue Y u, Rushi Qiang, Qifan W ang, Chao Zhang, and Bo Dai. Hydra: Model factorization framew ork for black-box llm personalization. Advances in Neural Information Pr ocessing Systems , 37:100783–100815, 2024. [40] Shuai Zhang, Lina Y ao, Aixin Sun, and Y i T ay . Deep learning based recommender system: A survey and new perspectiv es. A CM computing surve ys (CSUR) , 52(1):1–38, 2019. [41] Song Han, Huizi Mao, and William J Dally . Deep compres- sion: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR , 2016. [42] Geoffre y Hinton, Oriol V inyals, and Jeff Dean. Distill- ing the knowledge in a neural network. arXiv preprint arXiv:1503.02531 , 2015. [43] Ron Banner , Y ury Nahshan, and Daniel Soudry . Post train- ing 4-bit quantization of con volutional networks for rapid- deployment. Advances in neural information pr ocessing sys- tems , 32, 2019. [44] German I Parisi, Ronald Kemker , Jose L Part, Christopher Kanan, and Stefan W ermter . Continual lifelong learning with neural networks: A re vie w . Neural networks , 113:54–71, 2019. [45] Patrick Le wis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler , Mike Lewis, W en-tau Y ih, T im Rocktäschel, et al. Retriev al- augmented generation for knowledge-intensi v e nlp tasks. Ad- vances in neural information pr ocessing systems , 33:9459– 9474, 2020. [46] Y unfan Gao, Y un Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Y uxi Bi, Y ixin Dai, Jiawei Sun, Haofen W ang, and Haofen W ang. Retrie v al-augmented generation for large language models: A surve y . arXiv pr eprint arXiv:2312.10997 , 2(1), 2023. [47] Amey Agrawal, Nitin K edia, Ashish Panwar , Jayashree Mohan, Nipun Kwatra, Bhar ga v Gulav ani, Ale xe y T umano v , and Ra- machandran Ramjee. T aming { Throughput-Latency } tradeoff in { LLM } inference with { Sarathi-Serve } . In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , pages 117–134, 2024. [48] Kan Zhu, Y ufei Gao, Y ilong Zhao, Liangyu Zhao, Gefei Zuo, Y ile Gu, Dedong Xie, Zihao Y e, Keisuke Kamahori, Chien- Y u Lin, et al. { NanoFlow } : T owards optimal lar ge language model serving throughput. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25) , pages 749–765, 2025. [49] Rachata Ausavarungnirun, V ance Miller , Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adw ait Jog, Christopher J Rossbach, and Onur Mutlu. Mask: Redesigning the gpu mem- ory hierarchy to support multi-application concurrency . ACM SIGPLAN Notices , 53(2):503–518, 2018. [50] Zhihao Bai, Zhen Zhang, Y ibo Zhu, and Xin Jin. { PipeSwitch } : Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) , pages 499–514, 2020. [51] Gangmuk Lim, Jeongseob Ahn, W encong Xiao, Y oungjin Kwon, and Myeongjae Jeon. Zico: Efficient { GPU } mem- ory sharing for concurrent { DNN } training. In 2021 USENIX Annual T echnical Confer ence (USENIX A TC 21) , pages 161– 175, 2021. 14 [52] Manos P a vlidakis, Stelios Ma vridis, Antony Chazapis, Giorgos V asiliadis, and Angelos Bilas. Arax: a runtime framework for decoupling applications from heterogeneous accelerators. In Pr oceedings of the 13th Symposium on Cloud Computing , pages 1–15, 2022. [53] Guanhua W ang, Kehan W ang, Kenan Jiang, Xiangjun Li, and Ion Stoica. W avelet: Efficient dnn training with tick-tock scheduling. Pr oceedings of Machine Learning and Systems , 3:696–710, 2021. [54] NVIDIA Corporation. NVIDIA Multi-Instance GPU (MIG) User Guide , 2025. Describes GPU partitioning into multiple isolated GPU instances with dedicated compute, cache, and memory resources, enabling spatial sharing with strong isola- tion. [55] Baolin Li, Tirthak Patel, Siddharth Samsi, V ijay Gadepally , and Dev esh T iwari. Miso: exploiting multi-instance gpu capabil- ity on multi-tenant gpu clusters. In Pr oceedings of the 13th Symposium on Cloud Computing , pages 173–189, 2022. [56] David Goldber g. What ev ery computer scientist should kno w about floating-point arithmetic. ACM computing surve ys (CSUR) , 23(1):5–48, 1991. [57] NVIDIA Corporation. cuBLAS library documentation. https: //docs.nvidia.com/cuda/cublas/index.html , 2025. [58] Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik W ijmans, Marco Cusumano-T owner , Raja Giryes, and Philipp Krähen- bühl. Entropy-preserving reinforcement learning. In Interna- tional Confer ence on Learning Repr esentations (ICLR) , 2026. [59] Lianmin Zheng, Liangsheng Y in, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Y u, Shiyi Cao, Christos K ozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Y ing Sheng. SGLang: Efficient execution of structured language model programs. In Pr oceedings of the 38th Annual Conference on Neural Information Pr ocessing Systems (NeurIPS) , 2024. [60] LMSYS Org. SGLang v0.4.3: Deterministic attention and pytorch-nativ e backend. https://lmsys.org/blog/ 2025- 09- 22- sglang- deterministic/ , September 2025. [61] Peichen Xie, Xian Zhang, and Shuo Chen. Repdl: Bit-le vel reproducible deep learning training and inference, 2025. [62] Jiacai Liu, Y ingru Li, Y uqian Fu, Jiawei W ang, Qian Liu, and Zhuo Jiang. When speed kills stability: Demystifying RL collapse from the training-inference mismatch. https:// richardli.xyz/rl- collapse , September 2025. [63] Serge y Ioffe and Christian Sze gedy . Batch normalization: Ac- celerating deep netw ork training by reducing internal cov ariate shift. In International confer ence on machine learning , pages 448–456. pmlr , 2015. [64] Arpan Gujarati, Reza Karimi, Safya Alzayat, W ei Hao, Antoine Kaufmann, Ymir V igfusson, and Jonathan Mace. Serving dnns like clockwork: performance predictability from the bottom up. In Pr oceedings of the 14th USENIX Confer ence on Oper ating Systems Design and Implementation , OSDI’20, USA, 2020. USENIX Association. [65] W eihang Shen, Mingcong Han, Jialong Liu, Rong Chen, and Haibo Chen. Xsched: preemptive scheduling for di verse xpus. In Pr oceedings of the 19th USENIX Confer ence on Oper ating Systems Design and Implementation , OSDI ’25, USA, 2025. USENIX Association. [66] ByteDance Infrastructure T eam. Priv ate communication re- garding production gpu sharing constraints. Personal Commu- nication, 2026. Unpublished industry insights. [67] Peifeng Y u and Mosharaf Chowdhury . Salus: Fine-grained GPU sharing primitives for deep learning applications. In Pr oceedings of the 3r d MLSys Confer ence (MLSys) , Austin, TX, USA, March 2020. [68] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr oceed- ings of the IEEE conference on computer vision and pattern r ecognition (CVPR) , pages 770–778, 2016. [69] David E Rumelhart, Geoffrey E Hinton, and Ronald J W illiams. Learning representations by back-propagating errors. Natur e , 323(6088):533–536, 1986. [70] Huifeng Guo, Ruiming T ang, Y unming Y e, Zhenguo Li, and Xiuqiang He. DeepFM: a factorization-machine based neural network for CTR prediction. In Pr oceedings of the 26th In- ternational Joint Confer ence on Artificial Intelligence (IJCAI) , pages 1725–1731, 2017. [71] GLM-4 T eam and Zhipu AI. GLM-4: T owards open source language models for academic research. arXiv preprint arXiv:2406.12793 , 2024. [72] Edward J Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizhu Chen. LoRA: Low-rank adaptation of large language models. In Interna- tional Confer ence on Learning Repr esentations (ICLR) , 2022. [73] Ben W ang and Aran Komatsuzaki. GPT -J-6B: A 6 bil- lion parameter autoregressi ve language model. https: //github.com/kingoflolz/mesh- transformer- jax , 2021. EleutherAI. [74] Jov an Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep T orrel- las, and Esha Choukse. DynamoLLM: Designing LLM in- ference clusters for performance and energy efficienc y . In Pr oceedings of the 31st IEEE International Symposium on High-P erformance Computer Ar chitectur e (HPCA) , 2025. Best Paper A ward. [75] Y ushi Bai, Xin Lv , Jiajie Zhang, Hongchang L yu, Jiankai T ang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Y uxiao Dong, Jie T ang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long conte xt understanding. In Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (A CL) , 2024. [76] Y uxin W ang, Y ibo Chen, Zhaozhu Li, Xinyu Kang, Y inan Fang, Y angtian Zhou, Y ujie Zheng, Zhennan T ang, Xiuming He, Rong Guo, Xin W ang, Qiang W ang, Aoying Zhou, and Xiaowen Chu. BurstGPT : A real-w orld workload dataset to optimize LLM serving systems. In Pr oceedings of the 31st A CM SIGKDD Confer ence on Knowledge Discovery and Data Mining (KDD) , 2025. [77] Cuda runtime api :: Cuda toolkit documentation. https://docs.nvidia.com/cuda/cuda- runtime- api/ group__CUDART__STREAM.html , November 2025. (Accessed on 01/12/2025). 15 [78] Jaiaid Mobin, A vinash Maurya, and M Mustafa Rafique. Colti: T owards concurrent and co-located dnn training and inference. In Pr oceedings of the 32nd International Symposium on High- P erformance P arallel and Distributed Computing , pages 309– 310, 2023. [79] Hongtao L yu, Boyue Liu, Mingyu W u, and Haibo Chen. Fair- batching: Fairness-aware batch formation for llm inference. arXiv pr eprint arXiv:2510.14392 , 2025. [80] Biao Sun, Ziming Huang, Hanyu Zhao, W encong Xiao, Xinyi Zhang, Y ong Li, and W ei Lin. Llumnix: Dynamic scheduling for large language model serving. In 18th USENIX symposium on operating systems design and implementation (OSDI 24) , pages 173–191, 2024. [81] Rana Shahout, Chunwei Liu, W eifan Jiang, Minlan Y u, Michael Mitzenmacher , et al. Don’t stop me now: Embedding based scheduling for llms. In The Thirteenth International Confer- ence on Learning Repr esentations , 2025. [82] Nikoleta Iliak opoulou, Jo v an Stojko vic, Chloe Alverti, T ianyin Xu, Hubertus Franke, and Josep T orrellas. Chameleon: Adap- tiv e caching and scheduling for many-adapter llm inference en vironments. In Pr oceedings of the 58th IEEE/ACM Inter- national Symposium on Micr oar chitecture® , pages 217–231, 2025. [83] Shuowei Jin, Xueshen Liu, Jiaxin Shan, Le Xu, Tie ying Zhang, Liguang Xie, and Z Morley Mao. Llmvisor: A real-time latency attribution model for multi-tenant llm serving. [84] Kanishk Goel, Jayashree Mohan, Nipun Kwatra, Ravi Shreyas Anupindi, and Ramachandran Ramjee. Niyama: Break- ing the silos of llm inference serving. arXiv pr eprint arXiv:2503.22562 , 2025. [85] Hangchen Y u, Arthur Michener Peters, Amogh Akshintala, and Christopher J Rossbach. A va: Accelerated virtualization of accelerators. In Pr oceedings of the T wenty-Fifth Interna- tional Confer ence on Ar chitectur al Support for Pr ogr amming Languages and Oper ating Systems , pages 807–825, 2020. [86] Fan Guo, Y ongkun Li, John CS Lui, and Y inlong Xu. Dcuda: Dynamic gpu scheduling with li ve migration support. In Pro- ceedings of the ACM Symposium on Cloud Computing , pages 114–125, 2019. [87] Jaehoon Jung, Jinpyo Kim, and Jaejin Lee. Deepum: T ensor migration and prefetching in unified memory . In Proceedings of the 28th A CM International Confer ence on Arc hitectural Support for Pr ogr amming Languages and Oper ating Systems, V olume 2 , pages 207–221, 2023. 16

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment