Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking
Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have st…
Authors: Zizhao Mo, Junlin Chen, Huanle Xu
Serving Hybrid LLM Loads with SLO Guarante es Using CP U-GP U Aention Piggybacking ZIZHA O MO, University of Macau, Macau SAR, China JUNLIN CHEN, University of Macau, Macau SAR, China H U ANLE X U ∗ , University of Macau, Macau SAR, China CHENGZHONG X U, University of Macau, Macau SAR, China Nowadays, service providers often deploy multiple typ es of LLM ser vices within shared clusters. While the service co-location improves resource utilization, it intr oduces signicant interference risks for latency- sensitive (LS) services—which have strict SLO r equirements for inference latency—and sever ely constrains the service capacity of best-eort (BE) ser vices due to limited available memory . T o address interference, existing systems typically rely on reserving headroom to constrain BE r esource usage. Howe ver , this approach’s coarse granularity compromises the SLO compliance of the latency-sensitive service and unne cessarily restricts the generation potential of the best-eort ser vice. In this paper , we propose OmniServe, a novel LLM ser ving system that eciently harnesses both CP U and GP U resources to mitigate interference and improv e throughput. Central to OmniServe is the Attention Piggybacking mechanism, which eectively ooads the Attention computation of BE services to CP Us on the y . This mechanism also facilitates asynchronous communication between CP U and GP U streams, preventing GP Us from being blocked while aggregating Attention results. Additionally , OmniSer ve incorporates a dynamic batching control policy to adapt to uctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniSer ve improves the SLO attainment rate for LS services by up to 1 . 48 × while enhancing BE serving throughput by up to 9 . 85 × compared to state-of-the-art systems. CCS Concepts: • Computer systems organization → Cloud computing . Additional K ey W ords and Phrases: LLM serving, CP U-GP U collaboration, resource heterogeneity A CM Reference Format: Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu. 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Attention Piggybacking. Proc. A CM Manag. Data 4, 3 (SIGMOD), Article 230 ( June 2026), 26 pages. https://doi.org/10.1145/3802107 1 Introduction Nowadays, service providers aim to deliv er a diverse range of services in datacenters using large language mo dels (LLMs), to oer versatility and exibility across various domains such as pro- gramming assistance and and question answering. Curr ent foundational LLM models [ 4 , 50 ] are typically trained on vast amounts of real-world data, to capture knowledge from all asp ects of human life. ∗ Corresponding author A uthors’ Contact Information: Zizhao Mo, University of Macau, Macau SAR, China, yc17461@connect.um.edu.mo; Junlin Chen, University of Macau, Macau SAR, China, yc57440@um.edu.mo; Huanle Xu, University of Macau, Macau SAR, China, huanlexu@um.edu.mo; Chengzhong Xu, University of Macau, Macau SAR, China, czxu@um.edu.mo. This work is licensed under a Creative Commons Attribution 4.0 International License. © 2026 Copyright held by the owner/author(s). A CM 2836-6573/2026/6-ART230 https://doi.org/10.1145/3802107 Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:2 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu While a variety of LLM services are deployed together in clusters to enhance resource utilization, their service requirements vary signicantly . For example, online chatting is a latency-sensitiv e application [ 1 ] where users expect stringent service level obje ctives (SLOs) to ensure fast responses and high token generation rates. In contrast, many services focus on back-oce tasks, such as benchmarking, form processing, and data wrangling [ 15 , 19 , 30 ]. These workloads have more lenient SLOs and relatively low priority , requiring service provisioning on a best-eort basis. Furthermore, e ven within the same service type, non-uniform service pro visioning for dierent users is a commonly adopted strategy . For example, service providers may oer free users best-eort access while ensuring rapid responses for paid users. As LLM-based services become increasingly popular in data centers, it is critical to simultane ously optimize performance for dierent types of services [ 18 , 23 , 41 ]. This often requires maximizing generation throughput of b est-eort (BE) service requests without negatively impacting SLO guarantees for latency-sensitiv e (LS) ser vices. T o achie ve this goal, Llumnix [ 49 ] proposes a priority- based scheduling strategy to dynamically schedule various requests across multiple LLM ser ving instances. For ensuring performance isolation between dierent types of services, Llumnix limits the KV cache usage for BE ser vices by reserving headr oom for LS services, recognizing the positive correlation between memory usage and the level of interference introduced to LS services. Despite its operational simplicity in production environments, Llumnix’s GP U memor y-centric paradigm inadequately sustains balanced quality-of-ser vice for hybrid workloads. First, its coarse- grained memor y controls fail to eectively address compute resource contention—experiments reveal persistent 50% LS latency variance even with xed memory reservations for LS ser vices (Fig. 2(a)), attributable to unmanaged GP U SM/Memory bandwidth competition. Se cond, the frame- work’s rigid prioritization during LS trac surges triggers BE star vation, reducing BE throughput signicantly due to GP U memor y monopolization. Fortunately , our analysis reveals that the underutilized CP U resources in LLM serving clusters can be strategically leveraged to enhance BE computation while minimizing interference with LS services. Spe cically , idle CP U cores can compute the ooaded BE Attention workloads to reduce contention on GP Us, while the abundant CP U memory accommo dates signicantly larger KV caches for BE r equests. Howev er , fully exploiting CP U for BE throughput faces a critical bottleneck: the 498.1 × performance gap in Dense mo dule computation between GP Us and CP Us. This disparity necessitates reserving Dense computations for GP Us, creating a dependency where GP Us must wait for Attention results computed on CPUs. T o avoid GP U blocking, a robust synchr onization mechanism is required to orchestrate heterogeneous compute streams acr oss devices. This challenge is further compounded by the dynamic nature of LLM workloads. F luctuating request arrival rates widen the CP U-GP U performance gap under larger batch sizes, making synchronization incr easingly complex during trac bursts. T o address these challenges associated with harnessing CP U resources, this pap er proposes OmniServe, an ecient LLM system for ser ving hybrid loads with SLO-awareness. The rst innovation of OmniServe is the introduction of Attention Piggybacking, a new CP U inference mechanism designe d to eectively decouple CP U Attention computation from GP U inference. Specically , Attention Piggybacking enables an asynchronous execution ow between CP Us and GP Us, allowing the GP U stream to perform inference without waiting for gathering immediate Attention results from the CP U stream. Dense computations for these overowed BE requests can be opportunistically piggybacked later using layer-wise batching, intr oducing minimal interfer ence to LS ser vices with uctuating request arrival rates. Unlike token-wise batching [ 56 ], the set of requests executed within a batch under layer-wise batching can var y across layers during a single token iteration. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:3 The second innovation of OmniSer ve lies in the dynamic batching control through explicit latency quantication. OmniSer ve leverages the high pr edictability of service latency at the module level to dynamically determine the appropriate number of BE requests that can be piggybacked with LS inference using layer-wise batching during decoding. This approach not only ultimately ensures the SLO for LS services, but also enhances the BE service throughput. The Attention Piggybacking me chanism and e xplicit batching control approach are highly exible and can b e seamlessly integrated with various parallelism te chniques for distributed inference, as well as into state-of-the-art LLM ser ving systems, such as Prell-Decode disaggregation [ 38 , 39 , 63 ] and chunk prell [ 10 ]. W e have developed a prototype of OmniServe based on the vLLM framework [ 27 ]. Additionally , we conduct extensive experiments to evaluate OmniServe in a cluster consisting of one host with four A100 GP Us and four additional CP U-only hosts. All hosts features Intel Xeon(R) Gold 6342 CP Us. The results show that OmniServe can improve the SLO attainment rate for LS ser vice by up to 1.48 × while enhancing the BE ser ving throughput by up to 9 . 85 × , compared to existing hybrid serving systems. T o summarize, w e have made the following contributions in this paper: • W e conduct a comprehensive analysis to examine the interplay between BE and LS ser vices, highlighting the need for a more ecient interference mitigation approach in hybrid serving. • W e design a novel Attention piggybacking mechanism for ecient service co-lo cation. With asynchronous CP U-GP U interaction, GP Us do not need to wait for CP U results, repr esenting a fundamental departure from prior ooading approaches. • W e introduce a dynamic sche duling policy that builds on the Attention piggybacking mechanism. Its core innovation is a ne-grained, layer-level batching strategy for processing LS and BE requests concurrently—a method fundamentally distinct from token-wise batching. 2 Background and Motivation 2.1 LLM Serving Basics 2.1.1 LLM inference workflow . The LLM inference workow , illustrated in Fig. 1, consists of two phases with distinct computational patterns: the Prell phase and the Decoding phase. During the Prell phase (bottom-left), the entire input prompt is processed in a single, highly parallelizable forward pass to generate the initial output token ( e.g., "keep "), making this a compute-intensive operation. The subsequent De coding phase ( bottom-right) then generates tokens autoregressively , using the cumulative output sequence to produce one new token at a time (e .g., "the"). This phase is predominantly memory-bound, as each step requires loading the substantial K V cache for all previous tokens. Each token generation in both phases necessitates a complete forward pass through all se quential layers of the model. Notably , these layers share an identical structure, the components of which are detailed next. Dense computation . In transformer-based models, Dense computation is responsible for captur- ing token-wise patterns. Specically , each layer transforms the hidden states into 𝑄 , 𝐾 , 𝑉 matrix in the QKV module, translates Attention results in proj mo dule, and explores characteristics in a higher- dimensional space in MLP module (replaced by the MoE module in some mo dels). These op erations are characterized by the high computational intensity , due to their matrix multiplication-base d computation pattern. T o this end, these modules are suitable to be executed on GP Us. Attention computation . The Attention module plays a crucial role in LLM inference, enabling selective focus on specic parts of the context to highlight the most relevant information when generating responses [ 51 ]. Specically , the Attention mechanism utilizes the representations of tokens—namely , the 𝑄 , 𝐾 , and 𝑉 matrix—and computes Attention scores among them to capture Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:4 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu + QKV Attention proj MLP / MoE + layer 1 layer 2 layer 3 layer 4 layer m layer 1 layer 2 layer 3 layer 4 layer m Prefill phase Decoding phase Time An apple a day keep the Fig. 1. Illustration of the complete LLM inference workf low for token generation. All layers within the model must be executed sequentially and are compose d of an identical set of computation modules. their nuanced relationships. In decoding phase, the computation is formulated as: Aention ( 𝑞, 𝐾 , 𝑉 ) = somax 𝑞 · 𝐾 𝑇 𝑑 𝑘 · 𝑉 , (1) where only the query 𝑞 of the last token is involved in computation with all previously stashed 𝐾 and 𝑉 matrices. Residual connection . Residual connection provides an addition path for input tensors of each layer to bypass computations in the mainstream path [ 21 ]. This is presented in the upper part of Fig. 1, where the computational results of the proj and MLP modules have to add-and-normalize with the activations stored before. Since this residual connection is able to enhance the numerical convergence, it has gr own into the essential part of LLM models. 2.1.2 LLM inference optimization. T o impro ve the inference performance, the follo wing techniques are proposed: Continuous batching . The continuous batching mechanism has recently been proposed to enhance resource utilization during the inference pr ocess [ 56 ]. Specically , it renes the batching granularity from the request level to the token level, allowing the number of requests in an inference batch that processes through all layers of the model to vary over time. This approach enables the saturation of computing resources for higher serving throughput, particularly in the decoding phase, which is less compute-intensive . T o date , it has b ecome a common practice in LLM inference [5, 8, 27]. Chunk Prell . The chunk prell technique is proposed to process both prell and decoding requests in a stall-free manner [ 10 ]. T o b e specic, it divides a prell request into chunks, which ar e then included in a batch alongside additional decoding requests. This approach can signicantly reduce computational costs by avoiding the repeated invocation of the same kernel with identical model parameters, such as Dense modules, that would occur if the requests were processed in separate phases. 2.1.3 LLM Inference Service Requirements. Fast token generation is critical to LLM applications such as online chatting [ 1 ], where users expect rapid responses from servers. T ypically , the generation rate must meet a minimum threshold, i.e ., the SLO constraint. Moreover , ev en for the same LLM service, varying service rates are expected across dierent phases. For instance, users are generally impatient to wait long for the generation of the rst output token, while a relatively lower rate for subsequent tokens is acceptable due to limited human reading spee d. T o address these diverse token generation requirements, existing SLO-oriented systems like Splitwise and Distserve [ 39 , 63 ] dene key performance indicators for prell and decoding phases: the Time to First T oken (T TFT) and the Time Per Output T oken (TPOT), r espectively . In stark contrast, back-end infer ence tasks such as benchmarking [ 30 ], form processing [ 15 ], and data wrangling [ 19 ] are of lower priority and do not require prompt responses. As such, no spe cic SLO requirement should be imp osed for Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:5 0.8 1.2 1.6 2.0 A vg. seq len (k) 0.12 0.14 0.16 0.18 Time (s) (a) Per-token latency 0 10 20 30 40 50 # of BE R equests 0.106 0.11 0.114 Time (ms) (b) MLP latency 0 10 20 30 40 50 # of BE R equests 0.1 0.2 0.3 Time (ms) seq_len=1k seq_len=2k (c) Aention latency Fig. 2. The latency of co-hosting LS and BE requests from Llama-70B models. (a): Significant per-token latency variation under Llumnix’s fixed-size isolation, where the latency varies with the BE request length. (b) and (c): Latency of Aention and MLP modules given more BE requests. GPU SRAM GPU GBM Main Memory (CPU DRAM) SRAM : 19TB/s (dozens MB) HBM : 1.5TB/s (dozens GB) DRAM : 12.8GB/s (hundreds GB) Model Params LS Cache BE Cache Memory for SL O-free computation 1 1 2 1 : Priorities in GPU Mem. usage 2 Fig. 3. Memory hierarchy along with the bandwidth and capacity information as well as the prioritization on memory resource usage, when LS and BE services are colocated on the same device. these LLM services. In this paper , we dene LLM requests with SLO requirements as LS services and categorize others as BE services. 2.2 Existing Systems for Hybrid LLM Serving Loads Recently , there exists a body of work focused on enhancing LLM serving performance from two key aspects - throughput [27, 45, 56] and latency [10, 27, 29, 58, 59, 63] - by eectively leveraging GP U or CP U resources. Howev er , these approaches typically optimize for a single ser vice typ e, overlooking the necessity to support multiple types of LLM services simultaneously . As a result, cluster administrators are forced to deploy separate instances for dierent workloads, r esulting in inecient resource utilization. This approach duplicates model parameters in GP U memory and leaves compute resources in LS instances underutilize d when BE requests could otherwise consume them. Overcoming this resour ce waste necessitates multiplexing serving instances across both load types, complemented by delicate interference-aware resource management. Llumnix [ 49 ] is the rst system that supports hybrid ser ving for LS and BE services. Specically , it achiev es runtime co- location by reserving dedicated cache space (headroom) for LS services, where the model parameters and KV cache spaces are shared by two services. In this sense, the idle resources of LS service at leisure time can be opportunistically utilized by the BE service, signicantly enhancing the resource eciency and improving the infer ence performance. However , this design is too coarse-grained and fails to guarantee SLO compliance, as it does not inherently address latency control mechanisms. W e analyze this limitation and the interference in detail in § 2.3. 2.3 Interference between LS and BE Services In this part, we investigate the bi-directional interference between LS and BE services, motivating the necessity of eective interference mitigation for hybrid workloads. 2.3.1 Latency interference on LS ser vice. W e analyze the impact of hosting additional BE requests on the decoding p erformance of LS services. Prior work has establishe d that expanding the inference batch size increases latency due to resource contention, underscoring the ne cessity of latency control mechanisms for co-locating BE and LS requests [ 27 , 56 , 63 ]. Howev er , our experiments reveal that Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:6 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu Llumnix’s headroom-oriented technique alone fails to suciently mitigate SLO violations. For instance, when deploying a 4-way 1 tensor-parallel (TP) Llama-70B model on four A100 80GB GP Us with 80% of the KV cache reser ved for LS requests, we observe a 1.47 × increase in per-token latency for LS de coding under concurrent BE workloads compared to exclusive serving (Fig. 2(a)). This degradation varies with dierent BE context lengths. Notably , interference worsens when the same cache space is allocate d to prell-phase BE requests, as their computationally intensive operations exacerbate contention. 2.3.2 Ser ving capacity reduction on BE ser vice. In hybrid load ser ving, LS ser vices take precedence as the highest priority , inevitably impacting the p erformance of co-located BE services. Specically , the number of BE requests that can be processed on GP Us is constrained by the limited memory available to BE services. As illustrated in Fig. 3, modern GP Us typically provide only tens of gigabytes of memory . After allocating space for LS ser vice components—such as mo del parameters and dedicated caches—the remaining memor y left for BE ser vices becomes insucient. This results in computational under-utilization: despite maintaining the SLO of LS services, the GP U’s computational resources such as SM cores cannot b e fully leveraged by BE workloads due to memory constraints. 2.4 Hybrid Serving: Opportunities and Challenges 2.4.1 Opportunities. While multiplexing LS and BE ser vices can possibly introduce signicant interference, we identify opportunities that can b e leveraged to ensure SLO compliance for LS services while enhancing BE throughput. 𝑶 1 : Harnessing ample CP U resources for BE computation. CP U resources in LLM ser v- ing clusters are typically underutilized, as only the control ow for GP U workers is handled by the CP U [ 57 ], leaving the majority of cores idle during inference (e.g., the Intel Gold 6342 CP U has 24 physical cores, yet only four are actively utilize d when running a 4-way inference instance) [ 5 , 27 , 62 ]. Consequently , ooading the Attention computation of BE requests to the CP U can mitigate interference for LS requests. This strategy is grounded in empirical analysis: BE Attention modules—unlike their MLP counterparts—disproportionately degrade LS ser vice perfor- mance due to intense contention for memor y bandwidth and SRAM resources during decoding, as validated in Fig. 2(b-c). Concurr ently , leveraging CP Us’ abundant memory to host larger K V caches for BE requests (Fig. 3) reduces GP U memory pressure, enhancing BE thr oughput by minimizing contention while preserving LS performance. 𝑶 2 : Leveraging high predictability of service latency . Due to the layer-wise organization of LLM and the uniform GEMM-based implementation of Dense mo dules, the whole computation process during inference is highly deterministic for specic inputs of given sizes, allowing for precise latency estimation of inference requests. This accurate estimation enables a reliable quantication of BE requests that can be multiplexed with LS services to ensure SLO guarantees. Additionally , the predictability of latency facilitates ne-graine d load control, allowing BE requests to maximize the utilization of available GP U and CP U resources. Notably , the nearly consistent execution time observed with increased loads on MLP computation (illustrated in Fig. 2(b)) enhances BE computation, thereby improving o verall serving throughput. 2.4.2 Challenges. Despite the potential opportunities to b enet BE services while providing SLO guarantees for LS services, there are still several fundamental challenges when we want to fully utilize the CP U resources. T o illustrate this, we conducted a series of experiments in our cluster , measuring inference performance for requests on an Intel X eon Gold 6342 CP U and an N VIDIA 1 4-way TP means the LLM inference instance is parallelized on four GP Us in tensor-parallel manner . Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:7 T able 1. The computation p ower gap across modules in the Llama-2-70B model between an Intel Xeon Gold 6342 CP U and an A100 GP U. The length of the decoding context and prefill request is both 1000 in the experiment Prell Decode Attention MLP Attention MLP 1 request 184 . 6 × 288 . 2 × 2 . 34 × 65 . 2 × 10 requests 393 . 75 × 212 . 1 × 7 . 58 × 498 . 1 × A100 GP U, respectively . Our analysis focused on the CP U-GP U performance gap across dierent model modules (Attention versus MLP), operational phases (prell versus decode), and batch sizes (1 versus 10). Specically , the following key challenges were observed: 𝑪 1 : Low resource eciency in partitioning computation between CP U and GP U . As evidenced by T able 1, a substantial performance gap exists between MLP execution on CP Us and GP Us—even for single-request decoding. This renders ooading MLP computations—including partial-layer implementations (e .g., [ 6 , 27 ])—to CP Us ineective for throughput improvement. Howev er , existing inference systems’ reliance on token-wise continuous batching [ 56 ] forces all batched requests to stall their post- Attention Dense modules ( e.g., proj in Fig. 1) until all Attention results, including those compute d on CP Us, are synchronized. This tight CP U-GP U coupling risks blocking GP U execution due to the CP U’s limited bandwidth and computational capacity , thereby prolonging LS service latency . Meanwhile, CP U resources remain underutilized while Dense modules execute on GP Us, further exacerbating cluster ineciency . 𝑪 2 : O loading computation in the presence of uctuating inference loads. The dynamic nature of LLM inference, characterized by unpredictable request arrivals, results in highly uctuating serving loads. This variability further exacerbates the challenges posed by the performance gap between CP Us and GP Us. Sp ecically , the computation gap in the Attention module between CP U and GP U increases from 2.34 × to 7.58 × as the batch size of requests rises from 1 to 10, as shown in T able 1. This disparity signicantly hamp ers the ability to ensure SLO guarantees for LS services, as the involvement of CP Us can easily degrade the token generation rate during peak BE loads due to poor synchronization with GP Us. Conse quently , it is crucial to implement a load-adaptive ooading scheme, couple d with a meticulous batching strategy for all requests with dynamic arrivals. 3 OmniServe System 3.1 Overview of OmniServe 3.1.1 K ey ideas. OmniSer ve is a novel LLM ser ving system designe d to optimize performance for both LS and BE requests, with the capability to fully lev erage CP U and GP U resources. T o be specic, it is built on the following key design ideas: 𝑰 1 : Integrating the Attention Piggybacking mechanism. OmniServe introduces the Atten- tion Piggybacking mechanism, a no vel design to enhance the performance of hybrid services across heterogeneous hardware. By de coupling CP U and GP U computation streams, this me chanism enables asynchronous execution: the GP U stream progresses with inference tasks without stalling for immediate Attention results from the CP U. Overow ed BE requests have their Dense computa- tions (e .g., MLP) opportunistically piggybacke d onto the GP U via a layer-wise batching technique once the dependent Attention outputs are available. A s a consequence, the CP U stream focuses exclusively on processing Attention computations for these overow ed BE requests, while the GP U prioritizes executing the full generation pipeline for as many requests as possible. This symbiotic Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:8 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu Profiler Dense Attention Online Scheduler Piggybacking control Prefill chunking CPU Attention Manager CPU Attention Cache swap Async. transfer Piggybacking Manager Residual management Queue management BE selection LS & BE service CPU pool GPU pool CPU Attention 1 2 3 4 : data flow : control flow LS admission GPU Execution async. Fig. 4. The system overview of OmniSer ve. design ensures both CP Us and GP Us operate at peak eciency for their respective workloads, directly addressing Challenge 𝑪 1 . 𝑰 2 : Dynamic batching control through explicit latency quantication. T o ensure SLO for LS services amid var ying computational demands during the prell and deco ding phases, OmniServe proposes a module-wise latency mo deling approach in witnessing the inconsistent latency increase across modules (Fig. 2(b-c)). By quantifying these eects, OmniSer ve enables the delicate piggybacking of Attention results to perform subsequent Dense computations using layer- wise batching. As a result, the thr oughput of BE generation is impro ved while still guaranteeing SLO for LS ser vices, particularly in the face of uctuating LS serving loads, thereby addressing Challenge 𝑪 2 . 3.1.2 System architecture. The system architecture of OmniServe is illustrated in Fig. 4 and com- prises the following four main components: Proler ❶ , Online Scheduler ❷ , Piggybacking Man- ager ❸ , and CP U Attention Manager ❹ . When OmniServe is deployed within an LLM serving instance, the Pr oler initializes the mod- eling process to create precise p erformance proles for various modules, supporting dynamic inference across CPU and GP U resour ces. The Online Scheduler determines the execution of r e- quests on GP U and sustains the asynchronous CP U-GP U piggybacking based on SLO requirements. Subsequently , leveraging analytical models for Dense and Attention computations, the Online Scheduler dynamically chooses an appropriate number of LS and BE requests for e xecution within a batch. It also decides the number of tokens to be piggybacked from CP Us using layer-wise batching. The online scheduling functionality is further facilitated by the Piggybacking Manager , ensuring seamless and ecient Attention Piggybacking within the GP U backend thr ough ecient activation management and queue design. T o fully harness resources for BE services, excessiv e requests access KV cache on GP Us and swap them to memory , process intermediate results, and execute Attention with support from the CP U Attention Manager . 3.2 Aention Piggybacking The cor e rationale behind integrating Attention Piggybacking stems from the limitations asso ciated with the continuous batching approach [ 56 ]. Specically , continuous batching inherently aligns the inference processes of BE and LS requests within a single batch at the token granularity , running them on the same hardware concurrently . This batching strategy hinders the eective utilization of either CP U resources or GP U r esources due to a substantial performance gap in Dense module computations and the stringent SLO demands of the LS service. In this part, we illustrate the Attention Piggybacking mechanism, encompassing Attention ooading and post- Attention Piggybacking, to demonstrate how this approach benets both BE and LS services simultaneously . Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:9 Attention gate_up gate_down QKV proj Impacted ker nels on GPU 1x 2x Prolongment solo light load heavy load Fig. 5. The interference of colocating kernel execution within the same CUDA context is as high as 1 . 5 × within a Llama-2-70B instance on A100 GP Us, motivating the importance of post-Aention piggybacking. 3.2.1 Aention o loading. The rst characteristic of Attention Piggybacking lies in the Attention ooading design, which involves selectively ooading the Attention computation to CP Us while keeping computations of other modules on GP Us. This design is depicted in the left portion of Fig. 6(a). In this example, the input tensor of the 𝑙 − th layer consists of the data fr om three decoding LS requests and three decoding BE requests. Before this iteration, all the K V caches of these three BE requests have been ooaded to the CP U due to GP U memor y shortage. As such, after the QKV computation is completed, the intermediate tensor of these three BE requests is for warded to the CP U to proceed with their Attention computation there, and the LS requests continue their subsequent Attention computation on the GP U. Unlike synchronous execution ows, wher e GP U streams must wait for results from lo wer-end processors, OmniServe decouples the computation of Dense modules among LS and BE services on GP U devices without applying the token-wise continuous batching, i.e., no synchronous gathering after the Attention execution. This prev ents the GP U’s execution from being hindered by dependencies on immediate result gathering, particularly under the non-negligible data transfer latency given the limited PCIe bandwidth. 3.2.2 Post- Aention Piggybacking. Attention Piggybacking, which defers immediate aggregation between LS and BE requests, inevitably raises a critical concern, i.e., when to transfer the CP U- computed Attention result back to GP U and how to proceed with the subse quent computations. T o mitigate interference in the computation of LS requests, one potential appr oach is to block the kernel for ooaded BE requests until the GP Us become idle. Howev er , this method may prevent BE requests from eectively utilizing GP U resources, which contradicts the goal of impro ving utilization for BE workloads. Alternatively , another strategy is to run b oth LS and BE inference workows concurrently on the GP Us. While this approach allows for instantaneous BE execution, it intr oduces signicant interference to LS services. In Fig. 5, we analyze the interference on various computation modules within Llama-2-70B when concurrently invoking a proj module on A100 GP U, which is the function following the Attention module. MLP module is divide d into gate_up and gate_down . Each impacted mo dule performs computation for 50 requests, each with a context length of 500. ‘Light’ and ‘heavy’ loads refer to scenarios where the proj kernel carries 5 and 200 requests. W e isolate the BE and LS workows on the GP U using two dierent CUD A streams to enable parallel execution. Results show that triggering the proj computation can uniformly disrupt the GP U workow , dep ending on the other kernel invoked for LS ser vice inference . Even with light loads on the proj kernel, handling only 5 requests, we observe slowdowns ranging from 1 . 12 × to 1 . 3 × for the concurrently running kernel. This interference can incr ease to 1 . 5 × when processing 200 requests within the proj kernel. The primary r eason for this high interference is the substantial loading of model parameters for Dense computation, intensifying contention on GP U resources like SRAM and memory bandwidth. While the Attention kernel of LS services experiences minimal interference due to its low computational demand and parameter-fr ee nature [ 51 ], its subsequent computation still encounters interference. OmniServe addresses this interference by leveraging the recurrence of kernel invocation during the token generation process. As illustrated in the right part of Fig. 6(a), the proj module, located Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:10 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu Attn Proj QKV l -th layer , r -th iteration MLP Modules: GPU: Attn Proj QKV l -th layer , r+x -th iteration MLP : activations of on-GPU requests (e.g., LS requests) : activations of overflowed BE requests CPU: (a) Aention o loading and piggybacking design Time l -th layer l+1 -th layer l+2 -th layer x -th layer m-th CPU Attention batch m+1-th CPU Attention batch CPU: GPU: (b) T emporal view of CP U and GP U execution Fig. 6. The Aention piggybacking mechanism within OmniServe. (a) Module-Lev el Dataf low: The design includes Aention o loading (le) and p ost- Aention piggybacking (right). Activation values of overf lowed BE requests are oloade d to the CP U before executing the Aention module, while the resulting Aention outputs are aggregated back into the GP U dataf low aer the same layer’s Aention module. (b) T emporal W orkf low View: The CP U and GP U are utilized for diverse computations at arbitrary times, enabling eicient resource utilization and minimizing idle periods. after the Attention module, is repeatedly invoked throughout the auto-regressive generation process. This recurrence pr ovides an opportunity to piggyback the computation of ooaded BE requests within the subsequent Dense modules, minimizing interfer ence with LS requests. Coupled with the asynchronous Attention ooading mechanism, this design allo ws the CP U and GP U to handle diverse computations in parallel, as shown in Fig. 6(b), thereby achieving full utilization of computational resources. 3.2.3 Asynchronous CP U-GP U interaction. The Attention piggybacking mechanism relies on the an ecient CP U-GP U interaction design, as illustrated in Fig. 7. Specically , upon completion of the QKV module computation for both LS and BE requests at layer 𝑙 during the curr ent iteration, the corresponding 𝑞 , 𝑘 , and 𝑣 vectors of BE requests are transferred fr om GP Us to CP Us for subsequent Attention computation. This communication is lightweight since only the last token of each request is involved in this process. Moreover , the KV cache transfer between the CP U and GP U is a one-shot operation for each request. As such, the communication over PCIe will not become a p erformance bottleneck. As the GP U stream advances, when it is time to re-execute the proj module at layer 𝑙 after a series of token iterations, and the CP U has returned the Attention result of the 𝑙 -th layer , the proj computation is performed using layer-wise batching on a modie d input tensor that concatenates the returned Attention result from the CP U . Notably , adding additional BE decoding requests during the computation of Dense modules has minimal impact on LS requests, as depicte d in Fig. 2(b). This is because computations from both service types can b e processed in one GeMM kernel. As CP U Attention results are transferred to the GP Us during GP U kernel execution, ee ctively hiding the transfer overhead. T o facilitate ecient data interaction between the CP U stream and GP U stream, the Piggybacking Manager incorporates two distinct queues for storing CP U Attention input and output. Within these queues, computed results are dispatched to the tail of the respective queue by the producer , while consumer retrieves data from the head of the queue, establishing a producer-consumer pattern. Specically , the CP U-side write and read operations are managed by the CP U Attention Manager , while the GP U-side write and read operations are handled by the Piggybacking Manager . Accor ding to queuing theor y , in a stable state, the arrival rate of the queue equals the departure rate. This indicates that the CP U stream and GP U stream will maintain a steady pace during computation, under this asynchronous execution paradigm. A s a result, both the CP U and GP U resources can be fully utilized. Consequently , with more idle CP U resources available in data centers, they can store additional KV caches and execute Attention computation more rapidly through token-level parallelization, thereby substantially boosting the BE serving throughput. This queue-based design benets more complicated serving scenarios, such as handling CP U failure and ooading the KV Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:11 Attention Proj QKV l -th layer , r -th iteration CPU Attention Time task q, k, v; req_id; layer CPU GPU MLP Attention Proj QKV l -th layer , r+n -th iteration MLP CPU Attention Read Write Attention result; req_id; layer CPU Attention input queue CPU Attention output queue Batch 1 Batch 2 CPU Attention task <..., ...> task <..., ...> Write Read Batch m Fig. 7. The CP U–GP U interaction in OmniSer ve, implemented using CPU Aention Input eue and the CP U Aention Output eue. This queue-based design enables asynchr onous, interference-free communication and prevents LS service on the GP U from being aected by PCIe latency . cache to disk. This is because it decouples the serving workow from the cache sour ce, enabling smooth execution in the event of loading from disk or abrupt fault occurrence . 3.2.4 K V cache management. Co-hosting BE and LS requests signicantly incr eases memory usage in clusters. As a result, frequent K V cache swapping between CP Us and GP Us for BE requests may occur in response to uctuating LS ser ving loads, leading to excessive KV cache migration that can degrade overall service performance. T o address this issue, the CP U Attention Manager introduces an asynchronous KV cache swapping mechanism to manage memory swapping for BE requests. Specically , it supp orts a non-blocking swapping-out operation by overlapping computation and data transfer on the GP U, preventing interruptions to LS inference. Additionally , the CP U Attention Manager implements a delayed swapping-in scheme for BE requests. When LS loads are light, caches for BE requests are not swapp ed in GP U memory immediately; instead, this process is triggered only after the 𝑘 and 𝑣 vectors of the last token are generated for all layers. This delay helps mitigate excessive cache swapping over the PCIe channel, which can occur when LS loads uctuate wildly . 3.3 Online Scheduling for Hybrid Loads 3.3.1 Inference latency mo dels. Accurate performance modeling is essential for maintaining consis- tent performance isolation and maximizing the benet from adopting the attention piggybacking design on the y . Howev er , existing mo deling techniques rely on an assumption that the input and output tensors for modules in all model layers would have the identical size , rendering them un- suitable for our new Attention piggybacking mechanism. T o achieve this, the Proler incorporates novel analytical models with high-generality to estimate the execution time among Dense and Attention modules, represented by 𝑓 𝐷 ( · ) and 𝑓 𝐴 ( · ) , collectively constituting the inference latency . Modeling GP U Attention computation latency . Since requests from dierent computational phases can be ser ved within a single batch, the latency model must account for both prell and decoding attention simultaneously—a capability widely supporte d by advance d libraries [ 3 , 9 , 10 , 27 ]. Howev er , eective modeling remains challenging due to the presence of multiple variables that can inuence the latency , such as the prompt and context lengths of each request. Fortunately , the pairwise token relationship evaluation ( where causal masking governs token interactions in auto-regressive generation) enables a unied framework for latency modeling across diverse operational scenarios. By systematically characterizing latency variations as a function of Attention computational intensity , we demonstrate that prell-phase Attention latency scales linearly with computation loads, remaining independent of input sequence lengths during chunked prell operations, as shown in Fig. 8( a). Consequently , the Proler models the prell Attention latency 𝑓 𝑃 𝐴 ( · ) with learnable parameters { 𝑎 𝑃 𝐴 , 𝑏 𝑃 𝐴 } as: 𝑓 𝑃 𝐴 𝑐 𝑃 𝐴 ( 𝑡 ) = 𝑎 𝑃 𝐴 · 𝑐 𝑃 𝐴 ( 𝑡 ) + 𝑏 𝑃 𝐴 . (2) Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:12 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu 0.5 1.0 1.5 2.0 Computation (10e6) 0.1 0.2 0.3 Time (ms) len=0 len=500 len=1000 len=1500 (a) Prefill Aention 0.5 1.0 1.5 2.0 Computation (10e5) 0.1 0.2 0.3 0.4 Time (ms) bs=50 bs=100 bs=200 bs=400 (b) Decode Aention 0 100 200 300 # inputs 0.0 0.2 0.4 0.6 Time (ms) MLP QKV proj (c) Dense modules Fig. 8. Aention and Dense computation characterization in 4-way TP Llama-2-70B model. (a) Prefill Aention time is only correlated to computation load. (b) Deco ding Aention time is related to computation load and number of requests. (c) Dense computation time increases in a ladder paern. Here, for a prell request 𝑗 with 𝑙 𝑗 tokens already processed and 𝑞 𝑗 tokens to be prelled at time 𝑡 , the Proler models the computation load as 𝑐 𝑃 𝐴 ( 𝑡 ) = Í 𝑙 𝑗 + 𝑞 𝑗 𝑖 = 𝑙 𝑗 + 1 𝑖 . During the decoding phase, only the latest token of each request requires attention computation. This allows the computational load to be simplied as: 𝑐 𝐷 𝐴 ( 𝑡 ) = Í 𝑗 ∈ 𝑔 ( 𝑡 ) 𝑙 𝑗 + 1 , where 𝑔 ( 𝑡 ) denotes the number of decoding requests. Due to reduced computational intensity , decoding attention becomes memory bandwidth-bound, enabling higher compute utilization through parallel processing of multiple requests. As shown in Fig. 8( b), latency improves with increasing 𝑔 ( 𝑡 ) , even when the total K V cache size remains constant. T o model this behavior , the Proler denes the de coding attention latency 𝑓 𝐷 𝐴 ( · ) using learnable parameters { 𝑎 𝐷 𝐴 , ℎ 𝐷 𝐴 , 𝑏 𝐷 𝐴 } : 𝑓 𝐷 𝐴 𝑐 𝐷 𝐴 ( 𝑡 ) , 𝑔 ( 𝑡 ) = 𝑎 𝐷 𝐴 · 𝑐 𝐷 𝐴 ( 𝑡 ) + ℎ 𝐷 𝐴 · 𝑔 ( 𝑡 ) + 𝑏 𝐷 𝐴 , (3) where the ℎ 𝐷 𝐴 · 𝑔 ( 𝑡 ) term is use d to capture the impact of compute utilization on latency . Combining 𝑓 𝑃 𝐴 ( · ) and 𝑓 𝐷 𝐴 ( · ) yields the total aggregated latency 𝑓 𝐴 ( · ) . Modeling Dense computation latency . Dense modules employ stateless computations, mean- ing their latency depends solely on the numb er of query tokens, denote d as 𝑛 ( 𝑡 ) . For decoding requests, 𝑛 ( 𝑡 ) corresponds to the nal token, while for prell requests, it includes all prompt tokens. Howev er , as demonstrate d in Fig. 8(c), these modules exhibit non-linear latency scaling relative to 𝑛 ( 𝑡 ) , accompanied by spike events. This behavior arises because GP Us must allocate new thread blocks when input sizes exceed the hardware ’s tile capacity . Threads within these blocks execute in lockstep, even if some redundantly process data (pseudowork). Between spike e vents, the latency of Dense modules increases linearly with 𝑛 ( 𝑡 ) , primarily due to the variable overhead of transferring input tensors from GP U global memory to shared SRAM. Algorithm 1 The interp olation-based algorithm for the lantecy modeling of Dense mo dules 1: function modeling ( 𝑚𝑖 𝑛, 𝑚𝑎𝑥 , 𝑡 ℎ𝑟 𝑒 𝑠 ℎ𝑜 𝑙 𝑑 ) 2: 𝑚𝑖𝑛 _ 𝑙 𝑎𝑡 𝑒 𝑛𝑐𝑦 ← 𝑙 𝑎𝑡 𝑒 𝑛𝑐𝑦 ( 𝑚 𝑖 𝑛 ) 3: 𝑚𝑎𝑥 _ 𝑙 𝑎𝑡 𝑒 𝑛𝑐𝑦 ← 𝑙 𝑎𝑡 𝑒 𝑛𝑐𝑦 ( 𝑚𝑎𝑥 ) 4: if 𝑚𝑎𝑥 _ 𝑙 𝑎𝑡 𝑒 𝑛𝑐𝑦 − 𝑚𝑖𝑛 _ 𝑙 𝑎𝑡 𝑒 𝑛𝑐𝑦 < = 𝑡 ℎ𝑟 𝑒 𝑠 ℎ𝑜𝑙 𝑑 then 5: Interpolate latency between 𝑚𝑖 𝑛 and 𝑚𝑎𝑥 6: else 7: 𝑚𝑖𝑑 ← 𝑚𝑒 𝑎𝑛 ( 𝑚𝑖𝑛 + 𝑚𝑎𝑥 ) 8: 𝑚𝑖𝑑 _ 𝑙 𝑎𝑡 𝑒 𝑛𝑐𝑦 ← 𝑙 𝑎𝑡 𝑒 𝑛𝑐𝑦 ( 𝑚 𝑖 𝑑 ) 9: 𝑚𝑖 𝑛 _ 𝑝𝑎𝑟 𝑡 ← MODELING ( 𝑚𝑖 𝑛, 𝑚𝑖𝑑 , 𝑡 ℎ𝑟 𝑒 𝑠 ℎ𝑜𝑙 𝑑 ) 10: 𝑚𝑎𝑥 _ 𝑝𝑎𝑟 𝑡 ← MODELING ( 𝑚𝑖𝑑 + 1 , 𝑚𝑎𝑥 , 𝑡 ℎ𝑟 𝑒𝑠 ℎ𝑜𝑙 𝑑 ) 11: Aggregate modeling result from 𝑚𝑖𝑛 _ 𝑝 𝑎𝑟 𝑡 and 𝑚𝑎𝑥 _ 𝑝𝑎𝑟 𝑡 12: end if 13: end function Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:13 Motivated by these observations, we propose Alg. 1, a latency modeling algorithm designed to balance high delity and computational eciency . The algorithm leverages a divide-and-conquer strategy: if a at latency r egion is detected (line 4), it interpolates latency values within that interval; otherwise, it recursively identies spikes in progressively smaller sub-regions (lines 9-10). The threshold for distinguishing at regions from spikes is dynamically determine d as the latency dierence between input sizes 1 and 16 for each module. This approach achieves a complexity of O ( log 𝑛 ) , where 𝑛 represents the maximum supported number of query tokens. Crucially , the algorithm inherently accounts for collective communication overhead, which scales linearly with input size under xed parallelism conditions. Modeling hybrid parallelism. Contemporar y LLM deployments typically use hybrid paral- lelism—combining tensor and pipeline parallel techniques—to maximize serving capacity acr oss hierarchical network topologies [ 22 , 37 , 46 ]. Our modeling approach integrates seamlessly with the parallelism adopted by users by explicitly accounting for network overhead 𝛾 ( · ) . In particular , we model the collective communication latency for tensor parallelism as 𝛾 𝑇 ( · ) and the peer-to-peer transmission latency for pipeline parallelism as 𝛾 𝑃 ( · ) , both parameterized by 𝑛 ( 𝑡 ) , the number of tokens involved, following the Alpha–Beta communication model [2]. 3.3.2 Scheduling p olicy . The Online Scheduler rst determines a delicate sche duling order among BE and LS requests performing computations at prell and decoding phases. Ensuring SLO guarante e is of top priority in a cluster , the scheduler prioritizes LS requests over the BE counterparts. Additionally , it incorporates an admission control mechanism to dynamically determine the number of LS prell requests that can be admitted based on the cluster load, preventing SLO violations. For all pending and ongoing requests, the Scheduler maintains the following scheduling order across dierent types: ❶ LS decoding, ❷ LS chunk prell, ❸ BE chunk prell, and ❹ BE decoding. Decoding requests are prioritize d over prell requests for LS services to ensure SLO guarante es on token generation speed. Conversely , BE prell requests take precedence over BE decoding requests to impro ve service throughput. Requests within the same type are executed using a rst-come-rst- serve policy . During the inference, the Online Scheduler monitors and makes scheduling decisions based on state parameters 𝑐 𝑃 𝐴 ( 𝑡 ) , 𝑐 𝐷 𝐴 ( 𝑡 ) , 𝑔 ( 𝑡 ) , and 𝑛 ( 𝑡 ) . 3.3.3 Admission contr ol for LS requests. In a resource-constrained cluster , it becomes unfeasible to maintain SLO guarantees for every incoming request during periods of bursty arrivals. In such cases, early rejection is anticipate d if a violation is likely , rather than investing time in waiting for execution [ 53 ]. Online Sche duler enforces admission control me chanisms [ 16 , 42 , 52 ] for LS requests to address this issue . It aims to reduce unnecessary prell queuing during peak periods and ensure the fulllment of prell SLO requirements. Specically , with each arrived LS request 𝑘 , Online Scheduler admits it by evaluating whether the total queuing and prelling time is within the prell SLO constraint S 𝑝 : 𝑓 𝑃 𝐴 𝑐 𝑃 𝐴 ( 𝑡 ) + 𝑓 𝐷 𝐴 𝑐 𝐷 𝐴 ( 𝑡 ) , 𝑔 ( 𝑡 ) + 𝑓 𝐷 𝑛 ( 𝑡 ) ≤ S 𝑝 𝑑 − 𝛾 𝑛 ( 𝑡 ) . Here, 𝑐 𝑃 𝐴 ( 𝑡 ) = Í 𝑗 ∈ P ( 𝑡 ) ∪ 𝑘 Í 𝑝 𝑗 𝑖 = 𝑙 𝑗 + 1 𝑖 , 𝑔 ( 𝑡 ) = | D ( 𝑡 ) | + | P ( 𝑡 ) | + 1 , 𝑐 𝐷 𝐴 ( 𝑡 ) = Í 𝑗 ∈ D ( 𝑡 ) ∪ P ( 𝑡 ) ∪ 𝑘 𝑙 𝑗 + 1 , and 𝑛 ( 𝑡 ) = Í 𝑗 ∈ P ( 𝑡 ) ∪ 𝑘 ( 𝑙 𝑗 − 𝑝 𝑗 ) + | D ( 𝑡 ) | , where P ( 𝑡 ) and D ( 𝑡 ) denote the set of incompleted prell and decoding LS requests respectively , 𝑝 𝑗 and 𝑙 𝑗 are the total prompt length and the context length already prelled for each request 𝑗 , and 𝑑 is the number of mo del layers. Notably , this e quation oers an explicit T TFT quantication for each newly arrived request. 3.3.4 Chunk prefill control. After admitting the prell requests, Online Sche duler determines how many LS prell loads can b e accommodate d alongside ongoing LS decoding requests to ensure compliance with the specied SLO. Specically , for an LS prell request 𝑗 , it limits the number of Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:14 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu + Key: V alue: + CPU BE few iterations later Residual store CPU BE store load QKV Attention proj QKV Attention out proj MLP MLP Fig. 9. The residual store that manages residual tensors for o loaded BE requests for inference correctness. tokens 𝑞 𝑗 ( 𝑡 ) ∈ Z + chunk-prelled at the curr ent iteration, aiming to satisfy the decoding constraint. This can be formulated as an optimization problem: max 𝑞 𝑗 ( 𝑡 ) s.t. 𝑓 𝑃 𝐴 𝑐 ′ 𝑃 𝐴 ( 𝑡 ) + 𝑓 𝐷 𝐴 𝑐 𝐷 𝐴 ( 𝑡 ) , 𝑔 ( 𝑡 ) + 𝑓 𝐷 𝑛 ′ ( 𝑡 ) ≤ S 𝑑 𝑑 − 𝛾 𝑛 ( 𝑡 ) , 𝑞 𝑗 ( 𝑡 ) + 𝑙 𝑗 ≤ 𝑝 𝑗 . Here, 𝑐 ′ 𝑃 𝐴 ( 𝑡 ) = 𝑐 𝑃 𝐴 ( 𝑡 ) + Í 𝑙 𝑗 + 𝑞 𝑗 ( 𝑡 ) 𝑖 = 𝑙 𝑗 + 1 𝑖 , and 𝑛 ′ ( 𝑡 ) = 𝑛 ( 𝑡 ) + 𝑞 𝑗 ( 𝑡 ) . Additionally , S 𝑑 represents the decoding SLO, while 𝑐 𝐷 𝐴 ( 𝑡 ) , 𝑐 𝑃 𝐴 ( 𝑡 ) and 𝑛 ( 𝑡 ) are updated to reect the new computation loads once request 𝑗 is admitted into the batch. This optimization problem can be eciently solved using binary search, due to the monotonically increasing nature of the latency as additional loads are introduced. When sche duling a BE prell request, Online Scheduler also determines the chunk-prell load using the above method with a stricter decoding SLO constraint. Sp ecically , it replaces max { 0 , S 𝑑 / 𝑑 − 𝜔 } by S 𝑑 / 𝑑 if ther e are any available Attention results on the CP U Attention output queue. Here, 𝜔 represents the additional piggyback overhead. Meanwhile , BE chunk-prell load is constrained by the need to satisfy the LS prell SLO. 3.3.5 BE decoding control. After scheduling LS requests and BE prell loads, the Scheduler se eks to accommodate additional BE de coding r equests if there is available capacity on GP Us. Specically , it determines whether a de coding request 𝑗 of context length 𝑙 𝑗 can be execute d on the GP U by verifying that the decoding SLO for LS services will not be violated. This check inv olves simulating the impact of the request on system resources by updating the state parameters: 𝑐 ′ 𝐷 𝐴 ( 𝑡 ) = 𝑐 𝐷 𝐴 ( 𝑡 ) + 𝑙 𝑗 ; 𝑛 ′ ( 𝑡 ) = 𝑛 ( 𝑡 ) + 1 . The Online Scheduler also reserves room for piggybacking computations by incorporating max { 0 , S 𝑑 / 𝑑 − 𝜔 } in the constraint. Additionally , BE deco ding requests that cannot b e scheduled on GP Us will be ooaded to CP Us for Attention computation. Conversely , if there are not enough BE decoding requests on the GP Us and the SLO is still maintained, the Scheduler will notify the CP U Attention Manager to swap BE deco ding requests back to GP Us, pro vided there is available GP U memory for additional KV caches. 3.3.6 Piggybacking control. Finally , the Online Scheduler dynamically regulates Attention Pig- gybacking loads to maintain SLO compliance amidst uctuating LLM inference w orkloads. This is achieved through layer-wise batching control, which constrains the number of piggybacked requests ( 𝑝 𝑙 ( 𝑡 ) ) per model layer 𝑙 . T o handle dynamically arriving BE piggybacke d workloads, the system implements a greedy layer-wise admission strategy . Specically , the Scheduler incrementally admits BE requests starting from the lo west model layer (ascending order ) until SLO thresholds are satised. This approach pre vents starvation of BE requests in higher layers because: (1) new BE loads are not continuously admitted due to prell-rate limitations imposed by chunke d prell slots from latency-sensitive (LS) workloads, and (2) requests admitte d at lower layers progressively shift upward in subsequent processing cycles. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:15 4 System Implementation The implementation of CP U Attention. Parallelizing the Attention computation among multiple CP U cores and hosts is benecial to the BE ser vice performance. W e adopt the OpenMP [ 17 ] library to spawn multiple threads across dier ent CP U cores, where each thread is responsible for performing partial computation loads and aggregating interme diate results. Within each core, we employ Intel ® Advanced V ector Extensions (e.g., A VX) instructions to v ectorize the Attention computation. When multiple GP Us are attached to a single host, we dedicate an e qual share of CP U resources ( cores and memory) to each GP U worker . T o isolate their interactions with the CP U , each worker is assigned private input and output queues for CP U Attention. The corresponding Attention computation is then performed independently within each worker’s allocate d resource partition. The implementation of CP U Attention queue. T o mitigate the interference caused by writing and loading overhead on the Attention input and output queues, we implement these two queues in GP U memor y as special tensors with additional head and tail pointers to maintain their functionality . This approach prev ents blocking in the GP U stream due to costly CP U-GP U transmissions. Only a small amount of additional memory is use d as they are solely used to store temporary activations (i.e., 𝑞 , 𝑘 , 𝑣 , and Attention result of one layer ). With CUDA IPC and MPS service, CP U process can load and write the queues without blocking the workow on GP U. Distributed CP U Attention. Leveraging abundant CP U-only ser vers in datacenters, we also implement a mechanism called hierarchical CP U Attention to fully utilize the cluster resources. Specically , the ooaded BE requests are rst served on the local host as long as the memory is sucient. After the local memory is fully o ccupied, the remaining requests are ev enly distributed to remote CP U hosts. KV cache migration and remote Attention functionalities are implemented on top of RAY framework [36]. Residual management. Mo dern LLM models incorporate residual connections [ 21 ] in each layer , which span across modules and add complexity to the implementation of the Attention Piggybacking mechanism. In particular , the residual of each request lies on the critical path at every model layer . T o preserve computational correctness for BE requests after CP U Attention, we implement a residual store (see Fig. 9), which manages the storage and retrieval of residual values throughout the ooade d inference process. For instance, when a BE request has its KV cache ooaded to CP U memory , its residual is saved to the residual store before entering the QKV module, indexed by req_id and layer . After several iterations, the Attention result is returned to the GP U at the same layer . The pre viously saved residual tensor is then r etrieved from the stor e using the same identiers and combined with the output of the out_proj module to produce the correct result. Proler and Online Scheduler . T o accurately develop latency models as discussed in § 3.3.1, we utilize the linear regression functionality pr ovided by sklearn [ 40 ] to capture the coecients and intercepts of linear functions. Mor eover , we rewrite the scheduler module in vLLM to enable our hybrid load co-serving design. 5 Experimental Evaluation 5.1 Experiment Setup 5.1.1 T estb ed and workloads. W e primarily conducted our experiments in a cluster that includes a GP U ser ver equipped with four A100 80GB GP Us, along with four CP U-only servers. The GP Us within the host are interconnected via PCIe channels. Each GP U and CP U server is powered by an Intel ® Xeon ® Gold 6342 CP U and features 400GB of available RAM. The CP U in the GP U server also participates in Attention computation, alongside the CP U-only servers. All ser vers are interconnected by default through 100 Gbps RoCE links. Additionally , to evaluate performance Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:16 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu across div erse network environments, we report results fr om specic experiments conducted under a 10 Gbps LAN. W e evaluated OmniServe (OS) on two model architectures: Yi-34B (deplo yed with 2-way tensor parallelism) and Llama-2-70B (4-way tensor parallelism), demonstrating its general applicability across div erse model scales. W e also r eported experimental results from two envir onments: a single GP U ser ver (1G) and a cluster of one GP U ser ver plus four CP U servers (1G4C). Each GP U server contains four GP Us. Unless other wise specie d, all reported results use the latter (1G4C) as the default conguration. All computations used the BF16 precision format on b oth GP U and CP U instances. The following services were benchmarked: LS service - Online chatbot: Mo deled after real-world conversational workloads, LS r equests simulate an interactive chatbot using query-length distributions deriv ed from ShareGPT [ 7 ]. Re- quests were continuously submitted to the system at a xed rate with Poisson arrival patterns by default. W e also evaluated the SLO attainment of the LS ser vice under highly dynamic submission rates, discussed in § 5.2.1. BE service: BE workloads were generated using two benchmarks: 1) LongBench-v2 [ 11 ]: A verage input/output lengths are 8,952/136 tokens, with a maximum length of 12K tokens. 2) DailyMails [ 43 ]: A verage input/output lengths is 1,964/397 tokens. The cluster was congured to simulate a BE request load by replaying a submission pattern fr om the Azure Public Datasets [ 48 ], where 182.6 requests are submitted per minute on average. 5.1.2 Performance metrics. W e used SLO attainment and token generation throughput as the primary evaluation metrics for LS and BE services, respectively . For LS services, the attainment rate is assessed based on both T TFT and TPOT . W e dene d xed TPOT and T TFT constraints similar to previous works [ 63 ]. For BE services, we focuse d on token generation rates during both the prell and decode phases. Without specic mention, for 34B and 70B mo dels, the T TFT SLOs for LS service is set to 2s and 3s, while the TPOT SLOs ar e set to 0.2s and 0.25s, r espectively . All experiments were conducted over a period of 30 minutes. 5.1.3 Baselines. W e adopted the following baselines: Baseline A: Llumnix [ 49 ] on GP U + vLLM [ 27 ] on CP U. W e adopt the Llumnix [ 49 ] system to host the hybrid inference loads on GP Us, which utilizes a memory-centric control policy to achieve performance isolation. Mor eover , to fully utilize the CP U resources in clusters for a fair comparison with OmniServe, BE requests that could not be accommodated on GP U instances were ooaded to the CP U-hosted vLLM [ 27 ] instances and their inference process would be computed on CP Us. Baseline B: NEO [ 24 ]. W e adapt NEO that leverages both CP U and GP U for latency-oriented LLM inference. NEO identies the decoding phase’s Attention computation as b eing relatively lightweight and ooads it entirely to the CP U . T o harness both devices, it employs a pipeline pattern: while the CP U processes the Attention for one micro-batch of requests, the GP U concur- rently executes the non- Attention modules for another . However , NEO’s performance is ultimately constrained by its reliance on the CP U for all Attention computation, making it vulnerable to bottlenecks in CP U processing power and PCIe bandwidth, espe cially under strict SLOs. For a fair comparison, we enhanced NEO with a latency control mechanism similar to OmniServe ’s to prevent SLO violations. Baseline C: Sarathi-Serve [ 10 ]. W e also adapt Sarathi-Serve as our SLO-optimal baseline. This system runs computations solely on the GP U , eliminating potential SLO degradation cause d by CP U ooading. It consistently prioritizes LS requests, with overowed BE requests queued for available GP U slots. Exp erimental results from this baseline establish the upper bound for SLO achievement without CP U assistance across diverse LS submission patterns. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:17 2 4 6 8 10 Request rate (r eq/s) (a) Yi-34B 0.8 0.9 1.0 SLO A ttain. (%) 2 3 4 5 6 Request rate (r eq/s) (b) Llama-70B 0.8 0.9 1.0 OS TPOT A TPOT B TPOT C TPOT OS TTFT A TTFT B T TFT C TTFT Fig. 10. SLO aainment acr oss LS arrival rates with LongBench-v2 dataset. 1 GP U and 4 CP U hosts are used. 2 4 6 8 10 Request rate (r eq/s) (a) Yi-34B 0.8 0.9 1.0 SLO A ttain. (%) 2 3 4 5 6 Request rate (r eq/s) (b) Llama-70B 0.8 0.9 1.0 OS TPOT A TPOT B TPOT C TPOT OS TTFT A TTFT B T TFT C TTFT Fig. 11. SLO aainment across LS arrival rates with LongBench-v2 dataset. Only 1 GP U server is used. 2 4 6 8 10 Request rate (r eq/s) (a) Yi-34B 0.8 0.9 1.0 SLO A ttain. (%) 2 3 4 5 6 Request rate (r eq/s) (b) Llama-70B 0.8 0.9 1.0 OS TPOT A TPOT B TPOT C TPOT OS TTFT A TTFT B T TFT C TTFT Fig. 12. SLO aainment across LS arrival rates with DailyMails dataset. Only 1 GP U server is used. 5.2 End-to-end Performance W e evaluated the end-to-end performance of OmniSer ve against baselines under various conditions. W e selected the request arrival rate and SLO constraints to ensure that all LS requests are admitte d to the system without any rejections. 5.2.1 SLO aainment of LS services. In Fig. 10 and Fig. 11, we examined the SLO achiev ement of the systems across varying arrival rates under two har dware congurations, i.e., 1G4C and 1G. The BE requests are generated from the LongBench-v2 dataset. It is evident that the SLO achie vement rates for all baselines decrease as the LS request arrival rate increases. This decline o ccurs because the upper limit of ser ving capacity is constrained by the allocated GP U resources. A s the number of LS requests increases, competing for resour ces with BE loads, Llumnix experiences signicant performance degradation. This is primarily b ecause its memory-base d allocation fails to maintain the latency obje ctives, as factors such as request length and inference phase also play critical roles. The NEO baselines also struggle to achie ve high SLO rates in this scenario . Although NEO guarantees that latency remains below specied thresholds, its design for CP U-GP U interaction limits the capability to handle a larger number of requests. Notably , NEO does not dierentiate between LS and BE workows, resulting in all LS requests’ Attention being processed on the CP U. Consequently , both PCIe bandwidth and limited CP U computational power can become bottlenecks for LS services. For this reason, its SLO attainment is even worse given more CP Us due to the increased communication overhead. In contrast, OmniServe consistently sustains as high SLO attainment rates as Sarathi-Ser ve in both environments, where up to only 0.6% SLO attainment degradation is introduced. This is attributed to OmniServe’s nuanced latency control strategy , which considers multiple factors, including token count and context length. Additionally , it decouples the servicing of LS and BE requests, preventing LS services from being bottlenecke d by CP U resource limitations. Moreover , in Fig. 12, OmniServe achieves up to 1 . 42 × higher SLO Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:18 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu 0.1 0.15 0.2 0.25 SLO (s) (b) Yi-34B 0.8 0.9 1.0 SLO A ttain. (%) 0.15 0.2 0.25 0.3 SLO (s) (b) Llama-70B 0.8 0.9 1.0 OS TPOT A TPOT B TPOT C TPOT OS TTFT A TTFT B T TFT C TTFT Fig. 13. SLO aainment across SLO constraints with LongBench-v2 dataset. Only 1 GP U server is used. Yi-34b Llama-70b Models (a) F requency : 5s 0.6 0.8 1.0 SLO A ttain. (%) Yi-34b Llama-70b Models (b) F requency : 10s 0.6 0.8 1.0 OS TPOT A TPOT B TPOT C TPOT OS TTFT A TTFT B T TFT C TTFT Fig. 14. SLO aainment under the bursty request arrival test. The submission rate randomly varies from 1 to 8 for Yi-34B and from 1 to 5 for Llama-70B, respectively . The interval of changing the submission rates for two models are 5s and 10s. Evaluations are conducted on the 1G hardware seing. 2 4 6 8 Request rate (r eq/s) (a) Yi-34B 1x 4x 7x BE Decoding throughput 2 3 4 5 Request rate (r eq/s) (b) Llama-70B 1x 4x 7x 10x OS (100gbps) OS (10gbps) A B C Fig. 15. De coding throughput of BE service using LongBench-v2 dataset. 1 GP U and 4 CP U hosts are used. attainment with BE requests from the DailyMails dataset, proving its adaptability to a diverse computational intensity imposed by BE ser vices. Moreover , we evaluated SLO performance under varying latency requirements, with arrival rates set to 4 req/s and 3 req/s for the 34B and 70B models, respectively . As depicted in Fig. 13, the SLO attainment rate across all baselines improv es as the latency constraint is relaxed, given a xed allocation of GP U resources. Howev er , Llumnix struggles to maintain high SLO attainment under stringent TPOT requirements, particularly when interfered with by BE requests. For instance, when the SLO is set to 0.15s, its TPOT attainment rate for the Llama-70B model drops to 62% , whereas OmniServe maintains a 91 . 6% SLO attainment rate and presents 1.48 × improvement. This performance gap o ccurs be cause stringent SLOs are more vulnerable to latency exacerbation caused by admitting BE serving loads. A similar trend is observed with the DailyMail dataset. Furthermore, w e evaluated SLO attainment under dynamic workload arrival patterns based on a real-world LS submission trace with varying intensity over time . The request submission rate was modied at two dierent fr equencies (5s and 10s), with each modication ev ent assigning a random rate between 1–8 req/s for the Yi-34B model and 1–5 req/s for the Llama-70B model. A s shown in Fig. 14, OmniServe consistently outperforms Llumnix and NEO , achieving up to 1 . 23 × and 1 . 13 × higher SLO attainment, respectively . Moreover , OmniServe always maintains nearly identical SLO attainment with Sarathi-Serve, proving that there is no performance sacrice under bursty loads. It stems not only from asynchronous CP U-GP U coordination but also from the cache management mechanism in § 3.2.4. By minimizing e xpensive KV cache migration under uctuating loads, it helps maintain high inference eciency . 5.2.2 BE ser ving throughput. This section evaluates BE throughput across baseline systems. T o comprehensively e valuate OmniServe’s eciency , we conducted benchmark tests under the fol- lowing hardwar e congurations: (1) a single GP U server (1G), and (2) one GP U server plus four CP U ser vers (1G4C). In the latter conguration, we e xamined two network setups, i.e., 100 Gbps Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:19 2 4 6 8 Request rate (r eq/s) (a) Yi-34B 1x 2x 3x BE Decoding throughput 2 3 4 5 Request rate (r eq/s) (b) Llama-70B 1x 2x 3x 4x OS A B C Fig. 16. De coding throughput of BE service using LongBench-v2 dataset. Only 1 GP U ser ver is used. 2 4 6 8 Request rate (r eq/s) (a) Yi-34B 1x 4x 7x BE Decoding throughput 2 3 4 5 Request rate (r eq/s) (b) Llama-70B 1x 4x 7x 10x OS (100gbps) OS (10gbps) A B C Fig. 17. De coding throughput for BE requests under DailyMails dataset. 1 GP U and 4 CP U hosts are used. 1 2 3 4 # of additional CPUs 1x 2x 3x 4x Speedup rate 34b 10Gbps 34b 100Gbps 70b 10Gbps 70b 100Gbps (a) BE throughput 1 2 3 4 5 # of CPUs 0.180 0.185 0.190 0.195 0.200 0.205 Time (s) (b) LS TPOT Fig. 18. The b enefit and impact of admiing more CP Us in Aention Piggybacking on BE and LS services. and 10 Gbps. Acr oss these environments, the baseline Sarathi-Serve always yields the low est BE throughput, as it cannot leverage the CP U power . The BE p erformance under NEO is bottlene cked by strict SLO requirements, which restrict the BE computation on the CP U. In Fig. 15 and Fig. 16, OmniServe consistently outperforms baselines on the LongBench-v2 benchmark. Under light LS workloads with ample GP U resources that can be lev eraged by the BE service, it achieves a mod- est 1 . 2 × throughput improv ement, as the GP U’s computational capacity reduces dependence on CP U-assisted processing. However , as LS workloads intensify and GP U resources for BE requests diminish, ecient CP U utilization becomes critical. While baseline systems struggle with inecient CP U-based computation—creating signicant bottlenecks—OmniSer ve ’s coordinated GP U-CP U orchestration delivers a 9 . 85 × throughput advantage under heavy load. This p erformance advantage persists in low-bandwidth environments: by transmitting only intermediate tensors, our lightweight Attention Piggybacking design minimizes communication overhead. Moreover , even under the conguration where only one CP U in the GP U ser ver is involved, OmniSer ve can still achieve 3 . 47 × improvement, pr oving its pervasive eciency in CP U-limite d conditions. The benets are also pronounced for BE requests under the DailyMail benchmark in Fig. 17, where BE requests have relatively shorter context lengths. Although baselines can serve more BE requests on the GP U in this case, OmniSer ve can still achieve up to 9 . 1 × improvement, pr oving its applicability in practical scenarios with diverse characteristics. 5.3 Eectiveness of Aention Piggybacking In this section, we explore the b enets and impacts of the Attention Piggybacking mechanism. For BE services, experiments were conducted using the LongBench-v2 dataset, while LS workloads were tested with request rates of 4 reqs/s and 3 reqs/s for the 34B and 70B parameter models. First, we conducted experiments to examine the benet of leveraging a varying number of CP Us in clusters. In Fig. 18( a), we observe a signicant and consistent improv ement in BE throughput under OmniServe when utilizing more CP Us, compared to the throughput with an attached CP U in the GP U host only . Thanks to the distributed Attention me chanism, we can fully utilize the CP U Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:20 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu 100 200 300 400 # of requests 0.0 0.2 0.4 Time (ms) store re. get re. write q. read q. (a) Piggybacking oper- ation overhead 3 4 5 6 R equest rate (req/s) 0 25 50 75 100 SLO A ttain. (%) w/o w/ (b) LS T TFT SLO at- tainment rate 3 4 5 6 R equest rate (req/s) 0 200 400 600 Thr . (token/s) w/o w/ (c) LS decode under admission control Fig. 19. (a) Illustration of OmniServe’s Aention piggybacking overhead, including residual and queue operations; (b) and (c) jointly reveals the admission control design maintains prefill SLO without sacrificing decoding throughput. computational power in cluster . Since the communication overhead of transferring intermediate results for BE requests is signicantly less than the Attention computation time, the communi- cation overhead can be eectively hidden by the computation, yielding near-linear throughput improvement. T o this end, up to 3 . 43 × spe edup is witnessed given four additional CP Us. Second, we investigated the impact of involving more CP Us to SLO maintenance. As illustrated in Fig. 18(b), the me dian token generation latency remains nearly constant as the numb er of CP Us increases. Furthermore , the maximum latency consistently aligns with the decoding TPOT SLO . This demonstrates that the piggybacking control does not adversely aect LS services, thanks to the asynchronous communication and computation design, as well as the precise control over the number of BE requests piggybacked for the computation of Dense modules at each layer . Third, we evaluated the implementation o verhead of the Attention Piggybacking mechanism. Specically , we quantied the latency introduced by auxiliar y operations in the inference pip eline across layers, varying the number of piggybacked BE requests. In Fig. 19(a), these op erations incurred negligible ov erhead: queue read/write operations and r esidual storage required ≤ 75 µ s even for 400 concurrent requests, owing to ecient contiguous memory access. However , residual loading from the residual store tensor—triggered when CP U-generated Attention results return out of sequence—introduces non-contiguous memory access, resulting in higher latency (approximately 0.5 ms for 400 requests). Critically , this cost remains marginal compared to the inference latency , as the piggybacking design inherently limits its frequency: Attention results ar e deferred by at least one iteration before residual loading occurs, ensuring such operations remain infrequent. Finally , we conducted experiments to study the piggybacking overhead under varying numb ers of CP Us. Sp ecically , we deployed Yi-34B model in a GP U host, and allocate d 1, 2, 3, and 4 additional CP Us, respectively . For each conguration, we compared the SLO attainment rates with and without the BE service using the LongBench-v2 dataset. By this comparison, we can investigate the performance degradation given the piggybacking me chanism for BE service. Results show that the SLO attainment is only downgraded by 0 . 36% , 0 . 34% , 0 . 28% , and 0 . 31% under these settings, proving consistently low o verhead regardless of the number of CP Us allocate d. 5.4 Ablation Study 5.4.1 Mo deling accuracy . W e investigated the modeling accuracy of token generation latency . The Attention latency model is built on proling 100 data samples. W e evaluated the modeling accuracy of OmniServe across multiple parallel congurations using eight GP Us, collecting 1,000 samples for each test. As detailed in T able. 2, the average accuracy for the Yi-34B model reaches 95.7%. For the larger Llama-2-70B model, which is more susceptible to inter-GP U communication overhead, our approach-which inherently proles this ov erhead-maintains a high average accuracy of 94.5%. Moreover , the P90 accuracy—dene d as the 90th p ercentile value in descending order for each Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:21 T able 2. The accuracy of OmniServe’s latency model under dier ent parallel configurations. * A verage and 90th percentile values across seings (e xample: PP2&TP4 = 2 pipeline stages, each with TP degree of 4)* PP8&TP1 PP4&TP2 PP2&TP4 PP1&TP8 Yi-34B 94%, 92.2% 95.2%, 93.6% 94.9%, 93% 95.7%, 93.3% Llama-70B 94.3%, 92.7% 93.8%, 91.5% 93.2%, 91% 94.5%, 92.1% conguration, measuring performance consistency—remains high across all tested setups, with values no lower than 93.6% and 92.7% for Yi-34B and Llama-2-70B, respectively , in any parallel conguration. 5.4.2 Admission control. W e evaluated the ee ctiveness of OmniServe’s admission control in optimizing performance for LS services. A s illustrated in Fig. 19(b), without admission contr ol, SLO attainment rates drop sharply as request rates increase, falling to critical levels under high load. In contrast, with admission control enabled, OmniServe sustains a 94.1% T TFT SLO attainment rate, even under aggressiv e workloads. This improvement stems from mitigating the adverse eects of indiscriminately admitting requests b eyond system capacity—specically , excessive pr ell queuing delays and unmanageable KV cache contention. By judiciously regulating admissions, OmniSer ve achieves up to 43.3% higher prell SLO compliance compar ed to baseline approaches. Notably , this performance gain does not sacrice LS ser ving throughput. As shown in Fig. 19( c), LS decoding throughput with admission control remains nearly identical to unregulated scenarios, peaking at a marginal 6% dierence. This indicates that OmniServe eciently prioritizes high- priority requests without under-utilizing GPU resources. Furthermore, the sche duling overhead introduced by admission control is minimal, remaining below 1ms even at peak loads (e.g., 8 requests per second). This overhead is negligible relative to the time r equired to generate a single token, ensuring the design’s practicality for real-time inference . 5.4.3 Operational overhead. Since we leverage RA Y to implement the cluster-lev el control plane, we also evaluated the operational ov erhead caused by this framework. Specically , we measured the inference latency with and without the RA Y framework to isolate its ov erhead. In the non-RA Y baseline, all cross-host contr ol logic (e .g., hardware discovery ) was disabled, while the input data and model remained identical for b oth cases. Exp erimental results demonstrate that RA Y only induces up to 0 . 98% and 0 . 9% latency increase for a Yi-34B and Llama-70B model, respe ctively . 5.4.4 Performance in PD-disaggregation seing. W e also evaluated OmniSer ve ’s p erformance under a Prell-Decode disaggregation setup, demonstrating its strong compatibility . In this conguration, we deployed the LS service acr oss separate prell and decode instances, each using a tensor-parallel degree of 2. Hybrid serving on both instance types was facilitated by a customize d SLO - aware scheduler , operating in a manner analogous to § 3.3: it prioritizes LS requests and then utilizes any remaining computational resources for BE tasks. Additionally , the decode instance employs a piggybacking mechanism to ooad part of the BE decoding Attention computation to the CP U. Using a Llama - 2 - 70B mo del with an LS request rate of 4 req/s and BE workloads from LongBench - v2, OmniServe achieves up to 1.48 × higher SLO attainment rate and 6.94 × greater BE decoding through- put compared to Llumnix. 6 Discussion In this section, we present potential extensions and optimizations for leveraging OmniSer ve in LLM hybrid serving scenarios. Supporting inference with multiple priority levels. This setting has gained signicant attention recently in the eld of ecient LLM serving [ 14 , 28 , 55 , 64 ]. Typically , high-priority requests come with tight SLOs, whereas low er-priority requests are subject to more lenient ones. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:22 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu The core idea behind OmniServe—asynchronous inter-request batching—can be adapted to this setting. Specically , the computation for low-priority requests can be partially deferred at a given layer and then piggybacked onto the execution of the same layer several iterations later , with the deferral interval determine d by the corresponding SLO requirement. This approach reduces token- generation latency for high-priority requests compared to continuous batching, as it alleviates contention for computational resources. Building on this, the scheduling p olicy introduced in § 3.3 can be naturally extended to formulate similar optimization problems that determine how many requests of each priority type can be schedule d. Supporting more exible CP U o loading designs. Recent research has proposed ooading non- Attention mo dules to the CP U as one such strategy [ 12 , 26 , 60 ]. Although these ooading strategies were not designed for hybrid LLM workloads with distinct priorities, they can be eec- tively integrated into OmniServe to improve the performance of BE services in various scenarios. For instance, in KT ransformers [ 12 ], hybrid serving can be supporte d by partitioning CP U cores to separately handle MoE and Attention computations. Similarly , when BE workloads involve intensive pr ell operations but few ooaded KV caches, OmniServe can leverage partial CP U cor es equipped with modern instruction sets—such as Intel’s Advanced Matrix Extensions ( AMX)—to accelerate Dense computation within the BE service. 7 Related W ork LLM serving with CP Us. A growing bo dy of work focuses on democratizing LLM ser ving on CP Us. Llama.cpp [ 6 ] and vLLM [ 27 ] support ooading partial layers to the CP U . PowerInfer [ 47 ] treats CP U memor y as external storage, where model parameters and input tensors are transferred between CP U and GP U during the inference process. NEO and FastDe code [ 20 , 24 ] identify the light computation of decoding Attention and leverages CP U for decoding Attention to enhance the LS service. They split the inference batch into multiple sub-batches, enabling simultaneous execution on CP U and GP U while reducing resource idle time. HeteGen [ 61 ] enables ecient small-batch LLM serving in resource-constrained environments where GP Us lack the memory to host complete model parameters. T o overcome this limitation, it ooads parameters to CP U memory and dynamically orchestrates computation b etween the CP U and GP U to minimize inference latency . KTransformer and LIA address the same problem as HeteGen but pursue a dierent optimization [ 12 , 26 ]. They improve inference eciency by leveraging the Advanced Matrix Extension ( AMX) technique, which provides matrix computation performance comparable to that of a GP U . However , since this specialize d hardware is not universally available, the signicant performance gap indicated in T able 1 limits the applicability of their strategies in envir onments without such support. Conversely , OmniSer ve focuses on the inference scenario where GP Us can host all model parameters. Moreover , these systems ar e primarily optimized for low latency and, as a result, largely overlook impro vements in resource utilization for BE services. SLO-oriented inference. Nowadays, a b ody of SLO-oriente d works has been propose d. [ 10 , 38 , 39 , 63 ] optimize SLO attainment for LS service on GP U-only platform. [ 12 , 24 , 26 , 61 ] help improv e SLO achievement for LS service only on top of CP U-GP U hybrid platform. Llumnix [ 49 ] is the rst work that colocates LS and BE services, which uses GP U resource only . In contrast, OmniServe stands out to be the rst work that sustains SLO achievements and improves BE throughput using CP U and GP U resources. LLM serving with heterogeneous GP U devices. Recently , r esource heter ogeneity has b ecome an important issue in datacenter [ 33 – 35 ], and several systems have been developed for ecient serving in heterogeneous GP U clusters [ 25 , 31 , 32 ], with their delicate technique to balance com- putational and memory resource usage across GP U devices. However , they focus on parallelizing computation and cache storage loads uniformly (e.g., shar ding on tensor- or pipeline-parallelism) Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:23 and applying synchronous data aggregation. As a result, they are also not well-suited for being adapted to the hybrid ser ving cases with CP Us and GP Us, considering the signicant computational power gap in Dense modules. Generalizing to advanced model architecture. LoRA (i.e., Low-Rank Adaptation) architecture enables ecient adaptation of a shared foundational model to diverse service-specic generation requirements [ 13 , 44 , 54 ]. During inference, requests from distinct services are dynamically routed to specialized LoRA weights linke d to the model’s QK V modules. By retaining QK V computations on GP Us for BE requests—a resource-ecient design choice—OmniServe achieves seamless com- patibility with LoRA -driven optimization strategies. Moreover , the OmniServe architecture also provides nativ e support for MoE models. Because its design connes all dense module computation to the GP U, MoE optimization strategies like Expert Parallelism are inherently compatible without modication. 8 Conclusion This paper presents OmniServe, a new LLM inference system designed for serving hybrid loads with SLO awareness on CP U-GP U platforms. By adopting an innovativ e Attention Piggybacking inference mechanism and a delicate batching control policy , OmniSer ve eectively leverages both CP U and GP U resour ces to optimize BE services while ensuring SLO guarantees for LS services within data centers. In its inference ow design, both CP Us and GP Us execute BE ser vices through asynchronous communication, signicantly r educing the interference introduced into the inference process of LS services. This work also creates an opp ortunity to harness CP U resources to enhance the serving capacity for a single type of ser vice with extremely high request arrivals. Acknowledgments This work was supported in part by the Science and T echnology Development Fund of Macau (0041/2025/RIA1, 0074/2025/AMJ), the Multi- Y ear Research Grant of University of Macau MYRG- GRG2024-00255-FST -UMDF, MYRG-GRG2025-00119-FST), and Alibaba Group through Alibaba Innovative Research Pr ogram. References [1] 2024. ChatGPT . https://openai.com/chatgpt/. [2] 2024. Communication Models. https://spcl.inf.ethz.ch/T eaching/2019- dphpc/lectures/lecture12- comm- models.pdf . [3] 2024. ashinfer . https://github.com/ashinfer- ai/ashinfer. [4] 2024. GPT -4o. https://openai.com/index/hello- gpt- 4o/. [5] 2024. Light-LLM. https://github.com/ModelTC/lightllm. [6] 2024. Llama.cpp. https://github.com/ggerganov/llama.cpp. [7] 2024. sharegpt. https://sharegpt.com/. [8] 2024. T ext Generation Inference. https://huggingface.co/docs/text- generation- inference/index. [9] 2024. xformers. https://github.com/facebookresearch/xformers. [10] Amey Agrawal, Nitin K edia, Ashish Panwar , Jayashree Mohan, Nipun K watra, Bhargav Gulavani, Alexe y Tumano v , and Ramachandran Ramjee. 2024. T aming { Throughput-Latency } Tradeo in { LLM } Inference with { Sarathi-Serve } . In 18th USENIX Symposium on Op erating Systems Design and Implementation (OSDI 24) . 117–134. [11] Y ushi Bai, Xin Lv , Jiajie Zhang, Hongchang Lyu, Jiankai T ang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2023. Longb ench: A bilingual, multitask b enchmark for long context understanding. arXiv preprint arXiv:2308.14508 (2023). [12] Hongtao Chen, W eiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao W ang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Y uening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Y ongwei W u, and Mingxing Zhang. 2025. Unleashing the Full Potential of CP U/GP U Hybrid Inference for MoE Models. In Proceedings of SOSP . [13] Lequn Chen, Zihao Y e, Y ongji Wu, Danyang Zhuo, Luis Ceze, and Arvind Krishnamurthy . 2024. Punica: Multi-tenant lora serving. Proceedings of Machine Learning and Systems 6 (2024), 1–13. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:24 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu [14] Siyuan Chen, Zhipeng Jia, Samira Khan, Arvind Krishnamurthy, and Phillip B Gibbons. 2025. SLOs-Serve: Optimized Serving of Multi-SLO LLMs. arXiv preprint arXiv:2504.08784 (2025). [15] Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and Denny Zhou. 2021. Spread- sheetcoder: Formula prediction from semi-structured context. In International Conference on Machine Learning . PMLR, 1661–1672. [16] Carlo Curino, Djellel E Difallah, Chris Douglas, Subru Krishnan, Raghu Ramakrishnan, and Sriram Rao. 2014. Reservation-based scheduling: If you’re late don’t blame us!. In Proceedings of the ACM Symposium on Cloud Computing . 1–14. [17] Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE 5, 1 (1998), 46–55. [18] Christina Delimitrou and Christos K ozyrakis. 2013. ibench: Quantifying interference for datacenter applications. In 2013 IEEE international symposium on workload characterization (IISWC) . IEEE, 23–33. [19] Tim Furche, Georg Gottlob , Leonid Libkin, Giorgio Orsi, and Norman W Paton. 2016. Data wrangling for big data: Challenges and opportunities. In 19th International Conference on Extending Database T echnology . 473–478. [20] Jiaao He and Jidong Zhai. 2024. Fastdeco de: High-throughput gpu-ecient llm serving using heterogeneous pipelines. arXiv preprint arXiv:2403.11421 (2024). [21] Kaiming He, Xiangyu Zhang, Shao qing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778. [22] Y anping Huang, Y oulong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Le e, Jiquan Ngiam, Quoc V Le, Y onghui Wu, et al . 2019. Gpipe: Ecient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019). [23] Călin Iorgulescu, Reza Azimi, Y oungjin K won, Sameh Elnikety , Manoj Syamala, Viv ek Narasay ya, Herodotos Herodotou, Paulo T omita, Alex Chen, Jack Zhang, et al . 2018. { PerfIso } : Performance isolation for commercial { Latency-Sensitive } services. In 2018 USENIX Annual T echnical Conference (USENIX A TC 18) . 519–532. [24] Xuanlin Jiang, Y ang Zhou, Shiyi Cao, Ion Stoica, and Minlan Yu. 2024. Ne o: Saving gpu memory crisis with cpu ooading for online llm inference. arXiv preprint arXiv:2411.01142 (2024). [25] Y ouhe Jiang, Ran Y an, Xiaozhe Y ao, Y ang Zhou, Beidi Chen, and Binhang Y uan. [n. d.]. HexGen: Generativ e Inference of Large Language Model over Heterogeneous Environment. In Forty-rst International Conference on Machine Learning . [26] Hyungyo Kim, Nachuan W ang, Qirong Xia, Jinghan Huang, Amir Y azdanbakhsh, and Nam Sung Kim. 2025. LIA: A Single-GP U LLM Inference Acceleration with Cooperative AMX -Enabled CP U-GP U Computation and CXL Ooading. In Proceedings of the 52nd A nnual International Symposium on Computer A rchitecture . 544–558. [27] W oosuk K won, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Ecient memor y management for large language model ser ving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles . 611–626. [28] Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu W ang, Qinghan Chen, Shuhuai Lin, April Y ang, Zhihao Zhang, Zhuoming Chen, et al . 2025. AdaServe: Accelerating Multi-SLO LLM Ser ving with SLO-Customized Speculative Decoding. arXiv preprint arXiv:2501.12162 (2025). [29] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Y anping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al . 2023. { AlpaServe } : Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symp osium on Operating Systems Design and Implementation (OSDI 23) . 663–679. [30] Percy Liang, Rishi Bommasani, T ony Lee, Dimitris T sipras, Dilara Soylu, Michihiro Y asunaga, Yian Zhang, Deepak Narayanan, Y uhuai Wu, Ananya Kumar , et al . 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022). [31] Yixuan Mei, Y onghao Zhuang, Xupeng Miao , Juncheng Y ang, Zhihao Jia, and Rashmi Vinayak. 2024. Helix: Distributed Serving of Large Language Models via Max-F low on Heterogeneous GP Us. arXiv preprint arXiv:2406.01566 (2024). [32] Zizhao Mo, Jianxiong Liao, Huanle Xu, Zhi Zhou, and Chengzhong Xu. 2025. Hetis: Ser ving llms in heterogeneous gpu clusters with ne-grained and dynamic parallelism. In Procee dings of the International Conference for High Performance Computing, Networking, Storage and A nalysis . 1710–1724. [33] Zizhao Mo, Huanle Xu, and Wing Cheong Lau. 2024. Optimal resource eciency with fairness in heterogeneous GP U clusters. In Proceedings of the 25th International Middleware Conference . 36–48. [34] Zizhao Mo, Huanle Xu, and Wing Cheong Lau. 2025. Fast and fair training for deep learning in heterogeneous GP U clusters. In Proceedings of the 39th A CM International Conference on Sup ercomputing . 324–338. [35] Zizhao Mo, Huanle Xu, and Chengzhong Xu. 2024. Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters. In Proceedings of the 29th ACM International Conference on A rchitectural Support for Programming Languages and Operating Systems, V olume 2 . 499–513. [36] Philipp Moritz, Robert Nishihara, Stephanie W ang, Alexey T umanov , Richard Liaw , Eric Liang, Melih Elibol, Zongheng Y ang, William Paul, Michael I Jordan, et al . 2018. Ray: A distributed framework for emerging { AI } applications. In Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. Serving Hybrid LLM Loads with SLO Guarantees Using CP U-GP U Aention Piggybacking 230:25 13th USENIX symposium on op erating systems design and implementation (OSDI 18) . 561–577. [37] Deepak Narayanan, Mohammad Sho eybi, Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Korthikanti, Dmitri V ainbrand, Prethvi Kashinkunti, Julie Bernauer , Br yan Catanzaro, et al . 2021. Ecient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the international conference for high performance computing, networking, storage and analysis . 1–15. [38] Hyungjun Oh, Kihong Kim, Jaemin Kim, Sungkyun Kim, Junyeol Lee, Du-seong Chang, and Jiwon Seo. 2024. Exegpt: Constraint-aware resource scheduling for llm infer ence. In Proceedings of the 29th ACM International Conference on A rchitectural Support for Programming Languages and Op erating Systems, V olume 2 . 369–384. [39] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Ecient generative llm inference using phase splitting. In 2024 A CM/IEEE 51st A nnual International Symposium on Computer Ar chitecture (ISCA) . IEEE, 118–132. [40] F. Pedregosa, G. V aroquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer , R. W eiss, V . Dubourg, J. V anderplas, A. Passos, D. Cournapeau, M. Brucher , M. Perrot, and E. Duchesnay . 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [41] Haoran Qiu, Subho S Banerjee, Saurabh Jha, Zbigniew T Kalbar czyk, and Ravishankar K Iyer . 2020. { FIRM } : An intelli- gent ne-grained resource management framework for { SLO-Oriented } microservices. In 14th USENIX symposium on operating systems design and implementation (OSDI 20) . 805–825. [42] Sultan Mahmud Sajal, Luke Marshall, Beibin Li, Shandan Zhou, Abhisek Pan, Konstantina Mellou, Deepak Narayanan, Timothy Zhu, David Dion, Thomas Moscibroda, et al . 2023. Kerveros: Ecient and Scalable Cloud Admission Control. In 17th USENIX Symposium on Op erating Systems Design and Implementation (OSDI 23) . 227–245. [43] Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get T o The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th A nnual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) . Association for Computational Linguistics, V ancouver , Canada, 1073–1083. doi:10.18653/v1/P17- 1099 [44] Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper , Nicholas Le e, Shuo Y ang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer , et al . 2023. S-lora: Ser ving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285 (2023). [45] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. F lexgen: High-throughput generative infer ence of large language models with a single gpu. In International Conference on Machine Learning . PMLR, 31094–31116. [46] Mohammad Shoeybi, Mostofa Patwar y , Raul Puri, Patrick LeGr esley , Jared Casp er , and Bryan Catanzaro. 2019. Megatron- lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019). [47] Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2023. Powerinfer: Fast large language model ser ving with a consumer-grade gpu. arXiv preprint arXiv:2312.12456 (2023). [48] Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep T orrellas, and Esha Choukse. 2025. Dynamollm: Designing llm inference clusters for performance and energy eciency . In 2025 IEEE International Symposium on High Performance Computer A rchitecture (HPCA) . IEEE, 1348–1362. [49] Biao Sun, Ziming Huang, Hanyu Zhao, W encong Xiao, Xinyi Zhang, Y ong Li, and W ei Lin. 2024. Llumnix: Dy- namic Scheduling for Large Language Model Serving. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) (2024). [50] Gemini T eam, Rohan Anil, Sebastian Borgeaud, Y onghui Wu, Jean-Baptiste Alayrac, Jiahui Y u, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al . 2023. Gemini: a family of highly capable multimo dal models. arXiv preprint arXiv:2312.11805 (2023). [51] Ashish V aswani. 2017. Attention is all you nee d. arXiv preprint arXiv:1706.03762 (2017). [52] Abhishek V erma, Luis Pedrosa, Madhukar Korupolu, David Opp enheimer , Eric T une, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the tenth european conference on computer systems . 1–17. [53] Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. Loongserve: Eciently serving long-context large language models with elastic sequence parallelism. In Proceedings of the A CM SIGOPS 30th Symposium on Operating Systems Principles . 640–654. [54] Bingyang Wu, Ruidong Zhu, Zili Zhang, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. { dLoRA } : Dynamically orchestrating requests and adapters for { LoRA } { LLM } serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) . 911–927. [55] Zahra Y ousejamarani, Xinglu W ang, Qian W ang, Morgan Lindsay Heisler , T aha Shabani, Niloofar Gholipour , Parham Y assini, Hong Chang, Kan Chen, Qiantao Zhang, et al . 2025. Hyperexis: Joint design of algorithms and systems for multi-slo serving and fast scaling. arXiv preprint arXiv:2508.15919 (2025). [56] Gyeong-In Yu, Joo Seong Jeong, Geon-W oo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for { Transformer-Based } generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 521–538. Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026. 230:26 Zizhao Mo, Junlin Chen, Huanle Xu, and Chengzhong Xu [57] Chen Zhang, Lingxiao Ma, Jilong Xue, Yining Shi, Ziming Miao, Fan Y ang, Jidong Zhai, Zhi Y ang, and Mao Y ang. 2023. Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning. In Procee dings of OSDI . [58] Chengliang Zhang, Minchen Y u, W ei W ang, and Feng Y an. 2019. { MArk } : Exploiting cloud services for { Cost- Eective } , { SLO- A ware } machine learning inference serving. In 2019 USENIX A nnual T echnical Conference ( USENIX A TC 19) . 1049–1062. [59] Hong Zhang, Y upeng Tang, Anurag Khandelwal, and Ion Stoica. 2023. { SHEPHERD } : Serving { DNNs } in the wild. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) . 787–808. [60] Qian Zhang, Jiyuan W ang, Guo qing Harry Xu, and Miryung Kim. 2022. HeteroGen: transpiling C to heterogeneous HLS code with automated test generation and program repair . In Proceedings of the 27th ACM International Conference on A rchitectural Support for Programming Languages and Operating Systems . 1017–1029. [61] Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, and Y ang Y ou. 2024. HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices. arXiv preprint (2024). [62] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Je Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al . 2025. Sglang: Ecient execution of structur ed language model pr ograms. Advances in Neural Information Processing Systems 37 (2025), 62557–62583. [63] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. { DistServe } : Disaggregating Prell and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) . 193–210. [64] Kan Zhu, Haiyang Shi, Le Xu, Jiaxin Shan, Arvind Krishnamurthy , Baris Kasikci, and Liguang Xie. 2025. PolyServe: Ecient Multi-SLO Serving at Scale. arXiv preprint arXiv:2507.17769 (2025). Received October 2025; revised January 2026; accepte d February 2026 Proc. ACM Manag. Data, V ol. 4, No. 3 (SIGMOD), Article 230. Publication date: June 2026.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment