Low-Latency Edge LLM Handover via Joint KV Cache Transfer and Token Prefill
Edge deployment of large language models (LLMs) can reduce latency for interactive services, but mobility introduces service interruptions when an user equipment (UE) hands over between base stations (BSs). To promptly resume decoding, the target-sid…
Authors: Seunghun Lee, Jihong Park, Ce Zheng
1 Lo w-Latenc y Edge LLM Ha ndov er via Joint KV Ca che T ransfer and T oken Prefill Seunghu n Lee, Grad u ate Stude nt Memb er , IEEE , Jihong P ark, Senior Membe r , IEEE , Ce Zheng, Member , IEEE , and Hyu ncheol Park, Senior Member , IEEE Abstract —Edge deployment of large l anguage models (LLMs) can reduce latency fo r interactive service s, but mobility introduces service interruptions when an user equi pment (UE) hands over between b ase stations (BS s). T o promptly re sume decoding, the target-side edge server must reco ver the UE context state, which can be prov i sioned eith er by token forwar ding f ollowed by prefill computation or b y direct key -value (KV) cache transmission ov er backhaul. Th is paper proposes a unified handover (HO) design that jointly selects the prefill len gth and schedules backhaul KV cache delivery to minimize t he worst-user LLM HO d elay for multiple UEs. The resulting scheme admits a tractable step-wise solution with explicit feasibility conditions and a constructi ve rate- scheduling policy . Simul ations show that th e proposed method consistently outperfor ms baselin es across a wide range of backhaul capacities, prefill speeds, and context sizes, prov i ding practical guidelines for mobility-aware Edge LLM token streaming. Index T erms —T oken Streaming, Edge LLM, Key-V alue Cache. I . I N T R O D U C T I O N L ARGE language model (LLM) streaming ha s recently emerged as a key application f or beyond 5G network s, with b illio ns of user s a lr eady accessing services such a s Chat- GPT and Gemini via m obile devices. Existing LLM stre a m- ing services are pred ominantly clou d-based, which can incur significant and time-varying latency ov er wireless links [ 1 ]. T o enable low-latency and dif f erentiated ser vices, d e p loying LL M s at the n etwork ed ge, refe r red to as Edge LLM , h as recently be e n recogn ize d as a pr omising ap proach [ 2 ], [ 3 ]. Howe ver, provisioning seamless Edg e LLM strea m ing fo r mobile users re m ains challenging due to token decod ing de- penden cy durin g Edge LLM handover ( HO ) . Specifically , LLM inference is autoregressive, whe r e each token is generated condition ed on the key-value ( KV) cache o f previously de- coded tokens. When a m obile user is hand ed over between base stations (BSs) equipped with Edg e LL Ms, the target BS S. L ee and H. Park are with the School of Electri cal Engineer ing, K orea Adv anced Institute of Science and T echnology , Daejeon 34141, Republic of Ko rea (e-mail : seunghun21@kai st.ac.kr; hcpark@kaist.ac .kr). J. Park is with ISTD Pillar , Singapore Unive rsity of T echnology and Design (SUTD), 8 Somapah Rd, Singapore 487372, Singapo re (e-mail: ji- hong park@sutd.ed u.sg). C. Zheng is wit h the Department of Broadband Communication, Pengcheng Lab, Shenzhe n 518066, China (e-mail: zhengc @pcl.ac.c n). This work was supported in part by A ∗ ST AR u nder its IAF-ICP (I2501E0064), in part by the IITP-ITRC grant funded by the Kore an go v- ernment (MSIT) (IITP-2026-RS-2023-002599 91) (33%), in part by SUTD Kickstart er Initi ati ve (SKI 2021 06 08), and in part by the National Research Founda tion, Singapore, and the Infocomm Media Dev elopment Authority under its Future Communicatio ns Research & De velopment Programme. ( Corr e- sponding author s: J. P ark and H. P ark ) Source BS Target BS token transfer cache transfer batch prefill min. worst-user HO delay HO KV c ache Edge LLM tokens KV c ache Edge LLM Fig. 1. An illustratio n of the proposed ctHO, which jointly expl oits batc h prefill, and KV cach e transfer to minimize the worst-user HO dela y . has not decoded the user’ s past tokens, an d therefore c annot immediately preserve the token streaming co ntext. A straightforward solution is to tr ansfer past tokens to the target BS and r e-decod e them to recon struct the KV cac he. The KV cach e reco n struction of this token-ba sed HO (tHO) is eq uiv alent to th e prefill phase o f LLM inference, which is computatio nally intensive and in curs a large time-to-first-token (TTFT), resulting in substan tial HO delay . Fur thermor e, prefill is of ten batched across users, mak ing the worst-user delay the b ottleneck un der simultaneo u s HOs. Alter natively , BSs can transfer the KV cache directly over data backhaul links. This cache-based HO (cHO) can significantly re duce Edge LLM HO delay without re-decodin g pa st tokens. Ho wever , the KV cache can be large (e.g., GB-scale for billion-para meter models), and limited back haul c a pacity may constrain in ter-BS KV cache transfer, particularly when multiple HOs o ccur simu ltaneously . Motiv ated by these challeng es, we prop ose a jo int cache- token based HO framework (ctHO) fo r Edge LLM systems, in wh ic h token-based partial KV cac he recon struction is per- formed co ncurre n tly with the remainin g KV cache tra nsfer as shown in Fig. 1 . Th is joint design minim iz e s th e worst-user HO delay under b ackhaul c apacity con straints. The main co ntributions of this paper are as fo llows: • W e propose ctHO that min im izes E dge LLM HO d elay by optimizing the length of the p refilled K V cache and the backhau l rate a llo cation. • W e develop a two-stage optimization that determ ines the backhau l schedu ling fo r a given prefill length and then selects the prefill length, with a proof of global optimality . • Simulations show that ctHO red uces the worst-user HO delay by up to 3 . 1 × and 3 . 7 × comp ared to cHO and tHO, respectively . T oken-based communic a tion h as been studied f or multiple access [ 4 ], packetization [ 5 ], an d multimo dal tr ansmission [ 6 ], 2 but existing works assume static users. In con trast, we focu s on HOs of mobile users. M e anwhile, KV cache m anagemen t, including compression [ 7 ] an d cac he-based in ter-agent commu- nication [ 8 ], [ 9 ], h a s re cently e merged. These technique s are orthog onal to our appro a ch an d can further redu ce the Edge LLM HO laten cy of the prop osed framework. I I . S Y S T E M M O D E L A N D P RO B L E M F O R M U L A T I O N A. Multi-UE HO During Edge LLM T oken Streaming W e con sider a m obile Ed g e LLM token streaming ser vice in which a sou rce BS separately streams LLM -generated tokens to K UEs. During this service, we assume that the UEs move tow ar d a target BS, and that UE i trigger s HO to the target BS at time τ i , whe re both BSs dep loy the same LLM ar chitecture and parameter s. After HO occu r s, token streaming is stopped and the target BS pr epares the KV cache cor respondin g to the C i tokens, where C i denotes the num ber of tokens decod ed for UE i up to τ i . Streaming resumes on ly after the KV cache prepara tio n is com pleted, and the resulting Edg e LLM HO delay is th e o bjective to m in imize. As shown in Fig. 1 , the target BS can rebuild a po rtion o f the r equired KV cache by r eceiving tokens from the source BS and performin g prefill on them, while the remaining KV cache can be directly d eliv er ed via back h aul [ 7 ]. Th e delay for token transmission is assumed to be negligible co m pared to the KV cache tran sfer and prefill delay s. Because to ken streaming can resume on ly after b oth pr o cesses are co mpleted, the L LM HO delay is g overned b y the larger of the two delays. W e ther efore aim to minim ize the LLM HO delay b y join tly deter mining the amount of KV cache processed b y prefill and the backhaul capacity allocation across UEs. T o this end, we assume that the source BS k nows { ( τ i , C i ) } K i =1 and the prefill delay profile of the target BS. W ithou t loss of generality , we sort UEs by their HO trigger time s, τ 1 ≤ τ 2 ≤ · · · ≤ τ K . (1) At the target BS, to reduc e th e prefill d elay compa red w ith sequentially perf orming separ ate prefills fo r in dividual UEs, batch prefill jo intly processes th e tokens of multiple UEs in a single ba tc h p refill [ 10 ]. According ly , the KV cache is reconstruc ted for a comm on prefix of length L for a ll UEs, where L ∈ [0 , C max ] and C max , max i C i . Because the same input token seque n ce length mu st be used acr oss the b a tch, if C i < L , only the av ailable C i tokens of UE i are in volved in batch prefill, while the remaining token p o sitions up to leng th L are zero-pad d ed. Hence, th e C i tokens of UE i are split as n (pf) i ( L ) , min { C i , L } , (2) n (tx) i ( L ) , C i − n (pf) i ( L ) = C i − L + , (3) where ( · ) + = max {· , 0 } , n (pf) i ( L ) is the n u mber o f tokens processed by prefill, and n (tx) i ( L ) is the n umber of remaining tokens whose correspond in g KV cache is deli vered f r om th e source BS. For back haul transm ission, the rates allocated to the UEs at time t , denoted by r i ( t ) , are sub jec t to the cap a c ity con straint K X i =1 r i ( t ) ≤ R [ tokens/s ] , ∀ t, (4) where R , R bh /s KV is the b ackhaul c apacity normalized by the KV cache payload size per token s KV . Thus, bo th R and r i ( t ) ar e expressed in tokens/s to match the token-based quantities in the p refill model, such as L and C i . B. W o rst-user LLM HO Delay Minimization Let T (pf ) ( L ) an d T (tx) i ( L, r i , C i ) deno te th e comp letion times of batch prefill and cach e tra n sfer for UE i , respe c tively . Based on these com pletion times, the HO pr efill delay and the cache transfer delay are de fined a s D (pf ) i ( L ) , T (pf ) ( L ) − τ i , (5) D (tx) i ( L, r i , C i ) , T (tx) i ( L, r i , C i ) − τ i , (6) respectively . Since token decodin g can resume only after both processes ar e c o mpleted an d KV cache is r econstructed , the resulting LLM HO d elay o f UE i is D i ( L, r i , C i ) , max n D (pf ) i ( L ) , D (tx) i ( L, r i , C i ) o . (7) According ly , the worst-user LLM HO delay is de fin ed as D ( L, r ) , max i ∈{ 1 ,...,K } D i ( L, r i , C i ) . (8) W e next specify the com p letion times of batch prefill and cache transfer . Let T c denote the batch prefill cycle in terval. Then batch pr efill starts at the first cycle bound ary after th e latest HO time τ K , i.e., t s , l τ K T c m T c . Hence, th e batch prefill completion time is T (pf ) ( L ) , t s + p ( L ) , (9) where p ( L ) de n otes the batch prefill d e lay . The cache transfer completion tim e T (tx) i ( L, r, C i ) is time at which the remaining KV cache c o rrespon ding to n (tx) i ( L ) token s has been f ully delivered. According ly , T (tx) i ( L, r i , C i ) , inf t ≥ τ i : Z t τ i r i ( u ) du ≥ n (tx) i ( L ) . (10 ) Our goal is to jo intly choose the sha r ed p refill length and the backh aul rate a llo cation to minimiz e the worst-user LLM HO delay over all UEs: P : min 0 ≤ L ≤ C max , r ( · ) D ( L, r ) (11a) s.t. X i r i ( t ) ≤ R. (11b) Although P is a joint optimization problem over L and r ( · ) , it can be solved in a step- wise manner withou t loss of op tim ality . Specifically , for each fixed L , we first op tim ize r ( · ) , and then op tim ize L . The following pr oposition shows that this 3 decomp o sition is globally optim a l. T o form alize this step-wise decomp o sition, define the value function V ( L ) , min r ( · ) D ( L, r ) . (12) Let r ⋆ ( L ) ∈ arg min r ( · ) D ( L, r ) d enote an o ptimal rate allo - cation policy for a giv en L a n d the backhau l constraint so that V ( L ) = D ( L, r ∗ ( L )) , and let L ⋆ = arg min 0 ≤ L ≤ C max V ( L ) . Proposition 1. The pair L ⋆ , r ⋆ ( · ) is a global optimum of P . Pr oof. For any feasible ( L, r ) , b y de finition of V ( L ) we have V ( L ) ≤ D ( L, r ) . Moreover , D L ⋆ , r ⋆ ( L ⋆ ) = V ( L ⋆ ) and V ( L ⋆ ) ≤ V ( L ) by optimality o f L ⋆ . Hence, D L ⋆ , r ⋆ ( L ⋆ ) = V ( L ⋆ ) ≤ V ( L ) ≤ D ( L, r ) , (13) which proves global optimality . I I I . S T E P - W I S E O P T I M I Z AT I O N O F B AC K H AU L S C H E D U L I N G A N D P R E FI L L L E N G T H Based on Propo sition 1 , we decompo se P into th e follo wing two sub problem s. ( 11 ) is eq uiv alen tly expr e ssed as P 1 : V ( L ) : = min r ( · ) D ( L, r ) , for a given L, (14) P 2 : min 0 ≤ L ≤ C max V ( L ) . (15) In the following, we first characterize V ( L ) b y solving P 1 for any fixed L , th en o btain the op timal L by solv ing P 2 . A. Optimal Cache T ransfer De la y and Prefill Leng th T o solve P 1 for a fixed L , since p refill d e lay is in depend ent of r ( · ) , optimizing P 1 reduces to minimizing the worst-u ser cache tramsfer delay D (tx) ( L, r ) as follows. V ( L ) = min r ( · ) D ( L, r ) = min r ( · ) max i max n D (pf ) i ( L ) , D (tx) i ( L, r i , C i ) o = min r ( · ) max n max i D (pf ) i ( L ) , max i D (tx) i ( L, r i , C i ) o , max n D (pf ) ( L ) , min r ( · ) D (tx) ( L, r ) o , (16) where D (pf ) ( L ) , max i D (pf ) i ( L ) and D (tx) ( L, r ) , max i D (tx) i ( L, r i , C i ) . The following p roposition characterizes the min imum achiev ab le ca che tr ansfer delay for a fixed L b y examining the cum ulativ e remaining KV caches over the first k UEs. T o this end, define S k ( L ) , k X i =1 n (tx) i ( L ) , k ∈ { 1 , . . . , K } , (17) as the cumulative r emaining token amou nt for the first k UEs. Proposition 2. F or any fixed L ∈ [0 , C max ] , the min imum feasible c a che transfer delay D ⋆ (tx) ( L ) , min r ( · ) D (tx) ( L, r ) is D ⋆ (tx) ( L ) = max k ∈{ 1 ,...,K } S k ( L ) R − τ k − τ 1 + . (18) UE 1 UE 2 ( ܮ ) ݐ ߬ ଵ ߬ ଶ ݐ ௦ ( ܮ ) ࡰ ܘ ࡸ ࡰ ܜܠ څ ࡸ ࡰ ܘ ࡸ = ࡰ ܜܠ څ ࡸ ݊ ଶ ୲୶ ܮ / ܴ ݊ ଵ ୲୶ ܮ / ܴ Fig. 2. An il lustration of the proposed design principle . The HO prefill delay D (pf ) ( L ) and the minimum cache transfer delay D ⋆ (tx) ( L ) are set to equal by choo sing the batch prefill length L , so that neither process domina tes the ov erall LLM H O delay . Pr oof. For any feasible rate a llo cation r ( · ) , by the d efinition D (tx) ( L, r ) = max i D (tx) i ( L, r i , C i ) and ( 6 ), ev ery UE satisfies T (tx) i ( L, r i , C i ) ≤ τ i + D (tx) ( L, r ) , ∀ i ∈ { 1 , . . . , K } . (19) Since the KV cach e o f UE i can be tr ansmitted only a f ter τ i , by time τ k + D (tx) ( L, r ) the system mu st h av e de li vere d at least th e total KV cach e amou nt corre sp onding to S k ( L ) tokens released up to τ k to the fir st k UEs. On th e other ha nd, under the c onstraint P i r i ( t ) ≤ R , th e source BS can deli ver the KV cache correspondin g to at most R D (tx) ( L, r ) + τ k − τ 1 tokens over [ τ 1 , τ k + D (tx) ( L, r )] . Th erefore, S k ( L ) ≤ R D (tx) ( L, r ) + τ k − τ 1 , ∀ k ∈ { 1 , . . . , K } . (20) Rearrangin g and combinin g with D (tx) ( L, r ) ≥ 0 yields D (tx) ( L, r ) ≥ S k ( L ) R − τ k − τ 1 + , ∀ k , ( 21) which holds fo r every feasible r ( · ) , so D ⋆ (tx) ( L ) ≥ max k h S k ( L ) R − ( τ k − τ 1 ) i + . Achievability follows from the policy in Section III-B that attains this b ound . In the following, we consider only the UEs whose remaining KV cache can not be fully de li vere d bef ore batch prefill, since batch pr e fill is unnecessary f or UEs wh ose remainin g KV cache can alr eady be f ully delivered. W e next solve P 2 by o ptimizing L in ( 15 ). Since D (pf ) ( L ) is non-decr e a sing in L wher eas D ⋆ (tx) ( L ) is non-in creasing in L , the min imum is achiev ed by ch oosing L such that the two terms are equ al, whenever possible. The f ollowing propo sition formalizes this condition. Proposition 3. If there e x ists an L ∈ [0 , C max ] such that D (pf ) ( L ) = D ⋆ (tx) ( L ) , (22) then any such L minimizes V ( L ) on [0 , C max ] . Othe rwise, the minimizer occurs at a bou ndary , i.e., L ⋆ ∈ { 0 , C max } . Pr oof. Let f ( L ) , D (pf ) ( L ) and g ( L ) , D ⋆ (tx) ( L ) , where f is non-d ecreasing and g is n o n-increa sing o n [0 , C max ] , an d And accordin g to ( 16 ), V ( L ) = max { f ( L ) , g ( L ) } . If an L 0 satisfies f ( L 0 ) = g ( L 0 ) , then V ( L ) = g ( L ) ≥ g ( L 0 ) fo r L < L 0 and V ( L ) = f ( L ) ≥ f ( L 0 ) for L > L 0 , so L 0 minimizes V ( L ) . Otherwise, either f ( L ) > g ( L ) or f ( L ) < g ( L ) for all L , and the minimum is attained at L = 0 or L = C max , resp ectiv ely . 4 B. Optimal Backhaul Rate Scheduling Alg orithm Proposition 2 ch aracterizes th e optimal value D ⋆ (tx) ( L ) , whereas P 1 also re quires a f easible policy r ( · ) that attains it. Among p ossibly many such optimal policies, we pr esent a simple one that achieves D ⋆ (tx) ( L ) by allocating the bac k haul rate to o ne UE at a time. Let π ( t ) ∈ { 1 , . . . , K } den o te the backhau l sche d uler, i.e., the index of the UE th at is served by the b ackhaul at time t . The scheduler is upd ated o nly when the acti ve set cha n ges, namely , wh en a new UE begins to b e served at t = τ i or when the curre n tly ser ved UE co mpletes its rem aining KV cac h e transmission. Let n (rem) i ( t, L ) , h n (tx) i ( L ) − R t τ i r i ( s ) ds i + denote the remain ing token amou n t of UE i at tim e t . Also, let A ( t ) , n i ∈ { 1 , . . . , K } : t ≥ τ i , n (rem) i ( t, L ) > 0 o (23) denote the acti ve set of UEs whose HO has already occurr e d but whose KV cache tra nsfer has no t yet been completed. For a given L , we define th e target cache transfer com pletion time of UE i as d i ( L ) , τ i + D ⋆ (tx) ( L ) . At any time t , the scheduler π ( t ) selects the UE i with the smallest d i ( L ) among A ( t ) and allocates the e n tire b ackhaul c a p acity to that UE, i.e., π ( t ) = arg min i ∈A ( t ) d i ( L ) , (24) r i ( t ) = R · 1 { π ( t ) = i } . (25) The full b ackhaul capacity is always assigned to a single UE, and π ( t ) c h anges on ly when A ( t ) changes (eith er at t = τ i or when n (rem) π ( t ) ( t, L ) = 0 ). Under th e p olicy above, the backhau l is never idle wh enever A ( t ) 6 = ∅ , and it always serves the UE with the smallest d i ( L ) among A ( t ) . Hence, by time τ k + D ⋆ (tx) ( L ) the po licy can transmit at least R D ⋆ (tx) ( L ) + τ k − τ 1 that is available up to each τ k , whic h is sufficient to complete S k ( L ) for e very k . This con struction yields a feasible r ( · ) t hat attains D ⋆ (tx) ( L ) for the fixed L , comp leting the solution o f P 1 . I V . S I M U L A T I O N R E S U LT S A. Simulatio n Setup W e consider a multi-UE HO scenario with K = 4 UEs. For each UE i , C i is un iformly sampled from Unif [1 024 , C max ] tokens. T c is set as = 0 . 01 s. A 1D line-network model is adopted with two BSs loca te d at x (s) = 0 (source) and x (t) = D bs (target), where D bs = 300 m [ 11 ]. Ea c h UE starts fr om x i (0) ∼ Unif [120 , 1 30] m and moves with a constant speed v i = 2 0 m/s, so that its p osition e volves as x i ( t ) = x i (0) + v i t . HO is triggered whe n the UE crosses th e bou n dary x b = 150 m, which gives the HO time τ i = x b − x i (0) v i . This 1D line- network and mobility setting a bstracts a road-segment scenario where vehicles move along a line and sequentially associate with nearby access p oints d eployed along the roadside. The KV cache pay load size per to ken is s KV = 2 · N ℓ · N kv · d h · q bits, where the leadin g factor 2 accounts fo r the key and value tensors, N ℓ is the number of T ransfo rmer layers, N kv is the n umber o f key-value he a ds, d h is the h ead dimension, and q is th e per-element precision . For LLM, we adopt Qwen2.5-7B-I nstruct in which ( N ℓ , N kv , d h ) = (28 , 4 , 12 8) and q = 16 bits. This yields s KV = 4 58 , 752 bits per token, so that when C = 3072 tokens, the tota l KV cache size is 176 MB. W e use prefill mo del p ( L ) = aL + b , where a, b are set as 9 . 42 67 × 10 − 5 and 2 . 4 × 10 − 3 by default. B. Impac t of B ackhaul T ransmission Rate, Pr efill Sp eed, Cac he Size, a nd Number of UEs In this sectio n , we co mpare ctHO with tHO and cHO. tHO: L = C max ⇒ n (pf) i ( L ) = C i , n (tx) i ( L ) = 0 , ∀ i, (26) cHO: L = 0 ⇒ n (pf) i ( L ) = 0 , n (tx) i ( L ) = C i , ∀ i. (27) where tHO relies en tirely on batch prefill, whereas cHO relies entirely on KV cache transmission over backhau l. Fig. 3(a) shows th e a verag e LLM HO delay b y chan ging backha u l rate R bh . tHO rem ains unchan ged sinc e it does no t rely on backhaul KV cache transfer . By con trast, th e delay of cHO decreases as R bh increases and approa c h es that of ctHO in the h igh backhau l regime, whereas at R bh = 2 Gbps, cHO exhibits more than 3 . 1 × larger delay than ctHO. Fig. 3(b) varies the compute capability by ch a nging a in the prefill delay mo del p ( L ) = aL + b . cHO is unchan ged becau se it does not use pre fill. tHO improves as a decreases since faster prefill reduc es prefill delay . ctHO con sistently ach ie ves the smallest de la y by selecting L ⋆ accordin g to the co mpute condition : wh en prefill is slow (large a ), it reduc e s L ⋆ and relies more on backh aul tran smission, a nd when prefill becomes faster ( small a ), it incre ases L ⋆ to reduce the back haul load . In Fig. 3( c), the worst-user LLM HO d elay increases with the maximum cache size C max , since a larger c o ntext requ ires mo re computatio n and transmission bef o re streaming can resume. Across th e entire sweep, ctHO co nsistently achieves the lowest delay by jointly leveraging p refill an d KV cache transmission. Fig. 3 (d) shows the worst-user LLM HO d e la y a s K in- creases. As K incr eases, th e delay of all metho ds incre a ses because more UEs m ust sh are the same ba ckhaul and pre fill re - sources. Am ong the three schem e s, ctHO c o nsistently ach ie ves the smallest delay over th e en tire range of K . The performan ce gap becomes more pro nounc ed a s K increases. For e x ample, at K = 12 , the LLM HO delay of ctHO is about 0 . 95 s, compared with 1 . 25 s for tHO and 3 . 35 s for cHO. C. Compa rison of T ota l Streaming Delay W ith and W ithout HO T o co mpare the delay with and without HO, we e valuate the worst-user total streaming delay wh ich includes LLM HO delay and token stream in g delay af ter LL M HO, by sweeping D bs , while fix in g the number of newly generated tokens af te r LLM HO to G = 1024 tokens. The sign al-to-noise ratio (SNR) of UE i at time t is modeled as a function o f the distance between the UE po sition x i ( t ) and the ser ving BS locatio n x (b) , whe re b = s with x (s) = 0 denote s the no-HO case and b = t with x (t) = D bs denotes the HO case: γ (b) i ( t ) = γ ref d ref | x i ( t ) − x (b) | α , (28) 5 2 4 6 8 10 12 Backhaul rate (R bh ) [Gbits/s] 0 0.5 1 1.5 Worst-user LLM HO delay [s] (a) W orst-user LLM HO delay vs. backha ul rate R bh (Gbps). 1500 4500 7500 10500 13500 16500 Prefill speed (1/a) [tokens/s] 0 0.5 1 1.5 2 Worst-user LLM HO delay [s] ctHO tHO cHO (b) W orst-user LLM HO delay vs. prefill speed 1 /a with R bh = 4 . 5 Gbps. 2048 4096 6144 8192 10240 12288 Maximum cache size (C max ) [tokens] 0 1 2 3 Worst-user LLM HO delay [s] ctHO tHO cHO (c) W orst-user LLM HO delay vs. the maximum cac he size C max with R bh = 4 . 5 Gbps. 2 4 6 8 10 12 The number of UEs (K) 0 1 2 3 4 Worst-user LLM HO delay [s] ctHO tHO cHO (d) W orst-user LLM HO delay vs. the number of UEs K with R bh = 4 . 5 Gbps. Fig. 3. W orst-user LLM HO dela y comparison of ctHO, tHO, and cHO under dif ferent system parameters. 250 300 350 400 450 500 Distance between BSs (D bs ) 0 2 4 6 Worst-user total streaming delay [s] ctHO tHO cHO no HO Fig. 4. W orst-user total streaming delay vs. distance between BSs with R bh = 4 . 5 Gbps. The tot al streaming dela y inc ludes both the LLM HO del ay and the subsequent streami ng del ay to deli ver generated G toke ns after LLM HO. where ( γ ref , d ref , α ) = (10 dB , 20 m , 3 . 5 ) . The cor respondin g wireless token streamin g rate for each token is given by r (b) t , W log 2 1 + γ (b) i ( t ) s tok [ tokens/s ] , (29) where a token size s tok = 12 b its/token and W = 2 MHz. The total streaming dela y measur ed f rom τ i is d efined as D i ( L, r, G ) , T i ( L, r ) − τ i + D (st) i G, r (b) ( t ) [ s ] , (30) where D (st) i ( G ; γ ) d enotes the delay r equired to stream newly generated tokens after LLM HO. In the n o-HO ca se, the LLM HO delay vanishes, and th e total delay redu ces to D (st) i G, r (b) ( t ) . W e rep ort the worst-user total streaming delay D (tot) ( D bs ) , max i D i ( L, r, G ) while sweepin g D bs . As D bs increases, the no-HO d e lay grows much faster than the HO-based meth ods because d egraded link increases streaming delay . In p articular, a larger D bs leads to a mo r e se vere SNR dro p on the link betwe en the sou rce BS an d UEs. Although no-HO ach ieves a smaller de lay when D bs < 375 m due to the absence of LLM HO delay , as D bs increases, HO- based methods yield lower delay . At D bs = 500 m , the n o -HO delay exceeds that o f ctHO by more than 2 . 1 s. Overall, ctHO consistently achieves the smallest worst-user delay among the HO-based methods over th e en tire D bs range. V . C O N C L U S I O N This paper investigated mob ility-aware Edge LLM HO with multi-UE to ken streaming , where the target BS must r ecover the KV cac h e. W e prop osed ctHO, a join t me th od that optimizes th e amount to prefill an d KV cach e b ackhaul rate, and developed a step-wise solutio n th at selects the prefill length and backh aul scheduling to minimize the worst-user L L M HO d elay . Simula - tions show that ctHO con sistently outperfo rms other baselines under the simulation setting. Future work inclu des extendin g the curre n t hard LLM HO framework to a soft LLM HO setting, where the ta rget BS performs pre-co m putation and the sou rce BS continue s decoding fo r m ore seamless serv ice con tinuity . R E F E R E N C E S [1] H. Y et, N. T . T am, M. V . Ngo, L. Y . Shen, L. W ei, J. Park, B. Chen, and T . Q. S. Quek, “SLA-awa re distrib uted llm infe rence across de vice- ran-clo ud, ” 2026. [Online]. A vaila ble: https:/ /arxi v .org/abs/26 02.23722 [2] S. Jang and R. Morabito, “Edge-first la nguage model infer ence: Models, metrics, and tradeof fs, ” in 2025 IEE E 45th International Confe rence on Distrib uted Computing Systems W orkshops (ICDCSW) , 2025, pp. 309– 314. [3] AI-RAN Alliance, “AI-on-RAN: Enabling monetiza ble differ entiat ed connec ti vity for ai, ” T ech. Rep., 2026. [Online]. A vaila ble: https:/ /ai- ran.org/ documents/AI- RAN- WG3- AI- on- RAN- Whitepaper .pdf [4] L. Qiao , M. B. Mashha di, Z. Gao, and D. G ¨ und ¨ uz, “ T oken- domain multiple ac cess: Exploitin g semantic orthogonal ity for collision mitigat ion, ” 2025. [Online ]. A vail able: https:/ /arxi v .org/abs/25 02.06118 [5] S. Lee, J. Park, J. Choi, and H. Park, “Low-compl exi ty semantic pack et aggrega tion for token communicati on via lookahead search, ” arXiv pre print arXiv:25 06.19451 , 2025. [6] L. Qiao, M. B. Mashhadi, Z . Gao, R. T afazol li, M. Bennis, and D. Niyato, “T oken communica tions: A large m odel-dri ven framewor k for cross- modal conte xt-awa re semantic communicatio ns, ” IEEE W irel ess Commu- nicati ons , vol. 32, no. 5, pp. 80–88 , 2025. [7] Y . Liu, H. Li, Y . Cheng, S. Ray , Y . Huang, Q. Z hang, K. Du, J. Y ao, S. L u, G. Ananthana rayanan et al. , “Cache gen: Kv cache compression and streaming for fast lar ge language model servi ng, ” in Pr oceedings of the ACM SIGCOMM 2024 Confer ence , 2024, pp. 38–56. [8] T . Fu, Z. Min, H. Zhang, J. Y an, G. Dai, W . Ouyang, and Y . W ang, “Cache -to-cache: Direct semantic communication between large language models, ” arX iv pr eprint arXiv:2510.03215 , 2025. [9] Z. Chen, Z. Li, H. H. Y ang, T . Quek, and J. Park, “Federat ed infere nce for heterogene ous LLM communicati on and collaborat ion, ” in AAA I 2026 W orkshop on Machine Learning for W ire less Communication and Network s (ML4W irele ss ) , 2026. [10] Z. Zheng, X. J i, T . Fang, F . Z hou, C. Liu, and G. Peng, “BatchLLM: Optimizi ng lar ge batched LLM inference with global prefix sharing and throughput -oriented toke n batch ing, ” arXiv preprint , 2024. [11] R. Neetu, G. Ghatak, V . A. Bohara, and A. Sriv astav a, “Performanc e analysi s of cache -enable d handov er manag ement for vehicu lar netwo rks, ” IEEE T ransact ions on Network Science and Engine ering , vol. 11, no. 1, pp. 1151–1164, Oct. 2023.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment