Energy Efficient Orchestration in Multiple-Access Vehicular Aerial-Terrestrial 6G Networks

The proliferation of users, devices, and novel vehicular applications - propelled by advancements in autonomous systems and connected technologies - is precipitating an unprecedented surge in novel services. These emerging services require substantia…

Authors: Mohammad Farhoudi, Hamidreza Maz, arani

Energy Efficient Orchestration in Multiple-Access Vehicular Aerial-Terrestrial 6G Networks
1 Ener gy Ef ficient Orchestration in Multiple-Access V ehicular Aerial-T errestrial 6G Networks Mohammad Farhoudi 1 , Hamidreza Mazandarani 2 , Masoud Shokrnezhad 3 , T arik T aleb 2 , and Ignacio Lacalle 4 1 Oulu University , F inland ; mohammad.farhoudi@oulu.fi 2 Ruhr University Boc hum (R UB), Germany ; { hamidreza.mazandarani, tarik.taleb } @rub .de 3 ICTFICIAL Oy , Espoo, F inland ; masoud.shokrnezhad@ictficial.com 4 Universitat P olit ` ecnica de V al ` encia, Spain ; iglaub@upv .es Abstract —The proliferation of users, de vices, and novel v e- hicular applications–pr opelled by advancements in autonomous systems and connected technologies–is precipitating an unpr ece- dented surge in novel services. These emerging services require substantial band width allocation, adher ence to stringent Quality of Service (QoS) parameters, and energy-efficient implementa- tions, particularly within highly dynamic vehicular en vironments. The complexity of these requir ements necessitates a fundamental paradigm shift in service orchestration methodologies to facilitate seamless and rob ust service deliv ery . This paper addresses this challenge by presenting a novel framework for service orches- tration in Unmanned Aerial V ehicles (U A V)-assisted 6G aerial- terrestrial networks. The proposed framework synergistically integrates U A V trajectory planning, Multiple-Access Contr ol (MA C), and service placement to facilitate energy-efficient service coverage while maintaining ultra-low latency communication for vehicular user service requests. W e first present a non- linear pr ogramming model that formulates the optimization problem. Next, to addr ess the problem, we employ a Hierar - chical Deep Reinforcement Learning (HDRL) algorithm that dynamically predicts service requests, user mobility , and channel conditions, addressing the challenges of interference, resour ce scarcity , and mobility in heterogeneous networks. Simulation results demonstrate that the proposed framework outperforms state-of-the-art solutions in request acceptance, energy efficiency , and latency minimization, showcasing its potential to support the high demands of next-generation vehicular networks. Index T erms —Service orchestration, Service placement, Pr e- dictive resource allocation, Hierar chical DRL, Multi time-scale optimization, and 6G aerial-terrestrial networks. I . I N T RO D U C T I O N The rapid proliferation of connected autonomous vehicles and associated de vices, such as on-board units and short-range communication transceiv ers, is driving unprecedented growth in both technological sophistication and deployment v olume. T echnological advancements hav e enabled vehicular User Equipment (UEs) to integrate with div erse vehicular services, including object detection, traffic analysis, and high-precision cartographic updates [1], playing a piv otal role in augmenting vehicular functionality and enhancing it in various aspects like road safety protocols [2]. Howe ver , the exponential growth in deployment density has precipitated unprecedented traffic demands [3], a phenomenon amplified by the cumulative effect of each additional vehicle transmitting increasingly data- intensiv e sensor readings, high-definition video streams, and telemetry information simultaneously [4]. This multifaceted data proliferation renders the provision of continuous service access for UEs a critical technical challenge [5]. V ehicular services are architected through the integration of multiple fundamental functions, each executing a dis- crete task. For instance, real-time traffic monitoring repre- sents a paradigmatic composed service, synthesizing distinct functional components including vehicle velocity monitoring, safety-critical message dissemination, and traffic density quan- tification. These constituent functions operate in parallel and synergistically to deliv er comprehensiv e service functionality . Such composition necessitates adherence to stringent Quality of Service (QoS) parameters, with ultra-low End-to-End (E2E) latency emerging as the predominant requirement for real-time communications [6], [7]. These exacting performance criteria present significant implementation challenges, as contempo- rary service orchestration approaches demonstrate inadequate capability to consistently maintain uninterrupted connectivity while satisfying the requisite E2E latency thresholds [8]. The evolution of networks has enabled advanced solutions tailored to emerging vehicular services, among which the vehicular edge-cloud continuum has emerged as a promis- ing paradigm for real-time vehicular services. By lev erag- ing resource-constrained edge nodes such as Roadside Units (RSUs) as servers to deliv er services to users [9], easing computational loads for users and enhancing their experience [10]. One of the primary challenges in effecti ve orchestration within the continuum is optimizing service placement, which in volv es selecting the most suitable functions for UE requests while jointly allocating computing and networking resources. This approach promotes resource sharing and maintains a deterministic system to ensure requests are met according to their latency requirements [11]. Ho wever , meeting stringent QoS requirements during high-velocity vehicular mobility re- mains challenging [12]. Also, mobility induces spatiotemporal heterogeneity in service demand, creating localized conges- tion where UE demands exceed edge node capacities. These challenges necessitate innov ative strategies that address both the stochastic nature of v ehicular traffic patterns and the limitations of con ventional terrestrial infrastructure. Unmanned Aerial V ehicles (U A Vs) hav e significant poten- tial to enhance the edge-cloud continuum’ s capabilities in ser- The manuscript is accepted for publication in IEEE T ransactions on V ehicular T echnology . Copyright© 2026 IEEE. Personal use of this material is permitted. Howe ver , permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. 2 vice delivery . Conv entional terrestrial networks, characterized by sparse distribution, often struggle to maintain consistent connections, especially on busy roads and during peak traffic hours. In this context, aerial-terrestrial networks, which are cost-effecti ve and flexible, can be employed to provide prompt responses in demanding en vironments [13]. U A Vs serve as aerial base stations and edge servers to deliver high-bandwidth services to ground-based UEs. Also, U A Vs are considered integral components of the upcoming 6G landscape, playing a crucial role in the en visioned ubiquitous connectivity that sup- ports bandwidth-intensi ve and real-time v ehicular applications. Howe ver , as UA Vs trav erse div erse routes and engage with multiple UEs while managing a variety of computing requests, trajectory planning optimization becomes essential to ensure service coverage in U A V -assisted vehicular networks [14], [15]. Furthermore, the variability of time-dependent channel dynamics presents a challenge for maintaining E2E latency in continuous service deli very , resulting from vehicles’ high mobility [16]. This necessitates the de velopment of ef ficient trajectory and resource planning, as well as Multiple-Access Control (MAC) schemes, to mitigate mutual interference in a shared spectrum en vironment [17], [18]. Extant literature has adv anced U A V -assisted vehicular net- works; ho wev er, these approaches predominantly employ reac- tiv e mechanisms, artificially decouple the optimization of re- source planning and MAC from trajectory planning, and insuf- ficiently address composed service orchestration—collectiv ely constraining system scalability and adaptability in dynamic en vironments. T o fill in this gap, we propose a nov el service orchestration framework for vehicular aerial-terrestrial 6G networks that integrates MAC, UA V trajectory optimization, and composed service placement within a heterogeneous edge- cloud continuum where both RSUs and U A Vs function as communication interfaces for UEs. The main contributions and nov elties of this paper are outlined as follows: • A multi-time-scale Mixed Integer Non-Linear Program- ming (MINLP) formulation to optimize coverage and en- ergy consumption of vehicular composed requests under E2E latency requirements in aerial-terrestrial networks. • A decomposition of the problem into multi-UA V trajec- tory planning, MA C, and composed service placement for complexity reduction. T o the best of our knowledge, this is the first work to consider these interconnected aspects while managing user interference on shared channels. • A predicti ve Hierarchical Deep Reinforcement Learning (HDRL) framework that combines DRL with a Bayesian algorithm to enhance requests and channel quality pre- diction accuracy . The HDRL framework balances long- term and short-term objectiv es by facilitating interactions between the trajectory planning, MAC, and service place- ment modules, considering their distinct dynamics and time scales to achiev e a globally optimal solution. The upcoming sections of the paper are organized as fol- lows. Section II provides a detailed ov erview of e xisting works and their limitations. Section III details the system model, presenting the fundamental elements and their interactions. The problem formulation is introduced in Section IV.Section V elaborates on the proposed method, with a detailed explanation of its design and implementation. Subsequently , Section VI presents the simulation settings, analyzes the con vergence, and compares the proposed method’ s performance against baseline approaches. Finally , Section VII encapsulates the study’ s key insights and future research directions. I I . R E L A T E D W O R K S The field of service orchestration in the vehicular edge- cloud continuum, vital for enabling next-generation vehicular applications, has witnessed adv ancements in recent years. Research in this domain addresses div erse dimensions such as the transition from single-U A V to multi-U A V scenarios, the ev olution from heuristic approaches to adapti ve and learning algorithms, and the shift from isolated challenges to complex, integrated problems, including joint trajectory planning and service provisioning [19]. T able I provides a summary of the literature, offering a comparativ e analysis of existing works. Giv en UA V mobility , the literature has focused on trajectory planning [20] or assuming deterministic, predefined mobility patterns [21]. There is increasing attention to trajectory plan- ning solutions that lev erage their agility for rapid deployment, reliable Line-of-Sight (LoS) connecti vity , and the flexibility to adapt their coverage areas [22]. In this regard, Santos et al. [23] proposed deploying multiple U A Vs in underserved regions to ensure lo w-latency service delivery for mobile users. Similarly , W ei et al. [20] tackled U A V trajectory planning while accounting for physical and environmental obstacles. Advancements in U A V mobility ha ve spurred research on integrating joint trajectory planning and channel selection to optimize aerial-terrestrial networks with limited interf aces. As evidenced by Nabi et al. [24], which highlighted k ey challenges in aerial edge computing, including real-time adapt- ability and connectivity management for reliable communica- tion. Due to this, some studies addressed these challenges by incorporating non-orthogonal multiple access in edge-cloud en vironments [25]. Pervez et al. [26] proposed an iterativ e algorithm for user association, channel power allocation, and segment-based U A V trajectories in integrated aerial-terrestrial networks for smart vehicular services. Qin et al. [27] in- troduced a cluster -based air-ground integrated network with U A Vs for access and high-altitude platforms for backhaul, op- timizing UA V trajectories and subchannel selection to enhance energy and spectrum efficiency . Further , other works focused on joint scheduling and channel selection in the edge-cloud en vironment, accounting for dynamic channel v ariations to minimize latency and energy consumption [28]. Huang et al. [29] addressed the integration of satellite communications and aerial platforms through a DRL-based approach, optimizing both channel selection and trajectory planning. Qi et al. [30] extended this line of research by proposing an energy-ef ficient framew ork combining content placement, spectrum allocation, co-channel pairing, and power control, impro ving channel selection and system performance. Some studies have been carried out in the literature to study service provisioning and trajectory planning together for non-terrestrial networks. He et al. [31] used an online DRL 3 T able I S U M M A RY O F E X I S T I N G S C H E M E S A N D C O M PA R I S O N BA S E D O N C O N S I D E R ATI O N S I N T R A J E C TO RY P L A N N I N G , M A C , A N D S E RVI C E P L AC E M E N T . Reference Objective Function Algorithm Multi- U A V UEs Mobility T rajectory Planning Multiple Access Service Placement He et al. [31] Maximize acceptance & enhance energy Actor-Critic & Q-learning ✓ Random ✓ – ✓ SA C-TORA [32] Minimize energy consumption Soft Actor-Critic ✓ – ✓ – ✓ Gupta et al. [33] Max-min aggregate throughput SCA optimization ✓ – ✓ – – Qin et al. [34] Minimize energy consumption PMADDPG ✓ – ✓ ✓ ✓ Li et al. [35] Provisioning rate GNN-DRL – – ✓ – – Muto et al. [15] Minimize computational costs Multi-agent DRL ∼ – ✓ – ✓ Dutriez et al. [36] Maximize energy efficienc y Deep Q-Network – – – ∼ – FL-SNTD3 [37] Provisioning rate & latency Deep federated learning ✓ – ✓ – – DM-SA C-H [29] Minimize energy consumption & latency Soft Actor-Critic ✓ – ✓ – ✓ HaDDQN [30] Energy efficienc y HaDDQN ✓ Random ∼ ✓ ✓ W ei et al. [20] Service execution success rate Deep Q-Network – – ✓ – ✓ SCOFT [38] Minimize energy consumption Hierarchical DRL (HDRL) ✓ Random ✓ – ∼ Proposed Solution Maximize coverage & optimize energy Hierarchical DRL (HDRL) ✓ Predicti ve ✓ ✓ ✓ SCA: Successive Conve x Approximation; PMADDPG: Probabilistic Multi-Agent Deep Deterministic Policy Gradients. approach to in vestigate the interplay between continuous U A V trajectory planning and discrete service deployment actions. Le et al. [35] addressed optimization challenges in U A V -assisted edge networks, focusing on U A V trajectory and service pro- visioning in dynamic environments, using Graph Neural Net- works (GNN) to optimize U A V speed, heading, and service deployment. Ning et al. [15] developed a framew ork for UA V trajectory design that considers users’ computational tasks and probabilistic service preferences. By facilitating decentralized trajectory optimization, they tried to minimize computational costs while maximizing service efficienc y . Additionally , Li et al. [32] explored a multi-U A V -enabled orchestration scheme for heterogeneous services, utilizing collaborativ e capabilities to minimize ov erall energy consumption in the system. Despite innovati ve strategies, existing research on service orchestration in U A V -assisted networks still exhibits note- worthy limitations. Approaches assuming static users often fail to provide timely responses in highly dynamic environ- ments [20], [33], [34]. Se veral studies ov erlook the challenges of orchestrating composed services with div erse functions and shared resources across multiple users, which constrains scalability and flexibility [33], [35]–[37]. Current solutions are predominantly reacti ve, adapting U A V trajectories and deployment only after receiving feedback, underscoring the need for proactiv e orchestration of future conditions like user mobility and demand. For instance, SCOFT [38] optimized U A V trajectory and service placement for energy efficiency; howe ver , its decisions remain reactiv e and are based solely on instantaneous system states, while we explicitly integrate predictiv e models of user mobility and service demand. While some works consider communication links between network elements [30], [34], the complexities of dynamically sharing spectrum or considering channel qualities for U A V -assisted service orchestration remain unexplored. For example, HaD- DQN does not predict channel dynamics and instead models them as stochastic processes, resulting in a reactiv e orches- tration strategy that cannot anticipate future conditions. Like- wise, Qin et al. [27] optimized decisions at a single control layer using only the current network state, without explicitly forecasting future demand. Recent advances in wireless tech- niques [39], [40] improv e spectral ef ficiency b ut are limited to U A V T r a j ec t o r y H i g h - S p eed M o b i l e U E F l y i n g U A V F i x ed R S U U E C o n n ec t i o n t o S en d R eq u es t - C o l o r s i n d i c a t i n g r eq u es t ed s er v i c e B a r r i er B u i l d i n g U E r e q u e s t i s n o t a c c e p t e d - o u t o f c o m m u n i c a t i o n r a n g e C o v er a g e A r ea I n t er f er en c e L i n k C o m p u t i n g R es o u r c e F u n c t i o n s D ep l o y ed o n N et w o r k N o d e U E r e q u e s t i s n o t a c c e p t e d - t h e f u n c t io n s f o r r e q u e s t e d s e r v i c e i s n o t d e p l o y e d U E r e q u e s t i s n o t a c c e p t e d - c h a n n e l q u a l i t y i n t h i s a r e a i s n o t g o o d Figure 1. System Model: Supporting UEs through RSUs and U A Vs with service coverage, quality channel access, and deployed function av ailability . communication-layer offloading and do not jointly address tra- jectory planning and composed service orchestration Finally , the coupled optimization of trajectory planning, MA C, and service placement, each significantly influencing the others, has yet to be holistically addressed, which is essential for next- generation vehicular 6G service orchestration. I I I . S Y S T E M M O D E L This section describes the system’ s detailed structures: net- work architecture, services, and interactions between the UEs, RSUs, and U A Vs across the network, as shown in Fig. 1. A. V ehicular Network Arc hitectur e In this paper , a tiered aerial-terrestrial network is inv esti- gated, denoted by G ( N , L , P ) , which encompasses the coex- istence of vehicular networks and the edge-cloud continuum. The network integrates N computing nodes, such as core nodes, RSUs, and U A Vs, each equipped with networking capabilities. They connected through links to deliv er services ov er designated areas during uniform time frames, index ed by t . Nodes close to UEs provide limited computation capabil- ities at a high cost, whereas core nodes possess significant computing power with lo wer resource expenses [41]. Each node n is characterized by its processing capability p C n and its energy budget E n . Wired and wireless links connect the different network elements, the set of which is denoted by 4 L t ⊂ { l : ( n, n ′ ) | n, n ′ ∈ N } 1 . Each link l is associated with a bandwidth capacity p L l and transmission energy con- sumption ξ l . Packets associated with request r that tra verse link l experience latencies at time frame t determined using a function denoted as D t r,l , which is deterministically computed as a function of the current network state, link length, and link load. Based on the links av ailable at each time frame, a set of paths, P t , is av ailable for packet transmissions. Nodes forming a particular path p are represented as N p , where H t p and T t p are head and tail nodes respectiv ely , with J t p,l denoting the inclusion of link l in path p in time fame t . B. Services Services are characterized by outlining essential aspects, including their functions, data model, and data graph [42]. Each composed service s ∈ S = { 1 , 2 , ..., S } is segmented in functions, denoted as F s = { 1 , 2 , ..., F s } , where each function implemented by virtual instances. The data model accounts for the complex interdependencies between these functions, ensuring that service execution starts with the initial function and proceeds with sequential or parallel ex ecution of subse- quent functions. The data graph, G s , outlines the structure of the composed service, including inputs, outputs, preconditions, and results, along with the total service duration time − → T s . C. V ehicular User Equipment The set of vehicular UEs is defined as U = { 1 , 2 , . . . , U } . UEs generate requests at various time intervals for vehicle-to- ev erything communication, enabling them to transmit requests via appropriate wireless communication protocols in hetero- geneous networks. The first node a UE connects to is kno wn as the Point of Attachment (PoA), through which UE requests are handled to reach desired services. Requests r ∈ { 1 , . . . , R} are sent by UEs, and u r identifies who generates the request r . Each request r enters the system at time T ′ r , requesting a composed service S r ov er a predefined duration, which means the request should be processed within ∆ r = [ T ′ r , T ′ r + − → T s ] . In each time frame, the number of activ e requests may vary due to factors like mobility patterns and bandwidth-saving strategies. Requests come with specific requirements, including network bandwidth q L r , each atomic function’ s minimum capacity q I r,f , and E2E latency q D r [43]. Successful service deli very entails meeting the capacity and QoS requirements. D. Network Ar eas The network is divided into areas denoted by A = { 1 , 2 , ..., A} to ensure comprehensive cov erage. The dimen- sions of areas vary based on geographical factors like obsta- cles. The assumption is adopted for analytical and computa- tional tractability , where UA V movement is represented at the area le vel rather than a fully continuous flight trajectory , within which U A Vs reposition or hover within areas to serve users. This abstraction is known to capture the dominant mobility effects with ne gligible loss of accuracy at the considered 1 Mobility of both UEs and U A Vs leads to a dynamically changing network topology , and UEs should stay within the coverage area of an RSU or UA V to maintain a connection; otherwise, links between them become unav ailable. time scale [44]. UEs exhibit dynamic behavior by frequently moving between areas, while being assumed to remain within a single area during each time frame to simplify movement modeling. The area of UEs who send a request u r at time frame t is called A t u r ,a . Computing nodes are located in various parts of the network, with core nodes and RSUs placed in fixed areas and U A Vs moving in different areas. The energy consumption of U A Vs tra veling between the areas a 1 and a 2 is given by Λ( a 1 , a 2 ) (1), which captures hov ering and propelling energy [45]. The hov ering po wer includes a 1 and a 2 trav el duration ∆ a 1 ,a 2 , induced po wer coefficient I , U A V’ s total weight W n , air density φ , and rotor disk area υ r . The mov ement power accounts for aerodynamic drag, where ς and υ f represent the drag coefficient and frontal area, and V w ( t ) is the U A V’ s instantaneous velocity . Λ( a 1 , a 2 ) = Z ∆ a 1 ,a 2 0 I · W 3 / 2 n √ 2 · φ · υ r + 1 2 · ς · φ · υ f · V w ( t ) 3 ! dt (1) E. W ir eless Channel Model The wireless channel model is defined with a focus on uplink transmissions, wherein UE requests are dynamically scheduled to minimize the collision probability in each time frame through effecti ve channel assignment. Each time frame t is subdivided into T t smaller time slots to enable fine- grained MA C. The set of av ailable channels is denoted by C = { 1 , 2 , ..., C } , and each channel c is associated with a spe- cific energy consumption M c and channel quality q Q τ c,a in area a , as energy requirements vary with operating frequency and path loss. T o mitigate collisions resulting from simultaneous transmissions over the same channel within different locations, our model accommodates multiple uplink channels and allows channel reuse in neighboring areas. Meanwhile, we assume that the downlink channels used for service responses to UEs are collision-free, ensuring reliable response transmission. UE and UA V movements lead to time-v arying channel conditions, where the absence of a LoS link can degrade E2E transmission quality . T o model this, θ LoS a is defined, which represents the likelihood of establishing a LoS connection in area a , determined by en vironmental density , U A V altitude, and weather conditions [22], [46]. When a LoS connection exists, transmission quality is assumed to be ideal; otherwise, under Non-LoS (NLoS) conditions occurring with probability 1 − θ LoS a , signal quality is governed by an instantaneous channel gain H τ c,a that incorporates path loss, small-scale fading, and weather-dependent absorption Ω τ , expressed as (2). In this expression, R τ c,a and χ τ c,a denote the Rayleigh and lognormal components, D 0 is the reference distance and d τ c,a denotes the transmitter-recei ver separation on channel c in area a at time slot t . Also, the parameters ( ν s , η s (Ω τ )) correspond to the path loss exponent and weather-dependent shadowing factor under LoS/NLoS propagation, and ζ (Ω τ ) models atmospheric attenuation such as fog or rain. The receiv ed signal-to-noise ratio (SNR) for the channel is gi ven by (3), where P tx represents the transmit power , N 0 is the thermal noise spectral density , and ∆ f is the subcarrier spacing. Follo wing real- world 6G multiple-access modeling [36], a transmission is 5 considered successful if γ τ c,a exceeds a predefined quality threshold p Q (determined by QoS requirements). Accordingly , the resulting binary channel-quality indicator Q τ c,a equals 1 for successful transmissions (either through LoS or when NLoS SNR is acceptable) and 0 when signal degradation becomes excessi ve 2 , thereby capturing the essential dynamics of U A V communication reliability under variable weather conditions. H τ c,a = R τ c,a · 10 χ τ c,a · η s (Ω τ ) / 10 · ( D 0 /d τ c,a ) ν s · 10 − ζ (Ω τ ) / 10 (2) q Q τ c,a =    1 , w .p. θ LoS a , 1  γ τ c,a = P tx · H τ c,a N 0 · ∆ f ≥ p Q  , w .p. 1 − θ LoS a (3) I V . P R O B L E M F O R M U L A T I O N In this section, the formulation of a MINLP optimization problem, termed energy-A ware muLtipLe-access service Or - Chestration for vehicular Aerial-TErrestrial netw orks (AL- LOCA TE) is presented. The problem pertains to optimizing energy-ef ficient service coverage through the placement of requested service functions on netw ork nodes, the allocation of a specific set of functions to each request based on its service graph, the assignment of a channel and path for each request to facilitate data deliv ery from its PoA to the corresponding functions, and the subsequent return of data to the originating PoA. T able II shows the notations used in ALLOCA TE. A. Objective Function The objectiv e function is formulated to maximize request acceptance - ensuring comprehensive service coverage - while minimizing energy consumption (OF). A scaling factor ( α ) is introduced to balance the trade-off between energy consump- tion and request acceptance, adjusting the relativ e importance of the two factors to ensure an optimal solution. It not only aligns with theoretical optimization goals but also reflects the practical constraints and behaviors of 6G aerial-terrestrial networks [46]. The total energy consumption W (C1) ac- counts for the prioritization of nodes, from edge to cloud, with varying computational capabilities, communication channels, and links, as well as the energy required for UA Vs to trav erse between candidate areas (incorporates propulsion dynamics deriv ed from the aerodynamic energy model (Eq. (1))). For a request to be satisfied, all required functions during the request’ s duration should be deployed (C2). The selection of k ey components within the system is gov erned by binary decision variables: ˜ X t r,f indicates whether request r is served by function f , ˜ Y t f ,n denotes the hosting node n for the function f , and ˜ Z τ r,c represents the selection of channel c for request r . Also, ˜ S t n,a specifies the candidate area a for network node n while RSU areas are always fixed, ˜ B t u,n represents the PoA of UE u at time frame t while they are moving, and − → R t r,p determines if path p is selected to send request r packets to deployed nodes and receive the response. 2 Intuitiv ely , 1 ( C ) equals one if the condition C is satisfied. T able II L I S T O F N OTA T I O N S U S E D I N T H E P R O B L E M F O R M U L A T I O N . Notation Description G ( N , L , P ) V ehicular aerial-terrestrial edge-cloud network G s Service s data graph t ∈ T T ime frame (of T otal service time) τ ∈ T t T ime slot (of time frame t ) − → T s T otal time for deliv ering service s T ′ r Entry time of UE request r q T r Minimum required time slots to send request r N / A Set of network nodes / predefined areas L t / P t Set of (wireless links / activ e paths) at time t U / R / C Set of (active UEs / Requests / uplink channels) p C n / E n (Processing / Energy consumption) of node n p L l /ξ l (Bandwidth capacity / Transmission energy) of link l H t p / T t p Head/T ail node of path p at time frame t J t p,l The inclusion of link l in path p at time frame t D t r,l Latency experienced by request r over link l at time t F f ∈ F s Atomic function f (of functions set) u r / S r (UE who send / Composed service) of request r A t u,a Indicator for UE u located in area a at time t M c Energy consumption for using uplink channel c q Q τ c,a Quality of uplink channel c in area a at time slot τ q L r Network bandwidth required for request r q I r,f Minimum capacity required for function f of request r q D r Latency requirement for request r ˜ X t r,f if function f of request r is selected at time frame t ˜ Y t f ,n if function f is placed on node n at time frame t ˜ Z τ r,c if channel c is selected for request r at time slot τ ˜ S t n,a if node n is deployed in area a at time frame t ˜ B t u,n if UE u is connected (binned) to node n at time frame t − → R t r,p if path p is selected for request r at time frame t ALLOCA TE: max X R  ˜ X r  − α · W s.t. C1 - C12. (OF) W ≜ X F s , N , T ˜ Y t f ,n · E n + X N , A , T Λ( ˜ S t +1 n,a 1 − ˜ S t n,a 2 ) + X L t , P t , R , ∆ r ξ l · J t p,l · − → R t r,p + X R , C , T , T t ˜ Z τ r,c · M c (C1) ˜ X r = Y F s r , ∆ r ˜ X t r,f ∀ r ∈ R (C2) B. Constraints Constraints ensure: (1) appropriate channel allocation within network areas (MA C protocol); (2) optimal service deploy- ment on network nodes with ef ficient packet routing from PoAs to service nodes (service placement); and (3) dynamic U A V adjustment to accommodate active service requests (tra- jectory planning). All optimization processes simultaneously satisfy capacity limitations and QoS requirements. Notably , the system operates on multi time-scales: time frames (denoted by t ), and time slots (denoted by τ ) with each time frame comprising T t time slots. All resource allocation tasks, except for channel selection, operate on time frame granularity , while channel selection functions at a higher frequency of time slots. Channel Selection: Efficient service deliv ery in a multiple- access en vironment necessitates avoiding simultaneous trans- missions o ver the same channel in an area to pre vent collisions. For each UE, no more than one channel should be selected for transmitting its request (C3). Other UEs within the same 6 area should be prevented from using the same channel at the same time slot (C4). This is vital due to the limited av ailable channels of sufficient quality ( q Q τ c,a ), derived from the wireless channel model for request transmission Eq. (3). It indicates U A Vs’ inability to handle multiple channels simultaneously to maintain real-world consistency between link admission and channel quality . Requests are sent through the channels of nodes to which UEs are directly connected, af fecting energy consumption W with the selected transmission channel. X C ˜ Z τ r,c ≤ 1 ∀ r, τ ∈ R , [ t ∈ ∆ r T t (C3) X R ˜ Z τ r,c · A t u r ,a ≤ 1 ∀ c, a, τ ∈ C , A , [ t ∈ T T t (C4) Function Placement: When dealing with composed ser- vices, it is necessary to consider the deployment of various functions of services. Thus, each function targeted by at least one request should be deployed on an av ailable network node for the duration of the request (C5). Moreov er , each request r should be assigned to appropriate functions based on its required service F s r , only if they are transmitted on a quality channel not used by other UEs during (C6). This constraint, along with C4, assesses whether a UE’ s transmissions o ver quality channels meet the minimum required time slots ( q T r ). If they do, the service could be provided; otherwise, the variable ˜ X t r,f is set to zero (request is not accepted). X N ˜ Y t f ,n ≥ X R ˜ X t r,f ! / R ∀ f , t ∈ F s , T (C5) ˜ X t r,f ≤ ( X C , A , T t ˜ Z τ r,c · q Q τ c,a · A t u r ,a ) / q T r ∀ r, f , t ∈ R , F s r , ∆ r (C6) Path Selection: For the effecti ve transmission of inquiry traffic from a UE to its designated nodes, where the requested service is deployed, and the subsequent return of the response, feasible E2E routes should be provided. A unique inquiry path − → R t r,p is established for each request, originating at the UE’ s PoA ( ˜ B t u,n ). Considering UE mobility , the response will be di- rected to the PoA corresponding to the location where the UE will be present when the request duration concludes, thereby addressing its mobility . The requested service’ s functions are interconnected based on G s r to reach their final destination, with the last function sending the response to u r (C7). X p ∈ P t , N p , F s r H t p = P N n · ˜ B T ′ r u r ,n T t p = P N n · ˜ B T ′ r + − − → T s r u r ,n − → R t r,p · 1  ˜ Y t f ,n = 1  = 1 ∀ r, t ∈ R , ∆ r (C7) Capacity: Given the network’ s limited capabilities, it is essential to manage nodes’ and links’ capacities to maintain system stability . So, the total number of requests allocated to any node does not exceed its processing capacity (C8). Additionally , each link’ s capacity should not be exceeded during request and response transmission (C9). X R , F s r ˜ X t r,f · ˜ Y t f ,n · q I r,f ≤ p C n ∀ n, t ∈ N , T (C8) X R , P t J t p,l · − → R t r,p · q L r ≤ p L l ∀ l, t ∈ L t , T (C9) U A V T rajectory Planning: T rajectory planning is essential for covering UEs and meeting latency requirements while minimizing energy consumption. As the objectiv e is to opti- mize energy consumption and changes in U A V locations affect energy usage in W , the optimization problem prompts to limit U A V mobility while ensuring vehicular UE connectivity . Each network node should remain within e xactly one area during each time frame (C10) and each UE should be associated with one PoA within its area at a specified time frame (C11). X A ˜ S t n,a = 1 ∀ n, t ∈ N , T (C10) X N , A ˜ B t u r ,n · ˜ S t n,a · A t u r ,a ≤ 1 ∀ r, t ∈ R , ∆ r (C11) QoS Requirements: Ensuring timely and consistent ser- vice deliv ery is paramount to meeting UEs’ stringent QoS expectations. Constraint (C12) sets a maximum acceptable latency for request handling, verifying that if a request is accepted, its latency requirements are met. This constraint prev ents UEs from monopolizing acceptance based solely on low energy consumption, ensuring requests’ QoS within the specified requirements. The latency threshold q D r aggregates transmission, propagation, and processing delays from D t r,l to capture realistic E2E latenc y . Coupled with the channel-quality indicator q Q τ c,a , only transmissions meeting the required SNR threshold are accepted, ensuring compliance with latency and reliability standards in U A V -assisted vehicular networks [47]. X P t , L t , ∆ r J t p,l · D t r,l · − → R t r,p ≤ q D r ∀ r ∈ R (C12) C. Complexity Analysis The ALLOCA TE problem is classified as NP-hard, re- ducible from the multidimensional knapsack problem pre- sented in [48]. This classification implies a worst-case com- putational complexity proportional to the solution space size [49]. Determining the optimal solution for a set of requests requires interdependent ev aluations: analyzing each UA V node in e very area (trajectory planning), assessing each channel in each time slot (channel selection), and considering ev ery node, function, and path (placement). As any allocation for any request in a gi ven time frame impacts and is influenced by allocations made for other requests, complexity arises. Consequently , all permutations of UA Vs, requests, and times should be examined, creating an exponentially large solution space T !( N UA V ! A ) · ( C T t ) · ( NF s PU !) . In addition to the inherent complexity , in dynamic networks characterized by U A V and UE mobility , sev eral key parameters remain uncertain. W ithout prior knowledge of UE areas ( A t u,a ) and their request arri vals, it is impossible to determine the appropriate channels to send requests ( ˜ Z τ r,c ) or assess channel qualities ( q Q τ c,a ). Consequently , UA V trajectory planning ( ˜ S t n,a ) is impossible ahead of time, as the UE’ s locations dictate UA V mov ement. The uncertainty also extends to function placement ( ˜ Y t f ,n ) and path planning ( − → R t r,p ), as the links connecting U A Vs and network nodes are not predetermined. Therefore, tackling ALLOCA TE requires a nov el method that accommodates the dynamic nature and inherent uncertainties in the network. 7 Y e s N o I n f o rm at i o n G at h eri n g T raj ec t o ry P l an n i n g i f a n y A c t i v e R e q u e s t E x is t i f a n y C h a n g e s t o N e t w o r k Y e s N o Y e s N o M u l t i p l e- A cc es s C o n t ro l i f a n y R B E x i st i f a n y A c ti v e R e q u e s t w i th A l l o c a te d R B ( s ) Ex i s t Y e s N o P l ac em en t G e t U E u ' s A r e a i n T i m e F r a m e t a n d i t s R e q u e st H i st o r y P r e d i ct S e r v i ce R e q u e st ( s) G e n e r a t e d f r o m A S e l e ct u ' s P o A a s A E x e cu t e T P i n A l g o r i t h m 1 t o D e t e r m i n e U A V s' p o si t i o n s ( ) U p d a t e G ( N , L , P ) P r e d i ct C h a n n e l Q u a l i t i e s i n t + 1 ' s t i m e sl o t s ( τ ) E x e cu t e M A C i n A l g o r i t h m 1 C a l cu l a t e P L R e w a r d ( ) Y e s N o i f u E x i st i n t + 1 C a l cu l a t e C h a n n e l Q u a l i t y E st i m a t i o n s & R e w a r d ( ) H D R L i f H i g h - l e v e l S t o r e R e w a r d s A g g r e g a t e & U p d a t e R e w a r d ( ) C a l cu l a t e T P R e w a r d ( ) E x e cu t e P L i n A l g o r i t h m 1 F i n d P a t h T o w a r d D e p l o y e d I n st a n ce s' N o d e ( ) Figure 2. The proposed method’s process for time frame t , including Information Gathering and Service Orchestration phases. V . P R O P O S E D M E T H O D W e propose predictive energy efficient service orchestration (PERFECT) to tackle ALLOCA TE, which operates in two phases: Information Gathering and Service Orchestration. As illustrated in Fig. 2, the PERFECT workflow is presented as a flowchart depicting the sequential ex ecution and interconnec- tion of its key components. The Information Gathering phase employs predictiv e learning to capture the temporal ev olution of UE mobility and service request dynamics. Specifically , the process begins with the collection of request histories and network states, followed by the generation of predicted mobility and service demands. Using information retrie ved from the former, the latter allocates resources. T o further manage the complexity of lar ge-scale problems, the Service Orchestration phase is di vided into three sub-problems: T ra- jectory Planning (TP), MAC, and Placement (PL) modules. The TP module determines U A V trajectories and PoA updates based on predicted UE distributions; its output defines feasible communication links and coverage areas for the next time slot. The MA C module subsequently manages channel access to mitigate interference and updates the channel quality estima- tions for the next time frame. Finally , the PL module decides where network functions should be deployed by ev aluating node capacities, latency constraints, and energy efficienc y . The outputs of these modules are looped back into the environment, updating the state for the next orchestration cycle. Fig. 3 outlines a complementary perspectiv e by detailing technical underpinnings and addressing challenges. The Infor- mation Gathering phase, implemented via a Double Deep Q- Network (DDQN), resolves imperfect knowledge limitations. The DRL-based TP ensures energy-efficient U A V mov ement while maximizing coverage; the MA C’ s heuristic channel allocation achiev es collision-aware transmission by ev aluating channel qualities; and the action masking-enhanced Dueling Double Deep Q-Learning (D3QL) PL enables energy-ef ficient, latency-optimized function deployment and path selection. H D R L F r a m e w or k A d d r e s s e s : in c o m p l e t e k n o w l e d g e o f u s e r s r e q u e s t / m o b il it y A d d r e s s e s : U A V m o v e e n e r g y & u s e r c o v e r a g e l im it a t io n A d d r e s s e s : t r a n s m is s io n c o l l is io n a v o id a n c e & e n e r g y e f f ic ie n t c h a n n e l a l l o c a t io n A d d r e s s e s : f u n c t io n d e p l o y m e n t e n e r g y c o n s u m e & p a t h s e l e c t io n l a t e n c y M e m o r y U p d a t e F u n ct i o n R e se t t a r g e t p a r a m e t e r s E v a l u a t i o n Q ( E q . ( 3 ) ) B a ckw a r d P r o p a g a t i o n T a r g e t Q ' ( E q . ( 3 ) ) E v a l u a t i o n N e t w o r k T a r g e t N e t w o r k R a n d o m B a t ch D o u b l e D e e p Q -N e t w o rk T r a in S t a t e S e t I n f or m a t i on G a t h e r i n g H i g h -l e v e l p o l i cy w i t h D R L me t h o d L o w -l e v e l p o l i cy w i t h Ba ye si a n a l g o ri t h m a n d a H e u ri st i c L o w -l e v e l p o l i cy w i t h D 3 Q L me t h o d U p d a t e N e t w o r k P r e d i ct C h a n n e l Q u a l i t i e s L S T M C N N F C N U se r M o b i l i t y/ R e q u e st s P r e d i ct i o n s T r a j e c t o r y P l a n n i n g M u l t i p l e - A c c e s s C o n t r o l P la c e m e n t A ct i o n M a ski n g U A V R B s S h a r i n g S o r t b y Q u a l i t y & P r i o r i t y A l l o ca t e R B s F C N H e u r i st i c- b a se d P a t h S e l e ct i o n 1 2 1 3 2 4 1 3 2 R e q u e s t s T i m e C h a n n e l s C o l li s i o n P r e d i c t i v e M o d u l e Figure 3. PERFECT’ s technologies and addressed challenges. The predictiv e module forecasts UE mobility and request patterns, while the HDRL frame- work integrates TP , MAC, and PL modules. A. Motivation for Learning Algorithms T o facilitate adaptive decision-making in dynamic and un- certain scenarios, DRL demonstrates promise. Classical value- based approaches like DQL, a DRL ’ s potential techniques, approximate action-value functions through deep Neural Net- works (NNs). Double DQL enhances the stability of DQL by decoupling action selection and ev aluation, utilizing the target value defined in (4) [50]. This target incorporates re ward R , state O , real action a , and predicted action a ′ to update DQL weights ( W ) for each observ ation-action of time t ( O t , a t ). Here, a ′ = argmax a ∈ A Q ( O t +1 , a ; W t ) with W representing ev aluation weights updated at each step, and W − as tar get weights, synchronized ev ery ˆ t ≫ 0 step. D3QL enhances DDQL by integrating W ang et al. ’ s dueling approach [51]. In D3QL, separate estimators calculate state v alues ( V ) and action advantages ( ρ ), combining them to compute Q-values (5) and weights (6). This approach improves training stability , accelerates con ver gence, and mitigates overestimation issues. Y t = R t + Γ · Q ( O t +1 , a ′ ; W t − ) (4) Q ( O t ,a t ; W t ) = V ( O t ; W t ) + ρ ( O t ,a t ; W t ) − 1 | A | X a ′ ∈ A ρ ( O t ,a ′ ; W t ) (5) W t +1 ← W t + σ · [ Y t − Q ( O t , a t ; W t )] ∇ W t Q ( O t , a t ; W t ) (6) Although ef fectiv e in small and moderate-sized decision spaces, these non-hierarchical methods operate on a single ac- tion space and uniform time scale. Consequently , when applied to heterogeneous processes with multi-node orchestration, they exhibit degraded con vergence, become sensitiv e to state dimensionality , and require large exploration budgets. Also, 8 policy-gradient methods such as Proximal Policy Optimization (PPO) alleviate some instability issues through clipped surro- gate objecti ves, yet they also suffer from slo w con vergence under combinatorial action spaces and lack mechanisms for decomposing multi-time-scale decisions. T o handle these limitations in large state and action spaces, HDRL employs action-space and temporal abstraction, de- composing orchestration into coordinated high- and low-le vel modules. This design enhances training efficiency by enabling module interaction across distinct dynamics and timescales: high-lev el policies manage long-term strategy , while lo w- lev el policies handle short-term control. Each module oper- ates within a tailored state-action domain, reducing learn- ing complexity compared to monolithic DQN/PPO models. Hierarchical coupling allo ws high-lev el policies to integrate low-le vel outcomes, impro ving stability , sample efficiency , and scalability with increasing U A Vs, UEs, and functions. T o achiev e a globally optimal solution and resolve the limited observation capability , careful coordination to harmonize the modules’ beha viors is needed. By designing well-structured state representations and employing action masking in lo w- lev el policies, the proposed HDRL architecture avoids conv er- gence issues common in flat DRL. Thus, PERFECT achieves faster con ver gence, high service quality , and superior perfor- mance under multi-timescale constraints, establishing HDRL as an effecti ve solution for complex large-scale problems. B. Information Gathering Phase The Information Gathering phase constructs a dynamic net- work graph G ( N , L , P ) , representing UE areas and anticipated requests for the next time frame. Follo wing the strategy by Farhoudi et al. [43], we adopt an online model to handle requests’ sporadic and dynamic nature, as offline machine learning models cannot adapt to rapid changes in request patterns and render them insufficient for accurate predictions. The proposed framew ork inherently addresses real-time demand fluctuations and unpredictable mobility patterns through its learning-based approach. Specifically , each PoA is equipped with a D3QL agent, continuously updating its policy based on newly observed transitions, enabling rapid adaptation to sudden traffic or request changes. These agents predict the probability of each request r being issued in the next time frame, identifying the presence of u r in the area a ( A t +1 u r ,a ). During the prediction process, the agent returns a prioritized list of requests with the highest likelihood as well, with a reward based on the prediction accuracy and the state consisting of the receiv ed requests’ history and the previous UE areas. Unlike static predictors, our online DRL method exploits Short-T erm Memory (LSTM) layers to capture temporal variations in request arriv als caused by high UE mobility , while Conv olutional Neural Network (CNN) layers extract spatial correlations between adjacent areas. A memory bank stores observed transitions to enable ef ficient NN updates through random sampling, as illustrated in Fig. 3. C. Service Or chestration Phase Follo wing centralized aggreg ation of UE area predictions and anticipated service requests for the subsequent time frame, Algorithm 1: Service Orchestration Phase Input: T , ϵ ′ , and e ϵ Output: ˜ S , ˜ B , ˜ Z , ˜ Y , − → R 1 W TP, PL ← 0 , W − TP, PL ← 0 , ϵ TP, PL ← 1 , ψ TP, PL ← {} 2 for t in [1 : T ] do 3 ⋆ T rajectory Planning (high-lev el, frame-scale) ⋆ 4 ˜ S t +1 ← EpsilonGr eedy  Q ( O t TP , A t TP ; W TP ) , ϵ TP  5 H ← { t − H , . . . , t } 6 Calculate O t +1 TP according to H and (7) 7 ϵ TP ← max( ϵ TP − ϵ ′ , e ϵ ) 8 Update G ( N , L t , P t ) based on ˜ S t +1 9 Select ˜ B t +1 for each request 10 ⋆ Multiple-Access Control (low-le vel, slot-scale) ⋆ 11 Calculate O t +1 MAC according to (10) 12 Calculate Q t +1 based on O t +1 MAC (11) 13 ˜ Z ← MAC  ˜ B t +1 , ˜ S t +1 , ω t , Q t +1 , T t  14 Compute R t MAC based on channel qualities 15 ⋆ Placement (low-le vel, frame-scale) ⋆ 16 ˜ Y t +1 ← EpsilonGr eedy  Q ( O t PL , A t PL ; W PL ) , ϵ PL  17 Calculate O t +1 PL according to (12) 18 ϵ PL ← max( ϵ PL − ϵ ′ , e ϵ ) 19 Select − → R t +1 for each request 20 ⋆ T raining and Rew ard Propagation ⋆ 21 Calculate R t TP and R t PL based on (9) and (14) 22 if high level then 23 R t HL = R t TP + χ · R t MAC + κ · R t PL 24 R t TP , R t PL = R t HL 25 Update global state to sync TP → MAC/PL 26 ψ TP ← ψ TP ∪ { ( O t TP , ˜ S t +1 , R t TP , O t +1 TP ) } 27 T rain W TP on batch of samples from ψ TP (6) 28 ψ PL ← ψ PL ∪ { ( O t PL , ˜ Y t +1 , R t PL , O t +1 PL ) } 29 T rain W PL on batch of samples from ψ PL (6) this phase deploys functions and allocates resources to meet predicted requests and requirements. T o tackle the multi- faceted challenges of decision-making in dynamic, resource- constrained environments, we employ an HDRL framework. It facilitates decision-making by decomposing complex problems into manageable subproblems, employing high-level policies for strategic, long-term decisions and low-le vel policies for operational, short-term ones. The hierarchical interaction per- fectly fits the proposed decomposition approach and ensures immediate action adaptation to real-time network dynamics without deviating from long-term objectives. Reward-sharing and feedback mechanisms enable continuous synchronization of short-term adaptations with global optimization goals. The TP module serves as the high-level policy , responsible for long-term U A V mov ement decisions that optimize service cov erage, feasible communication links, and the structural conditions under which low-le vel modules operate. The MA C and PL modules operate as low-le vel policies, responsible for real-time, fine-grained decisions; such as channel allocation, Resource Block (RB) management, function deployment, and path selection–based on rapidly changing en vironmental con- 9 T i me F ra me : t 0 T i me F ra me : t 1 T i me F ra me : t 2 2 km 2 km 2 km 2 km 2 km 2 km 2 km 2 km 2 km 2 km 2 km 2 km Figure 4. A possible scenario where UE r 1 enters from area a 3 in time frame t 1 . Although u 2 requires less energy to reach a 3 than u 1 , TP learning algorithm decides to mov e u 1 , predicting that r 2 will enter from a 1 . Each grid cell in this deployment across Oulu city represents approximately 2km × 2km. Algorithm 2: Learning’ s EpsilonGreedy Process Input: Q ( S, A ; W ) , ϵ Output: Actions 1 ζ ← randomly generate a number from [0 : 1] 2 if ζ > ϵ then 3 Actions ← argmax a ∈ A Q ( O , a ; W ) 4 else 5 Actions ← select random actions ditions. Each RB represents a time slot within a specific chan- nel, determining the allocation of communication resources for transmitting UE requests and enabling the MA C to react at a higher temporal resolution than the TP. Using a simultaneous learning approach (Algorithm 1), the low-le vel policies are first stabilized independently to ensure reliable short-term beha vior . MA C learns efficient channel allocation across time slots, while PL optimizes function deployment and path selection. Their decisions shape the en vironment observed by TP , and their instantaneous rewards form the components of the high- lev el reward. Once stabilized, their operational outcomes (ag- gregated rew ards) are integrated into the training of high-lev el policies, forming a closed feedback loop where updated short- term results influence long-term optimization. Hence, R t HL is considered the total rew ard, which combines rew ards from TP , MA C, and PL modules, scaled by factors χ and κ to balance their contrib utions. Through this dynamic interaction across distinct time scales, low-le vel policies optimize within TP- defined constraints while TP learns to anticipate do wnstream effects, maintaining coherence between times and achieving real-time adaptability without sacrificing efficienc y . T rajectory Planning: The TP module determines optimal U A V areas for the upcoming time frame, aiming to optimize total U A V movement ener gy consumption while maximizing request coverage. The purpose of using a learning algorithm rather than a heuristic approach lies in its ability to achiev e long-term optimization, thereby reducing U A Vs’ overall en- ergy consumption during their movement. Using a D3QL al- gorithm, detailed in Algorithm 1 (steps 5–10), it predicts UA V mov ements based on past observations, prioritizing long-term energy optimization. A scenario is illustrated in Fig. 4 that represents the Oulu city area, where each grid cell represents approximately 2km × 2km. In this scenario, a UE request r 1 arises in area a 3 in time frame t 1 , with UA Vs u 1 existing at a 9 and u 2 at a 1 in t 0 . Although u 1 requires more energy to reach a 3 (based on Λ ), moving u 1 is more efficient, predicting another request at a 1 and av oiding unnecessary movements for u 2 . In another scenario, if r 1 mov es from a 3 at t 1 to a 2 at t 2 , the algorithm pre-positions u 3 at a 2 from t 0 , minimizing mov ement energy while ensuring timely delivery . The module implementation approach is to design the state and action spaces in a scalable manner . The state O t TP incorporates total requested capacities per area and U A V locations, encoded in the one-hot format (7), based on the last H observations. This representation is request number independent, making it scalable for networks with varying UE numbers. The state serves as input to a NN architecture consisting of LSTM, CNNs, and linear layers. The action A t TP is then generated that represents areas assigned to U A Vs for the next time frame, categorized as ˜ S t +1 (8). The action results in changes to the network graph, as U A Vs would be relocated across dif ferent areas, leading to alterations in the links and paths. Hereafter , ˜ B t +1 (PoAs) are determined based on A t +1 u,a (UE areas) and ˜ S t +1 (network node areas). The reward func- tion R t TP maximizes coverage while optimizing U A V energy consumption, aligning with ALLOCA TE’ s objective (9). This approach balances UE demand, service cov erage, and energy efficienc y , using predictiv e insights. o t TP = n X R , F s r A t u r ,a · q I r,f | a ∈ A o ∪ n ˜ S t n,a | n, a ∈ N , A o O t TP = n o h TP | h ∈ { t − H , . . . , t } o (7) A t TP = n ˜ S t +1 n,a | n, a ∈ N , A o (8) R t TP = X R , N ˜ B t +1 u r ,n − α · X N , A Λ( ˜ S t +1 n,a 1 − ˜ S t n,a 2 ) (9) Multiple-Access Contr ol: The MAC module allocates energy-ef ficient channels ( M c ) to UEs for request transmis- sion. It functions at the time slot le vel, with each time frame containing T t time slots. Efficient allocation requires predict- ing channel qualities ( q Q τ c,a ) in each area and mapping them to requests. Thus, the module is divided into two parts: channel quality prediction and heuristic-based channel assignment. Since channel predictions should update dynamically during HDRL training periods, this component is included in the MA C module rather than the Information Gathering phase. 10 T able III M A P P I N G O F P E R F E C T S U B P RO B L E M S T O S Y S T E M R O L E S , D E C I S I O N L A Y E R S , A N D A P P L I E D T E C H N O L O G I E S . Subproblem Physical Aspect Decision Layer Applied T echnology (Rationale) Info Gathering UE mobility & demand prediction Sensing & Prediction D3QL+ CNN+LSTM (spatiotemporal feature extraction) TP U A V mobility , cov erage & energy efficiency Mobility management D3QL (long-term control, stable conv ergence) MA C Channel allocation & latency control Communication/MA C Bayesian + heuristic (probabilistic–deterministic) PL Function deploy , routing, & QoS assure Resource allocation D3QL + masking + heuristic route (energy-ef ficient placement) HDRL Cross-layer coordination across multi-time-scale Hierarchical control Policy integration (temporal abstraction, scalability) This hybrid design prioritizes predicti ve elements for critical tasks while employing deterministic algorithms for simpler processes, ensuring energy-ef ficient channel allocation. T o predict channel quality , the MA C module utilizes a Bayesian algorithm that models and updates beliefs under uncertainty . The Bayesian algorithm assigns an initial quality value of 0 . 5 to each channel in ev ery area. As U A Vs traverse across areas, these estimates (posterior beliefs) are revised incrementally based on updated observ ations O t MAC (10) and weighted by λ , which gov erns observation impact. This pro- cess, outlined in steps 12–13 of Algorithm 1, continuously refines channel quality estimates Q t c,a , calculated using (11). O t MAC = n 1 | T t | · X T t q Q τ c,a | c, a ∈ C , A o (10) Q t c,a = n λ · O t MAC + (1 − λ ) · Q t − 1 c,a o (11) Follo wing channel quality prediction, the MA C module allocates them to requests, as detailed in Algorithm 3. The allocation strategy prioritizes proximity to request deadlines to meet E2E latency requirements q D r . At the beginning of each time frame, co-located U A Vs share available RBs. U A Vs then allocate their RBs to connected UEs using predicted qualities Q t c,a , priorities ω t , and required time slots q T r . Channels and requests are sorted by their respective quality and priority , and RBs are assigned iterativ ely until either RBs are exhausted or request requirements are fulfilled. Specifically , the algorithm calculates the minimum between remaining and required time slots ( ´ τ ) for each request. It allocates all slots within the inter- val τ : ´ τ if no prior allocation exists, adhering to the single- channel transmission constraint (C3). This procedure ensures high-priority UEs access superior-quality channels, optimizing resource usage while complying with QoS constraints. Placement: This module determines the optimal deploy- ment of functions on network nodes ˜ Y t f ,n and the paths to reach them − → R t r,p , relying on predicted requests’ UE areas and calculated UA V locations. It aims at minimizing energy con- sumption for function deployment ( E n ) while adhering to node processing and link bandwidth constraints ( p C n , p L l ), reducing transmission energy ( ξ l ), and meeting latency requirements ( q D r ). Giv en the impact of UE and U A V mobility on placement dynamics, the PL module uses a D3QL learning algorithm (Algorithm 1, steps 16–19) to strategically deploy functions on nodes, ensuring that the total energy consumption is ef ficient and request requirements are in compliance ov er time. PL module implementation provides a highly ef ficient and deployable state space, designed for scalability and inde- pendence from request v olume. Its state O t PL includes total Algorithm 3: Channel Allocation Method Input: ˜ B t +1 , ˜ S t +1 , ω t , Q t +1 , T t Output: { ˜ Z τ | τ ∈ T t } 1 ˜ Z τ r,c ← 0 ∀ r , c, τ ∈ R , C , T t 2 for each n in N do 3 C sorted ← SORT c  C , ke y = Q t +1 c,a | ˜ S t +1 n,a = 1  4 R sorted ← SORT r  R , ke y = ω t | ˜ B t +1 u r ,n = 1  5 for each c in C sorted do 6 τ ← 0 7 while τ ≤ | T t | do 8 for each r in R sorted do 9 ´ τ ← τ + min ( q T r , | T t | − τ ) 10 if P c,τ ˜ Z τ : ´ τ r,c == 0 then 11 ˜ Z τ : ´ τ r,c ← 1 12 τ ← ´ τ requested function capacities across all nodes with channel access as well as nodes’ av ailable capacities and associated energy consumption (12). Additionally , the reward function R t PL focuses on maximizing accepted requests (co verage) while optimizing ener gy consumption, considering latenc y constraints (14). Furthermore, directly considering all network nodes, functions, links, and paths in the action space poses scalability and con ver gence challenges due to the large action space. T wo key strategies are employed to o vercome this challenge. First, path selection is decoupled from learning via a heuristic integrated into reward calculation. Specifically , after selecting nodes for function deployment, the algorithm prioritizes requests by latency requirements and identifies feasible paths that meet the requirements with minimal energy consumption ( ξ l ). The algorithm penalizes inv alid deploy- ments with negati ve rewards, while otherwise re warding the total energy consumed to establish the connection. Second, action masking reduces the ef fecti ve action space without sacrificing optimality by eliminating infeasible actions by assigning large negativ e values. The infeasible actions include deploying functions with no predicted requests or exceeding thresholds. The PL module action A t PL is defined as in (13). O t P L = n X R , C , T t ˜ Z τ r,c · q I r,f | f ∈ F s o ∪ n  p C n , E n  | n ∈ N o (12) A t PL = n ˜ Y t +1 f ,n | action is not masked , f , n ∈ F s , N o = n ˜ Y t +1 f ,n      P R q I r,f ≤ P N ( ˜ Y t +1 f,n · p C n ) P R ( ˜ X t +1 r,f · ˜ Y t +1 f,n ) > 0 f , n ∈ F s , N o (13) R t PL = X R ( Y F s r ˜ X t +1 r,f ) − α ( X F s , N , T ˜ Y t +1 f ,n E n + X L t +1 , P t +1 , R , ∆ r ξ l J t +1 p,l − → R t +1 r,p ) (14) 11 T able III details the subproblems addressed in PERFECT , highlighting their physical aspects, corresponding decision layers, and underlying technologies. The modular decomposi- tion enables hierarchical decision-making across multiple time scales, effecti vely enhancing adaptability , energy efficienc y , and QoS assurance in aerial-terrestrial vehicular 6G networks. V I . P E R F O R M A N C E E V A L UA T I O N This section e valuates our proposed method’ s ef ficiency . It begins with a con ver gence analysis, examining the impact of hyperparameters on the algorithm’ s performance. Subse- quently , PERFECT is benchmarked against baseline methods using div erse metrics, demonstrating its superiority . A. Simulation settings The simulations are conducted within a v ehicular edge–cloud continuum, where U A Vs initially fly at a fixed height and are randomly distributed across the network area. Ke y simulation parameters are summarized in T able IV, where parameters follow a uniform distribution U . The parameters ensure that ener gy consumption is physically modeled fol- lowing the aerodynamic formulation in, while the wireless channel reflects en vironment-dependent attenuation following. This configuration enables a f air and realistic comparison between PERFECT and baseline frameworks under varying conditions. T o model UE mobility and reflect realistic user dynamics, we employ Simulation of Urban Mobility (SUMO) [52], a microscopic vehicular mobility simulator designed for large-scale network en vironments. Specifically , we consider a grid environment with bidirectional roads, while SUMO provides realistic trajectories across dif ferent urban zones of Oulu City in Finland, including both city-center and suburban areas, allowing us to reflect div erse densities and mobility patterns. UEs follo w the Manhattan mobility model, moving straight with a 50% probability and turning left or right with a 25% probability at intersections. The mobility model is inspired by the Manhattan-like urban cities and implemented with additional stochasticity in vehicle interactions and depar- ture processes, providing diverse and generalized dynamics. During the simulation, there are 50 UEs in the network, each capable of generating various types of service requests with heterogeneous data rate requirements, providing a practical context for assessing our proposed method. B. Con ver gence Analysis W e conduct e xperiments with v arying hyperparameters to assess the proposed method’ s con ver gence behavior . The pa- rameters are critical for training efficiency and stability , with the learning rate controlling the magnitude of NN weight updates and batch size defining sample numbers per training episode. Small learning rates result in prolonged training durations or even non-conv ergence because of ineffecti ve loss minimization. Conv ersely , large rates lead to unstable train- ing dynamics and pre vent con v ergence, where rapid weight updates may overshoot optimal values or become trapped in local optima. Fig. 5 demonstrates that a learning rate of 0.001 achie ves optimal con ver gence, of fering high rew ards and stability across training episodes. Also, smaller batch sizes T able IV S I M U L A T I O N P A R A M E T ER S . Domain Parameter V alue Network Node Processing Capacity ( p C n ) ∼ U (25 , 70) Mbps Node Energy Capability ( E n ) ∼ U (12 , 36) Hz Link Bandwidth Capacity ( p L l ) ∼ U (10 , 30) Mbps Link Latency ( D t r,l ) ∼ U (4 , 16) Mbps Link Energy Consumption ( ξ l ) ∼ U (5 , 8) Hz U A V W eight, velocity ( W n , V w ( t ) ) U (4 , 6) kg, U (8 , 12) m/s Area Areas ( A ) 4*4 Grid Induced Power , Drag Coef ficient ( I , ς ) 0.08, 0.05 Air Density ( φ ) 1.225 k g.m − 3 Rotor Disk, Frontal Area ( υ r , υ f ) 0.6, 0.25 m 2 Service Number of Composed Services ( S ) 20 Number of Atomic Functions ( F s ) 32 Service Duration ( − → T s ) ∼ U (3 , 10) frames Service Required Time Slots ( q T r ) ∼ U (3 , 9) slots UE V ehicular UEs ( U ) 50 Bandwidth Requirement ( q L r ) ∼ U (2 , 8) Mbps Capacity Requirement ( q I r,f ) ∼ U (8 , 20) Mbps Latency Requirement ( q D r ) ∼ U (50 , 100) ms Channel T ime slots per time frame ( T t ) 10 LoS Probability ( θ LoS a ) ∼ U (0 . 2 , 0 . 8) Rayleigh coefficient ( R τ c,a ) ∼ R ( σ ∼ U (0 . 2 , 0 . 8)) Quality Threshold ( p Q ) 1 Channel Energy Consumption ( M c ) ∼ U (2 , 8) Hz W eather Attenuation ( ζ (Ω τ ) ) { 0 , 2 . 5 , 5 . 0 } dB Shadowing Std. Deviation ( η s (Ω τ ) ) { 2 . 0 , 3 . 0 , 4 . 5 } dB HDRL Running episodes 100,000 Replay Memory ( ψ TP , ψ PL ) 2000, 1000 Discount Factor ( Γ ) 0.8 Scaling Factors ( α , χ , κ ) 0.001, 0.5, 0.8 EpsilonGr eedy process ( ϵ , ϵ ′ , e ϵ ) 1, 0.00005, 0.0001 TP NN Layers (LSTM / T wo CNN (ker- nel size, stride, and pooling size) / Three Fully Connected) 128 units / (3, 2, 2) / (256, 128, 64 units) Activ ation Function Hyperbolic T angent PL NN Layers (Three Fully Connected) (512, 256, 128 units) Activ ation Function Leaky ReLU U : Uniform distribution, R : Rayleigh distribution, Mbps: Megabits per second, ms: millisecond, Hz: Hertz, ReLU: Rectified Linear Units introduce variability and instability due to random sampling, while larger batch sizes increase ov erhead and slo w con ver- gence. As shown in Fig. 6, a batch size of 32 strikes the right balance, ensuring faster and smoother conv ergence, compared to the more v ariable results with a batch size of 8 and slower con ver gence with a batch size of 64. A note worthy feature of our HDRL framew ork is that it considers two high- and low-le vel policies, leading to reward escalation once we start a high-lev el one (episode 25,000). C. Comparison experiments W e compare our proposed method against baseline ap- proaches to demonstrate its effecti veness. The compared ap- proaches include the optimal solution of ALLOCA TE derived via CPLEX (with complete knowledge); a random selection strategy for UA V trajectory planning, channel selection, and 12 Figure 5. Conv ergence performance of the PERFECT algorithm for different learning rates, highlighting its stability and reward optimization. service deployment; the Successive Conv ex Approximation- based (SCA) method [33] that considers multi-U A V trajectory and resource planning; and the Hungarian and DDQN-based (HaDDQN) method [30], which uses a DDQN approach for service placement and a Hungarian algorithm for RB allo- cation in co-channel settings. The SCA method is modified to include latenc y constraints instead of U A V recharging as considered in the original study . Lik ewise, the HaDDQN method is adapted to align with our framework by embracing the Hungarian algorithm for UA V trajectory decisions. Three quantitative metrics critical to vehicular networks are considered to compare methods. First, the number of accepted requests (request coverage) indicates scalability in dynamic en vironments, essential for connectivity . Second, energy consumption quantifies the energy required for UA V mov ement, communication through channels, and resource utilization, addressing sustainability goals and operability ex- tension. Third, E2E latency assesses the ability to ensure seamless user experiences, which is crucial for futuristic latency-sensiti ve applications like autonomous dri ving. These metrics collecti vely provide a comprehensi ve assessment of the method’ s effecti veness, highlighting the potential to balance service deliv ery , energy use, and E2E latency requirements. T o e valuate the dif ferent aspects of the problem and their im- plications for service orchestration, we increase the number of requests, network nodes, and communication channels that are varied across simulation scenarios. The first one assesses the response to varying UE and activ e requests to verify the ability to maintain high request acceptance under dynamic workload conditions. It simulates real-world request fluctuations, as seen in futuristic applications like intelligent transportation systems [53], where vehicles generate variable real-time data during peak hours or emergencies. Second, varying network node and areas through SUMO 3 ev aluate the effect of network and infrastructure scale. The scenario reflects expected gro wth in 6G infrastructure size [54] to ensure effecti ve performance across div erse en vironments, from small-scale edge-cloud sys- tems to expansiv e, distributed networks. In futuristic networks, densely populated urban areas with high vehicular density and connectivity demand can strain channel a vailability [55]. Thus, we consider varying communication channel C as the third scenario that determines adaptability to constrained channel av ailability across di verse conditions. 3 The transition from sparse suburban to dense urban traffic is captured. Figure 6. Conv ergence performance of the PERFECT algorithm for different batch sizes, showing the trade-offs between stability and conver gence speed. Fig. 7 illustrates PERFECT and baseline methods compar- isons in terms of accepted requests percentage (1) as well as energy consumption (2) and E2E latency (3) incurred by each request under increasing requests (a), nodes (b), and chan- nels (c) scenarios. Notably , requests failing latency require- ments are considered unacceptable, reflecting their inability to provide a satisfactory user experience. T o ensure statistical robustness in a setting containing uniform distributions, we consider average values from multiple system runs, each using identical seed v alues for all methods. This approach ensures that each method operates under the same conditions in each iteration. In this regard, ev en in optimal solutions, energy consumption and E2E latency fluctuate due to dynamic factors like node/link capacities and evolving bandwidth requirements. Besides, the shaded regions around the trend lines represent standard deviations, indicating performance variability . The observed differences in shaded regions stem from the varying robustness of the methods to network dynamics, with the Ran- dom method sho wing the highest variability due to uninformed decisions, while PERFECT and ALLOCA TE achiev e more stable performance through adaptiv e allocation. Scenario I: This scenario begins with increasing request numbers from 7 (minimal load) to 30 (extreme load) per time frame while keeping network size (10 nodes) and channel av ailability (10 channels) constant. Fig. 7(a) demonstrates the scalability of the proposed framework under different request numbers. All methods exhibit a slight decline in accepted requests as the number of requests increases, as the number of network nodes remains fixed. Higher request volumes lead to increased E2E latencies due to diverse requests with v arying requirements across networks with high-capacity nodes located far from PoAs. Also, to accommodate high demand, methods are compelled to deploy additional functions and utilize more links for transmitting requests and responses, thereby increas- ing energy consumption. PERFECT maintains high request cov erage and relati vely stable energy consumption, outperforming HaDDQN, SCA, and random methods. Regarding E2E latency , both HaDDQN and PERFECT perform well, as they prioritize maintaining latency within acceptable requirements. The random method’ s poor request handling leads to service interruptions, latency infringement, and high energy consumption, rendering it un- suitable for real-w orld applications. SCA and HaDDQN’ s limited ability to predict requests contributes to their reduced performance, particularly in lar ger request volumes. Also, their 13 Figure 7. (1) supported requests percentage, (2) energy consumption, and (3) E2E latency are compared between the ALLOCA TE, PERFECT , SCA, HaDDQL, and random methods as (a) request set, (b) network size, and (c) channel size expand. The shaded regions indicate the standard deviation across multiple runs. TP methods overlook total energy minimization (discussed in Section V -C), incurring higher energy consumption. SCA ’ s lack of shared channel support further diminishes its accep- tance range compared to HaDDQN, despite its comparable performance under lighter loads. HaDDQN, benefiting from its learning PL, achiev es superior energy efficienc y than SCA in function deployment, though it remains less efficient than PERFECT . PERFECT’ s predictiv e capabilities and Bayesian- based channel sharing, combined with efficient TP and PL modules, ensure E2E latency compliance. It strategically de- ploys functions farther from UEs to reduce energy consump- tion, resulting in slightly higher, but within latency tolerances in some cases. The gap between ALLOCA TE and PERFECT is minimal and stems from occasional prediction errors that necessitate deploying slightly more functions. Howe ver , these increases are negligible in practical scenarios and underscore PERFECT’ s ef ficient prediction algorithm. Such consistency demonstrates the predictive module’ s adaptability to unex- pected mobility or traffic patterns, as it dynamically updates its knowledge base to reflect new behavior , prev enting service degradation and sustaining stable orchestration efficienc y . Scenario II: The second scenario scales the network infras- tructure from 6 to 23, with fixed requests (15) and channels (10). As the network size increases, the number of areas also expands (from 4×4 to 6×6). Fig. 7(b) sho ws the scalability and flexibility of the proposed framew ork in varying en viron- ments and under different infrastructure. Initially , all methods show lower acceptance rates due to an imbalance between requests and limited UA Vs. As network density increases, acceptance rates improv e with additional network resources. Energy consumption starts high due to reliance on high energy- intensiv e edge nodes but gradually decreases as U A Vs are deployed, requiring fe wer links for service deliv ery . Similarly , E2E latency declines as U A Vs are positioned closer to UEs, reducing transmission latencies. The proposed method e xhibits strong scalability and effi- cient energy consumption while performing practically simi- larly to HaDDQN regarding latency . The random approach’ s energy-intensi ve node selection leads to high energy consump- tion and prolonged latency despite minor improvement due to increased nodes. As seen in the first scenario, SCA and HaDDQN struggle with request prediction, which affects their performance in smaller networks. Howe ver , as the network size increases, the acceptance of both methods increases since 14 there are sufficient nodes in each area. SCA struggles with high latenc y in limited networks due to distant function deployments, occasionally violating latency requirements. As the network size increases, SCA ’ s latency improves, reflecting its ability to better utilize the expanded infrastructure. Energy consumption in PERFECT is notably efficient, with lower U A V mov ement and deployment energy compared to SCA and HaDDQN, owing to its HDRL algorithm that optimally selects proper nodes for service co verage and deli very . The HDRL algorithm enhances this efficienc y through high-le vel policies that become increasingly impactful as the network size grows. While HaDDQN and PERFECT exhibit similar energy usage in terms of deployment, PERFECT’ s predictive capabilities offer minor improvements. As UA V numbers grow , PERFECT further reduces ener gy consumption by accurately predicting user behavior and optimizing unnecessary U A V mo vements. The information gathering DRL agents continuously capture mobility-induced variations, enabling PERFECT to anticipate real-time mobility and adjust TP decisions. This optimization also lowers E2E latency , contributing to the method’ s overall effecti veness in handling scalability . ALLOCA TE performs comparably , especially in the presence of sufficient resources in the network, with minor variations due to slight prediction errors during information gathering. Scenario III: In this scenario, the number of channels is in- creased from 2 to 8 while keeping the number of requests (10) and network nodes (10) fixed. Channel av ailability influences system performance, with noticeable improvements observed when the number of channels exceeds two. This enhancement is attrib uted to the requirement of three or more time slots for most requests. W ith more channels, accepted requests increase across all methods due to additional av ailable RBs for transmission. Howe ver , this growth raises energy consumption because of the unique energy demands of each channel and the complexity of managing shared channels. Similarly , the rise in accepted requests slightly increases E2E latency , reflecting the higher system load. In this scenario, PERFECT achie ves higher request ac- ceptance rates and superior ener gy efficienc y than SCA and HaDDQN due to its Bayesian algorithm for channel quality prediction that optimizes channel allocation. HaDDQN lacks channel prediction, leading to inef ficient utilization and high energy consumption. Furthermore, the absence of support for shared channels led to an increase in channel ener gy consumption and E2E latency for SCA. The difference in accepted requests between ALLOCA TE and PERFECT stems from channel quality prediction errors and the proposed greedy channel selection. E2E latency remains stable for PERFECT across channel configurations, whereas other methods, es- pecially SCA and random, experience rising latency with increasing channel av ailability , highlighting their limitations in managing higher channel counts. D. Discussion A comparison of PERFECT with alternative approaches highlights its superior timely response. On av erage, PERFECT produces results 88% faster than the optimal approach. This speed advantage is supported by its complexity analysis: T  ( U + R ) A ) + ( N UA V A ) + ( NUC T t ) + ( NF s + PU )  . This complexity arises from (i) ev aluating U user trajectories and R requests over areas (Prediction); (ii) ev aluating each U A V in each area (TP); (iii) assessing nodes, UEs, channels, and RBs (MA C); and (i v) analyzing nodes and functions for function deployment, and determining optimal paths for requests (PL). The increased number of nodes and requests affects the runtime slightly , but this marginal cost is offset by performance gains in acceptance, latency , and energy , as demonstrated in Fig. 7. Notably , PERFECT’ s comple xity is dominated by lightweight online inference, while e xpensiv e offline training is amortized over long-term system usage. The HDRL training stage is computationally demanding due to iterativ e exploration, reward ev aluation, and parameter up- dates within multiple simulated episodes; howe ver , this cost is incurred once before deployment. In contrast, the online inference phase requires sub-second ex ecution times on stan- dard edge hardware. Hence, PERFECT achiev es an effecti ve balance between computational overhead and performance metrics, confirming scalability and HDRL design efficiency in next-generation vehicular networks. T o benchmark complexity , PERFECT was compared with SCA and HaDDQN. In SCA, the non-con ve x problem is decomposed into sequential conv ex subproblems, each solved via an interior-point method ov er all nodes, areas, users, and channels, yielding T ( NAUPC T t ) 3 . 5 , complexity . While SCA con verges to a stationary point, its runtime scales poorly with network size and channel numbers. HaDDQN incurs a per-frame cost of T  (( U + R ) PC T t ) 3 + ( N UA V A )  , domi- nated by the cubic complexity of the assignment operations, which relies on the Hungarian algorithm for user-resource and U A V -area association, followed by a DDQN-based decision. Thus, PERFECT maintains lower practical comple xity and faster execution. As shown in Fig. 8.a, PERFECT consistently surpasses SCA and HaDDQN in the oracle’ s objective, owing to its prediction-aware decisions, Bayesian channel adaptation, and hierarchical coordination. Each radar in this figure is generated by first normalizing (min-max) the av eraged values of accepted requests, energy consumption, and E2E latency collected across multiple scenarios at different densities, and then plotting the aggregated triplet for each algorithm to visualize its ov erall performance balance in each scenario. Fig. 8.a) presents heat maps of accepted requests and energy consumption, highlighting balanced behavior across dif ferent metrics. PERFECT achiev es ov er 92% of ALLOCA TE’ s opti- mal rates for varying requests, significantly outperforming the random approach–which drops to 24% at higher request le vels. In the network scaling scenario, PERFECT’ s performance rates rise from 51% to 99%, achieving 93% of ALLOCA TE’ s performance. Similarly , for channel variations, PERFECT ex- cels with a 92% of ALLOCA TE, outperforming SCA and HaDDQN, which plateau at 71-74%. Besides, PERFECT’ s performance metrics remain stable when the number of users, networks, and channels increases, as depicted in Fig. 8.b). Each heatmap is constructed by computing the ratio of total ac- cepted requests to total energy consumption based on averaged simulation statistics for every method that provides a compact representation of the objective trade-of f. The illustrations of 15 Figure 8. Performance metrics comparison of the baseline methods as (1) request set, (2) network size, and (3) channel size expand. In (a), radar plots present the normalized trade-offs among accepted requests, latency , and energy consumption, demonstrating how each algorithm behav es under varying scenarios. In (b), the heat maps depict the ratio of accepted requests to energy consumption (as objectives) under the same conditions. the PERFECT ev aluations underscore the proposed method’ s scalability and ability to (i) balance competing objecti ves effecti vely , (ii) sustain efficient operation, and (iii) effecti vely adapt to varying network conditions. V I I . C O N C L U S I O N In this study , we proposed a comprehensiv e framework for composed service orchestration in 6G aerial-terrestrial networks, addressing the intertwined challenges of mobility , resource planning, and service cov erage. Our orchestration approach ensures that we are able to meet the divers e require- ments of modern vehicular applications, resulting in ef ficient resource allocation and high-quality service provisioning. An MINLP problem of service orchestration in an integrated aerial-terrestrial network, while accounting for capacity con- straints, changing user behavior , and E2E latencies, was first formulated to maximize service coverage while optimizing en- ergy consumption. T o solve the NP-hard problem, the integra- tion of HDRL with predictiv e modeling enabled efficient U A V trajectory planning and resource-ef ficient service placement, ensuring QoS compliance and enhanced system performance. Our simulation results highlighted significant improvements in request acceptance, energy efficiency , and latency minimiza- tion, outperforming traditional and state-of-the-art methods. This framework underscores the transformativ e potential of HDRL-driv en solutions for managing the complexity and scalability of next-generation vehicular networks. Future works focus on further enhancing the scalability of the proposed method. Promising directions include integrat- ing federated learning to enable decentralized training and decision-making, and exploring multi-agent RL to improve co- ordination among UA Vs and network nodes in heterogeneous, multi-tiered environments inv olving satellite, optical wireless, and radio frequency communication nodes. Also, we plan to explore opportunities for using Large Language Models [56] for high-lev el decisions to assist in reasoning about resource allocation challenges in emerging quantum internet settings, such as qubit routing and link-lev el scheduling [57]. A C K N O W L E D G M E N T The work in this paper was supported in part by the Federal Ministry of Research, T echnology , and Space (BMFTR), Ger- many , through the Project 6GEM+ under Grant 16KIS2411; and in part by the 6G-Path project (Grant No. 101139172). R E F E R E N C E S [1] Z. Chen and X. W ang, “Decentralized computation offloading for multi- user mobile edge computing: A deep reinforcement learning approach, ” EURASIP J. W ir eless Commun. Netw . , vol. 2020, no. 1, p. 188, 2020. [2] I. Lee and D. K. Kim, “Decentralized multi-agent DQN-based resource allocation for heterogeneous traffic in V2X communications, ” IEEE Access , vol. 12, pp. 3070–3084, 2024. [3] M. Shokrnezhad et al. , “Semantic rev olution from communications to orchestration for 6G: Challenges, enablers, and research directions, ” IEEE Netw . , vol. 38, no. 6, pp. 63–71, 2024. [4] V . S. Hapanchak, A. Costa, J. Pereira, and M. J. Nicolau, “ An intelligent path management in heterogeneous vehicular networks, ” V eh. Commun. , vol. 45, p. 100690, 2024. [5] H. Zhou, W . Xu, J. Chen, and W . W ang, “Evolutionary V2X technologies tow ard the internet of vehicles: Challenges and opportunities, ” Pr oceed- ings of the IEEE , vol. 108, no. 2, pp. 308–323, 2020. [6] S. Wright, “ Autonomous cars generate more than 300 tb of data per year, ” T ech Blog, T uxera, 2021. [Online]. A vailable: https: //www .tuxera.com/blog/autonomous- cars- 300- tb- of- data- per- year/ [7] X. Li et al. , “Federated multi-agent deep reinforcement learning for resource allocation of V ehicle-to-V ehicle communications, ” IEEE T rans. V eh. T echnol. , vol. 71, no. 8, pp. 8810–8824, 2022. [8] M. N. A vcil, M. Soyturk, and B. Kantarci, “Fair and efficient resource allocation via vehicle-edge cooperation in 5G-V2X networks, ” V eh. Commun. , vol. 48, p. 100773, 2024. [9] Q. Wu et al. , “Mobility-aware cooperative caching in vehicular edge computing based on asynchronous federated and deep reinforcement learning, ” IEEE J. Sel. T opics Signal Pr ocess. , vol. 17, no. 1, pp. 66–81, 2023. 16 [10] F . Busacca, C. Grasso, S. Palazzo, and G. Schembra, “ A smart road side unit in a microeolic box to provide edge computing for vehicular applications, ” IEEE T rans. Gr een Commun. Netw . , vol. 7, no. 1, pp. 194–210, 2023. [11] M. Shokrnezhad et al. , “T ow ard a dynamic future with adaptable computing and network conv ergence (A CNC), ” IEEE Netw . , vol. 39, no. 2, pp. 268–277, 2025. [12] C. R. Storck and F . Duarte-Figueiredo, “ A surve y of 5G technol- ogy e volution, standards, and infrastructure associated with V ehicle- to-Everything communications by internet of vehicles, ” IEEE Access , vol. 8, pp. 117 593–117 614, 2020. [13] H. Mazandarani, M. Shokrnezhad, and T . T aleb, “Semantic-aware dy- namic and distributed power allocation: a multi-U A V area coverage use case, ” 2025. [14] F . Zhou, R. Q. Hu, Z. Li, and Y . W ang, “Mobile edge computing in unmanned aerial vehicle networks, ” IEEE T rans. Wir eless Commun. , vol. 27, no. 1, pp. 140–146, 2020. [15] Z. Ning et al. , “Multi-agent deep reinforcement learning based U A V trajectory optimization for differentiated services, ” IEEE T rans. Mobile Comput. , vol. 23, no. 5, pp. 5818–5834, 2024. [16] M. Z. Alam and A. Jamalipour , “Multi-agent DRL-based hungarian algo- rithm (MADRLHA) for task offloading in multi-access edge computing internet of vehicles (IoVs), ” IEEE Tr ans. W ireless Commun. , vol. 21, no. 9, pp. 7641–7652, 2022. [17] H. Mazandarani, M. Shokrnezhad, and T . T aleb, “ A novel multiple access scheme for heterogeneous wireless communications using symmetry- aware continual deep reinforcement learning, ” IEEE T rans. Mach. Learn. Commun. Netw . , vol. 3, pp. 353–368, 2025. [18] H. Mazandarani et al. , “ A semantic-aware multiple access scheme for distributed, dynamic 6G-based applications, ” in Proc. IEEE W ireless Commun. and Networking Conf. , 2024, pp. 1–6. [19] N. I. Sarkar and S. Gul, “ Artificial intelligence-based autonomous U A V networks: A survey , ” Drones , vol. 7, no. 5, p. 322, 2023. [20] X. W ei et al. , “Joint U A V trajectory planning, DA G task scheduling, and service function deployment based on DRL in UA V-empo wered edge computing, ” IEEE Internet Things J. , vol. 10, no. 14, pp. 12 826– 12 838, 2023. [21] Z. Md. Fadlullah and N. Kato, “HCP: Heterogeneous computing plat- form for federated learning based collaborative content caching towards 6G networks, ” IEEE T rans. Emer g. T opics Comput. , vol. 10, no. 1, pp. 112–123, 2022. [22] X. Liu, Y . Liu, and Y . Chen, “Reinforcement learning in multiple- U A V networks: Deployment and movement design, ” IEEE Tr ans. V eh. T echnol. , vol. 68, no. 8, pp. 8036–8049, 2019. [23] H. Santos et al. , “ A mobility-a ware flying edge computing service orchestration with quality of service support, ” in Proc. IEEE W orld F orum on Internet of Things (WF-IoT) , 2023, pp. 01–06. [24] A. Nabi, T . Baidya, and S. Moh, “Comprehensive survey on rein- forcement learning-based task of floading techniques in aerial edge computing, ” Internet of Things , p. 101342, 2024. [25] S. Han et al. , “DRL-assisted energy minimization for NOMA-based dynamic multi-user multi-access MEC networks, ” IEEE Internet Things J. , 2024. [26] F . Pervez, L. Zhao, and C. Y ang, “Joint user association, power opti- mization and trajectory control in an integrated satellite-aerial-terrestrial network, ” IEEE T rans. W ir eless Commun. , v ol. 21, no. 5, pp. 3279–3290, 2022. [27] P . Qin et al. , “Joint trajectory plan and resource allocation for U A V- enabled C-NOMA in air-ground integrated 6G heterogeneous network, ” IEEE T rans. Netw . Sci. Eng. , vol. 10, no. 6, pp. 3421–3434, 2023. [28] J. Gao, Z. Kuang, J. Gao, and L. Zhao, “Joint of floading scheduling and resource allocation in vehicular edge computing: A two layer solution, ” IEEE T rans. V eh. T echnol. , vol. 72, no. 3, pp. 3999–4009, 2023. [29] C. Huang, G. Chen, P . Xiao, Y . Xiao, Z. Han, and J. A. Chambers, “Joint offloading and resource allocation for hybrid cloud and edge computing in SAGINs: A decision assisted hybrid action space deep reinforcement learning approach, ” IEEE J. Sel. Areas Commun. , 2024. [30] W . Qi, Q. Song, L. Guo, and A. Jamalipour, “Energy-efficient resource allocation for UA V-assisted vehicular networks with spectrum sharing, ” IEEE T rans. V eh. T echnol. , vol. 71, no. 7, pp. 7691–7702, 2022. [31] Q. He and J. Liang, “Online joint optimization of virtual network function deployment and trajectory planning for virtualized service provision in multiple-unmanned-aerial-vehicle mobile-edge networks, ” Electr onics , vol. 13, no. 5, p. 938, 2024. [32] B. Li, R. Y ang, L. Liu, and C. Wu, “Service placement and trajectory design for heterogeneous tasks in multi-U A V edge computing networks, ” IEEE Internet Things J. , 2024. [33] N. Gupta, S. Agarwal, D. Mishra, and B. Kumbhani, “Trajectory and resource allocation for UA V replacement to provide uninterrupted service, ” IEEE T rans. Commun. , 2023. [34] P . Qin, J. Li, J. Zhang, and Y . Fu, “Joint task allocation and trajectory optimization for multi-UA V collaborative air-ground edge computing, ” IEEE T rans. Netw . Sci. Eng. , 2024. [35] K. Li, W . Ni, X. Y uan, A. Noor , and A. Jamalipour, “Exploring graph neural networks for joint cruise control and task offloading in U A V- enabled mobile edge computing, ” in Proc. IEEE V eh. T ec hnol. Conf. , 2023, pp. 1–6. [36] D. Cl ´ ement et al. , “Energy efficiency relaying election mechanism for 5G Internet of Things: A deep reinforcement learning technique, ” in Pr oc. IEEE W ir eless Commun. and Networking Conf. , 2024, pp. 1–6. [37] F . Li et al. , “Multi-U A V hierarchical intelligent traffic of floading network optimization based on deep federated learning, ” IEEE Internet Things J. , 2024. [38] Z. Chen, F . W ang, and J. W ang, “Joint optimization for service-caching, computation-offloading, and UA Vs flight trajectories over rechargeable U A V-aided MEC using hierarchical multi-agent deep reinforcement learning, ” V eh. Commun. , vol. 50, p. 100844, 2024. [39] Z. Lin et al. , “Hybridrdn: delay-optimal computation offloading for autonomous vehicle fleets based on rsma, ” IEEE T rans. Mobile Comput. , vol. 24, no. 11, pp. 12 456–12 470, 2025. [40] ——, “Sma-assisted distributed computation offloading in vehicular networks based on stochastic geometry , ” IEEE Tr ans. V eh. T echnol. , vol. 74, no. 6, pp. 10 047–10 051, 2025. [41] M. Farhoudi, M. Shokrnezhad, T . T aleb, R. Li, and J. Song, “Discov ery of 6G services and resources in edge-cloud-continuum, ” IEEE Netw . , vol. 39, no. 3, pp. 223–232, 2024. [42] M. Farhoudi, M. Shokrnezhad, S. Kianpisheh, and T . T aleb, “Deep learn- ing based service composition in integrated aerial-terrestrial networks, ” in International Conf. on Net. Softwarization , 2025, pp. 204–208. [43] M. Farhoudi, M. Shokrnezhad, and T . T aleb, “QoS-aware service pre- diction and orchestration in cloud-network integrated beyond 5G, ” in Pr oc. IEEE Global T elecommun. Conf. , 2023, pp. 369–374. [44] Y . Guo, C. Y ou, C. Y in, and R. Zhang, “U A V trajectory and commu- nication co-design: Flexible path discretization and path compression, ” IEEE J. Sel. Areas Commun. , vol. 39, no. 11, pp. 3506–3523, 2021. [45] Oubbati, Omar Sami et al. , “A U A V -UGV cooperati ve system: Patrolling and energy management for urban monitoring, ” IEEE Tr ans. V eh. T echnol. , vol. 74, no. 9, pp. 13 521–13 536, 2025. [46] Jamal Alotaibi et al. , “Optimizing disaster response with UA V -mounted RIS and HAP-enabled edge computing in 6G networks, ” J ournal of Network and Computer Applications , vol. 241, pp. 104–213, 2025. [47] 3rd Generation Partnership Project (3GPP), “Study on Channel Model for Frequencies from 0.5 to 100GHz, ” ETSI, T ech. Rep. TR 38.901 V16.1.0, Nov 2020. [48] F . Faticanti et al. , “Cutting throughput with the edge: App-aware placement in fog computing, ” in IEEE International Conf. on Cyber Security and Cloud Comput. , 2019, pp. 196–203. [49] G. Pataki, M. Tural, and E. B. W ong, “Basis reduction and the complexity of branch-and-bound, ” in Proc. of A CM-SIAM Symposium on Discrete Algorithms , ser. Proceedings. Society for Industrial and Applied Mathematics, Jan. 2010, pp. 1254–1261. [50] H. v . Hasselt, A. Guez, and D. Silver , “Deep reinforcement learning with double Q-learning, ” Proc. of the AAAI Conference on Artificial Intelligence , vol. 30, no. 1, Mar . 2016. [51] Z. W ang et al. , “Dueling network architectures for deep reinforcement learning, ” in Pr oc. Int. Conf. Mach. Learn. , v ol. 48, Jun. 2016, pp. 1995– 2003. [52] P . A. Lopez et al. , “Microscopic traffic simulation using SUMO, ” in International Confer ence on Intell. T ransp. Syst. (ITSC) , 2018, pp. 2575– 2582. [53] D. Oladimeji, K. Gupta, N. A. K ose, K. Gundogan, L. Ge, and F . Liang, “Smart transportation: An overvie w of technologies and applications, ” Sensors , vol. 23, no. 8, 2023. [54] T . T aleb et al. , “6G system architecture: A service of services vision, ” ITU J. on future and Evol. T ec hnol. , vol. 3, no. 3, pp. 710–743, 2022. [55] K. Deng, Z. He, H. Lin, H. Zhang, and D. W ang, “ A novel channel- constrained model for 6G vehicular networks with traffic spikes, ” in Pr oc. IEEE W ir eless Commun. and Networking Conf. , 2024, pp. 1–6. [56] M. Shokrnezhad and T . T aleb, “ An autonomous network orchestration framew ork integrating large language models with continual reinforce- ment learning, ” IEEE Commun. Mag. , vol. 63, no. 8, pp. 78–84, 2025. [57] J. Prados-Garzon et al. , “Deterministic 6GB-assisted quantum networks with slicing support: A new 6GB use case, ” IEEE Netw . , vol. 38, no. 1, pp. 87–95, 2024.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment