DRL-Based Spectrum Sharing for RIS-Aided Local High-Quality Wireless Networks
This paper investigates a smart spectrum-sharing framework for reconfigurable intelligent surface (RIS)-aided local high-quality wireless networks (LHQWNs) within a mobile network operator (MNO) ecosystem. Although RISs are often considered potential…
Authors: Hamid Reza Hashempour, Mina Khadem, Eduard A. Jorswieck
IEEE , V OL. ?, NO. ?, 1 DRL-Based Spectrum Sharing for RIS-Aided Local High-Quality W ireless Networks Hamid Reza Hashempour , Mina Khadem, and Eduard A. Jorswieck, F ellow , IEEE Abstract —This paper in vestigates a smart spectrum-sharing framework for r econfigurable intelligent surface (RIS)-aided local high-quality wireless netw orks (LHQWNs) within a mobile network operator (MNO) ecosystem. Although RISs are often considered potentially harmful due to interference, this work shows that properly controlled RISs can enhance quality of service (QoS). The proposed system enables temporary spectrum access for multiple vertical service providers (VSPs) by dynam- ically allocating radio resources according to traffic demand. The spectrum is divided into dedicated subchannels assigned to individual VSPs and r eusable subchannels shared among multiple VSPs, while RIS is employed to improve propagation conditions. W e formulate a multi-VSP utility maximization problem that jointly optimizes subchannel assignment, transmit power , and RIS phase configuration while accounting for spectrum access costs, RIS leasing costs, and QoS constraints. The resulting mixed-integer non-linear program (MINLP) is intractable using con ventional optimization methods. T o address this challenge, the problem is modeled as a Markov decision process (MDP) and solved using deep reinf orcement learning (DRL). Specifically , deep deterministic policy gradient (DDPG) and soft actor – critic (SA C) algorithms are developed and compared. Simulation results show that SA C outperforms DDPG in con vergence speed, stability , and achievable utility , reaching up to 96% of the exhaustive search benchmark and demonstrating the potential of RIS to improve overall utility in multi-VSP scenarios. Index T erms —Spectrum sharing, reconfigurable intelligent sur - face (RIS), vertical service provider (VSP), deep reinfor cement learning (DRL), licensed shar ed access (LSA), resour ce allocation. I . I N T R O D U C T I O N W ith the rapid growth of wireless communication networks, the increasing demand for spectrum resources has become a significant challenge. According to Cisco, global mobile data traffic is projected to grow exponentially , dri ven by the rise of emer ging applications such as industrial automation, smart cities, and augmented reality (AR) [1]. The emergence of vertical service pro viders (VSPs), which lease spectrum from mobile network operators (MNOs) to deplo y local high-quality wireless netw orks (LHQWNs), has been proposed as a solution to improv e spectral efficienc y and service customization [2]. Hamid Reza Hashempour is with the Center for W ireless Innov a- tion (CWI), Queen’ s Univ ersity Belfast, BT3 9DT Belfast, U.K., (Email: h.hashempoor@qub .ac.uk). Mina Khadem is with the Department of Engineering, Univ ersitat Pompeu Fabra (UPF), 08002 Barcelona, Spain (e-mail: mina.khadem@upf.edu). Eduard A. Jorswieck is with the Institute for Communication T echnol- ogy , T echnische Univ ersitat Braunschweig, Germany (email: e.jorswieck@tu- braunschweig.de). Eduard Jorswieck w ould like to thank the Federal Ministry of Research, T echnology , and Space (BMFTR) for supporting the xG-RIC project as part of the research program Communication Systems “Souver ¨ an. Digital. V ernetzt”. (grant number 16KIS2429K). Howe ver , traditional spectrum allocation schemes lack flex- ibility , leading to inefficient spectrum utilization. T o address this issue, licensed shared access (LSA) and its evolution, ev olved LSA (eLSA), have been proposed to enable controlled and dynamic spectrum sharing between MNOs and VSPs [3]. In the eLSA framework, spectrum resources are categorized into dedicated subchannels, allocated exclusi vely to a single VSP , and reusable subchannels, which can be shared among multiple VSPs simultaneously . The operation, administration, and maintenance (O AM) system of the MNO is responsible for dynamically assigning spectrum resources to VSPs based on their demand and network conditions. Howe ver , interfer- ence among VSPs using reusable subchannels poses a major challenge, impacting quality of service (QoS) [4]. T o enhance network performance, reconfigurable intelligent surfaces (RISs) ha ve emerged as a promising technology . RISs can manipulate the wireless propagation environment to improv e coverage, mitigate interference, and enhance spectral efficienc y [5]. In the proposed framew ork, RISs are integrated into the eLSA ecosystem and can be leased by VSPs to satisfy application-specific QoS requirements through joint optimization [6]. T o efficiently allocate resources, we formulate a utility maximization problem for VSPs, taking into account the costs associated with leasing subchannels and RIS elements, power consumption, and the rev enue generated from the profit per transmitted sum rate by dimension ($/Mbps) for VSP v . How- ev er, the formulated problem is a non-con vex mixed-integer nonlinear programming (MINLP) model, which is dif ficult to solve due to interdependencies between subchannel allocation, base station (BS) power control, and RIS configuration [7]. T o address the abov e challenges, we propose deep reinforce- ment learning (DRL)-based frameworks for dynamic spectrum sharing in RIS-aided local high-quality wireless networks. The considered resource allocation problem is first modeled as a Marko v decision process (MDP), which captures the se- quential and coupled nature of spectrum assignment, transmit power control, and RIS configuration. T o tackle the resulting high-dimensional and hybrid continuous–discrete action space, we employ two representati ve DRL algorithms, namely deep deterministic policy gradient (DDPG) and soft actor–critic (SA C), where the latter provides improved stability and near- optimal performance. The ke y contributions of this work are summarized as follows: • W e dev elop an RIS-assisted spectrum-sharing framew ork for the eLSA ecosystem with multiple VSPs leasing spectrum resources from MNOs. W ithin this frame work, a multi-VSP utility maximization problem is formulated IEEE , V OL. ?, NO. ?, 2 that jointly optimizes subchannel assignment, transmit power allocation, and RIS phase configuration while accounting for bandwidth leasing costs, RIS leasing costs, and QoS constraints. • The resulting mixed discrete–continuous optimization problem is modeled as an MDP , enabling adaptive decision-making for joint spectrum allocation, power control and RIS configuration under dynamic network conditions. • T wo state-of-the-art DRL algorithms, DDPG and SA C are tailored to the considered problem. Appropriate action shaping and constraint-aw are parameter mappings are designed to effecti vely handle hybrid resource variables. • Simulation results demonstrate that the proposed SA C- based solution achiev es faster con vergence, improved sta- bility , and higher utility compared to DDPG, approaching the performance of near-optimal benchmark solution. Extensiv e simulation results demonstrate that the proposed DRL-based framew ork significantly improv es spectrum uti- lization and VSP utility . In particular , the SAC-based solution exhibits faster con ver gence and higher robustness than DDPG, and closely approaches the performance of exhausti ve discrete search (EDS) under various network configurations. A. Literatur e r eview The existing literature can be broadly categorized into four distinct sections, reflecting the comprehensiv e scope and div erse topics addressed in this paper: 1) LSA-based spectrum sharing, 2) RIS-assisted networks, 3) Spectrum sharing in high-quality networks, 4) DRL for wireless resource manage- ment. 1) LSA-based spectrum sharing: LSA is a regulated spec- trum sharing paradigm designed to provide predictable QoS for licensees while ensuring incumbent protection through predefined rules and regulatory supervision. In [8], an eLSA framew ork is proposed that combines a f air auction mechanism with UA V -assisted spectrum sensing and DRL to improve fairness and spectral ef ficiency in spectrum allocation among mobile network operators. In [9], an optimization frame work for LSA systems is developed to jointly maximize spectral efficienc y and energy efficienc y during incumbent spectrum use. In [10], a dynamic LSA framework is proposed to optimize uplink and downlink power allocation, increasing spectral efficiency while protecting incumbents from harm- ful interference, with performance gains demonstrated across varying user densities and cell sizes. In [11], a multi-block ascending auction mechanism is introduced for LSA spectrum allocation, where the av ailable bandwidth is partitioned into multiple blocks rather than assigned as a single unit to support more flexible spectrum assignment. Howe ver , none of these works address utility optimization in the eLSA framework by maximizing the total sum rate while accounting for the bandwidth leasing cost from MNOs. 2) RIS-aided networks: RISs are emerging as an energy- efficient approach to enhance spectral efficienc y and QoS in future wireless networks. By adaptiv ely configuring the phase of reflected signals, RISs enable passi ve beamforming that strengthens desired signals and can suppress interference with low hardware cost and con venient integration into ex- isting infrastructure [12]–[16]. In [17], RIS-assisted designs are shown to substantially improve the achiev able sum rate in Multiple-Input Single-Output (MISO) and Multiple-Input Multiple-Output (MIMO) systems by jointly optimizing activ e beamforming and RIS phase shifts, including formulations that explicitly enforce quality of service constraints such as minimum rate requirements. Moreover , [18] shows that, by shaping propagation, RISs can create virtual line-of-sight links and partially compensate for blockage and fading effects, which is particularly beneficial in coverage-limited scenarios. In [19], RIS-assisted spectrum sharing is in vestigated by maximizing the secondary user rate subject to a primary user SINR target, via joint optimization of the secondary transmit po wer and RIS phase shifts. In addition, RIS-aided spectrum sensing is proposed to boost the received primary signal strength under sev ere path loss and fading, thereby improving detection performance for dynamic spectrum access [20]. Howe ver , to the best of our knowledge, the impact of RIS on utility maximization in a multi-VSP ecosystem has not been in vestigated. 3) Spectrum sharing in high-quality networks: High- quality local wireless networks, including pri vate and non- public deployments as well as multi-tenant architectures, aim to deliv er stringent service guarantees such as reliability , lo w latency , and isolation for vertical services within geograph- ically limited areas. Moreov er , in [21], it is discussed that achieving these guarantees calls for tighter control of spectrum and radio access network resources, particularly when infras- tructure and spectrum are shared among multiple stakeholders. In [22], this requirement is further emphasized for public- network-integrated and RAN-sharing deployment scenarios, where resource sharing intensifies the need for coordinated spectrum and RAN management. In [23], spectrum sharing in multi-tenant 5G is modeled and planned by leveraging tenant traffic characteristics and blocking behavior to guide spectrum allocation policies. In [24], a QoS-aware framew ork is proposed for beyond 5G and 6G spectrum management in which verticals act as spectrum leasers and the MNO allo- cates spectrum through auction mechanisms while enforcing minimum service requirements, with deep reinforcement learn- ing used to learn efficient allocation policies under dynamic conditions. In [25], a utility-based interference coordination approach is introduced to improve the efficienc y of local spectrum licensing by explicitly modeling spectrum holder utility and adjusting interference lev els to increase aggregate utility across neighboring networks. In [26], a complementary analysis examines the tradeoffs for geographically constrained local services operating in shared bands, showing how differ- ent coordination mechanisms influence both performance and incentiv es under o verlapping co verage. 4) DRL for wir eless r esource management: Wireless re- source management problems (e.g., spectrum sharing, dynamic spectrum access (DSA), power control, and scheduling) are often time-varying, coupled across users, and difficult to solve optimally in real time. DRL has therefore been widely adopted to model such tasks and to learn resource control policies IEEE , V OL. ?, NO. ?, 3 directly from interaction data [27]. In spectrum sharing, DRL has been used to learn allocation strategies that adapt to uncertain interference and traffic conditions. A heteroge- neous agent DRL approach for DSA in cognitiv e wireless networks is developed in [28], demonstrating how learning based policies can coordinate spectrum access decisions in dynamic en vironments. Be yond access-only decisions, DRL has been applied to shared spectrum resource licensing and assignment problems. Specifically , [29] develops a dynamic shared spectrum en vironment in which a centralized DRL agent assigns spectrum and jointly optimizes related resource decisions based on real-time demand. For local high-quality networks where verticals lease spectrum and require minimum service guarantees, DRL has also been integrated with eco- nomic mechanisms. In particular , [24] presents a spectrum management framework in which verticals lease spectrum and the MNO allocates resources through auctions while ensuring minimum service guarantees, with a DRL agent learning ef- fectiv e allocation policies under dynamic conditions. [30] pro- poses an offline multi-agent reinforcement learning framew ork for radio resource management that learns scheduling policies for multiple access points while impro ving both sum rate and tail rate performance. DRL has also been applied in RIS- assisted systems for resource allocation. [31] studies DRL- based resource allocation in RIS-assisted vehicular networking scenarios, supporting the use of DRL when RIS configuration is part of the control policy . B. Or ganization and Notation The remainder of this paper is structured as follows: Sec- tion II presents the system model and problem formulation. Section III details the proposed DRL-based framework. Sec- tion IV pro vides numerical results and performance e valuation. Finally , Section V concludes the paper and outlines future research directions. Notation : W e use bold lowercase letters for vectors and bold uppercase letters for matrices. The notation ( · ) T and ( · ) H denote the transpose operator and the conjugate transpose operator , respecti vely . The symbol ≜ denotes a definition. The sets R N and C N represent real and complex N -dimensional vectors, respectively . C N (0 , σ 2 ) denotes a complex circularly symmetric Gaussian random variable with variance σ 2 .The operator diag {·} constructs a diagonal matrix from its vector argument, and [ x ] m denotes the m -th element of vector x . I I . S Y S T E M M O D E L A N D P RO B L E M F O R M U L A T I O N A. Pr eliminaries The concept of spectrum sharing has gained significant attention for enabling the provision of local area services, par- ticularly focusing on QoS. Several spectrum-sharing schemes exist, due to page limitations we focus on the follo wing three models: • MNOs providing dedicated local area services within their licensed frequencies; • MNOs subleasing parts of their spectrum to local service providers; • Spectrum directly licensed to local area service providers. For this work, we adopt the first approach, wherein local ser - vice areas are hosted by MNO networks, aiming to provision high-quality wireless networks within the MNO ecosystem. This method inv olves the provisioning of local high-quality wireless networks as part of the MNO’ s domain, ensuring a more seamless integration with existing MNO infrastructure. A functional architecture of this system is depicted in Fig. 1, where MNOs provide dedicated local area networks as service network areas. These areas belong to the MNO’ s domain, and the MNO’ s O AM system dynamically configures the radio resources according to the needs of the local wireless com- munication services. The local service areas are represented by entities that inform the MNO’ s OAM about their service requirements. These entities are responsible for ensuring min- imal interference between service areas. Spectrum reuse is feasible between areas as long as there is no overlap, or if the MNO’ s O AM manages the interference effecti vely . This system fosters a dynamic and efficient spectrum allocation process while ensuring that service levels are met. The in- terface between the MNO and the local service entities is vital for the success of this system. It should provide reli- able monitoring and management to ensure that service level agreements (SLAs) are met for both the MNO and the VSPs. Additionally , the integration of priv ate network infrastructures, such as femtocells, can be incorporated to densify the network, though the complexities of such deployments will require further study . The deployment of this scheme requires the use of an radio access technology (RA T) that is compatible with the MNO’ s infrastructure, ensuring a harmonious coexistence of the shared resources. A rele vant approach for spectrum sharing in local area services is the European LSA framew ork, which ensures predictable QoS by allo wing both the incumbent and LSA licensees to access spectrum while protecting against harmful interference. The eLSA framework introduces sev eral consid- erations for its implementation [3]: • Extending the LSA framew ork to include VSPs; • Identifying appropriate frequency bands and establishing a clear Sharing Framework for VSPs; • Simplifying the LSA licensing process to accommodate a high number of VSPs; • Defining allowance zones, which specify local deploy- ment areas where licensees can transmit at designated frequencies until the time allowance e xpires; • Allo wance zones can be deployed in both indoor and outdoor en vironments; • Supporting flexible deployment durations, ranging from sev eral hours to years, thus enabling adaptable spectrum allocation procedures; • Ensuring deterministic channel arrangements, such as fixed channel plans, to meet the strict QoS requirements of local high-quality wireless networks. In this paper, we focus on leveraging the LSA/eLSA framew ork to dynamically allocate spectrum resources, using RIS to enhance the network’ s QoS and spectral efficienc y , as discussed in the following sections. IEEE , V OL. ?, NO. ?, 4 M NO Doma in O AM ( Ope r at io n , Adm in ist r at io n an d Ma in t en an ce) V S P 1 V S P 4 V S P 2 V S P 3 M N O sp e ct ru m r e so u rce s V S P 1 V S P 2 V S P 3 V S P 4 M N O sp e ct ru m r e so u rce s V S P 1 V S P 2 V S P 3 V S P 4 M NO Doma in O AM ( Ope r at io n , Adm in ist r at io n an d Ma in t en an ce) V S P 1 V S P 4 V S P 2 V S P 3 M N O sp e ct ru m r e so u rce s V S P 1 V S P 2 V S P 3 V S P 4 D e d i c ated sub c h an n e l R e u sab l e s u b c h an n e l D e d i c ated sub c h an n e l R e u sab l e s u b c h an n e l M NO Doma in O AM ( Ope r at io n , Adm in ist r at io n an d Ma in t en an ce) V S P 1 V S P 4 V S P 2 V S P 3 M N O sp e ct ru m r e so u rce s V S P 1 V S P 2 V S P 3 V S P 4 D e d i c ated sub c h an n e l R e u sab l e s u b c h an n e l Fig. 1: Functional use case for the integration of local high-quality wireless networks into an MNO ecosystem as service network areas. T ABLE I: K ey notations used in this paper . Notation Definition V /V /v Set/cardinality/index of VSPs B /B /b Set/cardinality/index of BSs J /J /j Set/cardinality/index of RISs K /K/k Set/cardinality/index of users C /C /c Set/cardinality/index of subchannels C r v , C d v Sets of reusable (shared) and dedicated sub- channels per VSP M j / M j Set/cardinality of reflecting elements of RIS j δ c Reusability indicator of subchannel c ω b,c k,v Subchannel assignment indicator p b,c k,v T ransmit power from BS b to user k of VSP v on subchannel c (W) P b,v max Maximum transmit po wer of BS b for VSP v (W) φ b k,v BS association indicator d j k,v RIS association indicator Θ j Phase-shift matrix of RIS j B c Bandwidth of subchannel c (Hz) N 0 Noise power spectral density (dBm) β Path-loss exponent U v Utility of VSP v ($) B. System Model In this section, we first briefly introduce the building blocks of network. T able I presents the main parameters and variables in order to enhance the readability of the paper . W ithout loss of generality , assume that the ecosystem comprises one MNO and multiple VSPs as shown in Fig. 2. Let V = { 1 , 2 , . . . , v , . . . , V } denote the set of VSPs in the MNO domain. For each VSP v ∈ V , the set B v = { 1 , 2 , . . . , b v , . . . , B v } denotes the BSs serving its users, while K v = { 1 , 2 , . . . , k v , . . . , K v } represents the corresponding set of users. Moreover , B = {B 1 , B 2 , . . . , B V } denotes the collection of BS sets associated with all VSPs. The set of all users are denoted by K = {K 1 , K 2 , · · · , K v , · · · , K V } . In order to improve the rate of users in the VSPs, a set of J = { 1 , 2 , · · · , j , · · · , J } RISs is utilized. The RISs are parts of the MNO that are used by some VSPs to tackle QoS requirements of their users based on their applications. W e consider an enhanced LSA based spectrum sharing method, where the spectrum is allocated to VSPs based on their demand. W e consider a set of C = { 1 , 2 , · · · , c, · · · , C } av ailable orthogonal subchannels for the MNO to share with VSPs. W e define a set of C d v for the dedicated subchannels to be assured agreed le vel of QoS of each VSP and a set of C r v for reusable subchannels of each VSP if the location areas of VSPs do not overlap or the MNO can handle interference where C r v , C d v ⊂ C . In addition, we define a binary indicator variable δ c , which equals 1 if subchannel c is reusable (i.e., c ∈ C r v ) and can be shared among VSPs, and 0 otherwise. The bandwidth of all subchannels are identical, and is denoted as B c . Let ω b,c k,v be the do wnlink binary subchannel assignment indicator of user k served by BS b of VSP v over subchannel c , which is defined as follows ω b,c k,v = 1 , if subchannel c is assigned to user k in BS b of VSP v ; 0 , Otherwise . (1) Subchannel c can not be assigned to more than L c users in the cov erage of one BS, simultaneously . Therefore we introduce the following subchannel allocation constraint: X k ∈K v ω b,c k,v ≤ L c , ∀ v ∈ V , ∀ b ∈ B v , ∀ c ∈ C . (2) Let p b,c k,v ≥ 0 denote the transmit power allocated by BS b of VSP v to user k on subchannel c . The per -BS transmit power IEEE , V OL. ?, NO. ?, 5 M NO Des ire d Sign al I n t er - ce ll I n ter f er en ce B S R I S u se r l i n k Des ire d Sign al I n t er - ce ll I n ter f er en ce B S R I S u se r l i n k M NO Des ire d Sign al I n t er - ce ll I n ter f er en ce B S R I S u se r l i n k VS P 2 VS P 1 VS P 3 R IS R IS R IS R IS I n t r a - ce ll I n ter f er en ce Ded icated sub ch an n el R eu s ab le s u b ch an n el Fig. 2: System model of an RIS-assisted multi-VSP wireless network within an MNO ecosystem with dedicated and reusable subchannels. constraint is X c ∈C X k ∈K v p b,c k,v ≤ P b,v max , ∀ v ∈ V , ∀ b ∈ B v , (3) and power is acti ve only when the user is scheduled 0 ≤ p b,c k,v ≤ ω b,c k,v P b,v max , ∀ v , b, k , c. (4) Considering that different BSs may serve different sets of users, we define the binary BS-association indicator φ b k,v ∈ { 0 , 1 } , where φ b k,v = 1 if user k ∈ K v is associated with BS b ∈ B v of VSP v , and φ b k,v = 0 otherwise. Each user can be associated with at most one BS at any time, i.e., X b ∈B v φ b k,v ≤ 1 , ∀ v ∈ V , ∀ k ∈ K v . (5) Moreov er , each user can be scheduled on at most one sub- channel from its associated BS. This constraint is enforced by X b ∈B v X c ∈C ω b,c k,v ≤ 1 , ∀ v ∈ V , ∀ k ∈ K v . (6) Finally , scheduling is only allowed if the corresponding BS association holds, i.e., ω b,c k,v ≤ φ b k,v , ∀ v ∈ V , ∀ b ∈ B v , ∀ k ∈ K v , ∀ c ∈ C . (7) The reflection-coefficient matrix of the j th RIS is defined as Θ j ≜ diag e j θ j, 1 , e j θ j, 2 , . . . , e j θ j,M j , ∀ m ∈ M j , (8) where θ j,m ∈ [0 , 2 π ) . Furthermore, we define d k j as a binary indicator denoting whether user k lies within the effecti ve cov erage region of RIS j , where d k j = 1 if RIS j can assist user k , and 0 otherwise. Each user can be associated with at most one RIS, i.e., X j ∈J d k j ≤ 1 , ∀ k ∈ K . (9) The channel coef ficients from BS b to user k , from RIS j to user k , and from BS b to RIS j on subchannel c are denoted by h c b,k , r c j,k ∈ C M j × 1 , and g c b,j ∈ C M j × 1 , respectiv ely . Then, the receiv ed interference at user k , associated with BS b of VSP v on subchannel c , is expressed as I b,c k,v = I 1 + I 2 + I 3 , where I 1 ≜ X u ∈K v \{ k } ω b,c u,v p b,c u,v ˜ h c b,k 2 (10) denotes the intra-cell interference, I 2 ≜ X b ′ ∈B v b ′ = b X u ∈K v ω b ′ ,c u,v p b ′ ,c u,v ˜ h c b ′ ,k 2 (11) represents the intra-VSP interference, and I 3 ≜ δ c X v ′ ∈V v ′ = v X b ′ ∈B v ′ X u ∈K v ′ ω b ′ ,c u,v ′ p b ′ ,c u,v ′ ˜ h c b ′ ,k 2 (12) corresponds to the inter -VSP interference. The effecti ve RIS- assisted channel is defined as ˜ h c b,k ≜ h c b,k + X j ∈J d k j r c j,k H Θ j g c b,j . (13) It is worth noting that an RIS is a passive reflecting element and does not activ ely generate interference. In this work, we therefore consider the RIS-reflected components of both the desired and interfering signals propagating through the BS–RIS–user cascaded links. Remark. Each RIS is assumed to be deployed and contr olled by its geogr aphically nearest BS. Hence, the RIS–BS associa- tion is determined by the network topology and is not treated as an optimization variable. This assumption is consistent with practical RIS deployments, wher e each RIS is connected to a single BS contr oller via a wir ed or wir eless contr ol link. Due to sever e path loss, blockag e effects, and cascaded double fading, the links between users and non-associated RISs ar e assumed IEEE , V OL. ?, NO. ?, 6 ne gligible and are ther efor e ignor ed. Consequently , each user can benefit fr om at most one RIS, and cr oss-RIS r eflections ar e not consider ed in the received signal model. The received signal-to-interference-plus-noise ratio (SINR) at user k from the b th BS over subchannel c to decode its own signal which is denoted by γ b,c k,v is obtained as γ b,c k,v = ω b,c k,v p b,c k,v h c b,k + P j ∈J d k j r c j,k H Θ j g c b,j 2 I b,c k,v + B c N 0 , (14) , ∀ v ∈ V , ∀ b ∈ B v , ∀ k ∈ K v , ∀ c ∈ C , where N 0 stands for the power spectral density of noise. The corresponding achiev able data rate is R b,c k,v = B c log 2 1 + γ b,c k,v . (15) Thus, the total rate of the k th user is R k,v = X b ∈B v X c ∈C R b,c k,v , ∀ k ∈ K v , ∀ v ∈ V . (16) Consider that all users want to obtain their maximum trans- mission capacity while meeting a minimum QoS requirement R th k,v . Thus, we enforce that the rate of the k th user R k,v should be not less than the minimum QoS requirement R th k,v . C. Pr oblem F ormulation W e aim to maximize the utility of the VSPs, where the utility of each VSP consists of a revenue function and a cost function. In the follo wing parts, we formulate the revenue, cost, and utility functions, respectiv ely . • Cost Function: As part of our system model, we take into account four types of costs: reusable and dedicated subchannels, RIS, and transmitted power . Accordingly , the total cost function of each VSP is denoted by U Cost v and defined as U Cost v = N r v λ r + N d v λ d | {z } Cost of sp ectrum + N j v ψ j | {z } Cost of RIS + αB c X b ∈B v X k ∈K v X c ∈C p b,c k,v | {z } Cost of transmitted pow er , ∀ v ∈ V . (17) where the N r v , N d v and N j v are the number of reusable subchannels, dedicated subchannels and used RISs for trans- mission, respectiv ely . These quantities are known to both the VSPs and the MNO. Let λ r > 0 , λ d > 0 and ψ j > 0 represent the price of each reusable subchannel, price of each dedicated subchannel and price of each RIS leasing, respectively . Let B c represent the bandwidth of each subchannel, assuming that all subchannels ha ve the same bandwidth. Considering that RISs belong to the MNO and α > 0 represents the unit price of the transmitted power (with unit $/W att/Hz). • Revenue Function: Let β v > 0 denote the profit of VSP v per unit transmitted data rate (with unit $/Mbps). W e denote the rev enue function of each VSP by U Rev enue v . Accordingly , it can be formulated as follows U Rev enue v = β v X b ∈B v X k ∈K v X c ∈C R b,c k,v = β v R v , ∀ v ∈ V . (18) • Utility Function: The utility function of VSP v is defined as the difference between its re venue and cost. As a result, it can be calculated as follows U v = Φ 1 U Rev enue v − Φ 2 U Cost v , ∀ v ∈ V , (19) where Φ 1 , Φ 2 > 0 are scaling factors used to balance the contributions of the revenue and cost terms in the utility function. Our objective is to jointly optimize the subchannel allocation, BS association, RIS association, and transmit po wer allocation so as to maximize the overall utility of the VSPs, while guaranteeing the QoS requirements of all users. Math- ematically , the utility maximization problem for all VSPs is formulated as follows max ω , p , φ , θ X v ∈V U v (20a) s . t . R k,v ≥ R th k,v , ∀ v ∈ V , ∀ k ∈ K v , (20b) ω b,c k,v , φ b k,v , ∀ v , b, k , c, j, (20c) θ j,m ∈ [0 , 2 π ) , ∀ j ∈ J , ∀ m ∈ M j . (20d) (2)–(7). (20e) The boldface symbols denote the collections of the corre- sponding optimization variables, defined as ω ≜ { ω b,c k,v | ∀ k ∈ K v , ∀ v ∈ V , ∀ b ∈ B v , ∀ c ∈ C } , (21) p ≜ { p b,c k,v | ∀ k ∈ K v , ∀ v ∈ V , ∀ b ∈ B v , ∀ c ∈ C } , (22) φ ≜ { φ b k,v | ∀ k ∈ K v , ∀ v ∈ V , ∀ b ∈ B v } , (23) θ ≜ { θ j,m | ∀ j ∈ J , ∀ m ∈ M j } . (24) Moreov er , constraint (20c) ensures that the corresponding de- cision variables are binary . The proposed problem formulation (20) is a non-con vex mixed-integer nonlinear programming (MINLP) problem, which is difficult to solve in polynomial time. Moreov er , the subchannel allocation, BS association, and power control strate gies of each VSP are strongly coupled due to mutual interference. In addition, the dynamic wireless channel conditions and time-varying network en vironment further complicate the problem, making it challenging to solve using conv entional optimization methods. These challenges motiv ate the adoption of a DRL-based solution, as described in the next section. I I I . D R L - B A S E D S O L U T I O N In this section, we propose two DRL-based frameworks to solve the utility maximization problem (20). Specifically , we first model the joint optimization problem as a MDP . Then, we develop DRL solutions based on the DDPG and SA C algorithms, which are well suited for high-dimensional continuous control problems with coupled decision v ariables. These methods enable efficient learning of joint scheduling, power allocation, and RIS configuration policies under dy- namic network conditions. IEEE , V OL. ?, NO. ?, 7 A. MDP F ormulation W e formulate the joint resource allocation problem as an MDP defined by the tuple M ≜ ⟨S , A , P , R⟩ , (25) where S denotes the state space, A denotes the action space, P represents the state transition dynamics, and R is the reward function. 1) State Space: The state at time slot t , denoted by s t ∈ S , summarizes the essential information of the network environ- ment required for sequential decision-making. It is defined as s t = H ( t ) , R ( t ) , a t − 1 , (26) where H ( t ) collects the instantaneous channel state informa- tion (CSI) of all communication links in the network, giv en by H ( t ) ≜ n h c b,k ( t ) , g c b,j ( t ) , r c j,k ( t ) ∀ b, k , j, c o . (27) Moreov er , R ( t ) denotes the vector of achiev ed user data rates at time slot t , i.e., R ( t ) ≜ R k,v ( t ) | ∀ v ∈ V , ∀ k ∈ K v . (28) Finally , a t − 1 represents the pre viously executed feasible con- trol action, including scheduling ω , transmit power p , BS association φ , RIS association d , and RIS phase shifts θ . By incorporating the previous action, the state definition preserves the Markov property and enables the agent to capture the impact of past decisions on the current network dynamics. 2) Action Space: At each time step t , the agent selects a control action a t ∈ A , which jointly determines the schedul- ing, power allocation, and RIS configuration. The feasible action is defined as a t = ω ( t ) , p ( t ) , φ ( t ) , θ ( t ) , (29) where ω ( t ) , and φ ( t ) denote discrete scheduling, BS associ- ation, and RIS association variables, respectiv ely , while p ( t ) and θ ( t ) represent the continuous transmit power allocation and RIS phase shifts. Since standard DRL algorithms operate ov er continuous action spaces, the actor network outputs a raw continuous action ˜ a t , consisting of relaxed representations of the discrete variables and unconstrained continuous values. This raw action is subsequently mapped onto the feasible set A through deterministic projection, thresholding, and normalization oper- ations. In particular, the relaxed binary variables are con verted into feasible binary decisions using element-wise thresholding, i.e., x = ( 1 , if ˜ x ≥ 0 . 5 , 0 , otherwise , (30) while the continuous variables are clipped and rescaled to satisfy the corresponding box constraints. 3) State T ransition: The state transition probability P ( s ( t + 1) | s ( t ) , a ( t )) is governed by the wireless channel e volution, user mobility , traf fic dynamics, and the applied control actions. Since these dynamics are generally unknown and time-varying, a model-free DRL approach is adopted. 4) Rewar d Function: The immediate rew ard at time t is designed based on the system utility and QoS satisfaction. It is defined as r ( t ) = X v ∈V U v ( t ) − λ qos X v ∈V X k ∈K v max 0 , R th k,v − R k,v ( t ) , (31) where the first term corresponds to the total utility of all VSPs, and the second term penalizes violations of QoS constraints with a weight λ qos > 0 . B. DDPG-Based Learning F rame work T o solve the MDP formulated in Section III-A, we adopt the DDPG algorithm, which is particularly suitable for high- dimensional continuous control problems with coupled de- cision v ariables. In our setting, the action space consists of continuous transmit po wers and RIS phase shifts, as well as relaxed representations of discrete scheduling and association decisions, making DDPG a natural choice. DDPG follo ws an actor –critic architecture, where the actor network learns a deterministic policy that maps the observed system state to a control action, while the critic network ev aluates the quality of the selected action through a learned Q-function. By combining policy gradient updates with value- function approximation, DDPG enables stable learning in complex and noncon vex en vironments. The objectiv e of the learning process is to maximize the expected long-term discounted return max π E " ∞ X t =0 γ t r ( t ) # , (32) where r ( t ) is the instantaneous rew ard defined in (31) and γ ∈ (0 , 1) is the discount factor . 1) Learning Procedur e: At each time step t , the actor network outputs a raw continuous action ˜ a t = π ( s t ; ψ ) , (33) where π ( · ) denotes the deterministic policy parameterized by ψ . As described in the MDP formulation, ˜ a t contains continuous relaxations of the hybrid decision variables. It is therefore mapped onto the feasible action set F via a deterministic projection operator a t = Π F (˜ a t ) , (34) which enforces all system constraints, including power bud- gets, scheduling feasibility , and RIS phase bounds. After executing a t , the agent observes the reward r t = r ( s t , a t ) and the ne xt state s t +1 . The transition tuple ( s t , a t , r t , s t +1 ) is stored in the replay buffer D . The critic network Q ( s, a ; ξ ) is trained by minimizing the temporal-difference (TD) loss L C ξ = E r t + γ Q ′ ( s t +1 , π ′ ( s t +1 ); ξ ′ ) − Q ( s t , a t ; ξ ) 2 , (35) where ( π ′ , Q ′ ) denote the corresponding target actor and critic networks. The actor network is updated by maximizing the IEEE , V OL. ?, NO. ?, 8 Algorithm 1 DDPG-Based Learning for Solving Problem (20) 1: Initialize: Replay buf fer D 2: Initialize actor network π ( s ; ψ ) and critic network Q ( s, a ; ξ ) 3: Initialize tar get networks π ′ ( s ; ψ ′ ) ← π ( s ; ψ ) , Q ′ ( s, a ; ξ ′ ) ← Q ( s, a ; ξ ) 4: f or episode e = 1 , 2 , . . . , E do 5: Observ e initial state s 0 defined in (26) 6: f or time step t = 0 , 1 , . . . , T − 1 do 7: Select raw action ˜ a t = π ( s t ; ψ ) using (33) 8: Project ˜ a t onto feasible set: a t = Π F (˜ a t ) using (34) 9: Execute a t and observe r t from (31) and next state s t +1 10: Store ( s t , a t , r t , s t +1 ) in D 11: Sample a mini-batch from D 12: Update critic by minimizing (35) 13: Update actor by minimizing (36) 14: Update target networks using (39)–(40) 15: s t ← s t +1 16: end f or 17: end f or critic’ s output, which is equiv alently formulated as minimizing the following surrogate loss: L A ψ = − Q ( s t , π ( s t ; ψ ); ξ ) . (36) Accordingly , the actor and critic parameters are updated via gradient descent as ψ ← ψ − η A ∇ ψ L A ψ , (37) ξ ← ξ − η C ∇ ξ L C ξ , (38) where η A and η C denote the learning rates of the actor and critic networks, respectiv ely . The corresponding target networks are softly updated using Polyak averaging: ψ ′ ← τ ψ + (1 − τ ) ψ ′ , (39) ξ ′ ← τ ξ + (1 − τ ) ξ ′ , (40) where τ ∈ (0 , 1) is the soft update factor . 2) DDPG Algorithm: The complete DDPG-based learning framew ork for solving Problem (20) is summarized in Algo- rithm 1. C. SA C-Based Learning F rame work T o further enhance exploration efficienc y and learning sta- bility , we also adopt the SAC algorithm to solve the utility maximization problem (20). SA C is an off-polic y actor–critic method that incorporates an entropy-regularized objecti ve, enabling robust learning in high-dimensional and noncon vex control problems. This property is particularly desirable in our setting, where the action space consists of continuous power variables, RIS phase shifts, and relaxed representations of discrete scheduling and association decisions. Unlike DDPG, which learns a deterministic policy , SA C learns a stochastic policy that maximizes both the expected cumulativ e reward and the entropy of the policy . Specifically , the SA C objecti ve is giv en by [32] max π E " ∞ X t =0 γ t r ( t ) + α H ( π ( ·| s t )) # , (41) where H ( π ( ·| s t )) denotes the differential entropy of the policy at state s t , and α > 0 is the temperature parameter controlling the tradeof f between re ward maximization and exploration, which is automatically tuned during training. 1) F easibility Pr ojection (En vironment Mapping): As de- scribed in the MDP formulation, the actor outputs a raw continuous action ˜ a ( t ) , which may not directly satisfy the system constraints in (20). Therefore, the executed control action is obtained through a deterministic feasibility projection a ( t ) = Π F ˜ a ( t ) , (42) where Π F ( · ) enforces all scheduling, po wer , and RIS-related constraints via thresholding, normalization, and clipping op- erations. This projection mechanism is identically applied to both DDPG and SA C to ensure a fair comparison. 2) Critic Update: In the SA C framew ork, the actor network parameterized by ψ outputs a stochastic policy π ( a t | s t ; ψ ) , from which a raw action is sampled as ˜ a t ∼ π ( ·| s t ; ψ ) , (43) and then mapped to a feasible action a t = Π F (˜ a t ) via (42). After applying a t , the en vironment returns the next state and rew ard according to the MDP in Section III-A. In particular , the instantaneous reward r t is computed using (31), where the achie ved rates are obtained from the corresponding SINR expression in (15). T o mitigate overestimation bias, SA C employs two critic networks Q 1 ( s, a ; ξ 1 ) and Q 2 ( s, a ; ξ 2 ) with target networks Q ′ 1 ( · ; ξ ′ 1 ) and Q ′ 2 ( · ; ξ ′ 2 ) . For each transition ( s t , a t , r t , s t +1 ) , we define the soft target as y t = r t + γ min i =1 , 2 Q ′ i ( s t +1 , a t +1 ; ξ ′ i ) − α log π ( a t +1 | s t +1 ; ψ ) . (44) where a t +1 = Π F (˜ a t +1 ) and ˜ a t +1 ∼ π ( ·| s t +1 ; ψ ) . The critics are trained by minimizing the soft Bellman residual L C ξ i = E Q i ( s t , a t ; ξ i ) − y t 2 , i ∈ { 1 , 2 } . (45) 3) Actor and T emper atur e Updates: The actor network is updated by minimizing the following policy loss: L A ψ = E α log π ( a t | s t ; ψ ) − min i =1 , 2 Q i ( s t , a t ; ξ i ) , (46) where a t = Π F (˜ a t ) with ˜ a t ∼ π ( ·| s t ; ψ ) . Moreov er , the temperature parameter α can be adaptively adjusted during training by minimizing [32] L ( α ) = E − α log π ( a t | s t ; ψ ) + H target , (47) where H target is a predefined target entropy . IEEE , V OL. ?, NO. ?, 9 Algorithm 2 SAC-Based Learning for Solving Problem (20) 1: Initialize: replay buf fer D 2: Initialize actor network π ( ·| s ; ψ ) and critics Q 1 ( · ; ξ 1 ) , Q 2 ( · ; ξ 2 ) 3: Initialize target critics Q ′ 1 ( · ; ξ ′ 1 ) ← Q 1 ( · ; ξ 1 ) , Q ′ 2 ( · ; ξ ′ 2 ) ← Q 2 ( · ; ξ 2 ) 4: Initialize temperature parameter α > 0 5: f or episode e = 1 , 2 , . . . , E do 6: Observ e initial state s 0 as defined in (26) 7: f or time step t = 0 , 1 , . . . , T − 1 do 8: Sample raw action ˜ a t ∼ π ( ·| s t ; ψ ) as in (43) 9: Execute feasible action a t = Π F (˜ a t ) using (42) (constraints of (20)) 10: Apply a t and observe next state s t +1 and re ward r t computed by (31) 11: Store transition ( s t , a t , r t , s t +1 ) in D 12: Sample a mini-batch { ( s j , a j , r j , s j +1 ) } B j =1 from D 13: for j = 1 to B do 14: Sample ˜ a j +1 ∼ π ( ·| s j +1 ; ψ ) and set a j +1 = Π F (˜ a j +1 ) using (42) 15: Compute target y j using (44) 16: end for 17: Update critics by minimizing (45) for i ∈ { 1 , 2 } 18: Update actor by minimizing (46) 19: Update α t +1 by minimizing (47) 20: Update target critics using (48) 21: s t ← s t +1 22: end f or 23: end f or Finally , the target critic networks are softly updated as ξ ′ i ← τ ξ i + (1 − τ ) ξ ′ i , i ∈ { 1 , 2 } . (48) By explicitly encouraging exploration through entropy reg- ularization, SA C exhibits improv ed robustness and faster con- ver gence compared to deterministic policy gradient methods, which makes it particularly suitable for the considered joint scheduling, po wer allocation, and RIS configuration problem. D. EDS with SCA-Based P ower Optimization Benc hmark W e consider a two-stage benchmark in which the discrete resource allocation variables are first determined by EDS, while the transmit power allocation is subsequently refined using successive conv ex approximation (SCA). Specifically , in the first stage, the benchmark exhausti vely enumerates all feasible combinations of the binary subchannel allocation and BS association variables, i.e., ( ω , φ ) , that satisfy the corresponding constraints. During this enumeration stage, the transmit power allocation is fixed to a uniform allocation across all acti ve links, and the RIS phase shifts are kept fixed. The best discrete configuration is then selected according to the resulting utility value. Denoting the optimal discrete configuration obtained by EDS as ( ω ∗ , φ ∗ ) , the second stage refines the transmit po wer vector while k eeping ( ω ∗ , φ ∗ ) and the RIS phase shifts fixed. For notational simplicity , given a feasible discrete configuration ( ω , φ ) and fixed RIS phase shifts θ , define G b,c k,v ≜ ω b,c k,v ˜ h c b,k 2 , where ˜ h c b,k is given in (13). Then, the SINR in (15) can be re written as γ b,c k,v ( p ) = p b,c k,v G b,c k,v I b,c k,v ( p ) + B c N 0 . (49) Accordingly , the achiev able rate is R b,c k,v ( p ) = B c log 2 1 + p b,c k,v G b,c k,v I b,c k,v ( p ) + B c N 0 ! . (50) The abov e rate expression can be equiv alently written as R b,c k,v ( p ) = B c h log 2 T b,c k,v ( p ) − log 2 I b,c k,v ( p ) + B c N 0 i , (51) where T b,c k,v ( p ) ≜ I b,c k,v ( p ) + B c N 0 + p b,c k,v G b,c k,v . (52) Since (51) is a difference of two concave functions, the power allocation problem is non-conv ex. Therefore, we adopt SCA to obtain a tractable approximation. At iteration n , given a feasible point p ( n ) , the second logarithmic term in (51) is upper-bounded by its first-order T aylor expansion as log 2 I b,c k,v ( p ) + B c N 0 ≤ log 2 I b,c k,v ( p ( n ) ) + B c N 0 + I b,c k,v ( p ) − I b,c k,v ( p ( n ) ) I b,c k,v ( p ( n ) ) + B c N 0 ln 2 ≜ ˜ ϕ b,c k,v ( p ; p ( n ) ) . (53) Substituting (53) into (51), a concav e lower bound of R b,c k,v ( p ) is obtained as ˆ R b,c k,v ( p ; p ( n ) ) = B c h log 2 T b,c k,v ( p ) − ˜ ϕ b,c k,v ( p ; p ( n ) ) i . (54) Hence, the total user rate is approximated by ˆ R k,v ( p ; p ( n ) ) = X b ∈B v X c ∈C ˆ R b,c k,v ( p ; p ( n ) ) . (55) For the selected discrete configuration ( ω ∗ , φ ∗ ) , the SCA- based power optimization problem at iteration n is formulated as max p X v ∈V ˆ U v ( p ; p ( n ) ) (56a) s . t . ˆ R k,v ( p ; p ( n ) ) ≥ R th k,v , ∀ v ∈ V , ∀ k ∈ K v , (56b) (3) , (4) . (56c) where ˆ U v ( p ; p ( n ) ) = Φ 1 β v X k ∈K v ˆ R k,v ( p ; p ( n ) ) − Φ 2 U Cost v ( p ) . (57) Since the cost term is linear in the transmit power , problem (56) is con ve x and can be solved iterativ ely until conv ergence using off-the-shelf conv ex solv ers. The resulting solution provides a strong near-optimal reference where the discrete variables are optimally selected via EDS and the continuous power allocation is refined via SCA. IEEE , V OL. ?, NO. ?, 10 E. Computational Complexity Analysis This section analyzes the computational complexity of the proposed DRL-based spectrum sharing framew ork and com- pares it with the EDS benchmark. The computational b urden of the DRL approaches mainly originates from deep neural network (DNN) operations, whereas the EDS benchmark is dominated by combinatorial enumeration of discrete resource allocation variables. In the proposed frame work, the action vector includes subchannel scheduling, transmit power alloca- tion, and RIS phase control. Since the RIS association is fixed by deployment, it is not treated as a decision v ariable. The action space dimension therefore scales as |A| = V B v K v C + V B v K v C + J M j , (58) where B v denotes the number of BSs per VSP , K v is the number of users per VSP , C is the number of subchannels, and J is the number of RISs. The state vector contains channel state information for both direct and cascaded links, user rates, and pre vious control actions. Its dimension scales approximately as |S | ∝ V B v K v C + V B v J C M j + V K v J C M j . (59) 1) Comple xity of DRL T raining: Both DDPG and SA C employ actor–critic architectures, where the main computa- tional burden arises from forward and backward propagation during network training. For a mini-batch size B , the dominant complexity of a critic update scales as O ( |S | |A| n ) , where n denotes the number of neurons per hidden layer . For E training episodes with T interaction steps per episode, the ov erall training complexity of DDPG is giv en by O ( E T B |S | |A| n ) . (60) SA C employs two critic networks to mitigate value over - estimation and additionally updates an entropy temperature parameter . As a result, SA C introduces a slightly larger com- putational ov erhead per training iteration while maintaining the same asymptotic complexity order as DDPG. 2) Comple xity of Online Decision Making: Once training is completed, DRL-based resource allocation requires only for- ward propagation through the actor netw ork to generate control actions. For a two-hidden-layer neural network, the comple xity of action generation at each time step is approximately O ( |S | n + n 2 + |A| n ) , (61) which grows polynomially with the system size. Importantly , this complexity is independent of the combinatorial nature of the underlying resource allocation problem, enabling real-time decision making ev en in large-scale networks. 3) Comple xity of EDS with SCA P ower Optimization: The EDS benchmark exhaustiv ely enumerates all feasible combi- nations of the binary subchannel allocation and BS association variables. Assigning K v users to C subchannels leads to approximately O ( C K v ) feasible scheduling configurations per VSP . Considering V VSPs, the total discrete search complexity scales as O ( C V K v ) . For each discrete configuration, the transmit power allocation is subsequently refined using the SCA procedure. Each SCA iteration requires solving a conv ex optimization problem whose complexity gro ws polynomially with the number of active transmission links. Assuming N sca SCA iterations, the overall complexity of the EDS benchmark can be approximated as O C V K v N sca , (62) which increases exponentially with the number of users and subchannels. 4) Discussion: The EDS benchmark provides a near- optimal reference but suffers from e xponential complexity growth due to exhausti ve enumeration of discrete resource allocations. In contrast, the proposed DRL framew ork shifts the computational burden to an offline training phase and enables lo w-complexity online decision making via neural network inference. Consequently , the DRL-based approach offers significantly better scalability for large-scale wireless networks. I V . N U M E R I C A L R E S U L T S This section ev aluates the performance of the proposed DRL-based frame work for joint spectrum sharing and RIS con- figuration. W e compare the proposed SA C- and DDPG-based learning approaches under various system configurations. The results demonstrate that SA C achie ves faster conv ergence, higher long-term utility , and improved rob ustness in dynamic wireless en vironments. A. Channel Model W e consider a frequency-selecti ve SISO downlink system with C orthogonal subchannels. For each subchannel c ∈ C , the direct BS–UE channel between BS b and user k is modeled as h c b,k = q ρ 0 d − β b,k ¯ h c b,k , (63) where d b,k denotes the Euclidean distance, β is the path- loss exponent, and ρ 0 is the reference channel gain at a distance of 1 m. The small-scale fading coefficient follows independent Rayleigh fading, i.e., ¯ h c b,k ∼ C N (0 , 1) , ∀ b, k , c . For RIS-assisted links, the BS–RIS and RIS–UE channels on subchannel c are giv en by g c b,j = q ρ 0 d − β b,j ¯ g c b,j , (64) r c j,k = q ρ 0 d − β j,k ¯ r c j,k , (65) where d b,j and d j,k denote the BS–RIS and RIS–UE distances, respectiv ely . The vectors ¯ g c b,j ∈ C M j × 1 and ¯ r c j,k ∈ C M j × 1 hav e i.i.d. entries distributed as C N (0 , 1) . The RIS phase shifts are designed with respect to the main carrier frequency and are assumed to be identical across all subchannels. Unless otherwise stated, fading is assumed to be independent across subchannels and links. Throughout the simulations, the path-loss exponent is set to β = 2 . 5 for all links. B. Simulation Settings and Benchmarks Unless otherwise stated, we consider a two-VSP spectrum- sharing network with V = 2 . Each VSP operates one or two IEEE , V OL. ?, NO. ?, 11 BSs, i.e., |B v | = 1 – 2 , and serves K v = 3 – 6 single-antenna users, resulting in a total of K = 6 – 12 users depending on the simulation setup. The number of users is kept moderate to control the computational complexity of the optimization algorithms. Nevertheless, the proposed framework is general and can be readily extended to larger networks with more users and VSPs. For simplicity , we assume uniform bandwidth partitioning across all VSPs, such that each VSP is assigned identical numbers of reusable and dedicated subchannels, i.e., C r v = C r and C d v = C d . The system bandwidth is divided into C v = 4 subchannels per VSP , of which C r = 2 are reusable, while the remaining C d = 2 are dedicated. Accordingly , users may experience (i) intra-VSP interference due to limited subchannels and multi-user sharing, and (ii) inter-VSP interference on reusable subchannels. W e consider a hybrid RIS-assisted architecture in which VSP 1 is equipped with an RIS, while VSP 2 operates without RIS for simplicity . Therefore, the RIS-association variables for users of VSP 2 are fixed to zero. Unless otherwise specified, a single RIS ( J = 1 ) is deployed randomly near the center of VSP 1’ s co verage region. T o in vestigate the impact of RIS size, the number of reflecting elements is varied from M 1 = 4 to M 1 = 16 . Each subchannel can be reused by at most L c = 2 users, whereas each user is restricted to occupy at most one subchan- nel, in accordance with the problem formulation in (20). The physical parameters are chosen according to practical cellular deployments. Specifically , the maximum BS transmit power is set to P max = 30 dBm and the thermal noise po wer spectral density is fix ed to N 0 = − 174 dBm/Hz. Each subchannel occupies a bandwidth of B c = 5 MHz. In the following simulations, we employ normalized po wer and noise values. This normalization preserves the underlying SNR ratios and therefore does not af fect the optimal policy or the relative performance of different algorithms. It is adopted to improve numerical stability and facilitate the training of DRL agents. Follo wing the economic utility model in (17), each VSP incurs spectrum access costs, RIS leasing costs, and transmit power costs. Dedicated subchannels are priced higher than reusable ones, i.e., λ d > λ r . Unless otherwise stated, we set λ r = 0 . 2 and λ d = 0 . 5 . Moreover , only VSP 1 pays an RIS leasing cost proportional to the number of deployed RISs, i.e., N j v ψ j (with N j 1 = 1 and N j 2 = 0 ), where ψ j = 0 . 3 . The transmit power cost coef ficient is set to α power = 0 . 1 . T o enforce user-le vel QoS, a minimum rate threshold R th = 0 . 5 is imposed, and violations are penalized using a coefficient λ qos = 50 . For DRL-based approaches, each training run consists of T = 2 × 10 4 interaction steps. T o reduce stochastic variability , all reported results are averaged ov er multiple independent runs with different random seeds. The benchmark methods include: • DDPG: Algorithm 1. • SA C: Algoirithm 2. • EDS: Exhaustiv e discrete search followed by SCA-based power allocation. T o enhance rob ustness in our quasi-static channel setting, we perform G = 2 gradient updates per environment in- teraction, improving sample efficiency and stabilizing SAC T ABLE II: Simulation Parameters Parameter Description V alue V Number of VSPs 2 |B v | Number of BSs per VSP 1-2 K v Number of users per VSP 3-6 C Number of subchannels per VSP 4 C r Number of reusable subchannels 2 C d Number of dedicated subchannels 2 L c Maximum users per subchannel 2 J Number of RISs 1 M j Number of reflecting elements per RIS 4 – 16 B c Subchannel bandwidth 5 MHz N 0 Noise power spectral density − 174 dBm/Hz P max Maximum BS transmit power 30 dBm β Path-loss exponent 2.5 R th Minimum QoS rate threshold 0.5 bps/Hz λ qos QoS penalty coefficient 50 λ r Reused subchannel price 0.2 λ d Dedicated subchannel price 0.5 ψ j RIS leasing cost 0.3 α power Power consumption cost coefficient 0.1 T ABLE III: Hyperparameters for DRL-Based Algorithms Parameter Description V alue Loss Critic loss function MSE γ Discount factor 0.99 τ T arget-netw ork soft update factor 5 × 10 − 3 B Mini-batch size 256 D Replay buffer size 2 × 10 5 T Training steps per run 2 × 10 4 Activ ation Hidden-layer activ ation ReLU Hidden layers Number of hidden layers 2 Hidden units Units per hidden layer 256 Output activ ation Actor output squashing tanh µ π Actor learning rate 10 − 4 µ Q Critic learning rate 10 − 4 µ α (SA C) T emperature learning rate 10 − 4 T arget entropy (SAC) Entropy regularization target −|A| G (SA C) Gradient updates per step 2 Policy delay (SA C) Actor update interval 2 W arm-up steps Random exploration 1000 training without increasing interaction cost. The main sim- ulation parameters and DRL hyperparameters are summarized in T ables II and III, respecti vely . The geometry of our simulation for one realization is shown in Fig. 3, where each VSP operates two BSs deployed within a circular service region. Users, BSs and RIS are uniformly and independently distributed inside the corresponding VSP region. The illustrated geometry corresponds to one represen- tativ e realization, while all numerical results are obtained by av eraging over multiple random user and channel realizations. The radius of each VSP region is 500 m, the centers of the two VSPs are separated by 800 m, and the number of users per VSP is set to 3. C. P erformance in the Presence of RIS This e xperiment in vestigates the impact of RIS on the ov erall system utility for K v = 4 . T o isolate the effect of the RIS and maintain a tractable experimental setup, we consider a simplified scenario with a single BS per VSP and one RIS deployed in the en vironment. The extension to multiple BSs IEEE , V OL. ?, NO. ?, 12 −500 −250 0 250 500 750 1000 1250 x (m) −400 −200 0 200 400 600 y (m) VSP 0 VSP 1 BS0 BS1 BS2 BS3 RIS0 U0 U1 U2 U3 U4 U5 BS User RIS Fig. 3: Simulation geometry for a realization with |B v | = 2 , K v = 3 and J = 1 . Users and BSs and RIS are randomly distributed within each VSP region. or multiple RISs is conceptually straightforward and therefore omitted for clarity . Fig. 4 compares the learning performance of SA C and DDPG against the EDS benchmark in terms of the av erage rew ard o ver multiple random seeds. The horizontal lines indicate the av erage benchmark rew ards for M 1 = 4 and M 1 = 16 . It is observed that SA C con ver ges significantly faster than DDPG and achiev es a final performance close to the corresponding EDS benchmark. In particular , SA C attains approximately 96% of the benchmark rew ard for M 1 = 16 , demonstrating its ability to learn near -optimal joint scheduling, RIS configuration, and power allocation policies. In contrast, DDPG exhibits slower con vergence, larger per- formance fluctuations, and saturates at a substantially lo wer rew ard lev el. More specifically , the performance of SAC with M 1 = 4 is comparable to that of DDPG with M 1 = 16 , while SA C with M 1 = 16 achiev es at least a 33% higher final re ward and continues to improve without clear saturation. These results highlight the superior robustness of SA C in mixed discrete–continuous resource allocation problems, where en- tropy regularization and stochastic exploration enable more effecti ve navigation of combinatorial decisions and highly noncon vex re ward landscapes. D. P erformance of Spectrum Sharing In the next experiment, we ev aluate the impact of the number of reusable and dedicated subchannels on the total utility . For simplicity , RIS is omitted in this scenario and and K v = 4 , while all other parameters follow the simulation setup in T able II. The results are illustrated in Fig. 5, where two extreme cases are considered: fully dedicated bandwidth ( C d = 4 ) and fully reusable bandwidth ( C r = 2 ). It can be observed that when VSPs use dedicated subchannels, the performance improves due to reduced interference. In partic- ular , for the case C d = 4 , each VSP serves four users using orthogonal bandwidth allocation, resulting in interference- free transmission. Consequently , the achieved re ward reaches approximately 35 after 20 k training steps. In contrast, when C r = 2 , full spectrum reuse is employed, leading to sev ere M 1 = 4 M 1 = 1 6 M 1 = 4 M 1 = 1 6 M 1 = 4 M 1 = 1 6 Fig. 4: Con vergence comparison of SAC and DDPG against the EDS benchmark in terms of the a verage sum utility of the VSPs for M 1 = 4 and M 1 = 16 . Solid curves denote the mean rew ard o ver dif ferent seeds. interference. In this case, each resource block is shared by two users within each VSP and experiences additional interference from two users in the other VSP . As a result, each user is affected by three interference sources in total. Under this configuration, the rew ard saturates early and con verges to approximately 5 , which results in a performance gap of nearly 30 compared to the fully dedicated case. On the other hand, SA C sho ws significantly improved per- formance compared to DDPG, achieving a rew ard of approxi- mately 15 after 20 k training steps for C d = 4 . The superior ro- bustness of SA C o ver DDPG stems from se veral architectural and optimization advantages. In particular , SA C employs a double critic network, which mitigates the overestimation bias commonly observed in v alue-function approximation. More- ov er , SA C incorporates an entropy regularization term in the objectiv e function, which promotes e xploration and improves policy stability during training. In contrast, DDPG relies on a deterministic policy and a single critic network, making it more sensiti ve to hyperparameter selection and prone to premature con vergence to suboptimal policies. Consequently , SA C demonstrates more stable learning dynamics and achiev es higher long-term utility compared to DDPG. It is worth emphasizing that this result is obtained under the assumption of single-antenna BSs, where transmissions are omnidirectional and interference is maximized. In multi- antenna systems, spatial beamforming can limit interference, which is expected to further improve the o verall performance. E. P erformance with Multiple BSs per VSP In this experiment, we ev aluate the impact of intracell interference by considering two BSs per VSP , as illustrated in Fig. 3, which enables simultaneous modeling of both inter- VSP and intra-VSP interference. The number of users per VSP is varied from K v = 3 to K v = 6 (i.e., K = 6 to K = 12 users in total). T wo deployment scenarios are considered: with RIS ( M 1 = 10 ) and without RIS. Since SA C achiev ed the best performance in previous experiments, only SA C is ev aluated here. Fig. 6 sho ws the final av erage reward for different user densities. RIS deployment consistently improves performance due to enhanced spatial di versity and improved IEEE , V OL. ?, NO. ?, 13 0 2500 5000 7500 10000 12500 15000 17500 20000 Training steps −5 0 5 10 15 20 25 30 35 A verage reward SAC ( C d = 4, C r = 0 ) SAC ( C d = 2, C r = 2 ) SAC ( C d = 1, C r = 2 ) SAC ( C d = 0, C r = 3 ) SAC ( C d = 0, C r = 2 ) DDPG ( C d = 4, C r = 0 ) Fig. 5: Conv ergence comparison of DRL under different configurations of reusable and dedicated subchannels for a fixed number of subchannels per VSP , C v = 4 . K = 6 K = 8 K = 10 K = 12 0 5 10 15 20 25 30 35 40 Final average reward with RIS without RIS Fig. 6: Final av erage re ward achiev ed by SA C with RIS ( M 1 = 10 ) and without RIS versus the total number of users. effecti ve channel gains, which increase the achie v able sum rate and ov erall utility . As the number of users increases, the reward generally decreases because higher user density leads to stronger inter - VSP interference on reusable subchannels and increased intra- VSP resource contention. Howe ver , the maximum rew ard is observed at K v = 4 ( K = 8 ). This occurs because the number of users matches the av ailable subchannel re- sources ( C = 4 ), allowing ef ficient user allocation with minimal subchannel sharing. When fewer users are present ( K v = 3 ), some subchannels remain underutilized, limiting the achiev able throughput. Conv ersely , when the number of users exceeds this le vel, increased subchannel sharing introduces stronger interference, reducing the overall system utility . F . Impact of DRL Hyperparameters This subsection ev aluates the sensitivity of the DRL algo- rithms to key training hyperparameters, namely the learning rate and mini-batch size B . For simplicity , the learning rates of the actor and critic are set equal, i.e., µ = µ π = µ Q . The e xperiments follo w the simulation setup in Section IV -B, considering two VSPs with one BS per VSP , K v = 4 users, C = 4 subchannels, and RIS assistance enabled for VSP 1 0 5000 10000 15000 20000 25000 30000 Training step 10 20 30 40 T otal reward μ = 5 ×10 −5 μ = 10 −4 μ = 2 ×10 −4 μ = 5 ×10 −4 μ = 10 −3 (a) DDPG 0 5000 10000 15000 20000 25000 30000 Training step 10 20 30 40 T otal reward μ = 5 ×10 −5 μ = 10 −4 μ = 2 ×10 −4 μ = 5 ×10 −4 μ = 10 −3 (b) SAC Fig. 7: Impact of the learning rate µ on training performance. RIS assistance is enabled for VSP 1 with M 1 = 8 reflecting elements. 0 5000 10000 15000 20000 25000 30000 Training step 10 15 20 25 30 35 40 45 T otal reward B = 16 B = 32 B = 64 B = 128 B = 256 (a) DDPG 0 5000 10000 15000 20000 25000 30000 Training step 10 20 30 40 50 T otal reward B = 16 B = 32 B = 64 B = 128 B = 256 (b) SAC Fig. 8: Impact of the mini-batch size B on training perfor- mance. RIS assistance is enabled for VSP 1 with M 1 = 8 reflecting elements. with M 1 = 8 elements. Each curve is obtained using the same random seed and smoothed using a moving average window of 500 steps to highlight transient learning behavior . Fig. 7 shows the impact of the learning rate µ . The results indicate that SA C maintains stable con ver gence across all tested values of µ , achieving final re wards in the range of approximately 40–48. In contrast, DDPG is highly sensitiv e to large learning rates. When µ = 5 × 10 − 4 or µ = 10 − 3 , DDPG con verges to suboptimal local solutions, resulting in a performance loss exceeding 20 reward units compared to the best configuration. This sensitivity arises from the single-critic structure of DDPG, which is more vulnerable to unstable value estimation under aggressiv e gradient updates. Fig. 8 illustrates the effect of the mini-batch size B . SA C again demonstrates consistent performance across all tested batch sizes, showing limited performance v ariation. Con versely , DDPG exhibits notable degradation when small batch sizes are used. In particular , the final rew ard for B = 16 is more than 10 units lower than the best-performing case ( B = 64 ), mainly due to increased gradient variance that destabilizes critic learning. Overall, SA C exhibits superior robustness and training stability compared to DDPG. This improvement is mainly attributed to entropy regularization and the double-critic struc- ture, which reduce value ov erestimation and stabilize policy updates. V . C O N C L U S I O N This paper in vestigated dynamic spectrum sharing for RIS- assisted LHQWNs operating within an MNO–VSP ecosys- tem. The joint optimization of subchannel allocation, transmit IEEE , V OL. ?, NO. ?, 14 power control, and RIS phase configuration was formulated as a utility maximization problem under spectrum leasing costs, RIS deployment costs, and QoS constraints. Due to the resulting mixed-inte ger nonlinear structure, the problem was modeled as an MDP and solved using DRL techniques. T wo actor–critic algorithms, DDPG and SA C, were de veloped and ev aluated. Simulation results demonstrated that the SA C-based solution consistently achieves near-optimal performance, at- taining up to 96% of the utility obtained by EDS benchmark, while significantly reducing computational complexity . More- ov er , SAC exhibits superior training stability and robustness to hyperparameter variations compared to DDPG. The results further confirmed the performance benefits of RIS deployment. Since the RIS leasing cost is fixed, optimizing RIS phase configurations significantly enhances effecti ve channel gains and ov erall VSP utility , providing substantial performance improv ement compared to RIS-free scenarios. In addition, spectrum partitioning was shown to strongly impact system performance. Dedicated subchannels significantly reduce in- terference and can increase utility by up to seven times compared to heavily reused spectrum configurations. SAC also demonstrated improv ed adaptability across different spectrum resource allocations and network densities. Overall, the proposed framework provides an effecti ve and scalable solution for mixed discrete–continuous resource optimization in RIS-assisted spectrum sharing en vironments. Future work will e xtend this framew ork to multi-antenna massiv e MIMO systems, multi-RIS cooperati ve deployments, and dynamic en vironments with time-v arying CSI and user mobility . R E F E R E N C E S [1] Cisco, “Cisco annual internet report 2018–2023, ” 2020. [2] M. Matinmikko, M. Latv a-Aho, P . Ahokangas, S. Yrj ¨ ol ¨ a, and T . Koi vum ¨ aki, “Micro operators to boost local service delivery in 5G, ” W ireless P ersonal Communications , vol. 95, no. 1, pp. 69–82, 2017. [3] ETSI, “Reconfigurable radio systems (rrs); evolv ed licensed shared access (elsa); part 2: System architecture and high-lev el procedures, ” T ech. Rep. ETSI TS 103 652-2 V1.1.1, European T elecommunications Standards Institute, Jan 2020. [4] J. Mitola and G. Q. Maguire, “Cognitiv e radio: Making software radios more personal, ” IEEE P ersonal Commun. Mag. , vol. 6, no. 4, pp. 13–18, 1999. [5] Q. Wu and R. Zhang, “T o wards smart and reconfigurable en vironment: Intelligent reflecting surface-aided wireless networks, ” IEEE Commun. Mag. , vol. 58, no. 1, pp. 106–112, 2020. [6] C. Huang, A. Zappone, G. C. Alexandropoulos, M. Debbah, and C. Y uen, “Reconfigurable intelligent surfaces for energy efficiency in wireless communication, ” IEEE T rans. W ir eless Commun. , v ol. 18, no. 8, pp. 4157–4170, 2019. [7] J. Xu, Z. Xu, W . Y ao, W . Hu, A. Cabani, and X. Hu, “ An intelligent mechanism for dynamic spectrum sharing in 5G IoT networks, ” Expert Syst. Appl. , vol. 252, p. 124122, 2024. [8] M. Khadem, M. Ansarifard, N. Mokari, M. R. Jav an, H. Saeedi, and E. A. Jorswieck, “Dynamic fairness-aware spectrum auction for enhanced licensed shared access in UA V-based networks, ” IEEE T rans. Commun. , vol. 73, no. 5, pp. 3076–3092, 2025. [9] S. O. Onidare, O. A. T iamiyu, Q. R. Adebowale, O. T . Ajayi, K. B. Adew ole, and A. A. A yeni, “Optimizing the spectrum and energy efficienc y in dynamic licensed shared access systems, ” Int. J. Electr . Eng. Inform. , vol. 15, no. 3, pp. 368–386, 2023. [10] S. O. Onidare, K. Navaie, and Q. Ni, “Spectral efficienc y of dynamic licensed shared access, ” IEEE T rans. V eh. T echnol. , vol. 69, no. 12, pp. 15149–15161, 2020. [11] A. Chouayakh, A. Bechler, I. Amigo, L. Nuaymi, and P . Maill ´ e, “Multi- block ascending auctions for effecti ve 5G licensed shared access, ” IEEE T rans. Mobile Comput. , vol. 21, no. 11, pp. 4051–4063, 2021. [12] M. Di Renzo, A. Zappone, M. Debbah, M.-S. Alouini, C. Y uen, J. De Rosny , and S. Tretyako v , “Smart radio environments empowered by reconfigurable intelligent surfaces: How it works, state of research, and the road ahead, ” IEEE J. Sel. Areas Commun. , vol. 38, no. 11, pp. 2450–2525, 2020. [13] Y . Chen, Y . W ang, J. Zhang, and M. Di Renzo, “Qos-driven spectrum sharing for reconfigurable intelligent surfaces (RISs) aided vehicular networks, ” IEEE W ireless Commun. , vol. 20, no. 9, pp. 5969–5985, 2021. [14] H. R. Hashempour , H. Bastami, M. Moradikia, S. A. Zekav at, H. Behroozi, G. Berardinelli, and A. L. Swindlehurst, “Secure SWIPT in the multiuser ST AR-RIS aided MISO rate splitting downlink, ” IEEE T rans. V eh. T echnol. , vol. 73, no. 9, pp. 13466–13481, 2024. [15] H. R. Hashempour , G. Berardinelli, R. Adeogun, and E. A. Jorswieck, “Power efficient cooperative communication within IIoT subnetworks: Relay or RIS?, ” IEEE Internet Things J. , 2024. [16] H. R. Hashempour and G. Berardinelli, “Secure rate splitting in ST AR- RIS assisted downlink MISO systems, ” in IEEE MeditCom 2024 , pp. 529–534, IEEE, 2024. [17] Y . Gao, C. Lu, Y . Lian, X. Li, G. Chen, D. B. da Costa, and A. Nal- lanathan, “QoS-aware resource allocation of RIS-aided multi-user MISO wireless communications, ” IEEE T rans. V eh. T echnol. , v ol. 73, no. 2, pp. 2872–2877, 2023. [18] H. Guo, Y .-C. Liang, J. Chen, and E. G. Larsson, “W eighted sum- rate maximization for reconfigurable intelligent surface aided wireless networks, ” IEEE T rans. W ir eless Commun. , v ol. 19, no. 5, pp. 3064– 3076, 2020. [19] X. Guan, Q. Wu, and R. Zhang, “Joint power control and passive beamforming in IRS-assisted spectrum sharing, ” IEEE Commun. Letters , vol. 24, no. 7, pp. 1553–1557, 2020. [20] S. Lin, B. Zheng, F . Chen, and R. Zhang, “Intelligent reflecting surface- aided spectrum sensing for cognitiv e radio, ” IEEE W ireless Commun. Letters , vol. 11, no. 5, pp. 928–932, 2022. [21] M. W en, Q. Li, K. J. Kim, D. L ´ opez-P ´ erez, O. A. Dobre, H. V . Poor, P . Popovski, and T . A. Tsiftsis, “Private 5G networks: Concepts, archi- tectures, and research landscape, ” IEEE J. Sel. T opics Signal Pr ocess. , vol. 16, no. 1, pp. 7–25, 2021. [22] J. Prados-Garzon, P . Ameigeiras, J. Ordonez-Lucena, P . Mu ˜ noz, O. Adamuz-Hinojosa, and D. Camps-Mur , “5G non-public networks: Standardization, architectures and challenges, ” IEEE Access , vol. 9, pp. 153893–153908, 2021. [23] O. Al-Khatib, W . Hardjawana, and B. V ucetic, “Spectrum sharing in multi-tenant 5G cellular networks: Modeling and planning, ” IEEE Access , vol. 7, pp. 1602–1616, 2018. [24] M. Khadem, F . Zeinali, N. Mokari, and H. Saeedi, “AI-enabled pri- ority and auction-based spectrum management for 6G, ” in Proc. IEEE W ireless Commun. Networking Confer ence (WCNC) , 2024. [25] A. Basaure, A. S. De Sena, M. Matinmikko-Blue, S. Yrj ¨ ol ¨ a, and P . Ahokangas, “Utility-based interference coordination for local spec- trum licensing in 6G, ” in IEEE DySP AN 2025 , pp. 1–8, IEEE, 2025. [26] K. Mu, Z. Xie, C. E. C. Bastidas, I. Kadota, W . Lehr , and R. Berry , “Compete or coordinate? analysis of spectrum sharing strategies for local wireless services, ” in IEEE DySP AN 2025 , pp. 1–10, IEEE, 2025. [27] A. Alwarafy , M. Abdallah, B. S. Ciftler, A. Al-Fuqaha, and M. Hamdi, “The frontiers of deep reinforcement learning for resource management in future wireless HetNets: T echniques, challenges, and research direc- tions, ” IEEE Open J. Commun. Soc. , vol. 3, pp. 322–365, 2022. [28] Q. W ang, W . Xu, and H.-H. Chen, “ A heterogeneous-agent deep rein- forcement learning approach for dynamic spectrum access in cognitiv e wireless networks, ” IEEE T rans. Cogn. Commun. Netw . , 2025. [29] E. Atimati, T . Nyasulu, D. Crawford, and R. Stewart, “Resource man- agement in dynamic shared spectrum networks, ” in IEEE DySP AN 2025 , pp. 13–19, IEEE, 2025. [30] E. Eldeeb and H. Alves, “ An offline multi-agent reinforcement learning framework for radio resource management, ” arXiv preprint arXiv:2501.12991 , 2025. [31] S. W ang, W . Y u, C. H. Foh, Q. Ni, Q. Cheng, and L. W en, “Deep reinforcement learning for resource allocation in RIS-assisted NOMA- MEC vehicular networks, ” in Proc. 52th Annual Int. V eh. T echnol. Conf. , pp. 1–7, IEEE, 2025. [32] T . Haarnoja, A. Zhou, K. Hartikainen, G. T ucker , S. Ha, J. T an, V . Kumar , H. Zhu, A. Gupta, P . Abbeel, et al. , “Soft actor-critic algorithms and applications, ” arXiv pr eprint arXiv:1812.05905 , 2018.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment