Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

1 Bridging Network Fragmentation: A Semantic-Augmented DRL Frame work for U A V -aided V ANETs Gaoxiang Cao, W enke Y uan, Huasen He, Member , IEEE , Y unpeng Hou, Xiaofeng Jiang, Member , IEEE , Shuangwu Chen, Member , IEEE , Jian Y ang, Senior Member , IEEE Abstract —V ehicular Ad-hoc Networks (V ANETs) are the dig- ital cornerstone of autonomous driving, y et they suffer from sever e network fragmentation in urban en vironments due to physical obstructions. Unmanned Aerial V ehicles (U A Vs), with their high mobility , have emerged as a vital solution to bridge these connectivity gaps. Howev er , traditional Deep Reinforcement Learning (DRL)-based U A V deployment strategies lack semantic understanding of road topology , often resulting in blind explo- ration and sample inefﬁciency . By contrast, Lar ge Language Models (LLMs) possess powerful reasoning capabilities capable of identifying topological importance, though applying them to control tasks remains challenging. T o address this, we propose the Semantic-A ugmented DRL (SA-DRL) framework. Firstly , we propose a fragmentation quantiﬁcation method based on Road T opology Graphs (RTG) and Dual Connected Graphs (DCG). Subsequently , we design a four -stage pipeline to transf orm a general-purpose LLM into a domain-speciﬁc topology expert. Finally , we propose the Semantic-A ugmented PPO (SA-PPO) algorithm, which employs a Logit Fusion mechanism to inject the LLM’s semantic reasoning directly into the policy as a prior , effectively guiding the agent toward critical intersections. Exten- sive high-ﬁdelity simulations demonstrate that SA-PPO achieves state-of-the-art performance with remarkable efﬁciency , reaching baseline performance levels using only 26.6% of the training episodes. Ultimately , SA-PPO impro ves two key connectivity metrics by 13.2% and 23.5% over competing methods, while reducing energy consumption to just 28.2% of the baseline. Index T erms —U A V -assisted V ANETs, Deep Reinfor cement Learning, Large Language Models, Network Connectivity , Se- mantic A ugmentation I . I N T RO D U C T I O N W ITHIN the ambitious framew ork of sixth-generation (6G) mobile communications and Intelligent T rans- portation Systems (ITS), wireless connectivity characterized by ubiquitous coverage, ultra-low latency , and high reliability is regarded as the fundamental digital infrastructure for en- abling Lev el 4/Level 5 (L4/L5) autonomous dri ving [1]. As the medium for information exchange in V ehicle-to-V ehicle (V2V) and V ehicle-to-Infrastructure (V2I) communications, V ehicular Ad-Hoc Networks (V ANETs) undertake critical tasks, includ- ing cooperativ e sensing, trafﬁc condition warnings, and com- putation ofﬂoading [2]. Howe ver , the physical characteristics G. Cao, W . Y uan, H. He, Y . Hou, X. Jiang, S. Chen and J. Y ang are with the Department of Automation, University of Science & T echnology of China. H. He (hehuasen@ustc.edu.cn) and Y . Hou (hyp314@mail.ustc.edu.cn) are the corresponding authors. The source code will be publicly available upon acceptance. of modern urban canyons constitute a substantial barrier to wireless w av e propagation. High-density b uilding comple xes induce sev ere shadow fading [3], while the high mobility of vehicles causes the network topology to ﬂuctuate drastically on a time scale of seconds [4]. This dual spatiotemporal dynamicity frequently leads to network partitioning into iso- lated fragmented subnets [5], creating fragmented systems that sev erely constrain the continuity and safety of Internet of V ehicles (IoV) services. Although traditional terrestrial Roadside Unit (RSU) de- ployment constitutes the network backbone, it is constrained by construction costs and land resources, making it dif ﬁcult to achiev e seamless coverage in all blind zones [6]. Furthermore, ﬁxed infrastructure lacks resilience, particularly in scenarios in volving sudden trafﬁc congestion or disasters [7]. In this context, Unmanned Aerial V ehicles (U A Vs), functioning as aerial base stations or mobile relays, have emerged as an ideal remedial solution for enhancing V ANET connectivity [8]. This is attributed to their high probability of establishing Line-of-Sight (LoS) communication links [9] and their ﬂexible three-dimensional maneuverability . U A V with high mobility is capable of dynamically tracking ground trafﬁc hotspots and repairing broken links in real-time, thereby constructing a resilient air-ground integrated network architecture. Howe ver , controlling UA V nodes to dynamically track and bridge the most critical disconnected areas on the ground under limited onboard energy constraints, thereby maximizing ov erall network connectivity , represents a typical non-con vex, non-linear , and highly dynamic optimization problem [10]. T raditional U A V path planning methods, such as Mixed Integer Linear Programming (MILP), con vex optimization, or heuristic algorithms based on potential ﬁelds, often rely on precise prior knowledge of environmental models. Furthermore, they suffer from e xcessiv e computational complexity when confronting high-dimensional state spaces and real-time dynamic varia- tions, making it difﬁcult to satisfy the real-time requirements of V ANETs [11]. In recent years, Deep Reinforcement Learn- ing (DRL) has gradually become a mainstream approach for addressing UA V dynamic deployment problems, owing to its end-to-end decision-making capabilities and adaptability to unknown en vironments [12]. Speciﬁcally , the Proximal Policy Optimization (PPO) algorithm [13] has been widely applied in U A V attitude control and trajectory planning due to its stability in continuous control tasks, sample efﬁciency , and robustness to hyperparameters [14]. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. 2 On the other hand, the e xplosiv e dev elopment of Lar ge Language Models (LLMs) ov er the past two years has pro- vided novel perspectiv es for the dynamic deployment of UA Vs [15], [16]. LLMs, pre-trained on massive te xtual datasets, have not only mastered natural language processing capabilities but hav e also demonstrated emer gent abilities in Commonsense Reasoning, Spatial Planning, and Causal Inference [17]. Re- search indicates that LLMs are capable of comprehending textual descriptions of en vironments and generating rational action recommendations or sequences of sub-goals based on common sense [18]. This suggests that LLMs can serve as high-lev el planners or knowledge bases to guide DRL agents in complex decision-making tasks. For instance, LLMs can provide prior knowledge about urban layouts, trafﬁc patterns, and U A V operational constraints, which can be integrated into the DRL framework to enhance learning ef ﬁciency and policy performance. Therefore, we propose integrating the semantic reasoning capabilities of LLMs into the decision- making process of U A Vs. Our objectiv e is to endow the agent with topological priors, enabling it to intelligently identify crit- ical connectivity hubs, thereby maximizing network coverage while minimizing energy consumption. A. Related W orks 1) Connectivity of U A V Aided V ANETs: Existing research has achieved notable progress in UA V -aided V ANETs, par - ticularly regarding V ANET connecti vity . In [19], V ANET connectivity under varying vehicle densities was in vestigated by employing fast global k-means (FGKMs)-based vehicles clustering (FGVC) and the interior penalty function method (IPM) to approximately maximize the data transmission rate while simultaneously minimizing energy consumption and la- tency . Similarly , a drone-assisted cooperative routing (D A CR) protocol was designed in [20] that integrates the Internet of Things (IoT) to enhance connectivity and data exchange capabilities within V ANETs in dynamic, high-density envi- ronments. Furthermore, the augmentation of Roadside Units (RSUs) by equipping U A Vs with Dedicated Short-Range Communications (DSRC) transcei vers was proposed in [21], which improved the network cov erage of V ANETs through the construction of V oronoi diagrams. T o address the dynamicity of U A V netw ork topologies, the study in [22] designed a delay- constrained time-varying graph (DTVG) and proposed a novel data dissemination algorithm, thereby enhancing network con- nectivity and throughput. Ho we ver , a critical limitation persists in these studies: the majority of existing works model the U A V deployment area either as a continuous Euclidean space or simplify it into a grid-based rasterized representation. Such abstractions fundamentally neglect the topological constraints inherent to urban mobility—speciﬁcally , that vehicular mov e- ment is strictly conﬁned to road networks rather than being freely distributed across a planar surface. Consequently , these approaches may fail to capture the complex connectivity rela- tionships among fragmented vehicle clusters. Moreover , while metrics based on distance or Signal-to-Interference-plus-Noise Ratio (SINR) are effecti ve for ev aluating local link quality , they are insufﬁcient for characterizing the fragmentation state of the global network. T o address these challenges, we propose a graph-theoretic modeling approach to rigorously quantify connectivity and guide agents to ward topologically critical intersections. 2) U A V Deployment Using DRL Methods: Extensiv e efforts hav e been made in existing literature to apply DRL methods to U A V deployment problems. For instance, a DRL-based algorithm w as proposed in [23] for comple x scenarios in- volving multiple obstacles and dependent tasks represented by Directed Acyclic Graphs (DA Gs), achieving joint optimization of U A V trajectory planning, task scheduling, and Service Function (SF) deployment. Similarly , the study in [24] pre- sented a DRL scheme based on Dueling Double DQN (D3QN) to jointly optimize bandwidth allocation, UA V 3D coordinates, and the phase shifts of Reconﬁgurable Intelligent Surfaces (RIS). Concurrently , researchers hav e attempted to apply DRL to U A V trajectory planning for assisting V ANETs. Addressing connectivity disruptions in V ANETs, a heterogeneous U A V - aided framew ork was proposed in [25] integrating a modiﬁed density-based clustering algorithm (MDBSCAN) and adaptive dual-model routing (ADMR). By employing a Multi-Agent Soft Actor-Critic approach for UA V trajectory optimization, this work signiﬁcantly improved network reachability and data transmission performance. For U A V swarm-assisted Intelligent T ransportation Systems, the work in [26] introduced over -the- air computation (AirComp) into edge services and proposed a DRL-based dual-scale intelligent AirComp (D2IAC) algo- rithm. This method jointly optimized U A V coordinates, service conﬁguration, and power control, leading to substantial gains in coverage and resource efﬁciency . Howe ver , the e xisting works of DRL-based connecti vity enhancement in U A V -assisted V ANETs faces sev eral core challenges: • Standard exploration strategies, such as ε -greedy or entropy re gularization, are typically topology-agnostic, failing to incorporate semantic understanding of the road network. This deﬁciency leads to extremely slow training con vergence or even failure to con verge to an ef fectiv e policy . • DRL typically acquires all kno wledge from scratch through trial-and-error learning. Ho wev er, urban road networks inherently contain rich prior knowledge within their topological structures (e.g., intersections ar e critical nodes for connectivity or arterial r oads have high traf ﬁc density ). T raditional DRL methods struggle to directly lev erage this structured semantic information, forcing U A Vs to spend a v ast number of episodes rediscovering the importance of intersections, thereby resulting in a massiv e waste of computational resources. • Existing DRL-based solutions typically suffer from lim- ited generalization capabilities when confronting signiﬁ- cant variations in trafﬁc patterns. Real-world urban trafﬁc ﬂows exhibit distinct characteristics in different time periods (e.g., the sharp contrast in vehicle density and distribution between rush hours and midnight). A policy trained on speciﬁc traf ﬁc scenarios often struggles to adapt to these unseen or drastically dif ferent ﬂo w dis- tributions without e xtensiv e retraining, sev erely limiting 3 the practical applicability of DRL models in continuous, long-term UA V operations. T o address the aforementioned problems, we ﬁne-tune an LLM to leverage the prior knowledge embedded in its vast knowledge base. By integrating this domain-speciﬁc topologi- cal e xpertise into the DRL agent’ s learning process, we pro vide effecti ve semantic guidance that transcends simple trial-and- error exploration. This approach not only accelerates con- ver gence by av oiding blind exploration but also signiﬁcantly enhances the agent’ s generalization capability against dynamic shifts in urban trafﬁc distributions. 3) U A Vs Controlled by LLM Agents: In the domain of U A V control, sev eral attempts hav e been made to utilize LLMs for high-lev el task planning or the translation of natural language instructions into control codes. For instance, to address GPS- denied indoor environments, a visual path planning method based on a ﬁne-tuned LLM w as proposed in [27]. By analyzing depth information and pedestrian location data, this approach generated ﬂight trajectories with superior safety compared to traditional methods. Similarly , the study in [28] introduced a hybrid frame work that utilized a V isual-Language En- coder (VLE) and a Retrie val-Augmented Generation (RA G)- empowered LLM to dynamically infer scene-speciﬁc safety margins, thereby guiding model predictive control (MPC) to achiev e safe landings in dynamic en vironments. Furthermore, some studies employed LLMs to guide DRL training through rew ard shaping. Speciﬁcally , the LLM Enhanced Q-Learning Approach (LLM-QL) was developed in [29], which utilizes an LLM to generate heuristic reward terms, thereby guiding multi-U A V cooperative path exploration in complex, unknown en vironments. T argeting complex urban wind en vironments, the work in [30] proposed a hierarchical control architec- ture employing a ﬁne-tuned LLM as a meta-decision maker . This system selected aggressiv e, balanced, or cautious Pareto- optimal RL ﬂight policies in real-time based on building density and wind conditions. Howe ver , existing LLM-guided U A V schemes typically suffer from two primary limitations: • The LLM functioned solely as a high-level planner, gen- erating coarse-grained waypoints, while low-le vel execu- tion relied on traditional Proportional-Integral-Deri v ativ e (PID) controllers or simple path planning algorithms. This approach ov erlooked the subtle dynamic variations inherent in the low-le vel environment. • While utilizing LLMs to generate auxiliary reward func- tions for guiding RL training is ef fecti ve, designing appropriate prompts to yield stable numerical rew ards is challenging and prone to introducing bias, potentially leading to Rewar d Hacking phenomena. Consequently , we propose a deeper and more direct method of LLM integration. Speciﬁcally , we propose the Logit Fusion mechanism to intervene directly at the probability distribution lev el of action selection, thereby injecting the commonsense intuition of the LLM as a Prior Distribution into the Rein- forcement Learning process. B. Contributions In response to the aforementioned challenges, we model the connectivity enhancement problem of UA V -aided V ANETs as a dynamic dual graph connecti vity maximization prob- lem, thereby achieving a rigorous quantiﬁcation of network connectivity metrics. Furthermore, we propose a nov el four- stage architecture termed the Semantic-Augmented DRL (SA- DRL) Framework, in which LLMs are employed to interpret the complex urban en vironment, speciﬁcally extracting high- lev el road topological semantics that characterize the struc- tural connectivity of the map. Building upon this foundation, we dev elop the Semantic-Augmented Proximal Policy Opti- mization (SA-PPO) algorithm, which incorporates the Logit Fusion mechanism to integrate these semantic priors, thereby explicitly guiding the agent’ s action decision-making towards more effecti ve deployment strategies. The main contributions of this paper are summarized as follows: • W e propose a graph-theoretic approach to rigorously quantify the fragmentation lev el of V ANETs in complex urban en vironments. By constructing the Road T opology Graph (R TG) and the Dual Connecti vity Graph (DCG), we reformulate the V ANET fragmentation mitigation problem as a dynamic dual graph connecti vity maxi- mization problem, which exploits the U A V to merge disjoint vehicle clusters and maximize the average size of connected components. • W e design the SA-DRL Framework , which establishes a nov el four-stage pipeline to bridge the gap between LLM reasoning and domain-speciﬁc control tasks. Through the integration of Experience Collection , Semantic Prior Construction , and Knowledge Alignment , this framework transforms a general-purpose LLM into a topology e xpert. Crucially , the framework lev erages this aligned expert to acti vely guide the DRL agent’ s exploration, thereby addressing the inefﬁcienc y of learning from scratch. • W e propose SA-PPO , a semantic-prior-augmented DRL algorithm that deeply integrates the ﬁne-tuned LLM into the PPO decision-making loop. By employing a nov el Logit Fusion mechanism, we directly inject the domain-speciﬁc expertise of the LLM as a semantic prior distribution into the policy generation process. This approach achieves a synergy between high-level expert reasoning and data-driven reinforcement learning, which signiﬁcantly mitigates V ANET fragmentation. • W e conduct extensi ve experiments on a high-ﬁdelity simulator driv en by real-world urban trajectories. Results demonstrate that SA-PPO outperforms state-of-the-art methods by improving key connecti vity metrics by 13.2% and 23.5% while reducing energy consumption to 28.2%. Notably , it achieves these gains using only 26.6% of the training episodes, effecti vely mitigating mode collapse and blind exploration observ ed in traditional methods. Furthermore, the framework exhibits superior generaliza- tion among varying trafﬁc distributions. The remainder of this paper is organized as follows. Section II introduces the system model and formulates the optimization problem. Section III elaborates on the proposed SA-DRL 4 framew ork for addressing the optimization problem. Section IV presents the experimental results and discussions. Finally , Section V concludes this paper . I I . S Y S T E M M O D E L A. Network Model 𝐻 UAV 𝑢 𝑣 ! 𝑣 " 𝑒 # = ( 𝑣 ! , 𝑣 " ) RSU 𝒱 $%& Road Intersection Road Segment V2V Link UAV/RSU Relay Link Fig. 1. UA V Aided V ANET Scenario. As illustrated in Fig. 1, we consider a U A V aided V ANET system deplo yed within an urban region comprising n in- tersections and m road segments. The set of intersections within the task area is denoted as I road = { v 1 , v 2 , · · · , v n } , while we represent the set of road segments as S road = { e 1 , e 2 , · · · , e m } . Speciﬁcally , each intersection v i possesses ﬁxed 2D coordinates p i = ( x i , y i ) , and a road segment e k = ( v i , v j ) connects intersections v i and v j . Furthermore, a subset of intersections is equipped with ﬁxed Roadside Units (RSUs), denoted as V RS U ⊂ I road . These intersections and their connecting segments maintain permanent network connectivity . W e assume that the operational timeline is discretized into a ﬁnite horizon of T time slots, denoted as T = { 1 , 2 , · · · , T } , where the duration of each slot is τ . In the time slot t , there are N ( t ) vehicles traversing the road network. If a speciﬁc vehicle v eh l is located on the road segment e k at the time slot t , its position is denoted as pos v eh,l ∈ e k . While vehicle mobility follo ws the Intelligent Dri ver Model (IDM) [31], we assume that v ehicle positions are known at the be ginning of each time slot. V ehicles on the road establish connections with one another via onboard wireless interfaces, forming connected clusters. Howe ver , due to high v ehicular mobility , intermittent connecti vity gaps persist between these clusters, resulting in V ANET fragmentation. Consequently , we consider the deployment of a U A V u as a communication relay to mitigate fragmentation and enhance network connectivity . The position of the U A V is denoted as p u ( t ) = ( x u ( t ) , y u ( t ) , H ) , where H represents a ﬁxed ﬂight altitude. Considering that urban structures (e.g., skyscrapers) can obstruct wireless sig- nals, we assume that the UA V’ s ﬂight destination in each time slot is positioned directly abov e a speciﬁc intersection v ∈ I road to maximize the probability of establishing LoS connections with ground vehicles. It is worth noting that while we employ a single-UA V scenario for ease of discussion and implementation, our proposed method is equally applicable to multi-U A V scenarios and can be readily extended. B. Communication Model Giv en the speciﬁc characteristics of the urban environment, we adopt a probabilistic Air-to-Ground (A2G) channel model. The probability P LoS of establishing a LoS link between the U A V and a ground node depends primarily on the elev ation angle θ [32], which is expressed as P LoS ( θ ) = 1 1 + a · exp( − b ( θ − a )) . (1) Here, θ = 180 π arctan  H d  represents the elev ation angle between the U A V and the ground vehicle, d is the horizontal Euclidean distance between the U A V and the ground vehicle, while a and b are S-curve parameters determined by the en vironment (e.g., dense urban, sub urban, high-rise). The av erage path loss L ( d ) , modelled as the weighted sum of LoS and Non-Line-of-Sight (NLoS) losses, is given by L ( d ) = P LoS · L LoS ( d ) + (1 − P LoS ) · L N LoS ( d ) , (2) where L LoS and L N LoS represent the free-space path loss plus additional attenuation loss in the two respectiv e states, expressed as L ξ ( d ) = 20 log  4 π d f c c  + η ξ , ξ ∈ { LoS, N LoS } . (3) Here, f c is the carrier frequency , c is the speed of light, while η LoS and η N LoS denote the average additional losses (shad- owing loss) under LoS and NLoS conditions, respectively . Let P tx be the U A V transmission power , and G be the sum of the receiv er and transmitter antenna g ains. The signal power receiv ed by a ground vehicle is given by P rx ( d ) = P tx + G − L ( d ) . (4) The maximum distance d satisfying P rx ( d ) ⩾ P th is deﬁned as the U A V coverage radius R cov , where P th is the minimum receiv ed power required to satisfy Quality of Service (QoS) requirements. W e assume that if a U A V is positioned directly abov e intersection v , and its wireless coverage radius R cov ex- ceeds half the length of the road segment, the U A V is capable of covering all vehicles situated on the road segments incident to intersection v . Furthermore, we assume that the commu- nication range of the U A V under unobstructed conditions is sufﬁcient to cover half the length of any road segment, thereby ensuring that vehicles located on the same road segment can be interconnected through multi-hop transmission. C. Graph-Theor etic Model of V ANET F ragmentation Deﬁnition 1 (Road T opology Graph (RTG)): An undirected weighted graph G = ( V , E ) is deﬁned as the RTG correspond- ing to the urban road network ( I road , S road ) , if there exist bijections ϕ : I road → V and ψ : S road → E , such that for ∀ s = ( i 1 , i 2 ) ∈ I road , the condition ψ ( s ) = ( ϕ ( i 1 ) , ϕ ( i 2 )) holds. Meanwhile, the weight of a vertex v ∈ V is deﬁned as p v ( t ) =      0 , if ϕ − 1 ( v ) is currently not covered , 1 , if ϕ − 1 ( v ) is covered by an RSU , 2 , if U A V u is located at ϕ − 1 ( v ) . (5) 5 In particular, when the U A V u is positioned at an intersection cov ered by an RSU, we deﬁne p v ( t ) = 2 . The weight of an edge e ∈ E is deﬁned as p e ( t ) = NumV ehicles ( e, t ) , (6) which represents the total number of vehicles located on the road segment e during the time slot t . The verte x weights of the R TG characterize the cov erage status of intersections, while the edge weights describe the trafﬁc load on urban roads. T o facilitate optimization in large- scale networks, we deﬁne c-edge and c-vertex as follows to characterize the connectivity of R TG vertices and edges. This allows us to abstract the complex determination of V ANET connectivity into a graph-theoretic cov erage problem. Deﬁnition 2 (Connected V ertex (c-vertex)): Consider a ver- tex v ∈ V , v is designated as a c-vertex in time slot t if and only if at least one of the following conditions is met: 1) p v ( t ) ⩾ 0 ; 2) ∃ e ∈ δ ( v ) , p e ( t ) > 0 . Here, δ ( v ) denotes the set of edges incident to vertex v . Deﬁnition 3 (Connected Edge (c-edge)): Consider an edge e = ( v 1 , v 2 ) ∈ E , e is designated as a c-edge in time slot t if and only if at least one of the following conditions is met: 1) p e ( t ) > 0 ; 2) v 1 or v 2 is a c-vertex. Building upon the concepts of c-edge and c-vertex, we deﬁne the c-graph as follows. Deﬁnition 4 (Connected Graph (c-graph)): Let V c ( t ) and E c ( t ) denote the sets of all c-vertices and c-edges in the R TG G at the time slot t , respectively . W e deﬁne the c-graph as the subgraph G c ( t ) = ( V ′ ( t ) , E ′ ( t )) , where the edge set is giv en by E ′ ( t ) = E c ( t ) , and the vertex set V ′ ( t ) is composed of V c ( t ) and its associated edge endpoints, i.e. V ′ ( t ) = V c ( t ) ∪ { v ∈ V | ∃ e ∈ E c ( t ) , v ∈ e } . (7) Since vehicles in V ANETs are primarily located on road segments rather than at intersections, the connected compo- nents of G c are insufﬁcient to accurately quantify the degree of V ANET fragmentation. This is because connected components are typically based on verte x reachability , whereas V ANET connectivity should fundamentally be regarded as a form of Edge-based Connecti vity . Consequently , we employ a dual graph approach by considering the following deﬁnition. Deﬁnition 5 (Dual Connected Graph (DCG)): Consider the c-graph G c ( t ) = ( V ′ ( t ) , E ′ ( t )) , for ∀ e ∈ E ′ ( t ) , based on a bijection φ : E ′ ( t ) → V ∗ ( t ) , we deﬁne a unique vertex v ∗ = φ ( e ) ∈ V ∗ ( t ) . For ∀ v ∈ V ′ ( t ) and any two incident edges e 1 , e 2 ∈ δ ( v ) , we deﬁne an edge ( φ ( e 1 ) , φ ( e 2 )) ∈ E ∗ ( t ) . The undirected weighted graph G ∗ ( t ) = ( V ∗ ( t ) , E ∗ ( t )) is termed the DCG corresponding to G c ( t ) . For any v ∗ ∈ V ∗ ( t ) , its weight is deﬁned as the weight of the corresponding edge in the primal graph, which can be expressed as p v ∗ ( t ) = p φ − 1 ( v ∗ ) ( t ) . (8) The connected subnets (i.e., network fragments) of the V ANET exhibit a one-to-one correspondence with the con- nected components in the DCG. Moreov er, the number of vehicles within a connected subnet is equi valent to the sum of the vertex weights of the corresponding connected com- ponents. Consequently , we quantify the de gree of V ANET fragmentation utilizing the connected components of the DCG. Let K ( t ) denote the number of connected components in G ∗ ( t ) , with the corresponding sums of vertex weights denoted as n 1 ( t ) , · · · , n K ( t ) ( t ) , we deﬁne C ( t ) = 1 K ( t ) K ( t ) X i =1 n i ( t ) . (9) Here, K ( t ) represents the number of V ANET network frag- ments, while C ( t ) signiﬁes the average number of vehicles within each fragment. Consequently , a smaller K ( t ) and a larger C ( t ) indicate the better network connectivity . D. Ener gy Consumption Model W e model the energy consumption of U A V u in time slot t as the sum of propulsion energy e f ( t ) , hovering energy e h ( t ) , and communication energy e c ( t ) . T o ensure reliable communication quality and mitigate Doppler shifts caused by UA V mobility , we adopt the hover -and-transmit paradigm [33]. Similar to the approach in [34], we assume a constant ﬂight velocity for the UA V . Thus, e f ( t ) is proportional to the ﬂight distance, while e h ( t ) and e c ( t ) are proportional to the hov ering duration. Suppose the UA V ﬂies from intersection v i to intersection v j in time slot t with velocity v u , the total energy consumption in time slot t is given by E ( t ) = e f ( t ) + e h ( t ) + e c ( t ) = ( ε 1 + ε 2 )  τ − l ( v i , v j ) v u  + ε 3 · l ( v i , v j ) v u . (10) Here, l ( v i , v j ) represents the Euclidean distance between the two intersections, while ε 1 , ε 2 , and ε 3 denote the power consumption for hovering, communication, and propulsion, respectiv ely . E. Pr oblem F ormulation T o simultaneously minimize V ANET fragmentation and U A V energy consumption, we formulate the dynamic deploy- ment problem for the U A V -assisted V ANET as max P u T X t =1 ( C ( t ) − E ( t )) (11) s.t. p u ( t ) ∈ I road , t ∈ [1 , T ] (11a) l ( p v ( t − 1) , p v ( t )) ⩽ v u τ , t ∈ [1 , T ] (11b) E battery − T X t =1 E ( t ) > 0 (11c) where P u = { p u (1) , · · · , p u ( T ) } represents the sequence of U A V positions at all time slots, and E battery is the U A V battery capacity . Constraint (11a) requires that the U A V’ s target destination in each time slot is an intersection. Con- straint (11b) ensures that the ﬂight distance in each time slot does not exceed the maximum distance feasible under the velocity limit. Constraint (11c) mandates that the U A V’ s 6 energy is not depleted before the mission concludes. The abov e problem represents a typical Mixed-Integer Non-Linear Programming (MINLP) problem. Due to the stochastic nature of the vehicular distribution p e ( t ) and the vastness of the state space, where the number of trajectory combinations grows exponentially with the number of intersections n , traditional optimization methods are computationally intractable. This challenge motiv ates the introduction of DRL in this study . I I I . P RO P O S E D A P P RO AC H T o address the aforementioned challenges, we propose a novel LLM-guided DRL framew ork, termed Semantic- Augmented Deep Reinfor cement Learning (SA-DRL), to solve the optimization problem formulated in (11). As illustrated in Fig. 2, the SA-DRL frame work operates via a four -stage pipeline. Speciﬁcally , we gather experience through en viron- mental exploration, extract semantics from this experience to construct datasets, and achiev e knowledge alignment of the pre-trained LLM reg arding the speciﬁc topology of urban networks via Parameter-Ef ﬁcient Fine-T uning (PEFT). More- ov er , we propose the Semantic-Augmented Pr oximal P olicy Optimization (SA-PPO) algorithm, which integrates the se- mantic reasoning capabilities of the LLM into the decision- making loop of the DRL agent through a Logit Fusion mechanism. This joint architecture effecti vely mitigates the challenges associated with blind exploration and the deﬁciency in topological understanding inherent in traditional DRL. Stage 1: Experience Collection Baseline PPO Agent Urban Environment State Database Interact Remove Duplicates High - fidelity state space sampling and extensive en vironment interaction based on lightweight baseline policies. Stage 2: Semantic Prior Construction State 𝑎 ! 𝑎 " 𝑎 # 𝑟 ! 𝑟 " ⋯ 𝑟 # ⋯ Action Score Semantic Dataset 𝑌 ! 𝑋 ! 𝐼 Tex t ua l s er ia l iz at i on o f g ra ph topological feat ures and generation of heuristic se mantic supervision dataset. State Database Sample Stage 3: Knowledge Alignment Semantic Dataset Pretrained Parameters Pretrained G eneral LLM LoRA Adaptor Fine - Tun ed Semantic Prior Model Precise alignment from general reason ing ca pabilities to domain - specific topological cognition based on LoRA. Stage 4: Semantic Augmented Trai nin g & Ex ecu ti on Semantic prior injection and dual - stream collaborative decision - making reinforcement learning based on Log it Fusion. ⋯ Vect orize d En viron ment State Batch Fine - Tun ed Semantic Prior SA - PPO Agent Logit Fusion Action Sample Interact Fig. 2. The four-stage pipeline of the SA-DRL framework. In the follo wing, we ﬁrst introduce the ke y elements of DRL. Subsequently , we elaborate on the speciﬁc details of each stage within the pipeline. Finally , we summarize the proposed SA-DRL framework. A. K ey Elements of DRL W e model the dynamic UA V deployment task as a Markov Decision Process (MDP) to solv e the optimization problem formulated in (11) using DRL algorithms. An MDP is deﬁned by the tuple ( S , A , r, γ ) , representing the state space, action space, reward function, and discount factor . 1) Action Space: W e constrain the UA V’ s ﬂight destination to always be an intersection. Therefore, A = V , and the size of the agent’ s action space is n . 2) State Space: W e deﬁne the state at time slot t as s ( t ) = { p e ( t ) | e ∈ E } ∪ { p v ( t ) | v ∈ V } . (12) When used as input for the DRL algorithm, s ( t ) requires normalization. 3) Rewar d Function: W e deﬁne the agent’ s reward at time slot t as r ( t ) = α · 1 K ( t ) − β · E ( t ) E 0 , (13) where α and β are weighting coefﬁcients, and E 0 represents the maximum energy consumption of the UA V per time slot. B. Stage 1: Experience Collection Algorithm 1: Experience Collection Input: T arget number N targ et , maximum number of training episodes K max . Output: State database D S tate . 1 Initialize empty database D S tate ← ∅ ; 2 Initialize baseline PPO policy π θ with random weights; 3 Initialize PPO replay buf fer B ; 4 f or k ← 1 to K max do 5 Reset environment and observe initial state s 0 ; 6 for t = 1 to T do 7 Select action a t ∼ π θ ( ·| s t ) ; 8 Execute action a t , observe reward r t and next state s t +1 ; 9 if s t / ∈ D S tate then 10 D S tate ← D S tate ∪ { s t } ; 11 if |D S tate | ≥ N targ et then 12 return D S tate 13 end 14 end 15 Store transition ( s t , a t , r t , s t +1 ) in B ; 16 s t ← s t +1 ; 17 end 18 if update condition met then 19 Update π θ using data in B via standard PPO loss; 20 Clear B ; 21 end 22 end 23 retur n D S tate As outlined in Algorithm 1, in this stage, we collect states that the agent may encounter during operation through simple en vironmental exploration and store them in a state database D state . Speciﬁcally , we train a lightweight baseline PPO agent to explore the environment brieﬂy and collect states encountered during the rollout process, which are then added to the database after deduplication. Compared to random walk sampling, the states collected by a lightweight DRL baseline are closer to the real task distribution. This approach enhances dataset quality , thereby improving training efﬁciency during LLM ﬁne-tuning. C. Stage 2: Semantic Prior Construction As sho wn in Algorithm 2, in this stage, we transform numerical graph states into semantic tokens processable by the LLM and construct a Supervised Fine-T uning (SFT) dataset based on the state database D state . Considering that directly 7 Algorithm 2: Semantic Prior Construction Input: State database D state , system prompt I task . Output: SFT dataset D sf t . 1 Initialize empty dataset D sf t ← ∅ ; 2 f oreach s in D state do 3 Initialize reward vector r ← zeros ( n ) ; 4 for each a in A do 5 Restore environment to state s ; 6 Execute action a and observe immediate rew ard r a ; 7 r [ a ] ← r a ; 8 end 9 Compute discrete action scores Y t from r using Eq. (14); 10 X t ← Serialize ( s ) ; 11 D t ← { “instruction” : I , “input” : X t , “output” : Y t } ; 12 D sf t ← D sf t ∪ { D t } ; 13 end 14 retur n D sf t inputting the graph adjacency matrix into the LLM is inefﬁ- cient, we propose a concise serialization method to capture key topological features. T o capture the semantic essence of V ANET fragmentation, we transform the numerical state s t into a textual description of the connectivity landscape. Speciﬁcally , vehicle per edge does not merely count cars but semantically represents trafﬁc density hotspots that require cov erage, while node weight identiﬁes backhaul-enabled inter - sections, which allo ws the LLM to perceiv e the road network not as a matrix, but as a set of disconnected vehicular clusters waiting to be bridged. T o transform the LLM into a domain expert, we construct an instruction ﬁne-tuning dataset D sf t = { ( I , X t , Y t ) } . Here, I is the system prompt guiding the task, informing the LLM of basic task information; X t is the serialized s t , and Y t represents the output action score . Since directly computing the optimal state-value function Q ∗ ( s t , · ) is intractable, we employ the immediate re ward r ( s t , · ) as an approximate action score. Although the immediate reward is myopic, it serves as a robust indicator of topological criticality within the dynamic network. A high score signiﬁes an intersection capable of heal- ing partitions in the DCG or merging disjoint vehicle platoons. Through SFT on this dataset, the LLM learns the mapping f LLM : X t → P prior ( a | s ) , thereby providing a commonsense distribution over the action space. It is important to note that ﬂoating-point data not only consume more tokens but also increase the learning difﬁculty for the LLM. Therefore, the state vectors fed into the LLM must utilize integer components and strictly av oid normalization. Since the immediate reward deﬁned in (13) is presented in ﬂoating-point format, we map it from the continuous interval ( − 1 , 1) to the discrete interval [0 , scor e ∗ ] ∩ Z via (14). In practical implementation, to ensure a ﬁxed output length for the LLM, scor e ∗ is set to 9, ensuring that each component of Y t occupies only a single character . Thus, we have Y t = round  r ( s t , · ) − min r ( s t , · ) max r ( s t , · ) − min r ( s t , · ) · scor e ∗  . (14) D. Stage 3: Knowledge Alignment Although pre-trained LLMs possess extensi ve common- sense knowledge, they lack physical intuition regarding urban mobility and propagation constraints. General LLMs do not inherently understand that vehicles are conﬁned to road ge- ometries or that speciﬁc intersections act as LoS bottlenecks due to urban can yons. Moreov er , without ﬁne-tuning, LLMs struggle to consistently generate structured outputs (e.g., in JSON format). Current solutions primarily rely on designing rigorous prompts or employing SFT . Ho wev er, the former imposes high demands on the model’ s intrinsic capabilities, making it difﬁcult to implement on smaller models. More- ov er , the study in [35] indicates that such strict prompting strategies may compromise the LLM’ s performance. Gi ven the substantial inference demands associated with utilizing LLMs to guide DRL training, we adopt Low-Rank Adaptation (LoRA) [36] for knowledge alignment. This approach enables the adaptation of the LLM to the action scoring task without incurring the prohibitiv e computational costs of full-parameter training. As shown in Fig. 3, LoRA freezes the pre-trained backbone weights W and injects trainable low-rank adapter matrices A and B into the linear layers. The forward pass is approximated as Y = X ( W + ∆ W ) = X ( W + B A ) , (15) where B and A are matrices of rank r , with r being signiﬁcantly smaller than the original model dimension. By optimizing A and B e xclusi vely on D sf t , the model aligns its probability distrib ution with the topological requirements of the urban grid while preserving its inherent reasoning capabilities. Furthermore, this ensures that the model’ s output consistently adheres to a strict JSON format. 𝑊 ! 𝑊 " 𝑊 # Self Attention 𝑊 $ Forward Network Pretrained 𝑊 ! 𝑟 𝐵 = 0 𝐴 = 𝒩 0, 𝜎 ! 𝑥 ⊕ ℎ LoRa Adaptor Frozen Parameters Trai nabl e Pa ramet ers Fig. 3. Schematic diagram of the LoRA ﬁne-tuning mechanism. E. Stage 4: Semantic-Augmented T raining & Execution In this stage, we proceed to tr ain the ﬁnal SA-PPO agent. As illustrated in Fig. 4, to mitigate the issue of blind exploration inherent in traditional DRL, we deeply integrate the semantic priors deriv ed from the ﬁne-tuned LLM with the PPO policy network via a Logit Fusion mechanism. Furthermore, we employ an objectiv e function incorporating KL regularization to guide the agent’ s exploration within the semantic space. Beneﬁting from the robust semantic understanding capabilities and the extensi ve pre-trained knowledge base of LLMs, the incorporation of semantic priors enhances the algorithm’ s generalization ability and simultaneously impro ves training efﬁcienc y . 8 ⋯ Vect orize d En viron ment State Batch Fine - Tun ed Semantic Prior Model Semantic Augmented PPO Agent 𝒛 !!" 𝑎 𝑠 𝜋 𝑎 𝑠 Sample PPO Actor Critic State 𝑠 𝒛 !!" 𝑎 𝑠 𝒛 ##$ 𝑎 𝑠 Logit Fusion 𝜋 𝑎 𝑠 KL Divergency Standard PPO Loss ⊕ Gradient Back propagation Fig. 4. Schematic diagram of the SA-PPO training and execution mechanism. 1) Dual-Str eam Infer ence and Logit Fusion: T o combine semantic priors with real-time control policies, we design a dual-stream inference architecture. The PPO Actor network receiv es the state vector s t and outputs the raw logits vector z P P O ov er the action space via a Multilayer Perceptron (MLP), i.e. z P P O = MLP actor ( s t ) . (16) Simultaneously , we serialize the current state into a text token sequence X t and obtain the semantic prior Y t outputed by the ﬁne-tuned LLM. This is then normalized to serve as the raw logits vector output by the LLM, i.e. z LLM = Y t − µ ( Y t ) σ ( Y t ) + ε . (17) Here, µ is the mean, σ denotes the standard deviation, and ε = 10 − 7 represents a small positiv e constant. The ﬁnal ex ecution policy ˜ π is generated by the weighted fusion of these two logits followed by Softmax activation. W e term this step Logit Fusion , which establishes a hierarchical control syner gy . The LLM acts globally , analyzing the R TG to recommend bridge nodes essential for network integrity . Simultaneously , the PPO agent functions acts locally , reﬁning these recom- mendations based on real-time feedback to handle transient vehicle ﬂuctuations. This fusion ensures the U A V executes a semantic-aware deployment, anchoring to critical hubs while adaptiv ely maneuvering for optimal signal coverage, which is giv en by ˜ π ( ·| s t ) = exp( z P P O + λ · z LLM ) P n j =1 exp( z ( j ) P P O + λ · z ( j ) LLM ) . (18) 2) Objective Function with KL Re gularization: T o en- able the LLM’ s commonsense recommendations to effecti vely guide PPO learning, we introduce a KL diver gence regulariza- tion term into the standard PPO loss function. The total loss function L ( θ ) for SA-PPO is deﬁned as L = E t [ − L C LI P t + c 1 L V F t − c 2 S [ ˜ π ]( s t ) + β D K L ( ˜ π ( ·| s t ) ∥ π LLM ( ·| X t ))] , (19) where L C LI P t is the PPO clipped surrogate objectiv e, L V F t denotes the PPO value function loss, S [ ˜ π ]( s t ) represents the PPO entropy regularization, and D K L is the KL diver gence. The KL term constrains the PPO policy distrib ution from deviating excessi vely from the semantic prior π LLM provided by the LLM, thereby acting as a guardrail to guide the agent’ s exploration within a valid semantic space, which effecti vely improv es the training efﬁciency of the algorithm. Finally , the training and e xecution procedure of SA-PPO is outlined in Algorithm 3. Algorithm 3: Semantic-Augmented T raining & Exe- cution of SA-PPO Input: Fine-tuned LLM M S F T , maximum number of training episodes K max . 1 Initialize S2A-PPO agent M θ with random weights θ ; 2 Initialize replay buf fer B ; 3 f or k ← 1 to K max do 4 Reset environment; 5 for t = 1 to T do 6 Observe current state s t ; 7 X t ← Serialize ( s t ) ; 8 z LLM ← M S F T ( I , X t ) ; 9 z P P O ← M θ ( s t ) ; 10 Get ˜ π ( ·| s t ) with Eq. (18); 11 Select action a t ∼ ˜ π ( ·| s t ) ; 12 Execute action a t , observe reward r t and next state s t +1 ; 13 Store transition ( s t , a t , r t , s t +1 , z LLM , z P P O ) in B ; 14 end 15 if update condition met then 16 Update π θ using data in B with Eq. (19); 17 Clear B ; 18 end 19 end F . High-Thr oughput P arallel T raining System A major bottleneck in LLM-assisted DRL is inference latency . Considering that existing LLM inference engines can signiﬁcantly increase inference speed when handling batch tasks (as opposed to serial ex ecution), we implement a v ec- torized parallel training system. Instead of using a single thread for interaction sampling, we generate N parallel envi- ronments to simultaneously collect a batch of states S batch = { s 1 , ..., s N } . These states are serialized and inputted into the LLM inference engine as a batch. This mechanism renders On- Policy training utilizing LLM priors computationally feasible in large-scale urban scenarios. G. SA-DRL F ramework Algorithm 4 outlines the e xecution workﬂo w of the SA- DRL frame work. First, a lightweight e xploration phase (Al- gorithm 1) is executed utilizing the pre-trained model M base to construct a database D S tate encompassing div erse states. Subsequently , a high-quality supervised ﬁne-tuning dataset D sf t is constructed via state serialization and action scoring (Algorithm 2). Building upon this, LoRA adapters are intro- duced to ﬁne-tune M base , yielding an intermediate model M S F T equipped with domain-speciﬁc semantic priors. Fi- nally , the SA-PPO agent M θ is initialized to perform online reinforcement learning under the semantic guidance provided by M S F T (Algorithm 3), ultimately resulting in a conv erged policy model. 9 Algorithm 4: The SA-DRL Framew ork Input: Pre-trained LLM M base , System Prompt I task , K max , N targ et . Output: Optimized SA-PPO agent M θ . 1 Execute lightweight exploration to collect state database D S tate via Algorithm 1 ; 2 Serialize states and compute action scores to construct SFT dataset D sf t via Algorithm 2 ; 3 Initialize LoRA adapters for M base ; 4 Fine-tune M base on D sf t to obtain the domain-speciﬁc semantic model M S F T ; 5 Initialize SA-PPO agent M θ ; 6 T rain M θ online guided by semantic priors from M S F T via Algorithm 3 ; 7 retur n Conver ged SA-PPO agent M θ I V . P E R F O R M A N C E E V A L UA T I O N A. System Setup W e utilize the dataset from [37], which provides topological urban maps and vehicle trajectories collected by large-scale trafﬁc surveillance systems in two Chinese cities. The original dataset contains approximately 5 million records ov er four days. T o speed up simulations, we generate a small-scale subset based on the data for a speciﬁc area in Shenzhen on April 16, 2021, consisting of 47 nodes, 88 edges, and 5,000 trajectory records. Based on this data set, we construct a Python simulation system to implement the four-stage SA- DRL pipeline. Figure 5 depicts the R TG of the simulated network, where the red node marks the U A V’ s starting position (Node 9), green nodes denote intersections equipped with RSUs (Nodes 11-13), and blue nodes represent standard road intersections. Follo wing this setup, we conduct comparative experiments and ablation studies to verify the algorithm’ s effecti veness. In terms of implementation, we use PyT orch for the SA-PPO algorithm, LLaMA Factory [38] for LLM ﬁne-tuning, and the vLLM [39] framework for inference acceleration. The experimental hardware consists of an Intel Core i7-14700KF @ 3.42 GHz CPU and an NVIDIA GeForce R TX 4080 Super GPU. Fig. 5. The R TG corresponding to the road network used in simulation. B. Analysis of the Semantic Prior Model 1) P erformance Evaluation of Semantic Knowledge Align- ment: In addition to standard LLM ev aluation metrics such as the Bilingual Ev aluation Understudy (Cumulati ve 4-gram score, BLEU-4) and the Recall-Oriented Understudy for Gist- ing Evaluation (based on Longest Common Subsequence, R OUGE-L), we introduce three speciﬁc e v aluation metrics tailored to the algorithm’ s characteristics: • JSON Parsing Success Rate (JSON PSR): Given that the DRL training loop requires automated interaction, the LLM output must strictly adhere to the deﬁned JSON schema. W e deﬁne the JSON Parsing Success Rate as the proportion of generated responses that can be successfully decoded by a standard JSON parser without syntax errors (e.g., missing brackets, unescaped characters). W e hav e P S R = 1 N N X i =1 I ( isV alidJSON ( o i )) . (20) • Kendall’ s Rank Corr elation Coefﬁcient (Kendall’ s τ ): This metric quantiﬁes the ordinal association between the semantic scores predicted by the LLM and the ground-truth heuristic rewards. Unlike absolute v alue- based metrics (e.g., Mean Squared Error), Kendall’ s τ focuses on the relativ e ranking of candidates, which is crucial for DRL agents that select actions based on relativ e dominance. Gi ven two sequences of equal length x , y , we enumerate all pairs of positions ( i, j ) . When ( x i − x j )( y i − y j ) > 0 (concordant pairs), we let the count be C . As for ( x i − x j )( y i − y j ) < 0 (discordant pairs), we let the count be D . Then, Kendall’ s τ is deﬁned as τ = C − D C + D . (21) A larger τ indicates a higher correlation in ranking between the two sequences. τ = 1 implies identical ranking, while τ = − 1 implies completely opposite ranking. • T op- k Hit Rate ( H R k ): Beyond global ranking corre- lation, a critical metric is ensuring that optimal actions are included within the high-probability candidate set. W e adopt a set-based T op- k Hit Rate. Speciﬁcally , for each state, we sort both the predicted semantic scores and the ground-truth heuristic rewards in descending order to extract two index subsets: the top- k predicted set S ( k ) pred and the top- k ground truth set S ( k ) g t . T o calculate the ratio of the cardinality of their intersection, we deﬁne H R k = 1 M M X i =1 |S ( k ) pred,i ∩ S ( k ) g t,i | k , (22) where M is the number of samples. This metric quantiﬁes the overlap ratio, indicating the accuracy with which the LLM captures the set of dominant topological nodes. In the subsequent presentation of results, we set k = 10 . T o ev aluate the impact of different backbone architectures on topological reasoning and instruction-follo wing capabili- ties, we benchmark four state-of-the-art lightweight LLMs: Qwen2.5-3B, Qwen3-4B, Llama3.2-3B, and Gemma3-4B. W e focus on models with less parameters to ensure the feasibility of deployment on resource-constrained U A Vs. The compara- tiv e results are summarized in T able I. As demonstrated in the ﬁrst two columns of T able I, the necessity of the knowledge alignment phase (Stage 3) is undeniable. The base model (without SFT) exhibits a complete failure in the instruction-following task, yielding a JSON PSR 10 T ABLE I P E R F O R M A N C E E V A L U A T I O N O F S E M A N T I C K N OW LE D G E A L I G N M E N T LLM Qwen3 Qwen3 Qwen2.5 Llama3.2 Gemma3 Parameters 4B 4B 3B 3B 4B SFT None LoRA LoRA LoRA LoRA BLEU-4 0.20 70.32 74.32 44.64 55.50 R OUGE-L 7.78 78.42 81.68 69.81 74.84 Kendall’ s τ nan 78.82 85.03 85.60 88.89 JSON PSR 0.00 100.00 100.00 100.00 100.00 H R 10 0.00 51.11 56.60 55.53 58.60 of 0.00% and a BLEU score approaching zero. This indi- cates that without domain-speciﬁc ﬁne-tuning, general-purpose LLMs are incapable of generating valid JSON structures or comprehending action scoring logic. In contrast, after applying LoRA ﬁne-tuning (Qwen3-4B), the success rate surges to 100%, and the topological reasoning capability is signiﬁcantly enhanced, achieving a Kendall’ s τ of 78.82. This v alidates that our parameter -ef ﬁcient ﬁne-tuning strate gy successfully transforms the LLM from a generic chatbot into a well-trained topology expert. Howe ver , notable performance disparities are observed among the ﬁne-tuned models. While Gemma3-4B achiev es marginally superior topological ranking accuracy , Qwen2.5-3B demonstrates enhanced performance in terms of textual consistency . This trade-off between precision and potential computational cost necessitates a further ev aluation incorporating inference efﬁciency to determine the ﬁnal model selection. 0 20 40 60 80 100 120 Batch Size 0 10 20 30 40 50 60 Steps / s Gemma3-4B Llama3.2-3B Qwen2.5-3B Qwen3-4B Fig. 6. Inference efﬁciency comparison for different LLM architectures as batch size varies. 2) Infer ence Efﬁciency Analysis: While semantic alignment is imperative, the inherent high-frequency interaction char- acteristic of DRL imposes stringent constraints on inference latency . Although our proposed parallel training system en- ables batch inference, the scalability of the underlying LLM remains a critical bottleneck determining the overall training efﬁcienc y . As illustrated in Fig. 6, we observe a critical perfor- mance diver gence as the batch size increases. The throughput of Gemma3-4B (purple line) saturates rapidly , stagnating at approximately 9 steps/s reg ardless of the increase in batch size. This indicates a severe VRAM bottleneck (e.g. the VRAM av ailable for the KV -Cache is insufﬁcient), likely attributable to its larger parameter count and architectural ov erhead, ren- dering it unsuitable for parallelized training en vironments. In contrast, Qwen2.5-3B (green line) demonstrates superior scalability . Its throughput exhibits a linear increase with batch size, peaking at approximately 58 steps/s at a batch size of 128, which is more than 6 times faster than Gemma3-4B. This high-throughput capability enables the SA-PPO agent to collect samples and update its policy signiﬁcantly faster , thereby substantially reducing the total training duration. In summary , although Gemma3-4B holds a marginal advantage in semantic reasoning, its suboptimal inference efﬁciency imposes an unacceptable computational burden on the system. Consequently , we select Qwen2.5-3B as the backbone network and set the batch size to 128, thereby striking an optimal balance between reasoning capability and system efﬁcienc y . 3) V isual Analysis of Semantic Alignment: T o provide a more intuitiv e understanding of the efﬁcac y of semantic align- ment, we visualize the action score distributions generated by the LLMs. Fig. 7 presents a heatmap comparison between the average semantic scores predicted by different ﬁne-tuned models and the ground truth (Label) on the test set. In this visualization, the x-axis represents the index of intersection nodes (Action Index), while the color intensity corresponds to the average action score. V ariations in color distribution reﬂect the differing learning capabilities of the models regarding the en vironment. Ho wever , the o verall trends reveal the intrinsic topological characteristics of the urban landscape. Since the heatmap represents aggregated scores across all test states, the vertical bands consistently exhibiting high scores (e.g., Nodes 34 and 10) identify nodes possessing high global centrality or strategic importance. By cross-referencing with the road topology map in Fig. 5, we observe that these high-scoring indices correspond to critical trafﬁc hubs (notably , they are all cut v ertices of the graph). This correspondence conﬁrms that the ﬁne-tuned LLMs hav e successfully internalized the spatial structure of the road network, learning to prioritize topologically critical nodes while abstracting away transient trafﬁc ﬂuctuations. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 A ction Inde x Label Qwen2.5-3B Llama3.2-3B Gemma3-4B Qwen3-4B Model 1 2 3 4 5 6 7 avg action scor e Fig. 7. Heatmap comparison of action scores predicted by different ﬁne-tuned LLMs against ground truth. Furthermore, to validate the stability of the selected back- bone model, Fig. 8 details the prediction performance of Qwen2.5-3B across multiple random samples. The upper row displays the ground truth labels, while the lo wer ro w shows the model predictions. The results indicate that the model successfully generalizes across diverse topological states, con- sistently identifying ke y intersections and reproducing the sparse distribution characteristics of the rew ard function. This conﬁrms that the LoRA-based ﬁne-tuning has successfully aligned the reasoning capabilities of the LLM with the domain- speciﬁc requirements of urban vehicular networks. sample 0 sample 1 sample 2 sample 3 sample 4 A ction Scor e Labels 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 A ction Inde x sample 0 sample 1 sample 2 sample 3 sample 4 A ction Scor e P r edictions 0 1 2 3 4 5 6 7 8 9 A ction Scor e Sample Fig. 8. Comparison of action score predictions by Qwen2.5-3B against ground truth. 11 C. P erformance Evaluation of SA-PPO 1) Comparative Experiments: T o validate the effecti veness of the proposed SA-PPO algorithm in dynamic U A V deploy- ment, we compare it against three representati ve DRL base- lines. These baselines encompass standard On-Policy and Off- Policy algorithms, as well as graph-reinforced DRL methods. • Soft Actor-Critic (SA C): An Off-Polic y DRL algorithm utilized in [40] to optimize the Age of Information (AoI) in U A V -aided vehicular edge computing. SA C enhances exploration by simultaneously maximizing re ward and policy entropy . • V anilla PPO [13]: The standard Proximal Polic y Opti- mization algorithm (On-Policy) without any semantic or structural enhancements. The agent observes raw state vectors and learns solely through trial-and-error interac- tions with the en vironment. • GA T -PPO : A PPO variant that employs a Graph Atten- tion Network (GA T) as the feature extractor , replacing the standard Multi-Layer Perceptron (MLP). Since our en vironment is modeled based on R TG, GA T -PPO is designed to explicitly extract structured data from the graph. • SA-PPO (Ours): The proposed algorithm which inte- grates the domain-speciﬁc topological expertise of a ﬁne- tuned LLM into the PPO training loop via the Logit Fusion mechanism. 0 1000 2000 3000 4000 5000 T raining Episode 26 28 30 32 Episode R ewar d Sof t A ctor -Critic V anilla PPO GA T -PPO S A -PPO Fig. 9. Training reward curves for different algorithms. Average Car Num in Connected Blocks Average Connected Block Size Average UA V Flight Distance 0 5 10 15 20 25 30 35 40 Car Num / Block Size 36.9 16.8 37.0 17.7 35.3 17.3 41.9 21.0 Soft Actor -Critic V anilla PPO GA T -PPO SA-PPO 0 200 400 600 800 1000 1200 UAV Flight Distance 45.7 793.2 1158.2 223.7 Fig. 10. Comparison of V ANET connectivity metrics. As shown in Fig. 9, SA-PPO demonstrates a con vergence speed signiﬁcantly superior to all baseline algorithms. By lev eraging semantic priors provided by the LLM to prune the in v alid action space, SA-PPO reaches a stable high-reward state around 2,500 episodes, whereas other baselines require substantially more interactions. The quantitati ve metrics in Fig. 10 rev eal distinct operational patterns among the algorithms. SA-PPO achieves the highest average number of vehicles within the connected component (41.9), surpassing both GA T - PPO and V anilla PPO. Crucially , SA-PPO achiev es this while maintaining an extremely low a verage UA V ﬂight distance (223.7 m). This indicates that SA-PPO accurately identiﬁes topologically critical intersections and stations there, moving only when necessary . This stands in sharp contrast to the lazy behavior of SA C (45.7 m), which sacriﬁces connectivity to sav e ener gy , and the inefﬁcient behaviors of other baselines. SA-PPO successfully resolv es the multi-objecti ve optimiza- tion problem, achieving maximum connectivity coverage with minimal energy consumption. It should be noticed in Fig. 9 and Fig. 10 that despite employing a Graph Neural Network (GNN), GA T -PPO per- forms comprehensively worse than the simple V anilla PPO. Not only is its connecti vity performance inferior , but its energy consumption increases drastically , with a ﬂight distance reaching 1158.2 m, which is nearly 1.5 times that of V anilla PPO and more than 5 times that of SA-PPO. W e attribute this to the graph attention mechanism’ s hypersensitivity to local trafﬁc noise. In scenarios with static road network topology but ﬂuctuating vehicle density , the GA T -based agent fails to ﬁlter out transient noise, leading to unstable and continuous blind maneuvering between adjacent nodes in pursuit of ﬂeeting trafﬁc hotspots. This behaviour results in excessi ve energy waste without forming a stable coverage structure. In contrast, the simple MLP-based V anilla PPO is less susceptible to such ov er-smoothing and jitter , thus outperforming GA T -PPO. This underscores the value of our SA-PPO approach. Rather than enhancing performance by increasing the complexity of the feature extractor , we guide the agent toward global optimality and policy stability by injecting stable semantic common sense via the LLM. Furthermore, to qualitativ ely analyze the decision-making logic, we visualize the UA V trajectories generated by different algorithms in simulation snapshots, as shown in Fig. 11. SA-PPO exhibits a highly intelligent “Strategic Stationing” pattern. The trajectory sho ws the U A V ﬂying directly to critical cut-verte x intersections in the road network and maintaining a precise hover . This decision-making behavior aligns perfectly with its extremely lo w ﬂight distance (i.e. 223.7 m). Guided by LLM semantic priors, the agent conﬁdently ignores transient trafﬁc noise in fringe areas and anchors itself to global connectivity hubs, thereby achieving maximum connectivity enhancement with minimal energy expenditure. The trajectory of V anilla PPO appears chaotic and lacks strategic direction, often wandering in low-density edge regions. Consistent with the quantitativ e data, the SA C agent remains nearly static, conﬁrming the issue of Mode Collapse. These visualizations conﬁrm that the common sense injected by the LLM is successfully translated into spatially rational ﬂight patterns. 2) Ablation Studies: T o strictly verify the contribution of semantic priors and the ef fectiv eness of the Logit Fusion mechanism, we conduct ablation studies by comparing the proposed SA-PPO with two variants: 1) w/o LLM (V anilla PPO) : The semantic branch is remov ed, and the agent learns solely through the PPO Actor . This serves as a benchmark for blind exploration. 2) w/o Logit Fusion (Pure Semantic P olicy) : The RL branch is removed, and actions are selected by directly sampling from the ﬁne-tuned LLM output distribution ( a ∼ Softmax ( z LLM ) ). This v ariant is used to ev aluate 12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 (a) Soft Actor-Critic 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 (b) V anilla PPO 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 (c) GA T -PPO 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 (d) SA-PPO Fig. 11. UA V trajectory visualization under different algorithms. the quality of the raw semantic priors without RL adaptation. The comparative results are summarized in Fig. 12. The w/o Logit Fusion variant achiev ed a signiﬁcantly higher con- nectivity score than V anilla PPO. This result pro ves that the ﬁne-tuned LLM has successfully internalized the topological “common sense” of the urban road network. Even without any interaction-based reinforcement learning, the domain knowl- edge embedded in the LLM outperforms the policy learned by the standard DRL agent through trial and error , thereby v alidat- ing the necessity of the knowledge alignment phase. Howe ver , the pure semantic policy reveals a fatal ﬂaw regarding energy consumption. As shown in Fig. 12, the pure semantic policy exhibits an e xtremely high ﬂight distance (980.6 m). This reﬂects the myopia of the LLM, which continuously mov es between various optimal intersections without considering energy penalties. In contrast, SA-PPO achie ves the optimal balance. It obtains the highest connecti vity (41.9) with the lowest ﬂight distance (223.7 m). This demonstrates the value of the Logit Fusion mechanism, in which the PPO component effecti vely constrains the hyperacti ve tendencies of the LLM, teaching the agent to perform strategic stationing at key nodes identiﬁed by the LLM, rather than blindly patrolling. Average Car Num in Connected Blocks Average Connected Block Size Average UA V Flight Distance 0 5 10 15 20 25 30 35 40 Car Num / Block Size 39.7 19.3 37.0 17.7 41.9 21.0 w/o Logit F usion w/o LLM SA-PPO 0 200 400 600 800 1000 UAV Flight Distance 980.6 793.2 223.7 Fig. 12. Ablation study results comparing SA-PPO with its variants. 3) Robustness and Generalization: T o ev aluate the robust- ness of the algorithms against temporal distribution shifts in traf ﬁc patterns, all models are trained using traf ﬁc ﬂo w data starting at 08:20 and tested in six distinct time periods ranging from 08:00 to 18:00 with identical duration. As illus- trated in Fig. 13, SA-PPO demonstrates exceptional scenario- adaptiv e intelligence. In hard sparse scenarios such as 12:00 where connectivity is scarce, it lev erages topological priors to achieve a superior connectivity score of 50.3, surpassing V anilla PPO (44.0) by ov er 14%. Conv ersely , in easy dense scenarios such as 16:00, it adopts a rational trade-off strategy . While maintaining a competitiv e connecti vity lev el (79.4 vs 86.2), it reduces ener gy consumption by approximately 50% (471.3 m vs 954.0 m) compared to the hyperactive V anilla PPO. Furthermore, SA-PPO exhibits dynamic maneuv erability , capable of dynamically adjusting its ﬂight distance from 152.1 m at 14:00 to 471.3 m at 16:00 according to real-time trafﬁc demands, thereby proving its ability to effecti vely generalize to unseen trafﬁc distributions. V . C O N C L U S I O N T o address the fragmentation issue of V ANETs in complex urban en vironments, we proposed the SA-DRL framew ork, effecti vely ov ercoming the limitations of blind exploration and sample inef ﬁciency in traditional methods caused by a lack of topological semantic understanding . In this work, we quantiﬁed network fragmentation using R TG and DCG, and designed a four -stage pipeline incorporating the SA-PPO al- gorithm. This architecture integrated the reasoning capabilities of LLMs as semantic priors into policy learning via a Logit Fusion mechanism . Simulation experiments based on real- world trajectories demonstrated that our method effecti vely av oided the Mode Collapse issue observed in the Soft Actor- Critic algorithm. Furthermore, compared to the V anilla PPO baseline, SA-PPO not only exhibited stronger generalization capabilities against dynamic trafﬁc ﬂo ws b ut also achie ved con vergence using only 26.6% of the training episodes, reduc- ing UA V energy consumption to 28.2% while improving two key connectivity metrics by 13.2% and 23.5%, respectiv ely . R E F E R E N C E S [1] V .-L. Nguyen, R.-H. Hwang, P .-C. Lin, A. Vyas, and V .-T . Nguyen, “T ow ard the age of intelligent vehicular networks for connected and autonomous vehicles in 6G, ” IEEE Network , vol. 37, no. 3, pp. 44–51, 2023. [2] M. J. N. Mahi, S. Chaki, S. Ahmed, M. Biswas, M. S. Kaiser, M. S. Islam, M. Sookhak, A. Barros, and M. Whaiduzzaman, “ A revie w on V ANET research: Perspective of recent emerging technologies, ” IEEE Access , vol. 10, pp. 65 760–65 783, 2022. [3] T . Abbas, K. Sj ¨ oberg, J. Karedal, and F . Tufvesson, “ A measurement based shadow fading model for vehicle-to-v ehicle network simulations, ” Int. J. Antennas Propa g. , vol. 2015, no. 1, p. 190607, 2015. [4] N. Akhtar, S. C. Ergen, and O. Ozkasap, “V ehicle mobility and com- munication channel models for realistic and efﬁcient highway v anet simulation, ” IEEE T rans. V eh. T echnol. , vol. 64, no. 1, pp. 248–262, 2015. [5] O. S. Oubbati, M. Atiquzzaman, A. Baz, H. Alhakami, and J. Ben- Othman, “Dispatch of U A Vs for urban v ehicular networks: A deep reinforcement learning approach, ” IEEE Tr ans. V eh. T echnol. , vol. 70, no. 12, pp. 13 174–13 189, 2021. [6] M. Lehsaini, N. Gaouar, and T . Nebbou, “Efﬁcient deployment of roadside units in vehicular networks using optimization methods, ” Int. J. Commun. Syst. , vol. 35, no. 14, p. e5265, 2022. [7] J. Clancy , D. Mullins, B. Deegan, J. Horgan, E. W ard, C. Eising, P . Denny , E. Jones, and M. Glavin, “Wireless access for V2X com- munications: Research, challenges and opportunities, ” IEEE Commun. Surv . T utorials , vol. 26, no. 3, pp. 2082–2119, 2024. [8] H. Kurunathan, H. Huang, K. Li, W . Ni, and E. Hossain, “Machine learning-aided operations and communications of unmanned aerial ve- hicles: A contemporary survey , ” IEEE Commun. Surv . T utorials , vol. 26, no. 1, pp. 496–533, 2024. 13 08:00 10:00 12:00 14:00 16:00 18:00 T ask Start Time 0 20 40 60 80 Average Car Num in Connected Blocks 36.1 59.3 44.9 76.3 82.1 60.7 38.4 58.2 44.0 74.7 86.2 62.2 37.1 58.6 44.5 75.1 78.8 61.4 39.4 58.6 50.3 77.1 79.4 63.8 08:00 10:00 12:00 14:00 16:00 18:00 T ask Start Time 0 5 10 15 20 25 Average Connected Block Size 16.7 20.3 18.1 21.7 18.8 15.7 18.7 20.8 18.4 22.3 20.3 17.2 18.1 21.1 18.5 22.9 19.3 17.1 20.4 22.0 22.0 24.1 19.8 18.1 08:00 10:00 12:00 14:00 16:00 18:00 T ask Start Time 0 200 400 600 800 1000 1200 1400 Average UA V Flight Distance 45.7 45.7 45.7 45.7 45.7 45.7 798.9 804.0 811.4 1019.1 945.3 954.0 1112.6 1173.6 1356.3 1036.0 1327.0 1173.7 371.4 299.9 270.8 152.1 471.3 170.3 Soft Actor -Critic V anilla PPO GA T -PPO SA-PPO Fig. 13. Ablation study results comparing SA-PPO with its variants. [9] M. Gapeyenko, D. Moltchanov , S. Andreev , and R. W . Heath, “Line-of- sight probability for mmwave-based UA V communications in 3D urban grid deployments, ” IEEE T rans. Wir eless Commun. , vol. 20, no. 10, pp. 6566–6579, 2021. [10] J. Sabzehali, V . K. Shah, Q. Fan, B. Choudhury , L. Liu, and J. H. Reed, “Optimizing number, placement, and backhaul connectivity of multi- U A V networks, ” IEEE Internet Things J. , vol. 9, no. 21, pp. 21 548– 21 560, 2022. [11] H. S. Y ahia and A. S. Mohammed, “Path planning optimization in unmanned aerial vehicles using meta-heuristic algorithms: A systematic revie w , ” En vir on. Monit. Assess. , vol. 195, no. 1, p. 30, 2023. [12] Y . Bai, H. Zhao, X. Zhang, Z. Chang, R. J ¨ antti, and K. Y ang, “T oward autonomous multi-UA V wireless network: A survey of reinforcement learning-based approaches, ” IEEE Commun. Surv . T utorials , v ol. 25, no. 4, pp. 3038–3067, 2023. [13] J. Schulman, F . W olski, P . Dhariwal, A. Radford, and O. Klimov , “Prox- imal policy optimization algorithms, ” arXiv preprint , 2017. [14] Y . Guan, S. Zou, H. Peng, W . Ni, Y . Sun, and H. Gao, “Cooperative U A V trajectory design for disaster area emergency communications: A multiagent PPO method, ” IEEE Internet Things J. , vol. 11, no. 5, pp. 8848–8859, 2024. [15] S. Javaid, H. Fahim, B. He, and N. Saeed, “Large language models for U A Vs: Current state and pathways to the future, ” IEEE Open J. V eh. T echnol. , vol. 5, pp. 1166–1192, 2024. [16] L. Y uan, C. Deng, D.-J. Han, I. Hwang, S. Brunswicker, and C. G. Brin- ton, “Next-generation llm for uav: From natural language to autonomous ﬂight, ” arXiv preprint , 2025. [17] S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Ka- mar , P . Lee, Y . T . Lee, Y . Li, S. Lundberg et al. , “Sparks of artiﬁcial general intelligence: Early experiments with gpt-4, ” arXiv pr eprint arXiv:2303.12712 , 2023. [18] W . Zu, W . Song, R. Chen, Z. Guo, F . Sun, Z. Tian, W . Pan, and J. W ang, “Language and sketching: An LLM-driven interactiv e multimodal multi- task robot na vigation framework, ” in 2024 IEEE Int. Conf. Robot. A utom. (ICRA) , 2024, pp. 1019–1025. [19] S. Mokhtari, N. Nouri, J. Abouei, A. A vokh, and K. N. Plataniotis, “Relaying data with joint optimization of energy and delay in cluster- based uav-assisted V ANETs, ” IEEE Internet Things J. , vol. 9, no. 23, pp. 24 541–24 559, 2022. [20] O. Chughtai, N. N. Qadri, Z. Kaleem, and C. Y uen, “Drone-assisted cooperativ e routing scheme for seamless connectivity in V2X commu- nication, ” IEEE Access , vol. 12, pp. 17 369–17 381, 2024. [21] A. Andreou, C. X. Ma vromoustakis, J. M. Batalla, E. K. Markakis, and G. Mastorakis, “Uav-assisted RSUs for V2X connecti vity using voronoi diagrams in 6G+ infrastructures, ” IEEE T rans. Intell. T ransp. Syst. , vol. 24, no. 12, pp. 15 855–15 865, 2023. [22] X. Fan, H. Zhang, Y . Huang, Y . Su, H. Li, J. Huo, C. Sun, S. Hao, and L. Zhen, “T emporal data dissemination in U A V-assisted V ANETs through time-v arying graphs, ” IEEE T rans. V eh. T echnol. , v ol. 73, no. 10, pp. 14 835–14 846, 2024. [23] X. W ei, L. Cai, N. W ei, P . Zou, J. Zhang, and S. Subramaniam, “Joint U A V trajectory planning, DA G task scheduling, and service function deployment based on DRL in uav-empowered edge computing, ” IEEE Internet Things J. , vol. 10, no. 14, pp. 12 826–12 838, 2023. [24] Y . Y ao, K. Lv , S. Huang, and W . Xiang, “3D deployment and energy efﬁcienc y optimization based on DRL for RIS-assisted air -to-ground communications networks, ” IEEE T rans. V eh. T echnol. , vol. 73, no. 10, pp. 14 988–15 003, 2024. [25] J. Chen, D. Huang, Y . W ang, Z. Y u, Z. Zhao, X. Cao, Y . Liu, T . Q. S. Quek, and D. Oliv er W u, “Enhancing routing performance through trajectory planning with DRL in UA V-aided V ANETs, ” IEEE T rans. Mach. Learn. Commun. Netw . , vol. 3, pp. 517–533, 2025. [26] P . Hou, Y . Huang, H. Zhu, Z. Lu, S.-C. Huang, Y . Y ang, and H. Chai, “Distributed DRL-based intelligent over -the-air computation in un- manned aerial vehicle swarm-assisted intelligent transportation system, ” IEEE Internet Things J. , vol. 11, no. 21, pp. 34 382–34 397, 2024. [27] H. Samma and S. El-Ferik, “U A V visual path planning using lar ge language models, ” T ransp. Res. Procedia , vol. 84, pp. 339–345, 2025. [28] S. Cai, Y . W u, and L. Zhou, “LLM-land: Large language models for context-aw are drone landing, ” arXiv preprint , 2025. [29] Q. Zhou, J. W u, M. Zhu, Y . Zhou, F . Xiao, and Y . Zhang, “LLM-QL: A LLM-enhanced q-learning approach for scheduling multiple parallel drones, ” IEEE T rans. Knowl. Data Eng. , vol. 37, no. 9, pp. 5393–5406, 2025. [30] J. Wu, H. Y ou, B. Sun, and J. Du, “LLM-driv en pareto-optimal multi- mode reinforcement learning for adaptive U A V navigation in urban wind en vironments, ” IEEE Access , vol. 13, pp. 163 550–163 570, 2025. [31] S. Albeaik, A. Bayen, M. T . Chiri, X. Gong, A. Hayat, N. Kardous, A. Keimer , S. T . McQuade, B. Piccoli, and Y . Y ou, “Limitations and improvements of the intelligent driv er model (IDM), ” SIAM J. Appl. Dyn. Syst. , vol. 21, no. 3, pp. 1862–1892, 2022. [32] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP altitude for maximum coverage, ” IEEE Wir eless Commun. Lett. , vol. 3, no. 6, pp. 569–572, 2014. [33] Y . Ding, Q. Zhang, W . Lu, N. Zhao, A. Nallanathan, X. W ang, and X. Y ang, “Collaborati ve communication and computation for secure U A V-enabled MEC against active aerial eav esdropping, ” IEEE T rans. W ireless Commun. , vol. 23, no. 11, pp. 15 915–15 929, 2024. [34] W . Y uan, G. Cao, Y . Hou, J. W ang, S. Chen, H. He, and J. Y ang, “Deep transfer reinforcement learning based exploration enhanced multi-UA V trajectory planning, ” IEEE T rans. Commun. , pp. 1–1, 2025. [35] J. He, M. Rungta, D. Koleczek, A. Sekhon, F . X. W ang, and S. Hasan, “Does prompt formatting have any impact on LLM performance?” arXiv pr eprint arXiv:2411.10541 , 2024. [36] E. J. Hu, Y . Shen, P . W allis, Z. Allen-Zhu, Y . Li, S. W ang, L. W ang, W . Chen et al. , “Lora: Low-rank adaptation of large language models. ” ICLR , vol. 1, no. 2, p. 3, 2022. [37] F . Y u, H. Y an, R. Chen, G. Zhang, Y . Liu, M. Chen, and Y . Li, “City-scale vehicle trajectory data from trafﬁc camera videos, ” Sci. Data , v ol. 10, no. 1, p. 711, Oct 2023. [Online]. A vailable: https://doi.org/10.1038/s41597- 023- 02589- y [38] Y . Zheng, R. Zhang, J. Zhang, Y . Y e, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Uniﬁed efﬁcient ﬁne-tuning of 100+ language models, ” in Proc. 62nd Annu. Meeting Assoc. Comput. Ling. (ACL) . Bangkok, Thailand: Association for Computational Linguistics, 2024. [Online]. A vailable: http://arxiv .org/abs/2403.13372 [39] W . Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Y u, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efﬁcient memory management for large language model serving with pagedattention, ” in Proc. 29th ACM Symp. Oper . Syst. Princ. (SOSP) , 2023. [40] S. Goudarzi, S. Ahmad Soleymani, M. Hossein Anisi, A. Jindal, and P . Xiao, “Optimizing U A V-assisted vehicular edge computing with age of information: An SAC-based solution, ” IEEE Internet Things J. , vol. 12, no. 5, pp. 4555–4569, 2025.

Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment