Critic-Free Deep Reinforcement Learning for Maritime Coverage Path Planning on Irregular Hexagonal Grids

Critic-Free Deep Reinforcement Learning for Maritime Cov erage Path Planning on Irre gular Hexagonal Grids Carlos S. Sepúlveda a,b, ∗ , Gonzalo A. Ruz a,c,d a F acultad de Ingeniería y Ciencias, Universidad Adolfo Ibáñez, Av . Diagonal las T orres 2640, Santiago, Chile b Dir ección de Progr amas, In vestigación y Desarrollo, Armada de Chile, General del Canto 398, V alparaíso, Chile c Millennium Nucleus for Social Data Science (SODAS), Santiago, Chile d Millennium Nucleus in Data Science for Plant Resilience (PhytoLearning), Santiago, Chile Abstract Maritime surveillance missions, such as search and rescue and en vironmental monitoring, rely on the e ﬃ cient allocation of sensing assets ov er vast and geometrically complex areas. T ra- ditional Cov erage Path Planning (CPP) approaches depend on decomposition techniques that struggle with irregular coastlines, islands, and e xclusion zones, or require computationally ex- pensiv e re-planning for every instance. W e propose a Deep Reinforcement Learning (DRL) framew ork to solv e CPP on hexagonal grid representations of irregular maritime areas. Un- like con ventional methods, we formulate the problem as a neural combinatorial optimization task where a T ransformer-based pointer policy autoregressi vely constructs coverage tours. T o ov ercome the instability of v alue estimation in long-horizon routing problems, we implement a critic-free Group-Relativ e Policy Optimization (GRPO) scheme. This method estimates ad- vantages through within-instance comparisons of sampled trajectories rather than relying on a value function. Experiments on 1,000 unseen synthetic maritime en vironments demonstrate that a trained policy achiev es a 99.0% Hamiltonian success rate, more than double the best heuristic (46.0%), while producing paths 7% shorter and with 24% fewer heading changes than the closest baseline. All three inference modes (greedy , stochastic sampling, and sampling with 2-opt re- ﬁnement) operate under 50 ms per instance on a laptop GPU, conﬁrming feasibility for real-time on-board deployment. K eywor ds: Deep Reinforcement Learning, Cov erage Path Planning, Maritime Surveillance, Hexagonal Grid, Neural Combinatorial Optimization, Maritime Domain A wareness 1. Intr oduction Maritime surveillance (MS) underpins safety- and security-critical missions such as search and rescue (SAR), protection of critical infrastructure, prev ention of illicit activities, and en viron- mental monitoring (Boraz, 2009). Maritime transport carries the majority of global merchandise trade by volume, making the maritime domain a key component of national and international ∗ Corresponding author . Email addresses: carlos.sepulveda@alumnos.uai.cl (Carlos S. Sepúlveda), gonzalo.ruz@uai.cl (Gonzalo A. Ruz) resilience (United Nations Trade and Dev elopment (UNCT AD), 2024). Consequently , dense tra ﬃ c, constrained chokepoints, and the strategic relev ance of o ﬀ shore infrastructure increase the operational demand for persistent Maritime Domain A wareness (MD A) (Bueger et al., 2024; Bischof et al., 2018). Achieving MD A typically relies on heterogeneous sensing assets whose capabilities must be allocated over large areas of interest (A OIs) with limited endurance. This operational require- ment giv es rise to planning problems formalized as Coverage P ath Planning (CPP) (Choset, 2001; Galceran and Carreras, 2013; Cabreira et al., 2019; Fevgas et al., 2022). CPP seeks trajectories that systematically observe an A OI under constraints on motion, endurance, and obstacles. Howe ver , maritime en vironments present speciﬁc geometric challenges that complicate clas- sical CPP . Realistic A OIs are rarely con vex polygons; they frequently in volv e irregular coastlines, exclusion zones, islands, and na vigational hazards (Nielsen et al., 2019; Mier et al., 2023). Stan- dard sweep patterns (e.g., boustrophedon paths) typically require the exact cellular decomposi- tion of the target area into simpler , sweepable sub-regions (Choset, 2001; Galceran and Carreras, 2013). This process relies heavily on the geometric morphology of the en vironment; irregular coastlines and dense exclusion zones generate numerous critical points that sev erely fragment the area. Consequently , routing a vehicle through these disjointed sub-regions becomes com- putationally intensive and kinematically ine ﬃ cient, forcing excessiv e transit ﬂights and sharp heading reversals (Cabreira et al., 2019; Nielsen et al., 2019). Furthermore, exact optimization- based approaches such as Mixed-Integer Linear Programming (MILP) or ev olutionary algorithms typically solve each instance from scratch (Choi et al., 2019, 2020; Azad et al., 2017). This lack of generalization prevents the reuse of computational e ﬀ ort across di ﬀ erent mission geometries, limiting their utility for rapid on-board replanning. Reinforcement Learning (RL) o ﬀ ers a paradigm to learn policies that generalize across prob- lem instances (Sutton and Barto, 2018; Kaelbling et al., 1996). In particular , attention-based Neural Combinatorial Optimization (NCO) has demonstrated that Transformer -based policies can learn to construct high-quality solutions for routing problems such as the T rav eling Sales- man Problem (TSP) (K ool et al., 2018; Berto et al., 2025). This paradigm is highly compatible with CPP over discretized A OIs, where the agent must select a sequence of nodes to visit. W e adopt a hexagonal discretization due to its fav orable geometric properties, such as equidistant neighbors and reduced anisotropy compared to grid maps (Boots et al., 1999; Kadioglu et al., 2019; Cho et al., 2021b). In this work, we propose a T ransformer-based pointer policy that operates on a graph repre- sentation of the hexagonal A OI. The policy constructs valid cov erage tours by selecting feasible mov es via dynamic action masking. A ke y innov ation in our approach is the use of a critic-free Group-Relativ e Policy Optimization (GRPO) scheme. While Actor -Critic methods like Proximal Policy Optimization (PPO) are standard, learning an accurate value function for combinatorial problems with sparse re wards remains challenging. GRPO stabilizes training by estimating ad- vantages from within-instance comparisons of multiple sampled trajectories (Shao et al., 2024), av oiding the bias and instability of a learned critic. The main contributions of this w ork are: • W e formulate maritime CPP ov er irregular A OIs as a graph trav ersal problem on hexag- onal grids with holes and obstacles, focusing on minimizing path length and turns while ensuring complete cov erage. • W e design a T ransformer-based pointer policy that constructs feasible coverage trajectories using validity masking, enabling the handling of variable-sized A OIs and arbitrary obstacle 2 conﬁgurations. • W e adapt GRPO to the CPP domain, demonstrating that critic-free learning via trajectory comparison is e ﬀ ectiv e for stabilizing training in long-horizon routing tasks. • W e empirically demonstrate that a single learned polic y generalizes to unseen geometries, achieving near-oracle feasibility rates and producing shorter , smoother paths than 13 clas- sical heuristics without instance-speciﬁc retraining. The remainder of the paper is organized as follows. Section 2 re vie ws related work. Section 3 formalizes the problem. Section 4 presents the proposed method and training scheme. Section 5 describes the experimental setup, Section 6 presents the results and discussion, and Section 7 concludes. 2. Related W ork This section re vie ws the lines of work most rele vant to our formulation: (i) maritime surv eil- lance and persistent monitoring, (ii) cov erage path planning and hexagonal tessellations, and (iii) deep reinforcement learning for combinatorial routing. W e close by positioning our ap- proach within this landscape (visualized in Fig. 1). 2.1. Maritime surveillance and persistent monitoring Maritime Domain A wareness (MDA) is a strategic requirement for safety and security mis- sions, including search and rescue (SAR) and critical infrastructure protection (Boraz, 2009; Bueger et al., 2024). T raditional surveillance relies on heterogeneous sensor constellations like coastal radars, Automatic Identiﬁcation System (AIS), and space-based assets, whose data must be fused into a consistent situational picture (Soldi et al., 2021; Dogancay et al., 2021). As national-lev el policy framew orks increasingly mandate persistent monitoring of vast maritime zones (de Ministros para el desarrollo de una Política Oceánica Nacional, 2023), the e ﬃ cient scheduling and routing of mobile assets such as U A Vs and USVs becomes operationally critical (Li and Savkin, 2021). From an optimization perspective, maritime surveillance has been modeled as routing and mission-planning problems since at least the late 1990s: SAR mission planning as TSP v ariants (Panton and Elbers, 1999; John et al., 2001), regional platform routing for surface surveillance (Grob, 2006), and persistent aerial surveillance as mixed-integer linear programs (MILP) max- imizing information gathered under endurance constraints (Zuo et al., 2020). Comprehensive surve ys on U A V routing and trajectory optimization further highlight the comple xity of integrat- ing realistic vehicle dynamics, endurance constraints, and heterogeneous sensing objectiv es into tractable planning models (Coutinho et al., 2018; Otto et al., 2018). These contributions under- line that, e ven before considering learning-based approaches, surv eillance planning is inherently a high-dimensional combinatorial optimization problem. 2.2. Covera ge path planning and hexagonal tessellations CPP seeks a trajectory that visits e very point in a region of interest while optimizing criteria such as path length or time (Galceran and Carreras, 2013). In maritime and aerial robotics, standard sweep patterns (e.g., boustrophedon) require decomposing complex areas into simpler 3 sub-regions (Choi et al., 2019; Nielsen et al., 2019), a process that is highly sensitive to the irregular morphology of coastlines and e xclusion zones. Hexagonal tessellations address this limitation: they minimize overlap and provide better isotropy than square grids, making motion costs direction-agnostic (Boots et al., 1999; Azpúrua et al., 2018). Hexagonal decompositions hav e been formulated in MILP models for multi-UA V maritime SAR (Cho et al., 2021b,a; Kadioglu et al., 2019), and the existence of Hamiltonian circuits in hexagonal grid graphs has been studied theoretically (Islam et al., 2007), establishing connectivity conditions directly rele v ant to complete-coverage feasibility . These results position hexagonal grids as a principled compromise between representational ﬁdelity and algorithmic tractability . Closely related are sweep and barrier coverag e formulations, which focus on periodic tra ver- sal or boundary monitoring of target zones (Li et al., 2011; Benahmed and Benahmed, 2019). While historically addressed with static sensor deployments, these problems conv erge with CPP when mobile platforms (U A Vs, USVs) are deployed to dynamically maintain cov erage (Li et al., 2019). 2.3. Reinfor cement learning and neural combinatorial optimization Deep Reinforcement Learning (DRL) has been applied to enhance barrier and sweep cov er- age with mobile U A Vs cooperating with sensor networks (Li and Chen, 2022), and more broadly to local ship maneuvering and collision av oidance (Gao et al., 2022), U A V path optimization under threat (Alpdemir, 2022), and dynamic maritime CPP using grid-based Markov decision processes (Ai et al., 2021; W u et al., 2024). Ho wev er , these controllers often rely on simple discretizations and small action spaces, struggling to scale to large, irre gularly weighted grids. A complementary paradigm is Neural Combinatorial Optimization (NCO), which treats DRL as a solver for routing problems. Starting with Pointer Networks (V inyals et al., 2015), attention- based models hav e e volved into highly competiti ve solvers for TSP and V ehicle Routing Prob- lems (K ool et al., 2018; Nazari et al., 2018; Berto et al., 2025). These graph RL policies operate directly on graph-structured inputs, learning construction heuristics that match exact solvers in quality while operating orders of magnitude faster at inference (Darvariu et al., 2024). Key ad- vances include inv alid-action masking for structured discrete spaces (Huang and Ontañón, 2022), and shared-instance baseline mechanisms such as POMO (Kwon et al., 2020), which drastically reduce gradient variance by comparing multiple rollouts from the same problem instance, an idea closely related to our critic-free GRPO training scheme (Shao et al., 2024). 2.4. P ositioning of the present work While CPP has been extensi vely studied (Cabreira et al., 2019) and maritime surveillance heavily relies on routing models (Zuo et al., 2020; Grob, 2006), few works explicitly address large-scale maritime CPP on hexagonal grids using learning-based combinatorial solv ers. Exist- ing barrier and sweep heuristics lack DRL inte gration, and attention-based NCO has rarely been applied to cov erage topologies with geometric obstacles and sparse adjacency constraints. Our work bridges this gap by modeling maritime surveillance as a CPP on a hexagonal grid, and tackling it with an attention-based DRL pointer policy trained via critic-free group-relativ e optimization. T o synthesize the context re viewed above, Fig. 1 depicts a visual taxonomy of related problem families. A detailed breakdown of objectiv es and representative literature per category is pro vided in T able A.7 (Appendix Appendix A). 4 Surveillance-r elated coverage, patrolling and routing problems Coverage of static regions Patrolling & persistent surveillance Routing & task assignment Re gion coverage (CPP) e.g., (Choset, 2001; Galceran and Carreras, 2013; Cabreira et al., 2019; Fevgas et al., 2022; Kumar and Kumar, 2023) Sweep covera ge e.g., (Li et al., 2011; Gorain and Mandal, 2014; Li et al., 2020; Liu and Liu, 2021) Barrier covera ge e.g., (Benahmed and Be- nahmed, 2019; Nguyen and So-In, 2018; Li et al., 2019; K ong et al., 2016) Sensor deployment & WSN cover age e.g., (Li et al., 2011; Nguyen and So-In, 2018; Chen et al., 2015) P atrolling & persistent ar ea coverage e.g., (Nigam et al., 2009; Zelenka et al., 2020; Karapetyan et al., 2019; Savkin and Huang, 2019; Luis et al., 2020, 2021) T arget / re gion r evisit scheduling & info-gain models e.g., (Zuo et al., 2020; Karasakal, 2016; Siew and Linares, 2022; Siew et al., 2022) Routing (VRP / TSP / orienteering variants) e.g., (Panton and Elbers, 1999; John et al., 2001; Grob, 2006; Coutinho et al., 2018; Otto et al., 2018; Cho et al., 2021b,a) Spatial cr owdsourcing & task assignment e.g., (W u et al., 2019; Bhatti et al., 2021; T ong et al., 2020; Zhou et al., 2019; Chen et al., 2020) Learning-based CO (DRL, attention-based NCO) e.g., (V inyals et al., 2015; Bello et al., 2016; Nazari et al., 2018; Kool et al., 2018; Xin et al., 2021; Li et al., 2021a,b; Berto et al., 2025; Darvariu et al., 2024) Figure 1: T axonomy of surveillance-related coverage, patrolling, and routing problems underpinning our CPP formulation on hexagonal grids. The three columns mirror the structure of the Related work section, while T able A.7 provides a concise summary of the corresponding problem families. 3. Problem F ormulation 3.1. Ar ea of Interest and Cover age Let the area of interest (A OI) be a planar polygonal region P ⊂ R 2 with a set of polygonal holes / obstacles {H k } K k = 1 . The feasible surveillance re gion is therefore A = P \ K [ k = 1 H k . (1) W e consider a single mobile sensing platform whose planar position at discrete time t ∈ { 0 , . . . , T } is p t ∈ A . 1 Cov erage path planning (CPP) is deﬁned as generating a trajectory that guarantees complete or persistent sensing / visiting of a region under platform and sensors constraints (Choset, 2001; Galceran and Carreras, 2013; Cabreira et al., 2019; Fevg as et al., 2022; T an et al., 2021). In wide-area surveillance and reconnaissance, a standard abstraction is to model the instan- taneous sensor footprint as a compact region around the platform position (often approximated 1 The formulation naturally extends to multiple platforms via task partitioning; this extension is discussed in Section 7. 5 by a disk), which aligns with classical geometric coverage models (e.g., unit-disk-type cov ering problems) Biniaz et al. (2017). Accordingly , we model the e ﬀ ectiv e footprint as a closed disk of radius r s > 0, F ( p t ) = n q ∈ R 2 : ∥ q − p t ∥ 2 ≤ r s o . (2) A point q ∈ A is cov ered by a trajectory { p t } T t = 0 if q ∈ S T t = 0 F ( p t ). Directly optimizing tra- jectories over the continuous set A is generally di ﬃ cult, especially when A is non-con vex and includes holes; therefore CPP pipelines typically adopt cellular discretization to obtain a ﬁnite representation amenable to routing / optimization Choset (2001); Galceran and Carreras (2013); Cabreira et al. (2019); Fevgas et al. (2022). W e adopt a hexagonal tessellation and graph con- struction pipeline detailed in Sections 3.3–3.4. 3.2. Why Appr oximate Discretizations CPP literature distinguishes exact cellular decompositions (that preserve geometry with algo- rithmic guarantees) from appr oximate decompositions (grid / tessellation-based) Choset (2001); Galceran and Carreras (2013); Cabreira et al. (2019). Exact decompositions are attracti ve when one can exploit geometric structure, but they become cumbersome as the A OI grows in com- plexity (multiple holes, narro w passages) and when additional operational constraints must be incorporated (energy limits, risk maps, multi-vehicle coordination), often requiring repeated re- computation and complicated bookk eeping Cabreira et al. (2019); Fevgas et al. (2022); T an et al. (2021). Approximate discretizations provide a controllable resolution parameter (cell size) that trades geometric ﬁdelity for computational tractability , while enabling uniform neighborhood relations and direct mapping to graph-based routing formulations Cabreira et al. (2019); Fevgas et al. (2022); T an et al. (2021). This trade-o ﬀ is particularly rele vant in maritime surveillance and other large-area missions where scalability and robustness are ﬁrst-order requirements Grob (2006); Cabreira et al. (2019). 3.3. Hexagonal T essellation of the A OI W e discretize A by a hexagonal tessellation. In practice, we generate the grid in an oriented bounding box (OBB) frame to reduce boundary artifacts; the resulting tessellation-and-ﬁltering sequence is illustrated in Fig. 2(a)–(c). Let G h denote a regular hexagonal grid with characteristic size (e.g., circumradius) r h . Hexag- onal tessellations are a canonical class of spatial tessellations Boots et al. (1999) and have been used for coverage and surveillance planning due to their near-isotropic adjacency and reduced directional bias compared to square grids, which is beneﬁcial when motion costs should be direction-agnostic Boots et al. (1999). Moreover , hexagonal decompositions hav e been explic- itly adopted in U A V coverage and multi-U A V maritime SAR settings Kadioglu et al. (2019); Azpúrua et al. (2018); Cho et al. (2021b). Beyond geometric conv enience, he xagonal discretization is consistent with disk-like foot- print abstractions: when coverage is approximated by disks, the discretization should a void anisotropies that can distort e ﬀ ecti ve co verage radii and neighborhood costs Biniaz et al. (2017); Boots et al. (1999). While the present work does not claim optimality of any single discretiza- tion, the above considerations motiv ate a hexagonal grid as a principled compromise between representational ﬁdelity and algorithmic simplicity Boots et al. (1999); Cho et al. (2021b); Ka- dioglu et al. (2019); Azpúrua et al. (2018). In this pipeline, the hexagon size is selected to be 6 sensor-dri ven to align discretization resolution with the e ﬀ ecti ve footprint. Speciﬁcally , the cir- cumradius r h of each hexagonal cell is set equal to the sensor footprint radius r s , so that a single visit to a cell centroid guarantees cov erage of the entire cell area. Consequently , the number of cells |V| is determined not by the absolute area of the A OI b ut by the ratio Area( A ) / Area( c ) between the feasible region and the indi vidual cell area, making the graph size scale-in variant. Let C = { c 1 , . . . , c N } be the set of hexagonal cells whose interiors intersect A , and let s i ∈ R 2 be the centroid of cell c i . Cells whose centroids fall inside holes are removed, yielding the feasible cell set V = { i ∈ { 1 , . . . , N } : s i ∈ A } . (3) In practice, obstacles may be deﬁned either as explicit polygonal holes H k or , equiv alently , by directly removing selected cells from V . Both approaches yield the same graph G ; the latter is used in our dataset generation for simplicity (Section 5.1). Discr ete coverag e.. W e say cell c i is cover ed at time t if the sensor footprint centered at p t contains its centroid s i : 1 { c i cov ered at t } = 1 { ∥ s i − p t ∥ 2 ≤ r s } . (4) Under the common discretization assumption p t ∈ { s i } i ∈V (platform moves between cell cen- troids), cov erage depends only on visited cells and can be represented combinatorially . 3.4. Graph Repr esentation The discretized A OI naturally induces a graph. Deﬁne an undirected adjacenc y relation N ( i ) ov er cells in V such that j ∈ N ( i ) if cells c i and c j share an edge and the straight-line transi- tion between s i and s j does not intersect any hole. 2 This yields a graph G = ( V , E ) with edges E = { ( i , j ) : j ∈ N ( i ) } . The edge construction from cell centers and the resulting A OI graph are depicted in Fig. 2(d)–(e). Let c i j ≥ 0 be the travel cost associated with moving from i to j (e.g., distance, time, or a risk-weighted metric). Graph abstractions of this form are standard in navi- gation and enable the use of graph-based reinforcement learning and combinatorial optimization techniques Darvariu et al. (2024); Zweig et al. (2020). A discrete path is a sequence π = ( v 0 , v 1 , . . . , v T ) with v t ∈ V and ( v t , v t + 1 ) ∈ E . Let v 0 = b denote a designated base / initial cell (and optionally v T = b for return-to-base missions). The base node b is positioned outside the tessellated A OI and connected to e very cell on the outer ring of V whose straight-line segment b s i does not intersect any obstacle. T o represent the end of the mission, a duplicate terminal node b ′ is introduced, which in our current closed- loop experiments shares the e xact same adjacency set. Architecturally , explicitly decoupling the departure and arriv al nodes enables the framew ork to nativ ely support open-path missions (e.g., launching from a mother ship and recov ering at a distinct coastal base) by simply assigning distinct spatial coordinates and adjacency masks to b ′ , witho ut requiring any modiﬁcations to the underlying neural policy . Finally , this line-of-sight constraint ensures that both departure and return transits safely av oid ov erﬂying exclusion zones. 2 Additional edge-le vel constraints (e.g., no-go zones, directional restrictions) can be incorporated by removing or redirecting edges; the graph abstraction remains unchanged. 7 (a) A OI, obstacle, and OBB. (b) OBB-aligned hex grid (sensor-dri ven cell size). (c) V isitable cells after A OI / obstacle ﬁltering. (d) Adjacency graph from cell centers (incl. base node). (e) Final A OI graph for coverage planning. Figure 2: Hexagonal tessellation-to-graph pipeline for an irre gular A OI with an internal obstacle. Hexagon size is set by the sensor footprint. (a) A OI, obstacle, and OBB. (b) OBB-aligned hex grid. (c) V isitable cells selected by intersection tests. (d) Graph edges deﬁned by hex-neighborhood adjacency using cell centers (including the base node). (e) Final graph used for coverage planning. (The pipeline is illustrated with an explicit polygonal obstacle for visual clarity; in the experimental dataset, obstacles are implemented as individual cell remo vals—see Section 5.1.) 3.5. Combinatorial Optimization V iew: Relation to TSP / VRP When coverage is enforced by visiting a set of discrete locations, CPP becomes closely re- lated to routing problems such as TSP , v ehicle routing problems (VRP), and their generalizations 8 (Bähnemann et al., 2021; Raza et al., 2022). In the simplest complete-cov erage case, one seeks a minimum-cost route that visits all required cells, which mirrors a Hamiltonian-path structure on the A OI graph. Howe ver , the grid-based maritime CPP addressed here di ﬀ ers from the classical TSP in three operationally signiﬁcant ways: (i) sparse feasibility : movements are restricted to local hexag- onal adjacency rather than a fully connected graph, and irregular obstacle conﬁgurations can render certain node orderings infeasible; (ii) kinematic costs : heading changes incur maneuver- ing penalties that break the symmetry of Euclidean edge weights, coupling the cost of visiting a node to the direction of approach; and (iii) dead-end risk : unlike the classical TSP where ev ery permutation of nodes constitutes a v alid tour , the sparse and irre gular topology of hex grids with obstacles means that certain visitation sequences lead to unrecoverable states from which com- plete coverage is impossible, making future feasibility non-trivial to assess without lookahead mechanisms. Note that, by design, the hexagonal cell size is matched to the sensor footprint radius r s (Section 3.3), so that complete area cov erage requires visiting e very cell in V . This deliberately reduces the problem to a constrained Hamiltonian-path formulation on G , av oiding the additional complexity of partial-cov erage models where sensing range exceeds cell size. Extensions to non-uniform coverage priorities and multi-vehicle coordination, while operationally relev ant, are beyond the scope of this work and are discussed in Section 7. 3.6. Markov Decision Pr ocess F ormulation While the CPP problem can be naturally formulated as a MILP targeting the Hamiltonian path, exact solvers become computationally intractable for grids scaling beyond approximately N ≈ 15 nodes due to the NP-hard nature of the underlying routing constraints (Cho et al., 2021b; Zuo et al., 2020). In maritime scenarios requiring rapid replanning over complex areas, relying on solvers that scale exponentially is operationally unfeasible. W e therefore cast the coverage path planning on the A OI graph G = ( V , E ) as a ﬁnite-horizon Markov decision process (MDP) solved via DRL. State.. Each node i ∈ V carries geometric features (normalized centroid coordinates) and a nonnegati ve priority weight w i ( hexscor e ). The state at step t comprises: (i) the current node v t , (ii) the set of already visited nodes S t ⊆ V , represented as a binary mask, and (iii) the static instance data ( G , { w i } ). This state representation allows the MDP formulation to natively support both uniform coverage path planning ( w i = 0 for all i ) and priority-weighted coverage ( w i ≥ 0 drawn from a spatial distribution). While the neural architecture fully integrates these priority maps, the experiments in this paper isolate the geometric routing challenge by focusing exclusi vely on the uniform coverage case; heterogeneous priority ﬁelds are deferred to future work. Actions and feasibility masking.. The action a t selects the next node v t + 1 among the unvisited feasible neighbors of v t : a t ∈ A ( s t ) = { j ∈ N ( v t ) : j < S t } , (5) where N ( v t ) denotes the graph neighborhood of v t . In valid actions are pre vented by applying an additiv e mask to the policy logits before the softmax activ ation. This technique has been sho wn to preserve valid policy gradients while strictly outperforming penalty-based constraints in discrete action spaces (Huang and Ontañón, 2022). Crucially , our mask enforces two strict topological 9 rules: (1) it assigns −∞ to non-neighbors and already-visited nodes to guarantee dynamically feasible, self-a voiding paths, and (2) it explicitly masks the terminal base node until 100% of the target cells have been visited. This terminal-masking prevents the agent from exploiting early- return behaviors to minimize kinematic costs, forcing it to navigate the entire re gion before it can extract the episodic completion re ward. T erminal conditions.. An episode terminates in one of two ways: (i) successful completion , when the agent has visited all target cells and returned to the base node, receiving a lar ge positi ve rew ard; or (ii) dead-end failur e , when no un visited neighbor is reachable, in which case a penalty is applied. This latter condition is common in irregular hex grids where narrow passages and obstacles create geometric choke points. W e address it with a Breadth-First Search (BFS) look- ahead mechanism described in Section 4.3. Rewar d function.. The total return of a trajectory π of length T is: R ( π ) = T − 1 X t = 0  r ( t ) ste p + r ( t ) he x − c ( t ) di st − c ( t ) tur n  + R e pi sodic , (6) where the dense components are e valuated at each transition ( v t , v t + 1 ), and R e pi sodic is a terminal modiﬁer . The rew ard components, calibrated to pre vent degenerate beha viors such as premature termination or redundant looping, are summarized in T able 1. T able 1: Dense and episodic components of the rew ard function. Component V alue T ype Operational justiﬁcation r ste p + 2 . 0 Dense Incentiv e per newly visited he xagonal cell. r he x + 0 . 5 · w v t Dense T ime-decayed priority bonus for early visitation of high-value zones (inactiv e in uniform-coverage e xperiments) † . c di st − 1 . 0 · ¯ d Dense Penalty proportional to normalized inter-cell distance. c turn − 0 . 25 · f ( θ ) Dense Kinematic penalty for sharp heading changes, promoting smooth trajectories. r com plet e + 100 . 0 Episodic Sparse re ward for full coverage and return to base. r deat h − 40 . 0 Episodic Penalty for falling into an unrecov erable geometric dead-end. † The hexscore channel w i is architecturally supported but set to zero in all reported experiments (Sec- tion 5.1). Here, the normalized distance ¯ d t = ∥ s v t − s v t + 1 ∥ 2 · √ |V| scales the inter -cell Euclidean distance by a density factor that normalizes the penalty across instances of varying size, ensuring that larger grids do not trivially dominate the cost structure. The turn penalty f ( θ t ) is a composite function of the heading change angle θ t ∈ [0 , π ] between consecutiv e movement vectors. T o reﬂect the physical reality of maritime vehicles, where initiating a course change in volves a ﬁxed mechanical overhead (e.g., rudder actuation) plus an angle-dependent hydrodynamic resistance, the penalty combines a discrete activ ation cost and a quadratic magnitude cost: f ( θ t ) =          2   θ t π  2 + c ba se  θ t > 0 , 0 θ t = 0 , (7) 10 where c ba se is a constant maneuvering penalty . In our implementation, we empirically calibrate c ba se = 1 / 12. This speciﬁc v alue ensures that for the minimum required course correction on a hexagonal grid (60 ◦ ), the ﬁxed mechanical overhead (1 / 12 ≈ 0 . 083) and the quadratic dynamic resistance ((1 / 3) 2 ≈ 0 . 111) are of comparable magnitude, appropriately reﬂecting the high inertia and rudder-shift costs of maritime platforms. For larger maneuvers, the quadratic term naturally dominates. Because the minimum non-zero heading change is 60 ◦ , every maneuver rob ustly activ ates this full penalty structure. Straight-line motion ( θ t = 0) and the ﬁrst step of each episode incur no turn cost. Consequently , the ﬁxed term heavily penalizes the total number of turns, while the quadratic term strictly limits sharp rev ersals, explicitly discouraging zigzagging and promoting dynamically smooth trajectories. The objective is to learn a stochastic policy π θ ( a t | s t ) that maximizes the expected return max θ E π θ [ R ( π )] ov er a distribution of irregular A OI geometries. 4. Proposed Method This section describes the three components of our approach: (i) a Transformer -based pointer policy that constructs coverage tours autoregressi vely (Section 4.1), (ii) a critic-free GRPO train- ing scheme that a voids learning a v alue function (Section 4.2), and (iii) an early dead-end detec- tion mechanism that accelerates con ver gence on irregular grids (Section 4.3). W e close with a description of the training procedure, including data generation and augmentation (Section 4.4). 4.1. T ransformer -based pointer policy Our policy follows the attention-based neural combinatorial optimization paradigm (K ool et al., 2018; Berto et al., 2025): an encoder produces contextual node embeddings for the A OI graph, and an autoregressi ve decoder points to the next node to visit. The full architecture is depicted in Fig. 3. Node features and embedding .. Each node i ∈ V is represented by a feature vector x i = [ x i , y i , w i , m i ]. T o ensure the polic y generalizes robustly across A OIs of varying absolute physical dimensions, the spatial coordinates ( x i , y i ) are centered at the base node and normalized by the maximum radial extent of the graph within the instance. w i is the hexscore priority , and m i is a binary indicator identifying the base node. A learnable linear projection maps x i to an initial embedding h (0) i ∈ R d . Graph-awar e encoder .. A multi-layer T ransformer encoder exchanges information across nodes. T o respect the sparse structure of the hex graph, attention is restricted to k -hop neighborhoods using the adjacency matrix as a mask, yielding graph-structured self-attention. After L layers, we obtain contextual embeddings h i = h ( L ) i and a global graph embedding ¯ h = 1 |V| P i h i . Autor egr essive decoder .. At step t , the decoder constructs a context-a ware query from four sources: (i) the embedding of the current node v t , (ii) the embedding of the ﬁrst node v 0 , (iii) the global graph embedding ¯ h , and (iv) a set of en vironmental signals encoding coverage progress, heading direction, frontier statistics, and reachability status. These signals are projected to the model dimension and concatenated before passing through a context network (two-layer MLP with residual connection) that produces the decoder output q t ∈ R d . 11 Pre-computed Static Netw ork (Executed once p er map) Autoregressive RL Policy (Executed T times per episo de) Environmen t Hexagonal Grid G = ( V , E ) Node Features X ( x, y , hex score , is base ) Linear Projection Graph Encoder ( L T ransformer Lay ers) Multi-Head Self Attention Node Embeddings H = { h 1 , . . . , h N } Graph Embedding ¯ h = 1 N P h i Dynamic Environment Signals f dyn [ heading , coverage , reach ratio , . . . ] Signal Projection MLP ( f dyn ) Context Aggregation [ ¯ h ∥ h prev ∥ h f ir st ∥ MLP ( f dyn )] Decoder Netw ork Query Generation Query Vector q t Extract h prev , h f ir st 1. K glimpse Attention Glimpses (Iteratively reﬁnes q t ) 2. Additive Pointer Logits ℓ i = v ⊤ tanh  1 √ d  W q q t + W k h i + α W b v i   Feasibilit y Masking score i = −∞ if a i / ∈ A valid Softmax π θ ( a t | s t ) Action a t Environmen t Step (State Up date) Figure 3: Proposed Transformer -based pointer policy for coverage path planning. The Graph Encoder pre- computes static node embeddings, while the Decoder dynamically aggregates the agent’ s spatial context and en vironmental signals to generate a query . The Pointer Network computes attention scores over valid nodes, strictly constrained by a feasibility mask to guarantee valid routing. P ointer mec hanism with masking.. The decoder output attends to the key projections of all node embeddings through K glim p se attention glimpses (K ool et al., 2018), iterati vely reﬁning the query vector q t . The ﬁnal logits are computed via additiv e (Bahdanau-style) attention with a tanh activ ation and a learnable projection vector v : ℓ t ( j ) = v ⊤ tanh 1 √ d  W q q t + W k h j + α W b v j  ! , (8) where v j ∈ { 0 , 1 } encodes the visitation status of node j , α is a learnable gating scalar mapping the binary status into the continuous embedding space, and W q , W k , W b are learnable weight matrices. An additi ve action mask M t ( j ) ∈ {−∞ , 0 } enforces feasibility: π θ ( a t = j | s t ) = softmax  clip  ℓ t ( j ) + M t ( j ) , C   j , (9) 12 where C > 0 is a tanh clipping constant that bounds logit magnitudes. The mask assigns −∞ to non-neighbors and already-visited nodes, ensuring that all sampled trajectories are feasible by construction. During training, actions are sampled from the categorical distribution. For ev aluation during the training phase (validation), greedy decoding is used exclusiv ely to strictly monitor policy conv ergence. At test time, howe ver , the trained policy supports multiple decod- ing strategies of varying computational cost to trade o ﬀ latency for path quality , as detailed in Section 5.3. While some attention-based routing models employ scaled dot-product mechanisms for the ﬁnal pointer logits (Kool et al., 2018), we adopt an additiv e (Bahdanau-style) attention head following the original Pointer Network formulation (V inyals et al., 2015; Bello et al., 2016). The additi ve attention e valuates the compatibility between the decoder’ s query and the candidate keys through a non-linear projection (tanh), acting e ﬀ ectively as a single-hidden-layer Multi- Layer Perceptron. This non-linearity provides greater expressi ve capacity to capture complex geometric relationships between the agent’ s current state and the surrounding irregular grid. Fur- thermore, the intrinsic saturation of the inner tanh function naturally bounds the pre-logit acti- vations. Combined with the outer clipping parameter C , this architectural choice signiﬁcantly enhances numerical stability during policy gradient updates, prev enting premature entropy col- lapse and maintaining healthy exploration in the early stages of training. Complexity .. With neighbor -restricted attention, encoder complexity scales as O ( |V| · d ma x · d 2 ) per layer , where d ma x is the maximum node degree (at most 6 for hexagonal grids). Decoding is O ( |V| · d 2 ) per step. This is advantageous for lar ge A OIs compared to full-attention T ransformers that scale quadratically in |V| . 4.2. Critic-fr ee Gr oup-Relative P olicy Optimization Standard actor-critic methods such as PPO (Schulman et al., 2017) require learning a value function V ϕ ( s ) to estimate adv antages. In combinatorial routing, the v alue function must gener - alize across highly div erse graph topologies with sparse terminal rewards, which often leads to high bias and training instability . Shared-instance baseline mechanisms, such as POMO (Kwon et al., 2020), mitigate this by ev aluating multiple parallel trajectories constructed from div erse starting nodes, using their av erage return as an instance-speciﬁc baseline. Because our maritime CPP formulation models missions deploying from a ﬁxed base node, we cannot exploit starting-node symmetries. Instead, we adapt this shared-baseline principle through GRPO (Shao et al., 2024). For each training instance n , we sample a group of G tra- jectories { π ( g ) n } G g = 1 . T o guarantee constructive div ersity across the group despite the single ﬁxed starting node, we rely on the policy’ s stochastic cate gorical sampling, which is heavily promoted during early training via a high initial temperature annealing schedule ( T init = 1 . 5 → 1 . 0). W e then compute the relative advantages by standardizing the episodic returns (Eq. 6) within the group: A ( g ) n = R ( g ) n − µ n σ n + ϵ , (10) where µ n = 1 G P g R ( g ) n , σ n = q 1 G P g ( R ( g ) n − µ n ) 2 , and ϵ is a small constant for numerical sta- bility .This formulation ev aluates whether a trajectory outperformed its peers on the exact same map, e ﬀ ectiv ely bypassing the generalization bottleneck of a global critic network. 13 The policy is then optimized via a clipped surrogate objective applied at the per-step le vel. Let r t ( θ ) = π θ ( a t | s t ) /π θ old ( a t | s t ) be the importance-sampling ratio for action a t . The GRPO loss is: L GRPO ( θ ) = − 1 |T | X ( n , g , t ) ∈T min r t ( θ ) A ( g ) n , clip  r t ( θ ) , 1 − ε, 1 + ε  A ( g ) n ! − β ¯ H , (11) where T denotes the set of v alid (non-padding) tok ens across all trajectories, ε is the clipping parameter , ¯ H is the mean per-step policy entropy , and β is the entropy coe ﬃ cient. The per- step formulation, as opposed to per-trajectory clipping, ensures that the trust region constraint is enforced at each decision point, which is critical for long-horizon routing where per-trajectory ratios can grow e xponentially with sequence length. Multi-epoch r euse.. For each batch of B instances with G rollouts each, we perform K inner optimization epochs o ver shu ﬄ ed minibatches, re-e valuating log-probabilities and entropy under the current policy parameters at each step. This is analogous to the inner-loop structure of PPO, and provides substantial sample e ﬃ cienc y compared to single-update REINFORCE methods. 4.3. Early dead-end detection via BFS On irregularly shaped hexagonal grids, narrow passages, peninsulas, and obstacle conﬁgu- rations frequently create situations where the agent, having visited certain nodes, can no longer reach all remaining tar gets—a geometric dead-end . W ithout early detection, the agent continues making decisions along a doomed trajectory , recei ving misleading intermediate rew ards before ev entually triggering the terminal death penalty . This contaminates credit assignment: the pol- icy cannot distinguish between the critical misstep that created the dead-end and the subsequent (irrelev ant) actions. W e integrate a Breadth-First Search (BFS) reachability check into the en vironment. At each step t , after the agent moves to node v t , the BFS computes the set of un visited nodes reachable from v t through un visited (or transit-eligible) intermediate nodes, and veriﬁes whether the termi- nal base node remains accessible. If either condition fails, i.e., some target nodes or the base are unreachable, the episode is terminated immediately with the death penalty r deat h . This mechanism provides two beneﬁts: 1. Sharper credit assignment. The penalty is applied at the step where the dead-end be- comes inevitable, rather than many steps later . The policy gradient correctly attributes the negati ve outcome to the responsible decision. 2. Reduced wasted computation. Doomed trajectories are truncated early , freeing rollout budget for informative trajectories. Empirically , enabling BFS detection substantially re- duces the fraction of non-informative rollouts during early training, allowing the policy gradient to receiv e meaningful credit-assignment signal from the ﬁrst epoch. The BFS has worst-case complexity O ( |V | + |E| ) per step. For hex grids with |V| ≤ 46 and |E| ≤ 6 |V | , this overhead is ne gligible compared to the T ransformer forward pass. 14 4.4. T raining procedur e Dataset gener ation.. Training and ev aluation use a synthetic dataset of 10,000 irregular hexag- onal A OI instances ( |V| ∈ [28 , 46]) with stochastic cell-lev el obstacle patterns; the generation procedure and morphological families are described in Section 5.1. Geometric augmentation.. During training, each batch under goes stochastic geometric augmen- tation with probability p aug = 0 . 9: random rotations, reﬂections, and coordinate permutations are applied to the A OI graph. This exposes the policy to geometric in variances and signiﬁcantly improv es generalization to unseen A OI shapes without increasing the dataset size. T emperatur e annealing.. T o balance exploration and exploitation, sampling temperature is lin- early annealed from T init = 1 . 5 to T f inal = 1 . 0 o ver the ﬁrst 10 epochs. Higher initial temperature encourages diverse trajectory exploration during early training, while conv ergence to unit tem- perature ensures that the ﬁnal policy is e valuated under standard softmax probabilities. Hyperparameters.. The full training conﬁguration is summarized in T able 2. Key choices in- clude G = 16 rollouts per instance (balancing baseline quality against computational cost), K = 4 inner PPO-style epochs per batch, a learning rate of 3 × 10 − 5 with linear annealing, and an entropy coe ﬃ cient of β = 0 . 02. T able 2: T raining hyperparameters. Parameter Description V alue d Model dimension 128 L Encoder layers 3 n h Attention heads 8 K glim p se Decoder glimpses 2 G Rollouts per instance 16 K Inner optimization epochs 4 ε PPO clip parameter 0.2 β Entropy coe ﬃ cient 0.02 lr Learning rate 3 × 10 − 5 – LR schedule Linear annealing B Batch size (instances) 32 B mb Minibatch size (trajectories) 8 – Optimizer Adam ∥∇∥ Max gradient norm 0.5 – Max epochs (early stop at 30) 300 – Augmentation probability 0.9 T init / T f inal T emperature annealing 1.5 / 1.0 – Early dead-end detection Enabled 5. Experimental Setup 5.1. Dataset description T raining instances are generated synthetically to represent a range of maritime A OI mor- phologies. Each instance is constructed in three stages: (i) an irregular polygon is sampled to 15 deﬁne the outer boundary of the A OI, (ii) a hexagonal tessellation is applied with cell circum- radius r h = r s matching the sensor footprint (Section 3.3), and (iii) a subset of interior cells is randomly remov ed to simulate islands, shoals, or other navigational exclusion zones. This cell- lev el obstacle generation is mathematically equi valent to inscribing small polygonal holes within the A OI and discarding any hexagon whose centroid falls inside them, but operates directly on the graph topology without requiring explicit geometric intersection tests. The outer polygons are drawn from three morphological families representing common mar- itime scenarios: (i) compact con vex re gions (open-water patrol zones), (ii) elongated or concav e polygons (coastal strips, fjords, channel approaches), and (iii) irregular shapes with narrow pas- sages created by the interaction of boundary concavities and internal cell remov als. The combina- tion of polygon shape and stochastic cell remo val produces a wide spectrum of graph topologies, including bottlenecks, peninsulas, and disconnected-looking corridors that are characteristic of real maritime en vironments with islands and reefs. Since the number of visitable cells |V| depends on the ratio between the A OI area and the cell area (Section 3.3), the geometric bounds of our dataset were rigorously calibrated to reﬂect real- world maritime domain parameters. W e generate operational areas ranging from 1 , 600 to 3 , 600 square nautical miles (NM 2 ). Crucially , by constraining the total area rather than bounding box dimensions, the generator naturally accommodates highly diverse morphologies under the same operational footprint, spanning from compact open-water Search and Rescue (SAR) sectors to elongated coastal patrol corridors. The cell circumradius is sampled between 5 . 0 and 7 . 0 NM, explicitly matching the e ﬀ ectiv e detection horizon of an X-band marine radar on a medium-sized Autonomous Surface V ehicle (ASV) or the EO / IR footprint of a tactical UA V . Additionally , the base node is stochastically placed at a stando ﬀ distance of 100 to 250 NM, simulating the realistic o ﬀ shore transit from a coastal nav al base or a mother ship. By dri ving the tessellation strictly through these doctrinal operational capabilities, the result- ing spatial graphs consistently emerge with |V| ∈ [28 , 46] valid target cells. This demonstrates that instances of this topological scale are not merely mathematical abstractions, but rather the exact combinatorial resolution required to plan persistent surv eillance missions for modern mar - itime assets. A total of 10,000 instances are generated and partitioned into 8,000 training, 1,000 v alidation, and 1,000 test instances. Polygon vertices, cell remov al patterns, and base-node locations are sampled randomly subject to connectivity constraints ensuring that the resulting graph remains connected. Hamiltonian-path feasibility audit.. Since our formulation requires visiting every cell exactly once (Section 3.5), it is essential that ev ery instance in the dataset admits at least one feasible Hamiltonian path. W e verify this exhausti vely using a depth-ﬁrst search with strict backtracking ov er the full graph. All 10,000 instances pass this audit. Consequently , the exhaustiv e DFS serves as a feasibility oracle conﬁrming a 100% theoretical solve rate. Howe ver , because DFS merely ﬁnds the ﬁrst valid topological sequence without optimizing kinematic costs or distance, its path quality is generally poor . Therefore, it is utilized strictly as a ground-truth upper bound for feasibility , not as a target benchmark for path optimization. In the experiments reported here, all instances use uniform cov erage priority ( w i = 0 for all i ∈ V ), reducing the objectiv e to minimum-cost complete cov erage. The architecture natively supports non-uniform priorities via the hexscore input channel; e xperiments with heterogeneous priority ﬁelds are deferred to future work (Section 7). 16 5.2. Baseline methods T o ev aluate the learned policy against a broad spectrum of classical CPP strategies, we imple- ment 13 heuristic baselines spanning six algorithmic f amilies. All baselines operate on the same hexagonal grid graph under identical adjacency constraints. Howe ver , a crucial methodological distinction exists regarding visitation constraints. While our RL policy is strictly constrained to single-visit paths via its action masking (terminating an episode if a dead-end is reached), clas- sical heuristic baselines may trav erse previously visited nodes (via backtracking, ﬂyback transit, or overlapping sweeps) as a natural consequence of their construction logic, thereby achie ving complete area coverage at the cost of redundant motion. This di ﬀ erence highlights a fundamen- tal operational trade-o ﬀ : traditional heuristics guarantee 100% coverage completion at the cost of ine ﬃ cient ov erlapping paths, whereas our learned policy prioritizes maximum kinematic and distance e ﬃ cienc y by attempting to solve the stricter Hamiltonian path v ariant. The families and their members are summarized in T able 3. The six families represent fundamentally di ﬀ erent algorithmic paradigms in CPP: geometric decomposition (sweep families), morphology-follo wing (contour / spiral), topological coverage guarantees (STC), local graph heuristics, and space-ﬁlling curves. This di versity ensures that performance comparisons are not biased tow ard any single planning philosophy . 5.3. Infer ence strate gies Unlike classical e xact solvers that compute a single solution, generati ve routing models allo w multiple decoding strate gies at inference time. W e e v aluate the trained policy on the test set using three modes that trade computational cost for solution quality: 1. Greedy ( RL-Greedy ): At each step, the node with the highest logit is selected determin- istically . This provides single-pass, lo west-latency inference ( O ( |V| ) steps). 2. Best-of- K sampling ( RL-BoK ): K independent trajectories are sampled in parallel from the stochastic policy . W e select the trajectory that successfully completes the Hamiltonian path with the highest episodic return (i.e., minimum kinematic and distance cost). If no trajectory achiev es strict complete coverage (i.e., all K rollouts reach a geometric dead-end before visiting e very cell), we employ a robust fallback logic: the model outputs the partial trajectory that maximizes the total cov ered area before failure, breaking ties by minimum distance. Thanks to GPU batching, generating K = 16 parallel rollouts is computationally e ﬃ cient and only marginally slo wer than greedy decoding. 3. Best-of- K with local search ( RL-BoK+2opt ): The best sampled trajectory is reﬁned via a 2-opt local search that iteratively re verses sub-segments of the tour to untangle crossings, accepting improvements in total path cost. Crucially , because the environment is a sparse graph and not a fully connected Euclidean space, feasibility (strict adjacency and single- visit constraints) is explicitly veriﬁed after each proposed reversal. This demonstrates the seamless integration of neural construction heuristics with classical reﬁnement. The RL-BoK and RL-BoK+2opt variants are then applied only at test time to the selected trained model and should therefore be interpreted as inference-time enhancements rather than separately trained policies. 5.4. Evaluation metrics T o capture the trade-o ﬀ between guaranteed cov erage, path e ﬃ ciency , and computational latency , we e valuate all methods along ﬁ ve dimensions: 17 T able 3: Baseline heuristic methods grouped by algorithmic family . All methods operate on the same hex- graph representation under identical constraints. Family Method Description Linear sweep (Boustrophedon) sweep_boustrophedon Classic lawnmo wer pattern with alternating row directions, minimizing inter -row transit. sweep_row_oneway Unidirectional row sweep; all ro ws trav ersed in the same direction with ﬂyback transit. sweep_segment_snake Obstacle-aware boustrophedon that decomposes the A OI into con ve x segments and applies snake patterns within each. Interleav ed sweep (Skip-row) sweep_row_interleave Skips rows to allo w wider turning arcs; rows are visited in an interleav ed order (e.g., 1, 7, 2, 8, . . . ). sweep_segment_interleave Combines segment decomposition with interleav ed row ordering for kinematically constrained vehicles. Contour / Spiral (Boundary rings) boundary_spiral_inward T races concentric rings from the A OI perimeter inward to ward the centroid. boundary_spiral_outward Starts at the centroid and spirals outward to the perimeter . sweep_boundary_peel Iterativ ely erodes the outermost layer of un visited cells, adapting dynamically if the remaining area splits. Spanning-tree cov erage (STC) stc_tree_coverage Builds a minimum spanning tree o ver grouped cells; the agent circumnavigates the tree edges. stc_like Adapted STC for hex grids with mandatory start / end nodes, using BFS to connect dead branches. Graph-based local search warnsdorff Adapts the Knight’ s T our heuristic: always mov es to the unvisited neighbor with the fewest remaining neighbors, clearing di ﬃ cult corners ﬁrst. dfs_backtrack Depth-ﬁrst greedy trav ersal with shortest-path backtracking when a dead-end is reached. Space-ﬁlling curve morton_zorder Orders cell centroids by their Morton code (bit-interleav ed coordinates), producing a Z-order curve that preserv es spatial locality . 1. Hamiltonian Success Rate (HSR, %) : the fraction of test instances for which a method produces a v alid start-to-terminal path that visits e very required node exactly once. For the proposed RL polic y , this is the primary success criterion. For classical baselines, this met- ric indicates whether their output happens to satisfy the stricter Hamiltonian requirement, ev en though they were not designed for it. 2. Coverage Completion Rate (CCR, %) : the fraction of test instances for which all re- 18 quired nodes are covered and the terminal node is reached, regardless of revisits. This reﬂects the nativ e operating regime of classical co verage heuristics. 3. Node Revisits : the average number of previously visited nodes traversed during the gen- erated route. This measures the overlap cost incurred by revisit-allo wed strategies. By construction, the proposed RL polic y yields zero re visits on instances counted as Hamilto- nian successes. 4. Normalized Path Distance : the total Euclidean route length normalized by the charac- teristic inter-cell spacing. T o isolate route quality from outright feasibility , this metric is reported conditionally on the pairwise common solved subset between the compared methods. 5. Number of T urns : the number of non-zero heading changes along the route. As with normalized distance, this metric is reported on the pairwise common solved subset in order to compare route quality independently of failure cases. For the main route-quality comparison, we use the subset I ( m ) ∩ = { i ∈ D te st | RL-BoK+2opt solves i under HSR and baseline m completes i } , and compute distance and turn metrics only on I ( m ) ∩ . This conditional analysis should be in- terpreted as a comparison of route quality on matched instances, not as a substitute for global feasibility metrics. 5.5. Implementation details All experiments were conducted under Windo ws Subsystem for Linux 2 (WSL2) running Ubuntu 24.04.4 L TS on a workstation equipped with an AMD Ryzen 9 5900HX CPU, 31 GiB of system RAM, and an NVIDIA GeForce R TX 3070 Laptop GPU (8 GB VRAM). The software stack utilized Python 3.12.2, PyT orch 2.4.0, and CUD A 12.4. The neural policy comprises approximately 2.05 million trainable parameters. T raining was conﬁgured for up to 300 epochs with early stopping (patience of 4 epochs on v alidation success rate); the best checkpoint was selected at epoch 30 (val. SR = 95.5%), after which validation performance began to decrease while training performance continued to rise. The total wall- clock time for the 34 epochs ex ecuted was approximately 440 h (seed 42). Model selection was based exclusi vely on validation performance under greedy decoding. Final test-set ev aluation was performed on the selected checkpoint using the three inference modes described in Section 5.3. T o assess the feasibility of real-time on-board deployment, all reported inference latencies reﬂect single-instance end-to-end processing (batch size 1)—including data transfer , en viron- ment construction, and neural forward pass—with explicit de vice synchronization, reporting the median over the 1,000 test instances. Classical heuristic baselines were executed sequentially on the CPU; neural inference used GPU acceleration. Detailed latency results are reported in Section 6.3. 6. Results and Discussion All metrics reported in this section are computed exclusi vely on the 1,000 unseen test in- stances, which were held out during both training and hyperparameter selection. 19 0 5 10 15 20 25 30 33 Epoch 60 65 70 75 80 85 90 95 100 Success R ate (%) Best checkpoint (epoch 30, 95.5%) Training SR V alidation SR Figure 4: T raining and validation success rate (greedy decoding) across epochs. The dashed line marks the selected checkpoint (epoch 30, v al. SR = 95.5%). The shaded region indicates ov erﬁtting, where validation performance declines while training performance remains stable. 6.1. Covera ge performance and feasibility T able 4 reports the Hamiltonian Success Rate (HSR), Cov erage Completion Rate (CCR), and mean node revisits for e very e v aluated method. T wo structural patterns emerge immediately . First, all 12 re visit-allowed heuristics achiev e 100% CCR by design: when a geometric dead-end is encountered, they backtrack or ﬂy back through pre viously visited cells, trading path e ﬃ ciency for coverage guarantees. Ho wev er , none of these heuristics produces a Hamilto- nian (zero-revisit) path on any test instance with the sole exception of DFS_Backtrack , which achiev es HSR = 33.8%. This conﬁrms that on irregular he x grids with obstacles, achieving strict single-visit cov erage is a qualitativ ely harder objective than relax ed cov erage. Second, Warnsdorff , the only heuristic explicitly designed for Hamiltonian-like traversal, achiev es HSR = CCR = 46.0%. Because it forbids re visits, ev ery failure is simultaneously a cov- erage failure. This 46% solve rate on non-trivial hex topologies establishes the state-of-the-art for greedy construction heuristics in our setting. Fig. 4 shows the e volution of training and v alidation success rates across epochs. The policy reaches near-saturation training performance ( > 98%) within the ﬁrst 12 epochs, while valida- tion performance con verges more gradually . Beyond epoch 30, validation SR begins to decline while training SR continues to increase marginally , conﬁrming standard overﬁtting behavior and validating the early-stopping criterion. Against this landscape, the proposed RL policy deli vers a qualitativ e leap. Under greedy de- 20 coding alone, RL-Greedy achieves HSR = 95.7%, more than double the best heuristic. Stochastic Best-of-16 sampling ( RL-BoK16 ) raises this to 98.7%, and the addition of adjacency-aware 2-opt reﬁnement ( RL-BoK16+2opt ) reaches 99.0%, within 1% of the theoretical oracle ( Exact_DFS , HSR = 100%). All three RL variants achiev e zero revisits on e very solved instance, conﬁrm- ing that the action-masking mechanism (Section 3.6) enforces strict Hamiltonian feasibility by construction. The progression from 95.7% to 99.0% across the three inference modes illustrates the value of stochastic sampling and local reﬁnement at test time: sampling explores alternativ e branching decisions in topologically challenging instances, and 2-opt untangles residual crossings without breaking adjacency constraints. Manual inspection of the 10 failed instances rev eals a consistent topological pattern: all exhibit internal obstacles that partition the A OI into sub-regions con- nected by narro w corridors (one or two cells wide). In e very case, the polic y successfully covers one sub-region but, in doing so, exhausts the cells forming the connecting passage, leaving the remaining sub-region unreachable. This sequencing failure arises because the optimal visitation order requires committing to a globally non-obvious entry , exit sequence through the bottleneck, a long-horizon planning challenge that the local attention mechanism and 16 stochastic samples are insu ﬃ cient to resolve. Increasing the sampling budget K or incorporating explicit graph- connectivity lookahead into the decoder are potential mitig ations for future work T able 4: Coverage performance on 1,000 unseen test instances. HSR: Hamiltonian Success Rate (strict single-visit). CCR: Coverage Completion Rate (revisits permitted). Methods are grouped by type and sorted by HSR within each group. T ype Method HSR (%) CCR (%) Re visits µ ± σ Oracle Exact_DFS 100.0 100.0 0.0 ± 0.0 RL (ours) RL-BoK16 + 2opt 99.0 99.0 0.0 ± 0.0 RL-BoK16 98.6 98.6 0.0 ± 0.0 RL-Greedy 95.7 95.7 0.0 ± 0.0 Heuristic W arnsdor ﬀ 46.0 46.0 0.0 ± 0.0 DFS_Backtrack 33.8 100.0 3.2 ± 6.0 Sweep_Boustrophedon 0.0 100.0 6.4 ± 2.0 Sweep_Segment_Snake 0.0 100.0 6.5 ± 2.1 Sweep_Row_OneW ay 0.0 100.0 8.5 ± 3.4 Boundary_Spiral_outward 0.0 100.0 12.3 ± 12.6 Boundary_Spiral_inward 0.0 100.0 12.4 ± 12.7 Sweep_Segment_Interlea ve 0.0 100.0 16.0 ± 4.1 Sweep_Boundary_Peel 0.0 100.0 16.2 ± 12.5 Morton_Zorder 0.0 100.0 18.5 ± 5.4 Sweep_Row_Interlea ve 0.0 100.0 18.4 ± 4.2 STC_T ree_Coverage 0.0 100.0 35.8 ± 3.5 STC_like 0.0 100.0 35.8 ± 3.5 6.2. P ath quality analysis T o compare route quality independently of feasibility di ﬀ erences, distance and turn metrics are computed on the pairwise common solved subset I ( m ) ∩ between each method m and the ref- 21 (a) Corridor trap: The agent is forced down a narro w 1D passage with no remaining exits, resulting in a geometric dead-end. (b) Graph bisection: The agent traverses a critical isthmus, splitting the remaining un visited nodes into disconnected components. (c) Self-occlusion: The trajectory loops around an internal obstacle, trapping the agent against its own pre viously visited path. Figure 5: Failure mode analysis of the proposed RL-BoK16+2opt policy . In the rare instances (ap- prox. 1.0%) where the stochastic sampling fails to secure a strict single-visit Hamiltonian path, the failures consistently correspond to severe geometric constraints. As illustrated, these include (a) dead-ends in nar- row 1D corridors, (b) topological bisections that isolate clusters of unvisited nodes, and (c) self-occlusion when wrapping around obstacles. In these conﬁgurations, the agent commits to a sub-optimal branch that leav es remaining nodes unreachable without revisiting cells, thereby triggering the early dead-end penalty . These edge cases highlight the fundamental fragility of strictly Hamiltonian routing on highly irregular grids and strongly moti v ate future operational extensions that incorporate a bounded node-re visitation bud- get. erence solver RL-BoK16+2opt , which solved 990 of 1,000 test instances. T able 5 summarizes these results. Distance. RL-BoK16+2opt achieves the lowest mean normalized distance (3 . 100 ± 0 . 315) across all 990 common instances, outperforming every heuristic and the oracle DFS. The best heuris- tic, Warnsdorff (3 . 328 ± 0 . 381), is 7.4% longer, but is only comparable on 453 instances due to its low solve rate. Among full-cov erage heuristics, the best performer is DFS_Backtrack (3 . 340 ± 0 . 449), 7.7% longer than the RL policy . The most commonly deployed survey pattern, Boustrophedon (3 . 564 ± 0 . 429), incurs 15.0% more distance. At the e xtreme, the spanning-tree methods ( STC ) produce paths 49% longer than the RL solution. The 2-opt reﬁnement provides a modest but consistent distance improv ement ov er raw sam- pling: RL-BoK16+2opt is 0.4% shorter than RL-BoK16 (3 . 111 ± 0 . 314). Greedy decoding and 22 Best-of-16 produce identical mean distances, conﬁrming that stochastic sampling primarily im- prov es feasibility rather than path geometry . T urns. The RL policy produces dramatically smoother paths. RL-BoK16 achiev es the fewest turns (32 . 2 ± 4 . 7), 25.9% fe wer than the best heuristic Warnsdorff (42 . 5 ± 5 . 6) and 37.6% fe wer than Boustrophedon (51 . 6 ± 6 . 6). This reduction is a direct consequence of the continuous turn penalty f ( θ ) in the rew ard function (Equation 7), which the policy internalizes to produce kinematically smooth trajectories. The 2-opt reﬁnement increases turns slightly (34 . 1 ± 5 . 2) relativ e to raw BoK because segment rev ersals can introduce heading changes while reducing distance—a classical distance–turns trade-o ﬀ in 2-opt optimization on constrained graphs. Revisits. All three RL variants achiev e exactly zero revisits by construction (action masking forbids re visits). Among heuristics, the cost of guaranteed coverage ranges from 3.2 re visits (DFS_Backtrack) to 35.8 (STC), which translates to 8–95% redundant motion. This redundancy represents wasted fuel and endurance in maritime operations, a cost entirely avoided by the RL policy . T able 5: Path quality on the pairwise common solved subset with RL-BoK16+2opt (990 of 1,000 test instances solved). n denotes the number of instances solved by both the listed method and the RL reference. Methods sorted by normalized distance. T ype Method n Dist. µ ± σ T urns µ ± σ Steps Revisits RL RL-BoK16 + 2opt 991 3.100 ± 0.314 34.1 ± 5.2 37.8 0.0 RL-BoK16 986 3.110 ± 0.315 32.2 ± 4.7 37.8 0.0 RL-Greedy 957 3.111 ± 0.314 32.9 ± 4.9 37.8 0.0 Oracle Exact_DFS 990 3.308 ± 0.374 43.5 ± 6.7 37.8 0.0 Heuristic W arnsdor ﬀ 454 3.325 ± 0.379 42.5 ± 5.6 37.2 0.0 DFS_Backtrack 990 3.340 ± 0.449 49.5 ± 12.4 41.1 3.2 Sweep_Boustrophedon 990 3.564 ± 0.429 51.6 ± 6.6 44.2 6.4 Sweep_Segment_Snake 990 3.567 ± 0.430 51.7 ± 6.5 44.3 6.5 Sweep_Row_OneW ay 990 3.672 ± 0.484 54.4 ± 6.9 46.3 8.5 Boundary_Spiral_inward 990 3.864 ± 0.772 51.3 ± 19.1 50.3 12.5 Sweep_Boundary_Peel 990 3.866 ± 0.731 60.0 ± 17.8 54.0 16.2 Boundary_Spiral_outward 990 3.867 ± 0.765 51.1 ± 18.2 50.1 12.3 Sweep_Segment_Interl. 990 3.981 ± 0.565 60.9 ± 8.0 53.8 16.0 Sweep_Row_Interlea ve 990 4.058 ± 0.554 54.1 ± 7.4 56.2 18.4 Morton_Zorder 990 4.134 ± 0.620 72.5 ± 9.7 56.2 18.4 STC_T ree_Coverage 990 4.622 ± 0.721 76.0 ± 12.9 73.6 35.8 STC_like 990 4.622 ± 0.721 78.8 ± 12.0 73.6 35.8 6.3. Computational e ﬃ ciency T able 6 reports per -instance end-to-end inference latencies measured as described in Sec- tion 5.5. Among the heuristics, simple graph-based methods ( Warnsdorff , STC ) e xecute sequentially on the CPU in under 1 ms, while sweep-family methods that require geometric axis-selection subproblems take 27–31 ms. In contrast, the RL policy leverages GPU acceleration. Under 23 T able 6: Per-instance inference latency (ms). Heuristics run on CPU; RL methods on GPU (R TX 3070). Latencies are for inference only , excluding I / O. Method ms / instance W arnsdor ﬀ 0.4 STC_T ree_Coverage 0.4 STC_like 0.4 DFS_Backtrack 0.6 Morton_Zorder 0.7 Boundary_Spiral_inward 1.1 Boundary_Spiral_outward 1.2 Sweep_Boundary_Peel 0.7 Sweep_Segment_Snake 26.7 Sweep_Boustrophedon 27.1 Sweep_Row_OneW ay 29.2 Sweep_Row_Interlea ve 29.8 Sweep_Segment_Interl. 31.5 RL-Greedy (GPU) 30.6 RL-BoK16 (GPU) 31.2 RL-BoK16 + 2opt (GPU) 32.2 greedy decoding, the neural forward pass requires 30.6 ms per instance. While this is comparable to the slower CPU heuristics, the neural policy is solving a strictly harder , tightly constrained Hamiltonian path problem rather than a relaxed co verage sweep. Crucially , Best-of-16 sampling adds only 0.6 ms ov er the greedy baseline (31.2 ms total) because all 16 rollouts are processed simultaneously in a single batched GPU operation. The subsequent 2-opt reﬁnement adds a further 1.0 ms (32.2 ms total), as its O ( n 2 ) inner loop ov er short paths ( |V | ≤ 46) is highly optimized via Numba JIT compilation. Although comparing CPU-bound heuristics with GPU-accelerated neural policies inv olves di ﬀ erent hardware paradigms, it accurately reﬂects modern operational realities. Modern au- tonomous surface vehicles are increasingly equipped with edge AI accelerators (e.g., NVIDIA Jetson modules) designed speciﬁcally to host such neural workloads. At approximately 32 ms per instance, all three RL inference modes operate orders of magnitude faster than the typical decision and replanning cycles of maritime autonomous systems, which range from seconds to minutes, fully validating the feasibility of real-time on-board deplo yment. 6.4. Qualitative path visualization Fig. 6 presents a comparati ve visualization of selected methods on representati ve test in- stances. The instances were chosen to illustrate three characteristic scenarios: (a) a compact A OI with a central obstacle, (b) an elongated corridor with narrow passages, and (c) an irregular shape with multiple peninsulas. Across all scenarios, the RL policy produces visually smoother trajectories with fewer sharp heading changes, consistent with the quantitati ve turn reduction reported in Section 6.2. The Boustrophedon pattern, while achieving complete cov erage, exhibits the characteristic zigzag structure with frequent 180 ◦ rev ersals. Warnsdorff produces relativ ely smooth paths when it succeeds, but fails entirely on instances (b) and (c) where narrow passages create topological 24 traps. The STC methods produce the most redundant paths, with extensi ve backtracking visible as dense, overlapping segments. Overall, these qualitative visualizations consistently reﬂect the quantitativ e trends observed across the entire 1,000-instance test set, conﬁrming the RL policy’ s superior ability to adapt to complex maritime topologies without relying on redundant motion. 6.5. Study limitations The present study should be interpreted with three scope limitations in mind. First, the pro- posed RL policy is optimized for the strict no-revisit Hamiltonian formulation, whereas most classical baselines are naturally designed for revisit-allo wed complete coverage. For this rea- son, Hamiltonian feasibility and relax ed cov erage completion are reported separately , and route- quality metrics are ev aluated conditionally on common solved subsets. Second, the experiments are conducted on a lar ge synthetic dataset of irregular maritime A OIs. Although the generator is calibrated to operationally meaningful area scales, obstacle densities, and o ﬀ shore base distances, transfer to real nautical charts with shoreline artifacts, bathymetric constraints, and sensor uncertainty remains to be v alidated. Third, the training results reported here correspond to a single full training run due to the substantial computational cost of the end-to-end DRL pipeline. Consequently , the reported un- certainty reﬂects test-set variability rather than run-to-run optimization v ariability . Extending the study to multiple independent training seeds is a natural next step for future work. 7. Conclusion and Future W ork This paper introduced a Deep Reinforcement Learning frame work for Maritime Cov erage Path Planning on irregular hexagonal grids. By framing the problem as a neural combinato- rial optimization task and training a Transformer -based pointer policy via a critic-free Group- Relativ e Policy Optimization (GRPO) scheme, we bypassed the limitations of classical geomet- ric decomposition methods. The incorporation of an early dead-end detection mechanism via BFS signiﬁcantly sharpened credit assignment during training. Extensiv e benchmarking ov er 1,000 unseen irre gular test areas demonstrates that the learned policy , particularly under Best-of-16 sampling with adjacency-a ware 2-opt reﬁnement ( RL-BoK16+2opt ), achiev es a 99.0% Hamiltonian success rate, more than doubling the best classical heuristic (46.0%) while producing paths 7% shorter and with up to 24% fewer heading changes than the closest comparable baseline, all with zero node revisits. Inference latency of approximately 32 ms per instance on a laptop GPU conﬁrms the viability of real-time on-board deployment on modern autonomous maritime platforms. Sev eral extensions of this work merit inv estigation. First, non-uniform coverage priorities can be incorporated by activ ating the hexscore ﬁeld already embedded in the node features, enabling time-critical surveillance where high-value zones (e.g., distress areas informed by drift models) are visited preferentially . Second, multi-platform coor dination can be addressed by par - titioning the A OI graph and assigning sub-tours to heterogeneous assets, extending the single- agent formulation to multi-agent ﬂeet-lev el planning. Third, node r evisitation under a time bud- get, architecturally supported b ut not ev aluated in the present study , could improve cov erage robustness on highly constrained topologies where strict single-visit Hamiltonian paths are frag- ile. Fourth, scaling to lar ger AOIs (e.g., hyper-resolution maritime grids with 100 + cells) will in volv e exploring architectural variants, such as alternativ e attention mechanisms, sparse graph representations, or region-le vel hierarchical decomposition, to further accelerate training and in- ference. Finally , validation on real maritime charts (e.g., Chilean archipelago coastlines) and 25 (a) Exact DFS H (b) RL-BoK + 2opt (c) W arnsdor ﬀ (d) Boustrophedon (e) STC (f) Exact DFS H (g) RL-BoK + 2opt (h) W arnsdor ﬀ (i) Boustrophedon (j) STC (k) Exact DFS H (l) RL-BoK + 2opt (m) W arnsdor ﬀ (n) Boustrophedon (o) STC Figure 6: Comparative path visualizations across three representativ e maritime topologies. The leftmost column (highlighted in blue) displays the Exact DFS (denoted by H for Hamiltonian) serving as a topological feasibility oracle. Row 1 shows a small open-water conﬁguration where all methods succeed, though STC introduces dense overlaps. Row 2 presents an H-shaped coastal corridor; here, the RL policy achieves a smooth trajectory , whereas greedy local heuristics like Warnsdorff fail to escape the resulting geometric dead-end. Row 3 illustrates a generic irregular shape, where the RL policy successfully identiﬁes a valid single-visit sequence that av oids the costly kinematic rev ersals typical of Boustrophedon sweeps. 26 deployment on autonomous surface vehicles under realistic oceanographic conditions and sensor models represent the natural applied extension of this framew ork toward operational Maritime Domain A wareness. CRediT authorship contribution statement Carlos S. Sepúlveda: Conceptualization, Methodology , Software, Formal analysis, In ves- tigation, Writing – original draft, Writing – revie w & editing. Gonzalo A. Ruz: Supervision, V alidation, Writing – revie w & editing. Declaration of competing interest The authors declare that they have no known competing ﬁnancial interests or personal rela- tionships that could hav e appeared to inﬂuence the work reported in this paper . Data av ailability The synthetic dataset and trained model checkpoint used in this study will be made a v ailable upon reasonable request to the corresponding author . Evaluation scripts and baseline implemen- tations will be released in a public repository upon acceptance. Acknowledgments This work is also supported by the Chilean Navy through the Directorate of Programs, Re- search and Dev elopment (Armada de Chile), which provided the necessary time and authoriza- tion for the dev elopment of this research. The authors thank ANID FONDECYT 1230315, ANID-MILENIO-NCN2024_103, ANID-MILENIO-NCN2024_047, and Centro de Modelamiento Matemático (CMM) FB210005, B ASAL funds for centers of excellence from ANID-Chile, and the ANID Doctorado Nacional Scholarship, grant number 21210465. Appendix A. T axonomy of Surveillance and Routing Pr oblems T able A.7 summarizes the main families of coverage, patrolling, and routing problems un- derpinning our approach. References Ai, B., Jia, M., Xu, H., Xu, J., W en, Z., Li, B., Zhang, D., 2021. Coverage path planning for maritime search and rescue using reinforcement learning. Ocean Engineering 241. doi: 1 0 .1 016/j.oceaneng.2021.110098 . Alpdemir , M.N., 2022. T actical uav path optimization under radar threat using deep reinforce- ment learning. Neural Computing and Applications 34, 5649–5664. Azad, A.S., Islam, M.M., Chakraborty , S., 2017. A heuristic initialized stochastic memetic al- gorithm for mdpvrp with interdependent depot operations. IEEE T ransactions on Cybernetics 47, 4302–4315. doi: 10.1109/TCYB.2016.2607220 . 27 T able A.7: T axonomy of surveillance-related coverage, patrolling, and routing problems underpinning our CPP formulation on hexagonal grids. Family T ypical objective and common formulation Relation to this work and repr esentative refer ences Re gion covera ge (CPP) Achiev e complete coverage of a kno wn area of interest (A OI) while minimising path length, time, or energy . Classical CPP uses cellular decompositions (boustrophedon, trapezoidal, con vex) and sweep patterns (e.g., lawnmo wer), extended to multi-robot and U A V settings. Our hexagonal A OI discretisation is a discrete CPP formulation in which each hex cell represents a re gion of the maritime A OI. W e follow CPP surv eys in robotics and U A Vs (Choset, 2001; Galceran and Carreras, 2013; Cabreira et al., 2019; Fevg as et al., 2022) and connect to hex-based CPP and SAR planning in maritime contexts (Azpúrua et al., 2018; Cho et al., 2021b; Kadioglu et al., 2019). Sweep and barrier covera ge Ensure that ev ery point of a region or a virtual barrier is periodically visited by mobile sensors, subject to revisit intervals or detection constraints. Models often rely on TSP / graph tours, Eulerian paths, or approximation algorithms for large-scale sweep cov erage and barrier placement. When maritime A OIs degenerate into corridors (routes, shorelines, chokepoints), our hex-grid CPP reduces to sweep or barrier-type problems o ver 1D / 2D structures. Sweep-cov erage and barrier-co verage results in WSNs (Li et al., 2011; Gorain and Mandal, 2014; Benahmed and Benahmed, 2019; Nguyen and So-In, 2018; Li et al., 2019; K ong et al., 2016) motiv ate modeling “barriers” or critical routes via specialized boundary constraints and periodic revisit scheduling. P atr olling and per- sistent surveillance Maintain high-quality surveillance o ver long horizons by minimizing maximal idle time, maximizing information gain, or enforcing revisit-frequenc y constraints under vehicle range and kinematic limitations. Formulations often use inﬁnite-horizon control or MILP-based routing with information metrics. W eighted coverage objecti ve can be interpreted as a single-visit proxy for persistent-surveillance v alue on hex-grids. Patrolling and persistent-surveillance works with U A Vs and surface vehicles (Nigam et al., 2009; Zuo et al., 2020; Bandarupalli et al., 2021; Savkin and Huang, 2019; Luis et al., 2020, 2021) motiv ate the use of information-based metrics that could inform future extensions with non-uniform priorities. Routing for surveil- lance and mission planning Plan routes for one or multiple platforms that visit regions or tasks while minimizing mission cost (time, fuel) and respecting range, time windows, and motion constraints. T ypical formulations are TSP / VRP variants, orienteering, and Dubins-type routing, often solved via MILP or heuristics. Our approach inherits the routing structure of maritime surveillance mission-planning models: hex cells act as “customers” to be visited, and tours must satisfy sensor and endurance constraints. Classical radar / SAR routing and U A V mission-planning work (Panton and Elbers, 1999; John et al., 2001; Grob, 2006; Karasakal, 2016; Coutinho et al., 2018; Otto et al., 2018; Cho et al., 2021b,a) serve as baselines to benchmark our learned policies. Spatial crowdsour c- ing and task assign- ment Assign spatial sensing or monitoring tasks to agents (humans, vehicles, sensors) to maximise utility or cov erage subject to capacity , location, and temporal constraints. Models are often matching or assignment problems with uncertainty in agent av ailability and task locations. In multi-platform maritime surveillance, our hex-based CPP can be e xtended with task-assignment layers that allocate subsets of hex cells or sub-tours to heterogeneous assets. Spatial crowdsourcing and mobile-sensing literature (W u et al., 2019; Bhatti et al., 2021; T ong et al., 2020; Zhou et al., 2019; Chen et al., 2020) informs potential extensions where di ﬀ erent vehicles or sensor types share the same hex-grid representation. Learning-based combinatorial op- timisation (NCO / RL4CO) Learn parametric policies that output near-optimal solutions for routing, scheduling, or CPP problems, enabling reuse across instances without re-solving from scratch. Approaches include pointer networks and attention-based models trained with RL or supervised signals on graph-structured inputs. Our transformer-lik e pointer policy over hex-graphs belongs to this family . W e adapt ideas from attention-based routing and RL4CO (V inyals et al., 2015; Bello et al., 2016; Nazari et al., 2018; K ool et al., 2018; Xin et al., 2021; Li et al., 2021a; Berto et al., 2025; Darvariu et al., 2024) and combine them with advanced training strate gies for long-horizon cov erage on hex-grids, bridging maritime CPP with neural combinatorial optimisation. 28 Azpúrua, H., Freitas, G.M., Macharet, D.G., Campos, M.F ., 2018. Multi-robot coverage path planning using hexagonal segmentation for geophysical surveys. Robotica 36, 1144–1166. doi: 10.1017/S0263574718000292 . Bähnemann, R., Lawrance, N., Chung, J.J., Pantic, M., Siegw art, R., Nieto, J., 2021. Re visiting boustrophedon coverage path planning as a generalized traveling salesman problem, in: Field and Service Robotics: Results of the 12th International Conference, Springer . pp. 277–290. Bandarupalli, A., Swarup, D., W eston, N., Chaterji, S., 2021. Persistent airborne surveillance using semi-autonomous drone swarms, in: Proceedings of the 7th W orkshop on Micro Aerial V ehicle Networks, Systems, and Applications, pp. 19–24. doi: 10.1145/3469259.3470487 . Bello, I., Pham, H., Le, Q.V ., Norouzi, M., Bengio, S., 2016. Neural combinatorial optimization with reinforcement learning. 5th International Conference on Learning Representations, ICLR 2017 - W orkshop T rack Proceedings . Benahmed, T ., Benahmed, K., 2019. Optimal barrier coverage for critical area surveillance using wireless sensor networks. International Journal of Communication Systems 32. doi: 10. 10 0 2/dac.3955 . Berto, F ., Hua, C., Park, J., Luttmann, L., Ma, Y ., Bu, F ., W ang, J., Y e, H., Kim, M., Choi, S., et al., 2025. Rl4co: an extensiv e reinforcement learning for combinatorial optimization benchmark, in: Proceedings of the 31st A CM SIGKDD Conference on Kno wledge Disco very and Data Mining V . 2, pp. 5278–5289. Bhatti, S.S., Fan, J., W ang, K., Gao, X., W u, F ., Chen, G., 2021. An approximation algorithm for bounded task assignment problem in spatial crowdsourcing. IEEE T ransactions on Mobile Computing 20, 2536–2549. doi: 10.1109/TMC.2020.2984380 . Biniaz, A., Liu, P ., Maheshwari, A., Smid, M., 2017. Approximation algorithms for the unit disk cov er problem in 2d and 3d. Computational Geometry 60, 8–18. doi: 10.101 6/J.COM G EO.2 016.04.002 . Bischof, Z.S., F ontugne, R., Bustamante, F .E., 2018. Untangling the world-wide mesh of under- sea cables, in: Proceedings of the 17th A CM workshop on hot topics in networks, pp. 78–84. Boots, B., Okabe, A., Sugihara, K., 1999. Spatial tessellations, in: Geographical Information Systems, pp. 503–526. Boraz, S.C., 2009. Maritime domain aw areness: Myths and realities. Nav al W ar Colle ge Re view 62, 137–146. doi: 10.2307/26397039 . Bueger , C., Edmunds, T .P ., Stockbruegger , J., 2024. Securing the seas: A comprehensi ve assess- ment of global maritime security . Cabreira, T .M., Brisolara, L.B., Paulo, R.F ., 2019. Survey on coverage path planning with un- manned aerial vehicles. Drones 3, 1–38. doi: 10.3390/drones3010004 . Chen, C.C., Chang, C.Y ., Chen, P .Y ., 2015. Linear time approximation algorithms for the relay node placement problem in wireless sensor networks with hexagon tessellation. Journal of Sensors 2015. doi: 10.1155/2015/565983 . 29 Chen, M., W ang, T ., Ota, K., Dong, M., Zhao, M., Liu, A., 2020. Intelligent resource allocation management for vehicles network: An a3c learning approach. Computer Communications 151, 485–494. doi: 10.1016/J.COMCOM.2019.12.054 . Cho, S.W ., Park, H.J., Lee, H., Shim, D.H., Kim, S.Y ., 2021a. Coverage path planning for multiple unmanned aerial vehicles in maritime search and rescue operations. Computers and Industrial Engineering 161. doi: 10.1016/j.cie.2021.107612 . Cho, S.W ., Park, J.H., Park, H.J., Kim, S., 2021b. Multi-uav cov erage path planning based on hexagonal grid decomposition in maritime search and rescue. Mathematics 10, 83. Choi, Y ., Choi, Y ., Briceno, S., Mavris, D.N., 2019. Multi-uas path-planning for a large-scale dis- joint disaster management, in: 2019 International Conference on Unmanned Aircraft Systems (ICU AS), pp. 799–807. Choi, Y ., Choi, Y ., Briceno, S., Mavris, D.N., 2020. Energy-constrained multi-uav coverage path planning for an aerial imagery mission using column generation. Journal of Intelligent & Robotic Systems 97, 125–139. Choset, H., 2001. Co verage for robotics–a survey of recent results. Annals of mathematics and artiﬁcial intelligence 31, 113–126. Coutinho, W .P ., Battarra, M., Fliege, J., 2018. The unmanned aerial vehicle routing and trajectory optimisation problem, a taxonomic revie w . Computers & Industrial Engineering 120, 116– 128. doi: 10.1016/J.CIE.2018.04.037 . Darvariu, V .A., Hailes, S., Musolesi, M., 2024. Graph reinforcement learning for combinatorial optimization: A surve y and unifying perspectiv e. arXiv preprint arXi v:2404.06492 . Dogancay , K., T u, Z., Ibal, G., 2021. Research into vessel beha viour pattern recognition in the maritime domain: Past, present and future. Digital Signal Processing 119, 103191. doi: 10 .1 016/J.DSP.2021.103191 . Fevg as, G., Lagkas, T ., Argyriou, V ., Sarigiannidis, P ., 2022. Coverage path planning methods focusing on ener gy e ﬃ cient and cooperative strategies for unmanned aerial vehicles. Sensors 22. doi: 10.3390/s22031235 . Galceran, E., Carreras, M., 2013. A surv ey on co verage path planning for robotics. Robotics and Autonomous systems 61, 1258–1276. Gao, M., Kang, Z., Zhang, A., Liu, J., Zhao, F ., 2022. Mass autonomous navigation system based on ais big data with dueling deep q networks prioritized replay reinforcement learning. Ocean Engineering 249. doi: 10.1016/J.OCEANENG.2022.110834 . Gorain, B., Mandal, P .S., 2014. Approximation algorithms for sweep co verage in wireless sensor networks. Journal of parallel and Distributed Computing 74, 2699–2707. Grob, M.J., 2006. Routing of platforms in a maritime surface surveillance operation. European Journal of Operational Research 170, 613–628. doi: 10.1016/j.ejor.2004.02.029 . Huang, S., Ontañón, S., 2022. A closer look at in valid action masking in policy gradient algo- rithms, in: The International FLAIRS Conference Proceedings. 30 Islam, K., Meijer , H., Rodríguez, Y .N., Rappaport, D., Xiao, H., 2007. Hamilton circuits in hexagonal grid graphs., in: CCCG, pp. 85–88. John, M., Panton, D., White, K., 2001. Mission planning for regional surveillance. Annals of Operations Research 108, 157–173. Kadioglu, E., Urtis, C., P apanikolopoulos, N., 2019. Uav co verage using he xagonal tessellation, in: 27th Mediterranean Conference on Control and Automation, MED 2019 - Proceedings, Institute of Electrical and Electronics Engineers Inc.. pp. 37–42. doi: 10 .110 9/ME D.2 0 19.8 798564 . Kaelbling, L.P ., Littman, M.L., Moore, A.W ., 1996. Reinforcement learning: A survey . Journal of Artiﬁcial Intelligence Research 4, 237–285. doi: 10.1613/JAIR.301 . Karapetyan, N., Braude, A., Moulton, J., Burstein, J.A., White, S., O’Kane, J.M., Rekleitis, I., 2019. Ri verine cov erage with an autonomous surface vehicle over known environments, in: 2019 IEEE / RSJ International Conference on Intelligent Robots and Systems (IR OS), IEEE. pp. 3098–3104. Karasakal, O., 2016. Minisum and maximin aerial surveillance ov er disjoint rectangles. T op 24, 705–724. doi: 10.1007/s11750- 016- 0416- 1 . K ong, L., Lin, S., Xie, W ., Qiao, X., Jin, X., Zeng, P ., Ren, W ., Liu, X.Y ., 2016. Adaptive barrier coverage using software deﬁned sensor networks. IEEE Sensors Journal 16, 7364– 7372. doi: 10.1109/JSEN.2016.2566808 . K ool, W ., Hoof, H.V ., W elling, M., 2018. Attention, learn to solve routing problems! arXiv preprint arXiv:1803.08475 . Kumar , K., Kumar , N., 2023. Region cov erage-aware path planning for unmanned aerial vehi- cles: A systematic re view . Physical Communication 59, 102073. doi: 10.10 1 6/J.PHYCOM. 2 023.102073 . Kwon, Y .D., Choo, J., Kim, B., Y oon, I., Gwon, Y ., Min, S., 2020. Pomo: Polic y optimization with multiple optima for reinforcement learning. Advances in Neural Information Processing Systems 33, 21188–21198. Li, J., Ma, Y ., Gao, R., Cao, Z., Lim, A., Song, W ., Zhang, J., 2021a. Deep reinforcement learning for solving the heterogeneous capacitated vehicle routing problem. IEEE Transactions on Cybernetics 54, 13572–13585. doi: 10.1109/TCYB.2021.3111082 . Li, J., Xin, L., Cao, Z., Lim, A., Song, W ., Zhang, J., 2021b . Heterogeneous attentions for solving pickup and deliv ery problem via deep reinforcement learning. IEEE TRANSA CTIONS ON INTELLIGENT TRANSPOR T A TION SYSTEMS . Li, J., Xiong, Y ., She, J., W u, M., 2020. A path planning method for sweep coverage with multiple uavs. IEEE Internet of Things Journal 7, 8967–8978. doi: 10. 1 109/JIOT.20 2 0.29 99083 . Li, L., Chen, H., 2022. Uav enhanced target-barrier cov erage algorithm for wireless sensor networks based on reinforcement learning. Sensors 22, 6381. doi: 10.3390/S22176381 . 31 Li, M., Cheng, W ., Liu, K., Liu, Y ., Li, X., Liao, X., Lab, W .J., 2011. Sweep coverage with mobile sensors. IEEE Transactions on Mobile Computing 10, 1534–1545. Li, S., Shen, H., Huang, Q., Guo, L., 2019. Optimizing the sensor mo vement for barrier coverage in a sink-based deployed mobile sensor network. IEEE Access 7, 156301–156314. doi: 10.1 109/ACCESS.2019.2949025 . Li, X., Savkin, A.V ., 2021. Networked unmanned aerial vehicles for surveillance and monitoring: A surve y . Future Internet 13. doi: 10.3390/FI13070174 . Liu, Z., Liu, Y ., 2021. Agent-based simulation of multi-ua v search-track for dynamic targets in sweep cov erage, in: Journal of Physics: Conference Series, IOP Publishing. p. 012033. Luis, S.Y ., Reina, D.G., Marin, S.L., 2021. A multiagent deep reinforcement learning approach for path planning in autonomous surface vehicles: The ypacaraí lake patrolling case. IEEE Access 9, 17084–17099. doi: 10.1109/ACCESS.2021.3053348 . Luis, S.Y ., Reina, D.G., Marín, S.L.T ., 2020. A deep reinforcement learning approach for the patrolling problem of water resources through autonomous surface vehicles: The ypacarai lake case. IEEE Access 8, 204076–204093. doi: 10.1109/ACCESS.2020.3036938 . Mier , G., V alente, J., Bruin, S.D., 2023. Fields2cover: An open-source cov erage path planning library for unmanned agricultural vehicles. IEEE Robotics and Automation Letters 8, 2166– 2172. Nazari, M., Oroojlooy , A., T aká ˇ c, M., Snyder , L.V ., 2018. Reinforcement learning for solving the vehicle routing problem. Advances in neural information processing systems 31. Nguyen, T .G., So-In, C., 2018. Distributed deployment algorithm for barrier cov erage in mobile sensor networks. IEEE Access 6, 21042–21052. doi: 10.1109/ACCESS.2018.2822263 . Nielsen, L.D., Sung, I., Nielsen, P ., 2019. Con vex decomposition for a coverage path planning for autonomous vehicles: Interior extension of edges. Sensors 19. doi: 10.3390/s19194165 . Nigam, N., Bieniawski, S., Kroo, I., V ian, J., 2009. Control of multiple uavs for persistent surveillance: Algorithm description and hardw are demonstration. IEEE T ransactions on Con- trol Systems T echnology 20, 1236–1251. Otto, A., Agatz, N., Campbell, J., Golden, B., Pesch, E., 2018. Optimization approaches for civil applications of unmanned aerial vehicles (uavs) or aerial drones: A surv ey . Networks 72, 411–458. doi: 10.1002/NET.21818 . Panton, D.M., Elbers, A.W ., 1999. Mission planning for synthetic aperture radar surveillance. Interfaces 29, 73–88. doi: 10.1287/inte.29.2.73 . de Ministros para el desarrollo de una Política Oceánica Nacional, C., 2023. Programa oceánico nacional: Plan oceánico sostenible chile 2023. Raza, S.M., Sajid, M., Singh, J., 2022. V ehicle routing problem using reinforcement learn- ing: Recent advancements, in: Lecture Notes in Electrical Engineering, Springer Science and Business Media Deutschland GmbH. pp. 269–280. doi: 10.1007/978- 981- 19- 0840- 8_20 . 32 Savkin, A.V ., Huang, H., 2019. Proactive deployment of aerial drones for cov erage over very unev en terrains: A version of the 3d art gallery problem. Sensors 19, 1438. doi: 10.3390/S 1 9061438 . Schulman, J., W olski, F ., Dhariwal, P ., Radford, A., Klimov , O., 2017. Proximal policy opti- mization algorithms. arXiv preprint arXi v:1707.06347 . Shao, Z., W ang, P ., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., et al., 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXi v preprint arXiv:2402.03300 . Siew , P .M., Jang, D., Roberts, T .G., Linares, R., Fletcher, J., 2022. Cislunar space situational awareness sensor tasking using deep reinforcement learning agents, in: 2022 Advanced Maui Optical and Space Surveillance T echnologies Conference (AMOS), Maui, Hawaii. Siew , P .M., Linares, R., 2022. Optimal tasking of ground-based sensors for space situational awareness using deep reinforcement learning. Sensors 22. doi: 10.3390/s22207847 . Soldi, G., Gaglione, D., Forti, N., Simone, A.D., Da ﬃ nà, F .C., Bottini, G., Quattrociocchi, D., Milleﬁori, L.M., Braca, P ., Carniel, S., W illett, P ., Iodice, A., Riccio, D., Farina, A., 2021. Space-based global maritime surveillance. part i: Satellite technologies. IEEE Aerospace and Electronic Systems Magazine 36, 8–28. Sutton, R.S., Barto, A.G., 2018. Reinforcement learning : an introduction. second edition ed., MIT press. T an, C.S., Mohd-Mokhtar , R., Arshad, M.R., 2021. A comprehensiv e revie w of coverage path planning in robotics using classical and heuristic algorithms. IEEE Access 9, 119310–119342. doi: 10.1109/ACCESS.2021.3108177 . T ong, Y ., Zhou, Z., Zeng, Y ., Chen, L., Shahabi, C., 2020. Spatial cro wdsourcing: a surve y . VLDB Journal 29, 217–250. doi: 10.1007/S00778- 019- 00568- 7/METRICS . United Nations Trade and De velopment (UNCT AD), 2024. Revie w of maritime transport 2024: Navigating maritime chok epoints. Stylus Publishing, LLC. V inyals, O., Fortunato, M., Jaitly , N., 2015. Pointer networks. Advances in neural information processing systems 28. W u, J., Cheng, L., Chu, S., Song, Y ., 2024. An autonomous co verage path planning algorithm for maritime search and rescue of persons-in-water based on deep reinforcement learning. Ocean engineering 291, 116403. W u, L., Xiong, Y ., W u, M., He, Y ., She, J., 2019. A task assignment method for sweep co verage optimization based on crowdsensing. IEEE Internet of Things Journal 6, 10686–10699. doi: 10 .1109/JIOT.2019.2940717 . Xin, L., Song, W ., Cao, Z., Zhang, J., 2021. Multi-decoder attention model with embedding glimpse for solving vehicle routing problems, in: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pp. 12042–12049. 33 Zelenka, J., Kasanický, T ., Bundzel, M., Andoga, R., 2020. Self-adaptation of a heterogeneous swarm of mobile robots to a co vered area. Applied Sciences (Switzerland) 10. doi: 10.3390/ app10103562 . Zhou, Z., Liu, P ., Feng, J., Zhang, Y ., Mumtaz, S., Rodriguez, J., 2019. Computation resource allocation and task assignment optimization in v ehicular fog computing: A contract-matching approach. IEEE Transactions on V ehicular T echnology 68, 3113–3125. doi: 10 .1 10 9 /T VT . 2019.2894851 . Zuo, Y ., Tharmarasa, R., Jassemi-Zargani, R., Kashyap, N., Thiyag alingam, J., Kirubarajan, T .T ., 2020. Milp formulation for aircraft path planning in persistent surveillance. IEEE Transactions on Aerospace and Electronic Systems 56, 3796–3811. doi: 10.1109/TAES.2020.2983532 . Zweig, A., Ahmed, N., W illke, T .L., Ma, G., 2020. Neural algorithms for graph navigation, in: Learning Meets Combinatorial Algorithms at NeurIPS2020. 34

Critic-Free Deep Reinforcement Learning for Maritime Coverage Path Planning on Irregular Hexagonal Grids

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment