SutureAgent: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

SutureAgen t: Learning Surgical T ra jectories via Goal-conditioned Oﬄine RL in Pixel Space Huanrong Liu 1 , Ch unlin Tian 1 , T ongyu Jia 2 , T ailai Zhou 2 , Qin Liu 1 , Y u Gao 2 , Y utong Ban 4 , Y un Gu 4 , Guy Rosman 3 , Xin Ma 2 , and Qingbiao Li 1 1 Univ ersity of Macau, Macau, China 2 The Chinese PLA General Hospital, Beijing, China 3 Duk e Univ ersity , Durham, North Carolina, USA 4 Shanghai Jiao T ong Univ ersity , Shanghai, China {qingbiaoli}@um.edu.mo Abstract. Predicting surgical needle tra jectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing metho ds that directly learn motion distributions from visual observ ations tend to ov er- lo ok the sequen tial dep endency among adjacent motion steps. Moreo ver, sparse wa ypoint annotations often fail to provide suﬃcient sup ervision, further increasing the diﬃculty of sup ervised or imitation learning meth- o ds. T o address these challenges, we formulate image-based needle tra- jectory prediction as a sequential decision-making problem, in whic h the needle tip is treated as an agen t that mo ves step by step in pixel space. This form ulation naturally captures the con tin uity of needle motion and enables the explicit mo deling of ph ysically plausible pixel-wise state tran- sitions ov er time. F rom this p ersp ective, we prop ose SutureAgent, a goal-conditioned oﬄine reinforcement learning framework that leverages sparse annotations to dense rew ard signals via cubic spline in terp ola- tion, encouraging the policy to exploit limited expert guidance while exploring plausible future motion paths. SutureAgen t enco des v ariable- length clips using an observ ation enco der to capture b oth lo cal spatial cues and long-range temp oral dynamics, and autoregressiv ely predicts future wa yp oints through actions comp osed of discrete directions and con tinuous magnitudes. T o enable stable oﬄine p olicy optimization from exp ert demonstrations, we adopt Conserv ative Q-Learning (CQL) with Beha vioral Cloning (BC) regularization. Exp eriments on a new kidney w ound suturing dataset containing 1,158 tra jectories from 50 patients sho w that SutureAgent reduces A v erage Displacement Error (ADE) by 58.6% compared with the strongest baseline, demonstrating the eﬀec- tiv eness of mo deling needle tra jectory prediction as pixel-level sequential action learning. Keyw ords: Surgical T ra jectory Prediction · Oﬄine Reinforcement Learn- ing · Rob ot-Assisted Surgery · Endoscopic Video Analysis 2 H. Liu et al. 1 In tro duction Rob ot-assisted surgery is gradually ev olving from pure tele-operation to w ard task-lev el autonomy , where intelligen t systems are exp ected to anticipate sur- gical in tent and provide proactive assistance [ 25 , 1 ]. Within this paradigm, pre- dicting instrumen t tra jectories (e.g., needle), particularly during suturing, one of the most tec hnically demanding and outcome-sensitiv e maneuvers in mini- mally in v asive procedures. A ccurate future tra jectory estimation can supp ort an ticipatory planning, real-time guidance, and safer motion execution. F rom the p ersp ective of b ounded rationalit y [ 20 ], surgeons op erate under inherent cog- nitiv e constraints including limited attention span, ﬁnite working memory , and time pressure collectiv ely b ound the optimality of intraoperative decisions. A learning-based tra jectory prediction system that distills exp ert demonstrations in to anticipatory guidance can therefore serve as a decision-support mec hanism, helping less experienced surgeons approximate exp ert-level p erformance and ul- timately impro ving access to high-quality surgical care. Despite gro wing interest in deep learning for surgical scene analysis [ 14 ], in- cluding w orkﬂow recognition [ 10 ] and scene understanding [ 15 ], research on pre- cise pro cedural assistance for tra jectory prediction remains nascent. Currently , most approaches dep end on rob ot kinematic signals, including joint angles, end- eﬀector p oses, and gripp er states [ 17 , 19 , 23 ]. Such data are av ailable only on plat- forms with accessible kinematic in terfaces (e.g., the da Vinci Researc h Kit) and often require direct collab oration with device manufacturers. This dep endency fundamen tally limits transferability: a policy trained on one rob otic platform cannot generalize to another, let alone to the v ast archiv e of conv en tional la- paroscopic video where no kinematic readout exists. In addition, many metho ds require dense temporal annotations, which are prohibitively exp ensive to obtain. Learning control directly from vision is a c hallenging but transformative idea, one that has seen remark able success in adjacent ﬁelds. F or instance, purely vision-based imitation learning agen ts hav e mastered complex con trol tasks like driving, learning to na vigate and plan by directly mapping image inputs to steering commands [ 16 , 3 ]. Cai et al. [ 4 ] demonstrated that cost functions for reinforcemen t learning (RL) can b e learned directly from images, bypassing the need for explicit state estimation. Similarly , T amar et al. prop osed V alue It- eration Net works (VIN) [ 21 ] that a neural net work can learn to p erform goal- directed reasoning, generalizing its p olicy to no vel environmen ts by embedding a computational pro cess akin to planning within its architecture. These insights suggest that the "kinematic b ottleneck" in surgical tra jectory prediction is not a technological inevitability , but rather a design c hoice, although it is more chal- lenging in ﬁne-grained surgical manipulation scenarios. Recen t work on surgical tra jectory prediction has only just emerged, indi- cating that such high-level assistance tasks can supp ort preoperative training and intraoperative guidance but still remain underexplored. F or example, Li et al. [ 12 ] introduced imitation learning for dissection tra jectory prediction, using an implicit diﬀusion p olicy (iDiﬀ-IL) to mo del a join t state-action distribution to capture uncertaint y in future dissection tra jectories; but still face challenges in Learning Surgical T ra jectories via Goal-conditioned Oﬄine RL 3 predicting uncertain mov emen ts and generalizing to v arious scenes. Subsequent w ork has attempted to learn motion distributions directly from s urgical videos. Zhao et al.’s m ulti-scale phase-conditioned diﬀusion (MS-PCD) [ 26 ] cascades dif- fusion mo dels across m ultiple scales to generate conditioned action sequences for pro cedure planning. Xu et al. prop osed the DP4AuSu, applying diﬀusion p olicies to autonomous suturing with dynamic time warping and lo cally weigh ted regres- sion to capture m ultimo dal demonstration tra jectories, achieving 94% insertion success in sim ulation and 85% suture success on a real robot [ 24 ]. Ho w ever, these diﬀusion-based generative approaches either predict an entire tra jectory in one pass or dep end on slow autoregressive deco ding, which increases inference complexit y and can accum ulate errors. More importantly , they fail to explic- itly mo del stepwise motion dep endency , and their reliance on sparse annotations pro vides insuﬃcient sup ervision for long-horizon tra jectory learning. These limitations motiv ate us a diﬀeren t modeling approac h: rather than casting image-driven needle tra jectory prediction as a one-oﬀ regression or gen- erativ e problem, we formulate it as a step wise decision-making pro cess in which the needle tip is treated as an agent progressing through pixel space. This view b etter conforms to the physical contin uity of motion and allows explicit mo deling of pixel-wise state transitions; sparse annotations can b e exploited as rew ard sig- nals rather than direct sup ervision, making Reinforcemen t Learning (RL) a nat- ural choice for learning from limited guidance . RL has already b een explored in surgical rob otics: Ji et al. [ 9 ] prop osed a heuristically accelerated deep Q netw ork for safe planning of neurosurgical ﬂexible needle insertion paths, demonstrating that RL can plan complex tra jectories but requires many training episo des; Lin et al. [ 13 ] built a w orld-mo del-driven pixel-level deep RL framework (GAS) that ac hieves robust grasping for v arious surgical instruments. Nevertheless, RL has not been fully exploited for image-based needle tra jectory prediction. In this w ork, we lev erage RL to mo del the dep endency b etw een adjacent motion steps and to learn eﬀectiv e tra jectory p olicies from sparse sup ervision. T o this end, we prop ose SutureAgent, which reformulates surgical tra jectory prediction as goal-conditioned visual navigation in pixel space using oﬄine ex- p ert demonstrations. A goal-conditioned p olicy iterativ ely predicts future needle w aypoints based solely on lo cal visual context and sparse k eyframe guidance.The observ ation enco der combines a Spatial CNN and T ransformer to capture spa- tial cues and long-range temp oral dep endencies. A discrete-action Conserv ativ e Q-Learning (CQL) agen t then autoregressively generates tra jectory w aypoints using a 9 direction action space with contin uous step magnitudes, conditioned on keyframe goals during training and p olynomial extrapolation at inference. In practice, SutureAgen t requires only endoscopic video frames and 9 sparse k eyframes annotations p er tra jectory . T o our kno wledge, this is the ﬁrst frame- w ork to formulate surgical tra jectory prediction as visual navigation task with oﬄine RL in pixel space. Our metho d outp erforms than current stat-of-art meth- o ds, demonstrating the viabilit y of oﬄine RL for surgical tra jectory prediction. In summary , our con tributions are threefold: 4 H. Liu et al. – W e reformulate image-based surgical needle tra jectory prediction as a se- quen tial decision-making problem in pixel space, providing a new p ersp ective that explicitly captures motion con tinuit y and stepwise state transitions. – W e prop ose a goal-conditioned oﬄine RL framework that conv erts sparse w aypoint annotations into dense reward signals via cubic spline interpola- tion, allowing limited sup ervision to guide p olicy learning more eﬀectively than direct sup ervised or imitation-based formulations. – W e v alidate the prop osed metho d on a new kidney w ound suturing dataset with 1,158 tra jectories from 50 patients, where it achiev es substantial im- pro vemen ts ov er strong baselines, demonstrating the promise of our paradigm for surgical tra jectory prediction. 2 Metho ds 2.1 Problem Statement and Annotation Interpolation Giv en a v ariable-length observed video segment V 0: T obs = { I t } T obs t =0 and the cor- resp onding needle tip p ositions { ( x t , y t ) } T obs t =0 , the goal is to predict the future tra jectory { ( ˆ x t , ˆ y t ) } T obs + T pred t = T obs +1 . W e cast this task as goal-conditioned sequential prediction in pixel space and learn the predictor entirely from an oﬄine dataset of exp ert demonstrations. Sp eciﬁcally , each training tra jectory is transformed into a sequence of state-transition tuples extracted fully from annotated exp ert motion without any online interaction. The ov erall framework consists of an observ ation enco der that summarizes visual-temporal context and a goal-conditioned p olicy head that autoregressiv ely predicts future wa yp oint displacements. T o obtain dense p er-frame annotations, we indep endently apply cubic spline in terp olation [ 5 ] to the x and y co ordinate sequences as functions of the frame index. Given the 9 keyframe p ositions, we ﬁt tw o natural cubic sp line functions, S x ( t ) and S y ( t ) , ev aluating them at every in termediate frame within the tem- p oral range. The in terp olated co ordinates are rounded to the nearest in teger to yield v alid pixel lo cations, without extrap olation beyond the ﬁrst and last k eyframes. Each in terp olated frame receives a conﬁdence score from 0.45 to 0.9 based on its temp oral proximit y to the nearest keyframe, assigning higher scores to closer frames. The exp ert annotated keyframes retain a conﬁdence score of 1.0. 2.2 Observ ation Enco der A t each observed time step, w e extract a 128 × 128 RGB crop centered at the an- notated needle-tip lo cation in the full image. T o complement lo cal appearance cues with sparse tra jectory supervision, inspired by G2RL [ 22 ], w e construct an additional guidance c hannel o ver the same crop by rasterizing the av ailable tra jectory path into a single-channel heatmap. The heatmap intensit y is mod- ulated b y annotation conﬁdence, with manually annotated keyframes assigned higher conﬁdence than in terp olated intermediate p ositions. The RGB crop and Learning Surgical T ra jectories via Goal-conditioned Oﬄine RL 5 Input Observation Encoder Goal - conditioned Encoder Policy Output Offline CQL T raining Spatial CNN Coord Encoder RGB Image Guidance Map Coordinate Tr a n s f o r m e r MLP Direction Head MLP Magnitude Head 9- Action Compass Softmax MLP Guidance Current Pos Relative Step Ratio Expert Demonstrations Tw i n C r i t i c s Predicted Tr a j e c t o r y Network Optimization RL Optimization Autoregressive Offline T ransitions Fig. 1. Overview of the prop osed framew ork. (i) Given the observed video segment, the observ ation enco der extracts lo cal visual guidance features from needle-cen tered crops and aggregates their temp oral dep endencies with a T ransformer to obtain the con textual representation z c . (ii) At each prediction step k , the goal-conditioned state enco der constructs the state s k b y combining z c with the enco ded curren t p osition ˆ p k , guidance co ordinate g k , relative displacement g k − ˆ p k , and normalized step ratio k /T pred . (iii) The p olicy head then predicts a discrete motion direction and a contin uous step magnitude to autoregressively update the needle-tip position and generate the future tra jectory . (iv) T raining is p erformed entirely oﬄine on exp ert transitions using Conserv ativ e Q-Learning with twin critics, together with auxiliary behavior-cloning and magnitude-regression ob jectives. the guidance map are concatenated to form a 4-c hannel input for each observed frame. All 4-channel crops are concatenated and pro cessed b y a spatial feature ex- tractor to obtain a uniﬁed embedding, whic h is then fed in to a T ransformer enco der to mo del temp oral dep endencies across the observ ation windo w. T o supp ort v ariable-length observ ation sequences, mask ed attention is applied so that the encoder attends only to v alid time steps and ignores padded p ositions. In addition, we adopt buc keted batch sampling to reduce excessive padding and impro ve training eﬃciency . The ﬁnal con textual represen tation, denoted b y z c , summarizes the visual and temp oral evidence av ailable up to the current predic- tion step. 2.3 Goal-conditioned Enco der A t prediction step k , the p olicy op erates on a state vector s k = ϕ ( z c , ˆ p k , g k , g k − ˆ p k , k/T pred ) (1) 6 H. Liu et al. where z c is the contextual representation pro duced by the observ ation enco der, ˆ p k is the current predicted p osition, g k is a step-sp eciﬁc guidance co ordinate, g k − ˆ p k is the relative displacement from the curren t p osition to the guidance target, and k /T pred enco des the normalized prediction progress. Three indepen- den t sinusoidal coordinate enco ders are used to em b ed ˆ p k , g k , and g k − ˆ p k , resp ectiv ely , while a linear la yer pro jects the scalar progress ratio in to the same feature space. Their concatenation with z c is then mapp ed by ϕ ( · ) to the ﬁnal state represen tation s k . W e paramete rize each motion step using a discrete direction and a contin uous magnitude. The discrete action space contains nine actions: eight mo vemen t di- rections uniformly spaced at 45 ◦ in terv als and one idle action. Let a ∈ { 1 , . . . , 9 } denote the direction action. The corresp onding unit vectors are deﬁned as: u a =    h sin  ( a − 1) π 4  , − cos  ( a − 1) π 4 i ⊤ , a = 1 , . . . , 8 (0 , 0) ⊤ , a = 9 (2) The policy predicts a categorical distribution π ( a | s k ) o ver the nine direction actions, while a separate magnitude head outputs a non-negativ e step length ˆ m k ∈ [0 , δ max ] . During inference, w e compute the exp ected direction vector: ˆ d k = 9 X a =1 π ( a | s k ) u a (3) and generate the next displacemen t and p osition as: ∆ ˆ p k = ˆ m k ˆ d k , ˆ p k +1 = clip( ˆ p k + ∆ ˆ p k , 0 , 1) (4) where clipping enforces the normalized image-co ordinate constraint [0 , 1] 2 . This form ulation yields smo oth autoregressive motion while retaining a discrete di- rectional action space for v alue learning. The guidance co ordinate g k is deﬁned diﬀerently for training and inference. During training, g k is taken from the ground-truth future tra jectory av ailable in the exp ert demonstration. During inference, future ground-truth p ositions are una v ailable; therefore, w e replace them with pseudo-guidance obtained by p oly- nomial extrap olation from the last L observed p oints ( L = 10 by default). A quadratic least-squares ﬁt is used when at least ﬁve observed p oints are av ail- able; otherwise, linear extrap olation is used. The extrap olated co ordinates are also clipp ed to [0 , 1] 2 . Stated diﬀerently , the mo del is trained with oracle future guidance and ev aluated with an extrap olated test-time surrogate. 2.4 Oﬄine Conserv ativ e Q-Learning The mo del is trained entirely oﬄine from a static exp ert tra jectories. F or eac h training tra jectory , we trav erse the annotated sequence and extract exp ert tran- sitions of the form ( s k , a k , r k , s k +1 , d k ) , where a k is obtained by quantizing the Learning Surgical T ra jectories via Goal-conditioned Oﬄine RL 7 exp ert displacement to its nearest compass direction, r k is the step reward, and d k indicates episo de termination. The next state s k +1 is constructed from the successor p osition in the demonstration tra jectory rather than generated through online environmen t in teraction. This setup corresp onds to oﬄine reinforcement learning on logged exp ert transitions. T o learn conserv ative action v alues from ﬁxed oﬄine data, we adopt twin Q-functions optimized with Conserv ative Q-Learning (CQL) [ 11 ]. The critic ob- jectiv e combines a Bellman regression term with a conserv ativ e regularizer that suppresses Q-v alues for unsupp orted actions: L Q i = E ( s,a,r,s ′ ,d ) ∼D h  Q i ( s, a ) − y  2 i + α cql E s ∼D  log P a exp  Q i ( s, a )  − Q i ( s, a exp )  (5) Here, a exp denotes the exp ert action and α cql con trols the strength of the con- serv ativ e p enalty . The soft v alue target is computed as V ( s ′ ) = X a ′ π ( a ′ | s ′ )  min  Q tgt 1 ( s ′ , a ′ ) , Q tgt 2 ( s ′ , a ′ )  − α log π ( a ′ | s ′ )  (6) with entrop y temperature α = 0 . 2 . The p olicy is optimized using the entrop y- regularized ob jective L π = E s ∼D " X a π ( a | s ) ( α log π ( a | s ) − min( Q 1 ( s, a ) , Q 2 ( s, a ))) # (7) F ollo wing our implementation, p olicy learning is further stabilized by a b ehavior- cloning term L BC = CE  logits , a exp  (8) and the magnitude head is sup ervised with L mag = λ mag E h ( ∥ ∆ ˆ p k ∥ 2 − ∥ p k +1 − p k ∥ 2 ) 2 i . (9) Th us, the ﬁnal training ob jective combines conserv ative v alue learning, exp ert- action imitation, and direct step-length sup ervision. A key implementation detail is gradient-ﬂo w separation. Sp eciﬁcally , the critic consumes state features with detached gradien ts so that critic updates do not propagate into the observ ation enco der. In contrast, gradien ts from the actor loss and the magnitude-regression loss are allow ed to update the shared en- co der. This choice is motiv ated by the oﬄine setting: the CQL critic is explicitly regularized to b e p essimistic on unsupp orted actions, allo wing critic gradients to dominate the enco der may bias the representation tow ard conserv ativ e v alue suppression rather than accurate motion mo deling. W e therefore restrict enco der up dates to ob jectives that are directly tied to displacement prediction. 2.5 Rew ard Design under Sparse Keyframe Sup ervision With only nine keyframes are manually annotated for each tra jectory , we densify sup ervision by ﬁtting a cubic spline to the annotated 2D co ordinates and ev aluat- ing it at intermediate frames. This pro duces dense frame-wise reference p ositions 8 H. Liu et al. { ˜ p k } K − 1 k =0 for rew ard construction, while preserving the original k eyframes as the most reliable sup ervisory anc hors. Each in terp olated frame is further assigned a conﬁdence score according to its temp oral proximit y to the nearest manually annotated keyframe; manually annotated keyframes are assigned conﬁdence 1 . 0 , whereas in terp olated p ositions receive lo wer conﬁdence v alues. A t prediction step k , let ˆ p k denote the predicted p osition and ˜ p k the corre- sp onding dense reference p osition. Their Euclidean distance in normalized image co ordinates is deﬁned as d k = ∥ ˆ p k − ˜ p k ∥ 2 (10) The reward at step k consists of three comp onents: a constant time p enalty , a conﬁdence-w eighted proximit y term, and a terminal shaping term applied only at the ﬁnal prediction step: r k = r time + r prox ,k + r term ,k (11) where the time p enalty is ﬁxed as r time = − 0 . 01 and the proximit y reward is deﬁned as: r prox ,k = w k r prox , max  1 − d k τ  (12) where τ = 0 . 02 is the distance threshold in normalized co ordinates and r prox , max = 0 . 5 . T o reﬂect annotation reliabilit y , the conﬁdence weigh t w k is deﬁned as w k = ( 1 . 0 , for man ually annotated keyframes 0 . 5 + 0 . 5 · conﬁdence k , for in terp olated frames (13) where conﬁdence k ∈ [0 , 1) denotes the conﬁdence assigned to the interpolated reference p osition at step k . T o emphasize endpoint accuracy , w e further in tro duce a terminal shaping term at the ﬁnal prediction step: r term ,k =    w k r prox , max  1 − d k τ  , k = K − 1 0 , otherwise (14) where K is the total num b er of prediction steps for the curren t sample. This term shares the same distance-dep enden t form as the proximit y reward, but is applied only at the ﬁnal step so as to place additional emphasis on the accuracy of the predicted endp oin t. Ov erall, this reward design pro vides dense step-wise learning signals from sparsely annotated tra jectories while down-w eigh ting less reliable interpolated sup ervision. 3 Exp erimen ts Datasets. W e ev aluate SutureAgent on clinical dataset of rob otic-assisted la- paroscopic kidney wound suturing op erated by exp ert surgeon. The dataset com- prises surgical videos from 50 patients and con tains a total of 1,158 tra jectories, Learning Surgical T ra jectories via Goal-conditioned Oﬄine RL 9 where clinical exp erts annotate 9 keyframes within eac h tra jectory . The dataset is partitioned at the patient level in to training, v alidation, and test sets, con- sisting of 35, 8, and 7 patients (corresp onding to 861, 151, and 146 tra jectories), resp ectiv ely . Implemen tation Details. The mo del is trained for 100 ep o chs with batch size 8 on an NVIDIA R TX 5090 GPU. W e use four Adam optimizers with cosine annealing decaying to 1% of the initial learning rates: 1 × 10 − 4 for the obser- v ation enco der and 3 × 10 − 4 for the actor, critic, and magnitude heads. CQL h yp erparameters are α CQL = 0 . 01 , γ = 0 . 95 , with soft target up dates τ = 0 . 005 . P olicy and magnitude up dates subsample up to 2,048 transitions p er batch. Ev aluation Metrics. All metrics are computed in pixel space b y rescaling normalized co ordinates to the original 1264 × 902 resolution. W e rep ort three complemen tary metrics (low er is b etter for all): A ver age Displac ement Err or (ADE) measures the mean p oint-wise accuracy o ver the entire predicted tra jectory: ADE = 1 T pred T pred X k =1 ∥ ˆ p k − p k ∥ 2 (15) Final Displac ement Err or (FDE) captures the positional accuracy at the tra jectory endp oin t, which is particularly relev an t for suturing tasks where the needle m ust arrive at a precise target: FDE = ∥ ˆ p T pred − p T pred ∥ 2 (16) Discr ete F r é chet Distanc e (FD) ev aluates global shap e similarity b etw een the predicted and ground-truth tra jectories, deﬁned recursively as: dp[ i, j ] = max  min  dp[ i − 1 , j ] , dp[ i − 1 , j − 1] , dp[ i, j − 1]  , ∥ ˆ p i − p j ∥ 2  FD = dp[ T pred , T pred ] (17) All three metrics are a veraged ov er the test set. 4 Results 4.1 Comparison with Baselines T o ev aluate the eﬀectiv eness of our SutureAgen t we compare against following baselines: (1) Beha vioral Cloning (BC) [ 2 ], which enco des stack ed frames via a U-Net [ 18 ] and regresses future co ordinates with an MLP; (2) Generative Adv er- sarial Imitation Learning (GAIL) [ 8 ], which generates tra jectories conditioned on a random latent v ector with a discriminator distinguishing real from generated pairs; (3) Implicit Behavioral Cloning (IBC) [ 6 ], an energy based mo del trained with InfoNCE loss and sampled via Langevin MCMC at inference; (4) Implicit 10 H. Liu et al. T able 1. Quan titative comparison on the test set under t wo settings (Obs = 6, Pred = 3 and Obs = 3, Pred = 6). Sp eciﬁcally , Obs = 6 denotes an observ ed surgical tra jectory consisting of 6 annotated keyframes, with the remaining lab els generated by in terp olation at a frame in terv al of 1, whereas Pred = 3 denotes the prediction horizon con taining 3 future keyframes. All metrics are ev aluated in pixel space (low er is b etter), with the b est results highligh ted in b old and second b est in underline. Method Seed Obs = 6, Pred = 3 Obs = 3, Pred = 6 ADE ( ↓ ) FDE ( ↓ ) FD ( ↓ ) ADE ( ↓ ) FDE ( ↓ ) FD ( ↓ ) V anillaCNN – 208 . 79 ± 93 . 65 221 . 39 ± 101 . 68 234 . 47 ± 100 . 29 194 . 86 ± 84 . 86 227 . 73 ± 112 . 16 249 . 74 ± 101 . 67 BC [ 2 ] – 128 . 15 ± 85 . 30 146 . 24 ± 100 . 87 156 . 83 ± 99 . 81 135 . 65 ± 75 . 23 182 . 31 ± 109 . 62 193 . 86 ± 104 . 57 GAIL [ 8 ] – 268 . 31 ± 128 . 76 281 . 42 ± 136 . 55 301 . 98 ± 133 . 52 242 . 24 ± 117 . 38 268 . 84 ± 141 . 19 314 . 21 ± 126 . 19 IBC [ 6 ] – 233 . 84 ± 132 . 22 250 . 76 ± 150 . 40 276 . 25 ± 141 . 66 214 . 63 ± 105 . 23 250 . 88 ± 153 . 55 288 . 38 ± 136 . 58 iDiﬀ-IL [ 12 ] – 179 . 65 ± 104 . 41 197 . 80 ± 115 . 62 208 . 52 ± 114 . 79 185 . 78 ± 95 . 33 228 . 86 ± 134 . 08 246 . 57 ± 123 . 23 CondDiﬀ [ 12 ] – 155 . 02 ± 90 . 48 176 . 03 ± 105 . 68 186 . 28 ± 104 . 52 158 . 20 ± 90 . 33 203 . 38 ± 128 . 04 219 . 98 ± 119 . 80 MID [ 7 ] – 148 . 64 ± 93 . 01 165 . 28 ± 107 . 93 185 . 39 ± 105 . 36 154 . 65 ± 80 . 46 213 . 77 ± 128 . 01 225 . 18 ± 120 . 95 Ours 42 56 . 01 ± 32 . 81 85 . 51 ± 59 . 51 88 . 15 ± 57 . 21 93 . 22 ± 45 . 54 156 . 38 ± 89 . 90 160 . 75 ± 86 . 71 123 55 . 54 ± 29 . 93 84 . 77 ± 51 . 15 85 . 67 ± 50 . 33 97 . 57 ± 49 . 59 167 . 84 ± 103 . 11 170 . 45 ± 101 . 23 456 52 . 95 ± 29 . 73 80 . 22 ± 52 . 41 81 . 97 ± 50 . 98 96 . 51 ± 46 . 44 165 . 49 ± 99 . 19 170 . 64 ± 96 . 10 789 55 . 11 ± 31 . 02 84 . 76 ± 54 . 17 85 . 82 ± 53 . 24 98 . 40 ± 46 . 65 165 . 96 ± 97 . 21 170 . 64 ± 93 . 95 1024 58 . 54 ± 32 . 09 93 . 05 ± 54 . 67 93 . 68 ± 53 . 94 97 . 75 ± 45 . 25 167 . 29 ± 87 . 88 169 . 68 ± 86 . 63 Diﬀusion P olicy (iDiﬀ-IL) [ 12 ], a joint image tra jectory d iﬀusion metho d using a dual-head UNet to denoise b oth spaces simultaneously; (5) Conditional Diﬀu- sion Policy (CondDiﬀ ) [ 12 ], a v ariant applying DDPM denoising in tra jectory space only , conditioned on the image; and (6) Motion Indeterminacy Diﬀusion (MID) [ 7 ], a latent-space diﬀusion metho d that enco des frames into a con text v ector and applies a T ransformer-based DDPM denoiser for tra jectory genera- tion; (7) V anillaCNN, as a strawman baseline implemen ting only a simple 4-lay er CNN that directly regresses tra jectory p oints from stack ed input frames. 4.2 Result Analysis Qualitativ e Results. T able 1 rep orts results under tw o experimental settings. SutureAgen t consisten tly outp erforms all baselines across all metrics and 5 ran- dom seeds. With Obs = 6, Pred = 3 (Obs = 6 denotes an observed surgical tra jec- tory consisting of 6 annotated k eyframes, with the remaining lab els generated b y in terp olation at a frame interv al of 1, whereas Pred = 3 denotes the predic- tion horizon containing 3 future keyframes; another setting is also the same), SutureAgen t ac hieves an ADE of 52.95 (58.6% reduction ov er BC at 128.15) and FD of 80.22 (vs. 146.24 for BC), indicating substantially b etter tra jectory shap e ﬁdelit y . Under the more challenging Obs = 3, Pred = 6 setting, all metho ds degrade as historical observ ational information decreases, yet ours remains the b est, achieving ADE 93.22, FDE 156.38 and FD 160.75. SutureAgent maintains clear adv an tages in ADE (31.3% reduction ov er BC), FDE (14.2% reduction) and FD (17.1% reduction), demonstrating that the CQL-based p olicy pro duces Learning Surgical T ra jectories via Goal-conditioned Oﬄine RL 11 Patient 1 (1.a) (1.b) (1.c) (1.d) Patient 2 (2.a) (2.b) (2.c) (2.d) Patient 3 (3.a) (3.b) (3.c) (3.d) Success Cases Corner Cases Observed T rajectory Ground T ruth Our Prediction Best Baseline Predic tion Fig. 2. Qualitative comparison of predicted tra jectories on the testset. The yello w curv e denotes the observed tra jectory , the green curv e represents the ground truth future tra jectory , the red curve shows the prediction from our SutureAgent and the blue curve indicates the best baseline prediction. more accurate tra jectories with b etter shap e consistency even under sparse ob- serv ations. The goal-conditioned navigation formulation prov es particularly b en- eﬁcial when visual context is limited, as explicit target guidance comp ensates for reduced observ ational evidence. Beyond achieving sup erior p erformance, our metho d also exhibits low er p er-tra jectory standard deviation across all metrics and random seeds compared to all baselines, demonstrating stronger robustness and consistency in handling diverse surgical tra jectory patterns. Notably, the p erformanc e of these b aselines diﬀers substantial ly fr om prior studies, primarily for two r e asons: ﬁrst, we c ompute our evaluation metrics at the original image r esolution, r ather than downsample d one; and se c ond, the sur gic al sc enes in our kidney wound suturing dataset pr esent c onsider ably higher c omplexity. Visualized Results. Fig. 2 illustrates representativ e predicted tra jectories ov er- laid on s urgical images, alongside the corresp onding ground-truth paths and pre- diction results from state-of-art baseline metho d. As observ ed across div erse su- turing scenarios from m ultiple patients, the baseline metho d frequen tly struggles to capture the complex spatial dynamics of the surgical instruments, resulting in predicted tra jectories that signiﬁcantly deviate from the true paths. In contrast, our prop osed approach consisten tly maintains high ﬁdelity to the ground truth. The prop osed mo del generates smo oth tra jectories that conform to the general curv ature of the suturing path. 12 H. Liu et al. SutureAgent (Ours) BC MID CondDiff Implicit SimpleCNN IBC GAIL 0 200 400 600 800 ADE (pixels) (a) V iolin Plot Mean Median 0 200 400 600 800 ADE (pixels) 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative Proportion 90% (b) CDF SutureAgent (Ours) BC MID CondDiff Implicit SimpleCNN IBC GAIL Fig. 3. Distribution of A verage Displacement Error (ADE) across all metho ds on the test set. (a) Violin plot showing the ADE distribution for each metho d, with individual data p oints ov erlaid. Blac k diamonds indicate the mean and white horizon tal lines indicate the median. (b) Empirical cum ulative distribution function (CDF) of ADE. The dashed v ertical line marks the ADE = 100 pixel threshold, where our metho d ac hieves 90% of tra jectories b elow this v alue. Statistical Analysis. Fig. 3 presents the ev aluated ADE distribution across all metho ds on the test set. SutureAgent achiev es the low est mean ADE of 55.99 pixels (median: 49.10), substantially outp erforming the second-b est metho d BC (mean: 128.15, median: 104.61) by a margin of 56.3%. As shown in the CDF plot, 90% of our predictions fall b elow an ADE of 100 pixels, compared to only 45% for BC. Wilcoxon signed-rank tests conﬁrm that our metho d signiﬁcantly outp erforms all baseline metho ds ( p < 0 . 001 ). The compact distribution in the violin plot further demonstrates that SutureAgentmain tains consistently lo w pre- diction errors across diverse surgical tra jectories, with markedly low er v ariance (std: 32.71) compared to other metho ds (std: 85.30–130.58). Generalization Analysis. T o ev aluate the generalisabilit y of the learned Q- function across tra jectories of v arying lengths, we visualise the p er-step Q- v alue curves on four randomly selected test tra jectories with prediction horizons T ∈ { short, medium-short, medium-long, long } , as shown in Fig. 4 . Across all horizons, Q policy ( s k , a π k ) consistently meets or exceeds Q expert ( s k , a ∗ k ) , indicating that the CQL conserv ative penalty has successfully shap ed the v alue function to prefer the p olicy’s actions ov er sub optimal oﬄine demonstrations. Both curves exhibit a monotonically decreasing trend as k increases, which is consistent with the discoun ted return structure of the rew ard function. Notably , this b eha viour is preserved even for the longest tra jectory ( T = 48 ), demonstrating that the Q-net work generalises stably across v ariable-length surgical sequences without degradation in v alue estimation quality . Learning Surgical T ra jectories via Goal-conditioned Oﬄine RL 13 0 2 4 6 8 10 Prediction step k 2 4 6 8 min( Q 1 , Q 2 ) value T = 11 steps 0 3 6 9 12 15 Prediction step k 2 4 6 8 min( Q 1 , Q 2 ) value T = 16 steps 0 5 10 15 20 Prediction step k 2 4 6 8 min( Q 1 , Q 2 ) value T = 22 steps 0 10 20 30 40 Prediction step k 2 4 6 8 min( Q 1 , Q 2 ) value T = 46 steps Q policy Q expert Fig. 4. Per-tra jectory Q-v alue curves on four test tra jectories of increasing prediction horizon, demonstrating generalisation across v ariable-length sequences. Q policy ( s k , a π k ) ( solid blue ) is the p essimistic v alue estimate min( Q 1 , Q 2 ) of the p olicy’s chosen action at step k ; Q expert ( s k , a ∗ k ) ( dashed red ) is the v alue of the corresp onding ground- truth exp ert action. Orange vertical lines indicate keyframe p ositions. Q policy ≥ Q expert consisten tly across all horizons conﬁrms that the CQL conserv ativ e p enalty successfully shap es the Q-function to rank the learned policy abov e sub optimal oﬄine actions, regardless of tra jectory length. 5 Conclusion This pap er introduces SutureAgent, a nov el framework that reform ulates surgical tra jectory prediction in pixel space based on goal-conditioned oﬄine reinforce- men t learning using Conserv ative Q-Learning. By requiring only 9 keyframe an- notations p er tra jectory and op erating without rob ot kinematics, the framework demonstrates broad applicability to existing clinical video archiv es. Exp erimental results on 1,158 tra jectories demonstrate that SutureAgent signiﬁcantly outp er- forms b est diﬀusion and imitation learning baselines, ac hieving up to a 58.6% reduction in ADE. F uture work will extend SutureAgent to broader laparoscopic pro cedures and v alidate its p erformance through ex-vivo p orcine exp eriments on rob otic platforms, facilitating its transition tow ard precise, adv ancing its trans- lation to ward real-world cognitiv e surgical assistance. 14 H. Liu et al. References 1. A ttanasio, A., Scaglioni, B., De Momi, E., Fiorini, P ., V aldastri, P .: Autonomy in surgical rob otics. Ann ual Review of Control, Rob otics, and Autonomous Sys- tems 4 (V olume 4, 2021), 651–679 (2021). https://doi.org/https://doi.org/ 10.1146/annurev- control- 062420- 090543 , https://www.annualreviews.org/ content/journals/10.1146/annurev- control- 062420- 090543 2. Bain, M., Sammut, C.: A framework for b ehavioural cloning. In: Machine In- telligence 15, Intelligen t Agents [St. Catherine’s College, Oxford, July 1995]. p. 103–129. Oxford Universit y , GBR (1999) 3. Bo jarski, M., Del T esta, D., Dworak owski, D., Firner, B., Flepp, B., Goy al, P ., Jac kel, L.D., Monfort, M., Muller, U., Zhang, J., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016) 4. Cai, P ., W ang, H., Huang, H., Liu, Y., Liu, M.: Vision-based autonomous car racing using deep imitative reinforcement learning. IEEE Robotics and Automation Letters 6 (4), 7262–7269 (2021) 5. De Bo or, C., De Bo or, C.: A practical guide to splines, vol. 27. springer New Y ork (1978) 6. Florence, P ., Lynch, C., Zeng, A., Ramirez, O.A., W ahid, A., Downs, L., W ong, A., Lee, J., Mordatch, I., T ompson, J.: Implicit b ehavioral cloning. In: F aust, A., Hsu, D., Neumann, G. (eds.) Pro ceedings of the 5th Conference on Rob ot Learning. PMLRPro ceedings of Machine Learning Research, vol. 164, pp. 158–168. PMLR (08–11 Nov 2022), https://proceedings.mlr.press/v164/florence22a.html 7. Gu, T., Chen, G., Li, J., Lin, C., Rao, Y., Zhou, J., Lu, J.: Sto chastic tra jectory prediction via motion indeterminacy diﬀusion. In: 2022 IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 17092–17101 (2022). https://doi.org/10.1109/CVPR52688.2022.01660 8. Ho, J., Ermon, S.: Generative adversarial imitation learning. In: NIPSPro ceedings of the 30th International Conference on Neural Information Pro cessing Systems. p. 4572–4580. NIPS’16, Curran Associates Inc., Red Ho ok, NY, USA (2016) 9. Ji, G., Gao, Q., Zhang, T., Cao, L., Sun, Z.: A heuristically accelerated reinforce- men t learnin g-based neurosurgical path planner. Cyb org and Bionic Systems 4 , 0026 (2023) 10. Jin, Y., Long, Y., Gao, X., Stoy anov, D., Dou, Q., Heng, P .A.: T rans-svnet: hybrid em b edding aggregation transformer for surgical workﬂo w analysis. International Journal of Computer Assisted Radiology and Surgery 17 (12), 2193–2202 (2022) 11. Kumar, A., Zhou, A., T uck er, G., Levine, S.: Conserv ative Q-learning for oﬄine reinforcemen t learning. In: NeurIPSPro ceedings of the 34th In ternational Confer- ence on Neural Information Pro cessing Systems. NIPS ’20, Curran Associates Inc., Red Ho ok, NY, USA (2020) 12. Li, J., Jin, Y., Chen, Y., Yip, H.C., Scheppac h, M., Chiu, P .W.Y., Y am, Y., Meng, H.M.L., Dou, Q.: Imitation learning from exp ert video data for dissection tra jec- tory prediction in endoscopic surgical pro cedure. In: MICCAIMedical Image Com- puting and Computer Assisted In terven tion – MICCAI 2023: 26th In ternational Conference, V ancouver, BC, Canada, Octob er 8–12, 2023, Pro ceedings, Part IX. p. 494–504. Springer-V erlag, Berlin, Heidelb erg (2023). https://doi.org/10.1007/ 978- 3- 031- 43996- 4_47 , https://doi.org/10.1007/978- 3- 031- 43996- 4_47 13. Lin, H., Li, B., W ong, C.W., Ro jas, J., Chu, X., Au, K.W.S.: W orld mo dels for general surgical grasping. arXiv preprin t arXiv:2405.17940 (2024) Learning Surgical T ra jectories via Goal-conditioned Oﬄine RL 15 14. Maier-Hein, L., V edula, S.S., Speidel, S., Nav ab, N., Kikinis, R., Park, A., Eisen- mann, M., F eussner, H., F orestier, G., Giannarou, S., et al.: Surgical data science for next-generation interv entions. Nature Biomedical Engineering 1 (9), 691–696 (2017) 15. Nw oy e, C.I., Y u, T., Gonzalez, C., Seeliger, B., Mascagni, P ., Mutter, D., Marescaux, J., Pado y , N.: Rendezvous: Atten tion mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis 78 , 102433 (2022) 16. P omerleau, D.A.: AL VINN: an autonomous land vehicle in a neural netw ork. In: NIPSPro ceedings of the 2nd International Conference on Neural Information Pro- cessing Systems. p. 305–313. NIPS’88, MIT Press, Cambridge, MA, USA (1988) 17. Qin, Y., F eyzabadi, S., Allan, M., Burdick, J.W., Azizian, M.: davincinet: Joint prediction of motion and surgical state in rob ot-assisted surgery . In: IROS2020 IEEE/RSJ In ternational Conference on Intelligen t Rob ots and Systems (IROS). pp. 2921–2928 (2020). https://doi.org/10.1109/IROS45743.2020.9340723 18. Ronneb erger, O., Fischer, P ., Brox, T.: U-net: Conv olutional netw orks for biomed- ical image segmentation. In: MICCAI. pp. 234–241. Springer (2015) 19. Shi, C., Zheng, Y., F ey , A.M.: Recognition and prediction of surgical gestures and tra jectories using transformer mo dels in robot-assisted surgery . In: IROS2022 IEEE/RSJ In ternational Conference on Intelligen t Rob ots and Systems (IROS). pp. 8017–8024 (2022). https://doi.org/10.1109/IROS47612.2022.9981611 20. Simon, H.A., et al.: Theories of bounded rationalit y . Decision and organization 1 (1), 161–176 (1972) 21. T amar, A., W u, Y., Thomas, G., Levine, S., Abb eel, P .: V alue iteration net works. A dv ances in neural information pro cessing systems 29 (2016) 22. W ang, B., Liu, Z., Li, Q., Prorok, A.: Mobile rob ot path planning in dynamic envi- ronmen ts through globally guided reinforcement learning. IEEE Robotics and Au- tomation Letters 5 (4), 6932–6939 (2020). https://doi.org/10.1109/LRA.2020. 3026638 23. W eerasinghe, K., Reza Ro o dab eh, S.H., Hutchinson, K., Alemzadeh, H.: Multi- mo dal transformers for real-time surgical activity prediction. In: ICRA2024 IEEE In ternational Conference on Robotics and Automation (ICRA). pp. 13323–13330 (2024). https://doi.org/10.1109/ICRA57147.2024.10611048 24. Xu, W., T an, Z., Cao, Z., Ma, H., W ang, G., W ang, H., W ang, W., Du, Z.: Dp4ausu: Autonomous surgical framework for suturing manipulation using diﬀusion p olicy with dynamic time wrapping-based locally w eighted regression. The International Journal of Medical Rob otics and Computer Assisted Surgery 21 (3), e70072 (2025) 25. Y ang, G.Z., Cambias, J., Cleary , K., Daimler, E., Drake, J., Dup ont, P .E., Hata, N., Kazanzides, P ., Martel, S., Patel, R.V., et al.: Medical rob otics—regulatory , ethical, and legal considerations for increasing levels of autonomy (2017) 26. Zhao, Z., F ang, F., Y ang, X., Xu, Q., Guan, C., Zhou, S.K.: See, Predict, Plan: Diﬀusion for Pro cedure Planning in Rob otic Surgical Videos . In: pro ceedings of Medical Image Computing and Computer Assisted Interv en tion – MICCAI 2024. v ol. LNCS 15006. Springer Nature Switzerland (October 2024)

SutureAgent: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment