Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy
Accurate intraoperative navigation is essential for robot-assisted endoluminal intervention, but remains difficult because of limited endoscopic field of view and dynamic artifacts. Existing navigation platforms often rely on external localization technologies, such as electromagnetic tracking or shape sensing, which increase hardware complexity and remain vulnerable to intraoperative anatomical mismatch. We present a vision-only autonomy framework that performs long-horizon bronchoscopic navigation using preoperative CT-derived virtual targets and live endoscopic video, without external tracking during navigation. The framework uses hierarchical long-short agents: a short-term reactive agent for continuous low-latency motion control, and a long-term strategic agent for decision support at anatomically ambiguous points. When their recommendations conflict, a world-model critic predicts future visual states for candidate actions and selects the action whose predicted state best matches the target view. We evaluated the system in a high-fidelity airway phantom, three ex vivo porcine lungs, and a live porcine model. The system reached all planned segmental targets in the phantom, maintained 80% success to the eighth generation ex vivo, and achieved in vivo navigation performance comparable to the expert bronchoscopist. These results support the preclinical feasibility of sensor-free autonomous bronchoscopic navigation.
💡 Research Summary
The paper addresses a critical challenge in robot‑assisted bronchoscopy: achieving accurate, long‑horizon navigation without relying on external localization hardware such as electromagnetic trackers or shape‑sensing fibers. The authors propose a fully vision‑only autonomy framework that combines a hierarchical pair of agents—a short‑term reactive controller and a long‑term strategic supervisor—together with a world‑model critic that resolves conflicts by predicting future visual states.
System Overview
Pre‑operative CT scans are automatically segmented to extract the airway tree and any target lesion. From this anatomy a sequence of virtual bronchoscopic views (sub‑targets) is rendered, forming a visual trajectory that the robot must follow. During the procedure the robot receives live endoscopic video. The short‑term reactive agent processes the current frame and the active virtual target using an EfficientNet‑B0 backbone and a decoder‑only transformer, outputting low‑latency 6‑DoF control commands (forward/backward translation, four bending directions, and a “target‑reached” flag). This agent continuously drives the bronchoscope toward visual alignment with the virtual target, providing fast response to dynamic artifacts such as fluid occlusion or motion blur.
The long‑term strategic agent is invoked only at anatomically ambiguous points (e.g., bifurcations). It fuses two guidance sources: (1) Pre‑operative Guidance, which maps the CT‑derived centerlines to a series of discrete actions and selects the majority vote over the last ten frames; and (2) LLM Guidance, a large multimodal language model that receives the virtual target image, a directional arrow, and a textual prompt, then proposes a short high‑level action sequence. When both agents are active, a consensus mechanism checks whether the strategic action lies within the top‑K logits of the reactive policy; if so, the action is executed immediately.
Conflict Resolution via World‑Model Critic
If the strategic suggestion falls outside the reactive top‑K, a conflict is declared. For conflicts originating from the pre‑operative guidance the system simply discards the suggestion. For LLM‑derived conflicts, a learned world‑model predicts a short rollout of future endoscopic frames for each candidate action. The predicted frames are compared to the intended virtual target using LPIPS (Learned Perceptual Image Patch Similarity). The action minimizing perceptual distance is selected, ensuring that the chosen motion will most closely bring the visual appearance to the target view.
Learning
The policies are trained by imitation learning on expert demonstration data collected from a skilled bronchoscopist. Cross‑entropy loss drives the transformer to reproduce the expert’s action distribution. Data augmentation introduces realistic visual disturbances to improve robustness.
Experimental Validation
Three experimental settings were used:
-
High‑fidelity airway phantom – 17 anatomical lung segments were defined. The system successfully navigated to every segment, matching expert performance and outperforming two prior autonomous baselines (GNM and VINT) especially beyond the eighth generation where only the proposed method and the expert reached.
-
Ex‑vivo porcine lungs – Three different lungs were tested under static conditions. The method achieved an 80 % success rate up to the eighth bronchial generation, whereas baseline methods failed earlier.
-
In‑vivo live porcine model – With active respiration, the system’s success rate and time‑to‑target were comparable to an expert bronchoscopist, demonstrating feasibility under realistic physiological motion.
Quantitative analyses revealed a trade‑off between navigation speed and control efficiency, but overall the hierarchical approach maintained low latency while providing high‑level decision support.
Key Contributions and Insights
- Demonstrates that pure visual feedback, combined with pre‑operative CT‑derived targets, is sufficient for long‑horizon bronchoscopic navigation.
- Introduces a novel hierarchical agent architecture that separates fast reactive control from strategic reasoning, enabling both responsiveness and contextual awareness.
- Leverages a large language model for high‑level semantic guidance in ambiguous visual contexts, a first in bronchoscopic autonomy.
- Employs a world‑model critic that predicts future endoscopic views and uses perceptual similarity to resolve action conflicts, effectively bridging the gap between low‑level control and high‑level planning.
- Validates the approach across increasing levels of realism, culminating in live animal experiments that match expert performance.
Limitations and Future Work
The world‑model’s predictive accuracy limits conflict resolution; errors could propagate to unsafe motions. The LLM’s reasoning is constrained by the quality of prompts and may produce inconsistent suggestions. Real clinical deployment will need to handle additional complexities such as bleeding, mucus accumulation, patient‑specific airway deformation, and regulatory safety requirements. Future directions include improving the world‑model with longer rollouts, integrating multimodal sensors (e.g., optical flow, ultrasound), and conducting human clinical trials to assess safety, efficacy, and workflow integration.
In summary, this work provides a compelling proof‑of‑concept that sensor‑free, vision‑only autonomous bronchoscopy is achievable using a carefully designed hierarchy of learning‑based agents and a perceptual world‑model, opening a path toward more streamlined, cost‑effective robotic airway interventions.
Comments & Academic Discussion
Loading comments...
Leave a Comment