OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.
💡 Research Summary
OralGPT‑Plus is an agentic vision‑language model designed for the analysis of panoramic dental radiographs (OPG). The authors observe that clinical dentists repeatedly inspect a global view, zoom into suspicious regions, and compare those regions with their contralateral counterparts due to the strong bilateral symmetry of the oral cavity. Existing approaches—object detectors that output only bounding boxes or single‑pass VLMs that generate a static description—cannot emulate this iterative, symmetry‑aware workflow and therefore miss subtle lesions.
To bridge this gap, the paper introduces three major contributions. First, a new tool set: “Zoom‑In” for high‑resolution local inspection and “Mirror‑In” for automatically extracting the horizontally mirrored region across the midline. The Mirror‑In operation is mathematically defined as a horizontal flip of the selected crop, allowing the model to obtain a paired view for direct bilateral comparison.
Second, the DentalProbe dataset, comprising 5 000 panoramic images drawn from four public sources plus 2 562 newly collected scans. Expert dentists authored more than 8 000 diagnostic trajectories that encode a step‑by‑step reasoning process: global scan → suspicion proposal → zoom → optional mirror comparison → final diagnosis. The trajectories are generated by a rule‑based pipeline, refined by a multi‑agent tool‑decision module, enriched with region‑level visual captions, and rewritten by several language models to ensure linguistic diversity. A sampling‑based evaluation by dentists confirms high annotation quality.
Third, a reinforcement‑learning (RL) framework called Reinspection‑Driven RL. Standard PPO with binary rewards suffers from sparsity because panoramic images often contain multiple, spatially distributed lesions. The authors therefore design a continuous rubric‑based reward that scores diagnostic completeness, accuracy, and explanation quality, and a conditioned diagnostic‑driven reward that only encourages further inspection when the rubric confidence exceeds a threshold. These signals are merged into a hybrid reward that stabilizes long‑horizon policy optimization while penalizing unnecessary tool usage.
Training proceeds in two stages. In the instruction‑tuning stage, the language model is fully fine‑tuned on the expert trajectories while the vision encoder and projector remain frozen, ensuring that visual features stay consistent. In the RL stage, the policy learns to select actions (Zoom‑In, Mirror‑In, or Finalize) over up to ten interaction steps per image, using the hybrid reward to balance exploration and clinical relevance.
For evaluation, the authors construct MMOral‑X, a benchmark of 300 open‑ended questions with region‑level ground truth and three difficulty tiers (easy, medium, hard). OralGPT‑Plus outperforms strong baselines—including YOLO‑based detectors and single‑pass VLMs such as LLaVA—by 5–12 percentage points in overall accuracy and explanation metrics. The gains are especially pronounced for subtle findings like tiny carious lesions and apical periodontitis. Moreover, the reinspection policy reduces redundant zoom actions by over 30 %, improving inference efficiency.
In summary, OralGPT‑Plus demonstrates that embedding clinically realistic tool use and symmetry‑aware reasoning into a VLM, supported by high‑quality trajectory data and a carefully crafted RL reward, yields more reliable and interpretable panoramic dental diagnoses. The approach is likely transferable to other symmetric medical imaging domains, suggesting a broader impact for agentic, tool‑augmented vision‑language systems in healthcare.
Comments & Academic Discussion
Loading comments...
Leave a Comment