IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

IOSVLM: A 3D Vision-Language Mo del for Uniﬁed Den tal Diagnosis from In traoral Scans Huimin Xiong 1 , 2 , Zijie Meng 1 , 2 , Tianxiang Hu 1 , 2 , Chen yi Zhou 1 , Y ang F eng 3 , and Zuozh u Liu 1 , 2 1 ZJU-UIUC Institute, Zhejiang Universit y , Haining, 314400, China zuozhuliu@intl.zju.edu.cn 2 Stomatology Hospital, Sc ho ol of Stomatology , Zhejiang Universit y School of Medicine, Hangzhou, 310058, China 3 Angelalign Researc h Institute, Angel Align Inc., Shanghai, 200011, China Abstract. 3D in traoral scans (IOS) are increasingly adopted in routine den tistry due to abundant geometric evidence, and uniﬁed multi-disease diagnosis is desirable for clinical do cumen tation and communication. While recent w orks introduce dental vision-language mo dels (VLMs) to enable uniﬁed diagnosis and report generation on 2D images or multi- view images rendered from IOS, theys do not fully leverage native 3D geometry . Suc h w ork is necessary and also challenging, due to: (i) het- erogeneous scan forms and the complex IOS top ology , (ii) multi-disease co-o ccurrence with class im balance and ﬁne-grained morphological ambi- guit y , (iii) limited paired 3D IOS–text data. Thus, w e present IOSVLM, an end-to-end 3D VLM that represen ts scans as p oin t clouds and follows a 3D enco der-pro jector-LLM design for uniﬁed diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale m ulti-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs ov er 23 oral diseases and heterogeneous scan t yp es. T o address the distribution gap betw een color-free IOS data and color- dep enden t 3D pretraining, we prop ose a geometry-to-c hromatic proxy that stabilizes ﬁne-grained geometric p erception and cross-mo dal align- men t. A tw o-stage curriculum training strategy further enhances robust- ness. IOSVLM consistently outp erforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the eﬀectiv eness of direct 3D geometry mo deling for IOS-based diagnosis. Keyw ords: IOS · Uniﬁed diagnosis · VLM. 1 In tro duction In traoral scans (IOS) is rapidly b ecoming routine in clinical den tistry [ 5 , 12 ]. Its high-ﬁdelit y 3D surface geometry preserves ﬁne-grained to oth–gingiv a morphol- ogy b eyond con ven tional 2D imaging, enabling more consistent and auditable assessmen t of subtle abnormalities [ 10 ]. Clinically , a single scan often contains m ultiple co-existing diseases, requiring integrated multi-disease diagnosis and 2 H. Xiong et al. natural-language rep orting for do cumentation and dentist–patien t communica- tion—b ey ond segmentation or single-lesion detection [ 19 ]. Recen t dentistry work explored vision-language mo dels (VLM) for uniﬁed di- agnosis. Den tVLM [ 13 ], DentalGPT[ 3 ], and OralGPT [ 25 ] supp ort visual question- answ ering (V QA) and generative rep orting on diverse 2D dental images, provid- ing a general in terface for fusing evidence in to natural-language conclusions. Ho wev er, they do not directly model 3D surface geometry , where man y abnor- malities present as ﬁne-grained morphological changes. OralGPT-Omni[ 8 ] and Arc hMap [ 24 ] instead render 3D scans in to multi-view images and applies 2D VLMs, but this relies on view selection and can weak en spatial relationships and geometric cues. Hence, a k ey gap remains: end-to-end multi-disease diagno- sis with language generation from native 3D IOS inputs. Ho wev er, direct 3D IOS–based uniﬁed diagnosis faces challenges. First, in- puts are highly heterogeneous: single-arch and o ccluded-arc hes scans acquired practically diﬀer substan tially in co verage and o cclusion/con tact visibility , and IOSs exhibit complex morphology and top ology , complicating representation learning [ 6 , 10 , 22 ]. Second, multiple diseases often co-exist in one scan. Class im balance and subtle geometric diﬀerences exacerbate inter-disease confusion, demanding uniﬁed reasoning from lo cal morphology to global seman tics [ 13 ]. Third, 3D IOS-text paired data and high-quality annotations is rare [ 11 , 15 ]. Existing public resources are insuﬃcient for systematic 3D VLM training and ev aluation. T o this end, w e prop ose IOSVLM, a 3D VLM mo del that directly consumes nativ e 3D IOS geometry for uniﬁed multi-disease diagnosis and generative VQA, together with IOSVQA as a paired training and b enc hmark suite that co vers 23 oral diseases and 2 common clinical settings, single-arch and o ccluded-arc hes scans, totalling 19,002 IOS cases and 249,055 VQA pairs. T o our knowledge, it is among the most comprehensive IOS diagnostic VQA resources to date, explicitly capturing the real-world setting of co-existing diseases and heterogeneous inputs. IOSVLM adopts a 3D enco der–pro jector–LLM framework where the 3D en- co der captures multi-scale geometric features, the pro jectors map them into the LLM token space, and the LLM generates communicable diagnostic outputs. Im- p ortan tly , IOS often lacks reliable color and common 3D pretraining implicitly assumes colored p oint clouds, naively dropping or padding color causes distribu- tion shift and w eaken cross-mo dal alignment. W e th us introduce the geometry- to-c hromatic proxy to comp ensate for that, to improv e ﬁne-grained morphology enco ding and language alignment. IOSVLM is trained with a curriculum-style sc hedule: pre-training on larger but noisy sup ervision to build 3D p erception and geometry–language alignment, and ﬁne-tuning on higher-quality sup ervision to enhance reliability and interpretabilit y under realistic lab el noise and limited ex- planations. Exp erimen ts sho w that IOSVLM substantially outp erforms strong state-of-the-art baselines, improving macro accuracy by at least 9.58% and macro F1 b y at least 1.46%. Our main contributions are: 1. W e construct the ﬁrst large-scale multi-source IOS diagnostic VQA dataset that supp orts multiple scan type inputs for multi-disease diagnosis. IOSVLM 3 T able 1. Statistics of IOSVQA and its source datasets. S: single-arch. O: o ccluded- arc h. SD: single-arch disease. OD: o ccluded-arc h disease. # denotes the case num ber. Dataset Priv acy Source IOS types #SD #OD #cases #VQA Malo ccIOS Priv ate China S+O 2 11 14,630 208,370 DiseaseIOS Priv ate China S 8 — 4,172 33,376 Bits2Bites Public Italy O — 5 200 7,309 IOSV QA — Mixed S+O 8 15 19,002 249,055 (a) Occluded-arches (b) Single-arch (c) GCP Fig. 1. Disease statistics of o ccluded-arc hes IOSs (a) and single-arch (b) on IOSVQA dataset, and visualization of normal-based geometry-to-chromatic proxy (GCP) (c). 2. W e introduce the ﬁrst end-to-end VLM that takes native 3D IOS geometry as input for uniﬁed diagnosis, ac hieving clear p erformance adv an tages. 3. Geometry-to-c hromatic proxy is prop osed to bridge the distribution gap b e- t ween color-free IOSs and color-dependent p oint-cloud pretraining, improv- ing ﬁne-grained geometry enco ding and cross-mo dal alignment. 2 Metho d 2.1 Problem F ormulation IOSVLM targets tw o common IOS inputs: single-arch scans and o ccluded-arc hes scans (refer to Fig. 3 ). Let D = { 1 , . . . , C } denote the set of oral diseases, where C = 23 . Each disease d ∈ D corresp onds to a m ulti-class lab el set Y d . Given an input pair  f M , q d  with IOS mesh f M and question q d , the diagnosis is formulated as a generative VQA task that pro duces an answer A d ∈ { y , ( y , r ) } ( y ∈ Y d ), i.e., a lab el y or a lab el–rationale pair ( y , r ) . Since multiple diseases may co-exist within a single scan, each scan is paired with multiple disease questions. T raining and ev aluation are conducted ov er scan–disease pairs as disease-level instances. 4 H. Xiong et al. Vision Encoder Large Language Model Projectors Tokens Dose the patient have a dental spacing malocclusion issue? Response RECON++ projector projector Projector 3D Point Cloud Absolute Position Encoding Learnable Prompts Query IOS Stage 1: IOS-Language Alignmen t Stage 2: Instruction Tuning (b) Model Architecture and Two-Stage Training (a) Dataset Construction MaloccIOS DiseaseIOS Bits2Bites Multi-source Dataset IOS Registration Label Consolidation Angle’s Classification Dental Crowding Dental Spacing Dental Protrusion Posterior Crossbite ... occluded-arch single-arch 23 Disease Labels and T wo Intraoral Scan Types Template-Based QA Generation Q: Doe s th e p atie nt have a de ntal spac ing malocclusion issue ？ A: No. D ental spaci ng malocclusion refers to a significant gap ... VQA Pairs Mixed- Quality Data High- Quality Data LLM-Based CoT Generation (c) Performance comparison 15 Occluded-Arch Dise ase Tasks 8 Single-Arch Disease Tasks Expert Verification HuatuoGPT-7B Hulu-Med-7B Hulu-Med-14B InternVL3-8B InternVL3-14B MedGemma-1.5-4B PointLLM-7B PointLLM-13B Qwen3-VL-8B ShapeLLM-7B ShapeLLM-13B IOSVLM (Ours) Baselines Gemini3 Pro GPT5 Geometry-to- Chromatic Proxy Fig. 2. Overview of our framework and dataset construction. 2.2 IOSVQA Dataset Construction W e ﬁrst construct a large-scale VQA dataset IOSV QA. It comprises 19,002 IOS cases and 249,055 QA pairs spanning 23 disease categories, with disease statis- tics detailed in Fig. 1 . Data were aggregated from three sources: Malo ccIOS, DiseaseIOS, and Bits2Bites [ 2 ] (T able 1 ). T o make geometric representations comparable across sources, IOSs were globally registered to standardize orien- tation and relative maxillomandibular pose. Each case pro vides one IOS paired with a disease lab el to form QA samples; questions were sampled randomly from a predeﬁned list and answ ers corresp ond predominantly to disease lab els. Disease Lab el Consolidation. F or Malo ccIOS, 2D imaging diagnoses were extracted from clinical reports. A randomly selected subset of 557 cases was man ually corrected by 28 ortho don tists, yielding 7,628 high-qualit y samples, while the remaining 200,742 samples contain partial lab el noise. Senior den tists deﬁned rule-based mappings to conv ert 2D diagnoses in to IOS lab els. Labels for DiseaseIOS and Bits2Bites w ere annotated by 5 and 1 orthodontic exp erts, resp ectiv ely , and are treated as high quality . Data Split and Rationale Strategy . IOSVQA is divided in to Stage-1 training, Stage-2 training, and testing, with eac h stage cov ering all three data sources to maintain disease div ersity . The 7,628 high-quality Malo ccIOS samples are fully assigned to Stage-2. F or interpretabilit y , approximately 50% of Stage-2 samples are augmented with GPT-4o–generated chain-of-though t (CoT) ratio- nales. Motiv ated by prior evidence that limited CoT sup ervision can induce rea- soning behavior [ 23 ], w e apply rationale annotation to a subset of high-quality data to balance cost, eﬃciency , and reasoning p erformance. T o mitigate scale im balance, the smaller Bits2Bites dataset is augmen ted b y pairing eac h case with 9 and 3 questions for Stage-1 and Stage-2 construction, resp ectiv ely . 2.3 Mo del Architecture W e adopt the multi-modal VLM arc hitecture that com bines a large-scale pre- trained 3D encoder with multi-branc h pro jectors and LLM (Fig. 2 ). The 3D IOSVLM 5 enco der injects strong geometric priors, while the structured pro jectors map m ulti-granularit y visual features in to seman tic tok ens aligned with the LLM. The LLM then p erforms reasoning to pro duce diagnostic outputs, enabling uni- ﬁed p erception and explainable diagnosis across heterogeneous IOS inputs. T o map the raw IOS mesh f M into the semantic space of LLM, we ﬁrst con- v ert it to a p oin t cloud e P b y taking the gravit y cen ter p oin t of each face. e P is then randomly down-sampled to obtain P with N p oin ts. A pre-trained 3D p oin t-cloud enco der ReCon++ [ 14 ] extracts complemen tary absolute p osition em b eddings F ape , lo cal geometric features F local , and a global descriptor F global . They are mapp ed in to the LLM tok en space via three dedicated MLP pro jectors ϕ ape , ϕ local , ϕ global , and each pro jected feature is concatenated with a corresp ond- ing learnable visual prompt V ape , V local , V global to balance their con tributions: F p =  V ape ; ϕ ape  F ape  ; V local ; ϕ local  F local  ; V global ; ϕ global  F global  ] . (1) The fused p oin t tokens F p are concatenated with text tok ens and jointly fed into the LLM, enabling uniﬁed language–visual represen tations. Geometry-to-Chromatic Pro xy . Poin t-cloud related mo dels are often pretrained with xyz + R GB inputs. In practical IOS pip elines, downstream stor- age and exchange typically preserve geometry while discarding or de-emphasizing color/texture, making app earance cues unav ailable. Remo ving RGB thus in tro- duces a pretraining mismatch and discards channels the enco der has learned to exploit. Moreov er, RGB is often b eneﬁcial not due to semantic color, but as a lo- cal separability cue that supp orts b oundary disco very and stable lo cal matching. W e therefore prop ose Ge ometry-to-Chr omatic Pr oxy (GCP) : a geometry-derived pro xy that mimics this separability comp onen t, enabling b etter reuse of color- pretraining priors while remaining purely geometry-driv en. W e instantiate GCP using surface normals, which encode lo cal surface ori- en tation and capture curv ature changes etc. F or eac h p oin t i with normal n i , w e deﬁne a ﬂip-robust mapping GCP( n i ) =    n i ∥ n i ∥ 2    ∈ R 3 . The ℓ 2 normalization standardizes magnitude, and the absolute v alue remov es sign am biguity , av oid- ing spurious color in versions. Mapping GCP( n i ) on to the IOS as pseudo-colors (Fig. 1(c) ) yields spatially coheren t patterns with clear transitions at anatomi- cal b oundaries, supp orting that GCP provides structured, lo cally discriminative cues analogous to R GB. This formulation is extensible: other geometry descrip- tors (e.g., curv ature) may serve as alternativ e pro xies and yield gains. Ov erall, GCP provides a geometry-driven substitute for the discriminativ e comp onen t of R GB, improving ﬁne-grained geometric p erception when color is unreliable. 2.4 T raining Strategy W e adopt a curriculum-st yle tw o-stage training: Stage-1 learns robust 3D ge- ometry and geometry-language alignmen t, while Stage-2 improv es diagnosis and generation under higher-qualit y (partially rationale-augmented) sup ervision. Stage-1. W e train the 3D enco der and pro jectors and freeze the LLM, lever- aging the large-scale Stage-1 data to build strong geometric represen tations and 6 H. Xiong et al. T able 2. Comparison with baseline metho ds on multi-disease IOS diagnosis. The best result is b olded, while the second b est result is underlined. Models Input Acc F1 Preci Recall PR GPT-5 [ 17 ] Multi-View Images 62.26 44.73 49.41 49.20 99.84 Gemini 3 Pro [ 7 ] Multi-View Images 67.65 48.93 56.78 51.85 99.71 Qwen3VL-8B [ 1 ] Multi-View Images 57.88 36.46 45.21 46.58 100 Intern VL3.5-8B [ 18 ] Multi-View Images 56.23 34.77 40.36 45.47 100 Intern VL3.5-14B [ 18 ] Multi-View Images 35.68 24.61 29.14 46.87 100 MedGemma-1.5 [ 16 ] Multi-View Images 52.21 31.31 36.58 45.25 100 HuatuoGPT-V-7B [ 4 ] Multi-View Images 59.26 39.14 40.33 45.56 100 HuluMed-7B [ 9 ] Multi-View Images 61.21 34.68 40.09 45.12 100 HuluMed-14B [ 9 ] Multi-View Images 53.72 36.87 41.63 46.58 100 Poin tLLM-7B [ 20 ] 3D Point Cloud 43.03 34.18 44.62 44.80 99.85 Poin tLLM-13B [ 20 ] 3D Poin t Cloud 34.57 26.76 42.04 45.03 96.78 ShapeLLM-7B [ 14 ] 3D Point Cloud 30.27 19.23 14.66 44.95 100 ShapeLLM-13B [ 14 ] 3D Point Cloud 26.13 17.78 13.61 45.10 100 IOSVLM(Ours) 3D Poin t Cloud 77.23 50.39 52.19 52.96 100 stable token alignment. Poin t features with GCP are used to reduce the gap to color-dep enden t p oin t-cloud pretraining. Stage-2. W e freeze the 3D enco der and ﬁne-tune the pro jectors and LLM with LoRA, leveraging higher-qualit y annotations to further enhance semantic mo deling and generative reasoning, while preven ting degradation of the learned geometric representations. A uniﬁed generative ob jective is applied: lab el-only samples sup ervise y , while rationale samples sup ervise ( y , r ) . 3 Exp erimen tal Results 3.1 Exp erimen tal Setup W e ev aluate on IOSVQA, with 229,943/15,598/5,884 samples for Stage-1 train- ing, Stage-2 training, and testing, resp ectiv ely . W e rep ort Macro Accuracy (Acc), Macro F1 (F1), Precision (Preci), and Recall, av eraged ov er tasks, and addition- ally Parsing Rate (PR), measuring the fraction of generated ans w ers that can b e parsed into a v alid lab el. IOSVLM uses LLM from Qwen3VL-8B-Instruct [ 1 ] with N=10,000 p oin ts and 32 learnable visual prompts. 3.2 Comparison with State-of-the-Arts T able 2 compares IOSVLM with 4 categories of representativ e baselines: (i) pro- prietary multimodal LLMs [ 17 , 7 ], (ii) op en-source general-purp ose 2D MLLMs [ 1 , 18 ], (iii) op en-source medical 2D MLLMs [ 16 , 4 , 9 ], and (iv) op en-source gen- eral 3D MLLMs op erating on point clouds [ 20 , 14 ]. F or fair comparison, all 3D IOSVLM 7 T able 3. Ablation studies. V/P/L denote 3D vision encoder, pro jector, and LLM, resp ectiv ely . " ∗ " indicates tuning with rationales. GCP: Geometry-to-Chromatic Pro xy . Ablation part Method choice Acc F1 Preci Recall PR LLM Qwen3-4B [ 21 ] 71.10 39.22 35.55 45.15 100 Qwen3-8B [ 21 ] 70.73 38.84 36.42 45.26 100 Qwen3VL-8B 67.02 43.10 48.94 46.70 95.08 GCP ✗ 67.02 43.10 48.94 46.70 95.08 ✓ 72.28 48.06 51.25 49.82 97.00 T raining mode Stage 1: V & P 72.28 48.06 51.25 49.82 97.00 *Stage 2: P & L 77.23 50.39 52.19 52.96 100 Stage 2: P & L 77.92 50.35 51.81 53.61 99.74 Stage 1: V & P & L 75.38 47.79 49.73 51.07 99.97 *Stage 2: P & L 75.72 50.78 52.37 52.62 100 baselines use the same p oin t-cloud prepro cessing as IOSVLM, 2D baselines re- ceiv e conv en tional m ulti-view renderings of IOS: 5 standard intraoral views for o ccluded-arc h scans and 4 views for single-arch scans. Ov erall p erformance. IOSVLM achiev es the b est ov erall results, obtaining the highest Acc (77.23%) and F1 (50.39%), as w ell as the b est Recall (52.96%). Notably , IOSVLM outp erforms all op en-source 2D MLLMs by a large margin (at least +16.02% Acc/+11.25% F1) and substantially surpasses op en-source 3D MLLMs (at least +34.20% Acc/+16.21% F1), highlighting the adv an tage of directly mo deling native 3D IOS geometry . Comparison to proprietary MLLMs. Compared with GPT-5, IOSVLM impro ves by +14.97% Acc and +5.66% F1; compared with Gemini 3 Pro, it yields +9.58% Acc and +1.46% F1 and achiev es higher recall (52.96% vs. 51.85%). No- tably , IOSVLM uses only an 8B-scale LLM, yet surpasses these substantially larger proprietary MLLMs, highlighting that our gains primarily come from geometry-a ware representation and alignment rather than mo del size. Gemini 3 Pro attains the b est precision (56.78%), while IOSVLM provides a more balanced proﬁle with the strongest accuracy-F1-recall, which is crucial for multi-disease diagnosis under class im balance and ﬁne-grained ambiguit y . P arsing rate (PR). IOSVLM reaches 100% macro PR, indicating consis- ten tly parsable outputs in the task-sp eciﬁc lab el space, while proprietary mo dels sho w slightly low er PR (99.84%/99.71%). W e observe that, despite near-p erfect PR, proprietary mo dels still exhibit sp oradic empty outputs that break parsabil- it y . T ogether, these results demonstrate that IOSVLM not only impro ves diag- nostic p erformance but also pro duces more reliably structured predictions and informativ e output for large-scale multi-task ev aluation. 3.3 Ablation Studies Eﬀect of Geometry-to-Chromatic Pro xy (GCP). In T able 3 , “ × ” denotes using a constan t “white” color (i.e., no separability cue), while “ ✓ ” replaces 8 H. Xiong et al. Fig. 3. Qualitative examples of generated diagnostic resp onses. the color c hannels with the normal-based pro xy . GCP yields consistent gains ( +5 . 26% Acc and +4 . 96% F1), supp orting our hypothesis that the discriminative comp onen t of RGB, not its semantics, is what b eneﬁts pretrained enco ders. This also v alidates normal vectors as an eﬀective instantiation of GCP , and suggests the pro xy can b e further extended to other geometric cues (e.g., curv ature). T raining strategy . Our tw o-stage curriculum achiev es strong p erformance, with the default setting yielding the b est o verall accuracy and F1. Interestingly , training V&P&L in Stage-1 outperforms V&P initially , but becomes sligh tly inferior after Stage-2 in the Acc–F1 trade-oﬀ. W e attribute this to early LLM up dates under mixed-quality sup ervision, whic h may bias generation and limit subsequen t high-quality adaptation. Nev ertheless, all v arian ts are comp etitiv e, suggesting that native 3D IOS input with a VLM pip eline forms a strong foun- dation, while the staged curriculum provides more stable optimization for multi- disease am biguity . Rationale sup ervision: reliability o ver ra w F1. Using rationales in Stage-2 yields comparable macro-F1 to label-only tuning, suggesting that ra- tionales mainly regularize generation rather than impro ving geometric discrim- inabilit y . Imp ortan tly , rationale tuning improv es output reliability/usabilit y: it main tains diagnostic performance while achieving higher macro PR, , and we qualitativ ely observe few er degenerate b eha viors such as rep eated lab els. This indicates improv ed con trollability and b etter-calibrated free-form answers, which is crucial for do wnstream clinical pip elines. Choice of LLM. Among LLM candidates, Qwen3VL-8B yields the b est macro-F1 under our setting. Although its macro accuracy is slightly lo wer than the other LLMs, we observe that in our highly imbalanced tasks those LLMs tend to collapse to ma jorit y-class predictions, inﬂating accuracy but hurting minority co verage. In contrast, Qwen3VL-8B exhibits b etter class-cov erage aw areness and less ma jority bias, leading to a more balanced precision-recall proﬁle. 3.4 Qualitative Analysis In Fig. 3 , IOSVLM sho ws more accurate diagnoses and more reliable, parsable outputs than the strong Gemini 3 Pro, supp orting that directly mo deling native IOSVLM 9 3D IOS geometry improv es sensitivity to ﬁne-grained morphological cues and enhances output robustness for clinical use. 4 Conclusion W e presented IOSVLM, an end-to-end 3D VLM for uniﬁed multi-disease IOS diagnosis and generativ e VQA, together with IOSV QA, a large-scale b enc h- mark spanning 23 oral diseases and heterogeneous scan types. W e introduce a geometry-to-c hromatic proxy and a curriculum-style training strategy to b etter lev erage color-dep enden t p oin t-cloud pretraining and improv e robustness. Ex- p erimen ts demonstrate clear gains ov er strong baselines, supporting direct 3D geometry mo deling for practical IOS-based clinical use. References 1. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical rep ort. arXiv preprint (2025) 2. Borghi, L., Lumetti, L., Cremonini, F., Rizzo, F., Grana, C., Lombardo, L., Bolelli, F.: Bits2bites: Intra-oral scans o cclusal classiﬁcation. In: Oral and Den tal Image aNalysis W orkshop (2025) 3. Cai, Z., Zhang, J., Zhao, J., Zeng, Z., Li, Y., Liang, J., Chen, J., Y ang, Y., Y ou, J., Deng, S., et al.: Dentalgpt: Incentivizing multimodal complex reasoning in den- tistry . arXiv preprint arXiv:2512.11558 (2025) 4. Chen, J., Gui, C., Ouy ang, R., Gao, A., Chen, S., Chen, G.H., W ang, X., Cai, Z., Ji, K., W an, X., et al.: T ow ards injecting medical visual knowledge into multimodal llms at scale. In: Proceedings of the 2024 conference on empirical metho ds in natural language pro cessing. pp. 7346–7370 (2024) 5. Eggmann, F., Blatz, M.B.: Recent adv ances in intraoral scanners. Journal of Dental Researc h 103 (13), 1349–1357 (2024) 6. Ender, A., Zimmermann, M., Mehl, A.: A ccuracy of complete-and partial-arch impressions of actual intraoral scanning systems in vitro. Int J Comput Dent 22 (1), 11–19 (2019) 7. Go ogle DeepMind: Gemini 3 pro model card. Mo del card (PDF) (Dec 2025), https://storage.googleapis.com/deepmind- media/Model- Cards/ Gemini- 3- Pro- Model- Card.pdf , accessed 2026-02-27 8. Hao, J., Liang, Y., Lin, L., F an, Y., Zhou, W., Guo, K., Y e, Z., Sun, Y., Zhang, X., Y ang, Y., et al.: Oralgpt-omni: A versatile dental multimodal large language mo del. arXiv preprint arXiv:2511.22055 (2025) 9. Jiang, S., W ang, Y., Song, S., Hu, T., Zhou, C., Pu, B., Zhang, Y., Y ang, Z., F eng, Y., Zhou, J.T., et al.: Hulu-med: A transparent generalist mo del tow ards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668 (2025) 10. Lian, C., W ang, L., W u, T.H., W ang, F., Y ap, P .T., Ko, C.C., Shen, D.: Deep multi- scale mesh feature learning for automated lab eling of raw den tal surfaces from 3d in traoral scanners. IEEE transactions on medical imaging 39 (7), 2440–2450 (2020) 11. Liu, Z., He, X., W ang, H., Xiong, H., Zhang, Y., W ang, G., Hao, J., F eng, Y., Zh u, F., Hu, H.: Hierarchical self-supervised learning for 3d to oth segmentation in intra-oral mesh scans. IEEE T ransactions on Medical Imaging 42 (2), 467–480 (2023). https://doi.org/10.1109/TMI.2022.3222388 10 H. Xiong et al. 12. Mangano, F., Gandolﬁ, A., Luongo, G., Logozzo, S.: Intraoral scanners in dentistry: a review of the current literature. BMC oral health 17 (1), 149 (2017) 13. Meng, Z., Hao, J., Dai, X., F eng, Y., Liu, J., F eng, B., W u, H., Gai, X., Zh u, H., Hu, T., et al.: Den tvlm: A multimodal vision-language mo del for comprehensiv e den tal diagnosis and enhanced clinical practice. arXiv preprint (2025) 14. Qi, Z., Dong, R., Zhang, S., Geng, H., Han, C., Ge, Z., Yi, L., Ma, K.: Shap ellm: Univ ersal 3d ob ject understanding for embo died interaction. In: Europ ean Confer- ence on Computer Vision. pp. 214–238. Springer (2024) 15. Ro dríguez-Ortega, J., Pérez-Hernández, F., T abik, S.: Charnet: conditioned heatmap regression for robust den tal landmark lo calization. arXiv e-prints pp. arXiv–2501 (2025) 16. Sellergren, A., Kazemzadeh, S., Jaro ensri, T., Kiraly , A., T rav erse, M., Kohlberger, T., Xu , S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical rep ort. arXiv preprin t arXiv:2507.05201 (2025) 17. Singh, A., F ry , A., Perelman, A., T art, A., Ganesh, A., El-Kishky , A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Op enai gpt-5 system card. arXiv preprin t arXiv:2601.03267 (2025) 18. W ang, W., Gao, Z., Gu, L., Pu, H., Cui, L., W ei, X., Liu, Z., Jing, L., Y e, S., Shao, J., et al.: Intern vl3. 5: Adv ancing op en-source multimodal mo dels in versatilit y , reasoning, and eﬃciency . arXiv preprint arXiv:2508.18265 (2025) 19. Xiong, H., Li, K., T an, K., F eng, Y., Zhou, J.T., Hao, J., Ying, H., W u, J., Liu, Z.: T segformer: 3d tooth segmentation in intraoral scans with geometry guided trans- former. In: In ternational conference on medical image computing and computer- assisted in terven tion. pp. 421–432. Springer (2023) 20. Xu, R., W ang, X., W ang, T., Chen, Y., P ang, J., Lin, D.: Poin tllm: Emp ow ering large language mo dels to understand p oin t clouds. In: European Conference on Computer Vision. pp. 131–147. Springer (2024) 21. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical rep ort. arXiv preprint arXiv:2505.09388 (2025) 22. Zanjani, F.G., Moin, D.A., V erheij, B., Claessen, F., Cherici, T., T an, T., et al.: Deep learning approach to semantic segmentation in 3d p oin t cloud intra-oral scans of teeth. In: International Conference on Medical Imaging with Deep Learning. pp. 557–571. PMLR (2019) 23. Zelikman, E., W u, Y., Mu, J., Go odman, N.D.: Star: Self-taught reasoner b oot- strapping reasoning with reasoning. In: Pro c. the 36th Inte rnational Conference on Neural Information Pro cessing Systems. vol. 1126, pp. 0–55 (2024) 24. Zhang, B., Miao, Y., W u, T., Chen, T., Jiang, J., Li, Z., T ang, Z., Y u, L., Su, J.: Arc hmap: Arch-ﬂattening and knowledge-guided vision language mo del for to oth coun ting and structured den tal understanding. arXiv preprint (2025) 25. Zhang, J., Du, B., Miao, Y., Sun, D., Cao, X.: Oralgpt: A tw o-stage vision- language model for oral mucosal disease diagnosis and description. arXiv preprint arXiv:2510.13911 (2025)

IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment