Singpath-VL Technical Report
We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.
💡 Research Summary
The paper introduces Singpath‑VL, a vision‑language large model (MLLM) specifically designed for cervical cytology assistance. Recognizing that existing multimodal models excel in computational pathology but lack sufficient fine‑grained data for cytopathology, the authors devise a three‑stage pipeline to automatically generate a million‑scale image‑description dataset called Singpath‑CytoText. In Stage 1, three state‑of‑the‑art open‑source MLLMs (Qwen3‑VL‑32B, InternVL‑3.5‑38B, Baichuan‑Omni‑1.5) are queried in parallel to produce independent captions for each cervical cell tile. Stage 2 employs a large language model to fuse these captions: it extracts consensus morphological features, resolves conflicts, and filters out dimensions that all weak annotators miss. Stage 3 introduces a domain‑specific expert model, fine‑tuned on cervical cytopathology, which reviews the original image together with the fused description and injects missing details such as precise nuclear‑to‑cytoplasmic ratios, chromatin patterns, and subtle membrane irregularities. The resulting dataset contains one million high‑quality image‑text pairs that capture both structured morphological attributes and coherent narrative summaries.
Using this dataset, the authors fine‑tune the Qwen3‑VL‑4B foundation model through a multi‑stage training recipe. Stage 1 aligns visual and textual representations within the cytopathology domain via full‑parameter vision‑language alignment. Stage 2 converts the aligned pairs into instruction‑following data, employing Bethesda System (TBS) reporting templates to simulate realistic diagnostic queries and multi‑turn dialogues. Stage 3 mitigates catastrophic forgetting by replaying two streams of knowledge: (i) domain‑specific QA pairs generated by the original Qwen3‑VL‑4B on the curated data, and (ii) general‑domain QA pairs sampled from the base model’s broader corpus. This replay strategy preserves the model’s general reasoning abilities while deepening its domain expertise.
Evaluation is conducted on two newly constructed benchmarks. MorphoPercept‑Bench comprises binary labels for nine key morphological observations (nuclear enlargement, atypia, hyperchromasia, chromatin texture, nuclear count, nuclear‑to‑cytoplasmic ratio, nuclear membrane, etc.) across a curated set of cell images. Singpath‑VL achieves an average accuracy of 89 % on these tasks, surpassing the inter‑rater agreement among expert cytopathologists (≈78 %) and dramatically outperforming baseline open‑source MLLMs (accuracy ranging from 40 % to 70 %). The second benchmark, CytoCell‑Bench, includes 29,102 cell images annotated with TBS diagnostic categories (NILM, ASC‑US, LSIL, ASC‑H, HSIL, AGC). Singpath‑VL attains perfect (100 %) accuracy on NILM and overall average accuracy of 80 %, exceeding both a dedicated EfficientNet‑B0 classifier and all general‑purpose MLLMs, particularly improving performance on the ambiguous ASC‑US and ASC‑H categories by 5.5 % and 3.7 % respectively.
Qualitative case studies further illustrate Singpath‑VL’s superiority: for both NILM and LSIL examples, the model generates precise morphological descriptions and correct TBS classifications, whereas competing models tend to over‑classify as positive due to biases in their training data. The authors acknowledge limitations: the model is currently confined to cervical cytology, downstream tasks such as whole‑slide screening and integration with clinical history remain untested, and extending the approach to other cytopathology domains (e.g., thyroid FNAB, urine cytology) would require substantial adaptation. Future work is outlined to explore causal reasoning, diagnostic inference, slide‑level efficiency, and broader domain generalization. To foster community progress, the authors commit to open‑sourcing a portion of Singpath‑CytoText and the benchmark datasets.
Comments & Academic Discussion
Loading comments...
Leave a Comment