Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model

Reading time: 4 minute
...

📝 Original Info

  • Title: Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model
  • ArXiv ID: 2602.16422
  • Date: 2026-02-18
  • Authors: ** 정보 제공되지 않음 (원문에 저자 정보가 명시되지 않았습니다.) **

📝 Abstract

Generating diagnostic text from histopathology whole slide images (WSIs) is challenging due to the gigapixel scale of the input and the requirement for precise, domain specific language. We propose a hierarchical vision language framework that combines a frozen pathology foundation model with a Transformer decoder for report generation. To make WSI processing tractable, we perform multi resolution pyramidal patch selection (downsampling factors 2^3 to 2^6) and remove background and artifacts using Laplacian variance and HSV based criteria. Patch features are extracted with the UNI Vision Transformer and projected to a 6 layer Transformer decoder that generates diagnostic text via cross attention. To better represent biomedical terminology, we tokenize the output using BioGPT. Finally, we add a retrieval based verification step that compares generated reports with a reference corpus using Sentence BERT embeddings; if a high similarity match is found, the generated report is replaced with the retrieved ground truth reference to improve reliability.

💡 Deep Analysis

📄 Full Content

Histopathological examination remains the clinical reference standard for cancer diagnosis, requiring expert pathologists to interpret complex morphological patterns across cellular, tissue, and architectural levels [1]. While the digitization of pathology has enabled discriminative deep learning for tasks such as tumor classification and segmentation [2,3], recent work has increasingly explored generative models for producing textual outputs from images. Automated Histopathology Report Generation (AHRG) extends slide-level prediction by synthesizing coherent, clinically appropriate natural-language descriptions directly from whole-slide images (WSIs).

A central difficulty in AHRG is the disparity between the scale of the visual input and the semantic density of the textual output. A single WSI often exceeds 10 10 pixels, rendering standard vision-language architectures-typically designed for natural images at 224 × 224 resolution-computationally intractable. Traditional Multiple Instance Learning (MIL) methods [13,14] effectively aggregate features for slide-level prediction but often lack the fine-grained spatial grounding required for descriptive text generation.

Recent advancements have attempted to bridge this gap through two primary avenues: domain-specific foundation models and Multimodal Large Language Models (MLLMs). Foundation models like UNI [4] and H-optimus-1 [23] provide robust, self-supervised feature representations, yet they lack inherent text generation capabilities. Conversely, MLLMs such as WSI-LLaVA [25] and ChatEXAONEPath [27] adapt general-purpose LLMs to pathology through instruction tuning. However, these end-to-end systems face significant hurdles: they are computationally expensive to train, often require massive token pruning that risks discarding rare diagnostic features, and are prone to hallucinations-plausible but factually incorrect statements [8,28].

In this work, we present a modular, hierarchical visionlanguage framework that emphasizes computational efficiency and diagnostic reliability. Rather than training an end-to-end MLLM, we pair a frozen pathology encoder with a lightweight, domain-adapted decoder. Our main contributions are:

  1. We propose a hierarchical pyramidal scanning strategy (downsampling 2 3 to 2 6 ) that follows a coarse-to-fine workflow and uses simple, interpretable filters to prioritize tissue regions while suppressing background and common artifacts.

  2. We integrate the UNI encoder [4] as a frozen feature extractor and train a lightweight Transformer decoder on top of its 1024-dimensional visual tokens, avoiding end-to-end retraining of the visual backbone.

  3. We use the BioGPT tokenizer [22] to better represent biomedical terminology and reduce vocabulary mismatch during decoding.

  4. We add a retrieval-based verification step that compares generated reports with a reference corpus using Sentence-BERT embeddings, replacing high-similarity matches with retrieved ground-truth references to improve output reliability.

Computational pathology has expanded from primarily discriminative tasks toward generative settings that require translating visual evidence into structured text. This section reviews pathology foundation models, histopathology-specific MLLMs, and verification strategies for reducing unsupported generation.

Transfer learning from natural images (e.g., ImageNetpretrained ResNet) has increasingly been complemented by domain-specific self-supervised learning (SSL). Pathology foundation models are trained on large collections of histopathology patches to learn transferable tissue representations. Chen et al. [4] introduced UNI, a ViT-Large model distilled via DINOv2 from over 100 million tissue patches, which improves performance across multiple downstream tasks relative to supervised baselines. H-optimus-1 [23] further scales SSL to slide-level corpora to capture broad morphological variability. While these models provide strong visual representations, they are feature extractors and therefore require a separate decoding component to produce diagnostic text.

Generating text from WSIs requires solving the “semantic gap” between pixel-level features and high-level diagnostic concepts.

Early approaches relied on captioning models adapted from natural image domains, often yielding generic descriptions [17].

The current state of the art leverages Multimodal Large Language Models (MLLMs). Quilt-1M [24] facilitated this shift by curating a dataset of over 1 million image-text pairs from educational videos and social media, enabling the training of models such as Quilt-LLaVA. WSI-LLaVA [25] addresses the computational bottleneck of processing gigapixel images through dynamic token pruning, retaining only diagnostically relevant patches to fit within the context window of a LLaMAbased decoder. HistGen [26] proposes a dual-stream architecture that separately aggregates local-region details and global WSI context, allowing the decoder to atte

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut