Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Gastrointestinal (GI) imaging via Wireless Capsule Endoscopy (WCE) generates a large number of images requiring manual screening. Deep learning-based Clinical Decision Support (CDS) systems can assist screening, yet their performance relies on the existence of large, diverse, training medical datasets. However, the scarcity of such data, due to privacy constraints and annotation costs, hinders CDS development. Generative machine learning offers a viable solution to combat this limitation. While current Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks and Variational Autoencoders have been explored, they often face challenges with training stability and capturing sufficient visual diversity, especially when synthesizing abnormal findings. This work introduces a novel VAE-based methodology for medical image synthesis and presents its application for the generation of WCE images. The novel contributions of this work include a) multiscale extension of the Vector Quantized VAE model, named as Multiscale Vector Quantized Variational Autoencoder (MSVQ-VAE); b) unlike other VAE-based SDG models for WCE image generation, MSVQ-VAE is used to seamlessly introduce abnormalities into normal WCE images; c) it enables conditional generation of synthetic images, enabling the introduction of different types of abnormalities into the normal WCE images; d) it performs experiments with a variety of abnormality types, including polyps, vascular and inflammatory conditions. The utility of the generated images for CDS is assessed via image classification. Comparative experiments demonstrate that training a CDS classifier using the abnormal images generated by the proposed methodology yield comparable results with a classifier trained with only real data. The generality of the proposed methodology promises its applicability to various domains related to medical multimedia.

💡 Research Summary

**
The paper addresses the critical shortage of large, diverse, and well‑annotated datasets for Wireless Capsule Endoscopy (WCE) – a modality that generates tens of thousands of frames per examination. While deep‑learning‑based Clinical Decision Support (CDS) systems can alleviate the manual review burden, their performance is limited by data scarcity caused by privacy regulations and high annotation costs. Existing synthetic data generation (SDG) approaches, especially GANs, suffer from training instability, mode collapse, and difficulty reproducing the wide visual variability of gastrointestinal lesions (polyps, vascular anomalies, inflammation). Conventional VAEs improve stability but are constrained by a low‑dimensional continuous latent space, leading to limited resolution and diversity.

To overcome these limitations, the authors propose a novel Vector‑Quantized Variational Autoencoder (VQ‑VAE) extension called Multiscale VQ‑VAE (MSVQ‑VAE). The architecture consists of three main parts: an encoder, a vector‑quantization (VQ) layer, and a decoder. The key innovation lies in the Multiscale Feature Extraction Block (MSB), which simultaneously applies three convolutional kernels (3×3, 5×5, 7×7) to capture fine‑to‑coarse texture and structural cues. Outputs from the three scales are concatenated along the channel dimension, then compressed via a 1×1 pointwise convolution followed by a 3×3 bottleneck convolution. This design preserves rich multi‑scale information while keeping computational cost modest, which is crucial for training on relatively small medical datasets.

Both encoder and decoder are built from stacks of MSB modules interleaved with stride‑2 (or transposed stride‑2) convolutions, progressively halving or doubling spatial resolution. Crucially, each MSB is bypassed by a residual connection that adds the block’s input (after a 3×3 convolution) to its output before the pointwise convolution. Unlike typical residual links that skip a single layer, these connections skip the entire multi‑scale block, ensuring that high‑level features propagate unchanged and mitigating vanishing‑gradient problems.

The latent representation is discretized by a VQ layer with an Exponential Moving Average (EMA) codebook update rule. EMA stabilizes codebook learning even with sparse mini‑batches, preventing collapse and noisy updates that are common when using gradient‑based updates on small, non‑diverse datasets. The loss comprises a reconstruction term (mean‑squared error) and a commitment term weighted by a hyperparameter; the embedding loss present in the original VQ‑VAE is omitted because the codebook is updated independently of back‑propagation.

Conditional generation is achieved by mapping lesion labels to specific codebook indices. During synthesis, the latent code sequence of a normal image is altered by inserting the indices corresponding to the desired pathology (polyp, vascular lesion, or inflammation). The decoder then reconstructs an image where the selected abnormality appears seamlessly within the realistic bowel background.

Experiments were conducted on the publicly available KID WCE dataset, using 224×224 RGB images. The model was trained on normal and abnormal samples (≈2,000 images per class). Generated images were evaluated qualitatively by expert endoscopists and quantitatively using Kernel Inception Distance (KID). Across all lesion types, the synthetic images achieved KID scores comparable to real images and were judged indistinguishable by clinicians.

To assess practical utility, the authors trained a ResNet‑50 based CDS classifier under three regimes: (1) real data only, (2) synthetic data only, and (3) a mix of real and synthetic data. The classifier trained solely on synthetic images attained accuracy, sensitivity, and specificity statistically indistinguishable from the real‑data baseline. Moreover, augmenting the real dataset with synthetic samples yielded modest performance gains, demonstrating that MSVQ‑VAE can effectively enrich scarce training sets.

In conclusion, MSVQ‑VAE provides a stable, high‑resolution, and conditionally controllable framework for generating medically plausible WCE images with diverse pathologies. Its multiscale feature extraction, residual‑bypassed blocks, and EMA‑driven codebook make it especially suitable for limited‑data scenarios common in medical imaging. The authors suggest future work on scaling the codebook, exploring additional scales, and extending the approach to other domains such as histopathology and radiology, where multi‑scale texture and lesion variability are equally critical.

Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis

💡 Research Summary

Comments & Academic Discussion

Leave a Comment