A texture-based framework for foundational ultrasound models
Ultrasound is the most widely used medical imaging modality, yet the images it produces are fundamentally unique, arising from tissue-dependent scattering, reflection, and speed-of-sound variations that produce a constrained set of characteristic textures that differ markedly from natural-image statistics. These acoustically driven patterns make ultrasound challenging for algorithms originally designed for natural images. To bridge this gap, the field has increasingly turned to foundation models, hoping to leverage their generalization capabilities. However, these models often falter in ultrasound applications because they are not designed for ultrasound physics, they are merely trained on ultrasound data. Therefore, it is essential to integrate ultrasound-specific domain knowledge into established learning frameworks. We achieve this by reformulating self-supervised learning as a texture-analysis problem, introducing texture ultrasound semantic analysis (TUSA). Using TUSA, models learn to leverage highly scalable contrastive methods to extract true domain-specific representations directly from simple B-mode images. We train a TUSA model on a combination of open-source, simulated, and in vivo data. The latent space is compared to several larger foundation models, demonstrating that our approach gives TUSA models better generalizability for difficult downstream tasks on unique online datasets as well as a clinical eye dataset collected for this study. Our model achieves higher accuracy in detecting COVID (70%), spinal hematoma (100%) and vitreous hemorrhage (97%) and correlates more closely with quantitative parameters like liver steatosis (r = 0.83), ejection fraction (r = 0.63), and oxygen saturation (r = 0.38). We open-source the model weights and training script: https://github.com/talg2324/tusa
💡 Research Summary
This paper introduces a novel self‑supervised learning framework for ultrasound imaging called Texture Ultrasound Semantic Analysis (TUSA). Recognizing that ultrasound B‑mode images are fundamentally different from natural photographs because they are generated by tissue‑dependent scattering, reflection, and speed‑of‑sound variations, the authors argue that conventional vision models fail to capture the limited yet characteristic texture patterns inherent to ultrasound. To embed this domain knowledge directly into the learning process, TUSA reformulates self‑supervised learning as a texture‑analysis problem.
The core architecture is a two‑stage autoencoder built on Swin‑UNETR, a hybrid CNN‑Transformer segmentation backbone. In the first stage, the input B‑mode image is segmented into K (empirically set to five) texture channels using a Sparsemax‑activated decoder that forces each pixel to belong to a single texture. In the second stage, each texture channel is processed by its own learnable convolutional kernel; the resulting feature maps are then merged via a depth‑wise separable convolution followed by a 1×1 tanh projection to reconstruct the original intensity image. This design explicitly separates “shape” (segmentation) from “texture” (channel‑wise convolution) while keeping the overall pipeline fully differentiable and label‑free.
Training follows the SimCLR paradigm. Each image undergoes two random spatial augmentations, each of which is further perturbed by color‑based augmentations (random erasing, brightness, contrast, Gaussian blur). The four augmented views are passed through the network, and a normalized temperature‑scaled cross‑entropy (NT‑Xent) loss encourages consistent latent representations across augmentations. Reconstruction quality is enforced by a composite loss comprising L1 distance, structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPS) with a RadImageNet backbone. To prevent the model from collapsing onto a subset of texture channels, a weak penalty proportional to the negative channel‑wise entropy of the output is added, encouraging balanced usage of all channels.
The authors assemble a large and diverse training corpus of roughly 100 000 grayscale B‑mode images drawn from public ultrasound datasets covering nine anatomical regions (abdomen, cardiac, breast, fetal, musculoskeletal, etc.), supplemented with k‑wave simulated phantoms and mouse tumor scans. All images are resized to 128 × 128 and normalized to the range
Comments & Academic Discussion
Loading comments...
Leave a Comment