📝 Original Info
- Title: MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation
- ArXiv ID: 2512.17774
- Date: 2025-12-19
- Authors: Saikat Roy, Yannick Kirchhoff, Constantin Ulrich, Maximillian Rokuss, Tassilo Wald, Fabian Isensee, Klaus Maier-Hein
📝 Abstract
Large-scale supervised pretraining is rapidly reshaping 3D medical image segmentation. However, existing efforts focus primarily on increasing dataset size and overlook the question of whether the backbone network is an effective representation learner at scale. In this work, we address this gap by revisiting ConvNeXt-based architectures for volumetric segmentation and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt that leverages improved micro-architecture and data scaling to deliver state-of-the-art performance. First, we show that routinely used backbones in large-scale pretraining pipelines are often suboptimal. Subsequently, we use comprehensive backbone benchmarking prior to scaling and demonstrate that stronger from scratch performance reliably predicts stronger downstream performance after pretraining. Guided by these findings, we incorporate a 3D Global Response Normalization module and use depth, width, and context scaling to improve our architecture for effective representation learning. We pretrain MedNeXt-v2 on 18k CT volumes and demonstrate state-of-the-art performance when fine-tuning across six challenging CT and MR benchmarks (144 structures), showing consistent gains over seven publicly released pretrained models. Beyond improvements, our benchmarking of these models also reveals that stronger backbones yield better results on similar data, representation scaling disproportionately benefits pathological segmentation, and that modality-specific pretraining offers negligible benefit once full finetuning is applied. In conclusion, our results establish MedNeXt-v2 as a strong backbone for large-scale supervised representation learning in 3D Medical Image Segmentation. Our code and pretrained models are made available with the official nnUNet repository at: https://www.github.com/MIC-DKFZ/nnUNet
💡 Deep Analysis
📄 Full Content
MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation
Learning in Medical Image Segmentation
Saikat Roy∗,1,2, Yannick Kirchhoff∗,1,2,3, Constantin Ulrich1,4,5, Maximillian Rokuss1,2,
Tassilo Wald1,2,6, Fabian Isensee1,6, Klaus Maier-Hein1,2,7
1German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image
Computing, Germany
2Faculty of Mathematics and Computer Science, Heidelberg University, Germany
3HIDSS4Health - Helmholtz Information and Data Science School for Health,
Karlsruhe/Heidelberg, Germany
4Medical Faculty Heidelberg, Heidelberg University, Germany
5National Center for Tumor Diseases (NCT), Heidelberg, Germany
6Helmholtz Imaging, German Cancer Research Center, Germany
7Pattern Analysis and Learning Group, Department of Radiation Oncology,
Heidelberg University Hospital, Germany
saikat.roy@dkfz-heidelberg.de; yannick.kirchhoff@dkfz-heidelberg.de
Abstract
Large-scale supervised pretraining is rapidly reshaping 3D
medical image segmentation. However, existing efforts fo-
cus primarily on increasing dataset size and overlook the
question of whether the backbone network is an effective
representation learner at scale. In this work, we address
this gap by revisiting ConvNeXt-based architectures for
volumetric segmentation and introducing MedNeXt-v2, a
compound-scaled 3D ConvNeXt that leverages improved
micro-architecture and data scaling to deliver state-of-the-
art performance. First, we show that routinely used back-
bones in large-scale pretraining pipelines are often subopti-
mal. Subsequently, we use comprehensive backbone bench-
marking prior to scaling and demonstrate that stronger
from scratch performance reliably predicts stronger down-
stream performance after pretraining. Guided by these find-
ings, we incorporate a 3D Global Response Normalization
module and use depth, width, and context scaling to im-
prove our architecture for effective representation learning.
We pretrain MedNeXt-v2 on 18k CT volumes and demon-
strate state-of-the-art performance when fine-tuning across
six challenging CT and MR benchmarks (144 structures),
showing consistent gains over seven publicly released pre-
trained models. Beyond improvements, our benchmarking
of these models also reveals that stronger backbones yield
better results on similar data, representation scaling dispro-
portionately benefits pathological segmentation, and that
modality-specific pretraining offers negligible benefit once
full finetuning is applied.
In conclusion, our results es-
tablish MedNeXt-v2 as a strong backbone for large-scale
supervised representation learning in 3D Medical Image
Segmentation. Our code and pretrained models are made
available with the official nnUNet repository at: https:
//www.github.com/MIC-DKFZ/nnUNet.
1. Introduction
Automated segmentation of medical images is one of the
most common tasks in biomedical image analysis [12, 28,
49, 52]. Despite rapid development in deep learning based
approaches over the last decade [1, 42, 50], UNet-based [59]
deep convolutional networks (ConvNets) have remained
central to high-performing methodologies for 3D medical
image segmentation [31, 32].
Although alternative ap-
proaches such as Transformers have been popular in recent
years [36], their limited inductive bias has proved a hin-
drance for training from scratch on the currently available
medical segmentation datasets, typically containing sparse
annotations [42].
This has led to ConvNeXt-based [46]
approaches leveraging the scalability of the Transformer
*Contributed equally. Each author may denote themselves as posi-
tional first author in their CVs.
1
arXiv:2512.17774v1 [eess.IV] 19 Dec 2025
SegVol
Vista3D
TotalSeg
CADS
MedNeXt-v1
MedNeXt-v2
(ours)
70
72
74
76
78
80
82
84
Mean DSC over Datasets
+11.33
+2.69
+2.50
+1.15
+1.23
Figure 1. MedNeXt-v2 sets a new state-of-the-art in 3D medical
image segmentation. By leveraging micro-architectural improve-
ments and large-scale pretraining, it outperforms powerful existing
networks across multiple 3D medical segmentation tasks.
while retaining the strong inductive bias of ConvNets to of-
fer effective solutions for 3D medical image segmentation
[10, 38, 43, 56, 61].
However, following significant advances in computer vi-
sion [13, 55, 66] over the last decade, the field of med-
ical image segmentation has also been gradually mov-
ing towards incorporating large-scale supervised pretrain-
ing of deep networks [9, 26, 68, 72].
In recent years,
the availability of large monolithic datasets [19, 73] or
aggregated collections of previously available small-scale
datasets [3, 39, 76] has led to initial attempts at pretrain-
ing large-scale deep learning models for the segmentation of
3D medical images. Notably, while approaches in 2D com-
puter vision have moved towards self-supervised learning
(SSL) owing to the abundance of unlabeled data [23, 34],
the domain of 3D medical image segmentation continues
to leverage supervised pretraining.
Despite s
Reference
This content is AI-processed based on open access ArXiv data.