A 2D dilated residual U-Net for multi-organ segmentation in thoracic CT
Automatic segmentation of organs-at-risk (OAR) in computed tomography (CT) is an essential part of planning effective treatment strategies to combat lung and esophageal cancer. Accurate segmentation of organs surrounding tumours helps account for the variation in position and morphology inherent across patients, thereby facilitating adaptive and computer-assisted radiotherapy. Although manual delineation of OARs is still highly prevalent, it is prone to errors due to complex variations in the shape and position of organs across patients, and low soft tissue contrast between neighbouring organs in CT images. Recently, deep convolutional neural networks (CNNs) have gained tremendous traction and achieved state-of-the-art results in medical image segmentation. In this paper, we propose a deep learning framework to segment OARs in thoracic CT images, specifically for the: heart, esophagus, trachea and aorta. Our approach employs dilated convolutions and aggregated residual connections in the bottleneck of a U-Net styled network, which incorporates global context and dense information. Our method achieved an overall Dice score of 91.57% on 20 unseen test samples from the ISBI 2019 SegTHOR challenge.
💡 Research Summary
This paper addresses the clinically important problem of automatically segmenting organs‑at‑risk (OAR) in thoracic computed tomography (CT) scans, focusing on four structures that are critical for radiotherapy planning: the heart, esophagus, trachea, and aorta. The authors propose a modified 2‑dimensional U‑Net architecture, termed U‑Net+DR (Dilated Residual), which incorporates two key innovations to overcome the limitations of the classic U‑Net when applied to low‑contrast, anatomically complex CT data.
First, the bottleneck of the encoder‑decoder network is equipped with four dilated convolutional layers whose dilation rates range from 1 to 4. By spacing the kernel weights, a 3 × 3 filter effectively attains the receptive field of a 7 × 7 (or larger) filter without increasing the number of parameters. This design enables the network to capture long‑range contextual information that is essential for distinguishing adjacent organs with ambiguous boundaries.
Second, each encoder block is transformed into a residual block: the input to the first convolution is concatenated with the output of the second convolution before the subsequent max‑pooling operation. This residual connection mitigates gradient vanishing, facilitates the flow of multi‑scale features, and stabilizes training, especially given the modest size of the dataset (60 patients).
The dataset used is the ISBI 2019 SegTHOR challenge set, comprising 60 thoracic CT volumes (512 × 512 pixels per slice, 150–284 slices per volume). The authors split the data into 40 patients for training (7 390 slices) and 20 patients for testing (3 694 slices). Pre‑processing includes slice‑wise contrast limited adaptive histogram equalization (CLAHE), mean‑std normalization, and a central crop to 288 × 288 pixels to focus on the region of interest while reducing memory consumption.
Training proceeds in two stages. In the first stage, a five‑fold cross‑validation is performed without online augmentation, using a multiclass soft‑Dice loss that averages Dice scores over the five classes (background + four organs). In the second stage, the best model from stage one is fine‑tuned for 50 additional epochs with extensive online augmentations (random rotations, scaling, shearing, translations, and cropping). Because Dice loss does not explicitly address class imbalance—particularly the esophagus, which occupies a small volume—the authors replace it with the Tversky loss (α = β = 0.5) during fine‑tuning. This loss balances false positives and false negatives and improves performance on minority classes.
Optimization is carried out with the Adam optimizer (β₁ = 0.9, β₂ = 0.999), an initial learning rate of 1 × 10⁻⁴, and a decay factor of 0.2 when validation loss plateaus. The network is implemented in Keras and trained on an NVIDIA TITAN X‑Pascal GPU (12 GB). Post‑processing consists of retaining only the largest connected component for each organ, which marginally raises the average Dice (≈ 0.2 %) but substantially reduces the Hausdorff distance by eliminating spurious outliers.
Evaluation uses two standard metrics: Dice similarity coefficient (DSC) and Hausdorff distance (HD). The final model achieves an overall DSC of 91.57 % on the unseen test set, with organ‑specific results: esophagus 85.8 % (HD 0.331 mm), heart 94.1 % (HD 0.226 mm), trachea 92.6 % (HD 0.193 mm), and aorta 93.8 % (HD 0.297 mm). Compared to the SharpMask + Conditional Random Field approach reported in the SegTHOR baseline (
Comments & Academic Discussion
Loading comments...
Leave a Comment