Artifact Reduction in Undersampled 3D Cone-Beam CTs using a Hybrid 2D-3D CNN Framework
Undersampled CT volumes minimize acquisition time and radiation exposure but introduce artifacts degrading image quality and diagnostic utility. Reducing these artifacts is critical for high-quality imaging. We propose a computationally efficient hybrid deep-learning framework that combines the strengths of 2D and 3D models. First, a 2D U-Net operates on individual slices of undersampled CT volumes to extract feature maps. These slice-wise feature maps are then stacked across the volume and used as input to a 3D decoder, which utilizes contextual information across slices to predict an artifact-free 3D CT volume. The proposed two-stage approach balances the computational efficiency of 2D processing with the volumetric consistency provided by 3D modeling. The results show substantial improvements in inter-slice consistency in coronal and sagittal direction with low computational overhead. This hybrid framework presents a robust and efficient solution for high-quality 3D CT image post-processing. The code of this project can be found on github: https://github.com/J-3TO/2D-3DCNN_sparseview/.
💡 Research Summary
The paper addresses the persistent problem of artifacts that arise when cone‑beam computed tomography (CT) is acquired with a reduced number of projection views. While undersampling dramatically shortens scan time and lowers patient radiation dose, it introduces streaking, noise, and inter‑slice inconsistencies that limit diagnostic usefulness. Existing deep‑learning solutions either operate slice‑by‑slice with 2‑D convolutional networks—thereby ignoring the three‑dimensional nature of CT data—or employ fully 3‑D networks that are computationally demanding and often constrained to small model capacities.
To reconcile these opposing requirements, the authors propose a hybrid two‑stage framework that leverages the efficiency of 2‑D processing together with the volumetric consistency of 3‑D modeling. In the first stage, a conventional 2‑D U‑Net is trained on individual axial slices. The encoder consists of five down‑sampling blocks that reduce a 512 × 512 input to a 32 × 32 feature map, preserving rich local information while keeping memory usage modest. After training, this encoder is used to extract multi‑scale feature maps from every slice of a full CT volume. The resulting 2‑D feature maps are stacked along the slice dimension, forming a 3‑D tensor that encodes spatial context across neighboring slices.
The second stage introduces a 3‑D decoder that mirrors the architecture of the 2‑D decoder but replaces all 2‑D convolutions with 3‑D convolutions (kernel size 3 × 3 × 3). This decoder receives two inputs: the stacked 3‑D feature tensor and the original undersampled volume. By processing the data with 3‑D kernels, the network can learn inter‑slice dependencies and thereby suppress artifacts that manifest primarily in the coronal and sagittal planes. The overall pipeline is illustrated in Figure 1 of the paper.
The experimental dataset originates from the RSNA Pulmonary Embolism Detection Challenge (2020), comprising 7,279 contrast‑enhanced chest CT scans. For each scan, a sparse‑view reconstruction was generated using a cone‑beam geometry with only 128 projection views, followed by filtered back‑projection (FDK) via the ASTRA toolbox. The authors reserved 100 scans for testing. For training the 2‑D U‑Net, 5,251 scans (≈2.6 M slices) were used for training and 1,313 scans for validation. The 3‑D decoder was trained on a much smaller set of 150 full volumes (training) and 50 volumes (validation) because the extracted feature maps consume substantial storage.
Training employed PyTorch Lightning, an MSE loss, and the AdamW optimizer. The 2‑D stage was run on an NVIDIA RTX 3090 GPU, while the 3‑D stage used an NVIDIA A100 GPU. During 3‑D training, 48 consecutive slices were randomly sampled to form each training batch, which helped to limit GPU memory usage.
Quantitative results are reported in Table 1. The baseline sparse‑view reconstructions achieved a PSNR of 24.81 dB and an SSIM of 0.637. After 2‑D U‑Net post‑processing, PSNR rose to 39.29 dB and SSIM to 0.949. The hybrid 3‑D decoder produced a PSNR of 38.09 dB and SSIM of 0.938. Although the 2‑D model yields slightly higher numerical scores, visual inspection (Figure 2) shows that the 2‑D approach leaves noticeable streaks in the coronal and sagittal views, whereas the 3‑D decoder markedly improves inter‑slice consistency, delivering smoother, artifact‑free volumes.
Runtime analysis reveals a trade‑off: the 2‑D post‑processing requires only 2.4 ± 0.65 seconds per volume, whereas the full hybrid pipeline—including feature extraction and 3‑D decoding—takes 20.3 ± 5.77 seconds per volume on the A100. Thus, the 3‑D stage is roughly eight times slower but provides superior volumetric quality.
The discussion acknowledges that the modest performance gap (2‑D > 3‑D) likely stems from the limited amount of training data available for the 3‑D decoder and the overhead of storing pre‑computed features. The authors suggest that on‑the‑fly feature extraction, multi‑GPU distributed training, and larger training sets could close this gap. They also propose evaluating the impact of the hybrid reconstruction on downstream clinical tasks such as pulmonary embolism detection or organ segmentation, which would better quantify practical benefits.
In conclusion, the study demonstrates that a hybrid 2‑D‑3‑D convolutional neural network can effectively reduce undersampling artifacts in cone‑beam CT while preserving inter‑slice consistency, all with a computational cost that remains feasible on modern GPU hardware. The approach offers a promising compromise between the speed of 2‑D methods and the volumetric fidelity of fully 3‑D networks, paving the way for faster, lower‑dose CT protocols in both research and clinical settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment