One-pass Multi-task Networks with Cross-task Guided Attention for Brain Tumor Segmentation

1 One-pass Multi-task Networks with Cross-task Guided Attention for Brain T umor Se gmentation Chenhong Zhou, Changxing Ding, Xinchao W ang, Zhentai Lu, Dacheng T ao, Abstract —Class imbalance has emerged as one of the major challenges for medical image segmentation. The model cascade (MC) strategy , a popular scheme, signiﬁcantly alleviates the class imbalance issue via running a set of individual deep models for coarse-to-ﬁne segmentation. Despite its outstanding performance, however , this method leads to undesired system complexity and also ignores the correlation among the models. T o handle these ﬂaws in the MC approach, we propose in this paper a light-weight deep model, i.e., the One-pass Multi-task Network (OM-Net) to solve class imbalance better than MC does, while requiring only one-pass computation f or brain tumor segmentation. First, OM-Net integrates the separate segmentation tasks into one deep model, which consists of shared parameters to learn joint features, as well as task-speciﬁc parameters to learn discriminati ve features. Second, to more effectively optimize OM-Net, we take advantage of the correlation among tasks to design both an online training data transfer strategy and a curriculum learning-based training strategy . Third, we further propose sharing prediction results between tasks, which enables us to design a cross-task guided attention (CGA) module. By follo wing the guidance of the prediction results provided by the pr evious task, CGA can adaptively recalibrate channel-wise feature responses based on the category-speciﬁc statistics. Finally , a simple yet effective post-processing method is introduced to reﬁne the segmentation r esults of the proposed attention network. Extensi ve experiments are conducted to demonstrate the effectiveness of the proposed techniques. Most impressiv ely , we achieve state-of-the-art performance on the BraTS 2015 testing set and BraTS 2017 online validation set. Using these proposed approaches, we also won joint third place in the BraTS 2018 challenge among 64 participating teams. The code is publicly av ailable at https://github .com/chenhong- zhou/OM- Net. Index T erms —Brain tumor segmentation, magnetic resonance imaging, class imbalance, con volutional neural networks, multi- task learning, channel attention. I . I N T RO D U C T I O N B RAIN tumors are one of the most deadly cancers world- wide. Among these tumors, glioma is the most common type [1]. The average surviv al time for glioblastoma patients is less than 14 months [2]. T imely diagnosis of brain tumors is thus vital to ensuring appropriate treatment planning, surgery , and follo w-up visits [3]. As a popular non-inv asive technique, Corresponding author: Changxing Ding. C. Zhou and C. Ding are with the School of Electronic and Information Engineering, South China University of T echnology , Guangzhou, 510000, China (e-mail: eezhouch@mail.scut.edu.cn; chxding@scut.edu.cn). X. W ang is with Department of Computer Science, Stev ens Institute of T echnology , Hoboken, USA. Z. Lu is with Guangdong Provincial Key Laboratory of Medical Image Processing, School of Biomedical Engineering, Southern Medical University , Guangzhou, China. D. T ao is with the UBT ech Sydney Artiﬁcial Intelligence Centre and the School of Information T echnologies, Faculty of Engineering and Information T echnologies, The Uni versity of Sydney , Darlington, NSW 2008, Australia. Magnetic Resonance Imaging (MRI) produces markedly dif- ferent types of tissue contrast and has thus been widely used by radiologists to diagnose brain tumors [4]. Howe ver , the manual segmentation of brain tumors from MRI images is both subjectiv e and time-consuming [5]. Therefore, it is highly desirable to design automatic and robust brain tumor segmentation tools. Recently , deep learning-based methods such as conv olu- tional neural networks (CNNs) [5]–[18], have become increas- ingly popular and achieved signiﬁcant progress in brain tumor segmentation tasks. Unfortunately , a sev ere class imbalance problem usually emerges between healthy tissue and tumor tissue, as well as between intra-tumoral classes. This problem causes the healthy tissue to be dominant during the training phase and degrades the optimization quality of the model. T o handle the class imbalance problem, many recent studies hav e employed the Model Cascade (MC) strategy [19]–[27]. More speciﬁcally , the MC strategy decomposes medical image segmentation into two or more tasks, each of which is achie ved by an individual model. The most common MC framew ork for segmentation tasks [19]–[25] incorporates two models, where the ﬁrst one detects regions of interest (R OIs) via coarse segmentation, while the second conducts ﬁne segmentation within the ROIs. Therefore, MC can ef fectively alleviate class imbalance via coarse-to-ﬁne segmentation. In spite of its effecti veness, howe ver , MC has several disadv antages. First, it usually requires multiple deep models to be trained, which substantially increases both the system complexity and the storage space consumption. Second, each model is trained separately using its own training data, which ignores the correlation between the deep models. Third, MC runs the deep models one by one, which leads to alternate GPU-CPU computations and a lack of online interactions between tasks. Here, we propose adopting multi-task learning to overcome the shortcomings of MC. In more detail, we aim to decompose multi-class brain tumor segmentation into three separate yet interconnected tasks. While MC trains one indi vidual netw ork for each task, as shown in Fig. 1(a), we incorporate the three tasks into a single model and propose a One-pass Multi-task Network (OM-Net). The proposed OM-Net not only makes use of the relev ance of these tasks to each other during the training stage, but also simpliﬁes the prediction stage by implementing one-pass computation, as shown in Fig. 1(b). Furthermore, an effecti ve training scheme inspired by curriculum learning is also designed: instead of training the three tasks together all the time, we gradually add the tasks to OM-Net in an increasing order of difﬁculty; this is beneﬁcial for improving the con vergence quality of the model. 2 Network 1 Network 2 Network 3 OM-Net (b) (a) Fig. 1. Illustrations of (a) a three-model cascade pipeline and (b) our proposed OM-Net. The model cascade pipeline contains three networks that segment different tumor regions sequentially . OM-Net is a nov el end-to-end deep model that simpliﬁes prediction using one-pass computation. Moreov er , the fact that OM-Net integrates three tasks pro- vides the possibility of online interaction between these tasks, which produces more beneﬁts. First, the online training data transfer strategy we propose here enables the three tasks to share training data. Therefore, certain tasks obtain more train- ing data, and the overall optimization quality can be improved. Second, we construct a novel channel attention module, named Cross-task Guided Attention (CGA), by sharing prediction results between tasks. In CGA, the prediction results of a preceding task can guide the following task to obtain category- speciﬁc statistics for each channel beforehand. This category- speciﬁc information further enables CGA to predict channel- wise dependencies with regard to a speciﬁc category of voxels. By contrast, existing self-attention models such as the popular ‘squeeze & excitation’ (SE) block [28], do not make use of such external guidance. W ithout external guidance, SE blocks only predict a single weight for each channel. Howe ver , there are usually multiple categories in a patch, and the importance of each channel varies for different categories. The proposed CGA module handles this problem by predicting the cate gory- speciﬁc channel attention. T o further reﬁne the segmentation results of OM-Net, we also propose a new post-processing scheme. The efﬁcac y of the proposed methods is systematically ev aluated on three popular brain tumor se gmentation datasets, namely BraTS 2015, 2017, and 2018. Experimental results indicate that OM- Net outperforms MC, despite having only one-third of the model parameters of MC. The CGA module further promotes the performance of OM-Net by a signiﬁcant mar gin. A preliminary version of this paper has previously been published in [29]. Compared with the conference version, this version proposes the novel CGA module, improv es the post- processing method, and includes more experimental inv estiga- tion. The remainder of this paper is or ganized as follo ws. W e brieﬂy revie w the related works for brain tumor segmentation in Section II, then pro vide the details of the OM-Net model in Section III. Experimental settings and datasets are detailed in Section IV , while the experimental results and analysis are presented in Section V . Finally , we conclude the paper in Section VI. I I . R E L A T E D W O R K S In this section, we brieﬂy re view approaches in tw o domains related to our proposed model: namely , brain tumor segmen- tation and attention mechanism. A. Brain T umor Segmentation In recent years, deep learning-based methods such as CNNs hav e dominated the ﬁeld of automatic brain tumor segmenta- tion. The architectures of deep models [5]–[18] have dev el- oped rapidly from single-label prediction (classifying only the central vox el of the input patch) to dense-prediction (making predictions for all vox els in the input patch simultaneously). For instance, Pereira et al. [5] designed a deep model equipped with small con volutional kernels to classify the central vox el of the input 2D patch. Moreov er , Havaei et al. [6] introduced a nov el 2D two-pathway deep model to explore additional contextual information. The abov e methods make predictions based on 2D patches, ignoring the 3D contextual information. T o handle this problem, Kamnitsas et al. [7] introduced the DeepMedic model, which extracts information from 3D patches using 3D conv olutional kernels. The abovementioned methods make predictions only for a single voxel or a set of central vox els within the input patch, meaning that they are slow in the inference stage. T o promote efﬁcienc y , encoder- decoder architectures such as fully conv olutional networks (FCNs) [8] and U-Net [9] have been widely adopted to realize dense prediction. For instance, Chen et al. [10] designed a vox elwise residual network (V oxResNet) to make predictions for all v oxels within the input 3D patch. Zhao et al. [11] intro- duced a uniﬁed framew ork integrating FCNs and conditional random ﬁelds (CRFs) [30]. This framework realizes end-to- end dense prediction with appearance and spatial consistenc y . The issue of class imbalance is commonly encountered in medical image segmentation, especially brain tumor segmen- tation. T o address this problem, many recent studies have adopted the MC strategy [19]–[27] to perform coarse-to-ﬁne segmentation. In particular , the common two-model cascaded framew ork has been widely adopted in many applications, including renal segmentation in dynamic contrast-enhanced MRI (DCE-MRI) images [19], cancer cell detection in phase contrast microscopy images [20], liv er and lesion segmen- tation [21], volumetric pancreas segmentation in CT images [22], calcium scoring in low-dose chest CT images [23], etc. Moreov er , MC can incorporate more stages to achieve better segmentation performance. For instance, W ang et al. [27] divided the brain tumor segmentation into three successive binary segmentation problems: namely , the segmentation of the complete tumor , tumor core, and enhancing tumor areas in MRI images. Since MC effecti vely alle viates class imbalance, its results are very encouraging. Despite its ef fectiv eness, howe ver , MC is cumbersome in terms of system complexity; moreover , it ignores the correla- tion among tasks. Accordingly , in this paper, we adopt multi- task learning to overcome the disadvantages inherent in MC. By implementing the sharing of model parameters and training 3 data, our proposed OM-Net outperforms MC using only one- third of the model parameters of MC. B. Attention Mechanism Attention is a popular tool in deep learning that highlights useful information in feature maps while suppressing irrelev ant information. The majority of existing studies [28], [31]–[36] belong to the category of self-attention models, meaning that they infer attentions based only on feature maps. These models can be roughly categorized into three types, namely hard regional attention, soft spatial attention, and channel attention. Moreov er , some studies combine two or more types of atten- tion in one uniﬁed model [32]–[34]. The spatial transformer network (STN) [31] is a representativ e example of a hard attention model. STN selects and reshapes important regions in feature maps to a canonical pose to simplify inference. STN performs at the coarse region-le vel while neglecting the ﬁne pixel-le vel saliency [32]. In comparison, soft spatial attention models aim to ev aluate pixel-wise importance in the spatial dimension. For instance, W ang et al. [33] proposed a residual attention learning method that adds soft weights to feature maps via a residual unit, in order to reﬁne the feature maps. Complementary to spatial attention, channel attention aims to recalibrate channel-wise feature responses. The ‘squeeze & excitation’ (SE) block [28] is one of the most popular channel attention models due to its simplicity and efﬁciency . Ho wev er , this model was originally proposed for image classiﬁcation and object detection tasks, but may not be optimal for image segmentation tasks; this is because SE blocks are based on the av erage response of all v oxels in each channel and recalibrate each channel with a single weight, regardless of which cate- gory the v oxels belong to. Howe ver , there are usually multiple categories of vox els in one patch, and the importance of each channel varies between the different categories. Se veral exist- ing studies hav e already tried to alleviate the abov ementioned problems experienced by SE blocks during se gmentation tasks. For example, Pereira et al. [35] designed a segmentation SE (SegSE) block that produces a channel descriptor for each vox el in feature maps; consequently , the obtained channel attention map is of the same size as the feature maps. V ox el- wise multiplication between the feature maps and the attention map produces the re-weighted features. Furthermore, the proposed CGA module also aims to solve the problems of SE blocks for segmentation. Unlike the existing self-attention models [28], [31]–[36], we make use of the special structure of OM-Net to provide cross-task guidance for the learning of category-speciﬁc channel attention. I I I . M E T H O D In this section, we ﬁrst present a strong segmentation baseline based on MC, then introduce the model structure and training strate gy of OM-Net. Ne xt, we further e xplain the principles of OM-Net from the perspectiv e of the attention mechanism, and subsequently propose the CGA module that promotes the performance of OM-Net by predicting robust channel attention. Finally , we propose a simple but effecti ve post-processing method in order to reﬁne the segmentation results of the attention network. A. A Strong Baseline Based on Model Cascade According to [3], tumors can be divided into the following classes: edema (ED), necrotic (NCR), non-enhancing tumor (NET), and enhancing tumor (ET). Follo wing [37], we con- sistently merge NCR and NET into one class. Performance is ev aluated on three re-deﬁned tumor regions: complete tumor (including all tumor classes), tumor core (including all tumor classes except edema), and enhancing tumor (including only the enhancing tumor class). Under this deﬁnition, three regions satisfy the hierarchical structure of tumor subregions, each of which completely covers the subsequent one. Based on this observation, brain tumor segmentation can be decomposed into three separate yet interconnected tasks. In the following, we design an MC model that includes three independent networks and forms a strong baseline for OM-Net. Each network is trained for a speciﬁc task. These three tasks are detailed below . 1) Coarse se gmentation of the complete tumor . W e utilize the ﬁrst network to detect the complete tumor region as an R OI. W e randomly sample training patches within the brain and train the network as a ﬁve-class se gmentation task : three tumor classes, normal tissue, and background. In the testing stage, we simply sum the predicted probabilities of all tumor classes to obtain a coarse tumor mask. 2) Reﬁned segmentation for the complete tumor and its intra-tumoral classes . W e dilate the above coarse tumor mask by 5 vox els in order to reduce false negativ es. Next, the labels of all voxels in the dilated region are predicted again as a ﬁve-class se gmentation task by the second network. T raining data are sampled randomly within the dilated ground-truth complete tumor area. 3) Pr ecise se gmentation for the enhancing tumor . Due to the extreme class imbalance, it is very dif ﬁcult to conduct precise seg- mentation of the enhancing tumor . T o handle this problem, a third network is introduced specially for the segmentation of enhancing tumor . Similarly , the training patches for this task are randomly sampled within the ground-truth tumor core area. It should be noted here that training patches for the three tasks are sampled independently , ev en if the sampling areas for the three tasks are nested hierarchically . The network structures for the above three tasks are the same except for the ﬁnal classiﬁcation layer . The adopted structure is a 3D v ariant of the FusionNet [38], as shown in Fig. 2. W e crop the MRI images to patches of size 32 × 32 × 16 × 4 voxels as input for the network. Here, the ﬁrst three numbers correspond to the input volume, while the last number (4) denotes the four MRI modalities: FLAIR, T1-weighted (T1), T1 with gadolinium enhancing contrast (T1c), and T2-weighted (T2). Due to the lack of contextual information, the segmentation results of the boundary voxels in the patch may be inaccurate. Accordingly , we adopt the ov erlap-tile strategy proposed in [9] during inference. In brief, we sample 3D patches with a stride of 20 × 20 × 5 voxels within the MRI image. For each patch, we retain only the predictions of voxels within the central region ( 20 × 20 × 5 vox els) and abandon the predictions of the boundary voxels. The predictions of all central re gions from the sampled patches are stitched together to constitute the segmentation results of the whole brain. This strate gy is also utilized in the follo wing 4 Input : Data Output : Label 4 32 32 32 32 32 32 C 64 64 128 128 64 128 128 128 256 128 64 64 64 256 256 Convolutional layer: 3 h 3 h 3 (BN) + ReLu Downsampling with Maxpooling layer: 2 h 2 h 2 Residual block with 3 Convolutional layers and 1 skip connection Upsampling with Deconvolutional layer: 2 h 2 h 2 (stride: 2) Convolutional layer: 1 h 1 h 1 Fig. 2. The network structure for each task, which is composed of ﬁve basic building blocks. Each block is represented by a different type of colored cube. The number below each cube refers to the number of feature maps. C equals to 5, 5, and 2 for the ﬁrst, second, and third task respectiv ely . SoftmaxW ithLoss is adopted as the loss function and applied to the output of each task. (Best vie wed in color) models. During the inference stage of MC, the three networks hav e to be run one by one since the R OI of one task is obtained by considering the results of all preceding tasks. More speciﬁcally , we employ the ﬁrst network to generate a coarse mask for the complete tumor; subsequently , all voxels within the dilated area of the mask are classiﬁed by the second network, from which the precise complete tumor and tumor core areas can be obtained. Finally , the third network is utilized to scan all v oxels in the tumor core re gion in order to determine the precise enhancing tumor area. Thus, there are three alternate GPU-CPU computations carried out during the MC inference process. B. One-pass Multi-task Network Despite its promising performance, MC not only encounters disadvantages due to its system complexity , but also neglects the relev ance among tasks. W e can observe that the essential difference among these tasks lies in the training data rather than the model architecture. Therefore, we propose a multi- task learning model that integrates the three tasks inv olved in MC into one network. Each task in this model has its own training data that is exactly the same as the data in MC. Moreov er , each task has an independent con volutional layer , a classiﬁcation layer, and a loss layer . The other parameters are shared to make use of the correlation among the tasks. Thanks to the multi-task model, we can obtain the prediction results of the three classiﬁers simultaneously through one-pass computation. As a result, we name the proposed model the One-pass Multi-task Network, or OM-Net. As the difﬁculty lev els of these three tasks progressiv ely increase, we propose to train OM-Net more effecti vely by employing curriculum learning [39], which is useful for im- proving the conv ergence quality of machine learning models. More speciﬁcally , instead of training the three tasks together all the time, we gradually introduce the tasks to the model in an order of increasing difﬁculty . The model structure and training strategy of OM-Net are illustrated in Fig. 3. First, OM-Net is trained with only the ﬁrst task in order to learn the basic knowledge required to dif ferentiate between tumors and normal tissue. This training process lasts until the loss curve displays a ﬂattening trend. The second task is then added to OM-Net; this means that the ﬁrst and the second tasks are trained together . As illustrated in Fig. 3, we concatenate Data-1 and Data-2 along the batch dimension to form the input for OM-Net. W e split the features generated by the shared backbone model along the batch dimension; here, the splitting position in the batch dimension is the same as the concatenation position of training data. W e then obtain task-speciﬁc features and use the sliced features to optimize task-speciﬁc parameters. Moreov er , we argue that not only knowledge (model parameters) but also learning material (training data) can be transferred from the easier course (task) to the more difﬁcult course (task) during curriculum learning. Since the sampling areas for the three tasks are hierarchically nested, we propose the following online training data transfer strategy . The training patches in Data-1 that satisfy the following sampling condition can be transferred to assist the training of the second task: N P i =1 1 { l i ∈ C complete } N ≥ 0 . 4 , (1) where l i is the label of the i -th vox el in the patch, C complete denotes the set of all tumor classes, N is the number of voxels in the input patch, and 0 . 4 is set to meet the patch sampling condition of the second task. W e thus concatenate the features of these patches in Data-1 with Feature-2, then compute the loss for the second task. The training process in this step continues until the loss curve of the second task displays a ﬂattening trend. Finally , the third task and its training data are introduced to OM-Net, meaning that these three tasks are trained together . The concatenation and slicing operations are similar to those in the second step. The training patches from Data-1 and Data-2 that satisfy the follo wing sampling condition can be transferred 5 Split Shared backbone model Concatenate data along batch dimension Data-1 Data-2 Data-3 4 4 4 Data-3 32 32 32 Output-1 32 5 Classifier-1 32 5 Output-2 Classifier-2 32 2 Output-3 Classifier-3 Concat If Ineq.(2) holds Concat If Ineq.(1) holds Feature-2 Feature-3 Feature-1 Concat Split features along batch dimension Feature-3 Fig. 3. Network structure of OM-Net in the training stage. For the i-th task, its training data, feature maps, and outputs of the classiﬁcation layer are denoted as Data- i , Feature- i , and Output- i , respectively . The light blue rectangles marked with ‘Concat’ and ‘Split’ represent the concatenation and splitting operations, respectiv ely , while the blue arrows represent the batch dimension. SoftmaxW ithLoss is adopted as the loss function and applied to the output of each task. The shared backbone model refers to the netw ork layers outlined by the yellow dashed line in Fig. 2. to the third task: N P i =1 1 { l i ∈ C core } N ≥ 0 . 5 , (2) where C core indicates the tumor classes belonging to the tumor core. Similarly , 0 . 5 is chosen to meet the patch sampling condition of the third task. The threshold in Ineq. (1) is smaller than that in Ineq. (2); this is because the center points of training patches for the second task are sampled within the dilated area of the complete tumor . By comparison, the center points of training patches for the third task are sampled within the ground-truth area of the tumor core. This means that one training patch for the second task may include more than 50% non-tumor voxels, while one training patch for the third task includes at most 50% non-core vox els. Therefore, we set the thresholds in Ineq. (1) and Ineq. (2) to 0.4 and 0.5, respectiv ely . The three tasks are trained together until con ver gence occurs. In conclusion, the OM-Net equipped with the curriculum learning-based training strate gy has three main components: 1) a deep model based on multi-task learning that realizes coarse-to-ﬁne segmentation via one-pass computation; 2) a stepwise training scheme that progresses from easy to difﬁcult; 3) the transfer of training data from the easier task to the more difﬁcult task. In the inference stage, the data concatenation, feature slic- ing, and data transfer operations in Fig. 3 are removed, and the 3D patches of an MRI image are fed into the shared backbone model. Feature-1, Feature-2, and Feature-3 are now the same for each patch. The prediction results of the three tasks can be obtained simultaneously by OM-Net. These results are fused in e xactly the same w ay as in the MC baseline. Moreover , OM- Net is different from the existing multi-task learning models for brain tumor segmentation [40], [41]. The principle behind these models [40], [41] in volv es the provision of multiple supervisions for the same training data. By contrast, OM- Net aims to achieve coarse-to-ﬁne segmentation by integrating tasks with their own training data into a single model. C. Cr oss-task Guided Attention The coarse-to-ﬁne segmentation strategy adopted by OM- Net can be regarded as a type of cascaded spatial attention, since the segmentation results of one task determine the R OI for the following task. In the following, we further enhance the performance of OM-Net from the perspectiv e of channel attention. In particular, we propose a no vel and effecti ve channel attention model that mak es use of cross-task guidance to solve the problems experienced by the popular SE block for the segmentation task. As e xplained in Section II, the global a verage pooling (GAP) operation in the SE block ignores the dramatic v ariation in volume of each class within the input patch. W e solve this problem by computing statistics in category-speciﬁc regions rather than in a whole patch. Howe ver , the category-speciﬁc regions for common CNNs are unknown until we reach the ﬁnal classiﬁcation layer; therefore, this is a chicken-and-egg problem. F ortunately , OM-Net allows us to estimate category- speciﬁc regions beforehand by sharing the prediction results between tasks. More speciﬁcally , in the training stage, we let Feature-2 and Feature-3 (shown in Fig. 3) pass through the ﬁrst and second task of OM-Net, respectiv ely . In this way , we obtain the coarse segmentation results for the sec- ond and third tasks respectively . It is worth noting that this strategy introduces only negligible additional computation in the training stage and no extra computation in the testing stage. This is because Feature-1, Feature-2, and Feature-3 are exactly the same in the testing stage, as the concatenation and slicing operations in Fig. 3 are removed during testing. Since we introduce cross-task guidance for the proposed channel attention block, we refer to it as Cross-task Guided Attention (CGA). W e also rename Classiﬁer-2 and Classiﬁer-3 in Fig. 3 once equipped with CGA as CGA-tumor and CGA-core, respectiv ely . The overall architecture of OM-Net equipped with CGA modules is illustrated in Fig. 4. Note that cross-task guidance takes place only in the forward pass, meaning that the back-propagations of the three tasks are still independent. As illustrated in Fig. 5, we take CGA-tumor as an example to illustrate the structure of the CGA module. This module 6 Shared model Classifier-1 CGA-tumor CGA-core S F-3 F-2 F-1 O-2 O-3 5 5 2 C O-1 D-2 D-3 Data-1 forward flow Data-2 forward flow Data-3 forward flow All data forward flow C S Concat Split Cross-task forward flow for Data-2 Cross-task forward flow for Data-3 D-1 Fig. 4. The overall architecture of OM-Net equipped with CGA modules in the training stage. D- i , F- i , and O- i are abbreviations for Data- i , Feature- i , and Output- i , which denote the training data, features, and outputs for the i -th task, respectively . Feature-2 and Feature-3 pass through the ﬁrst and second task respectively to obtain their own coarse prediction results beforehand. W e refer to these two operations as cross-task forward ﬂow for Data-2 and Data-3, represented by the green and red dotted lines respectiv ely . These coarse predictions are utilized as cross-task guidance to help generate category-speciﬁc channel attention. SoftmaxWithLoss is adopted as the loss function and applied to the output of each task. The training data transfer strategy described in Fig. 3 is omitted for clarity in this ﬁgure. is composed of two blocks: (a) a category-speciﬁc channel importance (CSCI) block, and (b) a complementary segmen- tation (CompSeg) block. In Fig. 5, all 4D feature maps and probability maps are simpliﬁed into 3D cubes to enable better visualization. In more detail, the height of one cube denotes the number of channels in feature maps. P ∈ R W × H × L × 5 denotes a probability tensor predicted by the preceding task and is represented as a pink cube. The number 5 next to the cube for P refers to the number of classes, which we hav e explained in the MC baseline. The grey cube F ∈ R W × H × L × 32 denotes the input feature maps of the current task for the CGA module. Moreov er , the number 32 next to the cube for F denotes its number of channels. 1) CSCI Block: As illustrated in Fig. 5(a), the CSCI block utilizes both P estimated by the preceding task and F to estimate the importance of each channel for the segmenta- tion of tumor and non-tumor categories, respecti vely . More speciﬁcally , we ﬁrst compute P t and P n ; each of these values refers to the probability of one vox el belonging to the tumor and non-tumor categories, respectiv ely: P t ( i, j, k ) = X c ∈ C tumor P ( i, j, k , c ) , (3) P n ( i, j, k ) = X c ∈ C non − tumor P ( i, j, k , c ) , (4) where { P t , P n } ∈ R W × H × L × 1 . C tumor and C non − tumor refer to the sets of classes that belong to the tumor and non-tumor categories, respecti vely . C tumor includes all tumor reshape & transpose 32 × 1 F-2 conv conv O-2 conv C 32 (a) (b) ! " # $ # % #  % &  $ & # & " & ' F-2 32 O-1 5 1 ! 1 ! reshape reshape L 1 norm reshape & transpose L 1 norm ! " 32 × 1 Fig. 5. Structure of the CGA-tumor module. (a) Category-speciﬁc channel importance (CSCI) block; (b) complementary segmentation (CompSeg) block. The elements in m t and m n describe the importance of each channel for one category of voxels within the patch, which are utilized to recalibrate the channel-wise feature responses. classes, while C non − tumor contains normal tissue and back- ground. W e then reshape P t and P n to R N × 1 , respectiv ely , where N is equal to W × H × L and denotes the number of vox els in a patch. Similarly , F is reshaped to R N × 32 . Subsequently , we perform matrix multiplication between the reshaped F and the reshaped P t , then apply L 1 normalization for the obtained vector: m t ( i ) = Re ( F ) T i, : · Re ( P t ) 32 P k =1 Re ( F ) T k, : · Re ( P t ) , (5) where m t ∈ R 32 × 1 , while R e ( · ) denotes the reshape opera- tion. Similarly , m n ( i ) = Re ( F ) T i, : · Re ( P n ) 32 P k =1 Re ( F ) T k, : · Re ( P n ) . (6) Elements in m t and m n describe the importance of each chan- nel for the segmentation of tumor and non-tumor categories, respectiv ely . Compared with the popular SE block, which squeezes the global information of each channel into a single value in order to describe its importance, CGA makes use of ﬁner category-speciﬁc statistics to ev aluate the importance of each channel for one speciﬁc category . 7 2) CompSe g Bloc k: Inspired by [42], we further propose a complementary se gmentation (CompSeg) block that performs segmentation via two complementary pathways. These two pathways can make full use of the category-speciﬁc channel importance information ( m t and m n ) to improv e segmenta- tion performance. As shown in Fig. 5(b), the two pathways focus on the se gmentation of the tumor and non-tumor voxels respectiv ely . The CompSeg block can be described in more detail as follows. First, m t and m n are used to recalibrate each channel in F , respectiv ely: U t = F scale ( m t , F ) =  m 1 t f 1 , m 2 t f 2 , · · · , m 32 t f 32  , (7) U n = F scale ( m n , F ) =  m 1 n f 1 , m 2 n f 2 , · · · , m 32 n f 32  , (8) where f i ∈ R W × H × L is the i -th channel in F . m i t and m i n are the i -th elements in m t and m n , respectiv ely . The recali- brated feature maps U t and U n highlight the more important channels and suppress the less important ones for tumors and non-tumors, respectiv ely . These maps are then individually fed into a 1 × 1 × 1 conv olutional classiﬁcation layer to produce their own score maps, S t and S n . { S t , S n } ∈ R W × H × L × C , where C refers to the number of classes for the current task. The two score maps are more sensitiv e to tumor and non- tumor classes respectiv ely . Therefore, we merge the two score maps via weighted averaging: ˜ S ( i, j, k , c ) = P t ( i, j, k ) · S t ( i, j, k , c )+ P n ( i, j, k ) · S n ( i, j, k , c ) , (9) where ˜ S ∈ R W × H × L × C . Finally , we feed ˜ S into another 1 × 1 × 1 con volutional layer to obtain the ultimate prediction results S for the current task. As illustrated in Fig. 5, P t and P n are used twice in the CGA module. The ﬁrst time, we use P t and P n in the CSCI block to provide category-speciﬁc probabilities, by which we calculate m t and m n ; these two vectors embed the interdependencies between channels with regard to different categories. The second time, we use them in the CompSeg block as soft spatial masks to merge two score maps by means of weighted av eraging and produce the ﬁnal segmentation results. Because all tasks are inte grated in OM-Net, we are able to obtain P t and P n for use as cross-task guidance to compute category-speciﬁc statistics for each channel, thereby obtaining improv ed channel attentions. By comparison, the popular SE block ignores category-speciﬁc statistics and reweights each channel with a single weight. In the experiment section, we justify the effecti veness of both the CSCI block and the CompSeg block. The model structures of the CGA-tumor and CGA-core modules are almost the same; there are only two tri vial and intuitiv e dif ferences. As illustrated in Fig. 4, the ﬁrst dif ference lies in the position where we introduce the cross-task guidance; for the CGA-core module, it is introduced from the second task of OM-Net. Second, the counterparts of P t and P n in CGA-core indicate the probability of each voxel belonging to the core and non-core tumor categories, respectively . D. P ost-processing W e can observ e from many prior studies [5]–[7], [11], [12], [43], [44] that post-processing is an efﬁcient way to improv e segmentation performance by reﬁning the results of CNNs. For example, some small clusters of the predicted tumors are remov ed in [5], [12], [43], [44]. In addition, conditional random ﬁeld (CRF) is commonly used as a post-processing step in [6], [7]. In particular , Zhao et al. [11] proposed a post-processing method comprising six steps to boost the segmentation performance by a large margin. In this paper, we introduce a simple and ﬂexible post- processing method in order to reﬁne the predictions of the proposed networks. Our method is mainly inspired by [11], but consists of fewer steps and adopts K-means clustering to achieve automatic classiﬁcation rather than deﬁning the thresholds of voxel intensities in [11]. Step 1: W e remov e isolated small clusters with volumes smaller than a threshold τ V O L . τ V O L = min (2000 , 0 . 1 × V max ) , where V max denotes the volume of the largest 3D connected tumor area predicted by the proposed model. This step can slightly improve the Dice score for the complete tumor , as false positiv es are removed. Step 2: It is observed that non-enhancing voxels are likely to be misclassiﬁed as edema if the predicted enhancing tumor area is small [11]. Accordingly , we propose a K-means-based method to handle this problem, as follo ws. Let v ol e and v ol t denote the volumes of enhancing tumor and complete tumor in the predicted results, respectively . Moreov er , v ol e ( n ) and v ol t ( n ) refer to the volumes of enhanc- ing tumor and complete tumor in the n -th 3D connected tumor area, respectiv ely . If v ol e /v ol t < 0 . 1 , v ol e ( n ) /v ol t ( n ) < 0 . 05 , and vol e ( n ) < 1000 , the K-means clustering algorithm is employed. Based on their intensity v alues in the MRI images, the segmented edema vox els in the n -th connected component are clustered into two groups. Finally , the av erage intensity of each group in the T1c channel is computed. W e con vert the labels of v oxels in the group with the lo wer averaged intensity to the non-enhancing class. The labels of the vox els in the other group remain unchanged. In the experiment section, we show that the second step signiﬁcantly impro ves the Dice score of the tumor core. As this step only changes the labels of the vox els predicted as edema, it will not affect the segmentation results of the complete tumor or enhancing tumor . I V . E X P E R I M E N TA L S E T U P In this section, we pro vide the datasets used to v alidate our approaches, the ev aluation metrics, and the implementation details. A. Datasets T o demonstrate the effecti veness of the proposed methods, we conduct experiments on the BraTS 2018 [3], [45]–[48], BraTS 2017 [3], [45]–[47], and BraTS 2015 [3], [49] datasets. There are four modalities for each MRI image. All images in the three datasets hav e been co-registered, interpolated and skull-stripped. The dimensions of all images are 240 × 240 × 155 v oxels. The BraTS 2018 dataset contains three subsets: the training set, testing set, and v alidation set. The training set comprises 8 210 cases of high-grade gliomas (HGG) and 75 cases of low- grade gliomas (LGG). The testing and validation sets contain 191 cases and 66 cases respecti vely with hidden ground-truth. The e valuation metrics of the testing and v alidation sets are computed using an online ev aluation platform [50]. The BraTS 2017 dataset shares an identical training set with BraTS 2018. Compared with BraTS 2018, it has a smaller validation set comprising 46 cases. The e v aluation of the validation set is conducted online [50]. The BraTS 2015 dataset consists of a training set including 274 MRI images and a testing set including 110 MRI images. Performance e valuation of the testing set is also conducted using an online ev aluation platform [51]. B. Evaluation Metrics W e follow the ofﬁcial ev aluation metrics for each dataset. There are a number of different metrics, namely the Dice score, Positive Predicti ve V alue (PPV), Sensiti vity , and Haus- dorff distance, each of which is deﬁned below: D ice = 2 T P F P + 2 T P + F N , (10) P P V = T P F P + T P , (11) S ensitiv ity = T P T P + F N , (12) H aus ( T , P ) = max { sup t ∈ T inf p ∈ P d ( t, p ) , sup p ∈ P inf t ∈ T d ( t, p ) } , (13) where the number of false ne gativ e, true ne gati ve, true positive, and false positiv e vox els are denoted as FN, TN, TP , and FP , respectiv ely . sup represents the supremum and inf denotes the inﬁmum, while t and p denote the points on surface T of the ground-truth regions and surface P of the predicted regions, respectiv ely . d ( · , · ) is the function that computes the distance between points t and p . Dice score, PPV , and Sensitivity measure the voxel-wise o verlap between the ground-truth and the predicted results [3]. The Hausdorff distance ev aluates the distance between the surface of the ground-truth regions and that of the predicted regions. Moreover , Hausdorff95 is a metric of Hausdorff distance used to measure the 95% quantile of the surface distance. As the Dice score is the o verall ev aluation metric, adopted consistently across all the BraTS challenges, we adopt it as the main metric for ev aluation in line with existing works [5]–[7], [11], [12], [15], [27], [35], [35], [40], [41], [43], [44], [48]. C. Implementation Details During pre-processing, we normalize the vox el intensities within the brain area to have zero mean and unit variance for each MRI modality . The numbers of training patches are around 400,000, 400,000, and 200,000 for the ﬁrst, second, and third task respectively . SoftmaxWithLoss is adopted as the loss function consistently . All implementations are based on the C3D 1 [52]–[54] package, which is a modiﬁed 3D version of Caffe [54]. The models are trained using stochastic gradient 1 https://github .com/facebook/C3D T ABLE I A B L ATI O N S T U D I E S O N T H E L O C A L V A L I DATI O N S U B S E T O F B R A T S 2 0 1 8 Method Parameters Dice (%) Complete Core Enhancing MC1 13.813 M 90.41 78.48 72.91 MC2 27.626 M 91.08 79.11 75.14 MC3 41.439 M 91.08 79.11 79.53 OM-Net 13.869 M 91.10 79.87 80.87 OM-Net 0 13.869 M 90.40 79.41 79.96 OM-Net d 13.869 M 91.11 79.93 80.26 OM-Net + SE 13.870 M 91.03 80.20 80.72 OM-Net + CGA 13.814 M 91.34 82.15 80.73 OM-Net + CGA − 13.814 M 91.06 80.28 80.78 OM-Net + CGA t 13.814 M 90.65 80.27 80.10 OM-Net + CGA n 13.814 M 89.75 79.87 76.00 OM-Net p 13.869 M 91.28 82.50 80.84 OM-Net + CGA p 13.814 M 91.59 82.74 80.73 descent with a momentum of 0.99 and a batchsize of 20 for each task. The learning rate of all networks is initially set to 0.001, which is di vided by 2 after ev ery four epochs. W e train each network in MC for 20 epochs. Similarly , we train OM- Net for 1 epoch, 1 epoch, and 18 epochs for its three steps, respectiv ely . Therefore, the three tasks in OM-Net are trained for 20, 19, and 18 epochs, respecti vely . V . E X P E R I M E N TAL R E S U LT S A N D D I S C U S S I O N W e ﬁrst carry out ablation studies to demonstrate the validity of each contribution proposed in this paper . W e then compare the performance of the proposed methods with state-of-the- art brain tumor segmentation approaches on the BraTS 2015, 2017, and 2018 datasets. A. Ablation Studies The training set of BraTS 2018 is randomly di vided into two subsets to enable con venient e valuation. These two subsets are a training subset and a local validation subset, which consist of 260 and 25 MRI images respectively . Quantitativ e results of this local v alidation subset are presented in T able I. Here, the one-model, tw o-model, and three-model cascades are denoted as MC1, MC2, and MC3, respecti vely . In the below , we conduct a series of experiments to prov e the effecti veness of each component in the proposed approach. 1) Effectiveness of Model Cascade Strate gy: From T able I, we can observe that with the increase of model number in MC, Dice scores steadily improve. These results prove the contribution of each deep network in MC. Unfortunately , the number of parameters increases along with the number of models, leading to increased storage consumption and system complexity . 9 2) Effectiveness of One-pass Multi-task Network: W e com- pare the performance of OM-Net with that of the MC strategy in T able I. Despite having only one-third of the parameters of MC3, OM-Net consistently obtains better segmentation performance, especially for Dice scores on the tumor core and enhancing tumor . Moreov er , we additionally train OM- Net 0 (a naive multi-task learning model without stepwise training or training data transfer) and OM-Net d (a multi-task learning model without stepwise training, but with training data transfer). It can be seen that OM-Net outperforms both OM-Net 0 and OM-Net d ; this demonstrates the effecti veness of the data transfer strate gy and the curriculum learning-based training strategy . 3) Effectiveness of Cross-task Guided Attention: T o com- pare the performance of the CGA module with that of the SE block, we further test the OM-Net + SE model where an SE block is inserted before each Classiﬁer - i (1 ≤ i ≤ 3) module of OM-Net in Fig. 3. Experimental results in T able I sho w that OM-Net + CGA outperforms both OM-Net and OM-Net + SE. In particular , it outperforms OM-Net by as much as 2.28% on Dice score for the tumor core region. Moreov er , there is a slight performance drop for enhancing tumor of 0.14%. Ho wev er , on much lar ger data sets (see T able II, III, IV) where the e xperimental results are more stable, we observe that CGA consistently improv es the Dice score on enhancing tumor . In comparison, there is no clear difference in performance between OM-Net and OM-Net + SE. This can be explained from two perspectiv es. First, the GAP operation in the SE block ignores category-speciﬁc statistics; second, recalibrating each channel with the same weight for all categories is suboptimal for segmentation. The proposed CGA module ef fectiv ely handles the above two problems, and therefore achiev es better performance than the SE block. In addition, the model size of OM-Net + CGA is smaller than both OM-Net and OM-Net + SE. W e can thus safely attribute the performance gains to the CGA module rather than to the additional parameters. Next, we conduct additional experimental in vestigation into the CGA module in order to prov e the validity of both the CSCI block and the CompSeg block. • T o justify the ef fectiv eness of the CSCI block, we visual- ize some feature maps produced by the shared backbone model, as sho wn in Fig. 6. It should be noted that, in the interests of intuitive and clear visualization, we choose to visualize the feature maps of a complete 2D slice rather than those of a 3D patch. T o achieve this, we stitch F- 2 and O-1 of 3D patches in Fig. 4 respectively to form the whole feature maps and probability maps. W e then select the feature maps F and probability map P of a certain slice and calculate m t and m n corresponding to this slice, according to Eqs. 5 and 6, respectiv ely . The channels corresponding to the ﬁv e lar gest and ﬁ ve small- est v alues in m t are presented in Fig. 6(c) and Fig. 6(d), respectiv ely . Similarly , the channels with the ﬁve largest and ﬁve smallest values in m n are presented in Fig. 6(e) and Fig. 6(f), respectiv ely . As decon volution layers are used in the model, there are ine vitable checkboard artifacts; howe ver , these do not af fect the observ ations. It is clear that the feature maps sho wn in Fig. 6(c) do indeed have strong responses for the tumor region, which should be highlighted for the segmentation of the tumor region. In contrast, the feature maps in Fig. 6(d) hav e weak responses for the tumor region, but strong responses for the non-tumor region; therefore, they will be suppressed in CGA for the segmentation of the tumor region. Similarly , consistent observ ations can be found in Fig. 6(e) and Fig. 6(f). The abov e analysis proves the validity of the CSCI block in generating the category- speciﬁc channel dependence. • Furthermore, we also justify the effecti veness of the CompSeg block, where P t and P n are used a second time. W e additionally train a model without using P t and P n , which simply performs an element-wise addition between S t and S n in Fig. 5(b), denoted as OM-Net + CGA − . Experimental results in T able I rev eal that OM-Net + CGA signiﬁcantly outperforms OM-Net + CGA − ; this is because OM-Net + CGA employs soft spatial masks ( P t and P n ) to fuse two complementary prediction results. This performance comparison proves the validity of P t and P n used in the CompSeg block. In addition, we also test another two models: OM-Net + CGA t is a model whose CompSeg block only includes the upper branch in Fig. 5(b), while OM-Net + CGA n is a model whose CompSeg block only incorporates the lower branch in Fig. 5(b). The experimental results are reported in T able I. From the table, it can be seen that OM-Net + CGA outperforms both OM-Net + CGA t and OM-Net + CGA n . Accordingly , we can conclude that the two pathways in the CompSeg block can make full use of the complementary information, which is beneﬁcial for the segmentation task. 4) Effectiveness of P ost-pr ocessing: T o justify the effec- tiv eness of the proposed post-processing operation, we apply it to reﬁning the segmentation results of both OM-Net and OM-Net + CGA, denoted as OM-Net p and OM-Net + CGA p respectiv ely in T able I. First, compared with OM-Net, it is shown that OM-Net p can slightly improv e the Dice score for the complete tumor due to false positi ves ha ving been remov ed in the ﬁrst post-processing step; meanwhile, it signiﬁcantly improv es the Dice score of tumor core by 2.6% because of the second post-processing step. Moreover , the performance comparison between OM-Net + CGA and OM-Net + CGA p shows that post-processing operation consistently brings about performance improvement for the complete tumor and tumor core regions. In conclusion, the abov e experimental results justify the effecti veness of the proposed techniques. Qualitati ve compar- isons between MC3, OM-Net, OM-Net + CGA, and OM- Net + CGA p are also provided in Fig. 7. It is clear that the proposed methods steadily improve the quality of brain tumor segmentation. This is consistent with the quantitative comparisons in T able I. B. P erformance Comparison on BraTS 2015 T esting Set In this experiment, the performance of MC3, OM-Net, and OM-Net + CGA is ev aluated on the testing set of the BraTS 10 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 (a) (b) (c) (d) (e) (f) 0 50 100 150 200 250 Fig. 6. V isualization of the feature maps output from the shared backbone model in OM-Net + CGA. W e present the feature maps of a complete 2D slice for intuiti ve and clear visualization. (a) The Flair modality on the 75 th slice of the sample BraTS18 2013 21 1 in the BraTS 2018 training set. (b) Its corresponding ground truth. (c) Heat maps corresponding to the channels with the ﬁv e largest v alues in m t . (d) Heat maps corresponding to the channels with the ﬁv e smallest values in m t . (e) Heat maps corresponding to the channels with the ﬁv e largest v alues in m n . (f) Heat maps corresponding to the channels with the ﬁ ve smallest v alues in m n . 2015 dataset. Each of these models is trained using the entire training set of the dataset. Experimental results are tabulated in T able II. From this table, we can make the following observations. First, we compare the segmentation performance of MC3, OM-Net, OM-Net + CGA, and OM-Net + CGA p . It is evident that OM-Net has clear advantages ov er MC3, with 1% higher Dice scores on both tumor core and enhancing tumor . OM- Net + CGA further promotes the Dice score of OM-Net by 1% on enhancing tumor . Moreover , following reﬁnement by the proposed post-processing method, the Dice scores of OM-Net + CGA improve signiﬁcantly by 1% and 4% on complete tumor and tumor core, respectiv ely . These results are consistent with comparisons on the local validation subset of BraTS 2018. Second, we compare the performance of OM-Net + CGA p with state-of-the-art methods. Our results exhibit clear adv an- tages ov er those obtained by the comparison methods [7], [11], [44], [55]. In particular , our results outperform the popular DeepMedic model [7] by 2%, 8%, and 2% in terms of Dice scores on complete tumor, tumor core, and enhancing tumor, respectiv ely . Furthermore, OM-Net + CGA p also outperforms the method in [11] which adopts more pre-processing and post- processing operations. At the time of this submission, OM-Net + CGA p ranks ﬁrst on the online leaderboard of BraTS 2015, demonstrating the effecti veness of the proposed methods. C. P erformance Comparison on BraTS 2017 V alidation Set Since access to the testing set of BraTS 2017 was closed after the challenge, we e valuate the proposed methods on the online validation set and compare them with other participants in T able III. First, we train the MC3, OM-Net, OM-Net + SE, and OM- Net + CGA models using the training subset of 260 MRI images. The follo wing observations can be made. OM-Net outperforms MC3, especially for the Dice scores on enhancing tumor . The SE block cannot improve the performance of OM-Net in terms of Dice scores; in fact, it reduces the Dice score of OM-Net by 1.18% on enhancing tumor . In comparison, OM-Net + CGA outperforms both OM-Net and OM-Net + SE by a signiﬁcant margin. Speciﬁcally , its Dice scores are 1.88% and 2.09% higher than those of OM-Net on tumor core and enhancing tumor respectiv ely; this again 11 Fig. 7. Example segmentation results on the local v alidation subset of BraTS 2018. From left to right: Ground truth, MC3, OM-Net, OM-Net + CGA, and OM-Net + CGA p results overlaid on FLAIR image; edema (green), necrosis and non-enhancing (blue), and enhancing(red). prov es the effecti veness of the CGA module. In addition, OM- Net + CGA p considerably improv es the Dice score on tumor core, demonstrating the effecti veness of the proposed post- processing method. Second, to further boost the performance of OM-Net + CGA, we divide the training data into ten folds and obtain an ensemble system using the following scheme: we train one model using nine folds of training data and pick the best snapshot on the rest one fold. W e repeat this process and obtain 10 models as an ensemble (denoted as OM-Net + CGA ? ). A similar strategy was used in [43]. T able III shows that OM-Net + CGA ? consistently obtains higher Dice scores than OM- Net + CGA. W e further apply the proposed post-processing method to OM-Net + CGA ? , denoted as OM-Net + CGA ? p . It can clearly be seen that the proposed post-processing method improv es the Dice score of OM-Net + CGA ? by as much as 1.48% for tumor core. Third, we present comparisons between OM-Net + CGA ? p and some state-of-the-art methods on the online validation leaderboard, which comprises more than 60 entries. It is clear that OM-Net + CGA ? p outperforms all other methods in terms of Dice scores. It is further worth noting that other top entries, such as Kamnitsas et al. [12], also combined multiple models to boost performance; moreover , W ang et al. [27] integrated nine single-vie w models from three orthogonal views to achieve excellent performance. The above compar- isons demonstrate the superiority of our proposed methods. D. P erformance Comparison on BraTS 2018 Dataset W e also make additional comparisons on the BraTS 2018 dataset. As BraTS 2018 and 2017 share the same training dataset, we directly e valuate the same models in the previous experiments on the validation set of BraTS 2018. The BraTS 2018 Challenge is intensely competitiv e, with more than 100 entries displayed on the online validation leaderboard. Therefore, we only present comparisons between our methods and the top entries in T able IV. Our observations are as follows. First, comparison results between MC3, OM-Net, OM-Net + SE, and OM-Net + CGA are consistent with those in T able III. W e can see that OM-Net achie ves better performance than MC3 on enhancing tumor with a visible margin. Moreov er , de- spite ha ving only a single model and without post-processing, OM-Net is able to outperform more than 70% of entries on the leaderboard. In addition, OM-Net + CGA outperforms OM- Net by 1.26% and 1.45% in terms of Dice scores on tumor core and enhancing tumor , respectively . By comparison, SE cannot improve the performance of OM-Net in terms of Dice scores. Second, by implementing the model ensemble and the post- processing operation, OM-Net + CGA ? p obtains higher Dice scores as expected, achie ving very competiti ve performance on the leaderboard. It is worth noting that the model ensemble strategy was also applied in [12], [15], [43]. The approach described in [15] also decomposes the multi-class brain tumor segmentation into three tasks. It achieves top performance by training 10 models as an ensemble, the inputs of which are 12 T ABLE II P E R F O R M A N C E O N B R A TS 2 0 1 5 T E S T I N G S E T ( % ) Method Dice Positiv e Predictive V alue Sensitivity Complete Core Enhancing Complete Core Enhancing Complete Core Enhancing MC3 86 70 63 86 82 60 88 67 72 OM-Net 86 71 64 86 83 61 88 68 72 OM-Net + CGA 86 71 65 87 84 63 88 67 70 OM-Net + CGA p 87 75 65 89 85 63 88 73 70 Isensee et al. [55] 85 74 64 83 80 63 91 73 72 Chen et al. [44] 85 72 61 86 83 66 86 68 63 Zhao et al. [11] 84 73 62 89 76 63 82 76 67 Kamnitsas et al. [7] 85 67 63 85 86 63 88 60 67 T ABLE III M E A N V A L U E S O F D I C E A N D H AU S D O R FF 9 5 M E T R I C S O N B R A TS 2 0 1 7 V A L I D ATI O N S E T Method Dice Hausdorff95 (mm) Enh. Whole Core Enh. Whole Core MC3 0.7424 0.8991 0.7937 4.9901 4.6085 8.5537 OM-Net 0.7534 0.9007 0.7934 3.6547 7.2524 8.4676 OM-Net + SE 0.7416 0.8997 0.7938 3.5115 6.2859 7.0154 OM-Net + CGA 0.7743 0.8988 0.8122 3.8820 4.8380 6.7953 OM-Net + CGA p 0.7743 0.9016 0.8320 3.8820 4.6663 6.7312 OM-Net + CGA ? 0.7852 0.9065 0.8274 3.2991 4.4886 6.9896 OM-Net + CGA ? p 0.7852 0.9071 0.8422 3.2991 4.3815 7.5614 W ang et al. [27] 0.7859 0.9050 0.8378 3.2821 3.8901 6.4790 MIC DKFZ 0.7756 0.9027 0.8194 3.1626 6.7673 8.6419 inpm 0.7723 0.8998 0.8085 4.7852 9.0029 7.2359 xfeng 0.7511 0.8922 0.7991 4.7547 16.3018 8.6847 Kamnitsas et al. [12] 0.738 0.901 0.797 4.50 4.23 6.56 large patches of size 160 × 192 × 128 vox els. Large input patches lead to considerable memory consumption; therefore, 32GB GPUs are employed in [15] to train this model. In comparison, OM-Net utilizes small patches of size 32 × 32 × 16 vox els, making it memory-efﬁcient and capable of being trained or deployed on low-cost GPU devices. W e can thus conclude that OM-Net is very competiti ve and has its own advantages. More impressi vely , beneﬁtting from the techniques proposed in this paper , we obtained the joint third position among 64 teams on the testing set of the BraTS 2018 Challenge 2 . More detailed results of this challenge are introduced in [48]. In conclusion, the ef fectiveness of the proposed methods has been demonstrated through comparisons on the abov e three datasets. V I . C O N C L U S I O N In this paper , we propose a novel model, OM-Net, for brain tumor segmentation that is tailored to handle the class 2 https://www .med.upenn.edu/sbia/brats2018/rankings.html T ABLE IV M E A N V A L U E S O F D I C E A N D H AU S D O R FF 9 5 M E T R I C S O N B R A TS 2 0 1 8 V A L I D ATI O N S E T Method Dice Hausdorff95 (mm) Enh. Whole Core Enh. Whole Core MC3 0.7732 0.9015 0.8233 4.1624 4.7198 7.6082 OM-Net 0.7882 0.9034 0.8273 3.1003 6.5218 7.1974 OM-Net + SE 0.7791 0.9034 0.8259 2.9950 5.7685 6.4289 OM-Net + CGA 0.8027 0.9033 0.8399 3.4437 4.7609 6.4339 OM-Net + CGA p 0.8027 0.9052 0.8536 3.4437 4.6236 6.3892 OM-Net + CGA ? 0.8112 0.9074 0.8461 2.8697 4.9105 6.6243 OM-Net + CGA ? p 0.8111 0.9078 0.8575 2.8810 4.8840 6.9322 Myronenko [15] 0.8233 0.9100 0.8668 3.9257 4.5160 6.8545 SHealth 0.8154 0.9120 0.8565 4.0461 4.2362 7.2181 Isensee et al. [43] † 0.8048 0.9072 0.8514 2.81 5.23 7.23 MedAI 0.8053 0.9104 0.8545 3.6695 4.1369 5.9821 BIGS2 0.8054 0.9104 0.8506 2.7543 4.8444 7.4548 SCAN 0.7925 0.9008 0.8474 3.6035 4.0626 4.9885 † T o facilitate fair comparison, we report the performance of [43] without private training data. imbalance problem. Unlike the popular MC frame work, OM- Net requires only one-pass computation to perform coarse- to-ﬁne segmentation. OM-Net is superior to MC because it not only signiﬁcantly reduces the model size and system complexity , but also thoroughly exploits the correlation be- tween the tasks through sharing parameters, training data, and ev en prediction results. In particular , we propose a CGA module that makes use of cross-task guidance information to learn category-speciﬁc channel attention, enabling it to signiﬁcantly outperform the popular SE block. In addition, we introduce a nov el and effecti ve post-processing method for use in reﬁning the segmentation results in order to achieve better accuracy . Extensive experiments were conducted on three popular datasets; the results of these experiments pro ve the ef fectiv eness of the proposed OM-Net model, and further demonstrate that OM-Net has clear advantages over existing 13 state-of-the-art methods for brain tumor segmentation. R E F E R E N C E S [1] A. Is ¸ın, C. Direko ˘ glu, and M. S ¸ ah, “Revie w of MRI-based brain tumor image segmentation using deep learning methods, ” Pr ocedia Comput. Sci. , vol. 102, pp. 317–324, 2016. [2] E. G. V an Meir , C. G. Hadjipanayis, A. D. Norden, H.-K. Shu, P . Y . W en, and J. J. Olson, “Exciting new advances in neuro-oncology: the avenue to a cure for malignant glioma, ” CA: a cancer journal for clinicians , vol. 60, no. 3, pp. 166–193, 2010. [3] B. H. Menze, A. Jakab, S. Bauer , J. Kalpathy-Cramer, K. Farahani, J. Kirby , Y . Burren, N. Porz, J. Slotboom, R. W iest et al. , “The multimodal brain tumor image segmentation benchmark (BRA TS), ” IEEE T rans. Med. Imag. , vol. 34, no. 10, pp. 1993–2024, 2015. [4] S. Bauer , R. W iest, L.-P . Nolte, and M. Reyes, “ A survey of MRI- based medical image analysis for brain tumor studies, ” Phys. Med. Biol. , vol. 58, no. 13, pp. R97–R129, 2013. [5] S. Pereira, A. Pinto, V . Alves, and C. A. Silva, “Brain tumor segmenta- tion using conv olutional neural networks in MRI images, ” IEEE T rans. Med. Imag. , vol. 35, no. 5, pp. 1240–1251, 2016. [6] M. Ha vaei, A. Da vy , D. W arde-Farley , A. Biard, A. Courville, Y . Bengio, C. Pal, P .-M. Jodoin, and H. Larochelle, “Brain tumor segmentation with deep neural networks, ” Med. Imag. Anal. , vol. 35, pp. 18–31, 2017. [7] K. Kamnitsas, C. Ledig, V . F . Newcombe, J. P . Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker , “Efﬁcient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation, ” Med. Imag. Anal. , vol. 36, pp. 61–78, 2017. [8] J. Long, E. Shelhamer , and T . Darrell, “Fully con volutional networks for semantic se gmentation, ” in Pr oc. IEEE Conf . Comput. V is. P attern Recognit. , 2015, pp. 3431–3440. [9] O. Ronneberger , P . Fischer , and T . Brox, “U-net: Con volutional netw orks for biomedical image segmentation, ” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent , 2015, pp. 234–241. [10] H. Chen, Q. Dou, L. Y u, J. Qin, and P .-A. Heng, “V oxResNet: Deep vox elwise residual networks for brain segmentation from 3D MR images, ” NeuroImag e , vol. 170, pp. 446–455, 2018. [11] X. Zhao, Y . W u, G. Song, Z. Li, Y . Zhang, and Y . Fan, “ A deep learning model integrating FCNNs and CRFs for brain tumor segmentation, ” Med. Imag. Anal. , vol. 43, pp. 98–111, 2018. [12] K. Kamnitsas, W . Bai, E. Ferrante, S. McDonagh, M. Sinclair , N. Pawlo wski, M. Rajchl, M. Lee, B. Kainz, D. Rueckert et al. , “Ensembles of multiple models and architectures for rob ust brain tumour segmentation, ” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent Brainlesion W orkshop , 2017, pp. 450–462. [13] M. Saha and C. Chakraborty , “Her2Net: A deep frame work for semantic segmentation and classiﬁcation of cell membranes and nuclei in breast cancer e valuation, ” IEEE T rans. Image Pr ocess. , vol. 27, no. 5, pp. 2189– 2200, 2018. [14] A. Farag, L. Lu, H. R. Roth, J. Liu, E. T urkbey , and R. M. Summers, “ A bottom-up approach for pancreas segmentation using cascaded super- pixels and (deep) image patch labeling, ” IEEE T rans. Image Pr ocess. , vol. 26, no. 1, pp. 386–399, 2017. [15] A. Myronenko, “3D MRI brain tumor segmentation using autoencoder regularization, ” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent Brainlesion W orkshop , 2018, pp. 311–320. [16] H. Fehri, A. Gooya, Y . Lu, E. Meijering, S. A. Johnston, and A. F . Frangi, “Bayesian polytrees with learned deep features for multi-class cell segmentation, ” IEEE T rans. Image Pr ocess. , 2019. [17] D. Xiang, H. T ian, X. Y ang, F . Shi, W . Zhu, H. Chen, and X. Chen, “ Automatic segmentation of retinal layer in OCT images with choroidal neov ascularization, ” IEEE Tr ans. Imag e Pr ocess. , vol. 27, no. 12, pp. 5880–5891, 2018. [18] C. Zhou, S. Chen, C. Ding, and D. T ao, “Learning contextual and attentiv e information for brain tumor segmentation, ” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent Brainlesion W orkshop , 2018, pp. 497–507. [19] M. Haghighi, S. K. W arﬁeld, and S. Kurugol, “ Automatic renal segmen- tation in DCE-MRI using con volutional neural networks, ” in Pr oc. IEEE Int. Symposium on Biomed. Imag. (ISBI) , 2018, pp. 1534–1537. [20] H. Hu, Q. Guan, S. Chen, Z. Ji, and L. Y ao, “Detection and recognition for life state of cell cancer using two-stage cascade CNNs, ” IEEE/ACM T rans. Comput. Biol. Bioinf. , 2017. [21] P . F . Christ, M. E. A. Elshaer, F . Ettlinger, S. T atavarty , M. Bickel, P . Bilic, M. Rempﬂer , M. Armbruster , F . Hofmann, M. DAnastasi et al. , “ Automatic liv er and lesion segmentation in CT using cascaded fully con volutional neural networks and 3D conditional random ﬁelds, ” in Pr oc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent , 2016, pp. 415–423. [22] Z. Zhu, Y . Xia, W . Shen, E. K. Fishman, and A. L. Y uille, “ A 3D coarse- to-ﬁne framework for automatic pancreas segmentation, ” arXiv preprint arXiv:1712.00201 , 2017. [23] N. Lessmann, B. van Ginneken, M. Zreik, P . A. de Jong, B. D. de V os, M. A. V ierge ver , and I. I ˇ sgum, “ Automatic calcium scoring in low-dose chest CT using deep neural networks with dilated conv olutions, ” IEEE T rans. Med. Imag . , vol. 37, no. 2, pp. 615–625, 2018. [24] J. Zhang, M. Liu, and D. Shen, “Detecting anatomical landmarks from limited medical imaging data using two-stage task-oriented deep neural networks, ” IEEE Tr ans. Image Pr ocess. , vol. 26, no. 10, pp. 4753–4764, 2017. [25] J. Guo, W . Zhu, F . Shi, D. Xiang, H. Chen, and X. Chen, “ A framework for classiﬁcation and segmentation of branch retinal artery occlusion in SD-OCT , ” IEEE Tr ans. Image Pr ocess. , vol. 26, no. 7, pp. 3518–3527, 2017. [26] Y . T ang and X. W u, “Scene text detection and segmentation based on cascaded conv olution neural networks, ” IEEE Tr ans. Image Pr ocess. , vol. 26, no. 3, pp. 1509–1520, 2017. [27] G. W ang, W . Li, S. Ourselin, and T . V ercauteren, “ Automatic brain tumor segmentation using cascaded anisotropic con volutional neural networks, ” in Pr oc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent Brainlesion W orkshop , 2017, pp. 178–190. [28] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks, ” in Proc. IEEE Conf. Comput. V is. P attern Recognit. , 2018, pp. 7132–7141. [29] C. Zhou, C. Ding, Z. Lu, X. W ang, and D. T ao, “One-pass multi-task con volutional neural networks for ef ﬁcient brain tumor segmentation, ” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent , 2018, pp. 637–645. [30] P . Kr ¨ ahenb ¨ uhl and V . Koltun, “Efﬁcient inference in fully connected crfs with gaussian edge potentials, ” in Proc. Adv . Neural Inf. Process. Syst. , 2011, pp. 109–117. [31] M. Jaderberg, K. Simonyan, A. Zisserman et al. , “Spatial transformer networks, ” in Pr oc. Adv . Neural Inf. Pr ocess. Syst. , 2015, pp. 2017–2025. [32] W . Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identiﬁcation, ” in Pr oc. IEEE Conf. Comput. V is. P attern Recognit. , vol. 1, 2018, p. 2. [33] F . W ang, M. Jiang, C. Qian, S. Y ang, C. Li, H. Zhang, X. W ang, and X. T ang, “Residual attention network for image classiﬁcation, ” in Proc. IEEE Conf. Comput. V is. P attern Recognit. , 2017, pp. 6450–6458. [34] A. G. Ro y , N. Nav ab, and C. W achinger , “Concurrent spatial and channel squeeze & excitation in fully con volutional networks, ” in Pr oc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent , 2018, pp. 421–429. [35] S. Pereira, V . Alves, and C. A. Silva, “ Adaptive feature recombination and recalibration for semantic segmentation: application to brain tumor segmentation in MRI, ” in Pr oc. Int. Conf. Med. Image Comput. Comput.- Assist. Intervent , 2018, pp. 706–714. [36] Y . Zhu, C. Zhao, H. Guo, J. W ang, X. Zhao, and H. Lu, “ Attention couplenet: Fully convolutional attention coupling network for object detection, ” IEEE T rans. Image Process. , vol. 28, no. 1, pp. 113–126, 2019. [37] Section for Biomedical Image Analysis, MICCAI BraTS 2018. [Online]. A vailable: https://www .med.upenn.edu/sbia/brats2018/data.html [38] T . M. Quan, D. G. Hilderbrand, and W .-K. Jeong, “Fusionnet: A deep fully residual conv olutional neural network for image segmentation in connectomics, ” arXiv pr eprint arXiv:1612.05360 , 2016. [39] Y . Bengio, J. Louradour, R. Collobert, and J. W eston, “Curriculum learning, ” in Proc. Int. Conf . Mach. Learn. (ICML) , 2009, pp. 41–48. [40] H. Shen, R. W ang, J. Zhang, and S. McKenna, “Multi-task fully con volutional netw ork for brain tumour segmentation, ” in Annu. Conf. Med. Imag. Understand. Anal. (MIUA) , 2017, pp. 239–248. [41] H. Shen, R. W ang, J. Zhang, and S. J. McKenna, “Boundary-aware fully con volutional network for brain tumor segmentation, ” in Pr oc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent , 2017, pp. 433–441. [42] R. Dey and Y . Hong, “CompNet: Complementary segmentation network for brain MRI extraction, ” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent , 2018, pp. 628–636. [43] F . Isensee, P . Kickingereder, W . W ick, M. Bendszus, and K. H. Maier- Hein, “No new-net, ” in Proc. Int. Conf. Med. Image Comput. Comput.- Assist. Intervent Brainlesion W orkshop , 2018, pp. 234–244. 14 [44] X. Chen, J. H. Liew , W . Xiong, C.-K. Chui, and S.-H. Ong, “Focus, segment and erase: An efﬁcient network for multi-label brain tumor segmentation, ” in Proc. Eur . Conf. Comput. V is. , 2018, pp. 674–689. [45] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby , J. B. Freymann, K. Farahani, and C. Da vatzikos, “ Adv ancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features, ” Nat. Sci. Data , vol. 4, p. 170117, 2017. [46] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby , J. Freymann, K. Farahani, and C. Dav atzikos, “Segmentation labels and radiomic features for the pre-operative scans of the TCGA-LGG collection, ” Cancer Imaging Arc h. , 2017. [47] S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby , J. Freymann, K. Farahani, and C. Dav atzikos, “Segmentation labels and radiomic features for the pre-operative scans of the TCGA-GBM collection, ” Cancer Imaging Arc h. , 2017. [48] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempﬂer , A. Crimi, R. T . Shinohara, C. Ber ger, S. M. Ha, M. Rozycki et al. , “Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and ov erall surviv al prediction in the BRA TS challenge, ” arXiv preprint arXiv:1811.02629 , 2018. [49] M. Kistler , S. Bonaretti, M. Pfahrer , R. Niklaus, and P . B ¨ uchler , “The virtual skeleton database: an open access repository for biomedical research and collaboration, ” J . Med. Internet Res. , vol. 15, no. 11, 2013. [50] Center for Biomedical Image Computing and Analytics, University of Pennsylvania, “Image processing portal - https://ipp.cbica.upenn.edu/, ” A web accessible platform for ima ging analytics , 2015. [51] V irtualSkeleton, BRA TS 2013 Sep. 30, 2015 [Online]. A vailable: https://www .virtualskeleton.ch/BRA TS/Start2015 [52] D. Tran, L. Bourdev , R. Fergus, L. T orresani, and M. Paluri, “Learning spatiotemporal features with 3D con volutional networks, ” in Proc. IEEE Int. Conf. Comput. V is. , 2015, pp. 4489–4497. [53] A. Karpathy , G. T oderici, S. Shetty , T . Leung, R. Sukthankar , and L. Fei-Fei, “Large-scale video classiﬁcation with con volutional neural networks, ” in Pr oc. IEEE Conf. Comput. V is. P attern Recognit. , 2014, pp. 1725–1732. [54] Y . Jia, E. Shelhamer , J. Donahue, S. Karayev , J. Long, R. Girshick, S. Guadarrama, and T . Darrell, “Caffe: Conv olutional architecture for fast feature embedding, ” in Pr oc. 22nd ACM Int. Conf. Multimedia , 2014, pp. 675–678. [55] F . Isensee, P . Kickingereder, W . W ick, M. Bendszus, and K. H. Maier- Hein, “Brain tumor segmentation and radiomics surviv al prediction: Contribution to the BRA TS 2017 challenge, ” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent Brainlesion W orkshop , 2017, pp. 287–297.

One-pass Multi-task Networks with Cross-task Guided Attention for Brain Tumor Segmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment