Improving RetinaNet for CT Lesion Detection with Dense Masks from Weak RECIST Labels
Accurate, automated lesion detection in Computed Tomography (CT) is an important yet challenging task due to the large variation of lesion types, sizes, locations and appearances. Recent work on CT lesion detection employs two-stage region proposal b…
Authors: Martin Zlocha, Qi Dou, Ben Glocker
Impro ving RetinaNet for CT Lesion Detection with Dense Masks from W eak RECIST Lab els Martin Zlo c ha, Qi Dou, and Ben Glo c k er Biomedical Image Analysis Group, Imp erial College London, UK { martin.zlocha15,qi.dou,b.glocker } @imperial.ac.uk Abstract. Accurate, automated lesion detection in Computed T omog- raph y (CT) is an imp ortan t y et c hallenging task due to the large v ariation of lesion t yp es, sizes, lo cations and appearances. Recent work on CT le- sion detection emplo ys tw o-stage region prop osal based metho ds trained with centroid or b ounding-box annotations. W e prop ose a highly accu- rate and efficient one-stage lesion detector, by re-designing a RetinaNet to meet the particular challenges in medical imaging. Sp ecifically , we op- timize the anchor configurations using a differen tial evolution searc h al- gorithm. F or training, w e leverage the resp onse ev aluation criteria in solid tumors (RECIST) annotatio n which are measured in clinical routine. W e incorp orate dense masks from weak RECIST lab els, obtained automat- ically using GrabCut, into the training ob jective, whic h in combination with other adv ancemen ts yields new state-of-the-art p erformance. W e ev aluate our method on the public DeepLesion benchmark, consisting of 32,735 lesions across the b ody . Our one-stage detector ac hieves a sensitiv- it y of 90.77% at 4 false p ositiv es p er image, significantly outp erforming the b est reported metho ds b y ov er 5%. 1 In tro duction Detection and localization of abnormalities in Computed T omography (CT) scans is a critical routine task for radiologists. Accurate, automated detection of suspicious regions could greatly supp ort screening, diagnosis and monitoring of disease progression. Most previous w ork fo cuses on a sp ecific type of lesion within a relatively constrained (anatomical) con text, such as lymph no des, lung no dules and brain microbleeds. Recently , Y an et al. [ 15 ] pioneered the study of univ ersal lesion detection and in troduced to da y’s largest data repository , i.e., the DeepLesion dataset. Detecting diverse t yp es of lesions across the bo dy using one single mo del is very challenging due to the large v ariation of lesion types, sizes, lo cations and heterogeneous app earances. F or example, DeepLesion consists of eigh t t yp es of lesions with diameters ranging from 0.21 to 342.5 mm. In addition, the lesions may app ear with limited contrast compared to nearby normal tissue, whic h further increases the difficulty of detecting subtle signs of disease. Automated lesion detection has been central in medical image computing. Re- cen t work employs t wo-stage metho ds with candidate proposal and false positive reduction steps. State-of-the-art p erformance on the DeepLesion b enc hmark has 2 M. Zlo c ha, Q. Dou, and B. Glo c k er b een achiev ed b y Y an et al. [ 13 ]. They prop ose a tw o-stage, region-based method called 3DCE to effectiv ely incorp orate 3D con text in to 2D regional CNNs. Their metho d achiev es a sensitivity of 85.65% at 4 false positives p er image, outp er- forming the p opular detection metho d of F aster R-CNN [ 7 ] on the same dataset. Ho wev er, their detection sensitivity for small lesions is muc h low er, which is an imp ortan t limitation in the critical context of detecting early signs of diseases. Some recent work take adv an tage of mask information for improving detec- tion accuracy . Jaeger et al. [ 4 ] prop ose a Retina U-Net, sho wing that aggregating pixel-wise sup ervision to train the detector is helpful. Their method shows ef- fectiv eness in tw o scenarios, i.e., lung lesions in CT and breast lesions in MRI. As pixel-wise annotations are tedious and exp ensive to obtain, T ang et al. [ 12 ] generate pseudo masks by fitting ellipses based on the resp onse ev aluation crite- ria in solid tumors (RECIST) [ 2 ] diameters. Using a 2D Mask R-CNN [ 3 ] with generated lesion masks and other strategies, [ 12 ] achiev es a sensitivity of 84.38% at 4 false positives per image on DeepLesion dataset. Their pseudo-mask genera- tion pro cedure relies heavily on the assumption of elliptical geometry of lesions, whic h may yield imprecise masks limiting the efficacy of dense sup ervision. W e prop ose a one-stage detector which directly lo calizes lesions without the need of candidate region prop osals. T o meet the sp ecific challenge of detecting small lesions, w e revisit the RetinaNet [ 6 ] and optimize the feature pyramid sc heme and anchor configuration b y emplo ying a differential ev olution search al- gorithm. T o enhance the mo del, w e leverage high-quality dense masks obtained automatically from weak RECIST lab els using GrabCut [ 8 ]. Incorp orating these generated masks in to pixel-wise sup ervision shows great b enefit for training the detector. In addition, we mak e use of the coherence b et ween lesion mask predic- tions and bounding-b ox regressions to calibrate the detector outputs. W e fur- ther inv estigate recen t strategies for b oosting the detection p erformance, such as integrating attention mechanism into our feature pyramids. W e ev aluate the con tributions of eac h part using the DeepLesion b enc hmark, achieving a new state-of-the-art sensitivity of 90.77% at 4 false p ositives p er image, significantly outp erforming the currently b est p erforming metho d 3DCE [ 13 ] b y ov er 5%. 2 Impro ving RetinaNet An o verview of our prop osed one-stage lesion detector is illustrated in Fig 1 (a). W e first describ e the model design b efore elab orating on ho w w e obtain dense masks from weak RECIST labels and incorporate them into training pro cess. W e then show the attention mec hanism for further impro ving detection performance. 2.1 Mo del Design with Optimized Anchor Configuration The backbone of our approach is a RetinaNet [ 6 ], a recent one-stage metho d for ob ject detection. The use of a fo cal loss addresses the common problem of class imbalance in detection tasks. The feature p yramids and lateral connections with a top-do wn architecture [ 5 ] are adopted for detecting ob jects at different scales. This is an imp ortan t difference with metho ds such as 3DCE [ 13 ], since Impro ving RetinaNet for CT Lesion Detection 3 Foreground (T F ) Background (T B ) Unclear (T U ) Raw image Generated mask RECIST diameter Lesion bbx (a) (b) Dense mask from weak label Fig. 1: (a) Ov erview of our impro ved RetinaNet. (b) Automatic dense mask gen- eration from w eak RECIST diameters using GrabCut [ 8 ]. the feature p yramids can effectiv ely capture information ab out lesions of v arying sizes including very small ones. Our sp ecific netw ork follo ws the structure of V GG-19 [ 10 ]. W e also explored ResNet-50 as used originally , but its p erformance w as worse on DeepLesion, which is in line with results rep orted in [ 14 ]. The anchor configuration is crucial for the detector, and we find the default anc hor sizes (32, 64, 128, 256 and 512), asp ect ratios (1:2, 1:1 and 2:1) and scales (2 0 3 , 2 1 3 and 2 2 3 ) turn out to b e ineffectiv e for detecting lesions of small size and large ratios. W e employ a differen tial evolution searc h algorithm [ 11 ] to optimize ratios and scales of anchors on the v alidation set. This algorithm iterativ ely im- pro ves a p opulation of candidate solutions with regard to an ob jective function. New solutions are created by com bining existing ones. W e aim to find the b est anc hor settings for 3 scales and 5 ratios. The ob jective is to maximise the o verlap b et w een the lesion b ounding-b o x and the b est anc hor on the v alidation dataset. W e fix one ratio as 1:1, and define other ratios as recipro cal pairs (i.e., if one ratio is 1 : γ then another is γ : 1). Th us, we need to optimise only five v ariables, i.e, t wo ratio pairs and three scales. When initialising the p opulation of candidate solutions, all scales are bounded to a range of [0 . 4 , 1 . 6] and the tw o ratios are resp ectiv ely b ounded in [1 , 2] and [2 , 4]. W e obtain optimal scales as 0.425, 0.540 and 0.680, and ratios of 3.27:1, 1.78:1, 1:1, 1:1.78, 1:3.27, which fits ob jects of small size and large ratios. Anchor sizes remain as (32, 64, 128, 256 and 512). These optimised configurations are then used for training the detector. 2.2 Dense Mask Sup ervision from W eak RECIST Lab els Although annotations of b ounding-boxes are relatively easy to obtain, there are other “w eak” lab els whic h are routinely generated in clinical practice, such as RECIST diameters. RECIST is used to track lesion growth, and consists of a pair of diameters to measure the lesion extent (cf. Fig. 1 (b)). T o lev erage this highly v aluable information, we automatically generate dense lesion masks from RECIST lab els (pro vided in the DeepLesion dataset) using GrabCut [ 8 ]. W e ini- tialize a trimap in to background ( T B ), foreground ( T F ) and unclear ( T U ) pixels. A segmentation mask is generated based on iterativ e graph-cuts. Initialization 4 M. Zlo c ha, Q. Dou, and B. Glo c k er can largely affect the final result, as it defines the Gaussian mixture mo dels capturing the foreground and bac kground intensit y distributions. Cai et al. [ 1 ] previously adopt GrabCut to initialise lesion masks of the RECIST-slice for the task of weakly-supervised lesion segmentation in 3D. Their T B is set as pixels outside the b ounding-box defined by RECIST axes, and T F is obtained b y dilation of the diameters. Such an initialisation ma y b e sub-optimal, sp ecifically , for large lesions, where a considerable num b er of lesion pixels, whic h are quite certain to b elong to foreground, are outside the dilation and omitted in T F . F or small lesions, the dilation has the risk of hard-lab elling background pixels in to T F , whic h cannot b e corrected in the optimization. T o achiev e a higher-quality masks using GrabCut, w e prop ose a new strategy , as illustrated in Fig. 1 (b). W e build a quadrilateral by consecutiv ely connecting the four endp oin ts of the RECIST diameters. A pixel is lab elled as foreground if it falls inside the quadrilateral. As most lesions sho w con vex outlines, this is a simple yet reliable strategy . With the annotation of b ounding-b o x pro vided in the dataset, the pixels outside the b o x are hard-lab elled as background T B . All remaining pixels are assigned to T U and estimated through GrabCut. T o exploit these generated dense lab els, we add tw o more upsampling lay ers (connecting to P2 and P1) and a segmentation prediction lay er to the detector. Skip connections are employ ed by fusing features obtained from C1 (via a 1 × 1 con volution) and input (via tw o 3 × 3 conv olutions), as shown in Fig. 1 (a). T o retain sufficien t resolution of feature maps for small lesions, we shift the sub-net work op eration (i.e., classification and regression) to pyramid levels of P2-P6 from P3-P7. Using dense sup ervision to help detection task shares the idea with Retina U-Net [ 4 ], where we a void the need for tedious lab elling, as our dense masks are automatically generated from lab els that are already recorded in clinical routine. Additionally , w e lev erage the IoU b et w een a b ounding-b o x around the predicted segmentation mask and the directly regressed b o x (yello w sub-net works in Fig. 1 ), to calibrate the prediction probabilit y ˜ p = p × (1 + IoU) of a lesion. High coherence b et ween segmen tation and detection results indicates high confidence in lesion prediction, and b enefits sensitivity at low FP rates. 2.3 A ttention Mec hanism for Gated F eature F usion A recen t attention gate (AG) model proposed by Schlemper et al. [ 9 ] learns to fo cus on target structures by pro ducing an atten tion map. According to this w ork, this ma y b e b eneficial for small, v arying structures. W e explore AGs to filter feature resp onses propagated through skip connections and use features upsampled from coarser scale as the gating signal. The AG mo dule only uses 1 × 1 con volutions and pro duces a single atten tion map, whic h makes it compu- tationally light-w eight. The output of AG is the element-wise multiplication of the atten tion map and the feature map from the skip connection. T r aining: W e follow the loss used in original RetinaNet for detection, and our segmen tation uses fo cal loss with cross-entrop y . W e employ the Adam optimizer with a learning rate of 10 − 4 whic h is reduced during training by a factor of 10 when the mean a verage precision (mAP) has not impro ved for 2 consecutive Impro ving RetinaNet for CT Lesion Detection 5 ep ochs. The batch size is 4 during training. T o reduce ov erfitting, early stopping is used if the mAP has not improv ed for 4 consecutiv e ep ochs on the v alidation set. W e use an NVIDIA GeF orce GTX 1080 for training and testing. 3 Exp erimen ts 3.1 Dataset, Pre-Pro cessing, and Augmentation The public DeepLesion dataset [ 15 ] consists of 32,120 axial CT slices from 10,594 studies of 4,427 unique patients. There are 1 ∼ 3 lesions in each slice, adding up to 32,735 lesions altogether. F or eac h lesion, there is usually 30mm of extra slices ab o v e and b elow the key slice to provide contextual information. In most cases, the slices hav e 1 or 5 mm thickness, but this v aries with some b eing 0.625 or 2 mm. The 2D b ounding-boxes and RECIST diameters for lesions are annotated on the k ey slice. The dataset co vers a wide range of lesions from lung, liv er, mediastin um (mainly lymph no des), kidney , p elvis, b one, ab domen and soft tissue. Sizes v ary significantly with diameters ranging from 0.21 to 342.5 mm. W e p erform ligh t weigh t pre-processing where images are resized into 512 × 512 pixels, resulting in a vo xel-spacing b et ween 0.175 and 0.977 mm with a mean of 0.802 mm. The Hounsfield units (HU) are clipp ed in to the range of [ − 1024 , 1050]. W e normalize the in tensities to the range of [ − 1 , 1] as input to the netw ork. In our exp erimen ts, we use three adjacent slices after resampling to 2 mm thic kness. In rare cases where the neigh b oring slice of the lesion slice is not provided, w e duplicate the lesion slice to fill the missing input channels. W e use data augmen tation where images are flipped in horizontal and vertical directions with 50% chance. W e also use random affine transformations with rotation/shearing up to 0.1 radians, and scaling/translation up to 10% of the image size. 3.2 Detection Results on DeepLesion Benchmark The DeepLesion dataset is pro vided with splits in to 70% for training, 15% for v alidation, and 15% for testing. Thus, our results can b e directly compared with n umbers rep orted in the literature. The curren t b est results hav e b een achiev ed b y Y an et al. [ 13 ] and T ang et al. [ 12 ]. W e also quote their provided baseline p erformance using p opular detection metho ds, i.e., F aster R-CNN [ 7 ] (rep orted in [ 13 ]) and Mask R-CNN [ 3 ] (rep orted in [ 12 ]). W e further provide the results of our o wn baseline RetinaNet [ 6 ] using its default configuration. A predicted b o x is regarded as correct if its IoU with a ground truth box is larger than 0 . 5. In T able 1 , we presen t the lesion detection sensitivities at different false p osi- tiv es (FP) per image. Our impro ved RetinaNet consisten tly outperforms existing metho ds across all FP rates. Sp ecifically , sensitivit y at 4 FPs, which is commonly rep orted in the literature, we ac hieve a sensitivit y of 90.77%, which is a 5.12% impro vemen t o ver 3DCE [ 13 ] and 6.39% ov er ULDor [ 12 ]. The free-resp onse re- ceiv er operating c haracteristics (FROC) curv es of different metho ds are shown on the left in Fig. 2 . W e observ e that optimized netw orks for lesion de tection are generally b etter than out-of-the-b ox detectors, such as F aster R-CNN, Mask R- CNN and RetinaNet. When comparing sensitivity at low FP rates, our improv ed 6 M. Zlo c ha, Q. Dou, and B. Glo c k er Fig. 2: Left : FROC curv es for our impro ved RetinaNet v ariants and baselines on DeepLesion dataset. Righ t : Per lesion size results compared to 3DCE [ 13 ]. T able 1: Detection p erformance of differen t metho ds and our ablation study . Metho ds 0.5 1 2 4 8 16 run time F aster R-CNN [ 7 ] 56.90 67.26 75.57 81.62 85.83 88.74 32 ms Mask R-CNN [ 3 ] 39.82 52.66 65.58 77.73 85.54 91.80 - ULDor (T ang et al. [ 12 ]) 52.86 64.80 74.84 84.38 87.17 91.80 - 3DCE (Y an et al. [ 13 ]) 62.48 73.37 80.70 85.65 89.09 91.06 114 ms original RetinaNet [ 6 ] 45.80 54.17 62.50 69.80 75.34 79.48 28 ms + anc hor optimization 64.82 74.98 82.29 87.87 92.20 94.90 31 ms + dense sup ervision 70.24 78.28 85.10 90.39 93.81 96.01 39 ms + atten tion gate 72.15 80.07 86.40 90.77 94.09 96.32 41 ms mo dels p erform muc h b etter than others, indicating the b enefit of task-sp ecific optimization and incorp oration of additional mask information. The sensitivit y for detecting different sizes of lesions at 4 FPs are shown on the right in Fig. 2 . W e divide the lesions in to three size groups according to the diameter, following [ 13 ] for direct comparison. F or small lesions with diameters less than 10 mm, our sensitivit y is 88.35% compared to 80% for 3DCE. Using a feature p yramid to retain resp onses from small lesions together with dense sup ervision with focal loss seems b eneficial for detecting subtle signs of disease. While 3DCE uses richer 3D context, this seems less helpful for small, lo cal structures. Our mo del works well across all lesion sizes where we impro ve sensitivit y from 87% to 91.73% for lesions of 10 ∼ 30 mm, and from 84% to 93.02% for lesions larger than 30 mm when compared to 3DCE. W e also record av erage inference time for each image during testing, as listed in T able 1 . All the detection results are obtained using a single netw ork without mo del ensem ble nor test augmen tation. Our one-stage detector is highly efficient eliminating the need of generating lesion prop osals. The in tegration of dense su- p ervision and attention mec hanism has minimal computational o verhead, taking ab out 41 ms for each image. Run times are rep orted for 3DCE and F aster R-CNN in [ 13 ], but a comparison is only indicativ e due to different GPUs b eing used. Impro ving RetinaNet for CT Lesion Detection 7 Fig. 3: Visual results for lesion detection at 0.5 FP rate using our improv ed RetinaNet. The first three columns show differen t sizes from small to large. The righ t column sho ws heatmaps from the segmentation la yer o verlaid on detections. Y ello w b o xes are ground truth, green are true p ositiv es, red are false p ositiv es. Last ro w shows intriguing failure cases with p ossibly incorrect ground truth. 3.3 Con tribution of Individual Improv ements W e in vestigate the individual impact of the prop osed additions leading to our final improv ed RetinaNet. In an ablation study , we first ev aluate the original RetinaNet [ 6 ] with default settings, then incremen tally add our improv emen ts, i.e., automatic anc hor optimization, dense sup ervision using lesion masks from w eak labels, and atten tion mechanism. T able 1 and Fig. 2 (left) summarizes these results. The original RetinaNet with default anchor configuration is p erforming p oorly on the lesion detection task, indicating that out-of-the-b ox approac hes from computer vision are sub-optimal. Remark ably , after employing the auto- matic search algorithm to optimize the anchor configuration, the simple Reti- naNet already outp erforms previous state-of-the-art. The sensitivit y at 0.5 FP is 2.34% higher than 3DCE and 11.96% higher than ULDor. Adding dense sup ervision with segmentation masks generated from RECIST diameters significantly b oosts detection sensitivity across all FP rates, with 5.42% improv emen t at 0.5 FP . The pixel-wise sup ervision adds an important training signal, pro viding more precise lo calization information in addition to b ounding-boxes. Consistency b et w een bounding-b o x regression and dense clas- sification helps to reduce false p ositiv es. Finally , adding an attention mec hanism further improv es the p erformance, achieving a sensitivit y of 90.77% at 4 FPs, with an impro vemen t of almost 10% at 0.5 FP ov er the b est rep orted results. 8 M. Zlo c ha, Q. Dou, and B. Glo c k er Visual examples of detected lesions on test images are shown in Figs. 3 and 4 . Probabilit y threshold is set to 0.3 yielding 0.5 FP p er image. Lesions of v arious size, app earance and t yp e are lo calized accurately . Segmen tation masks lo ok sensible, indicating go o d quality of the automatically generated dense lab els for training. 4 Conclusion Our improv ed RetinaNet shows impressive p erformance on CT lesion detection outp erforming state-of-the-art by a significant margin. In terestingly , we could sho w that by task-sp ecific optimization of an out-of-the-b o x detector w e already ac hieve results sup erior than the b est rep orted in the literature. Exploitation of clinically a v ailable RECIST annotations bears great promise as large amounts of suc h training data should b e a v ailable in many hospitals. With a sensitivit y of ab out 91% at 4 FPs p er image, our system may reac h clinical readiness. F uture w ork will fo cus on new applications such as whole-b o dy MRI in oncology . Ac kno wledgements This pro ject has receiv ed funding from the Europ ean Researc h Council (ER C) under the Europ ean Union’s Horizon 2020 research and innov ation programme (gran t agreement No 757173, pro ject MIRA, ERC-2017-STG). References 1. Cai, J., T ang, Y., Lu, L., Harrison, A.P ., Y an, K., Xiao, J., Y ang, L., Summers, R.M.: Accurate weakly-supervised deep lesion segmentation using large-scale clin- ical annotations. In: MICCAI (2018) 2. Eisenhauer, E.A., Therasse, P ., Bogaerts, J., Sch w artz, L.H., et al.: New resp onse ev aluation criteria in solid tumours: revised RECIST guideline (version 1.1). Eu- rop ean journal of cancer 45 (2), 228–247 (2009) 3. He, K., Gkioxari, G., Doll´ ar, P ., Girshick, R.: Mask R-CNN. In: ICCV (2017) 4. Jaeger, P .F., Kohl, S.A., Bick elhaupt, S., Isensee, F., Kuder, T.A., Sc hlem- mer, H.P ., Maier-Hein, K.H.: Retina U-Net: Em barrassingly simple exploita- tion of segmen tation supervision for medical ob ject detection. arXiv preprint arXiv:1811.08661 (2018) 5. Lin, T.Y., Doll´ ar, P ., Girshick, R., He, K., Hariharan, B., Belongie, S.: F eature p yramid netw orks for ob ject detection. In: CVPR (2017) 6. Lin, T.Y., Goy al, P ., Girshick, R., He, K., Doll´ ar, P .: F o cal loss for dense ob ject detection. In: ICCV (2017) 7. Ren, S., He, K., Girshick, R., Sun, J.: F aster R-CNN: T ow ards real-time ob ject detection with region prop osal netw orks. In: NIPS (2015) 8. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: Interactiv e foreground extraction using iterated graph cuts. In: ACM T ransactions on Graphics. vol. 23 (2004) 9. Sc hlemp er, J., Oktay , O., Sc haap, M., Heinrich, M., Kainz, B., Glo c ker, B., Ruec k- ert, D.: Atten tion gated netw orks: Learning to leverage salient regions in medical images. Medical Image Analysis (2019) Impro ving RetinaNet for CT Lesion Detection 9 10. Simon yan, K., Zisserman, A.: V ery deep conv olutional net works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 11. Storn, R., Price, K.: Differen tial ev olution–a simple and efficien t heuristic for global optimization ov er contin uous spaces. Journal of Global Optimization 11 (4) (1997) 12. T ang, Y., Y an, K., T ang, Y., Liu, J., Xiao, J., Summers, R.M.: ULDor: A universal lesion detector for ct scans with pseudo masks and hard negativ e example mining. arXiv preprin t arXiv:1901.06359 (2019) 13. Y an, K., Bagheri, M., Summers, R.M.: 3D context enhanced region-based conv o- lutional neural net work for end-to-end lesion detection. In: MICCAI (2018) 14. Y an, K., W ang, X., Lu, L., Summers, R.M.: DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. Journal of Medical Imaging 5 (3) (2018) 15. Y an, K., W ang, X., Lu, L., Zhang, L., Harrison, A.P ., Bagheri, M., Summers, R.M.: Deep lesion graphs in the wild: relationship learning and organization of significan t radiology image findings in a diverse large-scale lesion database. In: CVPR (2018) 10 M. Zlo c ha, Q. Dou, and B. Glo c ker Fig. 4: More visual results for lesion detection at 0.5 FP rate using our improv ed RetinaNet. The rows corresp ond to b one, ab domen, mediastin um, liver, lung, kidney , soft tissue, and p elvis lesions, resp ectiv ely . Each row contains examples of lesions of different sizes ordered from smallest to largest. Y ellow boxes are ground truth, green are true p ositiv es, red are false p ositiv es.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment