Test Sensitivity in the Computer-Aided Detection of Breast Cancer from Clinical Mammographic Screening: a Meta-analysis

1 Test Sensitivity in the Co m pute r-Aided Detection o f Breast Cancer from Clinical Ma mmographic S creening: a Meta-ana lysis Correspondi ng Author: Jacob Levman 1,2 , PhD 1. Institute of Biomedical Engineerin g Departm ent of Engineering Scien ce Universit y of Oxford Parks Road, Oxford, OX1 3PJ United Kin gdom jacob.levman@ eng.ox.ac.uk Telepho ne: 44 01865 617696 2. Sunnybrook Research Institute Imaging Res earch Universit y of Toronto 2075 Bay view Ave. Toronto, ON M5N 3M5 Canada 2 Abstract Objectives: To assess evalu ative methodologies for comparative measur ements of test sensitivity i n clinical mammographic s creening trials of computer-aided detection ( CAD) technologies. Materials and Methods: This meta-analysis was performed b y analytically reviewing the relevant literature on the clinical application o f computer-aided detection (CAD) technologies as part of a breast cancer screening program b ased on x-ra y m ammography. Each cli nical stud y’s method for measuring the C AD s ystem’s improvement in test sensitivity is examined in this meta-analysis. The impact of the chosen sensitivit y measurement on the study’s conclusions are analyzed. Results: This meta-anal ys is demonstrates tha t some studies have inappropriately compared sensiti vity measurements between control groups and CAD e nabled groups. The inappropriate comparison of control groups and CAD enabl ed groups can lead to an underestimation of the benefits of the clinical application o f computer- aided detection technologies. Conclusions: The potenti al for the s ensitivit y measurement issues r aised in this meta- analysis to alt er the conclusions of multiple exis ting large clinic al studie s is discussed. Two large scale studies are substantially a ffected by th e analysis pro vided in this study and this meta-anal ysis demonstrates that computer-aided detection systems are successfully assisting in the breast cancer screening process. 3 Keywords: Computer-aided detection, breast ca ncer screenin g , mammography, clinical studies, sensitivit y . 4 Introduction Evidence su ggests th at t he early detection of breast cancer throu g h periodic mammographic screening reduces t he mortality associated with the disease [1, 2]. Computer-aided dete ction (CAD) systems h ave the potential to improve the b reast cancer screening pro cess by marking su spicious tissues as potentially malignant on x-ra y mammograms, thus minim izing the likelihood of a can cer bein g missed b y the interpreting radiologist. CAD s y stems mark susp ect cancerous tissu es but also incorrectl y mark non-malignant t issues, thus although a C AD system may provide i mprovements in the rate of detection of cancers and may i mprove the sensitivit y of the screening process, it may also cause false po sitives, leading to higher recall rates and un necessar y b iopsies. This article is focused on how clinical CAD st udies measure a test’s sensitivit y and h ow a clinical study’s methodology can affect the study’s conclusions. Research has b een ongoing in th e desi gn and development of CAD s y ste ms to help radiologists with the br east cancer sc reening process. The research and dev elopment of a CAD system typically in corporates rounds of lab based evaluation, comparin g the CAD marked results with ground truth d ata. Once a C AD system has perfor med successfull y in earlier ev aluative s tudies, the t echnology may reach the stage wher eby it is evaluated clinically on ongoin g examinations th at are being actively relied upon for the detection of breast cancer. T yp icall y, testin g a CAD s y stem on an active screening population is performed with commerciall y av ailable CAD technolo gy. This meta-analysis focuses on those CAD studies that were performed in a clinical setti ng. The studies included in the 5 detailed analysis herein tested CAD systems in a prospective manner, b y measurin g the CAD system’s performance on a set of mammographic examinations activel y being u sed to screen for bre ast can cer. Thus this meta-an alysis exclusivel y analyzes those CAD studies that have reached a relatively advanced state of clinical use. In the course of a typical clinical CAD study, nu mero us performance metrics are computed to assist with th e process of ev aluating the perform ance of the CAD technology bein g tested (cancer detection rat es, t est sen sitivity, r ecall rates, biops y rates, size and stage of d etected cancers, test specificity, etc.). Variations exis t in the way a clinical study calculates its performance metrics and in particular, v ariations exist in how those performance m etrics are compared. This meta-analysis reviews v ariations i n the methods for measuring a test’s sensiti vity in clinical CAD studi es for breast cance r detection from x-ra y mammography and discusses the potential negative effects of relying on a direct comparison of the sensitivity measurements used. Clinical studies of the effects of C AD t echnology can be divided int o two groups: matched stu dies and longitudinal studies. In mat ched stu dies a radiologist will anal y ze a mammogram and are t hen exp osed to the r esults of the CAD t echnology which may change their diagnosis. Since th e imaging ex aminations are carefully matched and analyzed first without and t hen with CAD, one can be confident that comparing t he 6 measured test sensitivity before and after CAD based screening appropriatel y reflects t he potential performance improvements achieved by the CAD system. True sensitivit ies are not computed in these studies as th e number of missed cancers is an inherent unknown. Instead, it s hould be recognized that a sensitivit y measurement is relative and as such great care n eeds to b e taken in order t o ensure that two m easured sensitivity valu es are indeed appropriate fo r direct comparison b etween the control group and the CAD en abled group. Potential problems with a clinical CAD stud y can occur when the test sensi tivities of two groups are compared in appropriately (ie. between CAD enabled screening and screening with no CAD technolo gy ). When a CAD s ystem’s relative improvement o ver non-CAD enabled screening is measured in longitudinal studies it is t yp icall y assessed b y comparing the test metrics between two l arge groups (CAD enabled and no n-CAD enabled). These groups are n ot matched, and as su ch problems c an occur when we compare large popul ation g roups with a measurement like t he test’s sensitivity. If a CA D system gets i ntroduced and i n its first year of operation incr eases the yield o f cancers d etected (i e. a real increas e in the cancer detection rate), then it is n orma l to expect to s ee an associated in crease i n the test’s measured sensi tivity, ho wever t his is not necessarily the case. When the breast ima g ing mammograms are not matched in the stud y desi gn , then the sensitivit y of the control group can be inadvertent ly inflated rel ative to t he CAD- enabled group. This can o ccur 7 because missed some c ancerous exams in the cont rol group that would have been c aught had CAD been used get counted as true negatives when the y are in fact f alse negatives. This makes comparin g sensitivit y values between the control group and th e CAD enabled group potentially mislea ding. When the control group counts false ne gatives as true negatives, its sensitivit y is artificiall y inflated relative to the exp erimen tal CAD enabled group. This effect is discussed in more detail in the Discussion. 8 Materials and Methods Measuring Sensitivity A test’s sensitivity is t y pically assessed as the amo unt o f d isease detected relative to the total cases of known d isease in the population. The typical definition for a test’s sensitivity is provided in equation 1. FN TP TP y Sensitivit   (1) Where, TP are the true positives: the malignancies caught by the given screening method FN are the false negatives: cases of known missed cancers This meta-analysis involved a d etailed literature search in o rder to identify large s cale clinical studies loo king at the benefits of computer-aided detection enabl ed bre ast cancer screening. Each large sc ale clinical CAD stud y was analyzed based on th e methodology used to compare CAD enabled screening with alternative sc reening methodologies. In this meta-analysis, each large scale clinical stu dy’s test evaluation methodology was analyzed an d the potential impact of comparing test sensitivities inappro priatel y is discussed. 9 Results The clinical assessme nt of computer-aided detection technologies f or x-ra y mammographic breast cancer screenin g has y i elded numerous matched stu dies [3-14 ]. Such studies t y pically in corporate an initial n on-CAD enabled re ading b y a radiolo gist, followed b y reinterpretation with the benefit of the analyzed r esults produced b y the CAD system bein g tested. These st udies are not affected b y the arguments made in this meta-analysis as the authors’ stud y design carefully ex amines individual screening examinations wit h and without the use o f CAD technolog y. Thus cases where th e CAD system dete cts an otherw ise missed tumour are clearly recorded and included in the analysis. Potential benefi ts of CAD enabled screening are cl early anal y zed in these types of studies. Non-matched population based anal yses of CAD technologies account for 7 studies found in t he literature [ 15-21]. Each stud y is s ummarized in Tabl e 1, along with t he potential extent t o which this m eta-analysis may contribute t o r einterpreting the stu dy’s results. 10 Table 1: CAD studies for x-ray mammography – potential effect of this meta-analysis Study Reported Sensitivity Improvement Impact of this Meta- analysis Gur et al. 15 (JNCI, 2004) None – other me trics used none Cupples et al. 16 (AJR, 2005) None – other me trics used none Fenton et al. 17 (NEJM, 2007) 3.6% over pre-CAD Substantial Gromet 18 (AJR, 2008) 9% over single reader Very small Gilbert et al. 19 (NEJM, 2008) Equal to double reader Very small James et al. 20 (Radiology, 2010) Equal to double reader Very small Fenton et al. 21 (JNCI, 2011) 1.4% over pre-CAD. CAD had a lower sensitivity compared with non-CAD controls. Substantial 11 Discussion Critical to evaluating a n ew s creening technolo gy is t he correct acco unting (tru e positives, false negatives) of those cancers that could be caught by the new CAD-enabled screening p rocess. In a non-matched lon gitudinal population st udy, a misleading p roblem can occur with respect to evaluating the quality of the CAD system. Consider those malignant lesions that potentially can be caught by CAD but are not caught because the y are in the control group wher e C AD w asn’t implemented. Cases th at can be caught b y CAD b ut are not caught b y manual rad iological anal y sis are the most crit ical cases for accurately assessin g the sensi tivity i mprovement of CAD- enabled s creening. Non- matched studies that compare test sensitivities between control groups and CAD enabled groups can potentially b e misleading du e t o an assumption of a lack of cancers i n the control group’s negative screening findings. In a t ypica l non-matched longitudinal population study, malignant lesions presentin g on mammography that are missed b y a radiologist but able t o be caught by the CAD system and wo uld be caught b y future rounds of regular s creening a re cou nted as true negatives (ie. t hey are d iagnosed as non-cancerous and th en erroneously evaluated as correctly diagnosed). Th ese cases of missed cancers that could have be en cau ght by CAD are critical to assessing the CAD s ystem’s performance improvement over standard screening. Instead of bein g counted as true negatives, these cases should be counted as false n egatives ( ie. an erroneous non-cancerous diagnosis). An in crease in false ne gatives contributes to lowering the measured test sensitivit y (see equation 1). C omparing test 12 sensitivities of a control g roup with a CAD enabled group in a non-match ed st udy can result i n an underestimation of the difference in test sensitivity between the C AD enabled group and the con trol group. Thus co mparing s ensitivit y measu res between two independent g roups can imply a smaller s ensitivit y i mprovement than was actuall y accomplished b y t he int roduction of CAD technolo gy b ecause missed cancers in the control group that were CAD detectable are regularly in correctly cou nted as correctly diagnosed non-cancers (true negatives) when they are actually incorrectly dia gnosed cancers (false negatives). In preclinical CAD t rials, t he outlined comparative sensit ivity problem is usually n ot an issue as pre-clinical CAD evaluation does not tend to compare two pools of samples separately. Instead ind ividual exams are t ypically matched such that multiple screening methods are tested on th e exact same mammograms and di rectly compared. Thus a typical lab-based evaluation of a CAD s ystem do es not suffer fro m the comparative sensitivity p roblems disc ussed in this paper. Th e matched clinical studies investigating the use of CAD [3-14] also do not appear to suffer from the problems with analyz ing results b y co mparing t he s ensitivit y of two different screening methods. In t hese situations we can b e con fident that an y sensitivity me asures produ ced are a r easonable method by which to compare two screening methodologies. 13 The lon gitudinal studies which compare separ ate populations, one with CAD and one without CAD can lead to erroneous conclusions when analysis of screening technologies is based on co mparing sensitiviti es measured for each group. Two of the six longitudinal CAD studies included i n Table 1 in t his paper’s results avoided the test se nsitivit y as an evaluative metric and so are not affected b y t he comparative sensitivit y i ssues presented in this meta-analysis [15, 16]. Three additional longitudinal stud ies from Table 1 are affected b y this comparative sensitivity issue, however, those stud y als o included dual reading in the control group [18-20] w hich should help minimize the pot ential negative effects described in this anal y sis. This is because in the co ntrol g roup, the second rea der often identifies extra malignancies that would have been cau ght by the CAD system, which in turn p revented those cas es from being miscounted as true negatives. Thus t he expected impact of this meta-analysis o n Gro met’s stu dy [18], Gilbert’s stu dy [19] and James’ study [20] is expected to be very small as indicated in Table 1. Two of the six longitudinal CAD studies are potentially substantially a ffected b y the arguments raised in this a nalysis [17, 21]. Those two studies’ conclusions only emphasized t he existence of extreme l y minor ben efits of CAD enabled s creening. Th e effect described i n this meta-analysis is liable to have reduced the difference between t he measured s ensitivities of the control groups r elative to the CAD enabled groups because the control groups’ sensitivities are not degraded for the missed cancers t hat would have been cau ght b y CAD h ad it been deplo ye d in the control population. This ma y help explain wh y the more re cent stud y in the Journal of t he Nati onal Can cer Institut e [21] 14 reported hi gher sensitivities in the non CAD-enabled control group relativ e to the CAD enabled group – an initial ly surprisi ng finding. This ma y also explain why onl y a very small sensitivity improvement was repo rted when comparing the CAD enabled group with the pre-CAD control group ( 1.4% improvement). It i s in teresting to n ote that although both studi es emphasized meager benefits from C AD t echnology [17, 21], the earlier stud y in the New E ngland Journal of Medicine [17] indi cates that CAD screening actually resulted in an in crease in the r ate of detecti on of ductal c arcinoma i n si tu (DC IS) by 34% and a decrease in the rate of detection of invasive cancers b y 12%, indicatin g that CAD use may have contributed to shifting th e tumour yield towa rds earlier stage pre- invasive cancers (which are known to have more favourable prognostic c haracteristics). The later study o f the two from the Journal of th e Natio nal C ancer Institute [21] also showed a statisticall y significant shi ft t owards more DC IS yield in CAD-enabled screening [22], indicating that contrary to t he aut hor’s conclusions, CAD has exhibited a constructive role in the breast cancer screening process. In o rder to help il lustrate the problems with comparin g measured sensitivities in unmatched lon gitudinal trials, consider th e simple ex ample of a screening center which annually catches 80 0 cancers with mammo gr aph y. F urthermore, 200 patients screen ed at that center ar e dia gnosed with breast cancer annu ally, even tho ugh the y had a ne gative mammogram for a total o f 1000 patients diagnosed with breast cancer an nually. B y t he methods u sed i n t he liter ature [17, 21 ], the center’s sensitivity will be 800/ (800+200) = 80%. If we then add a C AD s ys tem that catches 100 extra tumours in its first round of screening and t hose tumours would have otherwise been caught in a subseq uent rou nd of 15 screening (CAD contributes t o catching those tumours earlier), then the sensitivity (as measured [17, 21 ]) woul d be 9 00/(900+200) = 81.8%. D irect comparison reveals just a 1.8% absolute increase in the test s ensitivity eve n thou gh t he introduct ion of the CAD system resulted in an i ncreased yield of malignant tumours of 12.5% (1 00/800) i n the first year of operation. If this hypothetical screenin g center had implemented CAD technolog y on e year earlier, then we would expect the CAD s ystem to have yielded an additional 10 0 lesions in the year it was introduced. The computed s ensitivit y for the control group prior to implementation of CAD does not account for these missed malignancies th at would have been caught h ad C AD been implemented earlier. Instead those malignancies are missed and erroneousl y counted as co rrect d iagnoses of ex am s without malignancies. Consider the control group in a situation where there are 1 00 missed cancers t hat cou ld have been caught by CAD and wo uld eventually be caugh t b y a futu re round of traditi onal non- CAD based screening. T he aforementioned testing methodologies [17, 21] would only account for t he normal 200 cases o f cancer caught i n spite of the implementation of mammographic screening (sc reened wom en wh o present with cancer after a negative mammogram). The control group’s sensitiv ity would be computed as 800/(800+200)=80%, when there a re in fact 100 extra missed cancers that were not accounted for and so a more accurate sensiti vity for the control population wou ld be 800/(800+300)=72.7%. Comparing this h ypothetical control with a CAD enabled group yielding an 81.8% sensit ivit y d emonstrates a s olid improvement i n the test’s sensitivit y . Such a comparison is not made however, as these cancers are not accoun ted for in t he 16 computation of the test’s sensitivity [17, 21]. It s hould be noted that the above is just a simple example, howeve r, it clearl y illustrates that it is possible for a rea l i mprovement in cancer y ield of 12.5% (100/800) to result in as little as a 1.8% improvement in sensitivity when relying on problematic methods for comparing sensitivity measurements [17, 21]. The earlier stud y in the New England Journal of Medicine [17 ] looked exclusively at t he performance of a particular CAD technology (I mageChecker, R2 Technology) and averaged its performance across a vari ety of centers. The more recent stud y from the Journal of the National Cancer Institute [21] averages together 25 CAD screening centers. It w as not report ed whether all of these centers e mployed the same commercial CAD technologies. It i s expected that different CAD technologies from different vendors will produce different be nefits. Furthermore, Dr. Nishikawa and Lorenz o Pesce’s analysis demonstrates that different CAD s ystems for mammographic breast cancer detection can range in disease detection rate improvements from 5 to 20% in cross-sect ional tr ials [23]. Considerable v ariation als o exists between indivi dual radiologists and bet ween different screening centers. A safer method for evalu ating a disease sc reening technology in a longitudinal clinical context would be to look at the increase in the di sease det ection rate during the first year/round of screening wit h a new dete ction technology. Measuri ng the disease detection rate beyond this period in a lon gitudinal trial can lead to misleadin g c onclusions because of the lowered prevalence of disease in the population after introduction of a new 17 more sensitiv e screening method [23-24]. Additi onally, examinin g the d etected tu mours’ size and stage are appropriate measures for the evaluation of a screening technology. Sensitivities are r elative measuremen ts, when comparing t wo s ensitivities great c are must be taken to ensure that t he co mparison is appropriate so as not to result in mislea ding conclusions. Of the many large s cale clinical studies i ncluded in this meta-analysis, t he two th at are most affected b y this paper [17, 21] also demonstrate s ome of t he least benefits fro m computer-aided detection enabled mammographic s creening, indicating that computer-aided detection systems are in fact assisting in mammographic breast cancer screening. The resul ts of this meta-anal ysis have been published in the journal Radiology [25]. Acknowledgements This work was supported in part by the Canadian Breast Cancer Foundation . References [1] Humphre y LL, et al. Br east cancer screening: a survey of the evidence for the U.S. Preventive Services Task Force. Ann Intern Med. 2002;137:347-60. [2] Tabar T, et al . The natural histor y o f breast carcinoma: what have we learned fro m screening? Cancer. 1999;86:449-62. 18 [3] Freer T W, Ulissey MJ. Screening mammography with computer -aided detection: prospective stud y of 12 860 p atients in a community breast center. Rad iology. 2001;220:781-786. [4] Helvie MA, Hadjiis ki L, Makariou E, et al. Sensitivit y of Noncommercial Computer- aided Detection System for Mammographic Breast C ancer Detection: Pilot Clinical Trial1. Radiology. 2004;231:208-214. [5] Birdwell RL, Bandodkar P, Ikeda DM. Co mputer-aided d etection with screening mammography in a university hospital setting. Radiology. 2005;236:451-457. [6] Morton MJ, W hale y DH, Brandt KR, A mrami KK. Screening mam mograms: interpretation with co mputer-aided d etection - prospectiv e evaluation. Radi ology. 2006;239:375-383. [7] Dean JC, Ilvento CC. Improved cancer detection using computer- aided detection with diagnostic an d screening mammogra ph y: pr ospective study of 104 ca ncers. Am J Roentgenol. 2006;187:20-28. [8] Ko J M, Nicholas MJ , Mendel JB , Slanetz P J. Prospective assessment o f computer- aided detection in interpretation of screening mammography. Am J Roentgenol. 2006;187:1483-1491. [9] Khoo LAL, Taylo P, Given- W ilson RM. Computer-aided Detection in the United Kingdom National Breast Screening Programm e: Prospective Study. Radiology. 2005;237:444-449. 19 [10] G eorgian-Smith D, et al . Blinded Comparison of Co mputer-Aided Detection with Human Second Reading in Screening Mammography. Am J Roentg enology. 2007;189:1135-1141. [11] Brem RF, et al. , Improvement in Sensitivit y o f Screening Mammogra ph y with Computer-Aided Detection: A Multiinstitutional Trial, American Journal of Roentgenology. 2003;181(3):687-693. [12] Ciatto S, et al. , Co mpar ison of standard reading and computer aided detecti on (CAD) on a national proficiency test of s creening mammograph y, Euro pean Journal of Radiology. 2003;45(2):135-138. [13] Morton MJ, et al. , Screening Mammograms: Interpretation wi th C omputer-aided Detection-Prospective Evaluation, Radiology. 2006;239:375-383. [14] Destounis SV, et al. , C an Computer-aided Detection wit h Double Reading o f Screening Mammogram s Help Decrease th e Fal se-Negative Rate? Initial Experience, Radiology. 2004;232:578-584. [15] Gur D, Sumkin JH, Rockette HE, et al. Changes in breast cancer detection and mammography recall rates after th e introduction of a computer-aided detection system. J Natl Cancer Inst. 2004;96:185-190. [16] Cupples TE, Cunningh am JE, Reynolds J C. Impact of Co mputer-Aided Detection in a Regional Screening Mammography Program. Am J Roentgenology. 2005;1 85:944-950. 20 [17] Fenton JJ, Taplin SH, Carne y PA, et al . Influence of computer- aided detection on performance of screening mammography. N Engl J Med. 2007;356:1399-1409. [18] Gromet M. Co mparison of C omputer-Aided Detection to Double Reading of Screening Mammograms: Review of 231 ,221 Mammograms. Am J Roent g enology. 2008;190:854-859. [19] Gilbert FJ, et al. , Single Reading with Co mputer-Aided Detection for Screening Mammography, New England Journal of Medicine. 2008;359:1675-1684. [20] James JJ, et al. , Mammographic features of breast cancers at single reading with computer-aided detection and at double reading in a large multicenter prospective trial of computer-aided detection: CADET II, Radiology. 2010;256(2):379-86. [21] Fenton JJ, et a l. , Effectiveness of Computer-Aided Detection i n Community Mammography Practice. Jo urnal of the Nati onal Cancer Institute. 2 011;103(15):1152- 1161. [22] Levman J, R e: Effectiveness of Co mputer-Aided Detection in Community Mammography Practice. Journal of the National Cancer Institute. 2012;104(1):77-78. [23] Nishikawa R M, and Pesce LL. Computer-aided detection evaluation methods are not created equal. Radiology. 2009;251:634-636. [24] Levman J, Disease detection rates are not necessarily a good way to evaluate a disease d etection m ethod in a longitudinal s tudy , European Jou rnal of Public Health. 2011; E-letter, April 12, 2011. 21 [25] Levman J, Clinical Mamm ographic Screening: Cross-Sectional and Longitudinal Studies Demonstrate B enefits from C omputer-Aided Detection, Radiology. Published online February 4 th , 2013. URL: http://radiology.rsna.org/content/266/1/123/reply#radiology_el_256741

Test Sensitivity in the Computer-Aided Detection of Breast Cancer from Clinical Mammographic Screening: a Meta-analysis

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment