Cross-Domain Conditional Generative Adversarial Networks for Stereoscopic Hyperrealism in Surgical Training
Phantoms for surgical training are able to mimic cutting and suturing properties and patient-individual shape of organs, but lack a realistic visual appearance that captures the heterogeneity of surgical scenes. In order to overcome this in endoscopi…
Authors: S, y Engelhardt, Lalith Sharan
Cross-Domain Conditional Generativ e Adv ersarial Net w orks for Stereoscopic Hyp errealism in Surgical T raining Sandy Engelhardt 1 , 3 , Lalith Sharan 1 , 3 , Matthias Karc k 2 , Raffaele De Simone 2 , and Iv o W olf 1 1 F acult y of Computer Science, Mannheim Universit y of Applied Sciences, German y s.engelhardt@hs-mannheim.de 2 Departmen t of Cardiac Surgery , Heidelb erg Univ ersity Hospital, German y 3 Dep. of Simulation and Graphics, Magdeburg Universit y , Germany Abstract. Phantoms for surgical training are able to mimic cutting and suturing prop erties and patien t-individual shape of organs, but lac k a realistic visual app earance that captures the heterogeneity of surgical scenes. In order to ov ercome this in endoscopic approac hes, hyperreal- istic concepts hav e been prop osed to b e used in an augmented realit y- setting, which are based on deep image-to-image transformation meth- o ds. Suc h concepts are able to generate realistic represen tations of phan- toms learned from real in traop erativ e endoscopic sequences. Conditioned on frames from the surgical training process, the learned mo dels are able to generate impressiv e results b y transforming unrealistic parts of the image (e.g. the uniform phan tom texture is replaced b y the more hetero- geneous texture of the tissue). Image-to-image synthesis usually learns a mapping G : X → Y suc h that the distribution of images from G ( X ) is indistinguishable from the distribution Y . Ho wev er, it do es not necessarily force the generated images to b e consistent and without ar- tifacts. In the endoscopic image domain this can affect depth cues and stereo consistency of a stereo image pair, whic h ultimately impairs surgi- cal vision. W e prop ose a cross-domain conditional generative adversarial net work approach (GAN) that aims to generate more consistent stereo pairs. The results show substantial improv ements in depth p erception and realism ev aluated by 3 domain experts and 3 medical students on a 3D monitor ov er the baseline metho d. In 84 of 90 instances our prop osed metho d w as preferred or rated equal to the baseline. Keyw ords: Generativ e adversarial net works, minimally-inv asive surgi- cal training, augmented realit y , mitral v alve sim ulator, laparoscopy 1 In tro duction Minimally in v asiv e surgery is c haracterized b y a restricted view of the surgical target. In many suc h procedures, the only w ay to observe the surgical field is with endoscopic vision on an external display , which is commonly associated with an impaired depth p erception. This situation requires an excellen t hand- ey e coordination, exceptional skills and dexterit y with instrumen ts, whic h should b e trained with surgical sim ulators b efore p erforming it on patien ts. On such sim ulators, the photo-realistic fidelit y and depth p erception is a key feature that can impro ve the transfer ratio of trainees to real surgeries. How ever, most virtual or physical sim ulators lack suc h prop erties, due to limited capabilities of mo delling realistic textures or b y current material that is used for surgical sim ulation. In our previous work [1], a deep learning-based concept to tackle the issue of photo-realism of surgical sim ulations w as presen ted. The approach was coined hyp err e alism , whic h is able to map patterns learned from intraoperative video sequences onto the video stream captured during sim ulated surgery on anatom- ical replica. Used within an augmented reality setting, hyp err e alism is defined as a new augmented realit y paradigm on the Reality-Virtualit y contin uum [2], whic h is closer to ‘full realit y’ in comparison to other concepts where artificial o verla ys are sup erimp osed on a video frame. In a h yp errealistic surgical training en vironment, the parts of the simulated en vironment that lo ok unnatural are re- placed by realistic app earances. Parts that already lo ok natural ideally stay the same. It is sho wn that such an approach greatly improv es repro duction of the in- traop erativ e app earance during training and therefore mak es minimally-inv asive surgical training more realistic. The approach is in principle also emplo yable for enhancing photo-realism of virtual training simulators, as shown b y Luengo et al. [3] who used a deep learning approac h that relies on style transfer. Metho dologically , our prop osed approac h is based on so-called unpaired deep image-to-image transformation methods [4]. The underlying concept is to use ad- v ersarial training of a generator and a discriminator net w ork, and to emplo y a cycle betw een the t wo input domains to generate realistic lo oking images. The k ey to the success of such generative adversarial netw orks (GANs) is the idea of an adversarial loss that forces the generated images to b e, in principle, indistin- guishable from real images. Such concepts are able to generate realistic represen- tations of phan toms learned from real intraoperative endoscopic sequences [1]. Conditioned on frames from the surgical training process, the learned mo dels are able to generate impressiv e results b y transforming unrealistic parts of the image (e.g. the uniform phan tom texture is replaced b y the more heterogeneous texture of the tissue). How ev er, the traditional CycleGAN approac h [4] neither enforces temp oral coherence, which was incorp orated in our previous contribution [1], nor enforces a stereo pair to b e consistent. A recent w ork published at CVPR 2018 [5] w as the first to address stereo- scopic neural style transfer. Approac hes b eforehand only dealt with mono cular st yle transfer. The authors sho wed that indep enden t application of st ylization approac hes to left and right views of stereoscopic images do es not preserve the original disparity consistency in the final stylization results, which causes 3D fa- tigue to the viewers on a 3D display [5]. Chen et al. [5] incorp orated a disparity loss in to the style loss function that is employ able in virtual scenes with known disparit y information. In this contribution, we wan t to tac kle the same issue l e ft ri ght Fig. 1. Prop osed architecture that sho ws the X → Y → X -cycle for a stereo pair ( x l , x r ). In con trast to classical generators, each generator G and F tak es tw o inputs and gener- ates one output. The second input image, e.g. y W , x T , is taken from the other domain and can b e chosen randomly . T o enable a better consistency , the output of G , whic h is y 0 l , is chosen as a second input in the generation cycle of the righ t image. Discriminators D y and D x ev aluate real and fake images mark ed in orange and green. for physical surgical simulators, i.e. generation of more consistent stereo-images without relying on ground truth disparit y , which is more complicated to obtain in the medical domain. In order to achiev e this, a no v el cross-domain conditional GAN is in tro duced in the follo wing. 2 Metho ds Giv en unpaired training samples in tw o image domains X and Y , the Cycle- GAN mo del prop osed by Zhu et al. [4] learns a mapping (generator) G : X → Y and the reverse mapping F : Y → X . These generators are trained to pro duce outputs that are indistinguishable from real images in the resp ective target do- mains for discriminator netw orks D Y and D X . Additionally , the consistency of the cyclic mappings F ( G ( x )) and G ( F ( y )) with the resp ectiv e inputs x ∈ X and y ∈ Y is enforced b y using the L 1 -norm. The prop osed metho d builds upon this idea to learn a style transfer b etw een an image stream from the source domain X of sur gic al simulation to a target domain Y of intr aop er ative sur geries and vice v ersa in the absence of paired endoscopic image samples. 2.1 Cross Domain Conditional GAN F or our task, the forward generator G : X → Y of the standard CycleGAN tended to create unrealistic colors and artifacts in the generated intraoperative scenes. T o ov ercome this, we introduce cr oss domain c onditional GANs . T raditional c onditional GANs [6] learn a mapping from an observed input domain X and random noise Z to the output domain Y , G : X × Z → Y . Isola et al. [7] found the additional random noise vector to b e ineffective and dropp ed it from the paired image translation predecessor of the unpaired CycleGAN as w ell as from the CycleGAN itself (in contrast to the concurrent DualGAN [8]). W e prop ose to re-introduce an additional input, but to use a sample y from the target domain distribution p Y instead of random noise to guide the training of the generator G : X × Y → Y to realistic coloring and preserv ation of detail (and analogously F : Y × X → X ). F or our stereo image translation task, we use a random sample y W ∼ p Y for the translation of the left image x l of the stereo pair and the generated output of the generator y 0 l := G ( x l , y W ) for generating the right image y 0 r := G ( x r , y 0 l ) to additionally supp ort the generation of consistently colored stereo pairs, see Figure 1. 2.2 Net work Arc hitectures The used netw ork architectures of the generators and discriminators are largely the same as in the original CycleGAN approach [4]. A T ensorFlow implementa- tion pro vided on GitHub 4 w as used as the basis and extended. All discriminators tak e the complete input images, which is differen t from the 70 × 70 P atchGAN approac h [4]. F or the generators, 7 instead of 9 residual blo c ks are used, b ecause exp erimen ts on our data sho wed b etter results for this configuration. Moreov er, the generators w ere changed to handle a 6-channel input. 3 Ev aluation A minimally inv asive mitral v alve repair simulator (MICS MVR surgical simula- tor, F ehling Instrumen ts Gm bH & Co. K G, Karlstein, Germany) was extended with patien t-sp ecific silicone mitral v alves. In comparison to other sim ulators, the v alv e replica consist of all anatomical parts, i.e. the annulus, the leaflets, the chordae tendineae and the papillary m uscles. Details on the v alv e model pro duction are elab orated on in a previous w ork [9]. An exp ert segmen ted pathological mitral v alv es on the end-systolic time step from ec ho cardiographic data, whic h are represen ted as virtual mo dels. F rom these models, 3D printable molds and suitable prosthetic rings w ere automati- cally generated and 3D-printed. Subsequently , 15 silicone v alves were man ufac- tured that could be anchored in the simulator onto a custom v alve holder. W e ask ed different experts and trainees to p erform mitral v alve repair tec hniques (ann uloplasty , triangular leaflet resection, neo-chordae implantation) on these v alves and recorded the endoscopic video stream [10]. 3.1 Data and T raining of Netw ork In total, approx. 240,000 stereo pair frames from the surgical training procedures w ere captured in full HD resolution or larger, which sums up to 9h of video material. Most of the videos were captured at 25 fps. Due to change in recording equipmen t, a subset of appro x. 20,000 stereo frames w as acquired at 1 fps. The training data for the net work was sampled every 240th frame or ev ery 40th frame, respectively . In total, the net work training set consists of 1400 stereo 4 h ttps://github.com/LynnHo/CycleGAN-T ensorflow-PyT orc h-Simple In tr aop . S t er eo In tr aop . Mon o Fig. 2. Mono- and stereoscopic examples from mitral v alve repair. The scenes are div erse: with or without prosthetic ring, sutures, instrumen ts and needles, blo od etc. pair frames from the training with the surgical phantom. T o av oid ov erfitting of the mo del, v alve replica sho wn in videos for netw ork training were not used for net work testing. In traop erativ ely , more than 620,000 stereo pairs were captured during three minimally inv asiv e mitral v alve repair surgeries. The frame rate v aried b et ween 60 fps and 25 fps. Scenes where the v alve itself was not visible w ere neglected. F or netw ork training, a stereo pair after each 120th or 240th frame was sampled retrosp ectiv ely from these videos, which sums up to approx. 1200 stero pairs for net work training. The scenes are highly div erse, as the v alve’s appearance drastically changes o ver time (e.g. due to cutting of tissue, implan ting sutures and prostheses, fluids suc h as blo od and saline solution), see Fig. 2. F urthermore, o cclusions or lens fogging often disturb ed the recording. Besides the just describ ed stereo data collection, we pre-trained our netw ork on a monoscopic data base. The strength of our prop osed method is that, it do es not solely rely on a stereo pair as input, but can be also trained un-stereo-paired. The monoscopic data was put together from recordings during four op en mitral v alve surgeries with a mono cular endoscop e, in which case few er lens o cclusions and less fogging occurred. F or the phantom recordings, half of the frames used for the monoscopic pre-training are also represented in the stereo data set. In total, the source and target domain consisted of appro x. 1500 single frames eac h. All monoscopic frames or each left and right image of the pair were randomly cropp ed and re-scaled to 256 × 512. F urther data augmentation was p erformed b y random horizon tal flipping and intensit y re-scaling. F or all the exp erimen ts, the consistency loss w as weigh ted with λ = 20. The Adam solver with a batch size of 1 and a learning rate of 0.0001 without linear decay to zero was used. Similar to Zhu et al. [4], the ob jective was divided by 2 while optimizing D , which slows do wn the rate at which D learns relative to G . Discriminators are up dated using a history of 50 generated images rather than the ones pro duced b y the latest generativ e net works [4]. The CycleGAN netw ork w as pre-trained for 40 epo chs on the monoscopic data and then trained on the left image of the stereo pair for another 40 ep ochs. Similarly , our proposed netw ork w as trained on 40+40 ep ochs. F or using the prop osed net work in the monoscopic case, y W is randomly c hosen from the other domain in the X → Y → X cycle and x W accordingly in the rev erse cycle. 3.2 Ev aluation The most imp ortan t factors for the prop osed application are related to p er- ception. Therefore, w e first ev aluated whether three domain experts (cardiac surgeons who eac h at least assisted mitral v alve repair surgeries) and three non- exp erts (medical studen ts) are able to p erceiv e depth on a 3D monitor on in- terlaced stereo pairs. Secondly , we asked the surgeons ho w real the generated in traop erativ e stereo frames appear from their exp eriences. All answers had to b e giv en on a 5-p oint Lik ert Scale, with 5 b eing the answ er with the highest agreemen t. Related to this, we found it crucial to ask clinical questions to the domain experts in order to sho w the reliability of the transformation, whic h is asso ciated with a change in app earance of the scene. Reliability of the transfor- mation requires the scene to not change too drastically , meaning that neither the shap e of ob jects should b e altered, nor additional parts should b e added or taken aw ay . The surgeons were asked 1) to diagnose the pathology of the presen ted v alve, 2) to name the surgical instrument visible in the scene, and 3) to state whic h phase of the surgery is presented. W e ev aluated these questions b y extracting 15 random samples from our test set; eac h sample was shown in in terlaced format on a 3D monitor. F or each frame, the corresponding result from the original CycleGAN was sho wn directly afterw ards, therefore enabling a direct comparison b et ween our results and the baseline. At the start of the exp erimen ts we asked the participan ts of the study to rate the depth for tw o realistic stereo frames from the surgical phantom, in the follo wing referred to as T est1 and T est2 example. 4 Results Example results of our method in comparison to the baseline CycleGAN are pro vided in Fig. 3. A rough visual analysis shows that the structure of the v alves and of the instruments are b etter preserv ed in our metho d. F urthermore, the left and right image of the stereo pair app eared to b e very consistent. The same w as confirmed by the user study . Fig. 4 illustrates the ratings b y the non-experts for depth p erception for eac h of the presen ted scenes. T he median for each participant on our results in comparison to the baseline are 3 to 2, 4 to 3 and 3 to 3, whic h means that our metho d was clearly fa vored ov er the baseline and that the participants had a three-dimensional impression from the synthesized stereo images. Fig. 4 also sho ws that depth p erception even on real images of the silicone phantom is not assessed as completely p erfect ( T est1 and T est2 ). These ratings help to relate the assessmen t of the generated stereo images to samples tak en from the real world. In 39 instances, our metho d was preferred ov er the baseline by the participants 𝑥 𝑙 𝑥 𝑟 𝑦′ 𝑙 𝑦′ 𝑟 𝑥′ ′ 𝑙 𝑥′ ′ 𝑟 𝑦′ 𝑙 𝑦′ 𝑟 𝑥′ ′ 𝑙 𝑥′ ′ 𝑟 𝑥 𝑙 𝑥 𝑟 𝑦′ 𝑙 𝑦′ 𝑟 𝑥′ ′ 𝑙 𝑥′ ′ 𝑟 𝑦′ 𝑙 𝑦′ 𝑟 𝑥′′ 𝑙 𝑥′ ′ 𝑟 Or i gi na l Cy cleGAN [2] Bas eli ne Cr oss - Domai n Condi t i onal GAN Fig. 3. Examples from CycleGAN baseline [4] and our prop osed method. (10 instances are b etter b y ∆ 2 and 29 are b etter by ∆ 1). In three instances, b oth metho ds were assessed as equally go od and in three other instances, our metho d was rated worse in stereo consistency . Considering the ev aluation by the exp ert, a similar picture can be dra wn. The resp ectiv e diagram on assessment of depth p erception is pro vided in Fig. 4. In 35 instances, our metho d w as preferred ov er the baseline by the exp erts (1 instance b etter b y ∆ 3, 8 instances b etter b y ∆ 2 and 26 instances b etter by ∆ 1). In seven cases, b oth metho ds were assessed as equal and in three other instances, our metho d was assessed as worse in comparison to the baseline. When referring to the realism, our method is also superior, and was conceived as less artifact-prone and more related to an intraoperative scene. Fig. 4 illustrates the ratings by the exp erts. In 37 cases, our metho d was preferred ov er the baseline by the exp erts (5 instances better by ∆ 3, 10 instances b etter b y ∆ 2 and 22 instances better b y ∆ 1). P athology assessmen t on the synthesized stereo frames yielded a go o d result. 37 of 45 correct decisions were made solely b y watc hing the generated stereo pair. F urthermore, in 42 of 45 cases, the correct instrument that is shown in the scene, w as named. Please note that in some instances, the instrumen ts are only visible b y a small margin on the actual clipp ed image. Moreo ver, motion artifacts complicated the assessment in 2 cases (example 13, 14). In 43 of 45 assessmen ts, the right surgical phase has b een identified by the participants. 5 Discussion T o the b est of our knowledge, the presented approach is the first to address stereo-endoscopic scene transformation for minimally-inv asive surgical training. In this pap er, w e prop ose a nov el cross domain conditioning GAN, which is su- p erior in syn thesizing consistent and more realistic stereo data in comparison 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 2 3 4 5 1 Lik er t Sc ale Ex ample Sc ene Dep th P er cep tion (Expert ) T es t1 T es t2 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 2 3 4 5 1 Lik er t Sc ale Ex ample Sc ene R eal ism (Expert) 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 2 3 4 5 1 Lik er t Sc ale Ex ample Sc ene Dep th P er cep tion (Non - Expert) T es t1 T es t2 Expert 1 Expert 2 Expert 3 Non - Expert 1 Non - Expert 2 Non - Expert 3 Fig. 4. Expert and non-exp ert ratings for depth p erception and realism. Symbols indicate the rating p er participant of the generated samples by cross-domain conditional GAN. Arrows show the difference to CycleGAN [4]. to the unpaired CycleGAN approach [4]. Due to conditioning on a second im- age, which is dra wn from the target domain (real or generated conten t), the net work is also able to generate images with less artifacts and with more re- alistic color, heterogeneous textures, sp ecularities and blo od. The reliabilit y of the generated samples was indirectly assessed by asking clinically relev ant end p oin ts considering visible pathology , surgical instrument and surgical phase. W e w ant to esp ecially emphasize that almost all of the questions could b e correctly answ ered with high confidence. In general, we decided against the conduction of a Visual T uring T est, as some shap e-related features in the scene (e.g. a p erson- alized ring shap e instead of a standard commercial ring) would hav e b een easily iden tified by an exp ert surgeon. F uture w ork includes usage of the presented approac h together with depth sensing technologies which are curren tly not applicable during surgery due to sterilization restrictions. The acquired depth information can b e leveraged as ground truth data for training disparity estimation mo dels from transformed mono- or stereo-endoscopic images. References 1. S. Engelhardt, R. De Simone, P . M. F ull, M. Karck, and I. W olf, “Improving surgi- cal training phan toms b y h yp errealism: Deep unpaired image-to-image translation from real surgeries,” in Me dic al Image Computing and Computer Assiste d Inter- vention – MICCAI 2018 , 2018, pp. 747–755. 2. P . Milgram and F. Kishino, “A taxonomy of mixed reality visual displays,” IEICE T r ans Inf Syst , vol. 77, no. 12, pp. 1321–1329, 1994. 3. I. Luengo, E. Flout y , P . Giataganas, P . Wisanuv ej, J. Nehme, and D. Sto yano v, “Surreal: enhancing surgical sim ulation realism using style transfer,” in British Ma- chine Vision Confer enc e 2018, BMV C 2018, Northumbria University, Newc astle, UK, Septemb er 3-6, 2018 , 2018, p. 116. 4. J. Y. Zh u, T. Park, P . Isola, and A. A. Efros, “Unpaired image-to-image trans- lation using cycle-consistent adv ersarial netw orks,” in 2017 IEEE International Confer enc e on Computer Vision (ICCV) , 2017, pp. 2242–2251. 5. D. Chen, L. Y uan, J. Liao, N. Y u, and G. Hua, “Stereoscopic neural st yle transfer,” in The IEEE Confer enc e on Computer Vision and Pattern R e co gnition (CVPR) , June 2018, pp. 6654–6663. 6. M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,” arXiv:1411.1784 , Nov. 2014. 7. P . Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image T ranslation with Conditional Adversarial Netw orks,” , No v. 2016. 8. Z. Yi, H. Zhang, P . T an, and M. Gong, “DualGAN: Unsup ervised Dual Learning for Image-T o-Image T ranslation,” in The IEEE International Confer enc e on Computer Vision (ICCV) , Oct 2017, pp. 2868–2876. 9. S. Engelhardt, S. Sauerzapf, B. Preim, M. Karc k, I. W olf, and R. De Simone, “Flex- ible and Comprehensive Patien t-Sp ecific Mitral V alve Silicone Mo dels with Chor- dae T endinae Made F rom 3D-Prin table Molds,” International Journal of Computer Assiste d R adiology and Sur gery (IPCAI Sp e cial Issue) , v ol. 14, no. 7, 2019. 10. S. Engelhardt, S. Sauerzapf, A. Bri, M. Karc k, I. W olf, and R. De Simone, “Repli- cated mitral v alve models from real patients offer training opp ortunities for mini- mally inv asive mitral v alve repair,” Inter act Car diovasc Thor ac Surg. , 2019.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment