Multitask Learning of Temporal Connectionism in Convolutional Networks using a Joint Distribution Loss Function to Simultaneously Identify Tools and Phase in Surgical Videos

Multitask Learning of T emp oral Connectionism in Con v olutional Net w orks using a Join t Distribution Loss F unction to Sim ultaneously Iden tify T o ols and Phase in Surgical Videos Shank a Subhra Mondal Rac hana Sathish Deb doot Sheet Indian Institute of T ec hnology , Kharagpur Abstract Surgical w orkﬂow analysis is of importance for understanding onset and persistence of surgical phases and individual tool usage across surgery and in eac h phase. It is b eneﬁcial for clinical qualit y control and to hos- pital administrators for understanding surgery planning. Video acquired during surgery t ypically can b e leveraged for this task. Currently , a com- bination of con volutional neural net work (CNN) and recurren t neural net- w orks (RNN) are p opularly used for video analysis in general, not only b eing restricted to surgical videos. In this pap er, we prop ose a multi-task learning framew ork using CNN follow ed by a bi-directional long short term memory (Bi-LSTM) to learn to encapsulate b oth forward and backw ard temp oral dependencies. F urther, the join t distribution indicating set of to ols asso ciated with a phase is used as an additional loss during learning to correct for their co-o ccurrence in any predictions. Exp erimental ev alu- ation is p erformed using the Cholec80 dataset. W e report a mean av erage precision (mAP) score of 0 . 99 and 0 . 86 for to ol and phase iden tiﬁcation resp ectiv ely which are higher compared to prior-art in the ﬁeld. 1 In tro duction Surgical w orkﬂow analysis using videos acquired from an endoscop e is of as- sistance to surgeons and hospital administrators to assess quality and progress of surgery and for medico-legal litigation. Being able to provide to ol usage in- formation during surgery along with its phase, rep ort generation, determining the duration of surgery , time to completion of surgery are some of such use- ful information. This information summarizing also makes it easy to ﬁnd out ab erration in pattern of a particular tool usage during a surgery b y comparing 1 1 INTR ODUCTION 2 Figure 1: The summary of the method presented. The phase indices from 0-6 are in order { Preparation, CalotT riangleDissection, ClippingCutting, GallbladderDissec- tion, GallbladderP ack aging, CleaningCoagulation, GallbladderRetraction } . The to ol indices from 0-6 are in order { Grasp er, Bip olar, Ho ok, Scissors, Clipp er, Irrigator, Sp ecimenBag } . it with the reports of past pro cedures. This pap er ∗ presen ts a m ulti-task deep learning framework whic h sim ultaneously infers both to ol and phase information in video frames. The summary of the metho d is presented in Fig.1. Challenges: In surgical videos the to ols often app ear occluded b ehind anatomical structures, which makes the task of to ol detection diﬃcult. Also the endoscop e used to acquire video suﬀers from motion jitters leading to v ari- ation in scene bac kground and the degree of illumination. Sp ecular reﬂection is another related artifact. This mak es the task of analysis typically challenging on account of the large scale vision app earance mo deling to b e performed. In a related note, this w ould also require training data to consist of large n um b er of frames annotated with b oth to ol and phase information, b elonging to micro- cosms making up such wide-scale visual v ariations, whic h b eing a tedious job is also challenging to collect. Approac h: In the prop osed framew ork a conv olutional neural net work (CNN) [12] is trained sim ultaneously for both to ol and phase detection where and it learns to extract high level visual features from the frames. T rained with an additional weigh ted join t distribution loss function whic h captures the join t probabilit y of co-o ccurrence of a particular set of to ols generally asso ciated with a phase of the surgery . The visual features extracted from the trained CNN is used to train a bi-directional long short term memory netw ork (LSTM) [20] to capture b oth the forward and bac kward temp oral information across video frames. This temp oral connectionism is imp ortant since in a surgery the phases are sequentially executed and there is an order in which a set of to ols are used p er phase. ∗ Accepted pap er at 5 th MedImage workshop of 11 th Indian Conference on Computer Vi- sion, Graphics and Image Processing , Hyderabad, India, 2018 2 PRIOR WORK 3 Impact: On accoun t of introducing the w eighted joint probabilistic loss function used in this multi-task training of CNN and bidirectional temp oral learning with LSTM, the mean a verage precision for to ols is higher than all the previous works related to this domain whereas in case of phase detection the mean a verage precision is comparable to the state of the art. The signiﬁcant adv ancement o ver kno wn prior-art in the ﬁeld is the abilit y to use a single net work to solve both phase and to ol detection simultaneously , at the highest p erformance metric, ac hieved through its auto-correcting learning ability using join t distribution mo deling. Organization of the pap er: The earlier w orks on surgical to ol and phase detection are brieﬂy describ ed in Sec. 2. The problem statement is presented in Sec. 3. The metho dology is explained in Sec. 4 . The exp erimen ts are detailed with the results in Sec. 5. Sec. 6 pres en ts the discussion. The conclusion is presen ted in Sec. 7 . 2 Prior W ork V arious types of video analysis solutions hav e b een prop osed through the y ears. Use of 3D CNNs [8], combining optical ﬂo w information along with 2D images [22], use of a RNN/LSTM along with a CNN to mo del long term dep endencies [6] are some of the widely known tec hniques. Many v arian ts of these approaches ha ve b een used in b oth surgical phase and tool detection. A CNN w as trained to sort surgical video frames to learn temp oral context b et w een the frames and com bined with a gated recurrent units (GR U) for sur- gical phase detection [3]. Later [25] prop osed a m ulti-task CNN framew ork for b oth to ol and phase detection, extracting the features from it and applying an hierarc hical hidden Marko v mo del (HHMM) for ﬁnal phase detection. Another w ork [17] used a CNN for phase classiﬁcation in cataract surgery , and improv ed their accuracy by dataset puriﬁcation and balancing. Later on [5] constructed a surgical pro cess modelling, and extracted v arious descriptors from images and then classiﬁed them using an Adabo ost classiﬁer. F urther the temp oral asp ect w as exploited using a hidden semi Marko v mo del. In [9] they prop osed an evo- lutionary search in the space of global image features using a genetic program- ming based approach for phase detection in cholecystectom y videos. Another approac h pro cesses information ab out to ol usage using non-visual electromag- netic trac king sensors and endoscopic camera for phase detection in laparoscopic surgeries using a left-righ t Hidden Marko v Model (HMM) [16]. Later [10] pro- p osed a framew ork to automatically detect surgical phases from microscop e videos. It ﬁrst deﬁned visual cues manually that can b e helpful for discriminat- ing the high-lev el tasks. The visual cues are automatically detected by image based classiﬁers, and the obtained time series are then aligned with a reference surgery using dynamic time warping (DTW) algorithm for phase detection. Suc- cessiv ely [11] used a spatio-temporal CNN and also encoded tool and temporal information in it for extracting visual features from surgical frames and then built a classiﬁer using DTW. In a prior work [15] to ol presence in surgical video 3 PR OBLEM ST A TEMENT 4 frames was detected by extracting visual features from a CNN and then feeding it to a LSTM for learning the temp oral connectionism. Similar styled work [24] prop osed a to ol detection system for minimally in v asive surgery based on a m ulticlass ensemble classiﬁer which was built using gradient b o osted regression trees. Subsequently [25] used only a CNN based approac h for tool detection in eac h frame without considering the temp oral information across video frames. Later [23] prop osed an automatic metho d for detection of instruments from en- doscopic images by segmen ting the tip of the instrumen t and then recognizing based on three dimensional instrumen t models. Earlier works in [18] used image pro cessing techniques like k-means clustering and Kalman ﬁltering for lo caliza- tion and trac king of to ols in surgical videos. In [19] com bined features extracted from pretrained and ﬁne-tuned imagenet mo dels to create con textual features for to ol detection and later prop osed a label set sampling to reduce the bias. Later [1] prop osed to use optical ﬂow information b etw een surgical images to exploit spatial redundancies b et ween consecutive images. Subsequen tly in [2] prop osed the CNN along with RNN framework follow ed by b o osting of both of these netw orks and ﬁnally smo othing the predictions for surgical to ol detection. All of the metho ds describ ed abov e for to ol and phase detection use a CNN or a CNN + RNN framew ork or statistical metho ds, but none of those captures the join t probability distribution b et ween the to ols asso ciated with a given phase while building a multitask learning framew ork. Also temporal information is captured in most of the works but they only consider the eﬀect of past frames in determining the present to ol or phase. It is equally imp ortan t to lo ok into the future as muc h as into the past for more accurate prediction in the curren t scenario and consider m ultitask framework in temporal domain. 3 Problem Statemen t Giv en a video frame F t it con tains information ab out a particular phase of surgery and the m ultiple to ols used whic h v aries from a minimum of no-to ol to a maximum of three tools. Given in a surgical video dataset, the ground truth for phase annotation in a frame is represe n ted as a one-hot tensor y t 1 ∈ { 0 , 1 } of size N 1 × 1, where N 1 is the n um b er of surgical phases. The surgical to ols ground truth is represented as a multi-hot tensor y t 2 ∈ { 0 , 1 } of size N 2 × 1, where N 2 is the num b er of surgical to ols. The prediction problem is mo delled as { ˆ y t 1 , ˆ y t 2 } ← H ( F t ) where ˆ y t 1 and ˆ y t 2 are the phase and to ol prediction tensors obtained from the trained multitask net w ork H whic h pro cesses F t . In case of to ols none or more than one to ol indices can b e one in a given frame, so the detection of to ols from a given video frame is a multilabel multiclass classiﬁcation problem where w e ha v e to predict a subset of to ols out of the total set of N 2 to ols. 4 EXPOSITION TO THE SOLUTION 5 Figure 2: The full training pip eline. The tw o one hot tensors(left) are the ground truths for phase and tool respectively . The arro ws through the blocks of Resnet-50 and Bi-LSTM demonstrate the gradien t ﬂow. WCE, WMLSF, WJPL stands for weigh ted cross entrop y loss, weigh ted multi lab el soft margin loss, weigh ted join t probabilistic loss resp ectiv ely . Other arrows demonstrate which tensors con tribute to the loss b eing computed. The bars denote the conﬁdence levels of corresp onding predictions with top predictions marked in red. 4 Exp osition to the Solution W e prop ose a multitask learning framework using CNN+LSTM to jointly solve for b oth to ol and phase detection while learning with a weigh ted join t probability based loss function to mo del the dependence of tool and phase o ccurrence in a giv en frame. W e ﬁrst train a CNN only with the multi-task setting. Se c ond we use the features from the p en ultimate fully-connected lay er of the CNN trained earlier to construct a Bidirectional LSTM (Bi-LSTM) trained with a m ulti- task framew ork. The full training pip eline is shown in Fig.2. These stages are subsequen tly detailed. 4.1 Multitask learning of a CNN for phase and to ol de- tection Since the amount of to ol annotated data is less, training a deep CNN architec- ture from scratc h has b een observed to lead to con vergence challenges as well as slo ws do wn conv ergence. So to sp eed up the training pro cess we hav e used a CNN trained prior on ImageNet for Large Scale Visual Recognition Challenge (ILSVR C) [4]. The ResNet-50 [7] is used as a feature extractor and is ﬁnetuned on the task speciﬁc dataset after replacing the output lay er. The input to the ResNet-50 is an image of size 224 × 224 p x and the features are obtained from 4 EXPOSITION TO THE SOLUTION 6 the last but one fully connected lay er of dimension 2 , 048. The output la y er in ResNet-50 is replaced to accommodate b oth to ol and phase classiﬁcations with tensors matching properties of y t 1 and y t 2 . Three diﬀerent loss functions are used during training. During learning of phase dete ction , the w eigh ted cross entrop y loss is used L 1 ( y t 1 , ˆ y t 1 ) = w 1 [ n ]  − y t 1 [ n ] + log  N 1 − 1 X j =0 exp( ˆ y t 1 [ j ])  (1) where n = arg max { ˆ y t 1 [ j ] ∀ j ∈ [0 , N 1 − 1] } with y t 1 = { y t 1 [ j ] ∀ j ∈ [0 , N 1 − 1] } , and w 1 [ j ] is the weigh t asso ciated with the j th phase out of the N 1 classes where the weigh t is obtained by median frequency balancing to comp ensate for high class imbalance in training data. In case of to ol dete ction a w eighted multi-label soft margin loss is used L 2 ( y t 2 , ˆ y t 2 ) = N 2 − 1 X i =0 w 2 [ i ] ˆ y t 2 [ i ] log  1 (1 + exp( − y t 2 [ i ]))  + w 2 [ i ](1 − ˆ y t 2 [ i ]) log  exp( − y t 2 [ i ]) 1 + exp( − y t 2 [ i ])  (2) where ˆ y t 2 [ i ] is the prediction of the i th to ol in F t and y t 2 [ i ] is ground truth annotation for the to ol presence with y t 2 = { y t 2 [ i ] ∀ i ∈ [0 , N 2 − 1] } , w 2 [ i ] is the to ol class w eigh t obtained b y median frequency balancing to comp ensate for high class imbalance in training data. The third comp onen t of the loss takes in consideration the model of joint distribution of to ol and phase o c curr enc e whic h is given as L 3 ( ˆ x t 1 , ˆ x t 2 ) = N 2 − 1 X i =0 N 1 − 1 X j =0 ˆ x t 1 [ i ] ˆ x t 2 [ j ] I F ( i, j ) (3) where ˆ x t 1 = σ ( ˆ y t 1 ) and ˆ x t 2 = SoftMax( ˆ y t 2 ) where σ ( · ) represents the sigmoid non-linearit y , and I F ( i, j ) denotes the in verse of the frequency of o ccurrence of to ol i ∈ [0 , N 2 − 1] with a phase j ∈ [0 , N 1 − 1]. Using the information presen t in the annotated training data we create a phase-tool co-o ccurrence matrix C = { c i,j ∀ i ∈ [0 , N 2 − 1] , j ∈ [0 , N 1 − 1] } whic h represen ts the coun t of the num b er of frames o ver all videos when the j th to ol w as being used in the i th phase of surgery . Subsequently we form a normalized matrix ˆ C = { ˆ c i,j ∀ i ∈ [0 , N 2 − 1] , j ∈ [0 , N 1 − 1] } with ˆ c i,j = c i,j P N 2 − 1 i =0 c i,j ∀ j ∈ [0 , N 1 − 1]. This is used to create an IF function deﬁned as I F ( i, j ) = 1 ˆ c i,j +  where  is the smallest v alue represen ted in the num b er system b eing used. This function is characterized suc h that if frequency of phase-tool co-occurrence turn out to be zero then a large v alue is represented in IF to induce a v ery high loss in that case. 5 EXPERIMENTS AND RESUL TS 7 T able 1: Mean ± Standard Deviation of duration for sev en diﬀerent phases in Cholec80 dataset Phase Id Phase Name Duration (secs) P1 Preparation 125 ± 95 P2 Calot triangle dissection 954 ± 538 P3 Clipping and cutting 168 ± 152 P4 Gallbladder dissection 857 ± 551 P5 Gallbladder pack aging 98 ± 53 P6 Cleaning and coagulation 178 ± 166 P7 Gallbladder retraction 83 ± 56 4.2 Multitask learning of a Bi-LSTM The features extracted from the penultimate fully connected lay er of the ResNet- 50 trained earlier are used to train a multitask Bi-LSTM [13] in a similar learning framew ork using same cost functions as in (1), (2) and (3). Whitening trans- form [21] is applied to all features across the training data b eing fed to the Bi-LSTM. Due to its bidirectional nature it maintains tw o hidden la yers, where ones propagates from left to right in the time unrolled sequence, and the other from righ t to left. The ﬁnal classiﬁcation result, is generated through combin- ing the score results produced b y b oth the LSTM hidden lay ers. The input to the bidirectional LSTM is sequence of visual features from the frames extracted from the en tire video. A single lay ered Bi-LSTM with 1 , 024 hidden neurons w as used. Finally median ﬁltering is applied to the phase predictions to remo ve an y abrupt changes. 5 Exp erimen ts and Results 5.1 Dataset Description The prop osed method is ev aluated on Cholec80 † dataset which contains 80 videos of c holecystectomy surgeries p erformed by 13 surgeons at the Univer- sit y Hospital of Strasb ourg. The phase annotation is provided for all the frames at 25 frames p er second (fps) whereas to ols are annotated on one p er 25 frames leading to 1 fps annotation rate on a 25 fps video. These annotations are rate matc hed to 1 fps. The dataset is split into tw o equal parts, the ﬁrst 40 videos are used for training the m ultitask CNN and Bi-LSTM and the last 40 videos are used for v alidation or testing. The visual appearance and list of 7 surgical to ols in Cholec 80 dataset is given in Fig.3. The details ab out the sev en diﬀeren t surgical phases and the mean ± std of their duration in given in T able. 1. Also the dataset is im balanced with resp ect to b oth surgical phases and to ols as evi- den t in Fig. 4(a) and Fig. 4(b) resp ectiv ely . Accordingly N 1 = 7 corresp onding † http://camma.u- strasbg.fr/datasets 5 EXPERIMENTS AND RESUL TS 8 to 7 phases of surgery and N 2 = 8 corresponding to 7 to ols and the no-tool case. The phase-to ol co-o ccurrence matrix C can b e visualized in Fig. 5. Figure 3: List of seven diﬀerent surgical tools present in the Cholec80 dataset. Figure 5: Co-o ccurrence matrix of to ols and phase in Cholec80 training dataset. 5.2 T raining The multitask CNN (Sec. 4.1) is trained with a learning rate of 1 × 10 − 4 with a learning rate sc heduler whic h reduces the learning rate b y 0 . 9 when the v al- idation loss did not decrease for more than 5 consecutiv e ep ochs of training, batc h size of 100 frames used, weigh t deca y of 5 × 10 − 4 , momentum of 0 . 9. The net work is optimized using stochastic gradient descent algorithm (SGD). The multitask Bi-LSTM (Sec. 4.2) is trained with a learning rate of 1 × 10 − 2 with a learning rate scheduler which reduces the learning rate by 0 . 5 when the v alidation loss does not decrease for more than 5 ep ochs consecutive during training, batch size of 1 video is used, and remaining parameters as same. 5 EXPERIMENTS AND RESUL TS 9 (a) Phase occurrence (b) T o ol co-o ccurrence Figure 4: T o ol and Phase Distribution in Cholec80 training dataset. 5.3 Baselines F or comparison of the performance of the prop osed metho d we hav e considered sev en baselines. BL1 is the mo diﬁed m ulti-lab el multi-class Resnet-50 which predicts to ols present on an individual frame without using any temp oral infor- mation in videos. BL2 is BL1 along with Bi-LSTM. BL3 is modiﬁed m ulti-class ResNet-50 used only for phase prediction using individual frame. BL4 is BL3 + Bi-LSTM. BL5 is mo diﬁed ResNet-50 and it jointly predicts b oth to ol and phase on individual frames only and trained using the 3 loss functions. BL6 is Endonet [25] whic h predicts b oth tool and phase. BL7 is b oosted CNN + RNN [2] which predicts only to ol. The pr op ose d metho d is essen tially BL5 + Bi-LSTM. 5.4 Implemen tation The experiments were implemented using PyT orch 0.4 ‡ and accelerated with Nvidia CUD A 9.0 § and cuDNN 7.3 ¶ on Ubun tu 16.04 L TS Serv er OS. The serv er consisted of 2x Intel Xeon E5-2699 v3 CPU, 2x32 GB DDR4 ECC Regd. RAM, 4TB HDD, 1x Nvidia Quadro P6000 GPU with 24 GB DDR5 RAM. The CNN mo dels (BL1, BL3, BL5) were trained for 200 ep ochs while the Bi-LSTM for the adjunct models (BL2, BL4, Prop osed metho d) for 1 , 000 ep o c hs. ‡ https://p ytorc h.org § https://dev eloper.nvidia.com/cuda-90-do wnload-archiv e ¶ https://dev eloper.nvidia.com/cudnn 5 EXPERIMENTS AND RESUL TS 10 T able 2: Performance comparison of prop osed metho d with baselines. Best perfor- mance metric indicated in b old face. T o ol Detection Phase Detection Baseline T o ol Average Precision Average Recall Average Accuracy Phase Average Precision Average Recall Average Accuracy BL1 3 0.955 0.928 0.958 7 - - - BL2 3 0.963 0.936 0.964 7 - - - BL3 7 - - - 3 0.515 0.63 0.88 BL4 7 - - - 3 0.63 0.717 0.92 BL5 3 0.974 0.88 0.938 3 0.705 0.6944 0.935 BL6 3 0.81 - - 3 0.848 0.883 0.92 BL7 3 0.9789 - - 7 - - - Proposed Method 3 0.99 0.912 0.9353 3 0.857 0.835 0.966 5.5 Results The comparison betw een the baselines and the prop osed metho d for the three metrics namely a verage precision, av erage recall, av erage accuracy is sho wn in T able. 2. The p erformance of the baselines (BL1, BL2, BL5) and the proposed metho d for to ol- wise precision is sho wn in Fig.6. The performance of the baselines (BL3, BL4, BL5) and the prop osed method for phase- wise precision and accuracy are shown in Fig.7 and Fig.8 resp ectiv ely . All results are provided for the v alidation set (last 40 videos of Cholec80 dataset). Figure 6: Performance of BL1, BL2, BL5 and prop osed metho d for to ol precision 6 DISCUSSION 11 Figure 7: Performance of BL3, BL4, BL5 and proposed method for phase wise precision 6 Discussion In this pap er we hav e prop osed a new loss function for m ultitask learning using a weigh ted join t probabilistic loss function to mo del the dep endency of a set of to ols to a phase in laparoscopic surgeries. Subsequently we use CNN and Bi-LSTM framework whic h jointly predicts tool and phase. W e show through exp erimen ts that the mean av erage precision (mAP) obtained for to ol detection outp erforms all other previous arc hitectures. In case of phase detection it yields b etter results with resp ect to mAP and also yields a higher accuracy . This indi- cates that the visual features learned b y the CNN pro vides v aluable information through ric h features to the Bi-LSTM. Also the interdependence b et ween to ol and phase pro vided to the netw ork through the weigh ted joint probabilistic loss function, whic h ultimately aﬀects gradients and update of parameters helps in b etter conv ergence. Another imp ortan t asp ect of our framework is the use of Bi-LSTM, which has an inheren t capabilit y to capture long term dep endencies b oth along past and future, expected to be required for b etter prediction in tem- p oral domain. In Bi-LSTM full video batc h stacking and whitening transform of CNN features prior to learning yield signiﬁcantly b etter p erformance and faster conv ergence. Also the median ﬁltering applied to the phase predictions obtained from Bi-LSTM resulted in sligh t improv ement in mAP and accuracy due to remov al of abrupt c hanges. The results are provided for Cholec80 dataset whic h contains 80 videos of c holecystectomy surgeries. Some of the previous works ha v e used less than 20 videos of surgeries for surgical work-ﬂo w analysis which had limited their p er- formance on accoun t of its inability to learn the richness of visual appearances asso ciated with to ols and phases. Without using any data augmen tation tec h- 7 CONCLUSION 12 Figure 8: Performance of BL3, BL4, BL5 and proposed method for phase wise accuracy niques to compensate for the to ol and phase imbalance as seen from Fig.4(a) and Fig.4(b) the mo del gav e signiﬁcan tly better results, which suggests that it is robust to data im balance. The dataset also contains lot of v ariability with resp ect to phase duration as seen from T able. 1 which do es not aﬀect the phase detection results to any signiﬁcan t extent thereb y demonstrating the netw ork’s capabilit y to tackle such challenges. Although the model can ov ercome the c hallenges described abov e there are some limitations. Firstly , Cholec80 dataset is limited to surgeons from one institution and can easily lead to o ver-ﬁtting and hence a dataset containing surgeries from multiple surgeons from diﬀeren t institutions should b e used for training which can yield more generalized results. Secondly , no image processing tec hniques were applied to the ra w frames extracted from videos to remov e redundan t information whic h can help the CNN to learn b etter features. Thirdly the framework requires training of the CNN ﬁrst follow ed by a Bi-LSTM, while making it as an end to end system would require training only once which would b e less computationally exp ensiv e and is desired. 7 Conclusion A multitask deep learning framework comprised of ResNet-50 and Bi-LSTM with a w eighted joint distribution loss function has b een prop osed. It giv es b etter mAP with resp ect to tool detection and comparable results for phase de- tection. The applicabilit y of the prop osed metho d is not necessarily limited only to to ol and phase detection but other areas such as to ol lo calization, estimating completion time of surgery , recognition of anatomy should be explored. Also the to ols in many images can ha ve v arious orien tations with resp ect to the camera REFERENCES 13 dep ending on the surgery , so the use of vector con volutions [14] can mak e the system rotation in v ariant whic h can b e seen as a future work to improv e to ol and phase prediction with ability to learn with limited annotated data corpus. References [1] Al Ha jj, H., Lamard, M., Charri ` ere, K., Cochener, B., Quellec, G.: Surgical to ol detection in cataract surgery videos through m ulti-image fusion inside a conv olutional neural netw ork. In: IEEE Ann. In t. Conf. Engg. Medicine Bio. So c. pp. 2002–2005 (2017) [2] Al Ha jj, H., Lamard, M., Conze, P .H., Co c hener, B., Quellec, G.: Monitor- ing tool usage in surgery videos using b oosted conv olutional and recurren t neural netw orks. Med. Image Anal. 47 , 203–218 (2018) [3] Bo denstedt, S., W agner, M., Kati ´ c, D., Mietko wski, P ., May er, B., Ken- ngott, H., M ¨ uller-Stich, B., Dillmann, R., Sp eidel, S.: Unsup ervised tem- p oral con text learning using conv olutional neural netw orks for laparoscopic w orkﬂow analysis. arXiv preprint arXiv:1702.03684 (2017) [4] Deng, J., Dong, W., So cher, R., Li, L.J., Li, K., F ei-F ei, L.: Imagenet: A large-scale hierarc hical image database. In: Proc. IEEE Conf. Comp. Vis. P att. Recog. pp. 248–255 (2009) [5] Dergach yo v a, O., Bouget, D., Huaulm´ e, A., Morandi, X., Jannin, P .: Auto- matic data-driven real-time segmentation and recognition of surgical work- ﬂo w. In t. J. Comp. Assist. Radio. Surgery 11 (6), 1081–1089 (2016) [6] Donahue, J., Anne Hendric ks, L., Guadarrama, S., Rohrbach, M., V enu- gopalan, S., Saenk o, K., Darrell, T.: Long-term recurrent conv olutional net- w orks for visual recognition and description. In: Pro c. IEEE Conf. Comp. Vis. Patt. Recog. pp. 2625–2634 (2015) [7] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro c. IEEE Conf. Comp. Vis. Patt. Recog. pp. 770–778 (2016) [8] Ji, S., Xu, W., Y ang, M., Y u, K.: 3d conv olutional neural net works for h uman action recognition. IEEE T rans. P att. Anal. Machine Intell. 35 (1), 221–231 (2013) [9] Klank, U., Pado y , N., F eussner, H., Nav ab, N.: Automatic feature gener- ation in endoscopic images. Int. J. Comp. Assist. Radio. Surgery 3 (3-4), 331–339 (2008) [10] Lalys, F., Jannin, P .: Surgical process mo delling: a review. Int. J. Comp. Assist. Radio. Surgery 9 (3), 495–511 (2014) REFERENCES 14 [11] Lea, C., Choi, J.H., Reiter, A., Hager, G.D.: Surgical phase recognition: from instrumen ted ors to hospitals around the w orld. In: In t. Conf. Med. Image Comput. Comp. Assist. In terv. - M2CAI workshop. pp. 45–54 (2016) [12] LeCun, Y., Bengio, Y., et al.: Con volutional netw orks for images, sp eec h, and time series [13] Ma, X., Hovy , E.: End-to-end sequence labeling via bi-directional lstm- cnns-crf. In: Pro c. Ann. Meeting, Asso c. Comput. Linguistics. vol. 1, pp. 1064–1074 (2016) [14] Marcos, D., V olpi, M., Komo dakis, N., T uia, D.: Rotation equiv arian t v ector ﬁeld netw orks. In: Proc. IEEE Inte. Conf. Comp. Vis. pp. 5048– 5057 (2017) [15] Mishra, K., Sathish, R., Sheet, D.: Learning latent temp oral connection- ism of deep residual visual abstractions for identifying surgical tools in laparoscop y procedures. In: Proc. IEEE Conf. Comp. Vis. P att. Recog. W orkshops. pp. 58–65 (2017) [16] Pado y , N., Blum, T., F eussner, H., Berger, M.O., Na v ab, N.: On-line recog- nition of surgical activity for monitoring in the operating ro om. In: AAAI Conf. Artif. Intell. pp. 1718–1724 (2008) [17] Primus, M.J., Putzgruber-Adamitsch, D., T asch wer, M., M ¨ unzer, B., El- Shabra wi, Y., B¨ osz¨ ormenyi, L., Sc ho eﬀmann, K.: F rame-based classiﬁca- tion of op eration phases in cataract surgery videos. In: Int. Conf. Multi- media Mo del. pp. 241–253 (2018) [18] Ryu, J., Choi, J., Kim, H.C.: Endoscopic vision based trac king of multi- ple surgical instruments in rob ot-assisted surgery . In: Int. Conf. Control, Autom. Sys. pp. 2195–2198 (2012) [19] Sahu, M., Mukhopadh y ay , A., Szengel, A., Zac how, S.: T o ol and phase recognition using contextual cnn features. arXiv preprint (2016) [20] Sch uster, M., Paliw al, K.K.: Bidirectional recurren t neural net works. IEEE T rans. Sig. Pro ces. 45 (11), 2673–2681 (1997) [21] Shental, N., Hertz, T., W einshall, D., P a vel, M.: Adjustment learning and relev ant comp onen t analysis. In: Pro c. Eur. Conf. Comp. Vis. pp. 776–790 (2002) [22] Simony an, K., Zisserman, A.: Tw o-stream conv olutional netw orks for ac- tion recognition in videos. In: Adv. Neural Info. Proces. Sys. pp. 568–576 (2014) REFERENCES 15 [23] Sp eidel, S., Benzko, J., Krappe, S., Sudra, G., Azad, P ., M ¨ uller-Stich, B.P ., Gutt, C., Dillmann, R.: Automatic classiﬁcation of minimally inv asive instrumen ts based on endoscopic image sequences. In: Med. Imag.- Vis., Image-Guided Pro ced. Mo deling. v ol. 7261, p. 72610A (2009) [24] Sznitman, R., Beck er, C., F ua, P .: F ast part-based classiﬁcation for instru- men t detection in minimally inv asive surgery . In: In t. Conf. Med. Image Comput. Comp. Assist. In terv. pp. 692–699 (2014) [25] Twinanda, A.P ., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Pado y , N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE T rans. Medical Imag. 36 (1), 86–97 (2017)

Multitask Learning of Temporal Connectionism in Convolutional Networks using a Joint Distribution Loss Function to Simultaneously Identify Tools and Phase in Surgical Videos

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment