Transfer Learning for Ultrasound Tongue Contour Extraction with Different Domains
Medical ultrasound technology is widely used in routine clinical applications such as disease diagnosis and treatment as well as other applications like real-time monitoring of human tongue shapes and motions as visual feedback in second language tra…
Authors: M. Hamed Mozaffari, Won-Sook Lee
T ransfer Learning for Ultrasound T ongue Con tour Extraction with Differen t Domains M. Hamed Mozaffari ∗ and W on-So ok Lee ∗ *Sc ho ol of Electrical Engineering and Computer Science, Universit y of Ottaw a, 800 King-Edw ard Av enue, Otta wa, On tario, Canada, K1N-6N5. Abstract. Medical ultrasound tec hnology is widely used in routine clin- ical applications such as disease diagnosis and treatment as w ell as other applications like real-time monitoring of human tongue shap es and mo- tions as visual feedbac k in second language training. Due to the low- con trast c haracteristic and noisy nature of ultrasound images, it might require expertise for non-expert users to recognize tongue gestures. Man- ual tongue segmen tation is a cumbersome, sub jectiv e, and error-prone task. F urthermore, it is not a feasible solution for real-time applications. In the last few years, deep learning metho ds ha ve been used for delin- eating and tracking tongue dorsum. Deep conv olutional neural net works (DCNNs), which hav e sho wn to b e successful in medical image analysis tasks, are t ypically w eak for the same task on differen t domains. In man y cases, DCNNs trained on data acquired with one ultrasound device, do not perform well on data of v arying ultrasound device or acquisition pro- to col. Domain adaptation is an alternative solution for this difficulty b y transferring the w eights from the mo del trained on a large annotated legacy dataset to a new model for adapting on another different dataset using fine-tuning. In this study , after conducting extensive exp eriments, we addressed the problem of domain adaptation on small ultrasound datasets for tongue con tour extraction. W e trained a U-net netw ork comprises of an enco der- deco der path from scratch, and then with several surrogate scenarios, some parts of the trained net work were fine-tuned on another dataset as the domain-adapted net works. W e repeat scenarios from target to source domains to find a balance p oint for knowledge transfer from source to target and vice v ersa. The p erformance of new fine-tuned net works was ev aluated on the same task with images from differen t domains. Keyw ords: Automatic ultrasound tongue con tour extraction · Domain adaptation · F ully conv olutional neural net work · T ransfer learning · Ultrasound image segmentation. 1 In tro duction Ultrasound imaging is safe, relatively affordable, and capable of real-time p er- formance. This tec hnology has b een utilized for many real-time medical appli- cations. Recen tly , ultrasound is used for visualizing and characterizing human 2 M.H. Mozaffari et al. tongue shape and motion in a real-time speech to study healthy or impaired sp eec h pro duction in applications such as visual second language training [4] or silen t sp eech in terfaces [2]. How ever, it requires exp ertise for non-exp ert users to recognize tongue shape and motion in noisy and low-con trast ultrasound data. T o address this problem and to hav e a quan titative analysis, tongue surface (dorsum) can b e extracted, track ed, and visualized sup erimp osed on the whole tongue region. Delineating the tongue surface from eac h frame is a cum b er- some, sub jectiv e, and error-prone task. Moreo ver, the rapidity and complexity of tongue gestures hav e made it a c hallenging task, and man ual segmen tation is not a feasible solution for real-time applications. Ov er the y ears, sev eral image processing techniques, suc h as active con tour mod- els [7], hav e sho wn their capabilit y for automatic tongue contour extraction. In man y of those traditional metho ds, man ual lab elling, initialization, monitoring, and manipulation are frequently needed. F urthermore, those metho ds are com- putationally exp ensive whereas the image gradient should b e calculated for e ac h frame [5]. In recen t years, conv olutional neural net works (CNN) hav e b een the metho d of c hoice for medical image analysis with outstanding results [8]. In a few studies, automatic tongue con tour extraction [14] and trac king [10] using CNN ha ve b een already inv estigated. In spite of their excellen t results on a sp ecific dataset from one ultrasound device, the generalization of those methods on test data with differen t distributions from differen t ultrasound device is often not in vestigated and ev aluated. Although ultrasound tongue datasets hav e differen t distributions, there is alwa ys a correlation b etw een mov ements of the tongue and its p ossible p ositions in the mouth. Therefore, domain adaptation migh t pro vide a univ ersal solution for automatic, real-time tongue contour extraction, applicable to the ma jorit y of ultrasound datasets. In transfer learning [12], a marginal probability distribution P ( X ), where X = { x 1 , ..., x n } defined on a feature space of R can b e used for expressing a domain D . On a sp ecific domain D = { R, P ( X ) } , comprises of pair of a lab el space Y and an ob jective function f ( · ) in a form of T = { Y , f ( · ) } , the ob jectiv e func- tion f ( · ) can b e optimized and learned the training data, which consists of pairs { x i , y i } , where x i ∈ X and y i ∈ Y in a sup ervised fashion using one CNN model. After termination of the optimization process, the trained CNN mo del denoted b y ˜ f ( · ) can predict the label of a new instance x . T ransfer learning is defined as the procedure of enhancing the target prediction function f T ( · ) in D T using the information in D S and T S , whereas a source domain D s with a learning task T S and a target domain D T with learning task T T are given and D S 6 = D T , or T S 6 = T T [3]. Therefore, the prediction function ˜ f S T ( · ) first is trained on the source domain D S and then fine-tuned for the target domain D T . Conv ersely , ˜ f T S ( · ) is initially trained for the target task, and then it is domain-adapted on the source dataset. F ully conv olutional netw orks (FCNs) consist of consecutiv e conv olutional and p o oling la yers (enco der), and one up-sampling lay er (decoder) was successfully exploited for the semantic segmentation problem in a study b y [9]. Due to the loss of information in p olling la yers, only one la y er of up-sampling cannot re- T ransfer Learning for Ultrasound T ongue Contour Extraction 3 triev e the input-sized resolution in the output prediction map. Concatenation of feature maps from deconv olutional lay ers (Decon vNet) [11] and enco der la y ers in U-net [13] impro ved significantly the segmentation accuracy . The enco der of U-net learns simple visual image features esp ecially in the first few la yers, while the deco der aims to reconstruct the input-sized output prediction map from the complicated, abstract, and task-dep endent features of the last lay er of the en- co der. Although enco der-deco der models lik e U-net ha v e been used for tongue con tour extraction, still it is not obvious how muc h kno wledge is preserved dur- ing the transfer learning pro cess for domain adaptation. In this study , the p erformance of the U-net in different scenarios was analyzed to answer some fundamen tal questions in domain adaptation. W e inv estigated ho w many lay ers from the deco der part should b e fine-tuned to ac hieve the b est segmen tation accuracy in b oth the source and target domain at the same time (w e called that a balanced p oint). F urthermore, the efficacy of dataset size in target domain along with the skip op eration and concatenation on the perfor- mance of the U-net were explored on the problem of ultrasound tongue con tour extraction. 2 Materials and Metho d 2.1 Dataset Ultrasound video frames were randomly selected from recorded videos of a linear transducer connected to a Sonix T ablet ultrasound device at the Univ ersity of Otta wa as well as videos from Seeing Sp eech pro ject [6]. Using informed under- sampling method [1], we generated tw o 2050 image datasets with different dis- tributions, dataset I (uOtta wa) and dataset I I (SeeingSp eech). In this metho d, an av erage in tensity image is calculated ov er the entire dataset, then one score is assigned for each frame dep ends on its intensit y distance to the a v erage im- age. After sorting the data b y ranking order, we selected 2000 frames with the highest rank (high v ariance) and 50 images with the lo west grade (lo w v ariation). Ground truth labels corresp onding to each data w as annotated semi-automatically b y t wo experts using our custom annotation softw are. Off-line augmen tation comprises of natural transformations in ultrasound data (e.g., horizontal flip- ping, restricted rotation and zooming) was employ ed to create larger datasets (50K for eac h). W e split eac h dataset in to training, v alidation, and test sets using %90 / 5 / 5 ratios. 2.2 Net work Arc hitecture and T raining F ully con volutional netw orks (F CNs) can b e considered as dense classification net works (e.g., V GG-nets) with consecutive conv olutional and p o oling la yers suc h that a fully conv olutional la yer substitutes the fully connected lay er (e.g., softmax in the last lay er). Similarly , DeconvNet is an FCN netw ork with several decon volutional lay ers in the up-sampling path. In U-net [13] which is a De- con vNet arc hitecture, feature maps (coarse contextual information) skips from 4 M.H. Mozaffari et al. eac h do wn-sampling la yer to concatenate with deconv olutional la y ers for increas- ing the accuracy of output segmen tation. Structural details of U-net hav e b een presen ted in Fig. 1. Fig. 1. An o verview of netw ork structures. Numbers in circles show several scenarios for finding the b est model for domain adaptation. The Decon vNet comprises of 9 double conv olutional la y ers of 3 × 3 filters with Rectified Linear Unit (ReLU) activ ation function as non-linearit y . Activ ations of all la y ers w ere normalized using batc h normalization la yers to sp eed up the con vergence. In the do wnsampling path, there are four max-p o oling la yers for the sak e of translation inv ariance and sa ving memory by decreasing learnable parameters. In contrast, in the up-sampling path, there are four deconv olutional la yers whic h retrieve the original receptive filed and spatial resolution. Finally , the high-lev el reasoning is done by a fully conv olutional lay er at the end lay er. Net work mo dels were deplo yed using the publicly a v ailable T ensorFlow frame- w ork on Keras API as the bac kend library . F or initialization of netw ork param- eters, randomly distributed v alues hav e b een selected. Adam optimization w as c hosen with a fixed momen tum v alue of 0.9 for finding the optim um solution on a binary cross-entrop y loss function. Each netw ork mo del was trained using the mini-batc h metho d emplo ying one NVIDIA 1080 GPU unit whic h was installed on a Windo ws PC with Core i7, 4.2 GHz sp eed, and 32 GB of RAM. Our re- sults from h yp er-parameter tuning revealed that, besides net work arc hitecture size, learning rate has the most significan t effect on the p erformance of eac h arc hitecture in terms of accuracy . T esting fixed and scheduled decaying learning rate sho wed that the v ariable learning rate might provide b etter results, but it requires different initialization of deca y factor and decay steps. Therefore, for the sake of a fair comparison, w e only rep orted results using fixed learning rates. T o alleviate the o ver-fitting problem, we regularized our netw orks by drop-out rate of 0.5. Netw orks w ere trained for a maximum of fiv e ep o chs, each of whic h for 5000 iterations and mini-batc h size of 10. T ransfer Learning for Ultrasound T ongue Contour Extraction 5 2.3 Domain Adaptation Scenarios Mo dels for ˜ f S T ( . ) were built from several scenarios with transferring the learned w eights from ˜ f S when we froze the enco der and some parts of the deco der sec- tions. Sp ecifically , in scenario I, w e transferred weigh ts of the whole enco der as w ell as p ortions of the DeconvNet deco der as ˜ f S whic h w as learned on the dataset I ( D S ), then w e froze those sections up to the i th deconv olutional lay er and fine-tuned the remaining (4 − i ) decon volutional lay ers using the Dataset I I ( D T ) (see the circled num b ers in Fig 1). In scenario II, w e in vestigated the op- p osite transferring case by switc hing source and target datasets to build mo del ˜ f T S ( . ) to see the effect of negative transferring. In similar scenarios, we repeated the same exp erimen ts by considering the impact of skip op erator and concate- nation in U-net to in vestigate the effect of transferring kno wledge by injecting feature maps to the deco der from the enco der. 3 Exp erimen tal Results T o ev aluate mo dels, w e in vestigated and compared differen t scenarios of tongue con tour extraction as describ ed in the previous section. In eac h situation, w e first trained the whole Decon vNet and U-net on the source domains (named base mo dels) and directly apply them on tw o source and target domains to see the weakness of each model in terms of generalization from one domain to another. F rom table 1, as it was anticipated, in both scenarios, base net works predicted b etter instances for their source domains than their target domains. The result of each scenario related to DeconvNet and U-net ha ve been presented in table 1. Results of the table reveal that on av erage fine-tuning the whole deco der section is the b est for achieving the b est accuracy in target domain while the negative transferring can b e seen clearly in these cases. F or instance, in the scenario I, in case of U-net base mo del, it ac hieved a Dice co efficient of 0.6884 for the source domain and 0.4664 for the target domain. At the same time, when the whole deco der fine-tuned a better Dice co efficien t of 0.6306 was achiev ed in the target domain and 0.5818 in the source domain. As it can be seen, by free zing more la yer in the deco der section (conv7, con v8, and con v9) the difference b et ween the Dice coefficient v alues in source and target domains significantly increases. F or the case of DeconvNet, this is not true and the difference decrease in higher la yers. T able 1 also indicate considerable result impro vemen t in the scenario I for the U-net compare to the Decon vNet due to the concatenation and skip op eration. T o iden tify the sufficient size of the target dataset for transfer learning, in separate exp eriments, we turned t wo transferred U-net mo dels (encoder and con v 9) on three datasets with sizes of 100, 1000, and 10000. W e used the same net work architecture and training pro cedure among the different experiments. Figure 2 sho ws the difference v alues b et ween dice co efficients and cross-entrop y losses in source and target domains for scenario I. It can b e seen that more data samples enhance the p erformance of eac h mo del in terms of accuracy . 6 M.H. Mozaffari et al. T able 1. Quan titative results of each scenario. Negative knowledge transferring can b e seen in the t wo first columns for b oth models. Scenario I D S → D T DecNet T ransferred up to U-net T ransferred up to Base Enco d con v 7 con v 8 conv 9 Base Enco d conv 7 conv 8 con v 9 T est D S Loss 0.2888 0.3212 0.3296 0.3711 0.4434 0.2269 0.3034 0.3203 0.3431 0.2464 Dice 0.6584 0.5957 0.5744 0.5768 0.5891 0.6884 0.5818 0.5777 0.6274 0.6213 T est D T Loss 0.4999 0.3573 0.3252 0.4100 0.4513 0.4805 0.3129 0.3600 0.3963 0.4627 Dice 0.5011 0.5779 0.5760 0.5131 0.5622 0.4664 0.6306 0.5074 0.5558 0.3808 Scenario I I D T → D S DecNet T ransferred up to U-net T ransferred up to Base Enco d con v 7 con v 8 conv 9 Base Enco d conv 7 conv 8 con v 9 T est D T Loss 0.3286 0.4423 0.4981 0.4856 0.4332 0.3736 0.5100 0.5777 0.5591 0.6015 Dice 0.6570 0.4685 0.3977 0.3611 0.4732 0.5901 0.4052 0.2931 0.3032 0.2301 T est D S Loss 0.4831 0.2571 0.2997 0.2986 0.2908 0.4253 0.2635 0.3363 0.3296 0.3270 Dice 0.5299 0.6378 0.5816 0.5659 0.5921 0.5283 0.6211 0.5263 0.5361 0.5268 Fig. 2. Effect of increasing dataset sizes on the accuracy of tw o transferred models in scenario I. Each column shows the difference of cross-entrop y loss and dice-co efficient b et ween source and target domains. Fig 3 illustrates the qualitativ e results of the scenario I for U-net mo del applied on a test instance. The U-net (base mo del) was trained on the set of images from the source domain ( ˜ f S ), achiev ed a Dice co efficient of 0.65 and Binary cross-entrop y loss of 0.28 while for the same mo del, the v alue of Dice score and loss for the target domain w as 0.50 and 0.49 without fine-tuning. It means that although the result of the target domain is not significan t, U-net base mo del can still predict instances in b oth source and target domains. Nevertheless, in case of real-time testing when some frames con tain rapid tongue mo v ement along with noisy dorsum region with artifacts (see Fig. 3.c), the model fails in prediction for the target domain. Our experimental results rev ealed that fine-tuning of the whole decoder of the U- T ransfer Learning for Ultrasound T ongue Contour Extraction 7 net alleviates this problem significantly . F or instance, dice score and loss v alues appro ximately b ecome 0.58 and 0.34 for both source and target domains when the whole deco der fine-tuned on the target domain. In general, we observed a balance point for the n umber of refined lay ers considering b oth source and target domains. On the balance p oint, the mo del can achiev e similar acceptable results in b oth source, and target domains whereas the segmentation accuracy is w orse than the mo del’s p erformance on only one domain (see Fig. 3.d). Fig. 3. Prediction results of U-net in scenario I D S → D T . (a) sample data, (b) corre- sp onding ground truth lab els, (c) prediction result of ˜ f S (U-net base), (d) prediction result of ˜ f S T when the whole deco der w as fine-tuned on D T . 4 Discussion and Conclusions In transfer learning literature, researchers usually focus on finding ˜ f S whic h demonstrates a decent p erformance on a source domain D S . Then they try do- main adaptation from source to target D T task to find ˜ f S T . How ever, a reliable and univ ersal metho d is the one whic h can pro vide acceptable results in the opp osite path from target to source domain as w ell. Our experimental results sho wed that there is a balance p oint for U-net model where it can provide rea- sonable predictions on both the source and the target domain ( ˜ f S T ∼ ˜ f T S ). F or instance, transferring the whole deco der of U-net on the target domain, it pro vided Binary cross-en tropy loss v alues of 0.3034 and 0.3129 for source and target test data. F urthermore, qualitativ e study sho ws that domain adaptation can improv e seg- men tation result for frames with significant noise and artifacts. Impact of using skip op erator and concatenation and increasing dataset size in target domain 8 M.H. Mozaffari et al. indicate a slight impro vemen t in final results. In con trast with other researc h fields with large datasets, the size of a usual ultrasound tongue dataset is not more than ∼ 200 K frames, and it makes more sense to fine-tune one mo del on sev eral datasets to find the knowledge balance p oin t as a universal model for use in real-time applications on v arious ultrasound devices. Using smaller learning rates in the target domains might increase the accuracy of the source and target domain segmen tation further on the balance p oint. References 1. Berry , J., F asel, I., F adiga, L., Arc hangeli, D.: T raining deep nets with imbalanced and unlab eled data. In: Thirteenth Annual Conference of the International Speech Comm unication Association (2012) 2. Csap´ o, T.G., Gr´ osz, T., Gosztoly a, G., T´ oth, L., Mark´ o, A.: Dnn-based ultrasound- to-sp eec h conv ersion for a silent speech in terface (2017) 3. Ghafo orian, M., Mehrtash, A., Kapur, T., Karssemeijer, N., Marchiori, E., Pesteie, M., Guttmann, C.R., de Leeuw, F.E., T empan y , C.M., v an Ginneken, B., et al.: T ransfer learning for domain adaptation in mri: Application in brain lesion segmen- tation. In: International Conference on Medical Image Computing and Computer- Assisted Interv ention. pp. 516–524. Springer (2017) 4. Gic k, B., Bernhardt, B., Bacsfalvi, P ., Wilson, I., Zampini, M.: Ultrasound imag- ing applications in second language acquisition. Phonology and second language acquisition 36 , 315–328 (2008) 5. Lap orte, C., M ´ enard, L.: Multi-h yp othesis tracking of the tongue surface in ultra- sound video recordings of normal and impaired sp eec h. Medical image analysis 44 , 98–114 (2018) 6. La wson, E., Stuart-Smith, J., Scobbie, J.M., Nak ai, S., Beav an, D., Edmonds, F., Edmonds, I., T urk, A., Timmins, C., Beck, J.M., et al.: Seeing speech: an articulatory web resource for the study of phonetics [website] (2015) 7. Li, M., Kam bhamettu, C., Stone, M.: Automatic contour tracking in ultrasound images. Clinical linguistics & phonetics 19 (6-7), 545–554 (2005) 8. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafo orian, M., V an Der Laak, J.A., V an Ginneken, B., S´ anchez, C.I.: A surv ey on deep learning in medical image analysis. Medical image analysis 42 , 60–88 (2017) 9. Long, J., Shelhamer, E., Darrell, T.: F ully con volutional netw orks for semantic segmen tation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015) 10. Mozaffari, M.H., Guan, S., W en, S., W ang, N., Lee, W.S.: Guided learning of pron unciation by visualizing tongue articulation in ultrasound image sequences. In: 2018 IEEE In ternational Conference on Computational In telligence and Virtual En vironments for Measuremen t Systems and Applications (CIVEMSA). pp. 1–5. IEEE (2018) 11. Noh, H., Hong, S., Han, B.: Learning decon volution netw ork for seman tic segmen- tation. In: Pro ceedings of the IEEE international conference on computer vision. pp. 1520–1528 (2015) 12. P an, S.J., Y ang, Q.: A surv ey on transfer learning. IEEE T ransactions on kno wledge and data engineering 22 (10), 1345–1359 (2010) T ransfer Learning for Ultrasound T ongue Contour Extraction 9 13. Ronneb erger, O., Fischer, P ., Bro x, T.: U-net: Con v olutional netw orks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted interv ention. pp. 234–241. Springer (2015) 14. Zh u, J., Styler, W., Callow a y , I.C.: Automatic tongue con tour extraction in ultra- sound images with con volutional neural netw orks. The Journal of the Acoustical So ciet y of America 143 (3), 1966–1966 (2018)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment