Neural Transfer Learning for Cry-based Diagnosis of Perinatal Asphyxia
Despite continuing medical advances, the rate of newborn morbidity and mortality globally remains high, with over 6 million casualties every year. The prediction of pathologies affecting newborns based on their cry is thus of significant clinical int…
Authors: Charles C. Onu, Jonathan Lebensold, William L. Hamilton
Neural T ransfer Learning f or Cry-based Diagnosis of P erinatal Asphyxia Charles C. Onu 1 , 2 , J onathan Lebensold 1 , 2 , W illiam L. Hamilton 1 , 3 , Doina Pr ecup 1 , 4 1 Mila - Qu ´ ebec Artificial Intelligence Institute, McGill Uni versity 2 Ubenwa Health 3 Facebook AI Research 4 Google DeepMind { charles.onu@mail, jonathan.maloney-lebensold@mail, wlh@cs, dprecup@cs } .mcgill.ca Abstract Despite continuing medical adv ances, the rate of ne wborn mor- bidity and mortality globally remains high, with o ver 6 million casualties ev ery year . The prediction of pathologies affecting newborns based on their cry is thus of significant clinical inter- est, as it would facilitate the de velopment of accessible, lo w- cost diagnostic tools. Ho wever , the inadequac y of clinically an- notated datasets of inf ant cries limits progress on this task. This study explores a neural transfer learning approach to de velop- ing accurate and robust models for identifying inf ants that hav e suffered from perinatal as phyxia. In particular , we explore the hypothesis that representations learned from adult speech could inform and impro ve performance of models de veloped on infant speech. Our experiments sho w that models based on such repre- sentation transfer are resilient to dif ferent types and degrees of noise, as well as to signal loss in time and frequency domains. 1. Introduction Perinatal asphyxia—i.e., the inability of a newborn to breath spontaneously after birth—is responsible for one-third of new- born mortalities and disabilities worldwide [1]. The high cost and expertise required to use standard medical de vices for blood gas analysis makes it e xtremely challenging to conduct early di- agnosis in many parts of the world. In this work, we develop and analyze neural transfer models [2] for predicting perinatal as- phyxia based on the infant cry . W e ask the question of whether such models could be more accurate and robust than pre vious approaches that primarily focus on classical machine learning algorithms due to limited data. Clinical research has shown that there exists a significant al- teration in the crying patterns of ne wborns affected by asph yxia [3]. The unav ailability of reasonably-sized clinically-annotated datasets limits progress in dev eloping effecti ve approaches for predicting asphyxia from cry . The Baby Chillanto Infant Cry database [4], based on 69 infants, remains the only known av ail- able database for this task. Previous w ork using this data has mainly focused on classical machine learning methods or very limited capacity feed-forward neural networks [4, 5]. W e take advantage of freely av ailable lar ge datasets of adult speech to in vestigate a transfer learning approach to this prob- lem using deep neural networks. In numerous domains (e.g., speech, vision, and text) transfer learning has led to substan- tial performance improv ements by pre-training deep neural net- works on some dif ferent b ut related task [6, 7, 8]. In our set- ting, we seek to transfer models trained on adult speech to im- prov e performance on the relativ ely small Baby Chillanto Infant Cry dataset. Unlik e ne wborns—whose cry is a direct response to stimuli—adults have voluntary control of their vocal organs and their speech patterns hav e been influenced, ov er time, by the en vironment. W e nevertheless explore the hypothesis that there exists some underlying similarity in the mechanism of the vocal tract between adults and infants, and that model parame- ters learned from adult speech could serve as better initialization (than random) for training models on infant speech. Of course, the choice of source task matters. The task on which the model is pre-trained should capture v ariations that are relev ant to those in the target task. For instance, a model pre- trained on a speaker identification task would likely learn em- beddings that identify individuals, whereas a word recognition model would likely discov er an embedding space that character- izes the content of utterances. What kind of embedding space would transfer well to diagnosing perinatal asphyxia is not clear a priori. For this reason, we ev aluate and compare 3 different (source) tasks on adult speech: speaker identification, gender classification and word recognition. W e study ho w different source tasks affect the performance, robustness and nature of the learned representations for detecting perinatal asphyxia. Key results . On the target task of predicting perinatal asphyxia, we find that a classical approach using support v ector machines (SVM) represents a hard-to-beat baseline. Of the 3 neural transfer models, one (the word recognition task) surpassed the SVM’ s performance, achie ving the highest unweighted a verage recall (UAR) of 86.5%. By observing the response of each model to different degrees and types of noise, and signal loss in time- and frequency-domain, we find that all neural models show better r obustness than the SVM. 2. Related W ork Detecting pathologies from infant cry . The physiological in- terconnectedness of crying and respiration has been long ap- preciated. Crying presupposes functioning of the respiratory muscles [9]. In addition, cry generation and respiration are both coordinated by the same regions of the brain [10, 11]. The study of how pathologies affect infant crying dates back to the 1970s and 1980s with the work of Michelsson et al [3, 12, 13]. Using spectrographic analysis, it was found that the cries of asphyx- iated newborns showed shorter duration, lo wer amplitude, in- creased higher fundamental frequency , and significant increase in “rising” melody type. The Chillanto Infant Cry database . In 2004, Reyes et al. [4] collected the Chillanto Infant Cry database with the objective of applying statistical learning techniques in classifying deaf- ness, asphyxia, pain and other conditions. The authors exper - imented with audio representations as linear predictiv e coeffi- cients (LPC) and mel-frequency cepstral coefficients (MFCC), training a time delay neural network as the classifier . They achiev ed a precision and recall of 72.7% and 68%. Building on this work, Onu et al. [5] improv ed the precision and recall to Figure 1: Structure of learning pipeline . W eights of featur e extraction sta ge wer e pre-loaded during tr ansfer learning. 73.4% and 85.3%, respectively , using support vector machines (SVM). It is worth noting that both works represent an o veresti- mate of performance as authors split train/test set by examples, not by subjects. W eight initialization and neural transfer learning . Modern neural networks often contain millions of parameters, leading to highly non-linear decision surfaces with many local optima. The careful initialization of the weights of these parameters has been a subject of continuous research, with the goal of increas- ing the probability of reaching a fav orable optimum [14, 15]. Initialization-based transfer learning is based on the idea that instead of hand-designing a choice of random initialization, the weights from a neural network trained on similar data or task could of fer better initialization. This pre-training could be done in an unsupervised [16] or supervised [17, 18] manner . 3. Methods In this section, we describe our approach to designing and e val- uating transfer learning models for the detection of perinatal asphyxia in infant cry . W e present the source tasks selected along with representati ve datasets. W e further describe pre- processing steps, choice of model architectures as well as anal- ysis of trained models. 3.1. T asks 3.1.1. Sour ce tasks W e choose 3 source tasks — speaker identification, gender classification, w ord recognition — with corresponding audio datasets: VCTK [19], Speakers in the Wild (SITW) [20], and Speech Commands [21] . T able 1 briefly describes the datasets used for each task. 3.1.2. T arget task: P erinatal asphyxia detection Our target task is the detection of perinatal asphyxia from newborn cry . W e develop and e valuate our models using the Chillanto Infant Cry Database. The database contains 1,049 T able 1: Source tasks and corresponding datasets used in pre- training neur al network. Size: number of audio files. Dataset Description Size VCTK Speaker Identification. 109 English speakers reading sentences from newspapers. 44K SITW Gender classification. Speech sam- ples from media of 299 speakers. 2K Speech commands W ord recognition. Utterances from 1,881 speakers of a set of 30 w ords. 65K recordings of normal infants and 340 cry recordings of infants clinically confirmed to have perinatal asphyxia. Audio record- ings were 1-second long audio and sampled at frequencies be- tween 8kHz to 16kHz with 16-bit PCM encoding. 3.2. Pre-pr ocessing All audio samples are pre-processed similarly , to allo w for even comparison between source tasks and compatibility with target task. Raw audio recordings are downsampled to 8kHz and con- verted to mel-frequenc y cepstral coefficients (MFCC). T o do this, spectrograms were computed for overlapping frame sizes of 30 ms with a 10 ms shift, and across 40 mel bands. For each frame, only frequency components between 20 and 4000 Hz are considered. The discrete cosine transform is then applied to the spectrogram output to compute the MFCCs. The resulting co- efficients from each frame are stacked in time to form a spatial ( 40 × 101 ), 2D representation of the input audio. 3.3. Model Ar chitecture and T ransfer Lear ning W e adopt a residual netw ork (ResNet) [22] architecture with av- erage pooling, for training. Consider a con volutional layer that learns a mapping function F ( x ) of the input, parameterized by some weights. A residual block adds a shortcut or skip connec- tion such that the output of the layer is the sum of F ( x ) and the input x , i.e., y = F ( x ) + x . This structure helps control over- fitting by allowing the network to learn the identity mapping y = x as necessary and facilitates the training of ev en deeper networks. ResNets represent an effecti ve architecture for speech, achieving se veral state-of-the-art results in recent years [23]. T o assure e ven comparison across source tasks, and to facili- tate transfer learning, we adopt a single network architecture: the res8 as in T ang et al. [23]. The model takes as input a 2D MFCC of an audio signal, transforms it through a collection of 6 residual blocks (flanked on either side by a conv olutional layer), employs av erage pooling to extract a fixed dimension embed- ding, and computes a k-way softmax to predict the classes of interest. Fig 1 shows the overall structure of our system. Each con volutional layer consists of 45, 3 × 3 kernels. W e train the res8 on each source task to achieve perfor - mance comparable with the state of the art. The learned model weights (except those of the softmax layer) are used as initial- ization for training the network on the Chillanto dataset. During this post-training, the entire network is tuned. 3.4. Baselines W e implement and compare the performance of our transfer models with 2 baselines. One is a model based on a radial basis function Support V ector Machine (SVM), similar to [5]. The other is a r es8 model whose initial weights are drawn ran- domly from a uniform Glorot distribution [14] i.e., according to U ( − k , k ) where k = √ 6 n i + n o , and n i and n o are number of Figure 2: Audio length analysis highlighting the impact of using shorter amounts of input audio on U AR performance. units in the input and output layers, respectively . This initial- ization scheme scales the weights in such a way that they are not too small to diminish or too large to explode through the network’ s layers during training. 3.5. Analysis 3.5.1. P erformance W e ev aluate the performance of our models on the target task by tracking the following metrics: sensitivity (recall on asphyxia class), specificity (recall on normal class), and the unweighted av erage recall (U AR). W e use the U AR on the validation set for choosing best hyperparameter settings. The U AR is a preferred choice over accuracy since the classes in the Chillanto dataset are imbalanced. 3.5.2. Robustness Noise . W e analyze our models for robustness to 4 different noise situations: Gaussian noise N (0 , 0 . 1) , sounds of children playing, dogs barking and sirens. In each case, we insert the noise in increasing magnitude to the test data and monitor the impact on classification performance of the model. A udio length . W e also ev aluate the response of each model to varying lengths of audio, since in the real-world a diagnostic system must be able to work with as much data as is available. T o achie ve this, we test the models on increasing lengths of the test data, starting from 0.1s to the full 1s segment, in 0.1 incre- ments. Frequency response . The response of the models to variations in frequency domain is important as this could reveal underly- ing characteristics of the data. W e know as well that perinatal asphyxia alters the frequency patterns in cry . T o discover what range of frequencies are most sensitiv e in detecting perinatal asphyxia, we conduct an ablation exercise where features ex- tracted from a dif ferent filterbanks in the MFCC are zeroed out. W e measure the response of our models by monitoring the drop in performance for the frequenc y ranges in each mel-filterbank. 3.5.3. MFCC Embeddings In order to further in vestigate the nature of the embedding learned by each model, we apply principal component analy- sis (PCA) to the learned final-layer embeddings for all models [24]. By applying PCA, we hope to gain insight on the extent to which the embedding space captures unique information. Figure 3: F r equency response analysis of the r elative impor- tance of differ ent Mel filterbanks on UAR performance. Each point r epr esents the performance after r emoving the corre- sponding Mel filterbank. T able 2: P erformance – mean (standar d err or) - of different models in pr edicting perinatal asphyxia. Model U AR % Sensitivity % Specificity % SVM 84.4 (0.4) 81.6 (0.7) 87.2 (0.2) no-transfer 80.0 (2.5) 71.8 (5.8) 88.1 (0.8) sc-transfer 86.5 (1.1) 84.1 (2.2) 88.9 (0.4) sitw-transfer 81.1 (1.7) 72.7 (3.5) 89.5 (0.2) vctk-transfer 80.7 (1.0) 72.2 (2.1) 89.1 (0.3) 4. Experiments 4.1. T raining details There were a total of 1,389 infant cry samples (1,049 normal and 340 asphyxiated) in the Chillanto dataset. The samples were split into training, v alidation and test sets, with a 60:20:20 ratio, and under the constraint that samples from the same pa- tients were placed in the same set. Each source task was trained, fine-tuning hyperparameters as necessary to obtain performance comparable with the liter- ature. For transfer learning on the target task, models were trained for 50 epochs using stochastic gradient descent with an initial learning rate of 0.001 (decreasing to 0.0001 after 15 epochs), a fixed momentum of 0.9, batch size of 50, and hinge loss function. W e used a weighted balanced sampling proce- dure for mini-batches to account for class imbalance. W e also applied data augmentation via random time-shifting of the au- dio recordings. Both led to up to 7% better U AR scores when training source and target models. 4.2. Perf ormance on source tasks Our model architecture achiev ed accuracies of 94.8% on word recognition task (Speech Commands), 91.9% on speaker iden- tification (VCTK) and 90.2% on gender classification (SITW). These results are comparable to pre vious work. See [23, 25] for reference 1 . 1 SITW to our knowledge has not been used for gender classification, ev en though this data is av ailable Figure 4: P erformance of models under differ ent noise conditions. 4.3. Perf ormance on target task T able 2 summarizes the performance of all models on the target task. The best performing model was pre-trained on the word recognition task (sc-transfer) and attained a UAR of 86.5%. This model also achieves the highest sensitivity and specificity 84.1% and 88.9% respectively . All other transfer models per- formed better than no-transfer , suggesting that transfer learn- ing resulted in better or at least as good an initialization. The SVM w as the second best performing model and had the lowest variance among all models in its predictions. 4.4. Robustness Analysis In most cases, our results suggest that neural models ha ve o ver- all increased robustness. W e focused on the top transfer model sc-transfer , no-transfer and the SVM. Figure 4, shows the re- sponse of the models to different types of noise, rev ealing that in all but one case the neural models degrade slower than the SVM. Results from Figure 2 suggest that the neural models are also capable of high UAR scores for short audio lengths, with sc-transfer maintaining peak performance when ev aluated on only half (0.5s) of the test signals. From our analysis of the models’ responses to filterbank frequencies (Figure 3), we observe that (i) the performance of all models (unsurprisingly) only drops in the range of the fun- damental frequency of infant cries, i.e. up to 500Hz [26] and (ii) sc-transfer again is the most resilient model across the fre- quency spectrum. 4.5. V isualization of embeddings Figure 5 shows cumulative variance explained by the principal components (PC) of the neural model embeddings. Whereas in no-transfer , the top 2 PCs explain nearly all variance in the data (91%), in sc-tr ansfer they represent only 52%—suggesting that the neural transfer leads to an embedding that is intrinsically higher dimensional and richer than the no-transfer counterpart. Figure 5: Cumulative variance explained by all principal components (left) and the top 2 principal components on the Chillanto test data (right) based on embeddings of no-transfer model. 5. Conclusion and Discussion W e compared the performance of a residual neural network (ResNet) pre-trained on se veral speech tasks in classifying peri- natal asphyxia. Among the transfer models, the one based on a word recognition task performed best, suggesting that the vari- ations learned for this task are most analogous and useful to our target task. The support vector machine trained directly on MFCC features proved to be a strong benchmark, and if vari- ance in predictions was of concern, a preferred model. The SVM, howe ver , was clearly less robust to pertubations in time- and frequency-domains than the neural models. This work re- inforces the modelling power of deep neural networks. More importantly , it demonstrates the value of a transfer learning ap- proach to the task of predicting perinatal asphyxia from the in- fant cries—a task of critical relevance for improving the acces- sibility of pediatric diagnostic tools. 6. References [1] W orld Health Organisation, “Children: reducing mortality , ” Me- dia Centr e , 2017. [2] Y . Bengio, “Deep learning of representations for unsupervised and transfer learning, ” in Pr oceedings of ICML W orkshop on Unsuper - vised and T ransfer Learning , 2012, pp. 17–36. [3] K. Michelsson, P . Sirvi ¨ o, and O. W asz-H ¨ ockert, “Pain cry in full- term asphyxiated newborn infants correlated with late findings, ” Acta Pædiatrica , vol. 66, no. 5, pp. 611–616, 1977. [4] O. F . Reyes-Galaviz and C. A. Reyes-Garcia, “ A system for the processing of infant cry to recognize pathologies in recently born babies with neural networks, ” in 9th Confer ence Speec h and Com- puter , 2004. [5] C. C. Onu, “Harnessing infant cry for swift, cost-ef fectiv e diagno- sis of perinatal asphyxia in low-resource settings, ” in 2014 IEEE Canada International Humanitarian T ec hnology Confer ence- (IHTC) . IEEE, 2014, pp. 1–4. [6] J. Howard and S. Ruder, “Universal language model fine-tuning for text classification, ” 2018. [7] M. Oquab, L. Bottou, I. Laptev , and J. Sivic, “Learning and trans- ferring mid-lev el image representations using con volutional neu- ral networks, ” in Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , 2014, pp. 1717–1724. [8] A. Karpathy , G. T oderici, S. Shetty , T . Leung, R. Sukthankar, and L. Fei-Fei, “Lar ge-scale video classification with con volutional neural networks, ” in Proceedings of the IEEE conference on Com- puter V ision and P attern Recognition , 2014, pp. 1725–1732. [9] L. L. LaGasse, A. R. Neal, and B. M. Lester, “ Assessment of infant cry: acoustic cry analysis and parental perception, ” Men- tal retar dation and developmental disabilities resear ch revie ws , vol. 11, no. 1, pp. 83–93, 2005. [10] B. M. Lester , C. Z. Boukydis, C. T . Garcia-Coll, and W . T . Hole, “Colic for dev elopmentalists, ” Infant Mental Health Jour - nal , vol. 11, no. 4, pp. 321–333, 1990. [11] P . S. Zeskind and B. Lester, “ Analysis of infant crying, ” Biobe- havioral assessment of the infant , pp. 149–166, 2001. [12] K. Michelsson, P . Sirvi ¨ o. A, and O. W asz-H ¨ ockert, “Sound spec- trographic cry analysis of infants with bacterial meningitis, ” De- velopmental Medicine & Child Neurology , vol. 19, no. 3, pp. 309– 315, 1977. [13] K. Michelsson, K. Eklund, P . Lepp ¨ anen, and H. L yytinen, “Cry characteristics of 172 healthy 1-to 7-day-old infants, ” F olia pho- niatrica et logopaedica , vol. 54, no. 4, pp. 190–200, 2002. [14] X. Glorot and Y . Bengio, “Understanding the difficulty of train- ing deep feedforward neural networks, ” in Pr oceedings of the thirteenth international confer ence on artificial intelligence and statistics , 2010, pp. 249–256. [15] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-lev el performance on imagenet classification, ” in Pr oceedings of the IEEE international conference on computer vision , 2015, pp. 1026–1034. [16] D. Erhan, Y . Bengio, A. Courville, P .-A. Manzagol, P . V incent, and S. Bengio, “Why does unsupervised pre-training help deep learning?” J ournal of Machine Learning Researc h , vol. 11, no. Feb, pp. 625–660, 2010. [17] J. Y osinski, J. Clune, Y . Bengio, and H. Lipson, “How transfer- able are features in deep neural networks?” in Advances in neural information pr ocessing systems , 2014, pp. 3320–3328. [18] Y . Bengio, P . Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks, ” in Advances in neural in- formation pr ocessing systems , 2007, pp. 153–160. [19] J. M. K. V eaux, Christophe; Y amagishi, “Cstr vctk corpus: En- glish multi-speaker corpus for cstr voice cloning toolkit, ” Uni ver - sity of Edinbur gh. The Centre for Speech T echnology Research (CSTR), 2017. [20] M. McLaren, L. Ferrer , D. Castan, and A. La wson, “The speak ers in the wild (sitw) speaker recognition database. ” in Interspeech , 2016, pp. 818–822. [21] P . W arden, “Speech commands: A dataset for limited-vocabulary speech recognition, ” 2018. [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Pr oceedings of the IEEE conference on computer vision and pattern r ecognition , 2016, pp. 770–778. [23] R. T ang and J. Lin, “Deep Residual Learning for Small-footprint Ke yword Spotting, ” T ech. Rep., 2018. [24] I. Jolliffe, Principal component analysis . Springer , 2011. [25] S. Arik, J. Chen, K. Peng, W . Ping, and Y . Zhou, “Neural voice cloning with a few samples, ” in Advances in Neural Information Pr ocessing Systems , 2018, pp. 10 040–10 050. [26] R. P . Daga and A. M. Panditrao, “ Acoustical analysis of pain cries in neonates: Fundamental frequency , ” Int. J. Comput. Appl. Spec. Issue Electr on. Inf. Commun. Eng ICEICE , vol. 3, pp. 18–21, 2011.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment