Voiceprint recognition of Parkinson patients based on deep learning
More than 90% of the Parkinson Disease (PD) patients suffer from vocal disorders. Speech impairment is already indicator of PD. This study focuses on PD diagnosis through voiceprint features. In this paper, a method based on Deep Neural Network (DNN)…
Authors: Zhijing Xu, Juan Wang, Ying Zhang
1 Voiceprint recognition of Parkinson patient s based on deep l earning Zhijing Xu 1 , Juan Wang 1 , Ying Zhang 1 , Xiangjian He 2 1 College of Information Engineering, Shanghai Marit ime University, Shanghai, China zjxu@shmtu.e du.cn , wangjuan_y @foxmail.com , Jennif3r@fo xmail.com 2 Global Big Data Technologies Centre, University of Technology Sydney, Sydney, Australia Xiangjian.He @uts.edu.au Abstract More than 90% o f the Parkinson Disease ( PD) p atients suffer from v ocal cord disorders. Speech impairment is an early i ndicator of PD. This stu dy focuses on PD diagnosis t hrough voiceprint features. In th is p aper, a metho d based on Deep Neural Network (DNN) recognition and cla ssification comb ined with Mini-Batch Gradient Descent (MBGD ) is p roposed to d istinguish PD patien ts from healthy people using voiceprint features. In order to extract the voiceprint fe atures from patients, Weighted Mel F requency Cepstrum Coefficients (WMFCC) is a pplied. The p roposed method i s tested on experimental data obtained by the voice recordings of t hree sustained vowels /a/, /o/ and /u/ from participants (48 PD and 20 healthy people). The result s sh ow that the proposed method achieves a h igh accuracy of the diagnosis of PD patients from h ealthy people, than the con ventional methods like Support Vector Machine (SVM) and other mentioned in this paper. The accuracy achieved is 89.5%. WMFCC approach can solve the problem that the high -order cepstrum coefficients a re small and the fea tures component’s representation abilit y to t he audio i s weak. MBGD reduce th e computational lo ads of the loss fu nction, and increase the training speed of the system. DNN classifier enhances th e classification ability of voiceprint f eatures. Th erefore, the above approaches can provide a so lid so lution for the quick auxiliary diagnosis of PD in early stage. 1. Introduction Parkinson disease (PD) is th e second common neurological disorder after Alzhei mer disease [1] . Vocal characteristi cs are considered to b e one of the earliest signs of this disease. At the early stage , the subtle anomalies of the sound are imperceptib le to the listener, but the recorded speech signals can be aco ustically analyzed for objective evaluation. The existing PD tests us es PE T-CT imaging equipment to de tect wheth er the d opaminergic neur ons are reduced. But such tests ar e expensive and has high rad iation levels, thereby r educing the accuracy of the diagno sis. Thus, a high accuracy, convenient, non-intru sive and in expensive diagnostic metho d is req uired. In the 1 990s, a var iety of machine learning models w ere proposed, the most prominent one b eing Support V ector Machine (SVM) [2 ] . I n 20 15, Benba et al. proposed Mel Freq uency Cepstrum Coeffici ent (MFCC) and SVM for voicep rint analysis of P D p atients to distinguish be tween PD patients and health y people [3] . I n 2016, the authors [4] proposed to ob tain voiceprints by extracting the cepstral coefficient of Relit ive Sp ecTrAI Perceptual Linear Predictive (RSTA-PLP) and classi fy them with SVM classifier or k -Near st Neighbor ( k -NN) et al. Fiv e different c lassifiers along with leave-one- subjec t-out sch eme were used to distinguis h between PD patients and patie nts with other neurological diseases. Later, they d iscussed the comparison of MFC C, P LP, and RSTA-PL P methods for extracting voiceprints and found that the highest accuracy was 90%, achieved by using PLP combined SVM classifier [5] . In addition, the authors [6] further studied the co mparison of SVM's M ultilayer Perceptr on ( MLP) kern el f unctio n w ith other kernel funct ions, the use of MFCC extraction features has very low-or der cepstral coefficients, and th e kernel function classifier has la rge computational complexity and long training time, and the accuracy of discriminati on i s 82.5% needing to be imp roved. In 2 017, Be nba et al. propo sed a new impr ovement that used the Huma n Factor Cepstral Coefficient (HFCC) to extract the voicep rint featu res and the SVM classifier with linear kernel to obtain desired r esults [7] . In [8] , Ma x A.Lit tle used L ine ar Di sc rimi nan t A nal ysis (LD A) to remove the r emaining nuisance effects in the c hannel vectors and SVM or Prob abilistic Linear Discr iminant Analysis (PLDA) to cla ssify the d ifferent types o f d istortion in Parkinson's voices. But SVM and other classifiers are nonparametr ic classifier wit h a shallow struct ure [9] . The drawback is that the ab ility to r epresent complex functions is limited in the case of finite samples and computat ional units [10] , in co mparison, there is a pap er that uses deep learning and it has good accuracy of spe aker recognitio n o n a large scale of voiceprint cor pus [1 1] . The recognition r ate is stable and rob ust at about 95%, which means that it could be used in some r eal applications. Deep lear ning can achiev e c omplex function appr oximation b y learning a deep n onlinear networ k structure, and d emonstrates the po werful ability to learn the essential feature s of da ta sets from a small sample set [1 2] . In recent years, with the advent of the big data e ra and the improvement of computer computi ng capability, deep learning has made bre akthroughs in image r ecognition , machine translatio n, natural language processing and other applications. The co re of d eep lear ning is enab ling the machine to imitate human a ctivities such as audio-vis ual, thinking and lea rning. Therefore, it solves many complic ated pattern reco gnition pro blems and mak es grea t pro gresses in correlation technique of Artificial Intelli gence (AI). As a 2 widely used Artificial Neural Network ( ANN) mod el in deep learning, Deep Neural Network (DNN) has made lo ts of achievement s in issues such as signal p rocessing, language modeling and bioinformatics [13,14,15] . Its n onlinear structure has powerful modeling capab ility. According to the ideas and a dvantages or disadvantages of various algorithms discussed above, in this paper we propose a new DNN- based voiceprint reco gnition model in this paper, which can use voic eprint features to discriminate between P D patie nts a nd healthy people and de tect the disease in its early stag es. The main contrib utions o f this pape r are: Weighted-MFCC (WMFC C) is used to extract the voiceprint featu res to enha nce the sensitiv e compo nents, and then o btain the p arameters of PD patients' dysphonia detection. The WMFCC a pproach, b ased o n the e ntropy method, solves the problems that the high-order cepstrum coefficien ts are small and the feature component’ s representatio n ability to the a udio is weak. Multi-layer neural network recognition and classificatio n of the DNN in de ep learning are used to improve the a ccuracy o f discriminatin g PD p atients. When the DNN is training, the model obtains better initial par ameters thro ugh unsuper vised pa rameter pr e- training algorithm. On this basis, the mod el uses the supervised training metho d to optimize p arameters further. Model optimization b ased on Mini-Ba tch Gradient Descent (MBGD) optimization algorithm is p roposed to reduce the amount of co mputation of the loss fun ction, and increase the traini ng speed o f the system. Compared with the ap proaches b ased on the traditional Su pport Vector Machine (SVM) and other state-of-art classifica tion methods for testing and classifying the sampl es in the PD dataset, it is found that our approach achieves the highest accuracy in diagnosing in the Par kinson disea se. In the next section, we will introduce the dataset and the technique o f acq uisition. Sectio n 3 will describe our proposed approach for feature extraction, mod el optimization, recognitio n and cl assificatio n. In Section 4, our experiments and re sults will be presented. Section 5 will pro vide a discussion in which the results will be interpreted in de pth with a highlight on the comparison of our research work with the state-of-art work. The co nclusion is p resented in Se ction 6 . 2. Datasets The dataset used in this study w as co llected and used in [16] . I t contains 20 (6 women a nd 14 men) pat ients w ith PD and 20 ( 10 women a nd 10 men) healthy p eople. For PD patie nts the ti me sinc e dia gno sis ra nges b etwee n 0 an d 6 ye ars, and the age of patients ranges between 43 and 77 (mean: 64.86, standar d deviation: 8 .97). The age of health y pe ople ranges b etween 45 and 8 3 (mean: 62.55, standard deviation: 10.79). All the re cordings were performed by a T rust MC- 1500 microp hone with a frequenc y range between 50 Hz a nd 13 kHz. The micro phone w as set to 96 kHz, 3 0 dB and placed at a 15 cm distance from subjects . All the samples were made in stereo-channel mode and sav ed in WAV format. In this study, we use 3 types of recordings, w hich were obta ined b y inviting eac h person of the 40 particip ants (20 P D a nd 20 healthy) to pr onounce thre e sustained vowels /a/, /o/ and /u/ at a comfortab le level. This gives us a dataset co ntaining 12 0 voice samples. The analyse s are made on these samples. An independent d ataset w as also co llected us ing the same recording devices with the same physicians. 28 PD patients wer e invited to pronounce sustained vowels /a/ and /o/ three times, and then we selected only one sample of the two vowe ls. For these p atients, the time sin ce diagnosi s ranges between 0 and 13 years, and the age of patients ranges between 39 and 79 (mean: 62.67, stand ard deviation: 10.96). This dataset was used to test and validate the ob tained results using the first d ataset. 3. Methodology Using the voiceprint features of PD patie nts to recognize and classi fy healthy p eople, this paper builds a DNN-based PD patient rec ognition mo del, as sho wn in Fig. 1 . Fig. 1. PD pa tients identifica tion system model 3.1. WM FCC voiceprint feature extraction The e xtraction of speech featur e parameters is crucial in voiceprint r ecognition . In the field of voiceprint recognit ion, the most c ommonly used feature for extractio n is MFCC [ 17] . The speech signal changes at a slower rate. When it is perceived in a short time, the spe ech signal is generally considered to be stable at intervals of 1 0-30 ms [1 8] . Therefore, it should b e calculated by sh ort-time spectrum analysis. The Mel scale is used to estima te the freq uency perception of the human ear, and it is calculat ed by 1000 Hz corresponding to 1000 Me l. This paper use s tempo ral speech quality, spectr um, and cepstrum domains to de velop more obje ctive assessmen ts to detect speech impair ments [19] . These measure ments include the fun damental frequency of voca l cord vibratio n (F 0 ), absolute so und pressure l evel, j itter, shimmer and harmonic s noise ratio ( HNR). From [ 20] , the d etails shown in table 1. Table 1. Acoust ic analysis results of healthy male a nd female, male and fema le with P D Groups Gender Averag e age F 0 (Hz) Jitter (%) Shimmer (%) HNR (dB) Healthy male 58.4 12.5 128.4 17.6 0.04 0.36 0.26 0.10 14.8 4.6 Healthy female 55.6 11.9 205.4 37.6 1.16 1.15 0.35 0.46 11.0 7.1 3 PD male 61.2 9.6 120.5 20.8 0.94 0.76 0.37 0.16 10.4 3.7 PD female 61.7 10.6 193.8 16.4 1.94 1.30 0.68 0.91 8.1 5. 1 Based on the pronun ciat ion cha rac ter istic s of P D pat ien ts, the c haracteristi c parameter s were extracted for analy sis. However, each component contained in the feature parameters has d ifferent voiceprint characteriza tion cap abilities for different sp eech samples . The traditional MFCC method extracts the voiceprint featu res with low-ord er cepstra l coefficients and the feature components have poor representatio nal capab ilities for audio. In order to enhanc e the sensitive co mponents of reco gnition, this paper analyzes the contributio n of each dimension f eature para meter to the voiceprint repre sentation by c alculating the entropy value of multi-dimens ional corpus, and extract s the voiceprint features by the entropy weighted method, thus impro ving the recognitio n accuracy of the system. Fig. 2 is a flow chart of the voiceprint feature extraction of W MFCC. Fig. 2. Flow c hart of WMFCC extractio n The detailed extra ction p rocess is as follows 3.1.1. Pre-emph asis, framing In order to e liminate the effect o f lips and voc al co rds during the vocalizatio n process, compensat ing the high- frequency part of speec h signal suppressed by the vocal system and highlig hting the high-fr equency formant. Therefor e, the first-order differential equation is applied to the speech samples to incr ease the amplitu de of the high frequency formant [21] . In fact, the speech signal is p assed through a high-pass filt er: , 1 ) ( 1 kz z H (1) where k is the pre-emphasis coefficien t and it should be in the range of [0, 1] (usually 0.97). In the framing step , the speech signal is divide d into N sample frames. In order to avoid the excessiv e c hange of two adjacent frames, there is an overlap ping area , containing M sampling p oints with N M , b etween two a djacent frames. 3.1.2. Hamming windo w The purpo se of ad ding a Hamming window is to r educe signal discontinu ities and make the end smoo th enough to connect with the begin ning. Assume that the signal after the framing is n s , wher e n is the size of the frame , of which N n s n ,..., 1 , . T he form of ' n s is as: . 1 1 2 cos 46 . 0 54 . 0 ' n n s N n s (2) 3.1.3. Fast Fourier tra nsform The Fast Fourie r Tra nsform (FFT ) is used to c onvert N samples from the time domain to the f requency domain. The FFT is used bec ause it is a fas t algor ithm that implemen ts the Discrete Fourier Transform ( DFT). T he DFT is defined on N sample sets, a nd the DFT of the sp eech signal is: , 1 ,..., 2 , 1 , 0 , 1 0 2 N n e s S N k N kn j k n (3) where k s is an input spee ch signal and N is the numb er of sampling po ints of the Fourier Transfo rm. 3.1.4. Filter bank analysis There are sever al redunda nt signals in the freq uency domain, and the filter b ank can str eamline the amplitude of the frequen cy domain. T he huma n ear's p erception of sound is not linear. It is better described by the nonl inear relationship of log. The relationship between Mel frequency and speech signal is as sho wn in Eq uation (4) [ 22] . ), 700 1 ln( 2595 ) ( f f Mel (4) where ) ( f Mel represents the Mel frequenc y, whose unit is mel, and f is the frequenc y of the spee ch signal, whose unit is Hz. 3.1.5. Logarithm/Di screte Cos ine Tr ansfor m In this p hase, the MFCC is calculat ed from the log filter bank amplitu des ) ( j m through the Discrete Cosine Tr ansform (DCT) [7 ] : , 5 . 0 cos 2 1 N j j i j N i m N c (5) where N is the number of filter b ank channels, j m is the amplitude of the j -th lo g fliter b ank. 3.1.6. Weighted The main advantage of cep stral coefficie nts is they are not r elated to each other. Thus it is convenient to analyze t he cepstral coefficients of eac h or der. As the high-order cep stral coefficients are very small, and the sensitive componen ts are not obvious, they reduce the recognitio n rate of the extrac ted effective features and the subsequent classifica tion recognition rate, as sho wn in Fig. 3. Therefore, based o n the MFCC method, the entropy method is used to improve the characterizing ab ility of the featu re components to the 4 voiceprint features. This method is simple and considers the interaction among the feature componen ts. Fig. 3. The firs t 20 mel freq uency cepstr al coefficients of a PD patient be fore weighted (wi thout using the first cep stral coefficient) The entro py method is an ob jective weightin g meth od for calcu lating the weig hts amo ng mutu ally inde pende nt var iable s. The weight of the co mponents is d etermined acco rding to the informatio n entropy of the calculat ed components [23] . The larger the value of entropy, the less informatio n is carr ied and the smaller the weight of the component is. On the contrary, the smaller the value of entropy, the more informatio n is carried and the larger the weight of the compo nent is. Therefor e, it is a crucial step in c hanging these cepstral coefficients ( Fig. 4) . It is achieved in the followi ng. In the voiceprint features of the PD da tabase speech sample ) ,..., ,..., , , ( 3 2 1 N i M M M M M MFCC , (6) where ) ,..., ,..., ( ) ( ) ( ) 1 ( D i j i i i mel mel mel M is the feature vector of the i -th frame of the voiceprint feature , D is the feature parameter d imension, N is the number of frames of the speech sampl e, and ) ( j i mel is the j -th feature vector value of the i -th frame of the voicepr int feature . First, the featur e matrix is normaliz ed as shown in Equation (7): } min{ } max{ } max{ ) ( ' ' ) ( j j j i j j i mel mel mel mel mel (7) The definition of entrop y is as sho wn in Equatio n (8) , , ln 1 N i ij ij j Y Y k e (8) where N i j i j i ij mel mel Y 1 ' ' ) ( ' ' ) ( , Equation (9 ) is the entro py weight of the obtained featu re component: D j j j j e e w 1 ) 1 ( 1 (9) Finally, the weights of the components of the MFCC a re calculated by Equation (9), and the n ew parameter s obtained are as shown in Eq uation (10): ) ,..., ( ) ( ) 1 ( 1 D i D i i mel w mel w wM (10) Taking a spe ech sample as a n example, the featur e corresponding to the cepstral coefficients of the first 2 0 mel frequency of the PD patient ar e extracted, and the weigh t of the featu re compo nents are calculat ed by the entropy weighting method , as shown in Fig. 4. Fig. 4. The firs t 20 mel freq uency cep stral coe fficients o f a PD patient after weighted (w itho ut using the first c epstral coefficient) The multi-c epstral coefficients of the W MFCC are extracted in each of the obtained speech samples , and the extracted c oefficients range from 1~20. No te that the first cepstral coefficient lo ses the re ference meaning due to large amplitude change. We continue to ob tain the optimum value of the coe fficient r equired for the best classifica tion a ccuracy in this way. N ext, the cor responding voiceprint is extra cted by calculating the average o f all the frames to obtain each voiceprint, as sh own in Fig. 5. Fig. 5. Voiceprint of the first 20 mel frequency cepstral coefficients of a PD patient (without using the first cepstral coefficient) Comparing Fig. 3 and Fig. 5, it can be intuitively concluded that WMFCC solves the problem that the high- order cepstral coefficients are very sm all. After the weighted average, the sensitivity of the MFCC paramete rs is a lso highlighted, and the change of the high -order cepstral coefficients affects the recognition rate of the subsequent effective featu res. 3.2. DNN Classification 5 Accord ing to the voicep rint featu res in the dataset extracted by the WMFCC, the DNN training method is used for feature cla ssification. The mode l structure, training process and the network op timization algorith m of DNN described a s follows. 3.2.1. DNN structur e DNN is a multilaye r percep tron with multiple hidden layers. Because it contains multiple hidden layers, it can abstract useful high-l evel features or attributes from high- redundant lo w-level featu res, a nd then discover the inherent distribution o f data [24 ] . The neural netw ork d esigned in this paper includes input l ayer, hidden layer and output layer. As shown in Fig. 6, the input layer is written as Layer 0, while the output layer is wri tten as layer L. A DNN c an have mult iple hidden layers, a nd the outp ut of the curr ent hidden layer is the inpu t o f the next hid den layer or the output layer. W e use the Back-Pro pagation ( BP) algorithm to calculate the g radient of ea ch layer's parameters. The activatio n functio n is a Rec tified Linear Unit (ReLU), which has the advantage that the network can introduce sparsity o n its own and grea tly improve the tra ining speed . Fig. 6 Schemati c diagram of the DNN struct ure For any L l l 0 layer, , 1 l l l l b v W z (11) , l l z f v (12) where 1 l N l R z is the excitat ion vector , 1 l N l R v is the activation vector, 1 l l N N l R W is the weight, 1 l N l R b is the b ias, and R N l is the number of neuro ns in the l th layer. f is the ac tivation func tion ReL U, with a mathematica l expre ssion: . , 0 max R e z z LU (13) 3.2.2. Paramete r Pre-training Algo rithm The DNN pre-training uses a layer-by-layer pre-training method b ased on a Restricted Boltz mann M achine (RBM). The DNN is tr eated a s a Deep Belief Network ( DBN) with a number of RBMs stacke d, and then pr e-trained bo ttom-up layer by layer [ 25] . The d etailed p rocess is as follows . If the inpu t is a continuo us featur e, a G auss-Berno ulli- distributed RBM is trained, and if the input is a binomial distribution feature, a B ernoulli-Be rnoulli-d istributed RBM is trained. The output of the hidden layer is then used as input data for the nex t layer of either the Gauss-B ernoulli- distributed RBM or Bernoulli-Bernoul li distrib ution RBM, depending on the input f eature. T his process does not req uire label infor mation and is an unsupervise d tr aining process. Supervised tra ining is perfor med after pr e-training. Acco rding to the task and application req uirements of this study, the label of the training d ata and the output o f the criter ion are added a t the to p level, and the b ack pr opagation algorithm is used to adjust the p arameters of the network . 3.2.3. Back Propaga tion Algorithm When using b ack propagation for par ameter tra ining, the model p arameters of the DNN are tr ained through a set o f training N i y x i i 1 , , , where i x is the feature vector o f the first i samples , and i y is the corre sponding label. The back propa gation algorithm is exp licitly summar ized be low. 1. Input x : Set the correspondin g activatio n value for the input layer. 2. Forward Propagation: Calculate the correspo nding Equations (11 ) and (12) for each layer. 3. Output layer L e : The erro r vector is calculat ed b y: L L z y x b W J e , ; , (14) 4. Bac k propagation : The err or o f d efining the layer 1 node is: 1 1 l T l l l l e W z f diag e (15) 5. Outpu t: The weight matrix a nd bias of each layer are calculated b y Equations ( 16) and (1 7), respecti vely. 1 0 ) ( ) ( ) ( ˆ ) 0 ( ) ( n k k n x k x n k n x x n x (16) 1 0 0 n , ) 0 ( ) ( ) ( ˆ ) 0 ( ) ( 0 n , 0 n k x k n x k x n k x n x (17) 3.2.4. Small batch gra dient descent optimiz ation algorithm The BP algori th m is the co re al gori th m f or tra ini ng DN N. It optimizes the parameter values in the network accor ding to the pr edefined lo ss functio n. An important step in determinin g the q uality o f the network model is the optimizat ion of the parameters in the neura l network model [2 6] . Among them, the Gradient Descent ( GD) solves the minimum value o f the loss functi on a nd so lves it step by step through the algo rithm to obta in the minimum loss function and model pa rameter values. Since each u pdate of the parameters will traverse all sample data, it is a complete gradient drop and has a high acc uracy. However , it will spe nd a lot of time o n traversing the co llection. 6 However, the Stochast ic Gr adient Descen t (SGD) algorithm d oes not op timize the lo ss function on all training samples, but r andomly optimizes the loss functi on of a training samp le at ea ch iter ation [27] . As a result, the speed of each ro und of parameter updates will be greatly acc elerated. However, since the SGD algorithm op timizes the loss fun ction on a certain sample each time, the disadvantage is ob vious, a small loss functio n of the lo cal sample does not mean that the loss function of all samples is smal l. The neural networ k obtained b y stochastic gradient desce nt optimizatio n is difficult to ac hieve global optimization , and the accurac y is lower than the complete grad ient descent algorith m. In or der to solve the shortcomings of the gradient de scent algorithm and the stoc hastic gradient de scent algorithm, in this pa per, a new fusion algorithm, Mini-Batch Gradient Descent (MBGD ) algorithm, is proposed , and it only calculates the loss function of a sma ll number of training samples when upd ating each parameter. A small p ortion of the sample is referred as a batch in this paper. On one hand, using matrix o peratio ns, o ptimizing the p arameters of a neural network on a batc h is co mparable to a single sample. On the other hand, using a small por tion of the sample each time, the number of iterations required for c onvergence can be greatly reduced. When the convergence is reduced , the accuracy obtained is clo ser to the result of th e gradient descent algorithm. Therefore, the M BGD algo rithm can o vercome the shortcomings of the ab ove two a lgorithms, and at the same time take into acc ount their ad vantages. The MBGD algorithm r andomly extrac ts m samples from all sampl es, where m is the to tal number of training samples. The m samples are m i X X X X ,..., ,..., , 2 1 . and b are the set s of weights a nd bias in the netwo rk. i Y and i A are the expected output and the actual output of the first i samples input, and is a norm o peration. T he mean squared error is calc ulated as follows: . 1 2 1 ) , ( 1 1 2 m i X m i i i i C m A Y m b C (18) Among 2 2 i i X A Y C i . Acco rding to the gradient, the representatio n of C is: m i X i C m C 1 1 (19) Equation ( 19) estima tes the o verall gradient using m sample d ata, and larger the m is, the more accurate the estimates’ result is. At this po int, the formula for the update are: , 1 ' m i k X k k k k i C m C (20) . 1 ' m i l X l l l k b C m b b C b b i (21) where is a positive number, whose r ange of value is [ 0,1], and is called the le arning ra te. Based on the MBGD optimization algorithm, a flow chart of the MBGD o ptimization algorithm is drawn, as shown in Figure 7. Figure 7 Flo w chart of the MBGD optimizatio n algorith m The PD datab ase is known to ha ve a tot al o f 12 0 samp les . Since the total numbe r of sampl es is limited, after several trials, it is finally determine d that two samples are taken each time as a batch to calculate the loss function and upd ate the parameters. After 60 iterations, the training o f the entire speech sampl e set is co mpleted, and this is called an epo ch. Since the lo ss function is calculat ed b y using multiple sampl es for each upd ate, the calculation of the loss function and the updates of the p arameters are more info rmative , the de cline o f the lo ss function is more stable, and the convergence speed is faster. At the same time, the us e of sma ll batch calculat ions also r educes the amount of calculat ion. This pro cess is calculated b y Equations ( 18-21). 4. Experimental results This paper use s the compressed frame s of 20 healthy people and 20 patients with PD to train, using k -fold cro ss validation. This meth od is used to measure the predictive performance of the built mod el. The training model has excellent performance on the new data. On one hand , it can greatly reduce the over-fitting. On the other hand, it ca n obtain as much more valid information from the limited sample data as p ossible . This method is a cross-valida tion method that is used when the sampl e size is small. The k -fo ld cross-validatio n works as follo ws. The initial sampling is d ivided into k sub-samples, o f which a sin gle sub-sample is taken as the data of the verification model, the re maining k -1 samples are used for the training. T his pr ocess is repeated k times, while e ach sub- sample is verified o nce. After that, the obtained k results are averaged to e valuate the p erformance o f the model. W hen k =n (where n is the amount of tota l sample s), it is c alled the leave- one method [ 3,5] . The test set for e ach training requires o nly a single sampl e, and a total of n times tr aining s and predictions 7 are per formed. T he training samples selected using this method are only one sample fewer than the to tal dataset, so they are c losest to the distr ibution of the or iginal samples. The pr oposed approach is also tested us ing an independent test set of 28 PD patients collected by the same physician. Furthermore , it is co mpared with the SVM method of differe nt kernels studied by B enba e t al., namely, Radial Basis Fu nction (RB F), Linear, Polynomial (P OL) and MLP on SVM classifier [6] . T o test the success rate of these cla ssifiers in identifying PD p atients and health y people, the accuracy, sensitivity and spec ificity are calculated . Accurac y represents the success rate of distingu ishing between two groups of participa nts, sensitivity represents the accuracy of detecting healthy p eople, and specificity rep resents the accuracy of detecting PD patie nts. TP is true positive (healthy p eople are classified correctly), TN is true neg ative (PD p atients are classified cor rectly), FP is false positive (PD patients are classified incor rectly), and FN is false neg ative (health y people are classi fied incorr ectly). In add ition, there ar e two evaluation crite ria Matthews Corr elation Coefficien t (MCC) and Pr obability Exce ss (P E), needed to be further calculated to show the quality of binary classificatio n. The calculat ion formula a re Equations ( 22-26) FN FP TN TP TN TP Accuracy (22) FN TP TP y Sensitivit (23) FP TN TN y Specificit (24) TN FN T P FP TN FP TP FN FP FN TN TP MCC (25) TN FP TP FN FP FN TN T P PE (26) Preprocessing and feature extractio n are performed on each type of speec h samples of /a/, /o/ and /u / . T hen, the DNN is used to train and calculat e evaluation in dicator s such as accuracy, sensitivity, spec ificity, M CC and P E. Table 2 shows the classification accurac y of the vowels /a/ in the same data set for each classifier. In the DNN classificatio n method, the characteristics of the extracted 1 2-th and 14-th Mel frequency ce pstral coefficients a chieve the maximum cl assificatio n accuracy o f 84.5%. Therefor e, 33 people are co rrectly c lassified, and o nly 7 are mis classified . Using the same par ameters, the maximum sensitiv ity is 85 % and the spec ificity is 80 %. It is conclud ed that 17 healthy people and 1 6 PD p atients are classified corr ectly. The maximum MCC is 0 .6917 and the PE is 0 .6500. Table 2. Results using vowel /a/ Classi- fier Accur- acy(%) Sensiti- vity(%) Specifi- city(%) MCC PE Coeffi- cients RBF 67.50 80.00 55.00 0.3615 0.3500 9 Linear 72.50 80.00 65.00 0.4551 0.4500 4-8 POL 70.00 65.00 75.00 0.4020 0.4000 18 MLP 80.00 85.00 75.00 0.6030 0.6000 20 DNN 84.50 85.00 80.00 0.6917 0.6500 12,14 Table 3 shows the classi fication accuracy of vowel /o / in the same data set for each classifie r. Using the DNN classification metho d, the extracted sound spec trum of the third Me l frequency cep stral coefficien t achieves the maximum classifi cation acc uracy o f 84.5%. Therefor e, 33 people are co rrectly classifi ed, and o nly 7 are misclass ified. Using the same parameter s, the maximum sensitiv ity is 80% and the specificity is 85%. It is concluded that 16 healthy people and 17 PD patients are classified cor rectly. The maximum MCC is 0 .5774 and the PE is 0 .7238. Table 3. Resu lts using vowe l /o/ Classi- fier Accur- acy(%) Sensiti- vity(%) Specifi- city(%) MCC PE Coeffi- cients RBF 67.50 80.00 55.00 0.3615 0.3500 7,8,10 Linear 72.50 80.00 65.00 0.4551 0.4500 4,6,7 POL 67.50 70.00 65.00 0.3504 0.3500 3 MLP 77.50 80.00 75.00 0.5507 0.5500 6 DNN 84.50 80.00 85.00 0.5774 0.7238 3 Table 4 uses only the vowel /u / to indicate classificat ion accuracy. Using the DN N cla ssificatio n meth od, the featu re of the extracted sixth Mel frequency cepstrum coefficient is used to achieve the maximum classi fication accuracy of 89. 5%. Therefore, 35 p eople are co rrectly c lassified, and only 5 are misclassifie d. Using the same parameters, the maximu m sensitivity is 80% and the specificity is 95%. Fro m t hese results, 1 6 healthy people and 1 9 P D patients are c orrectly classified. The maximum MCC is 0 .6773 and PE is 0.7440. The examp le suggests that the vowel /u/ spee ch samp le contains more discriminativ e informati on than other speech samples. Table 4. Resu lts using vowe l /u/ Classi- fier Accur- acy(%) Sensiti- vity(%) Specifi- city(%) MCC PE Coeffi- cients RBF 80.00 85.00 75.00 0.6030 0.6000 9 Linear 70.00 75.00 65.00 0.4020 0.4000 4-8 POL 72.50 70.00 75.00 0.4506 0.4500 18 MLP 82.50 80.00 85.00 0.6508 0.6500 20 DNN 89.5 80.00 95.00 0.6773 0.7440 6 Table 5 shows the obtained re sults using all types o f voice recor dings.These re cordings contain the pronunciat ion of 4 0 pa rticipant vowels /a/, /o/ and /u/, with a total of 120 speech samples(3 × 40). Using the DNN classificatio n method, the extracted features of the 7-th and 11-th Mel freq uency cepstral coe fficients achieve the maximum classificatio n accuracy, which is 85.00%. The same p arameters give a maximum sensitivity of 8 5.00% and a specificity of 8 5.00%, from which it can b e co ncluded that 102 p eople-time ar e correctly classi fied and o nly 18 are misclassif ied. I t is also shown that 5 1 healthy peo ple and 5 1 PD patients are correctly classified. T he maximum MCC is 0. 6619 and the PE is 0.7 080. Table 5 . Results using vowels /a/, /o/ and /u/ Classi- fier Accur- acy(%) Sensiti- vity(%) Specifi- city(%) MCC PE Coeffi- cients RBF 80.00 85.00 75.00 0.6030 0.6000 19 Linear 70.00 75.00 65.00 0.4020 0.4000 16 8 POL 72.50 70.00 75.00 0.4506 0.4500 7 MLP 82.50 80.00 85.00 0.6508 0.6500 6 DNN 85.00 85.00 85.00 0.6619 0.7080 7,11 After tr aining and testing for bo th groups of su bjects (PD patients and health y individuals ), independen t data sets containing 28 PD p atients ar e used to test and validate the results. 2 0 pa tients w ith PD and 20 healthy subjects are trained in vowel /a / and tested from ind ependent da ta sets. The results are shown in Table 6. The be st classificatio n accuracy of SVM and DNN using MLP kernel function and po lynomial kernel is 100%. Here, the characteristics of the 5-th, 3-rd, and 7-th Mel frequency cepstral coefficients sequentially a chieve this acc uracy. This means that 28 patients with P D are classified co rrectly. Table 6 . Test results using vowel /a/ Classifier Accuracy(%) Coefficients RBF 10.71 4,5,6 Linear 10.71 6-10 POL 100 5 MLP 100 3 DNN 100 7 The trainin g for vowel /o/ is also p erformed, and the vowel /o/ provid ed by the independ ent data set contain ing 28 PD patients is tested and verified. It is c oncluded that the b est classificatio n accuracy of the SVM of the polynomi al kernel using the character istics of the 3-r d and 4 -th Mel frequency cepstral coefficien ts and the D NN of th e 6-th Mel frequency cepstral coefficient are 10 0%, as sh own in Table 7. This means that 2 8 patients with PD are successfu lly classi fied. Table 7. Test results us ing vowel /o/ Classifier Accuracy(%) Coefficients RBF 32.14 3 Linear 89.28 3 POL 100 3,4 MLP 89.28 5 DNN 100 6 All speech reco rds of the data set are used and experiments are performed at the same time. These reco rdings contain the p ronunciation o f the vowels /a/ and /o / o f 28 participa nts, with a total of 56 speech samples (2 × 28). After training v owels /a/ and /o/ f or 2 0 PD p atients and 20 healthy individuals, the vowels /a/ and /o/ provided by independen t datasets of 28 P D p atients are tested and verif ied. As can be seen fro m T able 8, using the third Mel frequency cepstra l coefficient of the DN N has the best classificati on accuracy of 89.07%. This means that 51 o f the 56 P D patients in speech sample are succes sfully classified, and o nly 5 people -time are misdiagnosed . Table 8 . Test results using vowels /a/ a nd /o/ Classifier Accuracy(%) Coefficients RBF 10.71 4-8 Linear 42.85 3 POL 87.50 5 MLP 46.43 5 DNN 89.07 3 5. Discussion The sound d amage o f PD patients doe s not sud denly appear. This is a slow process and the symp toms a re at the early stag es which may be o verlooked. In order to improve the evaluation o f Par kinson's d isease, the a rticle uses the WMFCC to calc ulate the entropy meth od and ta ke the averag e value to extract the participants' voiceprint character istics, and then obtain the pa rameters of P D p atients' d ysphonia dete ction. Compared with the pre vious RASTA-PLP method , WMFCC has some imp rovements in compr ehensive p erformance and frame classificati on. RAST A-PLP characte rizes the sp eech signal by doing shor t-time spectral analysi s. The speec h spectrum of RASTA-PLP takes into account the auditory characteristics of the human ear , because the input speech signal is processed by the auditory model, and it facilitate s the extraction of anti-no ise speech featur es. However, the cepstrum coefficients extracted by RAST A-PLP c ontain many frames, which consume lots of processing time in the classification pro cess a nd hinde r accurate diagnosis. HFCC readjusts these cepstra l coefficien ts to q uite simi lar amplitu des by lifter ing the cepstral co efficients, but the calculation process is complicated and takes a long time . W MFCC has the same anti-noise perfor mance, and can e xtract more high frequency compo nents from spee ch signals, so the extracted features have str onger re presentation ability. Then the samples in the PD database are trained, and 28 patients with PD are tested and judged. The results sho w that the characteristics o f the sixth Mel freq uency ce pstral coefficient extracte d from the DNN classificatio n method achieves the highest classific ation accurac y o f 89.5%. T he accuracy is high er than those wi th the RB F, Linear, Polyno mia l and MLP k erne l func tio ns of th e SVM an d PLDA. That is to say the vowel /u/ speech samples contain more discriminative information than other speech sample s. The traditional SVM method has better robustness in so lving high- dimensional problems, but there is no universal solution to nonlinear problems. The p roblem lies in the c hoice of kernel functions and the difficul ty in implementing large-scale samples. PLDA is a c hannel compe nsation techniq ue base on i-vector. When applied to classification, it is hard to distinguish information in spe aker and channel. Compared with the abo ve two classifier s, DNN ad opts the layer-by-laye r pre-training meth od based on the restric ted B oltzmann machine for the unsupervise d p re-training process, and then carries out supe rvised tuning training. The efficiency of the training is greatly improved, a nd the problem of local optimum is well impr oved. When using 28 indep endent test sets for PD patients, the maximum cla ssification accuracy is always the data from DNN. Studies have shown that DNN c an make a better imp rovement in the c lassificatio n perfor mance between P D p atients and healthy people. It also shows that the DNN network struct ure is used to enhance the classifica tion ability of voicep rint feature s. 6. Conclusion 9 WMFCC can effectively extract the high-order ce pstral coefficients of the voic eprint, and hence the characte rizing ability o f the featu re components to the a udio. Using the MBGD algorithm to optimize t he DNN classifier , the computatio nal complexi ty o f the lo ss function is reduced a nd the training speed of the system is improved. The experimental results on the PD datab ase show that the classificatio n and recognitio n models in this p aper are superior to the traditio nal SVM classification methods in terms of sensitiv ity and accuracy, and p rovide a new solutio ns for earl y dia gnosis of P arkinson d isease based on voicep rint features.Future work will further study the intrinsic link between the feature extraction of sampl es and the training of a classifier , classificati on and complexity of a dataset, and continue to d o some theor etical derivation. Acknowledgement This work is suppo rted by the National Science Foundation o f China (Gra nt No.6 1673259). Conflict of i nterest statement We decla re that we have no conflicts of interest to this work.We also dec lare that we do not have any financi al and personal relationships w ith other people or organizations that can inappro priately influe nce our wor k. References [1] Graham S F , Rey N L, Y i lmaz A, et al. Biochemical profiling of the br ain and bl ood metabolome in a mouse mod el of prodromal P arkinson's disease revea l distinct metabo lic profiles. Journal of Proteome Research, 2018. [2] Returi K D, Mohan V M, Radhika Y . A Novel Approach for Speaker Recognition by Using W avelet Analysis and Support V ector Machines. Pro ceedings of the Second International Conference on Co mputer and Communication T e chnologies. Springer, New Delhi, 2016: 163-174. [3] Benba A, Jilbab A, Hammouch A, et al. V oiceprints an alysis using MFCC and SVM for detecting patients with P arkinson's disease. Electrical and In formation Te chnologies (ICEIT), 2015 International Conference on. IEEE, 2015: 300-304. [4] Benba A, Jil bab A, Hammouch A. Discriminating Between Patients W ith Parkinson ’ s and Neurological Diseases Using Cepstral Analysis. I EEE Transactions on Neural Systems and Rehabilitation Engineering, 2016,24(10):1100-1 103. [5] Benba A, Jilb ab A, Ham mouch A, Sara San dabad. Using RAST A-PLP for di scriminating between diff erent neurological diseases. 2016 International Conference on El ectrical and Information T echnologies (ICEIT), T a ngiers, 2016: 406-408. [6] Benba A, Jilbab A, Hammouch A . Analys is of mu ltiple types of voice recordi ngs in cepstral domain u sing MFC C for discriminating between patien ts with Parkinson’ s disease and healthy people. International Journal of Speech T echnolog y , 2016, 19(3): 449-456. [7] Benba A, Jilbab A, Hammouch A. Using Human Facto r Cepstral Co eff icient o n M ultiple T ypes of V oice Recordi ngs for Detecting Patients with Parkinson's Disease. 2017:38(6). [8] Amir Hossein Poorjam, M ax A. Little, Jesper Ri ndom Jensen. A Parametric Approach for Classification of Disto rtions in Pathological V oices. IEEE Intern ational Conference on Acoustics, Speech and Signal Processing (ICASSP),2018.2:286-290. [9] Giorgos Mo untrakis ;Jungho Im ; Caesar Ogole.Support vector machines in remote sensin g: A review. Isprs Journal of Photogrammetry & Remote Sensing.2011 ,66,247-259. [10] Ro sales-Perez A, Garcia S, T e rashima-Marin H, et al. MC2ESVM: Multiclass Classification Based on Coo perative Evolution of Support V ector Machines. IE EE Computational Intelligence Magazine, 2018, 13(2):18-29. [11] Fen g Y ong, Cai Xinyuan, Ji Ruifang. Evaluation of the d eep nonlinear metric learning based speaker identification on the large scale of voiceprin t corpus. 2 016 10th In ternational Symposium on Chinese Spoken Language Pro cessing (ISCSLP),2016:1-4. [12] Hao Q, Zhang H, Ding J. The hi dden layer design f or staked denoising autoencoder . International Computer Con ference on W avelet Active M edia T echnology and Information Processing. IEEE, 2016:150-153. [13] Li N,Mak M W ,Chien J T . Deep neural network d riven mixture of PLDA for robust i-vector speaker verification . Spoken Language T echnology W orkshop. IEEE, 2017. [14] Andre Esteva, Brett Kuprel, Roberto A. Novoa, et al. Dermatologist-level classification of skin cancer w ith deep neural networks. Nature, 2017, 542(7639):115- 11 8. [15] Liu W , W ang Z , Liu X , et al. A surve y of d eep neural network architectures and their applications. Neurocomputing, 2 017, 234:11 -26. [16] Sakar B E, Isenkul M E , Sakar C O, et al. Col lection and analysis of a Parkinson speec h dataset with multiple types of sound recordings. IEEE Jou rnal of Biomedical and Health Informatics, 2013, 17(4): 828-834. [17] W ahy uni E S. Arabic speech recognition using MFCC feature 10 extraction and ANN classifica tion. Information T ec hnology , Information Systems and El ectrical Engineerin g (ICITISEE), 2017 2nd International conferences on. IEEE, 2017: 22-25. [18] S. Selva Nidhyananthan , R. Shantha Selva Kumari, A. Arun Prakash. A review on speech enh ancement algorithms and why to comb ine with environment classifica tion. International Journal of Modern Physics C, 2014, 25(10):1430002-. [19] Ho li M S. Automatic detectio n of n eurological diso rdered voices using mel cepstral coefficients and neu ral networks. Point-of-Care Healthcare Te chnologies (PHT), 2 013 IEEE. IEEE, 2013: 76-79. [20] Zh ang Y uhai,Du Hu aidong,Chen Huijun, etal.Characteristic o f V oice in Parkinson Disease.Journal of Audiology and Speec h Pathology 2001,9(2). [21] Malewadi D, Gh ule G. Devel opment o f Speech recognition technique for Marathi numerals usi ng M FCC & LFZI algorithm. International Conference on Computing Communication Control and Automation. IEEE, 2017:1-6. [22] Benb a A, Jilbab A, Hammouch A. V oice analysis for detecting persons with Parkinson's disease using PLP and V Q. Jou rnal o f Theoretical & Applied Information T echnolog y , 2 014, 70(3):443. [23] Li Jin,Jiang Cheng. An Imp roved Speech Endpoint Detection Based on Spectral Su btraction and Adaptive Sub-band Sp ectral Entropy . 2010 Int ernational Co nference on Interligient Computation T echnology and Automation,Changsha,China, 2010:591-593. [24] Ta n Z, Mak M W , M ak B K W . DNN-Based Score Calib ration Wi th Multitask Learnin g for Noi se Robu st Speaker V erificatio n. IEEE/ACM T ransactions on Audio, Speech and Lan guage Processing (T ASLP), 2018, 26(4): 700-712. [25] L Li ,Y Zhao,F Zhao. Hybrid Deep Neural Network--Hidd en Markov Mo del (DNN-HMM) Based Speech Emot ion Recognition. 2013 Humaine Associatio n Conference on Af fective Computing and Intelligent Interactio n, Geneva Switzerland,2013: 313-315. [26] W ei Zhang,Zhong Li,W e idong Xu,Haiquan Zh ou.A classifier of satellite signals based on the back-prop agation neural network. 2015 8 th International Congress on Image and Signal Processing (CISP), Shenyang,China,2015:1353-1356. [27] Q Qian,R Jin ,J Yi , et al. Efficient distance metric learnin g by adaptive sampling and mini-batch stochastic gradient descent (SGD). Machine Learning,2015,99(3):353-372.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment