Model-based Speech Enhancement for Intelligibility Improvement in Binaural Hearing Aids

JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 1 Model-based Speech Enhancement for Intelligibility Impro v ement in Binaural Hearing Aids Mathe w Shaji Ka valekalam, Student Member , IEEE, Jesper Kjær Nielsen, Member , IEEE, Jesper B ¨ unso w Boldt, Member , IEEE and Mads Græsbøll Christensen, Senior Member , IEEE Abstract —Speech intelligibility is often sev erely degraded among hearing impaired individuals in situations such as the cocktail party scenario. The performance of the current hearing aid technology has been observed to be limited in these scenarios. In this paper , we propose a binaural speech enhancement framework that takes into consideration the speech production model. The enhancement framework pr oposed her e is based on the Kalman ﬁlter that allows us to take the speech production dynamics into account during the enhancement process. The usage of a Kalman ﬁlter requires the estimation of clean speech and noise short term predictor (STP) parameters, and the clean speech pitch parameters. In this work, a binaural codebook- based method is proposed for estimating the STP parameters, and a directional pitch estimator based on the harmonic model and maximum likelihood principle is used to estimate the pitch parameters. The proposed method for estimating the STP and pitch parameters jointly uses the information from left and right ears, leading to a more rob ust estimation of the ﬁlter parameters. Objective measures such as PESQ and STOI have been used to evaluate the enhancement framework in different acoustic scenarios repr esentative of the cocktail party scenario. W e have also conducted subjective listening tests on a set of nine normal hearing subjects, to evaluate the performance in terms of intelligibility and quality improvement. The listening tests show that the proposed algorithm, even with access to only a single channel noisy obser vation, signiﬁcantly improves the overall speech quality , and the speech intelligibility by up to 15% . Index T erms —Kalman ﬁlter , binaural enhancement, pitch estimation, autoregr essive model. I . I N T RO D U C T I O N Normal hearing (NH) indi viduals hav e the ability to con- centrate on a single speaker even in the presence of multiple interfering speakers. This phenomenon is termed as the cock- tail party effect. Howe ver , hearing impaired individuals lack this ability to separate out a single speaker in the presence of multiple competing speakers. This leads to listener fatigue and isolation of the hearing aid (HA) user . Mimicking the cocktail party ef fect in a digital HA is v ery much desired in such scenarios [1]. Thus, to help the HA user to focus on a particular speaker , speech enhancement has to be performed to reduce the effect of the interfering speakers. The primary objectiv es of a speech enhancement system in HA are to improv e the intelligibility and quality of the degraded speech. Often, a Mathew S. Kavalekalam, Jesper K. Nielsen and Mads G. Christensen are with the Audio Analysis Lab, Department of Architecture, Design and Media T echnology at Aalborg Univ ersity . Jesper Boldt is with GN Hearing, Ballerup, Denmark Manuscript received ; revised hearing impaired person is ﬁtted with HAs at both ears. Mod- ern HAs hav e the technology to wirelessly communicate with each other making it possible to share information between the HAs. Such a property in HAs enables the use of binaural speech enhancement algorithms. The binaural processing of noisy signals has shown to be more effecti ve than processing the noisy signal independently at each ear due to the utilization of spatial information [2]. Apart from a better noise reduction performance, binaural algorithms make it possible to preserve the binaural cues which contribute to spatial release from masking [3]. Often, HAs are ﬁtted with multiple microphones at both ears. Some binaural speech enhancement algorithms dev eloped for such cases are [4], [5]. In [4], a multichannel W iener ﬁlter for HA applications is proposed which results in a minimum mean squared error (MMSE) estimation of the target speech. These methods were shown to distort the binaural cues of the interfering noise while maintaining the binaural cues of the target. Consequently , a method was proposed in [6] that introduced a parameter to trade of f between the noise reduction and cue preservation. The abov e mentioned algorithms hav e reported improvements in speech intelligibility . W e are here mainly concerned with the binaural enhance- ment of speech with access to only one microphone per HA [7]–[9]. More speciﬁcally , this paper is concerned with a two-input two-output system. This situation is encountered in in-the-ear (ITE) HAs, where the space constraints limit the number of microphones per HA. Moreover , in the case where we hav e multiple microphones per HA, beamforming can be applied indi vidually on each HA to form the tw o inputs, which can then be processed further by the proposed dual channel enhancement framework. One of the ﬁrst approaches to perform dual channel speech enhancement was that of [7] where a two channel spectral subtraction was combined with an adaptiv e W iener post-ﬁlter . This led to a distortion of the binaural cues, as different gains were applied to the left and right channels. Another approach to performing dual channel speech enhancement w as proposed in [8] and this solution consisted of two stages. The ﬁrst stage dealt with the estimation of interference signals using an equalisation- cancellation theory , and the second stage was an adaptiv e W iener ﬁlter . The intelligibility improvements corresponding to the algorithms stated above have not been studied well. These algorithms perform the enhancement in the frequency domain by assuming that the speech and noise components are uncorrelated, and do not take into account the nature of the speech production process. In this paper , we propose a binaural speech enhancement framew ork that takes the speech JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 2 production model into account. The model used here is based on the source-ﬁlter model, where the ﬁlter corresponds to the vocal tract and the source corresponds to the excitation signal produced by the vocal chords. Using a physically meaningful model gi ves us a suf ﬁciently accurate way for explaining how the signals were generated, but also helps in reducing the number of parameters to be estimated. One way to e xploit this speech production model for the enhancement process is to use a Kalman ﬁlter, as the speech production dynamics can be modelled within the Kalman ﬁlter using the state space equations while also accounting for the background noise. Kalman ﬁltering for single channel speech enhancement in the presence of white background noise was ﬁrst proposed in [10]. This work was later extended to deal with coloured noise in [11], [12]. One of the main limitations of Kalman ﬁltering based enhancement is that the state space parameters required for the formulation of the state space equations need to be known or estimated. The estimation of the state space parameters is a difﬁcult problem due to the non-stationary nature of speech and the presence of noise. The state space parameters are the autore gressiv e (AR) coef ﬁcients and the excitation v ariances for the speech and noise respecti vely . Henceforth, AR coefﬁcients along with the excitation vari- ances will be denoted as the short term predictor (STP) parameters. In [11], [12] these STP parameters were estimated using an approximated expectation-maximisation algorithm. Howe ver , the performance of these algorithms were noted to be unsatisfactory in non-stationary noise environments. Moreov er , these algorithms assumed the excitation signal in the source-ﬁlter model to be white Gaussian noise. Even though this assumption is appropriate for modelling un voiced speech, it is not very suitable for modelling voiced speech. This issue was handled in [13] by using a modiﬁed model for the excitation signal capable of modelling both voiced and un voiced speech. The usage of this model for the enhancement process required the estimation of the pitch parameters in addition to the STP parameters. This modiﬁcation of the excitation signal was found to improve the performance in voiced speech regions, but the performance of the algorithm in the presence of non-stationary background noise was still observed to be unsatisfactory . This was primarily due to the poor estimation of the model parameters in non-stationary background noise. The noise STP parameters were estimated in [13] by assuming that the ﬁrst 100 milli seconds of the speech se gment contained only noise and the parameters were then assumed to be constant. In this work, we introduce a binaural model-based speech enhancement framew ork which addresses the poor estimation of the parameters explained above. W e here propose a binaural codebook-based method for estimating the STP parameters, and a directional pitch estimator based on the harmonic model for estimating the pitch parameters. The estimated parameters are subsequently used in a binaural speech enhancement framew ork that is based on the signal model used in [13]. Codebook-based approaches for estimating STP parameters in the single channel case have been previously proposed in [14], and has been used to estimate the ﬁlter parameters required for the Kalman ﬁlter for single channel speech enhancement in [15]. In this work we extend this to the dual channel case, where we assume that there is a wireless link between the HAs. The estimation of STP and pitch parameters using the information on both the left and right channels leads to a more robust estimation of these parameters. Thus, in this work, we propose a binaural speech enhancement method that is model- based in sev eral ways as 1) the state space equations in volv ed in the Kalman ﬁlter takes into account the dynamics of the speech production model; 2) the estimation of STP parameters utilised in the Kalman ﬁlter is based on trained spectral models of speech and noise; and 3) the pitch parameters used within the Kalman ﬁlter are estimated based on the harmonic model which is a good model for voiced speech. W e remark that this paper is an extension of pre vious conference papers [16], [17]. In comparison to [16], [17], we hav e used an improv ed method for estimating the e xcitation v ariances. Moreov er , the proposed enhancement framew ork has been ev aluated in more realistic scenarios and subjective listening tests have been conducted to validate the results obtained using objective measures. I I . P RO B L E M F O R M U L A T I O N In this section, we formulate the problem and state the assumptions that hav e been used in this work. The noisy signals at the left/right ears at time index n are denoted by z l/r ( n ) = s l/r ( n ) + w l/r ( n ) ∀ n = 0 , 1 , 2 . . . , (1) where z l/r , s l/r and w l/r denote the noisy , clean and noise components at the left/right ears, respectively . It is assumed that the clean speech component is statistically independent with the noise component. Our objectiv e here is to obtain estimates of the clean speech signals denoted as ˆ s l/r ( n ) , from the noisy signals. The processing of the noisy speech using a speech enhancement system to estimate the clean speech signal requires the knowledge of the speech and noise statistics. T o obtain this, it is con venient to assume a statsitical model for the speech and noise components, making it easier to estimate the statistics from the noisy signal. In this work, we model the clean speech as an AR process, which is a common model used to represent the speech production process [18]. W e also assume that the speech source is in the nose direction of the listener , so that the clean speech component at the left and right ears can be represented by AR processes having the same parameters, s l/r ( n ) = P X i =1 a i s l/r ( n − i ) + u ( n ) , (2) where a = [ − a 1 , . . . , − a P ] T is the set of speech AR coefﬁ- cients, P is the order of the speech AR process and u ( n ) is the excitation signal corresponding to the speech signal. Often, u ( n ) is modelled as white Gaussian noise with v ariance σ 2 u and this will be referred to as the un voiced (UV) model [11]. It should be noted that we do not model the rev erberation here. Similar to the speech, the noise components are represented by AR processes as, w l/r ( n ) = Q X i =1 c i w l/r ( n − i ) + v ( n ) , (3) JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 3 Kalman Smoother Parameter Estimation Kalman Smoother z l ( n ) z r ( n ) ˆ s l ( n ) ˆ s r ( n ) Fig. 1: Basic block diagram of the binaural enhancement framework. where c = [ − c 1 , . . . , − c Q ] T is the set of noise AR coefﬁ- cients, Q is the order of the noise AR process and v ( n ) is white Gaussian noise with variance σ 2 v . As we have seen previously , the excitation signal, u ( n ) , in (2) was modelled as a white Gaussian noise. Although this assumption is suitable for representing un voiced speech, it is not appropriate for modelling voiced speech. Thus, inspired by [13], the enhancement framework here models u ( n ) as u ( n ) = b ( p ) u ( n − p ) + d ( n ) , (4) where d ( n ) is white Gaussian noise with variance σ 2 d , p is the pitch period and b ( p ) ∈ (0 , 1) is the degree of voicing. In por- tions containing predominantly voiced speech, b ( p ) is assumed to be close to 1 and the variance of d ( n ) is assumed to be small, whereas in portions of un voiced speech, b ( p ) is assumed to be close to zero so that (2) simpliﬁes into the conv entional un voiced AR model. The excitation model in (4) when used together with (2) is referred to as the v oiced-un voiced (V - UV) model. This model can be easily incorporated into the speech enhancement framework by modifying the state space equations. The incorporation of the V -UV model into the enhancement framew ork requires the pitch parameters, p and b ( p ) , in addition to the STP parameters to be estimated from the noisy signal. W e would like to remark here that these parameters are usually time varying in the case of speech and noise signals. Herein, these parameters are assumed to be quasi-stationary , and are estimated for ev ery frame index f n = b n M c +1 , where M is the frame length. The estimation of these parameters will be explained in the subsequent section. I I I . P RO P O S E D E N H A N C E M E N T F R A M E W O R K A. Overvie w The enhancement framew ork proposed here assumes that there is a communication link between the tw o HAs that mak es it possible to exchange information. Fig. 1 shows the basic block diagram of the proposed enhancement framew ork. The noisy signals at the left and right ears are enhanced using a ﬁxed lag Kalman smoother (FLKS), which requires the estimation of STP and pitch parameters. These parameters are estimated jointly using the information in the left and right channels. The usage of identical ﬁlter parameters at both the ears leads to the preserv ation of binaural cues. In this paper , the details regarding the proposed binaural framew ork will be explained and the performance of the binaural framework will be compared with that of the bilateral frame work, where it is assumed that there is no communication link between the two HAs which leads to the ﬁlter parameters being estimated independently at each ear . W e will no w explain the different components of the proposed enhancement framework in detail. B. FLKS for speech enhancement As alluded to in the introduction, a Kalman ﬁlter allows us to tak e into account the speech production dynamics in the form of state space equations while also accounting for the observation noise. In this work, we use FLKS which is a variant of the Kalman ﬁlter . A FLKS gives a better performance than a Kalman ﬁlter , but has a higher delay . In this section, we will explain the functioning of FLKS for both the UV and V -UV models that we have introduced in Section II. W e assume here that the model parameters are known. For the UV model, the usage of a FLKS (with a smoother delay of d s ≥ P ) from a speech enhancement perspectiv e requires the AR signal model in (2) to be written as a state space form as shown below ¯ s l/r ( n ) = A ( f n ) ¯ s l/r ( n − 1) + Γ 1 u ( n ) , (5) where ¯ s l/r ( n ) = [ s l/r ( n ) , s l/r ( n − 1) , . . . , s l/r ( n − d s )] T is the state vector containing the d s + 1 recent speech samples, Γ 1 = [1 , 0 , . . . , 0] T is a ( d s + 1) × 1 vector , u ( n ) = d ( n ) and A ( f n ) is the ( d s + 1) × ( d s + 1) speech state transition matrix written as A ( f n ) =   − a ( f n ) T 0 T 0 I P 0 0 0 I d s − P 0   . (6) The state space equation for the noise signal in (3) is similarly written as ¯ w l/r ( n ) = C ( f n ) ¯ w l/r ( n − 1) + Γ 2 v ( n ) , (7) where ¯ w l/r ( n ) = [ w l/r ( n ) , w l/r ( n − 1) , . . . , w l/r ( n − Q +1)] T , Γ 2 = [1 , 0 , . . . , 0] T is a Q × 1 vector and C ( f n ) =  [ c 1 ( f n ) , . . . , c Q − 1 ( f n )] c Q ( f n ) I Q − 1 0  (8) is a Q × Q matrix. The state space equations in (5) and (7) are combined to form a concatenated state space equation for the UV model as  ¯ s l/r ( n ) ¯ w l/r ( n )  =  A ( f n ) 0 0 C ( f n )   ¯ s l/r ( n − 1) ¯ w l/r ( n − 1)  +  Γ 1 0 0 Γ 2   d ( n ) v ( n )  which can be rewritten as ¯ x UV l/r ( n ) , F UV ( f n ) x ( n − 1) + Γ 3 y ( n ) , (9) where ¯ x UV l/r ( n ) =  ¯ s l/r ( n ) T ¯ w l/r ( n ) T  T is the concatenated state space vector and F UV ( f n ) is the concatenated state transition matrix for the UV model. The observation equation to obtain the noisy signal is then written as z l/r ( n ) = Γ UV T ¯ x UV l/r ( n ) , (10) where Γ UV =  Γ T 1 Γ T 2  T . The state space equation (9) and the observation equation (10) can then be used to formulate the prediction and correction stages of the FLKS for the UV JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 4 model. W e will now explain the formulation of the state space equations for the V -UV model. The state space equation for the V -UV model of speech is written as ¯ s l/r ( n ) = A ( f n ) ¯ s l/r ( n − 1) + Γ 1 u ( n ) , (11) where the excitation signal in (4) is also modelled as a state space equation as ¯ u ( n ) = B ( f n ) ¯ u ( n − 1) + Γ 4 d ( n ) , (12) where ¯ u ( n ) = [ u ( n ) , u ( n − 1) , . . . , u ( n − p max + 1)] T , p max is the maximum pitch period in inte ger samples, Γ 4 = [1 , 0 . . . 0] T is a ( p max ) × 1 vector and B ( f n ) =  [ b (1) , . . . , b ( p max − 1)] b ( p max ) I p max − 1 0  (13) is a p max × p max matrix where b ( i ) = 0 ∀ i 6 = p ( f n ) . The concatenated state space equation for the V -UV model is   ¯ s l/r ( n ) u ( n + 1) ¯ w l/r ( n )   =   A ( f n ) Γ 1 Γ T 2 0 0 B ( f n ) 0 0 0 C ( f n )     ¯ s l/r ( n − 1) ¯ u ( n ) ¯ w l/r ( n − 1)   +   0 0 Γ 4 0 0 Γ 2    d ( n + 1) v ( n )  , which can also be written as ¯ x V -UV l/r ( n + 1) , F V -UV ( f n ) ¯ x V -UV l/r ( n ) + Γ 5 g ( n + 1) , (14) where ¯ x V -UV l/r ( n + 1) = [ ¯ s l/r ( n ) T ¯ u ( n + 1) T ¯ w l/r ( n ) T ] T is the concatenated state space vector , g ( n + 1) = [ d ( n + 1) v ( n )] T and F V -UV ( f n ) is the concatenated state transition matrix for the V -UV model. The observation equation to obtain the noisy signal is written as z l/r ( n ) = Γ V -UV T ¯ x V -UV l/r ( n + 1) , (15) where Γ V -UV =  Γ T 1 0 T Γ T 2  T . The state space equation (14) and the observ ation equation (15) can then be used to for- mulate the prediction and correction stages of the FLKS for the V -UV model (see Appendix A). It can be seen that the formulation of the prediction and correction stages of the FLKS requires the kno wledge of the speech and noise STP parameters, and the clean speech pitch parameters. The estimation of these model parameters are explained in the subsequent sections. C. Codebook-based binaural estimation of STP parameters As mentioned in the introduction, the estimation of the speech and noise STP parameters forms a very critical part of the proposed enhancement framework. These parameters are here estimated using a codebook-based approach. The estimation of STP parameters using a codebook-based ap- proach, when having access to a single channel noisy signal has been previously proposed in [14], [19]. Here, we extend this to the case when we have access to binaural noisy signals. Codebook-based estimation of STP parameters uses the a priori information about speech and noise spectral shapes stored in trained speech and noise codebooks in the form of speech and noise AR coef ﬁcients respecti vely . The codebooks offer us an elegant way of including prior information about the speech and noise spectral models e.g. if the enhancement system present in the HA has to operate in a particular noisy en vironment, or mainly process speech from a particular set of speakers, the codebooks can be trained accordingly . Contrarily , if we do not have any speciﬁc information regarding the speaker or the noisy en vironment, we can still train general codebooks from a large database consisting of different speak- ers and noise types. W e would like to remark here that we assume the UV model of speech for the estimation of STP parameters. A Bayesian framework is utilised to estimate the parameters for e very frame index. Thus, the random variables (r .v .) corresponding to the parameters to be estimated for the f th n frame are concatenated to form a single vector θ ( f n ) = [ θ s ( f n ) T θ w ( f n ) T ] T = [ a ( f n ) T σ 2 d ( f n ) c ( f n ) T σ 2 v ( f n )] T , where a ( f n ) and c ( f n ) are r .v . representing the speech and noise AR coefﬁcients, and σ 2 d ( f n ) and σ 2 v ( f n ) are r .v . repre- senting the speech and noise excitation v ariances. The MMSE estimate of the parameter vector is ˆ θ ( f n ) = E ( θ ( f n ) | z l ( f n M ) , z r ( f n M )) , (16) where E ( · ) is the expectation operator and z l/r ( f n M ) =  z l/r ( f n M ) , . . . , z l/r ( f n M + m ) , . . . , z l/r ( f n M + M − 1)  T denotes the f th n frame of noisy speech at the left/right ears. The frame index, f n , will be left out for the remainder of the section for notational conv enience. Equation (16) is then rewritten as ˆ θ = Z Θ θ p ( z l , z r | θ ) p ( θ ) p ( z l , z r ) d θ , (17) where Θ denotes the combined support space of the param- eters to be estimated. Since we assumed that the speech and noise are independent (see Section II), it follo ws that p ( θ ) = p ( θ s ) p ( θ w ) where θ s and θ w speech and noise STP parameters respectively . Furthermore, the speech and noise AR coef ﬁcients are assumed to be independent with the excitation variances leading to p ( θ s ) = p ( a ) p ( σ 2 d ) and p ( θ w ) = p ( c ) p ( σ 2 v ) . Using the aforementioned assumptions, (17) is rewritten as ˆ θ = Z Θ θ p ( z l , z r | θ ) p ( a ) p ( σ 2 d ) p ( c ) p ( σ 2 v ) p ( z l , z r ) d θ . (18) The probability density of the AR coefﬁcients is here mod- elled as a sum of Dirac delta functions centered around each codebook entry as p ( a ) = 1 N s P N s i =1 δ ( a − a i ) and p ( c ) = 1 N w P N w j =1 δ ( c − c j ) , where a i is the i th entry of the speech codebook (of size N s ), c j is the j th entry of the noise codebook (of size N w ) . Deﬁning θ ij , [ a T i σ 2 d c T j σ 2 v ] T , (18) can be rewritten as ˆ θ = 1 N s N w N s X i =1 N w X j =1 Z σ 2 d Z σ 2 v θ ij p ( z l , z r | θ ij ) p ( σ 2 d ) p ( σ 2 v ) p ( z l , z r ) dσ 2 d dσ 2 v . (19) For a particular set of speech and noise AR coefﬁcients, a i and c j , it can be sho wn that the likelihood, p ( z l , z r | θ ij ) , decays rapidly from its maximum v alue when there is a JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 5 small deviation in the excitation variances from its true value [14] (see Appendix B). If we then approximate the true values of the excitation variances with the corresponding maximum likelihood (ML) estimates denoted as σ 2 d,ij and σ 2 v ,ij , the likelihood term p ( z l , z r | θ ij ) can be approximated as p ( z l , z r | θ ij ) δ ( σ 2 d − σ 2 d,ij ) δ ( σ 2 v − σ 2 v ,ij ) . Deﬁning θ ML ij , [ a T i σ 2 d,ij c T j σ 2 v ,ij ] T , and using the abov e approximation and the property , R x f ( x ) δ ( x − x 0 ) dx = f ( x 0 ) , we can re write (19) as ˆ θ = 1 N s N w N s X i =1 N w X j =1 θ ML ij p ( z l , z r | θ ML ij ) p ( σ 2 d,ij ) p ( σ 2 v ,ij ) p ( z l , z r ) , (20) where p ( z l , z r ) = 1 N s N w N s X i =1 N w X j =1 p ( z l , z r | θ ML ij ) p ( σ 2 d,ij ) p ( σ 2 v ,ij ) . Details regarding the prior distrib utions used for the excitation variances is giv en in Appendix C. It can be seen from (20) that the ﬁnal estimate of the parameter vector is a weighted linear combination of θ ML ij with weights proportional to p ( z l , z r | θ ML ij ) p ( σ 2 d,ij ) p ( σ 2 v ,ij ) . T o compute this, we need to ﬁrst obtain the ML estimates of the excitation v ariances for a giv en set of speech and noise AR coefﬁcients, a i and c j , as { σ 2 d,ij , σ 2 v ,ij } = arg max σ 2 d ,σ 2 v ≥ 0 p ( z l , z r | θ ij ) . (21) For the models we hav e assumed previously in Section II, we can show that z l and z r are statistically independent giv en θ ij [20, Sec 8.2.2], which results in p ( z l , z r | θ ij ) = p ( z l | θ ij ) p ( z r | θ ij ) . W e ﬁrst deri ve the likelihood for the left channel, p ( z l | θ ij ) , using the assumptions we hav e introduced previously in Sec- tion II. Using these assumptions, frame of speech and noise component associated with the noisy frame z l denoted by s l and w l respectiv ely can be expressed as p ( s l | σ 2 d , a i ) ∼ N ( 0 , σ 2 d R s ( a i )) p ( w l | σ 2 v , c j ) ∼ N ( 0 , σ 2 v R w ( c j )) , where R s ( a i ) is the normalised speech cov ariance matrix and R w ( c j ) is the normalised noise covariance matrix. These matrices can be asymptotically approximated as circulant ma- trices which can be diagonalised using the Fourier transform as [14], [21], R s ( a i ) = FD s i F H and R w ( c j ) = FD w j F H , where F is the discrete F ourier transform (DFT) matrix deﬁned as [ F ] m,k = 1 √ M exp( ı 2 π mk M ) , ∀ m, k = 0 , . . . M − 1 where k represents the frequency index and D s i = ( Λ H s i Λ s i ) − 1 , Λ s i = diag   √ M F H   1 a i 0     , D w j = ( Λ H w j Λ w j ) − 1 , Λ w j = diag   √ M F H   1 c j 0     . Thus we obtain the likelihood for the left channel as, p ( z l | θ ij ) ∼ N ( 0 , σ 2 d FD s i F H + σ 2 v FD w j F H ) . The log-likelihood ln p ( z l | θ ij ) is then giv en by ln p ( z l | θ ij ) c = ln    σ 2 d FD s i F H + σ 2 v FD w j F H    − 1 2 − 1 2 z T l  σ 2 d FD s i F H + σ 2 v FD w j F H  − 1 z l , (22) where c = denotes equality up to a constant and | · | denotes the matrix determinant operator . Denoting 1 A i s ( k ) as the k th diagonal element of D s i and 1 A i w ( k ) as the k th diagonal element of D w j , (22) can be rewritten as ln p ( z l | θ ij ) c = ln K − 1 Y k =0 σ 2 d A i s ( k ) + σ 2 v A j w ( k ) ! − 1 2 − 1 2 z T l F      σ 2 d A i s (0) + σ 2 v A j w (0) 0 0 0 . . . 0 0 0 σ 2 d A i s ( K − 1) + σ 2 v A j w ( K − 1)      − 1 F H z l . (23) Deﬁning the modelled spectrum as ˆ P z ij ( k ) , σ 2 d A i s ( k ) + σ 2 v A j w ( k ) , (23) can be written as ln p ( z l | θ ij ) c = ln K − 1 Y k =0  ˆ P z ij ( k )  − 1 2 − 1 2 K − 1 X k =0 P z l ( k ) ˆ P z ij ( k ) , (24) where P z l ( k ) is the squared magnitude of the k th element of the vector F H z l . Thus, ln p ( z l | θ ij ) c = − 1 2 K − 1 X k =0 P z l ( k ) ˆ P z ij ( k ) + ln ˆ P z ij ( k ) ! . (25) W e can then see that the log-likelihood is equal, up to a constant, to the Itakura-Saito (IS) diver gence between P z l and ˆ P z ij which is deﬁned as [22] d IS ( P z l , ˆ P z ij ) = 1 K K − 1 X k =0 P z l ( k ) ˆ P z ij ( k ) − ln P z l ( k ) ˆ P z ij ( k ) − 1 ! , where P z l = [ P z l (0) , . . . , P z l ( K − 1)] T and ˆ P z ij = h ˆ P z ij (0) , . . . , ˆ P z ij ( K − 1) i T . Using the same result for the right ear, the optimisation problem in (21), under the afore- mentioned conditions can be equiv alently written as { σ 2 d,ij , σ 2 v ,ij } = arg min σ 2 d ,σ 2 v ≥ 0 h d IS ( P z l , ˆ P z ij ) + d IS ( P z r , ˆ P z ij ) i . (26) Unfortunately , it is not possible to get a closed form e xpression for the excitation variances by minimising (26). Instead, this is solved iterativ ely using the multiplicati ve update (MU) method [23]. For notational con venience, ˆ P z ij can be written as ˆ P z ij = P s,i σ 2 d + P w,j σ 2 v , where P s,i = h 1 A i s (0) , . . . , 1 A i s ( K − 1) i T , P w,j = h 1 A j w (0) , . . . , 1 A j w ( K − 1) i T . JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 6 Deﬁning P ij = [ P s,i P w,j ] , and Σ ( l ) ij = [ σ 2( l ) d,ij σ 2( l ) v ,ij ] T where σ 2( l ) d,ij and σ 2( l ) v ,ij represents the ML estimates of the excitation variances at the l th MU iteration, the values for the excitation variances using the MU method are computed iterati vely as [24], σ 2( l +1) d,ij ← σ 2( l ) d,ij P T s,i h ( P ij Σ ( l ) ij ) − 2 · ( P z l + P z r ) i 2 P T s,i ( P ij Σ ( l ) ij ) − 1 , (27) σ 2( l +1) v ,ij ← σ 2( l ) v ,ij P T w,j h ( P ij Σ ( l ) ij ) − 2 · ( P z l + P z r ) i 2 P T w,j ( P ij Σ ( l ) ij ) − 1 , (28) where ( · ) denotes the element wise multiplication operator and ( · ) − 2 denotes element-wise inv erse squared operator . The excitation variances estimated using (27) and (28) lead to the minimisation of the cost function in (26). Using these results, p ( z l , z r | θ ML ij ) can be written as p ( z l , z r | θ ML ij ) = C e  − M 2 h d IS ( P z l , ˆ P ML z ij )+ d IS ( P z r , ˆ P ML z ij ) i , (29) where C is a normalisation constant, and ˆ P ML z ij = [ ˆ P ML z ij (0) , . . . , ˆ P ML z ij ( K − 1)] T and ˆ P ML z ij ( k ) = σ 2 d,ij A i s ( k ) + σ 2 v ,ij A j w ( k ) . (30) Once the likelihoods are calculated using (29), they are sub- stituted into (20) to get the ﬁnal estimate of the speech and noise STP parameters. Some other practicalities inv olved in the estimation procedure of the STP parameters are explained next. 1) Adaptive noise codebook: The noise codebook used for the estimation of the STP parameters is usually generated by using a training sample consisting of the noise type of interest. Howe ver , there might be scenarios where the noise type is not known a priori. In such scenarios, to make the enhancement system more rob ust, the noise codebook can be appended with an entry corresponding to the noise po wer spectral density (PSD) estimated using another dual channel method. Here, we utilise such a dual channel method for estimating the noise PSD [7], which requires the transmission of noisy signals between the HAs. The estimated dual channel noise PSD, ˆ P DC w ( k ) , is then used to ﬁnd the AR coefﬁcients and the variance representing the noise spectral en velope. At ﬁrst, the autocorrelation coefﬁcients corresponding to the noise PSD estimate are computed using the Wiener -Khinchin theorem as r ww ( q ) = K − 1 X k =0 ˆ P DC w ( k ) exp  ı 2 π q k K  , 0 ≤ q ≤ Q. Subsequently , the AR coefﬁcients denoted by ˆ c DC = [1 , ˆ c DC 1 , . . . , ˆ c DC Q ] T , and the excitation variance corresponding to the dual channel noise PSD estimate are estimated by Levinson-Durbin recursi ve algorithm [25, p. 100]. The esti- mated AR coefﬁcient vector , ˆ c DC , is then appended to the noise codebook. The ﬁnal estimate of the noise excitation variance can be taken as a mean of variance obtained from the dual channel estimate and the variance obtained from (20). It should be noted that, in the case a noise codebook is not av ailable a priori, the speech codebook can be used in conjunction with dual channel noise PSD estimate alone. This leads to a reduction in the computational complexity . Some other dual channel noise PSD estimation algorithms present in the literature are [26], [27], and these can in principle also be included in the noise codebook. D. Dir ectional pitch estimator As we hav e seen previously , the formulation of the state transition matrix in (12) requires the estimation of pitch parameters. In this paper , we propose a parametric method to estimate the pitch parameters of clean speech present in noise. The babble noise generally encountered in a cocktail party scenario is spectrally coloured. As the pitch estimator proposed here is optimal only for white Gaussian noise signals, pre-whitening is ﬁrst performed on the noisy signal to whiten the noise component. Pre-whitening is performed using the estimated noise AR coefﬁcients as ˜ z l/r ( n ) = z l/r ( n ) + Q X i =1 ˆ c i ( f n ) z l/r ( n − i ) . (31) The method proposed here operates on signal vec- tors ˜ z l/r c ( f n M ) ∈ C M deﬁned as ˜ z l/r c ( f n M ) = [ ˜ z l/r c ( f n M ) , . . . , ˜ z l/r c ( f n M + M − 1)] T where ˜ z l/r c ( n ) is the comple x signal corresponding to ˜ z l/r ( n ) , which is obtained using the Hilbert transform. This method uses the harmonic model to represent the clean speech as a sum of L harmon- ically related complex sinusoids. Using the harmonic model, the noisy signal at the left ear in vector of Gaussian noise ˜ w l c ( f n M ) , with cov ariance matrix, Q l ( f n ) , is represented as ˜ z l c ( f n M ) = V ( f n ) D l q ( f n ) + ˜ w l c ( f n M ) (32) where q ( f n ) is a vector of complex amplitudes, V ( f n ) is the V andermonde matrix deﬁned as V ( f n ) = [ v 1 ( f n ) . . . v L ( f n )] , where [ v p ( f n )] m = e ıω 0 p ( f n M + m − 1) with ω 0 being the fundamental frequenc y and D l being the directivity matrix from the source to the left ear . The directivity matrix contains a frequency and angle dependent delay and magnitude term along the diagonal, designed using the method in [28, eq. 3]. Similarly , the noisy signal at the right ear is written as ˜ z r c ( f n M ) = V ( f n ) D r q ( f n ) + ˜ w r c ( f n M ) . (33) The frame index f n will be omitted for the remainder of the section for notational con venience. Assuming independence between the channels, the likelihood, due to Gaussianity can be expressed as p ( ˜ z l c , ˜ z r c |  ) = C N ( ˜ z l c ; VD l q , Q l ) C N ( ˜ z r c ; VD r q , Q r ) (34) where  is the parameter set containing ω 0 , the complex amplitudes, the directivity matrices and the noise cov ariance matrices. Assuming that the noise is white in both the chan- nels, the likelihood is rewritten as p ( ˜ z l c , ˜ z r c |  ) = e −  || ˜ z l c − VD l q || 2 σ 2 l + || ˜ z r c − VD r q || 2 σ 2 r  ( π σ l σ r ) 2 M (35) JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 7 and the log-likelihood is then ln p ( ˜ z l c , ˜ z r c |  ) = − M (ln π σ 2 l + ln π σ 2 r ) −  || ˜ z l c − VD l q || 2 σ 2 l + || ˜ z r c − VD r q || 2 σ 2 r  . (36) Assuming the fundamental frequency to be known, the ML estimate of the amplitudes is obtained as ˆ q = ( H H H ) − 1 H H y , (37) where H = h ( VD l ) T ( VD r ) T i T and y = [ ˜ z T l c ˜ z T r c ] T . These amplitude estimates are further used to estimate the noise variances as ˆ σ 2 l/r = 1 M || ˆ ˜ w l/r c || 2 = 1 M || ˜ z l/r c − VD l/r ˆ q || 2 . (38) Substituting these into (36), we obtain the log-likelihood as ln p ( ˜ z l c , ˜ z r c |  ) c = − M (ln ˆ σ 2 l + ln ˆ σ 2 r ) . (39) The ML estimate of the fundamental frequency is then ˆ ω 0 = arg min ω 0 ∈ Ω 0 (ln ˆ σ 2 l + ln ˆ σ 2 r ) , (40) where Ω 0 is the set of candidate fundamental frequencies. This leads to (40) being ev aluated on grid of candidate fundamental frequencies. The pitch is then obtained by rounding the reciprocal of the estimated fundamental frequency in Hz. W e remark that the model order L is estimated here using the maximum a posteriori (MAP) rule [29, p. 38]. The degree of voicing is calculated by taking the ratio between the energy (calculated as the square of the l 2 -norm) present at integer multiples of the fundamental frequency and the total energy present in the signal. This is motiv ated by the observation that, in case of highly voiced regions, the energy of the signal will be concentrated at the harmonics. Figures 2 and 3 show the pitch estimation plot from the binaural noisy signal (SNR = 3 dB) for the proposed method (which uses information from the two channels), and a single channel pitch estimation method which uses only the left channel, respecti vely . The red line denotes the true fundamental frequency and the blue asterisk denotes the estimated fundamental frequenc y . It can be seen that the use of the two channels leads to a more robust pitch estimation. The main steps inv olved in the proposed enhancement framew ork for the V -UV model are shown in Algorithm 1. The enhancement framework for the UV model differs from the V - UV model in that it does not require estimation of the pitch parameters, and that the FLKS equations would be deri ved based on (9) and (10) instead of (14) and (15). I V . S I M U L AT I O N R E S U LT S In this section, we will present the experiments that ha ve been carried out to e valuate the proposed enhancement frame- work. 20 40 60 80 100 200 300 Frame index Hz Fig. 2: Fundamental frequency estimates using the proposed method (SNR = 3 dB). The red line indicates the true fundamental frequency and the blue aterisk denotes the estimated fundamental frequency . 20 40 60 80 100 200 300 Frame index Hz Fig. 3: Fundamental frequency estimates using the corresponding single channel method [29] (SNR = 3 dB). A. Implementation details The test audio ﬁles used for the experiments consisted of speech from the GRID database [30] re-sampled to 8 kHz. The noisy signals were generated using the simulation set- up explained in Section IV -B. The speech and noise STP parameters required for the enhancement process were esti- mated ev ery 25 ms using the codebook-based approach, as explained in Section III-C. The speech codebook and noise codebook used for the estimation of the STP parameters are obtained by the generalised Lloyd algorithm [31]. During the training process, AR coefﬁcients (conv erted into line spectral frequency coefﬁcients) are extracted from windowed frames, obtained from the training signal and passed as an input to the vector quantiser . W orking in the line spectral frequency domain is guaranteed to result in stable inv erse ﬁlters [32]. Codebook vectors are then obtained as an output from the vector quantiser depending on the size of the codebook. For our e xperiments, we ha ve used both a speak er-speciﬁc codebook and a general speech codebook. A speaker-speciﬁc codebook of 64 entries was generated using head related impulse response (HRIR) con volv ed speech from the speciﬁc speaker of interest. A general speech codebook of 256 entries was generated from a training sample of 30 minutes of HRIR con volv ed speech from 30 different speakers. Using a speaker- speciﬁc codebook instead of a general speech codebook leads to an improvement in performance, and a comparison between the two was made in [15]. It should be noted that the sentences used for training the codebook were not included in the test sequence. The noise codebook consisting of only 8 entries, was generated using thirty seconds of noise signal [33]. The AR model order for both the speech and noise signal w as empirically chosen to be 14. The pitch period and degree of voicing was estimated as explained in Section III-D where JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 8 Algorithm 1 Main steps in volved in the binaural enhancement framew ork 1: while ne w time-frames are available do 2: Estimate the dual channel noise PSD and append the noise codebook with the AR coefﬁcients corresponding to the estimated noise PSD ˆ P DC w (see Section III-C1). 3: for ∀ i ∈ N s do 4: for ∀ j ∈ N w do 5: compute the ML estimates of excitation noise variances ( σ 2 d,ij and σ 2 v ,ij ) using (27) and (28). 6: compute the modelled spectrum ˆ P ML z ij using (30). 7: compute the likelihood values p ( z l , z r | θ ML ij ) using (29). 8: end for 9: end for 10: Get the ﬁnal estimates of STP parameters using (20). 11: Estimate the pitch parameters using the algorithm ex- plained in Section III-D. 12: Use the estimated STP parameters and the pitch param- eters in the FLKS equations (see Appendix A) to get the enhanced signal. 13: end while the cost function in (40) was ev aluated on a 0 . 5 Hz grid for fundamental frequencies in the range 80 − 400 Hz. For each fundamental frequency candidate ω 0 , the model orders considered were L = { 1 , . . . , b 2 π /ω 0 c} . B. Simulation set-up In this paper we ha ve considered two simulation set-ups rep- resentativ e of the cocktail party scenario. The details regarding the two set-ups are giv en below: 1) Set-up 1: The clean signals were at ﬁrst con volv ed with an anechoic binaural HRIR corresponding to the nose direction, taken from a database [34]. Noisy signals are then generated by adding binaurally recorded babble noise taken from the ETSI database [33]. 2) Set-up 2: The noisy signals were generated using the McRoomSim acoustic simulation software [35]. Fig. 4 shows the geometry of the room along with the speaker , listener and the interferers. This denotes a typical cocktail party scenario, where 1 (red) indicates the speaker of interest, 2-10 (red) are the interferers, and 1, 2 (blue) are the microphones on the left, right ears respecti vely . The dimensions of the room in this case is 10 × 6 × 4 m . The rev erberation time of the room was chosen to be 0.4 s . C. Evaluated enhancement frameworks In this section we will give an ov erview about the binaural and bilateral enhancement framew orks that have been ev alu- ated in this paper using the objective and subjectiv e scores. 1) Binaur al enhancement framework: In the binaural en- hancement frame work, we assume that there is a wireless link between the HAs. Thus, the ﬁlter parameters are estimated jointly using the information at the left and right channels. Fig. 4: Set-up 2 sho wing the cocktail scenario where 1 (red) indicates the speaker of interest and 2-10 (red) are the interferers and 1,2 (blue) are the microphones on the left ear and right ear respectiv ely . Pr oposed methods : The binaural enhancement frame- work utilising the V -UV model, when used in conjunc- tion with a general speech codebook is denoted as Bin- S(V -UV), whereas Bin-Spkr(V -UV) denotes the case where we use a speaker-speciﬁc codebook. The binaural enhancement framework utilising the UV model, when used in conjunction with a general speech codebook is denoted as Bin-S(UV), whereas Bin-Spkr(UV) denotes the case where we use a speaker-speciﬁc codebook. Refer ence methods : For comparison, we hav e used the methods proposed in [7] and [8] which we denote as T woChSS and TS-WF respectiv ely . W e chose these methods for comparison, as T woChSS was one of the ﬁrst methods designed for a two-input two-output con- ﬁguration and TS-WF is one of the state of the art methods belonging to this class. 2) Bilater al enhancement framework: In the bilateral en- hancement framew ork, single channel speech enhancement techniques are performed independently on each ear . Pr oposed methods : The bilateral enhancement frame- work utilising the V -UV model, when used in con- junction with a general speech codebook is denoted as Bil-S(V -UV), whereas Bil-Spkr(V -UV) denotes the case where we use a speaker-speciﬁc codebook. The bilateral enhancement framework utilising the UV model, when used in conjunction with a general speech codebook is denoted as Bil-S(UV), whereas Bil-Spkr(UV) denotes the case where we use a speaker-speciﬁc codebook. The difference of the bilateral case in comparison to the binaural case is in the estimation of the ﬁlter parameters. In the bilateral case, the ﬁlter parameters are estimated independently for each ear which leads to different ﬁlter parameters for each ear , e.g., the STP parameters are estimated using the method in [19] independently for each ear . Refer ence methods : For comparison, we hav e used the methods proposed in [36] and [37] which we denote as MMSE-GGP and PMBE respectiv ely . D. Objective measur es The objectiv e measures, STOI [38] and PESQ [39] ha ve been used to ev aluate the intelligibility and quality of different enhancement frameworks. W e ha ve ev aluated the performance JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 9 of the algorithms, separately for the 2 dif ferent simulation set-ups explained in Section IV -B. T able I and II sho w the objectiv e measures obtained for the binaural and bilateral enhancement frame works, respectively , when ev aluated in the set-up 1. The test signals that hav e been used for the binau- ral and bilateral enhancement frame works are identical. The scores shown in the tables are the averaged scores across the left and right channels. In comparison to the reference methods which reduce the STOI scores, it can be seen that all of the proposed methods improv e the STOI scores. It can be seen from T ables I and II that the Bin-Spkr(V -UV) performs the best in terms of STOI scores. In addition to preserving the binaural cues, it is evident from the scores that the binaural frameworks perform in general better than the bilateral frameworks, and the improv ement of binaural framew ork over bilateral framework is more pronounced at low SNRs. It can also be seen that the V -UV model which takes into account the pitch information performs better than the UV model. T ables III and IV sho w the objecti ve measures obtained for the dif ferent binaural and bilateral enhancement framew orks, respecti vely , when e valuated in the simulation set- up 2. The results obtained for set-up 2 shows similar trends to the results obtained for set-up 1. W e would also like to remark here that in the range of 0.6-0.8, an increase in 0 . 05 in STOI score corresponds to approximately 16 percentage points increase in subjectiv e intelligibility [40]. E. Inter -aural err ors W e no w evaluate the proposed algorithm in terms of bin- aural cue preservation. This was ev aluated objectively using inter-aural time dif ference (ITD) and inter -aural le vel differ - ence (ILD) also used in [8]. ITD is calculated as ITD = | ∠ C enh − ∠ C clean | π , (41) where ∠ C enh and ∠ C clean denotes the phases of the cross PSD of the enhanced and clean signal respectiv ely , giv en by C enh = E { ˆ S l ˆ S r } and C clean = E { S l S r } , where ˆ S l/r denotes the spectrum of enhanced signal at the left/right ear and S l/r denotes the spectrum of the clean signal at the left/right ear . The expectation is calculated by taking the av erage value over all frames and frequenc y indices (which has been omitted here for notational conv enience). ILD is calculated as ILD =     10 log 10 I enh I clean     , (42) where I enh = E {| ˆ S l | 2 } E {| ˆ S r | 2 } and I clean = E {| S l | 2 } E {| S r | 2 } . Fig. 5 shows the ILD and ITD cues for the proposed method, Bin-Spkr(V -UV), T woChSS and TS-WF for different angles of arriv als. It can be seen that the proposed method has a lo wer ITD and ILD in comparison to T woChSS and TS-WF . It should be noted that the proposed method and T woChSS do not use the angle of arriv al and assume that the speaker of interest is in the nose direction of the listener . TS-WF , on the other hand requires the a priori knowledge of the angle of arriv al. Thus, to make a fair comparison we have included here the inter-aural cues for TS-WF when the speaker of interest is assumed to be in the nose direction. -60 -40 -20 0 20 40 60 Angles 10 15 20 25 ILD Proposed TwoChSS TS-WF (a) ILD -60 -40 -20 0 20 40 60 Angles 0.3 0.4 0.5 0.6 ITD Proposed TwoChSS TS-WF (b) ITD Fig. 5: Inter-aural cues for different speaker positions. clean Bil-Spkr(V-UV) MMSE-GGP noisy 0 20 40 60 80 100 Fig. 6: Figure showing the mean scores and the 95% conﬁdence intervals obtained in the MUSHRA test for the different methods. F . Listening tests W e hav e conducted listening tests to measure the perfor- mance of the proposed algorithm in terms of quality and intelligibility improvements. The tests were conducted on a set of nine NH subjects. These tests were performed in a silent room using a set of Beyerdynamic DT 990 pro headphones. The speech enhancement method that we have e valuated in the listening tests is Bil-Spkr(V -UV) for a single channel. W e chose this case for the tests as we wanted to test the simpler , but more challenging case of intelligibility and quality improv ement when we have access to only a single channel. Moreov er , as the tests were conducted with NH subjects, we also wanted to eliminate any bias in the results that can be caused due to the binaural cues [41], as the beneﬁt of using binaural cues is higher for a NH person than for a hearing impaired person. 1) Quality tests: Quality performance of the proposed algorithms were ev aluated using MUSHRA experiments [42]. The test subjects were asked to ev aluate the quality of the processed audio-ﬁles using a MUSHRA set-up. The subjects were presented with the clean, processed and the noisy signals. The processing algorithms considered here are Bil-Spkr(V - UV) and MMSE-GGP . The SNR of the noisy signal considered here was 10 dB. The subjects were then asked to rate the presented signals in a score range of 0 − 100 . Fig. 6 sho ws the mean scores along with 95% conﬁdence intervals that were obtained for the dif ferent methods. It can be seen from the ﬁgure that the proposed method performs signiﬁcantly better than the reference method. 2) Intelligibility tests: Intelligibility tests were conducted using sentences from the GRID database [30]. The GRID database contains sentences spoken by 34 different speakers (18 males and 16 females). The sentences are of the follo wing JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 10 T ABLE I: This table shows the comparison of objective measures (PESQ & STOI) for the different BINA URAL enhancement frame works for 4 different signal to noise ratios. Noisy signals used for the evaluation here is generated using the simulation set-up 1. Bin-Spkr(UV) Bin-Spkr(V -UV) Bin-S(UV) Bin-S(V -UV) TS-WF T woChSS Noisy STOI 0 dB 0 . 71 0 . 75 0 . 68 0 . 72 0 . 62 0 . 64 0 . 67 3 dB 0 . 80 0 . 82 0 . 77 0 . 79 0 . 69 0 . 72 0 . 73 5 dB 0 . 84 0 . 85 0 . 81 0 . 83 0 . 74 0 . 77 0 . 78 10 dB 0 . 91 0 . 91 0 . 90 0 . 90 0 . 85 0 . 86 0 . 87 PESQ 0 dB 1 . 43 1 . 53 1 . 37 1 . 45 1 . 40 1 . 49 1 . 33 3 dB 1 . 67 1 . 72 1 . 58 1 . 68 1 . 55 1 . 66 1 . 43 5 dB 1 . 80 1 . 85 1 . 73 1 . 78 1 . 68 1 . 79 1 . 50 10dB 2 . 24 2 . 22 2 . 13 2 . 14 2 . 13 2 . 20 1 . 70 T ABLE II: This table sho ws the comparison of objecti ve measures (PESQ & STOI) for the different BILA TERAL enhancement framew orks for 4 different signal to noise ratios. Noisy signals used for the evaluation here is generated using the simulation set-up 1. Bil-Spkr(UV) Bil-Spkr(V -UV) Bil-S(UV) Bil-S(V -UV) MMSE-GGP PMBE Noisy STOI 0 dB 0 . 68 0 . 72 0 . 66 0 . 70 0 . 66 0 . 66 0 . 67 3 dB 0 . 77 0 . 79 0 . 75 0 . 78 0 . 73 0 . 73 0 . 73 5 dB 0 . 81 0 . 83 0 . 80 0 . 82 0 . 78 0 . 78 0 . 78 10 dB 0 . 90 0 . 90 0 . 89 0 . 90 0 . 87 0 . 87 0 . 87 PESQ 0 dB 1 . 37 1 . 45 1 . 34 1 . 40 1 . 26 1 . 30 1 . 33 3 dB 1 . 58 1 . 65 1 . 53 1 . 60 1 . 43 1 . 43 1 . 43 5 dB 1 . 72 1 . 76 1 . 66 1 . 72 1 . 50 1 . 56 1 . 50 10 dB 2 . 12 2 . 10 2 . 04 2 . 05 1 . 73 1 . 79 1 . 70 T ABLE III: This table shows the comparison of STOI scores for the different BINA URAL enhancement frameworks for 4 different signal to noise ratios. Noisy signals used for the evaluation here is generated using the simulation set-up 2. Bin-Spkr(UV) Bin-Spkr(V -UV) Bin-S(UV) Bin-S(V -UV) TS-WF T woChSS Noisy STOI 0 dB 0 . 63 0 . 68 0 . 61 0 . 66 0 . 62 0 . 58 0 . 60 3 dB 0 . 73 0 . 75 0 . 71 0 . 74 0 . 69 0 . 67 0 . 68 5 dB 0 . 78 0 . 80 0 . 76 0 . 79 0 . 73 0 . 72 0 . 73 10 dB 0 . 88 0 . 89 0 . 87 0 . 88 0 . 81 0 . 83 0 . 84 T ABLE IV: This table sho ws the comparison of STOI scores for the different BILA TERAL enhancement frame works for 4 different signal to noise ratios. Noisy signals used for the evaluation here is generated using the simulation set-up 2. Bil-Spkr(UV) Bil-Spkr(V -UV) Bil-S(UV) Bil-S(V -UV) MMSE-GGP PMBE Noisy STOI 0 dB 0 . 61 0 . 65 0 . 60 0 . 64 0 . 58 0 . 60 0 . 60 3 dB 0 . 71 0 . 74 0 . 69 0 . 73 0 . 66 0 . 68 0 . 68 5 dB 0 . 76 0 . 79 0 . 75 0 . 78 0 . 72 0 . 73 0 . 73 10 dB 0 . 87 0 . 88 0 . 86 0 . 88 0 . 83 0 . 84 0 . 84 syntax: Bin Blue (Color) by S (Letter) 5 (Digit) please. T able V shows the syntax of all the possible sentences. subjects are asked to identify the color , letter and number after listening to the sentence. The sentences are played back in the SNR range − 8 to 0 dB for different algorithms. This SNR range is chosen as all the subjects were NH which led to the intelligibility of the unprocessed signal abov e 2 dB to be close to 100 %. A total of nine test subjects were used for the experiments and the av erage time taken for carrying out the listening test for a particular person was approximately two hours. The noise signal that we have used for the tests is the babble signal from the A URORA database [43]. The test subjects ev aluated the noisy signals ( unp ) and two versions of the processed signal, nr 100 and nr 85 . The ﬁrst version, nr 100 , refers to the completely enhanced signal and the second version, nr 85 , refers to a mixture of the enhanced signal and the noisy signal with 85% of the enhanced signal and 15% of the noisy signal. This mixing combination was empirically chosen [44]. Figures 7, 8 and 9 show the intelligibility percentage along with 90% probability intervals obtained for digit, color and the letter ﬁeld respectively as a function of SNR, for the different methods. It can be seen that nr 85 performs the best consistently followed by nr 100 and the unp . Fig. 10 shows the mean accuracy over all the 3 ﬁelds. It can be seen from the ﬁgure that nr 85 giv es up to 15% improvement in intelligibility at − 8 dB SNR. W e have also computed the probabilities that a particular method is better than the unprocessed signal in terms of intelligibility . For the computation of these probabilities, the posterior probability of success for each method is modelled using a beta distribution. T able VI shows these probabilities at different SNRs for the 3 different ﬁelds. P ( nr 85 > unp ) denotes the probability that nr 85 is better than unp . It can be seen from the table that nr 85 consistently has a very high probability of being better than unp for all the SNRs, whereas nr 100 has a high probability of decreasing the intelligibility for the color ﬁeld at − 2 dB and the letter ﬁeld at 0 dB. This can also be seen from Figures 8 and 9. In terms of the mean intelligibility across all ﬁelds, it can be seen that the probability that nr 85 performs better than unp is 1 for all the SNRs. Similarly , the probability that nr 100 also performs JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 11 T ABLE V: Sentence syntax of the GRID database. Sentence structure command color preposition letter digit adverb bin blue at A-Z 0-9 again lay green by (no W) now place red in please set white with soon − 8 − 6 − 4 − 2 0 0 . 4 0 . 6 0 . 8 SNR (dB) Intelligibility Percentage Digit nr 100 nr 85 unp Fig. 7: Mean percentage of correct answers giv en by participants for the digit ﬁeld as function of SNR for different methods. ( unp ) refers to the noisy signal, ( nr 100 ) refers to the completely enhanced signal and ( nr 85 ) refers to a mixture of the enhanced signal and the noisy signal with 85% of the enhanced signal and 15% of the noisy signal. better than unp is very high across all SNRs. V . D I S C U S S I O N The noise reduction capabilities of a HA are limited es- pecially in situations such as the cocktail party scenario. Single channel speech enhancement algorithms which do not use any prior information regarding the speech and noise type ha ve not been able to show much improvements in speech intelligibility [45]. A class of algorithms that has receiv ed signiﬁcant attention recently hav e been the deep neural network (DNN) based speech enhancement systems. These algorithms use a priori information about speech and noise types to learn the structure of the mapping function between noisy and clean speech features. These methods were able to sho w improv ements in speech intelligibility when trained to very speciﬁc scenarios. Recently , the performance of a general DNN based enhancement system was inv estigated in terms of objectiv e measures and intelligibility tests [46]. − 8 − 6 − 4 − 2 0 0 . 7 0 . 8 0 . 9 SNR (dB) Intelligibility Percentage Color nr 100 nr 85 unp Fig. 8: Mean percentage of correct answers giv en by participants for color ﬁeld as function of SNR for different methods. − 8 − 6 − 4 − 2 0 0 . 2 0 . 3 0 . 4 0 . 5 SNR (dB) Intelligibility Percentage Letter nr 100 nr 85 unp Fig. 9: Mean percentage of correct answers giv en by participants for letter ﬁeld as function of SNR for different methods. − 8 − 6 − 4 − 2 0 0 . 4 0 . 5 0 . 6 0 . 7 SNR (dB) Intelligibility Percentage Mean Intelligibilty nr 100 nr 85 unp Fig. 10: Mean percentage of correct answers giv en by participants for all the ﬁelds as function of SNR for different methods. Even though the general system showed improv ements in the objective measures, the intelligibility tests failed to show consistent improvements across the SNR range. In this paper we have proposed a model-based speech enhancement frame- work that takes into account the speech production model, characterised by the vocal tract and the excitation signal. The proposed framew ork uses a priori information regarding the speech spectral en velopes (which is used for modelling the characteristics of the vocal tract) and noise spectral env elopes. In comparison to DNN based algorithms the training data required by the proposed algorithm, and the parameters to be trained for the proposed algorithm is signiﬁcantly less. The parameters to be trained in the proposed algorithm includes the AR coefﬁcients corresponding to the speech and noise spectral shapes which is considerably less compared to the weights present in a DNN. As the amount of parameters to be trained is much smaller , it should also be possible to train these parameters on-line in case of noise only scenarios or speech only scenarios. The proposed frame work was able to show consistent impro vements in the intelligibility tests even for the single channel case as shown in section IV -F2. Moreover , we hav e shown the beneﬁt of using multiple channels for enhancement by the means of objectiv e experiments. W e would like to remark that the enhancement algorithm proposed in this paper is computationally more complex when compared to con ventional speech enhancement algorithms such as [36]. Howe ver , there exists some methods in the literature which can reduce the computational complexity of the proposed JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 12 T ABLE VI: This table shows the probabilities that a particular method is better than the unprocessed signal. SNR (dB) -8 -6 -4 -2 0 Digit P ( nr 85 > unp ) 1 1 1 1 1 P ( nr 100 > unp ) 1 1 1 0 . 91 0 . 99 Color P ( nr 85 > unp ) 1 0 . 99 0 . 99 0 . 99 0 . 99 P ( nr 100 > unp ) 0 . 98 0 . 91 0 . 89 0 . 24 0 . 27 Letter P ( nr 85 > unp ) 1 1 1 0 . 96 0 . 99 P ( nr 100 > unp ) 1 0 . 44 0 . 99 0 . 22 0 . 19 Mean P ( nr 85 > unp ) 1 1 1 1 1 P ( nr 100 > unp ) 1 0 . 99 1 0 . 50 0 . 87 algorithm. The pitch estimation algorithm can be sped up using the principles proposed in [47]. There also exists efﬁcient ways of performing Kalman ﬁltering due to the structured and sparse matrices in volv ed in the operation of a Kalman ﬁlter [13]. V I . C O N C L U S I O N In this paper , we hav e proposed a model-based method for performing binaural/bilateral speech enhancement in HAs. The proposed enhancement framew ork tak es into account the speech production dynamics by using a FLKS for the enhancement process. The ﬁlter parameters required for the functioning of the FLKS are estimated jointly using the infor- mation at the left and right microphones. The ﬁlter parameters considered here are the speech and noise STP parameters and the speech pitch parameters. The estimation of these parameters in not tri vial due to the highly non-stationary nature of speech and the noise in a cocktail party scenario. In this work, we have proposed a binaural codebook-based method, trained on spectral models of speech and noise, for estimating the speech and noise STP parameters, and a pitch estimator based on the harmonic model is proposed to estimate the pitch parameters. W e then ev aluated the proposed enhancement framew ork in two experimental set-ups representativ e of the cocktail party scenario. The objectiv e measures, STOI and PESQ, were used for ev aluating the proposed enhancement framew ork. The proposed method sho wed considerable im- prov ement in STOI and PESQ scores, in comparison to a number of reference methods. Subjectiv e listening tests when having access to single channel noisy observ ation also showed improv ement in terms of intelligibility and quality . In the case of intelligibility tests, a mean improvement of about 15 % was observed at - 8 dB SNR. A P P E N D I X A P R E D I C T I O N A N D C O R R E C T I O N S T A G E S O F T H E F L K S This section giv es the prediction and correction stages in volv ed in the FLKS for the V -UV model. The same equations apply for the UV model, except that the state vector and the state transition matrices will be different. The prediction stage of the FLKS, which computes the a priori estimates of the state vector ( ˆ ¯ x V -UV l/r ( n | n − 1) ) and error co variance matrix ( M ( n | n − 1) ) is giv en by ˆ ¯ x V -UV l/r ( n | n − 1) = F V -UV ( f n ) ˆ ¯ x V -UV l/r ( n − 1 | n − 1) M ( n | n − 1) = F V -UV ( f n ) M ( n − 1 | n − 1) F V -UV ( f n ) T + Γ 5  σ 2 d ( f n ) 0 0 σ 2 v ( f n )  Γ T 5 . The Kalman gain is computed as K ( n ) = M ( n | n − 1) Γ V -UV [ Γ V -UV T M ( n | n − 1) Γ V -UV ] . (43) The correction stage of the FLKS, which computes the a posteriori estimates of the state vector and error covariance matrix is given by ˆ ¯ x V -UV l/r ( n | n ) = ˆ ¯ x V -UV l/r ( n | n − 1) + K ( n )[ z l/r ( n ) − Γ V -UV T ˆ ¯ x V -UV l/r ( n | n − 1)] M ( n | n ) = ( I − K ( n ) Γ V -UV T ) M ( n | n − 1) . Finally , the enhanced signal at time inde x n − ( d s + 1) is obtained by taking the ( d s + 1) th entry of the a posteriori estimate of the state vector as ˆ s l/r ( n − ( d s + 1)) = h ˆ ¯ x V -UV l/r ( n | n ) i d s +1 . (44) A P P E N D I X B B E H A V I O U R O F T H E L I K E L I H O O D F U N C T I O N For a gi ven set of speech and noise AR coef ﬁcients, we show the beha viour of the likelihood p ( z l , z r | θ ) as a function of the speech and noise excitation variance. For the experiments, we hav e set the excitation variances to be 10 − 3 . Fig. 11 plots the likelihood as a function of the speech and noise excitation v ariance. It can be seen from the ﬁgure that likelihood is the maximum at the true values and decays rapidly as it deviates form its true value. This beha viour motiv ates the approximation in Section III-C. 0.5 1 1.5 2 2.5 3 speech excitation variance × 10 -3 0.5 1 1.5 2 2.5 3 noise excitation variance × 10 -3 Fig. 11: Likelihood shown as a function of the speech and noise excitation v ariance. A P P E N D I X C A P R I O R I I N F O R M ATI O N O N T H E D I S T R I B U T I O N O F T H E E X C I T A T I O N V A R I A N C E S It can be seen from (20) that the prior distributions of the e xcitation variances are used in the estimation of STP parameters. In the case of no a priori kno wledge regarding the excitation v ariances, a uniform distribution can be used as done in [14], but a priori kno wledge re garding the distribution of the noise excitation variance can be beneﬁcial. Fig. 12 shows the histogram of the noise excitation variance plotted for a minute of babble noise [43]. It can be observed from the ﬁgure that the histogram approximately follows a Gamma distribution. Thus, we here use a Gamma distribution to model the a priori information about the noise excitation v ariance, JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 13 which is modelled using two parameters (shape parameter κ and the scale parameter ζ ) as p ( σ 2 v ) = 1 Γ( κ ) ζ k σ 2 κ − 1 v e − σ 2 v ζ , (45) where Γ( · ) is the Gamma function. The parameters ζ and κ can be learned from the training data. Fig. 12: Plot sho wing the histogram ﬁtting for noise excitation variance. Curve (red) is obtained by ﬁtting the histogram with a Gamma distribution with two parameters. A C K N O W L E D G M E N T The authors would like to thank Innov ation Fund Denmark (Grant No. 99-2014-1) for the ﬁnancial support. R E F E R E N C E S [1] S. Kochkin, “10-year customer satisfaction trends in the US hearing instrument market, ” Hearing Review , v ol. 9, no. 10, pp. 14–25, 2002. [2] T . V . D. Bogaert, S. Doclo, J. W outers, and M. Moonen, “Speech enhancement with multichannel Wiener ﬁlter techniques in multimicro- phone binaural hearing aids, ” The J ournal of the Acoustical Society of America , vol. 125, no. 1, pp. 360–371, 2009. [3] A. Bronkhorst and R. Plomp, “The ef fect of head-induced interaural time and level differences on speech intelligibility in noise, ” The J ournal of the Acoustical Society of America , vol. 83, no. 4, pp. 1508–1516, 1988. [4] S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “ Acoustic beamforming for hearing aid applications, ” Handbook on array pr ocessing and sensor networks , pp. 269–302, 2008. [5] B. Cornelis, S. Doclo, T . V an dan Bogaert, M. Moonen, and J. W outers, “Theoretical analysis of binaural multimicrophone noise reduction tech- niques, ” IEEE T rans. Audio, Speech, and Language Process. , vol. 18, no. 2, pp. 342–355, 2010. [6] T . J. Klasen, T . V . D. Bogaert, M. Moonen, and J. W outers, “Binaural noise reduction algorithms for hearing aids that preserve interaural time delay cues, ” IEEE T rans. on Signal Process. , vol. 55, no. 4, pp. 1579– 1585, 2007. [7] M. Dorbecker and S. Ernst, “Combination of two-channel spectral subtraction and adaptiv e Wiener post-ﬁltering for noise reduction and derev erberation, ” in Signal Pr ocessing Conference, 1996 European . IEEE, 1996, pp. 1–4. [8] J. Li, S. Sakamoto, S. Hongo, M. Akagi, and Y . Suzuki, “T wo-stage binaural speech enhancement with Wiener ﬁlter for high-quality speech communication, ” Speech Communication , vol. 53, no. 5, pp. 677–689, 2011. [9] T . Lotter and P . V ary , “Dual-channel speech enhancement by su- perdirectiv e beamforming, ” EURASIP Journal on Advances in Signal Pr ocessing , vol. 2006, no. 1, pp. 1–14, 2006. [10] K. K. Paliw al and A. Basu, “ A speech enhancement method based on Kalman ﬁltering, ” Proc. Int. Conf . Acoustics, Speech, Signal Pr ocessing , 1987. [11] J. D. Gibson, B. Koo, and S. D. Gray , “Filtering of colored noise for speech enhancement and coding, ” IEEE T rans. Signal Pr ocess. , v ol. 39, no. 8, pp. 1732–1742, 1991. [12] S. Gannot, D. Burshtein, and E. W einstein, “Iterative and sequen- tial Kalman ﬁlter-based speech enhancement algorithms, ” IEEE T rans. Acoust., Speech, Signal Process. , vol. 6, no. 4, pp. 373–385, 1998. [13] Z. Goh, K. C. T an, and B. T . G. T an, “Kalman-ﬁltering speech enhance- ment method based on a voiced-un voiced speech model, ” IEEE T rans. Acoust., Speech, Signal Process. , vol. 7, no. 5, pp. 510–524, 1999. [14] S. Srinivasan, J. Samuelsson, and W . B. Kleijn, “Codebook-based Bayesian speech enhancement for nonstationary environments, ” IEEE T rans. Audio, Speech, and Language Pr ocess. , vol. 15, no. 2, pp. 441– 452, 2007. [15] M. S. Kav alekalam, M. G. Christensen, F . Gran, and J. B. Boldt, “Kalman ﬁlter for speech enhancement in cocktail party scenarios using a codebook based approach, ” Pr oc. Int. Conf. Acoustics, Speec h, Signal Pr ocessing , 2016. [16] M. S. Kav alekalam, M. G. Christensen, and J. B. Boldt, “Binaural speech enhancement using a codebook based approach, ” Proc. Int. W orkshop on Acoustic Signal Enhancement , 2016. [17] ——, “Model based binaural enhancement of voiced and unv oiced speech, ” Pr oc. Int. Conf. Acoustics, Speech, Signal Pr ocessing , 2017. [18] J. Makhoul, “Linear prediction: A tutorial review , ” Pr oceedings of the IEEE , vol. 63, no. 4, pp. 561–580, 1975. [19] Q. He, F . Bao, and C. Bao, “Multiplicativ e update of auto-regressive gains for codebook-based speech enhancement, ” IEEE T rans. Audio, Speech, and Language Pr ocess. , vol. 25, no. 3, pp. 457–468, 2017. [20] M. B. Christopher , P attern r ecognition and machine learning . Springer- V erlag New Y ork, 2006. [21] R. M. Gray et al. , “T oeplitz and circulant matrices: A revie w , ” F ounda- tions and T rends R  in Communications and Information Theory , vol. 2, no. 3, pp. 155–239, 2006. [22] F . Itakura, “ Analysis synthesis telephony based on the maximum like- lihood method, ” in The 6th international congress on acoustics, 1968 , 1968, pp. 280–292. [23] D. D. Lee and H. S. Seung, “ Algorithms for non-negativ e matrix factorization, ” in Advances in neural information processing systems , 2001, pp. 556–562. [24] C. F ´ evotte, N. Bertin, and J.-L. Durrieu, “Nonnegativ e matrix factor- ization with the Itakura-Saito divergence: With application to music analysis, ” Neural computation , vol. 21, no. 3, pp. 793–830, 2009. [25] P . Stoica, R. L. Moses et al. , Spectral analysis of signals . Pearson Prentice Hall Upper Saddle Riv er , NJ, 2005, vol. 452. [26] A. H. Kamkar-Parsi and M. Bouchard, “Improved noise power spectrum density estimation for binaural hearing aids operating in a diffuse noise ﬁeld environment, ” IEEE T rans. A udio, Speech, and Languag e Process. , vol. 17, no. 4, pp. 521–533, 2009. [27] M. Jeub, C. Nelke, H. Kruger, C. Beaugeant, and P . V ary , “Rob ust dual- channel noise power spectral density estimation, ” in Signal Pr ocessing Confer ence, 2011 19th Eur opean . IEEE, 2011, pp. 2304–2308. [28] P . C. Brown and R. O. Duda, “ A structural model for binaural sound synthesis, ” IEEE T rans. Acoust., Speech, Signal Process. , vol. 6, no. 5, pp. 476–488, 1998. [29] M. G. Christensen and A. Jakobsson, “Multi-pitch estimation, ” Synthesis Lectur es on Speech & Audio Pr ocessing , v ol. 5, no. 1, pp. 1–160, 2009. [30] M. Cooke, J. Barker , S. Cunningham, and X. Shao, “ An audio-visual corpus for speech perception and automatic speech recognition, ” The Journal of the Acoustical Society of America , vol. 120, no. 5, pp. 2421– 2424, 2006. [31] Y . Linde, A. Buzo, and R. M. Gray , “ An algorithm for vector quantizer design, ” IEEE T rans. Communications , vol. 28, no. 1, pp. 84–95, 1980. [32] A. Gray and J. Markel, “Distance measures for speech processing, ” IEEE T rans. Acoust., Speech, Signal Process. , vol. 24, no. 5, pp. 380–391, 1976. [33] ETSI202396-1, “Speech and multimedia transmission quality; part 1: Background noise simulation technique and background noise database. ” 2009. [34] H. Kayser, S. D. Ewert, J. Anem ¨ uller , T . Rohdenbur g, V . Hohmann, and B. K ollmeier , “Database of multichannel in-ear and behind-the-ear head-related and binaural room impulse responses, ” EURASIP Journal on Advances in Signal Pr ocessing , vol. 2009, no. 1, pp. 1–10, 2009. [35] A. W abnitz, N. Epain, C. Jin, and A. V an Schaik, “Room acoustics simulation for multichannel microphone arrays, ” in Proceedings of the International Symposium on Room Acoustics , 2010, pp. 1–6. [36] J. S. Erkelens, R. C. Hendriks, R. Heusdens, and J. Jensen, “Mini- mum mean-square error estimation of discrete fourier coef ﬁcients with generalized gamma priors, ” IEEE T rans. Audio, Speech, and Language Pr ocess. , vol. 15, no. 6, pp. 1741–1752, 2007. [37] P . C. Loizou, “Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum, ” IEEE T rans. Acoust., Speech, Signal Pr ocess. , vol. 13, no. 5, pp. 857–869, 2005. [38] C. H. T aal, R. C. Hendriks, R. Heusdens, and J. Jensen, “ An algorithm for intelligibility prediction of time–frequency weighted noisy speech, ” IEEE T rans. A udio, Speech, and Language Pr ocess. , v ol. 19, no. 7, pp. 2125–2136, 2011. JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. XX, X 20XX 14 [39] “Perceptual evaluation of speech quality , an objective method for end- to-end speech quality assessment of narrowband telephone networks and speech codecs, ” ITU-T Recommendation , p. 862, 2001. [40] T . H. Falk, V . Parsa, J. F . Santos, K. Arehart, O. Hazrati, R. Huber, J. M. Kates, and S. Scollie, “Objective quality and intelligibility prediction for users of assistive listening devices: Advantages and limitations of existing tools, ” IEEE signal pr ocessing magazine , vol. 32, no. 2, pp. 114–124, 2015. [41] A. W . Bronkhorst and R. Plomp, “ A clinical test for the assessment of binaural speech perception in noise, ” Audiology , v ol. 29, no. 5, pp. 275–285, 1990. [42] I. Recommendation, “1534-1: Method for the subjectiv e assessment of intermediate quality le vel of coding systems, ” International T elecommu- nication Union , 2003. [43] H.-G. Hirsch and D. Pearce, “The aurora experimental framework for the performance ev aluation of speech recognition systems under noisy conditions, ” in ASR2000-A utomatic Speec h Reco gnition: Challenges for the ne w Millenium ISCA T utorial and Resear ch W orkshop (ITRW) , 2000. [44] M. C. Anzalone, L. Calandruccio, K. A. Doherty , and L. H. Carney , “Determination of the potential beneﬁt of time-frequency gain manipu- lation, ” Ear and hearing , vol. 27, no. 5, p. 480, 2006. [45] P . C. Loizou and G. Kim, “Reasons why current speech-enhancement al- gorithms do not improv e speech intelligibility and suggested solutions, ” IEEE T rans. A udio, Speech, and Language Pr ocess. , vol. 19, no. 1, pp. 47–56, 2011. [46] M. Kolbæk, Z.-H. T an, and J. Jensen, “Speech intelligibility potential of general and specialized deep neural netw ork based speech enhancement systems, ” IEEE T rans. Audio, Speech, and Language Process. , vol. 25, no. 1, pp. 153–167, 2017. [47] J. K. Nielsen, T . L. Jensen, J. R. Jensen, M. G. Christensen, and S. H. Jensen, “Fast fundamental frequency estimation: Making a statistically efﬁcient estimator computationally efﬁcient, ” Signal Pr ocessing , vol. 135, pp. 188–197, 2017. Mathew Shaji Kavalekalam was born in Thrissur , India in 1989. He received his B.T ech in electron- ics and communications engineering from Amrita Univ ersity and M.Sc in communications engineering from RWTH Aachen university in 2011 and 2014 respectiv ely . He is currently a PhD student at the Audio Analysis Lab, Department of Architecture, Design and Media T echnology , Aalborg Univ ersity . His research interests include speech enhancement for Hearing aid applications. Jesper Kjær Nielsen (S’12–M’13) recei ved the M.Sc (Cum Laude) and Ph.D. degrees in electrical engineering with a specialisation in signal process- ing from Aalbor g University , Denmark, in 2009 and 2012, respecti vely . From 2012 to 2016, he was with the Department of Electronic Systems, Aalborg Univ ersity , as an industrial postdoctoral researcher (12-15) and as a non-tenured associate professor (15- 16). Bang & Olufsen A/S (B&O) was the industrial partner in these four years. Jesper is currently with the Audio Analysis Lab, Aalborg University , in a three year position as an assistant professor in Statistical Signal Processing. He is part-time employed by B&O and part time employed on a research project with the Danish hearing aid company GN ReSound. Jesper has been a Visiting Scholar in the Signal Processing and Communi- cations Laboratory , Univ ersity of Cambridge in 2009 and at the Department of Computer Science, Univ ersity of Illinois at Urbana-Champaign in 2011. Moreover , he has been a guest researcher in the Signal & Information Processing Lab at TU Delft in 2014. His research interests include spectral estimation, (sinusoidal) parameter estimation, microphone array processing, as well as statistical and Bayesian methods for signal processing. Jesper B ¨ unsow Boldt receiv ed the M.Sc. degree in Electrical Engineering in 2003 and the Ph.D. degree in Signal Processing in 2010, both from Aalborg University (AA U) in Denmark. After his Masters studies he joined Oticon as Hearing Aid Algorithm Developer and from 2007 as Industrial Ph.D. Researcher jointly with Aalborg University and the T echnical University of Denmark (DTU). He has been visiting researcher at both Columbia Univ ersity and Eriksholm Research Centre. In 2013 he joined GN ReSound as Senior Research Scientist and in 2015 he became Research T eam Manager in GN Advanced Science. His main interest is the cocktail party problem and the research that has the potential to solve this problem for hearing impaired individuals. This includes speech, audio, and acoustic signal processing but also auditory signal processing, psychoacoustics, and perception. Mads Græsbøll Christensen (S’00–M’05–SM’11)) receiv ed the M.Sc. and Ph.D. degrees in 2002 and 2005, respectively , from Aalborg University (AA U) in Denmark, where he is also currently employed at the Dept. of Architecture, Design & Media T ech- nology as Professor in Audio Processing and is head and founder of the Audio Analysis Lab. He was formerly with the Dept. of Electronic Sys- tems at AA U and has been held visiting positions at Philips Research Labs, ENST , UCSB, and Columbia Univ ersity . He has published 3 books and more than 200 papers in peer-re viewed conference proceedings and journals, and he has giv en multiple tutorials at EUSIPCO, SMC, and INTERSPEECH and a keynote talk at IW AENC. His research interests lie in audio and acoustic signal processing where he has worked on topics such as microphone arrays, noise reduction, signal modeling, speech analysis, audio classiﬁcation, and audio coding. Dr . Christensen has recei ved several awards, including best paper aw ards, the Spar Nord Foundations Research Prize, a Danish Independent Research Council Y oung Researchers A ward, the Statoil Prize, the EURASIP Early Career A ward, and an IEEE SPS best paper award. He is a beneﬁciary of major grants from the Independent Research Fund Denmark, the V illum Foundation, and Innovation Fund Denmark. He is a former Associate Editor for IEEE/ACM Trans. on Audio, Speech, and Language Processing and IEEE Signal Processing Letters, a member of the IEEE Audio and Acoustic Signal Processing T echnical Committee, and a founding member of the EURASIP Special Area T eam in Acoustic, Sound and Music Signal Processing. He is Senior Member of the IEEE, Member of EURASIP , and Member of the Danish Academy of T echnical Sciences.

Model-based Speech Enhancement for Intelligibility Improvement in Binaural Hearing Aids

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment