Harmonic Detection from Noisy Speech with Auditory Frame Gain for Intelligibility Enhancement
This paper introduces a novel (HDAG - Harmonic Detection for Auditory Gain) method for speech intelligibility enhancement in noisy scenarios. In the proposed scheme, a series of selective Gammachirp filters are adopted to emphasize the harmonic compo…
Authors: A. Queiroz, R. Coelho
IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 1 Harmonic Detecti on from Nois y Speech with Auditory Frame Gain for Intelligibility En hancement A. Queiroz, Studen t Member , IEEE, and R. Coelho , S enior Member , IEEE Abstract —This paper introduces a novel (HD A G - Harmonic Detection fo r A uditory Gain) method fo r speech intelli gib ility enhancement in noisy scenarios. In the proposed scheme, a series of selecti ve Gamma chirp filters ar e adopted to emphasize th e harmonic components of speech reducing the masking effects of acoustic noises. The fundamental fr equency ar e estimated by the HHT -Amp technique. Harmonic patterns estimated with low accurac y ar e detected and adju sted according the FSFFE low/high p itch separation. The central f requencies of the filter - bank are defi ned considerin g the third octa ve subbands which are best suited to cov er the regions most rele vant to in telli- gibility . B ef ore si gn al reconstruction, the gammachirp filt ered components are ampli fied by gain factors regulated by FSFFE classification. The proposed HD A G solution and three baseline techniques are examined considering six background noises w i th fo ur signal-to-noise ratios. Three ob j ectiv e measures are adopted fo r the ev aluation of speech intelligib ility and quality . Several experiments are conducted to demonstrate that the proposed scheme achieve s better speech intelligibil i ty impro vement when compared to the competing approaches. A perceptual listening test is further considered and corr oborates with the objective results. Index T erms —Gammachirp filtering, low/high frequency sep- aration, harmonic detection, noisy speech. I . I N T RO D U C T I O N A COUSTIC noise is a strong maskin g effect that impairs speech intelligib ility [1][2]. This in terference underlies se veral research stud ies such as speech en hanceme n t [3][ 4][ 5], source localization [6][7], robot audition [8], speech and speaker recogn ition [9][10]. Thus, its mitigatio n is a relev ant element of interest for the intelligibility and quality enha nce- ment. Several signal p rocessing method s are d escribed in the literature to atten uate no ise inter ference for speech quality assessment [11]. Howe ver , this ach iev ement not necessarily leads to speech inte llig ibility impr ovemen t [12]. On the other hand, acoustic mask s [13][14][15] are defined to emulate the cocktail party effect. T h ese so lu tions provide intelligibility enhancem ent for the target speech signal. In the last y e a rs, th e a n alysis o f harmon ic comp onents of noisy speech [16][17] has enco uraged the pr oposal of n ew strategies for intelligibility gain [18][1 9]. For these, h armonic This work was supporte d in part by the National Counci l for Scienti fic and T e chnologic al De velo pment (CNPq) 305488/20 22-8 and Fundac ¸ ˜ ao de Amparo ` a Pesquisa do Estado do Rio de Janeiro (F APERJ) unde r Grant 200518/2023 and in part by the Coordena c ¸ ˜ ao de Aperfeic ¸ oamento de Pessoal de N´ ıve l Superior - Brasil (CAPES) - under Grant Code 001. The authors are with the Laborato ry of Acoustic Signal Processing, Military Institut e of Engineering (IME), Rio de Jane iro, RJ 2229 0-270, Brazil (e-mail: coelho@ ime.eb .br). compon ents such as fun damental frequ ency (F0 ) and for- mants [2 0] p lay an interesting role for intelligibility in n oisy condition [16][ 2 1][22]. Ti me-d omain adaptive solution s are designed to deal with the harmon ics of the speech sign a l to reduce the noise ef fects. In [23], the formant center frequ encies from voiced segments of speech ar e shifted away f rom the region of n oise. This formant shifting procedure [24] simu lates the h uman strategy to pr ovide a mor e audible signal in noisy en viron ment, i.e., the Lombar d ef fect [2 5]. Results showed that the Smooth ed Shifting o f Formants for V oiced segments (SSFV) is able to improve the intelligib ility o f speech signals in car noise environmen t. A d ifferent approa ch was proposed in [26], where the HHT -Amp [2 7] F0 estima tion technique was ap plied to the har m onic comp onents of noisy speech. The F0-based Gammaton e Filtering ( GTF F0 ) method consider ed integer multiples of th e estimated F0 as center fr equencies of a time- domain audito ry filterbank . Finally , the ou tputs are amplified to em p hasize the harmon ics o f the speec h signal leading to intelligibility gain. The u se of the Gam matone filterban k in the GTF F0 method may be limited by the hig h le vel mask ing effects [28]. T o over - come th is issue, the Gammachir p pr o posed in [29] p r oduces a filter with an asy mmetric amplitu de spectrum. This au ditory filter pr ovides an in teresting fit to various sets of noise masking data. The center frequencies of the its filterbank must be well-defined conside r ing the relevant ones for intelligib ility . I n this co ntext, the Exten ded Short-Time O b jectiv e In telligibility (ESTOI) [30] p erform s an e valuation of noisy speech in third-oc tave subb ands. ESTOI also co nsiders the tempo ral modulatio n fr e quencies relev ant to speech intellig ibility , wh ose values range from 1 –12.5 Hz [31][32][33]. These subbands and frequen cy mod ulation range are able to assist in regulating the b andwidth of filterbanks to cover the harmonic co mponen ts of speech most relev ant for intelligib ility . This paper intr oduces the HDA G (Harmo nic Detection with Auditory Gain ) m e th od to attain intelligibility enhancem ent fo r harmon ic co mponen ts of no isy speech signals. The pro p osed solution is p erform ed in four steps. Initially , the HHT -Amp method [2 7] is app lied to estimate the F0 of speech f rames. In the secon d step, th ese frames are sep arated in low-pitch or high-p itch o nes with FSFFE [34] technique. The separation leads to detection and adjustment of the F0 values accor d - ing som e typic a l error s [35] that may occur in estimatio n, improving its ac c uracy . In seq uence, the third stage consists in filtering the harmo nic componen ts o f no isy speech with Gammachirp . The c e ntral frequ encies and ban d widths o f the filterbank ar e selectively defined to cover the m ost r elev ant regions for speech intelligibility , as stated in [30]. Finally , the IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 2 Fig. 1. Block diagram of the proposed HDA G m ethod for improv e the inte lligibi lity of noisy speech s ignals. filtered compon ents ar e amp lified by a gain factor to h igh- light the harm onic comp onents of spee c h. This amplification mitigates the m asking effects of backg round n oise leading to intelligibility enhance ment. Sev eral experimen ts a re con ducted to examin e the effectiv e- ness of the HDA G method. F or this purpose, speech utterances collected f r om TIMIT [36] databa se are co rrupted by six real acoustic noises, consid e ring four SNR values: - 10 dB, -5 dB, 0 dB an d 5 d B. Th e pro posed metho d and three baseline approa c h es are examined in te r ms of in telligibility en hance- ment. T o this end, ESTOI [30] and Sh o rt-Time App roximated Speech Intelligibility Index (A SI I ST ) [37] are consider e d in the e valuation. Moreover , results for the Perceptual Evaluation of Speech Qu ality (PESQ) [38] demon strate tha t HDA G also achieve quality a ssessment. Objective results indicate tha t the propo sal outperfo rms th e c o mpetitive app r oaches in terms of speech in telligibility , and also q uality scor es. T hese results are corrob orated by a subjective listening e valuation test. The main contributions of this work are: • Introd uction o f the HD A G meth od to impr ove the in tel- ligibility and quality of acoustic n oisy speech. • Definition of the filterbank config uration using th e third- octave b ands an d sp e c ific mo dulation fre q uencies, with higher resolu tion in regions most relevant to intelligibility . • Adoption of the asymm etry coeffi cient from Gammach irp to adju st th e filterban k to th e noisy m asked compon ents of speech. • Interesting in telligibility and quality assessment attain e d with adaptive gain factors defined accord ing FSFFE sep- aration. The re maining of this paper is organized as follows. Section II describ es the steps of the proposed HD A G metho d for intelligibility enhancem ent. An explana tio n of the com peti- ti ve appro aches SSFV , P A CO (pitch -adaptive comp lex-valued Kalman filter) [39] and GT F F0 is in cluded in Section III. Section IV pre sen ts the ev aluation experiments and results. Finally , Section V conclud es this work. I I . T H E H DAG M E T H O D The pr oposed method includ e s four main steps: harmo nic detection, third- octave b ands config uration, g ammachirp fil- tering and outpu t samp les amplification by a g ain factor . Finally , th e overlap and add method is ap plied to ach ieve the r econstructed version of the target speech signal. Fig. 1 illustrates the b lock diagram of the HD A G me th od. A. F0 Estimation The fundam ental fr e quency (F0) is estimated from noisy speech sign al with HHT -Amp metho d [27]. This F0 estimator ensures [27][34] interesting accuracy results f rom n oisy sp e e ch signals. HHT -Amp is e valuated in a wide r ange of no isy scenarios outp erform ing four co mpeting estimators in terms of accu racy . I t applies the time-freq u ency EEMD ( Ensemble Empirical Mode Decompo sition) [4 0][ 41] to decompose a voiced sample sequen ce x q ( t ) such that x q ( t ) = K X k =1 IMF k,q ( t ) + r q ( t ) (1) where IMF k,q ( t ) is the k - th mode of x q ( t ) and r q ( t ) is the last residual. Then, instantaneo us am plitude f unctions are computed b y a k,q ( t ) = | Z k,q ( t ) | , k = 1 , . . . , K, (2) from th e analytic signals d efined as Z k,q ( t ) = IMF k,q ( t ) + j H { IMF k,q ( t ) } , (3) where H { IMF k,q ( t ) } refer s to the Hilber t transform of IMF k,q ( t ) . The A u tocorrela tio n Function is calcu lated as r k,q ( τ ) = X t a k ( t ) a k ( t + τ ) . (4) For each dec omposition mod e k , let τ 0 be the lowest τ value that co r respond to an A CF p eak, su bject to τ min ≤ τ 0 ≤ τ max . The f r equency restriction is ap plied a c cording to the ran ge [ F min , F max ] of possible F0 values. Th e k -th F0 IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 3 Fig. 2. Block diagram of the FSFFE techni que for low/high pitch classificati on of speec h frames. candidate is d efined as τ 0 /f s , where f s refers to the sam p ling rate. Finally , a decision criterion [ 27] is applied to select th e best pitch candidate ˆ T 0 . Finally , th e e stimated F0 is giv en by f est = 1 / ˆ T 0 . B. Harmonic Detection and Adjustment Sev ere noise m a sking ef fects may impact the har monic compon ents of voiced speech leading to lo w accur acy F0 estimates. In order to detect and adjust the erroneous F0 v alues the FSFFE (Frequency Separ ation f o r Fun damental Frequency Estimation) [34] is a p plied to harmo nic frames. This strategy separates th e no isy speech frames into lo w-pitch or high- pitch ones. Possible errors in F0 estimates can be detected by comparin g its values with the separatio n . Fig. 2 illustrates the block diagram of th e FSFFE m ethod. After th e EEMD decom p osition as in (1), pitch estimation is perfo r med in voiced frames of each IMF using PEF AC [4 2] algorithm . Let ˆ F 0 k,q denote the pitch value estimated fro m frame q of IMF k ( t ) , the ˆ F 0 q vector is co m posed as ˆ F 0 q = ˆ F 0 1 ,q , ˆ F 0 2 ,q , · · · , ˆ F 0 K,q T , (5) to express the ten d ency that the frame is placed in a low/high pitch region . Only th e first fou r IMFs ( K = 4 ) are consider ed in order to av oid the aco ustic noise masking effect. T he energy of these unwanted compone nts are mostly conce ntrated at low frequen cies ( K > 6 ) [4 3][5][44]. A n ormalized d istance is com puted betwee n IMFs for the successiv e f rames to detect and overcome the dif ferenc es in the estimated F0. Let k an d k ′ denote IMF indexes, th e distance is d escribed as δ q ˆ F 0 ( k , k ′ ) = ˆ F 0 k,q − ˆ F 0 k ′ ,q ˆ F 0 k,q + ˆ F 0 k ′ ,q . (6) The δ q ˆ F 0 ( k , k ′ ) values are compu ted for d ifferent ind exes of k and k ′ resulting in a 4x4 distance matrix δ q ˆ F 0 . The row compon ents of the matrix are summed to obtain the variation proper ty for th e k -th IMF . Th e f requen cy region is d efined as the mean value of PEF AC F0 estimates ( ¯ F 0 q ) between the two IMFs with the smallest variation scores. Fin ally , the low/high pitch separation is perfo rmed accord ing the thresho ld γ as ( ¯ F 0 q ≤ γ , low-frequency fr ame ; ¯ F 0 q > γ , high-f r equency fram e . (7) The th reshold γ is fixed in 2 00 Hz which is r elated to the av erage values b etween male (5 0-200 Hz) and female (12 0 - 350 Hz) speakers [4 5]. 0 300 600 900 1200 T ime (ms) 100 200 300 400 F0 (Hz) (a) 0 300 600 900 1200 T ime (ms) 100 200 300 400 F0 (Hz) (b) 0 300 600 900 1200 T ime (ms) 100 200 300 400 F0 (Hz) (c) Fig. 3. Ground Truth and F0 estimated with HHT -Amp technique for: (a) Clean Speech segment, (b) Noisy Sign al wit h babbl e SNR=-5d B and (c) same Noisy se gment with estimates improv ed by FSFFE. The F0 adjustmen t is co nducted accordin g the low/high pitch classification in (7). The F0 estimates are pr o ne to doublin g err o rs in low pitch fram es. Hen ce, a low pitch fram e that presents F0 value ( f est,q ) r anging f rom [ 200-4 00]Hz is ad justed to f ad j,q = 0 . 5 f est,q . On the other h a nd, the high p itch f rame is adju sted accord ing possible h alving a nd quarterin g [35] e rrors as follows: f ad j,q = ( 4 f est,q , 50 ≤ f est,q ≤ 10 0 2 f est,q , 100 < f est,q ≤ 200 . (8) Fig. 3 illustrates th e F0 adju stme nt p rocedur e in fram es of a 120 0 m s speech sign al. Fig. 3(a) re f ers to F0 attain e d with HHT -Am p metho d for the clean spee c h . The estimated values match th e groun d truth in the high pitch region . Fig. 3(b) presents the F0 estimates related to the noisy version of the same speech segmen t f o r the bab ble noise [ 46] with SNR = -5dB. Note that accu racy d ecreases significantly an d halv in g errors ap pear in har monic comp onents, e.g. , aroun d 100 m s or 600 ms. These regions are adjusted with FSFFE as can b e seen in Fig. 3( b). Obser ve that the p roposed adjust leads to accuracy improvemen t ev en in the sev ere n oisy co ndition. Th e correction in harm o nic detection is imp ortant specially in th is case. Particularly , due to the fact that importan t comp onents for speech intelligibility ar e placed in h igher f requen c ie s. C. Th ird-octave Ban ds Con figuration Third-o ctav e filter banks have be e n shown to lo osely ap- proxim a te th e m easured ban ds of the auditory filters [ 4 7]. Objective speech metrics co nsider the analy sis of clean and noisy speech with third -octave subspaces. This is the ca se of ESTOI [30] intelligibility measure , that gives a pr ediction throug h the cor relation of third-or der spec trograms from the referenc e and processed sig nal. IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 4 T ABLE I E S T O I [ × 1 0 − 2 ] S C O R E S F O R D I FFE R E N T A S Y M M ET RY C O EF FI C I E N T S . Gammachirp Coefficie nt – c Noise 2. 0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 Babble -10 dB 18.5 19.3 18.5 18.9 19.0 18.4 19.4 18.4 19.1 -5 dB 28.8 29.7 28.9 29.2 29.5 28.8 29.8 28.8 29.6 0 dB 41.1 42.0 41.3 41.6 41.8 41.1 42.2 41.1 42.0 5 dB 54.5 55.4 54.7 54.9 55.2 54.5 55.6 54.4 55.5 A verage 35.7 36.6 35.9 36.1 36.4 35.7 36.8 35.7 36.5 SSN -10 dB 20.5 21.3 20.6 20.9 21.0 20.4 21.5 20.4 21.2 -5 dB 30.2 31.1 30.4 30.7 30.8 30.3 31.3 30.1 31.1 0 dB 41.6 42.4 41.8 42.0 42.2 41.6 42.6 41.5 42.5 5 dB 54.2 55.0 54.4 54.6 54.8 54.2 55.3 54.1 55.2 A verage 36.6 37.4 36.8 37.0 37.2 36.6 37.7 36.5 37.5 This work proposes th e definition of an auditory filtering based on the third-octave band s. The accu rate harm onic d e- tection f ad j,q is adopted as cen ter frequency of the first band of the filter bank ( k = 0). The center frequencies for the fo llowing k band s are attained adaptively by f c ( k , q ) = 2 k 3 f ad j,q . (9) The resulting set of filters in ea c h fra m e q provides better resolution in the frequ encies near the har m onics of speech, which are th e most important f or the intelligibility [30]. D. Gammachirp F iltering In this step , a set of L Gamm achirp filters [29] { h k ( t ) , k = 1 . . . , L } are app lied to successi vely filter the input sample sequ ence x q ( t ) . Eac h filter h k ( t ) is imp lemented to the noisy signal consider ing frames of 32 ms, or der n = 4 , center fr equencies giv en by (9). In o rder to align the im pulse response functions, p hase compensation is applied to all filters, which correspo n d to the non-cau sal filters h k ( t ) = a ( t + t c ) n − 1 cos(2 π f c t + c ln t ) e − 2 π b ( t + t c ) , t ≥ − t c , (10) where c is the g ammachir p coefficient o f the filter an d t c = n − 1 2 π b , which ensures that p eaks o f all filters occur at t = 0 . The band width b is de fin ed here accord ing the frequ encies of mo dulation tr ansfer func tion co nsidered in [3 1][32][33]. The results p resented in [ 33] demo nstrated th at the frequen cy range relevant for intelligibility of male speech senten ces ranges from [1– 1 2.5] Hz. Nevertheless, female sentenc e s pre- sented a larger range, with n oticeable rele vance for frequ encies ≤ 20 Hz. Ther efore, this work proposes an harmon ic-adaptive bandwidth , giv en by b = 0 . 15 f ad j,q . Let x 0 q ( t ) = x q ( t ) , th e filtered signals y k q ( t ) , k = 1 , . . . , L , are r ecursively co mputed by ( y k q ( t ) = x k − 1 q ( t ) ∗ h k ( t ) x k q ( t ) = x k − 1 q ( t ) − y k q ( t ) , k = 1 , . . . , L . (11) The re sid ual signal is defined as r q ( t ) = x L q ( t ) to guaran tee the completen ess of the input seq uence, i.e., x q ( t ) = L X k =1 y k q ( t ) + r q ( t ) . (12) 2 4 6 8 10 Gain 34.0 36.0 38.0 40.0 42.0 ESTOI F1 F2 F3 F4 F5 1 2 3 4 5 Gain 41.0 42.0 43.0 44.0 ESTOI F6 F7 F8 F9 F10 (a) 1 2 3 4 5 Gain 34.0 36.0 38.0 40.0 42.0 ESTOI F1 F2 F3 F4 F5 1 2 3 4 5 Gain 41.6 42.0 42.4 42.8 ESTOI F6 F7 F8 F9 F10 (b) Fig. 4. E STOI curves of (a) lo w pitc h and (b) hig h pitch frames ave raged for SNR v alues: -10dB, -5dB, 0dB and 5dB of Babble noise accor ding the gain fac tor G k for each gammachirp filte r . T able I presen ts the ESTOI scores for different asymm etry coefficients c of the Gam machirp filter . The intelligibility is pred icted for a training subset o f 48 spe e ch signals of TIMIT [ 36] defined in [48]. The ESTOI sco res with different values of c is com puted f or Babble [46] an d SSN [49] noisy scenarios. Note that the coefficient c = -1 a chieves the high est intelligibility rates for all the n oisy co nditions. This can be justified by the fact th a t acoustic noises might shift the harmon ic d etection. Therefo re, the asymmetr y of gammachir p has the role o f fin e-tuning in these harmonic compon ents. E. F r ames Reconstructio n with a Gain F acto r After the Gammach irp filtering, the amplitude of the output samples y k q ( t ) , k = 1 , . . . , L , are amp lified by a g ain factor G k ≥ 1 . Th e idea is to emphasize th e presen ce of the har monic features of speech, which will lead to speech intellig ib ility improvement, witho ut introdu c ing any no ticeable distortion to the speech sign al. The reconstru ction of the voiced frame q ∈ S v leads to the sample sequence ˆ x q ( t ) = " L X k =1 G k y k q ( t ) # + r q ( t ) . (13) The reconstructed voiced frames in S v and all the remaining frames in S u are join ed together keeping the or iginal frame s indices. Thus, all frames are overlap and ad d ed to reco nstruct the modified version ˆ x ( t ) o f the target speech signal. Th e completen e ss and continuity of ˆ x ( t ) is guara n teed by the adoption o f the Ha n ning window that multiply all frames before the overlap and add metho d. This mea n s that th e reconstruc ted signal ˆ x ( t ) an d the orig inal signal x ( t ) would be exactly the same if each frame is reconstructed considerin g G k = 1 for every k ∈ { 1 , . . . , L } . The set of gains G k are empirically determine d in each filter using th e sam e training subset of 48 speech signals attained from TIMI T d atabase. Fig. 4 illustrates the ESTOI curves for noisy speech signal with Babble and av eraged to f our SNR IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 5 Algorithm 1 Intelligibility Enhancem ent Scheme HD A G. for q do Input: x q ( t ) Harmonic Detection f est,q ← F0 estimation with HHT - Amp as in Section II-A. ˆ F 0 q ← PEF AC (5) fo r K =4 decomp o sed modes of (1). δ q ˆ F 0 ← normalized distance matrix using (6) low/high pitch classification (7) accord in g ¯ F 0 q . Gammachirp Filt ering for k do h k ( t ) ← imp ulse respo n se of no n-causal filters (1 0) y k q ( t ) = x k − 1 q ( t ) ∗ h k ( t ) x k q ( t ) = x k − 1 q ( t ) − y k q ( t ) end for r q ( t ) = x L q ( t ) ← residu al co m ponen ts ˆ x q ( t ) ← voiced frames reco nstruction as in (13) a nd G k from (14). ˆ x ( t ) ← overlap and add techniqu e. end for return ˆ x ( t ) values. The config u ration starts f rom the first filter ( F1), an d the gain is incremen ted until ESTOI re aches its m a x imum value (highligh ted point). This gain is fixed, and the proc e ss is repeated for the subsequent filters. Observe th at two different sets of gain are pr e sented: on e for low pitch (Fig. 4(a) ) and other for high p itch frame s (Fig. 4(b)). Th erefore, the G k values fo r L = 10 filters that lead to the h ighest intelligibility ESTOI scores are define d as G k = ( { 14 , 1 , 4 , 8 , 4 , 3 . 5 , 3 , 2 , 2 , 1 . 5 } , low-pitch ; { 14 , 1 , 1 , 4 . 5 , 2 , 3 . 5 , 2 . 5 , 2 , 1 . 5 , 1 . 5 } , high -pitch . (14) The prop osed HDA G m e thod is sum m arized in Algorithm 1. This algorithm is tailored to th e harmon ic detection scheme considered in this paper . However , Algorith m 1 can be also used with any oth er F0 estimation techn ique. I I I . H A R M O N I C - BA S E D C O M P A R A T I V E M E T H O D S This Section b riefly d e scribes the baseline metho ds SSFV , P AC O and GTF F0 . T hey also con sider the harmon ic com- ponen ts of no isy sp eech to attain intelligib ility and q uality improvement. A. SSFV The ma in idea of this solu tio n consists o n tran sforming the origin a l signal, adopting a Lo mbard effect strategy [25] [50]. In this effect the central freq uencies of th e fo rmants are shifted (Formant Shifting). It moves aw ay the energy from these fre q uencies from th e region of spectral ac tion of the noise. The for m ant shiftin g pro cess is described in [23] and optimized to o perate in environments with the p resence o f Car noise (co mposed by radio, message aler t and telepho ne). Initially , LPC (Linear Predictio n Codin g) is used to estimate the poles an d formant frequencies of the voiced spee c h signal. In the LPC mod el, a 25m s frame of th e signal s ( n, m ) can be represented by linear prediction s of order p [51], that is s ( n, m ) = p X j =1 a j s ( n − j, m ) + e ( n, m ) , (15) where a j are the linear pre d iction coefficients, e ( n, m ) indi- cates the residual error and p = 12 . The variables n and m represent the signal sample and time frame indices, respec - ti vely . The LP filter A( z ) is ob tained f rom th e co efficients a j , so that A ( z ) = 1 + p X j =1 a j z j . (16) The poles P are obtain ed by the r o ots of the LP coefficients, and the form a nt f r equencies F are de fin ed as the estima ted pole angles. The fo rmants o btained are shifted accor ding to a function δ ( F ) [2 4] determin ed accord ing to the characteristics o f the acoustic noise. The d isp la c ement of fo rmants is carried ou t accordin g to the criterion ˆ F ( f ) = F ( f ) + δ ( f ) , f 1 < f < f 3 F ( f ) , otherwise . (17) where f 1 and f 3 are the fir st and thir d forman ts, respe cti vely . Finally , the resulting set of for mants ˆ F is obtain ed fro m these modification s. B. P ACO The p itch-adap tive comp lex-valued Kalman filter (P A CO) [39] is also ad opted as a competitive techniqu e fo r the pro- posed HDA G method. It applies the harmon ic signal modelin g for estimating the complex-valued sp eech AR par ameters required for the Kalman filter . T o th is end, fund a m ental frequen cy estimation f for each 32 ms signal frame y ( n, l ) is perfor med and p h ase pro gression ˆ ψ ( l ) is recu rsiv ely estimated for the h armon ic h accordin g to ψ h ( l ) = ψ h ( l − 1) + π L f s ( f h ( l ) + f h ( l − 1)) . (18) Successiv e sp e e ch DFT b in s of y ( n, l ) is comp uted by incorpo rating the harmon ic phase pro gression into a state- transition m odel. The AR coefficients ˆ a ( l ) ar e defined from the DFT bin s [3 9], which are the input for the Ka lm an filter gain G K and obtain an estimation of ˆ X ( k , l ) such as ˆ X ( k , l ) = G k ( k , l )( Y ( k , l ) − ˆ X pro p ( k , l )) (19) where ˆ X pro p is the state p ropagatio n estimate for the k- t h bin. Finally , inverse DFT is applied and th e proce ssed speec h signal is reconstruc ted perform ing overlap and add. C. GTF F0 In the GT F F0 [26] method, a set of L Gamm atone filters { h k ( t ) , k = 1 . . . , L } are app lied to successively filter th e IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 6 T ABLE II I N T E L L I G I B I L I T Y AN D Q U A L I T Y R E S U LTS W I T H T H E P RO P O S E D H D A G A N D C O M P E T I T I V E M E TH O D S . ESTOI PESQ Noise SNR UNP SSFV P A CO GTF F0 HD A G UNP SSFV P ACO GTF F0 HD A G Babble -10 dB 0.18 0.17 0.17 0.24 0 .28 0.56 0.99 1.25 1.77 2.06 -5 dB 0.29 0.29 0.30 0.37 0.40 1.52 1.54 1.94 2.30 2.50 0 dB 0.41 0.42 0.43 0.50 0.53 1.90 1.92 2.45 2.72 2.86 5 dB 0.55 0.55 0.58 0.64 0.66 2.35 2.36 2.94 3.12 3.22 A verage 0.36 0.36 0.37 0.44 0.47 1.58 1.70 2.14 2.48 2.66 Cafete ria -10 dB 0.20 0.19 0.19 0.27 0.31 1.00 1.30 1.50 1.97 2.20 -5 dB 0.31 0.31 0.31 0.40 0.43 1.63 1.67 2.01 2.48 2. 63 0 dB 0.44 0.44 0.45 0.54 0.56 2.07 2.09 2.52 2.90 3.02 5 dB 0.58 0.58 0.61 0.67 0.69 2.50 2.51 2.97 3.29 3.37 A verage 0.38 0.38 0.39 0.47 0.50 1.80 1.89 2.25 2.66 2.81 Tra ffic -10 dB 0.38 0.38 0.44 0.44 0.47 1.59 1.58 2.82 2.34 2.51 -5 dB 0.51 0.50 0.56 0.58 0.60 2.04 2.04 3.27 2.73 2. 88 0 dB 0.63 0.63 0.68 0.70 0.71 2.55 2.55 3.62 3.12 3.26 5 dB 0.74 0.74 0.79 0.79 0.80 3.06 3.06 3.86 3.52 3.62 A verage 0.48 0.48 0.52 0.56 0.58 2.31 2.31 3.39 2.93 3.07 Tra in -10 dB 0.32 0.30 0.36 0.38 0.42 1.33 1.38 1.92 2.09 2.29 -5 dB 0.43 0.43 0.47 0.51 0.54 1.82 1.83 2.55 2.61 2. 75 0 dB 0.55 0.54 0.58 0.63 0.65 2.33 2.34 3.03 3.06 3.18 5 dB 0.65 0.65 0.69 0.73 0.75 2.78 2.79 3.37 3.42 3.54 A verage 0.49 0.48 0.53 0.56 0.59 2.06 2.08 2.72 2.79 2.94 Helicop ter -10 dB 0.30 0.30 0.33 0.39 0.43 1.55 1.59 2.21 2.34 2.54 -5 dB 0.41 0.41 0.45 0.52 0.54 1.89 1.91 2.71 2.74 2. 87 0 dB 0.53 0.53 0.59 0.64 0.66 2.33 2.34 3.17 3.15 3.26 5 dB 0.66 0.65 0.72 0.75 0.76 2.76 2.76 3.53 3.51 3.60 A verage 0.47 0.47 0.52 0.58 0.60 2.13 2.15 2.91 2.93 3.06 SSN -10 dB 0.17 0.16 0.20 0.24 0.29 1.22 1.41 1.95 1.88 2.17 -5 dB 0.28 0.28 0.32 0.37 0.41 1.45 1.47 2.41 2.25 2. 41 0 dB 0.41 0.41 0.45 0.51 0.54 1.84 1.85 2.89 2.68 2.80 5 dB 0.54 0.54 0.59 0.64 0.66 2.32 2.33 3.29 3.11 3.20 A verage 0.35 0.35 0.39 0.44 0.47 1.70 1.77 2.63 2.48 2.65 Overa ll 0.44 0.43 0.47 0.52 0.54 1.93 1.98 2.67 2.71 2.86 input samp le sequ ence x q ( t ) . Each filter h k ( t ) is implemented 1 in frames of 32 ms con sidering ord er n = 4 , center frequ ency f c = kF 0 (20) and b andwidth b = 0 . 25 F 0 . The time-dom a in impulse re- sponse fun c tion described in (10) is app lied for GTF F0 without the a sy mmetry c oefficient. Th us, it can b e con sidered a specific case of Gammachirp filterbank, in wh ich c = 0 . After the Gammato ne filtering, the amp litude of the o utput samples y k q ( t ) , k = 1 , . . . , L , ar e amplified by the following a gain factor G k ≥ 1 . Th e integer m ultiples of F0 are amplified as in [2 6] with the fo llowing linear gains: G 1 = G 2 = 5 .0, G 3 = 4.0 and G 4 = 2.5. I V . R E S U L T S A N D D I S C U S S I O N This section presents objective r esults for intelligibility an d quality of acoustic signals pr ocessed b y HDA G method in compariso n to SSFV , P A CO a n d GTF F0 baseline tech niques. ESTOI [3 0] an d ASII ST [37] are considered to evaluate the speech intellig ibility improvement and PESQ [38] co mpares the q uality assessment of comp etitiv e methods. Following, 1 Code av ailable at http ://staf fwww .dcs.shef.ac .uk/people/N.Ma/ results for a perceptual test is presented, in ord er to corrobor ate the o bjective ev aluation . The experime ntal scenario co nsider a subset 2 of TIMIT [36] database to evaluate the co mpetitive m ethods. The set considered is compo sed by 128 speech signals spoken by 8 male and 8 female sp eakers, samp led at 1 6 KHz a nd with 3 s av erage duration. Th e F0 refer e n ce values and voiced/unv oiced informa tio n fo r the training and test datasets are obtain ed from [48]. Six noises are used to corrupt the speech ut- terances: acoustic Babble and T raffic attained from RSG-10 [46], Cafeteria, T rain and Helicopter from Freesou nd.org 3 , and Speech Sh aped N o ise (SSN) from DEMAND [49] datab a se. Experime nts are cond ucted co nsidering noisy signals with f o ur SNR values (- 10 dB, -5 dB, 0 dB and 5 dB). In th is study , it is assumed that the FSFFE separa tio n into hig h-pitch and low-pitch speec h frames is consider ed perfect and genera tes no errors into the wh ole sy stem. 2 A vaila ble at: http: //www .ee.ic .ac.uk/hp/sta ff/dmb/data/TIMITfxv .zi p . 3 [Online]. A vai lable: https://freesou nd.org. IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 7 -10 -5 0 5 SNR (dB) 0 10 ∆ ASII ST Babble ( a ) -10 -5 0 5 SNR (dB) 0 10 ∆ ASII ST Cafeteria ( b ) -10 -5 0 5 SNR (dB) 0 3 6 ∆ ASII ST Traffic ( c ) -10 -5 0 5 SNR (dB) 0 ∆ ASII ST Train (d) -10 -5 0 5 SNR (dB) 0 3 ∆ ASII ST Helicopter (e) -10 -5 0 5 SNR (dB) 0 10 ∆ ASII ST SSN (f) Fig. 5. ∆ AS II ST intel ligibili ty enhancement [ × 10 − 2 ] av eraged for speech signals corrupted by noises: (a) Babble, (b) Cafeteria, (c) Tra ffic, (d) Train , (e) Helicop ter and (f) SSN. A. Intelligibility and Quality Objective Evalua tio n T able II shows the in telligibility and qu ality o bjective results with ESTOI and PESQ measu r es, respectively . Note th at Babble an d SSN noises p resent the most challeng in g scen arios among those ev aluated in terms o f intelligib ility . For instan ce, the E STOI averaged for the SNR values of UNP speech signals are 0.36 and 0.35 for th e respective noises. Moreover , observe that HD A G meth od a chieves the best results for all the 2 4 noise co nditions even in the mo st challeng in g scenarios with negativ e SNR values. The sco res of HD A G are p a rticularly interesting fo r the no n -stationary noises, i. e., Babble an d Cafeteria. For these n oise sources th e ESTOI attained are considerab ly h igher than all the comp eting solutions for all SNR values. The highest ESTOI acco mplished by HDA G is 13 p.p can be ob served for He licopter noise with SNR = - 10 d B. Accord ing the overall average, the propo sed solution outperf orms the comp etiti ve appr oaches with ESTOI of 0.5 4, against 0.5 2, 0. 4 7 and 0.4 3 for GTF F0 , P ACO an d SSFV , respectively . The PESQ score is here compu ted fr o m 30 % of the most relev ant har monic fram es of noisy speech. Th ese frame s are selected from those with the lowest sign al-to-no ise ratio val- ues. Note th at HDA G o utperfo rms the com peting appro aches for most of the no isy speech cond itions in terms of qual- ity assessment. The prop osed solution achieves the hig hest PESQ, excep t fo r T raffic an d SSN (0 dB and 5 dB) no ises. For instance, in He licopter with SNR = - 1 0 dB the PESQ score attained by HD A G is 1.0 2 high er than UNP followed by increments of 0.7 9, 0 .66 and 0.04 presen te d by GTF F0 , P AC O and SSFV , respectiv ely . In sum mary , the overall PESQ obtained with HDA G is 2.86, ag ainst 2.71 f or the comp eting T ABLE III A S I I S T [ × 1 0 − 2 ] S C O R E S F O R U N P N O I S Y S P E E C H . SNR Babble Cafete ria Tra ffic Train Helic opter SSN -10 dB 23.1 24.3 35.8 34.6 34.1 19.3 -5 dB 26.6 27.9 39.2 39.9 40.2 23.7 0 dB 33.0 34.3 43.7 47.2 43.5 30.0 5 dB 40.9 42.3 47.5 54.9 51.0 37.9 A verage 30.9 32.2 41.6 44.2 42.2 27.7 approa c h GT F F0 . Therefore, these results in dicate that the propo sed so lution also provides quality assessment. T able III presents the average ASII ST results for the un- processed (UNP) noisy speech signals. Here the SSN and Babble noises attained the lowest scores for SNR value o f -10 dB, with ASII ST of 19.3 and 23.1, respectively . Th e ASI I ST values in cremented by e a c h competitive meth o d ( ∆ ASII ST ) are depicted in Fig. 5 for the six acou stic noises. Obser ve that the pr oposed solution accomp lishes the high est scores for most con d itions, except fo r Traf fic (SNR = -10 dB). The be st ∆ ASII ST (10.1 × 10 − 2 ) is achieved by the challeng ing SSN noise in - 1 0 dB. As can be seen in ESTOI, the SSFV approa ch do not present noticeable ASII ST increment. More over, for th e non-station ary Caf e teria no ise the pro posed solution attain s av erage intelligibility enhan cement of 5.4 × 10 − 2 , comp ared with 3 .5 × 10 − 2 , 1 . 8 × 10 − 2 and 0.3 × 10 − 2 for baselines GTF F0 , P AC O and SSFV . Therefo re, these re su lts rein force the ro bust- ness of th e pro posed m ethod aga in st se veral noisy ma sk ing effects. IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 8 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 20 40 60 80 100 Words Correct (%) (a) 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 20 40 60 80 100 Words Correct (%) (b) 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 20 40 60 80 100 Words Correct (%) 5 dB 0 dB -5 dB (c) Fig. 6. Percep tual intelli gibility e val uation with SSN additi ve acoustic noise for (a) male, (b) female volunteers and (c) o verall scores. Each case denotes: 1-UNP , 2-SSFV , 3-P A CO, 4-GTF F0 and 5-HD A G B. P er ceptual Intelligib ility Evaluatio n A subjective listening test [52] is cond ucted con sidering a scenario of phonetic b a lanced words 4 . T en nativ e male a nd ten female Brazilian v olun tee rs perform the test, which ages ran ge from 19 to 57 ye a r s with an av erage of 32 . The SSN noise is adopted w ith SNRs o f - 5 dB, 0 dB and 5 dB. T en words are a pplied f o r each of the 15 test condition s, i.e. , three SNR lev els an d f our m e th ods plu s the un processed case. Participants are introd u ced to th e task in a training session with 4 words. The material is diotically pr esented u sing a pair o f Roland RH-200S head phon es. Listeners hear each word once in an arbitrary presentation order and are asked to ind icate th e word in a sheet list. The in te llig ibility resu lts for each m ethod are p resented in Fig. 6. Each boxplo t de p icts the m edian an d deviation values scores (%) for one scenar io, sep arating the (a) m ale, (b) f emale volunteers, and (c) the overall sco res. The pro posed method accomplishes intelligibility un der all con ditions over the com- peting ap proach e s. For male listeners the HD A G obtained av erage intellig ibility scores of 6 6%, 85% and 9 3% comp ared 4 The complete test data base is av ailab le at lasp.ime.eb .br . T ABLE IV N O R M A L I Z E D M E A N P R O CE S S I N G T I M E . SSFV P A CO GTF F0 HD A G 0.32 0.67 0.89 1.00 to 5 2 %, 66% and 86% in the GTF F0 technique for SNR values of -5 dB, 0 dB and 5 d B, respectively . Furtherm ore, female volunteer s pr esented high er intelligibility r ates th a n male, mainly fo r - 5 dB with 75 % and 6 5 % fo r HDA G and GTF F0 . T he overall results show again the superiority o f HD A G with av erage scores o f 71%, 86% and 92%, surp assing GTF F0 (59%, 71% and 86%) and P ACO (43%, 64% and 78%). In acco rdance with fin dings in the o bjectiv e measures E STOI and ASII ST , SSFV attains scores less o r equal th e UNP case. C. Norma lized Pr ocessing T ime T able IV indicates the compu tational complexity which refers to the normalized p rocessing time req uired fo r each method ev aluated fo r 512 sam p les p er frame. These values are obtained with an Intel (R) Core (TM) i7-9 700 CPU, 8 GB RAM, and ar e norm alized b y the execution time o f th e propo sed HDA G solution. Th e processing time required for F0 estimation an d accurate har monic adjust is also co nsidered here. No te that the HDA G and GT F F0 schemes present a longer processing time, sin ce the FSFFE lo w/high pitch classification and HHT -Amp estimation are based o n the E EMD, an d demand a relev ant comp utational cost. V . C O N C L U S I O N This pap er introd uced th e HDA G method for speech in- telligibility en hancemen t in h armonic comp onents of no isy speech. It is com p osed by fo ur main steps. First, the HHT - Amp tech nique is adop ted to estimate the F0 fro m voiced frames. T he FSFFE separ ation was used for the detectio n and adjustment of these estimates, impr oving their accu racy . Then , selecti ve Gamm achirp filterb ank was applied to the frames considerin g th ird-octave bands to best cover the regions most relev ant to intelligibility . Finally , the filtered componen ts were amplified by gain factor s regulated by low/high pitch c lassifi- cation. Ex tensiv e experiments were cond ucted to ev aluate the intelligibility enh ancement provided by HD A G meth od and competitive appr oaches. Six acou stic noises were co nsidered with fo u r SNR values. Three objective m easures ar e adop ted for objective ev aluation of speech intelligibility a n d quality . The re su lts demon stra te that HDA G method o utperfo rmed the competitive approach es, with h igher intelligibility an d q uality assessment in m ost no isy environments. A perceptu al test fo r male and female listener s corro b orated the objec tive results. Future research includ e s the investi gation of the proposed method for other conditio ns, such as intellig ib ility en hance- ment for noisy rev erber ant speech . R E F E R E N C E S [1] N. R. French and J. C. Steinber g, “Fac tors Govern ing the Intelli gibilit y of Speec h Sounds, ” J. A coust. Soc. Amer . , vol. 19, no. 90, 1947. IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 9 [2] P . Assmann and Q. Summerfield, Speech Pro cessing in the Auditory System , ch. The percepti on of speech under adverse condit ions, pp. 231– 308. Berlin, Germany: Springer , 2004. [3] T . Gerkmann and R. C. Hendriks, “Unbiased mmse-based noise power estimati on with lo w comple xity and low trackin g delay , ” IEE E T ran s- actions on Audio, Speech, and Languag e Proc essing , vol. 20, no. 4, pp. 1383–1393, 2012. [4] R. T a v ares and R. Coelho, “Speech enhance ment w ith nonstationary acousti c noise detection in time domain, ” IE EE Signal P r ocessing Letter s , vol. 23, no. 1, pp. 6–10, 2016. [5] C. Medina , R. Coelho, and L. Z ˜ ao, “Impulsi ve noise detecti on for speech enhanc ement in hht domain, ” IE EE/ACM T rans. Audio, Speec h, Lang. Pr ocess. , vo l. 29, pp. 2244–2253, 2021. [6] E. Dranka and R. Coelho, “Robu st maximum like lihood acoustic ene rgy based source localiz ation in correlat ed noisy sensing en vironment s, ” IEEE Journa l of Selected T opi cs in Signal Proce ssing , vol. 9, no. 2, pp. 259–267, 2015. [7] C. E vers and P . A. Naylo r , “ Acoustic slam, ” IEE E/ACM Tr ansaction s on Audio, Speech, and Languag e P r ocessing , vol. 26, no. 9, pp. 1484–1498, 2018. [8] J. Martin ez-Carran za and C. Rascon, “ A re vie w on auditory perce ption for unmanne d aerial vehic les, ” Sensors , vol . 20, no. 24, 2020. [9] A. Ljolj, “Speech recognitio n using fundamental frequency and voicing in acoustic modeling, ” in Proc . Int. Conf. Spoken Lang. Pr ocess. , 2002. [10] A. V enturini, L. Zao, and R. Coelho, “On speech feature s fusion, inte gration gaussian modeling and multi-style training for noise robust speak er classification, ” IEEE/ACM T ransac tions on Audio , Speech , and Languag e Proc essing , vol. 22, no. 12, pp. 1951–1 964, 2014. [11] L. Z ˜ ao, R. Coel ho, and P . Fland rin, “Speech enhancement with emd and hurst-based mode select ion, ” IEEE /AC M T ra nsactions on Audio, Speech , and Langua ge Pr ocessing , vol. 22, no. 5, pp. 899–911, 2014. [12] P . C. L oizou and G. Kim, “Reasons why current speech-enh ancement al- gorithms do not impro ve speec h inte lligibi lity and suggested solutions, ” IEEE T r ansactions on Audio, Speec h, and Language Processi ng , vol. 19, no. 1, pp. 47–56, 2011. [13] Y . Li and D. W ang, “On the optimali ty of ideal binary time–fre quency masks, ” Speec h Communication , vol. 51, no. 3, pp. 230–239, 2009. [14] G. Kim and P . C. Loizou, “Improvi ng speech intelligibi lity in noise using a binary mask that is based on magnitud e s pectru m constraint s, ” IEEE Signal Proc essing Letter s , vol. 17, no. 12, pp. 1010–1013, 2010. [15] F . Farias and R. Coelh o, “Blind adapti ve mask to improv e intell igibilit y of non-stationary noisy speech, ” IE EE Signal Pr ocessing Letter s , vol. 28, pp. 1170–1174, 2021. [16] C. Bro wn and S. Bacon, “Fundamental frequency and speech inte lligi- bility in background noise, ” Hear . Res. , v ol. 266, pp. 52–59, 2010. [17] D. Ealey , H. Ke lleher , , and D. Pearce, “Harmonic tunnellin g: Tracking nonstati onary noises during speech, ” in Pr oc. 7th Eur . Conf . Speech Commun. T echn ol. , pp. 437–440 , 2001. [18] T . W ang, W . Zhu, Y . Gao, S. Zhang, and J . Feng, “Harmonic atten tion for monaural speec h enhance ment, ” IEEE /ACM T ransa ctions on Audio, Speec h, and Languag e Proce ssing , vol. 31, pp. 2424–2436, 2023. [19] S. Y . Barysenka and V . I. V orobiov , “Snr-ba sed inter-compone nt phase estimati on using bi-phase prior statistics for single-cha nnel speech en- hancemen t, ” IEEE/ACM T ran sactions on Audio, Speech, and Languag e Pr ocessing , vol. 31, pp. 2365–2381, 2023. [20] L. W elling and H. Ney , “Formant estimation for speech recognition, ” IEEE T ransactio ns on Speec h and Audio Proce ssing , vol. 6, no. 1, pp. 36–48, 1998. [21] L. W ang and F . Chen, “Fact ors affe cting the intelligib ility of low-pass filtered speech , ” in Proc. INTERSPEE CH , pp. 563–566, Aug. 2017. [22] L. W ang, D. Zheng, and F . Chen, “Understa nding lo w-pass-filtered mandarin sentences: E ff ects of fundamental frequenc y contour and single-c hannel noise suppression, ” J . Acoust. Soc. Amer . , vol. 143, no. 3, pp. 141–145, 2018. [23] K. Nathwani, M. Daniel , G. Richa rd, B. David, and V . Roussarie, “Forman t shifting for speech intellig ibility improv ement in car noise en vironme nt, ” IEEE Internat ional Conferen ce on Acoustics, Speech and Signal P r ocessing (ICASSP) , pp. 5375–5379, 2016. [24] K. Nathwani , G. Richard , B. David , P . Prablanc, and V . Roussarie, “Speec h inte lligibil ity improv ement in car noise en vironment by voic e transformat ion, ” Speec h Communicatio n , vol. 91, pp. 17–27, 2017. [25] E. Lom bard, “Le signe de l’ele v ation de la voix, ” Maladies Ore ille , Larynx, Nez, Pharynx , vol. 37, no. 25, pp. 101–119, 1911. [26] A. Queiroz and R. Coelho, “F0-based gammatone filtering for intel ligi- bility gain of acousti c noisy signals, ” IEEE Signal Pr ocess. Lett. , vol. 28, pp. 1225–1229, 2021. [27] L. Z ˜ ao and R. Coelho, “On the estimatio n of fundamental frequency from nonstatio nary noisy speech signals based on hilbert-hua ng trans- form, ” IEEE Signal Proce ss. Lett. , vol . 25, no. 2, pp. 248–252, 2018. [28] R. A. Lutfi and R. D. Patte rson, “On the gro wth of masking asymmetry with stimulus intensity , ” T he Journal of the A coustic al Society of America , v ol. 76, pp. 739–745, 1984. [29] T . IRINO and R. D. P A T TERSON, “ A time-domain, lev el-depe ndent auditor y filter : The ga mmachirp, ” The J ournal of the Acoustical Society of America , vol. 101, no. 1, pp. 412–419, 1997. [30] J. Jensen and C. H. T aal, “ An algo rithm for predicting the intelligi bility of speech masked by m odulate d noise maskers, ” IEEE /ACM Tr ans. Audio, Speec h, Lang. Pr ocess. , vol. 24, no. 11, pp. 2009–2022, 2016. [31] R. Drullman, J. M. Feste n, , and R. Plomp, “Effect of temporal env elope smearing on speech rece ption, ” The J ournal of the Acoustical Societ y of America , v ol. 95, pp. 1053–1064, 1994. [32] R. Drullman, J. M. Festen, , and R. Plomp, “Effect of reducing slow temporal m odulat ion on s peech recepti on, ” The Jo urnal of the Acoustical Societ y of America , vol. 95, pp. 2670–2680, 1994. [33] T . M. Elliott and F . E. Theunissen, “The modulation transfer function for speech iintel ligibili ty , ” PLOS Comput. Biol. , vol . 5, no. 3, pp. 1–14, 2009. [34] A. Queiroz and R. Coe lho, “Noisy spee ch bsed te mporal decompositi on to improve fundamenta l frequency estimation, ” IEEE/A CM T rans. Audio, Speec h, Lang. Process. , vol. 30, pp. 2504–2513 , 2022. [35] M. Khadem-hosseini, S. Ghaemmaghami, A. Abtahi, S. Gazor , and F . Marva sti, “Error correcti on in pitch detect ion using a deep learning based classification , ” IEEE/ACM Tr ansactions on Audio, Speech , and Languag e Proc essing , vol. 28, pp. 990–999, 2020. [36] J. Garofolo, L. Lamel, W . Fischer , J. Fiscus, D. Pallett , N. Dahlgren, and V . Zue, “Timit acousti c-phoneti c continuous speec h corpus, ” in Linguist. Data Consortium , Philadelphi a, P A , USA, 1993. [37] R. C. Hendr iks, J. B. Crespo, J. Jensen, and C. T aal, “Optimal near -end speech intellig ibility improv ement incorpora ting additi ve noise and late re verbe ration under an approximation of the short-ti me sii, ” IEEE/AC M T ran s. Audi o, Speech, Lang. Proce ss. , vol . 23, no. 5, pp. 851–8 62, 2015. [38] A. Rix, J. Beeren ds, M. Hollier , and A. Hekstra, “Perceptual ev aluat ion of speech quality (pesq)-a ne w method for speech quality assessment of telephone netw orks and codecs, ” in P r oc. IEEE Int. Conf. Acoust., Speec h Signal Pr ocess. , vol. 2, pp. 749–752, 2001. [39] J. Stahl and P . Mowla ee, “Exploiting temporal correlation in pitch- adapti v e speech enh ancement, ” Speech Commun ication , v ol. 111, pp. 1– 13, 2019. [40] N. E. Huang, Z. Shen, S . R. L ong, M. C. Wu, H. H. Shih, Q. Zheng, N. C. Y en, C. C. Tung , and H. H. Liu, “The empirica l mode decom- position and the hilb ert spectrum for nonlinear and non-station ary time series analy sis, ” Pr oc. Roy . Soc. London Ser . A: Math., Phys., Eng. Sci. , vol. 454, no. 1971, pp. 903–995, 1998. [41] M. E . T orres, M. A. Colominas, G. Schlotthauer , and P . Flandrin, “ A complete ensemble empirica l mode decomposition with adapti ve noise, ” Pr oc. IE EE Int. Conf . Acoust., Speec h Signal Pr ocess. , pp. 4144–4147, 2011. [42] S. Gonzalez and M. Brooke s, “ A pitch estimation filter robust to high le vel s of noise (pefac), ” P r oceedi ngs of the IE EE , pp. 451–455, 2011. [43] L. Z ˜ ao, R. Coel ho, and P . Fland rin, “Speech enhancement wit h emd and hurst-based mode s election, ” IEEE /A CM T rans. Audio, Speec h, Lang. Pr ocess. , vo l. 22, no. 5, pp. 897–909, 2014. [44] N. Chatl ani and J. Soragha n, “Emd-based filtering (emdf) of lo w- frequenc y noise for speech enhanceme nt, ” IEEE T rans. Audi o, Speec h, Lang. Proc ess. , vol. 20, no. 4, pp. 1158–11 66, 2012. [45] I. R. Titze, “Principles of voic e producti on, ” Englew ood Clif fs: Prentice Hall. [46] H. J. Steenek en and F . W . Geurtsen, “Description of the rsg-10 noise- databa se, ” re port IZF , vol. 3, 1988. [47] Electroa coustics, “Iec 61260: Octave -band and fractional-oc tav e-band filters, gene v a, switzerla nd, ” Inte rnational Elect r otechn ical Commission , 1995. [48] S. Gonza lez, “Pitch of the core timit database set, ” 2014. [49] J. Thiemann, N. Ito, and E. V incen t, “Demand: a collection of multi- channe l recordings of acoustic noise in div erse envi ronments, ” Pr oc. Meeti ngs Acoust. , 2013. [50] J. JUNQU A, “The lombard reflex and its role on human listene rs and automati c speech recogni zers, ” The Journal of the A coustic al Societ y of America , v ol. 93, no. 1, pp. 510–524, 1993. [51] L. Rabiner and R. Schafer Digital Pr ocessing of Speech Signals , 1978. [52] S. Ghimire, “Speech intelligib ility measuremen t on the basis of itu-t recommendat ion p.863, ” 2012.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment