Harmonic Detection from Noisy Speech with Auditory Frame Gain for Intelligibility Enhancement

IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 1 Harmonic Detecti on from Nois y Speech with Auditory Frame Gain for Intelligibility En hancement A. Queiroz, Studen t Member , IEEE, and R. Coelho , S enior Member , IEEE Abstract —This paper introduces a novel (HD A G - Harmonic Detection fo r A uditory Gain) method fo r speech intelli gib ility enhancement in noisy scenarios. In the proposed scheme, a series of selecti ve Gamma chirp ﬁlters ar e adopted to emphasize th e harmonic components of speech reducing the masking effects of acoustic noises. The fundamental fr equency ar e estimated by the HHT -Amp technique. Harmonic patterns estimated with low accurac y ar e detected and adju sted according the FSFFE low/high p itch separation. The central f requencies of the ﬁlter - bank are deﬁ ned considerin g the third octa ve subbands which are best suited to cov er the regions most rele vant to in telli- gibility . B ef ore si gn al reconstruction, the gammachirp ﬁlt ered components are ampli ﬁed by gain factors regulated by FSFFE classiﬁcation. The proposed HD A G solution and three baseline techniques are examined considering six background noises w i th fo ur signal-to-noise ratios. Three ob j ectiv e measures are adopted fo r the ev aluation of speech intelligib ility and quality . Several experiments are conducted to demonstrate that the proposed scheme achieve s better speech intelligibil i ty impro vement when compared to the competing approaches. A perceptual listening test is further considered and corr oborates with the objective results. Index T erms —Gammachirp ﬁltering, low/high frequency sep- aration, harmonic detection, noisy speech. I . I N T RO D U C T I O N A COUSTIC noise is a strong maskin g effect that impairs speech intelligib ility [1][2]. This in terference underlies se veral research stud ies such as speech en hanceme n t [3][ 4][ 5], source localization [6][7], robot audition [8], speech and speaker recogn ition [9][10]. Thus, its mitigatio n is a relev ant element of interest for the intelligibility and quality enha nce- ment. Several signal p rocessing method s are d escribed in the literature to atten uate no ise inter ference for speech quality assessment [11]. Howe ver , this ach iev ement not necessarily leads to speech inte llig ibility impr ovemen t [12]. On the other hand, acoustic mask s [13][14][15] are deﬁned to emulate the cocktail party effect. T h ese so lu tions provide intelligibility enhancem ent for the target speech signal. In the last y e a rs, th e a n alysis o f harmon ic comp onents of noisy speech [16][17] has enco uraged the pr oposal of n ew strategies for intelligibility gain [18][1 9]. For these, h armonic This work was supporte d in part by the National Counci l for Scienti ﬁc and T e chnologic al De velo pment (CNPq) 305488/20 22-8 and Fundac ¸ ˜ ao de Amparo ` a Pesquisa do Estado do Rio de Janeiro (F APERJ) unde r Grant 200518/2023 and in part by the Coordena c ¸ ˜ ao de Aperfeic ¸ oamento de Pessoal de N´ ıve l Superior - Brasil (CAPES) - under Grant Code 001. The authors are with the Laborato ry of Acoustic Signal Processing, Military Institut e of Engineering (IME), Rio de Jane iro, RJ 2229 0-270, Brazil (e-mail: coelho@ ime.eb .br). compon ents such as fun damental frequ ency (F0 ) and for- mants [2 0] p lay an interesting role for intelligibility in n oisy condition [16][ 2 1][22]. Ti me-d omain adaptive solution s are designed to deal with the harmon ics of the speech sign a l to reduce the noise ef fects. In [23], the formant center frequ encies from voiced segments of speech ar e shifted away f rom the region of n oise. This formant shifting procedure [24] simu lates the h uman strategy to pr ovide a mor e audible signal in noisy en viron ment, i.e., the Lombar d ef fect [2 5]. Results showed that the Smooth ed Shifting o f Formants for V oiced segments (SSFV) is able to improve the intelligib ility o f speech signals in car noise environmen t. A d ifferent approa ch was proposed in [26], where the HHT -Amp [2 7] F0 estima tion technique was ap plied to the har m onic comp onents of noisy speech. The F0-based Gammaton e Filtering ( GTF F0 ) method consider ed integer multiples of th e estimated F0 as center fr equencies of a time- domain audito ry ﬁlterbank . Finally , the ou tputs are ampliﬁed to em p hasize the harmon ics o f the speec h signal leading to intelligibility gain. The u se of the Gam matone ﬁlterban k in the GTF F0 method may be limited by the hig h le vel mask ing effects [28]. T o over - come th is issue, the Gammachir p pr o posed in [29] p r oduces a ﬁlter with an asy mmetric amplitu de spectrum. This au ditory ﬁlter pr ovides an in teresting ﬁt to various sets of noise masking data. The center frequencies of the its ﬁlterbank must be well-deﬁned conside r ing the relevant ones for intelligib ility . I n this co ntext, the Exten ded Short-Time O b jectiv e In telligibility (ESTOI) [30] p erform s an e valuation of noisy speech in third-oc tave subb ands. ESTOI also co nsiders the tempo ral modulatio n fr e quencies relev ant to speech intellig ibility , wh ose values range from 1 –12.5 Hz [31][32][33]. These subbands and frequen cy mod ulation range are able to assist in regulating the b andwidth of ﬁlterbanks to cover the harmonic co mponen ts of speech most relev ant for intelligib ility . This paper intr oduces the HDA G (Harmo nic Detection with Auditory Gain ) m e th od to attain intelligibility enhancem ent fo r harmon ic co mponen ts of no isy speech signals. The pro p osed solution is p erform ed in four steps. Initially , the HHT -Amp method [2 7] is app lied to estimate the F0 of speech f rames. In the secon d step, th ese frames are sep arated in low-pitch or high-p itch o nes with FSFFE [34] technique. The separation leads to detection and adjustment of the F0 values accor d - ing som e typic a l error s [35] that may occur in estimatio n, improving its ac c uracy . In seq uence, the third stage consists in ﬁltering the harmo nic componen ts o f no isy speech with Gammachirp . The c e ntral frequ encies and ban d widths o f the ﬁlterbank ar e selectively deﬁned to cover the m ost r elev ant regions for speech intelligibility , as stated in [30]. Finally , the IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 2 Fig. 1. Block diagram of the proposed HDA G m ethod for improv e the inte lligibi lity of noisy speech s ignals. ﬁltered compon ents ar e amp liﬁed by a gain factor to h igh- light the harm onic comp onents of spee c h. This ampliﬁcation mitigates the m asking effects of backg round n oise leading to intelligibility enhance ment. Sev eral experimen ts a re con ducted to examin e the effectiv e- ness of the HDA G method. F or this purpose, speech utterances collected f r om TIMIT [36] databa se are co rrupted by six real acoustic noises, consid e ring four SNR values: - 10 dB, -5 dB, 0 dB an d 5 d B. Th e pro posed metho d and three baseline approa c h es are examined in te r ms of in telligibility en hance- ment. T o this end, ESTOI [30] and Sh o rt-Time App roximated Speech Intelligibility Index (A SI I ST ) [37] are consider e d in the e valuation. Moreover , results for the Perceptual Evaluation of Speech Qu ality (PESQ) [38] demon strate tha t HDA G also achieve quality a ssessment. Objective results indicate tha t the propo sal outperfo rms th e c o mpetitive app r oaches in terms of speech in telligibility , and also q uality scor es. T hese results are corrob orated by a subjective listening e valuation test. The main contributions of this work are: • Introd uction o f the HD A G meth od to impr ove the in tel- ligibility and quality of acoustic n oisy speech. • Deﬁnition of the ﬁlterbank conﬁg uration using th e third- octave b ands an d sp e c iﬁc mo dulation fre q uencies, with higher resolu tion in regions most relevant to intelligibility . • Adoption of the asymm etry coefﬁ cient from Gammach irp to adju st th e ﬁlterban k to th e noisy m asked compon ents of speech. • Interesting in telligibility and quality assessment attain e d with adaptive gain factors deﬁned accord ing FSFFE sep- aration. The re maining of this paper is organized as follows. Section II describ es the steps of the proposed HD A G metho d for intelligibility enhancem ent. An explana tio n of the com peti- ti ve appro aches SSFV , P A CO (pitch -adaptive comp lex-valued Kalman ﬁlter) [39] and GT F F0 is in cluded in Section III. Section IV pre sen ts the ev aluation experiments and results. Finally , Section V conclud es this work. I I . T H E H DAG M E T H O D The pr oposed method includ e s four main steps: harmo nic detection, third- octave b ands conﬁg uration, g ammachirp ﬁl- tering and outpu t samp les ampliﬁcation by a g ain factor . Finally , th e overlap and add method is ap plied to ach ieve the r econstructed version of the target speech signal. Fig. 1 illustrates the b lock diagram of the HD A G me th od. A. F0 Estimation The fundam ental fr e quency (F0) is estimated from noisy speech sign al with HHT -Amp metho d [27]. This F0 estimator ensures [27][34] interesting accuracy results f rom n oisy sp e e ch signals. HHT -Amp is e valuated in a wide r ange of no isy scenarios outp erform ing four co mpeting estimators in terms of accu racy . I t applies the time-freq u ency EEMD ( Ensemble Empirical Mode Decompo sition) [4 0][ 41] to decompose a voiced sample sequen ce x q ( t ) such that x q ( t ) = K X k =1 IMF k,q ( t ) + r q ( t ) (1) where IMF k,q ( t ) is the k - th mode of x q ( t ) and r q ( t ) is the last residual. Then, instantaneo us am plitude f unctions are computed b y a k,q ( t ) = | Z k,q ( t ) | , k = 1 , . . . , K, (2) from th e analytic signals d eﬁned as Z k,q ( t ) = IMF k,q ( t ) + j H { IMF k,q ( t ) } , (3) where H { IMF k,q ( t ) } refer s to the Hilber t transform of IMF k,q ( t ) . The A u tocorrela tio n Function is calcu lated as r k,q ( τ ) = X t a k ( t ) a k ( t + τ ) . (4) For each dec omposition mod e k , let τ 0 be the lowest τ value that co r respond to an A CF p eak, su bject to τ min ≤ τ 0 ≤ τ max . The f r equency restriction is ap plied a c cording to the ran ge [ F min , F max ] of possible F0 values. Th e k -th F0 IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 3 Fig. 2. Block diagram of the FSFFE techni que for low/high pitch classiﬁcati on of speec h frames. candidate is d eﬁned as τ 0 /f s , where f s refers to the sam p ling rate. Finally , a decision criterion [ 27] is applied to select th e best pitch candidate ˆ T 0 . Finally , th e e stimated F0 is giv en by f est = 1 / ˆ T 0 . B. Harmonic Detection and Adjustment Sev ere noise m a sking ef fects may impact the har monic compon ents of voiced speech leading to lo w accur acy F0 estimates. In order to detect and adjust the erroneous F0 v alues the FSFFE (Frequency Separ ation f o r Fun damental Frequency Estimation) [34] is a p plied to harmo nic frames. This strategy separates th e no isy speech frames into lo w-pitch or high- pitch ones. Possible errors in F0 estimates can be detected by comparin g its values with the separatio n . Fig. 2 illustrates the block diagram of th e FSFFE m ethod. After th e EEMD decom p osition as in (1), pitch estimation is perfo r med in voiced frames of each IMF using PEF AC [4 2] algorithm . Let ˆ F 0 k,q denote the pitch value estimated fro m frame q of IMF k ( t ) , the ˆ F 0 q vector is co m posed as ˆ F 0 q =  ˆ F 0 1 ,q , ˆ F 0 2 ,q , · · · , ˆ F 0 K,q  T , (5) to express the ten d ency that the frame is placed in a low/high pitch region . Only th e ﬁrst fou r IMFs ( K = 4 ) are consider ed in order to av oid the aco ustic noise masking effect. T he energy of these unwanted compone nts are mostly conce ntrated at low frequen cies ( K > 6 ) [4 3][5][44]. A n ormalized d istance is com puted betwee n IMFs for the successiv e f rames to detect and overcome the dif ferenc es in the estimated F0. Let k an d k ′ denote IMF indexes, th e distance is d escribed as δ q ˆ F 0 ( k , k ′ ) =      ˆ F 0 k,q − ˆ F 0 k ′ ,q ˆ F 0 k,q + ˆ F 0 k ′ ,q      . (6) The δ q ˆ F 0 ( k , k ′ ) values are compu ted for d ifferent ind exes of k and k ′ resulting in a 4x4 distance matrix δ q ˆ F 0 . The row compon ents of the matrix are summed to obtain the variation proper ty for th e k -th IMF . Th e f requen cy region is d eﬁned as the mean value of PEF AC F0 estimates ( ¯ F 0 q ) between the two IMFs with the smallest variation scores. Fin ally , the low/high pitch separation is perfo rmed accord ing the thresho ld γ as ( ¯ F 0 q ≤ γ , low-frequency fr ame ; ¯ F 0 q > γ , high-f r equency fram e . (7) The th reshold γ is ﬁxed in 2 00 Hz which is r elated to the av erage values b etween male (5 0-200 Hz) and female (12 0 - 350 Hz) speakers [4 5]. 0 300 600 900 1200 T ime (ms) 100 200 300 400 F0 (Hz) (a) 0 300 600 900 1200 T ime (ms) 100 200 300 400 F0 (Hz) (b) 0 300 600 900 1200 T ime (ms) 100 200 300 400 F0 (Hz) (c) Fig. 3. Ground Truth and F0 estimated with HHT -Amp technique for: (a) Clean Speech segment, (b) Noisy Sign al wit h babbl e SNR=-5d B and (c) same Noisy se gment with estimates improv ed by FSFFE. The F0 adjustmen t is co nducted accordin g the low/high pitch classiﬁcation in (7). The F0 estimates are pr o ne to doublin g err o rs in low pitch fram es. Hen ce, a low pitch fram e that presents F0 value ( f est,q ) r anging f rom [ 200-4 00]Hz is ad justed to f ad j,q = 0 . 5 f est,q . On the other h a nd, the high p itch f rame is adju sted accord ing possible h alving a nd quarterin g [35] e rrors as follows: f ad j,q = ( 4 f est,q , 50 ≤ f est,q ≤ 10 0 2 f est,q , 100 < f est,q ≤ 200 . (8) Fig. 3 illustrates th e F0 adju stme nt p rocedur e in fram es of a 120 0 m s speech sign al. Fig. 3(a) re f ers to F0 attain e d with HHT -Am p metho d for the clean spee c h . The estimated values match th e groun d truth in the high pitch region . Fig. 3(b) presents the F0 estimates related to the noisy version of the same speech segmen t f o r the bab ble noise [ 46] with SNR = -5dB. Note that accu racy d ecreases signiﬁcantly an d halv in g errors ap pear in har monic comp onents, e.g. , aroun d 100 m s or 600 ms. These regions are adjusted with FSFFE as can b e seen in Fig. 3( b). Obser ve that the p roposed adjust leads to accuracy improvemen t ev en in the sev ere n oisy co ndition. Th e correction in harm o nic detection is imp ortant specially in th is case. Particularly , due to the fact that importan t comp onents for speech intelligibility ar e placed in h igher f requen c ie s. C. Th ird-octave Ban ds Con ﬁguration Third-o ctav e ﬁlter banks have be e n shown to lo osely ap- proxim a te th e m easured ban ds of the auditory ﬁlters [ 4 7]. Objective speech metrics co nsider the analy sis of clean and noisy speech with third -octave subspaces. This is the ca se of ESTOI [30] intelligibility measure , that gives a pr ediction throug h the cor relation of third-or der spec trograms from the referenc e and processed sig nal. IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 4 T ABLE I E S T O I [ × 1 0 − 2 ] S C O R E S F O R D I FFE R E N T A S Y M M ET RY C O EF FI C I E N T S . Gammachirp Coefﬁcie nt – c Noise 2. 0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 Babble -10 dB 18.5 19.3 18.5 18.9 19.0 18.4 19.4 18.4 19.1 -5 dB 28.8 29.7 28.9 29.2 29.5 28.8 29.8 28.8 29.6 0 dB 41.1 42.0 41.3 41.6 41.8 41.1 42.2 41.1 42.0 5 dB 54.5 55.4 54.7 54.9 55.2 54.5 55.6 54.4 55.5 A verage 35.7 36.6 35.9 36.1 36.4 35.7 36.8 35.7 36.5 SSN -10 dB 20.5 21.3 20.6 20.9 21.0 20.4 21.5 20.4 21.2 -5 dB 30.2 31.1 30.4 30.7 30.8 30.3 31.3 30.1 31.1 0 dB 41.6 42.4 41.8 42.0 42.2 41.6 42.6 41.5 42.5 5 dB 54.2 55.0 54.4 54.6 54.8 54.2 55.3 54.1 55.2 A verage 36.6 37.4 36.8 37.0 37.2 36.6 37.7 36.5 37.5 This work proposes th e deﬁnition of an auditory ﬁltering based on the third-octave band s. The accu rate harm onic d e- tection f ad j,q is adopted as cen ter frequency of the ﬁrst band of the ﬁlter bank ( k = 0). The center frequencies for the fo llowing k band s are attained adaptively by f c ( k , q ) = 2 k 3 f ad j,q . (9) The resulting set of ﬁlters in ea c h fra m e q provides better resolution in the frequ encies near the har m onics of speech, which are th e most important f or the intelligibility [30]. D. Gammachirp F iltering In this step , a set of L Gamm achirp ﬁlters [29] { h k ( t ) , k = 1 . . . , L } are app lied to successi vely ﬁlter the input sample sequ ence x q ( t ) . Eac h ﬁlter h k ( t ) is imp lemented to the noisy signal consider ing frames of 32 ms, or der n = 4 , center fr equencies giv en by (9). In o rder to align the im pulse response functions, p hase compensation is applied to all ﬁlters, which correspo n d to the non-cau sal ﬁlters h k ( t ) = a ( t + t c ) n − 1 cos(2 π f c t + c ln t ) e − 2 π b ( t + t c ) , t ≥ − t c , (10) where c is the g ammachir p coefﬁcient o f the ﬁlter an d t c = n − 1 2 π b , which ensures that p eaks o f all ﬁlters occur at t = 0 . The band width b is de ﬁn ed here accord ing the frequ encies of mo dulation tr ansfer func tion co nsidered in [3 1][32][33]. The results p resented in [ 33] demo nstrated th at the frequen cy range relevant for intelligibility of male speech senten ces ranges from [1– 1 2.5] Hz. Nevertheless, female sentenc e s pre- sented a larger range, with n oticeable rele vance for frequ encies ≤ 20 Hz. Ther efore, this work proposes an harmon ic-adaptive bandwidth , giv en by b = 0 . 15 f ad j,q . Let x 0 q ( t ) = x q ( t ) , th e ﬁltered signals y k q ( t ) , k = 1 , . . . , L , are r ecursively co mputed by ( y k q ( t ) = x k − 1 q ( t ) ∗ h k ( t ) x k q ( t ) = x k − 1 q ( t ) − y k q ( t ) , k = 1 , . . . , L . (11) The re sid ual signal is deﬁned as r q ( t ) = x L q ( t ) to guaran tee the completen ess of the input seq uence, i.e., x q ( t ) = L X k =1 y k q ( t ) + r q ( t ) . (12) 2 4 6 8 10 Gain 34.0 36.0 38.0 40.0 42.0 ESTOI F1 F2 F3 F4 F5 1 2 3 4 5 Gain 41.0 42.0 43.0 44.0 ESTOI F6 F7 F8 F9 F10 (a) 1 2 3 4 5 Gain 34.0 36.0 38.0 40.0 42.0 ESTOI F1 F2 F3 F4 F5 1 2 3 4 5 Gain 41.6 42.0 42.4 42.8 ESTOI F6 F7 F8 F9 F10 (b) Fig. 4. E STOI curves of (a) lo w pitc h and (b) hig h pitch frames ave raged for SNR v alues: -10dB, -5dB, 0dB and 5dB of Babble noise accor ding the gain fac tor G k for each gammachirp ﬁlte r . T able I presen ts the ESTOI scores for different asymm etry coefﬁcients c of the Gam machirp ﬁlter . The intelligibility is pred icted for a training subset o f 48 spe e ch signals of TIMIT [ 36] deﬁned in [48]. The ESTOI sco res with different values of c is com puted f or Babble [46] an d SSN [49] noisy scenarios. Note that the coefﬁcient c = -1 a chieves the high est intelligibility rates for all the n oisy co nditions. This can be justiﬁed by the fact th a t acoustic noises might shift the harmon ic d etection. Therefo re, the asymmetr y of gammachir p has the role o f ﬁn e-tuning in these harmonic compon ents. E. F r ames Reconstructio n with a Gain F acto r After the Gammach irp ﬁltering, the amplitude of the output samples y k q ( t ) , k = 1 , . . . , L , are amp liﬁed by a g ain factor G k ≥ 1 . Th e idea is to emphasize th e presen ce of the har monic features of speech, which will lead to speech intellig ib ility improvement, witho ut introdu c ing any no ticeable distortion to the speech sign al. The reconstru ction of the voiced frame q ∈ S v leads to the sample sequence ˆ x q ( t ) = " L X k =1 G k y k q ( t ) # + r q ( t ) . (13) The reconstructed voiced frames in S v and all the remaining frames in S u are join ed together keeping the or iginal frame s indices. Thus, all frames are overlap and ad d ed to reco nstruct the modiﬁed version ˆ x ( t ) o f the target speech signal. Th e completen e ss and continuity of ˆ x ( t ) is guara n teed by the adoption o f the Ha n ning window that multiply all frames before the overlap and add metho d. This mea n s that th e reconstruc ted signal ˆ x ( t ) an d the orig inal signal x ( t ) would be exactly the same if each frame is reconstructed considerin g G k = 1 for every k ∈ { 1 , . . . , L } . The set of gains G k are empirically determine d in each ﬁlter using th e sam e training subset of 48 speech signals attained from TIMI T d atabase. Fig. 4 illustrates the ESTOI curves for noisy speech signal with Babble and av eraged to f our SNR IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 5 Algorithm 1 Intelligibility Enhancem ent Scheme HD A G. for q do Input: x q ( t ) Harmonic Detection f est,q ← F0 estimation with HHT - Amp as in Section II-A. ˆ F 0 q ← PEF AC (5) fo r K =4 decomp o sed modes of (1). δ q ˆ F 0 ← normalized distance matrix using (6) low/high pitch classiﬁcation (7) accord in g ¯ F 0 q . Gammachirp Filt ering for k do h k ( t ) ← imp ulse respo n se of no n-causal ﬁlters (1 0) y k q ( t ) = x k − 1 q ( t ) ∗ h k ( t ) x k q ( t ) = x k − 1 q ( t ) − y k q ( t ) end for r q ( t ) = x L q ( t ) ← residu al co m ponen ts ˆ x q ( t ) ← voiced frames reco nstruction as in (13) a nd G k from (14). ˆ x ( t ) ← overlap and add techniqu e. end for return ˆ x ( t ) values. The conﬁg u ration starts f rom the ﬁrst ﬁlter ( F1), an d the gain is incremen ted until ESTOI re aches its m a x imum value (highligh ted point). This gain is ﬁxed, and the proc e ss is repeated for the subsequent ﬁlters. Observe th at two different sets of gain are pr e sented: on e for low pitch (Fig. 4(a) ) and other for high p itch frame s (Fig. 4(b)). Th erefore, the G k values fo r L = 10 ﬁlters that lead to the h ighest intelligibility ESTOI scores are deﬁne d as G k = ( { 14 , 1 , 4 , 8 , 4 , 3 . 5 , 3 , 2 , 2 , 1 . 5 } , low-pitch ; { 14 , 1 , 1 , 4 . 5 , 2 , 3 . 5 , 2 . 5 , 2 , 1 . 5 , 1 . 5 } , high -pitch . (14) The prop osed HDA G m e thod is sum m arized in Algorithm 1. This algorithm is tailored to th e harmon ic detection scheme considered in this paper . However , Algorith m 1 can be also used with any oth er F0 estimation techn ique. I I I . H A R M O N I C - BA S E D C O M P A R A T I V E M E T H O D S This Section b rieﬂy d e scribes the baseline metho ds SSFV , P AC O and GTF F0 . T hey also con sider the harmon ic com- ponen ts of no isy sp eech to attain intelligib ility and q uality improvement. A. SSFV The ma in idea of this solu tio n consists o n tran sforming the origin a l signal, adopting a Lo mbard effect strategy [25] [50]. In this effect the central freq uencies of th e fo rmants are shifted (Formant Shifting). It moves aw ay the energy from these fre q uencies from th e region of spectral ac tion of the noise. The for m ant shiftin g pro cess is described in [23] and optimized to o perate in environments with the p resence o f Car noise (co mposed by radio, message aler t and telepho ne). Initially , LPC (Linear Predictio n Codin g) is used to estimate the poles an d formant frequencies of the voiced spee c h signal. In the LPC mod el, a 25m s frame of th e signal s ( n, m ) can be represented by linear prediction s of order p [51], that is s ( n, m ) = p X j =1 a j s ( n − j, m ) + e ( n, m ) , (15) where a j are the linear pre d iction coefﬁcients, e ( n, m ) indi- cates the residual error and p = 12 . The variables n and m represent the signal sample and time frame indices, respec - ti vely . The LP ﬁlter A( z ) is ob tained f rom th e co efﬁcients a j , so that A ( z ) = 1 + p X j =1 a j z j . (16) The poles P are obtain ed by the r o ots of the LP coefﬁcients, and the form a nt f r equencies F are de ﬁn ed as the estima ted pole angles. The fo rmants o btained are shifted accor ding to a function δ ( F ) [2 4] determin ed accord ing to the characteristics o f the acoustic noise. The d isp la c ement of fo rmants is carried ou t accordin g to the criterion ˆ F ( f ) =  F ( f ) + δ ( f ) , f 1 < f < f 3 F ( f ) , otherwise . (17) where f 1 and f 3 are the ﬁr st and thir d forman ts, respe cti vely . Finally , the resulting set of for mants ˆ F is obtain ed fro m these modiﬁcation s. B. P ACO The p itch-adap tive comp lex-valued Kalman ﬁlter (P A CO) [39] is also ad opted as a competitive techniqu e fo r the pro- posed HDA G method. It applies the harmon ic signal modelin g for estimating the complex-valued sp eech AR par ameters required for the Kalman ﬁlter . T o th is end, fund a m ental frequen cy estimation f for each 32 ms signal frame y ( n, l ) is perfor med and p h ase pro gression ˆ ψ ( l ) is recu rsiv ely estimated for the h armon ic h accordin g to ψ h ( l ) = ψ h ( l − 1) + π L f s ( f h ( l ) + f h ( l − 1)) . (18) Successiv e sp e e ch DFT b in s of y ( n, l ) is comp uted by incorpo rating the harmon ic phase pro gression into a state- transition m odel. The AR coefﬁcients ˆ a ( l ) ar e deﬁned from the DFT bin s [3 9], which are the input for the Ka lm an ﬁlter gain G K and obtain an estimation of ˆ X ( k , l ) such as ˆ X ( k , l ) = G k ( k , l )( Y ( k , l ) − ˆ X pro p ( k , l )) (19) where ˆ X pro p is the state p ropagatio n estimate for the k- t h bin. Finally , inverse DFT is applied and th e proce ssed speec h signal is reconstruc ted perform ing overlap and add. C. GTF F0 In the GT F F0 [26] method, a set of L Gamm atone ﬁlters { h k ( t ) , k = 1 . . . , L } are app lied to successively ﬁlter th e IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 6 T ABLE II I N T E L L I G I B I L I T Y AN D Q U A L I T Y R E S U LTS W I T H T H E P RO P O S E D H D A G A N D C O M P E T I T I V E M E TH O D S . ESTOI PESQ Noise SNR UNP SSFV P A CO GTF F0 HD A G UNP SSFV P ACO GTF F0 HD A G Babble -10 dB 0.18 0.17 0.17 0.24 0 .28 0.56 0.99 1.25 1.77 2.06 -5 dB 0.29 0.29 0.30 0.37 0.40 1.52 1.54 1.94 2.30 2.50 0 dB 0.41 0.42 0.43 0.50 0.53 1.90 1.92 2.45 2.72 2.86 5 dB 0.55 0.55 0.58 0.64 0.66 2.35 2.36 2.94 3.12 3.22 A verage 0.36 0.36 0.37 0.44 0.47 1.58 1.70 2.14 2.48 2.66 Cafete ria -10 dB 0.20 0.19 0.19 0.27 0.31 1.00 1.30 1.50 1.97 2.20 -5 dB 0.31 0.31 0.31 0.40 0.43 1.63 1.67 2.01 2.48 2. 63 0 dB 0.44 0.44 0.45 0.54 0.56 2.07 2.09 2.52 2.90 3.02 5 dB 0.58 0.58 0.61 0.67 0.69 2.50 2.51 2.97 3.29 3.37 A verage 0.38 0.38 0.39 0.47 0.50 1.80 1.89 2.25 2.66 2.81 Tra fﬁc -10 dB 0.38 0.38 0.44 0.44 0.47 1.59 1.58 2.82 2.34 2.51 -5 dB 0.51 0.50 0.56 0.58 0.60 2.04 2.04 3.27 2.73 2. 88 0 dB 0.63 0.63 0.68 0.70 0.71 2.55 2.55 3.62 3.12 3.26 5 dB 0.74 0.74 0.79 0.79 0.80 3.06 3.06 3.86 3.52 3.62 A verage 0.48 0.48 0.52 0.56 0.58 2.31 2.31 3.39 2.93 3.07 Tra in -10 dB 0.32 0.30 0.36 0.38 0.42 1.33 1.38 1.92 2.09 2.29 -5 dB 0.43 0.43 0.47 0.51 0.54 1.82 1.83 2.55 2.61 2. 75 0 dB 0.55 0.54 0.58 0.63 0.65 2.33 2.34 3.03 3.06 3.18 5 dB 0.65 0.65 0.69 0.73 0.75 2.78 2.79 3.37 3.42 3.54 A verage 0.49 0.48 0.53 0.56 0.59 2.06 2.08 2.72 2.79 2.94 Helicop ter -10 dB 0.30 0.30 0.33 0.39 0.43 1.55 1.59 2.21 2.34 2.54 -5 dB 0.41 0.41 0.45 0.52 0.54 1.89 1.91 2.71 2.74 2. 87 0 dB 0.53 0.53 0.59 0.64 0.66 2.33 2.34 3.17 3.15 3.26 5 dB 0.66 0.65 0.72 0.75 0.76 2.76 2.76 3.53 3.51 3.60 A verage 0.47 0.47 0.52 0.58 0.60 2.13 2.15 2.91 2.93 3.06 SSN -10 dB 0.17 0.16 0.20 0.24 0.29 1.22 1.41 1.95 1.88 2.17 -5 dB 0.28 0.28 0.32 0.37 0.41 1.45 1.47 2.41 2.25 2. 41 0 dB 0.41 0.41 0.45 0.51 0.54 1.84 1.85 2.89 2.68 2.80 5 dB 0.54 0.54 0.59 0.64 0.66 2.32 2.33 3.29 3.11 3.20 A verage 0.35 0.35 0.39 0.44 0.47 1.70 1.77 2.63 2.48 2.65 Overa ll 0.44 0.43 0.47 0.52 0.54 1.93 1.98 2.67 2.71 2.86 input samp le sequ ence x q ( t ) . Each ﬁlter h k ( t ) is implemented 1 in frames of 32 ms con sidering ord er n = 4 , center frequ ency f c = kF 0 (20) and b andwidth b = 0 . 25 F 0 . The time-dom a in impulse re- sponse fun c tion described in (10) is app lied for GTF F0 without the a sy mmetry c oefﬁcient. Th us, it can b e con sidered a speciﬁc case of Gammachirp ﬁlterbank, in wh ich c = 0 . After the Gammato ne ﬁltering, the amp litude of the o utput samples y k q ( t ) , k = 1 , . . . , L , ar e ampliﬁed by the following a gain factor G k ≥ 1 . Th e integer m ultiples of F0 are ampliﬁed as in [2 6] with the fo llowing linear gains: G 1 = G 2 = 5 .0, G 3 = 4.0 and G 4 = 2.5. I V . R E S U L T S A N D D I S C U S S I O N This section presents objective r esults for intelligibility an d quality of acoustic signals pr ocessed b y HDA G method in compariso n to SSFV , P A CO a n d GTF F0 baseline tech niques. ESTOI [3 0] an d ASII ST [37] are considered to evaluate the speech intellig ibility improvement and PESQ [38] co mpares the q uality assessment of comp etitiv e methods. Following, 1 Code av ailable at http ://staf fwww .dcs.shef.ac .uk/people/N.Ma/ results for a perceptual test is presented, in ord er to corrobor ate the o bjective ev aluation . The experime ntal scenario co nsider a subset 2 of TIMIT [36] database to evaluate the co mpetitive m ethods. The set considered is compo sed by 128 speech signals spoken by 8 male and 8 female sp eakers, samp led at 1 6 KHz a nd with 3 s av erage duration. Th e F0 refer e n ce values and voiced/unv oiced informa tio n fo r the training and test datasets are obtain ed from [48]. Six noises are used to corrupt the speech ut- terances: acoustic Babble and T rafﬁc attained from RSG-10 [46], Cafeteria, T rain and Helicopter from Freesou nd.org 3 , and Speech Sh aped N o ise (SSN) from DEMAND [49] datab a se. Experime nts are cond ucted co nsidering noisy signals with f o ur SNR values (- 10 dB, -5 dB, 0 dB and 5 dB). In th is study , it is assumed that the FSFFE separa tio n into hig h-pitch and low-pitch speec h frames is consider ed perfect and genera tes no errors into the wh ole sy stem. 2 A vaila ble at: http: //www .ee.ic .ac.uk/hp/sta ff/dmb/data/TIMITfxv .zi p . 3 [Online]. A vai lable: https://freesou nd.org. IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 7 -10 -5 0 5 SNR (dB) 0 10 ∆ ASII ST Babble ( a ) -10 -5 0 5 SNR (dB) 0 10 ∆ ASII ST Cafeteria ( b ) -10 -5 0 5 SNR (dB) 0 3 6 ∆ ASII ST Traffic ( c ) -10 -5 0 5 SNR (dB) 0 ∆ ASII ST Train (d) -10 -5 0 5 SNR (dB) 0 3 ∆ ASII ST Helicopter (e) -10 -5 0 5 SNR (dB) 0 10 ∆ ASII ST SSN (f) Fig. 5. ∆ AS II ST intel ligibili ty enhancement [ × 10 − 2 ] av eraged for speech signals corrupted by noises: (a) Babble, (b) Cafeteria, (c) Tra fﬁc, (d) Train , (e) Helicop ter and (f) SSN. A. Intelligibility and Quality Objective Evalua tio n T able II shows the in telligibility and qu ality o bjective results with ESTOI and PESQ measu r es, respectively . Note th at Babble an d SSN noises p resent the most challeng in g scen arios among those ev aluated in terms o f intelligib ility . For instan ce, the E STOI averaged for the SNR values of UNP speech signals are 0.36 and 0.35 for th e respective noises. Moreover , observe that HD A G meth od a chieves the best results for all the 2 4 noise co nditions even in the mo st challeng in g scenarios with negativ e SNR values. The sco res of HD A G are p a rticularly interesting fo r the no n -stationary noises, i. e., Babble an d Cafeteria. For these n oise sources th e ESTOI attained are considerab ly h igher than all the comp eting solutions for all SNR values. The highest ESTOI acco mplished by HDA G is 13 p.p can be ob served for He licopter noise with SNR = - 10 d B. Accord ing the overall average, the propo sed solution outperf orms the comp etiti ve appr oaches with ESTOI of 0.5 4, against 0.5 2, 0. 4 7 and 0.4 3 for GTF F0 , P ACO an d SSFV , respectively . The PESQ score is here compu ted fr o m 30 % of the most relev ant har monic fram es of noisy speech. Th ese frame s are selected from those with the lowest sign al-to-no ise ratio val- ues. Note th at HDA G o utperfo rms the com peting appro aches for most of the no isy speech cond itions in terms of qual- ity assessment. The prop osed solution achieves the hig hest PESQ, excep t fo r T rafﬁc an d SSN (0 dB and 5 dB) no ises. For instance, in He licopter with SNR = - 1 0 dB the PESQ score attained by HD A G is 1.0 2 high er than UNP followed by increments of 0.7 9, 0 .66 and 0.04 presen te d by GTF F0 , P AC O and SSFV , respectiv ely . In sum mary , the overall PESQ obtained with HDA G is 2.86, ag ainst 2.71 f or the comp eting T ABLE III A S I I S T [ × 1 0 − 2 ] S C O R E S F O R U N P N O I S Y S P E E C H . SNR Babble Cafete ria Tra fﬁc Train Helic opter SSN -10 dB 23.1 24.3 35.8 34.6 34.1 19.3 -5 dB 26.6 27.9 39.2 39.9 40.2 23.7 0 dB 33.0 34.3 43.7 47.2 43.5 30.0 5 dB 40.9 42.3 47.5 54.9 51.0 37.9 A verage 30.9 32.2 41.6 44.2 42.2 27.7 approa c h GT F F0 . Therefore, these results in dicate that the propo sed so lution also provides quality assessment. T able III presents the average ASII ST results for the un- processed (UNP) noisy speech signals. Here the SSN and Babble noises attained the lowest scores for SNR value o f -10 dB, with ASII ST of 19.3 and 23.1, respectively . Th e ASI I ST values in cremented by e a c h competitive meth o d ( ∆ ASII ST ) are depicted in Fig. 5 for the six acou stic noises. Obser ve that the pr oposed solution accomp lishes the high est scores for most con d itions, except fo r Traf ﬁc (SNR = -10 dB). The be st ∆ ASII ST (10.1 × 10 − 2 ) is achieved by the challeng ing SSN noise in - 1 0 dB. As can be seen in ESTOI, the SSFV approa ch do not present noticeable ASII ST increment. More over, for th e non-station ary Caf e teria no ise the pro posed solution attain s av erage intelligibility enhan cement of 5.4 × 10 − 2 , comp ared with 3 .5 × 10 − 2 , 1 . 8 × 10 − 2 and 0.3 × 10 − 2 for baselines GTF F0 , P AC O and SSFV . Therefo re, these re su lts rein force the ro bust- ness of th e pro posed m ethod aga in st se veral noisy ma sk ing effects. IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 8 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 20 40 60 80 100 Words Correct (%) (a) 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 20 40 60 80 100 Words Correct (%) (b) 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 20 40 60 80 100 Words Correct (%) 5 dB 0 dB -5 dB (c) Fig. 6. Percep tual intelli gibility e val uation with SSN additi ve acoustic noise for (a) male, (b) female volunteers and (c) o verall scores. Each case denotes: 1-UNP , 2-SSFV , 3-P A CO, 4-GTF F0 and 5-HD A G B. P er ceptual Intelligib ility Evaluatio n A subjective listening test [52] is cond ucted con sidering a scenario of phonetic b a lanced words 4 . T en nativ e male a nd ten female Brazilian v olun tee rs perform the test, which ages ran ge from 19 to 57 ye a r s with an av erage of 32 . The SSN noise is adopted w ith SNRs o f - 5 dB, 0 dB and 5 dB. T en words are a pplied f o r each of the 15 test condition s, i.e. , three SNR lev els an d f our m e th ods plu s the un processed case. Participants are introd u ced to th e task in a training session with 4 words. The material is diotically pr esented u sing a pair o f Roland RH-200S head phon es. Listeners hear each word once in an arbitrary presentation order and are asked to ind icate th e word in a sheet list. The in te llig ibility resu lts for each m ethod are p resented in Fig. 6. Each boxplo t de p icts the m edian an d deviation values scores (%) for one scenar io, sep arating the (a) m ale, (b) f emale volunteers, and (c) the overall sco res. The pro posed method accomplishes intelligibility un der all con ditions over the com- peting ap proach e s. For male listeners the HD A G obtained av erage intellig ibility scores of 6 6%, 85% and 9 3% comp ared 4 The complete test data base is av ailab le at lasp.ime.eb .br . T ABLE IV N O R M A L I Z E D M E A N P R O CE S S I N G T I M E . SSFV P A CO GTF F0 HD A G 0.32 0.67 0.89 1.00 to 5 2 %, 66% and 86% in the GTF F0 technique for SNR values of -5 dB, 0 dB and 5 d B, respectively . Furtherm ore, female volunteer s pr esented high er intelligibility r ates th a n male, mainly fo r - 5 dB with 75 % and 6 5 % fo r HDA G and GTF F0 . T he overall results show again the superiority o f HD A G with av erage scores o f 71%, 86% and 92%, surp assing GTF F0 (59%, 71% and 86%) and P ACO (43%, 64% and 78%). In acco rdance with ﬁn dings in the o bjectiv e measures E STOI and ASII ST , SSFV attains scores less o r equal th e UNP case. C. Norma lized Pr ocessing T ime T able IV indicates the compu tational complexity which refers to the normalized p rocessing time req uired fo r each method ev aluated fo r 512 sam p les p er frame. These values are obtained with an Intel (R) Core (TM) i7-9 700 CPU, 8 GB RAM, and ar e norm alized b y the execution time o f th e propo sed HDA G solution. Th e processing time required for F0 estimation an d accurate har monic adjust is also co nsidered here. No te that the HDA G and GT F F0 schemes present a longer processing time, sin ce the FSFFE lo w/high pitch classiﬁcation and HHT -Amp estimation are based o n the E EMD, an d demand a relev ant comp utational cost. V . C O N C L U S I O N This pap er introd uced th e HDA G method for speech in- telligibility en hancemen t in h armonic comp onents of no isy speech. It is com p osed by fo ur main steps. First, the HHT - Amp tech nique is adop ted to estimate the F0 fro m voiced frames. T he FSFFE separ ation was used for the detectio n and adjustment of these estimates, impr oving their accu racy . Then , selecti ve Gamm achirp ﬁlterb ank was applied to the frames considerin g th ird-octave bands to best cover the regions most relev ant to intelligibility . Finally , the ﬁltered componen ts were ampliﬁed by gain factor s regulated by low/high pitch c lassiﬁ- cation. Ex tensiv e experiments were cond ucted to ev aluate the intelligibility enh ancement provided by HD A G meth od and competitive appr oaches. Six acou stic noises were co nsidered with fo u r SNR values. Three objective m easures ar e adop ted for objective ev aluation of speech intelligibility a n d quality . The re su lts demon stra te that HDA G method o utperfo rmed the competitive approach es, with h igher intelligibility an d q uality assessment in m ost no isy environments. A perceptu al test fo r male and female listener s corro b orated the objec tive results. Future research includ e s the investi gation of the proposed method for other conditio ns, such as intellig ib ility en hance- ment for noisy rev erber ant speech . R E F E R E N C E S [1] N. R. French and J. C. Steinber g, “Fac tors Govern ing the Intelli gibilit y of Speec h Sounds, ” J. A coust. Soc. Amer . , vol. 19, no. 90, 1947. IEEE/ACM TRANSACTIONS ON A UDIO, SPEECH, AND LANGUA GE PR OCESSING 9 [2] P . Assmann and Q. Summerﬁeld, Speech Pro cessing in the Auditory System , ch. The percepti on of speech under adverse condit ions, pp. 231– 308. Berlin, Germany: Springer , 2004. [3] T . Gerkmann and R. C. Hendriks, “Unbiased mmse-based noise power estimati on with lo w comple xity and low trackin g delay , ” IEE E T ran s- actions on Audio, Speech, and Languag e Proc essing , vol. 20, no. 4, pp. 1383–1393, 2012. [4] R. T a v ares and R. Coelho, “Speech enhance ment w ith nonstationary acousti c noise detection in time domain, ” IE EE Signal P r ocessing Letter s , vol. 23, no. 1, pp. 6–10, 2016. [5] C. Medina , R. Coelho, and L. Z ˜ ao, “Impulsi ve noise detecti on for speech enhanc ement in hht domain, ” IE EE/ACM T rans. Audio, Speec h, Lang. Pr ocess. , vo l. 29, pp. 2244–2253, 2021. [6] E. Dranka and R. Coelho, “Robu st maximum like lihood acoustic ene rgy based source localiz ation in correlat ed noisy sensing en vironment s, ” IEEE Journa l of Selected T opi cs in Signal Proce ssing , vol. 9, no. 2, pp. 259–267, 2015. [7] C. E vers and P . A. Naylo r , “ Acoustic slam, ” IEE E/ACM Tr ansaction s on Audio, Speech, and Languag e P r ocessing , vol. 26, no. 9, pp. 1484–1498, 2018. [8] J. Martin ez-Carran za and C. Rascon, “ A re vie w on auditory perce ption for unmanne d aerial vehic les, ” Sensors , vol . 20, no. 24, 2020. [9] A. Ljolj, “Speech recognitio n using fundamental frequency and voicing in acoustic modeling, ” in Proc . Int. Conf. Spoken Lang. Pr ocess. , 2002. [10] A. V enturini, L. Zao, and R. Coelho, “On speech feature s fusion, inte gration gaussian modeling and multi-style training for noise robust speak er classiﬁcation, ” IEEE/ACM T ransac tions on Audio , Speech , and Languag e Proc essing , vol. 22, no. 12, pp. 1951–1 964, 2014. [11] L. Z ˜ ao, R. Coel ho, and P . Fland rin, “Speech enhancement with emd and hurst-based mode select ion, ” IEEE /AC M T ra nsactions on Audio, Speech , and Langua ge Pr ocessing , vol. 22, no. 5, pp. 899–911, 2014. [12] P . C. L oizou and G. Kim, “Reasons why current speech-enh ancement al- gorithms do not impro ve speec h inte lligibi lity and suggested solutions, ” IEEE T r ansactions on Audio, Speec h, and Language Processi ng , vol. 19, no. 1, pp. 47–56, 2011. [13] Y . Li and D. W ang, “On the optimali ty of ideal binary time–fre quency masks, ” Speec h Communication , vol. 51, no. 3, pp. 230–239, 2009. [14] G. Kim and P . C. Loizou, “Improvi ng speech intelligibi lity in noise using a binary mask that is based on magnitud e s pectru m constraint s, ” IEEE Signal Proc essing Letter s , vol. 17, no. 12, pp. 1010–1013, 2010. [15] F . Farias and R. Coelh o, “Blind adapti ve mask to improv e intell igibilit y of non-stationary noisy speech, ” IE EE Signal Pr ocessing Letter s , vol. 28, pp. 1170–1174, 2021. [16] C. Bro wn and S. Bacon, “Fundamental frequency and speech inte lligi- bility in background noise, ” Hear . Res. , v ol. 266, pp. 52–59, 2010. [17] D. Ealey , H. Ke lleher , , and D. Pearce, “Harmonic tunnellin g: Tracking nonstati onary noises during speech, ” in Pr oc. 7th Eur . Conf . Speech Commun. T echn ol. , pp. 437–440 , 2001. [18] T . W ang, W . Zhu, Y . Gao, S. Zhang, and J . Feng, “Harmonic atten tion for monaural speec h enhance ment, ” IEEE /ACM T ransa ctions on Audio, Speec h, and Languag e Proce ssing , vol. 31, pp. 2424–2436, 2023. [19] S. Y . Barysenka and V . I. V orobiov , “Snr-ba sed inter-compone nt phase estimati on using bi-phase prior statistics for single-cha nnel speech en- hancemen t, ” IEEE/ACM T ran sactions on Audio, Speech, and Languag e Pr ocessing , vol. 31, pp. 2365–2381, 2023. [20] L. W elling and H. Ney , “Formant estimation for speech recognition, ” IEEE T ransactio ns on Speec h and Audio Proce ssing , vol. 6, no. 1, pp. 36–48, 1998. [21] L. W ang and F . Chen, “Fact ors affe cting the intelligib ility of low-pass ﬁltered speech , ” in Proc. INTERSPEE CH , pp. 563–566, Aug. 2017. [22] L. W ang, D. Zheng, and F . Chen, “Understa nding lo w-pass-ﬁltered mandarin sentences: E ff ects of fundamental frequenc y contour and single-c hannel noise suppression, ” J . Acoust. Soc. Amer . , vol. 143, no. 3, pp. 141–145, 2018. [23] K. Nathwani, M. Daniel , G. Richa rd, B. David, and V . Roussarie, “Forman t shifting for speech intellig ibility improv ement in car noise en vironme nt, ” IEEE Internat ional Conferen ce on Acoustics, Speech and Signal P r ocessing (ICASSP) , pp. 5375–5379, 2016. [24] K. Nathwani , G. Richard , B. David , P . Prablanc, and V . Roussarie, “Speec h inte lligibil ity improv ement in car noise en vironment by voic e transformat ion, ” Speec h Communicatio n , vol. 91, pp. 17–27, 2017. [25] E. Lom bard, “Le signe de l’ele v ation de la voix, ” Maladies Ore ille , Larynx, Nez, Pharynx , vol. 37, no. 25, pp. 101–119, 1911. [26] A. Queiroz and R. Coelho, “F0-based gammatone ﬁltering for intel ligi- bility gain of acousti c noisy signals, ” IEEE Signal Pr ocess. Lett. , vol. 28, pp. 1225–1229, 2021. [27] L. Z ˜ ao and R. Coelho, “On the estimatio n of fundamental frequency from nonstatio nary noisy speech signals based on hilbert-hua ng trans- form, ” IEEE Signal Proce ss. Lett. , vol . 25, no. 2, pp. 248–252, 2018. [28] R. A. Lutﬁ and R. D. Patte rson, “On the gro wth of masking asymmetry with stimulus intensity , ” T he Journal of the A coustic al Society of America , v ol. 76, pp. 739–745, 1984. [29] T . IRINO and R. D. P A T TERSON, “ A time-domain, lev el-depe ndent auditor y ﬁlter : The ga mmachirp, ” The J ournal of the Acoustical Society of America , vol. 101, no. 1, pp. 412–419, 1997. [30] J. Jensen and C. H. T aal, “ An algo rithm for predicting the intelligi bility of speech masked by m odulate d noise maskers, ” IEEE /ACM Tr ans. Audio, Speec h, Lang. Pr ocess. , vol. 24, no. 11, pp. 2009–2022, 2016. [31] R. Drullman, J. M. Feste n, , and R. Plomp, “Effect of temporal env elope smearing on speech rece ption, ” The J ournal of the Acoustical Societ y of America , v ol. 95, pp. 1053–1064, 1994. [32] R. Drullman, J. M. Festen, , and R. Plomp, “Effect of reducing slow temporal m odulat ion on s peech recepti on, ” The Jo urnal of the Acoustical Societ y of America , vol. 95, pp. 2670–2680, 1994. [33] T . M. Elliott and F . E. Theunissen, “The modulation transfer function for speech iintel ligibili ty , ” PLOS Comput. Biol. , vol . 5, no. 3, pp. 1–14, 2009. [34] A. Queiroz and R. Coe lho, “Noisy spee ch bsed te mporal decompositi on to improve fundamenta l frequency estimation, ” IEEE/A CM T rans. Audio, Speec h, Lang. Process. , vol. 30, pp. 2504–2513 , 2022. [35] M. Khadem-hosseini, S. Ghaemmaghami, A. Abtahi, S. Gazor , and F . Marva sti, “Error correcti on in pitch detect ion using a deep learning based classiﬁcation , ” IEEE/ACM Tr ansactions on Audio, Speech , and Languag e Proc essing , vol. 28, pp. 990–999, 2020. [36] J. Garofolo, L. Lamel, W . Fischer , J. Fiscus, D. Pallett , N. Dahlgren, and V . Zue, “Timit acousti c-phoneti c continuous speec h corpus, ” in Linguist. Data Consortium , Philadelphi a, P A , USA, 1993. [37] R. C. Hendr iks, J. B. Crespo, J. Jensen, and C. T aal, “Optimal near -end speech intellig ibility improv ement incorpora ting additi ve noise and late re verbe ration under an approximation of the short-ti me sii, ” IEEE/AC M T ran s. Audi o, Speech, Lang. Proce ss. , vol . 23, no. 5, pp. 851–8 62, 2015. [38] A. Rix, J. Beeren ds, M. Hollier , and A. Hekstra, “Perceptual ev aluat ion of speech quality (pesq)-a ne w method for speech quality assessment of telephone netw orks and codecs, ” in P r oc. IEEE Int. Conf. Acoust., Speec h Signal Pr ocess. , vol. 2, pp. 749–752, 2001. [39] J. Stahl and P . Mowla ee, “Exploiting temporal correlation in pitch- adapti v e speech enh ancement, ” Speech Commun ication , v ol. 111, pp. 1– 13, 2019. [40] N. E. Huang, Z. Shen, S . R. L ong, M. C. Wu, H. H. Shih, Q. Zheng, N. C. Y en, C. C. Tung , and H. H. Liu, “The empirica l mode decom- position and the hilb ert spectrum for nonlinear and non-station ary time series analy sis, ” Pr oc. Roy . Soc. London Ser . A: Math., Phys., Eng. Sci. , vol. 454, no. 1971, pp. 903–995, 1998. [41] M. E . T orres, M. A. Colominas, G. Schlotthauer , and P . Flandrin, “ A complete ensemble empirica l mode decomposition with adapti ve noise, ” Pr oc. IE EE Int. Conf . Acoust., Speec h Signal Pr ocess. , pp. 4144–4147, 2011. [42] S. Gonzalez and M. Brooke s, “ A pitch estimation ﬁlter robust to high le vel s of noise (pefac), ” P r oceedi ngs of the IE EE , pp. 451–455, 2011. [43] L. Z ˜ ao, R. Coel ho, and P . Fland rin, “Speech enhancement wit h emd and hurst-based mode s election, ” IEEE /A CM T rans. Audio, Speec h, Lang. Pr ocess. , vo l. 22, no. 5, pp. 897–909, 2014. [44] N. Chatl ani and J. Soragha n, “Emd-based ﬁltering (emdf) of lo w- frequenc y noise for speech enhanceme nt, ” IEEE T rans. Audi o, Speec h, Lang. Proc ess. , vol. 20, no. 4, pp. 1158–11 66, 2012. [45] I. R. Titze, “Principles of voic e producti on, ” Englew ood Clif fs: Prentice Hall. [46] H. J. Steenek en and F . W . Geurtsen, “Description of the rsg-10 noise- databa se, ” re port IZF , vol. 3, 1988. [47] Electroa coustics, “Iec 61260: Octave -band and fractional-oc tav e-band ﬁlters, gene v a, switzerla nd, ” Inte rnational Elect r otechn ical Commission , 1995. [48] S. Gonza lez, “Pitch of the core timit database set, ” 2014. [49] J. Thiemann, N. Ito, and E. V incen t, “Demand: a collection of multi- channe l recordings of acoustic noise in div erse envi ronments, ” Pr oc. Meeti ngs Acoust. , 2013. [50] J. JUNQU A, “The lombard reﬂex and its role on human listene rs and automati c speech recogni zers, ” The Journal of the A coustic al Societ y of America , v ol. 93, no. 1, pp. 510–524, 1993. [51] L. Rabiner and R. Schafer Digital Pr ocessing of Speech Signals , 1978. [52] S. Ghimire, “Speech intelligib ility measuremen t on the basis of itu-t recommendat ion p.863, ” 2012.

Harmonic Detection from Noisy Speech with Auditory Frame Gain for Intelligibility Enhancement

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment