Speech Enhancement in Adverse Environments Based on Non-stationary Noise-driven Spectral Subtraction and SNR-dependent Phase Compensation

Speec h Enhancement i n Adv erse En vironments Base d on Non-sta tionary Noise-dri ven Spectral Subt raction and SNR-dependent Phase Compe nsation Md T au hid ul Islam a , Asad uzzama n b , Celia Shahna z b, , W ei-Ping Z hu c , M . Om air A hmad c a Departmen t of Electrical an d Computer En gi nee r ing, T e xas A M University , Colleg e Station , T exas, USA-77840 b Department of Electri cal and Electr onic En gineering, Bangl ade sh Un iversity of Enginee r ing a nd T echnology , Dhaka-100 0, Bangladesh c Department of Electrica l and Computer Engineering, Conco r dia University , Montreal, Quebec H3G 1M8, Canad a Abstra ct A two-s tep e n hance ment m ethod ba se d on spe ctra l subtract ion a nd phase s pectrum compen sation i s p resente d in this pape r fo r noisy speech es in adv e rse envi ron ment s in v ol ving non-stat ionary noise and med ium to low le v els of SNR. The mag nitude of the nois y spe ech spe ctrum is modiﬁ ed in the ﬁrst step of the pro pose d me tho d by a spec tral subtrac tion approa ch, where a new noise estimation metho d b a sed on the lo w freque ncy inform atio n of the noisy speec h is introdu ced. W e argu e that thi s method o f noise esti mation is capab le of estim ating the non-st ation ary noi se accurat ely . The phase s p ectrum of the noisy spee ch i s modi ﬁe d in the secon d st ep cons isting of phase spectrum comp ensati on, where a n SNR -depe nde nt approac h is i ncorp ora ted to determine the amount of compen sation to be impose d on the phase spectrum . A modiﬁ ed comple x spect rum is obtained by aggre g ating t he magn itude from t h e spectr al s ubtra cti on ste p a nd modiﬁed phase sp ectru m fr om t he phase compensati on s tep, whic h i s fo und t o be a better repre sen tation of enhan ced speec h spe ctrum . Speec h ﬁl es a v ailabl e in the NOIZEU S databa se are used to carry ext ensi v e sim ulat ions for e v alu ation of the propos ed me thod. K e ywo r ds : Sp eech e nhanc eme nt, spec tral subt racti on, magnitu de c ompen sation , phas e comp ensati on, noise estimat ion 1. Intr oduc tion Corru pti on of s pee ch sig nals by the a dd iti ve or multiplic ativ e no ise dete riorates the perform ance of speech pro- cessing appl ica tions s u ch as speech com muni cation, speech rec ogniti on etc. Re mo vin g the disturbi ng noise wh ile preserv ing the speech is desirab le for pro per operat ion of t hese sy stems. Man y speech enha ncem ent methods ha v e been propose d to ach ie v e t h is goal. Spectr al sub tractio n [1 , 2, 3, 4, 5], Wi ener ﬁlter [6, 7], minim um mean squar e error (MMS E) estimato r [8, 9], subsp ace based methods [6, 7], threshol ding m eth ods based on wa ve let transform [10, 11, 12, 13] and Kalm an ﬁlterin g [14] are the p rominent on es. Subspace and wa v elet based appr oache s are com- putation ally slo w . Frequenc y-d omai n m ethods are compu tati onall y fa st, but mo st of them ne ed an esti mation of noi se to perfor m s peech e n hancem ent. Lo w c omput atio n with good pe rfor mance in sta tiona ry noise ma kes spe ctral s ubtract ion a v ery attr acti v e and widel y used me thod. In spe ctral subtrac tion, the n oise spectr um is estimated and subtra cted from the noisy speec h spectr um. If ther e is no variati on of the noise with tim e w hich means t hat the noise is stati onary , the method works well. How e v er , in presenc e of non-sta tiona ry noise, the perform ance of this met hod de grade s because of its inabi lity to es timate t h e nois e pr operly . Ano ther pro blem of t his method is the p rese nce of musical noise, a noise o f inc reasi ng va rianc e. The ﬁrst probl em is the ma in conc ern of [15], wh ere noise is e stima ted based on t he high-orde r Y ule- W a lke r equatio ns wi thout ﬁnding the non-spe ech frames. T h is metho d c an track the no n-s tationa ry noise but com putat ionally exha ust iv e . Anothe r met hod named min imum statistic s based spectral subtr action [16], can e st imate the non-sta tionary noise with less computat ion but this m ethod depend s on t he noise estima tion in the past frames w hich som etimes Correspondin g author Email add r ess: (Celia Shahnaz) Preprint submitted t o E lsevier F ebruary 18, 2018 in vok es wro ng estima tes or l ead s to s peech distortio n. In [17], noise sp ectru m i s estimate d ba sed on informat ion of the high freq uenc y spect rum of the current frame. This me thod requires v ery high sampli ng rat e which c r eates signiﬁc a nt problems in the cont ext of speech proces sing app licat ion. In the a bo v e ment ioned method s, a ltho ugh the spe ctrum of t he n oisy speech is a comple x number , only the magni - tude is modi ﬁed based on the estima t e o f the noise spectrum an d phase remains unchang e d . This w as being do ne for a long time base d on a n assumptio n that hum an audi t o ry system is phase-deaf, i.e., cannot di ere ntiate change of ph ase, until th e authors i n [18] sh o wed t hat the pha se spec trum ca n also be v e ry us eful in speech en hancemen t. The aut hors used the p hase spectr um in a spectra l subtraction based app roach to obtain an enhanc ed speech. Later , t he authors in [19, 20] also used t h is idea f or spe ech enhan cement . B ut these methods did not consider the magni tude spectrum at all. In this paper , we con sider both magni tude and phase spectra and compen s ate both of them based on the noise charac t eristi c s. W e de velo p a noise estimati on approac h t hat can track the tim e varia tion of non- statio nary noise for magn itude spectrum com pensatio n. T he phase spect rum is compe nsated i n an SNR-de pendent phas e comp e nsa tion step. W e a gg reg a t e the mod iﬁed magnitude and phas e from the se two steps and w e ﬁnd this m o diﬁed comple x spec- trum e ecti ve in p roducing enha nced speech of i m p roved qual ity with minima l spe ech distortio n a s compared to some of the state-of-the -art sp eec h enhanc e me nt method s. The paper is or ganiz ed as fo ll o ws. Sectio n 2 present s the pr oposed meth od. Section 3 descri bes re sul ts. Conclud- ing rem a r ks are pr esented in s e ct i on 4. 2. Pr oblem F ormula tion and Propos ed M eth od In any ana l y sis, m o di ﬁc atio n a nd s y nthesi s (A MS) framew o rk, at ﬁrst, no isy speech frames are tra nsformed by a transforma tion met hod. Then modiﬁc atio ns are carried out in the tra nsforme d dom a i n a nd ﬁ nally , the in verse transfor m of the t rans forma tion method follow ed by t h e ove rlap-add method is pe rformed to re constru ct the enhance d speec h. The propos ed meth od is based on the AMS fram ew ork, wh ere sp e ech is analyzed, modiﬁe d and synt hesized frame wise. In the presenc e o f add i ti ve noise d[n], a clean speech signa l x[n ] get s cont aminated and produ ces n oisy speech y[n]. The noisy s p e ec h can be segmen t ed into ov erlappi ng frames by u sing a sliding windo w . th windo wed noisy speec h frame c a n b e expr e ssed i n the time d omain as y [ n ] x [ n ] d [ n ] 1 T (1) where T is the to tal numb er of speech fra mes. If Y [ k ], X [ k ] and D [ k ] are the short-t i me F ourier transfo rm (STFT) represe ntat ions of y [ n ], x [ n ] and d [ n ], re spect iv ely , we can w rite Y [ k ] X [ k ] D [ k ] (2) where k 0 1 2 N 1, N is the t o t al num be r of s am pl es in a fram e . The N -point STFT , Y [ k ] o f y [ n ] can be comp uted as Y [ k ] N 1 n 0 y [ n ] e j 2 nk N (3) The Fouri er t rans for m of the noisy speech frame , Y [ k ] is modiﬁed in the prop osed method to obtain an estim ate of the cl ean sp eech spe ctrum X [ k ]. A n ov e rv i e w of t h e propo sed speech enha ncement me thod is sho wn by a blo ck diagra m in Fig. 1. It is seen from Fig. 1 that Fourie r tr a nsfo rm is ﬁrst appl i ed to each input sp eech frame. The magn itude o f the Fouri er spe ctrum is m odiﬁed in a spec t ra l s u btracti on method base d o n non-s ta tiona ry noi se estim a t ion, w hich we call s tep-1 . The modiﬁe d magnitude from step-1 is then combine d with un changed phase to ob tain t h e modiﬁe d c om ple x spectr um. Using in verse fast F ourie r tr a nsfo rm (IFF T) and o verlap and a dd , an in term ediate speech s igna l is obtained. The spectr um of the interm e d iate speech is sent to step- 2, whi ch c o nsists of phase spectrum compe nsation (PSC) [18]. PSC mod iﬁes the phase spec trum bas ed on the SNR o f the interm ediate speec h. Using the mod iﬁed phase spec t rum with the modi ﬁ ed magnit ude spectrum from the ﬁrst step, we o b tain an enhanc ed com plex spectrum. Fin ally , using IFFT a nd ov erlap a nd ad d, an enha nced spee c h i s construc ted. The f ull AMS process i s done f or bo t h s teps to get ful l ﬂexi bilities of using di e rent windo w sizes and parameters . 2 Figure 1: Block diag ram of the propose d method. 2.1. Magnit ude Spect rum C ompens ation by Spe c t ral S ubtracti on B ased on N on-stationar y N oise Es timat i on In this s ection, t he magn i t ude spect rum of the noisy speech i s modiﬁe d base d on the estima tion of non-st a tion a ry noise . Un l i ke con ventional m e thods, the estimate of noise spec trum is updated in ev ery silence period a nd t he lo w frequ e n cy regi on of magnitude spectru m is taken into conside ratio n in ord er to com pensa te fo r the noise estim ation error s that may be induce d whe n the add it i ve no i se i s non-stationa ry , i.e. , chan ges its a mplitude dr astically with time. W e propo sed to ob tain an esti mate of X [ k ] , Z [ k ] as Z [ k ] H [ k ] if H [ k ] 0 s Y [ k ] othe rwise (4) where H [ k ] Y [ k ] D S [ k ] (5) In (4), s refers to the sp e ctra l ﬂo w paramete r introduc e d t o preve nt a ny negat iv e value in Z [ k ] . In (5), symbol izes the trackin g fa ctor which tracks the ch ange of amplitude of the non-s tationa ry noise spectrum wi th time and D S [ k ] denot es th e estima ted noise s p ectrum in pr evious silence frame. In the pro posed s pectral subtrac tion based noise reduc tion scheme , the no ise spectrum esti m ate d from the begi nning silenc e fram e s i s update d during each silenc e period a s follo ws D S [ k ] Y 1 [ k ] Y N s [ k ] N s for S 1 v n D P [ k ] (1 v n ) Y S [ k ] othe rwise (6) 3 wh ere N s is the num be r of initial silen ce frame, P refe rs to the inde x of the previ ous si lence fra me with respec t to S and v n is the forgetting fac tor . Consideri ng that this estim ate of th e noise po w e r s pectrum is u pdated only d uring a silence peri od wh ile it m a y change dra stically with time , it is i nsu cient to use a constant value of the tracking factor to com pens ate for the errors induced in the noise s pec trum to be subtrac t ed from the noisy s p eech spectrum a t eac h frame. In order to track the time vari ation of the noise, should be a d justed at each frame after a silen c e period . Accord i ng t o the spe c tra l characte ri st ics of hum an speec h, the l ow fr equenc y band ty pi ca l ly from 0 to 50 Hz c on tains no spe ech i n formation. Thu s , for noisy speech, the l o w freque ncy band , say [0 5 0] Hz cont ains only nois e. In vie w of this fact , in order to change the v alue of for th fram e, we propose to us e the ratio of Y [ k ] and D S [ k ] in low frequenc y band d elta as Y [ k ] D S [ k ] (7) where [0 50] Hz, i s a const ant determi ned empiri cally . In t he l o w freque ncy b and of the th fram e, th e va ri a tion of the noisy s pee ch spectrum is e qu iv alent t o the noise spectrum of that frame . Thus , use of deﬁne d in (7) cle arly serves as a relati ve weig hing facto r with res pect to the estimated noise spect rum D S [ k ] , leadin g to a reasona ble t ra cking for the time varia tion of the noise if non-st a t ionar y . Please note tha t a v oi ce activi t y detector i s used i n the p roposed sch e me fro m [1] fo r detec ting th e speech and sile nce fram es . A ggreg ating the modiﬁe d magnitude spectru m with the unchan ged phase of noisy speech, we o btain a modiﬁe d comp lex s pect rum as Z [ k ] Z [ k ] e Y [ k ] (8) After using IFFT on Z [ k ] an d overl a p and add of real part o f the resulting signal, we obtain t i me-d omain inter mediat e speec h z [ n ]. 2.2. SNR Depend ent Phase Spectrum C ompens ation If we a p pl y S TFT on z [ n ], we obtain Z t [ k ], w here t is the frame numb e r for st e p -2. In step-2 , t he modiﬁe d comp lex spectru m Z t [ k ] is modiﬁed in s u ch a w ay that the low energy compo ne n t c ancel out more than the high ener gy compone nts. The mod i ﬁe d com ple x spectrum thus obtai ned is a b etter repr e se ntation of X t [ k ]. X t [ k ] Z t [ k ] e j ( Z t [ k ] t [ k ]) (9) z t [ n ], t th fram e of the interme di at e speech, is a real v alued signal and therefore, its FFT is conjug ate symmetr ic, i.e., Z t [ k ] Z t [ N t k ] (10) where N t is the nu mber of samp l es i n a fram e i n step- 2. The c o njuga te can be o btain ed as a resu lt of app l ying FFT on z t [ n ]. The c o njuga te arise naturally from the symm etry of t he magn i tude spe ctrum and anti-symm etry of the phase spectru m. During IFF T operation as needed for synthe sis of enhanced speec h, the conjuga t es are summed togethe r to produce large r real valu e d signal. If the c o njug a tes are modiﬁed , the de gree to which the y s u m toge ther can be inﬂ ue n ced and this can be contrib uted cons tructi vely o r destructi vel y to the recon s truc tion of the enhanced time domain speech . W e propose the de gree of phase spectrum compen s ation to be depe ndent on t h e SNR estim ate of the curr ent frame th us facilit ating t he handl i n g o f time and f re q uenc y var ying n on-stat ionary noi se condi tions. For this p urpo se, we f ormulate a pha se spectrum com pens ation f unction as gi ven by t [ k ] [ k ] V t (11) where V t is the root m ean square v alue of Z t , where Z t ( Z t [1] Z t [ N t ]) T [18]. In (11), is a real v alu e d con stant and [ k ] p resents a weigh ting functi on e xpress ed as [ k ] 1 if 0 k N 1 2 1 if 1 2 k N 1 0 oth erwise. (12) Here, zero we ighti ng is assigned to the v alues of k c o rresponding to the n on-conj ug ate vect ors of FFT , such as k 0 and k N 2 , if N e ven. 4 Figure 2: as a fun ction of a pos terior SNR. 2.2.1. SNR Depe ndent U nlike [18], instea d of conside ring as a consta nt, w e pro pose t o deter m ine it as 3 (13) where is d e ﬁne d as Y t [ k ] 2 V t 2 (14) The right hand side of (14) is the a poster i o ri SNR of t he t th interm e d i ate speech frame a n d the plot of wit h SNR is sho wn in Fig. 2. It is s ee n from Fig. 2 t hat if the SNR increases, the v alue o f t he constant dec reases a n d phase comp ensati on beco m es less so that t he re is no distortio n in the signal. On the contrary , when noise increase s to a highe r le vel, increase s. As a result, the p hase compensa t ion on the signal increas es and denoising is obtaine d to a sig niﬁcan t ext ent. S ince the estimat e o f noise mag nitude spectrum V t is cons tant , introd uction of the we ighti ng fu nction [ k ] deﬁned by (11) produc es an anti-symm e t ric c ompensat ion fun ctio n t [ k ] tha t acts as the cause for chang i ng the angul ar phase relatio nship in order to a c h ie v e noise c ancellatio n d uri n g synthe sis. Alth ough detail e xplanation of the phase compensa t ion meth od is giv en i n [18], we revi sit the ex planation for clarity of our method. Explana tion for tw o c as es of single con jugat e pair and their corre spondi ng mod iﬁcation s, i.e., when t h e estima ted speech vector from ﬁrst step is greater and smaller than the phase com pe n sa t ion fu nctio n are presente d in Fig. 3, where both the time frequen cy i nde xes are omitted for con veni e n ce and cla rity . W e wi l l deno te t h e phase compensa tion f uncti on as , the tw o conj ugates of Z [ k ] as Z and Z , a nd o f X [ k ] as X and X . F or the represe ntation i n Fig. 3(a), the magn itude of Z and Z are considere d lar ger than . Column one of Fig. 3(a) sho ws the conjuga te ve c t ors Z and Z as well as thei r summat ion v ec tor Z Z , in colum n two the real part of the Z and Z are s ho wn to be o set by and - , respec tiv ely . Alterin g t he angles of t h e vector s Z a nd Z while keepi ng the i r magnitu de uncha nged thus produc e s vectors X and X , resp ecti vel y . It is seen from t h e colum n t hree that the vect or X X is produced as a result of adding the modiﬁe d vectors X a nd X . Colum n four dem onstra tes the r eal pa r t o f t he addi t ion vector X X , while its ima ginar y part is discarded wi t h a vie w to av oid getting comp lex time domai n frames af ter IFFT oper ation . Compa ring colum n one and four of Fig . 3(a), it is clear that a limit e d change of origina l signal occu rs if Z and Z are gre a ter than . In Fi g. 3(b), similar illus tration i s sh o wn by con s i derin g Z and Z is sm a lle r tha n and found that signiﬁca nt cha nge o f the original signa l occurs. Sinc e is a nti sy mmetri c , the an gl e of the co njuga t e p a ir in eac h case of Fig. 3 are pushe d in opposite directions, one to wards 0 radian a n d ot h er to wards radian. The Fu rther t h ey 5 are pushed apart , the more o ut of phase the y becom e . This just iﬁ es that, FFT spectrum of noisy spee ch with large r ma gnitude undergoe s less atte nuation a n d that with smaller m a g nitude undergoe s more. Figure 3 : Phase spec trum compensa tion (a) when Z (b) when Z . 2.3. R es ynthes is of E nhanced Sig nal The enhanced sp eech fram e is synt hesized by perfor ming the IFFT on the re sulting X t [ k ], x t [ n ] Re I F F T X t [ k ] (15) where R e ( ) deno tes the real pa r t of the numbe r insid e i t and x t [ n ] repre sents the enha nced spee ch fram e . The ﬁna l enhan ced sp eech i s syn thesiz ed by u s ing the s tandard ov erlap and a d d meth od [21]. 3. Resul ts In t his s ection , a numb er of sim ulat ions is c arried out to e valu a te the performan c e o f the p roposed m ethod. 3.1. Implement at ion The ab ove pro posed meth od which we call non-s tationa ry noise-d riv en spectral subt ra cti on wi th SNR-de pe n dent phase com pensat ion (NSS P) is imp lemented i n MA TLAB R2 016b grap hical user i nterf a ce d ev elop m ent e n vironm ent (GUIDE ). The MA TLAB software with i ts user manual is attached a s sup pl em e ntar y materi al with the paper . This softw are also includes impleme ntation of s ome recent me thods, i.e., m ulti- band spe ctral subtraction (MB SS) [22], 6 T able 1: Constant s used in the spectral subtractio n st ep Consta n t V alue 0.1 v n 0.167 0.1 phase spectrum com pensat ion ( PSC) [18 ] and soft ma sk e sti ma tor w ith posteriori SNR uncertain ty (SM PO) [23]. The implement ations of the s e method s have been taken from publicly av ailable and trusted sources . MBSS code is tak e n from , PSC implem e n tation code is acq uired from and SMPO c o de is take n from . The MA TLAB implem enta tions of the calculat ions of segme nt al and ove rall SNR impro vem e nt a r e taken from [24]. 3.2. Si mu l at ion C onditions Rea l speec h sentences from the NOIZ E U S database a re emp l o yed for t he exp eriments , where the s peech da ta ar e sampl ed at 8 kHz. T o imita te a noisy en vironm ent, noise sequence is adde d to the clean speech samples at di erent SNR le vels rangi ng from 10 dB to 20 dB. T wo di e rent types o f noises, suc h as bab bl e and st re et are adopt ed from the N OIZEUS database. In order to obta in ov erlapping analys is frames in the spect ral subtra c tion step, Hamming windo wing opera t ion is perfor m ed, where the size of e ach of the frame is 96 sample s with 50% ov erlap between succ e ssi ve frames. In the phase compe nsation step, G ri n and Lim ’ s modi ﬁed Ha n ni n g wi ndow is used and the s ize o f eac h fr ame is 256 sampl es with 25% over l a p. V alues of us ed constants in the ﬁrs t st ep are gi ven i n T able 1. 3.3. Comparison Met rics Stan dard Object iv e metrics [39 ], nam e ly , segme ntal SNR (SNR Se g) impro vement i n dB, ove ra l l SNR impro ve- ment in dB , pe rceptual e valuati on of sp eech qua l ity (PES Q) a re use d for t he e valua tion o f the proposed NSSP method . The pro posed me thod is su bjectiv ely e va luated in t erms of t h e spectr ogram repre sen tations of the clea n spee ch, no i sy speec h an d enhanc e d speech . Formal listeni ng tests a r e also carrie d o ut in order to ﬁn d th e a nalo gy be t we en the objecti ve metrics and subject iv e sound quality . The performan c e of our method is compa re d with MB SS [22], PSC [24] a nd SM PO [ 23] in b oth object iv e and subj e cti ve senses . 3.4. Objective E valua t ion 3.4.1. Results f or sp eech signals wi th str eet noi se SNR Seg im pro vemen t , o v era ll SNR improv ement and PESQ scores fo r speec h sign a ls corrupted w ith s treet noise for MBS S, PSC , SMPO a nd NS SP are sh ow n for a SNR range of 20 dB t o 10 dB in Fig. 4 , 5 and T abl e 2. In Fi g. 4, we se e that t he S NR Se g impro vemen t fo r N SSP i s t h e highest at the low est SNR of 2 0 dB. The neares t SNRS eg imp rov ement is sho wn by S MPO, w hi ch is a l mos t half of th e SNR Seg imp rov ement by NSS P . W ith increme nt of th e SNR, SNRSe g impro vement for NSSP decrea s es. But at th e highest SNR of 10 dB, NSSP s h o ws an SNRSe g i mpro ve ment of 4 1 dB, which is m u ch better than MBSS, SMPO and PSC. Ano ther interesti ng fact is tha t the SNR Seg impro ve ment fo r NSSP inc rease s monotoni cally with d ecreme nt of S NR. B ut SNRS eg imp rovem e n t for other m eth ods incr e a se upto SNR of 5 dB, then s ta rt s to decrea se. Higher SNR Seg impro veme nt o f t h e propo sed NSSP metho d in all SNRs attests that NSSP can enhanc e the noisy spee c h bette r than other competi ng metho ds in fa vorabl e as well as adv e rse e n vironm ents. In Fig. 5, where we plot the o verall SNR imp ro vem ent for all t he me t ho ds for SNR range of 20 to 10 d B, w e se e that NS SP p rovid es an exc ellent o ve rall SN R impro ve ment of 14 dB a t SNR le vel of 20 dB. Ot her metho ds pro vide ov erall SNR impro veme nt of 11, 9 a n d 8 2 d B a t t hat SNR le vel. NS SP contin ues t o pro vide higher over a ll SNR 7 -20 -15 - 10 -5 0 5 10 S NR (dB ) 1 2 3 4 5 6 MB SS P SC S MPO N SSP Figure 4 : SNRSeg im pro vement for d i erent me thods in street noise. T able 2: PESQ for d i erent meth ods SNR(d B ) MBSS PSC SMPO NSSP -20 1.15 1.16 1.35 1.30 -15 1.37 1.23 1.47 1.51 -10 1.51 1.32 1.65 1.65 -5 1.69 1.43 1.77 1.83 0 2.07 1.69 1.89 1.74 5 2.38 1.93 2.57 2.45 10 2.60 2.14 2.78 2.69 -20 -15 - 10 -5 0 5 10 S NR (dB ) 0 5 1 0 1 5 MB SS P SC S MPO N SSP Figure 5 : Overall SNR improvement for di erent meth ods in street noise . impro v ement upto 0 dB , from whe re to 10 dB SNR, it pro vides compet i ti ve impr ovem e n t s in comp a ris on to P SC a nd better than S MPO a nd M BS S. 8 PES Q valu es for di ere nt method s for all the SN R lev els for street noise-co rrupted speech are show n in T able 2. Fo r hi g her SNR as 10 dB, we see that all the method s pro vide bette r PESQ. But wi th the decr ement of SNR, PESQ va l ue s for a l l the cases start to f all. The propos ed method prov i des very competi ti v e PES Q value s for all SNR le v els in comp arison to SMPO bu t perform s bet ter t han o ther two com peting meth ods. As PES Q v alue i ndicates the pe rceptual quality of the enhanc e d speech, this table pro ves that the proposed meth od pro vide s better enha nced speech in street noise corrupt ed spe ech a t high as w e ll as l o w S NRs than M BSS and PSC. 3.4.2. Results f or sp eech signals wi th mul ti-talk e r babble boise SNR Seg i mpro ve ment, ov erall SNR impro veme nt and PES Q score s for speech signal s corrupte d with babbl e noise for MBS S, PSC , SMPO a nd NS SP are sh ow n in F ig. 6, 7 and 8. In Fig . 6, the perfor mance of t he NSS P is co m p a red with perf ormance s o f other m eth ods at di erent lev els o f S NR in terms of SNRSeg improv emen t. From this ﬁgure, we s ee that t he SNRSeg improv emen t i n dB increas es a s SNR decrease s for NSP P for a l l SNR le vels. This is not t rue for MB SS , PSC and SMPO. Belo w S NR lev el of 10 dB, most of these three me t ho ds start to loose ove rall SNR impro v ement. At a low SNR of 2 0 dB, NSSP yie lds t h e highe st SNRSe g im pro vemen t of more tha n 6 dB. Such large r v alues o f SNR Seg improv ement at a lo w le ve l of SNR a tte st t h e capabil ity of NSS P in produ cing en ha n c ed speech w ith be tter qualit y for speec h corr upted by babb le noise- se vere ly . The ov e r all SN R i mpro veme nts o f MBSS, PSC, S MPO an d NS SP are s ho wn in Fig. 7, where it is se en tha t NSSP provi des a n imp rov ement of a lmo s t 18 dB at SNR le vel of 20 dB, which is signiﬁca ntly bett er than other method s. This trend co nt inue s upto 0 dB . After th at NSSP provi des c ompetit iv e v a lue in com parison to PSC and SMPO. PES Q va l ue s for di ere nt method s are sho wn in Fig. 8 for noisy speech in babble no ise. W e see from this ﬁgure that although NSSP provide s c o mpet iti ve PESQ score s i n comp arison to o t her metho ds for SNR lev els o f 20 to 10 d B, it provi des highe r PESQ scores for all ot her SNR le vels. Figure 6 : SNRSeg improvement for di ere nt methods in b abble noise. 3.5. Subjectiv e Eval uation T o e valu ate the perfor m anc e of the pro posed meth od an d oth er c ompetin g m eth ods s ubje ctive ly , we use two common ly used tools. The ﬁrst one is the plot of the spectrogr a ms of the ou t pu ts o f all the methods and comp are t heir perform ance s in te rms of prese rvati on of ha rmonics and cap a bili ty to r e mo ve noi se. The spectro grams of the clea n spee c h, the noisy speech, a n d the enhanc e d speech s igna ls obt ained b y using t h e propos ed meth od and al l other method s are presente d in Fig. 9 for stre et noise corrupt ed speec h at an SNR of 1 0 dB . It is o bvious fr om the s pec trogra m s that t he proposed m etho d preserv es the harm on ics signiﬁca ntly better tha n all the other compet ing method s. The n oise is also reduced at e very t im e point for the propo sed method which attes t our claim of better perfo rmance i n terms of higher SNR S e g im pro veme nt, highe r overa ll SNR impro vemen t a nd higher PESQ va l u es i n objec t i ve ev aluation. Another collec tion of s pec trogram s for t he p roposed method with o ther methods for spee ch signals corrup ted with babble noise is sho wn in Fi g. 10. This ﬁgure also attests that our pro pose d method has be tter pe rforma nce in ter ms of harm onics’ preserv a t ion and noi se remo v al in prese nce of stree t noise. 9 -20 -15 - 10 -5 0 5 10 S NR (dB ) 0 5 1 0 1 5 MB SS P SC S MPO N SPP Figure 7: Overall SNR improvement for di ere nt methods in b abble noise. Figure 8 : PESQ for di erent methods i n babble noise. The second tool we used for subj ecti ve e va l uati on of t he p roposed method and th e c ompe ting m eth ods is t h e forma l listeni ng tests. W e add street and babb le noises to all the thirty sp e ech sentence s of NOIZEUS data base at 20 to 10 dB SNR lev els and p rocess them wi th all the compe ting methods. W e all ow te n listeners to listen to these enhan ced speeche s from the s e method s a nd ev aluate them subjecti vely . Foll ow ing [13] and [25], W e use SIG , B AK and O VL scales on a r ange of 1 to 5. The detail o f these scales and procedure of this listening test is d iscussed in [13] . More d e tai ls on this test i ng m ethodol ogy of li stening te st can be ob tained from [26]. W e sho w t h e mean scores of SIG, B AK, a n d O VRL sc ales for all the methods for spe e ch signals corrupte d with 10 dB street noise in T abl es 3, 4, a nd 5 and for speech signals corrupte d with 10 dB babble no ise in T abl es 6, 7, and 8. The highe r va lues for the pr oposed method i n compa rison to other me thods in these tables clea rl y attest tha t the proposed method is better than t he compe ting methods in t erm s of lo wer signal d istortion (highe r SIG sco res) , e cient noi se re mov a l (h igher B AK score s) and o v erall sound qu ality (hig her O VL score s) for all SNR le vel s. 10 T ab le 3: Mean scores of S IG scale for di erent methods i n presence of street n oise at 10 dB Liste ner MBSS PSC SMPO NSSP 1 4.2 3.5 4.0 4.1 2 3.8 3.4 3.8 3.9 3 4.0 3.4 4.1 4.3 4 4.1 3.9 4.2 4.6 5 3.2 3.3 3.9 4.5 6 3.4 3.2 4.6 3.4 7 3.5 3.4 3.8 4.3 8 3.6 3.2 4.1 4.3 9 3.4 3.2 4.5 3.4 10 3.7 3.9 4.8 4.5 T able 4: Mean s cores of BAK scale for di erent methods in p resence of s treet noise at 10 dB Liste ner MBSS PSC SMPO NSSP 1 4.2 4.1 4.5 5.0 2 4.4 4.2 4.9 4.8 3 4.1 4.3 4.4 4.5 4 4.2 4.5 4.7 4.6 5 4.2 4.4 4.8 4.7 6 4.4 3.7 4.6 4.5 7 3.2 3.5 3.9 4.4 8 4.4 4.2 4.6 4.5 9 3.9 3.8 3.8 4.6 10 4.4 4.1 4.5 4.6 T able 5: Mean s cores of O VL scale for di ere nt methods in p resence of street noise at 10 d B Liste ner MBSS PSC SMPO NSSP 1 4.2 2.9 4.0 4.3 2 3.8 3.8 3.9 3.9 3 4.6 3.5 4.0 4.4 4 4.4 3.5 4.2 4.3 5 3.5 3.4 3.8 4.4 6 4.1 3.3 3.6 4.6 7 3.2 3.2 3.8 4.7 8 4.5 3.8 3.7 4.5 9 4.4 3.9 3.9 4.4 10 4.2 3.4 3.9 4.8 11 T able 6: Mean score s of SIG scale for d i erent meth ods in presence of babble no ise at 10 dB Liste ner MBSS PSC SMPO NSSP 1 4.0 3.6 4.0 4.3 2 3.9 3.3 3.9 3.9 3 4.0 3.9 4.0 4.4 4 4.2 3.4 4.2 4.7 5 3.8 3.2 3.8 4.4 6 3.6 2.9 3.6 3.9 7 3.8 3.8 3.8 4.4 8 3.6 3.4 3.6 4.3 9 3.9 3.5 3.9 3.9 10 3.8 3.7 3.8 3.9 T able 7: M ean sco res of BAK sc ale for di ere nt methods in p resence of babble n oise at 10 dB Liste ner MBSS PSC SMPO NSSP 1 4.5 4.0 4.5 4.6 2 4.9 4.3 4.9 4.4 3 4.4 4.2 4.4 4.7 4 4.7 4.4 4.7 4.8 5 4.8 4.2 4.8 4.8 6 4.6 3.9 4.6 4.7 7 3.9 3.8 3.9 4.6 8 4.6 4.4 4.6 4.5 9 3.9 3.5 3.9 4.5 10 4.8 4.7 4.8 4.6 T able 8: Mean score s of O VL scale for di erent m ethods in presence o f babble noise at 10 dB Liste ner MBSS PSC SMPO NSSP 1 4.1 2.9 4.0 4.4 2 3.9 3.4 3.8 3.8 3 4.5 3.4 4.1 4.3 4 4.1 3.3 4.2 4.3 5 3.8 3.2 3.9 4.5 6 4.4 3.7 4.6 4.6 7 3.9 3.4 3.8 4.5 8 4.0 3.6 4.1 4.2 9 4.4 3.1 4.5 4.7 10 4.5 3.1 4.8 4.9 12 0 1 2 2.7 Time( s) 0 1 2 3 4 (a) 0 1 2 2.7 Time( s) 0 1 2 3 4 (b) 0 1 2 2.7 Time( s) 0 1 2 3 4 (c) 0 1 2 2 .7 Time( s) 0 1 2 3 4 (d) 0 1 2 2.7 Time( s) 0 1 2 3 4 (e) 0 1 2 2.7 Time( s) 0 1 2 3 4 (f) Figure 9: S pectrograms of (a ) clean signal (b) noisy signal with 10 dB st reet n oise; s pectrograms of en hanced speech fro m (c) MBSS (d) PSC (e) SMPO (f) NS SP . 4. Conc l usio ns A n impro ve d spe e ch enhance m ent method bas ed on ma gnitude a nd pha se compe nsa t ion is pr e se nted i n this p aper for enhance men t of noisy spe ech in adver s e en vironme nt. Spe c t ral subtraction i s used in t he ﬁr st st e p for magni tude comp ensati on de pendi ng on a ne w non -stationa ry noise estimat i on. In t he second step, an SNR-depe ndent phase spectr um compe nsation is u s ed t o com pensa te the ph a se . For noisy speeche s with medium to l ow le vels of SNR, 13 0 1 2 2 .7 Time( s) 0 1 2 3 4 (a) 0 1 2 2 .7 Time( s) 0 1 2 3 4 (b) 0 1 2 2 .7 Time( s) 0 1 2 3 4 (c) 0 1 2 2.7 Time( s) 0 1 2 3 4 (d) 0 1 2 2.7 Time( s) 0 1 2 3 4 (e) 0 1 2 2.7 Time( s) 0 1 2 3 4 (f) Figure 10 : S pectrograms o f (a) clean s ignal (b) noisy sig nal with 10 dB b abble noise; sp ectrogram s of e nhanced speech f rom (c) MBSS (d) PS C (e) SMPO (f) NSSP . simulati on results sho w that the proposed method yiel ds consi stentl y better res ults in the sense of high er segm e n tal SNR impro veme nt, ov erall SNR impr ovem e n t, and output PE SQ t han those of the exist i ng sp e ech enha ncement methods . [1] S. B oll, Suppres s ion of acoustic noise in speech us ing sp ectral subtraction, IEEE Transactions o n a c ou stics, speech, and signal processing 27 (2 ) (1979) 113–120. 14 [2] K . Y a mashita, T . S himamura, Nonstationary no ise estimation using low-frequenc y regions for spectral subtra ction, IEE E Signal pro cessing letters 12 (6 ) (2005) 465–468. [3] Y . Lu, P . C. Lo izou, A geom e tric a pproach to spectral subtraction, Speech comm unication 50 (6) (20 08) 453–466. [4] M. T . Islam , C. S hahnaz, S. Fattah, Speec h enhancem ent base d on a m odiﬁed s pectral subtrac tion me th od, in: 2 014 IEEE 57th I nt e rnatio nal Midwest Symp osium on Circu its and Systems (MW SCAS), IEEE, 201 4, pp. 1085– 1088. [5] M. T . Islam, A. B. Hussain, K. T . Sha hid, U. Saha, C. Shahnaz, S peech enhanc ement b ased o n noise compe nsated m agnitude spec trum, in: Informatics, Electronics V ision (ICIE V), 2014 Internati onal Conferen ce on, IEEE, 201 4, pp. 1–5. [6] Y . Ephra im, H. L. V an Trees, A signal subspace a pproach f or speech enhancemen t, IEEE Tran sactions on speech and audio processin g 3 (4) (1995) 251 –266. [7] Y . H u, P . C. Loizou, A genera lized subspace a pproach for e nhancing sp eech corrupte d b y colored n oise, IEEE T ra nsactions o n Speech an d Audio Proces sing 11 (4) (2003) 334–341. [8] Y . Ephraim, D. Malah, Speech en hancement using a m inimum mean- square error log-sp ectral a mplitude e stimator , IEEE Transactions o n Acoustics, Spee ch, and Signal Proce ssing 33 (2) (1985 ) 443–44 5. [9] Y . Hu , P . C. Loizou, Subjective comparison and e valuation of speech enhance ment algorithms, Speech commun ication 49 (7) ( 2007) 588–601. [10] D. L. Don oho, De-noising by soft-thres holding, IEEE tra nsactions on i nformation theory 41 (3 ) (1995) 613– 627. [11] M. Bahou ra, J. Rouat, W avelet speech enhan cement based on th e teager ene r gy operat or , IEEE Signal Proc ess. Lett. 8 (2 001) 10–12. [12] Y . Gh anbari, M. Mollaei, A new approach f or spe ech enhan cement b ased on the adap ti ve thresholdin g of the wa v el et packets, S peech Commun . 48 (2006) 927–940 . [13] M. T . Isl am, C. Shahnaz, W .-P . Zhu, M. O. Ahmad, Spe ech enha ncement based on student modeling of teager en ergy op e rated perce ptual wave l et p a cket coe cients and a custom thresholding functi on, IEEE ACM T ransactions on Audio, Speech, a nd L anguage Processin g 23 (1 1) (2015) 1 800–1811. [14] N. Ma, M. Bouchard, R . A. Go ubran, Speech en hancement usi ng a mask ing thre shold constrained kal man ﬁl ter a nd its heuri s tic imp lemen- tations, IEEE Transactions on Au dio, Speech, and L anguage Proc essing 14 (1) (2006) 1 9–32. [15] K. Paliwal, Estimation of noise v arian ce from t he noisy ar signal an d its ap plication in speech enhance ment, IEEE T ransa ctions on Acoustics, Speec h, and Signal Process ing 36 (2) (1988 ) 292–294. [16] R . Martin, Noise po w er spectra l density estimation based on optimal smoothing and mini mum statistics, IEEE T ra nsactions on spee ch and audi o pro cessing 9 (5) (200 1) 504–512. [17] J. Y amau chi, T . Shimamura, Noise estimation using high frequency regions fo r spectral subtraction, IEICE TRANSACTIONS on Fundamen- tals of Electronics, Communi c ati ons and Computer Sc iences 85 (3) (20 02) 723–727. [18] K. W ´ ojcicki, M. Milacic, A. Stark, J. L yons, K . Paliwal, Exp lo itin g conjugate symmetry o f the sh ort-time fou rier spec tru m for spe ech enhan cement, IEEE Sig nal processing letters 1 5 (2008) 461 –464. [19] A. P . Stark, K. K. W ´ ojcicki, J . G. Lyons, K. K. Paliwal, K. K. Paliwal, No ise d ri ven short-time phase spec trum com pensation p rocedure for speech e nhancement., in: INTERS PEECH, 2008, pp . 549–552. [20] M . T . Islam , C. Shahnaz, Speech en hancement b ased o n n oise-compe nsated p hase spectrum, in : E lectrical En gine ering and Informa tion & Comm unication T echnolo gy (ICEEICT ), 2014 International Co nference on , IEEE, 2014, pp. 1– 5. [21] D. O’ sh aughnessy , Speec h communicatio n: huma n and machine, Universities press, 1987. [22] S . K amath, P . Lo izou, A multi-band spectral su btraction metho d for enhancin g sp eech corrupted by co lored noise, in: IEEE internati onal conferen ce on acoustic s speech and signal pro cessing, V ol . 4, Citeseer, 2002, pp. 4164–41 64. [23] Y . Lu , P . C . Lo izou, Estimators of the m agnitude-squared spectrum and meth ods for incorporati ng snr uncertai nty , IEEE t ransactions on audi o, speech, and lan guage processing 19 (5 ) (2011) 1123–11 37. [24] Y . Hu , P . C. Loizo u, Evaluat ion of objectiv e qualit y m easures for s pee ch enhancem ent, IEEE Transactions o n a udio, speech, and language processin g 16 (1) (20 08) 229–238. [25] Y . Hu, P . L oi zou, Su bjectiv e comparison a nd e valuation of spe ech enhanceme nt algorithms, Speec h Commun. 49 (20 07) 588–601. [26] IT U, P83 5 IT: sub jectiv e te s t methodolo gy for evaluating speech commu nication systems that inclu de noise supp ression algorithms., ITU- T Reco mmendation (ITU, G ene va) (2003) 835. 15

Speech Enhancement in Adverse Environments Based on Non-stationary Noise-driven Spectral Subtraction and SNR-dependent Phase Compensation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment