Increase Apparent Public Speaking Fluency By Speech Augmentation

Fluent and confident speech is desirable to every speaker. But professional speech delivering requires a great deal of experience and practice. In this paper, we propose a speech stream manipulation system which can help non-professional speakers to …

Authors: Sagnik Das, Nisha G, hi

Increase Apparent Public Speaking Fluency By Speech Augmentation
INCREASE APP ARENT PUBLIC SPEAKING FLUENCY BY SPEECH A UGMENT A TION Sagnik Das Nisha Gandhi T ejas Naik Roy Shilkr ot Human Interaction Lab, Stony Brook Uni versity Department of Computer Science Stony Brook, NY , USA ABSTRA CT Fluent and confident speech is desirable to e very speaker . But professional speech delivering requires a great deal of experi- ence and practice. In this paper , we propose a speech stream manipulation system which can help non-professional speak- ers to produce fluent, professional-lik e speech content, in turn contributing tow ards better listener engagement and compre- hension. W e propose to achiev e this task by manipulating the disfluencies in human speech, like the sounds uh and um , the filler words and awkward long silences. Giv en an y unre- hearsed speech we segment and silence the filled pauses and doctor the duration of imposed silence as well as other long pauses ( disfluent ) by a predictiv e model learned using pro- fessional speech dataset. Finally , we output a audio stream in which speaker sounds more fluent, confident and practiced compared to the original speech he/she recorded. According to our quantitativ e ev aluation, we significantly increase the fluency of speech by reducing rate of pauses and fillers. Index T erms — Speech disfluency detection, Speech dis- fluency repair , Speech Processing, Assistiv e technologies in speech 1. INTR ODUCTION Professional speakers, who make their li ving from their speech, speak clearly and fluently with very few repetitions and revisions. This kind of error -free utterances is the result of many hours of practice and experience. On the other hand, a regular speaker generally speaks with no real practice of articulation and delivery . Naturally , words of an unrehearsed speech contain unintentional disfluencies interrupting the flow of the speech. Speech disfluency generally contains long pauses, discourse markers, repeated words, phrases or sen- tences and fillers or filled pauses like uh and um . According to the research by [1] approximately 6% of the speech appears to be non-pause disfluency . Filled pauses or filler words are the most common disfluency in an y unrehearsed, impromptu speech [2]. Numerous linguistics research has been conducted to find out the effect of speech disfluencies on the listener’ s compre- Examples: https://sagniklp.github .io/pub-speaker-aug/ Fil ler - w or d S eg men ta ti on Dis flu en t Sil enc e Det ect io n Fil ler W or d R em o val Syn the si ze Si le n ces Fig. 1 . The proposed speaker augmentation pipeline hension and speakers cogniti ve state such as uncertainty , con- fidence, thoughtfulness and cogniti ve load ([3, 2]). Different studies claim that disfluencies often refer to uncertainties in speakers’ mind about the future statements. Consequently , less confident speakers tend to be more disfluent.[2]. More- ov er , it has also been observed that filled pauses specifically indicate the le vel of cogniti ve difficulty of the speaker [4]. Generally , disfluencies occur before a longer utterance [5] or when the topic is unfamiliar to the speaker [6]. Fluenc y re- flects the speaker’ s ability to focus the listener’ s attention on his/her message rather than in viting the listener to focus on the idea and try to self-interpret it [7]. Considering the di- verse factors af fecting speaker fluency , our idea is to doctor a speech to make it fluent by taking care of temporal factors contributing to it. In this work, we propose a system to detect, segment, and remove the most common disfluencies, namely the filler words and long, unnatural pauses from a speech to aid speak- ers’ fluency . Our system takes a raw speech track as input and outputs a modified fluent version of it by intelligently re- moving the filled pauses and adjusting the “long” silences. W e ev aluate the performance of our system quantitati vely on speeches of non-nati ve speakers of English. W e also propose an assistiv e user interface which can be used to help users’ to visualize and comparativ ely analyze their speech. W e interpret the occurrence of disfluencies in a speech as an acoustic e vent, and a segmentation approach is taken for the detection. A CNN and RNN combined architecture, a con volution recurrent neural network (CRNN) is used to achiev e the task, inspired from [8]. Further , a binary classifi- cation approach is taken to detect long pauses between words. After deleting the filler-words and adjusting the silences the fluent version of the speech is obtained. The performance of our system is e valuated on speeches of non-nativ e speakers of English using fluency metrics proposed by [9]. W e also propose an assistive user interface which can be used to help users’ to visualize and comparativ ely analyze their speech. The essential contributions of this paper are - 1. A disfluency detection mechanism that works directly on acoustic features without using any language fea- tures. 2. A silence modeling scheme directly conditioned on the previous speech. 3. A disfluency repair technique to help users impro ve a pre-deliv ered speech. 2. RELA TED WORKS In recent years, there ha ve been man y works related to speech disfluencies, spanned across the domains of psychology , lin- guistics and natural language processing (NLP). Where, the psychology and linguistic researchers focused on defining disfluencies, the reasons and effects of it from a language and cognitiv e aspect; the NLP researchers focused more on detecting these from speech transcripts to help language understanding and recognition systems. The prime moti vation for disfluenc y detection in NLP is to better interpret the speech-to-text transcripts for natural lan- guage understanding systems. One of the first work [10], focuses on classifying the edit words (restarts and repairs) from the text using a boosted classifier . Another contemporary method was to apply a noisy channel model to detect and correct speech disfluencies [11, 12, 13]. Later , Hidden Markov Model (HMM), Con- ditional Random Field (CRF), Integer Linear Programming (ILP) based [14, 15] methods are introduced. A classifica- tion approach using lexical features is taken by [16], they specifically focus on schizophrenic patient dialogs. Some incremental [17, 18, 19, 20], multi-step [21] and joint task (parsing and disfluency detection) [22, 17] methods were introduced recently . Though all these provide some con- vincing results, all of them are limited to pre-defined feature templates (lexical, acoustic, and prosodic). W ith advances in deep learning, most recent methods rely on recurrent neural networks (RNN) [23, 24, 25, 26, 27]. These methods use word embeddings and acoustic features instead of pre-defined feature templates. All the techniques above, make one fundamental assump- tion, i.e., any disfluency detection must have an automatic speech recognizer (ASR) in the pipeline. Consequently , to the best of our knowledge till now all the presented disflu- ency detection schemes work in the transcript le vel. Also, these systems have never been paired with an acoustic le vel repair scheme with a goal of e xploring the use-cases from the perspectiv e of the listener and the speaker . In our work, we address these motiv ations by devising a disfluency detection and repair method relying solely on acoustic features to synthesize temporally fluent speech seg- ments from the perspectiv e of human-interaction. 3. PR OPOSED METHOD 3.1. Disfluency Detection Our work focuses on b uilding a system that can be used not only as a disfluenc y detection system b ut also provide a way to understand users’ disfluency better . The primary moti va- tions of this work are the follo wing- • W ork with disfluencies on the acoustic lev el without us- ing any transcript. • Significant portion of a disfluent speech contains long pauses. In transcript le vel, it’ s not an issue, b ut in the acoustic level, it matters a lot in determining speakers’ fluency . • Repairing disfluent segments to help users understand the possible improvements to their speech, as well as allow to create fluent speech content without much has- sle. The types of disfluencies we considered in this work, are the use of filler words, and intermittent long pauses. 3.1.1. Dataset The dataset used for filler word segmentation is obtained from Switchboard transcription 1 . W e also used the Automanner 2 1 https://www .isip.piconepress.com/projects/switchboard/ 2 https://www .cs.rochester .edu/hci/currentprojects.php?proj=automanner GRU 1 GRU L … GRU L … GRU 1 FC1 ( ReL U ) FC2 Sof tm ax 𝑆 𝑀 𝑃 𝑐 𝑃 𝑠 𝑝 𝑡 ෠ 𝑃 𝑠 𝐺 ෠ 𝐺 MFC C Con v Max Poo l sta ck Fig. 2 . Block diagram of the filler-word se gmentation [28] transcription for additional data. This gi ves more gen- eralization to our training samples since contains recording from standard interfaces. T o label disfluent silences we use combination of a si- lence probability model [29] and a disfluency detection model [25]. First, we locate the silences and se gment each word pair from the dataset then according to the probability model it’ s decided if silence is disfluent. For each word pair utterance the silence probability model giv es a probability of a silence ( P sil ) occurring between them. A word pair with low P sil but a significant amount of silence is labeled as disfluent. If a word pair doesn’t exist the model v ocabulary , we resort to the following approach. Since, general disfluencies accompany longer silences, any silence within a disfluent se gment is la- beled as an unnatural pause. Additionally , the word pairs sur - rounded with silences more than 0 . 7 seconds are also labeled similarly . This choice is experimental and can be consid- ered safe because it’ s considerably higher than the suggested quantitativ e measure of micro-pauses ( fluent ), 0.2 secs. [30]. On the other hand, additional fluent pairs are collected from TIMIT [31]. 3.1.2. F eatures In this step, frame le vel acoustic features (log mel band en- ergy or mel frequency cepstral coefficients (MFCCs)) are ob- tained at each timestep t resulting a feature vector m t ∈ R C . Here, C is the number of features (in frequency dimension) at frame t . The task of se gmenting the filler words is formu- lated as binary classification of each frame to its correct class k (Eq. 1). arg max k P ( y ( k ) t | m t , θ ) (1) Where, k = { 1 , 2 } and θ are the parameters of the classifier . In the training data, the target class y ( k ) t = 1 if frame t be- longs to class k (determined using the onset/of fset timeline of k associated with a sound segment) , otherwise zero. Each soundtrack S , is di vided into multiple fixed length sequences of frames M t : t + T − 1 . Where, T is the length of the frame sequence. The corresponding class label matrix, Y t : t + T − 1 contains all the y t . 3.1.3. CRNN for filler wor d se gmentation Here we propose a Conv olutional Recurrent Neural Network (CRNN) for filler word segmentation. Similar , architecture is previously used for sound e vent detection (SED) [8] and speech-recognition [32] task. The architecture is a combina- tion of con volutional and recurrent layers, follo wed by feed- forward layers. The sequence of extracted features M ∈ R C × T is fed to the CNN layers with rectified linear unit (ReLU) activ ations. Filters used in the CNN layers are spanned across the feature and time dimension. Max-pooling is only applied over the frequency dimension. Output of max-pooling is a tensor P c ∈ R F × M 0 × T . Where, F is the number of filters of the final con volution layer, M 0 is the truncated frequency dimension after the max-pooling operation. T o learn the features ov er time axis, F feature maps are then stacked along the frequency axis which outputs a tensor P s ∈ R ( F × M 0 ) × T . This is fed to the RNN as a sequence of frames p t which outputs a hidden vector ˆ p t . The i th recurrent layer output is given as in Eq. 2. Where, F is a function learned by the each RNN unit. In this work, we use GR Us as presented in [33]. ˆ p i t = F ( ˆ p i − 1 t , ˆ p i t − 1 ) (2) RNN final layer outputs ˆ p f t are then fed to a fully- connected (FC) layer with ReLU activ ation and G ∈ R F C 1 × T is obtained where, F C 1 is the number of neurons of the layer . Finally , another layer with softmax acti vation is applied to get the class probabilities. Let, ˆ G ∈ R K × T is the output tensor of the final FC layer , then probabilities are given by- P ( y t | m 0: t , θ ) = S of tmax ( ˆ g t ) (3) Fig. 3 . The visualization interface ; top : Speech track with colored segmentation outputs ( br own : disfluent silences, blue : fillers, green : fluent silences); bottom : Modified speech. The CRNN training objective is to minimize the cross-entropy loss with l 2 regularization (Eq. 4) L ( θ ) = − X 0: t X k log P ( y ( k ) t ) + λ || θ || (4) 3.1.4. Disfluent silence Classification The problem is formulated as a binary classification task, giv en a silent segment Z , the task is to decide whether it’ s a disfluent or a non-disfluent silence. Classifying a silence only makes sense when it’ s combined with adjacent utterances. Because an occurrence of silence is solely dri ven by the ut- terance and also heavily influenced by disfluencies. Thus, it’ s not alw ays e vident that all pauses higher than a significant threshold is disfluent, illustration in Fig.3 gives an idea of the fact. W e train a support vector machine (SVM) to achieve this task. Gi ven a silent segment Z, it’ s first padded with the one- word utterances on the left and right ( ˆ Z ). Then, the MFCC features are extracted and we take the mean over the fre- quency axis to create the feature vector z i ∈ R T . T is the number of frames in z i . Segments are of variable length thus z i padded with trailing zeros prior the classification. During, testing we don’ t use the previous and next word boundaries but a fixed length time window is used. In our experiments, we found that 0 . 8 − 1 . 0 secs. gi ve pretty good results. 3.2. Disfluency Repair First, the fillers are replaced with silences. W e found that it’ s often helpful (such as when ambient noise is present) to use a decomposition mechanism [34] on the speech to separate the background noise and v ocals. Then, the fillers are replaced with its corresponding background se gment. The modified track is then used to segment ( Z ) the silences and finally , the classification is done. All the silence se gment lengths are then modified to make the speech fluent (Fig. 4). The goal is to reduce the amount 0 5 10 15 20 1 2 3 4 5 6 7 8 9 10 count bins H I S T O G R A M O F F L U E N T S I L E N C E T I M E S 0.0 905 0.1 578 0.2 252 0.2 925 0.3 599 0.4 272 0.4 945 0.5 619 0.6 292 0.6 965 Sil en ce cl a ssi fi er Fluent Silence In te rv a ls Di sf lu e nt Silence In te rv a ls Mod if ied S i len ce s Si le nc e Se gm en t at io n Fille r Re pl ac e me nt Vocals Ba ck gr o un d fillers Fig. 4 . Silence modification pipeline: The dashed line on the histogram shows the median time of the fluent silences. of long, unnatural pauses that hurt the fluency of the speech. It is also required to keep the pace of the speech intact. T oo much reduction of silences makes the speech unnatural and broken. W e take the fluent silence times (i.e., as suggested by our silence classifier) and obtain a histogram and found that taking the median of the histogram bins as the optimal amount of silence w orks quite well. In this way , the distrib ution of the silence along the speech progression confines to a constant distribution and speaker sounds more consistent and fluent in the modified speech. 4. RESUL TS & ANAL YSIS 4.1. Experimental Settings 4.1.1. Datasets The experiments are performed on Switchboard [35], Au- tomanner [28] and our dataset of public speaker recording. T o train the CRNN we use the segments from the Switchboard. The CRNN test results are reported on held-out data from Switchboard-I. Silence classification results are reported on TIMIT [31], Switchboard, and Automanner held-out dataset. All the fluency metrics are ev aluated on our dataset, contain- ing recordings of 20 non-nativ e speakers of English. The speakers were asked to talk on a specific topic for 50-60 seconds. 4.1.2. P arameter settings W e experimented with different configurations of the CNN and RNN parameters and different features. T ypes of featur es : Initial experiments were performed on mel frequency cepstrum coefficients ( mfcc ), mel spectro- grams, log mel spectrograms ( log mel ), spectral contrast, zero crossing rate and tonnetz. According to the experimental results, the mfcc ( 40 × t ) and log mel ( 128 × t ) features are used for filler segmentation. For the silence classification mfcc features are used, after taking mean over the frequency axis. The used feature dimensions are sho wn in table. All features are extracted in 30ms frames with 15ms o verlap. RNN & CRNN par ameters : In experiments with the CNN and CRNN, we explore { 1 , 2 , 3 } con volutional ( conv ) lay- ers with combination of max pooling and av erage pool- ing. At each layer , ReLU activ ation is used. F ollo wing settings are used for con v filters- { 16 , 32 , 64 } and kernel sizes- { 2 , 3 , 4 , 5 , 8 } . All the con v layers use same padding. Pooling size was varied within { 2 , 3 , 4 , 5 , 8 } . The pooling is performed only on the frequency dimension. W e tried different dropout ratios of { 0 . 3 , 0 . 5 , 0 . 75 } . The RNN we use is Gated Recurrent Units (GR U). Exper- iments are performed with { 2 , 3 } layers (l) and { 64 , 128 , 256 } hidden units (d). No intermediate dropout (dr) is applied. Final fully connected layer (FC1 in Fig.3) is experimented with hidden units (d) of { 100 , 200 } with dropout ratios of { 0 . 3 , 0 . 5 , 0 . 75 } . Featur es CNN RNN FC mfcc con v1 [32,(8,8)], conv2 [64,(4,4)] maxpool1 [5,5], maxpool2[4,4] dr=0.25 l=3 d=128 d=100 dr=0.5 log mel con v1 [32,(8,8)], conv2 [64,(4,4)] maxpool1 [8,4], maxpool2[4,2] dr=0.25 T able 1 . Final parameters for CNN, RNN and fully connected layers The networks are trained in an end-to-end fashion using AdaGrad algorithm for 200 epochs. The learning rate w as set to 0 . 01 . The regularization constant ( λ ) was set to 0 . 01 . The final parameters are giv en in T able 1. Silence classification parameters : Max length of the se- quences were set to 128. Final parameters are giv en in table 3. SVM LogReg XGBoost itr=1500 kernel= rbf C=10 itr=100 C=10 depth=3 lr=0.1 estimators=100 T able 3 . Final parameters used in silence classification 4.1.3. Evaluation Metrics T o e v aluate the filler word segmentation we use the follo wing frame lev el statistics: • F 1 Score ( F 1 ): The F 1 score is calculated on frame lev el (30ms) using the TP , the frames where fillers are correctly detected; TN, the frames where non-fillers are correctly detected; FP , the frames where fillers are wrongly detected; and FN, the frames where non-fillers are wrongly detected. The silence classification is ev aluated using the F 1 score w .r .t. the disfluent silence class. T o ev aluate the quality of the augmented speech from our system, we use the following metrics defined in [9]: • Speech rate : Is obtained as- S R = # of sy l labl es total time − uf p [ < 3] × 60 (5) Where, uf p [ < 3] = total time of unfilled pauses lesser than 3 seconds. Since, pauses > 3 secs. are considered as articulation pauses [30]. • Articulation rate AR = # of sy l labl es total time × 60 (6) • Phonation-time ratio P T R = speak ing time total time (7) • Mean length of runs M LR = # of sy l labl es # utter ances betw een p [ > 0 . 25] (8) Where, p [ > 0 . 25] = pauses greater than 0.25 seconds. • Mean length of pauses M LP = total of p [ > 0 . 2] # of p [ > 0 . 2] (9) • Filled pauses per min. F P M = # of f ill ed pauses total time (10) 4.2. Filler W ord Segmentation The filler word segmentation performance is ev aluation re- sults are gi ven in T able 4 and 5. In T able 4 we report the com- parativ e performance of the CRNN using different features. T o understand more about the credibility of the CRNN, in T able 5 we show the results compared to an automatic speech recognizer av ailable with Kaldi (ASpIRE Chain Model 3 ). Considering the simplicity of our network, it performs pretty close to the ASR in terms of F 1 score. All results are ev alu- ated on a subset of Switchboard-I dataset. Featur es Pr ecision Recall F 1 mfcc 0.9482 0.9610 0.9534 log mel 0.9495 0.9629 0.9550 T able 4 . Performance of the CRNN with dif ferent features 3 https://github .com/kaldi-asr/kaldi/tree/master/egs/aspire Metrics → SR ↑ AR ↑ PTR ↑ MLR ↑ MLP ↓ FPM ↓ Original 165.3571 171.0986 58.865 0.400 0.654 3.659 Processed 186.241 186.241 65.570 0.495 0.365 1.762 T able 2 . The fluenc y metrics, before and after processing the speeches. ↑ means higher is better and ↓ denotes lower is better Method P r ecision Recal l F 1 ASR 0.9774 0.9792 0.9775 CRNN 0.9495 0.9629 0.9550 T able 5 . Performance of filler word segmentation compared to an automatic speech recognizer . The only drawback that we hav e observed while compar- ing our method and ASR is that, sometimes our classifier de- tects segments that sounds similar with ’uh’ or ’um’. 4.3. Disfluent Silence Classification For this task we experimented with SVM, Logistic Regression (LogReg) and XGBoost. The results are summarized in table 6. W e used 10-fold cross validation to report our results. Method → SVM LogReg XGBoost F 1 0.9055 0.9200 0.9207 T able 6 . Silence classification performance on TIMIT , SwitchBoard and Automanner 4.4. Disfluency Repair After processing the speeches by removing the fillers and long silences, the fluent speech is obtained. T o compare the flu- ency of the synthesized and the original speech, discussed metrics (Section 4.1.3) are used. The results are reported in table 2. Mean of each metric across all the samples are re- ported. From the numbers, it’ s pretty clear that we improv e the fluency . It’ s notable that in the processed speech the ar- ticulation and speech rate increases to same quantity since we take care of all the unfilled pauses in the speech and in- troduce a more uniform silence production. Apart from the numbers, for qualitativ e understanding, some processed sam- ples are av ailable here . 5. FUTURE WORK This work is motiv ated by the fact that, disfluency detection is not only useful for the intelligent agents but also a practical problem definition to help users to produce a better, confi- dent and fluent talk. T o the extent of the types of disfluencies produced in a speech, this work is a small step to wards a big- ger goal, repairing disfluencies in a speech from a speakers’ perspectiv e. Along with the pitfalls of our method follo wing could be the future directions of this work- • Improving the filler word segmentation performance as well as de vising techniques to segment other kinds of common disfluencies (repetition, discourse markers, corrections) and speech impairments (stuttering). • Devising a dynamic and online repair scheme, by gen- erating necessary ( disfluent ) portions of speech, instead of replacing. 6. CONCLUSION Disfluency detection is a well-explored problem in the speech processing community and performed on speech transcripts to mostly aid the intelligent conv ersational agents. In this work, we interpret disfluency detection from speakers per- spectiv e and introduce an additional component of repairing the disfluencies. Consequently , we tried to w ork solely on the acoustic domain, diminishing a need for a complex system like an ASR, before disfluency detection. W ith the results of our detection and repair scheme, we show improved fluency in speakers’ dialogues, gi ven a less-fluent speech. T o the best of our kno wledge, this is the first work related to disfluency repair for the sake of users’ and can be further extended to assist users with speech impairments and other general dis- fluencies. 7. A CKNO WLEDGEMENTS W e are thankful to Faizaan Charania and Mahima Parashar for curating the dataset and working on some essential observa- tions. W e would also like to thank the participating speakers for the speeches they provided. W e gratefully acknowledge the support of NVIDIA Corporation with the donation of the T itan Xp and P6000 GPU used for this research. 8. REFERENCES [1] Jean E Fox Tree, “The effects of false starts and repeti- tions on the processing of subsequent words in sponta- neous speech, ” Journal of memory and language , vol. 34, no. 6, pp. 709–738, 1995. [2] Kathryn W omack, W ilson McCoy , Cecilia Ovesdotter Alm, Cara Calvelli, Jeff B Pelz, Pengcheng Shi, and Anne Haake, “Disfluencies as extra-propositional in- dicators of cognitive processing, ” in Proceedings of the workshop on extra-pr opositional aspects of meaning in computational linguistics . Association for Computa- tional Linguistics, 2012, pp. 1–9. [3] Martin Corley and Oli ver W Stewart, “Hesitation dis- fluencies in spontaneous speech: The meaning of um, ” Language and Linguistics Compass , vol. 2, no. 4, pp. 589–602, 2008. [4] Dale J Barr and Mandana Seyfeddinipur , “The role of fillers in listener attrib utions for speaker disfluency , ” Language and Cognitive Pr ocesses , vol. 25, no. 4, pp. 441–455, 2010. [5] Elizabeth Shriberg, “Disfluencies in switchboard, ” in Pr oceedings of International Conference on Spoken Language Pr ocessing , 1996, vol. 96, pp. 11–14. [6] Sandra Merlo and Letıcia Lessa Mansur , “Descripti ve discourse: topic familiarity and disfluencies, ” Journal of Communication Disor ders , v ol. 37, no. 6, pp. 489– 503, 2004. [7] Paul Lennon, “In vestigating fluency in efl: A quantita- tiv e approach, ” Languag e learning , v ol. 40, no. 3, pp. 387–417, 1990. [8] Emre Cakır , Giambattista Parascandolo, T oni Heittola, Heikki Huttunen, and T uomas V irtanen, “Con volutional recurrent neural networks for polyphonic sound ev ent detection, ” arXiv preprint , 2017. [9] Judit K ormos and Mariann D ´ enes, “Exploring measures and perceptions of fluency in the speech of second lan- guage learners, ” System , vol. 32, no. 2, pp. 145–164, 2004. [10] Eugene Charniak and Mark Johnson, “Edit detection and parsing for transcribed speech, ” in Pr oceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Lan- guage technologies . Association for Computational Lin- guistics, 2001, pp. 1–9. [11] Matthias Honal and T anja Schultz, “Correction of dis- fluencies in spontaneous speech using a noisy-channel approach, ” in Eighth Eur opean Confer ence on Speech Communication and T echnolo gy , 2003. [12] Mark Johnson and Eugene Charniak, “ A tag-based noisy-channel model of speech repairs, ” in Pr oceedings of the 42nd Annual Meeting of the Association for Com- putational Linguistics (A CL-04) , 2004. [13] Simon Zwarts, Mark Johnson, and Robert Dale, “De- tecting speech repairs incrementally using a noisy chan- nel approach, ” in Pr oceedings of the 23rd International Confer ence on Computational Linguistics . Association for Computational Linguistics, 2010, pp. 1371–1378. [14] Y ang Liu, Elizabeth Shriber g, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper , “Enriching speech recognition with automatic detection of sentence boundaries and disfluencies, ” IEEE T ransactions on au- dio, speec h, and language pr ocessing , v ol. 14, no. 5, pp. 1526–1540, 2006. [15] Kallirroi Georgila, “Using integer linear programming for detecting speech disfluencies, ” in Pr oceedings of Human Langua ge T echnologies: The 2009 Annual Con- fer ence of the North American Chapter of the Associa- tion for Computational Linguistics, Companion V olume: Short P apers . Association for Computational Linguis- tics, 2009, pp. 109–112. [16] Christine Howes, Matt Purver , Rose McCabe, PG Healey , and Mary Lav elle, “Helping the medicine go down: Repair and adherence in patient-clinician dialogues, ” in Pr oceedings of the 16th SemDial W ork- shop on the Semantics and Pragmatics of Dialogue (SeineDial) , 2012, pp. 19–21. [17] Matthew Honnibal and Mark Johnson, “Joint incre- mental disfluency detection and dependency parsing, ” T ransactions of the Association of Computational Lin- guistics , vol. 2, no. 1, pp. 131–142, 2014. [18] Julian Hough and Matthew Purver , “Strongly incremen- tal repair detection, ” arXiv pr eprint arXiv:1408.6788 , 2014. [19] Christine Howes, Julian Hough, Matthe w Purver , and Rose McCabe, “Helping, i mean assessing psychiatric communication: An application of incremental self- repair detection, ” 2014. [20] James Ferguson, Greg Durrett, and Dan Klein, “Disflu- ency detection with a semi-markov model and prosodic features, ” in Pr oceedings of the 2015 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language T ec hnologies , 2015, pp. 257–262. [21] Xian Qian and Y ang Liu, “Disfluency detection using multi-step stacked learning, ” in Pr oceedings of the 2013 Confer ence of the North American Chapter of the As- sociation for Computational Linguistics: Human Lan- guage T echnologies , 2013, pp. 820–825. [22] Mohammad Sadegh Rasooli and Joel T etreault, “Joint parsing and disfluency detection in linear time, ” in Pr o- ceedings of the 2013 Conference on Empirical Methods in Natural Languag e Processing , 2013, pp. 124–129. [23] Julian Hough and Da vid Schlangen, “Recurrent neural networks for incremental disfluency detection, ” Inter- speech 2015 , 2015. [24] Shaolei W ang, W anxiang Che, and T ing Liu, “ A neural attention model for disfluency detection, ” in Pr oceed- ings of COLING 2016, the 26th International Confer - ence on Computational Linguistics: T echnical P apers , 2016, pp. 278–287. [25] V icky Zayats, Mari Ostendorf, and Hannaneh Ha- jishirzi, “Disfluency detection using a bidirectional lstm, ” arXiv preprint , 2016. [26] Julian Hough and David Schlangen, “Joint, incremen- tal disfluency detection and utterance segmentation from speech, ” in Pr oceedings of the Annual Meeting of the Eur opean Chapter of the Association for Computational Linguistics (EA CL) , 2017. [27] Shaolei W ang, W anxiang Che, Y ue Zhang, Meishan Zhang, and Ting Liu, “Transition-based disfluency de- tection using lstms, ” in Pr oceedings of the 2017 Confer- ence on Empirical Methods in Natural Language Pr o- cessing , 2017, pp. 2785–2794. [28] M Iftekhar T an veer , Ru Zhao, Kezhen Chen, Zoe T iet, and Mohammed Ehsan Hoque, “ Automanner: An au- tomated interface for making public speakers aware of their mannerisms, ” in Pr oceedings of the 21st Interna- tional Conference on Intelligent User Interfaces . A CM, 2016, pp. 385–396. [29] Guoguo Chen, Hainan Xu, Minhua W u, Daniel Povey , and Sanjeev Khudanpur, “Pronunciation and silence probability modeling for asr , ” in Sixteenth Annual Con- fer ence of the International Speech Communication As- sociation , 2015. [30] Heidi Riggenbach, “T oward an understanding of flu- ency: A microanalysis of nonnative speaker con versa- tions, ” Discourse pr ocesses , v ol. 14, no. 4, pp. 423–441, 1991. [31] John S Garofolo, Lori F Lamel, W illiam M Fisher , Jonathan G Fiscus, and David S Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1, ” N ASA STI/Recon technical r eport n , vol. 93, 1993. [32] T ara N Sainath, Oriol V inyals, Andrew Senior , and Has ¸ im Sak, “Conv olutional, long short-term memory , fully connected deep neural networks, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2015 IEEE In- ternational Confer ence on . IEEE, 2015, pp. 4580–4584. [33] Kyunghyun Cho, Bart V an Merri ¨ enboer , Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Y oshua Bengio, “Learning phrase rep- resentations using rnn encoder -decoder for statistical machine translation, ” arXiv preprint , 2014. [34] Zafar Rafii and Bryan Pardo, “Music/voice separation using the similarity matrix., ” in ISMIR , 2012, pp. 583– 588. [35] John J Godfrey , Edward C Holliman, and Jane Mc- Daniel, “Switchboard: T elephone speech corpus for research and de velopment, ” in Acoustics, Speech, and Signal Pr ocessing, 1992. ICASSP-92., 1992 IEEE Inter- national Confer ence on . IEEE, 1992, vol. 1, pp. 517– 520.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment