Increase Apparent Public Speaking Fluency By Speech Augmentation

INCREASE APP ARENT PUBLIC SPEAKING FLUENCY BY SPEECH A UGMENT A TION Sagnik Das Nisha Gandhi T ejas Naik Roy Shilkr ot Human Interaction Lab, Stony Brook Uni versity Department of Computer Science Stony Brook, NY , USA ABSTRA CT Fluent and conﬁdent speech is desirable to e very speaker . But professional speech delivering requires a great deal of experi- ence and practice. In this paper , we propose a speech stream manipulation system which can help non-professional speak- ers to produce ﬂuent, professional-lik e speech content, in turn contributing tow ards better listener engagement and compre- hension. W e propose to achiev e this task by manipulating the disﬂuencies in human speech, like the sounds uh and um , the ﬁller words and awkward long silences. Giv en an y unre- hearsed speech we segment and silence the ﬁlled pauses and doctor the duration of imposed silence as well as other long pauses ( disﬂuent ) by a predictiv e model learned using pro- fessional speech dataset. Finally , we output a audio stream in which speaker sounds more ﬂuent, conﬁdent and practiced compared to the original speech he/she recorded. According to our quantitativ e ev aluation, we signiﬁcantly increase the ﬂuency of speech by reducing rate of pauses and ﬁllers. Index T erms — Speech disﬂuency detection, Speech dis- ﬂuency repair , Speech Processing, Assistiv e technologies in speech 1. INTR ODUCTION Professional speakers, who make their li ving from their speech, speak clearly and ﬂuently with very few repetitions and revisions. This kind of error -free utterances is the result of many hours of practice and experience. On the other hand, a regular speaker generally speaks with no real practice of articulation and delivery . Naturally , words of an unrehearsed speech contain unintentional disﬂuencies interrupting the ﬂow of the speech. Speech disﬂuency generally contains long pauses, discourse markers, repeated words, phrases or sen- tences and ﬁllers or ﬁlled pauses like uh and um . According to the research by [1] approximately 6% of the speech appears to be non-pause disﬂuency . Filled pauses or ﬁller words are the most common disﬂuency in an y unrehearsed, impromptu speech [2]. Numerous linguistics research has been conducted to ﬁnd out the effect of speech disﬂuencies on the listener’ s compre- Examples: https://sagniklp.github .io/pub-speaker-aug/ Fil ler - w or d S eg men ta ti on Dis flu en t Sil enc e Det ect io n Fil ler W or d R em o val Syn the si ze Si le n ces Fig. 1 . The proposed speaker augmentation pipeline hension and speakers cogniti ve state such as uncertainty , con- ﬁdence, thoughtfulness and cogniti ve load ([3, 2]). Different studies claim that disﬂuencies often refer to uncertainties in speakers’ mind about the future statements. Consequently , less conﬁdent speakers tend to be more disﬂuent.[2]. More- ov er , it has also been observed that ﬁlled pauses speciﬁcally indicate the le vel of cogniti ve difﬁculty of the speaker [4]. Generally , disﬂuencies occur before a longer utterance [5] or when the topic is unfamiliar to the speaker [6]. Fluenc y re- ﬂects the speaker’ s ability to focus the listener’ s attention on his/her message rather than in viting the listener to focus on the idea and try to self-interpret it [7]. Considering the di- verse factors af fecting speaker ﬂuency , our idea is to doctor a speech to make it ﬂuent by taking care of temporal factors contributing to it. In this work, we propose a system to detect, segment, and remove the most common disﬂuencies, namely the ﬁller words and long, unnatural pauses from a speech to aid speak- ers’ ﬂuency . Our system takes a raw speech track as input and outputs a modiﬁed ﬂuent version of it by intelligently re- moving the ﬁlled pauses and adjusting the “long” silences. W e ev aluate the performance of our system quantitati vely on speeches of non-nati ve speakers of English. W e also propose an assistiv e user interface which can be used to help users’ to visualize and comparativ ely analyze their speech. W e interpret the occurrence of disﬂuencies in a speech as an acoustic e vent, and a segmentation approach is taken for the detection. A CNN and RNN combined architecture, a con volution recurrent neural network (CRNN) is used to achiev e the task, inspired from [8]. Further , a binary classiﬁ- cation approach is taken to detect long pauses between words. After deleting the ﬁller-words and adjusting the silences the ﬂuent version of the speech is obtained. The performance of our system is e valuated on speeches of non-nativ e speakers of English using ﬂuency metrics proposed by [9]. W e also propose an assistive user interface which can be used to help users’ to visualize and comparativ ely analyze their speech. The essential contributions of this paper are - 1. A disﬂuency detection mechanism that works directly on acoustic features without using any language fea- tures. 2. A silence modeling scheme directly conditioned on the previous speech. 3. A disﬂuency repair technique to help users impro ve a pre-deliv ered speech. 2. RELA TED WORKS In recent years, there ha ve been man y works related to speech disﬂuencies, spanned across the domains of psychology , lin- guistics and natural language processing (NLP). Where, the psychology and linguistic researchers focused on deﬁning disﬂuencies, the reasons and effects of it from a language and cognitiv e aspect; the NLP researchers focused more on detecting these from speech transcripts to help language understanding and recognition systems. The prime moti vation for disﬂuenc y detection in NLP is to better interpret the speech-to-text transcripts for natural lan- guage understanding systems. One of the ﬁrst work [10], focuses on classifying the edit words (restarts and repairs) from the text using a boosted classiﬁer . Another contemporary method was to apply a noisy channel model to detect and correct speech disﬂuencies [11, 12, 13]. Later , Hidden Markov Model (HMM), Con- ditional Random Field (CRF), Integer Linear Programming (ILP) based [14, 15] methods are introduced. A classiﬁca- tion approach using lexical features is taken by [16], they speciﬁcally focus on schizophrenic patient dialogs. Some incremental [17, 18, 19, 20], multi-step [21] and joint task (parsing and disﬂuency detection) [22, 17] methods were introduced recently . Though all these provide some con- vincing results, all of them are limited to pre-deﬁned feature templates (lexical, acoustic, and prosodic). W ith advances in deep learning, most recent methods rely on recurrent neural networks (RNN) [23, 24, 25, 26, 27]. These methods use word embeddings and acoustic features instead of pre-deﬁned feature templates. All the techniques above, make one fundamental assump- tion, i.e., any disﬂuency detection must have an automatic speech recognizer (ASR) in the pipeline. Consequently , to the best of our knowledge till now all the presented disﬂu- ency detection schemes work in the transcript le vel. Also, these systems have never been paired with an acoustic le vel repair scheme with a goal of e xploring the use-cases from the perspectiv e of the listener and the speaker . In our work, we address these motiv ations by devising a disﬂuency detection and repair method relying solely on acoustic features to synthesize temporally ﬂuent speech seg- ments from the perspectiv e of human-interaction. 3. PR OPOSED METHOD 3.1. Disﬂuency Detection Our work focuses on b uilding a system that can be used not only as a disﬂuenc y detection system b ut also provide a way to understand users’ disﬂuency better . The primary moti va- tions of this work are the follo wing- • W ork with disﬂuencies on the acoustic lev el without us- ing any transcript. • Signiﬁcant portion of a disﬂuent speech contains long pauses. In transcript le vel, it’ s not an issue, b ut in the acoustic level, it matters a lot in determining speakers’ ﬂuency . • Repairing disﬂuent segments to help users understand the possible improvements to their speech, as well as allow to create ﬂuent speech content without much has- sle. The types of disﬂuencies we considered in this work, are the use of ﬁller words, and intermittent long pauses. 3.1.1. Dataset The dataset used for ﬁller word segmentation is obtained from Switchboard transcription 1 . W e also used the Automanner 2 1 https://www .isip.piconepress.com/projects/switchboard/ 2 https://www .cs.rochester .edu/hci/currentprojects.php?proj=automanner GRU 1 GRU L … GRU L … GRU 1 FC1 ( ReL U ) FC2 Sof tm ax 𝑆 𝑀 𝑃 𝑐 𝑃 𝑠 𝑝 𝑡 ෠ 𝑃 𝑠 𝐺 ෠ 𝐺 MFC C Con v Max Poo l sta ck Fig. 2 . Block diagram of the ﬁller-word se gmentation [28] transcription for additional data. This gi ves more gen- eralization to our training samples since contains recording from standard interfaces. T o label disﬂuent silences we use combination of a si- lence probability model [29] and a disﬂuency detection model [25]. First, we locate the silences and se gment each word pair from the dataset then according to the probability model it’ s decided if silence is disﬂuent. For each word pair utterance the silence probability model giv es a probability of a silence ( P sil ) occurring between them. A word pair with low P sil but a signiﬁcant amount of silence is labeled as disﬂuent. If a word pair doesn’t exist the model v ocabulary , we resort to the following approach. Since, general disﬂuencies accompany longer silences, any silence within a disﬂuent se gment is la- beled as an unnatural pause. Additionally , the word pairs sur - rounded with silences more than 0 . 7 seconds are also labeled similarly . This choice is experimental and can be consid- ered safe because it’ s considerably higher than the suggested quantitativ e measure of micro-pauses ( ﬂuent ), 0.2 secs. [30]. On the other hand, additional ﬂuent pairs are collected from TIMIT [31]. 3.1.2. F eatures In this step, frame le vel acoustic features (log mel band en- ergy or mel frequency cepstral coefﬁcients (MFCCs)) are ob- tained at each timestep t resulting a feature vector m t ∈ R C . Here, C is the number of features (in frequency dimension) at frame t . The task of se gmenting the ﬁller words is formu- lated as binary classiﬁcation of each frame to its correct class k (Eq. 1). arg max k P ( y ( k ) t | m t , θ ) (1) Where, k = { 1 , 2 } and θ are the parameters of the classiﬁer . In the training data, the target class y ( k ) t = 1 if frame t be- longs to class k (determined using the onset/of fset timeline of k associated with a sound segment) , otherwise zero. Each soundtrack S , is di vided into multiple ﬁxed length sequences of frames M t : t + T − 1 . Where, T is the length of the frame sequence. The corresponding class label matrix, Y t : t + T − 1 contains all the y t . 3.1.3. CRNN for ﬁller wor d se gmentation Here we propose a Conv olutional Recurrent Neural Network (CRNN) for ﬁller word segmentation. Similar , architecture is previously used for sound e vent detection (SED) [8] and speech-recognition [32] task. The architecture is a combina- tion of con volutional and recurrent layers, follo wed by feed- forward layers. The sequence of extracted features M ∈ R C × T is fed to the CNN layers with rectiﬁed linear unit (ReLU) activ ations. Filters used in the CNN layers are spanned across the feature and time dimension. Max-pooling is only applied over the frequency dimension. Output of max-pooling is a tensor P c ∈ R F × M 0 × T . Where, F is the number of ﬁlters of the ﬁnal con volution layer, M 0 is the truncated frequency dimension after the max-pooling operation. T o learn the features ov er time axis, F feature maps are then stacked along the frequency axis which outputs a tensor P s ∈ R ( F × M 0 ) × T . This is fed to the RNN as a sequence of frames p t which outputs a hidden vector ˆ p t . The i th recurrent layer output is given as in Eq. 2. Where, F is a function learned by the each RNN unit. In this work, we use GR Us as presented in [33]. ˆ p i t = F ( ˆ p i − 1 t , ˆ p i t − 1 ) (2) RNN ﬁnal layer outputs ˆ p f t are then fed to a fully- connected (FC) layer with ReLU activ ation and G ∈ R F C 1 × T is obtained where, F C 1 is the number of neurons of the layer . Finally , another layer with softmax acti vation is applied to get the class probabilities. Let, ˆ G ∈ R K × T is the output tensor of the ﬁnal FC layer , then probabilities are given by- P ( y t | m 0: t , θ ) = S of tmax ( ˆ g t ) (3) Fig. 3 . The visualization interface ; top : Speech track with colored segmentation outputs ( br own : disﬂuent silences, blue : ﬁllers, green : ﬂuent silences); bottom : Modiﬁed speech. The CRNN training objective is to minimize the cross-entropy loss with l 2 regularization (Eq. 4) L ( θ ) = − X 0: t X k log P ( y ( k ) t ) + λ || θ || (4) 3.1.4. Disﬂuent silence Classiﬁcation The problem is formulated as a binary classiﬁcation task, giv en a silent segment Z , the task is to decide whether it’ s a disﬂuent or a non-disﬂuent silence. Classifying a silence only makes sense when it’ s combined with adjacent utterances. Because an occurrence of silence is solely dri ven by the ut- terance and also heavily inﬂuenced by disﬂuencies. Thus, it’ s not alw ays e vident that all pauses higher than a signiﬁcant threshold is disﬂuent, illustration in Fig.3 gives an idea of the fact. W e train a support vector machine (SVM) to achieve this task. Gi ven a silent segment Z, it’ s ﬁrst padded with the one- word utterances on the left and right ( ˆ Z ). Then, the MFCC features are extracted and we take the mean over the fre- quency axis to create the feature vector z i ∈ R T . T is the number of frames in z i . Segments are of variable length thus z i padded with trailing zeros prior the classiﬁcation. During, testing we don’ t use the previous and next word boundaries but a ﬁxed length time window is used. In our experiments, we found that 0 . 8 − 1 . 0 secs. gi ve pretty good results. 3.2. Disﬂuency Repair First, the ﬁllers are replaced with silences. W e found that it’ s often helpful (such as when ambient noise is present) to use a decomposition mechanism [34] on the speech to separate the background noise and v ocals. Then, the ﬁllers are replaced with its corresponding background se gment. The modiﬁed track is then used to segment ( Z ) the silences and ﬁnally , the classiﬁcation is done. All the silence se gment lengths are then modiﬁed to make the speech ﬂuent (Fig. 4). The goal is to reduce the amount 0 5 10 15 20 1 2 3 4 5 6 7 8 9 10 count bins H I S T O G R A M O F F L U E N T S I L E N C E T I M E S 0.0 905 0.1 578 0.2 252 0.2 925 0.3 599 0.4 272 0.4 945 0.5 619 0.6 292 0.6 965 Sil en ce cl a ssi fi er Fluent Silence In te rv a ls Di sf lu e nt Silence In te rv a ls Mod if ied S i len ce s Si le nc e Se gm en t at io n Fille r Re pl ac e me nt Vocals Ba ck gr o un d fillers Fig. 4 . Silence modiﬁcation pipeline: The dashed line on the histogram shows the median time of the ﬂuent silences. of long, unnatural pauses that hurt the ﬂuency of the speech. It is also required to keep the pace of the speech intact. T oo much reduction of silences makes the speech unnatural and broken. W e take the ﬂuent silence times (i.e., as suggested by our silence classiﬁer) and obtain a histogram and found that taking the median of the histogram bins as the optimal amount of silence w orks quite well. In this way , the distrib ution of the silence along the speech progression conﬁnes to a constant distribution and speaker sounds more consistent and ﬂuent in the modiﬁed speech. 4. RESUL TS & ANAL YSIS 4.1. Experimental Settings 4.1.1. Datasets The experiments are performed on Switchboard [35], Au- tomanner [28] and our dataset of public speaker recording. T o train the CRNN we use the segments from the Switchboard. The CRNN test results are reported on held-out data from Switchboard-I. Silence classiﬁcation results are reported on TIMIT [31], Switchboard, and Automanner held-out dataset. All the ﬂuency metrics are ev aluated on our dataset, contain- ing recordings of 20 non-nativ e speakers of English. The speakers were asked to talk on a speciﬁc topic for 50-60 seconds. 4.1.2. P arameter settings W e experimented with different conﬁgurations of the CNN and RNN parameters and different features. T ypes of featur es : Initial experiments were performed on mel frequency cepstrum coefﬁcients ( mfcc ), mel spectro- grams, log mel spectrograms ( log mel ), spectral contrast, zero crossing rate and tonnetz. According to the experimental results, the mfcc ( 40 × t ) and log mel ( 128 × t ) features are used for ﬁller segmentation. For the silence classiﬁcation mfcc features are used, after taking mean over the frequency axis. The used feature dimensions are sho wn in table. All features are extracted in 30ms frames with 15ms o verlap. RNN & CRNN par ameters : In experiments with the CNN and CRNN, we explore { 1 , 2 , 3 } con volutional ( conv ) lay- ers with combination of max pooling and av erage pool- ing. At each layer , ReLU activ ation is used. F ollo wing settings are used for con v ﬁlters- { 16 , 32 , 64 } and kernel sizes- { 2 , 3 , 4 , 5 , 8 } . All the con v layers use same padding. Pooling size was varied within { 2 , 3 , 4 , 5 , 8 } . The pooling is performed only on the frequency dimension. W e tried different dropout ratios of { 0 . 3 , 0 . 5 , 0 . 75 } . The RNN we use is Gated Recurrent Units (GR U). Exper- iments are performed with { 2 , 3 } layers (l) and { 64 , 128 , 256 } hidden units (d). No intermediate dropout (dr) is applied. Final fully connected layer (FC1 in Fig.3) is experimented with hidden units (d) of { 100 , 200 } with dropout ratios of { 0 . 3 , 0 . 5 , 0 . 75 } . Featur es CNN RNN FC mfcc con v1 [32,(8,8)], conv2 [64,(4,4)] maxpool1 [5,5], maxpool2[4,4] dr=0.25 l=3 d=128 d=100 dr=0.5 log mel con v1 [32,(8,8)], conv2 [64,(4,4)] maxpool1 [8,4], maxpool2[4,2] dr=0.25 T able 1 . Final parameters for CNN, RNN and fully connected layers The networks are trained in an end-to-end fashion using AdaGrad algorithm for 200 epochs. The learning rate w as set to 0 . 01 . The regularization constant ( λ ) was set to 0 . 01 . The ﬁnal parameters are giv en in T able 1. Silence classiﬁcation parameters : Max length of the se- quences were set to 128. Final parameters are giv en in table 3. SVM LogReg XGBoost itr=1500 kernel= rbf C=10 itr=100 C=10 depth=3 lr=0.1 estimators=100 T able 3 . Final parameters used in silence classiﬁcation 4.1.3. Evaluation Metrics T o e v aluate the ﬁller word segmentation we use the follo wing frame lev el statistics: • F 1 Score ( F 1 ): The F 1 score is calculated on frame lev el (30ms) using the TP , the frames where ﬁllers are correctly detected; TN, the frames where non-ﬁllers are correctly detected; FP , the frames where ﬁllers are wrongly detected; and FN, the frames where non-ﬁllers are wrongly detected. The silence classiﬁcation is ev aluated using the F 1 score w .r .t. the disﬂuent silence class. T o ev aluate the quality of the augmented speech from our system, we use the following metrics deﬁned in [9]: • Speech rate : Is obtained as- S R = # of sy l labl es total time − uf p [ < 3] × 60 (5) Where, uf p [ < 3] = total time of unﬁlled pauses lesser than 3 seconds. Since, pauses > 3 secs. are considered as articulation pauses [30]. • Articulation rate AR = # of sy l labl es total time × 60 (6) • Phonation-time ratio P T R = speak ing time total time (7) • Mean length of runs M LR = # of sy l labl es # utter ances betw een p [ > 0 . 25] (8) Where, p [ > 0 . 25] = pauses greater than 0.25 seconds. • Mean length of pauses M LP = total of p [ > 0 . 2] # of p [ > 0 . 2] (9) • Filled pauses per min. F P M = # of f ill ed pauses total time (10) 4.2. Filler W ord Segmentation The ﬁller word segmentation performance is ev aluation re- sults are gi ven in T able 4 and 5. In T able 4 we report the com- parativ e performance of the CRNN using different features. T o understand more about the credibility of the CRNN, in T able 5 we show the results compared to an automatic speech recognizer av ailable with Kaldi (ASpIRE Chain Model 3 ). Considering the simplicity of our network, it performs pretty close to the ASR in terms of F 1 score. All results are ev alu- ated on a subset of Switchboard-I dataset. Featur es Pr ecision Recall F 1 mfcc 0.9482 0.9610 0.9534 log mel 0.9495 0.9629 0.9550 T able 4 . Performance of the CRNN with dif ferent features 3 https://github .com/kaldi-asr/kaldi/tree/master/egs/aspire Metrics → SR ↑ AR ↑ PTR ↑ MLR ↑ MLP ↓ FPM ↓ Original 165.3571 171.0986 58.865 0.400 0.654 3.659 Processed 186.241 186.241 65.570 0.495 0.365 1.762 T able 2 . The ﬂuenc y metrics, before and after processing the speeches. ↑ means higher is better and ↓ denotes lower is better Method P r ecision Recal l F 1 ASR 0.9774 0.9792 0.9775 CRNN 0.9495 0.9629 0.9550 T able 5 . Performance of ﬁller word segmentation compared to an automatic speech recognizer . The only drawback that we hav e observed while compar- ing our method and ASR is that, sometimes our classiﬁer de- tects segments that sounds similar with ’uh’ or ’um’. 4.3. Disﬂuent Silence Classiﬁcation For this task we experimented with SVM, Logistic Regression (LogReg) and XGBoost. The results are summarized in table 6. W e used 10-fold cross validation to report our results. Method → SVM LogReg XGBoost F 1 0.9055 0.9200 0.9207 T able 6 . Silence classiﬁcation performance on TIMIT , SwitchBoard and Automanner 4.4. Disﬂuency Repair After processing the speeches by removing the ﬁllers and long silences, the ﬂuent speech is obtained. T o compare the ﬂu- ency of the synthesized and the original speech, discussed metrics (Section 4.1.3) are used. The results are reported in table 2. Mean of each metric across all the samples are re- ported. From the numbers, it’ s pretty clear that we improv e the ﬂuency . It’ s notable that in the processed speech the ar- ticulation and speech rate increases to same quantity since we take care of all the unﬁlled pauses in the speech and in- troduce a more uniform silence production. Apart from the numbers, for qualitativ e understanding, some processed sam- ples are av ailable here . 5. FUTURE WORK This work is motiv ated by the fact that, disﬂuency detection is not only useful for the intelligent agents but also a practical problem deﬁnition to help users to produce a better, conﬁ- dent and ﬂuent talk. T o the extent of the types of disﬂuencies produced in a speech, this work is a small step to wards a big- ger goal, repairing disﬂuencies in a speech from a speakers’ perspectiv e. Along with the pitfalls of our method follo wing could be the future directions of this work- • Improving the ﬁller word segmentation performance as well as de vising techniques to segment other kinds of common disﬂuencies (repetition, discourse markers, corrections) and speech impairments (stuttering). • Devising a dynamic and online repair scheme, by gen- erating necessary ( disﬂuent ) portions of speech, instead of replacing. 6. CONCLUSION Disﬂuency detection is a well-explored problem in the speech processing community and performed on speech transcripts to mostly aid the intelligent conv ersational agents. In this work, we interpret disﬂuency detection from speakers per- spectiv e and introduce an additional component of repairing the disﬂuencies. Consequently , we tried to w ork solely on the acoustic domain, diminishing a need for a complex system like an ASR, before disﬂuency detection. W ith the results of our detection and repair scheme, we show improved ﬂuency in speakers’ dialogues, gi ven a less-ﬂuent speech. T o the best of our kno wledge, this is the ﬁrst work related to disﬂuency repair for the sake of users’ and can be further extended to assist users with speech impairments and other general dis- ﬂuencies. 7. A CKNO WLEDGEMENTS W e are thankful to Faizaan Charania and Mahima Parashar for curating the dataset and working on some essential observa- tions. W e would also like to thank the participating speakers for the speeches they provided. W e gratefully acknowledge the support of NVIDIA Corporation with the donation of the T itan Xp and P6000 GPU used for this research. 8. REFERENCES [1] Jean E Fox Tree, “The effects of false starts and repeti- tions on the processing of subsequent words in sponta- neous speech, ” Journal of memory and language , vol. 34, no. 6, pp. 709–738, 1995. [2] Kathryn W omack, W ilson McCoy , Cecilia Ovesdotter Alm, Cara Calvelli, Jeff B Pelz, Pengcheng Shi, and Anne Haake, “Disﬂuencies as extra-propositional in- dicators of cognitive processing, ” in Proceedings of the workshop on extra-pr opositional aspects of meaning in computational linguistics . Association for Computa- tional Linguistics, 2012, pp. 1–9. [3] Martin Corley and Oli ver W Stewart, “Hesitation dis- ﬂuencies in spontaneous speech: The meaning of um, ” Language and Linguistics Compass , vol. 2, no. 4, pp. 589–602, 2008. [4] Dale J Barr and Mandana Seyfeddinipur , “The role of ﬁllers in listener attrib utions for speaker disﬂuency , ” Language and Cognitive Pr ocesses , vol. 25, no. 4, pp. 441–455, 2010. [5] Elizabeth Shriberg, “Disﬂuencies in switchboard, ” in Pr oceedings of International Conference on Spoken Language Pr ocessing , 1996, vol. 96, pp. 11–14. [6] Sandra Merlo and Letıcia Lessa Mansur , “Descripti ve discourse: topic familiarity and disﬂuencies, ” Journal of Communication Disor ders , v ol. 37, no. 6, pp. 489– 503, 2004. [7] Paul Lennon, “In vestigating ﬂuency in eﬂ: A quantita- tiv e approach, ” Languag e learning , v ol. 40, no. 3, pp. 387–417, 1990. [8] Emre Cakır , Giambattista Parascandolo, T oni Heittola, Heikki Huttunen, and T uomas V irtanen, “Con volutional recurrent neural networks for polyphonic sound ev ent detection, ” arXiv preprint , 2017. [9] Judit K ormos and Mariann D ´ enes, “Exploring measures and perceptions of ﬂuency in the speech of second lan- guage learners, ” System , vol. 32, no. 2, pp. 145–164, 2004. [10] Eugene Charniak and Mark Johnson, “Edit detection and parsing for transcribed speech, ” in Pr oceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Lan- guage technologies . Association for Computational Lin- guistics, 2001, pp. 1–9. [11] Matthias Honal and T anja Schultz, “Correction of dis- ﬂuencies in spontaneous speech using a noisy-channel approach, ” in Eighth Eur opean Confer ence on Speech Communication and T echnolo gy , 2003. [12] Mark Johnson and Eugene Charniak, “ A tag-based noisy-channel model of speech repairs, ” in Pr oceedings of the 42nd Annual Meeting of the Association for Com- putational Linguistics (A CL-04) , 2004. [13] Simon Zwarts, Mark Johnson, and Robert Dale, “De- tecting speech repairs incrementally using a noisy chan- nel approach, ” in Pr oceedings of the 23rd International Confer ence on Computational Linguistics . Association for Computational Linguistics, 2010, pp. 1371–1378. [14] Y ang Liu, Elizabeth Shriber g, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper , “Enriching speech recognition with automatic detection of sentence boundaries and disﬂuencies, ” IEEE T ransactions on au- dio, speec h, and language pr ocessing , v ol. 14, no. 5, pp. 1526–1540, 2006. [15] Kallirroi Georgila, “Using integer linear programming for detecting speech disﬂuencies, ” in Pr oceedings of Human Langua ge T echnologies: The 2009 Annual Con- fer ence of the North American Chapter of the Associa- tion for Computational Linguistics, Companion V olume: Short P apers . Association for Computational Linguis- tics, 2009, pp. 109–112. [16] Christine Howes, Matt Purver , Rose McCabe, PG Healey , and Mary Lav elle, “Helping the medicine go down: Repair and adherence in patient-clinician dialogues, ” in Pr oceedings of the 16th SemDial W ork- shop on the Semantics and Pragmatics of Dialogue (SeineDial) , 2012, pp. 19–21. [17] Matthew Honnibal and Mark Johnson, “Joint incre- mental disﬂuency detection and dependency parsing, ” T ransactions of the Association of Computational Lin- guistics , vol. 2, no. 1, pp. 131–142, 2014. [18] Julian Hough and Matthew Purver , “Strongly incremen- tal repair detection, ” arXiv pr eprint arXiv:1408.6788 , 2014. [19] Christine Howes, Julian Hough, Matthe w Purver , and Rose McCabe, “Helping, i mean assessing psychiatric communication: An application of incremental self- repair detection, ” 2014. [20] James Ferguson, Greg Durrett, and Dan Klein, “Disﬂu- ency detection with a semi-markov model and prosodic features, ” in Pr oceedings of the 2015 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language T ec hnologies , 2015, pp. 257–262. [21] Xian Qian and Y ang Liu, “Disﬂuency detection using multi-step stacked learning, ” in Pr oceedings of the 2013 Confer ence of the North American Chapter of the As- sociation for Computational Linguistics: Human Lan- guage T echnologies , 2013, pp. 820–825. [22] Mohammad Sadegh Rasooli and Joel T etreault, “Joint parsing and disﬂuency detection in linear time, ” in Pr o- ceedings of the 2013 Conference on Empirical Methods in Natural Languag e Processing , 2013, pp. 124–129. [23] Julian Hough and Da vid Schlangen, “Recurrent neural networks for incremental disﬂuency detection, ” Inter- speech 2015 , 2015. [24] Shaolei W ang, W anxiang Che, and T ing Liu, “ A neural attention model for disﬂuency detection, ” in Pr oceed- ings of COLING 2016, the 26th International Confer - ence on Computational Linguistics: T echnical P apers , 2016, pp. 278–287. [25] V icky Zayats, Mari Ostendorf, and Hannaneh Ha- jishirzi, “Disﬂuency detection using a bidirectional lstm, ” arXiv preprint , 2016. [26] Julian Hough and David Schlangen, “Joint, incremen- tal disﬂuency detection and utterance segmentation from speech, ” in Pr oceedings of the Annual Meeting of the Eur opean Chapter of the Association for Computational Linguistics (EA CL) , 2017. [27] Shaolei W ang, W anxiang Che, Y ue Zhang, Meishan Zhang, and Ting Liu, “Transition-based disﬂuency de- tection using lstms, ” in Pr oceedings of the 2017 Confer- ence on Empirical Methods in Natural Language Pr o- cessing , 2017, pp. 2785–2794. [28] M Iftekhar T an veer , Ru Zhao, Kezhen Chen, Zoe T iet, and Mohammed Ehsan Hoque, “ Automanner: An au- tomated interface for making public speakers aware of their mannerisms, ” in Pr oceedings of the 21st Interna- tional Conference on Intelligent User Interfaces . A CM, 2016, pp. 385–396. [29] Guoguo Chen, Hainan Xu, Minhua W u, Daniel Povey , and Sanjeev Khudanpur, “Pronunciation and silence probability modeling for asr , ” in Sixteenth Annual Con- fer ence of the International Speech Communication As- sociation , 2015. [30] Heidi Riggenbach, “T oward an understanding of ﬂu- ency: A microanalysis of nonnative speaker con versa- tions, ” Discourse pr ocesses , v ol. 14, no. 4, pp. 423–441, 1991. [31] John S Garofolo, Lori F Lamel, W illiam M Fisher , Jonathan G Fiscus, and David S Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1, ” N ASA STI/Recon technical r eport n , vol. 93, 1993. [32] T ara N Sainath, Oriol V inyals, Andrew Senior , and Has ¸ im Sak, “Conv olutional, long short-term memory , fully connected deep neural networks, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2015 IEEE In- ternational Confer ence on . IEEE, 2015, pp. 4580–4584. [33] Kyunghyun Cho, Bart V an Merri ¨ enboer , Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Y oshua Bengio, “Learning phrase rep- resentations using rnn encoder -decoder for statistical machine translation, ” arXiv preprint , 2014. [34] Zafar Raﬁi and Bryan Pardo, “Music/voice separation using the similarity matrix., ” in ISMIR , 2012, pp. 583– 588. [35] John J Godfrey , Edward C Holliman, and Jane Mc- Daniel, “Switchboard: T elephone speech corpus for research and de velopment, ” in Acoustics, Speech, and Signal Pr ocessing, 1992. ICASSP-92., 1992 IEEE Inter- national Confer ence on . IEEE, 1992, vol. 1, pp. 517– 520.

Increase Apparent Public Speaking Fluency By Speech Augmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment