A Method for Analysis of Patient Speech in Dialogue for Dementia Detection

A Method f or Analysis of Patient Speech in Dialogue f or Dementia Detection Saturnino Luz, Soﬁa de la Fuente, Pierr e Albert Usher Institute of Population Health Sciences & Informatics Edinbur gh Medical School The Univ ersity of Edinbur gh, Scotland, UK { s.luz,soﬁa.delafuente,pierre.albert } @ed.ac.uk Abstract W e present an approach to automatic detection of Alzheimer’ s type dementia based on characteristics of spontaneous spoken language dialogue consisting of interviews recorded in natural settings. The proposed method employs additi ve logistic regression (a machine learning boosting method) on content-free features extracted from dialogical interaction to build a predictiv e model. The model training data consisted of 21 dialogues between patients with Alzheimer’ s and interviewers, and 17 dialogues between patients with other health conditions and interviewers. Features analysed included speech rate, turn-taking patterns and other speech parameters. Despite relying solely on content-free features, our method obtains overall accuracy of 86.5%, a result comparable to those of state-of-the-art methods that employ more complex lexical, syntactic and semantic features. While further inv estigation is needed, the fact that we were able to obtain promising results using only features that can be easily extracted from spontaneous dialogues suggests the possibility of designing non-in vasi ve and lo w-cost mental health monitoring tools for use at scale. Keyw ords: Dementia diagnosis and prediction, Alzheimer’ s disease, dialogue analysis, speech features, vocalisation graphs, content-free analysis. 1. Introduction Research into early detection of Alzheimer’ s disease (AD) has intensiﬁed in the last fe w years, dri v en by the real- isation that in order to implement ef fectiv e measures for secondary prev ention of Alzheimer’ s type dementia (A TD) it may be necessary to detect AD pathology decades be- fore a clinical diagnosis of dementia is made (Ritchie et al., 2017). While imaging (PET , MRI scans) and cerebrospinal ﬂuid analysis pro vides accurate diagnostic methods, there is an acknowledged need for alternativ e, less inv asi ve and more cost-effecti ve tools for AD screening and diagnostics (Laske et al., 2015). A number of neuropsychological tests hav e been dev eloped which can identify signs of AD with varying le vels of accuracy (Mortamais et al., 2017; Ritchie et al., 2017). Howe ver , the proliferation of technologies that enable personal health monitoring in daily life points tow ards the possibility of dev eloping tools to predict AD based on processing of behavioural signals. Speech is relatively easy to elicit and has prov en to be a valuable source of clinical information. It is closely related to cognitiv e status, having been used as the primary input in a number of applications to mental health assessment. It is also ubiquitous and can be seamlessly acquired. In recent years, combinations of signal processing, machine learn- ing, and natural language processing ha ve been proposed for the diagnosis of AD based on the patient’ s speech and language (Fraser et al., 2016). Models b uilt on phonetic, lexical and syntactic features ha ve borne out the observ a- tion that these linguistic processes are increasingly affected as the disease progresses (Kirshner, 2012). Ho wev er , most machine learning research in this area has employed either recorded narrativ e speech (Lopez-De-Ipi ˜ na et al., 2012), or recorded scene descriptions (Luz, 2017; Fraser et al., 2016) collected as part of a neuropsychological assessment test, such as the Boston “cookie theft” picture description task (Becker et al., 1994). In contrast to those methods, our approach employs spon- taneous con versational data, exploring patterns of dialogue as basic input features. Content-free interaction patterns of this kind were ﬁrst used in the characterisation of psy- chopathology by Jaffe and Feldstein (1970), who repre- sented therapist-patient dialogues as Markov chains. Here, we build on these ideas to analyse patient data from the Car- olina Con versations Collections (CCC) (Pope and Davis, 2011). W e trained machine learning models on these data to differentiate AD and non-AD speech. This work is, to the best of our knowledge, the ﬁrst to employ low-le vel di- alogue interaction data (as opposed to lexical features, or data from narrations other forms of monologue) as a basis for AD detection on spontaneous speech. 2. Background One of the greatest challenges facing dev eloped countries, and increasingly the dev eloping world, is the challenge of improving the quality of life of older people. In 2015, the First Ministerial Conference of the WHO on Global Ac- tion Ag ainst Dementia estimated that there are 47.5 million cases of this condition in the world. Cohort studies sho w between 10 and 15 new cases per each thousand people ev- ery year for dementia, and between 5 and 8 for Alzheimer’ s Disease. Prognosis is usually poor , with an av erage life expectanc y of 7 years from diagnosis. Less than 3% di- agnosed li ve longer than 14 years. Current statistics pre- dict that the population aged over 65 is e xpected to triple between years 2000 and 2050 (W orld Health Organization and others, 2015). This will lead to structural and societal changes, accentuating what is already becoming a highly demanding issue for health care systems. Dementia is therefore set to become a very common cause of disability which places a hea vy b urden on carers and pa- tients alike. While there are currently neither a cure nor a way to entirely prev ent the progress of the disease, it is hoped that a better understanding of language and commu- nication patterns will contrib ute to secondary prev ention. A characterisation of communication patterns and their re- lation to cognitiv e functioning and decline could be use- ful in the design of assistive technologies such as adapti ve interfaces and social robotics (W ada et al., 2008). These technologies might help pro vide respite to carers, and stim- ulate cognitiv e, physical and social activity , which can slo w disease progression and improv e the patient’ s quality of life (Middleton and Y af fe, 2009). Collecting relev ant real life observational data and assembly of prior and current knowledge (W ada et al., 2008) could lead to new effecti ve and personalised interventions. Assessing people’ s behaviour in natural settings might also contribute to earlier detection (Parsey and Schmitter- Edgecombe, 2013; Mortamais et al., 2017). Language im- pairment is a common feature of dementia, implying signs such as word-ﬁnding and understanding difﬁculties, blurred speech or disrupted coherence (American Psychiatric Asso- ciation, 2000). Although language is a good source of clin- ical information regarding cogniti ve status, manual analy- sis of language by mental health professionals for diagnos- tic purposes is challenging and time-consuming. Advances in speech and language technology could help by provid- ing tools for detecting reliable dif ferences between patients with dementia and controls (Bucks et al., 2000), distin- guishing among dementia stages (Thomas et al., 2005) and differentiating various types of dementia (Fraser et al., 2016). Features such as grammatical constituents, vocab ulary rich- ness, syntactic complexity , psycholinguistics, information content, repetiti veness, acoustics, speech coherence and prosody , ha ve been explored in conjunction with machine learning methods to identify Alzheimer’ s and other types of dementia through the patient’ s speech. This is not only because language is impaired in these patients, b ut also be- cause language relies on other cognitive functions, such as ex ecuti ve functions, which allo w us to interact in a sound and meaningful way . These functions are responsi- ble for decision making, strategy planning, foreseeing con- sequences and problem solving, which are essential to suc- cessful communication, but are impaired by AD (Fraser et al., 2016; Marklund et al., 2009; Satt et al., 2013). Al- though hardly perceptible to the speakers themselves, pat- terns of impairment are thought to occur even in informal and spontaneous conv ersations (Bucks et al., 2000; Cohen and Elve v ˚ ag, 2014). Our hypothesis in this paper is that people with an AD di- agnosis will show identiﬁable patterns during dialogue in- teractions. These patterns include disrupted turn taking and differences in speech rate. These indices relate to the fact that, in general, patients with AD show poorer con versa- tion abilities and their normal turn-taking is repeatedly in- terrupted. Therefore, we expect less con versational ﬂuidity ov erall in the AD group dialogues, as compared to non-AD group. Our approach, which does not rely on transcription but only on speech-silence patterns and basic prosodic in- formation, obtains lev els of accuracy comparable to state- of-the-art systems that rely on more complex feature sets. 3. Related work Potential applications of the kind of speech technology de- scribed in this paper include the de velopment of interactiv e assistiv e technologies, and monitoring of users for signs of cognitiv e decline with a vie w to mitigating further decline. From the perspecti ve of potential applications of automatic speech analysis to technology-assisted care, there is ev- idence (Rudzicz et al., 2014b) that it is psychologically more acceptable for a user to be aided by another person or a robot than from ambient sensors and devices which are unable to offer meaningful interaction. Therefore, the de- velopment of such assistiv e applications in volves research on speech processing for natural con versations rather than scripted speech or monologues (Conway and O’Connor , 2016). From the perspectiv e of monitoring for early detection, it is known that AD leads to disruption of one’ s ability to follow dialogues, e ven in simple, routine interactions. At later stages of the disease, failure to perform meaningful interactions appears (W atson, 1999). This has a negativ e impact on tasks such as following instructions regarding household activities and medication, as well as prev enting rew arding social interactions. Here, once again the focus should be on natural interaction data, as scripted talk cannot be compared to spontaneous conv ersation in terms of in- formation richness and external validity of results (Kato et al., 2013). Over the last decades, different approaches hav e targeted early detection of AD on spontaneously generated data through automatic and non-in v asiv e intelligent meth- ods. Some of these approaches ha ve focused on speech pa- rameters analysis: automatic spontaneous speech analysis (ASSA), emotional temperature (ET), (Lopez-De-Ipi ˜ na et al., 2012), voiceless segments, and phonological ﬂuency hav e been shown to explain signiﬁcant variance in neu- ropsychological test results (Garc ´ ıa Meil ´ an et al., 2012). These methods are not only non-in vasi ve and free from side-effects, but also relatively cheap in time and in terms of resources. Another approach that rely on easily extracted acoustic features, such as the ones we propose in this pa- per , though not in dialogical or spontaneous speech set- tings is presented by Satt et Al. (2013). This approach extracts a number of voice features (voiced se gments, a ver - age utterance duration, etc.) from recordings of picture de- scription, sentence repetition, and repeated pronunciation of three syllables used in diadochokinetic tests in succes- sion. The method achieves accuracy levels of over 80% in detection of AD and mild cognitiv e impairment (MCI). Other approaches hav e used time-aligned transcripts and syntactic parsing, extracting speech features and using them for classifying healthy elderly subjects from subjects suf- fering AD or MCI, as well as other tasks. This classi- ﬁcation has been done either by comparing impaired to healthy speech performance (speech quality in terms of lexicon, coherence, etc.), or by comparing classiﬁer per- formance when only neuropsychological tests are included against performance when such tests are used together with speech features, generally with statistically signiﬁcant im- prov ements (Roark et al., 2011; Fraser et al., 2016). Analysis performed on similar corpora provide good in- sight of the performances achiev ed using different features. A ﬁrst analysis (Fraser et al., 2016), based on a monologue corpus (DementiaBank), identiﬁed four different linguistic factors as main descriptors: syntactic, semantic, and in- formation impairments, and acoustic abnormality . They achiev ed accuracy of up to 92.05% using full scale anal- ysis of 25 features, selected amongst an original feature set of 370 features after extensi ve e xperimentation. An analysis of the CCC corpus by Guinn et al (Guinn and Habash, 2012) used similar linguistic features. Unlike the work presented in this paper , Guinn’ s analysis was focused on the differences between intervie wers and subjects in the subset of patients with AD. They achiev ed a combined ac- curacy of 75-79.5 % using decision trees, with a large dis- crepancy between AD (38-42 %) and non-AD (74-100 %) recognition accuracy . W orks on dialogue so far ha ve identiﬁed features such as con versational confusion (AD increases confusion rates, and this relates to slower and shorter speech; (Rudzicz et al., 2014a), prosodic measures (Gonzalez-Moreira et al., 2015), and emotion (Devillers et al., 2005). These stud- ies used machine learning methods (neural networks, Na ¨ ıve Bayes, and random forests, respectiv ely), reporting accu- racy in the 70-90 % range. Although these results are promising, they are dif ﬁcult to generalise. This is because they are primarily content dependent. That is, they em- ploy lexical, and sometimes syntactic information, which present a number of potential disadvantages. The content of a con versation is likely to change greatly depending on whether a participant belongs to the control group or to the group with Alzheimer’ s Disease, especially if the con ver - sational partner is their doctor . In addition, such content is difﬁcult to acquire in spontaneous speech settings. De- spite the advances in automatic speech recognition, recog- nition (word) error rates in unconstrained settings are still ov er 11%, ev en for f airly clear , telephone dialogues (Xiong et al., 2016). Another difﬁculty with these approaches is the fact that they are language-dependent, and therefore require building dif ferent models for different languages, which in the conte xt of global mental health could be a ma- jor shortcoming. Therefore, these models should aim to be as content-independent as possible to be generalisable (Satt et al., 2013). In contrast to content-based approaches, our method focuses on the interaction patterns themselves, rather than on characteristics of the speech and language content as such. 4. Methods 4.1. Dataset W e hav e conducted our analysis using the Carolina Con- versations Collection (Pope and Davis, 2011). The dataset is a digital collection of recordings of con versations about health, including both audio and video data, with corre- sponding transcriptions. The corpus consists of natural con- versations in volving an older person (over the age of 65) with a medical condition. Several demographic and clinical variables are also av ailable, including: age range, gender , occupation prior to retirement, disease diagnosed, and lev el of education (in years). The interviewers were gerontology and linguistic students or researchers to whom the patients prof essional 0.231 0.279 patient 0.575 0.628 P ause 0.018 0.023 Switc hingP ause 0.171 0.070 0.279 0.544 0.015 0.162 Other 0.005 1.000 0.500 0.500 0.714 0.214 0.071 Figure 1: V ocalisation diagram for a patient dialogue. spoke at least twice a year . A unique alias was assigned to each patient to protect their identity . Access to the data was provided after complying with the ethical requirements of the Uni versity of Edinbur gh and the Medical University of South Carolina. In order to ensure that the results described here are reproducible we will pro- vide, on request, the identiﬁers for the dialogues used in our experiments so that interested researchers can recreate our dataset upon being granted access to the CCC. The source code used for processing the data is av ailable at a Uni ver - sity of Edinbur gh gitlab server 1 . For the research described here we selected a total of 38 pa- tient dialogues: 21 patients had a diagnosis of Alzheimer’ s disease (15 females, 6 males), and 17 patients (12 fe- males, 5 males) had other diseases (diabetes, cardiac issues, etc., excluding neuropsychological conditions), but not AD. These groups were selected for matching age ranges and gender frequencies so as to a void statistical bias. The dataset also included time-aligned transcripts, which we did not use except for the computation of an alternative speech rate feature as described below . 4.2. Data Preparation The speech data selected as pre viously described were pre- processed in order to generate vocalisation graphs — that is, Markov diagrams encoding the ﬁrst-order conditional transition probabilities between v ocalisation ev ents and steady-state probabilities (Luz, 2013).V ocalisation ev ents are classiﬁed as speech by either the patient or the inter- viewer/others, joint talk (overlapping speech), or silence ev ents (also known as ’ﬂoor’ events, which are further in the diagrams as pauses and switching pauses, according to whether the ﬂoor is taken by the same speaker or another speaker , respectively). An example of vocalisation graph is shown in Figure 1. V ocalisation and pause patterns hav e been successfully em- ployed in the analysis of dialogues in a mental-health con- text (Jaf fe and Feldstein, 1970), segmentation (Luz and Su, 2010) and classiﬁcation of dialogues, and more recently on characterisation of participant role and performance in collaborativ e tasks (Luz, 2013). Models that employ basic turn-taking statistics have also been proposed for dementia diagnosis (Mirheidari et al., 2016), though not in a system- atic content-fee framew ork as in our proposed method. The distributions of event counts according to vocalisation 1 https://cybermat.tardis.ed.ac.uk/pial/CCCdataset ev ents is shown in Figure 2. It can be observed that pa- tients with AD tend to produce more vocalisation events than their interviewers (and, consequently , produce more silence ev ents). This is consistent with ﬁndings in the lit- erature on language changes in AD (American Psychiatric Association, 2000). ● ● V ocalisation event count Silence Patient Interviewer AD non−AD 0 50 100 150 200 250 300 Figure 2: Distribution of vocalisation ev ent counts for pa- tients with and without AD in CCC dialogues. Speech rate was estimated using De Jong’ s syllable nu- clei detection algorithm (Jong and W empe, 2009), which is an unsupervised method – that is, it can be applied di- rectly to the acoustic signal, with no need of human anno- tation. Howe ver , as the audio quality of the CCC record- ings is uneven, and as the dataset provides no gold stan- dard against which one could assess syllable count, we decided to validate the use of De Jong’ s method against the time-stamped transcripts provided. Using these tran- scripts one could, in principle, estimate av erage words per minute (WPM) for individual utterances, as is sometimes done (Hayakaw a et al., 2017). Howe ver , this method of measuring WPM based on transcription has a number of limitations. W ords ha ve v ariable length, and their articu- lation can vary greatly due to a number of speech-related phenomena, such as phonological stress, frequency , contex- tual predictability , and repetition (Bell et al., 2009). In or - der to mitigate these problems, we instead produced speech rate ratio estimates normalised through a speech synthe- sizer , employing the methods proposed by Hayakawa et al. Hayakawa et al. (2017). These estimates represent de vi- ations from a “normalised” pace of 160 words per minute (WPM) synthesised using the MaryTTS system (Schr ¨ oder and Trouv ain, 2003). W e therefore computed the ratio of the synthesised speech to the actual duration of the patient’ s speech. The speech rate ratio correlated well with the syl- lable per minute rate extracted using only the recorded au- dio ( ρ = 0 . 502 , t (30) = 3 . 19 , p = 0 . 003 ) indicating that speech rate can be estimated with an acceptable lev el of reliability through the unsupervised method, ev en in fairly noisy settings. A Python script was employed to e xtract basic speaker turn time stamps, speaker role information, and transcriptions from the original XML-encoded CCC data. The resulting T able 1: Descripti ve statistics on dialogue turn-taking (du- ration giv en in seconds). Feature non-AD AD Dialogue duration 4107.3 7628.4 Dialogue duration TTS 7618.8 7618.8 A vg turn duration 97.3 255.8 T otal turn duration 1654.3 4348.3 Norm. total turn duration 3.0 4.1 A vg turn duration TTS 107.6 238.0 T otal turn duration TTS 1829.7 4046.1 Norm. total turn duration TTS 3.0 4.2 A vg number of words 314.6 742.5 T otal number of words 5348.0 12622.0 A vg words per minute 155.9 166.5 data were then processed using the R language in order to detect silence intervals, and categorise turn transitions and pause ev ents. Some descriptiv e statistics on the dialogues can be seen in T able 1. These statistics include: a verage turn duration (how many seconds a participant speaks on av erage), to- tal turn duration (how many seconds did the participant’ s turns lasted in total), normalised turn duration (the ratio of a participant’ s turn duration to the total duration of AD or non-AD dialogues, according the participant’ s class), num- ber of words generated (total per class and on average per class’ participant), and number of words per minute (aver - age per class participant). Contrary to our expectations, we did not observe a statisti- cally signiﬁcant dif ference between the speech rate in syl- lables per minute between patients with and without AD (W elch two sample t-test t (30 . 5) = 1 . 15 , p = 0 . 28 ), e ven though the mean for non-AD ( M = 180 . 8 syllables/min, sd = 28 . 4 ) was higher than that for patients with AD ( M = 168 syllables/min, sd = 35 . 6 ). T wo alternativ e data representations were generated. The ﬁrst (henceforth referred to as VGO) was based on the vo- calisation graphs only . That is, VGO encodes the proba- bilities of each possible pair of transitions, including self- transitions, which tend to dominate Markov chains sam- pled, and the steady-state probabilities for each vocalisa- tion ev ent. The second form of representation (VGS) sim- ply consists of the VGO with information about the partic- ipant’ s speech rate (mean and v ariance) added to the vocal- isation statistics. With the exception of speech rate ratio, which necessitates transcription, all the information needed to build VGO and VGS instances can be extracted through straightforward signal processing methods. 4.3. Machine learning The data instances in the two alternativ e representa- tion schemes were annotated for presence or absence of Alzheimer’ s Disease (AD). A supervised learning proce- dure was employed in order to train classiﬁers to predict such annotations on unseen data. W e trained a boosting model (Schapire and Freund, 2014) using decision stumps (i.e. decision trees with a single split node) as weak learners. The training process consisted of 10 iterations whereby , for each training instance ( x i ), a weak classiﬁer ˆ f m was ﬁtted using weights on the data which were iterativ ely computed so that the instances mis- classiﬁed in the preceding step had their weights increased by a factor proportional to the weighted training error . In this case class probability estimates P ( ad = 1 | data ) were used to compute these weights and to weigh the ﬁnal classi- ﬁcation decision (additiv e logistic regression) following the Real Adaboost algorithm (Friedman et al., 2000): ˆ F ( x ) = sig n " M X m =1 ˆ f m ( x ) # (1) Classiﬁcation performance was assessed through a 10-fold cross validation procedure. As the dataset is reasonably balanced, results were assessed in terms of accuracy , pre- cision (the ratio of the number of true positi ves to the num- ber of instances classiﬁed as AD), recall (or sensitivity , the ratio of true positiv es to the number of AD cases) and F 1 score (the harmonic mean of precision and recall). Micro ( µ ) and macro ( M ) averages for these scores are given by taking means ov er the entire set of classiﬁcation decisions and over individual classiﬁers respectiv ely , across the 10 folds. As the data set is fairly small, we also ran a leave- one-out cross v alidation (LOOCV) procedure to obtain bet- ter estimates of generalisation accuracy . This consisted of selecting one instance for testing, and building a classiﬁ- cation model on the remaining instances, and iterating this procedure until all instances hav e been selected as testing instances. Macro av erages are uninformati ve in LOOCV , so we only report overall accuracy ﬁgures for this proce- dure. R OC curv es sho wing the relationship between true positi ve and false positi ve rates as the classiﬁcation threshold is v ar- ied were also plotted. Simulation was employed in order to smooth these R OC curves by running 10 rounds of 10- fold cross validation tests with a randomised selection of instances making up the hold-out sets. 5. Results Our ﬁrst approach, based on the VGO data representation scheme, produced promising results. Accuracy le vels were well abov e the baseline, with ov erall accuracy reaching 81.1%, showing that turn taking patterns can provide use- ful cues to the detection of AD in dialogues. The results for the VGO-based classiﬁcation are sho wn in T able 2. The corresponding R OC curve is sho wn in Figure 3. Adding speech rate information (VGS representation) con- tributed to further enhancing AD detection, bringing the ov erall accuracy score to about 86.5%. Detailed ev alua- tion metrics are sho wn in T able 3. The R OC curve for the VGS-based classiﬁcation approach is sho wn in Figure 4. It can be seen that the addition of features for mean and variance of speech rate ratio ov er dialogues had the ef fect of impro ving classiﬁcation trade-offs, particularly reducing the false positiv es while increasing the true positi ves at lo w threshold cut-offs. For comparison we ran the same testing procedure us- ing some of the other classiﬁers employed in the litera- ture revie wed in section 3., namely , logistic regression, T able 2: AD detection results for the VGO data representa- tion scheme. AD non-AD Accuracy µ 0.812 Accuracy µ 0.714 Precision µ 0.765 Precision µ 0.769 Recall µ 0.812 Recall µ 0.714 F 1 ,µ 0.788 F 1 ,µ 0.741 Precision M 0.667 Precision M 0.792 Recall M 0.722 Recall M 0.729 F 1 ,M 0.685 F 1 ,M 0.721 Overall accurac y (LOOCV): 0.811 F alse positive rate T rue positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A UC = 0.798 Figure 3: R OC curv e for VGO-based classiﬁers. na ¨ ıve Bayes (Gaussian kernel), decision trees (C4.5 algo- rithm), SVM trained using sequential minimal optimisa- tion, with a polynomial kernel (Platt, 1998), and random forests (Breiman, 2001), W eka implementation (Hall et al., 2009). The overall (LOOCV) accuracy ﬁgures are sho wn in T able 4. There is little difference in performance between our chosen method (Real Adaboost) and other methods used in the literature, except for logistic regression, which underperforms the machine learning methods. Real Ad- aboost slightly outperforms SVM and random forests clas- siﬁers, and matches C4.5 decision trees, with a slight ad- vantage o ver the latter on the tar get AD class ( F m = 0 . 878 vs. F m = 0 . 872 ). Although there is considerable room for improvement upon this level of classiﬁcation performance, the levels obtained with these simple models are comparable to the accuracy T able 3: Results for the VGS data representation scheme. AD non-AD Accuracy µ 0.882 Accuracy µ 0.769 Precision µ 0.833 Precision µ 0.833 Recall µ 0.882 Recall µ 0.769 F 1 ,µ 0.857 F 1 ,µ 0.800 Precision M 0.796 Precision M 0.708 Recall M 0.833 Recall M 0.708 F 1 ,M 0.811 F 1 ,M 0.700 Overall accurac y (LOOCV): 0.865 F alse positive rate T rue positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A UC = 0.894 Figure 4: R OC curv e for VGS-based classiﬁers. T able 4: Compared accuracy results obtained with dif ferent classiﬁcation algorithms, on VGS-based datasets. Classiﬁcation method Accuracy (LOOCV) Logistic regression 75.7% Real Adaboost 86.5% Decision trees 86.5% SVM 83.7% Random forests 81.1% of approaches that employ more detailed linguistic infor- mation, which are presumably harder to acquire in e very- day con versational situations, as they would in volv e a le vel of speech recognition accuracy which is beyond the capa- bilities of current systems for spontaneous speech in noisy en vironments. 6. Conclusion and Further W ork Dementia pre vention and life quality in elderly care are im- portant societal challenges. Automatic detection of signs of AD in speech can provide useful tools for the design of technologies for care-gi ving and cogniti ve health monitor- ing to help address these challenges. This paper presented initial results of a new method to auto- matically recognise the ﬁrst signs of disrupted communica- tion using dialogue features. This method obtained an over - all accuracy of 0.83, with a micro F-measure of 0.83 and a macro F-measure of 0.76 on the classiﬁcation of patients as “ AD” and “non-AD”. Although it is difﬁcult to compare these results directly to related works (Fraser et al., 2016; Guinn and Habash, 2012), our accuracy ﬁgures are situated within a similar range, 0.70-0.80, with a smaller discrep- ancy between the classiﬁcation of the two groups, while re- lying on features that can be more robustly extracted from spontaneous speech. Thanks to the increasingly important role of social technol- ogy , longitudinal studies may become richer in terms of the amount of variables measured, frequency of measurements and places where measures are taken (living settings), al- lowing for larger datasets. As more data are gathered in natural settings, we e xpect to obtain more reliable and gen- eralisable results. There are se veral linguistic parameters that are promising for the assessment of cognitiv e functioning. In current ap- proaches, these features ha ve been typically extracted from data collected through structured intervie ws, storytelling or picture descriptions. The w ork presented here contrib utes a new perspectiv e to feature extraction by focusing on spon- taneous dialogues. Dialogue processing provides a con ve- nient framework for the analysis of natural con versations, in which readily av ailable predictors, such as turn taking behaviour , hav e already yielded satisfactory results. W e plan to further analyse verbal and non-v erbal parameters to obtain a better characterisations of AD in order to infer neu- rosychological assessment results through speech and lan- guage processing, and subsequently to combine such fea- tures with actual neuropsychological ev aluations and other relev ant variables, building accurate models to achieve de- tection of AD at the time of onset. The data set used in the present study has some limitations. Due to its constraints, the study was performed on a re- stricted subset of 21+17 sessions. In addition, the interview setting includes a de gree of bias, as the interviewer’ s ob- jectiv e is to get the patient to perform a certain task (e.g. description of a picture, driving the discussion) therefore inﬂuencing the patient’ s speech. In order to mitigate these limitations, we plan to collect further data in more sponta- neous dialogue in the near future. 7. Acknowledgements This research has recei ved funding from the European Union’ s Horizon 2020 research and innovation programme under grant agreement No 769661, SAAM project. Soﬁa De la Fuente and Pierre Albert are supported by the Medical Research Council (MRC). The authors would also like to acknowledge Charlene Pope and Boyd H. Davis, from the Medical Uni versity of South Carolina, who host the Car- olinas Conv ersation Collection, for providing access to the dataset and help in completing the required procedures. 8. Bibliographical References American Psychiatric Association. (2000). Delirium, de- mentia, and amnestic and other cognitiv e disorders. In American Psychiatric Association, editor , Diagnostic and Statistical Manual of Mental Disorders, T ext Revi- sion (DSM-IV -TR) , chapter 2. Arlington, V A, 4 th edition. Becker , J., Boiler , F ., Lopez, O., Saxton, J., and McGonigle, K. (1994). The natural history of Alzheimer’ s disease: Description of study cohort and accuracy of diagnosis. Ar chives of Neur ology , 51(6):585–594. Bell, A., Brenier, J. M., Gregory , M., Girand, C., and Ju- rafsky , D. (2009). Predictability ef fects on durations of content and function words in con versational english. 60(1):92–111. Breiman, L. (2001). Random forests. Mac hine Learning , 45(1):5–32. Bucks, R., Singh, S., Cuerden, J., and W ilcock, G. (2000). Analysis of spontaneous, con versational speech in de- mentia of Alzheimer type: Ev aluation of an objective technique for analysing lexical performance. Aphasiol- ogy , 14(1):71–91. Cohen, A. S. and Elvev ˚ ag, B. (2014). Automated Com- puterized Analysis of Speech in Psychiatric Disorders. Curr ent opinion in psychiatry , 27(3):203–209. Conway , M. and O’Connor , D. (2016). Social media, big data, and mental health: Current advances and ethical implications. Curr ent Opinion in Psychology , 9:77–82. Devillers, L., V idrascu, L., and Lamel, L. (2005). Chal- lenges in real-life emotion annotation and machine learn- ing based detection. Neural Networks , 18(4):407–422. Fraser , K. C., Meltzer , J. A., and Rudzicz, F . (2016). Lin- guistic Features Identify Alzheimer’ s Disease in Narra- tiv e Speech. Journal of Alzheimer’ s Disease , 49(2):407– 422, October . Friedman, J., Hastie, T ., and T ibshirani, R. (2000). Ad- ditiv e logistic regression: a statistical view of boosting. The Annals of Statistics , 28(2):337–407, April. Garc ´ ıa Meil ´ an, J. J., Mart ´ ınez-S ´ anchez, F ., Carro, J., S ´ anchez, J. a., and P ´ erez, E. (2012). Acoustic Markers Associated with Impairment in Language Processing in Alzheimer’ s Disease. The Spanish Journal of Psychol- ogy , 15(2):487–494. Gonzalez-Moreira, E., T orres-Boza, D., Kairuz, H., Fer- rer , C., Garcia-Zamora, M., Espinoza-Cuadros, F ., and Hernandez-G ´ omez, L. (2015). Automatic prosodic anal- ysis to identify mild dementia. BioMed Resear ch Inter- national . Guinn, C. I. and Habash, A. (2012). Language analysis of speakers with dementia of the alzheimer’ s type. In AAAI F all Symposium: Artiﬁcial Intelligence for Ger ontech- nology , pages 8–13. Hall, M., Frank, E., Holmes, G., Pfahringer , B., Reute- mann, P ., and Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD explor ations newsletter , 11(1):10–18. Hayakawa, A., V ogel, C., Luz, S., and Campbell, N. (2017). Speech rate comparison when talking to a sys- tem and talking to a human: A study from a speech-to- speech, machine translation mediated map task. In Pr oc. Interspeech 2017 , pages 3286–3290. Jaffe, J. and Feldstein, S. (1970). Rhythms of dialogue . Personality and Psychopathology . Academic Press, New Y ork. Jong, N. H. d. and W empe, T . (2009). Praat script to detect syllable nuclei and measure speech rate automatically . Behavior Resear ch Methods , 41(2):385–390, May . Kato, S., Endo, H., Homma, A., Sakuma, T ., and W atan- abe, K. (2013). Early detection of cognitiv e impair- ment in the elderly based on Bayesian mining using speech prosody and cerebral blood ﬂow activ ation. Pr o- ceedings of the 37 th Annual International Confer ence of the IEEE Engineering in Medicine and Biology Society , 2013:5813–6. Kirshner , H. S. (2012). Primary Progressi ve Aphasia and Alzheimer’ s Disease: Brief History , Recent Evi- dence. Curr ent Neur ology and Neuroscience Reports , 12(6):709–714. Laske, C., Sohrabi, H. R., Frost, S. M., de Ipi ˜ na, K. L., Garrard, P ., Buscema, M., Dauwels, J., Soekadar , S. R., Mueller , S., Linnemann, C., Bridenbaugh, S. A., Kana- gasingam, Y ., Martins, R. N., and O’Bryant, S. E. (2015). Innov ati ve diagnostic tools for early detec- tion of alzheimer’ s disease. Alzheimer’ s & Dementia , 11(5):561–578. Lopez-De-Ipi ˜ na, K., Alonso, J., Sol ´ e-Casals, J., Barroso, N., Faundez, M., Ecay , M., Tra vieso, C., Ezeiza, A., and Estanga, A. (2012). Alzheimer disease diagnosis based on automatic spontaneous speech analysis. In Pr oceed- ings of the 4th International J oint Confer ence on Com- putational Intelligence , pages 698–705. Luz, S. and Su, J. (2010). The relev ance of timing, pauses and overlaps in dialogues: Detecting topic changes in scenario based meetings. In Pr oceedings of INTER- SPEECH 2010 , pages 1369–1372, Chiba, Japan. ISCA. Luz, S. (2013). Automatic Identiﬁcation of Experts and Performance Prediction in the Multimodal Math Data Corpus through Analysis of Speech Interaction. Pr o- ceedings of the 15th ACM on International conference on multimodal interaction, ICMI’13 , pages 575–582. Luz, S. (2017). Longitudinal monitoring and detection of Alzheimer’ s type dementia from spontaneous speech data. In Computer Based Medical Systems , pages 45–46. IEEE Press. Marklund, P ., Sikstr ¨ om, S., B ˚ a ˚ ath, R., and Nilsson, L. G. (2009). Age effects on semantic coherence: Latent Se- mantic Analysis applied to letter ﬂuency data. 3r d Inter- national Confer ence on Advances in Semantic Pr ocess- ing - SEMAPR O 2009 , pages 73–76. Middleton, L. E. and Y af fe, K. (2009). Promising strate- gies for the prev ention of dementia. Ar ch Neur ol , 66(10):1210–1215. Mirheidari, B., Blackburn, D., Reuber, M., W alker , T ., and Christensen, H. (2016). Diagnosing people with demen- tia using automatic con versation analysis. In Pr oceed- ings of Interspeech 2016 , pages 1220–1224. ISCA. Mortamais, M., Ash, J. A., Harrison, J., Kaye, J., Kramer , J., Randolph, C., Pose, C., Albala, B., Ropacki, M., Ritchie, C. W ., and Ritchie, K. (2017). Detecting cogni- tiv e changes in preclinical Alzheimer’ s disease: A revie w of its feasibility . Alzheimer’ s & Dementia , 13(4):468– 492. Parse y , C. M. and Schmitter-Edgecombe, M. (2013). Applications of technology in neuropsychological as- sessment. The Clinical neur opsycholo gist , 27(8):1328– 1361. Platt, J. (1998). Fast training of support vector ma- chines using sequential minimal optimization. In B. Schoelkopf, et al., editors, Advances in Kernel Meth- ods - Support V ector Learning . MIT Press. Pope, C. and Davis, B. H. (2011). Finding a balance: The carolinas con versation collection. Corpus Linguis- tics and Linguistic Theory , 7(1):143–161. Ritchie, K., Carri ` ere, I., Su, L., O’Brien, J. T ., Lov estone, S., W ells, K., and Ritchie, C. W . (2017). The midlife cognitiv e proﬁles of adults at high risk of late-onset alzheimer’ s disease: The PREVENT study . Alzheimer’s & Dementia , 13(10):1089–1097. Roark, B., Mithcell, M., Hosom, J.-P ., Hollingshead, K., and Kaye, J. (2011). Spoken Language Deriv ed Mea- sures for Detecting Mild Cogniti ve Impairment. The New England journal of medicine , 19(7):2081–2090. Rudzicz, F ., Chan Currie, L., Danks, A., Mehta, T ., and Zhao, S. (2014a). Automatically Identifying T rouble- indicating Speech Behaviors in Alzheimer’ s Disease. In 16th International A CM SIGA CCESS Conference on Computers & Accessibility , pages 241–242. Rudzicz, F ., W ang, R., Begum, M., and Mihailidis, A. (2014b). Speech recognition in Alzheimer’ s disease with personal assistiv e robots. Pr oceedings of the 5th W ork- shop on Speec h and Language Pr ocessing for Assistive T echnologies , pages 20–28. Satt, A., Sorin, A., T oledo-Ronen, O., Barkan, O., K ompat- siaris, I., K okonozi, A., and Tsolaki, M. (2013). Eval- uation of speech-based protocol for detection of early- stage dementia. Pr oceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH , (August):1692–1696. Schapire, R. E. and Freund, Y . (2014). Boosting: F ounda- tions and Algorithms . The MIT Press, January . Schr ¨ oder , M. and Trouv ain, J. (2003). The German text-to- speech synthesis system MAR Y: A tool for research, de- velopment and teaching. International J ournal of Speech T echnology , 6(4):365–377. Thomas, C., Keselj, V ., Cercone, N., Rockwood, K., and Asp, E. (2005). Automatic detection and rating of dementia of Alzheimer type through lexical analysis of spontaneous speech. IEEE International Confer ence Mechatr onics and A utomation, 2005 , 3(February):1569– 1574. W ada, K., Shibata, T ., Musha, T ., and Kimura, S. (2008). Robot Therapy for Elders affected by Dementia. (Au- gust). W atson, C. M. (1999). An analysis of trouble and repair in the natural con versations of people with dementia of the alzheimer’ s type. Aphasiology , 13(3):195–218. W orld Health Organization et al. (2015). First who min- isterial conference on global action against dementia: meeting report, who headquarters, gene va, switzerland, 16-17 march 2015. Xiong, W ., Droppo, J., Huang, X., Seide, F ., Seltzer, M., Stolcke, A., Y u, D., and Zweig, G. (2016). Achiev- ing human parity in con versational speech recognition. CoRR , abs/1610.05256.

A Method for Analysis of Patient Speech in Dialogue for Dementia Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment