Towards a Method for Synthetic Generation of Persons with Aphasia Transcripts

In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.

💡 Research Summary

The paper addresses a critical bottleneck in aphasia research: the scarcity of annotated speech transcripts needed for training automated language assessment tools. While large language models (LLMs) are trained on billions of tokens, the publicly available AphasiaBank contains only about 600 transcripts, limiting the development of reliable automatic CIU (Correct Information Units) coding systems. To mitigate this data deficit, the authors propose and evaluate two synthetic data generation pipelines for the “Cat Rescue” picture‑description task, each designed to emulate four clinically defined severity levels—Mild, Moderate, Severe, and Very Severe.

The first pipeline follows a procedural programming approach. Starting from a set of normal (non‑aphasic) transcripts, the authors apply a series of deterministic transformations: (1) word dropping at severity‑dependent rates (10 % to 40 % of tokens), (2) insertion of filler words such as “uh” and “um” at predefined probabilities (5 %–20 %), and (3) substitution of selected words with paraphasic equivalents drawn from a curated synonym/paraphasia lexicon. This rule‑based method yields a controlled but relatively rigid set of degraded utterances.

The second pipeline leverages two open‑source instruction‑tuned LLMs—Mistral 7b Instruct and Llama 3.1 8b Instruct. The authors craft detailed prompts that encode the target severity level and the desired linguistic manipulations (e.g., “Delete roughly 30 % of the words, insert filler tokens, and replace some words with near‑synonyms that could represent a paraphasia”). By feeding the original normal transcript together with the severity‑specific prompt, the models generate synthetic aphasic versions. The LLM approach allows the model’s internal knowledge of language patterns to guide the placement of errors, producing more context‑sensitive and varied degradations than the procedural pipeline.

Both pipelines generate 30 synthetic transcripts per severity level, resulting in 240 artificial samples. The authors evaluate the synthetic data against a held‑out set of human‑elicited aphasic transcripts using three quantitative metrics commonly employed in aphasia research: (i) Number of Different Words (NDW) to capture lexical diversity, (ii) total word count as a proxy for speech fluency, and (iii) mean word length (character count) to reflect morphological simplification. Statistical analysis (ANOVA with Tukey post‑hoc tests) reveals distinct patterns. The procedural method consistently reduces NDW and word count across severity levels, but the magnitude of change is uniform and does not align closely with the human data. In contrast, Mistral 7b Instruct produces a graded decline that mirrors clinical observations: NDW drops by up to 45 % and mean word length shortens from 2.1 to 1.5 characters in the Very Severe condition, with deviations from real aphasic transcripts of only 3 %–5 % for NDW and total words. Llama 3.1 8b Instruct also shows directional changes but with weaker intensity and occasional repetitive paraphasic substitutions, resulting in a larger gap from the human baseline.

The authors conclude that LLM‑based synthetic generation, particularly with Mistral 7b Instruct, offers a more realistic approximation of aphasic speech degradation than a purely rule‑based system. However, they acknowledge limitations: the study is confined to a single picture‑description task, the sample size per severity level is modest, and the prompts still require manually specified dropout and filler rates. Future work is outlined to (1) expand the synthetic corpus across multiple discourse tasks, (2) fine‑tune LLMs on existing aphasic data to improve domain specificity, and (3) involve speech‑language pathologists in systematic realism assessments, thereby creating a robust, scalable resource for training and evaluating automated aphasia assessment tools.

💡 Research Summary

📜 Original Paper Content