What Do LLMs Know About Alzheimer's Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reliable early detection of Alzheimer’s disease (AD) is challenging, particularly due to limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across domains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we fine-tune an LLM for AD detection and investigate how task-relevant information is encoded within its internal representations. We employ probing techniques to analyze intermediate activations across transformer layers, and we observe that, after fine-tuning, the probing values of specific words and special markers change substantially, indicating that these elements assume a crucial role in the model’s improved detection performance. Guided by this insight, we design a curated set of task-aware special markers and train a sequence-to-sequence model as a data-synthesis tool that leverages these markers to generate structurally consistent and diagnostically informative synthetic samples. We evaluate the synthesized data both intrinsically and by incorporating it into downstream training pipelines.

💡 Research Summary

This paper investigates how large language models (LLMs) can be adapted for early detection of Alzheimer’s disease (AD) in a low‑resource clinical setting. The authors first fine‑tune two open‑source instruction‑tuned LLMs—Meta’s Llama‑3‑1B‑Instruct and Qwen‑2.5‑1.5B‑Instruct—on the DementiaBank corpus, which contains transcribed “Cookie‑Theft” picture‑description interviews from AD patients and age‑matched healthy controls. To explore the impact of different training objectives, they experiment with four loss configurations: standard cross‑entropy (CE), CE with label smoothing (ε = 0.1), focal loss (α = 0.25, γ = 2.0), and a hybrid CE + contrastive loss (α = 0.1, margin = 1.0). All experiments share identical hyper‑parameters (learning rate 2e‑5, effective batch size 16, 10 epochs) and use an 80/20 train‑validation split (1,044 AD, 247 control samples). Results show that CE and label‑smoothed training achieve the highest accuracy (0.853) and recall (0.975), while contrastive and focal variants improve training stability but do not surpass the baseline in final classification metrics.

Beyond performance, the core contribution lies in probing the fine‑tuned models’ internal representations. The authors train linear probes (ridge‑regressed) on hidden states from each transformer layer to predict the binary AD label. Probes are evaluated both at the token level and at the sentence level using mean‑squared error loss with L2 regularization. Analysis reveals that after fine‑tuning, the probe scores for certain lexical items (e.g., “kid”, “corner”) and, more importantly, for CHAT‑style special markers (e.g., %pause, %rep, %unintelligible) shift dramatically. These markers encode pauses, repetitions, and unintelligible speech—features known to be characteristic of AD‑related language impairment. The strongest changes appear in middle layers, suggesting that the model learns to amplify domain‑specific cues without discarding its general linguistic knowledge.

Guided by these probing insights, the authors design a data‑synthesis pipeline. They curate a set of task‑aware special markers that capture AD‑relevant speech phenomena and train a T5‑based sequence‑to‑sequence model to transform plain transcripts into “marker‑rich” transcripts. The synthesis model learns to insert appropriate pauses, repetitions, and other disfluencies in a way that mirrors real AD speech patterns. Synthetic data are evaluated on three fronts: (1) statistical similarity of marker distributions to authentic AD transcripts; (2) token‑level probe scores, confirming that generated texts occupy the same AD‑directional subspace as real data; and (3) downstream impact when the synthetic samples are added to the original training set. In the downstream AD classification task, augmenting with synthetic data yields consistent gains of 2–3 % in accuracy and F1, with a notable boost in recall for the minority AD class.

The paper therefore demonstrates a full loop: (i) fine‑tune LLMs for a clinical classification task; (ii) use linear probing to uncover which linguistic signals the model deems informative; (iii) leverage those signals to create synthetic, diagnostically meaningful data; and (iv) show that the synthetic data improve model performance in a low‑resource regime. Limitations include the modest size and single‑domain nature of DementiaBank, the reliance on manually defined special markers (which may not generalize across corpora), and the focus on text‑only transcripts without integrating acoustic features. Future work is suggested to scale to larger, multi‑institution datasets, to automate marker extraction from raw audio, and to explore multimodal LLMs that jointly process speech and text.

In summary, the study provides a compelling methodology for adapting LLMs to specialized medical NLP tasks, offers concrete evidence that domain‑specific cues become linearly encoded after fine‑tuning, and introduces a probing‑driven synthetic data generation technique that can alleviate label scarcity in Alzheimer’s disease detection.

What Do LLMs Know About Alzheimer's Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection

💡 Research Summary

Comments & Academic Discussion

Leave a Comment