LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

Reading time: 33 minute
...

📝 Original Info

  • Title: LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models
  • ArXiv ID: 2512.23025
  • Date: 2025-12-28
  • Authors: Wenxuan Xu, Arvind Pillai, Subigya Nepal, Amanda C Collins, Daniel M Mackin, Michael V Heinz, Tess Z Griffin, Nicholas C Jacobson, Andrew Campbell

📝 Abstract

Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making. LENS Question Answer

📄 Full Content

Mental health conditions on the spectrum of anxiety and depression affect an estimated 18% and 9.5% of adults in the United States each year (Johns Hopkins Medicine, 2023). Traditional screening methods typically rely on structured clinical interviews and validated self-report instruments such as the Patient Health Questionnaire (PHQ-9) (Kroenke et al., 2001) and the Generalized Anxiety Disorder scale (GAD-2) (Spitzer et al., 2006). However, these assessments are limited by their Figure 1: Illustration of the LENS idea. Mobile and wearable sensing signals, combined with a question, are passed to LENS, which produces a natural-language description. Clinicians can then view an interpretable snapshot of the user's mental state instead of raw sensor streams.

high burden on clinicians, dependence on retrospective self-reports, and reduced ecological validity because they are administered in controlled settings that do not capture an individual’s real-world context (Abd-Alrazaq et al., 2023).

To address these challenges, recent work has used mobile and wearable technologies to collect passive sensing data (for example, phone usage, speech features, and heart rate) and to administer ecological momentary assessments (EMA) (Xu et al., 2023;Gomes et al., 2023;Nepal et al., 2024). A growing body of evidence shows that behavioral and physiological signals such as activity levels, sleep patterns, mobility, voice characteristics, and smartphone interactions can indicate the severity of depression and anxiety symptoms (Wang et al., 2014;Saeb et al., 2015;Sheikh et al., 2021;Jacobson et al., 2021). Together, these findings highlight passive sensing as a promising complementary approach for mental health monitoring.

In parallel, recent work has demonstrated the potential of large language models (LLMs) for mental health assessment. Studies show that prompt engineering and fine-tuning can enable depression detection, symptom severity inference, physiological indicator prediction, and even generation of psychological rationales (Yang et al., 2024;Moon et al., 2025;Kim et al., 2024). Despite these promising capabilities, LLMs struggle with long-duration time series due to limitations in context length and tokenizer design, which prevent them from directly ingesting raw numerical sequences (Spathis and Kawsar, 2024). In contrast, multimodal vision and language models benefit from mature frameworks that support native visual understanding (Li et al., 2022;Dosovitskiy et al., 2021;Radford et al., 2021). However, comparable methods for native time-series integration remain scarce (Tan et al., 2025). As a result, the few studies that combine passive health sensing with LLMs rarely operate directly on raw sensor streams (Kim et al., 2024;Justin et al., 2024;Englhardt et al., 2024). Overall, progress in integrating behavioral sensing data with language is limited by the lack of large datasets that pair raw sensor streams with text and by the limited capabilities of existing methods that align time-series signals with language models.

Toward the goal of aligning sensing with language models for mental health, our contributions are threefold. (1) LENS data synthesis pipeline: We introduce a pipeline that generates semantic mental health descriptions of multimodal sensing data from EMA responses (Section 3.1), producing a dataset of more than 100,000 sensor-text pairs and directly addressing the shortage of such resources. (2) LENS training: We propose a training strategy based on a patch-level time-series encoder that projects sensor signals into the language model’s representation space (Section 3.2). By interleaving time-series and text embeddings and using a two-stage curriculum, we show that LENS can generate clinically grounded narratives. (3) Comprehensive evaluation: We evaluate our approach on a clinical dataset of 258 participants comprising 50,957 unique samples (Section 4.2). Beyond custom LLM metrics, we conduct a user study with 13 mental health experts who manually assessed 117 narratives.

Recent work explores mobile and wearable data for mental health prediction, primarily focusing on symptom classification (Kim et al., 2024;Englhardt et al., 2024) or improving reasoning via multimodal encoders (Justin et al., 2024). These systems prioritize prediction over natural language generation, often producing text only as secondary reasoning traces. Furthermore, methods that se-rialize numerical data into tokens face significant scalability and tokenization constraints (Pillai et al., 2025;Spathis and Kawsar, 2024;Yoon et al., 2024). In contrast, LENS anchors narrative generation in raw sensor measurements and clinically validated PHQ/GAD items to produce meaningful symptom descriptions.

LLMs are increasingly used to synthesize paired datasets for time-series tasks by generating artificial sequences or surrogate signals for QA pairs and explanations (Xie et al., 2025;Li et al., 2025b;Yan et al., 2023;Imran et al., 2024). While scalable, these pipelines rely on synthetic inputs that fail to capture the complexity of real physiological and behavioral signals. LENS instead pairs realworld sensor streams with clinical assessments, using LLMs only to refine linguistic fluency via rigorous quality checks.

While vision-language alignment has progressed rapidly (Liu et al., 2023), time-series and language integration remains limited. Existing text-based serialization is constrained by numerical encoding and context length (Xue and Salim, 2023;Gruver et al., 2023), while vision-based methods convert series into images, introducing plot-engineering biases and indirect representations (Yoon et al., 2024;Zhang et al., 2023). Alignment-based approaches project encoder representations into LLM hidden states (Ming et al., 2023;Xie et al., 2025). However, by optimizing for synthetic QA, they rarely generate natural language aligned with realworld clinical constructs. Our work differs from these approaches by producing symptom-oriented narratives tied to psychometric instruments and grounded directly in raw signals

LENS consists of two components: a scalable dataset-construction pipeline that transforms EMA responses into high-quality sensor-text pairs (Section 3.1), and a sensor-text alignment method that enables native integration of raw time-series signals into an LLM through a patch-based encoder (Section 3.2).

Our longitudinal study investigates intra-day fluctuations in mental health symptoms among individuals diagnosed with major depressive disorder. In this 90-day study, we recruited participants aged 18 years and older, residing in the United States. Each participant wore a Garmin vivoactive 3 device and installed our Android application. This setup enabled the collection of passive sensing data, which are behavioral and physiological signals automatically recorded from smartphones and wearables, and ecological momentary assessments (EMAs), in which participants actively reported depression and anxiety symptoms experienced over the past four hours. The study data will be publicly available through the funding body after an embargo period to verify privacy of personal identifiable information.

The EMA consists of 13 items adapted from the PHQ-9 and GAD-4 (Table 2; Appendix §B). It is administered three times per day in the morning, afternoon, and evening, customized to each participant’s waking time. Each item is rated on a continuous 0 (“Not at all”) to 100 (“Constantly”) scale. In addition to the EMAs, we collect mobile and wearable time-series signals, including GPS traces, step counts, accelerometer-derived zerocrossing rate (ZCR) and energy, conversation time, phone lock and unlock events, heart rate, sleep estimates, and stress levels. These signals have been shown to correlate with depressive and anxiety symptoms (Choudhary et al., 2022). To align self-reported EMAs with corresponding sensing data, we use each EMA’s completion time to retrieve the preceding four hours of sensor data. This procedure results in a temporally aligned, multimodal dataset suitable for narrative synthesis and modeling. Ultimately, we utilize 50,957 EMAs from 258 participants.

The sensing data includes two types of signals:

(1) continuous streams, which contain time-series data such as steps, heart rate, accelerometer-derived zero-crossing rate (ZCR) and energy, phone lock and unlock state, stress level, and GPS traces, and (2) aggregated streams, which contain daily or window-level values such as sleep duration from the previous night and total conversation time. After preprocessing, all signals are standardized to fixed sampling rates to ensure consistent temporal alignment across modalities (see Appendix §C for specific details).

To address the lack of sensor-text datasets describing mental health symptoms, LENS constructs Responses → Answers (A). EMA responses are converted into narrative labels as shown in Figure 2. For each question, we first transform the EMA item into a symptom-focused template sentence. We then insert the frequency phrase associated with its numeric range (0-25: not at all, 26-50: sometimes, 51-75: often, 76-100: constantly) to obtain the raw item-level narrative. Question 14 is treated as bi- nary, and overall severity is categorized as mild, moderate, or severe. Summary-level narratives are formed by concatenating item-level narratives before applying LLM refinement. To reduce the mechanical tone of the templates, we use promptbased rewriting with the locally deployed Qwen2.5-14B model. The system prompt enforces factual accuracy, consistent terminology, and stigma-free language, while the user prompt supplies the rulebased text and requests a fluent rewrite(See prompt template in Appendix §G). This process produces 101,914 item-level narratives and 50,957 summarylevel narratives that serve as ground truth for training LENS.

Questions (Q). The ground-truth questions consist of item-level prompts for each EMA question and a summary-level prompt that covers all items along with the overall severity statement. To increase linguistic diversity, we use GPT-4o to generate paraphrased variants of every question. For each original prompt, we created ten semantically equivalent but lexically distinct phrasings. During QA dataset construction, one paraphrased variant is randomly sampled for each instance.

Quality Control. To ensure narrative reliability, we implement an automatic quality-control pipeline using a multi-model LLM-judge system (Figure 3). Each judge model follows an LLM-asa-Judge prompting template (Li et al., 2025a) with rule-augmented instructions that embed evaluation principles, baseline references, and dimensionspecific rubrics directly in the prompt. Judges compare the template-based and enhanced narratives in a pairwise format and output five dimension scores (1-5), confidence values, and a short rationale, providing structured and transparent assessments. Be-cause individual LLM judges can be biased, we use three independent models (Mistral-7B (Jiang et al., 2023), Llama-3.1-8B (Grattafiori et al., 2024), and Qwen2.5-7B (Qwen et al., 2025)). Dimension scores are averaged and rounded, and their sum forms a total quality score; confidence values are averaged similarly. A narrative is accepted only if its average total score exceeds 20 (out of 25) and its mean confidence exceeds 0.8. Otherwise, it is returned for regeneration and re-evaluation (FAIL in Figure 3). This iterative refine-and-judge loop, akin to Refine-n-Judge (Cayir et al., 2025), filters out low-quality outputs and ensures that finalized narratives faithfully reflect EMA responses while improving fluency and diversity without introducing distortions.

We encode each time-series stream independently using a lightweight patch-based module that converts raw scalar values into languagemodel-compatible embeddings. Given a univariate sequence S = {s t } T t=1 , we first apply a reversible value normalization to stabilize scale while keeping absolute magnitudes recoverable. Let µ and σ denote the per-stream mean and standard deviation, then st = st-µ σ , and auxiliary statistics such as (µ, σ, m min , m max ) are inserted into the textual prompt as metadata, allowing the model to reason about original numerical ranges inside the language space (Langer et al., 2025;Xie et al., 2025). The normalized sequence is divided into N = ⌈T /k⌉ non-overlapping patches of width k, where k = 8 for all streams in our experiments. Each timestep is augmented with a learnable positional embedding q t ∈ R dp with d p = 16. For each patch, scalar values and positional codes are concatenated to form 1+dp) .

Each patch representation is projected into the hidden space of the pretrained language model via a multilayer perceptron f θ consisting of 5 layers with hidden width 5120, matching the LLM embedding dimension d. The encoder outputs one embedding per patch,

This design captures localized temporal patterns at the patch level while preserving absolute numerical semantics through reversible normalization.

Given an instruction text X = {x 1 , . . . , x M } and K time-series streams {S (k) } K k=1 , we inject the normalization metadata described above into the text. After this step, the prompt X contains special placeholder tokens and to mark the position of each stream. The text X is then tokenized and passed through the pretrained LLM embedding layer f emb ϕ , producing a sequence of text embeddings E text = {e 1 , . . . , e L } ∈ R L×d . In parallel, each time-series stream S (k) is processed by the encoder, yielding patch embeddings

The multimodal embedding sequence is formed by concatenating the text embeddings and the patch embeddings at the positions referenced by the placeholders, i.e., H = concat E text , Z (1) , . . . , Z (K) ∈ R Lmm×d , which interleaves natural-language and time-series representations in a single ordered context. This unified sequence is then fed into the subsequent transformer blocks of the pretrained LLM, enabling multimodal reasoning over both textual instructions and temporal patterns.

Stage 1: Encoder Alignment. The first stage aims to establish a stable alignment between temporal features and their textual descriptions. We use the alignment dataset from ChatTS (Xie et al., 2025), which takes a QA form and probes time-series understanding across different signal types and temporal behaviors, enabling the encoder to capture trends, correlations, and local pattern variations rather than raw numeric fluctuations. However, ), while time-series data is encoded by a trainable patch-based encoder (f θ ). The resulting embeddings are concatenated into a unified sequence (H) and processed by the LLM backbone to generate a natural language response (Y ). our final objective is to generate clinically coherent symptom narratives rather than attribute-only descriptions. To prevent the encoder from overspecializing on low-level attribute queries, we interleave a small portion of narrative and general QA samples into the alignment corpus (8:1:1 with alignment data), giving the model early exposure to natural question forms and narrative structure and better preparing it for downstream symptomfocused generation.

Stage 2: Supervised Fine-Tuning on Symptom Narratives. After encoder alignment, we train the model to perform symptom-centered question answering and narrative generation. We fine-tune LENS on the two EMA-derived QA datasets described in Section 3.1.3, which respectively provide item-level supervision for single-symptom interpretation and summary-level supervision for multi-symptom synthesis. To prevent the model from overfitting to a single response style and to enhance its ability to follow structured outputs, we additionally include an instruction-following (IF) dataset constructed from predefined templates. This dataset exposes the model to consistent answer formats, encouraging stable response organization under different prompts. Because real-world time-series queries vary widely in temporal resolution, we further interleave an alignment-random subset in which sequence lengths are uniformly sampled between 64 and 1024, enabling the model to generalize across heterogeneous sampling rates and time spans. During supervised fine-tuning, the four datasets are mixed using a 0.3:0.3:0.2:0.2 ratio (item-level QA : summary-level QA : IF : alignment-random), which balances symptom narrative grounding, structural instruction patterns, and robustness to sequence-length variability.

Given an instruction text X = {x 1 , . . . , x M } and K time-series streams {S (k) } K k=1 , the model is trained in a standard supervised autoregressive manner. For a target response Y = {y 1 , . . . , y U }, the objective is

where ϕ denotes all trainable parameters. We adopt full-parameter fine-tuning, updating both the pretrained LLM backbone and the time-series encoder. See Appendix §A for hardware and batching details.

Baselines. To contextualize LENS’s performance, we compare it with baselines that follow common strategies for modeling time-series signals with LLMs. Prior work often encodes time-series values as raw text (Kim et al., 2024;Gruver et al., 2023), so we adopt the same principles and utilize the Qwen2.5-14B (Yang et al., 2025) with few-shot prompting as the backbone; we refer to this as TS-Text. Another promising approach converts timeseries data into visual representations to support downstream reasoning (Yoon et al., 2024;Liu et al., 2025). Following this approach, we transform each signal into a plot and generate narratives using the Qwen-2.5VL-32B model in a few-shot setting. This baseline is denoted as TS-Image. Inference for all baseline models utilizes default configurations to ensure a controlled comparison, and we utilize participant-level splits (approx. 70:15:15) to prevent information leakage, ensuring that all data from a given individual appears in only one split (Appendix §A).

Metrics. We evaluate model performance using three categories of metrics. (1) Linguistic Metrics: Generated narratives are assessed with ROUGE-1/2/L (Lin, 2004), BLEU-4 (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and BERTScore (Zhang et al., 2020), which measure lexical overlap, structural consistency, and semantic similarity with reference narratives.

(2) Symptom-grounded Evaluation: To assess clinical alignment, we employ a structured LLM-as-a-judge protocol using gpt-4.1-mini (See prompt template in Appendix §G. The judge is queried with temperature set to 0 to ensure deterministic and reproducible evaluations). For each of the PHQ-related symptom categories, the judge first outputs a JSON record containing ref_presence,ref_severity pred_presence, ref_severity, and pred_severity, indicating whether the model omits or hallucinates symptom dimensions and whether predicted severity is faithful to the ordinal reference. (3) Item-level QA Evaluation: For single-question outputs, evaluation reduces to a record specifying ref_severity and pred_severity; symptom presence is implicit in the question itself, allowing severity correctness to be measured without aggregating across narrative spans. Detailed definitions of coverage, presence-aware severity alignment, and the weighting procedure used for ordinal scoring are provided in Appendix §D.

User study. To assess the characteristics of LENS-generated narratives, we recruited 13 mental health experts to evaluate LENS (14B; zeroshot) along with the two baselines: TS-Image (32B, few-shot) and TS-Text (14B, few-shot), across four rating dimensions. Each example is a narrative paired with a table of 12 ground-truth symptoms and assessed as follows: (1) Comprehensiveness: “Does the narrative mention the symptom?”; (2) Accuracy: “Does the narrative accurately describe the symptom severity (including synonyms)?”; (3) Clinical Utility: “Does the narrative provide clini- cally useful information that informs care?”; and (4) Language Cohesion: “Is the narrative coherent, easy to understand, and focused on depression and anxiety symptoms?” Comprehensiveness and accuracy were collected as yes/no responses and grouped into five categories. For comprehensiveness: 0-2 = Very Poor, 3-5 = Poor, 6-7 = Adequate, 8-11 = Good, 12 = Excellent. For accuracy: 0-2 = Major Discrepancy, 3-5 = Substantial Discrepancy, 6-7 = Partial Agreement, 8-11 = High Agreement, 12 = Complete Agreement. Clinical utility and language cohesion were rated on a 1-5 Likert scale.

In total, we evaluated 117 narratives, with each expert rating 9 narratives (3 samples from each model). Additional information regarding the user study and expert background is provided in Appendix §F. To compare the models across the four domains, we first computed per-rater mean scores for each model by averaging over the three evaluated samples, and then performed paired t-tests across raters with Bonferroni correction. Effect sizes are reported using Cohen’s dz.

LENS consistently outperforms baselines in generating comprehensive clinical narratives. In the summary-level evaluation ( LENS exhibits strong performance in finegrained, item-level query answering. Compared to summary-level narrative generation, the performance gap in the item-level evaluation is more pronounced, indicating that the task requires more precise retrieval (Table 1). For example, LENS achieves a ROUGE-L score of 0.603, which is more than four times higher than both TS-Text (0.142) and TS-Image (0.136). This trend holds for semantic and clinical metrics as well, with LENS securing the highest Presence Alignment of 0.732 (10.1% improvement over TS-Text and 6.6% over TS-Image) and a BERTScore of 0.832 (over 37% improvement against both baselines). These results suggest that while baselines struggle to isolate specific symptom intensities in a QA format, LENS maintains high fidelity to the reference severity levels.

Mental health experts rate LENS narratives significantly higher than TS-Text. As shown in Figures 5 to 7, LENS consistently outperforms TS-Text across all four evaluation dimensions. Specifically, LENS narratives receive significantly higher expert ratings for comprehensiveness (4.23 vs. 2.97; T = 7.02; p < 0.01; d z = 1.12), accuracy (2.61 vs. 1.53; T = 6.49; p < 0.01; d z = 1.04), clinical utility (3.25 vs. 1.84; T = 7.16; p < 0.01; d z = 1.15), and language cohesion (3.82 vs. 2.23; T = 7.34; p < 0.01; d z = 1.17). These results demonstrate that experts find narratives derived from native time-series integration to be not only more accurate but also more practically useful for care than those generated from text descriptions alone.

Expert evaluation demonstrates performance parity with larger visual baselines. Comparing LENS to TS-Image (Figures 5 to 7), we observe no statistically significant differences between the two models across any of the four evaluation dimensions. Expert ratings are comparable for comprehensiveness (4.23 vs. 4.46; T = -2.47; p > 0.01; d z = -0.39), accuracy (2.61 vs. 2.69; T = -0.42; p > 0.01; d z = -0.06), clinical utility (3.82 vs. 3.71; T = 1.52; p > 0.01; d z = 0.24), and language cohesion (3.25 vs. 2.94; T = 0.56; p > 0.01; d z = 0.09). These results indicate that LENS matches the performance of the 2.2× larger TS-Image while offering a significantly streamlined inference process. By replacing complex visual plot engineering with native time-series encoding, LENS achieves these results in a zero-shot setting. Interestingly, experts identified LENS as having better real-world applicability, evidenced by its marginally higher ratings for clinical utility and language cohesion.

Ablation and scalability analysis. We analyze LENS’ architecture through ablation and scaling studies detailed in Appendix §E. An evaluation comparing LENS to a fine-tuned text-only baseline (TS-Text-FT) reveals that explicit time-series encoding is essential for capturing temporal structures; LENS significantly outperforms the textonly variant, particularly in item-level QA where the ROUGE-L gap is markedly wider (0.6030 vs. 0.1717) (Table 4). Furthermore, LENS demon- strates superior computational efficiency, requiring approximately 930 tokens per sample, a reduction of roughly 94% compared to verbose text serialization and 4× relative to vision-based models (Figure 8). We also observe that while basic instruction-following can be established with 10% of the training data, the full dataset is necessary to maximize clinical fidelity in complex narratives, with Presence Alignment improving monotonically as data size increases (Table 5). Finally, experiments with a lighter LENS-7B variant show that it maintains linguistic performance (ROUGE-L 0.409 vs. 0.410) while trailing slightly in complex clinical alignment (0.559 vs. 0.601) (Table 6).

In

Some limitations of this work should be noted.

First, although we demonstrate symptom-level generalization for depression and anxiety within a clinical population, we do not evaluate whether the approach transfers to other demographic groups or additional self-report psychological instruments. Examining generalization across broader populations and diverse measurement scales remains an important direction for future work and would strengthen the case for large-scale generation of paired sensor data and corresponding narrative descriptions. Second, the current framework primarily focuses on short temporal windows, analyzing behavior over the preceding four hours. While this design supports in-the-moment assessment and responsiveness, it may fail to capture longer-term behavioral patterns and symptom trajectories that unfold over weeks or months. Incorporating multi-scale temporal modeling could improve sensitivity to chronic or slowly evolving mental health states.

Third, although the model integrates multiple sensor streams, it may still lack sufficient environmental or situational context to fully disambiguate certain physiological or behavioral signals. For example, elevated heart rate or reduced mobility may reflect physical activity, illness, or stress, depending on context that is not always observable from passive sensing alone. Future work could mitigate this limitation by incorporating richer contextual signals or structured user input. Finally, the data synthesis pipeline relies on an iterative Refine-n-Judge loop involving multiple LLMs. While this approach improves output quality and consistency, it introduces additional computational overhead. As a result, scaling the limitations pipeline to realtime deployment or very large datasets may be challenging under constrained computational budgets. Exploring more efficient training or distillation strategies could help address this limitation.

Study Participation. The study procedures received approval from an institutional review board, and all participants provided both written and verbal informed consent before data collection began. Of the 300 participants recruited via online advertisements, data from 258 were utilized for this analysis. Study eligibility was determined through a Structured Clinical Interview for DSM-5 (SCID) conducted by trained clinicians, with inclusion lim-ited to individuals diagnosed with Major Depressive Disorder (MDD) who did not have comorbid bipolar disorder, active suicidality at intake, or psychosis. The participant sample was predominantly White (79%) and female (84%), with a mean age of 40 years. Representation of racial minority groups, including 12% Hispanic or Latino participants, was comparable to the broader U.S. population with MDD. Most participants (93%) had some college education, and 61% were employed, with household incomes broadly aligned with national distributions. Participants were compensated $1 for each completed EMA, plus a $50 bonus for achieving a 90% completion rate. Safety was ensured through ongoing clinical oversight and automated safeguards within the mobile application that notified clinicians in the event of active suicidality.

Systems & Software. All modeling was performed within a secure, closed computing environment using locally deployed large language models. Raw participant data were never transmitted to external APIs, cloud services, or thirdparty platforms; only AI-generated outputs were permitted outside the secure environment. Access to sensitive data was restricted to authorized researchers who had completed institutional review board-approved human subjects training. The modeling environment employed role-based access controls, audit logging, and continuous monitoring to track data usage and prevent unauthorized access. Collectively, these safeguards reduced the risk of re-identification of personally identifiable information and ensured adherence to IRB-approved data protection and privacy protocols.

Adverse Usage. LLMs for interpreting timeseries data in mental health contexts offer substantial potential benefits. However, we caution against several forms of misuse to protect participant safety, privacy, and clinical integrity. Although LLM narratives provide valuable insights, they must augment clinical workflows rather than serve as standalone diagnostic tools. Over-reliance on automated summaries may obscure critical nuances within the raw sensor data, while the utilization of sensitive signals like GPS traces necessitates strict governance and access controls to prevent unauthorized surveillance or secondary data use. Furthermore, as current LLMs cannot entirely eliminate the risk of hallucinations, all generated outputs must be reviewed by qualified experts to ensure clinical judgment remains the final authority for interpretation and intervention.

Training Specifications. Training is conducted on 4×H200 GPUs under a DeepSpeed distributed environment with bf16 precision. We use the AdamW optimizer with a cosine learning-rate schedule. The per-device batch size is set to 4, and gradients are accumulated for 32 steps, resulting in an effective batch size of 128 sequences per update. The model is fine-tuned with a base learning rate of 1 × 10 -5 .

Baseline Inference Configurations. For all baseline models, inference parameters follow the default configurations provided in the official Qwen2.5 Hugging Face repositories, with temperature set to 0.7, top-p to 0.8, and top-k to 20. Fewshot exemplars are selected from the training split and formatted according to the prompt templates in Appendix §G.

Dataset Splits. All experiments use participantlevel splits to prevent information leakage across EMA windows from the same individual. The 258 participants are partitioned into disjoint training (180; 69.77%), validation (38; 14.73%), and test (40; 15.50%) sets, following an approximate 70:15:15 ratio. All windows from a given participant appear in only one split, ensuring evaluation on unseen individuals.

The Ecological Momentary Assessment (EMA) used in this study consists of 13 primary items designed to monitor intra-day variations in mental health (Table 2). These items were specifically adapted from the Patient Health Questionnaire (PHQ) (Kroenke et al., 2001) and the Generalized Anxiety Disorder (GAD) (Spitzer et al., 2006) scale to assess symptoms over the preceding four-hour window. As shown in Table 2, the questionnaire covers 14 distinct categories, including core depression indicators like anhedonia and depressed mood, alongside physical and cognitive markers such as somatic discomfort, fatigue, and concentration. It also includes a negative event question to understand context about mental health symptoms.

The sensor streams are preprocessed as follows. For steps and heart rate, we applied empirical thresholds to remove outliers. Stress and accelerometer-derived features were already processed by the data collection software and required no additional filtering. GPS data was reduced to latitude and longitude pairs. Phone lock and unlock state was encoded as a binary sequence, where 0 indicates locked and 1 indicates unlocked, and downsampled from a 1-second to a 60-second resolution to avoid excessively long sequences. Conversation events were logged only when speech was detected, and for each EMA-aligned window we summed event durations to obtain total conversational time in seconds. Sleep duration was recorded once per day as the previous night’s total hours slept.

After preprocessing the sensor streams, we standardized all signals to fixed sampling rates, including heart rate every 10 seconds, GPS every 10 minutes, steps every 60 seconds, stress every 60 seconds, ZCR every 30 seconds, and phone lock and unlock every 60 seconds. Sleep duration and conversation length were stored as scalar values for each window. This unified representation ensured consistent temporal alignment across all sensing modalities.

Presence detection is computed over all symptom categories. Let a j , âj ∈ {0, 1} denote reference In the past 4 hours, how tired or low in energy has the user been? 5. Appetite Change

In the past 4 hours, how much has the user shown a poor appetite or overeating? 6. Self-worth / Guilt

In the past 4 hours, how much has the user felt bad about themselves? 7. Concentration

In the past 4 hours, how much trouble has the user had concentrating? 8. Psychomotor Change

In the past 4 hours, how much has the user been moving or speaking more slowly than usual? 9. Suicidal Ideation

In the past 4 hours, how often has the user had thoughts of harming themselves or wishing to be dead? 10. Somatic Discomfort

In the past 4 hours, how much has the user experienced headache, abdominal discomfort, or body aches? 11. Inverted Question An inverted question randomized from Q1, Q4, or Q7. 12. Anxiety Arousal

In the past 4 hours, how much has the user felt nervous, anxious, or on edge? 13. Uncontrollable Worry

In the past 4 hours, how much has the user been unable to stop or control worrying? 14. Negative Event

In the past 4 hours, did the user experience a negative event? If yes: How negative was the event? Overall Summary

Please summarize the user’s overall mental and physical state in the past 4 hours, integrating mood, energy, sleep, appetite, concentration, and physical symptoms.

Table 2: EMA Questions. Categories of depression-and anxiety-related symptoms and their corresponding questions or statements.

and predicted presence for category j. Then:

where T P , F P , and F N represent correct mentions, hallucinations, and omissions.

Severity is scored only when either reference or prediction marks a symptom as present:

For each j ∈ J , a weight w j is assigned based on ordinal deviation:

and the final alignment score is

This captures both hallucination penalties (a j ̸ = âj ) and ordinal severity mismatch.

Impact of Time Series Modality. To disentangle the effect of supervised fine-tuning from the contribution of native time-series modeling, we conduct an ablation using a text-only baseline (TS-Text-FT). In this setting, we remove the time-series encoder from LENS and fine-tune the same Qwen2.5-14B backbone on the identical sensor-text training pairs. Instead of being encoded by a patch-based time-series encoder, sensor streams are directly serialized as textual inputs. To control for sequence truncation effects, the text-only model is trained with a cutoff length of 20,000 tokens, covering more than 99.95% of training samples. As reported in Table 4, LENS consistently outperforms TS-Text-FT across all evaluation dimensions. For summary-level generation, LENS achieves higher clinical alignment (0.601 vs. 0.583) and stronger linguistic quality (METEOR score of 0.467 vs. 0.408). The gap widens substantially in the item-level QA setting, where TS-Text-FT exhibits difficulty reasoning over long, unstructured numerical sequences, resulting in markedly lower ROUGE-L scores (0.172 vs. 0.603). These results indicate that supervised fine-tuning alone is insufficient to capture the temporal structure of sensor data, and that explicit time-series encoding plays a central role in LENS’s performance.

Computational Efficiency and Token Consumption. In addition to accuracy, we analyze the computational efficiency of different modeling paradigms by comparing their token consumption. Specifically, we measure the mean prefill token count per sample and the total token consumption over the Narrative (n = 8,192) and Item-level QA (n = 16,384) datasets for three approaches: textonly (Qwen2.5), vision-based (Qwen2.5-VL-32B), and LENS.

For LENS, effective prefill tokens are computed as the sum of textual tokens and patch-based embeddings (patch size k = 8) that replace raw timeseries placeholders. For the vision-language baseline, token counts follow the model’s native processor, which accounts for both textual inputs and image-derived tokens. For the text-only baseline, the tokenizer needs to handle all the sensor streams as text, resulting in substantially longer input sequences.

Figure 8 summarizes the results. Across both tasks, LENS exhibits the lowest token consumption by a large margin. Compared to the text-only baseline, which averages over 15,800 tokens per sample due to verbose serialization of high-frequency signals, LENS requires approximately 930 tokens per sample, corresponding to a reduction of about 94%. Relative to the vision-language model, which repre-sents signals as plots, LENS remains roughly four times more token-efficient (933 vs. 3,679 tokens). This reduction in context length provides a practical explanation for the degraded QA performance observed in text-only fine-tuning, where long serialized inputs are prone to attention dilution in extended contexts. By encoding raw signals into compact patch-level representations, LENS substantially reduces sequence length while preserving task-relevant temporal information, enabling more efficient and scalable inference for long-duration health monitoring scenarios.

Impact of Data Scale. To investigate the influence of training data volume, we trained LENS using varying proportions (10%, 50%, and 100%) of the generated dataset, keeping the base model architecture consistent. As shown in Table 5, the impact of data scale varies across different evaluation dimensions. For summary-level narrative generation, Presence Alignment improves strictly monotonically with data size, rising from 0.545 (10% data) to 0.601 (100% data), indicating that larger datasets are essential for minimizing hallucinations in complex clinical summaries. However, Symptom Coverage is not strictly monotonic; the model trained on 50% data achieved the highest coverage (0.823), slightly outperforming the full model (0.801) and the 10% model (0.783). In the Itemlevel QA task, the full 100% model achieves the best overall performance in both linguistic metrics (ROUGE-L 0.603) and clinical alignment (0.732).

Notably, the model trained on only 10% data exhibits surprisingly strong performance in QA tasks, achieving a higher alignment score (0.729) than the 50% model (0.706). This suggests that while basic instruction-following for short-form QA can be established with smaller data scales, the full dataset is necessary to maximize structural coherence and clinical fidelity in longer narratives.

Smaller LENS Variants Remain Competitive.

To address computational feasibility for deployment in resource-constrained environments, we evaluate LENS-7B-a lighter variant of our framework-alongside models trained on reduced data proportions. As shown in our comparison, LENS-7B achieves slightly lower performance on complex clinical alignment than LENS-14B (0.559 vs. 0.601), reflecting the trade-off between model size and reasoning depth. Nevertheless, LENS-7B matches the 14B model’s mean performance on linguistic metrics (e.g., ROUGE-L 0.409 vs. 0.410) and closely trails it in Item-level QA tasks (Presence Alignment 0.727 vs. 0.732). Similarly, models trained on limited data demonstrate surprising robustness in short-form generation. These results suggest that LENS-7B or low-data fine-tuning provides a strong balance between efficiency and performance, making it a cost-effective choice for real-world, resource-limited scenarios such as edgedevice deployment.

Survey Design. The survey was designed with input from both a clinical psychologist and a therapist. Each rating example includes a narrative presented alongside a ground-truth symptom table, followed by questions assessing accuracy and comprehensiveness. Note that the user study excludes the inverted question (Q11 in Table 2), which is a semantic reversal of an existing item and Sleep Disturbance question which is administered only in morning EMAs to ensure non-redundant and temporally consistent evaluation. Because our expert reviewers noted that mentally tracking symptoms across the narrative was difficult, we incorporated two design changes: (1) presenting the narrative and symptom table simultaneously on the screen, and (2) adding Yes/No checkboxes to help raters track individual symptoms when answering these questions. For the clinical utility and language cohesion items, we display the narrative again to remind raters of the content. After receiving positive design feedback from our expert reviewers, we administered it to the 13 mental health experts.

Pre-survey Questions. Before beginning the rating process, raters were asked to answer four background questions:

  1. What is your professional role or background? (Options: Psychiatrist, Clinical Psychologist, Therapist, Other)

  2. What is your highest degree held? (Options: High school/Diploma, Bachelor’s, Master’s, Doctorate/MD) 3. How familiar are you with the Patient Health Questionnaire-9 (PHQ-9) for depression screening?

  3. How familiar are you with the Generalized Anxiety Disorder Questionnaire (GAD-7) for anxiety screening?

For Questions 3 and 4, the response options were: Not familiar at all, Slightly familiar, Moderately familiar, and Very familiar. The experts’ pre-survey responses are summarized in Table 7. Survey Flow. After completing the pre-survey questions, each expert rater is introduced to the task through a detailed explanation of the overall goal, problem setting, and rating procedure. They are then shown an annotated example that includes sample answers and reasoning to illustrate how the evaluation should be performed. In every survey, the first example presented is a dummy item intended solely to help raters become familiar with the interface and question format. This is followed by the nine narratives that constitute the actual rating set.

Prompt Templates. Tables 8-15 summarize the prompt templates used throughout our experiments for narrative generation, question answering, and evaluation. Tables 8 and9 define the prompts for rewriting rule-based EMA-derived symptom descriptions into fluent narrative labels and for LLM-based quality assessment of generated narratives. Tables 10 and11 specify structured evaluation prompts for extracting symptom presence and severity, as well as for assessing ordinal severity alignment in single-item QA. Tables 12 and13 present the prompts used for vision-language baselines, where multivariate sensor streams are provided as multi-panel time-series visualizations together with contextual features. Finally, Tables 14 and15 describe the text-based baseline prompts, in which the same sensor data are serialized as numerical sequences using placeholder tokens. We adopt few-shot prompting for non-fine-tuned baselines to mitigate systematic mismatches in output format observed under zero-shot settings; as shown in Table 16, zero-shot baselines often fail to follow the required PHQ-9-style narrative structure and instead produce generic, data-centric descriptions. By applying few-shot exemplars to the baseline models, we raise their performance floor and isolate differences arising from input representation and architectural design rather than prompt or format misalignment.

Prompt:

The monitoring system provides seven continuous data streams. Heart rate, measured as beats per minute, bpm, has a length of 1440 and is represented as . Pseudoactigraphy, measured as Garmin watch accelerometer, product of ZCR count and energy, has a length of 480 and is represented as . Steps per minute has a length of 240 and is represented as . Stress level, measured as Garmin watch estimate related to HRV, has a length of 240 and is represented as . GPS longitude, measured as coordinates, has a length of 24 and is represented as . GPS latitude, measured as coordinates, has a length of 24 and is represented as . Phone unlock status, measured as binary events per minute, has a length of 240 and is represented as .sleep duration is 7.770 hours\nconversation length is 6665 seconds Questions:Over the last 4 hours, how noticeable was the user’s anxiety?

Sensor

The user demonstrates a pattern of moderate physical activity, with periods of increased movement and heart rate, suggesting engagement in various tasks or activities. Sleep duration of 10.02 hours indicates good rest, potentially supporting mood stability. The stress level shows fluctuations, with peaks that may correlate with increased activity or cognitive demands, as evidenced by frequent phone unlocking. The GPS data reveals limited spatial mobility, possibly reflecting a routine or sedentary lifestyle. The conversation length of 2303 seconds suggests active social engagement, which can be beneficial for mood. Overall, the user appears to have a balanced routine with adequate rest and social interaction, though the fluctuating stress levels warrant further investigation to assess potential underlying anxiety or worry. The data does not strongly indicate significant mood disturbances or psychomotor changes, but continued monitoring is advised to capture any emerging patterns related to depression or anxiety.

Table 16: Example output from a zero-shot baseline model. Without few-shot guidance, the model focuses on descriptive interpretations of sensor data and behavioral signals, rather than producing a PHQ-9-style narrative summary aligned with EMA-derived symptom categories and severity framing.

Answer… The user has reported significant fatigue and low energy levels …

Answer

Study: with or without time series encoder. Comparison between LENS and TS-Text-FT (fine-tuned on text-serialized time series).

Prompt:

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut