MotionTeller: Multi-modal Integration of Wearable Time-Series with LLMs for Health and Behavioral Understanding

Reading time: 6 minute
...

📝 Abstract

As wearable sensing becomes increasingly pervasive, a key challenge remains: how can we generate natural language summaries from raw physiological signals such as actigraphy - minute-level movement data collected via accelerometers? In this work, we introduce MotionTeller, a generative framework that natively integrates minute-level wearable activity data with large language models (LLMs). MotionTeller combines a pretrained actigraphy encoder with a lightweight projection module that maps behavioral embeddings into the token space of a frozen decoder-only LLM, enabling free-text, autoregressive generation of daily behavioral summaries. We construct a novel dataset of 54383 (actigraphy, text) pairs derived from real-world NHANES recordings, and train the model using cross-entropy loss with supervision only on the language tokens. MotionTeller achieves high semantic fidelity (BERTScore-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7 percent in ROUGE-1. The average training loss converges to 0.38 by epoch 15, indicating stable optimization. Qualitative analysis confirms that MotionTeller captures circadian structure and behavioral transitions, while PCA plots reveal enhanced cluster alignment in embedding space post-training. Together, these results position MotionTeller as a scalable, interpretable system for transforming wearable sensor data into fluent, human-centered descriptions, introducing new pathways for behavioral monitoring, clinical review, and personalized health interventions.

💡 Analysis

As wearable sensing becomes increasingly pervasive, a key challenge remains: how can we generate natural language summaries from raw physiological signals such as actigraphy - minute-level movement data collected via accelerometers? In this work, we introduce MotionTeller, a generative framework that natively integrates minute-level wearable activity data with large language models (LLMs). MotionTeller combines a pretrained actigraphy encoder with a lightweight projection module that maps behavioral embeddings into the token space of a frozen decoder-only LLM, enabling free-text, autoregressive generation of daily behavioral summaries. We construct a novel dataset of 54383 (actigraphy, text) pairs derived from real-world NHANES recordings, and train the model using cross-entropy loss with supervision only on the language tokens. MotionTeller achieves high semantic fidelity (BERTScore-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7 percent in ROUGE-1. The average training loss converges to 0.38 by epoch 15, indicating stable optimization. Qualitative analysis confirms that MotionTeller captures circadian structure and behavioral transitions, while PCA plots reveal enhanced cluster alignment in embedding space post-training. Together, these results position MotionTeller as a scalable, interpretable system for transforming wearable sensor data into fluent, human-centered descriptions, introducing new pathways for behavioral monitoring, clinical review, and personalized health interventions.

📄 Content

In recent years, the proliferation of wearable sensor devices has transformed how we monitor and understand human behavior in everyday settings. Actigraphy, which is minute-level activity traces captured via wrist-worn accelerometers, have become a core modality in both clinical and consumer health contexts. It provides a non-invasive, high-resolution view into physical movement and is widely used to assess sleep quality, circadian rhythm, depression, and medication adherence (Yoo et al. 2023;Dunn et al. 2021).

At the same time, large language models (LLMs) have revolutionized natural language understanding and generation, demonstrating strong generalization across diverse tasks through pretrained contextual representations (Naveed et al. 2024). Their applications in healthcare are expanding rapidly: LLMs have shown promise in generating medical explanations, supporting peer-based interventions, and simulating empathic conversation in mental health contexts (De Choudhury, Pendse, and Kumar 2023;Sharma et al. 2023). Yet as recent evaluations suggest, current LLMs often underperform on competencies most relevant to behavioral health (such as contextual sensitivity, emotional nuance, and ethical decision-making), particularly when applied to high-stakes counseling scenarios (Nguyen et al. 2025).

A key missing link is the ability to natively connect LLMs with behavioral sensor inputs such as actigraphy. Despite their complementary strengths, LLMs and actigraphy have traditionally existed almost completely independently. Sensor-based models, such as CNNs, RNNs, and transformers, have been used to classify behavioral states and predict health outcomes from raw actigraphy (Heinz et al. 2022;F. Ruan et al. 2024;Dorris, Oh, and Jacobson 2024), but they produce structured outputs like labels or scores instead of language. Moreover, many of these CNN and LSTM models rely on short time windows or handcrafted features and struggle to capture long-range dependencies in continuous, high-dimensional sequences (Rahman and Adjeroh 2019;Patterson et al. 2023).

Conversely, while LLMs are inherently generative, they lack the capacity to interpret raw time-series inputs directly. Attempts to bridge the gap between sensing and generation have typically relied on manual preprocessing or prompt engineering, such as tokenizing numeric sequences into digit-wise text strings or formatting prompts as question-answering tasks (Gruver et al. 2023;Kim et al. 2024;Nepal et al. 2024). These approaches often fragment the temporal structure of the original signal, leading to representations that are decoupled from the dynamics of real-world time-series data. Similarly, earlier interpretable models for clinical time-series, like RETAIN, require heavily structured and aggregated inputs, limiting flexibility when dealing with dense or high-frequency behavioral trajectories (Choi et al. 2016). This reliance on preprocessing and structure constrains the interpretability and adaptability of outputs, particularly in complex real-world contexts.

To address this, we introduce MotionTeller, a unified framework that aligns actigraphy and language through a generative pipeline. MotionTeller combines a pretrained transformer-based actigraphy encoder (PAT) (F. Y. Ruan et al. 2024), originally trained on over 29,000 participants from the NHANES dataset, with a lightweight projection module that maps sensor-derived embeddings into the token space of a frozen decoder-only LLM. This setup enables the model to condition autoregressive generation on raw activity traces and produce fluent, context-aware behavioral summaries. We also release a novel dataset of over 54,000 of ⟨raw sequence, generated label⟩ pairs, each consisting of a 24-hour actigraphy sequence and a GPT-generated summary crafted via few-shot prompting. Finally, we examine how semantic representations evolve across training and evaluate performance both quantitatively and qualitatively. Together, these contributions form a foundation for behavior-aware LLMs that translate physiological signals into human-readable narratives, offering new possibilities for personalized feedback, clinical interpretability, and scalable behavioral health tools.

As large language models (LLMs) and wearable sensing technologies continue to evolve, a growing body of work explores how to connect time-series representations with language-based reasoning. Prior research spans traditional actigraphy modeling, health-oriented LLMs, early attempts at sensor-to-text generation, and architectural insights from multimodal foundation models. Yet, there remains a significant gap in directly aligning raw sensor data with autoregressive LLMs for free-form generation, which MotionTeller is designed to fill.

Actigraphy has long been a core modality in health monitoring, particularly in studies of circadian rhythms, psychiatric disorders, and medication use. Traditional work relied on engineered features or statistical summaries

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut