Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models–from rule-based to deep learning–struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context–temporal, spatial, behavioral history, and persona–and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.

💡 Research Summary

Background and Motivation
Predicting what a person will do next and for how long is a cornerstone capability for a wide range of intelligent systems, from adaptive smart‑home controllers to urban‑design simulators and human‑robot collaboration platforms. Traditional rule‑based, classical‑ML, and deep‑learning agents achieve high accuracy only when abundant labeled activity data are available. In low‑data regimes—common in newly deployed or privacy‑sensitive environments—these models deteriorate sharply, creating a trade‑off between predictive performance and data availability.

Objective
The paper investigates whether large language models (LLMs), pre‑trained on massive textual corpora and thus endowed with commonsense knowledge about human routines, can serve as “knowledge‑transfer” modules for activity forecasting when only compact contextual cues are provided. Two tasks are examined: (1) next‑activity prediction with duration estimation, and (2) multi‑step daily roll‑out (generating an ordered sequence of future activities with durations). The central experimental factor is the number of few‑shot demonstrations (N) embedded in the prompt, ranging from zero to five.

Dataset and Pre‑processing
Experiments use the CASAS Aruba smart‑home dataset, which records binary motion and door sensor events for a single resident over 219 days. The raw event stream is filtered to keep only activity start/end markers, then paired to form (label, start‑time, end‑time) intervals. For evaluation, each day is expanded to a one‑minute resolution categorical timeline, enabling fine‑grained temporal metrics.

Prompt Engineering and Retrieval‑Augmented Strategy
Four contextual pillars are encoded in a structured JSON format: (i) temporal (day of week, clock time), (ii) spatial (room layout, sensor placement), (iii) behavioral history (most recent activity‑duration pairs), and (iv) persona (natural‑language summary of the resident’s habits). An embedding model captures temporal‑behavioral similarity across training instances; a Maximal Marginal Relevance (MMR) selector then retrieves N examples that are both highly relevant to the current query and mutually diverse. The retrieved (context → ground‑truth) pairs are inserted into the prompt together with a system instruction, an indexed list of permissible activities, and a strict requirement that the model output a JSON object with fields next_activity and duration_minutes. Structured decoding enforces this format, eliminating free‑form text.

Evaluation Metrics
Classification performance is measured by accuracy, micro/macro/weighted F1, precision, and recall. Duration quality is assessed with Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and a composite Joint Success@10 min (correct activity label and duration error ≤ 10 min). For the multi‑step roll‑out, Dynamic Time Warping (DTW) distance and a normalized DTW (DTW / day length) capture temporal alignment between predicted and ground‑truth sequences.

Results

Zero‑shot: Activity label accuracy ≈ 62 %, MAE ≈ 28 min, Joint Success@10 min ≈ 38 %. The model already leverages generic daily routines learned during pre‑training.
One‑shot: Accuracy jumps to ≈ 78 % (+16 pp), MAE drops to ≈ 18 min, Joint Success rises to ≈ 55 %.
Two‑shot: Accuracy reaches ≈ 84 % (+6 pp), MAE ≈ 15 min, Joint Success ≈ 62 %.
Three‑shot and Five‑shot provide marginal gains (≤ 2 pp accuracy, ≤ 1 min MAE improvement).

In the roll‑out task, DTW raw decreases from ~210 min (zero‑shot) to ~150 min (two‑shot) and ~130 min (five‑shot), with normalized DTW showing a similar 30 % reduction. Beyond two demonstrations, performance plateaus, indicating diminishing returns.

Analysis and Discussion
The steep improvement from 0 → 2 shots demonstrates that LLMs possess a strong inherent temporal model of human behavior; a handful of concrete examples suffice to specialize this knowledge to a specific resident and sensor configuration. The plateau after two shots suggests that additional in‑context data mainly increase prompt length (and token cost) without providing new information, making 1–2‑shot prompting the most cost‑effective configuration. The structured JSON requirement and the inclusion of a persona description were crucial for maintaining output consistency and for guiding the model toward realistic duration estimates.

Limitations include: (i) focus on a single resident, preventing assessment of multi‑occupant dynamics; (ii) reliance on motion and door sensors only, leaving richer modalities (temperature, audio, vision) unexplored; (iii) static prompt composition—dynamic weighting of the four contextual pillars could further improve adaptability; (iv) lack of post‑processing (e.g., Bayesian smoothing) to mitigate LLM stochasticity in duration predictions.

Conclusion and Future Work
The study provides empirical evidence that LLMs can act as powerful temporal reasoners in low‑data smart‑environment settings, achieving competitive activity‑forecasting performance with only one or two in‑context demonstrations. This opens a pathway to integrate LLM‑based predictors into traditional agent‑based modeling pipelines, reducing the need for extensive labeled datasets. Future research directions include extending the framework to multi‑resident households, incorporating multimodal sensor streams, developing adaptive prompt‑generation mechanisms, and coupling LLM predictions with probabilistic filters to enhance robustness. Ultimately, such hybrid systems could substantially improve situation‑aware services across smart homes, cities, and collaborative robotics.

Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

💡 Research Summary

Comments & Academic Discussion

Leave a Comment