Towards understanding and modelling office daily life
Measuring and modeling human behavior is a very complex task. In this paper we present our initial thoughts on modeling and automatic recognition of some human activities in an office. We argue that to successfully model human activities, we need to consider both individual behavior and group dynamics. To demonstrate these theoretical approaches, we introduce an experimental system for analyzing everyday activity in our office.
š” Research Summary
The paper presents an earlyāstage exploration of how to automatically recognize and model everyday human activities that occur within a typical office environment. The authors argue that successful modeling of office behavior must simultaneously account for individualālevel patterns (such as personal work speed, deskātoādesk movement, keyboard and mouse usage habits) and groupālevel dynamics (such as meeting attendance, collaborative interactions, and informal conversations). To illustrate these ideas, they design and implement an experimental sensing platform that collects multimodal data from a real office setting.
The hardware setup includes multiple overhead and wallāmounted cameras, ambient microphones, and perādesk logging devices that capture computer interaction events (key strokes, mouse clicks, application usage). Video streams are processed with person detection and tracking algorithms to assign a unique identifier to each employee, enabling the construction of timeāordered action sequences. Audio streams are analyzed to detect speech activity and to differentiate between meeting contexts and casual chatter. Keyboard and mouse logs provide fineāgrained cues about the type of task being performed (e.g., document editing, coding, email composition). All modalities are synchronized into fixedālength time windows, preserving both temporal continuity and spatial relationships.
For the modeling stage, the authors propose a hybrid deep learning architecture that combines a graph neural network (GNN) to capture the relational structure of group interactions with a convolutional neural network (CNN) or temporal convolutional network (TCN) to process the sequential features of individual behavior. The GNN receives a dynamic interaction graph where nodes represent employees and edges encode coāpresence in meetings or shared screen sessions; edge weights evolve over time based on proximity and communication cues. The CNN/TCN processes perāperson feature vectors derived from video pose estimates, audio speech activity, and interaction logs, producing a latent representation of each individualās activity state. The two streams are fused to predict the current activity label (e.g., āfocused workā, āmeetingā, ābreakā, āinformal discussionā) and to estimate transition probabilities for a hidden Markov model that can be used for longerāterm behavior forecasting.
Preliminary experiments on a dataset collected over several weeks from a mediumāsize office (approximately 30 employees) reveal several key findings. First, there is substantial intraārole variability: even employees with the same job title exhibit distinct work rhythms, suggesting that a oneāsizeāfitsāall model would be insufficient. Second, incorporating explicit groupālevel signals (meeting schedules, coāattendance graphs) improves activity classification accuracy by roughly 12āÆ% compared with models that rely solely on individual sensor streams. Third, multimodal fusion markedly increases robustness to environmental noise; for instance, the system maintains high recognition rates despite fluctuating lighting conditions or background conversations that would degrade a visionāonly approach.
The authors acknowledge several limitations. Manual annotation of activity labels was laborāintensive, limiting the size of the training set and raising concerns about scalability. Privacy considerations were only superficially addressed; the raw video and audio data contain personally identifiable information, and the paper does not explore techniques such as differential privacy, onādevice processing, or federated learning to mitigate these risks. Moreover, the current system assumes a static office layout; any major reconfiguration of workstations or a shift toward remote or hybrid work would require substantial retraining or redesign of the sensing infrastructure.
Future work is outlined along three main directions. (1) Reduce labeling effort through semiāsupervised or selfāsupervised learning, leveraging clustering of unlabeled sensor streams to generate pseudoālabels. (2) Integrate privacyāpreserving mechanisms, such as applying homomorphic encryption to feature extraction or employing federated learning so that raw data never leave individual workstations. (3) Generalize the model to handle diverse office topologies and hybrid work scenarios by incorporating additional data sources (e.g., calendar APIs, virtual meeting logs) and by designing layoutāagnostic feature representations.
If these challenges are addressed, the authors contend that automatic office activity recognition could become a valuable tool for a range of practical applications: realātime workload balancing, space utilization optimization, early detection of burnout or disengagement, and dataādriven organizational culture assessments. The paper thus sets a conceptual foundation and a prototype implementation that invite further research into scalable, privacyāaware, and contextārich modeling of everyday professional life.
Comments & Academic Discussion
Loading comments...
Leave a Comment