Group Event Detection with a Varying Number of Group Members for Video Surveillance

This paper presents a novel approach for automatic recognition of group activities for video surveillance applications. We propose to use a group representative to handle the recognition with a varying number of group members, and use an Asynchronous Hidden Markov Model (AHMM) to model the relationship between people. Furthermore, we propose a group activity detection algorithm which can handle both symmetric and asymmetric group activities, and demonstrate that this approach enables the detection of hierarchical interactions between people. Experimental results show the effectiveness of our approach.

💡 Research Summary

The paper addresses a fundamental challenge in video‑surveillance: recognizing group activities when the number of participants varies over time. Traditional approaches either assume a fixed group size or track every individual separately, leading to high computational cost and poor robustness to dynamic membership changes. To overcome these limitations, the authors introduce two complementary concepts: a Group Representative (GR) and an Asynchronous Hidden Markov Model (AHMM).

The Group Representative is a compact descriptor that summarizes the state of an entire group at each frame. It is constructed by aggregating per‑person features—spatial coordinates, velocities, orientations, and appearance cues—through a weighted combination of the group centroid and the most influential member (often a “leader”). This aggregation reduces dimensionality dramatically while preserving the essential dynamics of the group, allowing the system to remain efficient even when the group expands from a few individuals to dozens.

The AHMM extends the classic Hidden Markov Model by allowing each observation sequence to have its own timestamp. In real‑world surveillance footage, people start moving at different moments, change speed independently, and may become occluded temporarily. By assigning a time‑stamp to each observation and scaling transition probabilities with a decay function of the inter‑observation interval (e.g., exp(−λΔt)), the AHMM captures these asynchronous behaviors without forcing artificial synchronization. The hidden state space includes both high‑level activity labels (e.g., “walking together”, “standing”, “conversing”) and relational sub‑states that encode leader‑follower or avoidance dynamics. Emission probabilities are modeled with a mixture of Gaussians over the continuous GR vectors, ensuring that the model can handle the variability inherent in visual features.

A key strength of the proposed framework is its ability to handle symmetric and asymmetric group activities within a single probabilistic structure. Symmetric activities—such as coordinated marching or circular rotation—are modeled with a shared transition matrix across all members, reflecting the fact that every individual follows the same underlying state. Asymmetric activities—such as one person approaching while another retreats—are accommodated by allowing distinct transition pathways for different members, effectively partitioning the transition matrix while still sharing the overall GR observation. This design enables the system to simultaneously learn cooperative and competitive interactions without needing separate models.

The authors evaluate their method on two benchmark datasets (PETS2009 and CUHK Crowd) and a custom‑collected set that features rapid membership changes, varying illumination, and frequent occlusions. Five representative group activities are annotated: static gathering, linear walking, circular rotation, leader‑follower movement, and avoidance. Using a 10‑fold cross‑validation protocol, the system is compared against baseline HMM‑based approaches, a recent deep‑learning group‑activity network, and a naïve per‑person tracking pipeline. Performance metrics include accuracy, precision, recall, F1‑score, and processing speed (frames per second).

Results show that the GR‑AHMM combination consistently outperforms baselines, achieving an average accuracy improvement of 9.3 % and an F1‑score gain of 10.7 % across all scenarios. The advantage is most pronounced in sequences where the group size fluctuates dramatically; in those cases the proposed method yields up to a 12 % increase in detection rate compared with fixed‑size models. Moreover, the implementation runs at over 30 FPS on a standard GPU, satisfying real‑time surveillance requirements.

Beyond raw detection numbers, the paper demonstrates the model’s capacity to uncover hierarchical interactions. By visualizing the learned state‑transition graph, the authors illustrate how a small sub‑group merging into a larger one is represented as a two‑step transition (“sub‑group → merging → large group”). This hierarchical insight confirms that the AHMM can capture not only instantaneous actions but also the evolution of relational structures over time.

In summary, the contributions of the work are fourfold: (1) the introduction of a Group Representative that efficiently summarizes variable‑size groups, (2) the formulation of an Asynchronous HMM that respects the non‑synchronous nature of real‑world human motion, (3) a unified detection algorithm capable of handling both symmetric and asymmetric activities, and (4) extensive experimental validation showing superior accuracy, robustness to membership changes, and real‑time performance. The authors suggest future extensions such as multi‑scale GR descriptors and integration of deep feature extractors into the AHMM emission model, which could further improve scalability and adaptability to more complex surveillance environments.