DIFEM: Key-points Interaction based Feature Extraction Module for Violence Recognition in Videos
Violence detection in surveillance videos is a critical task for ensuring public safety. As a result, there is increasing need for efficient and lightweight systems for automatic detection of violent behaviours. In this work, we propose an effective method which leverages human skeleton key-points to capture inherent properties of violence, such as rapid movement of specific joints and their close proximity. At the heart of our method is our novel Dynamic Interaction Feature Extraction Module (DIFEM) which captures features such as velocity, and joint intersections, effectively capturing the dynamics of violent behavior. With the features extracted by our DIFEM, we use various classification algorithms such as Random Forest, Decision tree, AdaBoost and k-Nearest Neighbor. Our approach has substantially lesser amount of parameter expense than the existing state-of-the-art (SOTA) methods employing deep learning techniques. We perform extensive experiments on three standard violence recognition datasets, showing promising performance in all three datasets. Our proposed method surpasses several SOTA violence recognition methods.
💡 Research Summary
Violence detection in surveillance footage is a critical task for public safety, yet most existing approaches rely on deep neural networks that demand large model sizes and substantial computational resources. This paper introduces a lightweight alternative called the Dynamic Interaction Feature Extraction Module (DIFEM), which exploits human skeleton key‑points to capture the essential dynamics of violent actions. First, OpenPose is used to extract 25 joint coordinates per frame; from these, eleven joints most indicative of aggressive motion (both wrists, elbows, hips, knees, ankles, and neck) are selected and assigned importance weights.
Temporal dynamics are quantified by computing a weighted Euclidean distance between each joint’s position in consecutive frames, yielding a velocity value for every joint at every time step. For each video, the mean, maximum, and variance of these velocities are aggregated, producing three temporal features that reflect rapid, erratic movements typical of fights. Spatial dynamics are captured through a joint‑overlap measure: for each frame, the algorithm checks whether any selected joint of person A falls inside the bounding box of person B, counting such intersections. The average and variance of these overlap counts across the video form two spatial features, indicating close physical interaction between individuals.
The five features (temporal mean, temporal max, temporal variance, spatial mean overlap, spatial variance) are concatenated into a compact 1‑D descriptor. Unlike deep learning pipelines that learn high‑dimensional representations, DIFEM’s descriptor is handcrafted, interpretable, and requires negligible storage. The authors feed this descriptor into four conventional machine‑learning classifiers—Random Forest (100 trees, Gini impurity), Decision Tree, AdaBoost (100 estimators), and k‑Nearest Neighbors (k = 5)—implemented with scikit‑learn.
Experiments were conducted on three publicly available violence datasets: RWF‑2000 (2,000 videos, balanced), Hockey‑Fight (1,000 videos, balanced), and Crowd Violence (246 real‑world clips, balanced). Evaluation metrics include accuracy, precision, recall, and F1‑score. The Random Forest classifier achieved the best results, reaching 93.2 % accuracy on RWF‑2000, 91.5 % on Hockey‑Fight, and 88.7 % on Crowd Violence, surpassing several recent deep‑learning based methods such as ViF + OViF and MoWLD+BoW. Importantly, the total number of model parameters and inference time are an order of magnitude lower than those of convolutional or graph‑based networks, making the approach suitable for real‑time deployment on edge devices.
A detailed error analysis reveals that performance degrades when OpenPose fails to detect joints due to severe occlusion, poor lighting, or scenes with more than two interacting persons, where the simple overlap count may become ambiguous. The authors acknowledge these limitations and propose future work that integrates multi‑person graph representations, learns adaptive joint weights, and explores end‑to‑end pipelines that combine pose estimation and feature extraction in a unified framework.
In summary, DIFEM demonstrates that carefully engineered, interpretable features derived from human pose can rival sophisticated deep models for violence recognition, offering a practical solution for resource‑constrained surveillance systems while preserving high detection accuracy.
Comments & Academic Discussion
Loading comments...
Leave a Comment