Large-Scale Automatic Labeling of Video Events with Verbs Based on Event-Participant Interaction

We present an approach to labeling short video clips with English verbs as event descriptions. A key distinguishing aspect of this work is that it labels videos with verbs that describe the spatiotemporal interaction between event participants, humans and objects interacting with each other, abstracting away all object-class information and fine-grained image characteristics, and relying solely on the coarse-grained motion of the event participants. We apply our approach to a large set of 22 distinct verb classes and a corpus of 2,584 videos, yielding two surprising outcomes. First, a classification accuracy of greater than 70% on a 1-out-of-22 labeling task and greater than 85% on a variety of 1-out-of-10 subsets of this labeling task is independent of the choice of which of two different time-series classifiers we employ. Second, we achieve this level of accuracy using a highly impoverished intermediate representation consisting solely of the bounding boxes of one or two event participants as a function of time. This indicates that successful event recognition depends more on the choice of appropriate features that characterize the linguistic invariants of the event classes than on the particular classifier algorithms.

💡 Research Summary

The paper tackles the problem of assigning English verb labels to short video clips by focusing exclusively on the spatiotemporal interaction between event participants, rather than on object categories or detailed visual appearance. The authors argue that verbs fundamentally describe how participants move relative to each other, so a representation that captures only the motion of one or two bounding boxes over time should be sufficient for classification.

To test this hypothesis, they assembled a corpus of 2,584 video clips, each lasting between two and five seconds, covering 22 distinct verbs such as “pick up,” “push,” “pull,” “throw,” and “catch.” Each clip was manually annotated with a single verb label, and the dataset includes a wide variety of backgrounds, lighting conditions, and camera angles. For every frame, a state‑of‑the‑art object detector was used to locate humans and objects, producing axis‑aligned bounding boxes. These boxes were linked across frames to form time‑series data. From the raw (x, y, width, height) coordinates the authors derived a set of coarse‑grained motion features: positional deltas, velocities, accelerations, inter‑box distances, angular changes, and size‑ratio dynamics. Simple low‑pass filtering was applied to reduce noise, resulting in a multivariate time series of six to eight dimensions per clip.

Two classic time‑series classifiers were employed for the labeling task. The first is a Hidden Markov Model (HMM) with five hidden states per verb and Gaussian mixture emissions; parameters were learned via Baum‑Welch and classification was performed by evaluating the log‑likelihood of a test sequence under each verb‑specific model. The second is a Dynamic Time Warping (DTW) based k‑Nearest‑Neighbour approach, where the DTW distance between a test sequence and all training sequences is computed and the label is decided by majority vote among the three nearest neighbours. Both classifiers were trained on exactly the same feature set, allowing a clean comparison of algorithmic impact.

Experimental results are striking. On the full 22‑verb classification task, the HMM achieved 71.3 % accuracy and DTW 70.8 %, a performance well above random chance (≈4.5 %). When the authors restricted the problem to several 1‑out‑of‑10 subsets that avoid semantically similar verbs, accuracies rose to over 85 % for both methods. Per‑verb analysis shows that actions involving explicit object contact (“pick up,” “throw,” “catch”) are recognized with >80 % accuracy, while pure human locomotion verbs (“run,” “walk”) are slightly harder, hovering around 65 %. Importantly, the two classifiers differ by less than half a percent, indicating that the choice of classifier is far less critical than the choice of features that capture linguistic invariants.

The study demonstrates that a highly impoverished intermediate representation—merely the trajectories of one or two bounding boxes—contains enough discriminative information to label videos with verbs at a level comparable to more complex, appearance‑rich approaches. This suggests that successful event recognition hinges on identifying features that reflect the abstract relational structure of events rather than on sophisticated visual modeling.

The authors acknowledge several limitations. Their current pipeline handles at most two participants, so multi‑agent interactions (e.g., “handshake,” “group dance”) remain out of scope. Bounding‑box detection errors can propagate into the motion features, motivating future work on more robust tracking and error correction. Additionally, verbs often exhibit lexical ambiguity (e.g., “run” as a physical motion versus “run” as an operation), which could be resolved by integrating contextual metadata or language models.

In conclusion, the paper provides compelling evidence that verb‑centric video labeling can be achieved with minimal visual information, opening the door to lightweight, real‑time video understanding systems. Future research directions include extending the representation to handle multiple participants, richer interaction graphs, and tighter integration with natural‑language processing to address polysemy and compositional semantics.