Video In Sentences Out

We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases,spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.

💡 Research Summary

The paper introduces “Video In Sentences Out,” a complete pipeline that converts raw video clips into coherent natural‑language sentences describing who performed what action on whom, where, and how. The system is organized into four tightly coupled modules. First, a state‑of‑the‑art object detector and multi‑object tracker produce per‑frame detections and maintain consistent trajectories using a combination of deep convolutional features, color histograms, shape descriptors, and motion vectors. A Kalman filter and graph‑based data association keep tracks stable even under occlusion. Second, each trajectory is assigned a grammatical role (agent, patient, instrument, etc.) through a Bayesian network that incorporates both a predefined role template for each action class and dynamic cues such as speed, direction changes, and proximity to other tracks. Temporal dependencies are explicitly modeled, allowing smooth role transitions in complex, overlapping activities. Third, the pipeline extracts body posture information using a pose estimator (e.g., OpenPose). Joint coordinates are mapped to a set of posture labels (standing, sitting, running) and posture‑change verbs (lifting, lowering). These fine‑grained posture cues are later used to select appropriate adverbs and verb modifiers, enriching the description beyond a simple “who‑did‑what” predicate. Fourth, a language generation component builds a syntactic skeleton from the role‑relation‑attribute mapping (e.g., “