CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.


💡 Research Summary

CamReasoner tackles the problem of camera motion understanding by reframing it from a black‑box classification task into a structured inference process called Observation‑Thinking‑Answer (O‑T‑A). The authors argue that existing approaches either rely on traditional geometric pipelines (SfM/SLAM) that need intrinsic parameters and struggle with dynamic scenes, or on modern video‑language models that, while powerful, often hallucinate and confuse physically distinct motions such as a dolly versus a pan because they lack explicit reasoning steps.

The O‑T‑A paradigm forces the model to first generate a detailed describing visual cues such as trajectory, view frustum, and background stability; then a segment where the model logically maps those cues to a specific camera motion (e.g., “the subject stays centered while the background shifts left, indicating a leftward dolly”); finally a concise that selects the motion label. This mirrors human cinematic analysis and provides interpretability.

To teach this behavior, the authors construct two large‑scale datasets. CamReasoning‑SFT‑18k contains 18 541 high‑quality samples generated by prompting a large multimodal LLM (Qwen2.5‑VL‑72B) to produce O‑T‑A chains from existing video‑QA pairs, followed by multi‑stage filtering for format, label correctness, and logical consistency. This dataset serves as a supervised fine‑tuning (SFT) cold‑start, teaching the model the required template and reasoning flow. CamReasoning‑RL‑38k consists of 38 000 question‑answer pairs derived from the CameraBench training split, without any pre‑written reasoning. In the reinforcement learning (RL) stage the model must generate its own O‑T‑A chain, and its quality is judged by a composite reward.

The reward combines a format reward (r_fmt) that grants a point only if the output strictly follows the order, and an accuracy reward (r_acc) that gives a point when the predicted label matches the ground truth. A weighting λ balances the two, ensuring the model does not sacrifice logical structure for raw accuracy.

Training uses Group Relative Policy Optimization (GRPO), which estimates a baseline from the average reward of a sampled group, avoiding a separate value network. To stabilize the notoriously noisy single‑task RL updates, the authors adopt EMA‑GRPO, an exponential moving average of the reward standard deviation, which smooths advantage estimates and prevents large, unstable policy jumps.

Empirically, CamReasoner‑7B achieves 78.4 % accuracy on binary camera‑motion classification and 74.5 % on visual question answering, outperforming prior multimodal baselines. Notably, on the “Confusable Motion” subset—where motions are visually similar but physically different (e.g., dolly vs. pan, zoom vs. tilt)—the model reaches 60.7 %, demonstrating that the structured reasoning genuinely captures geometric distinctions. Qualitative examples show coherent and texts that explain the final decision, offering transparency useful for downstream video editing or synthesis tasks.

In summary, CamReasoner contributes three major advances: (1) a novel O‑T‑A reasoning framework that makes camera‑motion inference explicit and interpretable; (2) a large, curated dataset of structured reasoning chains for both supervised and reinforcement learning phases; (3) the first application of RL‑based logical alignment to this domain, with a stable EMA‑GRPO training scheme. By bridging raw visual perception with cinematic logic, CamReasoner pushes multimodal large language models toward true “cinematic reasoning,” opening avenues for physically grounded video generation, editing, and analysis.


Comments & Academic Discussion

Loading comments...

Leave a Comment